Prediction of All-NBA Player using 22-23 NBA Player Stats Per Game + Advanced Stats#

Author: Xi Cheng

Course Project, UC Irvine, Math 10, F23

Introduction#

The All-NBA Team is an annual National Basketball Association (NBA) honor bestowed on the best players in the league following every NBA season. The voting is conducted by a global panel of sportswriters and broadcasters. Since 1988 it is typically composed of three five-man lineupsā€”a first, second, and third team. Voters selected two guards, two forwards, and one center for each team.

Nowadays, there are rich NBA statistics including many basic stats and more and more powerful advanced stats. All the different stats shows different aspect of the game.

In this project, I am going to predict the All-NBA Player of 22-23 season using the datasets 22-23 NBA Player Stats Per Game + Advanced Stats from basketball-reference.com

Data Cleaning#

import pandas as pd
import numpy as np
import altair as alt
import seaborn as sns
import matplotlib.pyplot as plt

I mannaly export the datasets 22-23 NBA Player Stats Per Game and Advanced Stats to notepad of my PC, and then save as csv files (method I learnt from here). Thus, the datasets in 2022-23 NBA Player Stats - Advanced.csv and 2022-23 NBA Player Stats - Per Game.csv are not perfect.

There is one mistaken column Player-additional which I dropped.

df1 = pd.read_csv("2022-23 NBA Player Stats - Per Game.csv").copy()
df1 = df1.drop(columns=['Player-additional'])
df1
Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 1 Precious Achiuwa C 23 TOR 55 12 20.7 3.6 7.3 ... 0.702 1.8 4.1 6.0 0.9 0.6 0.5 1.1 1.9 9.2
1 2 Steven Adams C 29 MEM 42 42 27.0 3.7 6.3 ... 0.364 5.1 6.5 11.5 2.3 0.9 1.1 1.9 2.3 8.6
2 3 Bam Adebayo C 25 MIA 75 75 34.6 8.0 14.9 ... 0.806 2.5 6.7 9.2 3.2 1.2 0.8 2.5 2.8 20.4
3 4 Ochai Agbaji SG 22 UTA 59 22 20.5 2.8 6.5 ... 0.812 0.7 1.3 2.1 1.1 0.3 0.3 0.7 1.7 7.9
4 5 Santi Aldama PF 22 MEM 77 20 21.8 3.2 6.8 ... 0.750 1.1 3.7 4.8 1.3 0.6 0.6 0.8 1.9 9.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
674 535 Thaddeus Young PF 34 TOR 54 9 14.7 2.0 3.7 ... 0.692 1.3 1.8 3.1 1.4 1.0 0.1 0.8 1.6 4.4
675 536 Trae Young PG 24 ATL 73 73 34.8 8.2 19.0 ... 0.886 0.8 2.2 3.0 10.2 1.1 0.1 4.1 1.4 26.2
676 537 Omer Yurtseven C 24 MIA 9 0 9.2 1.8 3.0 ... 0.833 0.9 1.7 2.6 0.2 0.2 0.2 0.4 1.8 4.4
677 538 Cody Zeller C 30 MIA 15 2 14.5 2.5 3.9 ... 0.686 1.7 2.6 4.3 0.7 0.2 0.3 0.9 2.2 6.5
678 539 Ivica Zubac C 25 LAC 76 76 28.6 4.3 6.8 ... 0.697 3.1 6.8 9.9 1.0 0.4 1.3 1.5 2.9 10.8

679 rows Ɨ 30 columns

Again, there are mistaken columns Player-additional','Unnamed: 19','Unnamed: 24 which I dropped.

I use df2.columns[7:] to exclude the first several columns because I intend to combine the 2 dataframe into 1 whole dataframe with both basic stats and advanced stats, the first several columns are repeatted in the df1.

df2 = pd.read_csv("2022-23 NBA Player Stats - Advanced.csv").copy()
df2 = df2.drop(columns=['Player-additional','Unnamed: 19','Unnamed: 24'])
df2_1 = df2[df2.columns[7:]]
df2_1
PER TS% 3PAr FTr ORB% DRB% TRB% AST% STL% BLK% TOV% USG% OWS DWS WS WS/48 OBPM DBPM BPM VORP
0 15.2 0.554 0.267 0.307 9.3 24.4 16.3 6.3 1.3 2.6 11.4 19.4 0.8 1.4 2.2 0.093 -1.4 -0.8 -2.3 -0.1
1 17.5 0.564 0.004 0.490 20.1 25.3 22.7 11.2 1.5 3.7 19.8 14.6 1.3 2.1 3.4 0.144 -0.3 0.9 0.6 0.7
2 20.1 0.592 0.011 0.361 8.0 23.6 15.5 15.9 1.7 2.4 12.7 25.2 3.6 3.8 7.4 0.137 0.8 0.8 1.5 2.3
3 9.5 0.561 0.591 0.179 3.9 6.9 5.4 7.5 0.6 1.0 9.0 15.8 0.9 0.4 1.3 0.053 -1.7 -1.4 -3.0 -0.3
4 13.9 0.591 0.507 0.274 5.4 18.0 11.7 7.6 1.3 2.6 9.3 16.0 2.1 2.4 4.6 0.130 -0.3 0.8 0.5 1.1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
674 14.1 0.573 0.172 0.131 9.4 14.6 11.8 12.9 3.4 0.6 16.7 13.5 0.7 1.1 1.8 0.109 -1.8 1.9 0.1 0.4
675 22.0 0.573 0.331 0.460 2.4 7.0 4.7 42.5 1.5 0.3 15.2 32.6 5.3 1.4 6.7 0.126 5.3 -2.0 3.3 3.4
676 16.7 0.675 0.259 0.222 10.9 21.9 16.2 3.9 1.2 2.5 11.9 18.0 0.2 0.1 0.3 0.159 -2.5 -1.5 -3.9 0.0
677 16.4 0.659 0.034 0.593 13.0 21.8 17.3 7.2 0.7 1.9 15.8 18.1 0.4 0.3 0.7 0.147 -2.0 -0.7 -2.8 0.0
678 16.7 0.661 0.004 0.463 12.3 26.5 19.5 5.1 0.7 4.0 15.9 14.8 3.8 2.9 6.7 0.149 -1.1 0.2 -0.9 0.6

679 rows Ɨ 20 columns

Combining the 2 dataframe into 1 whole dataframe with both basic stats and advanced stats while dropping the rows with na value at the same time, now the df is clean.

df = pd.concat([df1,df2_1], axis=1).dropna(axis=0).copy()
df
Rk Player Pos Age Tm G GS MP FG FGA ... TOV% USG% OWS DWS WS WS/48 OBPM DBPM BPM VORP
0 1 Precious Achiuwa C 23 TOR 55 12 20.7 3.6 7.3 ... 11.4 19.4 0.8 1.4 2.2 0.093 -1.4 -0.8 -2.3 -0.1
1 2 Steven Adams C 29 MEM 42 42 27.0 3.7 6.3 ... 19.8 14.6 1.3 2.1 3.4 0.144 -0.3 0.9 0.6 0.7
2 3 Bam Adebayo C 25 MIA 75 75 34.6 8.0 14.9 ... 12.7 25.2 3.6 3.8 7.4 0.137 0.8 0.8 1.5 2.3
3 4 Ochai Agbaji SG 22 UTA 59 22 20.5 2.8 6.5 ... 9.0 15.8 0.9 0.4 1.3 0.053 -1.7 -1.4 -3.0 -0.3
4 5 Santi Aldama PF 22 MEM 77 20 21.8 3.2 6.8 ... 9.3 16.0 2.1 2.4 4.6 0.130 -0.3 0.8 0.5 1.1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
674 535 Thaddeus Young PF 34 TOR 54 9 14.7 2.0 3.7 ... 16.7 13.5 0.7 1.1 1.8 0.109 -1.8 1.9 0.1 0.4
675 536 Trae Young PG 24 ATL 73 73 34.8 8.2 19.0 ... 15.2 32.6 5.3 1.4 6.7 0.126 5.3 -2.0 3.3 3.4
676 537 Omer Yurtseven C 24 MIA 9 0 9.2 1.8 3.0 ... 11.9 18.0 0.2 0.1 0.3 0.159 -2.5 -1.5 -3.9 0.0
677 538 Cody Zeller C 30 MIA 15 2 14.5 2.5 3.9 ... 15.8 18.1 0.4 0.3 0.7 0.147 -2.0 -0.7 -2.8 0.0
678 539 Ivica Zubac C 25 LAC 76 76 28.6 4.3 6.8 ... 15.9 14.8 3.8 2.9 6.7 0.149 -1.1 0.2 -0.9 0.6

624 rows Ɨ 50 columns

Feature Engineering#

Now my df has every basic and advanced stats but whether each player is in the All-NBA Team or not. Thus, I use the method inspired by chatgpt, and the list of 22-23 All-NBA Team to generate a new column

I find the there are difference for the version of name of the players between my original df and the new text, such Luka Doncic in my df is different from Luka Dončić. Thus, I mannaly replaced the wrong names.

As my intro/background info says, there are first, second, and third All-NBA teams, I choose to set my y later as all the 3 All-NBA Teams, i.e., all_nba_team

# List of All-NBA Team players for the 2022-2023 season
all_nba_first_team = ["Giannis Antetokounmpo", "Jayson Tatum", "Joel Embiid", "Shai Gilgeous-Alexander", "Luka Doncic"]
all_nba_second_team = ["Jimmy Butler", "Jaylen Brown", "Nikola Jokic", "Donovan Mitchell", "Stephen Curry"]
all_nba_third_team = ["Julius Randle", "LeBron James", "Domantas Sabonis", "Deā€™Aaron Fox", "Damian Lillard"]
#substitude the wrong names, 12/15
all_nba_first_team[-1] = 'Luka Dončić'
all_nba_second_team[2] = 'Nikola Jokić'
all_nba_third_team[3] = "De'Aaron Fox"

all_nba_team = all_nba_first_team + all_nba_second_team +all_nba_third_team
all_nba_team
['Giannis Antetokounmpo',
 'Jayson Tatum',
 'Joel Embiid',
 'Shai Gilgeous-Alexander',
 'Luka Dončić',
 'Jimmy Butler',
 'Jaylen Brown',
 'Nikola Jokić',
 'Donovan Mitchell',
 'Stephen Curry',
 'Julius Randle',
 'LeBron James',
 'Domantas Sabonis',
 "De'Aaron Fox",
 'Damian Lillard']

Define a function to help me to generate a new column whether a player is in the All-NBA Team of the year or not

# Function to determine All-NBA Team status
def determine_all_nba_team(player):
    if player in all_nba_team:
        return 1
    else:
        return 0

Apply the function to the dataset

# Apply the function to the dataset
df['All-NBA Team'] = df['Player'].apply(determine_all_nba_team)
df
Rk Player Pos Age Tm G GS MP FG FGA ... USG% OWS DWS WS WS/48 OBPM DBPM BPM VORP All-NBA Team
0 1 Precious Achiuwa C 23 TOR 55 12 20.7 3.6 7.3 ... 19.4 0.8 1.4 2.2 0.093 -1.4 -0.8 -2.3 -0.1 0
1 2 Steven Adams C 29 MEM 42 42 27.0 3.7 6.3 ... 14.6 1.3 2.1 3.4 0.144 -0.3 0.9 0.6 0.7 0
2 3 Bam Adebayo C 25 MIA 75 75 34.6 8.0 14.9 ... 25.2 3.6 3.8 7.4 0.137 0.8 0.8 1.5 2.3 0
3 4 Ochai Agbaji SG 22 UTA 59 22 20.5 2.8 6.5 ... 15.8 0.9 0.4 1.3 0.053 -1.7 -1.4 -3.0 -0.3 0
4 5 Santi Aldama PF 22 MEM 77 20 21.8 3.2 6.8 ... 16.0 2.1 2.4 4.6 0.130 -0.3 0.8 0.5 1.1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
674 535 Thaddeus Young PF 34 TOR 54 9 14.7 2.0 3.7 ... 13.5 0.7 1.1 1.8 0.109 -1.8 1.9 0.1 0.4 0
675 536 Trae Young PG 24 ATL 73 73 34.8 8.2 19.0 ... 32.6 5.3 1.4 6.7 0.126 5.3 -2.0 3.3 3.4 0
676 537 Omer Yurtseven C 24 MIA 9 0 9.2 1.8 3.0 ... 18.0 0.2 0.1 0.3 0.159 -2.5 -1.5 -3.9 0.0 0
677 538 Cody Zeller C 30 MIA 15 2 14.5 2.5 3.9 ... 18.1 0.4 0.3 0.7 0.147 -2.0 -0.7 -2.8 0.0 0
678 539 Ivica Zubac C 25 LAC 76 76 28.6 4.3 6.8 ... 14.8 3.8 2.9 6.7 0.149 -1.1 0.2 -0.9 0.6 0

624 rows Ɨ 51 columns

Check, now I can successfully identify all the 15 players who is in the All-NBA Team of the year. As I said above, there are only 12 players successfully identify before I mannaly substitude the wrong name

print(len(df[df['All-NBA Team']==1]))
df[df['All-NBA Team']==1]
15
Rk Player Pos Age Tm G GS MP FG FGA ... USG% OWS DWS WS WS/48 OBPM DBPM BPM VORP All-NBA Team
12 11 Giannis Antetokounmpo PF 28 MIL 63 63 32.1 11.2 20.3 ... 38.8 4.9 3.7 8.6 0.204 5.8 2.7 8.5 5.4 1
85 64 Jaylen Brown SF 26 BOS 67 67 35.9 10.1 20.6 ... 31.4 1.6 3.4 5.0 0.100 1.5 -0.2 1.3 2.0 1
100 75 Jimmy Butler PF 33 MIA 64 64 33.4 7.5 13.9 ... 25.6 9.4 2.9 12.3 0.277 6.7 2.0 8.7 5.8 1
139 108 Stephen Curry PG 34 GSW 56 56 34.7 10.0 20.2 ... 31.0 5.8 2.0 7.8 0.192 7.5 0.1 7.5 4.7 1
160 125 Luka Dončić PG 23 DAL 66 66 36.2 10.9 22.0 ... 37.6 7.3 2.9 10.2 0.204 7.6 1.4 8.9 6.6 1
184 143 Joel Embiid C 28 PHI 66 66 34.6 11.0 20.1 ... 37.0 8.4 3.9 12.3 0.259 6.8 2.3 9.2 6.4 1
198 153 De'Aaron Fox PG 25 SAC 73 73 33.4 9.3 18.2 ... 30.1 5.5 1.8 7.4 0.146 3.4 -0.9 2.5 2.7 1
209 164 Shai Gilgeous-Alexander PG 24 OKC 68 68 35.5 10.4 20.3 ... 32.8 8.4 3.0 11.4 0.226 5.7 1.5 7.3 5.6 1
306 239 LeBron James PF 38 LAL 55 54 35.5 11.1 22.2 ... 33.3 3.2 2.4 5.6 0.138 5.5 0.6 6.1 4.0 1
318 249 Nikola Jokić C 27 DEN 69 69 33.7 9.4 14.8 ... 27.2 11.2 3.8 14.9 0.308 8.5 4.5 13.0 8.8 1
373 292 Damian Lillard PG 32 POR 58 58 36.3 9.6 20.7 ... 33.8 8.2 0.8 9.0 0.205 8.3 -1.2 7.1 4.9 1
427 338 Donovan Mitchell SG 26 CLE 68 68 35.8 10.0 20.6 ... 32.1 5.4 3.5 8.9 0.176 5.6 0.6 6.3 5.0 1
514 405 Julius Randle PF 28 NYK 77 77 35.5 8.5 18.6 ... 29.5 5.0 3.1 8.1 0.142 3.9 -0.2 3.7 3.9 1
549 428 Domantas Sabonis C 26 SAC 79 79 34.6 7.3 11.9 ... 21.3 9.6 3.0 12.6 0.221 4.1 1.8 5.8 5.4 1
590 465 Jayson Tatum SF 24 BOS 74 74 36.9 9.8 21.1 ... 32.7 6.2 4.3 10.5 0.185 4.8 0.7 5.5 5.1 1

15 rows Ɨ 51 columns

I want to set some limit using all the features/stats that I have to make the df that I am going to do the machine learning tranning closer to a tranning and testing dataset which make senses. Originally there are 624 rows/Players in the sample, if I do not set additional limits, there is too much noice in the dataset.

I set the conditions:

  • df['GS']/df['G'] > 0.9 means (Game_Start / Game_Played)>90%, which indicates that the player is a solid Start Player.

  • df['G']>41 means (Game_Played)>41 where there are 82 games per season, which indicates that the player at least play 50% games of the season

  • df['MP']>30 means (Minites_Played per game)>30 where there are 48 minutes per game, which indicates that the player must play sufficient time per game.

Then, I have 92 players left in my sample who are almost all famous and good Started players left, which accord with the standard of All-NBA Player, which are the BEST NBA Player. I further sorted the df by Tm Team in order for NBA fans to look at this df better, where I learnt the method from Chatgpt

df = df[(df['GS']/df['G'] > 0.9) & (df['G']>41) & (df['MP']>30)]
df.sort_values(by='Tm')
Rk Player Pos Age Tm G GS MP FG FGA ... USG% OWS DWS WS WS/48 OBPM DBPM BPM VORP All-NBA Team
675 536 Trae Young PG 24 ATL 73 73 34.8 8.2 19.0 ... 32.6 5.3 1.4 6.7 0.126 5.3 -2.0 3.3 3.4 0
285 224 De'Andre Hunter SF 25 ATL 67 67 31.7 5.7 12.3 ... 19.7 1.4 1.2 2.5 0.057 -1.9 -1.4 -3.3 -0.7 0
442 351 Dejounte Murray SG 26 ATL 74 74 36.4 8.3 17.8 ... 24.4 2.4 2.3 4.7 0.084 1.1 0.0 1.0 2.1 0
278 217 Al Horford C 36 BOS 63 63 30.5 3.6 7.6 ... 11.9 3.6 2.7 6.3 0.157 1.6 1.7 3.3 2.5 0
590 465 Jayson Tatum SF 24 BOS 74 74 36.9 9.8 21.1 ... 32.7 6.2 4.3 10.5 0.185 4.8 0.7 5.5 5.1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
122 93 Jordan Clarkson SG 30 UTA 61 61 32.6 7.5 16.9 ... 27.8 0.9 0.8 1.7 0.042 1.0 -2.4 -1.4 0.3 0
392 307 Lauri Markkanen PF 25 UTA 66 66 34.4 8.7 17.3 ... 26.6 6.3 1.9 8.2 0.173 4.9 -1.0 3.8 3.3 0
502 393 Kristaps Porziņģis C 27 WAS 65 65 32.6 7.8 15.7 ... 27.4 5.0 2.7 7.7 0.174 3.8 0.5 4.3 3.4 0
355 278 Kyle Kuzma PF 27 WAS 64 64 35.0 8.0 17.8 ... 27.9 -1.0 1.9 0.9 0.019 -0.2 -1.5 -1.7 0.2 0
44 35 Bradley Beal SG 29 WAS 50 50 33.5 8.9 17.6 ... 29.2 2.2 1.2 3.4 0.099 2.9 -1.2 1.8 1.6 0

92 rows Ɨ 51 columns

Feature Selection#

I want to evaluate how significant the stats are using the correlation between the stats and my y where I learnt from How to Create a Correlation Matrix using Pandas

df_corr = df.corr()
df_corr[['All-NBA Team']]
All-NBA Team
Rk -0.009642
Age 0.095926
G 0.001436
GS 0.008730
MP 0.292817
FG 0.556520
FGA 0.449231
FG% 0.300232
3P 0.082090
3PA 0.115434
3P% 0.008667
2P 0.502456
2PA 0.423059
2P% 0.392843
eFG% 0.265468
FT 0.573517
FTA 0.589323
FT% 0.007220
ORB 0.165056
DRB 0.416201
TRB 0.375298
AST 0.260654
STL 0.173772
BLK 0.032665
TOV 0.432466
PF 0.130330
PTS 0.588607
PER 0.635366
TS% 0.387752
3PAr -0.164366
FTr 0.441352
ORB% 0.142607
DRB% 0.366110
TRB% 0.325540
AST% 0.339411
STL% 0.131656
BLK% -0.002979
TOV% 0.019838
USG% 0.523896
OWS 0.580939
DWS 0.299737
WS 0.583851
WS/48 0.548067
OBPM 0.600178
DBPM 0.402551
BPM 0.633779
VORP 0.676011
All-NBA Team 1.000000

I learnt how to draw a Seaborn Correlation Heatmap using sns and plt from How to Create a Seaborn Correlation Heatmap in Python?

plt.figure(figsize=(3, 14))
sns.heatmap(df_corr[['All-NBA Team']], annot=True, linewidths=.5)
plt.title('Correlation Matrix')
plt.show()
../../_images/f07b75f55c6383533eb34844a363f6e466807b9e23c364431d7db50aabfc0ac3.png

This shows how powerful some of the advanced stats are that they can effectly shows how good a player is. After researching BPM is directly calculated from VORP, VORP = [BPM - (-2.0)] * (% of possessions played) * (team games/82), I decided to only use the top 2 advanced stats VORP and PER to do my prediction

I learnt how to correctly write this code panda dataframe sort and index using chatgpt

ser_corr = df_corr['All-NBA Team']
X1_top2_adv = ser_corr[ser_corr.sort_values(ascending=False) > 0.634].index[:-1]
X1_top2_adv
Index(['PER', 'VORP'], dtype='object')
ser_corr.sort_values(ascending=False)
All-NBA Team    1.000000
VORP            0.676011
PER             0.635366
BPM             0.633779
OBPM            0.600178
FTA             0.589323
PTS             0.588607
WS              0.583851
OWS             0.580939
FT              0.573517
FG              0.556520
WS/48           0.548067
USG%            0.523896
2P              0.502456
FGA             0.449231
FTr             0.441352
TOV             0.432466
2PA             0.423059
DRB             0.416201
DBPM            0.402551
2P%             0.392843
TS%             0.387752
TRB             0.375298
DRB%            0.366110
AST%            0.339411
TRB%            0.325540
FG%             0.300232
DWS             0.299737
MP              0.292817
eFG%            0.265468
AST             0.260654
STL             0.173772
ORB             0.165056
ORB%            0.142607
STL%            0.131656
PF              0.130330
3PA             0.115434
Age             0.095926
3P              0.082090
BLK             0.032665
TOV%            0.019838
GS              0.008730
3P%             0.008667
FT%             0.007220
G               0.001436
BLK%           -0.002979
Rk             -0.009642
3PAr           -0.164366
Name: All-NBA Team, dtype: float64

Machine Learning Model Creation/Evaluation#

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[X1_top2_adv], df['All-NBA Team'], train_size=0.7, random_state=33)

Logistic Classification#

The scores look pretty good. On the one hand, this good score demostrate how powerful the advanced stats is to show how good a player is. On the other hand, I doubt do I set my proposition of prediction y well. Because if I simply check the top 15 players of a VORP(Value over replacement player, a advanced stats) or PTS(Points per game, most important basic stats), there will be a (10~12)/15 of All-NBA Team in the top 15. Although, use I use some of the not most significant stats to sort, the accuracy wonā€™t be this high. Anyway, good job.

from sklearn.linear_model import LogisticRegression

clf_lr = LogisticRegression(max_iter=400)
clf_lr.fit(X_train, y_train)

print("Logistic Regression Model")
print("Training accuracy:", clf_lr.score(X_train, y_train))
print("Test accuracy:", clf_lr.score(X_test, y_test))
Logistic Regression Model
Training accuracy: 0.921875
Test accuracy: 0.9285714285714286

Letā€™s using altair to visualize the graph between my two Xs, PER, VORP, and the y, All-NBA Team. Also include tooltip of Player, Team, Points per game in order to better provide information for NBA fans.

From this graph, we can tell that the more PER and especially VORP, the more likely All-NBA Team

alt.Chart(df).mark_circle().encode(
    x=alt.X("VORP", scale=alt.Scale(zero=False)),
    y=alt.Y("PER", scale=alt.Scale(zero=False)),
    color="All-NBA Team:N",
    tooltip = ['Player','Tm','PTS',"VORP","PER"]
).properties(
    title="All-NBA Team Shown By VORP and PER"
)
df['lr_Result'] = clf_lr.predict(df[X1_top2_adv])
df
/tmp/ipykernel_66/2545201439.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['lr_Result'] = clf_lr.predict(df[X1_top2_adv])
Rk Player Pos Age Tm G GS MP FG FGA ... OWS DWS WS WS/48 OBPM DBPM BPM VORP All-NBA Team lr_Result
2 3 Bam Adebayo C 25 MIA 75 75 34.6 8.0 14.9 ... 3.6 3.8 7.4 0.137 0.8 0.8 1.5 2.3 0 0
9 8 Jarrett Allen C 24 CLE 68 68 32.6 5.9 9.2 ... 5.5 4.0 9.5 0.205 1.3 1.0 2.4 2.4 0 0
12 11 Giannis Antetokounmpo PF 28 MIL 63 63 32.1 11.2 20.3 ... 4.9 3.7 8.6 0.204 5.8 2.7 8.5 5.4 1 1
15 14 OG Anunoby SF 25 TOR 67 67 35.6 6.3 13.2 ... 1.8 2.9 4.7 0.094 -0.3 0.7 0.4 1.5 0 0
20 17 Deandre Ayton C 24 PHO 67 67 30.4 7.8 13.2 ... 3.3 3.0 6.2 0.147 0.9 0.0 0.9 1.5 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
618 485 Fred VanVleet PG 28 TOR 69 69 36.7 6.3 16.1 ... 3.7 2.8 6.5 0.123 2.0 0.5 2.5 2.9 0 0
622 489 Nikola Vučević C 32 CHI 82 82 33.5 7.3 14.0 ... 3.9 4.4 8.3 0.145 1.9 0.7 2.7 3.2 0 0
624 491 Franz Wagner SF 21 ORL 80 80 32.6 6.8 14.0 ... 3.1 2.3 5.4 0.099 0.4 -0.5 -0.1 1.3 0 0
635 500 P.J. Washington PF 24 CHO 73 73 32.6 5.9 13.4 ... 0.4 2.2 2.6 0.052 -1.0 -0.3 -1.3 0.4 0 0
675 536 Trae Young PG 24 ATL 73 73 34.8 8.2 19.0 ... 5.3 1.4 6.7 0.126 5.3 -2.0 3.3 3.4 0 0

92 rows Ɨ 52 columns

Decision Tree Classifier#

The scores of Decision Tree Classifier is even better than Logistic.

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

clf_tree = DecisionTreeClassifier(max_leaf_nodes=3, random_state=2)
clf_tree.fit(X_train, y_train)

print("Decision Tree Model")
print("Training accuracy:", clf_tree.score(X_train, y_train))
print("Test accuracy:", clf_tree.score(X_test, y_test))
Decision Tree Model
Training accuracy: 0.953125
Test accuracy: 0.9642857142857143

Use the method Prof.Sha taught in class to see the decision tree to evaluate if there is overfitting proble.

I seems the condition of the tree look good while there is no obvious overfitting problem because it agrees with what we expect that the higher VORP is, the more likely All-NBA Team. Secondly, while if VORP is median high but PER is high, it still could possibly be in All-NBA Team. Thirdly, if both VORP and PER is low, we predict it is not All-NBA Team

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

fig = plt.figure()

# Convert class_names to strings if they are not already
class_names = [str(class_name) for class_name in clf_tree.classes_]

_ = plot_tree(clf_tree, 
              feature_names=clf_tree.feature_names_in_,
              class_names=class_names,  # Use the converted class_names
              filled=True)
../../_images/bee8e15699a9eb3afb9bb1fc4144abc4b78787b5bf5d432c7bb5f9036b58aeb9.png

Random Forest Classifier#

Letā€™s use Random Forest Classifier to evaluate the good score from Decision Tree.

The score still looks simillar

from sklearn.ensemble import RandomForestClassifier
clf_forest = RandomForestClassifier(n_estimators=200, max_leaf_nodes=3,random_state=2)
clf_forest.fit(X_train, y_train)

print("Random Forest Model")
print("Training accuracy:", clf_forest.score(X_train, y_train))
print("Test accuracy:", clf_forest.score(X_test, y_test))
Random Forest Model
Training accuracy: 0.953125
Test accuracy: 0.9642857142857143

Finally,letā€™s visalize the boundry of our Random Forest.

rng = np.random.default_rng()
arr = rng.random(size = (5000,2))
df3 = pd.DataFrame(arr, columns=X1_top2_adv)
df3.head(5)
PER VORP
0 0.529989 0.916397
1 0.961482 0.378611
2 0.928898 0.584705
3 0.583184 0.104014
4 0.516506 0.038385
df3["PER"] *= 40
df3["VORP"] *= 10
df3['pred'] = clf_forest.predict(df3[X1_top2_adv])
df3
PER VORP pred
0 21.199574 9.163968 1
1 38.459287 3.786108 0
2 37.155906 5.847048 1
3 23.327379 1.040143 0
4 20.660227 0.383845 0
... ... ... ...
4995 34.041293 5.940254 1
4996 37.745077 8.526217 1
4997 8.708129 1.684154 0
4998 27.385987 6.370831 1
4999 20.922994 0.348717 0

5000 rows Ɨ 3 columns

alt.Chart(df3).mark_circle().encode(
    x = "VORP",
    y = "PER",
    color = 'pred:N'
)

The boundry looks good, not much overfitting.

Summary#

There are some powerful advanced stats which can powerfully show whether a player is good or not while All-NBA Team is just about the best players, the prediction we did.

References#

Your code above should include references. Here is some additional space for references.

  • What is the source of your dataset(s)?

All-NBA Player of 22-23 season from NBA.com https://www.nba.com/news/history-all-nba-teams

22-23 NBA Player Stats Per Game from basketball-reference.com https://www.basketball-reference.com/leagues/NBA_2023_per_game.html)

Advanced Stats from basketball-reference.com https://www.basketball-reference.com/leagues/NBA_2023_advanced.html

  • List any other references that you found helpful.

All-NBA Team Wiki (https://en.wikipedia.org/wiki/All-NBA_Team) How to Create a Correlation Matrix using Pandas Dataframe(https://datatofish.com/correlation-matrix-pandas/) How to mannaly expert from Notepad to csv (https://www.wikihow.com/Create-a-CSV-File)) How to Create a Seaborn Correlation Heatmap in Python? (https://medium.com/@szabo.bibor/how-to-create-a-seaborn-correlation-heatmap-in-python-834c0686b88e) VORP Formula, direct relation with BPM (https://www.basketball-reference.com/about/bpm2.html) Panda Dataframe Sort https://chat.openai.com/