Comparing NBA 3 Point Shots For Two Given Seasons#
Author: Hieu Le Dang
Course Project, UC Irvine, Math 10, F23
Introduction#
Introduce your project here. Maybe 3 sentences.
I want to use a subset of NBA statistics regarding stats recorded per 36 minutes to see how the recency of an NBA season correlates to how many 3 point field goals are attempted and made. To understand what Per 36 Minute statistics are, let’s say for example we have a player who averages 9 points per game and plays 18 minutes per game. This player would be seen to average 18 points per 36 minutes on the court. Now I want to use statistics regarding 3 point field goals made and attempted to see if we can make a model that can guess which NBA season a player played in based on their 3 point statistics when comparing two seasons.
Organizing the Data#
You can either have all one section or divide into multiple sections. To make new sections, use ##
in a markdown cell. Double-click this cell for an example of using ##
import seaborn as sns
import numpy as np
import pandas as pd
import altair as alt
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
df = pd.read_csv("Per 36 Minutes.csv")
df
seas_id | season | player_id | player | birth_year | pos | age | experience | lg | tm | ... | ft_percent | orb_per_36_min | drb_per_36_min | trb_per_36_min | ast_per_36_min | stl_per_36_min | blk_per_36_min | tov_per_36_min | pf_per_36_min | pts_per_36_min | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 31136 | 2024 | 5025 | A.J. Green | NaN | SG | 24.0 | 2 | NBA | MIL | ... | 1.000 | 0.4 | 3.1 | 3.6 | 3.6 | 0.0 | 0.0 | 0.0 | 3.6 | 13.8 |
1 | 31137 | 2024 | 5027 | AJ Griffin | NaN | SF | 20.0 | 2 | NBA | ATL | ... | 1.000 | 0.7 | 2.9 | 3.6 | 1.1 | 0.4 | 0.0 | 1.4 | 1.8 | 10.7 |
2 | 31138 | 2024 | 4219 | Aaron Gordon | NaN | PF | 28.0 | 10 | NBA | DEN | ... | 0.520 | 2.8 | 4.8 | 7.6 | 4.1 | 1.2 | 0.9 | 1.9 | 2.1 | 13.9 |
3 | 31139 | 2024 | 4582 | Aaron Holiday | NaN | PG | 27.0 | 6 | NBA | HOU | ... | 0.857 | 0.3 | 3.3 | 3.6 | 3.9 | 1.1 | 0.1 | 0.8 | 3.6 | 11.9 |
4 | 31140 | 2024 | 4805 | Aaron Nesmith | NaN | SF | 24.0 | 4 | NBA | IND | ... | 0.654 | 1.5 | 3.5 | 5.0 | 1.5 | 1.7 | 0.8 | 0.9 | 5.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
31628 | 200 | 1947 | 157 | Walt Miller | NaN | F | 31.0 | 1 | BAA | PIT | ... | 0.500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
31629 | 201 | 1947 | 158 | Warren Fenley | NaN | F | 24.0 | 1 | BAA | BOS | ... | 0.511 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
31630 | 202 | 1947 | 159 | Wilbert Kautz | NaN | G-F | 31.0 | 1 | BAA | CHS | ... | 0.534 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
31631 | 203 | 1947 | 160 | Woody Grimshaw | NaN | G | 27.0 | 1 | BAA | PRO | ... | 0.477 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
31632 | 204 | 1947 | 161 | Wyndol Gray | NaN | G-F | 24.0 | 1 | BAA | BOS | ... | 0.581 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
31633 rows × 34 columns
Here I’m going to take a portion of the dataframe that takes data from two seasons from two different eras of basketball (2008 being before Stephen Curry was drafted and 2023 which concluded not long ago).
df2 = df[((df['season'] == 2008) | (df['season'] == 2023)) & ((df['pos'] == 'PG') | (df['pos'] == 'SG') | (df['pos'] == 'SF') | (df['pos'] == 'PF') | (df['pos'] == 'C'))]
df2 = df2.loc[:, ['season', 'player', 'pos', 'x3p_per_36_min', 'x3pa_per_36_min', 'fg_per_36_min', 'fga_per_36_min']]
df2 = df2.fillna(0)
df2['pos_number'] = df['pos'].map({'PG': 1, 'SG': 2, 'SF': 3, 'PF': 4, 'C': 5})
df2
season | player | pos | x3p_per_36_min | x3pa_per_36_min | fg_per_36_min | fga_per_36_min | pos_number | |
---|---|---|---|---|---|---|---|---|
498 | 2023 | A.J. Green | SG | 4.6 | 11.0 | 5.5 | 13.0 | 2.0 |
499 | 2023 | A.J. Lawson | SG | 3.3 | 8.3 | 7.3 | 14.7 | 2.0 |
500 | 2023 | A.J. Lawson | SG | 0.0 | 0.0 | 18.0 | 18.0 | 2.0 |
501 | 2023 | A.J. Lawson | SG | 3.4 | 8.5 | 7.1 | 14.6 | 2.0 |
502 | 2023 | AJ Griffin | SF | 2.6 | 6.7 | 6.4 | 13.7 | 3.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
10650 | 2008 | Yao Ming | C | 0.0 | 0.0 | 7.6 | 15.0 | 5.0 |
10651 | 2008 | Yi Jianlian | PF | 0.1 | 0.5 | 4.8 | 11.4 | 4.0 |
10652 | 2008 | Zach Randolph | C | 0.4 | 1.3 | 7.8 | 17.0 | 5.0 |
10653 | 2008 | Zaza Pachulia | C | 0.0 | 0.1 | 4.1 | 9.3 | 5.0 |
10654 | 2008 | Zydrunas Ilgauskas | C | 0.0 | 0.0 | 6.8 | 14.2 | 5.0 |
1255 rows × 8 columns
For each of the seasons’ datasets, I want to remove duplicate player names to avoid redundancies in the data pertaining to players who played for multiple teams in a single season.
df3 = df2[df2['season'] == 2008].drop_duplicates(subset='player')
df4 = df2[df2['season'] == 2023].drop_duplicates(subset='player')
df5 = pd.concat([df3, df4], axis=0, ignore_index=True)
df5
season | player | pos | x3p_per_36_min | x3pa_per_36_min | fg_per_36_min | fga_per_36_min | pos_number | |
---|---|---|---|---|---|---|---|---|
0 | 2008 | Aaron Brooks | PG | 2.1 | 6.5 | 5.5 | 13.3 | 1.0 |
1 | 2008 | Aaron Gray | C | 0.0 | 0.1 | 6.0 | 12.0 | 5.0 |
2 | 2008 | Aaron Williams | C | 0.0 | 0.0 | 3.3 | 6.7 | 5.0 |
3 | 2008 | Acie Law | PG | 0.3 | 1.4 | 4.0 | 9.9 | 1.0 |
4 | 2008 | Adonal Foyle | C | 0.0 | 0.0 | 3.3 | 7.1 | 5.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
983 | 2023 | Zach Collins | C | 1.4 | 3.7 | 7.1 | 13.7 | 5.0 |
984 | 2023 | Zach LaVine | SG | 2.7 | 7.1 | 8.8 | 18.1 | 2.0 |
985 | 2023 | Zeke Nnaji | PF | 0.8 | 3.2 | 5.4 | 9.7 | 4.0 |
986 | 2023 | Ziaire Williams | SF | 1.6 | 6.2 | 5.4 | 12.6 | 3.0 |
987 | 2023 | Zion Williamson | PF | 0.3 | 0.7 | 10.7 | 17.7 | 4.0 |
988 rows × 8 columns
To visualize the difference in 3 point shots between these two seasons that happened 15 years apart, I want to visualize what the data looks like with a chart organized by a players position since players of each position do different things on the court.
alt.Chart(df5).mark_circle().encode(
x = 'x3pa_per_36_min',
y = 'x3p_per_36_min',
color = 'season:N',
column = 'pos',
tooltip = 'player'
).properties(
title = "3 Pointers by Position"
)
With this faceted chart based on position, we see that for most if not all positions, there are noticeably more dots corresponding to 3 point field goals attempted and made per 36 minutes in the 2023 NBA season. This is especially evident in the C (Center) and PF (Power Forward) facets of the graph, where a vast majority of the points correspond to the 2023 season. There are also just more points corresponding to the 2023 season the further along to the right the graph goes.
cols = ['pos_number', 'x3p_per_36_min', 'x3pa_per_36_min', 'fg_per_36_min', 'fga_per_36_min']
Constructing Machine Learning Models#
I’m going to construct some training models to see what kind of model is best for predicting which of a given two NBA seasons does a player’s statistics belong to based on how many 3 pointers they make and attempt per 36 minutes. I will attempt to construct models of Logistic Regression, K-Neighbors Classifiers, and Decision Tree Classifiers.
X_train, X_test, y_train, y_test = train_test_split(df5[cols], df5["season"], test_size=0.5, random_state=42)
log = LogisticRegression()
log.fit(X_train, y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
log.score(X_train, y_train)
0.7813765182186235
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=10)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=10)
knn.score(X_train, y_train)
0.7732793522267206
clf = DecisionTreeClassifier(max_depth=10, random_state=42)
clf.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=10, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=10, random_state=42)
clf.score(X_train, y_train)
0.937246963562753
clf.score(X_test, y_test)
0.7145748987854251
It suffices to say that the Decision Tree Classifier that we constructed is the most accurate model out of the three that we attempted and the accuracy of the Decision Tree on the testing set is enough to say that it was properly fitted.
We can see the accuracy of the model by this example table.
df5['pred'] = clf.predict(df5[cols])
df5.loc[:, ['season', 'player', 'pred']].sample(9)
season | player | pred | |
---|---|---|---|
169 | 2008 | Hilton Armstrong | 2008 |
275 | 2008 | Luke Walton | 2008 |
902 | 2023 | Saddiq Bey | 2023 |
436 | 2008 | Viktor Khryapa | 2008 |
375 | 2008 | Ronny Turiaf | 2008 |
186 | 2008 | James Posey | 2023 |
804 | 2023 | Marcus Morris | 2023 |
213 | 2008 | Joe Johnson | 2008 |
219 | 2008 | John Salmons | 2008 |
Here are some examples of players that have played in both the 2008 NBA season and the 2023 NBA season. As we can see, the model managed to predict the season that the player played in correctly for both seasons.
df5[df5['player'] == 'Al Horford']
season | player | pos | x3p_per_36_min | x3pa_per_36_min | fg_per_36_min | fga_per_36_min | pos_number | pred | |
---|---|---|---|---|---|---|---|---|---|
7 | 2008 | Al Horford | C | 0.0 | 0.1 | 4.7 | 9.5 | 5.0 | 2008 |
458 | 2023 | Al Horford | C | 2.7 | 6.1 | 4.3 | 9.0 | 5.0 | 2023 |
df5[df5['player'] == 'LeBron James']
season | player | pos | x3p_per_36_min | x3pa_per_36_min | fg_per_36_min | fga_per_36_min | pos_number | pred | |
---|---|---|---|---|---|---|---|---|---|
262 | 2008 | LeBron James | SF | 1.3 | 4.3 | 9.4 | 19.5 | 3.0 | 2008 |
782 | 2023 | LeBron James | PF | 2.2 | 6.9 | 11.2 | 22.5 | 4.0 | 2023 |
More Examples Using Decision Trees#
Seeing that the decision tree model was the most accurate of the model for the initial set of seasons, I now want to see how it performs when taking comparisons of different sets of two seasons. For each comparison, I will set things up the way I did for the initial comparison set and then use the constructed decision tree classifier.
df2 = df[((df['season'] == 1998) | (df['season'] == 2023)) & ((df['pos'] == 'PG') | (df['pos'] == 'SG') | (df['pos'] == 'SF') | (df['pos'] == 'PF') | (df['pos'] == 'C'))]
df2 = df2.loc[:, ['season', 'player', 'pos', 'x3p_per_36_min', 'x3pa_per_36_min', 'fg_per_36_min', 'fga_per_36_min']]
df2 = df2.fillna(0)
df2['pos_number'] = df['pos'].map({'PG': 1, 'SG': 2, 'SF': 3, 'PF': 4, 'C': 5})
df3 = df2[df2['season'] == 1998].drop_duplicates(subset='player')
df4 = df2[df2['season'] == 2023].drop_duplicates(subset='player')
df6 = pd.concat([df3, df4], axis=0, ignore_index=True)
df6
season | player | pos | x3p_per_36_min | x3pa_per_36_min | fg_per_36_min | fga_per_36_min | pos_number | |
---|---|---|---|---|---|---|---|---|
0 | 1998 | A.C. Green | PF | 0.0 | 0.1 | 3.3 | 7.3 | 4.0 |
1 | 1998 | Aaron McKie | SF | 0.2 | 1.3 | 3.3 | 7.9 | 3.0 |
2 | 1998 | Aaron Williams | PF | 0.0 | 0.0 | 5.5 | 10.5 | 4.0 |
3 | 1998 | Adam Keefe | C | 0.0 | 0.0 | 4.0 | 7.5 | 5.0 |
4 | 1998 | Adonal Foyle | C | 0.0 | 0.1 | 3.8 | 9.3 | 5.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
972 | 2023 | Zach Collins | C | 1.4 | 3.7 | 7.1 | 13.7 | 5.0 |
973 | 2023 | Zach LaVine | SG | 2.7 | 7.1 | 8.8 | 18.1 | 2.0 |
974 | 2023 | Zeke Nnaji | PF | 0.8 | 3.2 | 5.4 | 9.7 | 4.0 |
975 | 2023 | Ziaire Williams | SF | 1.6 | 6.2 | 5.4 | 12.6 | 3.0 |
976 | 2023 | Zion Williamson | PF | 0.3 | 0.7 | 10.7 | 17.7 | 4.0 |
977 rows × 8 columns
X_train, X_test, y_train, y_test = train_test_split(df6[cols], df6["season"], test_size=0.5, random_state=42)
clf.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=10, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=10, random_state=42)
clf.score(X_train, y_train)
0.9774590163934426
clf.score(X_test, y_test)
0.7566462167689162
df2 = df[((df['season'] == 1984) | (df['season'] == 2008)) & ((df['pos'] == 'PG') | (df['pos'] == 'SG') | (df['pos'] == 'SF') | (df['pos'] == 'PF') | (df['pos'] == 'C'))]
df2 = df2.loc[:, ['season', 'player', 'pos', 'x3p_per_36_min', 'x3pa_per_36_min', 'fg_per_36_min', 'fga_per_36_min']]
df2 = df2.fillna(0)
df2['pos_number'] = df['pos'].map({'PG': 1, 'SG': 2, 'SF': 3, 'PF': 4, 'C': 5})
df3 = df2[df2['season'] == 1984].drop_duplicates(subset='player')
df4 = df2[df2['season'] == 2008].drop_duplicates(subset='player')
df7 = pd.concat([df3, df4], axis=0, ignore_index=True)
df7
season | player | pos | x3p_per_36_min | x3pa_per_36_min | fg_per_36_min | fga_per_36_min | pos_number | |
---|---|---|---|---|---|---|---|---|
0 | 1984 | Adrian Dantley | SF | 0.0 | 0.0 | 9.7 | 17.3 | 3.0 |
1 | 1984 | Al Wood | SG | 0.0 | 0.3 | 7.5 | 15.2 | 2.0 |
2 | 1984 | Albert King | SF | 0.1 | 0.4 | 8.0 | 16.2 | 3.0 |
3 | 1984 | Alex English | SF | 0.0 | 0.1 | 11.4 | 21.5 | 3.0 |
4 | 1984 | Allen Leavell | PG | 0.2 | 1.3 | 6.3 | 13.1 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
754 | 2008 | Yao Ming | C | 0.0 | 0.0 | 7.6 | 15.0 | 5.0 |
755 | 2008 | Yi Jianlian | PF | 0.1 | 0.5 | 4.8 | 11.4 | 4.0 |
756 | 2008 | Zach Randolph | C | 0.4 | 1.3 | 7.8 | 17.0 | 5.0 |
757 | 2008 | Zaza Pachulia | C | 0.0 | 0.1 | 4.1 | 9.3 | 5.0 |
758 | 2008 | Zydrunas Ilgauskas | C | 0.0 | 0.0 | 6.8 | 14.2 | 5.0 |
759 rows × 8 columns
X_train, X_test, y_train, y_test = train_test_split(df7[cols], df7["season"], test_size=0.5, random_state=42)
clf.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=10, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=10, random_state=42)
clf.score(X_train, y_train)
0.9340369393139841
clf.score(X_test, y_test)
0.7263157894736842
df2 = df[((df['season'] == 1998) | (df['season'] == 2008)) & ((df['pos'] == 'PG') | (df['pos'] == 'SG') | (df['pos'] == 'SF') | (df['pos'] == 'PF') | (df['pos'] == 'C'))]
df2 = df2.loc[:, ['season', 'player', 'pos', 'x3p_per_36_min', 'x3pa_per_36_min', 'fg_per_36_min', 'fga_per_36_min']]
df2 = df2.fillna(0)
df2['pos_number'] = df['pos'].map({'PG': 1, 'SG': 2, 'SF': 3, 'PF': 4, 'C': 5})
df3 = df2[df2['season'] == 1998].drop_duplicates(subset='player')
df4 = df2[df2['season'] == 2008].drop_duplicates(subset='player')
df8 = pd.concat([df3, df4], axis=0, ignore_index=True)
df8
season | player | pos | x3p_per_36_min | x3pa_per_36_min | fg_per_36_min | fga_per_36_min | pos_number | |
---|---|---|---|---|---|---|---|---|
0 | 1998 | A.C. Green | PF | 0.0 | 0.1 | 3.3 | 7.3 | 4.0 |
1 | 1998 | Aaron McKie | SF | 0.2 | 1.3 | 3.3 | 7.9 | 3.0 |
2 | 1998 | Aaron Williams | PF | 0.0 | 0.0 | 5.5 | 10.5 | 4.0 |
3 | 1998 | Adam Keefe | C | 0.0 | 0.0 | 4.0 | 7.5 | 5.0 |
4 | 1998 | Adonal Foyle | C | 0.0 | 0.1 | 3.8 | 9.3 | 5.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
884 | 2008 | Yao Ming | C | 0.0 | 0.0 | 7.6 | 15.0 | 5.0 |
885 | 2008 | Yi Jianlian | PF | 0.1 | 0.5 | 4.8 | 11.4 | 4.0 |
886 | 2008 | Zach Randolph | C | 0.4 | 1.3 | 7.8 | 17.0 | 5.0 |
887 | 2008 | Zaza Pachulia | C | 0.0 | 0.1 | 4.1 | 9.3 | 5.0 |
888 | 2008 | Zydrunas Ilgauskas | C | 0.0 | 0.0 | 6.8 | 14.2 | 5.0 |
889 rows × 8 columns
X_train, X_test, y_train, y_test = train_test_split(df8[cols], df8["season"], test_size=0.5, random_state=42)
clf.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=10, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=10, random_state=42)
clf.score(X_train, y_train)
0.9099099099099099
clf.score(X_test, y_test)
0.5910112359550562
When trying to predict the NBA season between 1998 and 2008 based on 3 pointers per 36 minutes, the test score seems to be noticeably worse than on other sets of comparisons between seasons. Other than that, when it came to the other comparisons, the accuracy of both the testing and training set were the same as the one from what we used to construct the model.
Now I want to visualize the differences of the amount of 3 point field goal makes and attempts per 36 minutes between all the different seasons we compared to one another.
c1 = alt.Chart(df5[df5['pos'] == 'PG']).mark_bar().encode(
x = 'season:N',
y = 'x3p_per_36_min',
color = 'season:N',
size = alt.value(20)
)
c2 = alt.Chart(df6[df6['pos'] == 'PG']).mark_line().encode(
x = 'season:N',
y = 'x3p_per_36_min',
color = 'season:N',
size = alt.value(20)
)
c3 = alt.Chart(df7[df7['pos'] == 'PG']).mark_line().encode(
x = 'season:N',
y = 'x3p_per_36_min',
color = 'season:N',
size = alt.value(20)
).properties(
title = 'PG 3P Made'
)
c4 = c1 + c2 + c3
d1 = alt.Chart(df5[df5['pos'] == 'SG']).mark_bar().encode(
x = 'season:N',
y = 'x3p_per_36_min',
color = 'season:N',
size = alt.value(20)
)
d2 = alt.Chart(df6[df6['pos'] == 'SG']).mark_line().encode(
x = 'season:N',
y = 'x3p_per_36_min',
color = 'season:N',
size = alt.value(20)
)
d3 = alt.Chart(df7[df7['pos'] == 'SG']).mark_line().encode(
x = 'season:N',
y = 'x3p_per_36_min',
color = 'season:N',
size = alt.value(20)
).properties(
title = 'SG 3P Made'
)
d4 = d1 + d2 + d3
e1 = alt.Chart(df5[df5['pos'] == 'SF']).mark_bar().encode(
x = 'season:N',
y = 'x3p_per_36_min',
color = 'season:N',
size = alt.value(20)
)
e2 = alt.Chart(df6[df6['pos'] == 'SF']).mark_line().encode(
x = 'season:N',
y = 'x3p_per_36_min',
color = 'season:N',
size = alt.value(20)
)
e3 = alt.Chart(df7[df7['pos'] == 'SF']).mark_line().encode(
x = 'season:N',
y = 'x3p_per_36_min',
color = 'season:N',
size = alt.value(20)
).properties(
title = 'SF 3P Made'
)
e4 = e1 + e2 + e3
f1 = alt.Chart(df5[df5['pos'] == 'PF']).mark_bar().encode(
x = 'season:N',
y = 'x3p_per_36_min',
color = 'season:N',
size = alt.value(20)
)
f2 = alt.Chart(df6[df6['pos'] == 'PF']).mark_line().encode(
x = 'season:N',
y = 'x3p_per_36_min',
color = 'season:N',
size = alt.value(20)
)
f3 = alt.Chart(df7[df7['pos'] == 'PF']).mark_line().encode(
x = 'season:N',
y = 'x3p_per_36_min',
color = 'season:N',
size = alt.value(20)
).properties(
title = 'PF 3PA Made'
)
f4 = f1 + f2 + f3
g1 = alt.Chart(df5[df5['pos'] == 'C']).mark_bar().encode(
x = 'season:N',
y = 'x3p_per_36_min',
color = 'season:N',
size = alt.value(20)
)
g2 = alt.Chart(df6[df6['pos'] == 'C']).mark_line().encode(
x = 'season:N',
y = 'x3p_per_36_min',
color = 'season:N',
size = alt.value(20)
)
g3 = alt.Chart(df7[df7['pos'] == 'C']).mark_line().encode(
x = 'season:N',
y = 'x3p_per_36_min',
color = 'season:N',
size = alt.value(20)
).properties(
title = 'C 3P Made'
)
g4 = g1 + g2 + g3
total = c4 | d4 | e4 | f4 | g4
total
With 3 point makes, there seems to be a clear correlation of how the amount of 3 pointers being made has increased over the years that the 3 point shot has existed in the NBA.
c1 = alt.Chart(df5[df5['pos'] == 'PG']).mark_bar().encode(
x = 'season:N',
y = 'x3pa_per_36_min',
color = 'season:N',
size = alt.value(20)
)
c2 = alt.Chart(df6[df6['pos'] == 'PG']).mark_line().encode(
x = 'season:N',
y = 'x3pa_per_36_min',
color = 'season:N',
size = alt.value(20)
)
c3 = alt.Chart(df7[df7['pos'] == 'PG']).mark_line().encode(
x = 'season:N',
y = 'x3pa_per_36_min',
color = 'season:N',
size = alt.value(20)
).properties(
title = 'PG 3PA'
)
c4 = c1 + c2 + c3
d1 = alt.Chart(df5[df5['pos'] == 'SG']).mark_bar().encode(
x = 'season:N',
y = 'x3pa_per_36_min',
color = 'season:N',
size = alt.value(20)
)
d2 = alt.Chart(df6[df6['pos'] == 'SG']).mark_line().encode(
x = 'season:N',
y = 'x3pa_per_36_min',
color = 'season:N',
size = alt.value(20)
)
d3 = alt.Chart(df7[df7['pos'] == 'SG']).mark_line().encode(
x = 'season:N',
y = 'x3pa_per_36_min',
color = 'season:N',
size = alt.value(20)
).properties(
title = 'SG 3PA'
)
d4 = d1 + d2 + d3
e1 = alt.Chart(df5[df5['pos'] == 'SF']).mark_bar().encode(
x = 'season:N',
y = 'x3pa_per_36_min',
color = 'season:N',
size = alt.value(20)
)
e2 = alt.Chart(df6[df6['pos'] == 'SF']).mark_line().encode(
x = 'season:N',
y = 'x3pa_per_36_min',
color = 'season:N',
size = alt.value(20)
)
e3 = alt.Chart(df7[df7['pos'] == 'SF']).mark_line().encode(
x = 'season:N',
y = 'x3pa_per_36_min',
color = 'season:N',
size = alt.value(20)
).properties(
title = 'SF 3PA'
)
e4 = e1 + e2 + e3
f1 = alt.Chart(df5[df5['pos'] == 'PF']).mark_bar().encode(
x = 'season:N',
y = 'x3pa_per_36_min',
color = 'season:N',
size = alt.value(20)
)
f2 = alt.Chart(df6[df6['pos'] == 'PF']).mark_line().encode(
x = 'season:N',
y = 'x3pa_per_36_min',
color = 'season:N',
size = alt.value(20)
)
f3 = alt.Chart(df7[df7['pos'] == 'PF']).mark_line().encode(
x = 'season:N',
y = 'x3pa_per_36_min',
color = 'season:N',
size = alt.value(20)
).properties(
title = 'PF 3PA'
)
f4 = f1 + f2 + f3
g1 = alt.Chart(df5[df5['pos'] == 'C']).mark_bar().encode(
x = 'season:N',
y = 'x3pa_per_36_min',
color = 'season:N',
size = alt.value(20)
)
g2 = alt.Chart(df6[df6['pos'] == 'C']).mark_line().encode(
x = 'season:N',
y = 'x3pa_per_36_min',
color = 'season:N',
size = alt.value(20)
)
g3 = alt.Chart(df7[df7['pos'] == 'C']).mark_line().encode(
x = 'season:N',
y = 'x3pa_per_36_min',
color = 'season:N',
size = alt.value(20)
).properties(
title = 'C 3PA'
)
g4 = g1 + g2 + g3
total = c4 | d4 | e4 | f4 | g4
total
From this chart, it seems that shooting guards (SG) attempted noticeably more 3 point field goals in 1998 per 36 minutes than shooting guards in 2008, which could explain why the decision tree had a noticeably lower score on the testing set when comparing the 1998 season to the 2008 season. I also notice that between 1998, 2008, and 2023, small forwards (SF) have all attempted a similar amount of 3 point field goal attempts per 36 minutes.
Summary#
What I have done is that I would take two distant NBA seasons to compare and the Per 36 stats and made a model that can accurately predict which of any two seasons a player played in based on their 3 pointer per 36 stats for a given season, which ended up being a Decision Tree Classifier. By most comparisons, the Decision Tree Classifier had an accuracy of ~90% on the training set and ~70% on the testing set.
I also visualized the difference of 3 point makes and attempts over the years to see how the impact of 3 point shots in the NBA have changed over time. It turns out that in terms of 3 point makes, there is a clear correlation that more 3 pointers have been made over time, while in terms of 3 point attempts, the results vary by position (such as SG and SF).
References#
Your code above should include references. Here is some additional space for references.
What is the source of your dataset(s)?
https://www.kaggle.com/datasets/sumitrodatta/nba-aba-baa-stats?select=Per+36+Minutes.csv
List any other references that you found helpful.
Sources on the history of the 3 point line in the NBA and other helpful season statistics.
https://www.teamrankings.com/nba/stat/three-pointers-attempted-per-game?date=2023-12-13