Comparing NBA 3 Point Shots For Two Given Seasons#

Author: Hieu Le Dang

Course Project, UC Irvine, Math 10, F23

Introduction#

Introduce your project here. Maybe 3 sentences.

I want to use a subset of NBA statistics regarding stats recorded per 36 minutes to see how the recency of an NBA season correlates to how many 3 point field goals are attempted and made. To understand what Per 36 Minute statistics are, let’s say for example we have a player who averages 9 points per game and plays 18 minutes per game. This player would be seen to average 18 points per 36 minutes on the court. Now I want to use statistics regarding 3 point field goals made and attempted to see if we can make a model that can guess which NBA season a player played in based on their 3 point statistics when comparing two seasons.

Organizing the Data#

You can either have all one section or divide into multiple sections. To make new sections, use ## in a markdown cell. Double-click this cell for an example of using ##

import seaborn as sns
import numpy as np
import pandas as pd
import altair as alt
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

df = pd.read_csv("Per 36 Minutes.csv")
df
seas_id season player_id player birth_year pos age experience lg tm ... ft_percent orb_per_36_min drb_per_36_min trb_per_36_min ast_per_36_min stl_per_36_min blk_per_36_min tov_per_36_min pf_per_36_min pts_per_36_min
0 31136 2024 5025 A.J. Green NaN SG 24.0 2 NBA MIL ... 1.000 0.4 3.1 3.6 3.6 0.0 0.0 0.0 3.6 13.8
1 31137 2024 5027 AJ Griffin NaN SF 20.0 2 NBA ATL ... 1.000 0.7 2.9 3.6 1.1 0.4 0.0 1.4 1.8 10.7
2 31138 2024 4219 Aaron Gordon NaN PF 28.0 10 NBA DEN ... 0.520 2.8 4.8 7.6 4.1 1.2 0.9 1.9 2.1 13.9
3 31139 2024 4582 Aaron Holiday NaN PG 27.0 6 NBA HOU ... 0.857 0.3 3.3 3.6 3.9 1.1 0.1 0.8 3.6 11.9
4 31140 2024 4805 Aaron Nesmith NaN SF 24.0 4 NBA IND ... 0.654 1.5 3.5 5.0 1.5 1.7 0.8 0.9 5.0 16.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
31628 200 1947 157 Walt Miller NaN F 31.0 1 BAA PIT ... 0.500 NaN NaN NaN NaN NaN NaN NaN NaN NaN
31629 201 1947 158 Warren Fenley NaN F 24.0 1 BAA BOS ... 0.511 NaN NaN NaN NaN NaN NaN NaN NaN NaN
31630 202 1947 159 Wilbert Kautz NaN G-F 31.0 1 BAA CHS ... 0.534 NaN NaN NaN NaN NaN NaN NaN NaN NaN
31631 203 1947 160 Woody Grimshaw NaN G 27.0 1 BAA PRO ... 0.477 NaN NaN NaN NaN NaN NaN NaN NaN NaN
31632 204 1947 161 Wyndol Gray NaN G-F 24.0 1 BAA BOS ... 0.581 NaN NaN NaN NaN NaN NaN NaN NaN NaN

31633 rows × 34 columns

Here I’m going to take a portion of the dataframe that takes data from two seasons from two different eras of basketball (2008 being before Stephen Curry was drafted and 2023 which concluded not long ago).

df2 = df[((df['season'] == 2008) | (df['season'] == 2023)) & ((df['pos'] == 'PG') | (df['pos'] == 'SG') | (df['pos'] == 'SF') | (df['pos'] == 'PF') | (df['pos'] == 'C'))]
df2 = df2.loc[:, ['season', 'player', 'pos', 'x3p_per_36_min', 'x3pa_per_36_min', 'fg_per_36_min', 'fga_per_36_min']]
df2 = df2.fillna(0)
df2['pos_number'] = df['pos'].map({'PG': 1, 'SG': 2, 'SF': 3, 'PF': 4, 'C': 5})
df2
season player pos x3p_per_36_min x3pa_per_36_min fg_per_36_min fga_per_36_min pos_number
498 2023 A.J. Green SG 4.6 11.0 5.5 13.0 2.0
499 2023 A.J. Lawson SG 3.3 8.3 7.3 14.7 2.0
500 2023 A.J. Lawson SG 0.0 0.0 18.0 18.0 2.0
501 2023 A.J. Lawson SG 3.4 8.5 7.1 14.6 2.0
502 2023 AJ Griffin SF 2.6 6.7 6.4 13.7 3.0
... ... ... ... ... ... ... ... ...
10650 2008 Yao Ming C 0.0 0.0 7.6 15.0 5.0
10651 2008 Yi Jianlian PF 0.1 0.5 4.8 11.4 4.0
10652 2008 Zach Randolph C 0.4 1.3 7.8 17.0 5.0
10653 2008 Zaza Pachulia C 0.0 0.1 4.1 9.3 5.0
10654 2008 Zydrunas Ilgauskas C 0.0 0.0 6.8 14.2 5.0

1255 rows × 8 columns

For each of the seasons’ datasets, I want to remove duplicate player names to avoid redundancies in the data pertaining to players who played for multiple teams in a single season.

df3 = df2[df2['season'] == 2008].drop_duplicates(subset='player')
df4 = df2[df2['season'] == 2023].drop_duplicates(subset='player')
df5 = pd.concat([df3, df4], axis=0, ignore_index=True)
df5
season player pos x3p_per_36_min x3pa_per_36_min fg_per_36_min fga_per_36_min pos_number
0 2008 Aaron Brooks PG 2.1 6.5 5.5 13.3 1.0
1 2008 Aaron Gray C 0.0 0.1 6.0 12.0 5.0
2 2008 Aaron Williams C 0.0 0.0 3.3 6.7 5.0
3 2008 Acie Law PG 0.3 1.4 4.0 9.9 1.0
4 2008 Adonal Foyle C 0.0 0.0 3.3 7.1 5.0
... ... ... ... ... ... ... ... ...
983 2023 Zach Collins C 1.4 3.7 7.1 13.7 5.0
984 2023 Zach LaVine SG 2.7 7.1 8.8 18.1 2.0
985 2023 Zeke Nnaji PF 0.8 3.2 5.4 9.7 4.0
986 2023 Ziaire Williams SF 1.6 6.2 5.4 12.6 3.0
987 2023 Zion Williamson PF 0.3 0.7 10.7 17.7 4.0

988 rows × 8 columns

To visualize the difference in 3 point shots between these two seasons that happened 15 years apart, I want to visualize what the data looks like with a chart organized by a players position since players of each position do different things on the court.

alt.Chart(df5).mark_circle().encode(
    x = 'x3pa_per_36_min',
    y = 'x3p_per_36_min',
    color = 'season:N',
    column = 'pos',
    tooltip = 'player'
).properties(
    title = "3 Pointers by Position"
)

With this faceted chart based on position, we see that for most if not all positions, there are noticeably more dots corresponding to 3 point field goals attempted and made per 36 minutes in the 2023 NBA season. This is especially evident in the C (Center) and PF (Power Forward) facets of the graph, where a vast majority of the points correspond to the 2023 season. There are also just more points corresponding to the 2023 season the further along to the right the graph goes.

cols = ['pos_number', 'x3p_per_36_min', 'x3pa_per_36_min', 'fg_per_36_min', 'fga_per_36_min']

Constructing Machine Learning Models#

I’m going to construct some training models to see what kind of model is best for predicting which of a given two NBA seasons does a player’s statistics belong to based on how many 3 pointers they make and attempt per 36 minutes. I will attempt to construct models of Logistic Regression, K-Neighbors Classifiers, and Decision Tree Classifiers.

X_train, X_test, y_train, y_test = train_test_split(df5[cols], df5["season"], test_size=0.5, random_state=42)
log = LogisticRegression()
log.fit(X_train, y_train)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
log.score(X_train, y_train)
0.7813765182186235
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=10)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
knn.score(X_train, y_train)
0.7732793522267206
clf = DecisionTreeClassifier(max_depth=10, random_state=42)
clf.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=10, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
clf.score(X_train, y_train)
0.937246963562753
clf.score(X_test, y_test)
0.7145748987854251

It suffices to say that the Decision Tree Classifier that we constructed is the most accurate model out of the three that we attempted and the accuracy of the Decision Tree on the testing set is enough to say that it was properly fitted.

We can see the accuracy of the model by this example table.

df5['pred'] = clf.predict(df5[cols])
df5.loc[:, ['season', 'player', 'pred']].sample(9)
season player pred
169 2008 Hilton Armstrong 2008
275 2008 Luke Walton 2008
902 2023 Saddiq Bey 2023
436 2008 Viktor Khryapa 2008
375 2008 Ronny Turiaf 2008
186 2008 James Posey 2023
804 2023 Marcus Morris 2023
213 2008 Joe Johnson 2008
219 2008 John Salmons 2008

Here are some examples of players that have played in both the 2008 NBA season and the 2023 NBA season. As we can see, the model managed to predict the season that the player played in correctly for both seasons.

df5[df5['player'] == 'Al Horford']
season player pos x3p_per_36_min x3pa_per_36_min fg_per_36_min fga_per_36_min pos_number pred
7 2008 Al Horford C 0.0 0.1 4.7 9.5 5.0 2008
458 2023 Al Horford C 2.7 6.1 4.3 9.0 5.0 2023
df5[df5['player'] == 'LeBron James']
season player pos x3p_per_36_min x3pa_per_36_min fg_per_36_min fga_per_36_min pos_number pred
262 2008 LeBron James SF 1.3 4.3 9.4 19.5 3.0 2008
782 2023 LeBron James PF 2.2 6.9 11.2 22.5 4.0 2023

More Examples Using Decision Trees#

Seeing that the decision tree model was the most accurate of the model for the initial set of seasons, I now want to see how it performs when taking comparisons of different sets of two seasons. For each comparison, I will set things up the way I did for the initial comparison set and then use the constructed decision tree classifier.

df2 = df[((df['season'] == 1998) | (df['season'] == 2023)) & ((df['pos'] == 'PG') | (df['pos'] == 'SG') | (df['pos'] == 'SF') | (df['pos'] == 'PF') | (df['pos'] == 'C'))]
df2 = df2.loc[:, ['season', 'player', 'pos', 'x3p_per_36_min', 'x3pa_per_36_min', 'fg_per_36_min', 'fga_per_36_min']]
df2 = df2.fillna(0)
df2['pos_number'] = df['pos'].map({'PG': 1, 'SG': 2, 'SF': 3, 'PF': 4, 'C': 5})
df3 = df2[df2['season'] == 1998].drop_duplicates(subset='player')
df4 = df2[df2['season'] == 2023].drop_duplicates(subset='player')
df6 = pd.concat([df3, df4], axis=0, ignore_index=True)
df6
season player pos x3p_per_36_min x3pa_per_36_min fg_per_36_min fga_per_36_min pos_number
0 1998 A.C. Green PF 0.0 0.1 3.3 7.3 4.0
1 1998 Aaron McKie SF 0.2 1.3 3.3 7.9 3.0
2 1998 Aaron Williams PF 0.0 0.0 5.5 10.5 4.0
3 1998 Adam Keefe C 0.0 0.0 4.0 7.5 5.0
4 1998 Adonal Foyle C 0.0 0.1 3.8 9.3 5.0
... ... ... ... ... ... ... ... ...
972 2023 Zach Collins C 1.4 3.7 7.1 13.7 5.0
973 2023 Zach LaVine SG 2.7 7.1 8.8 18.1 2.0
974 2023 Zeke Nnaji PF 0.8 3.2 5.4 9.7 4.0
975 2023 Ziaire Williams SF 1.6 6.2 5.4 12.6 3.0
976 2023 Zion Williamson PF 0.3 0.7 10.7 17.7 4.0

977 rows × 8 columns

X_train, X_test, y_train, y_test = train_test_split(df6[cols], df6["season"], test_size=0.5, random_state=42)
clf.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=10, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
clf.score(X_train, y_train)
0.9774590163934426
clf.score(X_test, y_test)
0.7566462167689162
df2 = df[((df['season'] == 1984) | (df['season'] == 2008)) & ((df['pos'] == 'PG') | (df['pos'] == 'SG') | (df['pos'] == 'SF') | (df['pos'] == 'PF') | (df['pos'] == 'C'))]
df2 = df2.loc[:, ['season', 'player', 'pos', 'x3p_per_36_min', 'x3pa_per_36_min', 'fg_per_36_min', 'fga_per_36_min']]
df2 = df2.fillna(0)
df2['pos_number'] = df['pos'].map({'PG': 1, 'SG': 2, 'SF': 3, 'PF': 4, 'C': 5})
df3 = df2[df2['season'] == 1984].drop_duplicates(subset='player')
df4 = df2[df2['season'] == 2008].drop_duplicates(subset='player')
df7 = pd.concat([df3, df4], axis=0, ignore_index=True)
df7
season player pos x3p_per_36_min x3pa_per_36_min fg_per_36_min fga_per_36_min pos_number
0 1984 Adrian Dantley SF 0.0 0.0 9.7 17.3 3.0
1 1984 Al Wood SG 0.0 0.3 7.5 15.2 2.0
2 1984 Albert King SF 0.1 0.4 8.0 16.2 3.0
3 1984 Alex English SF 0.0 0.1 11.4 21.5 3.0
4 1984 Allen Leavell PG 0.2 1.3 6.3 13.1 1.0
... ... ... ... ... ... ... ... ...
754 2008 Yao Ming C 0.0 0.0 7.6 15.0 5.0
755 2008 Yi Jianlian PF 0.1 0.5 4.8 11.4 4.0
756 2008 Zach Randolph C 0.4 1.3 7.8 17.0 5.0
757 2008 Zaza Pachulia C 0.0 0.1 4.1 9.3 5.0
758 2008 Zydrunas Ilgauskas C 0.0 0.0 6.8 14.2 5.0

759 rows × 8 columns

X_train, X_test, y_train, y_test = train_test_split(df7[cols], df7["season"], test_size=0.5, random_state=42)
clf.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=10, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
clf.score(X_train, y_train)
0.9340369393139841
clf.score(X_test, y_test)
0.7263157894736842
df2 = df[((df['season'] == 1998) | (df['season'] == 2008)) & ((df['pos'] == 'PG') | (df['pos'] == 'SG') | (df['pos'] == 'SF') | (df['pos'] == 'PF') | (df['pos'] == 'C'))]
df2 = df2.loc[:, ['season', 'player', 'pos', 'x3p_per_36_min', 'x3pa_per_36_min', 'fg_per_36_min', 'fga_per_36_min']]
df2 = df2.fillna(0)
df2['pos_number'] = df['pos'].map({'PG': 1, 'SG': 2, 'SF': 3, 'PF': 4, 'C': 5})
df3 = df2[df2['season'] == 1998].drop_duplicates(subset='player')
df4 = df2[df2['season'] == 2008].drop_duplicates(subset='player')
df8 = pd.concat([df3, df4], axis=0, ignore_index=True)
df8
season player pos x3p_per_36_min x3pa_per_36_min fg_per_36_min fga_per_36_min pos_number
0 1998 A.C. Green PF 0.0 0.1 3.3 7.3 4.0
1 1998 Aaron McKie SF 0.2 1.3 3.3 7.9 3.0
2 1998 Aaron Williams PF 0.0 0.0 5.5 10.5 4.0
3 1998 Adam Keefe C 0.0 0.0 4.0 7.5 5.0
4 1998 Adonal Foyle C 0.0 0.1 3.8 9.3 5.0
... ... ... ... ... ... ... ... ...
884 2008 Yao Ming C 0.0 0.0 7.6 15.0 5.0
885 2008 Yi Jianlian PF 0.1 0.5 4.8 11.4 4.0
886 2008 Zach Randolph C 0.4 1.3 7.8 17.0 5.0
887 2008 Zaza Pachulia C 0.0 0.1 4.1 9.3 5.0
888 2008 Zydrunas Ilgauskas C 0.0 0.0 6.8 14.2 5.0

889 rows × 8 columns

X_train, X_test, y_train, y_test = train_test_split(df8[cols], df8["season"], test_size=0.5, random_state=42)
clf.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=10, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
clf.score(X_train, y_train)
0.9099099099099099
clf.score(X_test, y_test)
0.5910112359550562

When trying to predict the NBA season between 1998 and 2008 based on 3 pointers per 36 minutes, the test score seems to be noticeably worse than on other sets of comparisons between seasons. Other than that, when it came to the other comparisons, the accuracy of both the testing and training set were the same as the one from what we used to construct the model.

Now I want to visualize the differences of the amount of 3 point field goal makes and attempts per 36 minutes between all the different seasons we compared to one another.

c1 = alt.Chart(df5[df5['pos'] == 'PG']).mark_bar().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
c2 = alt.Chart(df6[df6['pos'] == 'PG']).mark_line().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
c3 = alt.Chart(df7[df7['pos'] == 'PG']).mark_line().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
).properties(
    title = 'PG 3P Made'
)
c4 = c1 + c2 + c3

d1 = alt.Chart(df5[df5['pos'] == 'SG']).mark_bar().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
d2 = alt.Chart(df6[df6['pos'] == 'SG']).mark_line().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
d3 = alt.Chart(df7[df7['pos'] == 'SG']).mark_line().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
).properties(
    title = 'SG 3P Made'
)
d4 = d1 + d2 + d3

e1 = alt.Chart(df5[df5['pos'] == 'SF']).mark_bar().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
e2 = alt.Chart(df6[df6['pos'] == 'SF']).mark_line().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
e3 = alt.Chart(df7[df7['pos'] == 'SF']).mark_line().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
).properties(
    title = 'SF 3P Made'
)
e4 = e1 + e2 + e3

f1 = alt.Chart(df5[df5['pos'] == 'PF']).mark_bar().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
f2 = alt.Chart(df6[df6['pos'] == 'PF']).mark_line().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
f3 = alt.Chart(df7[df7['pos'] == 'PF']).mark_line().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
).properties(
    title = 'PF 3PA Made'
)
f4 = f1 + f2 + f3

g1 = alt.Chart(df5[df5['pos'] == 'C']).mark_bar().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
g2 = alt.Chart(df6[df6['pos'] == 'C']).mark_line().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
g3 = alt.Chart(df7[df7['pos'] == 'C']).mark_line().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
).properties(
    title = 'C 3P Made'
)
g4 = g1 + g2 + g3

total = c4 | d4 | e4 | f4 | g4
total

With 3 point makes, there seems to be a clear correlation of how the amount of 3 pointers being made has increased over the years that the 3 point shot has existed in the NBA.

c1 = alt.Chart(df5[df5['pos'] == 'PG']).mark_bar().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
c2 = alt.Chart(df6[df6['pos'] == 'PG']).mark_line().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
c3 = alt.Chart(df7[df7['pos'] == 'PG']).mark_line().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
).properties(
    title = 'PG 3PA'
)
c4 = c1 + c2 + c3

d1 = alt.Chart(df5[df5['pos'] == 'SG']).mark_bar().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
d2 = alt.Chart(df6[df6['pos'] == 'SG']).mark_line().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
d3 = alt.Chart(df7[df7['pos'] == 'SG']).mark_line().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
).properties(
    title = 'SG 3PA'
)
d4 = d1 + d2 + d3

e1 = alt.Chart(df5[df5['pos'] == 'SF']).mark_bar().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
e2 = alt.Chart(df6[df6['pos'] == 'SF']).mark_line().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
e3 = alt.Chart(df7[df7['pos'] == 'SF']).mark_line().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
).properties(
    title = 'SF 3PA'
)
e4 = e1 + e2 + e3

f1 = alt.Chart(df5[df5['pos'] == 'PF']).mark_bar().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
f2 = alt.Chart(df6[df6['pos'] == 'PF']).mark_line().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
f3 = alt.Chart(df7[df7['pos'] == 'PF']).mark_line().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
).properties(
    title = 'PF 3PA'
)
f4 = f1 + f2 + f3

g1 = alt.Chart(df5[df5['pos'] == 'C']).mark_bar().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
g2 = alt.Chart(df6[df6['pos'] == 'C']).mark_line().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
g3 = alt.Chart(df7[df7['pos'] == 'C']).mark_line().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
).properties(
    title = 'C 3PA'
)
g4 = g1 + g2 + g3

total = c4 | d4 | e4 | f4 | g4
total

From this chart, it seems that shooting guards (SG) attempted noticeably more 3 point field goals in 1998 per 36 minutes than shooting guards in 2008, which could explain why the decision tree had a noticeably lower score on the testing set when comparing the 1998 season to the 2008 season. I also notice that between 1998, 2008, and 2023, small forwards (SF) have all attempted a similar amount of 3 point field goal attempts per 36 minutes.

Summary#

What I have done is that I would take two distant NBA seasons to compare and the Per 36 stats and made a model that can accurately predict which of any two seasons a player played in based on their 3 pointer per 36 stats for a given season, which ended up being a Decision Tree Classifier. By most comparisons, the Decision Tree Classifier had an accuracy of ~90% on the training set and ~70% on the testing set.

I also visualized the difference of 3 point makes and attempts over the years to see how the impact of 3 point shots in the NBA have changed over time. It turns out that in terms of 3 point makes, there is a clear correlation that more 3 pointers have been made over time, while in terms of 3 point attempts, the results vary by position (such as SG and SF).

References#

Your code above should include references. Here is some additional space for references.

  • What is the source of your dataset(s)?

https://www.kaggle.com/datasets/sumitrodatta/nba-aba-baa-stats?select=Per+36+Minutes.csv

  • List any other references that you found helpful.

Sources on the history of the 3 point line in the NBA and other helpful season statistics.

https://www.teamrankings.com/nba/stat/three-pointers-attempted-per-game?date=2023-12-13

https://www.basketball-reference.com/leagues/