Comparing NBA 3 Point Shots For Two Given Seasons

Comparing NBA 3 Point Shots For Two Given Seasons#

Author: Hieu Le Dang

Course Project, UC Irvine, Math 10, F23

Introduction#

Introduce your project here. Maybe 3 sentences.

I want to use a subset of NBA statistics regarding stats recorded per 36 minutes to see how the recency of an NBA season correlates to how many 3 point field goals are attempted and made. To understand what Per 36 Minute statistics are, let’s say for example we have a player who averages 9 points per game and plays 18 minutes per game. This player would be seen to average 18 points per 36 minutes on the court. Now I want to use statistics regarding 3 point field goals made and attempted to see if we can make a model that can guess which NBA season a player played in based on their 3 point statistics when comparing two seasons.

Organizing the Data#

You can either have all one section or divide into multiple sections. To make new sections, use ## in a markdown cell. Double-click this cell for an example of using ##

import seaborn as sns
import numpy as np
import pandas as pd
import altair as alt
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

df = pd.read_csv("Per 36 Minutes.csv")
df

	seas_id	season	player_id	player	birth_year	pos	age	experience	lg	tm	...	ft_percent	orb_per_36_min	drb_per_36_min	trb_per_36_min	ast_per_36_min	stl_per_36_min	blk_per_36_min	tov_per_36_min	pf_per_36_min	pts_per_36_min
0	31136	2024	5025	A.J. Green	NaN	SG	24.0	2	NBA	MIL	...	1.000	0.4	3.1	3.6	3.6	0.0	0.0	0.0	3.6	13.8
1	31137	2024	5027	AJ Griffin	NaN	SF	20.0	2	NBA	ATL	...	1.000	0.7	2.9	3.6	1.1	0.4	0.0	1.4	1.8	10.7
2	31138	2024	4219	Aaron Gordon	NaN	PF	28.0	10	NBA	DEN	...	0.520	2.8	4.8	7.6	4.1	1.2	0.9	1.9	2.1	13.9
3	31139	2024	4582	Aaron Holiday	NaN	PG	27.0	6	NBA	HOU	...	0.857	0.3	3.3	3.6	3.9	1.1	0.1	0.8	3.6	11.9
4	31140	2024	4805	Aaron Nesmith	NaN	SF	24.0	4	NBA	IND	...	0.654	1.5	3.5	5.0	1.5	1.7	0.8	0.9	5.0	16.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
31628	200	1947	157	Walt Miller	NaN	F	31.0	1	BAA	PIT	...	0.500	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31629	201	1947	158	Warren Fenley	NaN	F	24.0	1	BAA	BOS	...	0.511	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31630	202	1947	159	Wilbert Kautz	NaN	G-F	31.0	1	BAA	CHS	...	0.534	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31631	203	1947	160	Woody Grimshaw	NaN	G	27.0	1	BAA	PRO	...	0.477	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31632	204	1947	161	Wyndol Gray	NaN	G-F	24.0	1	BAA	BOS	...	0.581	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

31633 rows × 34 columns

Here I’m going to take a portion of the dataframe that takes data from two seasons from two different eras of basketball (2008 being before Stephen Curry was drafted and 2023 which concluded not long ago).

df2 = df[((df['season'] == 2008) | (df['season'] == 2023)) & ((df['pos'] == 'PG') | (df['pos'] == 'SG') | (df['pos'] == 'SF') | (df['pos'] == 'PF') | (df['pos'] == 'C'))]
df2 = df2.loc[:, ['season', 'player', 'pos', 'x3p_per_36_min', 'x3pa_per_36_min', 'fg_per_36_min', 'fga_per_36_min']]
df2 = df2.fillna(0)
df2['pos_number'] = df['pos'].map({'PG': 1, 'SG': 2, 'SF': 3, 'PF': 4, 'C': 5})
df2

	season	player	pos	x3p_per_36_min	x3pa_per_36_min	fg_per_36_min	fga_per_36_min	pos_number
498	2023	A.J. Green	SG	4.6	11.0	5.5	13.0	2.0
499	2023	A.J. Lawson	SG	3.3	8.3	7.3	14.7	2.0
500	2023	A.J. Lawson	SG	0.0	0.0	18.0	18.0	2.0
501	2023	A.J. Lawson	SG	3.4	8.5	7.1	14.6	2.0
502	2023	AJ Griffin	SF	2.6	6.7	6.4	13.7	3.0
...	...	...	...	...	...	...	...	...
10650	2008	Yao Ming	C	0.0	0.0	7.6	15.0	5.0
10651	2008	Yi Jianlian	PF	0.1	0.5	4.8	11.4	4.0
10652	2008	Zach Randolph	C	0.4	1.3	7.8	17.0	5.0
10653	2008	Zaza Pachulia	C	0.0	0.1	4.1	9.3	5.0
10654	2008	Zydrunas Ilgauskas	C	0.0	0.0	6.8	14.2	5.0

1255 rows × 8 columns

For each of the seasons’ datasets, I want to remove duplicate player names to avoid redundancies in the data pertaining to players who played for multiple teams in a single season.

df3 = df2[df2['season'] == 2008].drop_duplicates(subset='player')
df4 = df2[df2['season'] == 2023].drop_duplicates(subset='player')
df5 = pd.concat([df3, df4], axis=0, ignore_index=True)
df5

	season	player	pos	x3p_per_36_min	x3pa_per_36_min	fg_per_36_min	fga_per_36_min	pos_number
0	2008	Aaron Brooks	PG	2.1	6.5	5.5	13.3	1.0
1	2008	Aaron Gray	C	0.0	0.1	6.0	12.0	5.0
2	2008	Aaron Williams	C	0.0	0.0	3.3	6.7	5.0
3	2008	Acie Law	PG	0.3	1.4	4.0	9.9	1.0
4	2008	Adonal Foyle	C	0.0	0.0	3.3	7.1	5.0
...	...	...	...	...	...	...	...	...
983	2023	Zach Collins	C	1.4	3.7	7.1	13.7	5.0
984	2023	Zach LaVine	SG	2.7	7.1	8.8	18.1	2.0
985	2023	Zeke Nnaji	PF	0.8	3.2	5.4	9.7	4.0
986	2023	Ziaire Williams	SF	1.6	6.2	5.4	12.6	3.0
987	2023	Zion Williamson	PF	0.3	0.7	10.7	17.7	4.0

988 rows × 8 columns

To visualize the difference in 3 point shots between these two seasons that happened 15 years apart, I want to visualize what the data looks like with a chart organized by a players position since players of each position do different things on the court.

alt.Chart(df5).mark_circle().encode(
    x = 'x3pa_per_36_min',
    y = 'x3p_per_36_min',
    color = 'season:N',
    column = 'pos',
    tooltip = 'player'
).properties(
    title = "3 Pointers by Position"
)

With this faceted chart based on position, we see that for most if not all positions, there are noticeably more dots corresponding to 3 point field goals attempted and made per 36 minutes in the 2023 NBA season. This is especially evident in the C (Center) and PF (Power Forward) facets of the graph, where a vast majority of the points correspond to the 2023 season. There are also just more points corresponding to the 2023 season the further along to the right the graph goes.

cols = ['pos_number', 'x3p_per_36_min', 'x3pa_per_36_min', 'fg_per_36_min', 'fga_per_36_min']

Constructing Machine Learning Models#

I’m going to construct some training models to see what kind of model is best for predicting which of a given two NBA seasons does a player’s statistics belong to based on how many 3 pointers they make and attempt per 36 minutes. I will attempt to construct models of Logistic Regression, K-Neighbors Classifiers, and Decision Tree Classifiers.

X_train, X_test, y_train, y_test = train_test_split(df5[cols], df5["season"], test_size=0.5, random_state=42)
log = LogisticRegression()
log.fit(X_train, y_train)

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

log.score(X_train, y_train)

0.7813765182186235

knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=10)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

knn.score(X_train, y_train)

0.7732793522267206

clf = DecisionTreeClassifier(max_depth=10, random_state=42)
clf.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=10, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

clf.score(X_train, y_train)

0.937246963562753

clf.score(X_test, y_test)

0.7145748987854251

It suffices to say that the Decision Tree Classifier that we constructed is the most accurate model out of the three that we attempted and the accuracy of the Decision Tree on the testing set is enough to say that it was properly fitted.

We can see the accuracy of the model by this example table.

df5['pred'] = clf.predict(df5[cols])
df5.loc[:, ['season', 'player', 'pred']].sample(9)

	season	player	pred
169	2008	Hilton Armstrong	2008
275	2008	Luke Walton	2008
902	2023	Saddiq Bey	2023
436	2008	Viktor Khryapa	2008
375	2008	Ronny Turiaf	2008
186	2008	James Posey	2023
804	2023	Marcus Morris	2023
213	2008	Joe Johnson	2008
219	2008	John Salmons	2008

Here are some examples of players that have played in both the 2008 NBA season and the 2023 NBA season. As we can see, the model managed to predict the season that the player played in correctly for both seasons.

df5[df5['player'] == 'Al Horford']

	season	player	pos	x3p_per_36_min	x3pa_per_36_min	fg_per_36_min	fga_per_36_min	pos_number	pred
7	2008	Al Horford	C	0.0	0.1	4.7	9.5	5.0	2008
458	2023	Al Horford	C	2.7	6.1	4.3	9.0	5.0	2023

df5[df5['player'] == 'LeBron James']

	season	player	pos	x3p_per_36_min	x3pa_per_36_min	fg_per_36_min	fga_per_36_min	pos_number	pred
262	2008	LeBron James	SF	1.3	4.3	9.4	19.5	3.0	2008
782	2023	LeBron James	PF	2.2	6.9	11.2	22.5	4.0	2023

More Examples Using Decision Trees#

Seeing that the decision tree model was the most accurate of the model for the initial set of seasons, I now want to see how it performs when taking comparisons of different sets of two seasons. For each comparison, I will set things up the way I did for the initial comparison set and then use the constructed decision tree classifier.

df2 = df[((df['season'] == 1998) | (df['season'] == 2023)) & ((df['pos'] == 'PG') | (df['pos'] == 'SG') | (df['pos'] == 'SF') | (df['pos'] == 'PF') | (df['pos'] == 'C'))]
df2 = df2.loc[:, ['season', 'player', 'pos', 'x3p_per_36_min', 'x3pa_per_36_min', 'fg_per_36_min', 'fga_per_36_min']]
df2 = df2.fillna(0)
df2['pos_number'] = df['pos'].map({'PG': 1, 'SG': 2, 'SF': 3, 'PF': 4, 'C': 5})

df3 = df2[df2['season'] == 1998].drop_duplicates(subset='player')
df4 = df2[df2['season'] == 2023].drop_duplicates(subset='player')
df6 = pd.concat([df3, df4], axis=0, ignore_index=True)
df6

	season	player	pos	x3p_per_36_min	x3pa_per_36_min	fg_per_36_min	fga_per_36_min	pos_number
0	1998	A.C. Green	PF	0.0	0.1	3.3	7.3	4.0
1	1998	Aaron McKie	SF	0.2	1.3	3.3	7.9	3.0
2	1998	Aaron Williams	PF	0.0	0.0	5.5	10.5	4.0
3	1998	Adam Keefe	C	0.0	0.0	4.0	7.5	5.0
4	1998	Adonal Foyle	C	0.0	0.1	3.8	9.3	5.0
...	...	...	...	...	...	...	...	...
972	2023	Zach Collins	C	1.4	3.7	7.1	13.7	5.0
973	2023	Zach LaVine	SG	2.7	7.1	8.8	18.1	2.0
974	2023	Zeke Nnaji	PF	0.8	3.2	5.4	9.7	4.0
975	2023	Ziaire Williams	SF	1.6	6.2	5.4	12.6	3.0
976	2023	Zion Williamson	PF	0.3	0.7	10.7	17.7	4.0

977 rows × 8 columns

X_train, X_test, y_train, y_test = train_test_split(df6[cols], df6["season"], test_size=0.5, random_state=42)
clf.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=10, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

clf.score(X_train, y_train)

0.9774590163934426

clf.score(X_test, y_test)

0.7566462167689162

df2 = df[((df['season'] == 1984) | (df['season'] == 2008)) & ((df['pos'] == 'PG') | (df['pos'] == 'SG') | (df['pos'] == 'SF') | (df['pos'] == 'PF') | (df['pos'] == 'C'))]
df2 = df2.loc[:, ['season', 'player', 'pos', 'x3p_per_36_min', 'x3pa_per_36_min', 'fg_per_36_min', 'fga_per_36_min']]
df2 = df2.fillna(0)
df2['pos_number'] = df['pos'].map({'PG': 1, 'SG': 2, 'SF': 3, 'PF': 4, 'C': 5})

df3 = df2[df2['season'] == 1984].drop_duplicates(subset='player')
df4 = df2[df2['season'] == 2008].drop_duplicates(subset='player')
df7 = pd.concat([df3, df4], axis=0, ignore_index=True)
df7

	season	player	pos	x3p_per_36_min	x3pa_per_36_min	fg_per_36_min	fga_per_36_min	pos_number
0	1984	Adrian Dantley	SF	0.0	0.0	9.7	17.3	3.0
1	1984	Al Wood	SG	0.0	0.3	7.5	15.2	2.0
2	1984	Albert King	SF	0.1	0.4	8.0	16.2	3.0
3	1984	Alex English	SF	0.0	0.1	11.4	21.5	3.0
4	1984	Allen Leavell	PG	0.2	1.3	6.3	13.1	1.0
...	...	...	...	...	...	...	...	...
754	2008	Yao Ming	C	0.0	0.0	7.6	15.0	5.0
755	2008	Yi Jianlian	PF	0.1	0.5	4.8	11.4	4.0
756	2008	Zach Randolph	C	0.4	1.3	7.8	17.0	5.0
757	2008	Zaza Pachulia	C	0.0	0.1	4.1	9.3	5.0
758	2008	Zydrunas Ilgauskas	C	0.0	0.0	6.8	14.2	5.0

759 rows × 8 columns

X_train, X_test, y_train, y_test = train_test_split(df7[cols], df7["season"], test_size=0.5, random_state=42)
clf.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=10, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

clf.score(X_train, y_train)

0.9340369393139841

clf.score(X_test, y_test)

0.7263157894736842

df2 = df[((df['season'] == 1998) | (df['season'] == 2008)) & ((df['pos'] == 'PG') | (df['pos'] == 'SG') | (df['pos'] == 'SF') | (df['pos'] == 'PF') | (df['pos'] == 'C'))]
df2 = df2.loc[:, ['season', 'player', 'pos', 'x3p_per_36_min', 'x3pa_per_36_min', 'fg_per_36_min', 'fga_per_36_min']]
df2 = df2.fillna(0)
df2['pos_number'] = df['pos'].map({'PG': 1, 'SG': 2, 'SF': 3, 'PF': 4, 'C': 5})

df3 = df2[df2['season'] == 1998].drop_duplicates(subset='player')
df4 = df2[df2['season'] == 2008].drop_duplicates(subset='player')
df8 = pd.concat([df3, df4], axis=0, ignore_index=True)
df8

	season	player	pos	x3p_per_36_min	x3pa_per_36_min	fg_per_36_min	fga_per_36_min	pos_number
0	1998	A.C. Green	PF	0.0	0.1	3.3	7.3	4.0
1	1998	Aaron McKie	SF	0.2	1.3	3.3	7.9	3.0
2	1998	Aaron Williams	PF	0.0	0.0	5.5	10.5	4.0
3	1998	Adam Keefe	C	0.0	0.0	4.0	7.5	5.0
4	1998	Adonal Foyle	C	0.0	0.1	3.8	9.3	5.0
...	...	...	...	...	...	...	...	...
884	2008	Yao Ming	C	0.0	0.0	7.6	15.0	5.0
885	2008	Yi Jianlian	PF	0.1	0.5	4.8	11.4	4.0
886	2008	Zach Randolph	C	0.4	1.3	7.8	17.0	5.0
887	2008	Zaza Pachulia	C	0.0	0.1	4.1	9.3	5.0
888	2008	Zydrunas Ilgauskas	C	0.0	0.0	6.8	14.2	5.0

889 rows × 8 columns

X_train, X_test, y_train, y_test = train_test_split(df8[cols], df8["season"], test_size=0.5, random_state=42)
clf.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=10, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

clf.score(X_train, y_train)

0.9099099099099099

clf.score(X_test, y_test)

0.5910112359550562

When trying to predict the NBA season between 1998 and 2008 based on 3 pointers per 36 minutes, the test score seems to be noticeably worse than on other sets of comparisons between seasons. Other than that, when it came to the other comparisons, the accuracy of both the testing and training set were the same as the one from what we used to construct the model.

Now I want to visualize the differences of the amount of 3 point field goal makes and attempts per 36 minutes between all the different seasons we compared to one another.

c1 = alt.Chart(df5[df5['pos'] == 'PG']).mark_bar().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
c2 = alt.Chart(df6[df6['pos'] == 'PG']).mark_line().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
c3 = alt.Chart(df7[df7['pos'] == 'PG']).mark_line().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
).properties(
    title = 'PG 3P Made'
)
c4 = c1 + c2 + c3

d1 = alt.Chart(df5[df5['pos'] == 'SG']).mark_bar().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
d2 = alt.Chart(df6[df6['pos'] == 'SG']).mark_line().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
d3 = alt.Chart(df7[df7['pos'] == 'SG']).mark_line().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
).properties(
    title = 'SG 3P Made'
)
d4 = d1 + d2 + d3

e1 = alt.Chart(df5[df5['pos'] == 'SF']).mark_bar().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
e2 = alt.Chart(df6[df6['pos'] == 'SF']).mark_line().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
e3 = alt.Chart(df7[df7['pos'] == 'SF']).mark_line().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
).properties(
    title = 'SF 3P Made'
)
e4 = e1 + e2 + e3

f1 = alt.Chart(df5[df5['pos'] == 'PF']).mark_bar().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
f2 = alt.Chart(df6[df6['pos'] == 'PF']).mark_line().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
f3 = alt.Chart(df7[df7['pos'] == 'PF']).mark_line().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
).properties(
    title = 'PF 3PA Made'
)
f4 = f1 + f2 + f3

g1 = alt.Chart(df5[df5['pos'] == 'C']).mark_bar().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
g2 = alt.Chart(df6[df6['pos'] == 'C']).mark_line().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
g3 = alt.Chart(df7[df7['pos'] == 'C']).mark_line().encode(
    x = 'season:N',
    y = 'x3p_per_36_min',
    color = 'season:N',
    size = alt.value(20)
).properties(
    title = 'C 3P Made'
)
g4 = g1 + g2 + g3

total = c4 | d4 | e4 | f4 | g4
total

With 3 point makes, there seems to be a clear correlation of how the amount of 3 pointers being made has increased over the years that the 3 point shot has existed in the NBA.

c1 = alt.Chart(df5[df5['pos'] == 'PG']).mark_bar().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
c2 = alt.Chart(df6[df6['pos'] == 'PG']).mark_line().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
c3 = alt.Chart(df7[df7['pos'] == 'PG']).mark_line().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
).properties(
    title = 'PG 3PA'
)
c4 = c1 + c2 + c3

d1 = alt.Chart(df5[df5['pos'] == 'SG']).mark_bar().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
d2 = alt.Chart(df6[df6['pos'] == 'SG']).mark_line().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
d3 = alt.Chart(df7[df7['pos'] == 'SG']).mark_line().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
).properties(
    title = 'SG 3PA'
)
d4 = d1 + d2 + d3

e1 = alt.Chart(df5[df5['pos'] == 'SF']).mark_bar().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
e2 = alt.Chart(df6[df6['pos'] == 'SF']).mark_line().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
e3 = alt.Chart(df7[df7['pos'] == 'SF']).mark_line().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
).properties(
    title = 'SF 3PA'
)
e4 = e1 + e2 + e3

f1 = alt.Chart(df5[df5['pos'] == 'PF']).mark_bar().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
f2 = alt.Chart(df6[df6['pos'] == 'PF']).mark_line().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
f3 = alt.Chart(df7[df7['pos'] == 'PF']).mark_line().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
).properties(
    title = 'PF 3PA'
)
f4 = f1 + f2 + f3

g1 = alt.Chart(df5[df5['pos'] == 'C']).mark_bar().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
g2 = alt.Chart(df6[df6['pos'] == 'C']).mark_line().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
)
g3 = alt.Chart(df7[df7['pos'] == 'C']).mark_line().encode(
    x = 'season:N',
    y = 'x3pa_per_36_min',
    color = 'season:N',
    size = alt.value(20)
).properties(
    title = 'C 3PA'
)
g4 = g1 + g2 + g3

total = c4 | d4 | e4 | f4 | g4
total

From this chart, it seems that shooting guards (SG) attempted noticeably more 3 point field goals in 1998 per 36 minutes than shooting guards in 2008, which could explain why the decision tree had a noticeably lower score on the testing set when comparing the 1998 season to the 2008 season. I also notice that between 1998, 2008, and 2023, small forwards (SF) have all attempted a similar amount of 3 point field goal attempts per 36 minutes.

Summary#

What I have done is that I would take two distant NBA seasons to compare and the Per 36 stats and made a model that can accurately predict which of any two seasons a player played in based on their 3 pointer per 36 stats for a given season, which ended up being a Decision Tree Classifier. By most comparisons, the Decision Tree Classifier had an accuracy of ~90% on the training set and ~70% on the testing set.

I also visualized the difference of 3 point makes and attempts over the years to see how the impact of 3 point shots in the NBA have changed over time. It turns out that in terms of 3 point makes, there is a clear correlation that more 3 pointers have been made over time, while in terms of 3 point attempts, the results vary by position (such as SG and SF).

References#

Your code above should include references. Here is some additional space for references.

What is the source of your dataset(s)?

https://www.kaggle.com/datasets/sumitrodatta/nba-aba-baa-stats?select=Per+36+Minutes.csv

List any other references that you found helpful.

Sources on the history of the 3 point line in the NBA and other helpful season statistics.

https://www.teamrankings.com/nba/stat/three-pointers-attempted-per-game?date=2023-12-13

https://www.basketball-reference.com/leagues/

Comparing NBA 3 Point Shots For Two Given Seasons

Contents

Comparing NBA 3 Point Shots For Two Given Seasons#

Introduction#

Organizing the Data#

Constructing Machine Learning Models#

More Examples Using Decision Trees#

Summary#

References#