Predicting a Team’s March Madness finish#
Author: Yorick Herrmann, yherrman@uci.edu
Course Project, UC Irvine, Math 10, F23
Introduction#
Introduce your project here. Maybe 3 sentences.
In this project I will build a model that predicts how far a college basketball team advances in the March Madness tournament. I will also examine which statistical categories are most correlated to a team’s success in the tournament. I will be using a dataset that contains various statistics for every NCAA Division 1 team that qualified for the tournament from the years 2013-2023.
Loading the Data#
First, we need to load the relevant libraries/packages.
import pandas as pd
import numpy as np
import altair as alt
df = pd.read_csv('cbb.csv')
As mentioned in the introduction, this DataFrame contains data from every single NCAA D1 college basketball team from 2013 to the end of the 2022/2023 season. The data is from: https://www.kaggle.com/datasets/andrewsundberg/college-basketball-dataset
df
TEAM | CONF | G | W | ADJOE | ADJDE | BARTHAG | EFG_O | EFG_D | TOR | ... | FTRD | 2P_O | 2P_D | 3P_O | 3P_D | ADJ_T | WAB | POSTSEASON | SEED | YEAR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | North Carolina | ACC | 40 | 33 | 123.3 | 94.9 | 0.9531 | 52.6 | 48.1 | 15.4 | ... | 30.4 | 53.9 | 44.6 | 32.7 | 36.2 | 71.7 | 8.6 | 2ND | 1.0 | 2016 |
1 | Wisconsin | B10 | 40 | 36 | 129.1 | 93.6 | 0.9758 | 54.8 | 47.7 | 12.4 | ... | 22.4 | 54.8 | 44.7 | 36.5 | 37.5 | 59.3 | 11.3 | 2ND | 1.0 | 2015 |
2 | Michigan | B10 | 40 | 33 | 114.4 | 90.4 | 0.9375 | 53.9 | 47.7 | 14.0 | ... | 30.0 | 54.7 | 46.8 | 35.2 | 33.2 | 65.9 | 6.9 | 2ND | 3.0 | 2018 |
3 | Texas Tech | B12 | 38 | 31 | 115.2 | 85.2 | 0.9696 | 53.5 | 43.0 | 17.7 | ... | 36.6 | 52.8 | 41.9 | 36.5 | 29.7 | 67.5 | 7.0 | 2ND | 3.0 | 2019 |
4 | Gonzaga | WCC | 39 | 37 | 117.8 | 86.3 | 0.9728 | 56.6 | 41.1 | 16.2 | ... | 26.9 | 56.3 | 40.0 | 38.2 | 29.0 | 71.5 | 7.7 | 2ND | 1.0 | 2017 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3518 | Toledo | MAC | 34 | 27 | 119.9 | 109.6 | 0.7369 | 56.3 | 52.9 | 13.6 | ... | 27.5 | 54.6 | 52.1 | 39.7 | 36.1 | 69.5 | -1.2 | NaN | NaN | 2023 |
3519 | Liberty | ASun | 33 | 27 | 111.4 | 97.3 | 0.8246 | 55.5 | 49.3 | 16.0 | ... | 27.8 | 56.4 | 48.6 | 36.4 | 33.6 | 64.4 | -2.0 | NaN | NaN | 2023 |
3520 | Utah Valley | WAC | 34 | 28 | 107.1 | 94.6 | 0.8065 | 51.7 | 44.0 | 19.3 | ... | 28.7 | 52.5 | 42.8 | 33.4 | 31.1 | 69.8 | -0.3 | NaN | NaN | 2023 |
3521 | UAB | CUSA | 38 | 29 | 112.4 | 97.0 | 0.8453 | 50.3 | 47.3 | 17.3 | ... | 28.9 | 48.8 | 47.2 | 35.6 | 31.6 | 70.7 | -0.5 | NaN | NaN | 2023 |
3522 | North Texas | CUSA | 36 | 31 | 110.0 | 93.8 | 0.8622 | 51.2 | 44.5 | 19.8 | ... | 40.2 | 49.6 | 44.2 | 35.7 | 30.1 | 58.7 | 1.1 | NaN | NaN | 2023 |
3523 rows × 24 columns
df.shape
(3523, 24)
For this project, we will only be considering teams that have a non-null value for the “POSTSEASON” column, as a non-null value indicates that a team qualified for the March Madness tournament.
dfPost = df[df["POSTSEASON"].notna()]
dfPost.shape
(680, 24)
from pandas.api.types import is_numeric_dtype
features = [i for i in dfPost.columns if is_numeric_dtype(dfPost[i])]
features
['G',
'W',
'ADJOE',
'ADJDE',
'BARTHAG',
'EFG_O',
'EFG_D',
'TOR',
'TORD',
'ORB',
'DRB',
'FTR',
'FTRD',
'2P_O',
'2P_D',
'3P_O',
'3P_D',
'ADJ_T',
'WAB',
'SEED',
'YEAR']
dfPost.describe()
G | W | ADJOE | ADJDE | BARTHAG | EFG_O | EFG_D | TOR | TORD | ORB | ... | FTR | FTRD | 2P_O | 2P_D | 3P_O | 3P_D | ADJ_T | WAB | SEED | YEAR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 680.000000 | 680.000000 | 680.000000 | 680.000000 | 680.000000 | 680.000000 | 680.000000 | 680.000000 | 680.000000 | 680.000000 | ... | 680.000000 | 680.000000 | 680.000000 | 680.000000 | 680.000000 | 680.000000 | 680.000000 | 680.000000 | 680.000000 | 680.000000 |
mean | 33.302941 | 24.066176 | 111.223824 | 96.555294 | 0.795007 | 52.092206 | 47.766324 | 17.489706 | 19.000000 | 31.304559 | ... | 35.617794 | 32.412206 | 51.337500 | 46.907206 | 35.636912 | 32.928529 | 67.531912 | 1.393676 | 8.801471 | 2017.800000 |
std | 3.631720 | 4.493494 | 6.278295 | 5.277004 | 0.167810 | 2.688772 | 2.356140 | 1.897230 | 2.432209 | 4.225676 | ... | 5.560964 | 5.924822 | 2.984119 | 2.907143 | 2.565072 | 2.075223 | 3.049348 | 4.802096 | 4.673461 | 3.252007 |
min | 15.000000 | 12.000000 | 90.600000 | 84.000000 | 0.152200 | 44.700000 | 39.600000 | 12.400000 | 13.100000 | 17.700000 | ... | 21.300000 | 19.100000 | 42.500000 | 37.700000 | 26.600000 | 26.100000 | 58.400000 | -15.600000 | 1.000000 | 2013.000000 |
25% | 32.000000 | 21.000000 | 107.075000 | 93.075000 | 0.744375 | 50.100000 | 46.200000 | 16.300000 | 17.400000 | 28.700000 | ... | 31.575000 | 28.000000 | 49.400000 | 45.000000 | 33.875000 | 31.500000 | 65.400000 | -1.100000 | 5.000000 | 2015.000000 |
50% | 34.000000 | 24.000000 | 111.300000 | 96.000000 | 0.855050 | 52.000000 | 47.800000 | 17.300000 | 18.800000 | 31.300000 | ... | 35.300000 | 32.000000 | 51.200000 | 46.900000 | 35.700000 | 32.900000 | 67.500000 | 1.700000 | 9.000000 | 2017.500000 |
75% | 35.000000 | 27.000000 | 115.400000 | 99.900000 | 0.911550 | 53.900000 | 49.300000 | 18.700000 | 20.400000 | 34.125000 | ... | 39.300000 | 36.100000 | 53.100000 | 48.900000 | 37.400000 | 34.300000 | 69.600000 | 4.300000 | 13.000000 | 2021.000000 |
max | 40.000000 | 38.000000 | 129.100000 | 115.600000 | 0.984200 | 61.000000 | 55.700000 | 23.700000 | 28.500000 | 43.600000 | ... | 55.500000 | 55.500000 | 64.000000 | 56.700000 | 43.700000 | 38.700000 | 77.300000 | 13.100000 | 16.000000 | 2023.000000 |
8 rows × 21 columns
Features of the Dataset#
Before we start fitting classifiers to all the numeric features in the data, we should first take a closer look at the features.
First, let’s examine which numeric features do not belong in the model. Right off the bat, we can see that the year, although it is technically numeric, is more of a categorical data type, as it merely classifies which year a team was playing in. The year column could still be useful later on though, so we we will keep track of it, although it does not belong in the features the RFC will be trained on.
features.remove('YEAR')
Next, let’s take a look at the ‘G’ column, which corresponds to the number of games a team played during the season. Again, we can immediately tell this feature should be cut because it is directly related to other features and isn’t as informative. For example, a good team that has won a lot of games is more likely to advance in their conference championship (games unrelated to March Madness), which in turn would increase the number of games that team has played. However, just to to confirm that this feature should be cut, let’s use the groupby function.
dfPost.groupby('POSTSEASON').mean()["G"]
POSTSEASON
2ND 37.80000
Champions 37.90000
E8 35.85000
F4 36.95000
R32 33.63125
R64 32.16250
R68 31.60000
S16 34.73750
Name: G, dtype: float64
This column indicates that the ‘G’ column actually includes March Madness games, which means the target variable is actually what is influencing the number of games played. Because of this confoundment, the ‘G’ column needs to be cut.
features.remove('G')
We therefore also need to cut the ‘W’ column for similar reasons, as this column also is directly determined by a team’s finish in the March Madness tournament, not the other way around.
features.remove('W')
Granted, the majority of the in-game statistics columns also includes data from March Madness games, unfortunately, which makes this dataset less than ideal. However, a team’s regular season statistics should outweigh the statistics from March Madness, because the majority of a team’s stats are from the regular season (as opposed to the ‘G’ and ‘W’ columns, which are ‘count’ columns instead of averaged columns).
Moving on, we will keep the ‘seed’ feature in for now, because the seeds are determined before the tournament actually begins. Now, let’s examine the ‘WAB’ feature, which refers to the Wins Above Bubble (The bubble refers to the cut off between making the NCAA March Madness Tournament and not making it). For dfPost, we are only examining teams that made the cut off, but perhaps this feature could still be useful, as it is not affected by the postseason finish (but perhaps it affects the March Madness finish)
dfPost.groupby('POSTSEASON').median()["WAB"]
POSTSEASON
2ND 6.95
Champions 8.95
E8 6.50
F4 5.55
R32 2.60
R64 0.30
R68 -3.00
S16 4.55
Name: WAB, dtype: float64
Interestingly enough, the median for teams eliminated in the R68 is negative, suggesting that they didn’t have enough to wins to qualify. However, this makes sense, because if they had a positive WAB, we’d expect them to have a guaranteed spot in the round of 64, and not have to fight for their life in the R68. We will keep the ‘WAB’ column in for now.
Next, lets examine the BARTHAG column, which examines a team’s Power Rating (Chance of beating an average Division I team)
dfPost.groupby('POSTSEASON').median()["BARTHAG"]
POSTSEASON
2ND 0.95265
Champions 0.96730
E8 0.93450
F4 0.93365
R32 0.87510
R64 0.79440
R68 0.61525
S16 0.91675
Name: BARTHAG, dtype: float64
This column will remain in for now, since it seems to be an independent variable, even though it has a strong trend (It is unclear whether or not the March Madness games impact this column). However, we might cut this out later down the line if the model is not performing well.
Our last single numerical feature that we will examine is the ‘ADJ_T’ column, which is the Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo). We will use the groupby to see if this column actually has any impact.
dfPost.groupby('POSTSEASON').median()["ADJ_T"]
POSTSEASON
2ND 66.70
Champions 66.90
E8 67.10
F4 65.95
R32 67.20
R64 67.60
R68 68.35
S16 67.25
Name: ADJ_T, dtype: float64
Let’s visualize this with a bar chart.
#This is just to make the chart more ordered
CorrectSort = ['Champions', '2ND', 'F4', 'E8', 'S16', 'R32', 'R64', 'R68']
alt.Chart(dfPost).mark_bar().encode(
x = alt.X("POSTSEASON:O", sort = CorrectSort),
y = alt.Y("median(ADJ_T)", scale=alt.Scale(domain=[60, 70])),
color = alt.Color("POSTSEASON", sort = CorrectSort),
tooltip = "median(ADJ_T)"
).properties(
width = 200,
title = "Median of Adjusted Tempo vs a Team's Postseason Finish"
)
There seems to be a slight trend, so we will also leave this feature in for now.
Now, let’s take a look at the ‘paired’ features. By paired features, I am referring to statistics for which a team has both an offensive and a defensive side. Examples would include the ‘EFG_O’ column and the ‘EFG_D’ column, because the ‘EFG_O’ column measures a team’s effective field goal percentage while ‘EFG_D’ measures the effective field goal percentage a team allows. However, a team’s EFG_O and EFG_D are likely not as important as the actual difference between them. For instance, if a team has a 20% EFG_O, we would probably expect them to lose every single game. However, if that same team allows a superhuman 10% EFG_D, their defense would likely carry them to a lot of wins. Therefore, instead of the values in EFG_O and EFG_D itself, our model should instead analyze a feature which contains the difference between these 2.
EFG_diff = df["EFG_O"] - df["EFG_D"]
df["EFG_diff"] = EFG_diff
Let’s make columns with the difference of other paired features.
ORB: Offensive Rebound Rate vs DRB: Offensive Rebound Rate Allowed
FTR : Free Throw Rate (How often the given team shoots Free Throws) vs FTRD: Free Throw Rate Allowed
2P_O: Two-Point Shooting Percentage vs 2P_D: Two-Point Shooting Percentage Allowed
3P_O: Three-Point Shooting Percentage vs 3P_D: Three-Point Shooting Percentage Allowed
ADJOE: Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense) vs ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense)
TORD: Turnover Percentage Committed (Steal Rate) vs TOR: Turnover Percentage Allowed (Turnover Rate)
#TRY ASARRAY
df["RB_diff"] = df["ORB"] - df["DRB"]
df["FTR_diff"] = df["FTR"] - df["FTRD"]
df["2P_diff"] = df["2P_O"] - df["2P_D"]
df["3P_diff"] = df["3P_O"] - df["3P_D"]
df["ADJE_diff"] = df["ADJOE"] - df["ADJDE"]
df["TOR_diff"] = df["TORD"] - df["TOR"]
In order to avoid slicing problems, we create these differential columns in the original DataFrame.
Lastly, we will take a look at the target column, the “POSTSEASON” column. It’s clear that the target column is ordinal data, with ‘CHAMPION’ being the best finish, while ‘R68’is the worst finish. Therefore, we will create an ordinally encoded column that corresponds to the “POSTSEASON” column.
mapping = {'R68': 0, 'R64':1,'R32':2,'S16':3, 'E8':4, 'F4': 5, '2ND':6, 'Champions':7}
df["POST_OE"] = df["POSTSEASON"].map(mapping)
This will not only be important for a certain model type we will be using later on, but it will also allow us to bring in an element of regression analysis into the project, as we can examine how far off certain predictions were (e.g. if the true value for a team was ‘Champions’, a prediction of 5 would be a better prediction than 3.)
Now, let’s redefine dfPost with our new columns and make a new and hopefully improved list of features:
dfPost = df[df["POSTSEASON"].notna()]
dfPost.columns
Index(['TEAM', 'CONF', 'G', 'W', 'ADJOE', 'ADJDE', 'BARTHAG', 'EFG_O', 'EFG_D',
'TOR', 'TORD', 'ORB', 'DRB', 'FTR', 'FTRD', '2P_O', '2P_D', '3P_O',
'3P_D', 'ADJ_T', 'WAB', 'POSTSEASON', 'SEED', 'YEAR', 'EFG_diff',
'RB_diff', 'FTR_diff', '2P_diff', '3P_diff', 'ADJE_diff', 'TOR_diff',
'POST_OE'],
dtype='object')
features = ["BARTHAG", "WAB", "SEED", "ADJ_T", "EFG_diff", "RB_diff", 'FTR_diff', '2P_diff', '3P_diff', 'ADJE_diff', 'TOR_diff']
Building a Random Forest Classifier with our features#
Before we start making the RFC, we need to first see how we would do by guessing the most frequent value in the POSTSEASON column for every row, in order to see how good any model we train is.
print(dfPost["POSTSEASON"].value_counts().index[0])
dfPost["POSTSEASON"].value_counts()[0]
R64
320
len(dfPost["POSTSEASON"])
680
320 out of 680 teams were eliminated in the round of 64.
BaseScore = 320/680
BaseScore
0.47058823529411764
As we can see, around 47% of teams were eliminated in the round of 64. This means that any model we train should at least have above 47% accuracy, or else it’s not a very good model (as it would then perform worse than guessing R64 for each team’)
We will now construct a random forest classifier with 400 estimators and see how it performs with the new features. As a side note, we keep the train_size at 0.6 to avoid both overfitting and underfitting. We will use the ordinally encoded column as the target column because this will be more helpful in evaluating the model’s performance.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(dfPost[features], dfPost["POST_OE"], train_size=0.6, random_state = 4)
rfc = RandomForestClassifier(n_estimators = 400, random_state = 0)
rfc.fit(X_train, y_train)
RFC_score = rfc.score(X_test, y_test)
RFC_score
0.5110294117647058
A 51.1% accuracy is already about a 4% improvement from the ‘R64’ method of guessing, which shows that our model is useful. Now, let’s examine on average how off the model’s predictions were.
y_pred = rfc.predict(X_test)
from sklearn.metrics import mean_absolute_error
RFC_MAE = mean_absolute_error(y_test,y_pred)
RFC_MAE
0.7095588235294118
Our model is off by about 0.71 for each guess on average. This is a good sign, as it signifies that even when the model’s classification is incorrect, it’s prediction still tends to be relatively close to the actual tournament finish of a team. For reference, our base score would be if we randomly guessed that every team lost in the R64 (OE class of 1), as this is the most common value.
Base_AE = np.abs(y_test-1).mean()
Base_AE
0.9889705882352942
Our model’s MAE is around 70% of this, which already is a noticeable improvement.
Attempting to Use Principal Component Analysis to improve performance#
dfPost[features].shape
(680, 11)
We currently have 11 features (i.e. 11 dimensions). Such a high number of dimensions for a relatively small dataFrame like dfPost could potentially lead to overfitting. As such, we will use Principal Component Analysis for dimension reduction. We will first reduce to 2 dimensions, in order to aid with visualizations.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(dfPost[features])
X_pca.shape
(680, 2)
The above block of code reduces our 11 features to 2 components. Now, we will turn X_pca into a DataFrame
a = dfPost["POSTSEASON"]
c = np.asarray(a)
b = dfPost["POST_OE"]
d = np.asarray(b)
#Meant to make visualization more understandable
dfPost_pca = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
dfPost_pca["POSTSEASON"] = c
dfPost_pca["POST_OE"] = d
dfPost_pca
PC1 | PC2 | POSTSEASON | POST_OE | |
---|---|---|---|---|
0 | -17.912580 | 2.843415 | 2ND | 6.0 |
1 | -26.161231 | -8.014326 | 2ND | 6.0 |
2 | -12.026691 | 4.225785 | 2ND | 6.0 |
3 | -17.757074 | 8.579842 | 2ND | 6.0 |
4 | -24.052305 | -8.248363 | 2ND | 6.0 |
... | ... | ... | ... | ... |
675 | 20.066996 | 6.606842 | R64 | 1.0 |
676 | 22.629749 | 2.186891 | R64 | 1.0 |
677 | 21.446552 | 5.778752 | R64 | 1.0 |
678 | 31.801340 | 4.289538 | R68 | 0.0 |
679 | 29.801159 | 6.687634 | R68 | 0.0 |
680 rows × 4 columns
Now, we will graph PC2 vs PC1, with the color of the graph being a team’s finish in March Madness.
alt.Chart(dfPost_pca).mark_circle(size = 60).encode(
x = 'PC1',
y = 'PC2',
color = alt.Color('POSTSEASON:N', sort = CorrectSort, scale=alt.Scale(scheme='viridis')),
tooltip = 'POSTSEASON'
).interactive().properties(
title = 'PCA of NCAA Basketball Dataset'
)
The graph is interesting, because it shows that a team tends to fare better in the NCAA tournament the lower their PC1 score is, and the higher their PC2 score is (altough this trend isn’t as noticeable as the PC1 trend). The explained_variance_ratio_ command will show how much variance/information of the original data the principal components captured.
pca.explained_variance_ratio_
array([0.56290596, 0.16871374])
In other words, PC1 captured 56.3% of the variance, while PC2 captured 16.9% of the variance. This is encouraging, because this means that we were able to capture around 73% of the information contained in the original dataset (11 features) into a dataframe consisting of only 2 features! This in turns means that any model we train on PC1 and PC2 will likely be simpler (2 vs 11), and therefore less prone to overfitting. Let’s now train a model.
X = dfPost_pca[["PC1","PC2"]]
y = dfPost_pca["POST_OE"]
We will limit the max_depth because there are only 2 features: PC1 and PC2.
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.6, random_state = 0)
rfc_pca = RandomForestClassifier(n_estimators = 400, max_depth = 3, random_state = 9)
rfc_pca.fit(X_train, y_train)
PCA_score = rfc_pca.score(X_test, y_test)
PCA_score
0.5036764705882353
The above model scored at a 50.4% accuracy, which is still an improvement over ‘R64’ guessing despite only capturing around 73% of the dataset’s information. Let’s now examine this model’s mean absolute error.
y_pca_pred = rfc_pca.predict(X_test)
PCA_MAE = mean_absolute_error(y_test,y_pca_pred)
PCA_MAE
0.7904411764705882
A MAE score of 0.79 is worse than our random forest’s score, but it is still a significant improvement over the baseline MAE we would get from ‘R64’ guessing (reminder that the baseline score is roughly 0.99).
Let’s now quickly build an RFC on a DataFrame with principal components that capture at least 90% of dfPost’s information and see how it fares.
pca2 = PCA(n_components=0.9)
X_pca2 = pca2.fit_transform(dfPost[features])
X_pca2.shape
(680, 5)
The shape indicates that we need 5 principal components to capture at least 90% of the dataset’s information. Let’s now create a DataFrame, fit a model to it, and predict the postseason finish.
dfPost_pca2 = pd.DataFrame(X_pca2, columns=['PC1', 'PC2','PC3','PC4', 'PC5'])
dfPost_pca2["POST_OE"] = d
dfPost_pca2
PC1 | PC2 | PC3 | PC4 | PC5 | POST_OE | |
---|---|---|---|---|---|---|
0 | -17.912580 | 2.843415 | 6.105287 | 2.698300 | 7.445344 | 6.0 |
1 | -26.161231 | -8.014326 | 2.371395 | 3.689217 | -4.832190 | 6.0 |
2 | -12.026691 | 4.225785 | -4.060744 | 0.901865 | -0.303206 | 6.0 |
3 | -17.757074 | 8.579842 | -7.770423 | -4.696640 | -0.238826 | 6.0 |
4 | -24.052305 | -8.248363 | -5.598479 | -8.952695 | 3.721472 | 6.0 |
... | ... | ... | ... | ... | ... | ... |
675 | 20.066996 | 6.606842 | 0.106943 | 4.536261 | -4.991573 | 1.0 |
676 | 22.629749 | 2.186891 | 2.801789 | -1.721679 | 0.113177 | 1.0 |
677 | 21.446552 | 5.778752 | 5.688093 | 4.014113 | -1.256441 | 1.0 |
678 | 31.801340 | 4.289538 | 5.937357 | 1.191128 | 3.988012 | 0.0 |
679 | 29.801159 | 6.687634 | -0.929121 | -0.061041 | 4.808986 | 0.0 |
680 rows × 6 columns
pca2.explained_variance_ratio_
array([0.56290596, 0.16871374, 0.0853449 , 0.0709192 , 0.03845252])
The third, fourth, and fifth PCs don’t seem to capture as much information as the first two, which makes sense.
X_train, X_test, y_train, y_test = train_test_split(dfPost_pca2[dfPost_pca2.columns[:5]], y, train_size=0.6, random_state = 4)
rfc_pca = RandomForestClassifier(n_estimators = 400, random_state = 0)
rfc_pca.fit(X_train, y_train)
rfc_pca.score(X_test, y_test)
0.48161764705882354
y_pca2_pred = rfc_pca.predict(X_test)
mean_absolute_error(y_test, y_pca2_pred)
0.8051470588235294
Using the same seeds as the model with 2 PCs, this model performs worse, even though the number of information captured by the PCs increased. Although this is a small dataset, this could be an example of overfitting, as the model might be capturing too much noise in the training data’s principal components. Another possible explanation is that this is just a bad seed for this particular model.
Making predictions using XGBClassifier#
For this section, we will examine whether or not the XGBClassifer (Extreme Gradient Boosting Classifier) performs better than the RandomForestClassifiers we’ve been using.
XGBoost (XGB)* is similar to random forest classifiers, but it has some differences. For instance, XGB uses a boosting technique, which means it builds a series of weak models (usually decision trees) sequentially. The idea is that each subsequent model corrects the errors of the previous one, and they work together to improve overall prediction accuracy.
*see references
!pip install xgboost
from xgboost import XGBClassifier
Collecting xgboost
Downloading xgboost-2.0.2-py3-none-manylinux2014_x86_64.whl (297.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 297.1/297.1 MB 4.8 MB/s eta 0:00:00
?25hRequirement already satisfied: numpy in /shared-libs/python3.10/py/lib/python3.10/site-packages (from xgboost) (1.23.4)
Requirement already satisfied: scipy in /shared-libs/python3.10/py/lib/python3.10/site-packages (from xgboost) (1.9.3)
Installing collected packages: xgboost
Successfully installed xgboost-2.0.2
[notice] A new release of pip is available: 23.0.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
So far, our most accurate model has been the RFC that predicted the postseason finish based off our 11 features. Let’s use the same train_test_split. As a side note, XGBClassifiers can only work with numerical classes, which is one of the reasons why we ordinally encoded the POSTSEASON column.
X_train, X_test, y_train, y_test = train_test_split(dfPost[features], dfPost["POST_OE"], train_size=0.6, random_state = 4)
XGBCl = XGBClassifier(n_estimators = 400, random_state = 0)
XGBCl.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=400, n_jobs=None, num_parallel_tree=None, objective='multi:softprob', ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=400, n_jobs=None, num_parallel_tree=None, objective='multi:softprob', ...)
The XGBClassifier has a number of parameters that can be finetuned to try and improve performance. For now, we will only set the seed and the n_estimators as equal to 400.
XGBCl.score(X_test,y_test)
0.44485294117647056
Such a poor performance indicates that we are either on a bad seed, that we need to change some parameters in the classifier, or both. Let’s fiddle with the max_depth and learning_rate parameters to try and improve performance (In essence, a smaller learning rate requires more boosting rounds but can help the algorithm generalize better to the data.)
X_train, X_test, y_train, y_test = train_test_split(dfPost[features], dfPost["POST_OE"], train_size=0.6, random_state = 7)
XGBCl = XGBClassifier(n_estimators = 400, max_depth = 8, learning_rate = 0.09, random_state = 9)
XGBCl.fit(X_train, y_train)
XGB_score = XGBCl.score(X_test,y_test)
XGB_score
0.5183823529411765
This is much better than the first XGBClassifier: in fact, with these specific seeds and parameters, the classifier actually has a slightly higher accuracy than our random forest classifier. Let’s now take a look at the mean absolute error.
y_xgb_pred = XGBCl.predict(X_test)
XGB_MAE = mean_absolute_error(y_test, y_xgb_pred)
XGB_MAE
0.7095588235294118
This XGBClassifier actually has around the same MAE as the random forest, which suggests their performances are extremely similar.
Examining Feature importances#
Now, let’s examine the relative importance of the features in our model, in order to see which features tend to be the most impactful in determining the March Madness finish of a team. Since our XGBClassifier had a slight edge in classifying teams, we will use this model.
FeatImp = XGBCl.feature_importances_ #Idea/command from ChatGPT
Classes = XGBCl.feature_names_in_
bo = pd.DataFrame([])
bo["Feature"] = Classes
bo["Importance"] = FeatImp
bo_sorted = bo.sort_values(by='Importance', ascending=False)
bo_sorted
Feature | Importance | |
---|---|---|
0 | BARTHAG | 0.161660 |
9 | ADJE_diff | 0.151615 |
2 | SEED | 0.127000 |
1 | WAB | 0.079642 |
8 | 3P_diff | 0.076092 |
5 | RB_diff | 0.072741 |
4 | EFG_diff | 0.070629 |
7 | 2P_diff | 0.069117 |
10 | TOR_diff | 0.065977 |
3 | ADJ_T | 0.064668 |
6 | FTR_diff | 0.060859 |
A higher importance indicates that a feature is more correlated to the target variable. Let’s visualize the importances.
#Code to make the chart more ordered
OrderSort = bo_sorted["Feature"].to_list()
alt.Chart(bo_sorted).mark_bar().encode(
x = alt.X("Feature:N", sort=None),
y = "Importance:Q",
color = alt.Color("Feature:N", sort=None),
tooltip = "Importance"
).properties(
title = "Feature Importances"
)
As we can see, the ADJE_diff, BARTHAG, and Seed Columns seem to be the three most impactful features, while the FTR_diff and Adjusted Tempo columns don’t seem to be that impactful. However, we shouldn’t outright cut any features, as each feature still seems to have at least some impact on the model’s decisions.
Examining Specific Features and Decision Tree Boundaries#
Let’s take a closer look at the three most impactful features in XGBCl, to see how they affect the predicted finish of a team. First, we’ll consider the seed of a team.
SEED = pd.DataFrame(dfPost.groupby('POSTSEASON').mean()["SEED"]).reset_index()
alt.Chart(SEED).mark_bar().encode(
x = alt.X("POSTSEASON", sort = CorrectSort),
y = "SEED:Q",
color = alt.Color("POSTSEASON:N", sort = CorrectSort),
tooltip = "SEED"
).properties(
width = 200,
title = "Average Seed for each Tournament Finish"
)
The chart suggests that as the seed of a team gets higher (higher meaning closer to 1), their finish in the tournament tends to get better. This makes sense, given that better teams are usually the higher seeds. One interesting nugget in this chart is that teams in the Final 4 had on average lower seeds (by lower, I mean that closer to 16 is lower) than teams in the Elite 8, which is likely just an example of the ‘madness’ and unpredictability of March Madness, and the fact that our data is only 680 rows long.
Let’s now move on to BARTHAG.
BART = pd.DataFrame(dfPost.groupby('POSTSEASON').mean()["BARTHAG"]).reset_index()
BART
POSTSEASON | BARTHAG | |
---|---|---|
0 | 2ND | 0.947810 |
1 | Champions | 0.962280 |
2 | E8 | 0.914007 |
3 | F4 | 0.922440 |
4 | R32 | 0.859177 |
5 | R64 | 0.730332 |
6 | R68 | 0.580652 |
7 | S16 | 0.901175 |
alt.Chart(BART).mark_bar().encode(
x = alt.X("POSTSEASON", sort = CorrectSort),
y = "BARTHAG:Q",
color = alt.Color("POSTSEASON:N", sort = CorrectSort),
tooltip = "BARTHAG"
).properties(
width = 200,
title = "Average BARTHAG for each Tournament Finish"
)
As we can see, the higher a team’s BARTHAG is, the better they tend to fare in the March Madness tournament. Although this seems obvious when you consider that BARTHAG measures a team’s chances of beating an average D1 opponent, the graph provides more specific information, such as the fact that teams with a BARTHAG below 0.7 are not likely to make it into the round of 64.
Let’s now look at the second most impactful feature, the ADJE_diff column, which measures the net number of points a team will score in 100 possessions (net = offense - defense).
ADJE_diff = pd.DataFrame(dfPost.groupby('POSTSEASON').mean()["ADJE_diff"]).reset_index()
ADJE_diff
POSTSEASON | ADJE_diff | |
---|---|---|
0 | 2ND | 27.600000 |
1 | Champions | 30.380000 |
2 | E8 | 23.235000 |
3 | F4 | 23.680000 |
4 | R32 | 17.511250 |
5 | R64 | 10.423437 |
6 | R68 | 3.867500 |
7 | S16 | 21.247500 |
alt.Chart(ADJE_diff).mark_bar().encode(
x = alt.X("POSTSEASON", sort = CorrectSort),
y = "ADJE_diff:Q",
color = alt.Color("POSTSEASON:N", sort = CorrectSort),
tooltip = "ADJE_diff"
).properties(
width = 200,
title = "Average Adjusted Efficiency Difference for each Tournament Finish"
)
Again, there seems to be a very direct correlation between the adjusted efficiency difference and a team’s finish, with champions and finalists tending to outscore their opponents over the course of a season by a whopping 28-30 points per 100 possessions. This is useful because it suggests the eventual winner will likely not be a team who wins a lot of close games, but rather a team that wins a lot of blowouts.
Finally, we will make a decision boundary graph with our two most important, continuous, quantitative features, that being the BARTHAG and adjusted efficiency features. We will use a Decision Tree Regressor this time so that the boundaries are more clear.
First, we need to make arrays of datapoints for our charts, using linspace.
print(dfPost["ADJE_diff"].min())
print(dfPost["ADJE_diff"].max())
-15.299999999999997
36.3
ADJElin = np.random.uniform(-16, 37, 5000)
Since BARTHAG is a probability, we will set its linspace between 0 and 1:
BARTHAGlin = np.random.uniform(0,1,5000)
Boundary = pd.DataFrame({'ADJE_diff':ADJElin, 'BARTHAG':BARTHAGlin})
Now, we fit another model on the original data that only takes into account these two features.
from sklearn.tree import DecisionTreeClassifier
X_train, X_test, y_train, y_test = train_test_split(dfPost[["ADJE_diff","BARTHAG"]], dfPost["POST_OE"], train_size=0.6, random_state = 7)
Tree = DecisionTreeClassifier(max_leaf_nodes = 10, random_state=5)
Tree.fit(X_train, y_train)
Tree_score = Tree.score(X_test,y_test)
Tree_score
0.5404411764705882
Remarkably, this simple Decision Tree has had the best performance out of all the models we’ve trained so far, with a 54% accuracy - around 7% better than our base score. Let’s look at it’s MAE.
y_tree_pred = Tree.predict(X_test)
Tree_MAE = mean_absolute_error(y_test, y_tree_pred)
Tree_MAE
0.6691176470588235
Sure enough,the tree also has the lowest mean absolute error, which means it’s predictions are closer to the true value than any other model we’ve trained. This is a sign that the majority of our previous models might have been overfitting too much.
Boundary["Prediction_OE"] = Tree.predict(Boundary[["ADJE_diff", "BARTHAG"]])
inverted_mapping = {value: key for key, value in mapping.items()}
Boundary["Predictions"] = Boundary["Prediction_OE"].map(inverted_mapping)
Boundary
ADJE_diff | BARTHAG | Prediction_OE | Predictions | |
---|---|---|---|---|
0 | 17.153968 | 0.473926 | 1.0 | R64 |
1 | 25.676365 | 0.891518 | 1.0 | R64 |
2 | 15.643836 | 0.534441 | 1.0 | R64 |
3 | 26.227244 | 0.510742 | 1.0 | R64 |
4 | 15.106568 | 0.989752 | 1.0 | R64 |
... | ... | ... | ... | ... |
4995 | -7.499594 | 0.406876 | 0.0 | R68 |
4996 | 15.435096 | 0.339736 | 1.0 | R64 |
4997 | 2.888958 | 0.312349 | 1.0 | R64 |
4998 | -8.423352 | 0.666879 | 0.0 | R68 |
4999 | -8.142750 | 0.120803 | 0.0 | R68 |
5000 rows × 4 columns
alt.Chart(Boundary).mark_circle(size=60).encode(
x = "ADJE_diff:Q",
y = "BARTHAG",
color = alt.Color("Predictions", sort = CorrectSort),
tooltip = "Predictions"
).properties(
title = "Tree's Predicted Tournament Finish based off ADJE_diff and BARTHAG"
)
Here, we see the downside of this tree - because we limited the max amount of leaf nodes, it does not have any ‘F4’, ‘2ND’, or ‘Champions’ predictions, likely because of the relatively low amount of these classes in the dataset. Let’s train one more tree with more leaf nodes in order to get the model to make some of these predictions
Tree2 = DecisionTreeClassifier(max_leaf_nodes = 70, random_state=5)
Tree2.fit(X_train, y_train)
print(Tree2.score(X_test,y_test))
print(mean_absolute_error(Tree2.predict(X_test), y_test))
0.4889705882352941
0.7610294117647058
As we can see, the performance of this tree is worse, because we extended the depth of the tree, thereby leading to potentially more overfitting.
Boundary["Prediction2_OE"] = Tree2.predict(Boundary[["ADJE_diff", "BARTHAG"]])
Boundary["Predictions2"] = Boundary["Prediction2_OE"].map(inverted_mapping)
alt.Chart(Boundary).mark_circle(size=60).encode(
x = "ADJE_diff:Q",
y = "BARTHAG",
color = alt.Color("Predictions2:N", sort = CorrectSort, scale = alt.Scale(scheme='category20')),
tooltip = "Predictions2"
).properties(
title = "Tree2's Predicted Tournament Finish based off ADJE_diff and BARTHAG"
)
This chart, although it contains predictions from every class, clearly displays signs of overfitting, such as the multiple linear decision boundaries, or the fact that there are a couple of ‘R68’ predictions to the left of the ‘E8’ prediction block. For these reasons, we should not use this tree.
Conclusions#
Let’s now visualize the performances of our various models.
ModelScores = pd.DataFrame({'ModelType': ['Base', 'RFC', 'PCA', 'XGB', 'Tree'], 'Accuracy': [BaseScore, RFC_score, PCA_score, XGB_score, Tree_score], 'MAE': [Base_AE, RFC_MAE, PCA_MAE, XGB_MAE, Tree_MAE]})
ModelScores
ModelType | Accuracy | MAE | |
---|---|---|---|
0 | Base | 0.470588 | 0.988971 |
1 | RFC | 0.511029 | 0.709559 |
2 | PCA | 0.503676 | 0.790441 |
3 | XGB | 0.518382 | 0.709559 |
4 | Tree | 0.540441 | 0.669118 |
alt.Chart(ModelScores).mark_bar(width=30).encode(
x = "ModelType:N",
y = alt.Y("Accuracy:Q", scale= alt.Scale(domain = [0.40, 0.55])),
color = "ModelType:N",
tooltip ="Accuracy:Q"
).properties(
width = 200,
height = 300,
title = 'Accuracy Scores for Different Models'
)
alt.Chart(ModelScores).mark_bar(width=30).encode(
x = "ModelType:N",
y = alt.Y("MAE:Q", scale= alt.Scale(domain = [0, 1])),
color = "ModelType:N",
tooltip ="MAE:Q"
).properties(
width = 200,
height = 300,
title = 'Mean Absolute Error for Different Models'
)
Ultimately, because it had the best performance on the model and displayed the least overfitting, we should use our simple Decision Tree (‘Tree’) when making predictions on the data, even though it is not as detailed as some of the other models we trained. This shows that relatively ‘primitive’ models are sometimes the best to use on certain data, because they are less prone to overfitting on training data.
We also learned that certain features in the dataset are more important than others, these being the BARTHAG, adjusted efficiency difference, and seed columns. So, I will probably take a look at these columns the next time I fill out my March Madness bracket, as they seem to have the most impact.
However, the main thing we learned is that the NCAA tournament is called ‘March Madness’ for a reason, as even our best model could only predict the finish of a postseason team with 54% accuracy. In other words, don’t expect to fill out a perfect bracket anytime soon, no matter how many models you build.
On the bright side, the mean absolute error score of our best model’s predictions was only a 0.66, which shows that on average, our model was very close to the correct prediction even when it was wrong (reminder: predicting ‘2nd’ for the actual champions would be considered close, while predicting ‘R68’ would be considered not close - ordinal data!)
Lastly, it is important to remember that the dataset we used is flawed, in that it contained data from the actual March Madness games. Therefore, the conclusions we drew from this dataset are not entirely reliable, even though we did remove some of the issues in the ‘Features of the Dataset’ section.
Summary#
In this project, I attempted to build a model to predict where a team would finish in the March Madness tournament. First, I removed and added some features, before constructing a variety of models in order to see which one would work the best (the Decision Tree on two features ended up working the best). I then also looked at which features were most important in determining a team’s predicted finish (the three most important features were BARTHAG, Adjusted efficieny difference, and seed).
References#
Your code above should include references. Here is some additional space for references.
What is the source of your dataset(s)?
https://www.kaggle.com/datasets/andrewsundberg/college-basketball-dataset
List any other references that you found helpful.
Course notes
XGB source: https://www.kaggle.com/code/alexisbcook/xgboost
ChatGPT helped with feature_importances_ and various other smaller bugs that I ran into.