Predicting a Team’s March Madness finish

Predicting a Team’s March Madness finish#

Author: Yorick Herrmann, yherrman@uci.edu

Course Project, UC Irvine, Math 10, F23

Introduction#

Introduce your project here. Maybe 3 sentences.

In this project I will build a model that predicts how far a college basketball team advances in the March Madness tournament. I will also examine which statistical categories are most correlated to a team’s success in the tournament. I will be using a dataset that contains various statistics for every NCAA Division 1 team that qualified for the tournament from the years 2013-2023.

Loading the Data#

First, we need to load the relevant libraries/packages.

import pandas as pd
import numpy as np
import altair as alt

df = pd.read_csv('cbb.csv')

As mentioned in the introduction, this DataFrame contains data from every single NCAA D1 college basketball team from 2013 to the end of the 2022/2023 season. The data is from: https://www.kaggle.com/datasets/andrewsundberg/college-basketball-dataset

df

	TEAM	CONF	G	W	ADJOE	ADJDE	BARTHAG	EFG_O	EFG_D	TOR	...	FTRD	2P_O	2P_D	3P_O	3P_D	ADJ_T	WAB	POSTSEASON	SEED	YEAR
0	North Carolina	ACC	40	33	123.3	94.9	0.9531	52.6	48.1	15.4	...	30.4	53.9	44.6	32.7	36.2	71.7	8.6	2ND	1.0	2016
1	Wisconsin	B10	40	36	129.1	93.6	0.9758	54.8	47.7	12.4	...	22.4	54.8	44.7	36.5	37.5	59.3	11.3	2ND	1.0	2015
2	Michigan	B10	40	33	114.4	90.4	0.9375	53.9	47.7	14.0	...	30.0	54.7	46.8	35.2	33.2	65.9	6.9	2ND	3.0	2018
3	Texas Tech	B12	38	31	115.2	85.2	0.9696	53.5	43.0	17.7	...	36.6	52.8	41.9	36.5	29.7	67.5	7.0	2ND	3.0	2019
4	Gonzaga	WCC	39	37	117.8	86.3	0.9728	56.6	41.1	16.2	...	26.9	56.3	40.0	38.2	29.0	71.5	7.7	2ND	1.0	2017
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3518	Toledo	MAC	34	27	119.9	109.6	0.7369	56.3	52.9	13.6	...	27.5	54.6	52.1	39.7	36.1	69.5	-1.2	NaN	NaN	2023
3519	Liberty	ASun	33	27	111.4	97.3	0.8246	55.5	49.3	16.0	...	27.8	56.4	48.6	36.4	33.6	64.4	-2.0	NaN	NaN	2023
3520	Utah Valley	WAC	34	28	107.1	94.6	0.8065	51.7	44.0	19.3	...	28.7	52.5	42.8	33.4	31.1	69.8	-0.3	NaN	NaN	2023
3521	UAB	CUSA	38	29	112.4	97.0	0.8453	50.3	47.3	17.3	...	28.9	48.8	47.2	35.6	31.6	70.7	-0.5	NaN	NaN	2023
3522	North Texas	CUSA	36	31	110.0	93.8	0.8622	51.2	44.5	19.8	...	40.2	49.6	44.2	35.7	30.1	58.7	1.1	NaN	NaN	2023

3523 rows × 24 columns

df.shape

(3523, 24)

For this project, we will only be considering teams that have a non-null value for the “POSTSEASON” column, as a non-null value indicates that a team qualified for the March Madness tournament.

dfPost = df[df["POSTSEASON"].notna()]
dfPost.shape

(680, 24)

from pandas.api.types import is_numeric_dtype 
features = [i for i in dfPost.columns if is_numeric_dtype(dfPost[i])]
features

['G',
 'W',
 'ADJOE',
 'ADJDE',
 'BARTHAG',
 'EFG_O',
 'EFG_D',
 'TOR',
 'TORD',
 'ORB',
 'DRB',
 'FTR',
 'FTRD',
 '2P_O',
 '2P_D',
 '3P_O',
 '3P_D',
 'ADJ_T',
 'WAB',
 'SEED',
 'YEAR']

dfPost.describe()

	G	W	ADJOE	ADJDE	BARTHAG	EFG_O	EFG_D	TOR	TORD	ORB	...	FTR	FTRD	2P_O	2P_D	3P_O	3P_D	ADJ_T	WAB	SEED	YEAR
count	680.000000	680.000000	680.000000	680.000000	680.000000	680.000000	680.000000	680.000000	680.000000	680.000000	...	680.000000	680.000000	680.000000	680.000000	680.000000	680.000000	680.000000	680.000000	680.000000	680.000000
mean	33.302941	24.066176	111.223824	96.555294	0.795007	52.092206	47.766324	17.489706	19.000000	31.304559	...	35.617794	32.412206	51.337500	46.907206	35.636912	32.928529	67.531912	1.393676	8.801471	2017.800000
std	3.631720	4.493494	6.278295	5.277004	0.167810	2.688772	2.356140	1.897230	2.432209	4.225676	...	5.560964	5.924822	2.984119	2.907143	2.565072	2.075223	3.049348	4.802096	4.673461	3.252007
min	15.000000	12.000000	90.600000	84.000000	0.152200	44.700000	39.600000	12.400000	13.100000	17.700000	...	21.300000	19.100000	42.500000	37.700000	26.600000	26.100000	58.400000	-15.600000	1.000000	2013.000000
25%	32.000000	21.000000	107.075000	93.075000	0.744375	50.100000	46.200000	16.300000	17.400000	28.700000	...	31.575000	28.000000	49.400000	45.000000	33.875000	31.500000	65.400000	-1.100000	5.000000	2015.000000
50%	34.000000	24.000000	111.300000	96.000000	0.855050	52.000000	47.800000	17.300000	18.800000	31.300000	...	35.300000	32.000000	51.200000	46.900000	35.700000	32.900000	67.500000	1.700000	9.000000	2017.500000
75%	35.000000	27.000000	115.400000	99.900000	0.911550	53.900000	49.300000	18.700000	20.400000	34.125000	...	39.300000	36.100000	53.100000	48.900000	37.400000	34.300000	69.600000	4.300000	13.000000	2021.000000
max	40.000000	38.000000	129.100000	115.600000	0.984200	61.000000	55.700000	23.700000	28.500000	43.600000	...	55.500000	55.500000	64.000000	56.700000	43.700000	38.700000	77.300000	13.100000	16.000000	2023.000000

8 rows × 21 columns

Features of the Dataset#

Before we start fitting classifiers to all the numeric features in the data, we should first take a closer look at the features.

First, let’s examine which numeric features do not belong in the model. Right off the bat, we can see that the year, although it is technically numeric, is more of a categorical data type, as it merely classifies which year a team was playing in. The year column could still be useful later on though, so we we will keep track of it, although it does not belong in the features the RFC will be trained on.

features.remove('YEAR')

Next, let’s take a look at the ‘G’ column, which corresponds to the number of games a team played during the season. Again, we can immediately tell this feature should be cut because it is directly related to other features and isn’t as informative. For example, a good team that has won a lot of games is more likely to advance in their conference championship (games unrelated to March Madness), which in turn would increase the number of games that team has played. However, just to to confirm that this feature should be cut, let’s use the groupby function.

dfPost.groupby('POSTSEASON').mean()["G"]

POSTSEASON
2ND          37.80000
Champions    37.90000
E8           35.85000
F4           36.95000
R32          33.63125
R64          32.16250
R68          31.60000
S16          34.73750
Name: G, dtype: float64

This column indicates that the ‘G’ column actually includes March Madness games, which means the target variable is actually what is influencing the number of games played. Because of this confoundment, the ‘G’ column needs to be cut.

features.remove('G')

We therefore also need to cut the ‘W’ column for similar reasons, as this column also is directly determined by a team’s finish in the March Madness tournament, not the other way around.

features.remove('W')

Granted, the majority of the in-game statistics columns also includes data from March Madness games, unfortunately, which makes this dataset less than ideal. However, a team’s regular season statistics should outweigh the statistics from March Madness, because the majority of a team’s stats are from the regular season (as opposed to the ‘G’ and ‘W’ columns, which are ‘count’ columns instead of averaged columns).

Moving on, we will keep the ‘seed’ feature in for now, because the seeds are determined before the tournament actually begins. Now, let’s examine the ‘WAB’ feature, which refers to the Wins Above Bubble (The bubble refers to the cut off between making the NCAA March Madness Tournament and not making it). For dfPost, we are only examining teams that made the cut off, but perhaps this feature could still be useful, as it is not affected by the postseason finish (but perhaps it affects the March Madness finish)

dfPost.groupby('POSTSEASON').median()["WAB"]

POSTSEASON
2ND          6.95
Champions    8.95
E8           6.50
F4           5.55
R32          2.60
R64          0.30
R68         -3.00
S16          4.55
Name: WAB, dtype: float64

Interestingly enough, the median for teams eliminated in the R68 is negative, suggesting that they didn’t have enough to wins to qualify. However, this makes sense, because if they had a positive WAB, we’d expect them to have a guaranteed spot in the round of 64, and not have to fight for their life in the R68. We will keep the ‘WAB’ column in for now.

Next, lets examine the BARTHAG column, which examines a team’s Power Rating (Chance of beating an average Division I team)

dfPost.groupby('POSTSEASON').median()["BARTHAG"]

POSTSEASON
2ND          0.95265
Champions    0.96730
E8           0.93450
F4           0.93365
R32          0.87510
R64          0.79440
R68          0.61525
S16          0.91675
Name: BARTHAG, dtype: float64

This column will remain in for now, since it seems to be an independent variable, even though it has a strong trend (It is unclear whether or not the March Madness games impact this column). However, we might cut this out later down the line if the model is not performing well.

Our last single numerical feature that we will examine is the ‘ADJ_T’ column, which is the Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo). We will use the groupby to see if this column actually has any impact.

dfPost.groupby('POSTSEASON').median()["ADJ_T"]

POSTSEASON
2ND          66.70
Champions    66.90
E8           67.10
F4           65.95
R32          67.20
R64          67.60
R68          68.35
S16          67.25
Name: ADJ_T, dtype: float64

Let’s visualize this with a bar chart.

#This is just to make the chart more ordered
CorrectSort = ['Champions', '2ND', 'F4', 'E8', 'S16', 'R32', 'R64', 'R68']

alt.Chart(dfPost).mark_bar().encode(
    x = alt.X("POSTSEASON:O", sort = CorrectSort),
    y = alt.Y("median(ADJ_T)", scale=alt.Scale(domain=[60, 70])),
    color = alt.Color("POSTSEASON", sort = CorrectSort),
    tooltip = "median(ADJ_T)"
).properties(
    width = 200,
    title = "Median of Adjusted Tempo vs a Team's Postseason Finish"
)

There seems to be a slight trend, so we will also leave this feature in for now.

Now, let’s take a look at the ‘paired’ features. By paired features, I am referring to statistics for which a team has both an offensive and a defensive side. Examples would include the ‘EFG_O’ column and the ‘EFG_D’ column, because the ‘EFG_O’ column measures a team’s effective field goal percentage while ‘EFG_D’ measures the effective field goal percentage a team allows. However, a team’s EFG_O and EFG_D are likely not as important as the actual difference between them. For instance, if a team has a 20% EFG_O, we would probably expect them to lose every single game. However, if that same team allows a superhuman 10% EFG_D, their defense would likely carry them to a lot of wins. Therefore, instead of the values in EFG_O and EFG_D itself, our model should instead analyze a feature which contains the difference between these 2.

EFG_diff = df["EFG_O"] - df["EFG_D"]
df["EFG_diff"] = EFG_diff

Let’s make columns with the difference of other paired features.

ORB: Offensive Rebound Rate vs DRB: Offensive Rebound Rate Allowed
FTR : Free Throw Rate (How often the given team shoots Free Throws) vs FTRD: Free Throw Rate Allowed
2P_O: Two-Point Shooting Percentage vs 2P_D: Two-Point Shooting Percentage Allowed
3P_O: Three-Point Shooting Percentage vs 3P_D: Three-Point Shooting Percentage Allowed
ADJOE: Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense) vs ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense)
TORD: Turnover Percentage Committed (Steal Rate) vs TOR: Turnover Percentage Allowed (Turnover Rate)

#TRY ASARRAY
df["RB_diff"] = df["ORB"] - df["DRB"]
df["FTR_diff"] = df["FTR"] - df["FTRD"]
df["2P_diff"] = df["2P_O"] - df["2P_D"]
df["3P_diff"] = df["3P_O"] - df["3P_D"]
df["ADJE_diff"] = df["ADJOE"] - df["ADJDE"]
df["TOR_diff"] = df["TORD"] - df["TOR"]

In order to avoid slicing problems, we create these differential columns in the original DataFrame.

Lastly, we will take a look at the target column, the “POSTSEASON” column. It’s clear that the target column is ordinal data, with ‘CHAMPION’ being the best finish, while ‘R68’is the worst finish. Therefore, we will create an ordinally encoded column that corresponds to the “POSTSEASON” column.

mapping = {'R68': 0, 'R64':1,'R32':2,'S16':3, 'E8':4, 'F4': 5, '2ND':6, 'Champions':7}

df["POST_OE"] = df["POSTSEASON"].map(mapping)

This will not only be important for a certain model type we will be using later on, but it will also allow us to bring in an element of regression analysis into the project, as we can examine how far off certain predictions were (e.g. if the true value for a team was ‘Champions’, a prediction of 5 would be a better prediction than 3.)

Now, let’s redefine dfPost with our new columns and make a new and hopefully improved list of features:

dfPost = df[df["POSTSEASON"].notna()]

dfPost.columns

Index(['TEAM', 'CONF', 'G', 'W', 'ADJOE', 'ADJDE', 'BARTHAG', 'EFG_O', 'EFG_D',
       'TOR', 'TORD', 'ORB', 'DRB', 'FTR', 'FTRD', '2P_O', '2P_D', '3P_O',
       '3P_D', 'ADJ_T', 'WAB', 'POSTSEASON', 'SEED', 'YEAR', 'EFG_diff',
       'RB_diff', 'FTR_diff', '2P_diff', '3P_diff', 'ADJE_diff', 'TOR_diff',
       'POST_OE'],
      dtype='object')

features = ["BARTHAG", "WAB", "SEED", "ADJ_T", "EFG_diff", "RB_diff", 'FTR_diff', '2P_diff', '3P_diff', 'ADJE_diff', 'TOR_diff']

Building a Random Forest Classifier with our features#

Before we start making the RFC, we need to first see how we would do by guessing the most frequent value in the POSTSEASON column for every row, in order to see how good any model we train is.

print(dfPost["POSTSEASON"].value_counts().index[0])
dfPost["POSTSEASON"].value_counts()[0]

R64

len(dfPost["POSTSEASON"])

320 out of 680 teams were eliminated in the round of 64.

BaseScore = 320/680
BaseScore

0.47058823529411764

As we can see, around 47% of teams were eliminated in the round of 64. This means that any model we train should at least have above 47% accuracy, or else it’s not a very good model (as it would then perform worse than guessing R64 for each team’)

We will now construct a random forest classifier with 400 estimators and see how it performs with the new features. As a side note, we keep the train_size at 0.6 to avoid both overfitting and underfitting. We will use the ordinally encoded column as the target column because this will be more helpful in evaluating the model’s performance.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(dfPost[features], dfPost["POST_OE"], train_size=0.6, random_state = 4)
rfc = RandomForestClassifier(n_estimators = 400, random_state = 0)
rfc.fit(X_train, y_train)
RFC_score = rfc.score(X_test, y_test)
RFC_score

0.5110294117647058

A 51.1% accuracy is already about a 4% improvement from the ‘R64’ method of guessing, which shows that our model is useful. Now, let’s examine on average how off the model’s predictions were.

y_pred = rfc.predict(X_test)

from sklearn.metrics import mean_absolute_error
RFC_MAE = mean_absolute_error(y_test,y_pred)
RFC_MAE

0.7095588235294118

Our model is off by about 0.71 for each guess on average. This is a good sign, as it signifies that even when the model’s classification is incorrect, it’s prediction still tends to be relatively close to the actual tournament finish of a team. For reference, our base score would be if we randomly guessed that every team lost in the R64 (OE class of 1), as this is the most common value.

Base_AE = np.abs(y_test-1).mean()
Base_AE

0.9889705882352942

Our model’s MAE is around 70% of this, which already is a noticeable improvement.

Attempting to Use Principal Component Analysis to improve performance#

dfPost[features].shape

(680, 11)

We currently have 11 features (i.e. 11 dimensions). Such a high number of dimensions for a relatively small dataFrame like dfPost could potentially lead to overfitting. As such, we will use Principal Component Analysis for dimension reduction. We will first reduce to 2 dimensions, in order to aid with visualizations.

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(dfPost[features])
X_pca.shape

(680, 2)

The above block of code reduces our 11 features to 2 components. Now, we will turn X_pca into a DataFrame

a = dfPost["POSTSEASON"]
c = np.asarray(a)
b = dfPost["POST_OE"]
d = np.asarray(b)
#Meant to make visualization more understandable

dfPost_pca = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
dfPost_pca["POSTSEASON"] = c
dfPost_pca["POST_OE"] = d
dfPost_pca

	PC1	PC2	POSTSEASON	POST_OE
0	-17.912580	2.843415	2ND	6.0
1	-26.161231	-8.014326	2ND	6.0
2	-12.026691	4.225785	2ND	6.0
3	-17.757074	8.579842	2ND	6.0
4	-24.052305	-8.248363	2ND	6.0
...	...	...	...	...
675	20.066996	6.606842	R64	1.0
676	22.629749	2.186891	R64	1.0
677	21.446552	5.778752	R64	1.0
678	31.801340	4.289538	R68	0.0
679	29.801159	6.687634	R68	0.0

680 rows × 4 columns

Now, we will graph PC2 vs PC1, with the color of the graph being a team’s finish in March Madness.

alt.Chart(dfPost_pca).mark_circle(size = 60).encode(
    x = 'PC1',
    y = 'PC2',
    color = alt.Color('POSTSEASON:N', sort = CorrectSort, scale=alt.Scale(scheme='viridis')),
    tooltip = 'POSTSEASON'
).interactive().properties(
    title = 'PCA of NCAA Basketball Dataset'
)

The graph is interesting, because it shows that a team tends to fare better in the NCAA tournament the lower their PC1 score is, and the higher their PC2 score is (altough this trend isn’t as noticeable as the PC1 trend). The explained_variance_ratio_ command will show how much variance/information of the original data the principal components captured.

pca.explained_variance_ratio_

array([0.56290596, 0.16871374])

In other words, PC1 captured 56.3% of the variance, while PC2 captured 16.9% of the variance. This is encouraging, because this means that we were able to capture around 73% of the information contained in the original dataset (11 features) into a dataframe consisting of only 2 features! This in turns means that any model we train on PC1 and PC2 will likely be simpler (2 vs 11), and therefore less prone to overfitting. Let’s now train a model.

X = dfPost_pca[["PC1","PC2"]]
y = dfPost_pca["POST_OE"]

We will limit the max_depth because there are only 2 features: PC1 and PC2.

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.6, random_state = 0)
rfc_pca = RandomForestClassifier(n_estimators = 400, max_depth = 3, random_state = 9)
rfc_pca.fit(X_train, y_train)
PCA_score = rfc_pca.score(X_test, y_test)
PCA_score

0.5036764705882353

The above model scored at a 50.4% accuracy, which is still an improvement over ‘R64’ guessing despite only capturing around 73% of the dataset’s information. Let’s now examine this model’s mean absolute error.

y_pca_pred = rfc_pca.predict(X_test)

PCA_MAE = mean_absolute_error(y_test,y_pca_pred)
PCA_MAE

0.7904411764705882

A MAE score of 0.79 is worse than our random forest’s score, but it is still a significant improvement over the baseline MAE we would get from ‘R64’ guessing (reminder that the baseline score is roughly 0.99).

Let’s now quickly build an RFC on a DataFrame with principal components that capture at least 90% of dfPost’s information and see how it fares.

pca2 = PCA(n_components=0.9)
X_pca2 = pca2.fit_transform(dfPost[features])
X_pca2.shape

(680, 5)

The shape indicates that we need 5 principal components to capture at least 90% of the dataset’s information. Let’s now create a DataFrame, fit a model to it, and predict the postseason finish.

dfPost_pca2 = pd.DataFrame(X_pca2, columns=['PC1', 'PC2','PC3','PC4', 'PC5'])
dfPost_pca2["POST_OE"] = d
dfPost_pca2

	PC1	PC2	PC3	PC4	PC5	POST_OE
0	-17.912580	2.843415	6.105287	2.698300	7.445344	6.0
1	-26.161231	-8.014326	2.371395	3.689217	-4.832190	6.0
2	-12.026691	4.225785	-4.060744	0.901865	-0.303206	6.0
3	-17.757074	8.579842	-7.770423	-4.696640	-0.238826	6.0
4	-24.052305	-8.248363	-5.598479	-8.952695	3.721472	6.0
...	...	...	...	...	...	...
675	20.066996	6.606842	0.106943	4.536261	-4.991573	1.0
676	22.629749	2.186891	2.801789	-1.721679	0.113177	1.0
677	21.446552	5.778752	5.688093	4.014113	-1.256441	1.0
678	31.801340	4.289538	5.937357	1.191128	3.988012	0.0
679	29.801159	6.687634	-0.929121	-0.061041	4.808986	0.0

680 rows × 6 columns

pca2.explained_variance_ratio_

array([0.56290596, 0.16871374, 0.0853449 , 0.0709192 , 0.03845252])

The third, fourth, and fifth PCs don’t seem to capture as much information as the first two, which makes sense.

X_train, X_test, y_train, y_test = train_test_split(dfPost_pca2[dfPost_pca2.columns[:5]], y, train_size=0.6, random_state = 4)
rfc_pca = RandomForestClassifier(n_estimators = 400, random_state = 0)
rfc_pca.fit(X_train, y_train)
rfc_pca.score(X_test, y_test)

0.48161764705882354

y_pca2_pred = rfc_pca.predict(X_test)
mean_absolute_error(y_test, y_pca2_pred)

0.8051470588235294

Using the same seeds as the model with 2 PCs, this model performs worse, even though the number of information captured by the PCs increased. Although this is a small dataset, this could be an example of overfitting, as the model might be capturing too much noise in the training data’s principal components. Another possible explanation is that this is just a bad seed for this particular model.

Making predictions using XGBClassifier#

For this section, we will examine whether or not the XGBClassifer (Extreme Gradient Boosting Classifier) performs better than the RandomForestClassifiers we’ve been using.

XGBoost (XGB)* is similar to random forest classifiers, but it has some differences. For instance, XGB uses a boosting technique, which means it builds a series of weak models (usually decision trees) sequentially. The idea is that each subsequent model corrects the errors of the previous one, and they work together to improve overall prediction accuracy.

*see references

!pip install xgboost
from xgboost import XGBClassifier

Collecting xgboost
  Downloading xgboost-2.0.2-py3-none-manylinux2014_x86_64.whl (297.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 297.1/297.1 MB 4.8 MB/s eta 0:00:00
?25hRequirement already satisfied: numpy in /shared-libs/python3.10/py/lib/python3.10/site-packages (from xgboost) (1.23.4)
Requirement already satisfied: scipy in /shared-libs/python3.10/py/lib/python3.10/site-packages (from xgboost) (1.9.3)
Installing collected packages: xgboost
Successfully installed xgboost-2.0.2

[notice] A new release of pip is available: 23.0.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip

So far, our most accurate model has been the RFC that predicted the postseason finish based off our 11 features. Let’s use the same train_test_split. As a side note, XGBClassifiers can only work with numerical classes, which is one of the reasons why we ordinally encoded the POSTSEASON column.

X_train, X_test, y_train, y_test = train_test_split(dfPost[features], dfPost["POST_OE"], train_size=0.6, random_state = 4)

XGBCl = XGBClassifier(n_estimators = 400, random_state = 0)
XGBCl.fit(X_train, y_train)

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=400, n_jobs=None,
              num_parallel_tree=None, objective='multi:softprob', ...)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The XGBClassifier has a number of parameters that can be finetuned to try and improve performance. For now, we will only set the seed and the n_estimators as equal to 400.

XGBCl.score(X_test,y_test)

0.44485294117647056

Such a poor performance indicates that we are either on a bad seed, that we need to change some parameters in the classifier, or both. Let’s fiddle with the max_depth and learning_rate parameters to try and improve performance (In essence, a smaller learning rate requires more boosting rounds but can help the algorithm generalize better to the data.)

X_train, X_test, y_train, y_test = train_test_split(dfPost[features], dfPost["POST_OE"], train_size=0.6, random_state = 7)
XGBCl = XGBClassifier(n_estimators = 400, max_depth = 8, learning_rate = 0.09, random_state = 9)
XGBCl.fit(X_train, y_train)
XGB_score = XGBCl.score(X_test,y_test)
XGB_score

0.5183823529411765

This is much better than the first XGBClassifier: in fact, with these specific seeds and parameters, the classifier actually has a slightly higher accuracy than our random forest classifier. Let’s now take a look at the mean absolute error.

y_xgb_pred = XGBCl.predict(X_test)
XGB_MAE = mean_absolute_error(y_test, y_xgb_pred)
XGB_MAE

0.7095588235294118

This XGBClassifier actually has around the same MAE as the random forest, which suggests their performances are extremely similar.

Examining Feature importances#

Now, let’s examine the relative importance of the features in our model, in order to see which features tend to be the most impactful in determining the March Madness finish of a team. Since our XGBClassifier had a slight edge in classifying teams, we will use this model.

FeatImp = XGBCl.feature_importances_  #Idea/command from ChatGPT
Classes = XGBCl.feature_names_in_
bo = pd.DataFrame([])
bo["Feature"] = Classes
bo["Importance"] = FeatImp
bo_sorted = bo.sort_values(by='Importance', ascending=False)
bo_sorted

	Feature	Importance
0	BARTHAG	0.161660
9	ADJE_diff	0.151615
2	SEED	0.127000
1	WAB	0.079642
8	3P_diff	0.076092
5	RB_diff	0.072741
4	EFG_diff	0.070629
7	2P_diff	0.069117
10	TOR_diff	0.065977
3	ADJ_T	0.064668
6	FTR_diff	0.060859

A higher importance indicates that a feature is more correlated to the target variable. Let’s visualize the importances.

#Code to make the chart more ordered
OrderSort = bo_sorted["Feature"].to_list()

alt.Chart(bo_sorted).mark_bar().encode(
    x = alt.X("Feature:N", sort=None),
    y = "Importance:Q",
    color = alt.Color("Feature:N", sort=None),
    tooltip = "Importance"
).properties(
    title = "Feature Importances"
)

As we can see, the ADJE_diff, BARTHAG, and Seed Columns seem to be the three most impactful features, while the FTR_diff and Adjusted Tempo columns don’t seem to be that impactful. However, we shouldn’t outright cut any features, as each feature still seems to have at least some impact on the model’s decisions.

Examining Specific Features and Decision Tree Boundaries#

Let’s take a closer look at the three most impactful features in XGBCl, to see how they affect the predicted finish of a team. First, we’ll consider the seed of a team.

SEED = pd.DataFrame(dfPost.groupby('POSTSEASON').mean()["SEED"]).reset_index()

alt.Chart(SEED).mark_bar().encode(
    x = alt.X("POSTSEASON", sort = CorrectSort),
    y = "SEED:Q",
    color = alt.Color("POSTSEASON:N", sort = CorrectSort),
    tooltip = "SEED"
).properties(
    width = 200,
    title = "Average Seed for each Tournament Finish"
)

The chart suggests that as the seed of a team gets higher (higher meaning closer to 1), their finish in the tournament tends to get better. This makes sense, given that better teams are usually the higher seeds. One interesting nugget in this chart is that teams in the Final 4 had on average lower seeds (by lower, I mean that closer to 16 is lower) than teams in the Elite 8, which is likely just an example of the ‘madness’ and unpredictability of March Madness, and the fact that our data is only 680 rows long.

Let’s now move on to BARTHAG.

BART = pd.DataFrame(dfPost.groupby('POSTSEASON').mean()["BARTHAG"]).reset_index()
BART

	POSTSEASON	BARTHAG
0	2ND	0.947810
1	Champions	0.962280
2	E8	0.914007
3	F4	0.922440
4	R32	0.859177
5	R64	0.730332
6	R68	0.580652
7	S16	0.901175

alt.Chart(BART).mark_bar().encode(
    x = alt.X("POSTSEASON", sort = CorrectSort),
    y = "BARTHAG:Q",
    color = alt.Color("POSTSEASON:N", sort = CorrectSort),
    tooltip = "BARTHAG"
).properties(
    width = 200,
    title = "Average BARTHAG for each Tournament Finish"
)

As we can see, the higher a team’s BARTHAG is, the better they tend to fare in the March Madness tournament. Although this seems obvious when you consider that BARTHAG measures a team’s chances of beating an average D1 opponent, the graph provides more specific information, such as the fact that teams with a BARTHAG below 0.7 are not likely to make it into the round of 64.

Let’s now look at the second most impactful feature, the ADJE_diff column, which measures the net number of points a team will score in 100 possessions (net = offense - defense).

ADJE_diff = pd.DataFrame(dfPost.groupby('POSTSEASON').mean()["ADJE_diff"]).reset_index()
ADJE_diff

	POSTSEASON	ADJE_diff
0	2ND	27.600000
1	Champions	30.380000
2	E8	23.235000
3	F4	23.680000
4	R32	17.511250
5	R64	10.423437
6	R68	3.867500
7	S16	21.247500

alt.Chart(ADJE_diff).mark_bar().encode(
    x = alt.X("POSTSEASON", sort = CorrectSort),
    y = "ADJE_diff:Q",
    color = alt.Color("POSTSEASON:N", sort = CorrectSort),
    tooltip = "ADJE_diff"
).properties(
    width = 200,
    title = "Average Adjusted Efficiency Difference for each Tournament Finish"
)

Again, there seems to be a very direct correlation between the adjusted efficiency difference and a team’s finish, with champions and finalists tending to outscore their opponents over the course of a season by a whopping 28-30 points per 100 possessions. This is useful because it suggests the eventual winner will likely not be a team who wins a lot of close games, but rather a team that wins a lot of blowouts.

Finally, we will make a decision boundary graph with our two most important, continuous, quantitative features, that being the BARTHAG and adjusted efficiency features. We will use a Decision Tree Regressor this time so that the boundaries are more clear.

First, we need to make arrays of datapoints for our charts, using linspace.

print(dfPost["ADJE_diff"].min())
print(dfPost["ADJE_diff"].max())

-15.299999999999997
36.3

ADJElin = np.random.uniform(-16, 37, 5000)

Since BARTHAG is a probability, we will set its linspace between 0 and 1:

BARTHAGlin = np.random.uniform(0,1,5000)

Boundary = pd.DataFrame({'ADJE_diff':ADJElin, 'BARTHAG':BARTHAGlin})

Now, we fit another model on the original data that only takes into account these two features.

from sklearn.tree import DecisionTreeClassifier
X_train, X_test, y_train, y_test = train_test_split(dfPost[["ADJE_diff","BARTHAG"]], dfPost["POST_OE"], train_size=0.6, random_state = 7)

Tree = DecisionTreeClassifier(max_leaf_nodes = 10, random_state=5)
Tree.fit(X_train, y_train)
Tree_score = Tree.score(X_test,y_test)
Tree_score

0.5404411764705882

Remarkably, this simple Decision Tree has had the best performance out of all the models we’ve trained so far, with a 54% accuracy - around 7% better than our base score. Let’s look at it’s MAE.

y_tree_pred = Tree.predict(X_test)
Tree_MAE = mean_absolute_error(y_test, y_tree_pred)
Tree_MAE

0.6691176470588235

Sure enough,the tree also has the lowest mean absolute error, which means it’s predictions are closer to the true value than any other model we’ve trained. This is a sign that the majority of our previous models might have been overfitting too much.

Boundary["Prediction_OE"] = Tree.predict(Boundary[["ADJE_diff", "BARTHAG"]])

inverted_mapping = {value: key for key, value in mapping.items()}
Boundary["Predictions"] = Boundary["Prediction_OE"].map(inverted_mapping)
Boundary

	ADJE_diff	BARTHAG	Prediction_OE	Predictions
0	17.153968	0.473926	1.0	R64
1	25.676365	0.891518	1.0	R64
2	15.643836	0.534441	1.0	R64
3	26.227244	0.510742	1.0	R64
4	15.106568	0.989752	1.0	R64
...	...	...	...	...
4995	-7.499594	0.406876	0.0	R68
4996	15.435096	0.339736	1.0	R64
4997	2.888958	0.312349	1.0	R64
4998	-8.423352	0.666879	0.0	R68
4999	-8.142750	0.120803	0.0	R68

5000 rows × 4 columns

alt.Chart(Boundary).mark_circle(size=60).encode(
    x = "ADJE_diff:Q",
    y = "BARTHAG",
    color = alt.Color("Predictions", sort = CorrectSort),
    tooltip = "Predictions"
).properties(
    title = "Tree's Predicted Tournament Finish based off ADJE_diff and BARTHAG"
)

Here, we see the downside of this tree - because we limited the max amount of leaf nodes, it does not have any ‘F4’, ‘2ND’, or ‘Champions’ predictions, likely because of the relatively low amount of these classes in the dataset. Let’s train one more tree with more leaf nodes in order to get the model to make some of these predictions

Tree2 = DecisionTreeClassifier(max_leaf_nodes = 70, random_state=5)
Tree2.fit(X_train, y_train)
print(Tree2.score(X_test,y_test))
print(mean_absolute_error(Tree2.predict(X_test), y_test))

0.4889705882352941
0.7610294117647058

As we can see, the performance of this tree is worse, because we extended the depth of the tree, thereby leading to potentially more overfitting.

Boundary["Prediction2_OE"] = Tree2.predict(Boundary[["ADJE_diff", "BARTHAG"]])
Boundary["Predictions2"] = Boundary["Prediction2_OE"].map(inverted_mapping)

alt.Chart(Boundary).mark_circle(size=60).encode(
    x = "ADJE_diff:Q",
    y = "BARTHAG",
    color = alt.Color("Predictions2:N", sort = CorrectSort, scale = alt.Scale(scheme='category20')),
    tooltip = "Predictions2"
).properties(
    title = "Tree2's Predicted Tournament Finish based off ADJE_diff and BARTHAG"
)

This chart, although it contains predictions from every class, clearly displays signs of overfitting, such as the multiple linear decision boundaries, or the fact that there are a couple of ‘R68’ predictions to the left of the ‘E8’ prediction block. For these reasons, we should not use this tree.

Conclusions#

Let’s now visualize the performances of our various models.

ModelScores = pd.DataFrame({'ModelType': ['Base', 'RFC', 'PCA', 'XGB', 'Tree'], 'Accuracy': [BaseScore, RFC_score, PCA_score, XGB_score, Tree_score], 'MAE': [Base_AE, RFC_MAE, PCA_MAE, XGB_MAE, Tree_MAE]})
ModelScores

	ModelType	Accuracy	MAE
0	Base	0.470588	0.988971
1	RFC	0.511029	0.709559
2	PCA	0.503676	0.790441
3	XGB	0.518382	0.709559
4	Tree	0.540441	0.669118

alt.Chart(ModelScores).mark_bar(width=30).encode(
    x = "ModelType:N",
    y = alt.Y("Accuracy:Q", scale= alt.Scale(domain = [0.40, 0.55])),
    color = "ModelType:N",
    tooltip ="Accuracy:Q"
).properties(
    width = 200,
    height = 300,
    title = 'Accuracy Scores for Different Models'
)

alt.Chart(ModelScores).mark_bar(width=30).encode(
    x = "ModelType:N",
    y = alt.Y("MAE:Q", scale= alt.Scale(domain = [0, 1])),
    color = "ModelType:N",
    tooltip ="MAE:Q"
).properties(
    width = 200,
    height = 300,
    title = 'Mean Absolute Error for Different Models'
)

Ultimately, because it had the best performance on the model and displayed the least overfitting, we should use our simple Decision Tree (‘Tree’) when making predictions on the data, even though it is not as detailed as some of the other models we trained. This shows that relatively ‘primitive’ models are sometimes the best to use on certain data, because they are less prone to overfitting on training data.

We also learned that certain features in the dataset are more important than others, these being the BARTHAG, adjusted efficiency difference, and seed columns. So, I will probably take a look at these columns the next time I fill out my March Madness bracket, as they seem to have the most impact.

However, the main thing we learned is that the NCAA tournament is called ‘March Madness’ for a reason, as even our best model could only predict the finish of a postseason team with 54% accuracy. In other words, don’t expect to fill out a perfect bracket anytime soon, no matter how many models you build.

On the bright side, the mean absolute error score of our best model’s predictions was only a 0.66, which shows that on average, our model was very close to the correct prediction even when it was wrong (reminder: predicting ‘2nd’ for the actual champions would be considered close, while predicting ‘R68’ would be considered not close - ordinal data!)

Lastly, it is important to remember that the dataset we used is flawed, in that it contained data from the actual March Madness games. Therefore, the conclusions we drew from this dataset are not entirely reliable, even though we did remove some of the issues in the ‘Features of the Dataset’ section.

Summary#

In this project, I attempted to build a model to predict where a team would finish in the March Madness tournament. First, I removed and added some features, before constructing a variety of models in order to see which one would work the best (the Decision Tree on two features ended up working the best). I then also looked at which features were most important in determining a team’s predicted finish (the three most important features were BARTHAG, Adjusted efficieny difference, and seed).

References#

Your code above should include references. Here is some additional space for references.

What is the source of your dataset(s)?

https://www.kaggle.com/datasets/andrewsundberg/college-basketball-dataset

List any other references that you found helpful.

Course notes

XGB source: https://www.kaggle.com/code/alexisbcook/xgboost

ChatGPT helped with feature_importances_ and various other smaller bugs that I ran into.