Predicting a Team’s March Madness finish#

Author: Yorick Herrmann, yherrman@uci.edu

Course Project, UC Irvine, Math 10, F23

Introduction#

Introduce your project here. Maybe 3 sentences.

In this project I will build a model that predicts how far a college basketball team advances in the March Madness tournament. I will also examine which statistical categories are most correlated to a team’s success in the tournament. I will be using a dataset that contains various statistics for every NCAA Division 1 team that qualified for the tournament from the years 2013-2023.

Loading the Data#

First, we need to load the relevant libraries/packages.

import pandas as pd
import numpy as np
import altair as alt
df = pd.read_csv('cbb.csv')

As mentioned in the introduction, this DataFrame contains data from every single NCAA D1 college basketball team from 2013 to the end of the 2022/2023 season. The data is from: https://www.kaggle.com/datasets/andrewsundberg/college-basketball-dataset

df
TEAM CONF G W ADJOE ADJDE BARTHAG EFG_O EFG_D TOR ... FTRD 2P_O 2P_D 3P_O 3P_D ADJ_T WAB POSTSEASON SEED YEAR
0 North Carolina ACC 40 33 123.3 94.9 0.9531 52.6 48.1 15.4 ... 30.4 53.9 44.6 32.7 36.2 71.7 8.6 2ND 1.0 2016
1 Wisconsin B10 40 36 129.1 93.6 0.9758 54.8 47.7 12.4 ... 22.4 54.8 44.7 36.5 37.5 59.3 11.3 2ND 1.0 2015
2 Michigan B10 40 33 114.4 90.4 0.9375 53.9 47.7 14.0 ... 30.0 54.7 46.8 35.2 33.2 65.9 6.9 2ND 3.0 2018
3 Texas Tech B12 38 31 115.2 85.2 0.9696 53.5 43.0 17.7 ... 36.6 52.8 41.9 36.5 29.7 67.5 7.0 2ND 3.0 2019
4 Gonzaga WCC 39 37 117.8 86.3 0.9728 56.6 41.1 16.2 ... 26.9 56.3 40.0 38.2 29.0 71.5 7.7 2ND 1.0 2017
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3518 Toledo MAC 34 27 119.9 109.6 0.7369 56.3 52.9 13.6 ... 27.5 54.6 52.1 39.7 36.1 69.5 -1.2 NaN NaN 2023
3519 Liberty ASun 33 27 111.4 97.3 0.8246 55.5 49.3 16.0 ... 27.8 56.4 48.6 36.4 33.6 64.4 -2.0 NaN NaN 2023
3520 Utah Valley WAC 34 28 107.1 94.6 0.8065 51.7 44.0 19.3 ... 28.7 52.5 42.8 33.4 31.1 69.8 -0.3 NaN NaN 2023
3521 UAB CUSA 38 29 112.4 97.0 0.8453 50.3 47.3 17.3 ... 28.9 48.8 47.2 35.6 31.6 70.7 -0.5 NaN NaN 2023
3522 North Texas CUSA 36 31 110.0 93.8 0.8622 51.2 44.5 19.8 ... 40.2 49.6 44.2 35.7 30.1 58.7 1.1 NaN NaN 2023

3523 rows × 24 columns

df.shape
(3523, 24)

For this project, we will only be considering teams that have a non-null value for the “POSTSEASON” column, as a non-null value indicates that a team qualified for the March Madness tournament.

dfPost = df[df["POSTSEASON"].notna()]
dfPost.shape
(680, 24)
from pandas.api.types import is_numeric_dtype 
features = [i for i in dfPost.columns if is_numeric_dtype(dfPost[i])]
features
['G',
 'W',
 'ADJOE',
 'ADJDE',
 'BARTHAG',
 'EFG_O',
 'EFG_D',
 'TOR',
 'TORD',
 'ORB',
 'DRB',
 'FTR',
 'FTRD',
 '2P_O',
 '2P_D',
 '3P_O',
 '3P_D',
 'ADJ_T',
 'WAB',
 'SEED',
 'YEAR']
dfPost.describe()
G W ADJOE ADJDE BARTHAG EFG_O EFG_D TOR TORD ORB ... FTR FTRD 2P_O 2P_D 3P_O 3P_D ADJ_T WAB SEED YEAR
count 680.000000 680.000000 680.000000 680.000000 680.000000 680.000000 680.000000 680.000000 680.000000 680.000000 ... 680.000000 680.000000 680.000000 680.000000 680.000000 680.000000 680.000000 680.000000 680.000000 680.000000
mean 33.302941 24.066176 111.223824 96.555294 0.795007 52.092206 47.766324 17.489706 19.000000 31.304559 ... 35.617794 32.412206 51.337500 46.907206 35.636912 32.928529 67.531912 1.393676 8.801471 2017.800000
std 3.631720 4.493494 6.278295 5.277004 0.167810 2.688772 2.356140 1.897230 2.432209 4.225676 ... 5.560964 5.924822 2.984119 2.907143 2.565072 2.075223 3.049348 4.802096 4.673461 3.252007
min 15.000000 12.000000 90.600000 84.000000 0.152200 44.700000 39.600000 12.400000 13.100000 17.700000 ... 21.300000 19.100000 42.500000 37.700000 26.600000 26.100000 58.400000 -15.600000 1.000000 2013.000000
25% 32.000000 21.000000 107.075000 93.075000 0.744375 50.100000 46.200000 16.300000 17.400000 28.700000 ... 31.575000 28.000000 49.400000 45.000000 33.875000 31.500000 65.400000 -1.100000 5.000000 2015.000000
50% 34.000000 24.000000 111.300000 96.000000 0.855050 52.000000 47.800000 17.300000 18.800000 31.300000 ... 35.300000 32.000000 51.200000 46.900000 35.700000 32.900000 67.500000 1.700000 9.000000 2017.500000
75% 35.000000 27.000000 115.400000 99.900000 0.911550 53.900000 49.300000 18.700000 20.400000 34.125000 ... 39.300000 36.100000 53.100000 48.900000 37.400000 34.300000 69.600000 4.300000 13.000000 2021.000000
max 40.000000 38.000000 129.100000 115.600000 0.984200 61.000000 55.700000 23.700000 28.500000 43.600000 ... 55.500000 55.500000 64.000000 56.700000 43.700000 38.700000 77.300000 13.100000 16.000000 2023.000000

8 rows × 21 columns

Features of the Dataset#

Before we start fitting classifiers to all the numeric features in the data, we should first take a closer look at the features.

First, let’s examine which numeric features do not belong in the model. Right off the bat, we can see that the year, although it is technically numeric, is more of a categorical data type, as it merely classifies which year a team was playing in. The year column could still be useful later on though, so we we will keep track of it, although it does not belong in the features the RFC will be trained on.

features.remove('YEAR')

Next, let’s take a look at the ‘G’ column, which corresponds to the number of games a team played during the season. Again, we can immediately tell this feature should be cut because it is directly related to other features and isn’t as informative. For example, a good team that has won a lot of games is more likely to advance in their conference championship (games unrelated to March Madness), which in turn would increase the number of games that team has played. However, just to to confirm that this feature should be cut, let’s use the groupby function.

dfPost.groupby('POSTSEASON').mean()["G"]
POSTSEASON
2ND          37.80000
Champions    37.90000
E8           35.85000
F4           36.95000
R32          33.63125
R64          32.16250
R68          31.60000
S16          34.73750
Name: G, dtype: float64

This column indicates that the ‘G’ column actually includes March Madness games, which means the target variable is actually what is influencing the number of games played. Because of this confoundment, the ‘G’ column needs to be cut.

features.remove('G')

We therefore also need to cut the ‘W’ column for similar reasons, as this column also is directly determined by a team’s finish in the March Madness tournament, not the other way around.

features.remove('W')

Granted, the majority of the in-game statistics columns also includes data from March Madness games, unfortunately, which makes this dataset less than ideal. However, a team’s regular season statistics should outweigh the statistics from March Madness, because the majority of a team’s stats are from the regular season (as opposed to the ‘G’ and ‘W’ columns, which are ‘count’ columns instead of averaged columns).

Moving on, we will keep the ‘seed’ feature in for now, because the seeds are determined before the tournament actually begins. Now, let’s examine the ‘WAB’ feature, which refers to the Wins Above Bubble (The bubble refers to the cut off between making the NCAA March Madness Tournament and not making it). For dfPost, we are only examining teams that made the cut off, but perhaps this feature could still be useful, as it is not affected by the postseason finish (but perhaps it affects the March Madness finish)

dfPost.groupby('POSTSEASON').median()["WAB"]
POSTSEASON
2ND          6.95
Champions    8.95
E8           6.50
F4           5.55
R32          2.60
R64          0.30
R68         -3.00
S16          4.55
Name: WAB, dtype: float64

Interestingly enough, the median for teams eliminated in the R68 is negative, suggesting that they didn’t have enough to wins to qualify. However, this makes sense, because if they had a positive WAB, we’d expect them to have a guaranteed spot in the round of 64, and not have to fight for their life in the R68. We will keep the ‘WAB’ column in for now.

Next, lets examine the BARTHAG column, which examines a team’s Power Rating (Chance of beating an average Division I team)

dfPost.groupby('POSTSEASON').median()["BARTHAG"]
POSTSEASON
2ND          0.95265
Champions    0.96730
E8           0.93450
F4           0.93365
R32          0.87510
R64          0.79440
R68          0.61525
S16          0.91675
Name: BARTHAG, dtype: float64

This column will remain in for now, since it seems to be an independent variable, even though it has a strong trend (It is unclear whether or not the March Madness games impact this column). However, we might cut this out later down the line if the model is not performing well.

Our last single numerical feature that we will examine is the ‘ADJ_T’ column, which is the Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo). We will use the groupby to see if this column actually has any impact.

dfPost.groupby('POSTSEASON').median()["ADJ_T"]
POSTSEASON
2ND          66.70
Champions    66.90
E8           67.10
F4           65.95
R32          67.20
R64          67.60
R68          68.35
S16          67.25
Name: ADJ_T, dtype: float64

Let’s visualize this with a bar chart.

#This is just to make the chart more ordered
CorrectSort = ['Champions', '2ND', 'F4', 'E8', 'S16', 'R32', 'R64', 'R68']
alt.Chart(dfPost).mark_bar().encode(
    x = alt.X("POSTSEASON:O", sort = CorrectSort),
    y = alt.Y("median(ADJ_T)", scale=alt.Scale(domain=[60, 70])),
    color = alt.Color("POSTSEASON", sort = CorrectSort),
    tooltip = "median(ADJ_T)"
).properties(
    width = 200,
    title = "Median of Adjusted Tempo vs a Team's Postseason Finish"
)

There seems to be a slight trend, so we will also leave this feature in for now.

Now, let’s take a look at the ‘paired’ features. By paired features, I am referring to statistics for which a team has both an offensive and a defensive side. Examples would include the ‘EFG_O’ column and the ‘EFG_D’ column, because the ‘EFG_O’ column measures a team’s effective field goal percentage while ‘EFG_D’ measures the effective field goal percentage a team allows. However, a team’s EFG_O and EFG_D are likely not as important as the actual difference between them. For instance, if a team has a 20% EFG_O, we would probably expect them to lose every single game. However, if that same team allows a superhuman 10% EFG_D, their defense would likely carry them to a lot of wins. Therefore, instead of the values in EFG_O and EFG_D itself, our model should instead analyze a feature which contains the difference between these 2.

EFG_diff = df["EFG_O"] - df["EFG_D"]
df["EFG_diff"] = EFG_diff

Let’s make columns with the difference of other paired features.

  1. ORB: Offensive Rebound Rate vs DRB: Offensive Rebound Rate Allowed

  2. FTR : Free Throw Rate (How often the given team shoots Free Throws) vs FTRD: Free Throw Rate Allowed

  3. 2P_O: Two-Point Shooting Percentage vs 2P_D: Two-Point Shooting Percentage Allowed

  4. 3P_O: Three-Point Shooting Percentage vs 3P_D: Three-Point Shooting Percentage Allowed

  5. ADJOE: Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense) vs ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense)

  6. TORD: Turnover Percentage Committed (Steal Rate) vs TOR: Turnover Percentage Allowed (Turnover Rate)

#TRY ASARRAY
df["RB_diff"] = df["ORB"] - df["DRB"]
df["FTR_diff"] = df["FTR"] - df["FTRD"]
df["2P_diff"] = df["2P_O"] - df["2P_D"]
df["3P_diff"] = df["3P_O"] - df["3P_D"]
df["ADJE_diff"] = df["ADJOE"] - df["ADJDE"]
df["TOR_diff"] = df["TORD"] - df["TOR"]

In order to avoid slicing problems, we create these differential columns in the original DataFrame.

Lastly, we will take a look at the target column, the “POSTSEASON” column. It’s clear that the target column is ordinal data, with ‘CHAMPION’ being the best finish, while ‘R68’is the worst finish. Therefore, we will create an ordinally encoded column that corresponds to the “POSTSEASON” column.

mapping = {'R68': 0, 'R64':1,'R32':2,'S16':3, 'E8':4, 'F4': 5, '2ND':6, 'Champions':7}
df["POST_OE"] = df["POSTSEASON"].map(mapping)

This will not only be important for a certain model type we will be using later on, but it will also allow us to bring in an element of regression analysis into the project, as we can examine how far off certain predictions were (e.g. if the true value for a team was ‘Champions’, a prediction of 5 would be a better prediction than 3.)

Now, let’s redefine dfPost with our new columns and make a new and hopefully improved list of features:

dfPost = df[df["POSTSEASON"].notna()]
dfPost.columns
Index(['TEAM', 'CONF', 'G', 'W', 'ADJOE', 'ADJDE', 'BARTHAG', 'EFG_O', 'EFG_D',
       'TOR', 'TORD', 'ORB', 'DRB', 'FTR', 'FTRD', '2P_O', '2P_D', '3P_O',
       '3P_D', 'ADJ_T', 'WAB', 'POSTSEASON', 'SEED', 'YEAR', 'EFG_diff',
       'RB_diff', 'FTR_diff', '2P_diff', '3P_diff', 'ADJE_diff', 'TOR_diff',
       'POST_OE'],
      dtype='object')
features = ["BARTHAG", "WAB", "SEED", "ADJ_T", "EFG_diff", "RB_diff", 'FTR_diff', '2P_diff', '3P_diff', 'ADJE_diff', 'TOR_diff']

Building a Random Forest Classifier with our features#

Before we start making the RFC, we need to first see how we would do by guessing the most frequent value in the POSTSEASON column for every row, in order to see how good any model we train is.

print(dfPost["POSTSEASON"].value_counts().index[0])
dfPost["POSTSEASON"].value_counts()[0]
R64
320
len(dfPost["POSTSEASON"])
680

320 out of 680 teams were eliminated in the round of 64.

BaseScore = 320/680
BaseScore
0.47058823529411764

As we can see, around 47% of teams were eliminated in the round of 64. This means that any model we train should at least have above 47% accuracy, or else it’s not a very good model (as it would then perform worse than guessing R64 for each team’)

We will now construct a random forest classifier with 400 estimators and see how it performs with the new features. As a side note, we keep the train_size at 0.6 to avoid both overfitting and underfitting. We will use the ordinally encoded column as the target column because this will be more helpful in evaluating the model’s performance.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(dfPost[features], dfPost["POST_OE"], train_size=0.6, random_state = 4)
rfc = RandomForestClassifier(n_estimators = 400, random_state = 0)
rfc.fit(X_train, y_train)
RFC_score = rfc.score(X_test, y_test)
RFC_score
0.5110294117647058

A 51.1% accuracy is already about a 4% improvement from the ‘R64’ method of guessing, which shows that our model is useful. Now, let’s examine on average how off the model’s predictions were.

y_pred = rfc.predict(X_test)
from sklearn.metrics import mean_absolute_error
RFC_MAE = mean_absolute_error(y_test,y_pred)
RFC_MAE
0.7095588235294118

Our model is off by about 0.71 for each guess on average. This is a good sign, as it signifies that even when the model’s classification is incorrect, it’s prediction still tends to be relatively close to the actual tournament finish of a team. For reference, our base score would be if we randomly guessed that every team lost in the R64 (OE class of 1), as this is the most common value.

Base_AE = np.abs(y_test-1).mean()
Base_AE
0.9889705882352942

Our model’s MAE is around 70% of this, which already is a noticeable improvement.

Attempting to Use Principal Component Analysis to improve performance#

dfPost[features].shape
(680, 11)

We currently have 11 features (i.e. 11 dimensions). Such a high number of dimensions for a relatively small dataFrame like dfPost could potentially lead to overfitting. As such, we will use Principal Component Analysis for dimension reduction. We will first reduce to 2 dimensions, in order to aid with visualizations.

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(dfPost[features])
X_pca.shape
(680, 2)

The above block of code reduces our 11 features to 2 components. Now, we will turn X_pca into a DataFrame

a = dfPost["POSTSEASON"]
c = np.asarray(a)
b = dfPost["POST_OE"]
d = np.asarray(b)
#Meant to make visualization more understandable
dfPost_pca = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
dfPost_pca["POSTSEASON"] = c
dfPost_pca["POST_OE"] = d
dfPost_pca
PC1 PC2 POSTSEASON POST_OE
0 -17.912580 2.843415 2ND 6.0
1 -26.161231 -8.014326 2ND 6.0
2 -12.026691 4.225785 2ND 6.0
3 -17.757074 8.579842 2ND 6.0
4 -24.052305 -8.248363 2ND 6.0
... ... ... ... ...
675 20.066996 6.606842 R64 1.0
676 22.629749 2.186891 R64 1.0
677 21.446552 5.778752 R64 1.0
678 31.801340 4.289538 R68 0.0
679 29.801159 6.687634 R68 0.0

680 rows × 4 columns

Now, we will graph PC2 vs PC1, with the color of the graph being a team’s finish in March Madness.

alt.Chart(dfPost_pca).mark_circle(size = 60).encode(
    x = 'PC1',
    y = 'PC2',
    color = alt.Color('POSTSEASON:N', sort = CorrectSort, scale=alt.Scale(scheme='viridis')),
    tooltip = 'POSTSEASON'
).interactive().properties(
    title = 'PCA of NCAA Basketball Dataset'
)

The graph is interesting, because it shows that a team tends to fare better in the NCAA tournament the lower their PC1 score is, and the higher their PC2 score is (altough this trend isn’t as noticeable as the PC1 trend). The explained_variance_ratio_ command will show how much variance/information of the original data the principal components captured.

pca.explained_variance_ratio_
array([0.56290596, 0.16871374])

In other words, PC1 captured 56.3% of the variance, while PC2 captured 16.9% of the variance. This is encouraging, because this means that we were able to capture around 73% of the information contained in the original dataset (11 features) into a dataframe consisting of only 2 features! This in turns means that any model we train on PC1 and PC2 will likely be simpler (2 vs 11), and therefore less prone to overfitting. Let’s now train a model.

X = dfPost_pca[["PC1","PC2"]]
y = dfPost_pca["POST_OE"]

We will limit the max_depth because there are only 2 features: PC1 and PC2.

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.6, random_state = 0)
rfc_pca = RandomForestClassifier(n_estimators = 400, max_depth = 3, random_state = 9)
rfc_pca.fit(X_train, y_train)
PCA_score = rfc_pca.score(X_test, y_test)
PCA_score
0.5036764705882353

The above model scored at a 50.4% accuracy, which is still an improvement over ‘R64’ guessing despite only capturing around 73% of the dataset’s information. Let’s now examine this model’s mean absolute error.

y_pca_pred = rfc_pca.predict(X_test)
PCA_MAE = mean_absolute_error(y_test,y_pca_pred)
PCA_MAE
0.7904411764705882

A MAE score of 0.79 is worse than our random forest’s score, but it is still a significant improvement over the baseline MAE we would get from ‘R64’ guessing (reminder that the baseline score is roughly 0.99).

Let’s now quickly build an RFC on a DataFrame with principal components that capture at least 90% of dfPost’s information and see how it fares.

pca2 = PCA(n_components=0.9)
X_pca2 = pca2.fit_transform(dfPost[features])
X_pca2.shape
(680, 5)

The shape indicates that we need 5 principal components to capture at least 90% of the dataset’s information. Let’s now create a DataFrame, fit a model to it, and predict the postseason finish.

dfPost_pca2 = pd.DataFrame(X_pca2, columns=['PC1', 'PC2','PC3','PC4', 'PC5'])
dfPost_pca2["POST_OE"] = d
dfPost_pca2
PC1 PC2 PC3 PC4 PC5 POST_OE
0 -17.912580 2.843415 6.105287 2.698300 7.445344 6.0
1 -26.161231 -8.014326 2.371395 3.689217 -4.832190 6.0
2 -12.026691 4.225785 -4.060744 0.901865 -0.303206 6.0
3 -17.757074 8.579842 -7.770423 -4.696640 -0.238826 6.0
4 -24.052305 -8.248363 -5.598479 -8.952695 3.721472 6.0
... ... ... ... ... ... ...
675 20.066996 6.606842 0.106943 4.536261 -4.991573 1.0
676 22.629749 2.186891 2.801789 -1.721679 0.113177 1.0
677 21.446552 5.778752 5.688093 4.014113 -1.256441 1.0
678 31.801340 4.289538 5.937357 1.191128 3.988012 0.0
679 29.801159 6.687634 -0.929121 -0.061041 4.808986 0.0

680 rows × 6 columns

pca2.explained_variance_ratio_
array([0.56290596, 0.16871374, 0.0853449 , 0.0709192 , 0.03845252])

The third, fourth, and fifth PCs don’t seem to capture as much information as the first two, which makes sense.

X_train, X_test, y_train, y_test = train_test_split(dfPost_pca2[dfPost_pca2.columns[:5]], y, train_size=0.6, random_state = 4)
rfc_pca = RandomForestClassifier(n_estimators = 400, random_state = 0)
rfc_pca.fit(X_train, y_train)
rfc_pca.score(X_test, y_test)
0.48161764705882354
y_pca2_pred = rfc_pca.predict(X_test)
mean_absolute_error(y_test, y_pca2_pred)
0.8051470588235294

Using the same seeds as the model with 2 PCs, this model performs worse, even though the number of information captured by the PCs increased. Although this is a small dataset, this could be an example of overfitting, as the model might be capturing too much noise in the training data’s principal components. Another possible explanation is that this is just a bad seed for this particular model.

Making predictions using XGBClassifier#

For this section, we will examine whether or not the XGBClassifer (Extreme Gradient Boosting Classifier) performs better than the RandomForestClassifiers we’ve been using.

XGBoost (XGB)* is similar to random forest classifiers, but it has some differences. For instance, XGB uses a boosting technique, which means it builds a series of weak models (usually decision trees) sequentially. The idea is that each subsequent model corrects the errors of the previous one, and they work together to improve overall prediction accuracy.

*see references

!pip install xgboost
from xgboost import XGBClassifier
Collecting xgboost
  Downloading xgboost-2.0.2-py3-none-manylinux2014_x86_64.whl (297.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 297.1/297.1 MB 4.8 MB/s eta 0:00:00
?25hRequirement already satisfied: numpy in /shared-libs/python3.10/py/lib/python3.10/site-packages (from xgboost) (1.23.4)
Requirement already satisfied: scipy in /shared-libs/python3.10/py/lib/python3.10/site-packages (from xgboost) (1.9.3)
Installing collected packages: xgboost
Successfully installed xgboost-2.0.2

[notice] A new release of pip is available: 23.0.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip

So far, our most accurate model has been the RFC that predicted the postseason finish based off our 11 features. Let’s use the same train_test_split. As a side note, XGBClassifiers can only work with numerical classes, which is one of the reasons why we ordinally encoded the POSTSEASON column.

X_train, X_test, y_train, y_test = train_test_split(dfPost[features], dfPost["POST_OE"], train_size=0.6, random_state = 4)
XGBCl = XGBClassifier(n_estimators = 400, random_state = 0)
XGBCl.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=400, n_jobs=None,
              num_parallel_tree=None, objective='multi:softprob', ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The XGBClassifier has a number of parameters that can be finetuned to try and improve performance. For now, we will only set the seed and the n_estimators as equal to 400.

XGBCl.score(X_test,y_test)
0.44485294117647056

Such a poor performance indicates that we are either on a bad seed, that we need to change some parameters in the classifier, or both. Let’s fiddle with the max_depth and learning_rate parameters to try and improve performance (In essence, a smaller learning rate requires more boosting rounds but can help the algorithm generalize better to the data.)

X_train, X_test, y_train, y_test = train_test_split(dfPost[features], dfPost["POST_OE"], train_size=0.6, random_state = 7)
XGBCl = XGBClassifier(n_estimators = 400, max_depth = 8, learning_rate = 0.09, random_state = 9)
XGBCl.fit(X_train, y_train)
XGB_score = XGBCl.score(X_test,y_test)
XGB_score
0.5183823529411765

This is much better than the first XGBClassifier: in fact, with these specific seeds and parameters, the classifier actually has a slightly higher accuracy than our random forest classifier. Let’s now take a look at the mean absolute error.

y_xgb_pred = XGBCl.predict(X_test)
XGB_MAE = mean_absolute_error(y_test, y_xgb_pred)
XGB_MAE
0.7095588235294118

This XGBClassifier actually has around the same MAE as the random forest, which suggests their performances are extremely similar.

Examining Feature importances#

Now, let’s examine the relative importance of the features in our model, in order to see which features tend to be the most impactful in determining the March Madness finish of a team. Since our XGBClassifier had a slight edge in classifying teams, we will use this model.

FeatImp = XGBCl.feature_importances_  #Idea/command from ChatGPT
Classes = XGBCl.feature_names_in_
bo = pd.DataFrame([])
bo["Feature"] = Classes
bo["Importance"] = FeatImp
bo_sorted = bo.sort_values(by='Importance', ascending=False)
bo_sorted
Feature Importance
0 BARTHAG 0.161660
9 ADJE_diff 0.151615
2 SEED 0.127000
1 WAB 0.079642
8 3P_diff 0.076092
5 RB_diff 0.072741
4 EFG_diff 0.070629
7 2P_diff 0.069117
10 TOR_diff 0.065977
3 ADJ_T 0.064668
6 FTR_diff 0.060859

A higher importance indicates that a feature is more correlated to the target variable. Let’s visualize the importances.

#Code to make the chart more ordered
OrderSort = bo_sorted["Feature"].to_list()
alt.Chart(bo_sorted).mark_bar().encode(
    x = alt.X("Feature:N", sort=None),
    y = "Importance:Q",
    color = alt.Color("Feature:N", sort=None),
    tooltip = "Importance"
).properties(
    title = "Feature Importances"
)

As we can see, the ADJE_diff, BARTHAG, and Seed Columns seem to be the three most impactful features, while the FTR_diff and Adjusted Tempo columns don’t seem to be that impactful. However, we shouldn’t outright cut any features, as each feature still seems to have at least some impact on the model’s decisions.

Examining Specific Features and Decision Tree Boundaries#

Let’s take a closer look at the three most impactful features in XGBCl, to see how they affect the predicted finish of a team. First, we’ll consider the seed of a team.

SEED = pd.DataFrame(dfPost.groupby('POSTSEASON').mean()["SEED"]).reset_index()
alt.Chart(SEED).mark_bar().encode(
    x = alt.X("POSTSEASON", sort = CorrectSort),
    y = "SEED:Q",
    color = alt.Color("POSTSEASON:N", sort = CorrectSort),
    tooltip = "SEED"
).properties(
    width = 200,
    title = "Average Seed for each Tournament Finish"
)

The chart suggests that as the seed of a team gets higher (higher meaning closer to 1), their finish in the tournament tends to get better. This makes sense, given that better teams are usually the higher seeds. One interesting nugget in this chart is that teams in the Final 4 had on average lower seeds (by lower, I mean that closer to 16 is lower) than teams in the Elite 8, which is likely just an example of the ‘madness’ and unpredictability of March Madness, and the fact that our data is only 680 rows long.

Let’s now move on to BARTHAG.

BART = pd.DataFrame(dfPost.groupby('POSTSEASON').mean()["BARTHAG"]).reset_index()
BART
POSTSEASON BARTHAG
0 2ND 0.947810
1 Champions 0.962280
2 E8 0.914007
3 F4 0.922440
4 R32 0.859177
5 R64 0.730332
6 R68 0.580652
7 S16 0.901175
alt.Chart(BART).mark_bar().encode(
    x = alt.X("POSTSEASON", sort = CorrectSort),
    y = "BARTHAG:Q",
    color = alt.Color("POSTSEASON:N", sort = CorrectSort),
    tooltip = "BARTHAG"
).properties(
    width = 200,
    title = "Average BARTHAG for each Tournament Finish"
)

As we can see, the higher a team’s BARTHAG is, the better they tend to fare in the March Madness tournament. Although this seems obvious when you consider that BARTHAG measures a team’s chances of beating an average D1 opponent, the graph provides more specific information, such as the fact that teams with a BARTHAG below 0.7 are not likely to make it into the round of 64.

Let’s now look at the second most impactful feature, the ADJE_diff column, which measures the net number of points a team will score in 100 possessions (net = offense - defense).

ADJE_diff = pd.DataFrame(dfPost.groupby('POSTSEASON').mean()["ADJE_diff"]).reset_index()
ADJE_diff
POSTSEASON ADJE_diff
0 2ND 27.600000
1 Champions 30.380000
2 E8 23.235000
3 F4 23.680000
4 R32 17.511250
5 R64 10.423437
6 R68 3.867500
7 S16 21.247500
alt.Chart(ADJE_diff).mark_bar().encode(
    x = alt.X("POSTSEASON", sort = CorrectSort),
    y = "ADJE_diff:Q",
    color = alt.Color("POSTSEASON:N", sort = CorrectSort),
    tooltip = "ADJE_diff"
).properties(
    width = 200,
    title = "Average Adjusted Efficiency Difference for each Tournament Finish"
)

Again, there seems to be a very direct correlation between the adjusted efficiency difference and a team’s finish, with champions and finalists tending to outscore their opponents over the course of a season by a whopping 28-30 points per 100 possessions. This is useful because it suggests the eventual winner will likely not be a team who wins a lot of close games, but rather a team that wins a lot of blowouts.

Finally, we will make a decision boundary graph with our two most important, continuous, quantitative features, that being the BARTHAG and adjusted efficiency features. We will use a Decision Tree Regressor this time so that the boundaries are more clear.

First, we need to make arrays of datapoints for our charts, using linspace.

print(dfPost["ADJE_diff"].min())
print(dfPost["ADJE_diff"].max())
-15.299999999999997
36.3
ADJElin = np.random.uniform(-16, 37, 5000)

Since BARTHAG is a probability, we will set its linspace between 0 and 1:

BARTHAGlin = np.random.uniform(0,1,5000)
Boundary = pd.DataFrame({'ADJE_diff':ADJElin, 'BARTHAG':BARTHAGlin})

Now, we fit another model on the original data that only takes into account these two features.

from sklearn.tree import DecisionTreeClassifier
X_train, X_test, y_train, y_test = train_test_split(dfPost[["ADJE_diff","BARTHAG"]], dfPost["POST_OE"], train_size=0.6, random_state = 7)
Tree = DecisionTreeClassifier(max_leaf_nodes = 10, random_state=5)
Tree.fit(X_train, y_train)
Tree_score = Tree.score(X_test,y_test)
Tree_score
0.5404411764705882

Remarkably, this simple Decision Tree has had the best performance out of all the models we’ve trained so far, with a 54% accuracy - around 7% better than our base score. Let’s look at it’s MAE.

y_tree_pred = Tree.predict(X_test)
Tree_MAE = mean_absolute_error(y_test, y_tree_pred)
Tree_MAE
0.6691176470588235

Sure enough,the tree also has the lowest mean absolute error, which means it’s predictions are closer to the true value than any other model we’ve trained. This is a sign that the majority of our previous models might have been overfitting too much.

Boundary["Prediction_OE"] = Tree.predict(Boundary[["ADJE_diff", "BARTHAG"]])
inverted_mapping = {value: key for key, value in mapping.items()}
Boundary["Predictions"] = Boundary["Prediction_OE"].map(inverted_mapping)
Boundary
ADJE_diff BARTHAG Prediction_OE Predictions
0 17.153968 0.473926 1.0 R64
1 25.676365 0.891518 1.0 R64
2 15.643836 0.534441 1.0 R64
3 26.227244 0.510742 1.0 R64
4 15.106568 0.989752 1.0 R64
... ... ... ... ...
4995 -7.499594 0.406876 0.0 R68
4996 15.435096 0.339736 1.0 R64
4997 2.888958 0.312349 1.0 R64
4998 -8.423352 0.666879 0.0 R68
4999 -8.142750 0.120803 0.0 R68

5000 rows × 4 columns

alt.Chart(Boundary).mark_circle(size=60).encode(
    x = "ADJE_diff:Q",
    y = "BARTHAG",
    color = alt.Color("Predictions", sort = CorrectSort),
    tooltip = "Predictions"
).properties(
    title = "Tree's Predicted Tournament Finish based off ADJE_diff and BARTHAG"
)

Here, we see the downside of this tree - because we limited the max amount of leaf nodes, it does not have any ‘F4’, ‘2ND’, or ‘Champions’ predictions, likely because of the relatively low amount of these classes in the dataset. Let’s train one more tree with more leaf nodes in order to get the model to make some of these predictions

Tree2 = DecisionTreeClassifier(max_leaf_nodes = 70, random_state=5)
Tree2.fit(X_train, y_train)
print(Tree2.score(X_test,y_test))
print(mean_absolute_error(Tree2.predict(X_test), y_test))
0.4889705882352941
0.7610294117647058

As we can see, the performance of this tree is worse, because we extended the depth of the tree, thereby leading to potentially more overfitting.

Boundary["Prediction2_OE"] = Tree2.predict(Boundary[["ADJE_diff", "BARTHAG"]])
Boundary["Predictions2"] = Boundary["Prediction2_OE"].map(inverted_mapping)
alt.Chart(Boundary).mark_circle(size=60).encode(
    x = "ADJE_diff:Q",
    y = "BARTHAG",
    color = alt.Color("Predictions2:N", sort = CorrectSort, scale = alt.Scale(scheme='category20')),
    tooltip = "Predictions2"
).properties(
    title = "Tree2's Predicted Tournament Finish based off ADJE_diff and BARTHAG"
)

This chart, although it contains predictions from every class, clearly displays signs of overfitting, such as the multiple linear decision boundaries, or the fact that there are a couple of ‘R68’ predictions to the left of the ‘E8’ prediction block. For these reasons, we should not use this tree.

Conclusions#

Let’s now visualize the performances of our various models.

ModelScores = pd.DataFrame({'ModelType': ['Base', 'RFC', 'PCA', 'XGB', 'Tree'], 'Accuracy': [BaseScore, RFC_score, PCA_score, XGB_score, Tree_score], 'MAE': [Base_AE, RFC_MAE, PCA_MAE, XGB_MAE, Tree_MAE]})
ModelScores
ModelType Accuracy MAE
0 Base 0.470588 0.988971
1 RFC 0.511029 0.709559
2 PCA 0.503676 0.790441
3 XGB 0.518382 0.709559
4 Tree 0.540441 0.669118
alt.Chart(ModelScores).mark_bar(width=30).encode(
    x = "ModelType:N",
    y = alt.Y("Accuracy:Q", scale= alt.Scale(domain = [0.40, 0.55])),
    color = "ModelType:N",
    tooltip ="Accuracy:Q"
).properties(
    width = 200,
    height = 300,
    title = 'Accuracy Scores for Different Models'
)
alt.Chart(ModelScores).mark_bar(width=30).encode(
    x = "ModelType:N",
    y = alt.Y("MAE:Q", scale= alt.Scale(domain = [0, 1])),
    color = "ModelType:N",
    tooltip ="MAE:Q"
).properties(
    width = 200,
    height = 300,
    title = 'Mean Absolute Error for Different Models'
)

Ultimately, because it had the best performance on the model and displayed the least overfitting, we should use our simple Decision Tree (‘Tree’) when making predictions on the data, even though it is not as detailed as some of the other models we trained. This shows that relatively ‘primitive’ models are sometimes the best to use on certain data, because they are less prone to overfitting on training data.

We also learned that certain features in the dataset are more important than others, these being the BARTHAG, adjusted efficiency difference, and seed columns. So, I will probably take a look at these columns the next time I fill out my March Madness bracket, as they seem to have the most impact.

However, the main thing we learned is that the NCAA tournament is called ‘March Madness’ for a reason, as even our best model could only predict the finish of a postseason team with 54% accuracy. In other words, don’t expect to fill out a perfect bracket anytime soon, no matter how many models you build.

On the bright side, the mean absolute error score of our best model’s predictions was only a 0.66, which shows that on average, our model was very close to the correct prediction even when it was wrong (reminder: predicting ‘2nd’ for the actual champions would be considered close, while predicting ‘R68’ would be considered not close - ordinal data!)

Lastly, it is important to remember that the dataset we used is flawed, in that it contained data from the actual March Madness games. Therefore, the conclusions we drew from this dataset are not entirely reliable, even though we did remove some of the issues in the ‘Features of the Dataset’ section.

Summary#

In this project, I attempted to build a model to predict where a team would finish in the March Madness tournament. First, I removed and added some features, before constructing a variety of models in order to see which one would work the best (the Decision Tree on two features ended up working the best). I then also looked at which features were most important in determining a team’s predicted finish (the three most important features were BARTHAG, Adjusted efficieny difference, and seed).

References#

Your code above should include references. Here is some additional space for references.

  • What is the source of your dataset(s)?

https://www.kaggle.com/datasets/andrewsundberg/college-basketball-dataset

  • List any other references that you found helpful.

Course notes

XGB source: https://www.kaggle.com/code/alexisbcook/xgboost

ChatGPT helped with feature_importances_ and various other smaller bugs that I ran into.