Predicting Success of MLB Teams#
Author: Eric Cao
Course Project, UC Irvine, Math 10, F23
Introduction#
Among baseball executives and fans, there is much debate about the way a MLB team’s roster should be constructed. Some abide by the slogan “Defense wins championships”, while others place more emphasis on offense. The goal of this project is to apply machine learning methods to a dataset containing teams’ stats to determine which stats, if any, have the largest effect on a team’s success.
Cleaning the Data#
First import all the necessary modules and functions.
import pandas as pd
import numpy as np
import altair as alt
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
import tensorflow as tf
from sklearn.metrics import mean_squared_error, mean_absolute_error, accuracy_score, confusion_matrix, classification_report
2023-12-14 08:52:56.790851: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-14 08:52:57.097621: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-12-14 08:52:57.097645: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-12-14 08:52:57.187274: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-14 08:52:58.839821: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-14 08:52:58.839909: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-14 08:52:58.839919: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
df_raw = pd.read_csv('mlb_teams.csv')
df_raw
year | league_id | division_id | rank | games_played | home_games | wins | losses | division_winner | wild_card_winner | ... | hits_allowed | homeruns_allowed | walks_allowed | strikeouts_by_pitchers | errors | double_plays | fielding_percentage | team_name | ball_park | home_attendance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1876 | NL | NaN | 4 | 70 | NaN | 39 | 31 | NaN | NaN | ... | 732 | 7 | 104 | 77 | 442 | 42 | 0.860 | Boston Red Caps | South End Grounds I | NaN |
1 | 1876 | NL | NaN | 1 | 66 | NaN | 52 | 14 | NaN | NaN | ... | 608 | 6 | 29 | 51 | 282 | 33 | 0.899 | Chicago White Stockings | 23rd Street Grounds | NaN |
2 | 1876 | NL | NaN | 8 | 65 | NaN | 9 | 56 | NaN | NaN | ... | 850 | 9 | 34 | 60 | 469 | 45 | 0.841 | Cincinnati Reds | Avenue Grounds | NaN |
3 | 1876 | NL | NaN | 2 | 69 | NaN | 47 | 21 | NaN | NaN | ... | 570 | 2 | 27 | 114 | 337 | 27 | 0.888 | Hartford Dark Blues | Hartford Ball Club Grounds | NaN |
4 | 1876 | NL | NaN | 5 | 69 | NaN | 30 | 36 | NaN | NaN | ... | 605 | 3 | 38 | 125 | 397 | 44 | 0.875 | Louisville Grays | Louisville Baseball Park | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2779 | 2020 | NL | C | 3 | 58 | 27.0 | 30 | 28 | N | Y | ... | 376 | 69 | 204 | 464 | 33 | 46 | 0.983 | St. Louis Cardinals | Busch Stadium III | 0.0 |
2780 | 2020 | AL | E | 1 | 60 | 29.0 | 40 | 20 | Y | N | ... | 475 | 70 | 168 | 552 | 33 | 52 | 0.985 | Tampa Bay Rays | Tropicana Field | 0.0 |
2781 | 2020 | AL | W | 5 | 60 | 30.0 | 22 | 38 | N | N | ... | 479 | 81 | 236 | 489 | 40 | 40 | 0.981 | Texas Rangers | Globe Life Field | 0.0 |
2782 | 2020 | AL | E | 3 | 60 | 26.0 | 32 | 28 | N | Y | ... | 517 | 81 | 250 | 519 | 38 | 47 | 0.982 | Toronto Blue Jays | Sahlen Field | 0.0 |
2783 | 2020 | NL | E | 4 | 60 | 33.0 | 26 | 34 | N | N | ... | 548 | 94 | 216 | 508 | 39 | 48 | 0.981 | Washington Nationals | Nationals Park | 0.0 |
2784 rows Ă— 41 columns
My dataset contains data on teams spanning back to 1876. I will remove the unimportant columns. Some stats were not yet created and collected back then, so I will also remove the rows that do not contain this data.
df_count = df_raw.drop(['league_id', 'division_id', 'rank', 'home_games', 'team_name', 'ball_park', 'home_attendance'], axis=1)
df_count = df_count[~df_count['caught_stealing'].isna() & ~df_count['sacrifice_flies'].isna()]
df_count.isna().sum(axis=0)
year 0
games_played 0
wins 0
losses 0
division_winner 28
wild_card_winner 640
league_winner 28
world_series_winner 28
runs_scored 0
at_bats 0
hits 0
doubles 0
triples 0
homeruns 0
walks 0
strikeouts_by_batters 0
stolen_bases 0
caught_stealing 0
batters_hit_by_pitch 0
sacrifice_flies 0
opponents_runs_scored 0
earned_runs_allowed 0
earned_run_average 0
complete_games 0
shutouts 0
saves 0
outs_pitches 0
hits_allowed 0
homeruns_allowed 0
walks_allowed 0
strikeouts_by_pitchers 0
errors 0
double_plays 0
fielding_percentage 0
dtype: int64
Note: There are missing values in the columns division_winner
, wild_card_winner
, league_winner
, and world_series_winner
. I have not removed the rows corresponding to these missing values because I will not be using these columns for the first two models.
Linear Regression Model#
One way to determine a team’s success is by its winning percentage. Since this dataset only contains number of wins, I will add a win_percentage
column to it.
df_count['win_percentage'] = df_count['wins'] / df_count['games_played']
Many of the stats use different scales, so I will normalize all the stats on a scale of 0 to 1 using MinMaxScaler
. This will allow the data to be correctly analyzed via machine learning methods. I got this idea from ChatGPT.
df_rate = df_count.copy()
scaler = MinMaxScaler()
cols = list(df_rate.loc[:, 'runs_scored': 'fielding_percentage'].columns)
df_rate[cols] = scaler.fit_transform(df_rate[cols])
df_rate
year | games_played | wins | losses | division_winner | wild_card_winner | league_winner | world_series_winner | runs_scored | at_bats | ... | saves | outs_pitches | hits_allowed | homeruns_allowed | walks_allowed | strikeouts_by_pitchers | errors | double_plays | fielding_percentage | win_percentage | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1370 | 1970 | 162 | 76 | 86 | N | NaN | N | N | 0.654430 | 0.941673 | ... | 0.290323 | 0.937052 | 0.791605 | 0.547170 | 0.521127 | 0.440339 | 0.675978 | 0.491329 | 0.391304 | 0.469136 |
1371 | 1970 | 162 | 108 | 54 | Y | NaN | Y | Y | 0.725316 | 0.941425 | ... | 0.403226 | 0.984018 | 0.692931 | 0.373585 | 0.507042 | 0.425712 | 0.541899 | 0.664740 | 0.565217 | 0.666667 |
1372 | 1970 | 162 | 87 | 75 | N | NaN | N | N | 0.717722 | 0.938943 | ... | 0.612903 | 0.952381 | 0.747423 | 0.437736 | 0.702660 | 0.473441 | 0.759777 | 0.566474 | 0.260870 | 0.537037 |
1373 | 1970 | 162 | 86 | 76 | N | NaN | N | N | 0.521519 | 0.938198 | ... | 0.693548 | 0.968037 | 0.665685 | 0.430189 | 0.647887 | 0.411085 | 0.597765 | 0.786127 | 0.521739 | 0.530864 |
1374 | 1970 | 162 | 56 | 106 | N | NaN | N | N | 0.524051 | 0.933730 | ... | 0.387097 | 0.936725 | 0.867452 | 0.467925 | 0.643192 | 0.287914 | 0.810056 | 0.890173 | 0.304348 | 0.345679 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2779 | 2020 | 58 | 30 | 28 | N | Y | N | N | 0.026582 | 0.000000 | ... | 0.112903 | 0.000000 | 0.000000 | 0.109434 | 0.092332 | 0.058507 | 0.072626 | 0.075145 | 0.652174 | 0.517241 |
2780 | 2020 | 60 | 40 | 20 | Y | N | Y | N | 0.088608 | 0.055349 | ... | 0.274194 | 0.053490 | 0.072901 | 0.113208 | 0.035994 | 0.126251 | 0.072626 | 0.109827 | 0.739130 | 0.666667 |
2781 | 2020 | 60 | 22 | 38 | N | N | N | N | 0.006329 | 0.045669 | ... | 0.064516 | 0.042727 | 0.075847 | 0.154717 | 0.142410 | 0.077752 | 0.111732 | 0.040462 | 0.565217 | 0.366667 |
2782 | 2020 | 60 | 32 | 28 | N | Y | N | N | 0.105063 | 0.067262 | ... | 0.177419 | 0.050554 | 0.103829 | 0.154717 | 0.164319 | 0.100847 | 0.100559 | 0.080925 | 0.608696 | 0.533333 |
2783 | 2020 | 60 | 26 | 34 | N | N | N | N | 0.093671 | 0.053611 | ... | 0.096774 | 0.030007 | 0.126657 | 0.203774 | 0.111111 | 0.092379 | 0.106145 | 0.086705 | 0.565217 | 0.433333 |
1414 rows Ă— 35 columns
Because the target variable win_percentage
is quantitative, determining the relationship between the stats and win_percentage
is a regression problem. Therefore, I need to fit a linear regression model between each stat and win_percentage
, returning each model’s coefficient, mean squared error, and mean absolute error.
reg = LinearRegression()
coefs = []
mse = []
mae = []
for col in cols:
reg.fit(df_rate[[col]], df_rate['win_percentage'])
coefs.append(reg.coef_[0])
mse.append(mean_squared_error(df_rate['win_percentage'], reg.predict(df_rate[[col]])))
mae.append(mean_absolute_error(df_rate['win_percentage'], reg.predict(df_rate[[col]])))
df_rate_reg = pd.DataFrame({'Stat': cols, 'Coefficient': coefs, 'Mean Squared Error': mse, 'Mean Absolute Error': mae})
df_rate_reg
Stat | Coefficient | Mean Squared Error | Mean Absolute Error | |
---|---|---|---|---|
0 | runs_scored | 0.188315 | 0.004267 | 0.052642 |
1 | at_bats | 0.011933 | 0.005020 | 0.058330 |
2 | hits | 0.087089 | 0.004877 | 0.057019 |
3 | doubles | 0.076541 | 0.004881 | 0.057288 |
4 | triples | 0.017559 | 0.005017 | 0.058268 |
5 | homeruns | 0.128334 | 0.004606 | 0.055547 |
6 | walks | 0.156944 | 0.004541 | 0.054853 |
7 | strikeouts_by_batters | -0.024567 | 0.005004 | 0.058242 |
8 | stolen_bases | 0.068245 | 0.004951 | 0.057895 |
9 | caught_stealing | -0.024082 | 0.005011 | 0.058374 |
10 | batters_hit_by_pitch | 0.038366 | 0.004974 | 0.058146 |
11 | sacrifice_flies | 0.122088 | 0.004706 | 0.055930 |
12 | opponents_runs_scored | -0.229026 | 0.004115 | 0.051231 |
13 | earned_runs_allowed | -0.214156 | 0.004214 | 0.052116 |
14 | earned_run_average | -0.253572 | 0.003590 | 0.047939 |
15 | complete_games | 0.036581 | 0.004986 | 0.058083 |
16 | shutouts | 0.163535 | 0.004180 | 0.052013 |
17 | saves | 0.206479 | 0.003990 | 0.050757 |
18 | outs_pitches | 0.016956 | 0.005016 | 0.058276 |
19 | hits_allowed | -0.117936 | 0.004771 | 0.056385 |
20 | homeruns_allowed | -0.099719 | 0.004814 | 0.056881 |
21 | walks_allowed | -0.177109 | 0.004461 | 0.054053 |
22 | strikeouts_by_pitchers | 0.064297 | 0.004907 | 0.057443 |
23 | errors | -0.105935 | 0.004768 | 0.056594 |
24 | double_plays | -0.048609 | 0.004975 | 0.058041 |
25 | fielding_percentage | 0.126001 | 0.004620 | 0.055467 |
Each stat’s corresponding coefficient, mean squared error, and mean absolute error are visualized below in three bar charts.
alt.Chart(df_rate_reg).mark_bar().encode(
x = 'Stat',
y = 'Coefficient'
)
alt.Chart(df_rate_reg).mark_bar(color = 'lightgreen').encode(
x = 'Stat',
y = 'Mean Squared Error'
)
alt.Chart(df_rate_reg).mark_bar(color = 'darkgreen').encode(
x = 'Stat',
y = 'Mean Absolute Error'
)
earned_run_average
, earned_runs_allowed
, and opponents_run_scored
have the largest absolute value of all coefficients, with an average coefficient of \(-0.22\). The negative value of the coefficient means that higher values of each of the three stats correlates to a lower win_percentage
, so successful teams have lower earned runs allowed. Furthermore, these three stats have the lowest mean squared error and lowest mean absolute error of all stats, meaning that these stats are the best predictors of win_percentage
.
Random Forest Classifier#
A binary way to determine a team’s success is to determine if it is a winning team, which means that its win percentage is over 50%. This will be represented by the is_win
column.
df_rate['is_win'] = (df_rate['win_percentage'] >= 0.5)
df_rate
year | games_played | wins | losses | division_winner | wild_card_winner | league_winner | world_series_winner | runs_scored | at_bats | ... | outs_pitches | hits_allowed | homeruns_allowed | walks_allowed | strikeouts_by_pitchers | errors | double_plays | fielding_percentage | win_percentage | is_win | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1370 | 1970 | 162 | 76 | 86 | N | NaN | N | N | 0.654430 | 0.941673 | ... | 0.937052 | 0.791605 | 0.547170 | 0.521127 | 0.440339 | 0.675978 | 0.491329 | 0.391304 | 0.469136 | False |
1371 | 1970 | 162 | 108 | 54 | Y | NaN | Y | Y | 0.725316 | 0.941425 | ... | 0.984018 | 0.692931 | 0.373585 | 0.507042 | 0.425712 | 0.541899 | 0.664740 | 0.565217 | 0.666667 | True |
1372 | 1970 | 162 | 87 | 75 | N | NaN | N | N | 0.717722 | 0.938943 | ... | 0.952381 | 0.747423 | 0.437736 | 0.702660 | 0.473441 | 0.759777 | 0.566474 | 0.260870 | 0.537037 | True |
1373 | 1970 | 162 | 86 | 76 | N | NaN | N | N | 0.521519 | 0.938198 | ... | 0.968037 | 0.665685 | 0.430189 | 0.647887 | 0.411085 | 0.597765 | 0.786127 | 0.521739 | 0.530864 | True |
1374 | 1970 | 162 | 56 | 106 | N | NaN | N | N | 0.524051 | 0.933730 | ... | 0.936725 | 0.867452 | 0.467925 | 0.643192 | 0.287914 | 0.810056 | 0.890173 | 0.304348 | 0.345679 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2779 | 2020 | 58 | 30 | 28 | N | Y | N | N | 0.026582 | 0.000000 | ... | 0.000000 | 0.000000 | 0.109434 | 0.092332 | 0.058507 | 0.072626 | 0.075145 | 0.652174 | 0.517241 | True |
2780 | 2020 | 60 | 40 | 20 | Y | N | Y | N | 0.088608 | 0.055349 | ... | 0.053490 | 0.072901 | 0.113208 | 0.035994 | 0.126251 | 0.072626 | 0.109827 | 0.739130 | 0.666667 | True |
2781 | 2020 | 60 | 22 | 38 | N | N | N | N | 0.006329 | 0.045669 | ... | 0.042727 | 0.075847 | 0.154717 | 0.142410 | 0.077752 | 0.111732 | 0.040462 | 0.565217 | 0.366667 | False |
2782 | 2020 | 60 | 32 | 28 | N | Y | N | N | 0.105063 | 0.067262 | ... | 0.050554 | 0.103829 | 0.154717 | 0.164319 | 0.100847 | 0.100559 | 0.080925 | 0.608696 | 0.533333 | True |
2783 | 2020 | 60 | 26 | 34 | N | N | N | N | 0.093671 | 0.053611 | ... | 0.030007 | 0.126657 | 0.203774 | 0.111111 | 0.092379 | 0.106145 | 0.086705 | 0.565217 | 0.433333 | False |
1414 rows Ă— 36 columns
Because the target variable is_win
is categorical, the model will use each row’s stats to determine its class in is_win
, which is a classification problem. Therefore, I will use a random forest to model the relationship between stats and is_win
. One of the main advantages of a random forest is its robustness, which counters overfitting. I also separated the data into training and testing data, which allow us to see if the random forest has overfit our data by testing the model on the testing data.
X_train, X_test, y_train, y_test = train_test_split(df_rate[cols], df_rate['is_win'], test_size=0.2)
rfc = RandomForestClassifier(n_estimators=100, max_leaf_nodes=len(cols))
rfc.fit(X_train, y_train)
rfc_train_pred = rfc.predict(X_train)
rfc_test_pred = rfc.predict(X_test)
I will determine the accuracy of the random forest model on both the training and testing data.
accuracy_score(y_train, rfc_train_pred)
0.9540229885057471
accuracy_score(y_test, rfc_test_pred)
0.9293286219081273
The accuracy scores of the random forest classifier on the training data and testing data are both over \(0.90\), which does not suggest that the random forest has overfit the data. Next, I will visualize this with a confusion matrix, the code for which I obtained from: https://medium.com/analytics-vidhya/evaluating-a-random-forest-model-9d165595ad56
matrix = confusion_matrix(y_test, rfc_test_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(matrix, annot=True, cmap='Blues', fmt='d', cbar=False)
class_names = ['Losing Team', 'Winning Team']
tick_marks = np.arange(len(class_names)) + 0.5
plt.xticks(tick_marks, class_names, rotation=0)
plt.yticks(tick_marks, class_names, rotation=0)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix for Random Forest Classifier')
plt.show()
A classification report will sum up the confusion matrix with three metrics: precision, recall, and f1 score. According to https://medium.com/analytics-vidhya/evaluating-a-random-forest-model-9d165595ad56:
Precision is the number of correctly-identified members of a class divided by all the times the model predicted that class.
Recall is the number of members of a class that the classifier identified correctly divided by the total number of members in that class.
F1 score combines precision and recall into one metric. If precision and recall are both high, F1 will be high, too. If they are both low, F1 will be low. If one is high and the other low, F1 will be low.
print(classification_report(y_test, rfc_test_pred))
precision recall f1-score support
False 0.97 0.89 0.93 144
True 0.89 0.97 0.93 139
accuracy 0.93 283
macro avg 0.93 0.93 0.93 283
weighted avg 0.93 0.93 0.93 283
Because the precision, recall, and f1 scores for both classes are high (\(\geq0.89\)), we can conclude that the random forest model is accurate at predicting whether a team has a winning record based on their stats. We can now analyze which stats were most important in determining a team’s success.
pd.Series(rfc.feature_importances_, index=rfc.feature_names_in_).sort_values(ascending=False)
runs_scored 0.192366
earned_run_average 0.134273
opponents_runs_scored 0.113034
earned_runs_allowed 0.091288
saves 0.067500
hits 0.064919
walks_allowed 0.048037
homeruns 0.044631
hits_allowed 0.035169
walks 0.033901
shutouts 0.024170
outs_pitches 0.021774
doubles 0.014424
at_bats 0.014272
homeruns_allowed 0.012837
errors 0.012371
sacrifice_flies 0.012179
strikeouts_by_batters 0.008981
fielding_percentage 0.008807
triples 0.007699
strikeouts_by_pitchers 0.007344
complete_games 0.006878
caught_stealing 0.006677
stolen_bases 0.006267
batters_hit_by_pitch 0.005247
double_plays 0.004955
dtype: float64
runs_scored
, opponents_runs_scored
, and earned_run_average
had the highest feature importances, which means that they were the most significant stats towards a team’s success. This agrees with the previous results found in the linear regression model, which found that earned_run_average
had the greatest negative effect on winning percentage and runs_scored
had one of the greatest positive effects on winning percentage.
Neural Network#
Another criteria to judge a team’s success is how far it goes in the playoffs. Now I will drop the rows with missing values in wild_card_winner
.
df_playoffs = df_rate[~df_rate['wild_card_winner'].isna()]
df_playoffs.isna().sum(axis=0)
year 0
games_played 0
wins 0
losses 0
division_winner 0
wild_card_winner 0
league_winner 0
world_series_winner 0
runs_scored 0
at_bats 0
hits 0
doubles 0
triples 0
homeruns 0
walks 0
strikeouts_by_batters 0
stolen_bases 0
caught_stealing 0
batters_hit_by_pitch 0
sacrifice_flies 0
opponents_runs_scored 0
earned_runs_allowed 0
earned_run_average 0
complete_games 0
shutouts 0
saves 0
outs_pitches 0
hits_allowed 0
homeruns_allowed 0
walks_allowed 0
strikeouts_by_pitchers 0
errors 0
double_plays 0
fielding_percentage 0
win_percentage 0
is_win 0
dtype: int64
I will create a new column in df_playoffs
called playoffs
that correlates to how far each team made in the playoffs. \(0\) correlates to no playoff series wins, \(1\) correlates to winning the wild card, \(2\) correlates to winning the league championship series, and \(3\) correlates to winning the world series. I created playoffs
using code from: https://stackoverflow.com/questions/26886653/create-new-column-based-on-values-from-other-columns-apply-a-function-of-multi
conditions = [df_playoffs['world_series_winner'] == 'Y', df_playoffs['league_winner'] == 'Y', df_playoffs['wild_card_winner'] == 'Y']
outputs = [3, 2, 1]
df_playoffs['playoffs'] = np.select(conditions, outputs)
df_playoffs
/tmp/ipykernel_37/169299682.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_playoffs['playoffs'] = np.select(conditions, outputs)
year | games_played | wins | losses | division_winner | wild_card_winner | league_winner | world_series_winner | runs_scored | at_bats | ... | hits_allowed | homeruns_allowed | walks_allowed | strikeouts_by_pitchers | errors | double_plays | fielding_percentage | win_percentage | is_win | playoffs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2010 | 1995 | 144 | 90 | 54 | Y | N | Y | Y | 0.539241 | 0.759990 | ... | 0.594993 | 0.252830 | 0.455399 | 0.538106 | 0.446927 | 0.462428 | 0.608696 | 0.625000 | True | 3 |
2011 | 1995 | 144 | 71 | 73 | N | N | N | N | 0.613924 | 0.765699 | ... | 0.581001 | 0.411321 | 0.591549 | 0.417244 | 0.290503 | 0.624277 | 0.782609 | 0.493056 | False | 0 |
2012 | 1995 | 144 | 86 | 58 | Y | N | N | N | 0.724051 | 0.805411 | ... | 0.708395 | 0.328302 | 0.517997 | 0.384911 | 0.558659 | 0.682081 | 0.434783 | 0.597222 | True | 0 |
2013 | 1995 | 145 | 78 | 67 | N | N | N | N | 0.736709 | 0.810871 | ... | 0.687776 | 0.464151 | 0.533646 | 0.394919 | 0.418994 | 0.502890 | 0.608696 | 0.537931 | True | 0 |
2014 | 1995 | 145 | 68 | 76 | N | N | N | N | 0.678481 | 0.821047 | ... | 0.734904 | 0.467925 | 0.738654 | 0.387991 | 0.491620 | 0.566474 | 0.521739 | 0.468966 | False | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2779 | 2020 | 58 | 30 | 28 | N | Y | N | N | 0.026582 | 0.000000 | ... | 0.000000 | 0.109434 | 0.092332 | 0.058507 | 0.072626 | 0.075145 | 0.652174 | 0.517241 | True | 1 |
2780 | 2020 | 60 | 40 | 20 | Y | N | Y | N | 0.088608 | 0.055349 | ... | 0.072901 | 0.113208 | 0.035994 | 0.126251 | 0.072626 | 0.109827 | 0.739130 | 0.666667 | True | 2 |
2781 | 2020 | 60 | 22 | 38 | N | N | N | N | 0.006329 | 0.045669 | ... | 0.075847 | 0.154717 | 0.142410 | 0.077752 | 0.111732 | 0.040462 | 0.565217 | 0.366667 | False | 0 |
2782 | 2020 | 60 | 32 | 28 | N | Y | N | N | 0.105063 | 0.067262 | ... | 0.103829 | 0.154717 | 0.164319 | 0.100847 | 0.100559 | 0.080925 | 0.608696 | 0.533333 | True | 1 |
2783 | 2020 | 60 | 26 | 34 | N | N | N | N | 0.093671 | 0.053611 | ... | 0.126657 | 0.203774 | 0.111111 | 0.092379 | 0.106145 | 0.086705 | 0.565217 | 0.433333 | False | 0 |
774 rows Ă— 37 columns
Because the target variable is_win
is categorical, the model will use each row’s stats to determine its class in playoffs
, which is a classification problem. Therefore, I will try to use a neural network to model the relationship between stats and playoffs
, using the code from https://www.tensorflow.org/tutorials/keras/classification. I also separated the data into training and testing data, which allow us to see if the neural network has overfit our data by testing the model on the testing data.
X_train, X_test, y_train, y_test = train_test_split(df_playoffs[cols], df_playoffs['playoffs'], test_size=0.2)
model = tf.keras.Sequential([
tf.keras.layers.Dense(30, activation='relu'),
tf.keras.layers.Dense(4)
])
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10);
Epoch 1/10
20/20 [==============================] - 1s 5ms/step - loss: 0.9253 - accuracy: 0.8336
Epoch 2/10
20/20 [==============================] - 0s 5ms/step - loss: 0.6009 - accuracy: 0.8627
Epoch 3/10
20/20 [==============================] - 0s 5ms/step - loss: 0.5611 - accuracy: 0.8627
Epoch 4/10
20/20 [==============================] - 0s 5ms/step - loss: 0.5538 - accuracy: 0.8627
Epoch 5/10
20/20 [==============================] - 0s 1ms/step - loss: 0.5485 - accuracy: 0.8627
Epoch 6/10
20/20 [==============================] - 0s 2ms/step - loss: 0.5459 - accuracy: 0.8627
Epoch 7/10
20/20 [==============================] - 0s 2ms/step - loss: 0.5434 - accuracy: 0.8627
Epoch 8/10
20/20 [==============================] - 0s 5ms/step - loss: 0.5411 - accuracy: 0.8627
Epoch 9/10
20/20 [==============================] - 0s 4ms/step - loss: 0.5399 - accuracy: 0.8627
Epoch 10/10
20/20 [==============================] - 0s 5ms/step - loss: 0.5374 - accuracy: 0.8627
The model’s accuracy on the training data is \(0.8627\) (above), while the model’s accuracy on the testing data is \(0.8065\) (below).
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=2)
print('Test accuracy:', test_acc)
5/5 - 0s - loss: 0.6691 - accuracy: 0.8065 - 86ms/epoch - 17ms/step
Test accuracy: 0.8064516186714172
Now let’s look at the values predicted by the model.
probability_model = tf.keras.Sequential([model,
tf.keras.layers.Softmax()])
predictions = probability_model.predict(X_test)
lst = [np.argmax(row) for row in predictions]
np.unique(lst, return_counts=True)
5/5 [==============================] - 0s 2ms/step
(array([0]), array([155]))
This indicates that the model has predicted all teams in the testing data to belong to the \(0\) class of playoffs
. Evidently, the neuron network was not able to distinguish a significant difference in the stats between teams that had playoff success and teams that did not. Therefore, none of the stats are statistically significant in determining a team’s playoff success.
Summary#
Using linear regression and a random forest, I determined that a MLB team’s success during the regular season can be predicted by some stats. In particular, runs scored and saves positively correlate to wins, while runs allowed and walks allowed negatively correlate to wins. However, there appears to be no correlation between any of the stats and a team’s playoff success, as shown by a neural network.
References#
Your code above should include references. Here is some additional space for references.
What is the source of your dataset(s)? https://www.retrosheet.org
List any other references that you found helpful.