Predicting Success of MLB Teams#

Author: Eric Cao

Course Project, UC Irvine, Math 10, F23

Introduction#

Among baseball executives and fans, there is much debate about the way a MLB team’s roster should be constructed. Some abide by the slogan “Defense wins championships”, while others place more emphasis on offense. The goal of this project is to apply machine learning methods to a dataset containing teams’ stats to determine which stats, if any, have the largest effect on a team’s success.

Cleaning the Data#

First import all the necessary modules and functions.

import pandas as pd
import numpy as np
import altair as alt
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
import tensorflow as tf
from sklearn.metrics import mean_squared_error, mean_absolute_error, accuracy_score, confusion_matrix, classification_report
2023-12-14 08:52:56.790851: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-14 08:52:57.097621: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-12-14 08:52:57.097645: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-12-14 08:52:57.187274: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-14 08:52:58.839821: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-14 08:52:58.839909: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-14 08:52:58.839919: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
df_raw = pd.read_csv('mlb_teams.csv')
df_raw
year league_id division_id rank games_played home_games wins losses division_winner wild_card_winner ... hits_allowed homeruns_allowed walks_allowed strikeouts_by_pitchers errors double_plays fielding_percentage team_name ball_park home_attendance
0 1876 NL NaN 4 70 NaN 39 31 NaN NaN ... 732 7 104 77 442 42 0.860 Boston Red Caps South End Grounds I NaN
1 1876 NL NaN 1 66 NaN 52 14 NaN NaN ... 608 6 29 51 282 33 0.899 Chicago White Stockings 23rd Street Grounds NaN
2 1876 NL NaN 8 65 NaN 9 56 NaN NaN ... 850 9 34 60 469 45 0.841 Cincinnati Reds Avenue Grounds NaN
3 1876 NL NaN 2 69 NaN 47 21 NaN NaN ... 570 2 27 114 337 27 0.888 Hartford Dark Blues Hartford Ball Club Grounds NaN
4 1876 NL NaN 5 69 NaN 30 36 NaN NaN ... 605 3 38 125 397 44 0.875 Louisville Grays Louisville Baseball Park NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2779 2020 NL C 3 58 27.0 30 28 N Y ... 376 69 204 464 33 46 0.983 St. Louis Cardinals Busch Stadium III 0.0
2780 2020 AL E 1 60 29.0 40 20 Y N ... 475 70 168 552 33 52 0.985 Tampa Bay Rays Tropicana Field 0.0
2781 2020 AL W 5 60 30.0 22 38 N N ... 479 81 236 489 40 40 0.981 Texas Rangers Globe Life Field 0.0
2782 2020 AL E 3 60 26.0 32 28 N Y ... 517 81 250 519 38 47 0.982 Toronto Blue Jays Sahlen Field 0.0
2783 2020 NL E 4 60 33.0 26 34 N N ... 548 94 216 508 39 48 0.981 Washington Nationals Nationals Park 0.0

2784 rows Ă— 41 columns

My dataset contains data on teams spanning back to 1876. I will remove the unimportant columns. Some stats were not yet created and collected back then, so I will also remove the rows that do not contain this data.

df_count = df_raw.drop(['league_id', 'division_id', 'rank', 'home_games',  'team_name', 'ball_park', 'home_attendance'], axis=1)
df_count = df_count[~df_count['caught_stealing'].isna() & ~df_count['sacrifice_flies'].isna()]
df_count.isna().sum(axis=0)
year                        0
games_played                0
wins                        0
losses                      0
division_winner            28
wild_card_winner          640
league_winner              28
world_series_winner        28
runs_scored                 0
at_bats                     0
hits                        0
doubles                     0
triples                     0
homeruns                    0
walks                       0
strikeouts_by_batters       0
stolen_bases                0
caught_stealing             0
batters_hit_by_pitch        0
sacrifice_flies             0
opponents_runs_scored       0
earned_runs_allowed         0
earned_run_average          0
complete_games              0
shutouts                    0
saves                       0
outs_pitches                0
hits_allowed                0
homeruns_allowed            0
walks_allowed               0
strikeouts_by_pitchers      0
errors                      0
double_plays                0
fielding_percentage         0
dtype: int64

Note: There are missing values in the columns division_winner, wild_card_winner, league_winner, and world_series_winner. I have not removed the rows corresponding to these missing values because I will not be using these columns for the first two models.

Linear Regression Model#

One way to determine a team’s success is by its winning percentage. Since this dataset only contains number of wins, I will add a win_percentage column to it.

df_count['win_percentage'] = df_count['wins'] / df_count['games_played']

Many of the stats use different scales, so I will normalize all the stats on a scale of 0 to 1 using MinMaxScaler. This will allow the data to be correctly analyzed via machine learning methods. I got this idea from ChatGPT.

df_rate = df_count.copy()
scaler = MinMaxScaler()

cols = list(df_rate.loc[:, 'runs_scored': 'fielding_percentage'].columns)
df_rate[cols] = scaler.fit_transform(df_rate[cols])
df_rate
year games_played wins losses division_winner wild_card_winner league_winner world_series_winner runs_scored at_bats ... saves outs_pitches hits_allowed homeruns_allowed walks_allowed strikeouts_by_pitchers errors double_plays fielding_percentage win_percentage
1370 1970 162 76 86 N NaN N N 0.654430 0.941673 ... 0.290323 0.937052 0.791605 0.547170 0.521127 0.440339 0.675978 0.491329 0.391304 0.469136
1371 1970 162 108 54 Y NaN Y Y 0.725316 0.941425 ... 0.403226 0.984018 0.692931 0.373585 0.507042 0.425712 0.541899 0.664740 0.565217 0.666667
1372 1970 162 87 75 N NaN N N 0.717722 0.938943 ... 0.612903 0.952381 0.747423 0.437736 0.702660 0.473441 0.759777 0.566474 0.260870 0.537037
1373 1970 162 86 76 N NaN N N 0.521519 0.938198 ... 0.693548 0.968037 0.665685 0.430189 0.647887 0.411085 0.597765 0.786127 0.521739 0.530864
1374 1970 162 56 106 N NaN N N 0.524051 0.933730 ... 0.387097 0.936725 0.867452 0.467925 0.643192 0.287914 0.810056 0.890173 0.304348 0.345679
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2779 2020 58 30 28 N Y N N 0.026582 0.000000 ... 0.112903 0.000000 0.000000 0.109434 0.092332 0.058507 0.072626 0.075145 0.652174 0.517241
2780 2020 60 40 20 Y N Y N 0.088608 0.055349 ... 0.274194 0.053490 0.072901 0.113208 0.035994 0.126251 0.072626 0.109827 0.739130 0.666667
2781 2020 60 22 38 N N N N 0.006329 0.045669 ... 0.064516 0.042727 0.075847 0.154717 0.142410 0.077752 0.111732 0.040462 0.565217 0.366667
2782 2020 60 32 28 N Y N N 0.105063 0.067262 ... 0.177419 0.050554 0.103829 0.154717 0.164319 0.100847 0.100559 0.080925 0.608696 0.533333
2783 2020 60 26 34 N N N N 0.093671 0.053611 ... 0.096774 0.030007 0.126657 0.203774 0.111111 0.092379 0.106145 0.086705 0.565217 0.433333

1414 rows Ă— 35 columns

Because the target variable win_percentage is quantitative, determining the relationship between the stats and win_percentage is a regression problem. Therefore, I need to fit a linear regression model between each stat and win_percentage, returning each model’s coefficient, mean squared error, and mean absolute error.

reg = LinearRegression()
coefs = []
mse = []
mae = []

for col in cols:
    reg.fit(df_rate[[col]], df_rate['win_percentage'])
    coefs.append(reg.coef_[0])
    mse.append(mean_squared_error(df_rate['win_percentage'], reg.predict(df_rate[[col]])))
    mae.append(mean_absolute_error(df_rate['win_percentage'], reg.predict(df_rate[[col]])))

df_rate_reg = pd.DataFrame({'Stat': cols, 'Coefficient': coefs, 'Mean Squared Error': mse, 'Mean Absolute Error': mae})
df_rate_reg
Stat Coefficient Mean Squared Error Mean Absolute Error
0 runs_scored 0.188315 0.004267 0.052642
1 at_bats 0.011933 0.005020 0.058330
2 hits 0.087089 0.004877 0.057019
3 doubles 0.076541 0.004881 0.057288
4 triples 0.017559 0.005017 0.058268
5 homeruns 0.128334 0.004606 0.055547
6 walks 0.156944 0.004541 0.054853
7 strikeouts_by_batters -0.024567 0.005004 0.058242
8 stolen_bases 0.068245 0.004951 0.057895
9 caught_stealing -0.024082 0.005011 0.058374
10 batters_hit_by_pitch 0.038366 0.004974 0.058146
11 sacrifice_flies 0.122088 0.004706 0.055930
12 opponents_runs_scored -0.229026 0.004115 0.051231
13 earned_runs_allowed -0.214156 0.004214 0.052116
14 earned_run_average -0.253572 0.003590 0.047939
15 complete_games 0.036581 0.004986 0.058083
16 shutouts 0.163535 0.004180 0.052013
17 saves 0.206479 0.003990 0.050757
18 outs_pitches 0.016956 0.005016 0.058276
19 hits_allowed -0.117936 0.004771 0.056385
20 homeruns_allowed -0.099719 0.004814 0.056881
21 walks_allowed -0.177109 0.004461 0.054053
22 strikeouts_by_pitchers 0.064297 0.004907 0.057443
23 errors -0.105935 0.004768 0.056594
24 double_plays -0.048609 0.004975 0.058041
25 fielding_percentage 0.126001 0.004620 0.055467

Each stat’s corresponding coefficient, mean squared error, and mean absolute error are visualized below in three bar charts.

alt.Chart(df_rate_reg).mark_bar().encode(
    x = 'Stat',
    y = 'Coefficient'
)
alt.Chart(df_rate_reg).mark_bar(color = 'lightgreen').encode(
    x = 'Stat',
    y = 'Mean Squared Error'
)
alt.Chart(df_rate_reg).mark_bar(color = 'darkgreen').encode(
    x = 'Stat',
    y = 'Mean Absolute Error'
)

earned_run_average, earned_runs_allowed, and opponents_run_scored have the largest absolute value of all coefficients, with an average coefficient of \(-0.22\). The negative value of the coefficient means that higher values of each of the three stats correlates to a lower win_percentage, so successful teams have lower earned runs allowed. Furthermore, these three stats have the lowest mean squared error and lowest mean absolute error of all stats, meaning that these stats are the best predictors of win_percentage.

Random Forest Classifier#

A binary way to determine a team’s success is to determine if it is a winning team, which means that its win percentage is over 50%. This will be represented by the is_win column.

df_rate['is_win'] = (df_rate['win_percentage'] >= 0.5)
df_rate
year games_played wins losses division_winner wild_card_winner league_winner world_series_winner runs_scored at_bats ... outs_pitches hits_allowed homeruns_allowed walks_allowed strikeouts_by_pitchers errors double_plays fielding_percentage win_percentage is_win
1370 1970 162 76 86 N NaN N N 0.654430 0.941673 ... 0.937052 0.791605 0.547170 0.521127 0.440339 0.675978 0.491329 0.391304 0.469136 False
1371 1970 162 108 54 Y NaN Y Y 0.725316 0.941425 ... 0.984018 0.692931 0.373585 0.507042 0.425712 0.541899 0.664740 0.565217 0.666667 True
1372 1970 162 87 75 N NaN N N 0.717722 0.938943 ... 0.952381 0.747423 0.437736 0.702660 0.473441 0.759777 0.566474 0.260870 0.537037 True
1373 1970 162 86 76 N NaN N N 0.521519 0.938198 ... 0.968037 0.665685 0.430189 0.647887 0.411085 0.597765 0.786127 0.521739 0.530864 True
1374 1970 162 56 106 N NaN N N 0.524051 0.933730 ... 0.936725 0.867452 0.467925 0.643192 0.287914 0.810056 0.890173 0.304348 0.345679 False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2779 2020 58 30 28 N Y N N 0.026582 0.000000 ... 0.000000 0.000000 0.109434 0.092332 0.058507 0.072626 0.075145 0.652174 0.517241 True
2780 2020 60 40 20 Y N Y N 0.088608 0.055349 ... 0.053490 0.072901 0.113208 0.035994 0.126251 0.072626 0.109827 0.739130 0.666667 True
2781 2020 60 22 38 N N N N 0.006329 0.045669 ... 0.042727 0.075847 0.154717 0.142410 0.077752 0.111732 0.040462 0.565217 0.366667 False
2782 2020 60 32 28 N Y N N 0.105063 0.067262 ... 0.050554 0.103829 0.154717 0.164319 0.100847 0.100559 0.080925 0.608696 0.533333 True
2783 2020 60 26 34 N N N N 0.093671 0.053611 ... 0.030007 0.126657 0.203774 0.111111 0.092379 0.106145 0.086705 0.565217 0.433333 False

1414 rows Ă— 36 columns

Because the target variable is_win is categorical, the model will use each row’s stats to determine its class in is_win, which is a classification problem. Therefore, I will use a random forest to model the relationship between stats and is_win. One of the main advantages of a random forest is its robustness, which counters overfitting. I also separated the data into training and testing data, which allow us to see if the random forest has overfit our data by testing the model on the testing data.

X_train, X_test, y_train, y_test = train_test_split(df_rate[cols], df_rate['is_win'], test_size=0.2)

rfc = RandomForestClassifier(n_estimators=100, max_leaf_nodes=len(cols))
rfc.fit(X_train, y_train)
rfc_train_pred = rfc.predict(X_train)
rfc_test_pred = rfc.predict(X_test)

I will determine the accuracy of the random forest model on both the training and testing data.

accuracy_score(y_train, rfc_train_pred)
0.9540229885057471
accuracy_score(y_test, rfc_test_pred)
0.9293286219081273

The accuracy scores of the random forest classifier on the training data and testing data are both over \(0.90\), which does not suggest that the random forest has overfit the data. Next, I will visualize this with a confusion matrix, the code for which I obtained from: https://medium.com/analytics-vidhya/evaluating-a-random-forest-model-9d165595ad56

matrix = confusion_matrix(y_test, rfc_test_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(matrix, annot=True, cmap='Blues', fmt='d', cbar=False)

class_names = ['Losing Team', 'Winning Team']
tick_marks = np.arange(len(class_names)) + 0.5
plt.xticks(tick_marks, class_names, rotation=0)
plt.yticks(tick_marks, class_names, rotation=0)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix for Random Forest Classifier')
plt.show()
../../_images/bb419041bb3ececb7e07e2fb8fd650a618f1d1027264c425e497850f7500d7af.png

A classification report will sum up the confusion matrix with three metrics: precision, recall, and f1 score. According to https://medium.com/analytics-vidhya/evaluating-a-random-forest-model-9d165595ad56:

  1. Precision is the number of correctly-identified members of a class divided by all the times the model predicted that class.

  2. Recall is the number of members of a class that the classifier identified correctly divided by the total number of members in that class.

  3. F1 score combines precision and recall into one metric. If precision and recall are both high, F1 will be high, too. If they are both low, F1 will be low. If one is high and the other low, F1 will be low.

print(classification_report(y_test, rfc_test_pred))
              precision    recall  f1-score   support

       False       0.97      0.89      0.93       144
        True       0.89      0.97      0.93       139

    accuracy                           0.93       283
   macro avg       0.93      0.93      0.93       283
weighted avg       0.93      0.93      0.93       283

Because the precision, recall, and f1 scores for both classes are high (\(\geq0.89\)), we can conclude that the random forest model is accurate at predicting whether a team has a winning record based on their stats. We can now analyze which stats were most important in determining a team’s success.

pd.Series(rfc.feature_importances_, index=rfc.feature_names_in_).sort_values(ascending=False)
runs_scored               0.192366
earned_run_average        0.134273
opponents_runs_scored     0.113034
earned_runs_allowed       0.091288
saves                     0.067500
hits                      0.064919
walks_allowed             0.048037
homeruns                  0.044631
hits_allowed              0.035169
walks                     0.033901
shutouts                  0.024170
outs_pitches              0.021774
doubles                   0.014424
at_bats                   0.014272
homeruns_allowed          0.012837
errors                    0.012371
sacrifice_flies           0.012179
strikeouts_by_batters     0.008981
fielding_percentage       0.008807
triples                   0.007699
strikeouts_by_pitchers    0.007344
complete_games            0.006878
caught_stealing           0.006677
stolen_bases              0.006267
batters_hit_by_pitch      0.005247
double_plays              0.004955
dtype: float64

runs_scored, opponents_runs_scored, and earned_run_average had the highest feature importances, which means that they were the most significant stats towards a team’s success. This agrees with the previous results found in the linear regression model, which found that earned_run_average had the greatest negative effect on winning percentage and runs_scored had one of the greatest positive effects on winning percentage.

Neural Network#

Another criteria to judge a team’s success is how far it goes in the playoffs. Now I will drop the rows with missing values in wild_card_winner.

df_playoffs = df_rate[~df_rate['wild_card_winner'].isna()]
df_playoffs.isna().sum(axis=0)
year                      0
games_played              0
wins                      0
losses                    0
division_winner           0
wild_card_winner          0
league_winner             0
world_series_winner       0
runs_scored               0
at_bats                   0
hits                      0
doubles                   0
triples                   0
homeruns                  0
walks                     0
strikeouts_by_batters     0
stolen_bases              0
caught_stealing           0
batters_hit_by_pitch      0
sacrifice_flies           0
opponents_runs_scored     0
earned_runs_allowed       0
earned_run_average        0
complete_games            0
shutouts                  0
saves                     0
outs_pitches              0
hits_allowed              0
homeruns_allowed          0
walks_allowed             0
strikeouts_by_pitchers    0
errors                    0
double_plays              0
fielding_percentage       0
win_percentage            0
is_win                    0
dtype: int64

I will create a new column in df_playoffs called playoffs that correlates to how far each team made in the playoffs. \(0\) correlates to no playoff series wins, \(1\) correlates to winning the wild card, \(2\) correlates to winning the league championship series, and \(3\) correlates to winning the world series. I created playoffs using code from: https://stackoverflow.com/questions/26886653/create-new-column-based-on-values-from-other-columns-apply-a-function-of-multi

conditions = [df_playoffs['world_series_winner'] == 'Y', df_playoffs['league_winner'] == 'Y', df_playoffs['wild_card_winner'] == 'Y']
outputs = [3, 2, 1]
df_playoffs['playoffs'] = np.select(conditions, outputs)
df_playoffs
/tmp/ipykernel_37/169299682.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_playoffs['playoffs'] = np.select(conditions, outputs)
year games_played wins losses division_winner wild_card_winner league_winner world_series_winner runs_scored at_bats ... hits_allowed homeruns_allowed walks_allowed strikeouts_by_pitchers errors double_plays fielding_percentage win_percentage is_win playoffs
2010 1995 144 90 54 Y N Y Y 0.539241 0.759990 ... 0.594993 0.252830 0.455399 0.538106 0.446927 0.462428 0.608696 0.625000 True 3
2011 1995 144 71 73 N N N N 0.613924 0.765699 ... 0.581001 0.411321 0.591549 0.417244 0.290503 0.624277 0.782609 0.493056 False 0
2012 1995 144 86 58 Y N N N 0.724051 0.805411 ... 0.708395 0.328302 0.517997 0.384911 0.558659 0.682081 0.434783 0.597222 True 0
2013 1995 145 78 67 N N N N 0.736709 0.810871 ... 0.687776 0.464151 0.533646 0.394919 0.418994 0.502890 0.608696 0.537931 True 0
2014 1995 145 68 76 N N N N 0.678481 0.821047 ... 0.734904 0.467925 0.738654 0.387991 0.491620 0.566474 0.521739 0.468966 False 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2779 2020 58 30 28 N Y N N 0.026582 0.000000 ... 0.000000 0.109434 0.092332 0.058507 0.072626 0.075145 0.652174 0.517241 True 1
2780 2020 60 40 20 Y N Y N 0.088608 0.055349 ... 0.072901 0.113208 0.035994 0.126251 0.072626 0.109827 0.739130 0.666667 True 2
2781 2020 60 22 38 N N N N 0.006329 0.045669 ... 0.075847 0.154717 0.142410 0.077752 0.111732 0.040462 0.565217 0.366667 False 0
2782 2020 60 32 28 N Y N N 0.105063 0.067262 ... 0.103829 0.154717 0.164319 0.100847 0.100559 0.080925 0.608696 0.533333 True 1
2783 2020 60 26 34 N N N N 0.093671 0.053611 ... 0.126657 0.203774 0.111111 0.092379 0.106145 0.086705 0.565217 0.433333 False 0

774 rows Ă— 37 columns

Because the target variable is_win is categorical, the model will use each row’s stats to determine its class in playoffs, which is a classification problem. Therefore, I will try to use a neural network to model the relationship between stats and playoffs, using the code from https://www.tensorflow.org/tutorials/keras/classification. I also separated the data into training and testing data, which allow us to see if the neural network has overfit our data by testing the model on the testing data.

X_train, X_test, y_train, y_test = train_test_split(df_playoffs[cols], df_playoffs['playoffs'], test_size=0.2)

model = tf.keras.Sequential([
    tf.keras.layers.Dense(30, activation='relu'),
    tf.keras.layers.Dense(4)
])
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10);
Epoch 1/10
20/20 [==============================] - 1s 5ms/step - loss: 0.9253 - accuracy: 0.8336
Epoch 2/10
20/20 [==============================] - 0s 5ms/step - loss: 0.6009 - accuracy: 0.8627
Epoch 3/10
20/20 [==============================] - 0s 5ms/step - loss: 0.5611 - accuracy: 0.8627
Epoch 4/10
20/20 [==============================] - 0s 5ms/step - loss: 0.5538 - accuracy: 0.8627
Epoch 5/10
20/20 [==============================] - 0s 1ms/step - loss: 0.5485 - accuracy: 0.8627
Epoch 6/10
20/20 [==============================] - 0s 2ms/step - loss: 0.5459 - accuracy: 0.8627
Epoch 7/10
20/20 [==============================] - 0s 2ms/step - loss: 0.5434 - accuracy: 0.8627
Epoch 8/10
20/20 [==============================] - 0s 5ms/step - loss: 0.5411 - accuracy: 0.8627
Epoch 9/10
20/20 [==============================] - 0s 4ms/step - loss: 0.5399 - accuracy: 0.8627
Epoch 10/10
20/20 [==============================] - 0s 5ms/step - loss: 0.5374 - accuracy: 0.8627

The model’s accuracy on the training data is \(0.8627\) (above), while the model’s accuracy on the testing data is \(0.8065\) (below).

test_loss, test_acc = model.evaluate(X_test,  y_test, verbose=2)
print('Test accuracy:', test_acc)
5/5 - 0s - loss: 0.6691 - accuracy: 0.8065 - 86ms/epoch - 17ms/step
Test accuracy: 0.8064516186714172

Now let’s look at the values predicted by the model.

probability_model = tf.keras.Sequential([model, 
                                         tf.keras.layers.Softmax()])
predictions = probability_model.predict(X_test)
lst = [np.argmax(row) for row in predictions]
np.unique(lst, return_counts=True)
5/5 [==============================] - 0s 2ms/step
(array([0]), array([155]))

This indicates that the model has predicted all teams in the testing data to belong to the \(0\) class of playoffs. Evidently, the neuron network was not able to distinguish a significant difference in the stats between teams that had playoff success and teams that did not. Therefore, none of the stats are statistically significant in determining a team’s playoff success.

Summary#

Using linear regression and a random forest, I determined that a MLB team’s success during the regular season can be predicted by some stats. In particular, runs scored and saves positively correlate to wins, while runs allowed and walks allowed negatively correlate to wins. However, there appears to be no correlation between any of the stats and a team’s playoff success, as shown by a neural network.

References#

Your code above should include references. Here is some additional space for references.

  • List any other references that you found helpful.