The effects of every parts of a car to its acceleration#
Author: Mianfu Zhong
Course Project, UC Irvine, Math 10, F23
Introduction#
This project is using a datasets of car specifics to predict the factors that would affect a car’s acceleration. Also the precise detail about how the analysis is done and why the factors are related are discussed.
The Preparation Step#
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
# import all the tools we might need to use, including panda, numpy, matplot, seaborn, and sklearn.
df = pd.read_table('auto-mpg.data', delim_whitespace=True, header=None, names=['mpg', 'cylinders', 'displacement','horsepower','weight','acceleration','model_year','origin','car_name'])
# Because we have a file using "data" extension, we have to call it in a strict format like above.
# If you download and check the data, it is obvious that all the datas are seperated using blankspaces
# Therefore, we use "delimeter_whitespace=True" to tell python how to seperate.
# lets see what it looks like first
df
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | car_name | |
---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130.0 | 3504.0 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
1 | 15.0 | 8 | 350.0 | 165.0 | 3693.0 | 11.5 | 70 | 1 | buick skylark 320 |
2 | 18.0 | 8 | 318.0 | 150.0 | 3436.0 | 11.0 | 70 | 1 | plymouth satellite |
3 | 16.0 | 8 | 304.0 | 150.0 | 3433.0 | 12.0 | 70 | 1 | amc rebel sst |
4 | 17.0 | 8 | 302.0 | 140.0 | 3449.0 | 10.5 | 70 | 1 | ford torino |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
393 | 27.0 | 4 | 140.0 | 86.00 | 2790.0 | 15.6 | 82 | 1 | ford mustang gl |
394 | 44.0 | 4 | 97.0 | 52.00 | 2130.0 | 24.6 | 82 | 2 | vw pickup |
395 | 32.0 | 4 | 135.0 | 84.00 | 2295.0 | 11.6 | 82 | 1 | dodge rampage |
396 | 28.0 | 4 | 120.0 | 79.00 | 2625.0 | 18.6 | 82 | 1 | ford ranger |
397 | 31.0 | 4 | 119.0 | 82.00 | 2720.0 | 19.4 | 82 | 1 | chevy s-10 |
398 rows × 9 columns
# Before decide which method we want to use to find out our answer.
# There are still things we want to do before hand. such as:
# Data exploration, deal with missing value and duplicate values:
# Remove rows with missing values
# !!!!notice that the second row, I replace all the "?"" with "NA" , Because some coloumns habe dtype= objects and can not be detected as a missing value
df = df.dropna()
df = df.replace('?', pd.NA).dropna()
# Data Cleaning
df = df.drop_duplicates()
# Data exploration
print("Data Exploration:")
print(df.info())
print(df.describe())
df
Data Exploration:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 397
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mpg 392 non-null float64
1 cylinders 392 non-null int64
2 displacement 392 non-null float64
3 horsepower 392 non-null object
4 weight 392 non-null float64
5 acceleration 392 non-null float64
6 model_year 392 non-null int64
7 origin 392 non-null int64
8 car_name 392 non-null object
dtypes: float64(4), int64(3), object(2)
memory usage: 30.6+ KB
None
mpg cylinders displacement weight acceleration \
count 392.000000 392.000000 392.000000 392.000000 392.000000
mean 23.445918 5.471939 194.411990 2977.584184 15.541327
std 7.805007 1.705783 104.644004 849.402560 2.758864
min 9.000000 3.000000 68.000000 1613.000000 8.000000
25% 17.000000 4.000000 105.000000 2225.250000 13.775000
50% 22.750000 4.000000 151.000000 2803.500000 15.500000
75% 29.000000 8.000000 275.750000 3614.750000 17.025000
max 46.600000 8.000000 455.000000 5140.000000 24.800000
model_year origin
count 392.000000 392.000000
mean 75.979592 1.576531
std 3.683737 0.805518
min 70.000000 1.000000
25% 73.000000 1.000000
50% 76.000000 1.000000
75% 79.000000 2.000000
max 82.000000 3.000000
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | car_name | |
---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130.0 | 3504.0 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
1 | 15.0 | 8 | 350.0 | 165.0 | 3693.0 | 11.5 | 70 | 1 | buick skylark 320 |
2 | 18.0 | 8 | 318.0 | 150.0 | 3436.0 | 11.0 | 70 | 1 | plymouth satellite |
3 | 16.0 | 8 | 304.0 | 150.0 | 3433.0 | 12.0 | 70 | 1 | amc rebel sst |
4 | 17.0 | 8 | 302.0 | 140.0 | 3449.0 | 10.5 | 70 | 1 | ford torino |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
393 | 27.0 | 4 | 140.0 | 86.00 | 2790.0 | 15.6 | 82 | 1 | ford mustang gl |
394 | 44.0 | 4 | 97.0 | 52.00 | 2130.0 | 24.6 | 82 | 2 | vw pickup |
395 | 32.0 | 4 | 135.0 | 84.00 | 2295.0 | 11.6 | 82 | 1 | dodge rampage |
396 | 28.0 | 4 | 120.0 | 79.00 | 2625.0 | 18.6 | 82 | 1 | ford ranger |
397 | 31.0 | 4 | 119.0 | 82.00 | 2720.0 | 19.4 | 82 | 1 | chevy s-10 |
392 rows × 9 columns
# Note that besides other factors that might influence acceleration
# There is a column named "oring", that it might be a non-direct factor.
# Which might need us to fit the model with or without it
# so that we can see if it matters.
# !!!!Not try to spoil but look at the names of our features, you might notice that
# some of them are not related to acceleration but might still appear high goodness of fit.
Call to our adventure: First guess to the result#
What might be main factors? Horsepower? cylinders? or weight?
Because we are studying accelerations. let’s recall from Physics 7: acceleration formula: f=ma => a=f/m while in such case, “angular acceleration” would replace “acceleraiton” and “torque” would replace “force”.
However torque is not mentioned in the datasets, datasets thats related to torque might be “weight” and “model_year”. As the model_year represent the (记得在fit的部分把不同年份的分开fit,再把得出的结果mean,这样能保证年份不对其他项造成影响)technology level.
In this case, we could say that cylinders, horsepower, could be the main factors as our common sense. lets check that out using alt chart.
BUT, before all of the above, we should first find out what does the value of the acceleration mean. does it mean time from 0-60mph or the km/s^-1.
# To see what ACCELERATION value really means for this data, we first want to see the relationship between weight and its acceleration.
# It is obvious that heavier the car is, longer it needs to accelerate.
c1 = alt.Chart(df).mark_point().encode(
x = 'weight:Q',
y = 'acceleration:Q',
tooltip = ['weight', 'acceleration','cylinders','model_year','car_name','horsepower']
)
trendline = c1.transform_regression('weight', 'acceleration').mark_line(color='red')
chart_with_trendline1 = (c1 + trendline)
chart_with_trendline1
Weird! why is the acceleration time goes down when weight increases? or are values from acceleraiton NOT the time it needs to accelerate from 0 to 60mph?
Don’t worry, we can check it using horsepower, which should be the one who tells the acceleration!!
c2 = alt.Chart(df).mark_point().encode(
x = 'horsepower:Q',
y = 'acceleration:Q',
tooltip = ['weight', 'acceleration','cylinders','model_year','car_name']
)
c2
# Through a roughly look from the chart, it seems that when horsepower goes up, the value of acceleration goes down
# Which is really weird, becuase it tells the opposite from the weight acceleration relationship we just have
# !! lets check there relationship using a regression.
trendline2 = c2.transform_regression('horsepower', 'acceleration').mark_line(color='red',size=200)
chart_with_trendline2 = (c2 + trendline2)
chart_with_trendline2
Nice! now we can ensure that the acceleration value represent for its acceleration ability, the larger the value, the faster it accelerate!
Also, We verify our first guess that horsepower and weight do affect the acceleration.
The next step would be try doing find what are the best factors we put in our machine.
Approaching The Inner Cave: OLS LinearRegression model#
Find out what are the factors that should be used in training our bot (EXTRA COMPONENT)
WHY use OLS model from ‘statesmodels’ NOW?
We now want to find out what factors we want to put in side our scikit-learn model.
‘statesmodels’ library emphasize more on statistical modeling and hypothesis testing, it provides a detailed statistical summary of the regression results, allowing us to analyze the statistical significance of coefficients.
That scikit-learn can’t do what statesmodels can do.(e.g. point1,point2 above)
pip install statsmodels
Collecting statsmodels
Downloading statsmodels-0.14.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.1/10.1 MB 72.2 MB/s eta 0:00:00
?25hRequirement already satisfied: scipy!=1.9.2,>=1.4 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from statsmodels) (1.9.3)
Requirement already satisfied: numpy>=1.18 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from statsmodels) (1.23.4)
Requirement already satisfied: pandas>=1.0 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from statsmodels) (1.2.5)
Collecting patsy>=0.5.2
Downloading patsy-0.5.4-py2.py3-none-any.whl (233 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 233.9/233.9 kB 50.9 MB/s eta 0:00:00
?25hRequirement already satisfied: packaging>=21.3 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from statsmodels) (21.3)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from packaging>=21.3->statsmodels) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from pandas>=1.0->statsmodels) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from pandas>=1.0->statsmodels) (2022.5)
Requirement already satisfied: six in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from patsy>=0.5.2->statsmodels) (1.16.0)
Installing collected packages: patsy, statsmodels
Successfully installed patsy-0.5.4 statsmodels-0.14.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[notice] A new release of pip is available: 23.0.1 -> 23.3.1
[notice] To update, run: python -m pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
import statsmodels.api as sm
allfactors= df[['weight','mpg','cylinders','model_year','origin','horsepower']]
X = sm.add_constant(allfactors)
y = df['acceleration']
olsmodel = sm.OLS(y,X).fit()
print(olsmodel.summary())
# this is error because horsepower coloumn is recognized as objects, lets try do it again but this time we change it to number first
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In [16], line 5
3 X = sm.add_constant(allfactors)
4 y = df['acceleration']
----> 5 olsmodel = sm.OLS(y,X).fit()
6 print(olsmodel.summary())
File /usr/local/lib/python3.9/site-packages/statsmodels/regression/linear_model.py:922, in OLS.__init__(self, endog, exog, missing, hasconst, **kwargs)
919 msg = ("Weights are not supported in OLS and will be ignored"
920 "An exception will be raised in the next version.")
921 warnings.warn(msg, ValueWarning)
--> 922 super(OLS, self).__init__(endog, exog, missing=missing,
923 hasconst=hasconst, **kwargs)
924 if "weights" in self._init_keys:
925 self._init_keys.remove("weights")
File /usr/local/lib/python3.9/site-packages/statsmodels/regression/linear_model.py:748, in WLS.__init__(self, endog, exog, weights, missing, hasconst, **kwargs)
746 else:
747 weights = weights.squeeze()
--> 748 super(WLS, self).__init__(endog, exog, missing=missing,
749 weights=weights, hasconst=hasconst, **kwargs)
750 nobs = self.exog.shape[0]
751 weights = self.weights
File /usr/local/lib/python3.9/site-packages/statsmodels/regression/linear_model.py:202, in RegressionModel.__init__(self, endog, exog, **kwargs)
201 def __init__(self, endog, exog, **kwargs):
--> 202 super(RegressionModel, self).__init__(endog, exog, **kwargs)
203 self.pinv_wexog: Float64Array | None = None
204 self._data_attr.extend(['pinv_wexog', 'wendog', 'wexog', 'weights'])
File /usr/local/lib/python3.9/site-packages/statsmodels/base/model.py:270, in LikelihoodModel.__init__(self, endog, exog, **kwargs)
269 def __init__(self, endog, exog=None, **kwargs):
--> 270 super().__init__(endog, exog, **kwargs)
271 self.initialize()
File /usr/local/lib/python3.9/site-packages/statsmodels/base/model.py:95, in Model.__init__(self, endog, exog, **kwargs)
93 missing = kwargs.pop('missing', 'none')
94 hasconst = kwargs.pop('hasconst', None)
---> 95 self.data = self._handle_data(endog, exog, missing, hasconst,
96 **kwargs)
97 self.k_constant = self.data.k_constant
98 self.exog = self.data.exog
File /usr/local/lib/python3.9/site-packages/statsmodels/base/model.py:135, in Model._handle_data(self, endog, exog, missing, hasconst, **kwargs)
134 def _handle_data(self, endog, exog, missing, hasconst, **kwargs):
--> 135 data = handle_data(endog, exog, missing, hasconst, **kwargs)
136 # kwargs arrays could have changed, easier to just attach here
137 for key in kwargs:
File /usr/local/lib/python3.9/site-packages/statsmodels/base/data.py:675, in handle_data(endog, exog, missing, hasconst, **kwargs)
672 exog = np.asarray(exog)
674 klass = handle_data_class_factory(endog, exog)
--> 675 return klass(endog, exog=exog, missing=missing, hasconst=hasconst,
676 **kwargs)
File /usr/local/lib/python3.9/site-packages/statsmodels/base/data.py:84, in ModelData.__init__(self, endog, exog, missing, hasconst, **kwargs)
82 self.orig_endog = endog
83 self.orig_exog = exog
---> 84 self.endog, self.exog = self._convert_endog_exog(endog, exog)
86 self.const_idx = None
87 self.k_constant = 0
File /usr/local/lib/python3.9/site-packages/statsmodels/base/data.py:509, in PandasData._convert_endog_exog(self, endog, exog)
507 exog = exog if exog is None else np.asarray(exog)
508 if endog.dtype == object or exog is not None and exog.dtype == object:
--> 509 raise ValueError("Pandas data cast to numpy dtype of object. "
510 "Check input data with np.asarray(data).")
511 return super(PandasData, self)._convert_endog_exog(endog, exog)
ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).
df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce')
#Now that it should works with the above code.
import statsmodels.api as sm
allfactors= df[['weight','mpg','cylinders','model_year','origin','horsepower']]
X = sm.add_constant(allfactors)
y = df['acceleration']
olsmodel = sm.OLS(y,X).fit()
print(olsmodel.summary())
OLS Regression Results
==============================================================================
Dep. Variable: acceleration R-squared: 0.615
Model: OLS Adj. R-squared: 0.609
Method: Least Squares F-statistic: 102.7
Date: Thu, 14 Dec 2023 Prob (F-statistic): 9.17e-77
Time: 07:17:22 Log-Likelihood: -766.23
No. Observations: 392 AIC: 1546.
Df Residuals: 385 BIC: 1574.
Df Model: 6
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 21.3419 2.192 9.738 0.000 17.033 25.651
weight 0.0030 0.000 9.735 0.000 0.002 0.004
mpg 0.0142 0.026 0.543 0.587 -0.037 0.066
cylinders -0.3920 0.124 -3.168 0.002 -0.635 -0.149
model_year -0.0478 0.033 -1.464 0.144 -0.112 0.016
origin 0.1015 0.140 0.724 0.469 -0.174 0.377
horsepower -0.0912 0.005 -18.202 0.000 -0.101 -0.081
==============================================================================
Omnibus: 46.182 Durbin-Watson: 1.672
Prob(Omnibus): 0.000 Jarque-Bera (JB): 67.916
Skew: 0.782 Prob(JB): 1.79e-15
Kurtosis: 4.309 Cond. No. 7.80e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.8e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Definition: “t”: The t-statistic tests the null hypothesis that the coefficient of a variable is equal to zero. “P>|t|”: The p-value associated with the t-statistic tests the null hypothesis.
Summary: A small p-value (typically < “0.05”) suggests that you can reject the null hypothesis. A large t-statistic (with a corresponding small p-value) suggests that the variable is likely to be a significant predictor in the model.
Now after reading these Let’s look at the charts above, focusing only on the “t” value and the “P>|t|”.
We find out that only “weight”, “horsepower”, “cylinders”, and “const” have reletively high absolute t-value and p-value that is smaller than 5%.
It is easy to understand that weight horsepower and cylinders would affect the acceleration, but what is “const”?
“const” stands for constant, it is from the formula(Y = β0 + Σj=1..p βjXj + ε ) of the ols linear-model that normally represent ERROR. However in our case, we know that if all other factors are zero, the acceleration should be zero at this moment, which means that the “const” is meaningless and only exsists as it makes the model better explained.
How exciting! we now know what factors the acceleration most: “horsepower”,”cylinder”,”weight”. LETS run them on a sklearn model.
Aiming our Goal: Use Sklearn to predict, and verify our guess#
Lets try to use the linear regression model first, as it is simple and explanable
X = df[['weight','cylinders','horsepower']]# predicted variables
y =df["acceleration"]#target variables
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)#test train split
model =LinearRegression()
model.fit(X_train, y_train)
model.score(X_test,y_test)
0.6354558360449379
Seemingly, the machine is not performing well, it might be the reason that we use only the 3 predicted values that we subjectively consider it important. lets try fit it with all the variables.
X = allfactors# predicted variables
y =df["acceleration"]#target variables
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)#test train split
model =LinearRegression()
model.fit(X_train, y_train)
model.score(X_test,y_test)
0.6277449368818655
Not a good sign!! seems that we chose the wrong model or it is just that they are not relevant. maybe it is just that linearregression model assume variables are linearly related
Lets try decision trees. Because it can deal with non-linearly relation!
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
model.score(X_test,y_test)
0.44422080245934037
EVEN WORSE!! (Dont need to test if it is because the x input, although change X input to the 3 varibales we find matters the most would increase the score to approximatly 0.44.)
This is because it is easily over-fitting. lets see if it is overfitted
model.score(X_train, y_train)
0.9975355913598829
model.score(X_test, y_test)
0.44422080245934037
Well, it sure did overfit. Lets try random forest as it might solve the the prob that decision tree has.
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
model.score(X_test,y_test)
0.681217641805061
Lets try some other method!
from sklearn.neural_network import MLPRegressor
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = MLPRegressor(hidden_layer_sizes=(100, 50), max_iter=1000, random_state=42)
model.fit(X_train, y_train)
model.score(X_test, y_test)
0.40588689134599465
Summary#
In this project, The aim is to find what are the factors that influence the accelerators of a car. I first guess it, and try find out, using ols linear regression, what are the variables that should ne put in the sklearn model. At last, I put those variables in the sklearn, and find all of them bad at predicting non-trained datasets. Firstly, I thought it was the wrong model I use. After several failure trial, I turned to Multi-layer Perceptron Regressor, which is a neural network method; and the results is even worse, which shows that it needs more data to actually train a machine predicting the acceleration using ‘weight’,’cylinders’,’horsepower’. But through that OLS linear regression model, we still acheive our goal, which is “find out what are the factors in the datasets that affects the acceleration of a car”!!
References#
Your code above should include references. Here is some additional space for references.
What is the source of your dataset(s)?
Quinlan,R.. (1993). Auto MPG. UCI Machine Learning Repository. https://doi.org/10.24432/C5859H.