The effects of every parts of a car to its acceleration#

Author: Mianfu Zhong

Course Project, UC Irvine, Math 10, F23

Introduction#

This project is using a datasets of car specifics to predict the factors that would affect a car’s acceleration. Also the precise detail about how the analysis is done and why the factors are related are discussed.

The Preparation Step#

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
# import all the tools we might need to use, including panda, numpy, matplot, seaborn, and sklearn.
df = pd.read_table('auto-mpg.data', delim_whitespace=True, header=None, names=['mpg', 'cylinders', 'displacement','horsepower','weight','acceleration','model_year','origin','car_name'])
# Because we have a file using "data" extension, we have to call it in a strict format like above.
# If you download and check the data, it is obvious that all the datas are seperated using blankspaces
# Therefore, we use "delimeter_whitespace=True" to tell python how to seperate.
# lets see what it looks like first
df
mpg cylinders displacement horsepower weight acceleration model_year origin car_name
0 18.0 8 307.0 130.0 3504.0 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693.0 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150.0 3436.0 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150.0 3433.0 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140.0 3449.0 10.5 70 1 ford torino
... ... ... ... ... ... ... ... ... ...
393 27.0 4 140.0 86.00 2790.0 15.6 82 1 ford mustang gl
394 44.0 4 97.0 52.00 2130.0 24.6 82 2 vw pickup
395 32.0 4 135.0 84.00 2295.0 11.6 82 1 dodge rampage
396 28.0 4 120.0 79.00 2625.0 18.6 82 1 ford ranger
397 31.0 4 119.0 82.00 2720.0 19.4 82 1 chevy s-10

398 rows × 9 columns

# Before decide which method we want to use to find out our answer.
# There are still things we want to do before hand. such as:
# Data exploration, deal with missing value and duplicate values:

# Remove rows with missing values 
# !!!!notice that the second row, I replace all the "?"" with "NA" , Because some coloumns habe dtype= objects and can not be detected as a missing value
df = df.dropna()
df = df.replace('?', pd.NA).dropna()

# Data Cleaning
df = df.drop_duplicates()

# Data exploration
print("Data Exploration:")
print(df.info())
print(df.describe())
df
Data Exploration:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           392 non-null    float64
 1   cylinders     392 non-null    int64  
 2   displacement  392 non-null    float64
 3   horsepower    392 non-null    object 
 4   weight        392 non-null    float64
 5   acceleration  392 non-null    float64
 6   model_year    392 non-null    int64  
 7   origin        392 non-null    int64  
 8   car_name      392 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 30.6+ KB
None
              mpg   cylinders  displacement       weight  acceleration  \
count  392.000000  392.000000    392.000000   392.000000    392.000000   
mean    23.445918    5.471939    194.411990  2977.584184     15.541327   
std      7.805007    1.705783    104.644004   849.402560      2.758864   
min      9.000000    3.000000     68.000000  1613.000000      8.000000   
25%     17.000000    4.000000    105.000000  2225.250000     13.775000   
50%     22.750000    4.000000    151.000000  2803.500000     15.500000   
75%     29.000000    8.000000    275.750000  3614.750000     17.025000   
max     46.600000    8.000000    455.000000  5140.000000     24.800000   

       model_year      origin  
count  392.000000  392.000000  
mean    75.979592    1.576531  
std      3.683737    0.805518  
min     70.000000    1.000000  
25%     73.000000    1.000000  
50%     76.000000    1.000000  
75%     79.000000    2.000000  
max     82.000000    3.000000  
mpg cylinders displacement horsepower weight acceleration model_year origin car_name
0 18.0 8 307.0 130.0 3504.0 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693.0 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150.0 3436.0 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150.0 3433.0 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140.0 3449.0 10.5 70 1 ford torino
... ... ... ... ... ... ... ... ... ...
393 27.0 4 140.0 86.00 2790.0 15.6 82 1 ford mustang gl
394 44.0 4 97.0 52.00 2130.0 24.6 82 2 vw pickup
395 32.0 4 135.0 84.00 2295.0 11.6 82 1 dodge rampage
396 28.0 4 120.0 79.00 2625.0 18.6 82 1 ford ranger
397 31.0 4 119.0 82.00 2720.0 19.4 82 1 chevy s-10

392 rows × 9 columns

# Note that besides other factors that might influence acceleration
# There is a column named "oring", that it might be a non-direct factor.
# Which might need us to fit the model with or without it
# so that we can see if it matters.

# !!!!Not try to spoil but look at the names of our features, you might notice that
# some of them are not related to acceleration but might still appear high goodness of fit.

Call to our adventure: First guess to the result#

What might be main factors? Horsepower? cylinders? or weight?

Because we are studying accelerations. let’s recall from Physics 7: acceleration formula: f=ma => a=f/m while in such case, “angular acceleration” would replace “acceleraiton” and “torque” would replace “force”.

However torque is not mentioned in the datasets, datasets thats related to torque might be “weight” and “model_year”. As the model_year represent the (记得在fit的部分把不同年份的分开fit,再把得出的结果mean,这样能保证年份不对其他项造成影响)technology level.

In this case, we could say that cylinders, horsepower, could be the main factors as our common sense. lets check that out using alt chart.

BUT, before all of the above, we should first find out what does the value of the acceleration mean. does it mean time from 0-60mph or the km/s^-1.

# To see what ACCELERATION value really means for this data, we first want to see the relationship between weight and its acceleration.
# It is obvious that heavier the car is, longer it needs to accelerate.
c1 = alt.Chart(df).mark_point().encode(
    x = 'weight:Q',
    y = 'acceleration:Q',
    tooltip = ['weight', 'acceleration','cylinders','model_year','car_name','horsepower']
)
trendline = c1.transform_regression('weight', 'acceleration').mark_line(color='red')
chart_with_trendline1 = (c1 + trendline)
chart_with_trendline1

Weird! why is the acceleration time goes down when weight increases? or are values from acceleraiton NOT the time it needs to accelerate from 0 to 60mph?

Don’t worry, we can check it using horsepower, which should be the one who tells the acceleration!!

c2 = alt.Chart(df).mark_point().encode(
    x = 'horsepower:Q',
    y = 'acceleration:Q',
    tooltip = ['weight', 'acceleration','cylinders','model_year','car_name']
)
c2
# Through a roughly look from the chart, it seems that when horsepower goes up, the value of acceleration goes down
# Which is really weird, becuase it tells the opposite from the weight acceleration relationship we just have
# !! lets check there relationship using a regression.
trendline2 = c2.transform_regression('horsepower', 'acceleration').mark_line(color='red',size=200)
chart_with_trendline2 = (c2 + trendline2)
chart_with_trendline2

Nice! now we can ensure that the acceleration value represent for its acceleration ability, the larger the value, the faster it accelerate!

Also, We verify our first guess that horsepower and weight do affect the acceleration.

The next step would be try doing find what are the best factors we put in our machine.

Approaching The Inner Cave: OLS LinearRegression model#

Find out what are the factors that should be used in training our bot (EXTRA COMPONENT)

WHY use OLS model from ‘statesmodels’ NOW?

  1. We now want to find out what factors we want to put in side our scikit-learn model.

  2. ‘statesmodels’ library emphasize more on statistical modeling and hypothesis testing, it provides a detailed statistical summary of the regression results, allowing us to analyze the statistical significance of coefficients.

  3. That scikit-learn can’t do what statesmodels can do.(e.g. point1,point2 above)

pip install statsmodels
Collecting statsmodels
  Downloading statsmodels-0.14.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.1/10.1 MB 72.2 MB/s eta 0:00:00
?25hRequirement already satisfied: scipy!=1.9.2,>=1.4 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from statsmodels) (1.9.3)
Requirement already satisfied: numpy>=1.18 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from statsmodels) (1.23.4)
Requirement already satisfied: pandas>=1.0 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from statsmodels) (1.2.5)
Collecting patsy>=0.5.2
  Downloading patsy-0.5.4-py2.py3-none-any.whl (233 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 233.9/233.9 kB 50.9 MB/s eta 0:00:00
?25hRequirement already satisfied: packaging>=21.3 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from statsmodels) (21.3)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from packaging>=21.3->statsmodels) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from pandas>=1.0->statsmodels) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from pandas>=1.0->statsmodels) (2022.5)
Requirement already satisfied: six in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from patsy>=0.5.2->statsmodels) (1.16.0)
Installing collected packages: patsy, statsmodels
Successfully installed patsy-0.5.4 statsmodels-0.14.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.0.1 -> 23.3.1
[notice] To update, run: python -m pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
import statsmodels.api as sm
allfactors= df[['weight','mpg','cylinders','model_year','origin','horsepower']]
X = sm.add_constant(allfactors)
y = df['acceleration']
olsmodel = sm.OLS(y,X).fit()
print(olsmodel.summary())
# this is error because horsepower coloumn is recognized as objects, lets try do it again but this time we change it to number first
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [16], line 5
      3 X = sm.add_constant(allfactors)
      4 y = df['acceleration']
----> 5 olsmodel = sm.OLS(y,X).fit()
      6 print(olsmodel.summary())

File /usr/local/lib/python3.9/site-packages/statsmodels/regression/linear_model.py:922, in OLS.__init__(self, endog, exog, missing, hasconst, **kwargs)
    919     msg = ("Weights are not supported in OLS and will be ignored"
    920            "An exception will be raised in the next version.")
    921     warnings.warn(msg, ValueWarning)
--> 922 super(OLS, self).__init__(endog, exog, missing=missing,
    923                           hasconst=hasconst, **kwargs)
    924 if "weights" in self._init_keys:
    925     self._init_keys.remove("weights")

File /usr/local/lib/python3.9/site-packages/statsmodels/regression/linear_model.py:748, in WLS.__init__(self, endog, exog, weights, missing, hasconst, **kwargs)
    746 else:
    747     weights = weights.squeeze()
--> 748 super(WLS, self).__init__(endog, exog, missing=missing,
    749                           weights=weights, hasconst=hasconst, **kwargs)
    750 nobs = self.exog.shape[0]
    751 weights = self.weights

File /usr/local/lib/python3.9/site-packages/statsmodels/regression/linear_model.py:202, in RegressionModel.__init__(self, endog, exog, **kwargs)
    201 def __init__(self, endog, exog, **kwargs):
--> 202     super(RegressionModel, self).__init__(endog, exog, **kwargs)
    203     self.pinv_wexog: Float64Array | None = None
    204     self._data_attr.extend(['pinv_wexog', 'wendog', 'wexog', 'weights'])

File /usr/local/lib/python3.9/site-packages/statsmodels/base/model.py:270, in LikelihoodModel.__init__(self, endog, exog, **kwargs)
    269 def __init__(self, endog, exog=None, **kwargs):
--> 270     super().__init__(endog, exog, **kwargs)
    271     self.initialize()

File /usr/local/lib/python3.9/site-packages/statsmodels/base/model.py:95, in Model.__init__(self, endog, exog, **kwargs)
     93 missing = kwargs.pop('missing', 'none')
     94 hasconst = kwargs.pop('hasconst', None)
---> 95 self.data = self._handle_data(endog, exog, missing, hasconst,
     96                               **kwargs)
     97 self.k_constant = self.data.k_constant
     98 self.exog = self.data.exog

File /usr/local/lib/python3.9/site-packages/statsmodels/base/model.py:135, in Model._handle_data(self, endog, exog, missing, hasconst, **kwargs)
    134 def _handle_data(self, endog, exog, missing, hasconst, **kwargs):
--> 135     data = handle_data(endog, exog, missing, hasconst, **kwargs)
    136     # kwargs arrays could have changed, easier to just attach here
    137     for key in kwargs:

File /usr/local/lib/python3.9/site-packages/statsmodels/base/data.py:675, in handle_data(endog, exog, missing, hasconst, **kwargs)
    672     exog = np.asarray(exog)
    674 klass = handle_data_class_factory(endog, exog)
--> 675 return klass(endog, exog=exog, missing=missing, hasconst=hasconst,
    676              **kwargs)

File /usr/local/lib/python3.9/site-packages/statsmodels/base/data.py:84, in ModelData.__init__(self, endog, exog, missing, hasconst, **kwargs)
     82     self.orig_endog = endog
     83     self.orig_exog = exog
---> 84     self.endog, self.exog = self._convert_endog_exog(endog, exog)
     86 self.const_idx = None
     87 self.k_constant = 0

File /usr/local/lib/python3.9/site-packages/statsmodels/base/data.py:509, in PandasData._convert_endog_exog(self, endog, exog)
    507 exog = exog if exog is None else np.asarray(exog)
    508 if endog.dtype == object or exog is not None and exog.dtype == object:
--> 509     raise ValueError("Pandas data cast to numpy dtype of object. "
    510                      "Check input data with np.asarray(data).")
    511 return super(PandasData, self)._convert_endog_exog(endog, exog)

ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).
df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce')
#Now that it should works with the above code.
import statsmodels.api as sm
allfactors= df[['weight','mpg','cylinders','model_year','origin','horsepower']]
X = sm.add_constant(allfactors)
y = df['acceleration']
olsmodel = sm.OLS(y,X).fit()
print(olsmodel.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           acceleration   R-squared:                       0.615
Model:                            OLS   Adj. R-squared:                  0.609
Method:                 Least Squares   F-statistic:                     102.7
Date:                Thu, 14 Dec 2023   Prob (F-statistic):           9.17e-77
Time:                        07:17:22   Log-Likelihood:                -766.23
No. Observations:                 392   AIC:                             1546.
Df Residuals:                     385   BIC:                             1574.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         21.3419      2.192      9.738      0.000      17.033      25.651
weight         0.0030      0.000      9.735      0.000       0.002       0.004
mpg            0.0142      0.026      0.543      0.587      -0.037       0.066
cylinders     -0.3920      0.124     -3.168      0.002      -0.635      -0.149
model_year    -0.0478      0.033     -1.464      0.144      -0.112       0.016
origin         0.1015      0.140      0.724      0.469      -0.174       0.377
horsepower    -0.0912      0.005    -18.202      0.000      -0.101      -0.081
==============================================================================
Omnibus:                       46.182   Durbin-Watson:                   1.672
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               67.916
Skew:                           0.782   Prob(JB):                     1.79e-15
Kurtosis:                       4.309   Cond. No.                     7.80e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.8e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Definition: “t”: The t-statistic tests the null hypothesis that the coefficient of a variable is equal to zero. “P>|t|”: The p-value associated with the t-statistic tests the null hypothesis.

Summary: A small p-value (typically < “0.05”) suggests that you can reject the null hypothesis. A large t-statistic (with a corresponding small p-value) suggests that the variable is likely to be a significant predictor in the model.

Now after reading these Let’s look at the charts above, focusing only on the “t” value and the “P>|t|”.

We find out that only “weight”, “horsepower”, “cylinders”, and “const” have reletively high absolute t-value and p-value that is smaller than 5%.

It is easy to understand that weight horsepower and cylinders would affect the acceleration, but what is “const”?

“const” stands for constant, it is from the formula(Y = β0 + Σj=1..p βjXj + ε ) of the ols linear-model that normally represent ERROR. However in our case, we know that if all other factors are zero, the acceleration should be zero at this moment, which means that the “const” is meaningless and only exsists as it makes the model better explained.

How exciting! we now know what factors the acceleration most: “horsepower”,”cylinder”,”weight”. LETS run them on a sklearn model.

Aiming our Goal: Use Sklearn to predict, and verify our guess#

Lets try to use the linear regression model first, as it is simple and explanable

X = df[['weight','cylinders','horsepower']]# predicted variables
y =df["acceleration"]#target variables
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)#test train split
model =LinearRegression()
model.fit(X_train, y_train)
model.score(X_test,y_test)
0.6354558360449379

Seemingly, the machine is not performing well, it might be the reason that we use only the 3 predicted values that we subjectively consider it important. lets try fit it with all the variables.

X = allfactors# predicted variables
y =df["acceleration"]#target variables
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)#test train split
model =LinearRegression()
model.fit(X_train, y_train)
model.score(X_test,y_test)
0.6277449368818655

Not a good sign!! seems that we chose the wrong model or it is just that they are not relevant. maybe it is just that linearregression model assume variables are linearly related

Lets try decision trees. Because it can deal with non-linearly relation!

from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
model.score(X_test,y_test)
0.44422080245934037

EVEN WORSE!! (Dont need to test if it is because the x input, although change X input to the 3 varibales we find matters the most would increase the score to approximatly 0.44.)

This is because it is easily over-fitting. lets see if it is overfitted

model.score(X_train, y_train)
0.9975355913598829
model.score(X_test, y_test)
0.44422080245934037

Well, it sure did overfit. Lets try random forest as it might solve the the prob that decision tree has.

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
model.score(X_test,y_test)
0.681217641805061

Lets try some other method!

from sklearn.neural_network import MLPRegressor
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = MLPRegressor(hidden_layer_sizes=(100, 50), max_iter=1000, random_state=42)
model.fit(X_train, y_train)
model.score(X_test, y_test)
0.40588689134599465

Summary#

In this project, The aim is to find what are the factors that influence the accelerators of a car. I first guess it, and try find out, using ols linear regression, what are the variables that should ne put in the sklearn model. At last, I put those variables in the sklearn, and find all of them bad at predicting non-trained datasets. Firstly, I thought it was the wrong model I use. After several failure trial, I turned to Multi-layer Perceptron Regressor, which is a neural network method; and the results is even worse, which shows that it needs more data to actually train a machine predicting the acceleration using ‘weight’,’cylinders’,’horsepower’. But through that OLS linear regression model, we still acheive our goal, which is “find out what are the factors in the datasets that affects the acceleration of a car”!!

References#

Your code above should include references. Here is some additional space for references.

  • What is the source of your dataset(s)?

Quinlan,R.. (1993). Auto MPG. UCI Machine Learning Repository. https://doi.org/10.24432/C5859H.