Week 6 Monday#
Announcements#
HW5 due Wednesday.
Quiz tomorrow is based on HW5.
Plan:#
Performance measures for regression
More on polynomical regression
import numpy as np
import pandas as pd
import altair as alt
Performance measures for regression#
Here is a dataset with only 4 data points.
df = pd.DataFrame({
"x":np.arange(4),
"y":[0,2,-10,6]},
)
df
x | y | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 2 |
2 | 2 | -10 |
3 | 3 | 6 |
Here is how the data looks.
alt.Chart(df).mark_circle(size=150).encode(
x="x",
y="y"
)
Letâs figure out which of the following linear models better fits the data:
Line A: \(f(x) = 2x\)
Line B: \(g(x) = 0.6x - 1.4\)
Add columns to
df
corresponding to these lines. Name the new columns âlineAâ and âlineBâ.
The following works, but is overly complicated. Thereâs no need to use the map
method for these basic examples.
df["lineA"] = df['x'].map(lambda x : 2*x)
df
x | y | lineA | |
---|---|---|---|
0 | 0 | 0 | 0 |
1 | 1 | 2 | 2 |
2 | 2 | -10 | 4 |
3 | 3 | 6 | 6 |
This is the most natural way to include the Line A points.
df["lineA"] = 2*df['x']
df
x | y | lineA | |
---|---|---|---|
0 | 0 | 0 | 0 |
1 | 1 | 2 | 2 |
2 | 2 | -10 | 4 |
3 | 3 | 6 | 6 |
Similarly for Line B: no map
method is needed.
df["lineB"] = 0.6*df['x']-1.4
df
x | y | lineA | lineB | |
---|---|---|---|---|
0 | 0 | 0 | 0 | -1.4 |
1 | 1 | 2 | 2 | -0.8 |
2 | 2 | -10 | 4 | -0.2 |
3 | 3 | 6 | 6 | 0.4 |
Plot the data together with these lines, using the color red for Line A and the color black for Line B. Use a base chart so that you are not repeating the same code three times.
Here is a trick that will save us some typing. This trick is not that important and I donât expect it to show up on a quiz or exam, but it is nice to know it exists. We define a base chart that contains the common components to all our charts.
base = alt.Chart(df).encode(
x = "x"
)
If we try to display base
on its own, we get an error, because this by itself is not a valid chart (itâs missing the mark
code).
base
---------------------------------------------------------------------------
SchemaValidationError Traceback (most recent call last)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:2020, in Chart.to_dict(self, *args, **kwargs)
2018 copy.data = core.InlineData(values=[{}])
2019 return super(Chart, copy).to_dict(*args, **kwargs)
-> 2020 return super().to_dict(*args, **kwargs)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:393, in TopLevelMixin.to_dict(self, *args, **kwargs)
391 if dct is None:
392 kwargs["validate"] = "deep"
--> 393 dct = super(TopLevelMixin, copy).to_dict(*args, **kwargs)
395 # TODO: following entries are added after validation. Should they be validated?
396 if is_top_level:
397 # since this is top-level we add $schema if it's missing
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/schemapi.py:360, in SchemaBase.to_dict(self, validate, ignore, context)
358 self.validate(result)
359 except jsonschema.ValidationError as err:
--> 360 raise SchemaValidationError(self, err)
361 return result
SchemaValidationError: Invalid specification
altair.vegalite.v4.api.Chart, validating 'required'
'mark' is a required property
alt.Chart(...)
Now we can layer these three charts on top of each other.
c1 = base.mark_circle(size = 150).encode(y = 'y')
c2 = base.mark_line(color='red').encode(y = "lineA")
c3 = base.mark_line(color='black').encode(y = "lineB")
c1 + c2 + c3
Which line fits the data better?
Thereâs no single correct answer to this, as we will see, it depends what âperformance measureâ we are using.
Using scikit-learn, find the line of best fit for this data. How does it compare to the above lines?
Here we return to the routine we used several times last week: Import, instantiate, fit.
We need an instance of LinearRegression to:
Store the specific model data (like coefficients) after fitting it to your data.
Use the methods that belong to LinearRegression (like fit and predict) on that specific model data.
Here we fit it to the data in the âyâ column. It wouldnât make much sense to fit it to the âlineAâ or âlineBâ data, because that data already lies on a line, and we already know the equation. Linear regression is really only interesting when the data does not perfectly lie on a line.
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(df[["x"]], df["y"])
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Here is the coefficient of the line of âbestâ fit to this data.
reg.coef_
array([0.6])
Here is the y-intercept.
reg.intercept_
-1.4
Notice that this describes exactly the black line above. So from the point of view of scikit-learnâs linear regression, the black line (Line B) fits the data better than the red line, but also better than every other possible line.
Import the
mean_squared_error
function fromsklearn.metrics
. Which of our lines fits the data better according to this metric?
from sklearn.metrics import mean_squared_error
Here is an example of computing the mean squared error (MSE) between the true data and Line A.
Important: when computing errors (or loss functions), a smaller number is better.
mean_squared_error(df['y'], df["lineA"])
49.0
Here is the same computation with Line B. Because the Line B value is smaller, from the point of view of Mean Squared Error, Line B fits the data better than Line A.
When scikit-learn performs linear regression, it is seeking the line that minimizes the mean squared error.
mean_squared_error(df['y'], df["lineB"])
34.300000000000004
The best hypothetical error is 0
, but there is no line that achieves a mean squared error of 0
for this data (because no line passes through all four data points).
Import the
mean_absolute_error
function fromsklearn.metrics
. Which of our lines fits the data better according to this metric?
from sklearn.metrics import mean_absolute_error
print(mean_absolute_error(df['y'], df["lineA"]))
print(mean_absolute_error(df['y'], df["lineB"]))
3.5
4.9
There is no correct answer to which line fits the data better: it depends what performance measure is used.
Summary:
Line B is better (actually the best possible) wrt mean squared error
Line A is better with respect to mean absolute error.
Different performance measures (different loss functions) will lead to different lines of âbestâ fit.
More on polynomial regression#
Polynomial features#
We saw how to perform polynomial regression âby handâ last week. The process is much easier if we take advantage of some additional functionality in scikit-learn.
Demonstrate the
PolynomialFeatures
class fromsklearn.preprocessing
by evaluating it on the âxâ column indf
. Use adegree
value of3
.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
import seaborn as sns
df = sns.load_dataset("mpg").dropna(axis=0).copy()
help(poly.fit_transform)
Help on method fit_transform in module sklearn.base:
fit_transform(X, y=None, **fit_params) method of sklearn.preprocessing._polynomial.PolynomialFeatures instance
Fit to data, then transform it.
Fits transformer to `X` and `y` with optional parameters `fit_params`
and returns a transformed version of `X`.
Parameters
----------
X : array-like of shape (n_samples, n_features)
Input samples.
y : array-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
**fit_params : dict
Additional fit parameters.
Returns
-------
X_new : ndarray array of shape (n_samples, n_features_new)
Transformed array.
X = df[["horsepower"]]
y = df["mpg"]
#Create a PolynomialFeatures object with degree=3
poly = PolynomialFeatures(degree = 3)
#Transform 'horsepower' into polynomial features
X_poly = poly.fit_transform(X)
#To make the result easier to read, put the transformed data into a pandas DataFrame
df_ploy = pd.DataFrame(X_poly)
df_ploy
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | 130.0 | 16900.0 | 2197000.0 |
1 | 1.0 | 165.0 | 27225.0 | 4492125.0 |
2 | 1.0 | 150.0 | 22500.0 | 3375000.0 |
3 | 1.0 | 150.0 | 22500.0 | 3375000.0 |
4 | 1.0 | 140.0 | 19600.0 | 2744000.0 |
... | ... | ... | ... | ... |
387 | 1.0 | 86.0 | 7396.0 | 636056.0 |
388 | 1.0 | 52.0 | 2704.0 | 140608.0 |
389 | 1.0 | 84.0 | 7056.0 | 592704.0 |
390 | 1.0 | 79.0 | 6241.0 | 493039.0 |
391 | 1.0 | 82.0 | 6724.0 | 551368.0 |
392 rows Ă 4 columns
#Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.1, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
#Make predictions using the model on the test set
y_pred_test = model.predict(X_test)
#Calculate MSE for the predictions
mean_squared_error(y_test, y_pred_test)
19.680376038411936
model.intercept_
63.46111330910941
Here are the coefficients. Based on what we were saying with the taxis dataset (âOh look, the number of passengers is not very meaningful to the duration of the taxis ride, because the coefficient is pretty smallâ), you might think the âhousepowerâ\(^2\) and âhousepowerâ\(^3\) columns are not very meaningful here, because their coefficients are so small. But look at the values in these columns, they are huge (often over a million in the âhousepowerâ\(^3\) column), so even with these small coefficients, there is still a meaningful impact of these columns.
model.coef_
array([ 0.00000000e+00, -6.31876607e-01, 2.52981240e-03, -3.15086840e-06])