Week 6 Monday

Week 6 Monday#

Announcements#

HW5 due Wednesday.
Quiz tomorrow is based on HW5.

Plan:#

Performance measures for regression
More on polynomical regression

import numpy as np
import pandas as pd
import altair as alt

Performance measures for regression#

Here is a dataset with only 4 data points.

df = pd.DataFrame({
    "x":np.arange(4),
    "y":[0,2,-10,6]},
)

df

	x	y
0	0	0
1	1	2
2	2	-10
3	3	6

Here is how the data looks.

alt.Chart(df).mark_circle(size=150).encode(
    x="x",
    y="y"
)

Let’s figure out which of the following linear models better fits the data:

Line A: \(f(x) = 2x\)
Line B: \(g(x) = 0.6x - 1.4\)

Add columns to df corresponding to these lines. Name the new columns “lineA” and “lineB”.

The following works, but is overly complicated. There’s no need to use the map method for these basic examples.

df["lineA"] = df['x'].map(lambda x : 2*x)
df

	x	y	lineA
0	0	0	0
1	1	2	2
2	2	-10	4
3	3	6	6

This is the most natural way to include the Line A points.

df["lineA"] = 2*df['x']
df

	x	y	lineA
0	0	0	0
1	1	2	2
2	2	-10	4
3	3	6	6

Similarly for Line B: no map method is needed.

df["lineB"] = 0.6*df['x']-1.4
df

	x	y	lineA	lineB
0	0	0	0	-1.4
1	1	2	2	-0.8
2	2	-10	4	-0.2
3	3	6	6	0.4

Plot the data together with these lines, using the color red for Line A and the color black for Line B. Use a base chart so that you are not repeating the same code three times.

Here is a trick that will save us some typing. This trick is not that important and I don’t expect it to show up on a quiz or exam, but it is nice to know it exists. We define a base chart that contains the common components to all our charts.

base = alt.Chart(df).encode(
    x = "x"
)

If we try to display base on its own, we get an error, because this by itself is not a valid chart (it’s missing the mark code).

base

---------------------------------------------------------------------------
SchemaValidationError                     Traceback (most recent call last)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:2020, in Chart.to_dict(self, *args, **kwargs)
   2018     copy.data = core.InlineData(values=[{}])
   2019     return super(Chart, copy).to_dict(*args, **kwargs)
-> 2020 return super().to_dict(*args, **kwargs)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:393, in TopLevelMixin.to_dict(self, *args, **kwargs)
    391 if dct is None:
    392     kwargs["validate"] = "deep"
--> 393     dct = super(TopLevelMixin, copy).to_dict(*args, **kwargs)
    395 # TODO: following entries are added after validation. Should they be validated?
    396 if is_top_level:
    397     # since this is top-level we add $schema if it's missing

File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/schemapi.py:360, in SchemaBase.to_dict(self, validate, ignore, context)
    358         self.validate(result)
    359     except jsonschema.ValidationError as err:
--> 360         raise SchemaValidationError(self, err)
    361 return result

SchemaValidationError: Invalid specification

        altair.vegalite.v4.api.Chart, validating 'required'

        'mark' is a required property
        

alt.Chart(...)

Now we can layer these three charts on top of each other.

c1 = base.mark_circle(size = 150).encode(y = 'y')
c2 = base.mark_line(color='red').encode(y = "lineA")
c3 = base.mark_line(color='black').encode(y = "lineB")
c1 + c2 + c3

Which line fits the data better?

There’s no single correct answer to this, as we will see, it depends what “performance measure” we are using.

Using scikit-learn, find the line of best fit for this data. How does it compare to the above lines?

Here we return to the routine we used several times last week: Import, instantiate, fit.

We need an instance of LinearRegression to:

Store the specific model data (like coefficients) after fitting it to your data.
Use the methods that belong to LinearRegression (like fit and predict) on that specific model data.

Here we fit it to the data in the “y” column. It wouldn’t make much sense to fit it to the “lineA” or “lineB” data, because that data already lies on a line, and we already know the equation. Linear regression is really only interesting when the data does not perfectly lie on a line.

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(df[["x"]], df["y"])

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Here is the coefficient of the line of “best” fit to this data.

reg.coef_

array([0.6])

Here is the y-intercept.

reg.intercept_

-1.4

Notice that this describes exactly the black line above. So from the point of view of scikit-learn’s linear regression, the black line (Line B) fits the data better than the red line, but also better than every other possible line.

Import the mean_squared_error function from sklearn.metrics. Which of our lines fits the data better according to this metric?

from sklearn.metrics import mean_squared_error

Here is an example of computing the mean squared error (MSE) between the true data and Line A.

Important: when computing errors (or loss functions), a smaller number is better.

mean_squared_error(df['y'], df["lineA"])

49.0

Here is the same computation with Line B. Because the Line B value is smaller, from the point of view of Mean Squared Error, Line B fits the data better than Line A.

When scikit-learn performs linear regression, it is seeking the line that minimizes the mean squared error.

mean_squared_error(df['y'], df["lineB"])

34.300000000000004

The best hypothetical error is 0, but there is no line that achieves a mean squared error of 0 for this data (because no line passes through all four data points).

Import the mean_absolute_error function from sklearn.metrics. Which of our lines fits the data better according to this metric?

from sklearn.metrics import mean_absolute_error
print(mean_absolute_error(df['y'], df["lineA"]))
print(mean_absolute_error(df['y'], df["lineB"]))

3.5
4.9

There is no correct answer to which line fits the data better: it depends what performance measure is used.

Summary:

Line B is better (actually the best possible) wrt mean squared error

Line A is better with respect to mean absolute error.

Different performance measures (different loss functions) will lead to different lines of “best” fit.

More on polynomial regression#

Polynomial features#

We saw how to perform polynomial regression “by hand” last week. The process is much easier if we take advantage of some additional functionality in scikit-learn.

Demonstrate the PolynomialFeatures class from sklearn.preprocessing by evaluating it on the “x” column in df. Use a degree value of 3.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
import seaborn as sns
df = sns.load_dataset("mpg").dropna(axis=0).copy()

help(poly.fit_transform)

Help on method fit_transform in module sklearn.base:

fit_transform(X, y=None, **fit_params) method of sklearn.preprocessing._polynomial.PolynomialFeatures instance
    Fit to data, then transform it.
    
    Fits transformer to `X` and `y` with optional parameters `fit_params`
    and returns a transformed version of `X`.
    
    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        Input samples.
    
    y :  array-like of shape (n_samples,) or (n_samples, n_outputs),                 default=None
        Target values (None for unsupervised transformations).
    
    **fit_params : dict
        Additional fit parameters.
    
    Returns
    -------
    X_new : ndarray array of shape (n_samples, n_features_new)
        Transformed array.

X = df[["horsepower"]]
y = df["mpg"]

#Create a PolynomialFeatures object with degree=3
poly = PolynomialFeatures(degree = 3)
#Transform 'horsepower' into polynomial features
X_poly = poly.fit_transform(X)

#To make the result easier to read, put the transformed data into a pandas DataFrame
df_ploy = pd.DataFrame(X_poly)
df_ploy

	0	1	2	3
0	1.0	130.0	16900.0	2197000.0
1	1.0	165.0	27225.0	4492125.0
2	1.0	150.0	22500.0	3375000.0
3	1.0	150.0	22500.0	3375000.0
4	1.0	140.0	19600.0	2744000.0
...	...	...	...	...
387	1.0	86.0	7396.0	636056.0
388	1.0	52.0	2704.0	140608.0
389	1.0	84.0	7056.0	592704.0
390	1.0	79.0	6241.0	493039.0
391	1.0	82.0	6724.0	551368.0

392 rows × 4 columns

#Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.1, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

#Make predictions using the model on the test set
y_pred_test = model.predict(X_test)

#Calculate MSE for the predictions
mean_squared_error(y_test, y_pred_test)

19.680376038411936

model.intercept_

63.46111330910941

Here are the coefficients. Based on what we were saying with the taxis dataset (“Oh look, the number of passengers is not very meaningful to the duration of the taxis ride, because the coefficient is pretty small”), you might think the ‘housepower’\(^2\) and ‘housepower’\(^3\) columns are not very meaningful here, because their coefficients are so small. But look at the values in these columns, they are huge (often over a million in the ‘housepower’\(^3\) column), so even with these small coefficients, there is still a meaningful impact of these columns.

model.coef_

array([ 0.00000000e+00, -6.31876607e-01,  2.52981240e-03, -3.15086840e-06])

Created in Deepnote