Week 6 Wednesday#

Announcements#

  • HW5 due Wednesday.

  • HW6 is posted (due next Wednesday).

Plan:#

  • Using Pipeline to combine multiple steps

import numpy as np
import pandas as pd
import altair as alt

Generating random data#

Here we make some data that follows a random polynomial. Can we use scikit-learn to estimate the underlying polynomial?

Here are some comments about the code:

  • It’s written so that if you change deg to another integer, the rest should work the same.

  • The “y_true” column values follow a degree 3 polynomial exactly.

  • The “y” column values are obtained by adding random noise to the “y_true” values.

  • We use two different size keyword arguments, one for getting the coefficients, and one for getting a different random value for each row in the DataFrame.

  • It’s better to use normally distributed random values, rather than uniformly distributed values in [0,1], so that the data points are not all within a band of width 1 from the true polynomial.

  • In general in Python, if you find yourself writing range(len(???)), you’re probably not writing your code in a “Pythonic” way. We will see an elegant way to replace range(len(???)) below.

np.arange(-3, 2, 0.1)
array([-3.00000000e+00, -2.90000000e+00, -2.80000000e+00, -2.70000000e+00,
       -2.60000000e+00, -2.50000000e+00, -2.40000000e+00, -2.30000000e+00,
       -2.20000000e+00, -2.10000000e+00, -2.00000000e+00, -1.90000000e+00,
       -1.80000000e+00, -1.70000000e+00, -1.60000000e+00, -1.50000000e+00,
       -1.40000000e+00, -1.30000000e+00, -1.20000000e+00, -1.10000000e+00,
       -1.00000000e+00, -9.00000000e-01, -8.00000000e-01, -7.00000000e-01,
       -6.00000000e-01, -5.00000000e-01, -4.00000000e-01, -3.00000000e-01,
       -2.00000000e-01, -1.00000000e-01,  2.66453526e-15,  1.00000000e-01,
        2.00000000e-01,  3.00000000e-01,  4.00000000e-01,  5.00000000e-01,
        6.00000000e-01,  7.00000000e-01,  8.00000000e-01,  9.00000000e-01,
        1.00000000e+00,  1.10000000e+00,  1.20000000e+00,  1.30000000e+00,
        1.40000000e+00,  1.50000000e+00,  1.60000000e+00,  1.70000000e+00,
        1.80000000e+00,  1.90000000e+00])
deg = 3
rng = np.random.default_rng(seed=27)
# random integers in [-5,5)
m = rng.integers(low=-5, high=5, size=deg+1)
print(m)
# A pandas DataFrame df is created with a column x that ranges from -3 to 2 in increments of 0.1. 
df = pd.DataFrame({"x": np.arange(-3, 2, 0.1)})

# Calculate the true oolynomial values y_true
df["y_true"] = 0
for i in range(len(m)):
    df["y_true"] += m[i]*df["x"]**i
#Add noise to generate y:
df["y"] = df["y_true"] + rng.normal(scale=5, size=len(df))
[-5  1 -3 -2]

At the end of that process, here is how df looks.

df
x y_true y
0 -3.000000e+00 19.000 23.824406
1 -2.900000e+00 15.648 10.237108
2 -2.800000e+00 12.584 16.919087
3 -2.700000e+00 9.796 8.955196
4 -2.600000e+00 7.272 6.323695
5 -2.500000e+00 5.000 10.602832
6 -2.400000e+00 2.968 0.784105
7 -2.300000e+00 1.164 -5.234227
8 -2.200000e+00 -0.424 -2.771499
9 -2.100000e+00 -1.808 -7.792136
10 -2.000000e+00 -3.000 -12.199286
11 -1.900000e+00 -4.012 -4.739785
12 -1.800000e+00 -4.856 -2.864605
13 -1.700000e+00 -5.544 -16.354306
14 -1.600000e+00 -6.088 -6.015613
15 -1.500000e+00 -6.500 -5.224009
16 -1.400000e+00 -6.792 -5.926045
17 -1.300000e+00 -6.976 -13.326468
18 -1.200000e+00 -7.064 -8.618807
19 -1.100000e+00 -7.068 -7.999078
20 -1.000000e+00 -7.000 -4.835120
21 -9.000000e-01 -6.872 -13.308757
22 -8.000000e-01 -6.696 -4.778936
23 -7.000000e-01 -6.484 -1.524600
24 -6.000000e-01 -6.248 -17.686227
25 -5.000000e-01 -6.000 -4.880304
26 -4.000000e-01 -5.752 -13.385067
27 -3.000000e-01 -5.516 -6.368335
28 -2.000000e-01 -5.304 -8.704742
29 -1.000000e-01 -5.128 1.240175
30 2.664535e-15 -5.000 -4.714457
31 1.000000e-01 -4.932 -0.955557
32 2.000000e-01 -4.936 -10.169895
33 3.000000e-01 -5.024 -2.721364
34 4.000000e-01 -5.208 1.903864
35 5.000000e-01 -5.500 -9.036285
36 6.000000e-01 -5.912 -2.582631
37 7.000000e-01 -6.456 -4.049151
38 8.000000e-01 -7.144 -7.467370
39 9.000000e-01 -7.988 -14.245467
40 1.000000e+00 -9.000 -7.584490
41 1.100000e+00 -10.192 -11.543440
42 1.200000e+00 -11.576 -8.565901
43 1.300000e+00 -13.164 -9.269588
44 1.400000e+00 -14.968 -7.295300
45 1.500000e+00 -17.000 -21.182783
46 1.600000e+00 -19.272 -14.946791
47 1.700000e+00 -21.796 -18.291899
48 1.800000e+00 -24.584 -24.206884
49 1.900000e+00 -27.648 -24.589037

Aside: If you are using range(len(???)) in Python, there is almost always a more elegant way to accomplish the same thing.

  • Rewrite the code above using enumerate(m) instead of range(len(m)).

Recall that m holds the four randomly chosen coefficients for our true polynomial. Why couldn’t we use just for c in m: above? Because we needed to know both the value in m and its index. For example, we needed to know that -3 corresponded to the x**2 column (m[2] is -3).

This is such a common pattern in Python, that a function is provided to help accomplish this, called enumerate. When we iterate through enumerate(m), pairs of elements are returned: the index, and the value. For example in our case m = [-5,  1, -3, -2], and so the initial pair returned will be (0, -5), the next pair will be (1, 1), the next pair will be (2, -3), and the last pair will be (3, -2). We assign the values in these pairs to i and c, respectively.

# A pandas DataFrame df is created with a column x that ranges from -3 to 2 in increments of 0.1. 
df = pd.DataFrame({"x": np.arange(-3, 2, 0.1)})

# Calculate the true oolynomial values y_true
df["y_true"] = 0
for i,c in enumerate(m): #c is m[i]
    df["y_true"] += c*df["x"]**i
#Add noise to generate y:
df["y"] = df["y_true"] + rng.normal(scale=5, size=len(df))
df
x y_true y
0 -3.000000e+00 19.000 19.658465
1 -2.900000e+00 15.648 19.004528
2 -2.800000e+00 12.584 14.544002
3 -2.700000e+00 9.796 6.982885
4 -2.600000e+00 7.272 7.806070
5 -2.500000e+00 5.000 8.322150
6 -2.400000e+00 2.968 -1.289692
7 -2.300000e+00 1.164 -0.827392
8 -2.200000e+00 -0.424 6.149442
9 -2.100000e+00 -1.808 -0.183967
10 -2.000000e+00 -3.000 -6.093065
11 -1.900000e+00 -4.012 -6.817643
12 -1.800000e+00 -4.856 2.249072
13 -1.700000e+00 -5.544 -3.184134
14 -1.600000e+00 -6.088 -15.528922
15 -1.500000e+00 -6.500 -3.867985
16 -1.400000e+00 -6.792 -11.173514
17 -1.300000e+00 -6.976 -7.492278
18 -1.200000e+00 -7.064 -14.988038
19 -1.100000e+00 -7.068 -8.428815
20 -1.000000e+00 -7.000 -10.778717
21 -9.000000e-01 -6.872 -12.853986
22 -8.000000e-01 -6.696 -2.085862
23 -7.000000e-01 -6.484 5.928608
24 -6.000000e-01 -6.248 -3.419772
25 -5.000000e-01 -6.000 -5.006722
26 -4.000000e-01 -5.752 -11.888357
27 -3.000000e-01 -5.516 -6.781925
28 -2.000000e-01 -5.304 -8.647650
29 -1.000000e-01 -5.128 -14.463928
30 2.664535e-15 -5.000 -2.247368
31 1.000000e-01 -4.932 -5.226225
32 2.000000e-01 -4.936 -8.567178
33 3.000000e-01 -5.024 -12.507355
34 4.000000e-01 -5.208 -2.985618
35 5.000000e-01 -5.500 1.248474
36 6.000000e-01 -5.912 -5.436481
37 7.000000e-01 -6.456 -9.005758
38 8.000000e-01 -7.144 -4.079082
39 9.000000e-01 -7.988 -12.877670
40 1.000000e+00 -9.000 -19.902710
41 1.100000e+00 -10.192 -1.632357
42 1.200000e+00 -11.576 -3.138947
43 1.300000e+00 -13.164 -14.674297
44 1.400000e+00 -14.968 -14.626226
45 1.500000e+00 -17.000 -8.663278
46 1.600000e+00 -19.272 -24.778477
47 1.700000e+00 -21.796 -21.197001
48 1.800000e+00 -24.584 -20.253230
49 1.900000e+00 -27.648 -18.957032
  • Here is what the data looks like.

Based on the values in m above, we know these points are approximately following the curve \(y = -2x^3 - 3x^2 + x - 5\). For example, because the leading coefficient is negative, we know the outputs should be getting more negative as x increases, which seems to match what we see in the plotted data.

c1 = alt.Chart(df).mark_circle().encode(
    x="x",
    y="y"
)

c1

Polynomial regression using PolynomialFeatures#

  • Use the include_bias keyword argument in PolynomialFeatures so we do not get a column for \(x^0\)

Notice how these values correspond to powers of the “x” column, including the zero-th power (all 1s).

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3)
pd.DataFrame(poly.fit_transform(df[['x']]))
0 1 2 3
0 1.0 -3.000000e+00 9.000000e+00 -2.700000e+01
1 1.0 -2.900000e+00 8.410000e+00 -2.438900e+01
2 1.0 -2.800000e+00 7.840000e+00 -2.195200e+01
3 1.0 -2.700000e+00 7.290000e+00 -1.968300e+01
4 1.0 -2.600000e+00 6.760000e+00 -1.757600e+01
5 1.0 -2.500000e+00 6.250000e+00 -1.562500e+01
6 1.0 -2.400000e+00 5.760000e+00 -1.382400e+01
7 1.0 -2.300000e+00 5.290000e+00 -1.216700e+01
8 1.0 -2.200000e+00 4.840000e+00 -1.064800e+01
9 1.0 -2.100000e+00 4.410000e+00 -9.261000e+00
10 1.0 -2.000000e+00 4.000000e+00 -8.000000e+00
11 1.0 -1.900000e+00 3.610000e+00 -6.859000e+00
12 1.0 -1.800000e+00 3.240000e+00 -5.832000e+00
13 1.0 -1.700000e+00 2.890000e+00 -4.913000e+00
14 1.0 -1.600000e+00 2.560000e+00 -4.096000e+00
15 1.0 -1.500000e+00 2.250000e+00 -3.375000e+00
16 1.0 -1.400000e+00 1.960000e+00 -2.744000e+00
17 1.0 -1.300000e+00 1.690000e+00 -2.197000e+00
18 1.0 -1.200000e+00 1.440000e+00 -1.728000e+00
19 1.0 -1.100000e+00 1.210000e+00 -1.331000e+00
20 1.0 -1.000000e+00 1.000000e+00 -1.000000e+00
21 1.0 -9.000000e-01 8.100000e-01 -7.290000e-01
22 1.0 -8.000000e-01 6.400000e-01 -5.120000e-01
23 1.0 -7.000000e-01 4.900000e-01 -3.430000e-01
24 1.0 -6.000000e-01 3.600000e-01 -2.160000e-01
25 1.0 -5.000000e-01 2.500000e-01 -1.250000e-01
26 1.0 -4.000000e-01 1.600000e-01 -6.400000e-02
27 1.0 -3.000000e-01 9.000000e-02 -2.700000e-02
28 1.0 -2.000000e-01 4.000000e-02 -8.000000e-03
29 1.0 -1.000000e-01 1.000000e-02 -1.000000e-03
30 1.0 2.664535e-15 7.099748e-30 1.891753e-44
31 1.0 1.000000e-01 1.000000e-02 1.000000e-03
32 1.0 2.000000e-01 4.000000e-02 8.000000e-03
33 1.0 3.000000e-01 9.000000e-02 2.700000e-02
34 1.0 4.000000e-01 1.600000e-01 6.400000e-02
35 1.0 5.000000e-01 2.500000e-01 1.250000e-01
36 1.0 6.000000e-01 3.600000e-01 2.160000e-01
37 1.0 7.000000e-01 4.900000e-01 3.430000e-01
38 1.0 8.000000e-01 6.400000e-01 5.120000e-01
39 1.0 9.000000e-01 8.100000e-01 7.290000e-01
40 1.0 1.000000e+00 1.000000e+00 1.000000e+00
41 1.0 1.100000e+00 1.210000e+00 1.331000e+00
42 1.0 1.200000e+00 1.440000e+00 1.728000e+00
43 1.0 1.300000e+00 1.690000e+00 2.197000e+00
44 1.0 1.400000e+00 1.960000e+00 2.744000e+00
45 1.0 1.500000e+00 2.250000e+00 3.375000e+00
46 1.0 1.600000e+00 2.560000e+00 4.096000e+00
47 1.0 1.700000e+00 2.890000e+00 4.913000e+00
48 1.0 1.800000e+00 3.240000e+00 5.832000e+00
49 1.0 1.900000e+00 3.610000e+00 6.859000e+00

Let’s get rid of the column of all 1 values. We do this by setting include_bias=False when we instantiate the PolynomialFeatures object.

We can perform both the fit and the transform steps in the same step using fit_transform.

poly = PolynomialFeatures(degree=3, include_bias=False)
df_pow = pd.DataFrame(poly.fit_transform(df[['x']]))
df_pow
0 1 2
0 -3.000000e+00 9.000000e+00 -2.700000e+01
1 -2.900000e+00 8.410000e+00 -2.438900e+01
2 -2.800000e+00 7.840000e+00 -2.195200e+01
3 -2.700000e+00 7.290000e+00 -1.968300e+01
4 -2.600000e+00 6.760000e+00 -1.757600e+01
5 -2.500000e+00 6.250000e+00 -1.562500e+01
6 -2.400000e+00 5.760000e+00 -1.382400e+01
7 -2.300000e+00 5.290000e+00 -1.216700e+01
8 -2.200000e+00 4.840000e+00 -1.064800e+01
9 -2.100000e+00 4.410000e+00 -9.261000e+00
10 -2.000000e+00 4.000000e+00 -8.000000e+00
11 -1.900000e+00 3.610000e+00 -6.859000e+00
12 -1.800000e+00 3.240000e+00 -5.832000e+00
13 -1.700000e+00 2.890000e+00 -4.913000e+00
14 -1.600000e+00 2.560000e+00 -4.096000e+00
15 -1.500000e+00 2.250000e+00 -3.375000e+00
16 -1.400000e+00 1.960000e+00 -2.744000e+00
17 -1.300000e+00 1.690000e+00 -2.197000e+00
18 -1.200000e+00 1.440000e+00 -1.728000e+00
19 -1.100000e+00 1.210000e+00 -1.331000e+00
20 -1.000000e+00 1.000000e+00 -1.000000e+00
21 -9.000000e-01 8.100000e-01 -7.290000e-01
22 -8.000000e-01 6.400000e-01 -5.120000e-01
23 -7.000000e-01 4.900000e-01 -3.430000e-01
24 -6.000000e-01 3.600000e-01 -2.160000e-01
25 -5.000000e-01 2.500000e-01 -1.250000e-01
26 -4.000000e-01 1.600000e-01 -6.400000e-02
27 -3.000000e-01 9.000000e-02 -2.700000e-02
28 -2.000000e-01 4.000000e-02 -8.000000e-03
29 -1.000000e-01 1.000000e-02 -1.000000e-03
30 2.664535e-15 7.099748e-30 1.891753e-44
31 1.000000e-01 1.000000e-02 1.000000e-03
32 2.000000e-01 4.000000e-02 8.000000e-03
33 3.000000e-01 9.000000e-02 2.700000e-02
34 4.000000e-01 1.600000e-01 6.400000e-02
35 5.000000e-01 2.500000e-01 1.250000e-01
36 6.000000e-01 3.600000e-01 2.160000e-01
37 7.000000e-01 4.900000e-01 3.430000e-01
38 8.000000e-01 6.400000e-01 5.120000e-01
39 9.000000e-01 8.100000e-01 7.290000e-01
40 1.000000e+00 1.000000e+00 1.000000e+00
41 1.100000e+00 1.210000e+00 1.331000e+00
42 1.200000e+00 1.440000e+00 1.728000e+00
43 1.300000e+00 1.690000e+00 2.197000e+00
44 1.400000e+00 1.960000e+00 2.744000e+00
45 1.500000e+00 2.250000e+00 3.375000e+00
46 1.600000e+00 2.560000e+00 4.096000e+00
47 1.700000e+00 2.890000e+00 4.913000e+00
48 1.800000e+00 3.240000e+00 5.832000e+00
49 1.900000e+00 3.610000e+00 6.859000e+00
  • Name the columns using the get_feature_names_out method of the PolynomialFeatures object.

poly.get_feature_names_out()
array(['x', 'x^2', 'x^3'], dtype=object)
df_pow.columns = poly.get_feature_names_out()
df_pow
x x^2 x^3
0 -3.000000e+00 9.000000e+00 -2.700000e+01
1 -2.900000e+00 8.410000e+00 -2.438900e+01
2 -2.800000e+00 7.840000e+00 -2.195200e+01
3 -2.700000e+00 7.290000e+00 -1.968300e+01
4 -2.600000e+00 6.760000e+00 -1.757600e+01
5 -2.500000e+00 6.250000e+00 -1.562500e+01
6 -2.400000e+00 5.760000e+00 -1.382400e+01
7 -2.300000e+00 5.290000e+00 -1.216700e+01
8 -2.200000e+00 4.840000e+00 -1.064800e+01
9 -2.100000e+00 4.410000e+00 -9.261000e+00
10 -2.000000e+00 4.000000e+00 -8.000000e+00
11 -1.900000e+00 3.610000e+00 -6.859000e+00
12 -1.800000e+00 3.240000e+00 -5.832000e+00
13 -1.700000e+00 2.890000e+00 -4.913000e+00
14 -1.600000e+00 2.560000e+00 -4.096000e+00
15 -1.500000e+00 2.250000e+00 -3.375000e+00
16 -1.400000e+00 1.960000e+00 -2.744000e+00
17 -1.300000e+00 1.690000e+00 -2.197000e+00
18 -1.200000e+00 1.440000e+00 -1.728000e+00
19 -1.100000e+00 1.210000e+00 -1.331000e+00
20 -1.000000e+00 1.000000e+00 -1.000000e+00
21 -9.000000e-01 8.100000e-01 -7.290000e-01
22 -8.000000e-01 6.400000e-01 -5.120000e-01
23 -7.000000e-01 4.900000e-01 -3.430000e-01
24 -6.000000e-01 3.600000e-01 -2.160000e-01
25 -5.000000e-01 2.500000e-01 -1.250000e-01
26 -4.000000e-01 1.600000e-01 -6.400000e-02
27 -3.000000e-01 9.000000e-02 -2.700000e-02
28 -2.000000e-01 4.000000e-02 -8.000000e-03
29 -1.000000e-01 1.000000e-02 -1.000000e-03
30 2.664535e-15 7.099748e-30 1.891753e-44
31 1.000000e-01 1.000000e-02 1.000000e-03
32 2.000000e-01 4.000000e-02 8.000000e-03
33 3.000000e-01 9.000000e-02 2.700000e-02
34 4.000000e-01 1.600000e-01 6.400000e-02
35 5.000000e-01 2.500000e-01 1.250000e-01
36 6.000000e-01 3.600000e-01 2.160000e-01
37 7.000000e-01 4.900000e-01 3.430000e-01
38 8.000000e-01 6.400000e-01 5.120000e-01
39 9.000000e-01 8.100000e-01 7.290000e-01
40 1.000000e+00 1.000000e+00 1.000000e+00
41 1.100000e+00 1.210000e+00 1.331000e+00
42 1.200000e+00 1.440000e+00 1.728000e+00
43 1.300000e+00 1.690000e+00 2.197000e+00
44 1.400000e+00 1.960000e+00 2.744000e+00
45 1.500000e+00 2.250000e+00 3.375000e+00
46 1.600000e+00 2.560000e+00 4.096000e+00
47 1.700000e+00 2.890000e+00 4.913000e+00
48 1.800000e+00 3.240000e+00 5.832000e+00
49 1.900000e+00 3.610000e+00 6.859000e+00
  • Concatenate the “y” and “y_true” columns from df onto the end of df_pow using pd.concat((???, ???), axis=???). Name the result df_both.

Notice how we use axis=1, because the column labels are changing but the row labels are staying the same.

df_both = pd.concat((df_pow, df[["y", "y_true"]]), axis=1)
df_both
x x^2 x^3 y y_true
0 -3.000000e+00 9.000000e+00 -2.700000e+01 19.658465 19.000
1 -2.900000e+00 8.410000e+00 -2.438900e+01 19.004528 15.648
2 -2.800000e+00 7.840000e+00 -2.195200e+01 14.544002 12.584
3 -2.700000e+00 7.290000e+00 -1.968300e+01 6.982885 9.796
4 -2.600000e+00 6.760000e+00 -1.757600e+01 7.806070 7.272
5 -2.500000e+00 6.250000e+00 -1.562500e+01 8.322150 5.000
6 -2.400000e+00 5.760000e+00 -1.382400e+01 -1.289692 2.968
7 -2.300000e+00 5.290000e+00 -1.216700e+01 -0.827392 1.164
8 -2.200000e+00 4.840000e+00 -1.064800e+01 6.149442 -0.424
9 -2.100000e+00 4.410000e+00 -9.261000e+00 -0.183967 -1.808
10 -2.000000e+00 4.000000e+00 -8.000000e+00 -6.093065 -3.000
11 -1.900000e+00 3.610000e+00 -6.859000e+00 -6.817643 -4.012
12 -1.800000e+00 3.240000e+00 -5.832000e+00 2.249072 -4.856
13 -1.700000e+00 2.890000e+00 -4.913000e+00 -3.184134 -5.544
14 -1.600000e+00 2.560000e+00 -4.096000e+00 -15.528922 -6.088
15 -1.500000e+00 2.250000e+00 -3.375000e+00 -3.867985 -6.500
16 -1.400000e+00 1.960000e+00 -2.744000e+00 -11.173514 -6.792
17 -1.300000e+00 1.690000e+00 -2.197000e+00 -7.492278 -6.976
18 -1.200000e+00 1.440000e+00 -1.728000e+00 -14.988038 -7.064
19 -1.100000e+00 1.210000e+00 -1.331000e+00 -8.428815 -7.068
20 -1.000000e+00 1.000000e+00 -1.000000e+00 -10.778717 -7.000
21 -9.000000e-01 8.100000e-01 -7.290000e-01 -12.853986 -6.872
22 -8.000000e-01 6.400000e-01 -5.120000e-01 -2.085862 -6.696
23 -7.000000e-01 4.900000e-01 -3.430000e-01 5.928608 -6.484
24 -6.000000e-01 3.600000e-01 -2.160000e-01 -3.419772 -6.248
25 -5.000000e-01 2.500000e-01 -1.250000e-01 -5.006722 -6.000
26 -4.000000e-01 1.600000e-01 -6.400000e-02 -11.888357 -5.752
27 -3.000000e-01 9.000000e-02 -2.700000e-02 -6.781925 -5.516
28 -2.000000e-01 4.000000e-02 -8.000000e-03 -8.647650 -5.304
29 -1.000000e-01 1.000000e-02 -1.000000e-03 -14.463928 -5.128
30 2.664535e-15 7.099748e-30 1.891753e-44 -2.247368 -5.000
31 1.000000e-01 1.000000e-02 1.000000e-03 -5.226225 -4.932
32 2.000000e-01 4.000000e-02 8.000000e-03 -8.567178 -4.936
33 3.000000e-01 9.000000e-02 2.700000e-02 -12.507355 -5.024
34 4.000000e-01 1.600000e-01 6.400000e-02 -2.985618 -5.208
35 5.000000e-01 2.500000e-01 1.250000e-01 1.248474 -5.500
36 6.000000e-01 3.600000e-01 2.160000e-01 -5.436481 -5.912
37 7.000000e-01 4.900000e-01 3.430000e-01 -9.005758 -6.456
38 8.000000e-01 6.400000e-01 5.120000e-01 -4.079082 -7.144
39 9.000000e-01 8.100000e-01 7.290000e-01 -12.877670 -7.988
40 1.000000e+00 1.000000e+00 1.000000e+00 -19.902710 -9.000
41 1.100000e+00 1.210000e+00 1.331000e+00 -1.632357 -10.192
42 1.200000e+00 1.440000e+00 1.728000e+00 -3.138947 -11.576
43 1.300000e+00 1.690000e+00 2.197000e+00 -14.674297 -13.164
44 1.400000e+00 1.960000e+00 2.744000e+00 -14.626226 -14.968
45 1.500000e+00 2.250000e+00 3.375000e+00 -8.663278 -17.000
46 1.600000e+00 2.560000e+00 4.096000e+00 -24.778477 -19.272
47 1.700000e+00 2.890000e+00 4.913000e+00 -21.197001 -21.796
48 1.800000e+00 3.240000e+00 5.832000e+00 -20.253230 -24.584
49 1.900000e+00 3.610000e+00 6.859000e+00 -18.957032 -27.648
  • Find the “best” coefficient values for modeling \(y \approx c_3 x^3 + c_2 x^2 + c_1 x + c_0\).

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(df_both[["x","x^2","x^3"]], df_both["y"])
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
  • How do these values compare to the true coefficient values?

reg.coef_
array([ 1.6446191 , -2.03654891, -1.83956523])
reg.intercept_
-6.358623201499859

The true values follow the polynomial \(y = -2x^3 - 3x^2 + x - 5\). In our case, we have found approximately \(-1.8 x^3 - 2x^2 + 1.6 x - 6.4\). These two sequences of coefficients are remarkably similar.

We will see a more efficient way to do all of these steps below, using Pipeline.

Using Pipeline to combine multiple steps#

The above process is a little awkward. We can achieve the same thing much more efficiently by using another data type defined by scikit-learn, Pipeline. (The tradeoff is that it is less explicit what is happening.)

  • Import the Pipeline class from sklearn.pipeline.

from sklearn.pipeline import Pipeline
  • Make an instance of this Pipeline class. Pass to the constructor a list of length-2 tuples, where each tuple provides a name for the step (as a string) and the constructor (like PolynomialFeatures(???)).

pipe = Pipeline(
    [
        ('poly', PolynomialFeatures(degree=3, include_bias=False)),
        ('reg', LinearRegression())
    ]
)
  • Fit this object to the data.

This is where we really benefit from Pipeline. The following call of pipe.fit first fits and transforms the data using PolynomialFeatures, and then fits that transformed data using LinearRegression.

pipe.fit(df[["x"]], df['y'])
Pipeline(steps=[('poly', PolynomialFeatures(degree=3, include_bias=False)),
                ('reg', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
  • Do the coefficients match what we found above? Use the named_steps attribute, or just use the name directly.

You might try calling pipe.coef_, but we get an error message. It’s not the Pipeline object itself that has the fit coefficients, but the LinearRegression object within it.

pipe.coef_
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [20], line 1
----> 1 pipe.coef_

AttributeError: 'Pipeline' object has no attribute 'coef_'

The information is recorded in a Python dictionary stored in the named_steps attribute of our Pipeline object.

pipe.named_steps
{'poly': PolynomialFeatures(degree=3, include_bias=False),
 'reg': LinearRegression()}

The point of all that is, now that we know how to access the LinearRegression object, we can get its coef_ attribute just like usual when performing linear regression. (Remember that this attribute only exists after we call fit.)

Notice that these are the exact same values as what we found above. It’s worth looking over both procedures and noticing how much shorter this procedure using Pipeline is.

pipe['reg'].coef_
array([ 1.6446191 , -2.03654891, -1.83956523])
pipe['reg'].intercept_
-6.3586232014998565
  • Call the predict method, and add the resulting values to a new column in df named “pred”.

The following simple code is evaluating our “best fit” degree three polynomial \(-1.8 x^3 - 2x^2 + 1.6 x - 6.4\) for every value of in the “x” column. Notice how we don’t need to explicitly type "x^2" or anything like that, the polynomial part of this polynomial regression is happening “under the hood”.

df["pred"] = pipe.predict(df[["x"]])
  • Plot the resulting predictions using a red line. Name the chart c2.

This one does lie perfectly on a cubic polynomial, more specifically, that cubic polynomial is approximately \(-1.8 x^3 - 2x^2 + 1.6 x - 6.4\). This is our cubic polynomial of “best fit” (meaning the Mean Squared Error between the data and this polynomial is minimized). For the given data, using Mean Squared Error, this polynomial fits the data “better” than the true underlying polynomial \(-2x^3 - 3x^2 + x - 5\).

c2 = alt.Chart(df).mark_line(color = "red").encode(
    x = "x",
    y = "pred"
)
c2
  • Plot the true values using a dashed black line, using strokeDash=[10,5] as an argument to mark_line. Name the chart c3.

Don’t focus too much on the strokeDash=[10,5] part, I just wanted to show you an example of an option that exists. Here the dashes are made with 10 black pixels followed by a gap of 5 pixels.

This curve represents the true underlying polynomial that we used to generate the data (before adding the random noise to it).

c3 = alt.Chart(df).mark_line(color = "black", strokeDash = [10,5]).encode(
    x = "x",
    y = "y_true"
)
c3
  • Layer these plots on top of the above scatter plot c1.

Notice how similar our two polynomial curves are. If we had used more data points or less standard deviation for our random noise, we would expect the curves to be even closer to each other.

c1 + c2 + c3
Created in deepnote.com Created in Deepnote