Week 6 Wednesday#
Announcements#
HW5 due Wednesday.
HW6 is posted (due next Wednesday).
Plan:#
Using
Pipeline
to combine multiple steps
import numpy as np
import pandas as pd
import altair as alt
Generating random data#
Here we make some data that follows a random polynomial. Can we use scikit-learn to estimate the underlying polynomial?
Here are some comments about the code:
It’s written so that if you change
deg
to another integer, the rest should work the same.The “y_true” column values follow a degree 3 polynomial exactly.
The “y” column values are obtained by adding random noise to the “y_true” values.
We use two different
size
keyword arguments, one for getting the coefficients, and one for getting a different random value for each row in the DataFrame.It’s better to use normally distributed random values, rather than uniformly distributed values in [0,1], so that the data points are not all within a band of width 1 from the true polynomial.
In general in Python, if you find yourself writing
range(len(???))
, you’re probably not writing your code in a “Pythonic” way. We will see an elegant way to replacerange(len(???))
below.
np.arange(-3, 2, 0.1)
array([-3.00000000e+00, -2.90000000e+00, -2.80000000e+00, -2.70000000e+00,
-2.60000000e+00, -2.50000000e+00, -2.40000000e+00, -2.30000000e+00,
-2.20000000e+00, -2.10000000e+00, -2.00000000e+00, -1.90000000e+00,
-1.80000000e+00, -1.70000000e+00, -1.60000000e+00, -1.50000000e+00,
-1.40000000e+00, -1.30000000e+00, -1.20000000e+00, -1.10000000e+00,
-1.00000000e+00, -9.00000000e-01, -8.00000000e-01, -7.00000000e-01,
-6.00000000e-01, -5.00000000e-01, -4.00000000e-01, -3.00000000e-01,
-2.00000000e-01, -1.00000000e-01, 2.66453526e-15, 1.00000000e-01,
2.00000000e-01, 3.00000000e-01, 4.00000000e-01, 5.00000000e-01,
6.00000000e-01, 7.00000000e-01, 8.00000000e-01, 9.00000000e-01,
1.00000000e+00, 1.10000000e+00, 1.20000000e+00, 1.30000000e+00,
1.40000000e+00, 1.50000000e+00, 1.60000000e+00, 1.70000000e+00,
1.80000000e+00, 1.90000000e+00])
deg = 3
rng = np.random.default_rng(seed=27)
# random integers in [-5,5)
m = rng.integers(low=-5, high=5, size=deg+1)
print(m)
# A pandas DataFrame df is created with a column x that ranges from -3 to 2 in increments of 0.1.
df = pd.DataFrame({"x": np.arange(-3, 2, 0.1)})
# Calculate the true oolynomial values y_true
df["y_true"] = 0
for i in range(len(m)):
df["y_true"] += m[i]*df["x"]**i
#Add noise to generate y:
df["y"] = df["y_true"] + rng.normal(scale=5, size=len(df))
[-5 1 -3 -2]
At the end of that process, here is how df
looks.
df
x | y_true | y | |
---|---|---|---|
0 | -3.000000e+00 | 19.000 | 23.824406 |
1 | -2.900000e+00 | 15.648 | 10.237108 |
2 | -2.800000e+00 | 12.584 | 16.919087 |
3 | -2.700000e+00 | 9.796 | 8.955196 |
4 | -2.600000e+00 | 7.272 | 6.323695 |
5 | -2.500000e+00 | 5.000 | 10.602832 |
6 | -2.400000e+00 | 2.968 | 0.784105 |
7 | -2.300000e+00 | 1.164 | -5.234227 |
8 | -2.200000e+00 | -0.424 | -2.771499 |
9 | -2.100000e+00 | -1.808 | -7.792136 |
10 | -2.000000e+00 | -3.000 | -12.199286 |
11 | -1.900000e+00 | -4.012 | -4.739785 |
12 | -1.800000e+00 | -4.856 | -2.864605 |
13 | -1.700000e+00 | -5.544 | -16.354306 |
14 | -1.600000e+00 | -6.088 | -6.015613 |
15 | -1.500000e+00 | -6.500 | -5.224009 |
16 | -1.400000e+00 | -6.792 | -5.926045 |
17 | -1.300000e+00 | -6.976 | -13.326468 |
18 | -1.200000e+00 | -7.064 | -8.618807 |
19 | -1.100000e+00 | -7.068 | -7.999078 |
20 | -1.000000e+00 | -7.000 | -4.835120 |
21 | -9.000000e-01 | -6.872 | -13.308757 |
22 | -8.000000e-01 | -6.696 | -4.778936 |
23 | -7.000000e-01 | -6.484 | -1.524600 |
24 | -6.000000e-01 | -6.248 | -17.686227 |
25 | -5.000000e-01 | -6.000 | -4.880304 |
26 | -4.000000e-01 | -5.752 | -13.385067 |
27 | -3.000000e-01 | -5.516 | -6.368335 |
28 | -2.000000e-01 | -5.304 | -8.704742 |
29 | -1.000000e-01 | -5.128 | 1.240175 |
30 | 2.664535e-15 | -5.000 | -4.714457 |
31 | 1.000000e-01 | -4.932 | -0.955557 |
32 | 2.000000e-01 | -4.936 | -10.169895 |
33 | 3.000000e-01 | -5.024 | -2.721364 |
34 | 4.000000e-01 | -5.208 | 1.903864 |
35 | 5.000000e-01 | -5.500 | -9.036285 |
36 | 6.000000e-01 | -5.912 | -2.582631 |
37 | 7.000000e-01 | -6.456 | -4.049151 |
38 | 8.000000e-01 | -7.144 | -7.467370 |
39 | 9.000000e-01 | -7.988 | -14.245467 |
40 | 1.000000e+00 | -9.000 | -7.584490 |
41 | 1.100000e+00 | -10.192 | -11.543440 |
42 | 1.200000e+00 | -11.576 | -8.565901 |
43 | 1.300000e+00 | -13.164 | -9.269588 |
44 | 1.400000e+00 | -14.968 | -7.295300 |
45 | 1.500000e+00 | -17.000 | -21.182783 |
46 | 1.600000e+00 | -19.272 | -14.946791 |
47 | 1.700000e+00 | -21.796 | -18.291899 |
48 | 1.800000e+00 | -24.584 | -24.206884 |
49 | 1.900000e+00 | -27.648 | -24.589037 |
Aside: If you are using range(len(???))
in Python, there is almost always a more elegant way to accomplish the same thing.
Rewrite the code above using
enumerate(m)
instead ofrange(len(m))
.
Recall that m
holds the four randomly chosen coefficients for our true polynomial. Why couldn’t we use just for c in m:
above? Because we needed to know both the value in m
and its index. For example, we needed to know that -3
corresponded to the x**2
column (m[2]
is -3
).
This is such a common pattern in Python, that a function is provided to help accomplish this, called enumerate
. When we iterate through enumerate(m)
, pairs of elements are returned: the index, and the value. For example in our case m = [-5, 1, -3, -2]
, and so the initial pair returned will be (0, -5)
, the next pair will be (1, 1)
, the next pair will be (2, -3)
, and the last pair will be (3, -2)
. We assign the values in these pairs to i
and c
, respectively.
# A pandas DataFrame df is created with a column x that ranges from -3 to 2 in increments of 0.1.
df = pd.DataFrame({"x": np.arange(-3, 2, 0.1)})
# Calculate the true oolynomial values y_true
df["y_true"] = 0
for i,c in enumerate(m): #c is m[i]
df["y_true"] += c*df["x"]**i
#Add noise to generate y:
df["y"] = df["y_true"] + rng.normal(scale=5, size=len(df))
df
x | y_true | y | |
---|---|---|---|
0 | -3.000000e+00 | 19.000 | 19.658465 |
1 | -2.900000e+00 | 15.648 | 19.004528 |
2 | -2.800000e+00 | 12.584 | 14.544002 |
3 | -2.700000e+00 | 9.796 | 6.982885 |
4 | -2.600000e+00 | 7.272 | 7.806070 |
5 | -2.500000e+00 | 5.000 | 8.322150 |
6 | -2.400000e+00 | 2.968 | -1.289692 |
7 | -2.300000e+00 | 1.164 | -0.827392 |
8 | -2.200000e+00 | -0.424 | 6.149442 |
9 | -2.100000e+00 | -1.808 | -0.183967 |
10 | -2.000000e+00 | -3.000 | -6.093065 |
11 | -1.900000e+00 | -4.012 | -6.817643 |
12 | -1.800000e+00 | -4.856 | 2.249072 |
13 | -1.700000e+00 | -5.544 | -3.184134 |
14 | -1.600000e+00 | -6.088 | -15.528922 |
15 | -1.500000e+00 | -6.500 | -3.867985 |
16 | -1.400000e+00 | -6.792 | -11.173514 |
17 | -1.300000e+00 | -6.976 | -7.492278 |
18 | -1.200000e+00 | -7.064 | -14.988038 |
19 | -1.100000e+00 | -7.068 | -8.428815 |
20 | -1.000000e+00 | -7.000 | -10.778717 |
21 | -9.000000e-01 | -6.872 | -12.853986 |
22 | -8.000000e-01 | -6.696 | -2.085862 |
23 | -7.000000e-01 | -6.484 | 5.928608 |
24 | -6.000000e-01 | -6.248 | -3.419772 |
25 | -5.000000e-01 | -6.000 | -5.006722 |
26 | -4.000000e-01 | -5.752 | -11.888357 |
27 | -3.000000e-01 | -5.516 | -6.781925 |
28 | -2.000000e-01 | -5.304 | -8.647650 |
29 | -1.000000e-01 | -5.128 | -14.463928 |
30 | 2.664535e-15 | -5.000 | -2.247368 |
31 | 1.000000e-01 | -4.932 | -5.226225 |
32 | 2.000000e-01 | -4.936 | -8.567178 |
33 | 3.000000e-01 | -5.024 | -12.507355 |
34 | 4.000000e-01 | -5.208 | -2.985618 |
35 | 5.000000e-01 | -5.500 | 1.248474 |
36 | 6.000000e-01 | -5.912 | -5.436481 |
37 | 7.000000e-01 | -6.456 | -9.005758 |
38 | 8.000000e-01 | -7.144 | -4.079082 |
39 | 9.000000e-01 | -7.988 | -12.877670 |
40 | 1.000000e+00 | -9.000 | -19.902710 |
41 | 1.100000e+00 | -10.192 | -1.632357 |
42 | 1.200000e+00 | -11.576 | -3.138947 |
43 | 1.300000e+00 | -13.164 | -14.674297 |
44 | 1.400000e+00 | -14.968 | -14.626226 |
45 | 1.500000e+00 | -17.000 | -8.663278 |
46 | 1.600000e+00 | -19.272 | -24.778477 |
47 | 1.700000e+00 | -21.796 | -21.197001 |
48 | 1.800000e+00 | -24.584 | -20.253230 |
49 | 1.900000e+00 | -27.648 | -18.957032 |
Here is what the data looks like.
Based on the values in m
above, we know these points are approximately following the curve \(y = -2x^3 - 3x^2 + x - 5\). For example, because the leading coefficient is negative, we know the outputs should be getting more negative as x
increases, which seems to match what we see in the plotted data.
c1 = alt.Chart(df).mark_circle().encode(
x="x",
y="y"
)
c1
Polynomial regression using PolynomialFeatures
#
Use the
include_bias
keyword argument inPolynomialFeatures
so we do not get a column for \(x^0\)
Notice how these values correspond to powers of the “x” column, including the zero-th power (all 1s).
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3)
pd.DataFrame(poly.fit_transform(df[['x']]))
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | -3.000000e+00 | 9.000000e+00 | -2.700000e+01 |
1 | 1.0 | -2.900000e+00 | 8.410000e+00 | -2.438900e+01 |
2 | 1.0 | -2.800000e+00 | 7.840000e+00 | -2.195200e+01 |
3 | 1.0 | -2.700000e+00 | 7.290000e+00 | -1.968300e+01 |
4 | 1.0 | -2.600000e+00 | 6.760000e+00 | -1.757600e+01 |
5 | 1.0 | -2.500000e+00 | 6.250000e+00 | -1.562500e+01 |
6 | 1.0 | -2.400000e+00 | 5.760000e+00 | -1.382400e+01 |
7 | 1.0 | -2.300000e+00 | 5.290000e+00 | -1.216700e+01 |
8 | 1.0 | -2.200000e+00 | 4.840000e+00 | -1.064800e+01 |
9 | 1.0 | -2.100000e+00 | 4.410000e+00 | -9.261000e+00 |
10 | 1.0 | -2.000000e+00 | 4.000000e+00 | -8.000000e+00 |
11 | 1.0 | -1.900000e+00 | 3.610000e+00 | -6.859000e+00 |
12 | 1.0 | -1.800000e+00 | 3.240000e+00 | -5.832000e+00 |
13 | 1.0 | -1.700000e+00 | 2.890000e+00 | -4.913000e+00 |
14 | 1.0 | -1.600000e+00 | 2.560000e+00 | -4.096000e+00 |
15 | 1.0 | -1.500000e+00 | 2.250000e+00 | -3.375000e+00 |
16 | 1.0 | -1.400000e+00 | 1.960000e+00 | -2.744000e+00 |
17 | 1.0 | -1.300000e+00 | 1.690000e+00 | -2.197000e+00 |
18 | 1.0 | -1.200000e+00 | 1.440000e+00 | -1.728000e+00 |
19 | 1.0 | -1.100000e+00 | 1.210000e+00 | -1.331000e+00 |
20 | 1.0 | -1.000000e+00 | 1.000000e+00 | -1.000000e+00 |
21 | 1.0 | -9.000000e-01 | 8.100000e-01 | -7.290000e-01 |
22 | 1.0 | -8.000000e-01 | 6.400000e-01 | -5.120000e-01 |
23 | 1.0 | -7.000000e-01 | 4.900000e-01 | -3.430000e-01 |
24 | 1.0 | -6.000000e-01 | 3.600000e-01 | -2.160000e-01 |
25 | 1.0 | -5.000000e-01 | 2.500000e-01 | -1.250000e-01 |
26 | 1.0 | -4.000000e-01 | 1.600000e-01 | -6.400000e-02 |
27 | 1.0 | -3.000000e-01 | 9.000000e-02 | -2.700000e-02 |
28 | 1.0 | -2.000000e-01 | 4.000000e-02 | -8.000000e-03 |
29 | 1.0 | -1.000000e-01 | 1.000000e-02 | -1.000000e-03 |
30 | 1.0 | 2.664535e-15 | 7.099748e-30 | 1.891753e-44 |
31 | 1.0 | 1.000000e-01 | 1.000000e-02 | 1.000000e-03 |
32 | 1.0 | 2.000000e-01 | 4.000000e-02 | 8.000000e-03 |
33 | 1.0 | 3.000000e-01 | 9.000000e-02 | 2.700000e-02 |
34 | 1.0 | 4.000000e-01 | 1.600000e-01 | 6.400000e-02 |
35 | 1.0 | 5.000000e-01 | 2.500000e-01 | 1.250000e-01 |
36 | 1.0 | 6.000000e-01 | 3.600000e-01 | 2.160000e-01 |
37 | 1.0 | 7.000000e-01 | 4.900000e-01 | 3.430000e-01 |
38 | 1.0 | 8.000000e-01 | 6.400000e-01 | 5.120000e-01 |
39 | 1.0 | 9.000000e-01 | 8.100000e-01 | 7.290000e-01 |
40 | 1.0 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 |
41 | 1.0 | 1.100000e+00 | 1.210000e+00 | 1.331000e+00 |
42 | 1.0 | 1.200000e+00 | 1.440000e+00 | 1.728000e+00 |
43 | 1.0 | 1.300000e+00 | 1.690000e+00 | 2.197000e+00 |
44 | 1.0 | 1.400000e+00 | 1.960000e+00 | 2.744000e+00 |
45 | 1.0 | 1.500000e+00 | 2.250000e+00 | 3.375000e+00 |
46 | 1.0 | 1.600000e+00 | 2.560000e+00 | 4.096000e+00 |
47 | 1.0 | 1.700000e+00 | 2.890000e+00 | 4.913000e+00 |
48 | 1.0 | 1.800000e+00 | 3.240000e+00 | 5.832000e+00 |
49 | 1.0 | 1.900000e+00 | 3.610000e+00 | 6.859000e+00 |
Let’s get rid of the column of all 1
values. We do this by setting include_bias=False
when we instantiate the PolynomialFeatures object.
We can perform both the fit
and the transform
steps in the same step using fit_transform
.
poly = PolynomialFeatures(degree=3, include_bias=False)
df_pow = pd.DataFrame(poly.fit_transform(df[['x']]))
df_pow
0 | 1 | 2 | |
---|---|---|---|
0 | -3.000000e+00 | 9.000000e+00 | -2.700000e+01 |
1 | -2.900000e+00 | 8.410000e+00 | -2.438900e+01 |
2 | -2.800000e+00 | 7.840000e+00 | -2.195200e+01 |
3 | -2.700000e+00 | 7.290000e+00 | -1.968300e+01 |
4 | -2.600000e+00 | 6.760000e+00 | -1.757600e+01 |
5 | -2.500000e+00 | 6.250000e+00 | -1.562500e+01 |
6 | -2.400000e+00 | 5.760000e+00 | -1.382400e+01 |
7 | -2.300000e+00 | 5.290000e+00 | -1.216700e+01 |
8 | -2.200000e+00 | 4.840000e+00 | -1.064800e+01 |
9 | -2.100000e+00 | 4.410000e+00 | -9.261000e+00 |
10 | -2.000000e+00 | 4.000000e+00 | -8.000000e+00 |
11 | -1.900000e+00 | 3.610000e+00 | -6.859000e+00 |
12 | -1.800000e+00 | 3.240000e+00 | -5.832000e+00 |
13 | -1.700000e+00 | 2.890000e+00 | -4.913000e+00 |
14 | -1.600000e+00 | 2.560000e+00 | -4.096000e+00 |
15 | -1.500000e+00 | 2.250000e+00 | -3.375000e+00 |
16 | -1.400000e+00 | 1.960000e+00 | -2.744000e+00 |
17 | -1.300000e+00 | 1.690000e+00 | -2.197000e+00 |
18 | -1.200000e+00 | 1.440000e+00 | -1.728000e+00 |
19 | -1.100000e+00 | 1.210000e+00 | -1.331000e+00 |
20 | -1.000000e+00 | 1.000000e+00 | -1.000000e+00 |
21 | -9.000000e-01 | 8.100000e-01 | -7.290000e-01 |
22 | -8.000000e-01 | 6.400000e-01 | -5.120000e-01 |
23 | -7.000000e-01 | 4.900000e-01 | -3.430000e-01 |
24 | -6.000000e-01 | 3.600000e-01 | -2.160000e-01 |
25 | -5.000000e-01 | 2.500000e-01 | -1.250000e-01 |
26 | -4.000000e-01 | 1.600000e-01 | -6.400000e-02 |
27 | -3.000000e-01 | 9.000000e-02 | -2.700000e-02 |
28 | -2.000000e-01 | 4.000000e-02 | -8.000000e-03 |
29 | -1.000000e-01 | 1.000000e-02 | -1.000000e-03 |
30 | 2.664535e-15 | 7.099748e-30 | 1.891753e-44 |
31 | 1.000000e-01 | 1.000000e-02 | 1.000000e-03 |
32 | 2.000000e-01 | 4.000000e-02 | 8.000000e-03 |
33 | 3.000000e-01 | 9.000000e-02 | 2.700000e-02 |
34 | 4.000000e-01 | 1.600000e-01 | 6.400000e-02 |
35 | 5.000000e-01 | 2.500000e-01 | 1.250000e-01 |
36 | 6.000000e-01 | 3.600000e-01 | 2.160000e-01 |
37 | 7.000000e-01 | 4.900000e-01 | 3.430000e-01 |
38 | 8.000000e-01 | 6.400000e-01 | 5.120000e-01 |
39 | 9.000000e-01 | 8.100000e-01 | 7.290000e-01 |
40 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 |
41 | 1.100000e+00 | 1.210000e+00 | 1.331000e+00 |
42 | 1.200000e+00 | 1.440000e+00 | 1.728000e+00 |
43 | 1.300000e+00 | 1.690000e+00 | 2.197000e+00 |
44 | 1.400000e+00 | 1.960000e+00 | 2.744000e+00 |
45 | 1.500000e+00 | 2.250000e+00 | 3.375000e+00 |
46 | 1.600000e+00 | 2.560000e+00 | 4.096000e+00 |
47 | 1.700000e+00 | 2.890000e+00 | 4.913000e+00 |
48 | 1.800000e+00 | 3.240000e+00 | 5.832000e+00 |
49 | 1.900000e+00 | 3.610000e+00 | 6.859000e+00 |
Name the columns using the
get_feature_names_out
method of thePolynomialFeatures
object.
poly.get_feature_names_out()
array(['x', 'x^2', 'x^3'], dtype=object)
df_pow.columns = poly.get_feature_names_out()
df_pow
x | x^2 | x^3 | |
---|---|---|---|
0 | -3.000000e+00 | 9.000000e+00 | -2.700000e+01 |
1 | -2.900000e+00 | 8.410000e+00 | -2.438900e+01 |
2 | -2.800000e+00 | 7.840000e+00 | -2.195200e+01 |
3 | -2.700000e+00 | 7.290000e+00 | -1.968300e+01 |
4 | -2.600000e+00 | 6.760000e+00 | -1.757600e+01 |
5 | -2.500000e+00 | 6.250000e+00 | -1.562500e+01 |
6 | -2.400000e+00 | 5.760000e+00 | -1.382400e+01 |
7 | -2.300000e+00 | 5.290000e+00 | -1.216700e+01 |
8 | -2.200000e+00 | 4.840000e+00 | -1.064800e+01 |
9 | -2.100000e+00 | 4.410000e+00 | -9.261000e+00 |
10 | -2.000000e+00 | 4.000000e+00 | -8.000000e+00 |
11 | -1.900000e+00 | 3.610000e+00 | -6.859000e+00 |
12 | -1.800000e+00 | 3.240000e+00 | -5.832000e+00 |
13 | -1.700000e+00 | 2.890000e+00 | -4.913000e+00 |
14 | -1.600000e+00 | 2.560000e+00 | -4.096000e+00 |
15 | -1.500000e+00 | 2.250000e+00 | -3.375000e+00 |
16 | -1.400000e+00 | 1.960000e+00 | -2.744000e+00 |
17 | -1.300000e+00 | 1.690000e+00 | -2.197000e+00 |
18 | -1.200000e+00 | 1.440000e+00 | -1.728000e+00 |
19 | -1.100000e+00 | 1.210000e+00 | -1.331000e+00 |
20 | -1.000000e+00 | 1.000000e+00 | -1.000000e+00 |
21 | -9.000000e-01 | 8.100000e-01 | -7.290000e-01 |
22 | -8.000000e-01 | 6.400000e-01 | -5.120000e-01 |
23 | -7.000000e-01 | 4.900000e-01 | -3.430000e-01 |
24 | -6.000000e-01 | 3.600000e-01 | -2.160000e-01 |
25 | -5.000000e-01 | 2.500000e-01 | -1.250000e-01 |
26 | -4.000000e-01 | 1.600000e-01 | -6.400000e-02 |
27 | -3.000000e-01 | 9.000000e-02 | -2.700000e-02 |
28 | -2.000000e-01 | 4.000000e-02 | -8.000000e-03 |
29 | -1.000000e-01 | 1.000000e-02 | -1.000000e-03 |
30 | 2.664535e-15 | 7.099748e-30 | 1.891753e-44 |
31 | 1.000000e-01 | 1.000000e-02 | 1.000000e-03 |
32 | 2.000000e-01 | 4.000000e-02 | 8.000000e-03 |
33 | 3.000000e-01 | 9.000000e-02 | 2.700000e-02 |
34 | 4.000000e-01 | 1.600000e-01 | 6.400000e-02 |
35 | 5.000000e-01 | 2.500000e-01 | 1.250000e-01 |
36 | 6.000000e-01 | 3.600000e-01 | 2.160000e-01 |
37 | 7.000000e-01 | 4.900000e-01 | 3.430000e-01 |
38 | 8.000000e-01 | 6.400000e-01 | 5.120000e-01 |
39 | 9.000000e-01 | 8.100000e-01 | 7.290000e-01 |
40 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 |
41 | 1.100000e+00 | 1.210000e+00 | 1.331000e+00 |
42 | 1.200000e+00 | 1.440000e+00 | 1.728000e+00 |
43 | 1.300000e+00 | 1.690000e+00 | 2.197000e+00 |
44 | 1.400000e+00 | 1.960000e+00 | 2.744000e+00 |
45 | 1.500000e+00 | 2.250000e+00 | 3.375000e+00 |
46 | 1.600000e+00 | 2.560000e+00 | 4.096000e+00 |
47 | 1.700000e+00 | 2.890000e+00 | 4.913000e+00 |
48 | 1.800000e+00 | 3.240000e+00 | 5.832000e+00 |
49 | 1.900000e+00 | 3.610000e+00 | 6.859000e+00 |
Concatenate the “y” and “y_true” columns from
df
onto the end ofdf_pow
usingpd.concat((???, ???), axis=???)
. Name the resultdf_both
.
Notice how we use axis=1
, because the column labels are changing but the row labels are staying the same.
df_both = pd.concat((df_pow, df[["y", "y_true"]]), axis=1)
df_both
x | x^2 | x^3 | y | y_true | |
---|---|---|---|---|---|
0 | -3.000000e+00 | 9.000000e+00 | -2.700000e+01 | 19.658465 | 19.000 |
1 | -2.900000e+00 | 8.410000e+00 | -2.438900e+01 | 19.004528 | 15.648 |
2 | -2.800000e+00 | 7.840000e+00 | -2.195200e+01 | 14.544002 | 12.584 |
3 | -2.700000e+00 | 7.290000e+00 | -1.968300e+01 | 6.982885 | 9.796 |
4 | -2.600000e+00 | 6.760000e+00 | -1.757600e+01 | 7.806070 | 7.272 |
5 | -2.500000e+00 | 6.250000e+00 | -1.562500e+01 | 8.322150 | 5.000 |
6 | -2.400000e+00 | 5.760000e+00 | -1.382400e+01 | -1.289692 | 2.968 |
7 | -2.300000e+00 | 5.290000e+00 | -1.216700e+01 | -0.827392 | 1.164 |
8 | -2.200000e+00 | 4.840000e+00 | -1.064800e+01 | 6.149442 | -0.424 |
9 | -2.100000e+00 | 4.410000e+00 | -9.261000e+00 | -0.183967 | -1.808 |
10 | -2.000000e+00 | 4.000000e+00 | -8.000000e+00 | -6.093065 | -3.000 |
11 | -1.900000e+00 | 3.610000e+00 | -6.859000e+00 | -6.817643 | -4.012 |
12 | -1.800000e+00 | 3.240000e+00 | -5.832000e+00 | 2.249072 | -4.856 |
13 | -1.700000e+00 | 2.890000e+00 | -4.913000e+00 | -3.184134 | -5.544 |
14 | -1.600000e+00 | 2.560000e+00 | -4.096000e+00 | -15.528922 | -6.088 |
15 | -1.500000e+00 | 2.250000e+00 | -3.375000e+00 | -3.867985 | -6.500 |
16 | -1.400000e+00 | 1.960000e+00 | -2.744000e+00 | -11.173514 | -6.792 |
17 | -1.300000e+00 | 1.690000e+00 | -2.197000e+00 | -7.492278 | -6.976 |
18 | -1.200000e+00 | 1.440000e+00 | -1.728000e+00 | -14.988038 | -7.064 |
19 | -1.100000e+00 | 1.210000e+00 | -1.331000e+00 | -8.428815 | -7.068 |
20 | -1.000000e+00 | 1.000000e+00 | -1.000000e+00 | -10.778717 | -7.000 |
21 | -9.000000e-01 | 8.100000e-01 | -7.290000e-01 | -12.853986 | -6.872 |
22 | -8.000000e-01 | 6.400000e-01 | -5.120000e-01 | -2.085862 | -6.696 |
23 | -7.000000e-01 | 4.900000e-01 | -3.430000e-01 | 5.928608 | -6.484 |
24 | -6.000000e-01 | 3.600000e-01 | -2.160000e-01 | -3.419772 | -6.248 |
25 | -5.000000e-01 | 2.500000e-01 | -1.250000e-01 | -5.006722 | -6.000 |
26 | -4.000000e-01 | 1.600000e-01 | -6.400000e-02 | -11.888357 | -5.752 |
27 | -3.000000e-01 | 9.000000e-02 | -2.700000e-02 | -6.781925 | -5.516 |
28 | -2.000000e-01 | 4.000000e-02 | -8.000000e-03 | -8.647650 | -5.304 |
29 | -1.000000e-01 | 1.000000e-02 | -1.000000e-03 | -14.463928 | -5.128 |
30 | 2.664535e-15 | 7.099748e-30 | 1.891753e-44 | -2.247368 | -5.000 |
31 | 1.000000e-01 | 1.000000e-02 | 1.000000e-03 | -5.226225 | -4.932 |
32 | 2.000000e-01 | 4.000000e-02 | 8.000000e-03 | -8.567178 | -4.936 |
33 | 3.000000e-01 | 9.000000e-02 | 2.700000e-02 | -12.507355 | -5.024 |
34 | 4.000000e-01 | 1.600000e-01 | 6.400000e-02 | -2.985618 | -5.208 |
35 | 5.000000e-01 | 2.500000e-01 | 1.250000e-01 | 1.248474 | -5.500 |
36 | 6.000000e-01 | 3.600000e-01 | 2.160000e-01 | -5.436481 | -5.912 |
37 | 7.000000e-01 | 4.900000e-01 | 3.430000e-01 | -9.005758 | -6.456 |
38 | 8.000000e-01 | 6.400000e-01 | 5.120000e-01 | -4.079082 | -7.144 |
39 | 9.000000e-01 | 8.100000e-01 | 7.290000e-01 | -12.877670 | -7.988 |
40 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | -19.902710 | -9.000 |
41 | 1.100000e+00 | 1.210000e+00 | 1.331000e+00 | -1.632357 | -10.192 |
42 | 1.200000e+00 | 1.440000e+00 | 1.728000e+00 | -3.138947 | -11.576 |
43 | 1.300000e+00 | 1.690000e+00 | 2.197000e+00 | -14.674297 | -13.164 |
44 | 1.400000e+00 | 1.960000e+00 | 2.744000e+00 | -14.626226 | -14.968 |
45 | 1.500000e+00 | 2.250000e+00 | 3.375000e+00 | -8.663278 | -17.000 |
46 | 1.600000e+00 | 2.560000e+00 | 4.096000e+00 | -24.778477 | -19.272 |
47 | 1.700000e+00 | 2.890000e+00 | 4.913000e+00 | -21.197001 | -21.796 |
48 | 1.800000e+00 | 3.240000e+00 | 5.832000e+00 | -20.253230 | -24.584 |
49 | 1.900000e+00 | 3.610000e+00 | 6.859000e+00 | -18.957032 | -27.648 |
Find the “best” coefficient values for modeling \(y \approx c_3 x^3 + c_2 x^2 + c_1 x + c_0\).
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(df_both[["x","x^2","x^3"]], df_both["y"])
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
How do these values compare to the true coefficient values?
reg.coef_
array([ 1.6446191 , -2.03654891, -1.83956523])
reg.intercept_
-6.358623201499859
The true values follow the polynomial \(y = -2x^3 - 3x^2 + x - 5\). In our case, we have found approximately \(-1.8 x^3 - 2x^2 + 1.6 x - 6.4\). These two sequences of coefficients are remarkably similar.
We will see a more efficient way to do all of these steps below, using Pipeline
.
Using Pipeline
to combine multiple steps#
The above process is a little awkward. We can achieve the same thing much more efficiently by using another data type defined by scikit-learn, Pipeline
. (The tradeoff is that it is less explicit what is happening.)
Import the
Pipeline
class fromsklearn.pipeline
.
from sklearn.pipeline import Pipeline
Make an instance of this
Pipeline
class. Pass to the constructor a list of length-2 tuples, where each tuple provides a name for the step (as a string) and the constructor (likePolynomialFeatures(???)
).
pipe = Pipeline(
[
('poly', PolynomialFeatures(degree=3, include_bias=False)),
('reg', LinearRegression())
]
)
Fit this object to the data.
This is where we really benefit from Pipeline. The following call of pipe.fit
first fits and transforms the data using PolynomialFeatures
, and then fits that transformed data using LinearRegression
.
pipe.fit(df[["x"]], df['y'])
Pipeline(steps=[('poly', PolynomialFeatures(degree=3, include_bias=False)), ('reg', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('poly', PolynomialFeatures(degree=3, include_bias=False)), ('reg', LinearRegression())])
PolynomialFeatures(degree=3, include_bias=False)
LinearRegression()
Do the coefficients match what we found above? Use the
named_steps
attribute, or just use the name directly.
You might try calling pipe.coef_
, but we get an error message. It’s not the Pipeline
object itself that has the fit coefficients, but the LinearRegression
object within it.
pipe.coef_
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In [20], line 1
----> 1 pipe.coef_
AttributeError: 'Pipeline' object has no attribute 'coef_'
The information is recorded in a Python dictionary stored in the named_steps
attribute of our Pipeline
object.
pipe.named_steps
{'poly': PolynomialFeatures(degree=3, include_bias=False),
'reg': LinearRegression()}
The point of all that is, now that we know how to access the LinearRegression
object, we can get its coef_
attribute just like usual when performing linear regression. (Remember that this attribute only exists after we call fit
.)
Notice that these are the exact same values as what we found above. It’s worth looking over both procedures and noticing how much shorter this procedure using Pipeline
is.
pipe['reg'].coef_
array([ 1.6446191 , -2.03654891, -1.83956523])
pipe['reg'].intercept_
-6.3586232014998565
Call the predict method, and add the resulting values to a new column in
df
named “pred”.
The following simple code is evaluating our “best fit” degree three polynomial \(-1.8 x^3 - 2x^2 + 1.6 x - 6.4\) for every value of in the “x” column. Notice how we don’t need to explicitly type "x^2"
or anything like that, the polynomial part of this polynomial regression is happening “under the hood”.
df["pred"] = pipe.predict(df[["x"]])
Plot the resulting predictions using a red line. Name the chart
c2
.
This one does lie perfectly on a cubic polynomial, more specifically, that cubic polynomial is approximately \(-1.8 x^3 - 2x^2 + 1.6 x - 6.4\). This is our cubic polynomial of “best fit” (meaning the Mean Squared Error between the data and this polynomial is minimized). For the given data, using Mean Squared Error, this polynomial fits the data “better” than the true underlying polynomial \(-2x^3 - 3x^2 + x - 5\).
c2 = alt.Chart(df).mark_line(color = "red").encode(
x = "x",
y = "pred"
)
c2
Plot the true values using a dashed black line, using
strokeDash=[10,5]
as an argument tomark_line
. Name the chartc3
.
Don’t focus too much on the strokeDash=[10,5]
part, I just wanted to show you an example of an option that exists. Here the dashes are made with 10 black pixels followed by a gap of 5 pixels.
This curve represents the true underlying polynomial that we used to generate the data (before adding the random noise to it).
c3 = alt.Chart(df).mark_line(color = "black", strokeDash = [10,5]).encode(
x = "x",
y = "y_true"
)
c3
Layer these plots on top of the above scatter plot
c1
.
Notice how similar our two polynomial curves are. If we had used more data points or less standard deviation for our random noise, we would expect the curves to be even closer to each other.
c1 + c2 + c3