Week 5 Wednesday

Week 5 Wednesday#

Announcements#

Midterms should be returned this Friday.
Midterm evaluation.
HW5 has been posted. Due next Wednesday.
There is an in-class quiz next Tuesday, based on this week’s HW.

Introduction#

We’re starting the Machine Learning portion of Math 10. Here is a quote from Hands on Machine Learning by Aurélien Géron:

Machine Learning is the science (and art) of programming computers so they can learn from data.

Machine Learning is roughly divided into two large categories:

supervised (Math 10 mostly focus): regression and classification
unsupervised

In regression problems, we are trying to compute a continuous quantitative value (":Q" in the Altair encoding type syntax). In classification problems, we are trying to compute a discrete value (":N" or ":O" in the Altair encoding type syntax).

We’ll start with regression. Linear regression in particular is a nice first example of machine learning. Here we’ll use a real dataset, and we’ll see how to perform linear regression using scikit-learn.

Motivating Example I: Single-variable (1D) Linear Regression#

Problem#

Given the training dataset $(x^{(i)}\in\mathbb{R},y^{(i)}\in\mathbb{R}), i= 1,2,..., N$, we want to find the linear function

\[y\approx f(x)=wx +b\]

that fits the relations between $x^{(i)}$ and $y^{(i)}$. So that given any new $x^{test}$ in the test dataset, we can make the prediction

\[y^{pred} = w x^{test}+b\]

Training the model#

With the training dataset, define the loss function $L(w,b)$ of parameter $w$ and $b$, which is also called mean squared error (MSE) $$ L(w,b)=\frac{1}{N}\sum_{i=1}^N\big(\hat{y}^{(i)}-y^{(i)}\big)^2=\frac{1}{N}\sum_{i=1}^N\big((wx^{(i)}+b)-y^{(i)}\big)^2, $$ where $\hat{y}^{(i)}$ denotes the predicted value of y at $x^{(i)}$, i.e. $\hat{y}^{(i)} = wx^{(i)}+b$.
Then find the minimum of loss function – note that this is the quadratic function of $w$ and $b$, and we can analytically solve $\partial_{w}L = \partial_{b}L =0$, and yields

\[ w^* =\frac{\sum_{i=1}^{N}(x^{(i)}-\bar{x})(y^{(i)}-\bar{y})}{\sum_{i=1}^{N}(x^{(i)}-\bar{x})^{2}} = \frac{\frac{1}{N}\sum_{i=1}^{N}(x^{(i)}-\bar{x})(y^{(i)}-\bar{y})}{\frac{1}{N}\sum_{i=1}^{N}(x^{(i)}-\bar{x})^{2}} = \frac{\text{Cov}(X,Y)}{\text{Var}(X)}, \]

\[ b^* = \bar{y} - w^*\bar{x}, \]

where $\bar{x}$ and $\bar{y}$ are the mean of $x$ and of $y$, and $\text{Cov}(X,Y)$ denotes the estimated covariance (or called sample covariance) between $X$ and $Y$ (a little difference with what you learned in statistics is that we have the normalization factor $1/N$ instead of $1/(N-1)$ here), and $\text{Var}(Y)$ denotes the sample variance of $Y$ (the normalization factor is still $1/N$). This is just about convention – in statistics, they pursue for unbiased estimator.

Evaluating the model#

MSE: The smaller MSE indicates better performance

Performing linear regression using scikit-learn#

Load the taxis dataset from Seaborn. Drop rows with missing values.

import seaborn as sns
import altair as alt
import pandas as pd

df = sns.load_dataset("taxis")
df = df.dropna(axis=0)

Using Altair, make a scatter plot with “fare” on the y-axis and with “distance” on the x-axis. Choose 5000 random rows to avoid the max_rows error.

Here is the error we get if we don’t restrict the DataFrame or don’t change the limit of maximum number of rows.

#alt.data_transformers.enable('default', max_rows = 7000)

alt.Chart(df).mark_circle().encode(
    x="distance",
    y="fare"
)

---------------------------------------------------------------------------
MaxRowsError                              Traceback (most recent call last)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:2020, in Chart.to_dict(self, *args, **kwargs)
   2018     copy.data = core.InlineData(values=[{}])
   2019     return super(Chart, copy).to_dict(*args, **kwargs)
-> 2020 return super().to_dict(*args, **kwargs)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:374, in TopLevelMixin.to_dict(self, *args, **kwargs)
    372 copy = self.copy(deep=False)
    373 original_data = getattr(copy, "data", Undefined)
--> 374 copy.data = _prepare_data(original_data, context)
    376 if original_data is not Undefined:
    377     context["data"] = original_data

File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:89, in _prepare_data(data, context)
     87 # convert dataframes  or objects with __geo_interface__ to dict
     88 if isinstance(data, pd.DataFrame) or hasattr(data, "__geo_interface__"):
---> 89     data = _pipe(data, data_transformers.get())
     91 # convert string input to a URLData
     92 if isinstance(data, str):

File /shared-libs/python3.9/py/lib/python3.9/site-packages/toolz/functoolz.py:628, in pipe(data, *funcs)
    608 """ Pipe a value through a sequence of functions
    609 
    610 I.e. ``pipe(data, f, g, h)`` is equivalent to ``h(g(f(data)))``
   (...)
    625     thread_last
    626 """
    627 for func in funcs:
--> 628     data = func(data)
    629 return data

File /shared-libs/python3.9/py/lib/python3.9/site-packages/toolz/functoolz.py:304, in curry.__call__(self, *args, **kwargs)
    302 def __call__(self, *args, **kwargs):
    303     try:
--> 304         return self._partial(*args, **kwargs)
    305     except TypeError as exc:
    306         if self._should_curry(args, kwargs, exc):

File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/data.py:19, in default_data_transformer(data, max_rows)
     17 @curried.curry
     18 def default_data_transformer(data, max_rows=5000):
---> 19     return curried.pipe(data, limit_rows(max_rows=max_rows), to_values)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/toolz/functoolz.py:628, in pipe(data, *funcs)
    608 """ Pipe a value through a sequence of functions
    609 
    610 I.e. ``pipe(data, f, g, h)`` is equivalent to ``h(g(f(data)))``
   (...)
    625     thread_last
    626 """
    627 for func in funcs:
--> 628     data = func(data)
    629 return data

File /shared-libs/python3.9/py/lib/python3.9/site-packages/toolz/functoolz.py:304, in curry.__call__(self, *args, **kwargs)
    302 def __call__(self, *args, **kwargs):
    303     try:
--> 304         return self._partial(*args, **kwargs)
    305     except TypeError as exc:
    306         if self._should_curry(args, kwargs, exc):

File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/data.py:80, in limit_rows(data, max_rows)
     78         return data
     79 if max_rows is not None and len(values) > max_rows:
---> 80     raise MaxRowsError(
     81         "The number of rows in your dataset is greater "
     82         "than the maximum allowed ({}). "
     83         "For information on how to plot larger datasets "
     84         "in Altair, see the documentation".format(max_rows)
     85     )
     86 return data

MaxRowsError: The number of rows in your dataset is greater than the maximum allowed (5000). For information on how to plot larger datasets in Altair, see the documentation

alt.Chart(...)

Here we are using the sample method to get 5000 random rows from the DataFrame.

As a fun aside, notice the plateau around fare 50. By adding a tooltip, we can see that these mostly correspond to airport rides, where there is (or was) a fixed rate.

alt.Chart(df.sample(5000)).mark_circle().encode(
    x = "distance",
    y = "fare",
    tooltip=["pickup_zone","dropoff_zone","fare"]
)

What would you estimate is the slope of the “line of best fit” for this data?

The line might go from (2,7) to (4,12), for example, so maybe the slope will be 2.5?

Find this slope using the LinearRegression class from scikit-learn.

There is a routine in scikit-learn that we will see many times:

Import
Instantiate (meaning create an object, aka, an instance, of the appropriate class)
Fit
Predict

We will see this pattern many times, starting right now. The more comfortable you get with this routine, the more comfortable you will be with conventions in Object Oriented Programming.

Here we import the LinearRegression class.

# import
from sklearn.linear_model import LinearRegression

Here we create a LinearRegression object, and name it reg (for regression).

# instantiate (make an instance)
reg = LinearRegression()

Notice that this is a LinearRegression object. (That class does not exist in base Python; it is defined by scikit-learn.) This emphasis on objects (as opposed to, for example, functions) is the standard in Object Oriented Programming.

type(reg)

sklearn.linear_model._base.LinearRegression

help(reg.fit)

Help on method fit in module sklearn.linear_model._base:

fit(X, y, sample_weight=None) method of sklearn.linear_model._base.LinearRegression instance
    Fit linear model.
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        Training data.
    
    y : array-like of shape (n_samples,) or (n_samples, n_targets)
        Target values. Will be cast to X's dtype if necessary.
    
    sample_weight : array-like of shape (n_samples,), default=None
        Individual weights for each sample.
    
        .. versionadded:: 0.17
           parameter *sample_weight* support to LinearRegression.
    
    Returns
    -------
    self : object
        Fitted Estimator.

Now we try to fit the LinearRegression object, but an error is raised.

reg.fit(df["distance"],df["fare"])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [4], line 1
----> 1 reg.fit(df["distance"],df["fare"])

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/linear_model/_base.py:684, in LinearRegression.fit(self, X, y, sample_weight)
n_jobs_ = self.n_jobs
accept_sparse = False if self.positive else ["csr", "csc", "coo"]
--> 684 X, y = self._validate_data(
   X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
)
sample_weight = _check_sample_weight(
   sample_weight, X, dtype=X.dtype, only_non_negative=True
)
X, y, X_offset, y_offset, X_scale = _preprocess_data(
   X,
   y,
   (...)
   sample_weight=sample_weight,
)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/base.py:596, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
       y = check_array(y, input_name="y", **check_y_params)
   else:
--> 596         X, y = check_X_y(X, y, **check_params)
   out = X, y
if not no_val_X and check_params.get("ensure_2d", True):

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/utils/validation.py:1074, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
       estimator_name = _check_estimator_name(estimator)
   raise ValueError(
       f"{estimator_name} requires y to be passed, but the target y is None"
   )
-> 1074 X = check_array(
   X,
   accept_sparse=accept_sparse,
   accept_large_sparse=accept_large_sparse,
   dtype=dtype,
   order=order,
   copy=copy,
   force_all_finite=force_all_finite,
   ensure_2d=ensure_2d,
   allow_nd=allow_nd,
   ensure_min_samples=ensure_min_samples,
   ensure_min_features=ensure_min_features,
   estimator=estimator,
   input_name="X",
)
y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
check_consistent_length(X, y)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/utils/validation.py:879, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
   # If input is 1D raise error
   if array.ndim == 1:
--> 879         raise ValueError(
           "Expected 2D array, got 1D array instead:\narray={}.\n"
           "Reshape your data either using array.reshape(-1, 1) if "
           "your data has a single feature or array.reshape(1, -1) "
           "if it contains a single sample.".format(array)
       )
if dtype_numeric and array.dtype.kind in "USV":
   raise ValueError(
       "dtype='numeric' is not compatible with arrays of bytes/strings."
       "Convert your data to numeric values explicitly instead."
   )

ValueError: Expected 2D array, got 1D array instead:
array=[1.6  0.79 1.37 ... 4.14 1.12 3.85].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

The problem is that scikit-learn wants two-dimension input data, whereas we passed a pandas Series, which should be viewed as one-dimensional. (This two-dimensionality is also why the documentation uses a capital letter X for the input features, but uses a lower-case y for the target. The capital letter is a reminder that X should be two-dimensional.)

df["distance"].shape

(6341,)

df[["distance"]].shape

(6341, 1)

Notice the subtle difference in how the following is presented. The above was a pandas Series, but the following is a one-column pandas DataFrame.

type(df[["distance"]])

pandas.core.frame.DataFrame

Using these double square brackets (which you should think of as a list inside the outer square brackets), we are able to call the fit method without error.

reg.fit(df[["distance"]],df["fare"])

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now reg has done all the hard work of finding a linear equation that approximates our taxi data (just the “fare” column as a linear function of the “distance” column). The slope of the resulting line is stored in the coef_ attribute.

reg.coef_

array([2.73248966])

reg.coef_[0]

2.7324896558936222

Find the intercept.

The intercept is stored in the intercept_ attribute.

reg.intercept_

4.696727594311989

What are the predicted outputs for the first 5 rows? What are the actual outputs?

df[:5] #first 5 rows

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.60	7.0	2.15	12.95	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan
1	2019-03-04 16:11:55	2019-03-04 16:19:00	1	0.79	5.0	0.00	9.30	yellow	cash	Upper West Side South	Upper West Side South	Manhattan	Manhattan
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.37	7.5	2.36	14.16	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan
3	2019-03-10 01:23:59	2019-03-10 01:49:51	1	7.70	27.0	6.15	36.95	yellow	credit card	Hudson Sq	Yorkville West	Manhattan	Manhattan
4	2019-03-30 13:27:42	2019-03-30 13:37:14	3	2.16	9.0	1.10	13.40	yellow	credit card	Midtown East	Yorkville West	Manhattan	Manhattan

When calling the predict method, we do not pass the target values, just the input values. The predict method then returns the outputs of the linear equation.

Here are the first 5 outputs.

reg.predict(df[["distance"]][:5])

array([ 9.06871104,  6.85539442,  8.44023842, 25.73689794, 10.59890525])

Something fancy is happening when we call reg.fit, but nothing fancy is happening when we call reg.predict: scikit-learn is simply evaluating the linear function. For example, notice that one of the rows had a distance of 7.70 miles (I assume the unit is miles, but I’m not certain).

Here is the corresponding output. Looking in the above DataFrame, we see that the true output was 27.0, so we are pretty close with our 25.7 estimate.

reg.coef_*7.7 + reg.intercept_

array([25.73689794])

Notice that reg.coef_ is a length-one NumPy array. If we just want a number, we can extract the element out of it using indexing.

reg.coef_[0]*7.7 + reg.intercept_

25.736897944692878

Motivating Example II: Multi-variable Linear Regression#

Problem#

Given the training dataset $(x^{(i)},y^{(i)}), i= 1,2,..., N$, this time with $y^{(i)}\in \mathbb{R}$ and $x^{(i)}\in\mathbb{R}^{p}$, we fit the multi-variable linear function

\[ y\approx\mathbf{f}(x)=\beta_{0}+\beta_{1}x_{1}+..+\beta_{p}x_{p} = \tilde{x}\beta, \]

\[ \tilde{x}=(1,x_{1},..,x_{p})\in\mathbb{R}^{1\times (p+1)},\beta = (\beta_{0},\beta_{1},..,\beta_{p})^{T}\in\mathbb{R}^{(p+1)\times 1}. \]

Here $\beta$ is called regression coefficients, and $\beta_{0}$ specially referred to intercept.

Using the whole training dataset, we can write as

\[\begin{split} Y=\left( \begin{matrix} y^{(1)}\\ y^{(2)} \\ \cdots \\ y^{(N)} \end{matrix} \right)\approx\left( \begin{matrix} \mathbf{f}(x^{(1)})\\ \mathbf{f}(x^{(2)})\\ \cdots \\ \mathbf{f}(x^{(N)}) \end{matrix} \right)=\left( \begin{matrix} \tilde{x}^{(1)}\beta\\ \tilde{x}^{(2)}\beta\\ \cdots \\ \tilde{x}^{(N)}\beta \end{matrix} \right)=\left( \begin{matrix} \tilde{x}^{(1)}\\ \tilde{x}^{(2)}\\ \cdots \\ \tilde{x}^{(N)} \end{matrix} \right)\beta = \tilde{X}\beta, \end{split}\]

where $$ \tilde{X}=\left( \begin{matrix} 1& x_{1}^{(1)} & \cdots & x_{p}^{(1)}\\ 1& x_{1}^{(2)} & \cdots & x_{p}^{(2)}\\ \cdots \\ 1& x_{1}^{(N)} & \cdots & x_{p}^{(N)} \end{matrix} \right) $$ is also called the augmented data matrix.

Question: To get unknown $\beta$, can we directly solve the linear equation $\tilde{X}\beta = Y$?
Answer: Most time no, because 1) typically there are more equations than variables ($N>>(p+1)$) 2) the linear model is merely the approximation to the real mapping 2) there are noises in the data points – it’s highly possible that there is NO solution at all!
Strategy: Instead of solving $\tilde{X}\beta = Y$ exactly, we want find $\beta$ such that $\tilde{X}\beta$ is as close as $Y$.

Training the model#

With the training dataset, define the loss function $L(\beta)$ of parameters $\beta$, which is also called mean squared error (MSE)

\[ L(\beta)=\frac{1}{N}\sum_{i=1}^N\big(\hat{y}^{(i)}-y^{(i)}\big)^2 = \frac{1}{N}\sum_{i=1}^{N}(y^{(i)}-\tilde{x}^{(i)}\beta)^{2}, \]

where $\hat{y}^{(i)}$ denotes the predicted value of y at $x^{(i)}$, i.e. $$\hat{y}^{(i)} = \beta_{0}+\beta_{1}x^{(i)}_{1}+..+\beta_{p}x^{(i)}_{p} = \tilde{x}^{(i)}\beta.$$

Now the problem becomes $$\min_{\beta}L(\beta),$$ i.e. find the minimizer of a multi-variable (p+1 dimensions) function.

Then find the minimum of loss function – There are two ways, either by numerical optimization or by solving linear systems (introduced below), which is also called the normal equation approach.

To solve the critical points, we have $\nabla L(\beta)=0$. $$ \begin{aligned} \frac{\partial L}{\partial \beta_{0}}&=2\sum_{i=1}^{N}(\tilde{x}^{(i)}\beta-y^{(i)})=0,\\ \frac{\partial L}{\partial \beta_{k}}&=2\sum_{i=1}^{N} x_{k}^{(i)}(\tilde{x}^{(i)}\beta-y^{(i)})=0,\quad k=1,2,..,p. \end{aligned} $$

In Matrix form, it can be expressed as (left as exercise) $$\tilde{X}^{T}\tilde{X}\beta=\tilde{X}^{T}Y,$$

also called the normal equation of linear regression. The optimal parameter $\hat{\beta}=\text{argmin} L(\beta)$ is also called the ordinary least square (OLS) estimator in statistics community.

Then the OLS estimator can be solved as $$\hat{\beta}=(\tilde{X}^{T}\tilde{X})^{-1}\tilde{X}^{T}Y.$$

Geometrical Interpretation

Denote $\tilde{X}=(\tilde{X}_{0},\tilde{X}_{1},..,\tilde{X}_{p})$, then $\tilde{X}\beta=\sum_{k=0}^{p}\beta_{k}\tilde{X}_{k}$. We require that the residual $Y-\tilde{X}\beta$ is vertical to the plane spanned by $\tilde{X}_{k}$, which yields $$ \tilde{X}_{k}^{T}(Y-\tilde{X}\beta)=0,\quad k = 0,1,...,p $$

Notice: Check that when $p=1$, the solution is equivalent to the single-variable regression.

Prediction in Test Data#

Given the new observation called $x^{(test)}$, we have the prediction as $$ \hat{y}^{(test)}=\hat{\beta}_{0}+\hat{\beta}_{1}x^{(test)}_{1}+..+\hat{\beta}_{p}x^{(test)}_{p} = \tilde{x}^{(test)}\hat{\beta}. $$

Evaluating the model#

MSE: The smaller MSE indicates better performance

Interpreting linear regression coefficients#

Add a new column to the DataFrame, called “hour”, which contains the hour at which the pickup occurred.

Notice that we already have date-time data type values in the first two columns, so there is no need to use pd.to_datetime.

df.dtypes
df["hour"] = df["pickup"].dt.hour

Remove all rows from the DataFrame where the hour is 16 or earlier. (So we are only using late afternoon and evening taxi rides.)

Using .copy() ensures that the two DataFrames are independent of each other. Changes to df2 will not affect df, and vice versa.

df2 = df[df["hour"] >16].copy() #indepent

Add a new column to the DataFrame, called “duration”, which contains the amount of time in minutes of the taxi ride.

Hint 1. Because the “dropoff” and “pickup” columns are already date-time values, we can subtract one from the other and pandas will know what to do.

Hint 2. I expected there to be a minutes attribute (after using the dt accessor) but there wasn’t. Call dir(df2["duration"].dt) to see some options.

df2["duration"] = df2["dropoff"] - df2["pickup"]

df2

	pickup	dropoff	passengers	distance	fare	tip	tolls	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough	hour	duration
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.60	7.0	2.15	0.0	12.95	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan	20	0 days 00:06:15
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.37	7.5	2.36	0.0	14.16	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan	17	0 days 00:07:24
6	2019-03-26 21:07:31	2019-03-26 21:17:29	1	3.65	13.0	2.00	0.0	18.80	yellow	credit card	Battery Park City	Two Bridges/Seward Park	Manhattan	Manhattan	21	0 days 00:09:58
11	2019-03-20 19:39:42	2019-03-20 19:45:36	1	1.53	6.5	2.16	0.0	12.96	yellow	credit card	Upper West Side South	Manhattan Valley	Manhattan	Manhattan	19	0 days 00:05:54
12	2019-03-18 21:27:14	2019-03-18 21:34:16	1	1.05	6.5	1.00	0.0	11.30	yellow	credit card	Murray Hill	Midtown Center	Manhattan	Manhattan	21	0 days 00:07:02
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6424	2019-03-30 20:52:15	2019-03-30 20:59:55	1	1.70	8.0	0.00	0.0	9.30	green	cash	Central Harlem	Central Harlem North	Manhattan	Manhattan	20	0 days 00:07:40
6427	2019-03-23 18:26:09	2019-03-23 18:49:12	1	7.07	20.0	0.00	0.0	20.00	green	cash	Parkchester	East Harlem South	Bronx	Manhattan	18	0 days 00:23:03
6429	2019-03-31 17:38:00	2019-03-31 18:34:23	1	18.74	58.0	0.00	0.0	58.80	green	credit card	Jamaica	East Concourse/Concourse Village	Queens	Bronx	17	0 days 00:56:23
6430	2019-03-23 22:55:18	2019-03-23 23:14:25	1	4.14	16.0	0.00	0.0	17.30	green	cash	Crown Heights North	Bushwick North	Brooklyn	Brooklyn	22	0 days 00:19:07
6432	2019-03-13 19:31:22	2019-03-13 19:48:02	1	3.85	15.0	3.36	0.0	20.16	green	credit card	Boerum Hill	Windsor Terrace	Brooklyn	Brooklyn	19	0 days 00:16:40

2524 rows × 16 columns

We are then going to get the number of seconds, and divide by 60 to get the number of minutes (no advantage to rounding here, we’re okay with decimals).

df2["duration"]=df2["duration"].dt.seconds/60

Notice how the right-hand side now shows the duration of the taxi ride, in minutes.

df2

	pickup	dropoff	passengers	distance	fare	tip	tolls	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough	hour	duration
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.60	7.0	2.15	0.0	12.95	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan	20	6.250000
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.37	7.5	2.36	0.0	14.16	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan	17	7.400000
6	2019-03-26 21:07:31	2019-03-26 21:17:29	1	3.65	13.0	2.00	0.0	18.80	yellow	credit card	Battery Park City	Two Bridges/Seward Park	Manhattan	Manhattan	21	9.966667
11	2019-03-20 19:39:42	2019-03-20 19:45:36	1	1.53	6.5	2.16	0.0	12.96	yellow	credit card	Upper West Side South	Manhattan Valley	Manhattan	Manhattan	19	5.900000
12	2019-03-18 21:27:14	2019-03-18 21:34:16	1	1.05	6.5	1.00	0.0	11.30	yellow	credit card	Murray Hill	Midtown Center	Manhattan	Manhattan	21	7.033333
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6424	2019-03-30 20:52:15	2019-03-30 20:59:55	1	1.70	8.0	0.00	0.0	9.30	green	cash	Central Harlem	Central Harlem North	Manhattan	Manhattan	20	7.666667
6427	2019-03-23 18:26:09	2019-03-23 18:49:12	1	7.07	20.0	0.00	0.0	20.00	green	cash	Parkchester	East Harlem South	Bronx	Manhattan	18	23.050000
6429	2019-03-31 17:38:00	2019-03-31 18:34:23	1	18.74	58.0	0.00	0.0	58.80	green	credit card	Jamaica	East Concourse/Concourse Village	Queens	Bronx	17	56.383333
6430	2019-03-23 22:55:18	2019-03-23 23:14:25	1	4.14	16.0	0.00	0.0	17.30	green	cash	Crown Heights North	Bushwick North	Brooklyn	Brooklyn	22	19.116667
6432	2019-03-13 19:31:22	2019-03-13 19:48:02	1	3.85	15.0	3.36	0.0	20.16	green	credit card	Boerum Hill	Windsor Terrace	Brooklyn	Brooklyn	19	16.666667

2524 rows × 16 columns

Fit a new LinearRegression object, this time using “distance”, “hour”, “passengers” as the input features, and using “duration” as the target value.

Here we instantiate a new LinearRegression object. (It would also be okay to overwrite the old one, just by calling fit on it with the new data.)

reg2 = LinearRegression()

Now we are using three input variables, rather than just one. We also have a new target variable: “duration”. Our overall question is, how can we predict “duration” using “distance”, “hour”, and “passengers”.

reg2.fit(df2[[ "distance", "hour", "passengers"]], df2["duration"])

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Here are the resulting coefficients.

reg2.coef_

array([ 2.40296826, -0.56721926, -0.01271133])

Pay particular attention to their signs. It certainly makes intuitive sense that a greater distance corresponds to a longer taxi ride. This corresponds to the first coefficient, 2.4, being positive. These coefficients are like partial derivatives. This one is saying that as we go one more mile (I think the unit is miles), the length of the ride goes up by 2.4 minutes.

The reason we restricted to times after 16:00 is that I wanted us to be getting further and further from rush hour as the hour gets larger. So you would expect the taxi rides to take less time as the hour increases. That’s why we have a negative number in the “hour” coefficient.

Lastly, we have a negative coefficient for passengers. But notice also that it is a quite small number, which makes intuitive sense. The number of passengers does not play much role with respect to the duration of the taxi ride.

Created in Deepnote

Week 5 Wednesday

Contents

Week 5 Wednesday#

Announcements#

Introduction#

Motivating Example I: Single-variable (1D) Linear Regression#

Problem#

Training the model#

Evaluating the model#

Performing linear regression using scikit-learn#

Motivating Example II: Multi-variable Linear Regression#

Problem#

Training the model#

Prediction in Test Data#

Evaluating the model#

Interpreting linear regression coefficients#