Week 7 Monday

Week 7 Monday#

Plan#

Logistic Regression and Classification

Logistic Regression and Classification#

In classification problem, the response variables $y$ are discrete, representing different catagories.

Why not use linear regression for classification problem?

The problem for range of $y$
The inappropriate MSE loss function, especially for multi-class classification. It does not make sense to assume miss-classify 9 for 1 will yield a larger penalty than 7 for 1.
There’s no order in the $y$ in classification – they are just categories (we can permute the label number as we like, while the permutation will definitely affect regression results)

Therefore for classification problem, we may want to:

replace the mapping assumption between $y$ and $x$
replace the loss function in regression

logistic regression: linear classification method and a direct generalization of linear regression.

Binary Classification#

For simplicity, we will first introduce the binary classification case – $y$ has only two categories, denoted as $0$ and $1$.

Assumption 1: Dependent on the variable $x$, the response variable $y$ has different probabilities to take value in 0 or 1. Instead of predicting exact value of 0 or 1, we are actually predicting the probabilities.

Assumption 2: Logistic function assumption. Given $x$, what is the probability to observe $y=1$?

\[ P(y=1|\mathbf{x})=f(\mathbf{x};\mathbf{\beta}) = \frac{1}{1 + \exp(-\tilde{x}\mathbf{\beta})} =: \sigma(\tilde{x}\mathbf{\beta}). \]

where $\sigma(z)=\frac{1}{1+\exp{(-z)}}$ is called standard logistic function, or sigmoid function in deep learning. Recall that $\beta\in\mathbb{R}^{p+1}$ and $\tilde{x}$ is the “augmented” sample with first element one to incorporate intercept in the linear function.

Equivalent expression:

Denote $p = P(y=1|\mathbf{x})$, then we can write in linear form (the LHS is called odds ratio in statistics)

\[ \ln\frac{p}{1-p}=\tilde{x}\beta \]

Since $y$ only takes value in 0 or 1, we have

\[ P(y|\mathbf{x},\beta) = f(\mathbf{x};\beta)^y \big(1 - f(\mathbf{x};\beta) \big)^{1-y} \]

MLE (Maximum Likelihood Estimation)

Assume the samples are independent. The overall probibility to witness the whole training dataset

\[\begin{split} {\begin{aligned} &P(\mathbf{y}\; | \; \mathbf{X};\beta )\\ =&\prod _{i=1}^N P\left(y^{(i)}\mid \mathbf{x}^{(i)};\beta\right)\\ =&\prod_{i=1}^N f\big(\mathbf{x}^{(i)};\beta \big)^{y^{(i)}} \Big(1-f\big(\mathbf{x}^{(i)};\beta\big) \Big)^{\big(1-y^{(i)}\big)} \end{aligned}}. \end{split}\]

By maximizing the logarithm of likelihood function, then we derive the loss function to be minimized

\[ L (\beta) = L (\beta; X,\mathbf{y}) = - \frac{1}{N}\sum_{i=1}^N \Bigl\{y^{(i)} \ln\big( f(\mathbf{x}^{(i)};\beta) \big) + (1 - y^{(i)}) \ln\big( 1 - f(\mathbf{x}^{(i)};\beta) \big) \Bigr\}. \]

The loss function also has clear probabilistic interpretations. Given $i$-th sample, the vector of true labels $(y^{i},1-y^{i})$ can also be viewed as the probability distribution. Then the loss function is the mean of all cross entropy across samples, i.e. “distance” between observed sample probability distribution and modelled probability distribution via logistic model.

Algorithm#

Take the gradient (left as exercise – if you like)

\[ \frac{\partial L (\beta)}{\partial \beta_{k}} =\frac{1}{N}\sum_{i=1}^N \big(\sigma(\tilde{x}^{(i)}\beta) - y^{(i)} \big) \tilde{x}^{(i)}_k. \]

In vector form

\[ \nabla_{\beta} \big( L (\beta) \big) = \sum_{i=1}^N \big(\sigma(\tilde{x}^{(i)}) - y^{(i)} \big) \tilde{x}^{(i)} =\frac{1}{N}\sum_{i=1}^N \big( f(\mathbf{x}^{(i)};\beta) - y^{(i)} \big) \tilde{x}^{(i)}. \]

The solution here is numerical optimization. The simplest algorithm in optimization is gradient descent (GD). $$\beta^{k+1}=\beta^{k}-\eta\nabla L(\beta^{k}).$$

Here the step size $\eta$ is also called learning rate in machine learning. Note that it is indeed the Euler’s scheme to solve the ODE $$\dot{\beta} = -\nabla L(\beta).$$

By setting certain stopping criterion for GD, we think that we have approximated the optimized solution $\hat{\beta}$.

Making predictions and Evaluation of Performance#

Now with the estimated $\hat{\beta}$ and given a new data $x^{test}$, we calculate the probability that $y^{test}=1$ as $f(\mathbf{x};\mathbf{\beta})$. If is greater than 0.5, we assign that $y^{test}=1$.

For the test dataset, the accuracy is defined as ratio of number of correct predictions to the total number of samples.

import pandas as pd
import numpy as np
import altair as alt
import seaborn as sns

Example: Predicting if a penguin is in the Chinstrap species#

Today we will use Logistic Regression with the flipper length and bill length columns in the penguins dataset to predict if a penguin is in the Chinstrap species.

df = sns.load_dataset("penguins").dropna(axis=0).copy()

Here are the columns we’re going to use. I didn’t want to keep typing these repeatedly, so I stored them as a length two list.

cols = ["flipper_length_mm", "bill_length_mm"]

Here is what the true data looks like.

alt.Chart(df).mark_circle().encode(
    x=alt.X(cols[0], scale=alt.Scale(zero=False)),
    y=alt.Y(cols[1], scale=alt.Scale(zero=False)),
    color="species:N"
)

The procedure is basically the same as for Linear Regression.

First we import.

from sklearn.linear_model import LogisticRegression

Then we instantiate. The convention is to name this object clf, for “classifier”, since we will be performing classification, not regression.

clf = LogisticRegression()

Here are the values we will be predicting. Logistic regression is easiest to understand (especially the formula is easiest to understand) when we are predicting a binary variable (meaning a variable with only two options). We will see later an example of predicting the “species” variable itself, which in this case has three options.

df["isChinstrap"] = (df["species"] == "Chinstrap")
df.head(5)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	isChinstrap
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male	False
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female	False
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female	False
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female	False
5	Adelie	Torgersen	39.3	20.6	190.0	3650.0	Male	False

Notice how there is a new Boolean column named “isChinstrap” which indicates whether or not it is a Chinstrap penguin. I’m using sample here rather than head because the penguins at the beginning of the dataset are all the same species.

df.sample(10, random_state=1)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	isChinstrap
65	Adelie	Biscoe	41.6	18.0	192.0	3950.0	Male	False
276	Gentoo	Biscoe	43.8	13.9	208.0	4300.0	Female	False
186	Chinstrap	Dream	49.7	18.6	195.0	3600.0	Male	True
198	Chinstrap	Dream	50.1	17.9	190.0	3400.0	Female	True
293	Gentoo	Biscoe	46.5	14.8	217.0	5200.0	Female	False
183	Chinstrap	Dream	54.2	20.8	201.0	4300.0	Male	True
98	Adelie	Dream	33.1	16.1	178.0	2900.0	Female	False
193	Chinstrap	Dream	46.2	17.5	187.0	3650.0	Female	True
95	Adelie	Dream	40.8	18.9	208.0	4300.0	Male	False
195	Chinstrap	Dream	45.5	17.0	196.0	3500.0	Female	True

We now do the fit step. Notice how we are using two input features and we are using our new Boolean “isChinstrap” column for the target.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[cols], df['isChinstrap'], test_size=0.3, random_state=42)
clf.fit(X_train, y_train)

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Here we can compare the actual values of “isChinstrap” to the values of “pred”.

df["pred"] = clf.predict(df[cols])
df.sample(10, random_state = 1)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	isChinstrap	pred
65	Adelie	Biscoe	41.6	18.0	192.0	3950.0	Male	False	False
276	Gentoo	Biscoe	43.8	13.9	208.0	4300.0	Female	False	False
186	Chinstrap	Dream	49.7	18.6	195.0	3600.0	Male	True	True
198	Chinstrap	Dream	50.1	17.9	190.0	3400.0	Female	True	True
293	Gentoo	Biscoe	46.5	14.8	217.0	5200.0	Female	False	False
183	Chinstrap	Dream	54.2	20.8	201.0	4300.0	Male	True	True
98	Adelie	Dream	33.1	16.1	178.0	2900.0	Female	False	False
193	Chinstrap	Dream	46.2	17.5	187.0	3650.0	Female	True	True
95	Adelie	Dream	40.8	18.9	208.0	4300.0	Male	False	False
195	Chinstrap	Dream	45.5	17.0	196.0	3500.0	Female	True	False

A simple way to find the proportion of correct predictions is to call the mean method of the above Boolean Series. (Remember that True is treated as 1 and False is treated as 0.)

To evaluate the accuracy of a logistic regression model (or any classification model), you typically use the test dataset, not the training dataset. The reason for this is that the accuracy on the training set can be misleading; it may be high due to overfitting, which means the model has learned the training data too well, including its noise and outliers, and may not generalize well to unseen data.

y_test_pred = clf.predict(X_test)
(y_test_pred == y_test).mean()

0.98

There is also a way to use clf directly to find the proportion of correct predictions, by calling its score method. We pass as inputs to score the input features along with the desired outputs. Notice that we get the same number as above.

clf.score(X_test, y_test)

0.98

X_test.index

Int64Index([ 30, 317,  79, 201,  63, 304, 289, 186, 217, 203,  81,  14, 328,
            132, 272, 138, 120, 152,  82, 282, 115, 143, 323, 205,   6, 116,
            268, 332, 169, 331, 174, 309,  69,  90, 285, 296, 188, 260,  52,
            124,  38,  48, 178, 187, 119,  78, 281,  51, 107, 320, 100, 235,
            226,  21, 114, 255,  12,  61, 153, 185, 305, 325, 288,   4,  83,
            319,  66, 230,  84, 303,  22,  29, 257, 334, 244, 183,  88, 117,
            149, 247, 122, 146, 182,  96, 321, 265,  36, 308, 191, 173, 240,
            341, 151,  65, 125,  20,   7, 215,  99,  35],
           dtype='int64')

y_test_pred != y_test

   False
  False
   False
  False
   False
       ...  
   False
    False
  False
   False
   False
Name: isChinstrap, Length: 100, dtype: bool

Find the indices (row labels) where predictions and actual values don’t match:

misclass_index = X_test[y_test_pred != y_test].index
misclass_index

Int64Index([122, 182], dtype='int64')

df.loc[misclass_index]

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	isChinstrap	pred
122	Adelie	Torgersen	40.2	17.0	176.0	3450.0	Female	False	True
182	Chinstrap	Dream	40.9	16.6	187.0	3200.0	Female	True	False

There are 2 rows above. If we want to know what proportion of predictions are incorrect, then we can divide by the length of the DataFrame. The following should be $1-x$ where $x$ is the score value we computed above.

1 - 2/len(X_test)

0.98

df.loc[121:123]

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	isChinstrap	pred
121	Adelie	Torgersen	37.7	19.8	198.0	3500.0	Male	False	False
122	Adelie	Torgersen	40.2	17.0	176.0	3450.0	Female	False	True
123	Adelie	Torgersen	41.4	18.5	202.0	3875.0	Male	False	False

A great feature of logistic regression is that it indicates not just a predicted output, but also a confidence. We can get these confidences by using the predict_proba method of the fit LogisticRegression object.

clf.predict_proba(df.loc[121:123, cols])

array([[9.99885723e-01, 1.14276848e-04],
       [2.85110398e-01, 7.14889602e-01],
       [9.98695439e-01, 1.30456058e-03]])

The column at index 0 corresponds to False predictions and the column at index 1 corresponds to True predictions. I don’t think there is any way to know that in advance, but we can check it using the classes_ attribute.

clf.classes_

array([False,  True])

The whole point of logistic regression is to find parameters for estimating probabilities. That is also the difficult part (it happens when we call fit).

In our specific case, we have two input variables, so we need to find two coefficients and one intercept (or bias), so three parameters total.

Here are the coefficients. They are in the same order as the variables in cols.

clf.coef_

array([[-0.33801769,  1.0238539 ]])

cols

['flipper_length_mm', 'bill_length_mm']

clf.intercept_

array([19.25143827])

Let $x_1$ denote the flipper length and let $x_2$ denote the bill length. We are plugging approximately $19.25 + -0.34 \cdot x_1 + 1.02 \cdot x_2$ into the logistic function $1/(1 + e^{-x})$. The result will be a probability estimate for the penguin being a Chinstrap penguin.

Let’s see this formula in action for the row with label 122 where we know our classifier makes a mistake.

df.loc[122, cols]

flipper_length_mm    176.0
bill_length_mm        40.2
Name: 122, dtype: object

When we plug in the values above, we get the following. Here we use np.exp(x) to compute $e^x$.

1/(1+np.exp(-(clf.intercept_[0] + clf.coef_[0][0]*176 + clf.coef_[0][1] *40.2)))

0.7148896017002825

This result is the exact predicted probability.

clf.predict_proba(df.loc[[122], cols])

array([[0.2851104, 0.7148896]])

Interpretation question:

As flipper length increases, is the penguin more or less likely to be a Chinstrap penguin, according to our model. What about bill length?

If you look at the overall formula, this will get pretty confusing. Better is to notice that logistic function $\sigma(x) = 1/(1+e^{-x})$ is an increasing function, and our formula is $\sigma(L(x))$ for some linear term $L(x)$. This interpretation question is much easier when we focus on the linear part: all we need to do is look at the signs of the coefficients.

Answer: as flipper length increases, the probability of being Chinstrap decreases. Why? -0.34 is negative. As bill length increases, probability increases (1.02 is positive).

Created in Deepnote