Week 7 Monday#

Plan#

  • Logistic Regression and Classification

Logistic Regression and Classification#

In classification problem, the response variables \(y\) are discrete, representing different catagories.

Why not use linear regression for classification problem?

  • The problem for range of \(y\)

  • The inappropriate MSE loss function, especially for multi-class classification. It does not make sense to assume miss-classify 9 for 1 will yield a larger penalty than 7 for 1.

  • There’s no order in the \(y\) in classification – they are just categories (we can permute the label number as we like, while the permutation will definitely affect regression results)

Therefore for classification problem, we may want to:

  • replace the mapping assumption between \(y\) and \(x\)

  • replace the loss function in regression

logistic regression: linear classification method and a direct generalization of linear regression.

Binary Classification#

For simplicity, we will first introduce the binary classification case – \(y\) has only two categories, denoted as \(0\) and \(1\).

Assumption 1: Dependent on the variable \(x\), the response variable \(y\) has different probabilities to take value in 0 or 1. Instead of predicting exact value of 0 or 1, we are actually predicting the probabilities.

Assumption 2: Logistic function assumption. Given \(x\), what is the probability to observe \(y=1\)?

\[ P(y=1|\mathbf{x})=f(\mathbf{x};\mathbf{\beta}) = \frac{1}{1 + \exp(-\tilde{x}\mathbf{\beta})} =: \sigma(\tilde{x}\mathbf{\beta}). \]

where \(\sigma(z)=\frac{1}{1+\exp{(-z)}}\) is called standard logistic function, or sigmoid function in deep learning. Recall that \(\beta\in\mathbb{R}^{p+1}\) and \(\tilde{x}\) is the “augmented” sample with first element one to incorporate intercept in the linear function.

Equivalent expression:

  • Denote \(p = P(y=1|\mathbf{x})\), then we can write in linear form (the LHS is called odds ratio in statistics)

\[ \ln\frac{p}{1-p}=\tilde{x}\beta \]
  • Since \(y\) only takes value in 0 or 1, we have

\[ P(y|\mathbf{x},\beta) = f(\mathbf{x};\beta)^y \big(1 - f(\mathbf{x};\beta) \big)^{1-y} \]

MLE (Maximum Likelihood Estimation)

Assume the samples are independent. The overall probibility to witness the whole training dataset

\[\begin{split} {\begin{aligned} &P(\mathbf{y}\; | \; \mathbf{X};\beta )\\ =&\prod _{i=1}^N P\left(y^{(i)}\mid \mathbf{x}^{(i)};\beta\right)\\ =&\prod_{i=1}^N f\big(\mathbf{x}^{(i)};\beta \big)^{y^{(i)}} \Big(1-f\big(\mathbf{x}^{(i)};\beta\big) \Big)^{\big(1-y^{(i)}\big)} \end{aligned}}. \end{split}\]

By maximizing the logarithm of likelihood function, then we derive the loss function to be minimized

\[ L (\beta) = L (\beta; X,\mathbf{y}) = - \frac{1}{N}\sum_{i=1}^N \Bigl\{y^{(i)} \ln\big( f(\mathbf{x}^{(i)};\beta) \big) + (1 - y^{(i)}) \ln\big( 1 - f(\mathbf{x}^{(i)};\beta) \big) \Bigr\}. \]

The loss function also has clear probabilistic interpretations. Given \(i\)-th sample, the vector of true labels \((y^{i},1-y^{i})\) can also be viewed as the probability distribution. Then the loss function is the mean of all cross entropy across samples, i.e. “distance” between observed sample probability distribution and modelled probability distribution via logistic model.

Algorithm#

Take the gradient (left as exercise – if you like)

\[ \frac{\partial L (\beta)}{\partial \beta_{k}} =\frac{1}{N}\sum_{i=1}^N \big(\sigma(\tilde{x}^{(i)}\beta) - y^{(i)} \big) \tilde{x}^{(i)}_k. \]

In vector form

\[ \nabla_{\beta} \big( L (\beta) \big) = \sum_{i=1}^N \big(\sigma(\tilde{x}^{(i)}) - y^{(i)} \big) \tilde{x}^{(i)} =\frac{1}{N}\sum_{i=1}^N \big( f(\mathbf{x}^{(i)};\beta) - y^{(i)} \big) \tilde{x}^{(i)}. \]

The solution here is numerical optimization. The simplest algorithm in optimization is gradient descent (GD). $\(\beta^{k+1}=\beta^{k}-\eta\nabla L(\beta^{k}).\)$

Here the step size \(\eta\) is also called learning rate in machine learning. Note that it is indeed the Euler’s scheme to solve the ODE $\(\dot{\beta} = -\nabla L(\beta).\)$

By setting certain stopping criterion for GD, we think that we have approximated the optimized solution \(\hat{\beta}\).

Making predictions and Evaluation of Performance#

Now with the estimated \(\hat{\beta}\) and given a new data \(x^{test}\), we calculate the probability that \(y^{test}=1\) as \(f(\mathbf{x};\mathbf{\beta})\). If is greater than 0.5, we assign that \(y^{test}=1\).

For the test dataset, the accuracy is defined as ratio of number of correct predictions to the total number of samples.

import pandas as pd
import numpy as np
import altair as alt
import seaborn as sns

Example: Predicting if a penguin is in the Chinstrap species#

Today we will use Logistic Regression with the flipper length and bill length columns in the penguins dataset to predict if a penguin is in the Chinstrap species.

df = sns.load_dataset("penguins").dropna(axis=0).copy()

Here are the columns we’re going to use. I didn’t want to keep typing these repeatedly, so I stored them as a length two list.

cols = ["flipper_length_mm", "bill_length_mm"]

Here is what the true data looks like.

alt.Chart(df).mark_circle().encode(
    x=alt.X(cols[0], scale=alt.Scale(zero=False)),
    y=alt.Y(cols[1], scale=alt.Scale(zero=False)),
    color="species:N"
)

The procedure is basically the same as for Linear Regression.

First we import.

from sklearn.linear_model import LogisticRegression

Then we instantiate. The convention is to name this object clf, for “classifier”, since we will be performing classification, not regression.

clf = LogisticRegression()

Here are the values we will be predicting. Logistic regression is easiest to understand (especially the formula is easiest to understand) when we are predicting a binary variable (meaning a variable with only two options). We will see later an example of predicting the “species” variable itself, which in this case has three options.

df["isChinstrap"] = (df["species"] == "Chinstrap")
df.head(5)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex isChinstrap
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male False
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female False
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female False
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female False
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 Male False

Notice how there is a new Boolean column named “isChinstrap” which indicates whether or not it is a Chinstrap penguin. I’m using sample here rather than head because the penguins at the beginning of the dataset are all the same species.

df.sample(10, random_state=1)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex isChinstrap
65 Adelie Biscoe 41.6 18.0 192.0 3950.0 Male False
276 Gentoo Biscoe 43.8 13.9 208.0 4300.0 Female False
186 Chinstrap Dream 49.7 18.6 195.0 3600.0 Male True
198 Chinstrap Dream 50.1 17.9 190.0 3400.0 Female True
293 Gentoo Biscoe 46.5 14.8 217.0 5200.0 Female False
183 Chinstrap Dream 54.2 20.8 201.0 4300.0 Male True
98 Adelie Dream 33.1 16.1 178.0 2900.0 Female False
193 Chinstrap Dream 46.2 17.5 187.0 3650.0 Female True
95 Adelie Dream 40.8 18.9 208.0 4300.0 Male False
195 Chinstrap Dream 45.5 17.0 196.0 3500.0 Female True

We now do the fit step. Notice how we are using two input features and we are using our new Boolean “isChinstrap” column for the target.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[cols], df['isChinstrap'], test_size=0.3, random_state=42)
clf.fit(X_train, y_train)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Here we can compare the actual values of “isChinstrap” to the values of “pred”.

df["pred"] = clf.predict(df[cols])
df.sample(10, random_state = 1)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex isChinstrap pred
65 Adelie Biscoe 41.6 18.0 192.0 3950.0 Male False False
276 Gentoo Biscoe 43.8 13.9 208.0 4300.0 Female False False
186 Chinstrap Dream 49.7 18.6 195.0 3600.0 Male True True
198 Chinstrap Dream 50.1 17.9 190.0 3400.0 Female True True
293 Gentoo Biscoe 46.5 14.8 217.0 5200.0 Female False False
183 Chinstrap Dream 54.2 20.8 201.0 4300.0 Male True True
98 Adelie Dream 33.1 16.1 178.0 2900.0 Female False False
193 Chinstrap Dream 46.2 17.5 187.0 3650.0 Female True True
95 Adelie Dream 40.8 18.9 208.0 4300.0 Male False False
195 Chinstrap Dream 45.5 17.0 196.0 3500.0 Female True False

A simple way to find the proportion of correct predictions is to call the mean method of the above Boolean Series. (Remember that True is treated as 1 and False is treated as 0.)

To evaluate the accuracy of a logistic regression model (or any classification model), you typically use the test dataset, not the training dataset. The reason for this is that the accuracy on the training set can be misleading; it may be high due to overfitting, which means the model has learned the training data too well, including its noise and outliers, and may not generalize well to unseen data.

y_test_pred = clf.predict(X_test)
(y_test_pred == y_test).mean()
0.98

There is also a way to use clf directly to find the proportion of correct predictions, by calling its score method. We pass as inputs to score the input features along with the desired outputs. Notice that we get the same number as above.

clf.score(X_test, y_test)
0.98
X_test.index
Int64Index([ 30, 317,  79, 201,  63, 304, 289, 186, 217, 203,  81,  14, 328,
            132, 272, 138, 120, 152,  82, 282, 115, 143, 323, 205,   6, 116,
            268, 332, 169, 331, 174, 309,  69,  90, 285, 296, 188, 260,  52,
            124,  38,  48, 178, 187, 119,  78, 281,  51, 107, 320, 100, 235,
            226,  21, 114, 255,  12,  61, 153, 185, 305, 325, 288,   4,  83,
            319,  66, 230,  84, 303,  22,  29, 257, 334, 244, 183,  88, 117,
            149, 247, 122, 146, 182,  96, 321, 265,  36, 308, 191, 173, 240,
            341, 151,  65, 125,  20,   7, 215,  99,  35],
           dtype='int64')
y_test_pred != y_test
30     False
317    False
79     False
201    False
63     False
       ...  
20     False
7      False
215    False
99     False
35     False
Name: isChinstrap, Length: 100, dtype: bool

Find the indices (row labels) where predictions and actual values don’t match:

misclass_index = X_test[y_test_pred != y_test].index
misclass_index
Int64Index([122, 182], dtype='int64')
df.loc[misclass_index]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex isChinstrap pred
122 Adelie Torgersen 40.2 17.0 176.0 3450.0 Female False True
182 Chinstrap Dream 40.9 16.6 187.0 3200.0 Female True False

There are 2 rows above. If we want to know what proportion of predictions are incorrect, then we can divide by the length of the DataFrame. The following should be \(1-x\) where \(x\) is the score value we computed above.

1 - 2/len(X_test)
0.98
df.loc[121:123]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex isChinstrap pred
121 Adelie Torgersen 37.7 19.8 198.0 3500.0 Male False False
122 Adelie Torgersen 40.2 17.0 176.0 3450.0 Female False True
123 Adelie Torgersen 41.4 18.5 202.0 3875.0 Male False False

A great feature of logistic regression is that it indicates not just a predicted output, but also a confidence. We can get these confidences by using the predict_proba method of the fit LogisticRegression object.

clf.predict_proba(df.loc[121:123, cols])
array([[9.99885723e-01, 1.14276848e-04],
       [2.85110398e-01, 7.14889602e-01],
       [9.98695439e-01, 1.30456058e-03]])

The column at index 0 corresponds to False predictions and the column at index 1 corresponds to True predictions. I don’t think there is any way to know that in advance, but we can check it using the classes_ attribute.

clf.classes_
array([False,  True])

The whole point of logistic regression is to find parameters for estimating probabilities. That is also the difficult part (it happens when we call fit).

In our specific case, we have two input variables, so we need to find two coefficients and one intercept (or bias), so three parameters total.

Here are the coefficients. They are in the same order as the variables in cols.

clf.coef_
array([[-0.33801769,  1.0238539 ]])
cols
['flipper_length_mm', 'bill_length_mm']
clf.intercept_
array([19.25143827])

Let \(x_1\) denote the flipper length and let \(x_2\) denote the bill length. We are plugging approximately \(19.25 + -0.34 \cdot x_1 + 1.02 \cdot x_2\) into the logistic function \(1/(1 + e^{-x})\). The result will be a probability estimate for the penguin being a Chinstrap penguin.

Let’s see this formula in action for the row with label 122 where we know our classifier makes a mistake.

df.loc[122, cols]
flipper_length_mm    176.0
bill_length_mm        40.2
Name: 122, dtype: object

When we plug in the values above, we get the following. Here we use np.exp(x) to compute \(e^x\).

1/(1+np.exp(-(clf.intercept_[0] + clf.coef_[0][0]*176 + clf.coef_[0][1] *40.2)))
0.7148896017002825

This result is the exact predicted probability.

clf.predict_proba(df.loc[[122], cols])
array([[0.2851104, 0.7148896]])

Interpretation question:

  • As flipper length increases, is the penguin more or less likely to be a Chinstrap penguin, according to our model. What about bill length?

If you look at the overall formula, this will get pretty confusing. Better is to notice that logistic function \(\sigma(x) = 1/(1+e^{-x})\) is an increasing function, and our formula is \(\sigma(L(x))\) for some linear term \(L(x)\). This interpretation question is much easier when we focus on the linear part: all we need to do is look at the signs of the coefficients.

Answer: as flipper length increases, the probability of being Chinstrap decreases. Why? -0.34 is negative. As bill length increases, probability increases (1.02 is positive).

Created in deepnote.com Created in Deepnote