Week 7 Monday#
Plan#
Logistic Regression and Classification
Logistic Regression and Classification#
In classification problem, the response variables \(y\) are discrete, representing different catagories.
Why not use linear regression for classification problem?
The problem for range of \(y\)
The inappropriate MSE loss function, especially for multi-class classification. It does not make sense to assume miss-classify 9 for 1 will yield a larger penalty than 7 for 1.
There’s no order in the \(y\) in classification – they are just categories (we can permute the label number as we like, while the permutation will definitely affect regression results)
Therefore for classification problem, we may want to:
replace the mapping assumption between \(y\) and \(x\)
replace the loss function in regression
logistic regression: linear classification method and a direct generalization of linear regression.
Binary Classification#
For simplicity, we will first introduce the binary classification case – \(y\) has only two categories, denoted as \(0\) and \(1\).
Assumption 1: Dependent on the variable \(x\), the response variable \(y\) has different probabilities to take value in 0 or 1. Instead of predicting exact value of 0 or 1, we are actually predicting the probabilities.
Assumption 2: Logistic function assumption. Given \(x\), what is the probability to observe \(y=1\)?
where \(\sigma(z)=\frac{1}{1+\exp{(-z)}}\) is called standard logistic function, or sigmoid function in deep learning. Recall that \(\beta\in\mathbb{R}^{p+1}\) and \(\tilde{x}\) is the “augmented” sample with first element one to incorporate intercept in the linear function.
Equivalent expression:
Denote \(p = P(y=1|\mathbf{x})\), then we can write in linear form (the LHS is called odds ratio in statistics)
Since \(y\) only takes value in 0 or 1, we have
MLE (Maximum Likelihood Estimation)
Assume the samples are independent. The overall probibility to witness the whole training dataset
By maximizing the logarithm of likelihood function, then we derive the loss function to be minimized
The loss function also has clear probabilistic interpretations. Given \(i\)-th sample, the vector of true labels \((y^{i},1-y^{i})\) can also be viewed as the probability distribution. Then the loss function is the mean of all cross entropy across samples, i.e. “distance” between observed sample probability distribution and modelled probability distribution via logistic model.
Algorithm#
Take the gradient (left as exercise – if you like)
In vector form
The solution here is numerical optimization. The simplest algorithm in optimization is gradient descent (GD). $\(\beta^{k+1}=\beta^{k}-\eta\nabla L(\beta^{k}).\)$
Here the step size \(\eta\) is also called learning rate in machine learning. Note that it is indeed the Euler’s scheme to solve the ODE $\(\dot{\beta} = -\nabla L(\beta).\)$
By setting certain stopping criterion for GD, we think that we have approximated the optimized solution \(\hat{\beta}\).
Making predictions and Evaluation of Performance#
Now with the estimated \(\hat{\beta}\) and given a new data \(x^{test}\), we calculate the probability that \(y^{test}=1\) as \(f(\mathbf{x};\mathbf{\beta})\). If is greater than 0.5, we assign that \(y^{test}=1\).
For the test dataset, the accuracy is defined as ratio of number of correct predictions to the total number of samples.
import pandas as pd
import numpy as np
import altair as alt
import seaborn as sns
Example: Predicting if a penguin is in the Chinstrap species#
Today we will use Logistic Regression with the flipper length and bill length columns in the penguins dataset to predict if a penguin is in the Chinstrap species.
df = sns.load_dataset("penguins").dropna(axis=0).copy()
Here are the columns we’re going to use. I didn’t want to keep typing these repeatedly, so I stored them as a length two list.
cols = ["flipper_length_mm", "bill_length_mm"]
Here is what the true data looks like.
alt.Chart(df).mark_circle().encode(
x=alt.X(cols[0], scale=alt.Scale(zero=False)),
y=alt.Y(cols[1], scale=alt.Scale(zero=False)),
color="species:N"
)
The procedure is basically the same as for Linear Regression.
First we import.
from sklearn.linear_model import LogisticRegression
Then we instantiate. The convention is to name this object clf
, for “classifier”, since we will be performing classification, not regression.
clf = LogisticRegression()
Here are the values we will be predicting. Logistic regression is easiest to understand (especially the formula is easiest to understand) when we are predicting a binary variable (meaning a variable with only two options). We will see later an example of predicting the “species” variable itself, which in this case has three options.
df["isChinstrap"] = (df["species"] == "Chinstrap")
df.head(5)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | isChinstrap | |
---|---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male | False |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female | False |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female | False |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female | False |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male | False |
Notice how there is a new Boolean column named “isChinstrap” which indicates whether or not it is a Chinstrap penguin. I’m using sample
here rather than head
because the penguins at the beginning of the dataset are all the same species.
df.sample(10, random_state=1)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | isChinstrap | |
---|---|---|---|---|---|---|---|---|
65 | Adelie | Biscoe | 41.6 | 18.0 | 192.0 | 3950.0 | Male | False |
276 | Gentoo | Biscoe | 43.8 | 13.9 | 208.0 | 4300.0 | Female | False |
186 | Chinstrap | Dream | 49.7 | 18.6 | 195.0 | 3600.0 | Male | True |
198 | Chinstrap | Dream | 50.1 | 17.9 | 190.0 | 3400.0 | Female | True |
293 | Gentoo | Biscoe | 46.5 | 14.8 | 217.0 | 5200.0 | Female | False |
183 | Chinstrap | Dream | 54.2 | 20.8 | 201.0 | 4300.0 | Male | True |
98 | Adelie | Dream | 33.1 | 16.1 | 178.0 | 2900.0 | Female | False |
193 | Chinstrap | Dream | 46.2 | 17.5 | 187.0 | 3650.0 | Female | True |
95 | Adelie | Dream | 40.8 | 18.9 | 208.0 | 4300.0 | Male | False |
195 | Chinstrap | Dream | 45.5 | 17.0 | 196.0 | 3500.0 | Female | True |
We now do the fit
step. Notice how we are using two input features and we are using our new Boolean “isChinstrap” column for the target.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[cols], df['isChinstrap'], test_size=0.3, random_state=42)
clf.fit(X_train, y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
Here we can compare the actual values of “isChinstrap” to the values of “pred”.
df["pred"] = clf.predict(df[cols])
df.sample(10, random_state = 1)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | isChinstrap | pred | |
---|---|---|---|---|---|---|---|---|---|
65 | Adelie | Biscoe | 41.6 | 18.0 | 192.0 | 3950.0 | Male | False | False |
276 | Gentoo | Biscoe | 43.8 | 13.9 | 208.0 | 4300.0 | Female | False | False |
186 | Chinstrap | Dream | 49.7 | 18.6 | 195.0 | 3600.0 | Male | True | True |
198 | Chinstrap | Dream | 50.1 | 17.9 | 190.0 | 3400.0 | Female | True | True |
293 | Gentoo | Biscoe | 46.5 | 14.8 | 217.0 | 5200.0 | Female | False | False |
183 | Chinstrap | Dream | 54.2 | 20.8 | 201.0 | 4300.0 | Male | True | True |
98 | Adelie | Dream | 33.1 | 16.1 | 178.0 | 2900.0 | Female | False | False |
193 | Chinstrap | Dream | 46.2 | 17.5 | 187.0 | 3650.0 | Female | True | True |
95 | Adelie | Dream | 40.8 | 18.9 | 208.0 | 4300.0 | Male | False | False |
195 | Chinstrap | Dream | 45.5 | 17.0 | 196.0 | 3500.0 | Female | True | False |
A simple way to find the proportion of correct predictions is to call the mean
method of the above Boolean Series. (Remember that True
is treated as 1
and False
is treated as 0
.)
To evaluate the accuracy of a logistic regression model (or any classification model), you typically use the test dataset, not the training dataset. The reason for this is that the accuracy on the training set can be misleading; it may be high due to overfitting, which means the model has learned the training data too well, including its noise and outliers, and may not generalize well to unseen data.
y_test_pred = clf.predict(X_test)
(y_test_pred == y_test).mean()
0.98
There is also a way to use clf
directly to find the proportion of correct predictions, by calling its score
method. We pass as inputs to score
the input features along with the desired outputs. Notice that we get the same number as above.
clf.score(X_test, y_test)
0.98
X_test.index
Int64Index([ 30, 317, 79, 201, 63, 304, 289, 186, 217, 203, 81, 14, 328,
132, 272, 138, 120, 152, 82, 282, 115, 143, 323, 205, 6, 116,
268, 332, 169, 331, 174, 309, 69, 90, 285, 296, 188, 260, 52,
124, 38, 48, 178, 187, 119, 78, 281, 51, 107, 320, 100, 235,
226, 21, 114, 255, 12, 61, 153, 185, 305, 325, 288, 4, 83,
319, 66, 230, 84, 303, 22, 29, 257, 334, 244, 183, 88, 117,
149, 247, 122, 146, 182, 96, 321, 265, 36, 308, 191, 173, 240,
341, 151, 65, 125, 20, 7, 215, 99, 35],
dtype='int64')
y_test_pred != y_test
30 False
317 False
79 False
201 False
63 False
...
20 False
7 False
215 False
99 False
35 False
Name: isChinstrap, Length: 100, dtype: bool
Find the indices (row labels) where predictions and actual values don’t match:
misclass_index = X_test[y_test_pred != y_test].index
misclass_index
Int64Index([122, 182], dtype='int64')
df.loc[misclass_index]
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | isChinstrap | pred | |
---|---|---|---|---|---|---|---|---|---|
122 | Adelie | Torgersen | 40.2 | 17.0 | 176.0 | 3450.0 | Female | False | True |
182 | Chinstrap | Dream | 40.9 | 16.6 | 187.0 | 3200.0 | Female | True | False |
There are 2 rows above. If we want to know what proportion of predictions are incorrect, then we can divide by the length of the DataFrame. The following should be \(1-x\) where \(x\) is the score value we computed above.
1 - 2/len(X_test)
0.98
df.loc[121:123]
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | isChinstrap | pred | |
---|---|---|---|---|---|---|---|---|---|
121 | Adelie | Torgersen | 37.7 | 19.8 | 198.0 | 3500.0 | Male | False | False |
122 | Adelie | Torgersen | 40.2 | 17.0 | 176.0 | 3450.0 | Female | False | True |
123 | Adelie | Torgersen | 41.4 | 18.5 | 202.0 | 3875.0 | Male | False | False |
A great feature of logistic regression is that it indicates not just a predicted output, but also a confidence. We can get these confidences by using the predict_proba
method of the fit LogisticRegression
object.
clf.predict_proba(df.loc[121:123, cols])
array([[9.99885723e-01, 1.14276848e-04],
[2.85110398e-01, 7.14889602e-01],
[9.98695439e-01, 1.30456058e-03]])
The column at index 0
corresponds to False
predictions and the column at index 1
corresponds to True
predictions. I don’t think there is any way to know that in advance, but we can check it using the classes_
attribute.
clf.classes_
array([False, True])
The whole point of logistic regression is to find parameters for estimating probabilities. That is also the difficult part (it happens when we call fit
).
In our specific case, we have two input variables, so we need to find two coefficients and one intercept (or bias), so three parameters total.
Here are the coefficients. They are in the same order as the variables in cols
.
clf.coef_
array([[-0.33801769, 1.0238539 ]])
cols
['flipper_length_mm', 'bill_length_mm']
clf.intercept_
array([19.25143827])
Let \(x_1\) denote the flipper length and let \(x_2\) denote the bill length. We are plugging approximately \(19.25 + -0.34 \cdot x_1 + 1.02 \cdot x_2\) into the logistic function \(1/(1 + e^{-x})\). The result will be a probability estimate for the penguin being a Chinstrap penguin.
Let’s see this formula in action for the row with label 122
where we know our classifier makes a mistake.
df.loc[122, cols]
flipper_length_mm 176.0
bill_length_mm 40.2
Name: 122, dtype: object
When we plug in the values above, we get the following. Here we use np.exp(x)
to compute \(e^x\).
1/(1+np.exp(-(clf.intercept_[0] + clf.coef_[0][0]*176 + clf.coef_[0][1] *40.2)))
0.7148896017002825
This result is the exact predicted probability.
clf.predict_proba(df.loc[[122], cols])
array([[0.2851104, 0.7148896]])
Interpretation question:
As flipper length increases, is the penguin more or less likely to be a Chinstrap penguin, according to our model. What about bill length?
If you look at the overall formula, this will get pretty confusing. Better is to notice that logistic function \(\sigma(x) = 1/(1+e^{-x})\) is an increasing function, and our formula is \(\sigma(L(x))\) for some linear term \(L(x)\). This interpretation question is much easier when we focus on the linear part: all we need to do is look at the signs of the coefficients.
Answer: as flipper length increases, the probability of being Chinstrap decreases. Why? -0.34 is negative. As bill length increases, probability increases (1.02 is positive).
Created in Deepnote