Week 8 Monday

Week 8 Monday#

Rough plan for this week’s classes#

Today: Decision trees
Wednesday: Random forests

Decision Tree #

How human-being make classifications? Instead of using mathematical equations, we actually make a series of “decisions” based on the important features that are drawn from our past experience.

Intuitions: By repeatedly setting threshold for different features (multiple if-else conditions – forming a flow-chart or decision tree structure), we can naturally achieve the classification task.

For the basic requirements, in this course we only ask you to call the package and understand how to interpret the results.

Loading the iris dataset#

The iris dataset (also available from Seaborn) is smaller than most of the datasets we work with in Math 10 (it only contains 150 rows/observations/data points). But it is one of the most classic datasets in Machine Learning, so we should see it at some point.

import pandas as pd
import seaborn as sns
import altair as alt
df = sns.load_dataset("iris")

This dataset is fairly similar to the penguins dataset, but with fewer rows and columns. The penguins dataset has some missing data, but the iris dataset does not have any missing values. Overall, the iris dataset is a little easier to analyze.

df

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

150 rows × 5 columns

Take a quick look at the data.

alt.Chart(df).mark_circle().encode(
    x = "petal_length",
    y = "petal_width",
    color = 'species:N',
    tooltip = ['petal_length','petal_width']
)

Visualizing how a decision tree splits the data#

Our goal is to divide the iris data by species.

First we will divide by petal length.
Then we will divide by petal width.
Where would you make these divisions?

Guesses: petal length 2.5 and petal width 1.65 or 1.75.

(Be sure to look over the above chart and make sure you agree with these predictions.)

How many “leaves” will the corresponding decision tree have? (How many regions or cells does our domain get divided into?)

Three regions

A Decision tree with two splits#

Create an instance of a DecisionTreeClassifier with max_leaf_nodes as above. Specify random_state=2 (which will help me know that the first split happens on “petal_length”…
Fit the classifier to the training data using cols = ["petal_length", "petal_width"] for the input features and using "species" for the target.

from sklearn.tree import DecisionTreeClassifier

Notice that we are putting a (heavy) constraint on the decision tree, by only allowing it to have 3 regions (3 leaf nodes). This will help reduce the potential for overfitting.

(Reminder: on Thursday last week, we saw two different pieces of evidence for overfitting. One was just intuition: we had a very random-looking collection of data divided into four classes, and we achieved ~93% accuracy. The other evidence was more precise: on the test set, which was not used when fitting the classifier, the accuracy dropped to ~29%. In general, if you don’t put a constraint on a decision tree, it is likely to overfit the data.)

We fit the classifier to the data, just like usual, by calling the fit method.

cols = ["petal_length", "petal_width"]
clf = DecisionTreeClassifier(max_leaf_nodes=3, random_state=2)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[cols], df["species"], test_size=0.2, random_state=1)

clf.fit(X_train, y_train)

DecisionTreeClassifier(max_leaf_nodes=3, random_state=2)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Illustrate the resulting tree using the following.
Does it match what we expected from the Altair chart?

Don’t bother trying to memorize this syntax (I don’t have it memorized), but it’s very important to be able to interpret the decision tree plot once you have it displayed.

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

fig = plt.figure()
_ = plot_tree(clf, 
                   feature_names=clf.feature_names_in_,
                   class_names=clf.classes_,
                   filled=True)

plot_tree(clf, 
                feature_names=clf.feature_names_in_,
                class_names=clf.classes_,
                filled=True)

[Text(0.4, 0.8333333333333334, 'petal_length <= 2.6\ngini = 0.665\nsamples = 120\nvalue = [39, 37, 44]\nclass = virginica'),
 Text(0.2, 0.5, 'gini = 0.0\nsamples = 39\nvalue = [39, 0, 0]\nclass = setosa'),
 Text(0.6, 0.5, 'petal_width <= 1.65\ngini = 0.496\nsamples = 81\nvalue = [0, 37, 44]\nclass = virginica'),
 Text(0.4, 0.16666666666666666, 'gini = 0.18\nsamples = 40\nvalue = [0, 36, 4]\nclass = versicolor'),
 Text(0.8, 0.16666666666666666, 'gini = 0.048\nsamples = 41\nvalue = [0, 1, 40]\nclass = virginica')]

../_images/829b093d4511486fa89eaacc2c6f616362380ce0691530c697f0ee0e7fd9214f.png

Nodes showing splits based on features, and branches representing the decision paths. The coloring (when filled=True) helps in understanding the class distributions at each node.

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

fig = plt.figure()
_ = plot_tree(clf, 
                   feature_names=clf.feature_names_in_,
                   class_names=clf.classes_,
                   filled=True)

#underscore (_) for unwanted values

Gini score gives you an idea of how good a split is by showing the amount of class label mixing at that node. A lower Gini impurity is generally better.

Samples represent the number of samples (or instances) that reached the node.

Value shows the class distribution of the samples in that node.

clf.score(X_test, y_test)

0.9666666666666667

What is the depth of the corresponding tree? We can answer by looking at the diagram, or by using the get_depth method.

Assuming we count the top row as row 0, then we see that the bottom row is row 2. That is where the depth of 2 comes from.

clf.get_depth()

What is the theoretical maximum number of leaf nodes for a Decision Tree with depth \(m\)?

In the above picture, we only don’t split the leaf at the left, because it already contains only irises from the setosa species, but theoretically we could split such a node in two, and then we would have four nodes on the bottom. In general, if the depth is \(m\), then the maximum number of leaf nodes is \(2^m\).

Predictions and predicted probabilities#

What species will be predicted for an iris with the following (physically impossible) values?

{"petal_length": 4, "petal_width": 3}

Here is one way to make a one-row DataFrame. We pass to the constructor a length-one list with a dictionary in it.

pd.DataFrame([ {"petal_length": 4, "petal_width": 3} ])

	petal_length	petal_width
0	4	3

Here is another way to make a one-row DataFrame: we pass a dictionary to the constructor, with values length-one lists. Notice that this creates the same DataFrame as the previous cell.

df_fake = pd.DataFrame({"petal_length": [4], "petal_width": [3]})
df_fake

	petal_length	petal_width
0	4	3

Here is a common warning if we input something without column names.

clf.predict([[4, 3]])

/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names
  warnings.warn(

array(['virginica'], dtype=object)

The warning goes away if we input a DataFrame with appropriate column names (meaning with the same column names as were used when we called fit. Notice how we get the same prediction as in the previous cell.

clf.predict(df_fake)

array(['virginica'], dtype=object)

What are the corresponding predicted probabilities?

We can find this probability by calling the predict_proba method. There are three numbers, corresponding to the probabilities of the three classes.

clf.predict_proba(df_fake)

array([[0.        , 0.02439024, 0.97560976]])

Here are the corresponding classes. Notice how the highest probability corresponds to versicolor, which was the prediction above.

clf.classes_

array(['setosa', 'versicolor', 'virginica'], dtype=object)

Could you have predicted those probabilities from the visualization above?

Important: where does this probability come from? If you look at the tree diagram above, you see that we start at the top and then move right (because \(4 > 2.45\)), and then move to the left (because \(3 > 1.65\)). We wind up in a cell with value = [0, 1, 40], which stands for 0 setosa, 1 versicolor, and 40 virginica.

The probability of being versicolor then corresponds to 40/41.

40/41

0.975609756097561

The decision boundary for our decision tree#

Illustrate the decision boundary for our decision tree classifier using 5000 random values for the petal length and petal width.

Here we instantiate a random number generator. This will only be used to generate (5000) random points for our chart, so there is no real advantage to using a seed value. (We aren’t using these values to produce any precise numbers.)

import numpy as np
rng = np.random.default_rng()
# We want 5000 points, each with two features.
arr = rng.random(size = (5000,2))

We convert this NumPy array to a DataFrame, and name the columns. (If we didn’t name the columns, then when we call predict, we would have the same warning as we saw above.

df2 = pd.DataFrame(arr, columns=cols)
df2.head(5)

	petal_length	petal_width
0	0.108106	0.154910
1	0.306966	0.329880
2	0.170410	0.020553
3	0.966794	0.379437
4	0.278400	0.103418

I want these columns to have the same kind of range as in the original dataset. They currently go from 0 to 1. We multiply the “petal_length” column by 7. (Be careful not to execute this cell more than once.) Similarly we multiply the petal width by 2.5.

df2["petal_length"] *= 7
df2["petal_width"] *= 2.5
# We now add a prediction column.  
#(This is the step which would raise a warning if we didn't have correct column names.)
df2['pred'] = clf.predict(df2[cols])

df2.head(5)

	petal_length	petal_width	pred
0	0.756739	0.387274	setosa
1	2.148765	0.824700	setosa
2	1.192868	0.051384	setosa
3	6.767558	0.948593	versicolor
4	1.948797	0.258545	setosa

Now we make the chart. This shows a decision tree with three leaf nodes (three regions). The decision boundaries are straight lines, like for logistic regression. Moreover, these straight lines are always parallel to the coordinate axes. So it might sound like decision trees are less flexible than logistic regression classifiers. But in fact, decision trees are more prone to overfitting than logistic regression classifiers in general, because as the number of leaf nodes increases, the regions can get more and more complex/specialized.

It’s worth staring at the following image, together with the illustrated decision tree above (the one produced by plot_tree), and seeing how they compare. For example, do you see why the boundary line at petal width 1.65 only goes to the right? (Why doesn’t it continue all the way to the left of petal length 2.6, in terms of the above visualization of the decision tree?)

alt.Chart(df2).mark_circle().encode(
    x = "petal_length",
    y = "petal_width",
    color = 'pred'
)

Bias-Variance Tradeoff in DT:#

The parameter tree depth is important to determine the final performance of classification, which can be understood as control the complexity of algorithm.

deep tree: increase complexity, low bias, large variance
shallow tree: decrease complexity, high bias, low variance

Typically, decision tree tends to have large depth, making it prune to over-fitting (it just means the model is too complicated for the data, and the variance is too large, therefore causing the large error in testdataset).

Created in Deepnote