Week 9 Wednesday#
Unsupervised Learning: Clustering#
In addition to dimension reduction, another typical task of unsupervised machine learning is clustering: assigning the data samples into several groups (called clusters) based on their similarity – similar samples should be in the same cluster, and dissimilar samples should be in different clusters.
Caution: Don’t confuse clustering (unsupervised) with classification (supervised)!
K-Means Clustering#
Mathematical Description: Given a set of observations \((x^{(1)}, x^{(2)}, ..., x^{(n)})\), where each observation is a p-dimensional real vector, k-means clustering aims to partition the \(n\) samples into \(K (\leq n)\) sets \(S = {S_1, S_2, ..., S_K}\) so as to minimize the within-cluster sum of squares (i.e. variance). Formally, the objective is to find the best parition of groups such that minimize the “loss function” of \(S\)
where \(\mu_{i}\) is the mean of points in \(S_i\).
How to solve it: The exact solution to the K-means problem is NP hard. In practical, the common approach is to apply Lloyd’s algorithm to find the heuristic solutions (local minimum).
In the iterative algorithm, each iteration contains two steps:
(assignment step) Given the cluster center, update the cluster of data according to its nearest cluster center.
(update step) Given the cluster assignment, update the cluster centers by calculating the means within each cluster.
The algorithm may converge if no further adjustments can be made.
Because the algorithm may stuck in local minimums, the practical strategy is to randomly initialize the algorithm, and run multiple parallel programs to find the best partition with smallest “loss function “.
Below we will apply the functions in scikit-learn. Note that to visualize the results of k-means on high-dimensional data, it is often combined with dimension reductions. Another pratical strategy is to use dimension reduction to pre-process the data, and apply clustering on the reduced datasets.
It is also worth noting that in practice, determining true number of clusters (\(K\)) is a very hard problem.
K-means clustering using scikit-learn#
Using KMeans
from scikit-learn, cluster the iris data using the columns ‘sepal_length’, ‘sepal_width’, ‘petal_length’, ‘petal_width’.
import seaborn as sns
import pandas as pd
import altair as alt
# Load the Iris dataset using Seaborn
df = sns.load_dataset("iris")
df.head(5)
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
df.columns[0:4]
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], dtype='object')
cols = list(df.columns[0:4])
cols
['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
Divide that data into three clusters.
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=0)
y_km = kmeans.fit_predict(df[cols])
y_km
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,
2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2,
2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0], dtype=int32)
df['kmeans_cluster'] = y_km
df.head(5)
sepal_length | sepal_width | petal_length | petal_width | species | kmeans_cluster | |
---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 1 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 1 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 1 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 1 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 1 |
Visualize the clustering result on PCs
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(df[cols])
X_pca.shape
(150, 2)
df[['PC1','PC2']] = X_pca
df.head(5)
sepal_length | sepal_width | petal_length | petal_width | species | kmeans_cluster | PC1 | PC2 | |
---|---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 1 | -2.684126 | 0.319397 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 1 | -2.714142 | -0.177001 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 1 | -2.888991 | -0.144949 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 1 | -2.745343 | -0.318299 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 1 | -2.728717 | 0.326755 |
c1 = alt.Chart(df).mark_circle(size = 60).encode(
x = 'PC1',
y = 'PC2',
color = alt.Color('kmeans_cluster:N', scale = alt.Scale(scheme='set1')),
tooltip = ['kmeans_cluster:N']
).properties(
title = 'K-means clustering'
)
c1
c2 = alt.Chart(df).mark_circle(size = 60).encode(
x = 'PC1',
y = 'PC2',
color = alt.Color('species:N', scale = alt.Scale(scheme='set1')),
tooltip = ['species:N']
).properties(
title = 'Species'
)
c2
alt.hconcat(c1, c2).properties(
title = 'PCA: k-means clusterting vs species label'
)
To quantitatively measure the performace, it is not a good idea to naively compute “accuracy” as in the classification case, because permutation of the label values does not affect clustering results, while severely affects the “accuracy”. There are many good measures considering such effects in the clustering.
Adjusted Rand Index (ARI) The Adjusted Rand Index is a function that measures the similarity of the two assignments, ignoring permutations and with chance normalization. It’s a common way to compare the clustering result with the true labels.
from sklearn import metrics
metrics.adjusted_rand_score(df['species'],df['kmeans_cluster'])
0.7302382722834697
Visualiztion on tsne
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0, n_jobs=-1)
X_tsne = tsne.fit_transform(df[cols])
df[['tsne1','tsne2']] = X_tsne
df.head(5)
/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:800: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
warnings.warn(
/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:810: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
warnings.warn(
sepal_length | sepal_width | petal_length | petal_width | species | kmeans_cluster | PC1 | PC2 | tsne1 | tsne2 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 1 | -2.684126 | 0.319397 | 8.037639 | 26.252501 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 1 | -2.714142 | -0.177001 | 8.577435 | 23.864986 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 1 | -2.888991 | -0.144949 | 7.902760 | 23.480476 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 1 | -2.745343 | -0.318299 | 7.691000 | 23.256388 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 1 | -2.728717 | 0.326755 | 7.589659 | 26.284225 |
c1 = alt.Chart(df).mark_circle(size = 60).encode(
x = 'tsne1',
y = 'tsne2',
color = alt.Color('kmeans_cluster:N', scale = alt.Scale(scheme='set1')),
tooltip = ['kmeans_cluster:N']
).properties(
title = 'K-means clustering'
)
c2 = alt.Chart(df).mark_circle(size = 60).encode(
x = 'tsne1',
y = 'tsne2',
color = alt.Color('species:N', scale = alt.Scale(scheme='set1')),
tooltip = ['species:N']
).properties(
title = 'Species'
)
alt.hconcat(c1, c2).properties(
title = 'tsne: k-means clusterting vs species label'
)
How about clustering on TSNE instaed of orginal features?
y_km_tsne = kmeans.fit_predict(X_tsne)
metrics.adjusted_rand_score(df['species'], y_km_tsne)
0.7591987071071522
How about try other clustering methods?
Gaussian Mixture Model (GMM)#
GMMs provide a flexible and probabilistic approach to clustering, making them suitable for complex datasets where the assumption of spherical clusters (as in k-means) is too restrictive.
Explore yourself and you may upload your results to Kaggle!
Created in Deepnote