Week 9 Wednesday#

Unsupervised Learning: Clustering#

In addition to dimension reduction, another typical task of unsupervised machine learning is clustering: assigning the data samples into several groups (called clusters) based on their similarity – similar samples should be in the same cluster, and dissimilar samples should be in different clusters.

Caution: Don’t confuse clustering (unsupervised) with classification (supervised)!

K-Means Clustering#

Mathematical Description: Given a set of observations \((x^{(1)}, x^{(2)}, ..., x^{(n)})\), where each observation is a p-dimensional real vector, k-means clustering aims to partition the \(n\) samples into \(K (\leq n)\) sets \(S = {S_1, S_2, ..., S_K}\) so as to minimize the within-cluster sum of squares (i.e. variance). Formally, the objective is to find the best parition of groups such that minimize the “loss function” of \(S\)

\[\min_{S}\sum_{i=1}^{K}\sum_{x\in S_{i}}\|x-\mu_{i}\|^{2}\]

where \(\mu_{i}\) is the mean of points in \(S_i\).

How to solve it: The exact solution to the K-means problem is NP hard. In practical, the common approach is to apply Lloyd’s algorithm to find the heuristic solutions (local minimum).

In the iterative algorithm, each iteration contains two steps:

  • (assignment step) Given the cluster center, update the cluster of data according to its nearest cluster center.

  • (update step) Given the cluster assignment, update the cluster centers by calculating the means within each cluster.

The algorithm may converge if no further adjustments can be made.

Because the algorithm may stuck in local minimums, the practical strategy is to randomly initialize the algorithm, and run multiple parallel programs to find the best partition with smallest “loss function “.

Below we will apply the functions in scikit-learn. Note that to visualize the results of k-means on high-dimensional data, it is often combined with dimension reductions. Another pratical strategy is to use dimension reduction to pre-process the data, and apply clustering on the reduced datasets.

It is also worth noting that in practice, determining true number of clusters (\(K\)) is a very hard problem.

K-means clustering using scikit-learn#

Using KMeans from scikit-learn, cluster the iris data using the columns ‘sepal_length’, ‘sepal_width’, ‘petal_length’, ‘petal_width’.

import seaborn as sns
import pandas as pd
import altair as alt
# Load the Iris dataset using Seaborn
df = sns.load_dataset("iris")
df.head(5)
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
df.columns[0:4]
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], dtype='object')
cols = list(df.columns[0:4])
cols
['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
  • Divide that data into three clusters.

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=0)
y_km = kmeans.fit_predict(df[cols])
y_km
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,
       2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2,
       2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0], dtype=int32)
df['kmeans_cluster'] = y_km
df.head(5)
sepal_length sepal_width petal_length petal_width species kmeans_cluster
0 5.1 3.5 1.4 0.2 setosa 1
1 4.9 3.0 1.4 0.2 setosa 1
2 4.7 3.2 1.3 0.2 setosa 1
3 4.6 3.1 1.5 0.2 setosa 1
4 5.0 3.6 1.4 0.2 setosa 1
  • Visualize the clustering result on PCs

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(df[cols])
X_pca.shape
(150, 2)
df[['PC1','PC2']] = X_pca
df.head(5)
sepal_length sepal_width petal_length petal_width species kmeans_cluster PC1 PC2
0 5.1 3.5 1.4 0.2 setosa 1 -2.684126 0.319397
1 4.9 3.0 1.4 0.2 setosa 1 -2.714142 -0.177001
2 4.7 3.2 1.3 0.2 setosa 1 -2.888991 -0.144949
3 4.6 3.1 1.5 0.2 setosa 1 -2.745343 -0.318299
4 5.0 3.6 1.4 0.2 setosa 1 -2.728717 0.326755
c1 = alt.Chart(df).mark_circle(size = 60).encode(
    x = 'PC1',
    y = 'PC2',
    color = alt.Color('kmeans_cluster:N', scale = alt.Scale(scheme='set1')),
    tooltip = ['kmeans_cluster:N']
).properties(
    title = 'K-means clustering'
)
c1
c2 = alt.Chart(df).mark_circle(size = 60).encode(
    x = 'PC1',
    y = 'PC2',
    color = alt.Color('species:N', scale = alt.Scale(scheme='set1')),
    tooltip = ['species:N']
).properties(
    title = 'Species'
)
c2
alt.hconcat(c1, c2).properties(
    title = 'PCA: k-means clusterting vs species label'
)

To quantitatively measure the performace, it is not a good idea to naively compute “accuracy” as in the classification case, because permutation of the label values does not affect clustering results, while severely affects the “accuracy”. There are many good measures considering such effects in the clustering.

Adjusted Rand Index (ARI) The Adjusted Rand Index is a function that measures the similarity of the two assignments, ignoring permutations and with chance normalization. It’s a common way to compare the clustering result with the true labels.

from sklearn import metrics
metrics.adjusted_rand_score(df['species'],df['kmeans_cluster'])
0.7302382722834697
  • Visualiztion on tsne

from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0, n_jobs=-1)
X_tsne = tsne.fit_transform(df[cols])

df[['tsne1','tsne2']] = X_tsne
df.head(5)
/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:800: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:810: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(
sepal_length sepal_width petal_length petal_width species kmeans_cluster PC1 PC2 tsne1 tsne2
0 5.1 3.5 1.4 0.2 setosa 1 -2.684126 0.319397 8.037639 26.252501
1 4.9 3.0 1.4 0.2 setosa 1 -2.714142 -0.177001 8.577435 23.864986
2 4.7 3.2 1.3 0.2 setosa 1 -2.888991 -0.144949 7.902760 23.480476
3 4.6 3.1 1.5 0.2 setosa 1 -2.745343 -0.318299 7.691000 23.256388
4 5.0 3.6 1.4 0.2 setosa 1 -2.728717 0.326755 7.589659 26.284225
c1 = alt.Chart(df).mark_circle(size = 60).encode(
    x = 'tsne1',
    y = 'tsne2',
    color = alt.Color('kmeans_cluster:N', scale = alt.Scale(scheme='set1')),
    tooltip = ['kmeans_cluster:N']
).properties(
    title = 'K-means clustering'
)
c2 = alt.Chart(df).mark_circle(size = 60).encode(
    x = 'tsne1',
    y = 'tsne2',
    color = alt.Color('species:N', scale = alt.Scale(scheme='set1')),
    tooltip = ['species:N']
).properties(
    title = 'Species'
)

alt.hconcat(c1, c2).properties(
    title = 'tsne: k-means clusterting vs species label'
)
  • How about clustering on TSNE instaed of orginal features?

y_km_tsne = kmeans.fit_predict(X_tsne)
metrics.adjusted_rand_score(df['species'], y_km_tsne)
0.7591987071071522
  • How about try other clustering methods?

Gaussian Mixture Model (GMM)#

GMMs provide a flexible and probabilistic approach to clustering, making them suitable for complex datasets where the assumption of spherical clusters (as in k-means) is too restrictive.

Explore yourself and you may upload your results to Kaggle!

Created in deepnote.com Created in Deepnote