Prediction of the class of the wine#

Author: Yiding Huang

Course Project, UC Irvine, Math 10, F23

Introduction#

This project mainly explores the performance of different models on the wine dataset from sklearn. It also explores the pattern of the data using PCA and tSNE. The goal is to use the 5 features in the wine dataset to predict the class of each samples.

Feature Engineering#

Firstly, I am going to load the dataset from sklearn.dataset.

from sklearn.datasets import load_wine
import pandas as pd
wine_dataset = load_wine()
wine_dataset
{'data': array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
         1.065e+03],
        [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
         1.050e+03],
        [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
         1.185e+03],
        ...,
        [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
         8.350e+02],
        [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
         8.400e+02],
        [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
         5.600e+02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2]),
 'frame': None,
 'target_names': array(['class_0', 'class_1', 'class_2'], dtype='<U7'),
 'DESCR': '.. _wine_dataset:\n\nWine recognition dataset\n------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 178\n    :Number of Attributes: 13 numeric, predictive attributes and the class\n    :Attribute Information:\n \t\t- Alcohol\n \t\t- Malic acid\n \t\t- Ash\n\t\t- Alcalinity of ash  \n \t\t- Magnesium\n\t\t- Total phenols\n \t\t- Flavanoids\n \t\t- Nonflavanoid phenols\n \t\t- Proanthocyanins\n\t\t- Color intensity\n \t\t- Hue\n \t\t- OD280/OD315 of diluted wines\n \t\t- Proline\n\n    - class:\n            - class_0\n            - class_1\n            - class_2\n\t\t\n    :Summary Statistics:\n    \n    ============================= ==== ===== ======= =====\n                                   Min   Max   Mean     SD\n    ============================= ==== ===== ======= =====\n    Alcohol:                      11.0  14.8    13.0   0.8\n    Malic Acid:                   0.74  5.80    2.34  1.12\n    Ash:                          1.36  3.23    2.36  0.27\n    Alcalinity of Ash:            10.6  30.0    19.5   3.3\n    Magnesium:                    70.0 162.0    99.7  14.3\n    Total Phenols:                0.98  3.88    2.29  0.63\n    Flavanoids:                   0.34  5.08    2.03  1.00\n    Nonflavanoid Phenols:         0.13  0.66    0.36  0.12\n    Proanthocyanins:              0.41  3.58    1.59  0.57\n    Colour Intensity:              1.3  13.0     5.1   2.3\n    Hue:                          0.48  1.71    0.96  0.23\n    OD280/OD315 of diluted wines: 1.27  4.00    2.61  0.71\n    Proline:                       278  1680     746   315\n    ============================= ==== ===== ======= =====\n\n    :Missing Attribute Values: None\n    :Class Distribution: class_0 (59), class_1 (71), class_2 (48)\n    :Creator: R.A. Fisher\n    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n    :Date: July, 1988\n\nThis is a copy of UCI ML Wine recognition datasets.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data\n\nThe data is the results of a chemical analysis of wines grown in the same\nregion in Italy by three different cultivators. There are thirteen different\nmeasurements taken for different constituents found in the three types of\nwine.\n\nOriginal Owners: \n\nForina, M. et al, PARVUS - \nAn Extendible Package for Data Exploration, Classification and Correlation. \nInstitute of Pharmaceutical and Food Analysis and Technologies,\nVia Brigata Salerno, 16147 Genoa, Italy.\n\nCitation:\n\nLichman, M. (2013). UCI Machine Learning Repository\n[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,\nSchool of Information and Computer Science. \n\n.. topic:: References\n\n  (1) S. Aeberhard, D. Coomans and O. de Vel, \n  Comparison of Classifiers in High Dimensional Settings, \n  Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of  \n  Mathematics and Statistics, James Cook University of North Queensland. \n  (Also submitted to Technometrics). \n\n  The data was used with many others for comparing various \n  classifiers. The classes are separable, though only RDA \n  has achieved 100% correct classification. \n  (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) \n  (All results using the leave-one-out technique) \n\n  (2) S. Aeberhard, D. Coomans and O. de Vel, \n  "THE CLASSIFICATION PERFORMANCE OF RDA" \n  Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of \n  Mathematics and Statistics, James Cook University of North Queensland. \n  (Also submitted to Journal of Chemometrics).\n',
 'feature_names': ['alcohol',
  'malic_acid',
  'ash',
  'alcalinity_of_ash',
  'magnesium',
  'total_phenols',
  'flavanoids',
  'nonflavanoid_phenols',
  'proanthocyanins',
  'color_intensity',
  'hue',
  'od280/od315_of_diluted_wines',
  'proline']}

We can see that now the dataset is hard to read. Let’s transform the data into a pandas dataframe. To access the data and the name for each column, we can use data and feature_names attributes. Since the type of this wine_dataset is not in the range of Math 10, I found these two attribute with the help of Chatgpt.

type(wine_dataset)
sklearn.utils._bunch.Bunch
df = pd.DataFrame(data=wine_dataset.data, columns=wine_dataset.feature_names)
df
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline
0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065.0
1 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050.0
2 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185.0
3 14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480.0
4 13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
173 13.71 5.65 2.45 20.5 95.0 1.68 0.61 0.52 1.06 7.70 0.64 1.74 740.0
174 13.40 3.91 2.48 23.0 102.0 1.80 0.75 0.43 1.41 7.30 0.70 1.56 750.0
175 13.27 4.28 2.26 20.0 120.0 1.59 0.69 0.43 1.35 10.20 0.59 1.56 835.0
176 13.17 2.59 2.37 20.0 120.0 1.65 0.68 0.53 1.46 9.30 0.60 1.62 840.0
177 14.13 4.10 2.74 24.5 96.0 2.05 0.76 0.56 1.35 9.20 0.61 1.60 560.0

178 rows × 13 columns

There are 13 features and 178 samples. However, the class of the wine for each sample is not shown in the dataframe. Let’s add the new column named class in the dataframe.

df['class'] = wine_dataset.target
df['class']
0      0
1      0
2      0
3      0
4      0
      ..
173    2
174    2
175    2
176    2
177    2
Name: class, Length: 178, dtype: int64

There are three classes in this dataset which are 0, 1, and 2. let’s check the type of data for each column and make sure there is no missing values in any rows.

df.dropna(axis=0, inplace=True)
df.shape
(178, 14)
df.dtypes
alcohol                         float64
malic_acid                      float64
ash                             float64
alcalinity_of_ash               float64
magnesium                       float64
total_phenols                   float64
flavanoids                      float64
nonflavanoid_phenols            float64
proanthocyanins                 float64
color_intensity                 float64
hue                             float64
od280/od315_of_diluted_wines    float64
proline                         float64
class                             int64
dtype: object

We will be only exploring the first 5 features among the 13 features, so let’s save these column names here to save the redundant code.

columns = list(df.columns[0:5])
columns
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium']

Losgistic Regression#

Even though the type of the class column is int, but it is actuallly a classification problem. Therefore, let’s imply Losgistic Regression to predict the class for each sample based on the 5 parameters.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

We use train_test_split to get the train set and test set for our model, so later on we can use test set to check the accuracy of our model. I set max_iter to 5000 to avoid the max iteration error

X = df[columns]
y = df["class"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
reg= LogisticRegression(max_iter=5000)
reg.fit(X_train, y_train)
LogisticRegression(max_iter=5000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Then we check the accuracy of the model.

reg.score(X_train, y_train)
0.8661971830985915
reg.score(X_test, y_test)
0.8888888888888888
reg.coef_
array([[ 1.54585303e+00, -1.06491444e-01,  2.06049833e+00,
        -4.56947436e-01,  2.10866426e-02],
       [-1.83081135e+00, -4.85981649e-01, -1.67076083e+00,
         2.66545555e-01, -2.00940638e-02],
       [ 2.84958317e-01,  5.92473093e-01, -3.89737501e-01,
         1.90401881e-01, -9.92578768e-04]])
reg.classes_
array([0, 1, 2])
reg.intercept_
array([-18.35259791,  25.87706018,  -7.52446227])

Let’s visualize the data for the first two columns

import altair as alt
df["predict"] = reg.predict(df[columns])
c= alt.Chart(df).mark_circle().encode(
    x = "alcohol",
    y = "malic_acid",
    color = "class:N"
)

c2= alt.Chart(df).mark_circle().encode(
    x = "alcohol",
    y = "malic_acid",
    color = "predict:N"
)
c|c2

The right hand side is the graph for predicted class. We can see that some points for class 0 around (13,4) are predicted to be class 2. We can see why this is from the graph.

Conclusion for Linear Regression model#

The accuracy for both train set and test set are pretty high. It is plausible that there is an overfitting problem. However, considering it uses 5 paremeters to predict only 3 classes, and the test size are pretty small(178*0.2= 35.6), the high accuracy is understandable. Also, checking the score for the test set also helps us to inspect overfitting. For this model, the score for test set is consistent with train set. We should check the model for a larger test set to see the real performance of this model.

Random Forest#

Using Random Forest model is a good way to avoid overfitting problems. Let’s check the performance of Random Forest model on this dataset. Considering it is a pretty small dataset, I am going to set the max depth to 5 and max leaf node to 5. Let’s also use the first five columns instead of 13 columns to predict the classes, and see how would the accuracy be different from the Logistic Regression mmodel.

columns = list(df.columns[0:5])
columns
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium']
from sklearn.ensemble import RandomForestClassifier
columns = list(df.columns[0:5])
X = df[columns]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rfc = RandomForestClassifier(n_estimators=100, max_depth=5, max_leaf_nodes=5 ,random_state=42)
rfc.fit(X_train, y_train)
RandomForestClassifier(max_depth=5, max_leaf_nodes=5, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
rfc.score(X_test, y_test), rfc.score(X_train, y_train)
(0.9166666666666666, 0.8802816901408451)

We can also access one particular Decision Tree.

clf = rfc.estimators_[0]
clf
DecisionTreeClassifier(max_depth=5, max_features='sqrt', max_leaf_nodes=5,
                       random_state=1608637542)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Conclusion#

Generally random tree forest is also a pretty good model for predicting the class of the wine. Since the test score are pretty good, and there is limit in max leaf node and max depth, there should be no overfitting.

Princiiple Component Analysis(PCA) and t-distributed stochastic neighbor embedding (tSNE)#

We want to visualize the pattern between the parameters and the class. Because there are 13 parameters with many dimensions, we can first process the data using PCA, and then apply tSNE to the data. This idea is from Week 9 Monday https://deepnote.com/workspace/yidingh-e4771a0c-f6a2-4e56-a1c1-1e3b722443e1/project/W9Monday-Duplicate-5d113670-1a44-40d4-b14b-25fc975f1627/notebook/82aac0e734df489fa063c3b95c94cda1.

Now we first use PCA to process the data.

from sklearn.decomposition import PCA
import altair as alt
columns= list(df.columns[0:13])
X=df[columns]
y = df['class']
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

Transform the data into an Altair chart.

pca_df = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2'])
pca_df['class'] = y
alt.Chart(pca_df).mark_point().encode(
    x = 'PC1',
    y = 'PC2',
    color = alt.Color('class:N'),
    tooltip = ['class']
)

Then we apply tSNE.

from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state= 42, n_jobs=-1)
X_tsne = tsne.fit_transform(X_pca)
/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:800: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:810: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(
df_tsne=pd.DataFrame(X_tsne, columns=['tsne1','tsne2'])
df_tsne["Class"] = y
alt.Chart(df_tsne).mark_circle().encode(
    x = 'tsne1',
    y = 'tsne2',
    color = 'Class:N',
    tooltip = ['Class:N']
)

Conclusion#

From the PCA chart we can see that it is easier to distinguish class 2 from the 3 classes, but class 0 and class 1 have a similarity in them. The shape of the tSNE chart was a little unexpected, but we can still observe that class 0 is more distinguished from class 1 and 2, and class 1 and 2 are blended together. This result is also consistent with the graph in the LogisticRegression part.

K-Nearest Neighbors Classifier#

I see this model in the extra topic session in the instruction topic, and I followed the note from YuJa professor https://christopherdavisuci.github.io/UCI-Math-10-W22/Week6/Week6-Wednesday.html. But I use clasifier for this dataset instead of regressor.

import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

We use score to check the accuracy of the model, but it doen’t give us a number. Then we use the accuracy_score method to inspect the accuracy.

X=df[columns]
y=df["class"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf2 = KNeighborsClassifier(n_neighbors=10)
clf2.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=10)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
clf2.score 
<bound method ClassifierMixin.score of KNeighborsClassifier(n_neighbors=10)>
y_pred_test = clf2.predict(X_test)
accuracy_score(y_test, y_pred_test) 
0.7222222222222222
y_pred_train = clf2.predict(X_train)
accuracy_score(y_train, y_pred_train) 
0.7676056338028169

The accuracy is consitent, so there should be no over fitting problem.

classification_report(y_test, y_pred_test)
'              precision    recall  f1-score   support\n\n           0       0.82      1.00      0.90        14\n           1       0.82      0.64      0.72        14\n           2       0.38      0.38      0.38         8\n\n    accuracy                           0.72        36\n   macro avg       0.67      0.67      0.67        36\nweighted avg       0.72      0.72      0.71        36\n'

Conclusion#

The K-Nearest Neighbor Regressor model suprisingly has a low accuracy compared to the model previously used(Logistic Regression and Random Forest).

Summary#

In this project, I mainly explore the performance of three models on the wine dataset for predicting the classes for each sample. Overall, Logistic Regression and Random Forest have a pretty high accuracy on the dataset(Both around 88 percent), while the accuracy for K-Neighbors Classifier are relatively low. We can also see that class 0 is distinguishable compared to the two other classes.

  • What is the source of your dataset(s)?

Sklearn.dataset

  • List any other references that you found helpful.

  1. https://deepnote.com/workspace/yidingh-e4771a0c-f6a2-4e56-a1c1-1e3b722443e1/project/W9Monday-Duplicate-5d113670-1a44-40d4-b14b-25fc975f1627/notebook/82aac0e734df489fa063c3b95c94cda1.

  2. https://christopherdavisuci.github.io/UCI-Math-10-W22/Week6/Week6-Wednesday.html.

  3. https://christopherdavisuci.github.io/UCI-Math-10-S23/Proj/StudentProjects/SethAbuhamdeh.html#logistic-regression