Email Segmentation Forecasting and Bank Subscriber Forecasting#

Author: Shengqi You

Course Project, UC Irvine, Math 10, F23

Introduction#

Our dataset comes from the UCI’s machine learning related website and is a dataset on email text and email categorization. In this project I need to read the text and extract features from the text, which may use some deep learning stuff. At the end we still use machine learning models. I will compare some machine learning models to see how accurate their predictions are.

But after that , I have re-selected a dataset of bank users, in which personal information about the users is recorded and I will analyze and make predictions about whether or not they are loanable users, which will be more relevant to the content of this course than the previous project.

Bank Subscriber Forecasting#

Adding Necessary Packages#

import numpy as np 
import pandas as pd # data processing
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import random
import os
import altair as alt
from sklearn.model_selection import train_test_split

Generate Training and Test Data#

df = pd.read_csv("bank-full.csv")

#Creating User Columns
df_user = pd.DataFrame(np.arange(0,len(df)), columns=['user'])
df = pd.concat([df_user, df], axis=1)
df.head(20)
user age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 0 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 1 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
2 2 33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no
3 3 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
4 4 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no
5 5 35 management married tertiary no 231 yes no unknown 5 may 139 1 -1 0 unknown no
6 6 28 management single tertiary no 447 yes yes unknown 5 may 217 1 -1 0 unknown no
7 7 42 entrepreneur divorced tertiary yes 2 yes no unknown 5 may 380 1 -1 0 unknown no
8 8 58 retired married primary no 121 yes no unknown 5 may 50 1 -1 0 unknown no
9 9 43 technician single secondary no 593 yes no unknown 5 may 55 1 -1 0 unknown no
10 10 41 admin. divorced secondary no 270 yes no unknown 5 may 222 1 -1 0 unknown no
11 11 29 admin. single secondary no 390 yes no unknown 5 may 137 1 -1 0 unknown no
12 12 53 technician married secondary no 6 yes no unknown 5 may 517 1 -1 0 unknown no
13 13 58 technician married unknown no 71 yes no unknown 5 may 71 1 -1 0 unknown no
14 14 57 services married secondary no 162 yes no unknown 5 may 174 1 -1 0 unknown no
15 15 51 retired married primary no 229 yes no unknown 5 may 353 1 -1 0 unknown no
16 16 45 admin. single unknown no 13 yes no unknown 5 may 98 1 -1 0 unknown no
17 17 57 blue-collar married primary no 52 yes no unknown 5 may 38 1 -1 0 unknown no
18 18 60 retired married primary no 60 yes no unknown 5 may 219 1 -1 0 unknown no
19 19 33 services married secondary no 0 yes no unknown 5 may 54 1 -1 0 unknown no
df.columns.values
array(['user', 'age', 'job', 'marital', 'education', 'default', 'balance',
       'housing', 'loan', 'contact', 'day', 'month', 'duration',
       'campaign', 'pdays', 'previous', 'poutcome', 'y'], dtype=object)

Users have been tagged good or bad.(Not yet defined in the original data)

df.groupby('y').mean()
user age balance day duration campaign pdays previous
y
no 21197.503081 40.838986 1303.714969 15.892290 221.182806 2.846350 36.421372 0.502154
yes 33228.953867 41.670070 1804.267915 15.158253 537.294574 2.141047 68.702968 1.170354

Feature Engineering#

#Define X and y
X = df.drop(['y','user','job','marital', 'education', 'contact', 
             'housing', 'loan', 'day', 'month', 'poutcome' ], axis=1)

y = df['y']

X = pd.get_dummies(X)
y = pd.get_dummies(y)

X.columns
X = X.drop(['default_no'], axis= 1)
X = X.rename(columns = {'default_yes': 'default'})
y.columns
y = y.drop(['yes'], axis=1)
y = y.rename(columns= {'no': 'y'})

Dummy Variable Trap can influence negatively in our analyses. Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms, one variable can be predicted from the others. The best definition for Dummy Variable

Visualising Data#

hist_subscribed = alt.Chart(df.head(5000)).mark_bar(color='green').encode(
    alt.X('age', bin=alt.Bin(step=10), title='Age'),
    alt.Y('count()', title='Count'),
    alt.Tooltip(['age', 'count()'])
).properties(
    title="Age Distribution "
)

hist_subscribed

# Only 5000 examples can be shown
scatter_plot = alt.Chart(df.head(10)).mark_point().encode(
    x=alt.X('age', title='Age'),
    y=alt.Y('duration:Q', title='Duration'),
    
    tooltip=['age', 'duration']
).properties(
    title='Scatter Plot of Age vs Duration ',
    width=600,
    height=400
)

scatter_plot.display()

Show more relationships between different data

# Assuming df is your existing DataFrame
variables = ['age', 'balance', 'duration']

# Create a base chart
base = alt.Chart(df.head(5000)).mark_point().encode(
    color='y:N'
)

# Create a repeated chart for pair plot
pairplot = base.encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y(alt.repeat("row"), type='quantitative')
).properties(
    width=200,
    height=200
).repeat(
    row=variables,
    column=variables
).resolve_scale(color='independent')

# Display the chart
pairplot.display()

Splitting the Dataset#

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)

Balancing the Trainng Set#

In machine learning tasks, we often encounter this nuisance: the data imbalance problem.

The data imbalance problem exists mainly in supervised machine learning tasks. When encountering unbalanced data, traditional classification algorithms, which have overall classification accuracy as their learning goal, focus too much on the majority class, which results in a degradation of the classification performance of the minority class samples. The vast majority of common machine learning algorithms do not work well with unbalanced datasets.

y_train['y'].value_counts()
1    31937
0     4231
Name: y, dtype: int64
pos_index = y_train[y_train.values == 1].index
neg_index = y_train[y_train.values == 0].index

if len(pos_index) > len(neg_index):
    higher = pos_index
    lower = neg_index
else:
    higher = neg_index
    lower = pos_index

random.seed(0)
higher = np.random.choice(higher, size=len(lower))
lower = np.asarray(lower)
new_indexes = np.concatenate((lower, higher))

X_train = X_train.loc[new_indexes]
y_train = y_train.loc[new_indexes]

Determine the Majority and Minority Class:

The if-else block compares the lengths of pos_index and neg_index to determine which class has more instances (i.e., the majority class). higher is set to the index list of the majority class, and lower is set to the index list of the minority class. Random Sampling from the Majority Class:

np.random.choice(higher, size=len(lower)) is used to randomly sample instances from the majority class so that its size matches the minority class. This is done to balance the classes. random.seed(0) ensures that the random sampling is reproducible; it will produce the same results each time the code is run. Create a Balanced Dataset:

lower is converted to a NumPy array to ensure compatibility for concatenation. new_indexes is created by concatenating the down-sampled majority class indexes (higher) with the minority class indexes (lower). This creates a new index array that represents a balanced dataset.

y_train['y'].value_counts()
0    4231
1    4231
Name: y, dtype: int64

Feature Scaling#

StandardScaler is a class from the scikit-learn library used to standardize features by removing the mean and scaling to unit variance. Standardizing features is an important step, especially for algorithms that are sensitive to the scale of input features, like SVMs, k-nearest neighbors, and principal component analysis.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train2 = pd.DataFrame(sc.fit_transform(X_train))
X_test2 = pd.DataFrame(sc.transform(X_test))
X_train2.columns = X_train.columns.values
X_test2.columns = X_test.columns.values
X_train2.index = X_train.index.values
X_test2.index = X_test.index.values
X_train = X_train2
X_test = X_test2

LogisticRegression#

We’re very familiar with this piece.

## Logistic Regression
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

# Predicting Test Set
y_pred = classifier.predict(X_test)


from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

results = pd.DataFrame([['Logistic Regression ', acc, prec, rec, f1]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/utils/validation.py:1111: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
classifier.coef_
array([[-0.09572692, -0.21519623, -1.62131306,  0.32479807, -0.19420282,
        -0.34889164,  0.09061903]])

ages and balance may be much more important.

classifier.classes_
array([0, 1], dtype=uint8)
classifier.intercept_
array([-0.18344598])

KNN#

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=15, metric='minkowski', p= 2)
classifier.fit(X_train, y_train)

# Predicting Test Set
y_pred = classifier.predict(X_test)

acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

model_results = pd.DataFrame([['K-Nearest Neighbors ', acc, prec, rec, f1]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

results = results.append(model_results, ignore_index = True)
/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:207: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  return self._fit(X, y)

Decision Tree#


from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion='entropy', random_state=0)
classifier.fit(X_train, y_train)

#Predicting the best set result
y_pred = classifier.predict(X_test)

acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

model_results = pd.DataFrame([['Decision Tree', acc, prec, rec, f1]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

results = results.append(model_results, ignore_index = True)

result#

results
Model Accuracy Precision Recall F1 Score
0 Logistic Regression 0.797412 0.951431 0.812023 0.876216
1 K-Nearest Neighbors 0.779056 0.963893 0.778961 0.861615
2 Decision Tree 0.716908 0.950956 0.716343 0.817143

As you can see from the graph, Logistic Regression and KNN will perform better in terms of performance. But the fact that all three models did not perform as well as I expected is unfortunate.

I finally wanted to visualize the decision boundaries of our different machine learning models. After processing the training data multiple times, it was really hard for me to merge it into the initial table and use the Altair package, so I processed the data again here and then used the matplotlib package

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# Selecting two features for the visualization
# For demonstration, let's select 'age' and 'balance' as two features. 

selected_features = ['age', 'balance']

# Preprocessing
df['y'] = df['y'].map({'yes': 1, 'no': 0})
X = df[selected_features]
y = df['y']

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)

# Standardizing the features
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Training models
log_reg = LogisticRegression(random_state=0)
knn = KNeighborsClassifier(n_neighbors=15, metric='minkowski', p=2)
dtree = DecisionTreeClassifier(criterion='entropy', random_state=0)

# Fitting the models
log_reg.fit(X_train, y_train)
knn.fit(X_train, y_train)
dtree.fit(X_train, y_train)
DecisionTreeClassifier(criterion='entropy', random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We have mapped the decision boundary here for specific reference:https://blog.csdn.net/weixin_45891612/article/details/128858765?ops_request_misc=&request_id=&biz_id=102&utm_term=decision_boundary作图方法&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-0-128858765.142^v96^control&spm=1018.2226.3001.4187

# Function to plot decision boundaries
def plot_decision_boundary(X, y, classifier, title):
    X_set, y_set = X, y
    X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01),
                         np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01))
    plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
                 alpha=0.75, cmap=ListedColormap(('red', 'blue')))
    plt.xlim(X1.min(), X1.max())
    plt.ylim(X2.min(), X2.max())
    for i, j in enumerate(np.unique(y_set)):
        plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                    c=ListedColormap(('red', 'blue'))(i), label=j)
    plt.title(title)
    plt.xlabel('Age (Standardized)')
    plt.ylabel('Balance (Standardized)')
    plt.legend()
    plt.show()

# Plotting decision boundaries for each model
plot_decision_boundary(X_train, y_train, log_reg, 'Logistic Regression')
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
../../_images/ad8f91287befdc59885c854d02aafa8fb98fb65c9d3fe4984ff3c70a7f7781e9.png
  • The KNN boundary can not be display , I thought it’s very weird.

plot_decision_boundary(X_train, y_train, dtree, 'Decision Tree')
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
../../_images/c0089952db89b1f62d558309d5f253b7ba01f0e0713115a2ebcd455fcbdc2644.png

As we can see, the decision boundaries for the each graph are very different, only Logistic Regression has clear decision boundaries while the DecisionTreeClassifier has very mixed boundaries. While this doesn’t completely show how each model makes predictions, it does offer some very interesting insight into how each of these models work and the differences between them.

Mail Classification#

Regression#

Adding Necessary Packages#

!pip install wordcloud==1.9.3
Requirement already satisfied: wordcloud==1.9.3 in /root/venv/lib/python3.9/site-packages (1.9.3)
Requirement already satisfied: matplotlib in /shared-libs/python3.9/py/lib/python3.9/site-packages (from wordcloud==1.9.3) (3.6.0)
Requirement already satisfied: pillow in /shared-libs/python3.9/py/lib/python3.9/site-packages (from wordcloud==1.9.3) (9.2.0)
Requirement already satisfied: numpy>=1.6.1 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from wordcloud==1.9.3) (1.23.4)
Requirement already satisfied: packaging>=20.0 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from matplotlib->wordcloud==1.9.3) (21.3)
Requirement already satisfied: fonttools>=4.22.0 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from matplotlib->wordcloud==1.9.3) (4.37.4)
Requirement already satisfied: pyparsing>=2.2.1 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from matplotlib->wordcloud==1.9.3) (3.0.9)
Requirement already satisfied: kiwisolver>=1.0.1 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from matplotlib->wordcloud==1.9.3) (1.4.4)
Requirement already satisfied: cycler>=0.10 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from matplotlib->wordcloud==1.9.3) (0.11.0)
Requirement already satisfied: contourpy>=1.0.1 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from matplotlib->wordcloud==1.9.3) (1.0.5)
Requirement already satisfied: python-dateutil>=2.7 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from matplotlib->wordcloud==1.9.3) (2.8.2)
Requirement already satisfied: six>=1.5 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib->wordcloud==1.9.3) (1.16.0)

[notice] A new release of pip is available: 23.0.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
!pip install wget
import tensorflow as tf
import os, sys, shutil, tempfile, wget
from zipfile import ZipFile
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import pandas as pd
import nltk
nltk.download('stopwords')
from wordcloud import WordCloud
from nltk.corpus import stopwords
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import matplotlib.pyplot as plt
import altair as alt
Requirement already satisfied: wget in /root/venv/lib/python3.9/site-packages (3.2)

[notice] A new release of pip is available: 23.0.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
2023-12-14 08:02:23.659091: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-14 08:02:23.865421: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-12-14 08:02:23.865455: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-12-14 08:02:23.918215: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-14 08:02:26.571139: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-14 08:02:26.571231: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-14 08:02:26.571241: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Generate Training and Test Data#

# Load the dataset
df = pd.read_csv('SMSSpamCollection', sep='\t', names=["Label", "SMS"])

print(df.shape)
df1 = df.head(5000)


# Convert labels to binary values , Feature engineering
df['Label'] = df['Label'].map({'ham': 0, 'spam': 1})

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df['SMS'], df['Label'], test_size=0.2, random_state=42)

import altair as alt

# Visualize the distribution of labels in the dataset using Altair
label_distribution_chart = alt.Chart(df1).mark_bar().encode(
    alt.X('Label:N'),
    alt.Y('count():Q'),
    color='Label:N',
    tooltip=['Label:N', 'count()']
).properties(
    title='Distribution of Labels in the Dataset'
)

label_distribution_chart
(5572, 2)

The data has only two columns, so we have to figure out how to extract features from the text data.

label_distribution_chart1 = alt.Chart(df1).mark_point().encode(
    alt.X('Label:N'),
    alt.Y('count():Q'),
    color='Label:N',
    tooltip=['Label:N', 'count()']
).properties(
    title='Distribution of Labels in the Dataset'
)
label_distribution_chart1

Vectorize the SMS data#

Here we introduce the feature extraction tool,we use TF-idf an optimized word frequency statistical vectorization method, is a way to turn text into numerical data, direct use on the line. Take out X and y, X word vectorization processing.

vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'))
X_train_vect = vectorizer.fit_transform(X_train)
X_test_vect = vectorizer.transform(X_test)

X_train_vect.shape
(4457, 7567)

As you can see, X, which was originally a column of text, has now become a feature variable, and they are all numeric, which can be used directly for modeling calculations.

Initialize Logistic Regression Model#

Since the predictions are for string type data, we should start with a LogisticRegression as our model for our machine.

model = LogisticRegression()

Train the Model#

model.fit(X_train_vect, y_train)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Validate the Model#

# Predict on test data
y_pred = model.predict(X_test_vect)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print('')

print(f'Accuracy: {accuracy}')

import altair as alt

# Sample a subset of the data to avoid MaxRowsError
df_sample = df.sample(1115, random_state=42)

# Dot plot: Length of SMS messages for ham and spam
df_sample['SMS_length'] = df_sample['SMS'].apply(len)
df_sample["pred"] = y_pred

dot_plot = alt.Chart(df_sample).mark_point(size=60).encode(
    x=alt.X('pred', title='Message Type'),
    y=alt.Y('SMS_length:Q', title='SMS Length'),
    color=alt.Color('pred:N', scale=alt.Scale(range=['#1f77b4', '#ff7f0e']), legend=None),
    tooltip=['pred:N', 'count()']
).properties(
    title='Dot Plot of SMS Length for Ham and Spam Messages',
    width=300
)
dot_plot
Accuracy: 0.9721973094170404
y_pred
array([0, 0, 0, ..., 0, 0, 0])
model.intercept_
array([-2.47425518])
model.coef_
array([[ 0.65752426,  1.29532624, -0.01725343, ..., -0.01139043,
        -0.01464767,  0.13616753]])

Word cloud: Most frequent words in SMS messages#

The wordcloud library is an excellent third-party library for displaying word clouds, which can turn a piece of text into a word cloud (a word cloud uses words as the basic unit for a more visual and artistic presentation of the text).I think it’s very interesting.

text = ' '.join(df['SMS'])
wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=ENGLISH_STOP_WORDS).generate(text)

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud - Most Frequent Words in SMS Messages')

# Display charts
plt.show()
../../_images/c755f803c325cb80c4510e9a5342ccd8cbdc380c8719c849d1e60e40de4b81ab.png

Decison Tree Classifier#

Next I’ll use a DecisionTreeClassifier since this machine learning model is best model to analyze this type of data that we have used in class.

from sklearn.tree import DecisionTreeClassifier

clf1 = DecisionTreeClassifier(max_leaf_nodes= 25)
clf1.fit(X_train_vect, y_train)
DecisionTreeClassifier(max_leaf_nodes=25)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
y_pred2 = clf1.predict(X_test_vect)
accuracy2 = accuracy_score(y_test, y_pred2)
print('')
print('')

print(f'Accuracy: {accuracy2}')
Accuracy: 0.9542600896860987

Swap the positions of the training set with the test set to see if there is any overfitting.

clf2 = DecisionTreeClassifier(max_leaf_nodes= 25)
clf2.fit(X_test_vect, y_test)
y_pred3 = clf2.predict(X_train_vect)
accuracy3 = accuracy_score(y_train, y_pred3)
print('')
print('')

print(f'Accuracy: {accuracy3}')
Accuracy: 0.951312542068656

The model behaves roughly the same after the swap and there should be no overfitting.

Multi-level Classification#

Again,adding Necessary Packages#

import os, sys, wget
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

Generate Training and Test Data#

# Load the data into a DataFrame
df = pd.read_csv('SMSSpamCollection', sep='\t', names=['label', 'message'])

# Encode the labels
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

# Features and labels
X = df['message']
y = df['label']

Feature Engineering#

Tokenizer is one of the core components of NLP pipeline. The goal is to convert text into data that the model can process. That is, convert text input to digital input

# Tokenize the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
X_seq = tokenizer.texts_to_sequences(X)

Build a model#

Define the neural network architecture: The following lines are defining a Sequential model. Sequential is a Keras model that represents a linear stack of layers. You can create a Sequential model by passing a list of layer instances to the constructor, or by using the .add() method to add layers one by one.

# Pad sequences
X_pad = pad_sequences(X_seq, maxlen=50)

# Define the neural network architecture
model = Sequential() #This line initializes the Sequential model.

#This line adds the first layer to the model, which is a Dense (fully connected) layer with 512 neurons.
model.add(Dense(512, input_shape=(50,), activation='relu'))

#This line adds a second Dense layer with 256 neurons, also with ReLU activation.
model.add(Dense(256, activation='relu'))

#This line adds a third Dense layer with a single neuron and uses the sigmoid activation function. 
model.add(Dense(1, activation='sigmoid'))
2023-12-14 08:02:34.290748: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-12-14 08:02:34.290782: W tensorflow/stream_executor/cuda/cuda_driver.cc:263] failed call to cuInit: UNKNOWN ERROR (303)
2023-12-14 08:02:34.290799: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (p-2ad65188-06c4-4ed7-a144-8a7c420abd2f): /proc/driver/nvidia/version does not exist
2023-12-14 08:02:34.291052: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

Compile the model#

This line of code is configuring the model for training by defining the loss function, the optimizer, and the evaluation metrics.

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Data allocation#

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_pad, y, test_size=0.2, random_state=0)

Train the model#

model.fit(X_train, y_train, epochs=10, batch_size=64, validation_data=(X_test, y_test))
Epoch 1/10
70/70 [==============================] - 1s 9ms/step - loss: 40.7405 - accuracy: 0.8088 - val_loss: 20.3003 - val_accuracy: 0.6933
Epoch 2/10
70/70 [==============================] - 1s 7ms/step - loss: 8.4930 - accuracy: 0.8616 - val_loss: 11.1212 - val_accuracy: 0.8323
Epoch 3/10
70/70 [==============================] - 0s 7ms/step - loss: 4.1421 - accuracy: 0.9008 - val_loss: 11.6610 - val_accuracy: 0.7865
Epoch 4/10
70/70 [==============================] - 1s 7ms/step - loss: 3.5643 - accuracy: 0.9020 - val_loss: 9.9862 - val_accuracy: 0.7722
Epoch 5/10
70/70 [==============================] - 0s 6ms/step - loss: 2.8527 - accuracy: 0.9042 - val_loss: 9.9826 - val_accuracy: 0.8655
Epoch 6/10
70/70 [==============================] - 1s 7ms/step - loss: 2.0686 - accuracy: 0.9233 - val_loss: 8.5287 - val_accuracy: 0.8816
Epoch 7/10
70/70 [==============================] - 0s 7ms/step - loss: 1.4942 - accuracy: 0.9363 - val_loss: 6.8574 - val_accuracy: 0.8637
Epoch 8/10
70/70 [==============================] - 1s 8ms/step - loss: 1.1440 - accuracy: 0.9488 - val_loss: 7.9643 - val_accuracy: 0.8673
Epoch 9/10
70/70 [==============================] - 0s 6ms/step - loss: 0.9372 - accuracy: 0.9518 - val_loss: 7.5382 - val_accuracy: 0.8915
Epoch 10/10
70/70 [==============================] - 1s 8ms/step - loss: 0.9461 - accuracy: 0.9545 - val_loss: 7.5730 - val_accuracy: 0.8933
<keras.callbacks.History at 0x7f3ca53f6520>

# Evaluate the model#

loss, accuracy = model.evaluate(X_test, y_test)
print(f'Loss: {loss}, Accuracy: {accuracy}')
35/35 [==============================] - 0s 7ms/step - loss: 7.5730 - accuracy: 0.8933
Loss: 7.573015213012695, Accuracy: 0.8932735323905945

It looks like the regression model performs better on email classification. Because there are only two columns of data in this dataset, and all the features were extracted using a deep learning model, it is difficult to discuss the impact of each feature in the project for this class. For each model I’ve raised a few small issues of note that have actually been quite rewarding.

Summary#

Either summarize what you did, or summarize the results. Maybe 3 sentences.

In this project, I used two very different datasets for the analysis, but the results of the exploration tended to be that the simpler model performed better.

I skillfully utilized packages such as scikit-learn,pandas, and Altair to perform operations such as assigning data, cleaning data, performing ad hoc feature engineering, and plotting graphs.

Similarly, I have used interesting tools like StandardScaler, matplotlib, tensorflow.keras, wordcloud, Tokenizer etc. for data analysis which have benefited me a lot.

References#

Your code above should include references. Here is some additional space for references.

  • What is the source of your dataset(s)?

1.https://www.kaggle.com/datasets/sonujha090/bank-marketing/data

2.https://archive.ics.uci.edu/dataset/228/sms+spam+collection

  • List any other references that you found helpful.

https://www.kaggle.com/datasets/sonujha090/bank-marketing/code

request_id=&biz_id=102&utm_term=decision_boundary作图方法&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-0-128858765.142^v96^control&spm=1018.2226.3001.4187

https://so.csdn.net/so/search?spm=1000.2115.3001.4498&q=Tokenizer&t=&u=