Email Segmentation Forecasting and Bank Subscriber Forecasting

Email Segmentation Forecasting and Bank Subscriber Forecasting#

Author: Shengqi You

Course Project, UC Irvine, Math 10, F23

Introduction#

Our dataset comes from the UCI’s machine learning related website and is a dataset on email text and email categorization. In this project I need to read the text and extract features from the text, which may use some deep learning stuff. At the end we still use machine learning models. I will compare some machine learning models to see how accurate their predictions are.

But after that , I have re-selected a dataset of bank users, in which personal information about the users is recorded and I will analyze and make predictions about whether or not they are loanable users, which will be more relevant to the content of this course than the previous project.

Bank Subscriber Forecasting#

Adding Necessary Packages#

import numpy as np 
import pandas as pd # data processing
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import random
import os
import altair as alt
from sklearn.model_selection import train_test_split

Generate Training and Test Data#

df = pd.read_csv("bank-full.csv")

#Creating User Columns
df_user = pd.DataFrame(np.arange(0,len(df)), columns=['user'])
df = pd.concat([df_user, df], axis=1)
df.head(20)

	user	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	y
0	0	58	management	married	tertiary	no	2143	yes	no	unknown	5	may	261	1	-1	unknown	no
1	1	44	technician	single	secondary	no	29	yes	no	unknown	5	may	151	1	-1	unknown	no
2	2	33	entrepreneur	married	secondary	no	2	yes	yes	unknown	5	may	76	1	-1	unknown	no
3	3	47	blue-collar	married	unknown	no	1506	yes	no	unknown	5	may	92	1	-1	unknown	no
4	4	33	unknown	single	unknown	no	1	no	no	unknown	5	may	198	1	-1	unknown	no
5	5	35	management	married	tertiary	no	231	yes	no	unknown	5	may	139	1	-1	unknown	no
6	6	28	management	single	tertiary	no	447	yes	yes	unknown	5	may	217	1	-1	unknown	no
7	7	42	entrepreneur	divorced	tertiary	yes	2	yes	no	unknown	5	may	380	1	-1	unknown	no
8	8	58	retired	married	primary	no	121	yes	no	unknown	5	may	50	1	-1	unknown	no
9	9	43	technician	single	secondary	no	593	yes	no	unknown	5	may	55	1	-1	unknown	no
10	10	41	admin.	divorced	secondary	no	270	yes	no	unknown	5	may	222	1	-1	unknown	no
11	11	29	admin.	single	secondary	no	390	yes	no	unknown	5	may	137	1	-1	unknown	no
12	12	53	technician	married	secondary	no	6	yes	no	unknown	5	may	517	1	-1	unknown	no
13	13	58	technician	married	unknown	no	71	yes	no	unknown	5	may	71	1	-1	unknown	no
14	14	57	services	married	secondary	no	162	yes	no	unknown	5	may	174	1	-1	unknown	no
15	15	51	retired	married	primary	no	229	yes	no	unknown	5	may	353	1	-1	unknown	no
16	16	45	admin.	single	unknown	no	13	yes	no	unknown	5	may	98	1	-1	unknown	no
17	17	57	blue-collar	married	primary	no	52	yes	no	unknown	5	may	38	1	-1	unknown	no
18	18	60	retired	married	primary	no	60	yes	no	unknown	5	may	219	1	-1	unknown	no
19	19	33	services	married	secondary	no	0	yes	no	unknown	5	may	54	1	-1	unknown	no

df.columns.values

array(['user', 'age', 'job', 'marital', 'education', 'default', 'balance',
       'housing', 'loan', 'contact', 'day', 'month', 'duration',
       'campaign', 'pdays', 'previous', 'poutcome', 'y'], dtype=object)

Users have been tagged good or bad.(Not yet defined in the original data)

df.groupby('y').mean()

	user	age	balance	day	duration	campaign	pdays	previous
y
no	21197.503081	40.838986	1303.714969	15.892290	221.182806	2.846350	36.421372	0.502154
yes	33228.953867	41.670070	1804.267915	15.158253	537.294574	2.141047	68.702968	1.170354

Feature Engineering#

#Define X and y
X = df.drop(['y','user','job','marital', 'education', 'contact', 
             'housing', 'loan', 'day', 'month', 'poutcome' ], axis=1)

y = df['y']

X = pd.get_dummies(X)
y = pd.get_dummies(y)

X.columns
X = X.drop(['default_no'], axis= 1)
X = X.rename(columns = {'default_yes': 'default'})
y.columns
y = y.drop(['yes'], axis=1)
y = y.rename(columns= {'no': 'y'})

Dummy Variable Trap can influence negatively in our analyses. Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms, one variable can be predicted from the others. The best definition for Dummy Variable

Visualising Data#

hist_subscribed = alt.Chart(df.head(5000)).mark_bar(color='green').encode(
    alt.X('age', bin=alt.Bin(step=10), title='Age'),
    alt.Y('count()', title='Count'),
    alt.Tooltip(['age', 'count()'])
).properties(
    title="Age Distribution "
)

hist_subscribed

# Only 5000 examples can be shown

scatter_plot = alt.Chart(df.head(10)).mark_point().encode(
    x=alt.X('age', title='Age'),
    y=alt.Y('duration:Q', title='Duration'),
    
    tooltip=['age', 'duration']
).properties(
    title='Scatter Plot of Age vs Duration ',
    width=600,
    height=400
)

scatter_plot.display()

Show more relationships between different data

# Assuming df is your existing DataFrame
variables = ['age', 'balance', 'duration']

# Create a base chart
base = alt.Chart(df.head(5000)).mark_point().encode(
    color='y:N'
)

# Create a repeated chart for pair plot
pairplot = base.encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y(alt.repeat("row"), type='quantitative')
).properties(
    width=200,
    height=200
).repeat(
    row=variables,
    column=variables
).resolve_scale(color='independent')

# Display the chart
pairplot.display()

Splitting the Dataset#

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)

Balancing the Trainng Set#

In machine learning tasks, we often encounter this nuisance: the data imbalance problem.

The data imbalance problem exists mainly in supervised machine learning tasks. When encountering unbalanced data, traditional classification algorithms, which have overall classification accuracy as their learning goal, focus too much on the majority class, which results in a degradation of the classification performance of the minority class samples. The vast majority of common machine learning algorithms do not work well with unbalanced datasets.

y_train['y'].value_counts()

1    31937
0     4231
Name: y, dtype: int64

pos_index = y_train[y_train.values == 1].index
neg_index = y_train[y_train.values == 0].index

if len(pos_index) > len(neg_index):
    higher = pos_index
    lower = neg_index
else:
    higher = neg_index
    lower = pos_index

random.seed(0)
higher = np.random.choice(higher, size=len(lower))
lower = np.asarray(lower)
new_indexes = np.concatenate((lower, higher))

X_train = X_train.loc[new_indexes]
y_train = y_train.loc[new_indexes]

Determine the Majority and Minority Class:

The if-else block compares the lengths of pos_index and neg_index to determine which class has more instances (i.e., the majority class). higher is set to the index list of the majority class, and lower is set to the index list of the minority class. Random Sampling from the Majority Class:

np.random.choice(higher, size=len(lower)) is used to randomly sample instances from the majority class so that its size matches the minority class. This is done to balance the classes. random.seed(0) ensures that the random sampling is reproducible; it will produce the same results each time the code is run. Create a Balanced Dataset:

lower is converted to a NumPy array to ensure compatibility for concatenation. new_indexes is created by concatenating the down-sampled majority class indexes (higher) with the minority class indexes (lower). This creates a new index array that represents a balanced dataset.

y_train['y'].value_counts()

0    4231
1    4231
Name: y, dtype: int64

Feature Scaling#

StandardScaler is a class from the scikit-learn library used to standardize features by removing the mean and scaling to unit variance. Standardizing features is an important step, especially for algorithms that are sensitive to the scale of input features, like SVMs, k-nearest neighbors, and principal component analysis.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train2 = pd.DataFrame(sc.fit_transform(X_train))
X_test2 = pd.DataFrame(sc.transform(X_test))
X_train2.columns = X_train.columns.values
X_test2.columns = X_test.columns.values
X_train2.index = X_train.index.values
X_test2.index = X_test.index.values
X_train = X_train2
X_test = X_test2

LogisticRegression#

We’re very familiar with this piece.

## Logistic Regression
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

# Predicting Test Set
y_pred = classifier.predict(X_test)


from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

results = pd.DataFrame([['Logistic Regression ', acc, prec, rec, f1]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/utils/validation.py:1111: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

classifier.coef_

array([[-0.09572692, -0.21519623, -1.62131306,  0.32479807, -0.19420282,
        -0.34889164,  0.09061903]])

ages and balance may be much more important.

classifier.classes_

array([0, 1], dtype=uint8)

classifier.intercept_

array([-0.18344598])

KNN#

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=15, metric='minkowski', p= 2)
classifier.fit(X_train, y_train)

# Predicting Test Set
y_pred = classifier.predict(X_test)

acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

model_results = pd.DataFrame([['K-Nearest Neighbors ', acc, prec, rec, f1]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

results = results.append(model_results, ignore_index = True)

/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:207: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  return self._fit(X, y)

Decision Tree#

from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion='entropy', random_state=0)
classifier.fit(X_train, y_train)

#Predicting the best set result
y_pred = classifier.predict(X_test)

acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

model_results = pd.DataFrame([['Decision Tree', acc, prec, rec, f1]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

results = results.append(model_results, ignore_index = True)

result#

results

	Model	Accuracy	Precision	Recall	F1 Score
0	Logistic Regression	0.797412	0.951431	0.812023	0.876216
1	K-Nearest Neighbors	0.779056	0.963893	0.778961	0.861615
2	Decision Tree	0.716908	0.950956	0.716343	0.817143

As you can see from the graph, Logistic Regression and KNN will perform better in terms of performance. But the fact that all three models did not perform as well as I expected is unfortunate.

I finally wanted to visualize the decision boundaries of our different machine learning models. After processing the training data multiple times, it was really hard for me to merge it into the initial table and use the Altair package, so I processed the data again here and then used the matplotlib package

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# Selecting two features for the visualization
# For demonstration, let's select 'age' and 'balance' as two features. 

selected_features = ['age', 'balance']

# Preprocessing
df['y'] = df['y'].map({'yes': 1, 'no': 0})
X = df[selected_features]
y = df['y']

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)

# Standardizing the features
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Training models
log_reg = LogisticRegression(random_state=0)
knn = KNeighborsClassifier(n_neighbors=15, metric='minkowski', p=2)
dtree = DecisionTreeClassifier(criterion='entropy', random_state=0)

# Fitting the models
log_reg.fit(X_train, y_train)
knn.fit(X_train, y_train)
dtree.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy', random_state=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We have mapped the decision boundary here for specific reference:https://blog.csdn.net/weixin_45891612/article/details/128858765?ops_request_misc=&request_id=&biz_id=102&utm_term=decision_boundary作图方法&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-0-128858765.142^v96^control&spm=1018.2226.3001.4187

# Function to plot decision boundaries
def plot_decision_boundary(X, y, classifier, title):
    X_set, y_set = X, y
    X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01),
                         np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01))
    plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
                 alpha=0.75, cmap=ListedColormap(('red', 'blue')))
    plt.xlim(X1.min(), X1.max())
    plt.ylim(X2.min(), X2.max())
    for i, j in enumerate(np.unique(y_set)):
        plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                    c=ListedColormap(('red', 'blue'))(i), label=j)
    plt.title(title)
    plt.xlabel('Age (Standardized)')
    plt.ylabel('Balance (Standardized)')
    plt.legend()
    plt.show()

# Plotting decision boundaries for each model
plot_decision_boundary(X_train, y_train, log_reg, 'Logistic Regression')

*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.

../../_images/ad8f91287befdc59885c854d02aafa8fb98fb65c9d3fe4984ff3c70a7f7781e9.png

The KNN boundary can not be display , I thought it’s very weird.

plot_decision_boundary(X_train, y_train, dtree, 'Decision Tree')

*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.

../../_images/c0089952db89b1f62d558309d5f253b7ba01f0e0713115a2ebcd455fcbdc2644.png

As we can see, the decision boundaries for the each graph are very different, only Logistic Regression has clear decision boundaries while the DecisionTreeClassifier has very mixed boundaries. While this doesn’t completely show how each model makes predictions, it does offer some very interesting insight into how each of these models work and the differences between them.

Mail Classification#

Regression#

Adding Necessary Packages#

!pip install wordcloud==1.9.3

Requirement already satisfied: wordcloud==1.9.3 in /root/venv/lib/python3.9/site-packages (1.9.3)
Requirement already satisfied: matplotlib in /shared-libs/python3.9/py/lib/python3.9/site-packages (from wordcloud==1.9.3) (3.6.0)
Requirement already satisfied: pillow in /shared-libs/python3.9/py/lib/python3.9/site-packages (from wordcloud==1.9.3) (9.2.0)
Requirement already satisfied: numpy>=1.6.1 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from wordcloud==1.9.3) (1.23.4)
Requirement already satisfied: packaging>=20.0 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from matplotlib->wordcloud==1.9.3) (21.3)
Requirement already satisfied: fonttools>=4.22.0 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from matplotlib->wordcloud==1.9.3) (4.37.4)
Requirement already satisfied: pyparsing>=2.2.1 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from matplotlib->wordcloud==1.9.3) (3.0.9)
Requirement already satisfied: kiwisolver>=1.0.1 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from matplotlib->wordcloud==1.9.3) (1.4.4)
Requirement already satisfied: cycler>=0.10 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from matplotlib->wordcloud==1.9.3) (0.11.0)
Requirement already satisfied: contourpy>=1.0.1 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from matplotlib->wordcloud==1.9.3) (1.0.5)
Requirement already satisfied: python-dateutil>=2.7 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from matplotlib->wordcloud==1.9.3) (2.8.2)
Requirement already satisfied: six>=1.5 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib->wordcloud==1.9.3) (1.16.0)

[notice] A new release of pip is available: 23.0.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip

!pip install wget
import tensorflow as tf
import os, sys, shutil, tempfile, wget
from zipfile import ZipFile
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import pandas as pd
import nltk
nltk.download('stopwords')
from wordcloud import WordCloud
from nltk.corpus import stopwords
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import matplotlib.pyplot as plt
import altair as alt

Requirement already satisfied: wget in /root/venv/lib/python3.9/site-packages (3.2)

[notice] A new release of pip is available: 23.0.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
2023-12-14 08:02:23.659091: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-14 08:02:23.865421: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-12-14 08:02:23.865455: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-12-14 08:02:23.918215: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-14 08:02:26.571139: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-14 08:02:26.571231: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-14 08:02:26.571241: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Generate Training and Test Data#

# Load the dataset
df = pd.read_csv('SMSSpamCollection', sep='\t', names=["Label", "SMS"])

print(df.shape)
df1 = df.head(5000)


# Convert labels to binary values , Feature engineering
df['Label'] = df['Label'].map({'ham': 0, 'spam': 1})

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df['SMS'], df['Label'], test_size=0.2, random_state=42)

import altair as alt

# Visualize the distribution of labels in the dataset using Altair
label_distribution_chart = alt.Chart(df1).mark_bar().encode(
    alt.X('Label:N'),
    alt.Y('count():Q'),
    color='Label:N',
    tooltip=['Label:N', 'count()']
).properties(
    title='Distribution of Labels in the Dataset'
)

label_distribution_chart

(5572, 2)

The data has only two columns, so we have to figure out how to extract features from the text data.

label_distribution_chart1 = alt.Chart(df1).mark_point().encode(
    alt.X('Label:N'),
    alt.Y('count():Q'),
    color='Label:N',
    tooltip=['Label:N', 'count()']
).properties(
    title='Distribution of Labels in the Dataset'
)
label_distribution_chart1

Vectorize the SMS data#

Here we introduce the feature extraction tool，we use TF-idf an optimized word frequency statistical vectorization method, is a way to turn text into numerical data, direct use on the line. Take out X and y, X word vectorization processing.

vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'))
X_train_vect = vectorizer.fit_transform(X_train)
X_test_vect = vectorizer.transform(X_test)

X_train_vect.shape

(4457, 7567)

As you can see, X, which was originally a column of text, has now become a feature variable, and they are all numeric, which can be used directly for modeling calculations.

Initialize Logistic Regression Model#

Since the predictions are for string type data, we should start with a LogisticRegression as our model for our machine.

model = LogisticRegression()

Train the Model#

model.fit(X_train_vect, y_train)

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Validate the Model#

# Predict on test data
y_pred = model.predict(X_test_vect)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print('')

print(f'Accuracy: {accuracy}')

import altair as alt

# Sample a subset of the data to avoid MaxRowsError
df_sample = df.sample(1115, random_state=42)

# Dot plot: Length of SMS messages for ham and spam
df_sample['SMS_length'] = df_sample['SMS'].apply(len)
df_sample["pred"] = y_pred

dot_plot = alt.Chart(df_sample).mark_point(size=60).encode(
    x=alt.X('pred', title='Message Type'),
    y=alt.Y('SMS_length:Q', title='SMS Length'),
    color=alt.Color('pred:N', scale=alt.Scale(range=['#1f77b4', '#ff7f0e']), legend=None),
    tooltip=['pred:N', 'count()']
).properties(
    title='Dot Plot of SMS Length for Ham and Spam Messages',
    width=300
)
dot_plot

Accuracy: 0.9721973094170404

y_pred

array([0, 0, 0, ..., 0, 0, 0])

model.intercept_

array([-2.47425518])

model.coef_

array([[ 0.65752426,  1.29532624, -0.01725343, ..., -0.01139043,
        -0.01464767,  0.13616753]])

Word cloud: Most frequent words in SMS messages#

The wordcloud library is an excellent third-party library for displaying word clouds, which can turn a piece of text into a word cloud (a word cloud uses words as the basic unit for a more visual and artistic presentation of the text).I think it’s very interesting.

text = ' '.join(df['SMS'])
wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=ENGLISH_STOP_WORDS).generate(text)

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud - Most Frequent Words in SMS Messages')

# Display charts
plt.show()

../../_images/c755f803c325cb80c4510e9a5342ccd8cbdc380c8719c849d1e60e40de4b81ab.png

Decison Tree Classifier#

Next I’ll use a DecisionTreeClassifier since this machine learning model is best model to analyze this type of data that we have used in class.

from sklearn.tree import DecisionTreeClassifier

clf1 = DecisionTreeClassifier(max_leaf_nodes= 25)

clf1.fit(X_train_vect, y_train)

DecisionTreeClassifier(max_leaf_nodes=25)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

y_pred2 = clf1.predict(X_test_vect)

accuracy2 = accuracy_score(y_test, y_pred2)
print('')
print('')

print(f'Accuracy: {accuracy2}')

Accuracy: 0.9542600896860987

Swap the positions of the training set with the test set to see if there is any overfitting.

clf2 = DecisionTreeClassifier(max_leaf_nodes= 25)
clf2.fit(X_test_vect, y_test)
y_pred3 = clf2.predict(X_train_vect)
accuracy3 = accuracy_score(y_train, y_pred3)
print('')
print('')

print(f'Accuracy: {accuracy3}')

Accuracy: 0.951312542068656

The model behaves roughly the same after the swap and there should be no overfitting.

Multi-level Classification#

Again，adding Necessary Packages#

import os, sys, wget
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

Generate Training and Test Data#

# Load the data into a DataFrame
df = pd.read_csv('SMSSpamCollection', sep='\t', names=['label', 'message'])

# Encode the labels
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

# Features and labels
X = df['message']
y = df['label']

Feature Engineering#

Tokenizer is one of the core components of NLP pipeline. The goal is to convert text into data that the model can process. That is, convert text input to digital input

# Tokenize the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
X_seq = tokenizer.texts_to_sequences(X)

Build a model#

Define the neural network architecture: The following lines are defining a Sequential model. Sequential is a Keras model that represents a linear stack of layers. You can create a Sequential model by passing a list of layer instances to the constructor, or by using the .add() method to add layers one by one.

# Pad sequences
X_pad = pad_sequences(X_seq, maxlen=50)

# Define the neural network architecture
model = Sequential() #This line initializes the Sequential model.

#This line adds the first layer to the model, which is a Dense (fully connected) layer with 512 neurons.
model.add(Dense(512, input_shape=(50,), activation='relu'))

#This line adds a second Dense layer with 256 neurons, also with ReLU activation.
model.add(Dense(256, activation='relu'))

#This line adds a third Dense layer with a single neuron and uses the sigmoid activation function. 
model.add(Dense(1, activation='sigmoid'))

2023-12-14 08:02:34.290748: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-12-14 08:02:34.290782: W tensorflow/stream_executor/cuda/cuda_driver.cc:263] failed call to cuInit: UNKNOWN ERROR (303)
2023-12-14 08:02:34.290799: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (p-2ad65188-06c4-4ed7-a144-8a7c420abd2f): /proc/driver/nvidia/version does not exist
2023-12-14 08:02:34.291052: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

Compile the model#

This line of code is configuring the model for training by defining the loss function, the optimizer, and the evaluation metrics.

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Data allocation#

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_pad, y, test_size=0.2, random_state=0)

Train the model#

model.fit(X_train, y_train, epochs=10, batch_size=64, validation_data=(X_test, y_test))

Epoch 1/10
70/70 [==============================] - 1s 9ms/step - loss: 40.7405 - accuracy: 0.8088 - val_loss: 20.3003 - val_accuracy: 0.6933
Epoch 2/10
70/70 [==============================] - 1s 7ms/step - loss: 8.4930 - accuracy: 0.8616 - val_loss: 11.1212 - val_accuracy: 0.8323
Epoch 3/10
70/70 [==============================] - 0s 7ms/step - loss: 4.1421 - accuracy: 0.9008 - val_loss: 11.6610 - val_accuracy: 0.7865
Epoch 4/10
70/70 [==============================] - 1s 7ms/step - loss: 3.5643 - accuracy: 0.9020 - val_loss: 9.9862 - val_accuracy: 0.7722
Epoch 5/10
70/70 [==============================] - 0s 6ms/step - loss: 2.8527 - accuracy: 0.9042 - val_loss: 9.9826 - val_accuracy: 0.8655
Epoch 6/10
70/70 [==============================] - 1s 7ms/step - loss: 2.0686 - accuracy: 0.9233 - val_loss: 8.5287 - val_accuracy: 0.8816
Epoch 7/10
70/70 [==============================] - 0s 7ms/step - loss: 1.4942 - accuracy: 0.9363 - val_loss: 6.8574 - val_accuracy: 0.8637
Epoch 8/10
70/70 [==============================] - 1s 8ms/step - loss: 1.1440 - accuracy: 0.9488 - val_loss: 7.9643 - val_accuracy: 0.8673
Epoch 9/10
70/70 [==============================] - 0s 6ms/step - loss: 0.9372 - accuracy: 0.9518 - val_loss: 7.5382 - val_accuracy: 0.8915
Epoch 10/10
70/70 [==============================] - 1s 8ms/step - loss: 0.9461 - accuracy: 0.9545 - val_loss: 7.5730 - val_accuracy: 0.8933

<keras.callbacks.History at 0x7f3ca53f6520>

# Evaluate the model#

loss, accuracy = model.evaluate(X_test, y_test)
print(f'Loss: {loss}, Accuracy: {accuracy}')

35/35 [==============================] - 0s 7ms/step - loss: 7.5730 - accuracy: 0.8933
Loss: 7.573015213012695, Accuracy: 0.8932735323905945

It looks like the regression model performs better on email classification. Because there are only two columns of data in this dataset, and all the features were extracted using a deep learning model, it is difficult to discuss the impact of each feature in the project for this class. For each model I’ve raised a few small issues of note that have actually been quite rewarding.

Summary#

Either summarize what you did, or summarize the results. Maybe 3 sentences.

In this project, I used two very different datasets for the analysis, but the results of the exploration tended to be that the simpler model performed better.

I skillfully utilized packages such as scikit-learn,pandas, and Altair to perform operations such as assigning data, cleaning data, performing ad hoc feature engineering, and plotting graphs.

Similarly, I have used interesting tools like StandardScaler, matplotlib, tensorflow.keras, wordcloud, Tokenizer etc. for data analysis which have benefited me a lot.

References#

Your code above should include references. Here is some additional space for references.

What is the source of your dataset(s)?

1.https://www.kaggle.com/datasets/sonujha090/bank-marketing/data

2.https://archive.ics.uci.edu/dataset/228/sms+spam+collection

List any other references that you found helpful.

https://www.kaggle.com/datasets/sonujha090/bank-marketing/code

request_id=&biz_id=102&utm_term=decision_boundary作图方法&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-0-128858765.142^v96^control&spm=1018.2226.3001.4187

https://so.csdn.net/so/search?spm=1000.2115.3001.4498&q=Tokenizer&t=&u=

Email Segmentation Forecasting and Bank Subscriber Forecasting

Contents

Email Segmentation Forecasting and Bank Subscriber Forecasting#

Introduction#

Bank Subscriber Forecasting#

Adding Necessary Packages#

Generate Training and Test Data#

Feature Engineering#

Visualising Data#

Splitting the Dataset#

Balancing the Trainng Set#

Feature Scaling#

LogisticRegression#

KNN#

Decision Tree#

result#

Mail Classification#

Regression#

Adding Necessary Packages#

Generate Training and Test Data#

Vectorize the SMS data#

Initialize Logistic Regression Model#

Train the Model#

Validate the Model#

Word cloud: Most frequent words in SMS messages#

Decison Tree Classifier#

Multi-level Classification#

Again，adding Necessary Packages#

Generate Training and Test Data#

Feature Engineering#

Build a model#

Compile the model#

Data allocation#

Train the model#

# Evaluate the model#

Summary#

References#