Bank Direct Marketing: Term Deposit Subscription Forecast

Bank Direct Marketing: Term Deposit Subscription Forecast#

Course Project, UC Irvine, Math 10, F23

- Introduction -#

This project involves analyzing data from a Portuguese banking institution’s direct marketing campaigns, obtained from the UCI Machine Learning Repository. The main objective is to predict whether clients will subscribe to a term deposit based on clients’ personal features and bank’s interactions, aiming to reveal key patterns that impact subscription decisions. Ultimately, this analysis aims to offer valuable insights to enhance the institution’s marketing strategies in the banking sector.

- Main Section -#

Now we are going to start our project!

Part I. Download and Clean the Database#

Before diving into analysis, the initial step involves downloading our data, typically stored as a CSV file, using the pd.read_csv() method in Python. In this project, we will name the dataframe df.
Once we’ve acquired the data, the next crucial step is to check whether exists Nan in side our data frame. We can check the database with the isna().any() method. If there is any column contain NaN, then we can clean it using methods like dropna() to remove empty or irrelevant entries. Otherwise, we can keep going since the data is clean. This cleaning process ensures accuracy, consistency, and completeness, setting the stage for reliable analysis and insights.

import pandas as pd
df = pd.read_csv('bank-full.csv')
df.head(5)

	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	y
0	58	management	married	tertiary	no	2143	yes	no	unknown	5	may	261	1	-1	unknown	no
1	44	technician	single	secondary	no	29	yes	no	unknown	5	may	151	1	-1	unknown	no
2	33	entrepreneur	married	secondary	no	2	yes	yes	unknown	5	may	76	1	-1	unknown	no
3	47	blue-collar	married	unknown	no	1506	yes	no	unknown	5	may	92	1	-1	unknown	no
4	33	unknown	single	unknown	no	1	no	no	unknown	5	may	198	1	-1	unknown	no

df.isna().any()

age          False
job          False
marital      False
education    False
default      False
balance      False
housing      False
loan         False
contact      False
day          False
month        False
duration     False
campaign     False
pdays        False
previous     False
poutcome     False
y            False
dtype: bool

In our case, since all columns are False, the data doesn’t contain missing values. Also, from the above output, we can see each column’s name. We can also check each column’s name and data type with the method df.dtypes

df.dtypes

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

In case you don’t know the meaning of each variables, I will provide the explanations below.

variables:

bank client data: 1 - age (numeric) 2 - job : type of job (categorical: “admin.”,”unknown”,”unemployed”,”management”,”housemaid”,”entrepreneur”,”student”, “blue-collar”,”self-employed”,”retired”,”technician”,”services”) 3 - marital : marital status (categorical: “married”,”divorced”,”single”; note: “divorced” means divorced or widowed) 4 - education (categorical: “unknown”,”secondary”,”primary”,”tertiary”) 5 - default: has credit in default? (binary: “yes”,”no”) 6 - balance: average yearly balance, in euros (numeric) 7 - housing: has housing loan? (binary: “yes”,”no”) 8 - loan: has personal loan? (binary: “yes”,”no”)
related with the last contact of the current campaign: 9 - contact: contact communication type (categorical: “unknown”,”telephone”,”cellular”) 10 - day: last contact day of the month (numeric) 11 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”) 12 - duration: last contact duration, in seconds (numeric)
other attributes: 13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted) 15 - previous: number of contacts performed before this campaign and for this client (numeric) 16 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”,”other”,”failure”,”success”)
desired target: 17 - y - has the client subscribed a term deposit? (binary: “yes”,”no”)

From the data information shown above, we can see some of the columns have wrong data type.

For example, day and month are int64 and object, but usually we want to put them together and generate a datetime data type. However, since we are not going to consider that as a variable in the following analysis, we can remove those two columns with the method drop(columns = [‘column name’], inplace = True)
In addition, we can see the column [‘default’, ‘housing’, ‘loan’, ‘y’] all have the data type object, but we want to convert them to boolean value with the following method (has reference).

df.drop(columns = ['day','month'],inplace = True)
df[['default','housing','loan','y']] =  df[['default','housing','loan','y']].replace({'yes': True, 'no': False}).astype(bool)
df.head(5)

	age	job	marital	education	default	balance	housing	loan	contact	duration	campaign	pdays	poutcome	y
0	58	management	married	tertiary	False	2143	True	False	unknown	261	1	-1	unknown	False
1	44	technician	single	secondary	False	29	True	False	unknown	151	1	-1	unknown	False
2	33	entrepreneur	married	secondary	False	2	True	True	unknown	76	1	-1	unknown	False
3	47	blue-collar	married	unknown	False	1506	True	False	unknown	92	1	-1	unknown	False
4	33	unknown	single	unknown	False	1	False	False	unknown	198	1	-1	unknown	False

Let’s take an overview of our data frame again. You can see that there exists string called ‘unknown’ in many object columns.

Remark: Even though we use the method isna() to check if there exists any the missing value, there still might exists some elements represent the similar meaning as NaN, such as ‘unknown’ in our case. To better clean up our data, we can check the amount of ‘unknown’ values in each column and decide whether to keep or erase those values.
to get the columns might include that string: we use df.select_dtypes(include=object).columns
to check the amount of ‘unknown’ values in each column: we use for loop and df.value_counts(‘column name’)

for i in list(df.select_dtypes(include=object).columns): #reference
    print(df.value_counts(i))

job
blue-collar      9732
management       9458
technician       7597
admin.           5171
services         4154
retired          2264
self-employed    1579
entrepreneur     1487
unemployed       1303
housemaid        1240
student           938
unknown           288
dtype: int64
marital
married     27214
single      12790
divorced     5207
dtype: int64
education
secondary    23202
tertiary     13301
primary       6851
unknown       1857
dtype: int64
contact
cellular     29285
unknown      13020
telephone     2906
dtype: int64
poutcome
unknown    36959
failure     4901
other       1840
success     1511
dtype: int64

Now, we can see the string ‘unknown’:

take a small propotion in column ‘job’
doesn’t exist in column ‘marital’
take a relative small propotion in column ‘education’
take a large propotion in column ‘contact’
take a extreme large propotion in column ‘poutcome’ Base on our observation, we can remove the columns ‘contact’ and ‘poutcome’ since they have too much unknown values which won’t be helpful in the later analysis; for columns ‘job’ and ‘education’, we can erase the corresponding rows which contain ‘unknown’, in order to ensure a better performance in later regression.
To remove the columns, we will use the same method as before: drop(column = , inplace = True)
To remove the rows exist ‘unknown’, we will utilize Boolean values to help us get the result

df.drop(columns = ['contact', 'poutcome'],inplace = True)
df = df[~((df['job']=='unknown') | (df['education']=='unknown'))]

df.head(5)

	age	job	marital	education	default	balance	housing	loan	duration	campaign	pdays	y
0	58	management	married	tertiary	False	2143	True	False	261	1	-1	False
1	44	technician	single	secondary	False	29	True	False	151	1	-1	False
2	33	entrepreneur	married	secondary	False	2	True	True	76	1	-1	False
5	35	management	married	tertiary	False	231	True	False	139	1	-1	False
6	28	management	single	tertiary	False	447	True	True	217	1	-1	False

Finally, we obtain an useful data frame! Let’s move to the next stage.

Part II. Client Feature Comparation: Dummy Variable and Logistic Regression#

In this section, we’ll use Logistic Regression, a method from unsupervised learning, to analyze client data and see how different clients features affect whether a client subscribes to a term deposit. We’ll also use dummy variables. Here are some explanations:

Supervised learning: It’s a machine learning approach where algorithms learn patterns from labeled data.
Logistic Regression: It’s a statistical method used to model the relationship between a categorical outcome (subscription to a term deposit: “yes” or “no”) and independent variables.
Dummy variable: It’s a numeric representation of categorical data. For example, our ‘job’ column contains different job categories as strings. To use this categorical data in Logistic Regression, we convert it into dummy variables (binary values 0 or 1) to help the model understand these categories.

\[ y = β_0 + β_1 \cdot \text{age} + β_2 \cdot \text{job} + β_3 \cdot \text{marital} + β_4 \cdot \text{education} + β_5 \cdot \text{default} + β_6 \cdot \text{balance} + β_7 \cdot \text{housing} + β_8 \cdot \text{loan} \]

This formula doesn’t perform well since we have categorical variables (for example: in the “job” column, it contains 12 different categories) and the Logistic Regression cannot identify those variables. Therefore, we want to convert those categorical variables in to various dummy variables which represent each categories.

Also, to better classify the numerical varibales ‘age’ , we will also convert them in to different interval based on the common sence or mean of the original data. (for example: convert ‘age’ column to ‘young’ , ‘middle_age’, ‘elderly’ columns)

Here is the final formula:

\[ y = β_0 + β_{11} \cdot \text{young} + ... + β_{21} \cdot \text{job}_\text{admin} + ... + β_{31} \cdot \text{marital}_\text{married} +...+ β_{41} \cdot \text{education}_\text{secondary}+...+ β_{51} \cdot \text{default}_\text{yes} +...+ β_{61} \cdot \text{balance} +...+ β_{71} \cdot \text{housing}_\text{yes} +...+ β_{81} \cdot \text{loan}_\text{yes} + β_{82} \cdot \text{loan}_\text{no} \]

After understand the basic structure of our Logistic Regression model, we will start our analysis:

1st Step: we can construct a new data frame included all the varibles (in form of dummy varibles) The method we are going to use is pd.get_dummies() (reference)

client_col = list(df.columns)[:8]
client_col

['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan']

#separate columns
age_col = ['young','middle_age','elderly']
balance_col = ['balance']
dum_col = client_col[1:4]
binary_col = [client_col[4]] + client_col[-2:]

#new data frame
df_new = pd.DataFrame()

#age
df_new['young'] = df['age']<=30
df_new['middle_age'] = (df['age']>30) & (df['age']<=60)
df_new['elderly'] = df['age']>60

#balance
df_new['balance'] = df['balance']

#dummy variables
df_dummy = pd.DataFrame()
for i in dum_col:
    x = pd.get_dummies(df[i])
    df_dummy = pd.concat([df_dummy,x],axis = 1)

df_new = pd.concat([df_new,df_dummy],axis = 1)

#binary variabes
df_new[binary_col] = df[binary_col]

var_col = list(df_new.columns)
df_new['y'] = df['y']
df_new

	young	middle_age	elderly	balance	admin.	blue-collar	entrepreneur	housemaid	management	retired	...	divorced	married	single	primary	secondary	tertiary	default	housing	loan	y
0	False	True	False	2143	0	0	0	0	1	0	...	0	1	0	0	0	1	False	True	False	False
1	False	True	False	29	0	0	0	0	0	0	...	0	0	1	0	1	0	False	True	False	False
2	False	True	False	2	0	0	1	0	0	0	...	0	1	0	0	1	0	False	True	True	False
5	False	True	False	231	0	0	0	0	1	0	...	0	1	0	0	0	1	False	True	False	False
6	True	False	False	447	0	0	0	0	1	0	...	0	0	1	0	0	1	False	True	True	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
45206	False	True	False	825	0	0	0	0	0	0	...	0	1	0	0	0	1	False	False	False	True
45207	False	False	True	1729	0	0	0	0	0	1	...	1	0	0	1	0	0	False	False	False	True
45208	False	False	True	5715	0	0	0	0	0	1	...	0	1	0	0	1	0	False	False	False	True
45209	False	True	False	668	0	1	0	0	0	0	...	0	1	0	0	1	0	False	False	False	False
45210	False	True	False	2971	0	0	1	0	0	0	...	0	1	0	0	1	0	False	False	False	False

43193 rows × 25 columns

(From above code, we can find our new column var_col (include all dummy variables) and our new data frame df_new)

Once we finish operate our new data frame for Logistic Regression, we start step 2.

2nd Step: split training and testing data

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_new[var_col], df_new['y'],test_size=0.3, random_state=42)

3rd Step: import LogisticRegression and use the data to train our model

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train,y_train)

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

4th Step: check each of coefficents and their corresponding feature names; check intercept. if the coefficient is nagtive, then this feature will contribute a positive impact for cilent’s subscription choice. (in other words, more likely to deposite); vice-versa.

clf.intercept_

array([-0.68674773])

clf.coef_

array([[-9.80980660e-02, -5.97110165e-01,  8.46051278e-03,
         3.31166290e-05, -8.17098589e-02, -1.77400992e-01,
        -2.62199512e-02, -2.14099132e-02, -1.23483082e-01,
        -1.30836822e-02, -2.26880901e-02, -7.72627289e-02,
        -3.62317038e-03, -1.21942664e-01, -1.79235852e-02,
        -8.19359555e-02, -4.33841177e-01, -1.70970586e-01,
        -1.23193378e-01, -3.94802409e-01, -1.68751931e-01,
        -2.14060154e-02, -4.54891049e-01, -1.48492591e-01]])

clf.feature_names_in_

array(['young', 'middle_age', 'elderly', 'balance', 'admin.',
       'blue-collar', 'entrepreneur', 'housemaid', 'management',
       'retired', 'self-employed', 'services', 'student', 'technician',
       'unemployed', 'divorced', 'married', 'single', 'primary',
       'secondary', 'tertiary', 'default', 'housing', 'loan'],
      dtype=object)

df_coef = pd.DataFrame()
df_coef['variables'] = clf.feature_names_in_
df_coef['coef'] = (clf.coef_).squeeze()
df_coef

	variables	coef
0	young	-0.098098
1	middle_age	-0.597110
2	elderly	0.008461
3	balance	0.000033
4	admin.	-0.081710
5	blue-collar	-0.177401
6	entrepreneur	-0.026220
7	housemaid	-0.021410
8	management	-0.123483
9	retired	-0.013084
10	self-employed	-0.022688
11	services	-0.077263
12	student	-0.003623
13	technician	-0.121943
14	unemployed	-0.017924
15	divorced	-0.081936
16	married	-0.433841
17	single	-0.170971
18	primary	-0.123193
19	secondary	-0.394802
20	tertiary	-0.168752
21	default	-0.021406
22	housing	-0.454891
23	loan	-0.148493

By checking the (+) or (-) of each variables in the data frame df_coef, we can know a general connection b/t each variables and “y” (T/F: whether subscribe)

5th step: check how well does our model fit the data by using method: clf.score().

clf.score(X_test,y_test)

0.883392498842414

Conclusion:

From the above score, a classification score of 0.88 (or 88%) indicates that our logistic regression model is correctly predicting the target variable (whether clients will subscribe to a term deposit or not) about 88% of the time on the dataset we used for evaluation.
The score falls in a neutral range, neither indicating a notably strong performance nor suggesting a poor one.

Part III. Bank’s Interactions: Logistic Regression#

In this part, we are still using logistic regression.
We’re digging into client interactions to find out what really encourages them to sign up. This information can be super helpful for the bank to improve how they approach their customers.
This time, we only consider 2 variables: “duration” & “campaign”
- duration: last contact duration, in seconds (numeric)
- campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

In order to help you understand the data better, let visualize it using Altair (Notice: we have a huge dataset, over 5000 rows, which is exceed the defualt of max_row in Altair. In order to plot our data, we need to use the method:alt.data_transformers.disable_max_rows(). This can help us remove the constrain. I already put the link of this method in reference)

import altair as alt
alt.data_transformers.disable_max_rows() #dataser is too large (more than 5000 rows), this code can help remove that constrain
c = alt.Chart(df).mark_circle().encode(
    x = 'duration',
    y = 'campaign',
    color = 'y:N'
).interactive()
c

By looking at this chart, we can take a simply guess:

Clients are more likely to sign up if they receive less contacts but have longer durations.

Now, let’s use our model to verify our guess.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[['duration','campaign']],df['y'],test_size=0.3,random_state=11)
clf.fit(X_train,y_train)
clf.score(X_test,y_test)

0.8901836703194937

(Score = 89%) The model seems fair enough for us to make a general prediction. Let’s check its coefficients.

clf.coef_

array([[ 0.00355202, -0.1206294 ]])

clf.feature_names_in_

array(['duration', 'campaign'], dtype=object)

The sign coefficients confirm our guess.

duration: (+)
campaign: (-) If you think the first chart contains too much plots which might misperform in color classification. We can extract some samples to show a similar imagine:

df_sample = df[['duration','campaign','y']].sample(5000)
c2 = alt.Chart(df_sample).mark_circle().encode(
    x = 'duration',
    y = 'campaign',
    color = 'y:N'
).interactive()
c2

Conclusion: From the model, the coefficients suggest that longer individual contact times might slightly increase the likelihood of sign-ups. However, making numerous contacts during the campaign tends to decrease the probability of subscription. This implies that finding a balance between longer, more impactful individual interactions and minimizing the overall number of campaign contacts could be crucial in optimizing the bank’s approach to drive more subscriptions.

Part IV. Bank’s Interactions: Decision Tree & Random Forest#

Decision Tree: A Decision Tree is a tool used in machine learning to make decisions based on features in the data. It creates a tree-like structure where each step narrows down possibilities until it predicts an outcome. It’s easy to understand and works with different types of data to predict things accurately.
Random Forest: is a powerful method for classification (combination of multiple decision trees) (Both classification and regression algorithm)
Goal: Same as last part,we’re digging into client interactions to find out what really encourages them to sign up. This information can be super helpful for the bank to improve how they approach their customers.
Chosen Columns:
- duration: last contact duration, in seconds (numeric)
- campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

1. Decision Tree

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df[['duration','campaign']], df['y'],test_size=0.3, random_state=189)

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_leaf_nodes=4,max_depth=10)
clf.fit(x_train, y_train)

DecisionTreeClassifier(max_depth=10, max_leaf_nodes=4)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

fig = plt.figure()
_ = plot_tree(clf, 
                   feature_names=['duration', 'campaign'],
                    class_names=['False', 'True'],
                   filled=True)

../../_images/c6e10db64ff818f27443a660f2847dd9fde7fde519a0d7ebbc199de0ad54b2ab.png

clf.score(x_test,y_test)

0.8908782219478315

Conclusion: The score obtained from our decision tree method is similar to Logistic Regression in last part. By look at the decision tree, you can see the gini value is relative big in 3 out of 4 leaf nodes. Therefore, we can say it is still hard to predict the outcome through this method. However, the problem can be data itself. If data is not linearly corelated, then we can’t get a good estimation based on several methods.

2. Random Forest

X_tree = df_new.iloc[:,:-1]
X_tree.columns

Index(['young', 'middle_age', 'elderly', 'balance', 'admin.', 'blue-collar',
       'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed',
       'services', 'student', 'technician', 'unemployed', 'divorced',
       'married', 'single', 'primary', 'secondary', 'tertiary', 'default',
       'housing', 'loan'],
      dtype='object')

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X_tree, df['y'],test_size=0.3, random_state=78)

from sklearn.ensemble import RandomForestClassifier
rfc=RandomForestClassifier(n_estimators=500,max_leaf_nodes=10)
rfc.fit(x_train, y_train)

rfc.score(x_test,y_test)

0.8850902917116839

Conclusion: I tried many combination of **n_estimators= ,max_leaf_nodes= ** to maximize the final score, however, it seems always close to 0.88. (When I increase my leaf_nodes to a huge amount, the score will drop due to overlapping) I will say random forest can provide a brief prediction, but it is not accurate enough.

Part V. Data Relevence Visualization: Dimension Reduction#

In this part, we aim to utilize Dimension Reduction plus to predict which type of customers are likely to sign up for a bank’s term deposit. Given the multitude of variables in our dataset, we need to employ dimension reduction techniques to categorize them more efficiently. These methods fall under unsupervised learning. Here’s a brief introduction to these terms:

Unsupervised learning: discovers patterns in data without specific guidance
Dimension reduction: simplifies complex data
- PCA (Principal Component Analysis): simplifies complex datasets by condensing them into fewer dimensions while keeping the most important information
- TSNE (t-distributed Stochastic Neighbor Embedding): a dimension reduction technique that focuses on visualizing high-dimensional data in a lower-dimensional space while preserving local structures.
- <PCA focuses on capturing overall variance in data to condense it, while t-SNE emphasizes revealing local patterns and clusters in a way that’s easier to visualize.>

In this part, we will try Dimension Reduction first:

PCA: We want to reduce all variable columns into 2 columns, so we set n_components=2. pca.fit_transform() is necessary, since we want to get a data frame with only 2 columns.

X_PCA = df_new.iloc[:,:-1]

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_PCA)

In the below code, since the df_new[‘y’] has unique index for each row, we need to erase those index and convert it to the normal way. Here, we use the method .reset_index(drop=True):

df_pca = pd.DataFrame(X_pca,columns=['pc1','pc2'])
df_pca_y = df_new['y'].reset_index(drop = True) #reference

df_pca = pd.concat([df_pca,df_pca_y],axis = 1)
df_pca

	pc1	pc2	y
0	788.972683	-0.711166	False
1	-1325.027359	0.133277	False
2	-1352.027360	0.734173	False
3	-1123.027316	-0.746432	False
4	-907.027340	-1.505725	False
...	...	...	...
43188	-529.027317	-0.479142	True
43189	374.972668	-0.309321	True
43190	4360.972664	0.531843	True
43191	-686.027346	0.759329	False
43192	1616.972660	0.576338	False

43193 rows × 3 columns

Now, let’s visualize the above data frame using altair:

alt.Chart(df_pca).mark_circle(size=60).encode(
    x = 'pc1',
    y = 'pc2',
    color = 'y:N',
    tooltip = ['y:N']
).interactive().properties(
    title = 'PCA of Dataset'
)

Since the PCA doesn’t give a clear image of classification, we will try another method:

TSNE: TSNE will provide us a better image. But firstly, we need to import TSNE to our project, and convert our data into TSNE data. Notice: TSNE takes a long time to deal with a large data, in order to save some time, we will use a group random samples from df_new to fit in TSNE model.

TSNE_sample = df_new.sample(10000,random_state=123)
TSNE_X_sample = TSNE_sample.iloc[:,:-1]
TSNE_Y_sample = (TSNE_sample['y']).reset_index(drop=True)

from sklearn.manifold import TSNE
tsne = TSNE(n_components=2,random_state=110,n_jobs=-1) #use all available CPU cores
X_tsne = tsne.fit_transform(TSNE_X_sample)

/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:800: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:810: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(

df_tsne = pd.DataFrame(X_tsne,columns=['tsne1','tsne2'])
df_tsne['y'] = TSNE_Y_sample
df_tsne

	tsne1	tsne2	y
0	70.630997	36.882130	False
1	26.073503	-5.131013	False
2	54.076099	15.021196	False
3	13.031662	-16.288887	False
4	-53.456001	-2.219901	False
...	...	...	...
9995	-65.966759	-52.339767	False
9996	-39.448551	-54.611374	True
9997	-65.475075	-24.158939	False
9998	18.116796	-36.984108	False
9999	15.613144	12.081707	False

10000 rows × 3 columns

alt.Chart(df_tsne).mark_circle(size=60).encode(
    x = 'tsne1',
    y = 'tsne2',
    color = 'y:N',
    tooltip = ['y:N']
).interactive().properties(
    title = 'TSNE of Dataset'
)

In this case, TSNE also gives us a strange figue.

Analysis:
- PCA might not generate a clear classification image when the data doesn’t exhibit distinct linear relationships between variables or when the variance critical for classification isn’t well-aligned with the principal components. Additionally, in scenarios where non-linear relationships exist within the data, PCA might not capture these complex structures effectively, impacting its ability to provide a clear classification image.
- t-SNE might struggle to create a clear classification image when dealing with very high-dimensional data or datasets with noisy or sparse features. Additionally, if the clusters in the data are inherently overlapping or have complex, non-linear structures, t-SNE might not represent them accurately in a lower-dimensional space, leading to a less distinct classification image.
Possible Solution:
- Reduce the range of our input data size. This time, instead of using all features, I will reduce the data size (less columns/variables). Hope we can workout this time!

Let’s choose our new data set and redo the above process with smaller samples:

DR_column = ['duration', 'campaign','y']

DR_data_new = (df.loc[:,DR_column]).sample(5000,random_state=999)
DR_X = DR_data_new.iloc[:,:-1]
DR_Y = DR_data_new['y'].reset_index(drop=True)

New_x_tsne = tsne.fit_transform(DR_X)

/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:800: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:810: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(

df_New_tsne = pd.DataFrame(New_x_tsne,columns=['tsne1','tsne2'])
df_New_tsne['y'] = DR_Y
df_New_tsne

	tsne1	tsne2	y
0	-0.782646	6.951202	False
1	79.616730	10.005441	False
2	-49.484390	-53.164055	True
3	38.418545	-65.913544	False
4	-45.744884	37.791634	False
...	...	...	...
4995	3.745734	-30.745201	False
4996	-52.594669	-50.808937	False
4997	-3.334065	7.638218	False
4998	-19.823957	-2.482768	False
4999	-56.856602	-44.818787	False

5000 rows × 3 columns

alt.Chart(df_New_tsne).mark_circle(size=60).encode(
    x = 'tsne1',
    y = 'tsne2',
    color = 'y:N',
    tooltip = ['y:N']
).interactive().properties(
    title = 'TSNE of Dataset'
)

Conclusion: Even with the smaller data and variables, the image still doesn’t illustrate distinct areas for different types of output. Therefore, we can say the classification in the data are inherently overlapping or have complex, non-linear structures.

Part VI. Clustering: K-Means Clustering#

K-Means Clustering is an unsupervised machine learning algorithm used for clustering or grouping similar data points together. It aims to partition a dataset into K clusters where each data point belongs to the cluster with the nearest mean (centroid)
here, we want to use K-Means to do clients segmentation which finding natural groupings or patterns within data is crucial.
Variables that we use this time (in df):
- all numberical variables in df
- [‘age’, ‘duration’, ‘campaign’, ‘pdays’, ‘previous’]

df_K_x = df.loc[:,['age', 'duration', 'campaign', 'pdays', 'previous']]
df_K_y = df['y'].reset_index(drop=True)
# Use PCA
pca_y = pca.fit_transform(df_K_x)

df_KM = pd.DataFrame(pca_y,columns=['pca1','pca2'])
df_KM['y'] = df_K_y

#Use K-mean
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2,random_state=0)
y_Km = kmeans.fit_predict(df_K_x)

df_KM['y_km'] = y_Km

# image (y)
c_y = alt.Chart(df_KM).mark_circle().encode(
    x = 'pca1',
    y = 'pca2',
    color = alt.Color('y:N',scale = alt.Scale(scheme='set1')),
).properties(
    title = 'PCA: Y (True/False)'
)
c_y

# image (K-m)
c_km = alt.Chart(df_KM).mark_circle().encode(
    x = 'pca1',
    y = 'pca2',
    color = alt.Color('y_km:N',scale = alt.Scale(scheme='set1')),
).properties(
    title = 'PCA: K-mean (1/0)'
)
c_km

alt.hconcat(c_y,c_km).properties(
    title = 'PCA: K-means Clustering vs. y Bool'
)

Test the accuracy: Adjusted Rand Index (ARI) (reference) The Adjusted Rand Index is a function that measures the similarity of the two assignments, ignoring permutations and with chance normalization. It’s a common way to compare the clustering result with the true labels.

from sklearn import metrics
metrics.adjusted_rand_score(df_KM['y_km'],df_KM['y'])

0.2945078298429357

Conclusion: Visually, the clusters produced by K-Means (image PCA:K-mean) appear distinct and well-separated, indicating clear boundaries between groups. However, when quantitatively assessing the similarity between the K-Means clusters and the true categories using the Adjusted Rand Index (ARI), the agreement between the two is relatively low (ARI of 0.295).
The ARI score of 0.295 suggests a moderate level of agreement beyond random chance, indicating that the clustering produced by K-Means does share some similarity with the true categories but doesn’t perfectly align with them.

Extra: Hierarchical clustering: Agglomerative Clustering#

Hierarchical clustering methods are versatile and adaptable, making them valuable for handling binary or categorical data.

Agglomerative hierarchical clustering, for instance, starts with each data point as a single cluster and progressively merges similar clusters until a stopping criterion is met, creating a tree-like structure (dendrogram).
Divisive clustering, on the other hand, starts with all data points in a single cluster and recursively divides them into smaller clusters based on dissimilarity until individual points form separate clusters. These methods use specific similarity measures suitable for categorical or binary data, such as the Hamming distance or other appropriate metrics for non-numeric variables. They’re advantageous for such data types because they don’t rely on assumptions related to distance metrics for continuous variables, making them well-suited for handling categorical features.

Before we start this method, I extract the data frame which contains all binary / boolean columns:

all_bool = df_new.drop(columns = 'balance').copy()
all_bool = all_bool.sample(10000,random_state = 1314)
bool_x = all_bool.iloc[:,:-1]

Now, we can start using Agg_Clustering method, plotting it, and comparing it with K-means.

from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
agg_cluster = AgglomerativeClustering(n_clusters=2, linkage='ward')  # Adjust linkage and n_clusters as needed
agg_cluster.fit(bool_x)

AgglomerativeClustering()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# Use PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2,random_state=555)
pca_bool = pca.fit_transform(bool_x)
df_bool = pd.DataFrame(pca_bool,columns=['pca1','pca2'])
df_bool['y'] = all_bool['y'].reset_index(drop = True)
df_bool['y_agg'] = agg_cluster.labels_

(df_bool['y']==df_bool['y_agg']).mean()

0.6989

import altair as alt
alt.data_transformers.disable_max_rows() #reference 4
c_bool = alt.Chart(df_bool).mark_circle().encode(
    x = 'pca1',
    y = 'pca2',
    color = alt.Color('y_agg:N',scale = alt.Scale(scheme='set1')),
).properties(
    title = 'PCA: Agg'
)
c_bool

Conclusion: Visually inspecting the scatter plots resulting from the agglomerative hierarchical clustering method reveals a noticeable separation between two outcome categories. However, when quantitatively assessing the similarity between the assigned agg_cluster_labels and the true binary labels (y), the accuracy yields approximately 0.7. Surprisingly, this accuracy, although not perfect, surpasses the performance obtained by the K-Means method. Therefore, while the agglomerative hierarchical clustering visually demonstrates a discernible separation, the relatively higher accuracy in aligning with the true binary labels signifies its improved performance compared to K-Means in this context.

Summary#

In this project, I began by cleaning the data and then constructed a Logistic Regression model with many variables (include Dummy Variables). Training this model with the database enabled me to discern whether each variable positively or negatively influenced the outcome. Subsequently, I attempted various methods to visually classify the data, encompassing both supervised and unsupervised machine learning approaches. However, the final results were unsatisfactory due to an excessive number of categorical variables (transformed into numerous dummy variables). The abundance of variables resulted in many models being disrupted by irrelevant features, thereby neglecting the genuinely significant variables. Hence, in conducting data analysis, the alignment between data selection and methodology becomes pivotal for effective outcomes.
Finally, addressing the primary question of this project—regarding whether customers opt for a bank’s term deposit subscription—it becomes evident that customer characteristics play a predominant role. Yet, the bank’s promotional efforts subtly influence their decisions. Crucial factors to emphasize include loan status, defaults, and similar aspects. If the bank intends to boost customer subscription inclination through campaign contacts, our recommended strategy lies in striking a balance between extended, more impactful personal interactions and minimizing the overall count of contacts. This balancing act is pivotal in optimizing the bank’s approach to attract more subscriptions.

References#

Your code above should include references. Here is some additional space for references.

What is the source of your dataset(s)? Moro,S., Rita,P., and Cortez,P.. (2012). Bank Marketing. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.

List any other references that you found helpful.

1. https://www.geeksforgeeks.org/replace-the-column-contains-the-values-yes-and-no-with-true-and-false-in-python-pandas/ 2.https://note.nkmk.me/en/python-pandas-dtype-select/ 3.https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html 4.https://altair-viz.github.io/user_guide/large_datasets.html 5.https://pandas.pydata.org/docs/reference/api/pandas.Series.reset_index.html 6.https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html 7.https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

Bank Direct Marketing: Term Deposit Subscription Forecast

Contents

Bank Direct Marketing: Term Deposit Subscription Forecast#

- Introduction -#

- Main Section -#

Part I. Download and Clean the Database#

Part II. Client Feature Comparation: Dummy Variable and Logistic Regression#

Part III. Bank’s Interactions: Logistic Regression#

Part IV. Bank’s Interactions: Decision Tree & Random Forest#

Part V. Data Relevence Visualization: Dimension Reduction#

Part VI. Clustering: K-Means Clustering#

Extra: Hierarchical clustering: Agglomerative Clustering#

Summary#

References#