Predicting Income Above a Threshold

Predicting Income Above a Threshold#

Author: Timothy Cho

Course Project, UC Irvine, Math 10, F23

Introduction#

We have one main goal in this section: given a set of people, and knowing a little bit about their work background (work, education, and number of hours per week), can we predict with high accuracy whether someone makes more than $50000 per year in income? Our first approach will be numerical, which is like what we did in Math 10. However, we will also explore the data using the categorical pieces we have, and we will learn how to do a classification problem, but the inputs are categorical rather than numerical.

Setting up the Data#

In this section, we import the Adult dataset from the UC Irvine Machine Learning Repository and convert the data into our DataFrame df. Next, we clean out any missing values, so that our data will be ready for classification later.

We start by importing the necessary modules.

!pip install ucimlrepo

Requirement already satisfied: ucimlrepo in /root/venv/lib/python3.9/site-packages (0.0.3)

[notice] A new release of pip is available: 23.0.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip

import pandas as pd
import numpy as np
import altair as alt

Next, we load the data from the UCI repository and convert it to a DataFrame.

from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
adult = fetch_ucirepo(id=2)
# data (as pandas dataframes) 
X = adult.data.features 
y = adult.data.targets 

df = pd.DataFrame(data = X)
df['income'] = y
df.sample(20) # take a look at the data

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	hours-per-week	native-country	income
25956	32	Private	27207	10th	6	Never-married	Craft-repair	Not-in-family	White	Male	50	United-States	<=50K
20391	36	Local-gov	251091	Some-college	10	Married-civ-spouse	Protective-serv	Husband	White	Male	40	United-States	>50K
171	28	State-gov	175325	HS-grad	9	Married-civ-spouse	Protective-serv	Husband	White	Male	40	United-States	<=50K
33047	20	Private	273989	HS-grad	9	Never-married	Transport-moving	Own-child	White	Male	40	United-States	<=50K.
39628	36	Local-gov	254202	Prof-school	15	Divorced	Prof-specialty	Unmarried	White	Female	24	Germany	<=50K.
13797	30	Private	54929	HS-grad	9	Married-civ-spouse	Sales	Husband	White	Male	55	United-States	<=50K
46905	33	Private	133861	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	62	United-States	<=50K.
35796	21	Private	287681	HS-grad	9	Never-married	Machine-op-inspct	Own-child	White	Male	40	Mexico	<=50K.
905	46	Private	171550	HS-grad	9	Divorced	Machine-op-inspct	Not-in-family	White	Female	38	United-States	<=50K
12903	50	Private	209320	Bachelors	13	Married-civ-spouse	Sales	Husband	White	Male	40	United-States	>50K
13038	58	Private	214502	9th	5	Married-civ-spouse	Handlers-cleaners	Husband	White	Male	50	United-States	>50K
216	50	Private	313321	Assoc-acdm	12	Divorced	Sales	Not-in-family	White	Female	40	United-States	<=50K
28843	22	Private	218343	HS-grad	9	Never-married	Other-service	Own-child	White	Female	40	United-States	<=50K
37620	27	Private	315640	Masters	14	Never-married	Prof-specialty	Not-in-family	Asian-Pac-Islander	Male	20	China	<=50K.
26259	18	?	91670	Some-college	10	Never-married	?	Own-child	Asian-Pac-Islander	Female	40	United-States	<=50K
13299	44	Private	214838	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	45	United-States	>50K
47874	27	Private	160291	Some-college	10	Never-married	Adm-clerical	Unmarried	Black	Female	40	Germany	<=50K.
21459	26	Private	113571	HS-grad	9	Never-married	Transport-moving	Not-in-family	White	Male	70	United-States	<=50K
35292	34	Self-emp-not-inc	264351	HS-grad	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	40	Mexico	<=50K.
21063	64	State-gov	111795	Bachelors	13	Married-civ-spouse	Craft-repair	Husband	White	Male	45	United-States	>50K

Cleaning the Data#

As we can see from the data sample above, our DataFrame df is quite messy, with a lot of categorical/nominal data, and very confusing categories, and with a lot of missing values. Additionally, the df['income'] column should be binary, but the way the data is encoded makes it an object. Our goal in this section is to clean up the data, so that clearly binary categories are stored with True/False values. We will also get rid of all rows with missing values.

df = df.dropna() # drops any row with a missing value

df.sample(10)

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	income
15150	29	Private	213842	Some-college	10	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40	United-States	<=50K
11767	38	State-gov	346766	HS-grad	9	Divorced	Adm-clerical	Unmarried	White	Female	0	0	40	United-States	<=50K
1551	17	Private	130806	10th	6	Never-married	Handlers-cleaners	Own-child	White	Male	0	0	24	United-States	<=50K
4052	31	Private	225053	HS-grad	9	Divorced	Transport-moving	Not-in-family	White	Male	0	0	45	United-States	<=50K
20249	80	Private	252466	Assoc-voc	11	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	24	United-States	<=50K
32027	46	Local-gov	140219	Masters	14	Divorced	Prof-specialty	Not-in-family	White	Female	8614	0	55	United-States	>50K
18858	53	Private	96062	Some-college	10	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	1740	40	United-States	<=50K
36115	37	Private	108282	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	45	United-States	<=50K.
5269	59	Private	43221	9th	5	Married-civ-spouse	Transport-moving	Husband	White	Male	0	0	60	United-States	>50K
19163	19	Private	170800	HS-grad	9	Never-married	Craft-repair	Own-child	White	Male	0	0	60	United-States	<=50K

The income column has the issue that <=50K and <=50K. represent the same value (and similarly with >50K and >50K.). We now convert this row into a True/False binary column.

income = df['income'].str.contains(">50K")
df['income>50k'] = income
df.sample(10)

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	income	income>50k
31093	45	State-gov	144351	Masters	14	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	40	United-States	>50K	True
14049	38	Private	117528	Bachelors	13	Never-married	Other-service	Other-relative	White	Female	0	45	United-States	<=50K	False
24214	20	Private	316043	11th	7	Never-married	Other-service	Own-child	Black	Male	594	20	United-States	<=50K	False
35309	24	Private	99697	HS-grad	9	Never-married	Handlers-cleaners	Own-child	White	Female	0	40	United-States	<=50K.	False
36406	55	Private	119344	HS-grad	9	Married-civ-spouse	Prof-specialty	Own-child	White	Female	0	40	United-States	<=50K.	False
15892	90	Private	88991	Bachelors	13	Married-civ-spouse	Exec-managerial	Wife	White	Female	0	40	England	>50K	True
15279	52	Self-emp-inc	334273	Doctorate	16	Married-civ-spouse	Prof-specialty	Husband	White	Male	99999	65	United-States	>50K	True
23945	21	Private	163333	Some-college	10	Never-married	Other-service	Own-child	White	Female	0	35	United-States	<=50K	False
18463	74	Private	188709	Prof-school	15	Married-civ-spouse	Prof-specialty	Husband	White	Male	99999	50	United-States	>50K	True
13845	65	?	105017	Bachelors	13	Married-civ-spouse	?	Husband	White	Male	0	40	United-States	<=50K	False

The old df['income'] row is useless now, so we drop it from our DataFrame.

df = df.drop('income', axis=1)

df

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	income>50k
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	0	40	United-States	False
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	13	United-States	False
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	0	40	United-States	False
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	0	40	United-States	False
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	0	40	Cuba	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
48836	33	Private	245211	Bachelors	13	Never-married	Prof-specialty	Own-child	White	Male	0	0	40	United-States	False
48837	39	Private	215419	Bachelors	13	Divorced	Prof-specialty	Not-in-family	White	Female	0	0	36	United-States	False
48839	38	Private	374983	Bachelors	13	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	0	50	United-States	False
48840	44	Private	83891	Bachelors	13	Divorced	Adm-clerical	Own-child	Asian-Pac-Islander	Male	5455	0	40	United-States	False
48841	35	Self-emp-inc	182148	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	60	United-States	True

47621 rows × 15 columns

Next, the sex trait is binary here, so we change that into a True/False column.

sex = df['sex'].str.contains('Male')
sex
df['is-male'] = sex
df = df.drop('sex', axis=1)

df.sample(10)

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	hours-per-week	native-country	income>50k	is-male
3275	58	Private	241056	Some-college	10	Divorced	Adm-clerical	Unmarried	White	46	United-States	False	False
22806	46	Private	403911	Assoc-voc	11	Married-civ-spouse	Craft-repair	Husband	Black	40	United-States	True	True
19170	52	State-gov	314627	Masters	14	Divorced	Prof-specialty	Not-in-family	Asian-Pac-Islander	40	United-States	False	False
23728	44	Private	152150	Assoc-acdm	12	Separated	Exec-managerial	Not-in-family	White	40	United-States	False	True
24312	20	State-gov	147280	HS-grad	9	Never-married	Other-service	Other-relative	Other	40	United-States	False	True
9352	41	State-gov	116520	Doctorate	16	Married-civ-spouse	Prof-specialty	Husband	White	40	United-States	True	True
19123	20	Private	62865	HS-grad	9	Never-married	Priv-house-serv	Not-in-family	White	45	United-States	False	False
19610	22	Private	340543	HS-grad	9	Never-married	Tech-support	Not-in-family	White	40	United-States	False	False
35840	19	Private	382688	10th	6	Never-married	Handlers-cleaners	Not-in-family	White	20	United-States	False	True
38961	57	Private	157271	11th	7	Divorced	Other-service	Not-in-family	Black	54	United-States	False	True

Now, our data is ready to be analyzed.

Time Investment versus Income: Logistic Regression#

In this section, we examine if a person’s education level and the number of hours they work determine whether they make more than $50000 a year. This is just a regular binary classification problem like we have been doing in Math 10; we will later use the categorical data to make classifications, and we will see new things when we get there.

Intuitively, we should expect that the more hours worked and the higher the education level, the person’s income is more likely to be above $50000. That is, the more time they take to do things like educate themselves/go to work, the higher we expcet their income to be. Let us see if this is actually the case.

Visualizing the Data#

Unfortunately, our dataset is far too large for Deepnote to render output. To remedy this, we will plot a few different plots by randomly taking 500 samples 5 times.

cols = ['education-num', 'hours-per-week'] # these are the columns we are interested in
rng = np.random.default_rng(seed=63)
seed_list = [rng.integers(1, 10**6) for i in range(5)]

n = 500
circ_size = 60
chart_list = []
# generate our 5 randomized charts
for s in seed_list:
    chart_list.append(alt.Chart(df.sample(n, random_state=s)).mark_circle(size=circ_size).encode(
        x = cols[0],
        y = cols[1],
        color = 'income>50k:N'
        ).properties(title=f'Education Level/Hours per Week, 500 samples, seed={s}'))

alt.hconcat(*chart_list)

As we can see, there seems to be a slight correlation between the education-num and the income: we see more orange points towards the right of the chart. On the other had, the number of hours worked per week does not seem to affect whether someone makes more than $50000. Now, let us implement logistic regression to see the exact correlations.

Analysis#

We now fit our data to the logistic regression model. We will let 90% of the data be training data, and the remaining 10% be testing data.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

clf = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(df[cols], df['income>50k'], test_size=0.1, random_state=63)
clf.fit(X_train, y_train)

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Let us see some information about what we got for clf. Remember that we set cols = ['education-num', 'hours-per-week'].

clf.classes_

array([False,  True])

clf.coef_

array([[0.34405795, 0.0401318 ]])

clf.intercept_

array([-6.49326522])

The above tells us that if $x_1$ is the education level, and $x_2$ is the number of hours worked per week, the probability that someone gets paid over $50000 a month is given approximately by the function $$f(x_1, x_2) =\displaystyle \frac 1{1+\exp(-0.339x_1 - 0.040x_2 + 6.428)}$$. Now, let us see how well our model performs on the test data.

clf.score(X_test, y_test)

0.7772412345160613

The performance of our model is not bad, but also not great: we might be overfitting slightly, and the correlation was not as strong as what we might have intuitively expected. Looking back to the charts we made, it seemed as if the number of hours worked barely had an effect on the income. What if we tried removing the number of hours column and running regression again?

clf2 = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(df[['education-num']], df['income>50k'], test_size=0.1, random_state=63)
clf2.fit(X_train, y_train)

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

For this new model, we see that the coefficient for the education level is still around 0.35:

clf2.coef_

array([[0.36081249]])

clf2.intercept_

array([-4.97816245])

How well does this perform on the test set?

clf2.score(X_test, y_test)

0.7753516691161033

This was around the same as when we had the number of hours worked. Hence, there might be a correlation between education level and a person’s income, but this correlation seems a lot weaker than we expected. What if we approach this dataset using the categorical data instead?

Types of Education versus Income: A Categorical Approach#

In the last section, we saw that when we plugged in the education level of a person (education-num), our models were able to predict with around 77% accuracy whether a person made more than $50000 a year. This is much better than random guessing, but 77% accuracy does not suggest a very strong correlation. However, one thing that the education-num column leaves out is the type of school the person went to: for example, we would expect a PhD to have different job opportunities than someone who went to a professional training school; however, this information is completely missing from the numeric data we were provided. What if we can get a stronger correlation using the categorical education column instead?

Of course, we cannot use logistic regression here, as our inputs are nominal data. Additionally, we have a second problem, as the standard DecisionTreeClassifier we learned in Math 10, which we plan to use, does not normally accept categorical data, so we cannot directly fit the education column into clf. Instead, we use the pandas method get_dummies to convert our education data into a set of binary variables, so that the DecisionTreeClassifier could properly process them. This technique is known as one-hot encoding.

from sklearn.tree import DecisionTreeClassifier

Making Our Data Understandable to `DecisionTreeClassifier`: One-Hot Encoding#

In the code below, we one-hot encode our education data. Notice that each unique entry in the education column becomes its own column in the DataFrame below, and each 1 corresponds to what the original category was in the original DataFrame.

encoded = pd.get_dummies(df['education'])
encoded

	10th	11th	12th	1st-4th	5th-6th	7th-8th	9th	Assoc-acdm	Assoc-voc	Bachelors	Doctorate	HS-grad	Masters	Preschool	Prof-school	Some-college
0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
3	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
48836	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
48837	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
48839	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
48840	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
48841	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0

47621 rows × 16 columns

Analysis#

Now, our data is ready to be analyzed. There is one slight technicality that we need to be careful with: notice that that encoded from above is a DataFrame in its own right, so our source data should be drawn from encoded, but our prediction target is in the column df['income>50k']. We will set the max_depth of our classifier clf to be 4 to prevent overfitting.

clf3 = DecisionTreeClassifier(max_depth=4)
X_train, X_test, y_train, y_test = train_test_split(encoded, df['income>50k'], test_size=0.1, random_state=60)

clf3.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=4)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Let us visualize how our classifier makes decisions below. We will actually write a function for this, as we will be repeating the same code a lot later on. The code below for plotting the decision tree was taken from the class notes.

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
# function to visualize decision tree
def plot(classifier):
    classes = [str(cls) for cls in classifier.classes_]
    plot_tree(classifier, 
                feature_names=classifier.feature_names_in_,
                class_names=classes,
                filled=True)

plot(clf3)

../../_images/76519add4f5d6e6940fc40cd93a83c59c39d731c3afa5ecb3351f76b2a7c5af4.png

We should remember that each category in the education column is mutually exclusive. Hence, this means that the decision tree’s decisions seem logical to us: the tree first checks if the person’s highest education level is a Bachelor’s. If this is probably not the case, then the tree checks for higher and higher degrees, until the Doctorate. Then, if they tree thinks that the person does not hold any of the degrees it checks, it renders a False decision.

Now, let us check its accuracy on the test set.

clf3.score(X_test, y_test)

0.7751417174049968

This accuracy is almost the same as when we did the logistic regression. Hence, we see that no matter if we do things categorically or numerically, in this area of education, we were able to predict with around 77% accuracy whether a person made more than $50000 a year. Hence, we are confident in saying that there is a slight correlation between education and income. We also see a curiosity: the accuracy scores between our decision tree and our logistic classifier were equal up to the third decimal place. It seems like the decision tree behaved a lot like the logistic classifier in this case.

Does the Job Worked Affect Income?#

In this short section, we apply our techniques from the previous section to see if it is really true that the type of job that you work determines the amount of money you might make.

Preparing Our Data#

As before, we need to one-hot encode our data in order to use DecisionTreeClassifier.

encoded = pd.get_dummies(df['occupation'])
encoded

	?	Adm-clerical	Armed-Forces	Craft-repair	Exec-managerial	Farming-fishing	Handlers-cleaners	Machine-op-inspct	Other-service	Priv-house-serv	Prof-specialty	Protective-serv	Sales	Tech-support	Transport-moving
0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
48836	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
48837	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
48839	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
48840	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0
48841	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0

47621 rows × 15 columns

We have a slight issue here: one of the columns in our one-hot encoded data is labeled ?, which does not seem to be particularly useful to what we are doing. Hence, we are justified in dropping that column, which cleans the data.

encoded = encoded.drop("?", axis=1)

Now, our data is ready for analysis.

Analysis#

We do the same steps as we did in the previous section.

clf4 = DecisionTreeClassifier(max_depth=6)
X_train, X_test, y_train, y_test = train_test_split(encoded, df['income>50k'], test_size=0.1, random_state=57)
clf4.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=6)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Let us take a look at how the decisions were made.

plot(clf4)

../../_images/119969e99fb011369d653061c696481d3e4f92a9ad01b50224dd338f67d18073.png

Immediately, we should see that something is off about the way decisions are being made: the machine renders a False decision for every job. This suggests to us that occupation is not a great way to determine someone’s income, at least for the jobs mentioned here. However, there are still patterns: the machine is less sure about its False decision when the probablity of being in a executive/managerial or a professional/specialty position is at least 0.5, which kind of makes sense. Let us see the classifier’s score on the test data.

clf4.score(X_test, y_test)

0.7579256770942683

Surprisingly, the machine does a lot better when it assigns False decisions to everyone when compared to randomly guessing with 50% probability of being correct. However, the correlation between job type and income still seems to be slight, rather than actually predictable.

Conclusion#

In this dataset, we saw two factors that have some impact on the income: the job worked and the education level. However, the correlation we got seemed to be sort of weak in both cases: our accuracy on the test data was around 75% to 77%, which is better than making a 50/50 guess at whether someone makes more than $50000 a year, but still not great in comparison to some of the examples we saw in class. However, we still expect there to be a correlation. If we had a dataset that gave the income as a numeric value, we would expect a regression model to have higher accuracy, as the accuracy of that is judged by mean squared error instead. Hence, our conclusion is that while classification does seem to work out okay here, this problem is much better addressed by a regression model, where we predict the income directly instead of determining whether it passes a certain threshold.

References#

Your code above should include references. Here is some additional space for references.

What is the source of your dataset(s)? UCI ML Repository

List any other references that you found helpful. Class notes; all other sources are linked above.

Predicting Income Above a Threshold

Contents

Predicting Income Above a Threshold#

Introduction#

Setting up the Data#

Cleaning the Data#

Time Investment versus Income: Logistic Regression#

Visualizing the Data#

Analysis#

Types of Education versus Income: A Categorical Approach#

Making Our Data Understandable to DecisionTreeClassifier: One-Hot Encoding#

Analysis#

Does the Job Worked Affect Income?#

Preparing Our Data#

Analysis#

Conclusion#

References#

Making Our Data Understandable to `DecisionTreeClassifier`: One-Hot Encoding#