Gender Inequality

Gender Inequality#

Author: Khanh Linh Bui

Course Project, UC Irvine, Math 10, F23

Introduction#

My project revolves around addressing gender inequality globally by presenting and comparing data related to the Gender Inequality Index (GII) across different countries. Additionally, I have implemented a machine learning model that predicts the GII by leveraging the feature of female secondary education, offering insights into classifying human development levels based on specific criteria. Through these efforts, my project aims to contribute to a comprehensive understanding of gender disparities and human development patterns worldwide.

Gender Inequality Around The World#

I. Data Cleaning#

# import library
import pandas as pd
import numpy as np
import altair as alt
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from sklearn.linear_model import LinearRegression

df = pd.read_csv("Gender_Inequality_Index.csv")
df = df.dropna()
df = df.drop_duplicates()
df.head(5)

	Country	Human_development	GII	Rank	Maternal_mortality	Adolescent_birth_rate	Seats_parliament	F_secondary_educ	M_secondary_educ	F_Labour_force	M_Labour_force
0	Switzerland	Very high	0.018	3.0	5.0	2.2	39.8	96.9	97.5	61.7	72.7
1	Norway	Very high	0.016	2.0	2.0	2.3	45.0	99.1	99.3	60.3	72.0
2	Iceland	Very high	0.043	8.0	4.0	5.4	47.6	99.8	99.7	61.7	70.5
4	Australia	Very high	0.073	19.0	6.0	8.1	37.9	94.6	94.4	61.1	70.5
5	Denmark	Very high	0.013	1.0	4.0	1.9	39.7	95.1	95.2	57.7	66.7

df.shape

(170, 11)

II. Data Visualization#

df.dtypes

Country                   object
Human_development         object
GII                      float64
Rank                     float64
Maternal_mortality       float64
Adolescent_birth_rate    float64
Seats_parliament         float64
F_secondary_educ         float64
M_secondary_educ         float64
F_Labour_force           float64
M_Labour_force           float64
dtype: object

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 170 entries, 0 to 190
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Country                170 non-null    object 
 1   Human_development      170 non-null    object 
 2   GII                    170 non-null    float64
 3   Rank                   170 non-null    float64
 4   Maternal_mortality     170 non-null    float64
 5   Adolescent_birth_rate  170 non-null    float64
 6   Seats_parliament       170 non-null    float64
 7   F_secondary_educ       170 non-null    float64
 8   M_secondary_educ       170 non-null    float64
 9   F_Labour_force         170 non-null    float64
 10  M_Labour_force         170 non-null    float64
dtypes: float64(9), object(2)
memory usage: 15.9+ KB

df['Human_development'].unique()

array(['Very high', 'High', 'Medium', 'Low'], dtype=object)

a) Overview of dataset#

c1 = alt.Chart(df).mark_bar().encode(
    x = 'Human_development',
    y = 'GII',
    color = "Human_development:N"
)
c1

Lower Human development rate, higher gender inequality index and vice versa

fig1 = go.Figure(data=go.Choropleth(
    locations = df['Country'],
    locationmode = 'country names',
    z = df['GII'],
    colorscale = 'RdBu',
    colorbar_title = 'GII',
))
fig1.update_layout(
    title_text='Gender Inequality Index'
)

fig1

A geographic map of worldwide GII

b) Correlation between data#

Firstly we covert Human Development into integer to compare with other data

development_mapping = {
    "Very high": 1,
    "High": 2,
    "Medium": 3,
    "Low": 4
}

df['Human_development_int'] = df['Human_development'].map(development_mapping)

df

	Country	Human_development	GII	Rank	Maternal_mortality	Adolescent_birth_rate	Seats_parliament	F_secondary_educ	M_secondary_educ	F_Labour_force	M_Labour_force	Human_development_int
0	Switzerland	Very high	0.018	3.0	5.0	2.2	39.8	96.9	97.5	61.7	72.7	1
1	Norway	Very high	0.016	2.0	2.0	2.3	45.0	99.1	99.3	60.3	72.0	1
2	Iceland	Very high	0.043	8.0	4.0	5.4	47.6	99.8	99.7	61.7	70.5	1
4	Australia	Very high	0.073	19.0	6.0	8.1	37.9	94.6	94.4	61.1	70.5	1
5	Denmark	Very high	0.013	1.0	4.0	1.9	39.7	95.1	95.2	57.7	66.7	1
...	...	...	...	...	...	...	...	...	...	...	...	...
186	Burundi	Low	0.505	127.0	548.0	53.6	38.9	7.8	13.0	79.0	77.4	4
187	Central African Republic	Low	0.672	166.0	829.0	160.5	12.9	13.9	31.6	63.3	79.5	4
188	Niger	Low	0.611	153.0	509.0	170.5	25.9	9.2	15.2	61.7	84.3	4
189	Chad	Low	0.652	165.0	1140.0	138.3	32.3	7.7	24.4	46.9	69.9	4
190	South Sudan	Low	0.587	150.0	1150.0	99.2	32.3	26.5	36.4	70.4	73.6	4

170 rows × 12 columns

HD = alt.Chart(df).mark_bar().encode(
    x = 'Human_development_int',
    y = 'GII',
    color = "Human_development"
)
HD

df_cor = df.corr()
df_cor

	GII	Rank	Maternal_mortality	Adolescent_birth_rate	Seats_parliament	F_secondary_educ	M_secondary_educ	F_Labour_force	M_Labour_force	Human_development_int
GII	1.000000	0.996755	0.713515	0.806791	-0.424116	-0.809278	-0.782130	-0.070970	0.158270	0.861164
Rank	0.996755	1.000000	0.733578	0.820780	-0.419945	-0.811686	-0.781625	-0.050753	0.160078	0.865994
Maternal_mortality	0.713515	0.733578	1.000000	0.752769	-0.162221	-0.698014	-0.642572	0.230511	0.106192	0.748021
Adolescent_birth_rate	0.806791	0.820780	0.752769	1.000000	-0.094736	-0.728725	-0.691760	0.260471	0.263851	0.798041
Seats_parliament	-0.424116	-0.419945	-0.162221	-0.094736	1.000000	0.169483	0.168823	0.279015	0.059662	-0.176859
F_secondary_educ	-0.809278	-0.811686	-0.698014	-0.728725	0.169483	1.000000	0.972921	-0.098711	-0.269952	-0.835847
M_secondary_educ	-0.782130	-0.781625	-0.642572	-0.691760	0.168823	0.972921	1.000000	-0.083176	-0.283553	-0.803273
F_Labour_force	-0.070970	-0.050753	0.230511	0.260471	0.279015	-0.098711	-0.083176	1.000000	0.426839	0.081366
M_Labour_force	0.158270	0.160078	0.106192	0.263851	0.059662	-0.269952	-0.283553	0.426839	1.000000	0.156376
Human_development_int	0.861164	0.865994	0.748021	0.798041	-0.176859	-0.835847	-0.803273	0.081366	0.156376	1.000000

charts = []

for column in df.columns[4:12]:
    chart = alt.Chart(df).mark_circle().encode(
        x=alt.X(column, title=column),
        y='GII',
        color='Human_development_int:N',
        tooltip=['Country', 'GII', column]
    ).properties(
        title=f'{column} vs GII by Country'
    )
    charts.append(chart)

final_chart = alt.hconcat(*charts)
final_chart

In summary, the visual representation of the data indicates a clear inverse relationship between country development and the Gender Inequality Index. Specifically, when examining the Labor Force Rate, it is noteworthy that the distribution of female labor is more spread towards the left side of the graph compared to male labor. This observation suggests that women globally may face greater job insecurity when compared to men.

Machine Learning#

1) Gender Inequality Index Prediction#

Using Python and the scikit-learn library to demonstrate linear regression for predicting the Gender Inequality Index based on Female Secondary Education

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

X = df[['F_secondary_educ']]
y = df['GII']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

Mean Squared Error: 0.013371863289269646
R-squared: 0.6810502795300843

plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
plt.xlabel('Female Secondary Education')
plt.ylabel('Gender Inequality Index')
plt.title('Linear Regression: Female Secondary Education vs GII')
plt.show()

../../_images/0b573cf2514c62c9375c7bbeacfbc9f1cfb13956d70097ac4615c34c6cde263b.png

2) Human Development Classification#

Using machine learning for human development allows us to categorize a country’s human development rate based on maternal ma criteria. The primary objective of this machine learning approach is to forecast or classify a country’s development rate in the future, recognizing the dynamic nature of these metrics that evolve with changes in civilization and the development of countries over time.

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import validation_curve

df1 = pd.read_csv('Gender_Inequality_Index.csv',index_col=0)
df1 = df1.dropna()
df1 = df1.drop_duplicates()
df1

	Human_development	GII	Rank	Maternal_mortality	Adolescent_birth_rate	Seats_parliament	F_secondary_educ	M_secondary_educ	F_Labour_force	M_Labour_force
Country
Switzerland	Very high	0.018	3.0	5.0	2.2	39.8	96.9	97.5	61.7	72.7
Norway	Very high	0.016	2.0	2.0	2.3	45.0	99.1	99.3	60.3	72.0
Iceland	Very high	0.043	8.0	4.0	5.4	47.6	99.8	99.7	61.7	70.5
Australia	Very high	0.073	19.0	6.0	8.1	37.9	94.6	94.4	61.1	70.5
Denmark	Very high	0.013	1.0	4.0	1.9	39.7	95.1	95.2	57.7	66.7
...	...	...	...	...	...	...	...	...	...	...
Burundi	Low	0.505	127.0	548.0	53.6	38.9	7.8	13.0	79.0	77.4
Central African Republic	Low	0.672	166.0	829.0	160.5	12.9	13.9	31.6	63.3	79.5
Niger	Low	0.611	153.0	509.0	170.5	25.9	9.2	15.2	61.7	84.3
Chad	Low	0.652	165.0	1140.0	138.3	32.3	7.7	24.4	46.9	69.9
South Sudan	Low	0.587	150.0	1150.0	99.2	32.3	26.5	36.4	70.4	73.6

170 rows × 10 columns

development_mapping = {
    "Very high":1,
    "High": 2,
    "Medium": 3,
    "Low": 4
}

df1['Human_development_int'] = df1['Human_development'].map(development_mapping)
df1

	Human_development	GII	Rank	Maternal_mortality	Adolescent_birth_rate	Seats_parliament	F_secondary_educ	M_secondary_educ	F_Labour_force	M_Labour_force	Human_development_int
Country
Switzerland	Very high	0.018	3.0	5.0	2.2	39.8	96.9	97.5	61.7	72.7	1
Norway	Very high	0.016	2.0	2.0	2.3	45.0	99.1	99.3	60.3	72.0	1
Iceland	Very high	0.043	8.0	4.0	5.4	47.6	99.8	99.7	61.7	70.5	1
Australia	Very high	0.073	19.0	6.0	8.1	37.9	94.6	94.4	61.1	70.5	1
Denmark	Very high	0.013	1.0	4.0	1.9	39.7	95.1	95.2	57.7	66.7	1
...	...	...	...	...	...	...	...	...	...	...	...
Burundi	Low	0.505	127.0	548.0	53.6	38.9	7.8	13.0	79.0	77.4	4
Central African Republic	Low	0.672	166.0	829.0	160.5	12.9	13.9	31.6	63.3	79.5	4
Niger	Low	0.611	153.0	509.0	170.5	25.9	9.2	15.2	61.7	84.3	4
Chad	Low	0.652	165.0	1140.0	138.3	32.3	7.7	24.4	46.9	69.9	4
South Sudan	Low	0.587	150.0	1150.0	99.2	32.3	26.5	36.4	70.4	73.6	4

170 rows × 11 columns

df1.drop(columns='Human_development',inplace=True)
df1

	GII	Rank	Maternal_mortality	Adolescent_birth_rate	Seats_parliament	F_secondary_educ	M_secondary_educ	F_Labour_force	M_Labour_force	Human_development_int
Country
Switzerland	0.018	3.0	5.0	2.2	39.8	96.9	97.5	61.7	72.7	1
Norway	0.016	2.0	2.0	2.3	45.0	99.1	99.3	60.3	72.0	1
Iceland	0.043	8.0	4.0	5.4	47.6	99.8	99.7	61.7	70.5	1
Australia	0.073	19.0	6.0	8.1	37.9	94.6	94.4	61.1	70.5	1
Denmark	0.013	1.0	4.0	1.9	39.7	95.1	95.2	57.7	66.7	1
...	...	...	...	...	...	...	...	...	...	...
Burundi	0.505	127.0	548.0	53.6	38.9	7.8	13.0	79.0	77.4	4
Central African Republic	0.672	166.0	829.0	160.5	12.9	13.9	31.6	63.3	79.5	4
Niger	0.611	153.0	509.0	170.5	25.9	9.2	15.2	61.7	84.3	4
Chad	0.652	165.0	1140.0	138.3	32.3	7.7	24.4	46.9	69.9	4
South Sudan	0.587	150.0	1150.0	99.2	32.3	26.5	36.4	70.4	73.6	4

170 rows × 10 columns

x_df1,y_df1= df1.iloc[:,:-1],df1['Human_development_int']
x_train,x_test,y_train,y_test= train_test_split(x_df1,y_df1,train_size = 0.6)

#Decision Tree Classifier
dtc = DecisionTreeClassifier(max_leaf_nodes = 9, random_state = 27)
dtc.fit(x_train,y_train)

DecisionTreeClassifier(max_leaf_nodes=9, random_state=27)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

dtc.score(x_train,y_train)

0.9117647058823529

dtc.score(x_test,y_test)

0.6764705882352942

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

fig = plt.figure()
plot_tree(dtc, feature_names=x_train.columns,  class_names=['1','2','3','4'], filled=True)

[Text(0.375, 0.9, 'Rank <= 82.5\ngini = 0.714\nsamples = 102\nvalue = [41, 26, 19, 16]\nclass = 1'),
 Text(0.16666666666666666, 0.7, 'Maternal_mortality <= 22.5\ngini = 0.404\nsamples = 57\nvalue = [41, 16, 0, 0]\nclass = 1'),
 Text(0.08333333333333333, 0.5, 'gini = 0.184\nsamples = 39\nvalue = [35, 4, 0, 0]\nclass = 1'),
 Text(0.25, 0.5, 'Seats_parliament <= 20.8\ngini = 0.444\nsamples = 18\nvalue = [6, 12, 0, 0]\nclass = 2'),
 Text(0.16666666666666666, 0.3, 'M_Labour_force <= 68.5\ngini = 0.408\nsamples = 7\nvalue = [5, 2, 0, 0]\nclass = 1'),
 Text(0.08333333333333333, 0.1, 'gini = 0.0\nsamples = 2\nvalue = [0, 2, 0, 0]\nclass = 2'),
 Text(0.25, 0.1, 'gini = 0.0\nsamples = 5\nvalue = [5, 0, 0, 0]\nclass = 1'),
 Text(0.3333333333333333, 0.3, 'gini = 0.165\nsamples = 11\nvalue = [1, 10, 0, 0]\nclass = 2'),
 Text(0.5833333333333334, 0.7, 'F_secondary_educ <= 27.75\ngini = 0.646\nsamples = 45\nvalue = [0, 10, 19, 16]\nclass = 3'),
 Text(0.5, 0.5, 'gini = 0.0\nsamples = 14\nvalue = [0, 0, 0, 14]\nclass = 4'),
 Text(0.6666666666666666, 0.5, 'Maternal_mortality <= 119.0\ngini = 0.516\nsamples = 31\nvalue = [0, 10, 19, 2]\nclass = 3'),
 Text(0.5, 0.3, 'M_secondary_educ <= 52.7\ngini = 0.444\nsamples = 12\nvalue = [0, 8, 4, 0]\nclass = 2'),
 Text(0.4166666666666667, 0.1, 'gini = 0.444\nsamples = 6\nvalue = [0, 2, 4, 0]\nclass = 3'),
 Text(0.5833333333333334, 0.1, 'gini = 0.0\nsamples = 6\nvalue = [0, 6, 0, 0]\nclass = 2'),
 Text(0.8333333333333334, 0.3, 'Rank <= 155.0\ngini = 0.355\nsamples = 19\nvalue = [0, 2, 15, 2]\nclass = 3'),
 Text(0.75, 0.1, 'gini = 0.208\nsamples = 17\nvalue = [0, 2, 15, 0]\nclass = 3'),
 Text(0.9166666666666666, 0.1, 'gini = 0.0\nsamples = 2\nvalue = [0, 0, 0, 2]\nclass = 4')]

../../_images/dcd2fd222b5503df7e53471e5911c7642375d64cd0558202c0a80b5d81a5902e.png

#K-Nearest Neighbor
#To avoid overfitting and underfitting and decide which n_neighbors would be then best fit for the dataset, KNN validation curve is tested.
train_scores_KNN1, test_scores_KNN1 = validation_curve(KNeighborsClassifier(),x_train,y_train,param_name='n_neighbors',param_range=[1,5,7,9,11],cv=4)
print('avg train acc for each param val:',train_scores_KNN1.mean(axis=1).round(3))
print('avg test acc for each param val:',test_scores_KNN1.mean(axis=1).round(3))

avg train acc for each param val: [1.    0.82  0.794 0.755 0.729]
avg test acc for each param val: [0.667 0.696 0.697 0.716 0.677]

knn=KNeighborsClassifier(n_neighbors=7)
knn.fit(x_train,y_train)
knn.score(x_train,y_train)

0.8137254901960784

knn.score(x_test,y_test)

0.6617647058823529

From the results of two methods, we can see that the Decision Tree Classifier method overfitted the data set and the K-Nearest Neighbors offers a better fitted model. With human development as the target variable, KNN would be the better machine learning method to predict future human development classes based on the same predictor variables. ## Summary

References#

Your code above should include references. Here is some additional space for references.

This is the reference data set of gender inequality index around the world i found on kaggle

https://www.kaggle.com/datasets/gianinamariapetrascu/gender-inequality-index

I use listed libraries belows as a a reference for my data analysing: https://plotly.com/python/map-configuration/