Analyzing the Interplay Between Meteorological Factors and Urban Air Quality#

Author: Boyu Ren

Course Project, UC Irvine, Math 10, F23

Introduction#

In this data analysis project, we delve into the intricate relationship between air quality and meteorological conditions. The core claim of this investigation is to ascertain the impact that weather factors such as temperature have on the concentration of air pollutants, with a specific focus on nitrogen oxides (NOx), which are key indicators of air pollution and have significant implications for environmental health and public policy. The importance of understanding this relationship cannot be overstated. With increasing awareness of climate change and its impact on urban living, this analysis could be pivotal for city planners, environmentalists, and health professionals who seek to design effective interventions to improve air quality.

Main Part#

Import packages and data#

import pandas as pd
import altair as alt
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
from sklearn.linear_model import LinearRegression
from pandas.api.types import is_numeric_dtype
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split

Check the dataset’s shape, understand data types for each column, identify missing values#

air_quality_data = pd.read_csv('AirQualityUCI.csv',delimiter=';')

# Inspect the first few rows and the general information of the dataset
air_quality_data.head()
Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH Unnamed: 15 Unnamed: 16
0 10/03/2004 18.00.00 2,6 1360.0 150.0 11,9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13,6 48,9 0,7578 NaN NaN
1 10/03/2004 19.00.00 2 1292.0 112.0 9,4 955.0 103.0 1174.0 92.0 1559.0 972.0 13,3 47,7 0,7255 NaN NaN
2 10/03/2004 20.00.00 2,2 1402.0 88.0 9,0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11,9 54,0 0,7502 NaN NaN
3 10/03/2004 21.00.00 2,2 1376.0 80.0 9,2 948.0 172.0 1092.0 122.0 1584.0 1203.0 11,0 60,0 0,7867 NaN NaN
4 10/03/2004 22.00.00 1,6 1272.0 51.0 6,5 836.0 131.0 1205.0 116.0 1490.0 1110.0 11,2 59,6 0,7888 NaN NaN
air_quality_data.info
<bound method DataFrame.info of             Date      Time CO(GT)  PT08.S1(CO)  NMHC(GT) C6H6(GT)  \
0     10/03/2004  18.00.00    2,6       1360.0     150.0     11,9   
1     10/03/2004  19.00.00      2       1292.0     112.0      9,4   
2     10/03/2004  20.00.00    2,2       1402.0      88.0      9,0   
3     10/03/2004  21.00.00    2,2       1376.0      80.0      9,2   
4     10/03/2004  22.00.00    1,6       1272.0      51.0      6,5   
...          ...       ...    ...          ...       ...      ...   
9466         NaN       NaN    NaN          NaN       NaN      NaN   
9467         NaN       NaN    NaN          NaN       NaN      NaN   
9468         NaN       NaN    NaN          NaN       NaN      NaN   
9469         NaN       NaN    NaN          NaN       NaN      NaN   
9470         NaN       NaN    NaN          NaN       NaN      NaN   

      PT08.S2(NMHC)  NOx(GT)  PT08.S3(NOx)  NO2(GT)  PT08.S4(NO2)  \
0            1046.0    166.0        1056.0    113.0        1692.0   
1             955.0    103.0        1174.0     92.0        1559.0   
2             939.0    131.0        1140.0    114.0        1555.0   
3             948.0    172.0        1092.0    122.0        1584.0   
4             836.0    131.0        1205.0    116.0        1490.0   
...             ...      ...           ...      ...           ...   
9466            NaN      NaN           NaN      NaN           NaN   
9467            NaN      NaN           NaN      NaN           NaN   
9468            NaN      NaN           NaN      NaN           NaN   
9469            NaN      NaN           NaN      NaN           NaN   
9470            NaN      NaN           NaN      NaN           NaN   

      PT08.S5(O3)     T    RH      AH  Unnamed: 15  Unnamed: 16  
0          1268.0  13,6  48,9  0,7578          NaN          NaN  
1           972.0  13,3  47,7  0,7255          NaN          NaN  
2          1074.0  11,9  54,0  0,7502          NaN          NaN  
3          1203.0  11,0  60,0  0,7867          NaN          NaN  
4          1110.0  11,2  59,6  0,7888          NaN          NaN  
...           ...   ...   ...     ...          ...          ...  
9466          NaN   NaN   NaN     NaN          NaN          NaN  
9467          NaN   NaN   NaN     NaN          NaN          NaN  
9468          NaN   NaN   NaN     NaN          NaN          NaN  
9469          NaN   NaN   NaN     NaN          NaN          NaN  
9470          NaN   NaN   NaN     NaN          NaN          NaN  

[9471 rows x 17 columns]>

It provides concise summary of the DataFrame, including the data types of each column and helpful in identifying missing values and understanding the kind of data each column holds.

Now, after checking the given info of the data, I dicided to remove unnecessary columns. And here is a chatGPT suggestion that converting columns with comma as decimal separator to float, which will help me easier to handle data in later analysis. After that, I noticed time and data are seprate, which is not good to track for some visualization operations. So,I combined date and time into a single datetime column

air_quality_data_cleaned = air_quality_data.drop(columns=['Unnamed: 15', 'Unnamed: 16'])

cols_to_convert = ['CO(GT)', 'C6H6(GT)', 'T', 'RH', 'AH']
for col in cols_to_convert:
    air_quality_data_cleaned[col] = air_quality_data_cleaned[col].str.replace(',', '.').astype(float)

air_quality_data_cleaned['DateTime'] = pd.to_datetime(air_quality_data_cleaned['Date'] + ' ' + air_quality_data_cleaned['Time'], format='%d/%m/%Y %H.%M.%S')
air_quality_data_cleaned = air_quality_data_cleaned.drop(columns=['Date', 'Time'])

air_quality_data_cleaned.head()
CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH DateTime
0 2.6 1360.0 150.0 11.9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13.6 48.9 0.7578 2004-03-10 18:00:00
1 2.0 1292.0 112.0 9.4 955.0 103.0 1174.0 92.0 1559.0 972.0 13.3 47.7 0.7255 2004-03-10 19:00:00
2 2.2 1402.0 88.0 9.0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11.9 54.0 0.7502 2004-03-10 20:00:00
3 2.2 1376.0 80.0 9.2 948.0 172.0 1092.0 122.0 1584.0 1203.0 11.0 60.0 0.7867 2004-03-10 21:00:00
4 1.6 1272.0 51.0 6.5 836.0 131.0 1205.0 116.0 1490.0 1110.0 11.2 59.6 0.7888 2004-03-10 22:00:00
air_quality_data_cleaned.describe()
CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
count 9357.000000 9357.000000 9357.000000 9357.000000 9357.000000 9357.000000 9357.000000 9357.000000 9357.000000 9357.000000 9357.000000 9357.000000 9357.000000
mean -34.207524 1048.990061 -159.090093 1.865683 894.595276 168.616971 794.990168 58.148873 1391.479641 975.072032 9.778305 39.485380 -6.837604
std 77.657170 329.832710 139.789093 41.380206 342.333252 257.433866 321.993552 126.940455 467.210125 456.938184 43.203623 51.216145 38.976670
min -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000
25% 0.600000 921.000000 -200.000000 4.000000 711.000000 50.000000 637.000000 53.000000 1185.000000 700.000000 10.900000 34.100000 0.692300
50% 1.500000 1053.000000 -200.000000 7.900000 895.000000 141.000000 794.000000 96.000000 1446.000000 942.000000 17.200000 48.600000 0.976800
75% 2.600000 1221.000000 -200.000000 13.600000 1105.000000 284.000000 960.000000 133.000000 1662.000000 1255.000000 24.100000 61.900000 1.296200
max 11.900000 2040.000000 1189.000000 63.700000 2214.000000 1479.000000 2683.000000 340.000000 2775.000000 2523.000000 44.600000 88.700000 2.231000

Count: There are 9357 observations for each variable. Mean and Standard Deviation: These values give us an idea of the central tendency and dispersion for each variable.

We observe that the minimum value for several variables is -200, which could indicate missing or placeholder values. For simplicity, we’ll drop rows with missing values.

key_variables = [air_quality_data_cleaned.columns]
air_quality_data_filtered = air_quality_data_cleaned

for var in key_variables:
    air_quality_data_filtered = air_quality_data_filtered[air_quality_data_filtered[var] != -200]
air_quality_data_filtered=air_quality_data_filtered.dropna()
air_quality_data_filtered.shape
(827, 14)

Now, data is clean and appropriate for futher approaches.

alt.data_transformers.disable_max_rows()

alt.Chart(air_quality_data_filtered).mark_bar().encode(
    alt.X('NOx(GT)', bin=True),
    alt.Y('count()'),
    tooltip=['NOx(GT)', 'count()']
).properties(
    title='Histogram of NOx Levels'
)

It appears that most of the NOx concentration values are clustered within lower range bins, suggesting that high NOx levels are less common. The shape of the histogram, which skews towards the lower end, indicates that extreme high values of NOx are outliers or occur infrequently.

alt.Chart(air_quality_data_filtered).mark_boxplot().encode(
    alt.X('T'),
    tooltip=['T']
).properties(
    title='Boxplot of Temperature'
)

Boxplot indicates that most temperature values fall within a relatively narrow interval

base = alt.Chart(air_quality_data_filtered).encode(
    alt.X('DateTime:T')
)

line_NOx = base.mark_line(color='blue').encode(
    alt.Y('NOx(GT)'),
    tooltip=['DateTime:T', 'NOx(GT)']
)

line_Temp = base.mark_line(color='red').encode(
    alt.Y('T'),
    tooltip=['DateTime:T', 'T']
)

chart = alt.layer(line_NOx, line_Temp).resolve_scale(
    y='independent'
).properties(
    title='Time Series of NOx Levels and Temperature'
)
chart

The time series plot shows the fluctuations of NOx levels and temperature over time. We observe that NOx levels have considerable variability, with several peaks, which indicates high pollution. There seems to be no immediate, clear correlation between temperature and NOx levels visually from this plot. But, NOx levels variations seem to have some form of a cyclical pattern which does not appear to be directly mirrored by temperature.

Now, let’s do regression analysis to do futher research about it.

X = air_quality_data_filtered[['T']]
y = air_quality_data_filtered['NOx(GT)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print("Mean Absolute Error:", mae)
print("Mean Squared Error:", mse)
Mean Absolute Error: 61.246507897236604
Mean Squared Error: 6017.096172269701
scatter_plot = alt.Chart(air_quality_data_filtered).mark_circle().encode(
    x='T',
    y='NOx(GT)',
    tooltip=['T', 'NOx(GT)']
)

regression_line = scatter_plot.transform_regression('T', 'NOx(GT)').mark_line()

scatter_plot + regression_line

Here we have a scatter plot along with a regression line. The data points are widely scattered, but the upward trend of the regression line suggests that higher temperatures are generally associated with higher NOx levels. Nonetheless, the broad spread of the points around the line indicates that temperature alone does not predict NOx levels with high precision, and other factors are also at play.

Because we can see from the figure that the data is not well correlated, the model does not learn and perform well, either in the train data or the test data. Also, from the results, model is underfitting. Underfitting occurs when a model is too simple to capture the underlying patterns in the data, and as a result, it may perform poorly on both the training and testing datasets.

One indicator in a weather system may receive the influence of multiple variables. There is not just a single relationship between them.

I want to understand more about the relationship between their key indicators. Correlation Matrix Heatmap is a good and clear choice to understand their relationship.

correlation_matrix = air_quality_data_filtered[['T', 'RH', 'AH', 'NOx(GT)', 'CO(GT)', 'NO2(GT)', 'C6H6(GT)']].corr()
correlation_matrix
T RH AH NOx(GT) CO(GT) NO2(GT) C6H6(GT)
T 1.000000 -0.769869 0.159964 0.238395 0.318261 0.406807 0.418409
RH -0.769869 1.000000 0.475776 -0.041975 -0.105157 -0.223033 -0.178410
AH 0.159964 0.475776 1.000000 0.270679 0.295591 0.214559 0.313415
NOx(GT) 0.238395 -0.041975 0.270679 1.000000 0.951342 0.857425 0.927304
CO(GT) 0.318261 -0.105157 0.295591 0.951342 1.000000 0.861432 0.972660
NO2(GT) 0.406807 -0.223033 0.214559 0.857425 0.861432 1.000000 0.846743
C6H6(GT) 0.418409 -0.178410 0.313415 0.927304 0.972660 0.846743 1.000000
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Meteorological Factors and Pollutants')
plt.show()
../../_images/ce24af1a9246bf3e568832b45b76c9430184ad8bf5079974ad8bd8527d4600a1.png

The heat map visualizes the correlation coefficients between various meteorological factors and pollutants. A darker red color indicates a stronger positive correlation and a darker blue color indicates a stronger negative correlation. The strong positive correlation between different pollutants (e.g. NOx, CO, NO2 and C6H6) indicates that when the level of one pollutant is high, the level of other pollutants tends to be high as well.

When we check the correlation coefficients between NOx levels and temperature, 0.24 indicates our previous conclusion that higher temperatures are generally associated with higher NOx levels is true.

Summary#

Structure Summary: My research began with the importing of analytical libraries and the dataset, which was followed by a preliminary data review. The data was then painstakingly cleaned, with extraneous columns removed, data types normalized, and rows with missing values excluded. Following exploratory data analysis, histograms and boxplots provided visual insights into the distribution of contaminants and weather parameters. A thorough time series analysis revealed temporal trends as well as potential cyclical patterns. Inter-variable correlations were shown using correlation matrices, and the impact of climatic conditions on NOx levels was assessed using regression models. For a more interactive evaluation of the findings, interactive visualizations were constructed.

Analysis Summary: The analysis discovered fluctuating NOx levels and a possible positive association with temperature, implying that greater temperatures may be related with higher pollution levels. The correlation heatmap indicated significant interdependence among contaminants, pointing to shared sources of pollution. The regression model revealed that temperature is not the only predictor of NOx, highlighting the complexities of air quality concerns. The interactive plots highlighted the dynamic nature of air pollution, emphasizing the significance of taking into account several elements for successful air quality management.

References#

Your code above should include references. Here is some additional space for references.

  • What is the source of your dataset(s)?

From UC Irvine Machine Learning Repository: https://archive.ics.uci.edu/dataset/360/air+quality

  • List any other references that you found helpful.

I asked chatgpt for some help and I already mentioned them before I quoted for converting data types from strings to numeric values, particularly for columns that contained numeric values with commas as decimal separators.

I use the more complex correlation heatmap learned from https://www.geeksforgeeks.org/how-to-create-a-seaborn-correlation-heatmap-in-python/