Avocado Analysis#
Author: Katie Kim
Course Project, UC Irvine, Math 10, F23
Introduction#
In this project, I am examining the Avocado dataset containing avocado price, demand, region, date, etc. First I am going to use linear regression in an attempt to predict the avocado’s price. Then, I am going to use KNeighborsClassifier to predict the avocado’s origin region.
Data Cleaning#
# import necessary libraries
import pandas as pd
import altair as alt
import numpy as np
# load data
df = pd.read_csv('avocado.csv')
df.columns
Index(['Unnamed: 0', 'Date', 'AveragePrice', 'Total Volume', '4046', '4225',
'4770', 'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags', 'type',
'year', 'region'],
dtype='object')
df.sample(10)
Unnamed: 0 | Date | AveragePrice | Total Volume | 4046 | 4225 | 4770 | Total Bags | Small Bags | Large Bags | XLarge Bags | type | year | region | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10146 | 32 | 2015-05-17 | 1.19 | 2363.41 | 111.73 | 870.72 | 0.00 | 1380.96 | 258.10 | 1122.86 | 0.00 | organic | 2015 | Indianapolis |
9449 | 11 | 2015-10-11 | 2.07 | 70004.38 | 14400.49 | 48933.21 | 3.10 | 6667.58 | 6607.58 | 60.00 | 0.00 | organic | 2015 | California |
7862 | 20 | 2017-08-13 | 1.29 | 821395.74 | 164086.95 | 569847.08 | 9914.90 | 77546.81 | 72666.82 | 1062.21 | 3817.78 | conventional | 2017 | SanFrancisco |
982 | 46 | 2015-02-08 | 0.70 | 1180723.02 | 525259.71 | 535768.93 | 5175.33 | 114519.05 | 103507.51 | 11011.54 | 0.00 | conventional | 2015 | Houston |
5246 | 46 | 2016-02-07 | 0.62 | 6847218.72 | 3614202.25 | 1364491.37 | 22097.49 | 1846427.61 | 740216.91 | 1106150.27 | 60.43 | conventional | 2016 | Southeast |
12929 | 8 | 2016-10-30 | 2.10 | 3125.55 | 150.44 | 1634.81 | 0.00 | 1340.30 | 1054.45 | 285.85 | 0.00 | organic | 2016 | Indianapolis |
17288 | 3 | 2017-12-10 | 1.72 | 5723.01 | 745.23 | 1060.12 | 0.00 | 3917.66 | 1241.12 | 2676.54 | 0.00 | organic | 2017 | StLouis |
12172 | 31 | 2016-05-22 | 1.57 | 18030.82 | 8.23 | 593.13 | 0.00 | 17429.46 | 15470.61 | 1958.85 | 0.00 | organic | 2016 | Boston |
8147 | 40 | 2017-03-26 | 1.09 | 93209.14 | 20638.90 | 24121.86 | 21.01 | 48427.37 | 48412.48 | 9.53 | 5.36 | conventional | 2017 | Spokane |
16343 | 12 | 2017-10-08 | 1.52 | 21089.70 | 10.89 | 194.84 | 0.00 | 20883.97 | 20879.53 | 4.44 | 0.00 | organic | 2017 | NorthernNewEngland |
# check which columns has NaN values
{col: f'{(df[col].isna().sum() / len(df))}%' for col in df.columns}
{'Unnamed: 0': '0.0%',
'Date': '0.0%',
'AveragePrice': '0.0%',
'Total Volume': '0.0%',
'4046': '0.0%',
'4225': '0.0%',
'4770': '0.0%',
'Total Bags': '0.0%',
'Small Bags': '0.0%',
'Large Bags': '0.0%',
'XLarge Bags': '0.0%',
'type': '0.0%',
'year': '0.0%',
'region': '0.0%'}
The data seems like it doesn’t have any missing values. Looking at column ‘type’, it looks like we have both conventional and organic avocado in our rows. The prices will clearly be different even with other properties being identical.
# groupby type, check average of the demand
grouped = df.groupby(['type']).mean()['Total Volume']
grouped
type
conventional 1.653213e+06
organic 4.781121e+04
Name: Total Volume, dtype: float64
# ratio of average of demand of each type (organic, conventional)
round((grouped['organic'] / grouped['conventional'] * 100), 2)
2.89
# percentage of number of rows that are organic to the total number of rows
round((len(df[(df['type']=='organic')]) / len(df) *100), 2)
49.99
# average price of conventional and organic
df.groupby(['type']).mean()['AveragePrice']
type
conventional 1.158040
organic 1.653999
Name: AveragePrice, dtype: float64
The ratio of average amount of avocadoes sold are only around 3%, but they take up around 50% of our rows. The organic avocadoes, which only take up around 2% in volume sold in real life is probably not a good sample to represent an arbiturary avocado. So it is probably best to only analyze the 5000 random samples of conventional avocados.
df.shape #18249 total rows
(18249, 14)
Now we would like to use mainly altair for data visualization, but it lookes like we have way too many rows, clearly larger than 5000.
# filtering only conventioal type, selecting 5000 random rows
# select a specific random_state so the results can be redone
df = (df[(df['type']=='conventional')]).sample(5000, random_state=21)
Data Visualization, Quick Analysis#
df
Unnamed: 0 | Date | AveragePrice | Total Volume | 4046 | 4225 | 4770 | Total Bags | Small Bags | Large Bags | XLarge Bags | type | year | region | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2276 | 40 | 2015-03-22 | 1.13 | 514196.67 | 192630.73 | 229939.77 | 8716.19 | 82909.98 | 82898.73 | 0.00 | 11.25 | conventional | 2015 | Seattle |
6582 | 12 | 2017-10-08 | 1.29 | 1042700.75 | 491561.89 | 146966.38 | 1449.29 | 402723.19 | 320849.72 | 81873.47 | 0.00 | conventional | 2017 | Houston |
3923 | 23 | 2016-07-17 | 1.08 | 283074.00 | 118910.27 | 52023.77 | 7495.98 | 104643.98 | 71629.96 | 31540.69 | 1473.33 | conventional | 2016 | LasVegas |
8291 | 25 | 2017-07-09 | 1.45 | 323653.24 | 211134.84 | 42947.50 | 137.48 | 69433.42 | 21337.25 | 37338.39 | 10757.78 | conventional | 2017 | Tampa |
1230 | 34 | 2015-05-03 | 1.03 | 95550.51 | 2031.68 | 66286.54 | 2641.74 | 24590.55 | 8761.52 | 15348.83 | 480.20 | conventional | 2015 | Louisville |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3920 | 20 | 2016-08-07 | 1.11 | 311533.30 | 144862.85 | 57549.16 | 7063.01 | 102058.28 | 64258.99 | 37604.85 | 194.44 | conventional | 2016 | LasVegas |
5059 | 15 | 2016-09-11 | 1.00 | 644049.31 | 154068.23 | 167761.10 | 2713.10 | 319506.88 | 319298.96 | 135.14 | 72.78 | conventional | 2016 | Seattle |
7249 | 43 | 2017-03-05 | 1.38 | 384828.85 | 9417.54 | 318710.70 | 4245.38 | 52455.23 | 52455.23 | 0.00 | 0.00 | conventional | 2017 | NorthernNewEngland |
8772 | 6 | 2018-02-11 | 1.00 | 856022.49 | 580099.01 | 76468.67 | 2153.49 | 197301.32 | 106427.69 | 90813.63 | 60.00 | conventional | 2018 | MiamiFtLauderdale |
4955 | 15 | 2016-09-11 | 1.08 | 498555.72 | 164685.23 | 144231.63 | 12611.44 | 177027.42 | 161784.53 | 15187.89 | 55.00 | conventional | 2016 | SanDiego |
5000 rows × 14 columns
df.describe()
Unnamed: 0 | AveragePrice | Total Volume | 4046 | 4225 | 4770 | Total Bags | Small Bags | Large Bags | XLarge Bags | year | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 5000.000000 | 5000.000000 | 5.000000e+03 | 5.000000e+03 | 5.000000e+03 | 5.000000e+03 | 5.000000e+03 | 5.000000e+03 | 5.000000e+03 | 5000.000000 | 5000.000000 |
mean | 24.315400 | 1.159906 | 1.782880e+06 | 6.257781e+05 | 6.183192e+05 | 4.899875e+04 | 4.897835e+05 | 3.758316e+05 | 1.071213e+05 | 6830.621146 | 2016.141000 |
std | 15.425586 | 0.263544 | 5.110268e+06 | 1.886614e+06 | 1.757329e+06 | 1.535306e+05 | 1.470644e+06 | 1.111716e+06 | 3.636076e+05 | 26063.996693 | 0.937067 |
min | 0.000000 | 0.460000 | 3.369968e+04 | 7.206000e+01 | 1.518000e+03 | 1.000000e+00 | 3.311770e+03 | 1.119180e+03 | 0.000000e+00 | 0.000000 | 2015.000000 |
25% | 10.000000 | 0.980000 | 2.005931e+05 | 3.195186e+04 | 5.033716e+04 | 5.802250e+02 | 5.730982e+04 | 4.304135e+04 | 2.456845e+03 | 0.000000 | 2015.000000 |
50% | 24.000000 | 1.130000 | 4.131249e+05 | 1.078396e+05 | 1.419972e+05 | 6.261165e+03 | 9.972553e+04 | 7.664851e+04 | 1.534843e+04 | 136.255000 | 2016.000000 |
75% | 38.000000 | 1.320000 | 1.109403e+06 | 3.866971e+05 | 4.286676e+05 | 2.245689e+04 | 3.186164e+05 | 2.346596e+05 | 6.115988e+04 | 2658.975000 | 2017.000000 |
max | 52.000000 | 2.220000 | 6.250565e+07 | 2.274362e+07 | 2.044550e+07 | 1.993645e+06 | 1.937313e+07 | 1.338459e+07 | 5.719097e+06 | 551693.650000 | 2018.000000 |
df.dtypes
Unnamed: 0 int64
Date object
AveragePrice float64
Total Volume float64
4046 float64
4225 float64
4770 float64
Total Bags float64
Small Bags float64
Large Bags float64
XLarge Bags float64
type object
year int64
region object
dtype: object
In this project, I am planning to use machine learning to:
predict the price of avocados 2.predict the region origin of avocadoes
Let’s get a feeling of these data with data visualization before we start fitting it into machine learning objects
# convert object into date time, extract month
df['Date'] = pd.to_datetime(df['Date'])
df['month'] = df['Date'].dt.month_name()
# relate time (date) sold and its prices
alt.Chart(df, title='Avocado Prices by Year/Month').mark_boxplot().encode(
x = 'month',
y = 'AveragePrice',
column = 'year',
tooltip = 'AveragePrice',
color="month"
)
# total volume and average price, check region as well with color
alt.Chart(df, title='Avocado Prices by Demand').mark_circle().encode(
x = 'Total Volume',
y = 'AveragePrice',
tooltip = 'region',
color="region"
)
Using tooltip, it seems like the totalUS avocado volume is skewing the scatter plot. Let’s remove the region “total US” and observe again.
# removed 'total US' chart
temp = df[df['region'] != 'TotalUS']
alt.Chart(temp, title='Avocado Prices by Demand').mark_circle().encode(
x = 'Total Volume',
y = 'AveragePrice',
tooltip = 'region',
color="region"
)
Looks like linear regression might work out for this data. Also, looking at the color coding, it looks like we might be able to predict the region with the Total Volume and AveragePrice (and maybe date as well).
# Made data frame with mean(average price, total volume) for each regions
temp = df[df['region'] != 'TotalUS']
temp2 = pd.DataFrame(temp.groupby('region').mean()['AveragePrice'])
temp2['region'] = temp2.index
temp2['Total Volume'] = temp.groupby('region').mean()['Total Volume']
temp2
AveragePrice | region | Total Volume | |
---|---|---|---|
region | |||
Albany | 1.354100 | Albany | 9.349567e+04 |
Atlanta | 1.065361 | Atlanta | 5.072671e+05 |
BaltimoreWashington | 1.359767 | BaltimoreWashington | 7.704632e+05 |
Boise | 1.066105 | Boise | 8.218555e+04 |
Boston | 1.294943 | Boston | 5.685672e+05 |
BuffaloRochester | 1.384353 | BuffaloRochester | 1.268634e+05 |
California | 1.109725 | California | 5.955764e+06 |
Charlotte | 1.292222 | Charlotte | 2.005410e+05 |
Chicago | 1.344468 | Chicago | 7.620783e+05 |
CincinnatiDayton | 1.031034 | CincinnatiDayton | 2.485378e+05 |
Columbus | 1.070617 | Columbus | 1.675162e+05 |
DallasFtWorth | 0.844494 | DallasFtWorth | 1.206192e+06 |
Denver | 1.089615 | Denver | 7.891559e+05 |
Detroit | 1.121443 | Detroit | 3.519631e+05 |
GrandRapids | 1.329800 | GrandRapids | 1.766684e+05 |
GreatLakes | 1.176733 | GreatLakes | 3.382353e+06 |
HarrisburgScranton | 1.252222 | HarrisburgScranton | 2.389051e+05 |
HartfordSpringfield | 1.379341 | HartfordSpringfield | 2.990995e+05 |
Houston | 0.829130 | Houston | 1.166488e+06 |
Indianapolis | 1.166222 | Indianapolis | 1.718719e+05 |
Jacksonville | 1.202245 | Jacksonville | 1.652072e+05 |
LasVegas | 0.991047 | LasVegas | 3.227662e+05 |
LosAngeles | 0.983222 | LosAngeles | 2.915848e+06 |
Louisville | 1.106082 | Louisville | 9.166587e+04 |
MiamiFtLauderdale | 1.233902 | MiamiFtLauderdale | 5.774169e+05 |
Midsouth | 1.215435 | Midsouth | 2.894809e+06 |
Nashville | 1.040119 | Nashville | 1.953408e+05 |
NewOrleansMobile | 1.114409 | NewOrleansMobile | 2.619054e+05 |
NewYork | 1.386632 | NewYork | 1.386456e+06 |
Northeast | 1.353900 | Northeast | 4.064766e+06 |
NorthernNewEngland | 1.252308 | NorthernNewEngland | 4.144214e+05 |
Orlando | 1.230000 | Orlando | 3.354096e+05 |
Philadelphia | 1.401111 | Philadelphia | 4.042902e+05 |
PhoenixTucson | 0.712889 | PhoenixTucson | 1.181876e+06 |
Pittsburgh | 1.238632 | Pittsburgh | 1.051180e+05 |
Plains | 1.167182 | Plains | 1.784841e+06 |
Portland | 1.047959 | Portland | 6.222032e+05 |
RaleighGreensboro | 1.219072 | RaleighGreensboro | 2.826816e+05 |
RichmondNorfolk | 1.115455 | RichmondNorfolk | 2.414427e+05 |
Roanoke | 1.087204 | Roanoke | 1.399972e+05 |
Sacramento | 1.269691 | Sacramento | 4.290801e+05 |
SanDiego | 1.074615 | SanDiego | 5.125842e+05 |
SanFrancisco | 1.413444 | SanFrancisco | 7.823022e+05 |
Seattle | 1.161848 | Seattle | 6.157131e+05 |
SouthCarolina | 1.159184 | SouthCarolina | 3.486652e+05 |
SouthCentral | 0.874124 | SouthCentral | 5.887291e+06 |
Southeast | 1.153011 | Southeast | 3.524644e+06 |
Spokane | 1.120306 | Spokane | 9.016937e+04 |
StLouis | 1.184845 | StLouis | 1.828080e+05 |
Syracuse | 1.403537 | Syracuse | 6.096562e+04 |
Tampa | 1.186333 | Tampa | 3.891342e+05 |
West | 0.982529 | West | 6.146695e+06 |
WestTexNewMexico | 0.839195 | WestTexNewMexico | 8.430621e+05 |
# x: region, y: Average price, color: demand
us_avg_price = df.groupby('region').mean()['AveragePrice']['TotalUS']
dot = alt.Chart(temp2, title='Average of Price and Demand by Region').mark_circle(size=80).encode(
x = 'region',
y = 'AveragePrice',
tooltip = ['Total Volume', 'region'],
color = 'Total Volume'
)
line = alt.Chart(pd.DataFrame({'y':[us_avg_price]})).mark_rule().encode(
y='y'
)
dot+line
Machine Learning#
part 1. predict the price of avocados based on date, total volume (amount of avocados sold), and region#
–> numerical to numerical: try linear regression, polynomial regression part 2. predict the region origin of avocadoes based on total volume (amount of avocados sold), and average price –> numerical to catagorical: try K-neighbor, logistic regression, random forest, Kmeans
what I think matters and just brute force every numerical machine learning
Part 1. Prediction of Avocado Prices#
Since the input values I have selected, specifically the date, total volume, and region
date is a date_time object, so regression can’t process it
so we have to convert it to a numerical data using dt.datetime.toordinal
‘region’ is a catagorical data, but I stil do want to use this data for machine learning, since we have enough values such that it . So, we are going to convert this data into numerical using ‘LabelEncoder’ from ‘sklearn.preprocessing’.
since we are going from numerical to numerical, we are going to try linear regression and polynomial regression
then we are going to test out the error on the ‘test’ dataset.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
linear regression only on total volume#
df_temp = df.copy()
df_temp = df_temp[df_temp['region'] != 'TotalUS']
# recall the chart
c = alt.Chart(df_temp).mark_circle().encode(
x = 'Total Volume',
y = 'AveragePrice'
)
c
# simple linear regression
reg = LinearRegression()
reg.fit(df_temp[['Total Volume']], df_temp['AveragePrice'])
df_temp['pred1'] = reg.predict(df_temp[['Total Volume']])
# linear regression graph
c1 = alt.Chart(df_temp).mark_line(strokeWidth=2.5).encode(
x = 'Total Volume',
y = 'pred1',
color = alt.value('blue')
)
c+c1
# polynomial linear regression
pipe = Pipeline(
[
('poly', PolynomialFeatures(degree=10, include_bias=False)),
('reg', LinearRegression())
]
)
pipe.fit(df_temp[['Total Volume']], df_temp['AveragePrice'])
df_temp['pred2'] = pipe.predict(df_temp[['Total Volume']])
# polynomial linear regression line + linear regression line
c2 = alt.Chart(df_temp).mark_line(strokeWidth=2.5).encode(
x = 'Total Volume',
y = 'pred2',
color=alt.value('red')
)
c+c2+c1
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
# print out each error value and its ratios
print('Mean Squared Error')
print('linear: ', mean_squared_error(df_temp['AveragePrice'], df_temp['pred1']))
print('poly: ', mean_squared_error(df_temp['AveragePrice'], df_temp['pred2']))
print(mean_squared_error(df_temp['AveragePrice'], df_temp['pred1']) / mean_squared_error(df_temp['AveragePrice'], df_temp['pred2']))
print('\n'+"Mean Absolute Error")
print('linear: ', mean_absolute_error(df_temp['AveragePrice'], df_temp['pred1']))
print('poly: ', mean_absolute_error(df_temp['AveragePrice'], df_temp['pred2']))
print(mean_absolute_error(df_temp['AveragePrice'], df_temp['pred1']) / mean_absolute_error(df_temp['AveragePrice'], df_temp['pred2']))
Mean Squared Error
linear: 0.06734598754057898
poly: 0.06888489340753648
0.9776597481562027
Mean Absolute Error
linear: 0.20413352982576358
poly: 0.20596939608489562
0.991086703684972
# make data frame for each error rates, linear/polynomial, squared/absolute error
err_dict = {'type': ['linear','linear','poly','poly'],
'err': ['squared', 'absolute','squared', 'absolute'],
'value': [mean_squared_error(df_temp['AveragePrice'], df_temp['pred1']),
mean_absolute_error(df_temp['AveragePrice'], df_temp['pred1']),
mean_squared_error(df_temp['AveragePrice'], df_temp['pred2']),
mean_absolute_error(df_temp['AveragePrice'], df_temp['pred2'])]}
err_df = pd.DataFrame(err_dict)
err_df
type | err | value | |
---|---|---|---|
0 | linear | squared | 0.067346 |
1 | linear | absolute | 0.204134 |
2 | poly | squared | 0.068885 |
3 | poly | absolute | 0.205969 |
# maybe it's better to visualize it
alt.Chart(err_df, title='Error Rate for Regression').mark_bar().encode(
x = 'type',
y = alt.Y('value', scale=alt.Scale(domain=[0, 1])),
column = 'err',
tooltip = 'value',
color = 'type'
)
interesting how similar the values are on linear and linear polynomial, having great resemblence. Let us try change the polynomial degree and observe what happens.
my_dict = {'polynomial degree': [], 'err': [], 'value': []}
for d in range(1,26,4):
pipe = Pipeline(
[
('poly', PolynomialFeatures(degree=d, include_bias=False)),
('reg', LinearRegression())
]
)
pipe.fit(df_temp[['Total Volume']], df_temp['AveragePrice'])
pred = pipe.predict(df_temp[['Total Volume']])
# squared error
my_dict['polynomial degree'].append(d)
my_dict['err'].append('squared')
my_dict['value'].append(mean_squared_error(df_temp['AveragePrice'], pred))
# abs error
my_dict['polynomial degree'].append(d)
my_dict['err'].append('absolute')
my_dict['value'].append(mean_absolute_error(df_temp['AveragePrice'], pred))
print(d)
print(mean_squared_error(df_temp['AveragePrice'], pred))
print(mean_absolute_error(df_temp['AveragePrice'], pred), '\n')
1
0.06734598754057898
0.20413352982576358
5
0.06773665223164686
0.2038320557786262
9
0.06850735945045253
0.2052835081851185
13
0.0696888787437351
0.20743873030710425
17
0.06997968560904548
0.20791419282499893
21
0.07004430999171733
0.20800576085056485
25
0.07006955607815679
0.20804343767339145
my_df = pd.DataFrame(my_dict)
my_df
polynomial degree | err | value | |
---|---|---|---|
0 | 1 | squared | 0.067346 |
1 | 1 | absolute | 0.204134 |
2 | 5 | squared | 0.067737 |
3 | 5 | absolute | 0.203832 |
4 | 9 | squared | 0.068507 |
5 | 9 | absolute | 0.205284 |
6 | 13 | squared | 0.069689 |
7 | 13 | absolute | 0.207439 |
8 | 17 | squared | 0.069980 |
9 | 17 | absolute | 0.207914 |
10 | 21 | squared | 0.070044 |
11 | 21 | absolute | 0.208006 |
12 | 25 | squared | 0.070070 |
13 | 25 | absolute | 0.208043 |
# visualization for this one as well
alt.Chart(my_df).mark_line(point=True).encode(
x = 'polynomial degree',
y = 'value',
color = 'err',
tooltip = 'value'
)
Interestingly, both the mean Squared Error and the Mean Absolute Error seemed similar, even when I have changed the variable ‘degree of polynomial’ on the polynomial linear regression. The degree 1 polynomial and degree 25 polynomial seems like it had a similar error rate. I didn’t expect an overfitting especially for a simple linear regression, so this is quite interesting.
# trying to check for overfitting
X_train, X_test, y_train, y_test = train_test_split(
df_temp[['Total Volume']], df_temp["AveragePrice"], test_size=0.2, random_state=42)
reg.fit(X_train, y_train)
pipe = Pipeline(
[
('poly', PolynomialFeatures(degree=10, include_bias=False)),
('reg', LinearRegression())
]
)
pipe.fit(X_train, y_train)
Pipeline(steps=[('poly', PolynomialFeatures(degree=10, include_bias=False)), ('reg', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('poly', PolynomialFeatures(degree=10, include_bias=False)), ('reg', LinearRegression())])
PolynomialFeatures(degree=10, include_bias=False)
LinearRegression()
print('Reg')
print('squared error:',mean_squared_error(reg.predict(X_test), y_test))
print('absolute error:',mean_absolute_error(reg.predict(X_test), y_test), '\n')
print('Pipe')
print('squared error:',mean_squared_error(pipe.predict(X_test), y_test))
print('absolute error:',mean_absolute_error(pipe.predict(X_test), y_test))
Reg
squared error: 0.06865000585663716
absolute error: 0.20524305640524532
Pipe
squared error: 0.08108453085669562
absolute error: 0.20986569507238603
These numbers feel very familier. Let’s check if this intuition is true.
# data frame of error rate on training data without splitting data
err_df['split'] = ['no_test','no_test','no_test','no_test']
err_df
type | err | value | split | |
---|---|---|---|---|
0 | linear | squared | 0.067346 | no_test |
1 | linear | absolute | 0.204134 | no_test |
2 | poly | squared | 0.068885 | no_test |
3 | poly | absolute | 0.205969 | no_test |
# combine error rate of no split and split, where if split error is measured on test data
my_dict = {'type': ['linear','linear', 'poly', 'poly'],
'err': ['squared','absolute','squared','absolute'],
'value': [mean_squared_error(reg.predict(X_test), y_test),
mean_absolute_error(reg.predict(X_test), y_test),
mean_squared_error(pipe.predict(X_test), y_test),
mean_absolute_error(pipe.predict(X_test), y_test)],
'split': ['test','test','test','test']}
my_df = pd.DataFrame(my_dict)
my_df = pd.concat([err_df, my_df])
my_df
type | err | value | split | |
---|---|---|---|---|
0 | linear | squared | 0.067346 | no_test |
1 | linear | absolute | 0.204134 | no_test |
2 | poly | squared | 0.068885 | no_test |
3 | poly | absolute | 0.205969 | no_test |
0 | linear | squared | 0.068650 | test |
1 | linear | absolute | 0.205243 | test |
2 | poly | squared | 0.081085 | test |
3 | poly | absolute | 0.209866 | test |
# compare error rate of same conditions
alt.Chart(my_df, title='Overfitting Check').mark_circle(size=35).encode(
x = 'type',
y = 'value',
column = 'err',
color = 'split'
)
Seems like it is not over fitted. Comparing the numerical error for predicting the data used for training and the data not used for training.
Knearest neighbor#
Now we are going to use knearest neighbor to predict region. Using date, price, demand features as input.
I realized I can’t directly use a date time object straight into machine learning, but I do still want to use it as a feature. So I am goint to turn it into a usable form first.
# date time object to numerical form
# import datetime
df['date_numerical'] = df['Date'].map(lambda x: x.to_pydatetime().toordinal())
# select features
cols = ['date_numerical', 'AveragePrice', 'Total Volume']
# split training and test data
X_train, X_test, y_train, y_test = train_test_split(
df[cols], df["region"], test_size=0.2, random_state=42
)
# instantiate and fit KNeighbors classifier object
from sklearn.neighbors import KNeighborsClassifier
d = len(df['region'].unique())
neigh = KNeighborsClassifier(n_neighbors=d)
neigh.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=54)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=54)
# make prediction on test set, check accuracy on test data
pred = neigh.predict(X_test)
neigh.score(X_test, y_test)
0.191
# number of categories of regions
print(d)
54
# if guesssed at random
1/54*100
1.8518518518518516
We have a score of 19.1% on the accuracy. It seems like it is not preforming too well on the test data at least. Although, we do have 54 catagories, which means if guessed at random, we will have 1/54 = 1.85% accuracy. So maybe the classifier is quite effective after all.
# let's check the score on the trained data.
neigh.score(X_train, y_train)
0.244
The score is not too high on both the test data and the training data. Maybe 5000 data points to predict 54 catagories is difficult. We can load a new data frame with the full rows (but of course only conventional avocados). Remember, we have previously limited the rows of df to 5000 to be usuable in altair.
# load new data
df2 = pd.read_csv('avocado.csv')
df2 = (df2[(df2['type']=='conventional')])
# 9126 rows
df2.shape
(9126, 14)
# score on new (larger) data set, on test data
d = len(df2['region'].unique())
neigh = KNeighborsClassifier(n_neighbors=d)
neigh.fit(X_train, y_train)
pred = neigh.predict(X_test)
neigh.score(X_test, y_test)
0.191
# new (larger) data set, trained data
neigh.score(X_train, y_train)
0.244
It seems like the accuracy is not getting any better. It is probably because of the overwhelming amount of catagories to predict, limited sample size, and/or just the low correlation in the [date, price, demand] to region.
Summary#
First, I preformed data cleaning, then a quick analysis visualizing the data points. Then using linear regression on “total volume”, I was able to predict the average price pretty accurately without overfitting and I was surprised to find out that the linear and polynomial regressions had similar error rates. Then I tried using Knearest neighbor to predict the region, which wasn’t too effective probably due to the overwhelming amount of catagories to classify compared to the limited size of the data.
References#
What is the source of your dataset(s)?#
List any other references that you found helpful.#
date object into numerical form: https://stackoverflow.com/questions/40217369/python-linear-regression-predict-by-date
date timestamp to datetime: https://www.statology.org/pandas-timestamp-to-datetime/
general inspirations: https://www.codersarts.com/post/avocado-prices-dataset-regression, https://www.kaggle.com/code/mukulsaini01/random-forest-king-of-all-algo, https://github.com/rohinegi548/EDA-and-Machine-Learning-Avocado-Prices-Predictions/blob/master/Avocado Dataset Analysis and ML Predictions.ipynb
k-neighbors: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html, https://www.datacamp.com/tutorial/k-nearest-neighbor-classification-scikit-learn