Predicting the Housing Price and House Type in London#
Author: Siyu Wu
Course Project, UC Irvine, Math 10, F23
Introduction#
This dataset comprises of various house listings in London and neighbouring regions. It also includes various parameters like their property name, price, house type,location, area and etc. I use pandas to clean the data and perform feature engineering, then I use Altair to visualize the results. I use linear regression to predict the housing price per sq ft and random forest to predict the house type.
Main portion of the project#
Data Loading#
import pandas as pd
df = pd.read_csv('London.csv')
df = df.dropna(axis = 0).copy()
df
Unnamed: 0 | Property Name | Price | House Type | Area in sq ft | No. of Bedrooms | No. of Bathrooms | No. of Receptions | Location | City/County | Postal Code | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Queens Road | 1675000 | House | 2716 | 5 | 5 | 5 | Wimbledon | London | SW19 8NY |
1 | 1 | Seward Street | 650000 | Flat / Apartment | 814 | 2 | 2 | 2 | Clerkenwell | London | EC1V 3PA |
2 | 2 | Hotham Road | 735000 | Flat / Apartment | 761 | 2 | 2 | 2 | Putney | London | SW15 1QL |
3 | 3 | Festing Road | 1765000 | House | 1986 | 4 | 4 | 4 | Putney | London | SW15 1LP |
4 | 4 | Spencer Walk | 675000 | Flat / Apartment | 700 | 2 | 2 | 2 | Putney | London | SW15 1PL |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3473 | 3473 | Park Lane | 19950000 | Flat / Apartment | 3698 | 5 | 5 | 5 | Mayfair | London | W1K 7AJ |
3476 | 3476 | St. James's Street | 5275000 | Flat / Apartment | 1749 | 3 | 3 | 3 | St James's | London | SW1A 1JT |
3477 | 3477 | Ingram Avenue | 5995000 | House | 4435 | 6 | 6 | 6 | Hampstead Garden Suburb | London | NW11 6TG |
3478 | 3478 | Cork Street | 6300000 | New development | 1506 | 3 | 3 | 3 | Mayfair | London | W1S 3AR |
3479 | 3479 | Courtenay Avenue | 8650000 | House | 5395 | 6 | 6 | 6 | Highgate | London | N6 4LP |
2518 rows × 11 columns
Using Altair to Visualize the Average Housing Price per sq ft in Different County#
import altair as alt
df["Price per sq ft"] = df["Price"]/df["Area in sq ft"]
alt.Chart(df).mark_bar().encode(
x = "City/County:N",
y = "mean(Price per sq ft)",
tooltip = ["mean(Price per sq ft)","City/County"]
).interactive().properties(
title = 'Average Housing Price per sq ft in Different County'
)
The graph shows that Oxshott has the highest average housing price,while Fitzrovia has the lowest average housing price.#
Using Altair to Visualize the Number of Housing Within Different Price Range in London#
Here I use the Price per sq ft data that I calculated above. I first divide Price per sq ft into 4 categories according to their percentiles. Then I create a new column called ‘Price Range’ which labels its range in df.#
percentiles = [0,25,50,75,100]
percentile_values = [df['Price per sq ft'].quantile(i/100) for i in percentiles]
percentile_values
[241.61073825503357,
713.2344515742341,
902.516708437761,
1184.5623359473075,
7069.180152943269]
labels = ['242~713','713~903','903~1185','1185~7070']
df['Price Range'] = pd.qcut(df['Price per sq ft'],4,labels)
df
Unnamed: 0 | Property Name | Price | House Type | Area in sq ft | No. of Bedrooms | No. of Bathrooms | No. of Receptions | Location | City/County | Postal Code | Price per sq ft | Price Range | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Queens Road | 1675000 | House | 2716 | 5 | 5 | 5 | Wimbledon | London | SW19 8NY | 616.715758 | 242~713 |
1 | 1 | Seward Street | 650000 | Flat / Apartment | 814 | 2 | 2 | 2 | Clerkenwell | London | EC1V 3PA | 798.525799 | 713~903 |
2 | 2 | Hotham Road | 735000 | Flat / Apartment | 761 | 2 | 2 | 2 | Putney | London | SW15 1QL | 965.834428 | 903~1185 |
3 | 3 | Festing Road | 1765000 | House | 1986 | 4 | 4 | 4 | Putney | London | SW15 1LP | 888.721047 | 713~903 |
4 | 4 | Spencer Walk | 675000 | Flat / Apartment | 700 | 2 | 2 | 2 | Putney | London | SW15 1PL | 964.285714 | 903~1185 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3473 | 3473 | Park Lane | 19950000 | Flat / Apartment | 3698 | 5 | 5 | 5 | Mayfair | London | W1K 7AJ | 5394.808004 | 1185~7070 |
3476 | 3476 | St. James's Street | 5275000 | Flat / Apartment | 1749 | 3 | 3 | 3 | St James's | London | SW1A 1JT | 3016.009148 | 1185~7070 |
3477 | 3477 | Ingram Avenue | 5995000 | House | 4435 | 6 | 6 | 6 | Hampstead Garden Suburb | London | NW11 6TG | 1351.747463 | 1185~7070 |
3478 | 3478 | Cork Street | 6300000 | New development | 1506 | 3 | 3 | 3 | Mayfair | London | W1S 3AR | 4183.266932 | 1185~7070 |
3479 | 3479 | Courtenay Avenue | 8650000 | House | 5395 | 6 | 6 | 6 | Highgate | London | N6 4LP | 1603.336423 | 1185~7070 |
2518 rows × 13 columns
The usage of the quantile and pd.qcut function are adapted from chatgdp [https://chat.openai.com/c/e14df8ae-65af-4ecf-be0f-9038695fcfcd]#
Here I only focus on the housing price in London#
df2 = df[df["City/County"]=='London']
df2
Unnamed: 0 | Property Name | Price | House Type | Area in sq ft | No. of Bedrooms | No. of Bathrooms | No. of Receptions | Location | City/County | Postal Code | Price per sq ft | Price Range | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Queens Road | 1675000 | House | 2716 | 5 | 5 | 5 | Wimbledon | London | SW19 8NY | 616.715758 | 242~713 |
1 | 1 | Seward Street | 650000 | Flat / Apartment | 814 | 2 | 2 | 2 | Clerkenwell | London | EC1V 3PA | 798.525799 | 713~903 |
2 | 2 | Hotham Road | 735000 | Flat / Apartment | 761 | 2 | 2 | 2 | Putney | London | SW15 1QL | 965.834428 | 903~1185 |
3 | 3 | Festing Road | 1765000 | House | 1986 | 4 | 4 | 4 | Putney | London | SW15 1LP | 888.721047 | 713~903 |
4 | 4 | Spencer Walk | 675000 | Flat / Apartment | 700 | 2 | 2 | 2 | Putney | London | SW15 1PL | 964.285714 | 903~1185 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3473 | 3473 | Park Lane | 19950000 | Flat / Apartment | 3698 | 5 | 5 | 5 | Mayfair | London | W1K 7AJ | 5394.808004 | 1185~7070 |
3476 | 3476 | St. James's Street | 5275000 | Flat / Apartment | 1749 | 3 | 3 | 3 | St James's | London | SW1A 1JT | 3016.009148 | 1185~7070 |
3477 | 3477 | Ingram Avenue | 5995000 | House | 4435 | 6 | 6 | 6 | Hampstead Garden Suburb | London | NW11 6TG | 1351.747463 | 1185~7070 |
3478 | 3478 | Cork Street | 6300000 | New development | 1506 | 3 | 3 | 3 | Mayfair | London | W1S 3AR | 4183.266932 | 1185~7070 |
3479 | 3479 | Courtenay Avenue | 8650000 | House | 5395 | 6 | 6 | 6 | Highgate | London | N6 4LP | 1603.336423 | 1185~7070 |
2058 rows × 13 columns
alt.Chart(df2).mark_bar().encode(
x= alt.X("Price Range",sort = ['242~713','713~903','903~1185','1185~7070']),
y= "count()",
tooltip = ["count()"]
).interactive().properties(
title = 'Number of Housing Within Different Price Range in London'
)
From the graph above, there are 615 housings that have a price range per sq ft between 1185 and 7070 in London. And 1205 out of 2058 housings that is over 55% of the total housings have prices above 903 in London.#
Using Altair to Visualize the Properties of Different Housing in London#
alt.Chart(df2).mark_circle().encode(
x = 'Area in sq ft',
y = 'Price',
color = 'House Type',
tooltip = ['Area in sq ft','Price','House Type']
).interactive().properties(
title = 'Housing in London'
)
The scatter plot above shows that the price increases as the Area in sq ft increases in general.#
Using Linear Regression to Predict Housing Prices per sq ft from No.of Bedrooms, No.of Bathrooms, No. of Receptions,Location,City/County,House Type and Property Name#
Since Location, City/County, House Type and Property Name are categorical data,in order to fit them into the linear regression model as inputs, I use label encoding to change them into numerical data type.#
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['Location_encoded'] = label_encoder.fit_transform(df['Location'])
df['City/County_encoded'] = label_encoder.fit_transform(df['City/County'])
df['House Type_encoded'] = label_encoder.fit_transform(df['House Type'])
df['Property Name_encoded'] = label_encoder.fit_transform(df['Property Name'])
The idea of using label encoding to change the categorical data into numerical data type is adapted from the chatgdp [https://chat.openai.com/c/000f18cc-7297-489c-87c9-724ede56147a].#
cols = ["No. of Bedrooms","No. of Bathrooms","No. of Receptions","Location_encoded","City/County_encoded","House Type_encoded","Property Name_encoded"]
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(df[cols],df['Price per sq ft'])
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
reg.coef_
array([-1.13548039e+01, -1.13548039e+01, -1.13548039e+01, 1.18086599e-01,
-2.05733993e+01, 7.76405280e+01, -5.15776331e-02])
reg.intercept_
1366.2166259982616
df['Pred Price per sq ft'] = reg.predict(df[cols])
from sklearn.metrics import mean_absolute_error
mean_absolute_error(df['Pred Price per sq ft'], df["Price per sq ft"])
365.8094894613128
From the coefficient above, we know that the housing price per sq ft will decrease as the no.of bedrooms, no. of bathrooms and no. of receptions increases. The House Type has a strong correlation with the housing price per sq ft because its coefficient is much larger than others. Since the mean_absolute_error is still very large, therefore it may not be a very good method for predicting the housing prices per sq ft.#
Using Random Forest to Predict House Type from Price, Area in sq ft, No. of Bedrooms, No. of Bathrooms, No. of Receptions and Price per sq ft#
cols2 = ["Price","Area in sq ft","No. of Bedrooms","No. of Bathrooms","No. of Receptions","Price per sq ft"]
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[cols2], df["House Type"], test_size=0.2, random_state=1)
rfc = RandomForestClassifier(n_estimators = 2000, max_leaf_nodes=36,random_state=2)
rfc.fit(X_train,y_train)
RandomForestClassifier(max_leaf_nodes=36, n_estimators=2000, random_state=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_leaf_nodes=36, n_estimators=2000, random_state=2)
rfc.score(X_test,y_test)
0.7738095238095238
From above, we use random forest to predict house type from Price,Area in sq ft,No. of Bedrooms, No. of Bathrooms, No. of Receptions and Price per sq ft. We get a test score about 77% which is much better than random guessing that is 12.5%.
df["House Type"].value_counts()
Flat / Apartment 1103
House 1005
New development 306
Penthouse 82
Bungalow 9
Studio 7
Duplex 4
Mews 2
Name: House Type, dtype: int64
df["Pred House Type"] = rfc.predict(df[cols2])
df["Pred House Type"].value_counts()
Flat / Apartment 1395
House 1092
New development 18
Studio 7
Penthouse 6
Name: Pred House Type, dtype: int64
From above, we can see that our model can distinguish Flat/Apartment, House and Studio from other house types well, but it is very difficult for it to predict other house types like Bungalow, Duplex and etc.#
Summary#
Either summarize what you did, or summarize the results. Maybe 3 sentences.
I use pandas to clean the data and perform feature engineering, then I use Altair to visualize the results.
I use linear regression to predict the housing price per sq ft and random forest to predict the house type
References#
Your code above should include references. Here is some additional space for references.
What is the source of your dataset(s)?
https://www.kaggle.com/datasets/arnavkulkarni/housing-prices-in-london?rvi=1
List any other references that you found helpful.