Predicting the Housing Price and House Type in London

Predicting the Housing Price and House Type in London#

Author: Siyu Wu

Course Project, UC Irvine, Math 10, F23

Introduction#

This dataset comprises of various house listings in London and neighbouring regions. It also includes various parameters like their property name, price, house type,location, area and etc. I use pandas to clean the data and perform feature engineering, then I use Altair to visualize the results. I use linear regression to predict the housing price per sq ft and random forest to predict the house type.

Main portion of the project#

Data Loading#

import pandas as pd
df = pd.read_csv('London.csv')

df = df.dropna(axis = 0).copy()

df

	Unnamed: 0	Property Name	Price	House Type	Area in sq ft	No. of Bedrooms	No. of Bathrooms	No. of Receptions	Location	City/County	Postal Code
0	0	Queens Road	1675000	House	2716	5	5	5	Wimbledon	London	SW19 8NY
1	1	Seward Street	650000	Flat / Apartment	814	2	2	2	Clerkenwell	London	EC1V 3PA
2	2	Hotham Road	735000	Flat / Apartment	761	2	2	2	Putney	London	SW15 1QL
3	3	Festing Road	1765000	House	1986	4	4	4	Putney	London	SW15 1LP
4	4	Spencer Walk	675000	Flat / Apartment	700	2	2	2	Putney	London	SW15 1PL
...	...	...	...	...	...	...	...	...	...	...	...
3473	3473	Park Lane	19950000	Flat / Apartment	3698	5	5	5	Mayfair	London	W1K 7AJ
3476	3476	St. James's Street	5275000	Flat / Apartment	1749	3	3	3	St James's	London	SW1A 1JT
3477	3477	Ingram Avenue	5995000	House	4435	6	6	6	Hampstead Garden Suburb	London	NW11 6TG
3478	3478	Cork Street	6300000	New development	1506	3	3	3	Mayfair	London	W1S 3AR
3479	3479	Courtenay Avenue	8650000	House	5395	6	6	6	Highgate	London	N6 4LP

2518 rows × 11 columns

Using Altair to Visualize the Average Housing Price per sq ft in Different County#

import altair as alt

df["Price per sq ft"] = df["Price"]/df["Area in sq ft"]

alt.Chart(df).mark_bar().encode(
    x = "City/County:N",
    y = "mean(Price per sq ft)",
    tooltip = ["mean(Price per sq ft)","City/County"]
).interactive().properties(
    title = 'Average Housing Price per sq ft in Different County'
)

The graph shows that Oxshott has the highest average housing price,while Fitzrovia has the lowest average housing price.#

Using Altair to Visualize the Number of Housing Within Different Price Range in London#

Here I use the Price per sq ft data that I calculated above. I first divide Price per sq ft into 4 categories according to their percentiles. Then I create a new column called ‘Price Range’ which labels its range in df.#

percentiles = [0,25,50,75,100]

percentile_values = [df['Price per sq ft'].quantile(i/100) for i in percentiles]

percentile_values

[241.61073825503357,
2344515742341,
516708437761,
5623359473075,
180152943269]

labels = ['242~713','713~903','903~1185','1185~7070']

df['Price Range'] = pd.qcut(df['Price per sq ft'],4,labels)
df

	Unnamed: 0	Property Name	Price	House Type	Area in sq ft	No. of Bedrooms	No. of Bathrooms	No. of Receptions	Location	City/County	Postal Code	Price per sq ft	Price Range
0	0	Queens Road	1675000	House	2716	5	5	5	Wimbledon	London	SW19 8NY	616.715758	242~713
1	1	Seward Street	650000	Flat / Apartment	814	2	2	2	Clerkenwell	London	EC1V 3PA	798.525799	713~903
2	2	Hotham Road	735000	Flat / Apartment	761	2	2	2	Putney	London	SW15 1QL	965.834428	903~1185
3	3	Festing Road	1765000	House	1986	4	4	4	Putney	London	SW15 1LP	888.721047	713~903
4	4	Spencer Walk	675000	Flat / Apartment	700	2	2	2	Putney	London	SW15 1PL	964.285714	903~1185
...	...	...	...	...	...	...	...	...	...	...	...	...	...
3473	3473	Park Lane	19950000	Flat / Apartment	3698	5	5	5	Mayfair	London	W1K 7AJ	5394.808004	1185~7070
3476	3476	St. James's Street	5275000	Flat / Apartment	1749	3	3	3	St James's	London	SW1A 1JT	3016.009148	1185~7070
3477	3477	Ingram Avenue	5995000	House	4435	6	6	6	Hampstead Garden Suburb	London	NW11 6TG	1351.747463	1185~7070
3478	3478	Cork Street	6300000	New development	1506	3	3	3	Mayfair	London	W1S 3AR	4183.266932	1185~7070
3479	3479	Courtenay Avenue	8650000	House	5395	6	6	6	Highgate	London	N6 4LP	1603.336423	1185~7070

2518 rows × 13 columns

The usage of the quantile and pd.qcut function are adapted from chatgdp [https://chat.openai.com/c/e14df8ae-65af-4ecf-be0f-9038695fcfcd]#

Here I only focus on the housing price in London#

df2 = df[df["City/County"]=='London']
df2

	Unnamed: 0	Property Name	Price	House Type	Area in sq ft	No. of Bedrooms	No. of Bathrooms	No. of Receptions	Location	City/County	Postal Code	Price per sq ft	Price Range
0	0	Queens Road	1675000	House	2716	5	5	5	Wimbledon	London	SW19 8NY	616.715758	242~713
1	1	Seward Street	650000	Flat / Apartment	814	2	2	2	Clerkenwell	London	EC1V 3PA	798.525799	713~903
2	2	Hotham Road	735000	Flat / Apartment	761	2	2	2	Putney	London	SW15 1QL	965.834428	903~1185
3	3	Festing Road	1765000	House	1986	4	4	4	Putney	London	SW15 1LP	888.721047	713~903
4	4	Spencer Walk	675000	Flat / Apartment	700	2	2	2	Putney	London	SW15 1PL	964.285714	903~1185
...	...	...	...	...	...	...	...	...	...	...	...	...	...
3473	3473	Park Lane	19950000	Flat / Apartment	3698	5	5	5	Mayfair	London	W1K 7AJ	5394.808004	1185~7070
3476	3476	St. James's Street	5275000	Flat / Apartment	1749	3	3	3	St James's	London	SW1A 1JT	3016.009148	1185~7070
3477	3477	Ingram Avenue	5995000	House	4435	6	6	6	Hampstead Garden Suburb	London	NW11 6TG	1351.747463	1185~7070
3478	3478	Cork Street	6300000	New development	1506	3	3	3	Mayfair	London	W1S 3AR	4183.266932	1185~7070
3479	3479	Courtenay Avenue	8650000	House	5395	6	6	6	Highgate	London	N6 4LP	1603.336423	1185~7070

2058 rows × 13 columns

alt.Chart(df2).mark_bar().encode(
    x= alt.X("Price Range",sort = ['242~713','713~903','903~1185','1185~7070']),
    y= "count()",
    tooltip = ["count()"]
).interactive().properties(
    title = 'Number of Housing Within Different Price Range in London'
)

From the graph above, there are 615 housings that have a price range per sq ft between 1185 and 7070 in London. And 1205 out of 2058 housings that is over 55% of the total housings have prices above 903 in London.#

Using Altair to Visualize the Properties of Different Housing in London#

alt.Chart(df2).mark_circle().encode(
    x = 'Area in sq ft',
    y = 'Price',
    color = 'House Type',
    tooltip = ['Area in sq ft','Price','House Type']
).interactive().properties(
    title = 'Housing in London'
)

The scatter plot above shows that the price increases as the Area in sq ft increases in general.#

Using Linear Regression to Predict Housing Prices per sq ft from No.of Bedrooms, No.of Bathrooms, No. of Receptions,Location,City/County,House Type and Property Name#

Since Location, City/County, House Type and Property Name are categorical data,in order to fit them into the linear regression model as inputs, I use label encoding to change them into numerical data type.#

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['Location_encoded'] = label_encoder.fit_transform(df['Location'])
df['City/County_encoded'] = label_encoder.fit_transform(df['City/County'])
df['House Type_encoded'] = label_encoder.fit_transform(df['House Type'])
df['Property Name_encoded'] = label_encoder.fit_transform(df['Property Name'])

The idea of using label encoding to change the categorical data into numerical data type is adapted from the chatgdp [https://chat.openai.com/c/000f18cc-7297-489c-87c9-724ede56147a].#

cols = ["No. of Bedrooms","No. of Bathrooms","No. of Receptions","Location_encoded","City/County_encoded","House Type_encoded","Property Name_encoded"]

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(df[cols],df['Price per sq ft'])

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

reg.coef_

array([-1.13548039e+01, -1.13548039e+01, -1.13548039e+01,  1.18086599e-01,
       -2.05733993e+01,  7.76405280e+01, -5.15776331e-02])

reg.intercept_

1366.2166259982616

df['Pred Price per sq ft'] = reg.predict(df[cols])

from sklearn.metrics import mean_absolute_error
mean_absolute_error(df['Pred Price per sq ft'], df["Price per sq ft"])

365.8094894613128

From the coefficient above, we know that the housing price per sq ft will decrease as the no.of bedrooms, no. of bathrooms and no. of receptions increases. The House Type has a strong correlation with the housing price per sq ft because its coefficient is much larger than others. Since the mean_absolute_error is still very large, therefore it may not be a very good method for predicting the housing prices per sq ft.#

Using Random Forest to Predict House Type from Price, Area in sq ft, No. of Bedrooms, No. of Bathrooms, No. of Receptions and Price per sq ft#

cols2 = ["Price","Area in sq ft","No. of Bedrooms","No. of Bathrooms","No. of Receptions","Price per sq ft"]

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[cols2], df["House Type"], test_size=0.2, random_state=1)
rfc = RandomForestClassifier(n_estimators = 2000, max_leaf_nodes=36,random_state=2)
rfc.fit(X_train,y_train)

RandomForestClassifier(max_leaf_nodes=36, n_estimators=2000, random_state=2)