Predicting the Housing Price and House Type in London#

Author: Siyu Wu

Course Project, UC Irvine, Math 10, F23

Introduction#

This dataset comprises of various house listings in London and neighbouring regions. It also includes various parameters like their property name, price, house type,location, area and etc. I use pandas to clean the data and perform feature engineering, then I use Altair to visualize the results. I use linear regression to predict the housing price per sq ft and random forest to predict the house type.

Main portion of the project#

Data Loading#

import pandas as pd
df = pd.read_csv('London.csv')
df = df.dropna(axis = 0).copy()
df
Unnamed: 0 Property Name Price House Type Area in sq ft No. of Bedrooms No. of Bathrooms No. of Receptions Location City/County Postal Code
0 0 Queens Road 1675000 House 2716 5 5 5 Wimbledon London SW19 8NY
1 1 Seward Street 650000 Flat / Apartment 814 2 2 2 Clerkenwell London EC1V 3PA
2 2 Hotham Road 735000 Flat / Apartment 761 2 2 2 Putney London SW15 1QL
3 3 Festing Road 1765000 House 1986 4 4 4 Putney London SW15 1LP
4 4 Spencer Walk 675000 Flat / Apartment 700 2 2 2 Putney London SW15 1PL
... ... ... ... ... ... ... ... ... ... ... ...
3473 3473 Park Lane 19950000 Flat / Apartment 3698 5 5 5 Mayfair London W1K 7AJ
3476 3476 St. James's Street 5275000 Flat / Apartment 1749 3 3 3 St James's London SW1A 1JT
3477 3477 Ingram Avenue 5995000 House 4435 6 6 6 Hampstead Garden Suburb London NW11 6TG
3478 3478 Cork Street 6300000 New development 1506 3 3 3 Mayfair London W1S 3AR
3479 3479 Courtenay Avenue 8650000 House 5395 6 6 6 Highgate London N6 4LP

2518 rows × 11 columns

Using Altair to Visualize the Average Housing Price per sq ft in Different County#

import altair as alt
df["Price per sq ft"] = df["Price"]/df["Area in sq ft"]
alt.Chart(df).mark_bar().encode(
    x = "City/County:N",
    y = "mean(Price per sq ft)",
    tooltip = ["mean(Price per sq ft)","City/County"]
).interactive().properties(
    title = 'Average Housing Price per sq ft in Different County'
)

The graph shows that Oxshott has the highest average housing price,while Fitzrovia has the lowest average housing price.#

Using Altair to Visualize the Number of Housing Within Different Price Range in London#

Here I use the Price per sq ft data that I calculated above. I first divide Price per sq ft into 4 categories according to their percentiles. Then I create a new column called ‘Price Range’ which labels its range in df.#

percentiles = [0,25,50,75,100]
percentile_values = [df['Price per sq ft'].quantile(i/100) for i in percentiles]
percentile_values
[241.61073825503357,
 713.2344515742341,
 902.516708437761,
 1184.5623359473075,
 7069.180152943269]
labels = ['242~713','713~903','903~1185','1185~7070']
df['Price Range'] = pd.qcut(df['Price per sq ft'],4,labels)
df
Unnamed: 0 Property Name Price House Type Area in sq ft No. of Bedrooms No. of Bathrooms No. of Receptions Location City/County Postal Code Price per sq ft Price Range
0 0 Queens Road 1675000 House 2716 5 5 5 Wimbledon London SW19 8NY 616.715758 242~713
1 1 Seward Street 650000 Flat / Apartment 814 2 2 2 Clerkenwell London EC1V 3PA 798.525799 713~903
2 2 Hotham Road 735000 Flat / Apartment 761 2 2 2 Putney London SW15 1QL 965.834428 903~1185
3 3 Festing Road 1765000 House 1986 4 4 4 Putney London SW15 1LP 888.721047 713~903
4 4 Spencer Walk 675000 Flat / Apartment 700 2 2 2 Putney London SW15 1PL 964.285714 903~1185
... ... ... ... ... ... ... ... ... ... ... ... ... ...
3473 3473 Park Lane 19950000 Flat / Apartment 3698 5 5 5 Mayfair London W1K 7AJ 5394.808004 1185~7070
3476 3476 St. James's Street 5275000 Flat / Apartment 1749 3 3 3 St James's London SW1A 1JT 3016.009148 1185~7070
3477 3477 Ingram Avenue 5995000 House 4435 6 6 6 Hampstead Garden Suburb London NW11 6TG 1351.747463 1185~7070
3478 3478 Cork Street 6300000 New development 1506 3 3 3 Mayfair London W1S 3AR 4183.266932 1185~7070
3479 3479 Courtenay Avenue 8650000 House 5395 6 6 6 Highgate London N6 4LP 1603.336423 1185~7070

2518 rows × 13 columns

The usage of the quantile and pd.qcut function are adapted from chatgdp [https://chat.openai.com/c/e14df8ae-65af-4ecf-be0f-9038695fcfcd]#

Here I only focus on the housing price in London#

df2 = df[df["City/County"]=='London']
df2
Unnamed: 0 Property Name Price House Type Area in sq ft No. of Bedrooms No. of Bathrooms No. of Receptions Location City/County Postal Code Price per sq ft Price Range
0 0 Queens Road 1675000 House 2716 5 5 5 Wimbledon London SW19 8NY 616.715758 242~713
1 1 Seward Street 650000 Flat / Apartment 814 2 2 2 Clerkenwell London EC1V 3PA 798.525799 713~903
2 2 Hotham Road 735000 Flat / Apartment 761 2 2 2 Putney London SW15 1QL 965.834428 903~1185
3 3 Festing Road 1765000 House 1986 4 4 4 Putney London SW15 1LP 888.721047 713~903
4 4 Spencer Walk 675000 Flat / Apartment 700 2 2 2 Putney London SW15 1PL 964.285714 903~1185
... ... ... ... ... ... ... ... ... ... ... ... ... ...
3473 3473 Park Lane 19950000 Flat / Apartment 3698 5 5 5 Mayfair London W1K 7AJ 5394.808004 1185~7070
3476 3476 St. James's Street 5275000 Flat / Apartment 1749 3 3 3 St James's London SW1A 1JT 3016.009148 1185~7070
3477 3477 Ingram Avenue 5995000 House 4435 6 6 6 Hampstead Garden Suburb London NW11 6TG 1351.747463 1185~7070
3478 3478 Cork Street 6300000 New development 1506 3 3 3 Mayfair London W1S 3AR 4183.266932 1185~7070
3479 3479 Courtenay Avenue 8650000 House 5395 6 6 6 Highgate London N6 4LP 1603.336423 1185~7070

2058 rows × 13 columns

alt.Chart(df2).mark_bar().encode(
    x= alt.X("Price Range",sort = ['242~713','713~903','903~1185','1185~7070']),
    y= "count()",
    tooltip = ["count()"]
).interactive().properties(
    title = 'Number of Housing Within Different Price Range in London'
)

From the graph above, there are 615 housings that have a price range per sq ft between 1185 and 7070 in London. And 1205 out of 2058 housings that is over 55% of the total housings have prices above 903 in London.#

Using Altair to Visualize the Properties of Different Housing in London#

alt.Chart(df2).mark_circle().encode(
    x = 'Area in sq ft',
    y = 'Price',
    color = 'House Type',
    tooltip = ['Area in sq ft','Price','House Type']
).interactive().properties(
    title = 'Housing in London'
)

The scatter plot above shows that the price increases as the Area in sq ft increases in general.#

Using Linear Regression to Predict Housing Prices per sq ft from No.of Bedrooms, No.of Bathrooms, No. of Receptions,Location,City/County,House Type and Property Name#

Since Location, City/County, House Type and Property Name are categorical data,in order to fit them into the linear regression model as inputs, I use label encoding to change them into numerical data type.#

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['Location_encoded'] = label_encoder.fit_transform(df['Location'])
df['City/County_encoded'] = label_encoder.fit_transform(df['City/County'])
df['House Type_encoded'] = label_encoder.fit_transform(df['House Type'])
df['Property Name_encoded'] = label_encoder.fit_transform(df['Property Name'])
The idea of using label encoding to change the categorical data into numerical data type is adapted from the chatgdp [https://chat.openai.com/c/000f18cc-7297-489c-87c9-724ede56147a].#
cols = ["No. of Bedrooms","No. of Bathrooms","No. of Receptions","Location_encoded","City/County_encoded","House Type_encoded","Property Name_encoded"]
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(df[cols],df['Price per sq ft'])
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
reg.coef_
array([-1.13548039e+01, -1.13548039e+01, -1.13548039e+01,  1.18086599e-01,
       -2.05733993e+01,  7.76405280e+01, -5.15776331e-02])
reg.intercept_
1366.2166259982616
df['Pred Price per sq ft'] = reg.predict(df[cols])
from sklearn.metrics import mean_absolute_error
mean_absolute_error(df['Pred Price per sq ft'], df["Price per sq ft"])
365.8094894613128

From the coefficient above, we know that the housing price per sq ft will decrease as the no.of bedrooms, no. of bathrooms and no. of receptions increases. The House Type has a strong correlation with the housing price per sq ft because its coefficient is much larger than others. Since the mean_absolute_error is still very large, therefore it may not be a very good method for predicting the housing prices per sq ft.#

Using Random Forest to Predict House Type from Price, Area in sq ft, No. of Bedrooms, No. of Bathrooms, No. of Receptions and Price per sq ft#

cols2 = ["Price","Area in sq ft","No. of Bedrooms","No. of Bathrooms","No. of Receptions","Price per sq ft"]
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[cols2], df["House Type"], test_size=0.2, random_state=1)
rfc = RandomForestClassifier(n_estimators = 2000, max_leaf_nodes=36,random_state=2)
rfc.fit(X_train,y_train)
RandomForestClassifier(max_leaf_nodes=36, n_estimators=2000, random_state=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
rfc.score(X_test,y_test)
0.7738095238095238

From above, we use random forest to predict house type from Price,Area in sq ft,No. of Bedrooms, No. of Bathrooms, No. of Receptions and Price per sq ft. We get a test score about 77% which is much better than random guessing that is 12.5%.

df["House Type"].value_counts()
Flat / Apartment    1103
House               1005
New development      306
Penthouse             82
Bungalow               9
Studio                 7
Duplex                 4
Mews                   2
Name: House Type, dtype: int64
df["Pred House Type"] = rfc.predict(df[cols2])
df["Pred House Type"].value_counts()
Flat / Apartment    1395
House               1092
New development       18
Studio                 7
Penthouse              6
Name: Pred House Type, dtype: int64

From above, we can see that our model can distinguish Flat/Apartment, House and Studio from other house types well, but it is very difficult for it to predict other house types like Bungalow, Duplex and etc.#

Summary#

Either summarize what you did, or summarize the results. Maybe 3 sentences.

  1. I use pandas to clean the data and perform feature engineering, then I use Altair to visualize the results.

  2. I use linear regression to predict the housing price per sq ft and random forest to predict the house type

References#

Your code above should include references. Here is some additional space for references.

  • What is the source of your dataset(s)?

https://www.kaggle.com/datasets/arnavkulkarni/housing-prices-in-london?rvi=1

  • List any other references that you found helpful.