Building Linear Regression Model for Predicting King County House Price

Jonna Wang
5 min readMar 29, 2021

Have you ever thinking about being a homebuyer in the next few years? Oh, yes? Great! Have you thought about predicting the house by yourself ? Well, I got a simulation chance at Flatiron for the Phase 2 project.

Introduction

The project data contains house sale prices in King County Washington between May 2014 to May 2015. It combines various house information including price, sale date, building year, area, room, quality, surroundings, city, and location coordinates. Each house has a unique identifier. I use multiple linear regression modeling to predict house prices and train test split to evaluate machine learning performance.

King County Map

Data Understanding and Cleaning

The original dataset contains 21,597 house sale prices with 21 feature columns.

  • id was dropped after checking duplicates
  • sqrt_above and sqrt_basement were combined to a binary column: basement (whether has a basement or not)
  • yr_built was converted to the house age
  • yr_renovated was converted to a binary column: rev(whether was renovated or not)
  • zipcode was translated to actual city names
  • Rest of the columns were kept

Questions to consider:

Q1. What is the best month to purchase a house?
Q2. Which region in King County should homebuyers look first based on their budget?
Q3. What features help predict house prices?
Q4. What are the top features that have a stronger relationship with house price?

Linear Modeling

A baseline model was created for later comparison. I use only continuous variables for the baseline model.

Heatmap 1

From this Heatmap 1, we can see that there existed multicollinearity issue (correlation ≥ 0.5). Some independent variables are correlated with one or more of the other independent variables. Except for sqft_lot, all other variables in the baseline model have some multicollinearity. In order to solve this issue, I decided to combine the bedrooms and bathrooms to one column: rm-ratio, which divide the number of bedrooms by the number of bathrooms.

Heatmap 2

Now, in Heatmap 2, we can see that after some modification, the multicollinearity issued was solved.

I performed a train test split to see how well the baseline model performed for unseen data.

The baseline model performs poorly with a test R² of 0.529. Only use continuous variables to build the linear regression model is not enough. So I decided to use all the variables.

I separated the variables into:

continuous = ['sqft_living', 'sqft_lot', 'rm_ratio', 'age']categorical = ['waterfront', 'basement', 'rev', 'gd', 'city']target = ['price']

I performed One Hot Encoding for categorical columns.

cat_ohe = pd.get_dummies(data[categorical], prefix=categorical, drop_first=True)

Train test split was performed for the first iteration. The test R² showed improvement (0.731). But when I checked the ResidualPlot 1, I found the residuals are poorly distributed which showed a cone-like shape. All these sighs indicates violation of linear assumption.

ResidualPlot 1

In order to solve this issue, I did log-transformation for the continuous columns which are highly skewed. After partial-log transformation, there was no patten can be seen for the ResidualPlot 3, which means this model satisfied linear assumption. The test R² for 3rd-iteration approach reached 0.749, which shows much enhance than the baseline model.

ResidualPlot 3

Back to the questions…

Question 1

What is the best month to purchase a house?

  • February has the lowest average house sale price (guess hard to sell during cold winter?)
  • April has the highest average house sale price (guess families need to move before school starts?)

Question 2

Which region in King County should homebuyers look first based on their budget?

  • Higher priced houses (Group A and B) are more likely to be found in the north region
  • Middle priced houses (Group C) are more scattered which spread widely in King County
  • Lower priced houses (Group D and E) are more likely to be found in the south region

Question 3

What features help predict house prices?

  • area of living space (sqft_living)
  • area of lot space (sqft_lot)
  • room ratio (rm_ratio)
  • house age (age)
  • house grade (gd)
  • basement
  • renovation (rev)
  • waterfront view (waterfront)
  • city

Question 4

What are the top features that have a stronger relationship with house price?

  • sqft_living, grade, waterfront, and city
  • The construction material and level of craftmanship relate to the grading of the house thus higher grade, better quality
  • Huge price difference between different house grade
  • Geography location make waterfront important (not every city has water nearby) (Check the map)
  • The top 3 cities are Medina, Mercer Island, and Bellevue. Those cities have relatively higher house price.

--

--