House Prices: Advanced Regression Techniques Haiyang Shi Apr. 17, - - PowerPoint PPT Presentation

house prices advanced regression techniques
SMART_READER_LITE
LIVE PREVIEW

House Prices: Advanced Regression Techniques Haiyang Shi Apr. 17, - - PowerPoint PPT Presentation

House Prices: Advanced Regression Techniques Haiyang Shi Apr. 17, 2018 Outline Introduction ML Techniques Feature Engineering Experiments Observations The Ohio State University 2 Introduction Goal : predicting the


slide-1
SLIDE 1

House Prices: Advanced Regression Techniques

Haiyang Shi

  • Apr. 17, 2018
slide-2
SLIDE 2

2 The Ohio State University

  • Introduction
  • ML Techniques
  • Feature Engineering
  • Experiments
  • Observations

Outline

slide-3
SLIDE 3

3 The Ohio State University

  • Goal: predicting the final price for each house using advanced

regression techniques.

  • Data: a Kaggle competition, based on property data in Ames, Iowa

from 2006 and 2010.

  • Evaluation: Root-Mean-Square-Error (RMSE) (the log price is to reduce

the impact of biased higher price). !"#$% = ∑()*

+ (log 0

1

( − log 1 ()4

5

Introduction

slide-4
SLIDE 4

4 The Ohio State University

  • Linear Regression: Ridge, Lasso
  • Support Vector Regression
  • Random Forest
  • Adaptive Boosting
  • Gradient Boosted Decision Tree
  • K Nearest Neighbors
  • Neural Network

ML Techniques

slide-5
SLIDE 5

5 The Ohio State University

  • Impute missing values
  • Clean outliers
  • Categorize categorical attributes
  • Transform skewed attributes
  • Generate features*
  • Select feature subset

Feature Engineering

slide-6
SLIDE 6

6 The Ohio State University

Attribute Missing Values BsmtFullBath 2 BsmtHalfBath 2 GarageYrBlt 159 GarageCars 1 LotFrontage 486 MasVnrArea 23 BsmtFinSF1 1 BsmtFinSF2 1 BsmtUnfSF 1 TotalBsmtSF 1 GarageArea 1

Missing Values and Highly Correlated Attributes

Attribute 1 Attribute 2 Correlation MSSubClass BldgType 0.75 OverallQual ExterQual 0.72 OverallQual SalePrice 0.82 YearBuilt GarageYrBlt 0.78 Exterior1st Exterior2nd 0.86 ExterQual KitchenQual 0.72 TotalBsmtSF 1stFlrSF 0.78 GrLivArea TotRmsAbvGrd 0.82 GrLivArea SalePrice 0.73 Fireplaces FireplaceQu 0.80 GarageCars GarageArea 0.89 GarageQual GarageCond 0.90

slide-7
SLIDE 7

7 The Ohio State University

Outliers

slide-8
SLIDE 8

8 The Ohio State University

  • Log Transformation

Skewness

slide-9
SLIDE 9

9 The Ohio State University

Before

Skewness (Cont.)

After

slide-10
SLIDE 10

10 The Ohio State University

Bivariate Relationship Analysis

slide-11
SLIDE 11

11 The Ohio State University

Experiments

  • GridSearchCV to select hyperparameters
  • 10-fold cross validation
slide-12
SLIDE 12

12 The Ohio State University

Experiments

  • Lasso Regression

– Most important feature engineering

  • Transformation of skewed data
  • Categorization of categorical attributes
  • Imputation of missing values

– Score

  • 0.12789

– Most important features

  • Above grade (ground) living area square feet
  • Lot size in square feet
  • Rates the overall material and finish of the house
slide-13
SLIDE 13

13 The Ohio State University

  • Random Forest

– Main hyperparameters

  • n_estimators (800): number of trees in the forest
  • max_features (0.3): number of features to consider when looking for the best split
  • max_depth (20): maximum depth

– Deeper trees with smaller max_features performs better – Resilient to data preprocessing with smaller max_features – Score

  • 0.14169

Experiments

slide-14
SLIDE 14

14 The Ohio State University

  • Gradient Boosted Decision Tree

– Main hyperparameters: n_estimators (3000), learning_rate (0.05), max_features (log2) and max_depth (3) – Score: 0.12365

  • Support Vector Regression

– Main hyperparameters: kernel (linear) – Score: 0.15413

  • Adaptive Boosting

– Main hyperparameters: base_estimator (DecisionTree(max_features=0.3)) – Score: 0.14149

Experiments

slide-15
SLIDE 15

15 The Ohio State University

  • K Nearest Neighbors

– Main hyperparameters: n_neighbors (11) – Score: 0.24084

  • Neural Network

– Main hyperparameters: hidden_layer_sizes ((30, 30, 30, 30)) – Score: 0.23495

Experiments

slide-16
SLIDE 16

16 The Ohio State University

  • Feature engineering is very important

– Feature selection – Feature creation

  • Transforming neighborhood attribute to geographical location

– Feature combination

  • Overfitting is considered harmful, and cross validation alone is not enough
  • Tuning hyperparameters is very time consuming

More details: http://www.shihaiyang.me/2018/04/16/house-prices/

Observations

slide-17
SLIDE 17

17 The Ohio State University

Thank You!