Revenue Prediction of House Resale Resale Bairong Lei University - - PowerPoint PPT Presentation
Revenue Prediction of House Resale Resale Bairong Lei University - - PowerPoint PPT Presentation
Revenue Prediction of House Resale Resale Bairong Lei University of Waterloo November 6, 2012 Overview Motivation Previous Works Project Goals Dataset Dataset Plan for Analysis Motivation People are favor of the
Overview
Motivation Previous Works Project Goals Dataset Dataset Plan for Analysis
Motivation
People are favor of the ownership of a valuable property. Home investment is treated as a hedge against Home investment is treated as a hedge against inflation. House resale is expected to be able to make a profit.
Census structure of private home
30.5% 26.5% 25.7% 27.6% 28.5% 26.8% 28.5% 29.5% 29.0%
household type
- Source: Statistics Canada
Motivation Cont’
New home purchasing VS. resale home purchasing:
Issues to Concern New Homes Resold Homes Registration for a home Needed Not needed Registration for a home builder Needed Not needed List Prices Unknown Known Renovation Cost of Upgrade Usually Included in Price Appliance May or May Not Usually included in Price Locations Unpredictable Fixed Offer Presentation Not needed Needed
Previous Work
- Basu, S. and Thibodeau, T. Analysis of Spatial Auto-correlation in
House Prices. Journal of Real Estate Finance and Economics, Vol. 17:1, 61-85 (1998).
Structural characteristics increases hedonic house price prediction accuracy.
- Chopra, S., Thampy, T., Leahy, J., Caplin, A., and LeCun, Y. Machine
- Chopra, S., Thampy, T., Leahy, J., Caplin, A., and LeCun, Y. Machine
Learning and the Spatial Structure of House Prices and Housing Returns (2008).
Applying linear regression model to account for geography factor to reduce error for price prediction over a long period.
- Question: How to predict the revenue when selling a house? What
are the factors to affect the revenue when selling a houses?
Project Goals
Predict the difference between sold prices and listing prices of resold houses (regression problem) Predict whether the sold prices is greater than the asking prices (classification problem)
Raw Dataset
- Source: realmarketwatch.com
Raw Dataset
Source: realmarketwatch.com Description: Resold house Records in Great Toronto Area in recent two weeks Fields include: Fields include:
MLS Number, City, Street Number, Street Name, Street Type, Area, House Type, House Style, Number of Bedrooms, Number of Bathrooms, Contract Date, Sold Date, Ask Price, Sold Price
Raw Data Cont’
Overview of the raw dataset
Number of Records: 4194 City: totally 54 distinct names Area: 340 districts Street Types: 38 distinct types Street Types: 38 distinct types House Types: 15 House Styles: 10
- No. of Bedrooms: 0 ~ 9
- No. of Washrooms: 0 ~ 11
Ask Price: 89900 ~ 7995000 Sold Price: 2200 ~ 7025000
Raw Dataset Cont’
- Example:
- City: Aurora
- St. No.: 51
- St. Name: Cashel
- St. Type: Crt
- Area: Aurora Hig
Ask Price: 329777
- Ask Price: 329777
- Contract Date: 10/09/2012
- Sold Price: 320000
- Sold Date: 24/09/2012
- House Type: Att/Row/Tw
- House Style: 2-Storey
- Bedroom: 3
- Washroom: 2
Challenges of Raw Data
No house records for Halton region in GTA Fields with Invalid Data
0 Bedrooms 0 Bathrooms 0 Bathrooms
Ambiguous data
House style as “Vacant land”
Any suggestion on imputation for these misleading data? (mean, hot-deck or machine learning methods?)
Plan for Analysis
Overview Pre-processing raw data Potential machine learning methods Validation
Overview
Feature reconstruction from raw data
Goal: to group categorical and qualitative data into levels to be more ML descriptive (feature encoding) Focus on features of City, house type, house style,
- number of bedrooms and number of bathrooms
Primarily focus on supervised learning methods
Regression method Classification method
Pre-processing raw data
Feature Encoding
Apply dummy variables to feature construction for qualitative variables City names are categorized into four regions (City City names are categorized into four regions (City
- f Toronto, Peel, York, Durham)
Z1 = 1 if the house resides in Peel, else Z1 = 0; Z2 = 1 if the house resides in York, else Z2 = 0; Z3 = 1 if the house resides in Durham, else Z3 = 0;
Machine Learning Methods
Regression problem
Multivariate Linear Regression
Classification problem Classification problem
Support Vector Machine Decision Tree
Multivariate Linear Regression
Use encoded qualitative variables to build up models
Recall: City names are categorized into four regions (City of Toronto, Peel, York, Durham) Z1 = 1 if the house resides in Peel, else Z1 = 0; Z2 = 1 if the house resides in York, else Z2 = 0; Z = 1 if the house resides in Durham, else Z = 0; Z3 = 1 if the house resides in Durham, else Z3 = 0; Yi = α0 + α1Z1+ α2Z2+ α3Z3
Training models with training samples Generate conclusion with the testing results using test sets.
Support Vector Machine
Select features and generate feature subsets Build up models with various feature subsets and Gaussian Radial basis kernel Gaussian Radial basis kernel Apply K-fold cross validation to training models Compare averaged misclassification rates for each feature subset
Decision Tree
- Build up the tree with C4.5 algorithm
Handle training set with unknown attribute values by evaluating the gain or gain ratio for that attribute Pruning would run after tree is created
- Pseudocode of C4.5Algorithm:
1. Check for base cases 1. Check for base cases 2. For each attribute a Find the normalized information gain from splitting on a 3. Let a_best be the attribute with the highest normalized information gain 4. Create a decision node that splits on a_best 5. Recurse on the sublists obtained by splitting on a_best, and add those nodes as children of node
- Source: C4.5 Algorithm http://en.wikipedia.org/wiki/C4.5_algorithm
Validation
Test data set New real data released from the website to test the prediction accuracy of the models for test the prediction accuracy of the models for those machine learning methods
Reference
- Statistics Canada. Distribution (in percent-age) of private households by household
type, 2001 to 2011. http://www12.statcan.gc.ca/census-recensement/2011/as- sa/98-312-x/2011003/fig/fig3_2-1-eng.cfm
- Basu, S. and Thibodeau, T. Analysis of Spatial Auto-correlation in House Prices.
Journal of Real Estate Finance and Economics, Vol. 17:1, 61-85 (1998).
- Chopra, S., Thampy, T., Leahy, J., Caplin, A., and LeCun, Y. Machine Learning and
the Spatial Structure of House Prices and Housing Returns (2008).
- Antipov E. and Pokryshevskaya, E. Mass appraisal of residential apartments: An
application of Random forest for valuation and a CART-based approach for model
- diagnostics. Working Paper(2010).
- RealMarketWatch http://realmarketwatch.com/
- C4.5 Algorithm http://en.wikipedia.org/wiki/C4.5_algorithm