Blue Book for Bulldozers Predicting Auction Sale Price to C reate a - - PowerPoint PPT Presentation

blue book for bulldozers
SMART_READER_LITE
LIVE PREVIEW

Blue Book for Bulldozers Predicting Auction Sale Price to C reate a - - PowerPoint PPT Presentation

Blue Book for Bulldozers Predicting Auction Sale Price to C reate a Blue Book for a Piece of Heavy Equipment for Bulldozers Yoojong Bang, Joon Lim, Benedict Lim, Samuel Hills , Eun Hee Ko MASTER IN ANALYTICS I NORTHWESTERN UNIVERSITY


slide-1
SLIDE 1

Blue Book for Bulldozers

Predicting Auction Sale Price to Create a “Blue Book” for a Piece of Heavy Equipment for Bulldozers

Yoojong Bang, Joon Lim, Benedict Lim, Samuel Hills , Eun Hee Ko

MASTER IN ANALYTICS I NORTHWESTERN UNIVERSITY

slide-2
SLIDE 2

Project Overview Kaggle competition sponsored by FastIron Predict auction price of bulldozers Training set – Over 400k observations – 52 predictor variables Predictor variables consist of information on machine size, usage and configuration of equipment

slide-3
SLIDE 3

Forecast Goal The validation criteria for this competition is residual mean squared log error (RMSLE).

𝑆𝑁𝑇𝑀𝐹 = 1 𝑜

𝑗=1 𝑜

(log 𝑍

𝑗 + 1 − log 𝑍ℎ𝑏𝑢𝑗 + 1 )2

Current top ranked model has an RMSLE of 0.2209. Our cross-validated estimate of RMSLE beats this value, but does not reach these for the validation set provided by Kaggle.

slide-4
SLIDE 4

Challenges Variable Sparsity

– Majority of predictor variables are very sparse – No observations contain values for all predictors – Even subsetting predictors, most models do not take null values

Multicollinearity

– Many predictors are identical

Categorical Variables

– Almost all predictors are categorical – Most local linear regression models do not accept categorical variables

slide-5
SLIDE 5

Data Description

Response Variable – Sale Price

Min. 1st Qu. Median Mean 3rd Qu. Max. 4750 14,500 24,000 31,100 40,000 142,000 Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s 3,458 3,025 248,300 258,360

MachineHoursCurrentMeter

Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s 1919 1988 1996 1994 2001 2013 38,185

YearMade Enclosure: Five Values NA’s = 2 Data Transformed log10

fiProductClassDesc 74 values NA’s = 0

State: 53 Values NA’s = 2801

slide-6
SLIDE 6

Linear Regression

Model Setup Parameters

  • No Parameters

Data Description

  • Split data into 57 Product Classes
  • Linear regression on
  • Number of Machine Hours on the Current Meter
  • Year made

RMLSE Results

  • Min: 0.220586345
  • Max: 0.730789603
  • Average: 0.400408126

Additional Remarks

  • Average coefficient
  • Number of machine hours = -245.7014
  • Year made = 11585.12.
  • These coefficients make sense because it means that the longer the machine has been used

the lower the price, and the ‘younger’ the machine the more valuable it is.

slide-7
SLIDE 7

Ridge Regression

Model Setup Parameters

  • 20 values of lambda
  • X=1:20
  • Lambda=1/(1.5^(X-1))

Data Description

  • Split data into 57 Product Classes
  • Linear regression on
  • Number of Machine Hours on the Current Meter
  • Year made

RMLSE Results

  • Min: 0.511227413
  • Max: 1.818759302
  • Average: 1.01693174
  • Additional Remarks
  • Little correlation between predictor variables
  • Lambda that had the lowest RMLSE for all the 57 product classes was the smallest lambda of

0.00045,

slide-8
SLIDE 8

K-Nearest Neighbor Classification (KNN)

Model Setup Parameters

  • Number of Nearest Neighbors: 3->10

Data Description

  • Split data into 57 Product Classes
  • KNN on
  • Number of Machine Hours on the Current Meter
  • Year made

RMLSE Results

  • Min: 0.206888639
  • Max: 0.657082542
  • Average: 0.34215316

Additional Remarks

Number of Nearest Neighbors 5 6 7 8 9 10 Number of Product Classes with Associated Optimal K 1 1 4 6 12 33

slide-9
SLIDE 9

Support Vector Machines Classification (SVM)

Model Setup Parameters

  • Four types of kernels (Tuning Parameters)
  • Polynomial (Degree and Gamma)
  • Sigmoid (Gamma and Coefficient)
  • Radial (Gamma)
  • Linear (Gamma)
  • Gamma Range: 10-6 to 0.1
  • Degree Range: 2 to 6
  • Coefficient Range: 0 to 3

Data Description

  • Split data into 57 Product Classes
  • SVM on:
  • State of sale
  • Type of enclosure
  • Number of Machine Hours on the Current Meter
  • Year made

RMLSE Results

  • Min: 0.208401176
  • Max: 0.519523311
  • Average: 0.329685653
slide-10
SLIDE 10

Boosting (GBM)

Model Setup Parameters

  • Interaction terms {1, 2, 3, 4}
  • Shrinkage parameter {0.1, 0.2, …, 1.0}

Data Description

  • Single model run
  • Subset of 26 variables
  • Mix of quantitative and qualitative

RMLSE Results

  • Min: 0.1447
  • Max: 0.1597
  • Average: 0.1483

Additional Remarks

  • Even with variable selection, only a few predictor variables are dominant
  • Chose 100 trees to create model
  • Chosen heuristically
  • Tried 10, 100, and 1000 on a few models
slide-11
SLIDE 11

Regression Trees (CART)

Model Setup Parameters

  • Prone the tree based

On Cp = 0.1 Data Description

  • Used all 52 Variables

RMLSE Results

  • 0.3318572

Additional Remarks

  • fiProductClassDesc is the most

important predictor variable.

  • Error is randomly distributed.

E[error]=0.

slide-12
SLIDE 12

GAM

Model Setup Parameters

  • No parameters

Data Description

  • Split data into 57 Product Classes
  • Variable used to fit GAM:
  • Number of Machine Hours on the Current Meter
  • Year made

RMLSE Results

  • Min: 0.214957079
  • Max: 0.683211532
  • Average: 0.35196
slide-13
SLIDE 13

MARS

Model Setup Parameters

  • Degrees of interaction {1, 2, 3, 4}

Data Description

  • Single model run
  • Subset of 5 variables
  • Mix of quantitative and qualitative

RMLSE Results

  • Min: 0.1497
  • Max: 0.1512
  • Average: 0.1501

Additional Remarks

  • Subset of variables chosen because R package can only be run on observations without null

values

  • Because of the small number of variables, the model with 2, 3, and 4 interaction terms were

identical (due to nature of backward pass)

slide-14
SLIDE 14

Random Forest

Model Setup Parameters

  • N.tree=1000

Data Description

  • Random Forest on:
  • MachineID
  • ProductGroup
  • YearMade
  • Saledate

RMLSE Results

  • 0.4819

Additional Remarks

  • Two R packages: randomForest vs. party (difference lies in variable importance and base tree)
  • Very high computational power required – especially RAM 8G not enough
  • randomForest() requires non-null variables, less than 8 categories.
  • cforest() can handle missing values and more categories but take way too long time.
slide-15
SLIDE 15

Stacked Generalization - Staking

We stacked three models with Squared-Error Loss Function. 1) Random Forest : 10 fold CV RMSLE = 0.4819 2) Regression Tree: 10 fold CV RMSLE = 0.3365 3) Gradient Boosted Model: 10 fold CV RMSLE = 0.1447 Average RMLSE of above three model = 0.321 Stacked Generalization Model has 10 fold RMSLE of 0.2646284 Coefficients: Estimate Std. Error t value Pr(>|t|) RandF -0.046102 0.001138 -40.52 <2e-16 *** CART 0.391332 0.001374 284.86 <2e-16 *** GBM 0.655589 0.001504 435.97 <2e-16 ***