Blue Book for Bulldozers Predicting Auction Sale Price to C reate a - PowerPoint PPT Presentation

Blue Book for Bulldozers Predicting Auction Sale Price to C reate a “Blue Book” for a Piece of Heavy Equipment for Bulldozers Yoojong Bang, Joon Lim, Benedict Lim, Samuel Hills , Eun Hee Ko MASTER IN ANALYTICS I NORTHWESTERN UNIVERSITY

Project Overview Kaggle competition sponsored by FastIron Predict auction price of bulldozers Training set – Over 400k observations – 52 predictor variables Predictor variables consist of information on machine size, usage and configuration of equipment

Forecast Goal The validation criteria for this competition is residual mean squared log error (RMSLE). 𝑜 1 𝑗 + 1 − log 𝑍ℎ𝑏𝑢 𝑗 + 1 ) 2 𝑆𝑁𝑇𝑀𝐹 = 𝑜 (log 𝑍 𝑗=1 Current top ranked model has an RMSLE of 0.2209. Our cross-validated estimate of RMSLE beats this value, but does not reach these for the validation set provided by Kaggle.

Challenges Variable Sparsity – Majority of predictor variables are very sparse – No observations contain values for all predictors – Even subsetting predictors, most models do not take null values Multicollinearity – Many predictors are identical Categorical Variables – Almost all predictors are categorical – Most local linear regression models do not accept categorical variables

Data Description 1 st Qu. 3 rd Qu. Min. Median Mean Max. Response Variable – Sale Price 4750 14,500 24,000 31,100 40,000 142,000 Data Transformed log10 1 st Qu. 3 rd Qu. Min. Median Mean Max. NA’s MachineHoursCurrentMeter 0 0 0 3,458 3,025 248,300 258,360 1 st Qu. 3 rd Qu. Min. Median Mean Max. NA’s YearMade 1919 1988 1996 1994 2001 2013 38,185 Enclosure: State: fiProductClassDesc Five Values 53 Values 74 values NA’s = 2 NA’s = 0 NA’s = 2801

Linear Regression Model Setup Parameters • No Parameters Data Description • Split data into 57 Product Classes • Linear regression on • Number of Machine Hours on the Current Meter • Year made RMLSE Results • Min: 0.220586345 • Max: 0.730789603 • Average: 0.400408126 Additional Remarks • Average coefficient • Number of machine hours = -245.7014 • Year made = 11585.12. • These coefficients make sense because it means that the longer the machine has been used the lower the price, and the ‘younger’ the machine the more valuable it is.

Ridge Regression Model Setup Parameters • 20 values of lambda • X=1:20 • Lambda=1/(1.5^(X-1)) Data Description • Split data into 57 Product Classes • Linear regression on • Number of Machine Hours on the Current Meter • Year made RMLSE Results • Min: 0.511227413 • Max: 1.818759302 • Average: 1.01693174 • Additional Remarks • Little correlation between predictor variables • Lambda that had the lowest RMLSE for all the 57 product classes was the smallest lambda of 0.00045,

K-Nearest Neighbor Classification (KNN) Model Setup Parameters • Number of Nearest Neighbors: 3->10 Data Description • Split data into 57 Product Classes • KNN on • Number of Machine Hours on the Current Meter • Year made RMLSE Results • Min: 0.206888639 • Max: 0.657082542 • Average: 0.34215316 Additional Remarks Number of Nearest Neighbors 5 6 7 8 9 10 Number of Product Classes with 1 1 4 6 12 33 Associated Optimal K

Support Vector Machines Classification (SVM) Model Setup Parameters • Four types of kernels (Tuning Parameters) • Polynomial (Degree and Gamma) • Sigmoid (Gamma and Coefficient) • Radial (Gamma) • Linear (Gamma) Gamma Range: 10 -6 to 0.1 • • Degree Range: 2 to 6 • Coefficient Range: 0 to 3 Data Description • Split data into 57 Product Classes • SVM on: • State of sale • Type of enclosure • Number of Machine Hours on the Current Meter • Year made RMLSE Results • Min: 0.208401176 • Max: 0.519523311 • Average: 0.329685653

Boosting (GBM) Model Setup Parameters • Interaction terms {1, 2, 3, 4} • S hrinkage parameter {0.1, 0.2, …, 1.0} Data Description • Single model run • Subset of 26 variables • Mix of quantitative and qualitative RMLSE Results • Min: 0.1447 • Max: 0.1597 • Average: 0.1483 Additional Remarks • Even with variable selection, only a few predictor variables are dominant • Chose 100 trees to create model • Chosen heuristically • Tried 10, 100, and 1000 on a few models

Regression Trees (CART) Model Setup Parameters • Prone the tree based On Cp = 0.1 Data Description • Used all 52 Variables RMLSE Results • 0.3318572 Additional Remarks • fiProductClassDesc is the most important predictor variable. • Error is randomly distributed. E[error]=0.

GAM Model Setup Parameters • No parameters Data Description • Split data into 57 Product Classes • Variable used to fit GAM: • Number of Machine Hours on the Current Meter • Year made RMLSE Results • Min: 0.214957079 • Max: 0.683211532 • Average: 0.35196

MARS Model Setup Parameters • Degrees of interaction {1, 2, 3, 4} Data Description • Single model run • Subset of 5 variables • Mix of quantitative and qualitative RMLSE Results • Min: 0.1497 • Max: 0.1512 • Average: 0.1501 Additional Remarks • Subset of variables chosen because R package can only be run on observations without null values • Because of the small number of variables, the model with 2, 3, and 4 interaction terms were identical (due to nature of backward pass)

Random Forest Model Setup Parameters • N.tree=1000 Data Description • Random Forest on: • MachineID • ProductGroup • YearMade • Saledate RMLSE Results • 0.4819 Additional Remarks • Two R packages: randomForest vs. party (difference lies in variable importance and base tree) • Very high computational power required – especially RAM 8G not enough • randomForest() requires non-null variables, less than 8 categories. • cforest() can handle missing values and more categories but take way too long time.

Stacked Generalization - Staking We stacked three models with Squared-Error Loss Function. 1) Random Forest : 10 fold CV RMSLE = 0.4819 2) Regression Tree: 10 fold CV RMSLE = 0.3365 3) Gradient Boosted Model: 10 fold CV RMSLE = 0.1447 Average RMLSE of above three model = 0.321 Coefficients: Estimate Std. Error t value Pr(>|t|) RandF -0.046102 0.001138 -40.52 <2e-16 *** CART 0.391332 0.001374 284.86 <2e-16 *** GBM 0.655589 0.001504 435.97 <2e-16 *** Stacked Generalization Model has 10 fold RMSLE of 0.2646284

Blue Book for Bulldozers Predicting Auction Sale Price to C reate a - PowerPoint PPT Presentation

Blue Book for Bulldozers Predicting Auction Sale Price to C reate a Blue Book for a Piece of Heavy Equipment for Bulldozers Yoojong Bang, Joon Lim, Benedict Lim, Samuel Hills , Eun Hee Ko MASTER IN ANALYTICS I NORTHWESTERN UNIVERSITY

RC Claim Assist Kelly Hayes Blue Cross / Blue Care Network Provider Outreach Blue Cross Blue

FOUR STROKE FOUR STROKE Electronic Fuel Injection Electronic Fuel Injection 1 Blue Whale Blue

Blue A Sketch Model Review Blue A Blue A Smooth Passenger Bag Be Gone Talker Blue A

Alaha Blue Cross and Blue Shield of Alabama Speakers: Kathryn Miller Amber Williams Blue

Prior Authorization Vincent Nelson, M.D. Vice President, Medical Affairs Blue Cross Blue Shield

Introducing the Diabetes Prevention Program (DPP) EMPLOYER OVERVIEW May 20 20 Blue Cross and Blue

The New Cooling Unit Generation Blue e+ 1 Die neue Khlgerte-Generation Blue e+ The Blue e+

EC Blue for Water Testing EC Blue for water testing Can you detect contaminant (bacteria) from

EC Blue for Water Testing HyServe EC Blue for water testing Can you detect contaminant

2020 Blue's Tour August 2020 We would like to Welcome you on behalf of Blue Cross and Blue Shield

Water Quality Fun Book ter Quality Fun Book Water Quality Fun Book ater Quality Fun Book Join

Book Diskette Guide Life Presentation Skill Windows pem chodron book titles for windows xp book

Managed Care and Sleep Medicine Denice Logan, DO, FACOI Regional Medical Director Blue Cross

2017 Blue's Tour Presented by: Blue Cross and Blue Shield of Kansas Presenters Include: Sally

2016 Blue's Tour Presented by Blue Cross and Blue Shield of Kansas Today's Presenters Sally

Blue Whiting Focus Group Blue Whiting Focus Group MSY = 50,700 to 223,000 t 50,700 to

skeyes Technical Training Model (EU) 2017/373 compliance path E. Paquay, D. Van der Biest

2 nd EDITION FIDIC BLUE BOOK: A CONTRACT FOR THE INDUSTRY BY JOHN GREENHALGH & TIM MADDOCK

PROMOTING THE CONVENTION AS AN INSTRUMENT FOR DEVELOPMENT Keynote Speech by Holly Aylett to

The Dynamics of Parliamentary Discourse in the UK: 1936 2011 Draft paper for presentation at

Full year results 2018 14 March 2019 1 | Capita FY Results 2018 Agenda 1. Summary Jon Lewis, CEO

ACRS SECY-12-0064 Donald A. Cool U.S. Nuclear Regulatory Commission October 4, 2012 1

Climate and Humans as Amplifiers of Hydro-Ecologic Change: Science and Policy Implications for

MEASURES TO REDUCE PLAN COSTS Cost Containment Strategy Transparent Pricing Pharmacy

Blue Book for Bulldozers Predicting Auction Sale Price to C reate a - PowerPoint PPT Presentation

Blue Book for Bulldozers Predicting Auction Sale Price to C reate a Blue Book for a Piece of Heavy Equipment for Bulldozers Yoojong Bang, Joon Lim, Benedict Lim, Samuel Hills , Eun Hee Ko MASTER IN ANALYTICS I NORTHWESTERN UNIVERSITY

RC Claim Assist Kelly Hayes Blue Cross / Blue Care Network Provider Outreach Blue Cross Blue

FOUR STROKE FOUR STROKE Electronic Fuel Injection Electronic Fuel Injection 1 Blue Whale Blue

Blue A Sketch Model Review Blue A Blue A Smooth Passenger Bag Be Gone Talker Blue A

Alaha Blue Cross and Blue Shield of Alabama Speakers: Kathryn Miller Amber Williams Blue

Prior Authorization Vincent Nelson, M.D. Vice President, Medical Affairs Blue Cross Blue Shield

Introducing the Diabetes Prevention Program (DPP) EMPLOYER OVERVIEW May 20 20 Blue Cross and Blue

The New Cooling Unit Generation Blue e+ 1 Die neue Khlgerte-Generation Blue e+ The Blue e+

EC Blue for Water Testing EC Blue for water testing Can you detect contaminant (bacteria) from

EC Blue for Water Testing HyServe EC Blue for water testing Can you detect contaminant

2020 Blue's Tour August 2020 We would like to Welcome you on behalf of Blue Cross and Blue Shield

Water Quality Fun Book ter Quality Fun Book Water Quality Fun Book ater Quality Fun Book Join

Book Diskette Guide Life Presentation Skill Windows pem chodron book titles for windows xp book

Managed Care and Sleep Medicine Denice Logan, DO, FACOI Regional Medical Director Blue Cross

2017 Blue's Tour Presented by: Blue Cross and Blue Shield of Kansas Presenters Include: Sally

2016 Blue's Tour Presented by Blue Cross and Blue Shield of Kansas Today's Presenters Sally

Blue Whiting Focus Group Blue Whiting Focus Group MSY = 50,700 to 223,000 t 50,700 to

skeyes Technical Training Model (EU) 2017/373 compliance path E. Paquay, D. Van der Biest

2 nd EDITION FIDIC BLUE BOOK: A CONTRACT FOR THE INDUSTRY BY JOHN GREENHALGH &amp; TIM MADDOCK

PROMOTING THE CONVENTION AS AN INSTRUMENT FOR DEVELOPMENT Keynote Speech by Holly Aylett to

The Dynamics of Parliamentary Discourse in the UK: 1936 2011 Draft paper for presentation at

Full year results 2018 14 March 2019 1 | Capita FY Results 2018 Agenda 1. Summary Jon Lewis, CEO

ACRS SECY-12-0064 Donald A. Cool U.S. Nuclear Regulatory Commission October 4, 2012 1

Climate and Humans as Amplifiers of Hydro-Ecologic Change: Science and Policy Implications for

MEASURES TO REDUCE PLAN COSTS Cost Containment Strategy Transparent Pricing Pharmacy

2 nd EDITION FIDIC BLUE BOOK: A CONTRACT FOR THE INDUSTRY BY JOHN GREENHALGH & TIM MADDOCK