categorical inp u ts
play

Categorical inp u ts SU P E R VISE D L E AR N IN G IN R : R E G R - PowerPoint PPT Presentation

Categorical inp u ts SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector , LLC E x ample : Effect of Diet on Weight Loss WtLoss24 ~ Diet + Age + BMI Diet Age BMI WtLoss 24 Med 59 30.67 -6.7 Lo


  1. Categorical inp u ts SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector , LLC

  2. E x ample : Effect of Diet on Weight Loss WtLoss24 ~ Diet + Age + BMI Diet Age BMI WtLoss 24 Med 59 30.67 -6.7 Lo w- Carb 48 29.59 8.4 Lo w- Fat 52 32.9 6.3 Med 53 28.92 8.3 Lo w- Fat 47 30.20 6.3 SUPERVISED LEARNING IN R : REGRESSION

  3. model . matri x() model.matrix(WtLoss24 ~ Diet + Age + BMI, data = diet) All n u merical v al u es Con v erts categorical v ariable w ith N le v els into N - 1 indicator v ariables SUPERVISED LEARNING IN R : REGRESSION

  4. Indicator Variables to Represent Categories Original Data Model Matri x Diet Age ... DietLo w- ( Int ) DietMed ... Fat Med 59 ... 1 0 1 ... Lo w- Carb 48 ... 1 0 0 ... Lo w- Fat 52 ... 1 1 0 ... Med 53 ... 1 0 1 ... Lo w- Fat 47 ... 1 1 0 ... reference le v el : " Lo w- Carb " SUPERVISED LEARNING IN R : REGRESSION

  5. Interpreting the Indicator Variables Linear Model : lm(WtLoss24 ~ Diet + Age + BMI, data = diet)) Coefficients: (Intercept) DietLow-Fat DietMed -1.37149 -2.32130 -0.97883 Age BMI 0.12648 0.01262 SUPERVISED LEARNING IN R : REGRESSION

  6. Iss u es w ith one - hot - encoding Too man y le v els can be a problem E x ample : ZIP code ( abo u t 40,000 codes ) Don ' t hash w ith geometric methods ! SUPERVISED LEARNING IN R : REGRESSION

  7. Let ' s practice ! SU P E R VISE D L E AR N IN G IN R : R E G R E SSION

  8. Interactions SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector , LLC

  9. Additi v e relationships E x ample of an additi v e relationship : plant_height ~ bacteria + sun Change in height is the s u m of the e � ects of bacteria and s u nlight Change in s u nlight ca u ses same change in height , independent of bacteria Change in bacteria ca u ses same change in height , independent of s u nlight SUPERVISED LEARNING IN R : REGRESSION

  10. What is an Interaction ? The sim u ltaneo u s in �u ence of t w o v ariables on the o u tcome is not additi v e . plant_height ~ bacteria + sun + bacteria:sun Change in height is more ( or less ) than the s u m of the e � ects d u e to s u n / bacteria At higher le v els of s u nlight , 1 u nit change in bacteria ca u ses more change in height SUPERVISED LEARNING IN R : REGRESSION

  11. What is an Interaction ? The sim u ltaneo u s in �u ence of t w o v ariables on the o u tcome is not additi v e . plant_height ~ bacteria + sun + bacteria:sun sun : categorical {" s u n ", " shade "} In s u n , 1 u nit change in bacteria ca u ses m u nits change in height In shade , 1 u nit change in bacteria ca u ses n u nits change in height Like t w o separate models : one for s u n , one for shade . SUPERVISED LEARNING IN R : REGRESSION

  12. E x ample of no Interaction : So y bean Yield yield ~ Stress + SO2 + O3 SUPERVISED LEARNING IN R : REGRESSION

  13. E x ample of an Interaction : Alcohol Metabolism Metabol ~ Gastric + Sex SUPERVISED LEARNING IN R : REGRESSION

  14. E x pressing Interactions in Form u lae Interaction - Colon ( : ) y ~ a:b Main e � ects and interaction - Asterisk ( * ) y ~ a*b # Both mean the same y ~ a + b + a:b E x pressing the prod u ct of t w o v ariables - I y ~ I(a*b) same as y ∝ ab SUPERVISED LEARNING IN R : REGRESSION

  15. Finding the Correct Interaction Pattern Form u la RMSE ( cross v alidation ) Metabol ~ Gastric + Sex 1.46 Metabol ~ Gastric * Sex 1.48 Metabol ~ Gastric + Gastric:Sex 1.39 SUPERVISED LEARNING IN R : REGRESSION

  16. Let ' s practice ! SU P E R VISE D L E AR N IN G IN R : R E G R E SSION

  17. Transforming the response before modeling SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector , LLC

  18. The Log Transform for Monetar y Data Monetar y v al u es : lognormall y distrib u ted Long tail , w ide d y namic range (60-700 K ) SUPERVISED LEARNING IN R : REGRESSION

  19. Lognormal Distrib u tions mean > median (~ 50 K v s 39 K ) Predicting the mean w ill o v erpredict t y pical v al u es SUPERVISED LEARNING IN R : REGRESSION

  20. Back to the Normal Distrib u tion For a Normal Distrib u tion : mean = median ( here : 4.53 v s 4.59) more reasonable d y namic range (1.8 - 5.8) SUPERVISED LEARNING IN R : REGRESSION

  21. The Proced u re 1. Log the o u tcome and � t a model model <- lm(log(y) ~ x, data = train) SUPERVISED LEARNING IN R : REGRESSION

  22. The Proced u re 1. Log the o u tcome and � t a model model <- lm(log(y) ~ x, data = train) 2. Make the predictions in log space logpred <- predict(model, data = test) SUPERVISED LEARNING IN R : REGRESSION

  23. The Proced u re 1. Log the o u tcome and � t a model model <- lm(log(y) ~ x, data = train) 2. Make the predictions in log space logpred <- predict(model, data = test) 3. Transform the predictions to o u tcome space pred <- exp(logpred) SUPERVISED LEARNING IN R : REGRESSION

  24. Predicting Log - transformed O u tcomes : M u ltiplicati v e Error log ( a ) + log ( b ) = log ( ab ) log ( a ) − log ( b ) = log ( a / b ) M u ltiplicati v e error : pred / y pred Relati v e error : ( pred − y )/ y = − 1 y Red u cing m u ltiplicati v e error red u ces relati v e error . SUPERVISED LEARNING IN R : REGRESSION

  25. Root Mean Sq u ared Relati v e Error √ ( pred − y 2 ) y RMS - relati v e error = Predicting log - o u tcome red u ces RMS - relati v e error B u t the model w ill o � en ha v e larger RMSE SUPERVISED LEARNING IN R : REGRESSION

  26. E x ample : Model Income Directl y modIncome <- lm(Income ~ AFQT + Educ, data = train) AFQT : Score on pro � cienc y test 25 y ears before s u r v e y Educ : Years of ed u cation to time of s u r v e y Income : Income at time of s u r v e y SUPERVISED LEARNING IN R : REGRESSION

  27. Model Performance test %>% + mutate(pred = predict(modIncome, newdata = test), + err = pred - Income) %>% + summarize(rmse = sqrt(mean(err^2)), + rms.relerr = sqrt(mean((err/Income)^2))) RMSE RMS - relati v e error 36,819.39 3.295189 SUPERVISED LEARNING IN R : REGRESSION

  28. Model log ( Income ) modLogIncome <- lm(log(Income) ~ AFQT + Educ, data = train) SUPERVISED LEARNING IN R : REGRESSION

  29. Model Performance test %>% + mutate(predlog = predict(modLogIncome, newdata = test + pred = exp(predlog), + err = pred - Income) %>% + summarize(rmse = sqrt(mean(err^2)), + rms.relerr = sqrt(mean((err/Income)^2))) RMSE RMS - relati v e error 38,906.61 2.276865 SUPERVISED LEARNING IN R : REGRESSION

  30. Compare Errors log(Income) model : smaller RMS - relati v e error , larger RMSE Model RMSE RMS - relati v e error On Income 36,819.39 3.295189 On log(Income) 38,906.61 2.276865 SUPERVISED LEARNING IN R : REGRESSION

  31. Let ' s practice ! SU P E R VISE D L E AR N IN G IN R : R E G R E SSION

  32. Transforming inp u ts before modeling SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector LLC

  33. Wh y To Transform Inp u t Variables Domain kno w ledge / s y nthetic v ariables Intelligence ~ mass . brain / mass . body 2/3 SUPERVISED LEARNING IN R : REGRESSION

  34. Wh y To Transform Inp u t Variables Domain kno w ledge / s y nthetic v ariables 2/3 Intelligence ~ mass . brain / mass . body Pragmatic reasons Log transform to red u ce d y namic range Log transform beca u se meaningf u l changes in v ariable are m u ltiplicati v e SUPERVISED LEARNING IN R : REGRESSION

  35. Wh y To Transform Inp u t Variables Domain kno w ledge / s y nthetic v ariables 2/3 Intelligence ~ mass . brain / mass . body Pragmatic reasons Log transform to red u ce d y namic range Log transform beca u se meaningf u l changes in v ariable are m u ltiplicati v e y appro x imatel y linear in f ( x ) rather than in x SUPERVISED LEARNING IN R : REGRESSION

  36. E x ample : Predicting An x iet y SUPERVISED LEARNING IN R : REGRESSION

  37. Transforming the hassles v ariable SUPERVISED LEARNING IN R : REGRESSION

  38. Different possible fits Which is best ? anx ~ I(hassles^2) anx ~ I(hassles^3) anx ~ I(hassles^2) + I(hassles^3) anx ~ exp(hassles) ... I() : treat an e x pression literall y ( not as an interaction ) SUPERVISED LEARNING IN R : REGRESSION

  39. Compare different models Linear , Q u adratic , and C u bic models mod_lin <- lm(anx ~ hassles, hassleframe) summary(mod_lin)$r.squared 0.5334847 mod_quad <- lm(anx ~ I(hassles^2), hassleframe) summary(mod_quad)$r.squared 0.6241029 mod_tritic <- lm(anx ~ I(hassles^3), hassleframe) summary(mod_tritic)$r.squared 0.6474421 SUPERVISED LEARNING IN R : REGRESSION

  40. Compare different models Use cross -v alidation to e v al u ate the models Model RMSE Linear ( hassles ) 7.69 2 Q u adratic ( hassles ) 6.89 3 C u bic ( hassles ) 6.70 SUPERVISED LEARNING IN R : REGRESSION

  41. Let ' s practice ! SU P E R VISE D L E AR N IN G IN R : R E G R E SSION

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend