Categorical inp u ts SU P E R VISE D L E AR N IN G IN R : R E G R - PowerPoint PPT Presentation

Categorical inp u ts SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector , LLC

E x ample : Effect of Diet on Weight Loss WtLoss24 ~ Diet + Age + BMI Diet Age BMI WtLoss 24 Med 59 30.67 -6.7 Lo w- Carb 48 29.59 8.4 Lo w- Fat 52 32.9 6.3 Med 53 28.92 8.3 Lo w- Fat 47 30.20 6.3 SUPERVISED LEARNING IN R : REGRESSION

model . matri x() model.matrix(WtLoss24 ~ Diet + Age + BMI, data = diet) All n u merical v al u es Con v erts categorical v ariable w ith N le v els into N - 1 indicator v ariables SUPERVISED LEARNING IN R : REGRESSION

Indicator Variables to Represent Categories Original Data Model Matri x Diet Age ... DietLo w- ( Int ) DietMed ... Fat Med 59 ... 1 0 1 ... Lo w- Carb 48 ... 1 0 0 ... Lo w- Fat 52 ... 1 1 0 ... Med 53 ... 1 0 1 ... Lo w- Fat 47 ... 1 1 0 ... reference le v el : " Lo w- Carb " SUPERVISED LEARNING IN R : REGRESSION

Interpreting the Indicator Variables Linear Model : lm(WtLoss24 ~ Diet + Age + BMI, data = diet)) Coefficients: (Intercept) DietLow-Fat DietMed -1.37149 -2.32130 -0.97883 Age BMI 0.12648 0.01262 SUPERVISED LEARNING IN R : REGRESSION

Iss u es w ith one - hot - encoding Too man y le v els can be a problem E x ample : ZIP code ( abo u t 40,000 codes ) Don ' t hash w ith geometric methods ! SUPERVISED LEARNING IN R : REGRESSION

Let ' s practice ! SU P E R VISE D L E AR N IN G IN R : R E G R E SSION

Interactions SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector , LLC

Additi v e relationships E x ample of an additi v e relationship : plant_height ~ bacteria + sun Change in height is the s u m of the e � ects of bacteria and s u nlight Change in s u nlight ca u ses same change in height , independent of bacteria Change in bacteria ca u ses same change in height , independent of s u nlight SUPERVISED LEARNING IN R : REGRESSION

What is an Interaction ? The sim u ltaneo u s in �u ence of t w o v ariables on the o u tcome is not additi v e . plant_height ~ bacteria + sun + bacteria:sun Change in height is more ( or less ) than the s u m of the e � ects d u e to s u n / bacteria At higher le v els of s u nlight , 1 u nit change in bacteria ca u ses more change in height SUPERVISED LEARNING IN R : REGRESSION

What is an Interaction ? The sim u ltaneo u s in �u ence of t w o v ariables on the o u tcome is not additi v e . plant_height ~ bacteria + sun + bacteria:sun sun : categorical {" s u n ", " shade "} In s u n , 1 u nit change in bacteria ca u ses m u nits change in height In shade , 1 u nit change in bacteria ca u ses n u nits change in height Like t w o separate models : one for s u n , one for shade . SUPERVISED LEARNING IN R : REGRESSION

E x ample of no Interaction : So y bean Yield yield ~ Stress + SO2 + O3 SUPERVISED LEARNING IN R : REGRESSION

E x ample of an Interaction : Alcohol Metabolism Metabol ~ Gastric + Sex SUPERVISED LEARNING IN R : REGRESSION

E x pressing Interactions in Form u lae Interaction - Colon ( : ) y ~ a:b Main e � ects and interaction - Asterisk ( * ) y ~ a*b # Both mean the same y ~ a + b + a:b E x pressing the prod u ct of t w o v ariables - I y ~ I(a*b) same as y ∝ ab SUPERVISED LEARNING IN R : REGRESSION

Finding the Correct Interaction Pattern Form u la RMSE ( cross v alidation ) Metabol ~ Gastric + Sex 1.46 Metabol ~ Gastric * Sex 1.48 Metabol ~ Gastric + Gastric:Sex 1.39 SUPERVISED LEARNING IN R : REGRESSION

Transforming the response before modeling SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector , LLC

The Log Transform for Monetar y Data Monetar y v al u es : lognormall y distrib u ted Long tail , w ide d y namic range (60-700 K ) SUPERVISED LEARNING IN R : REGRESSION

Lognormal Distrib u tions mean > median (~ 50 K v s 39 K ) Predicting the mean w ill o v erpredict t y pical v al u es SUPERVISED LEARNING IN R : REGRESSION

Back to the Normal Distrib u tion For a Normal Distrib u tion : mean = median ( here : 4.53 v s 4.59) more reasonable d y namic range (1.8 - 5.8) SUPERVISED LEARNING IN R : REGRESSION

The Proced u re 1. Log the o u tcome and � t a model model <- lm(log(y) ~ x, data = train) SUPERVISED LEARNING IN R : REGRESSION

The Proced u re 1. Log the o u tcome and � t a model model <- lm(log(y) ~ x, data = train) 2. Make the predictions in log space logpred <- predict(model, data = test) SUPERVISED LEARNING IN R : REGRESSION

The Proced u re 1. Log the o u tcome and � t a model model <- lm(log(y) ~ x, data = train) 2. Make the predictions in log space logpred <- predict(model, data = test) 3. Transform the predictions to o u tcome space pred <- exp(logpred) SUPERVISED LEARNING IN R : REGRESSION

Predicting Log - transformed O u tcomes : M u ltiplicati v e Error log ( a ) + log ( b ) = log ( ab ) log ( a ) − log ( b ) = log ( a / b ) M u ltiplicati v e error : pred / y pred Relati v e error : ( pred − y )/ y = − 1 y Red u cing m u ltiplicati v e error red u ces relati v e error . SUPERVISED LEARNING IN R : REGRESSION

Root Mean Sq u ared Relati v e Error √ ( pred − y 2 ) y RMS - relati v e error = Predicting log - o u tcome red u ces RMS - relati v e error B u t the model w ill o � en ha v e larger RMSE SUPERVISED LEARNING IN R : REGRESSION

E x ample : Model Income Directl y modIncome <- lm(Income ~ AFQT + Educ, data = train) AFQT : Score on pro � cienc y test 25 y ears before s u r v e y Educ : Years of ed u cation to time of s u r v e y Income : Income at time of s u r v e y SUPERVISED LEARNING IN R : REGRESSION

Model Performance test %>% + mutate(pred = predict(modIncome, newdata = test), + err = pred - Income) %>% + summarize(rmse = sqrt(mean(err^2)), + rms.relerr = sqrt(mean((err/Income)^2))) RMSE RMS - relati v e error 36,819.39 3.295189 SUPERVISED LEARNING IN R : REGRESSION

Model log ( Income ) modLogIncome <- lm(log(Income) ~ AFQT + Educ, data = train) SUPERVISED LEARNING IN R : REGRESSION

Model Performance test %>% + mutate(predlog = predict(modLogIncome, newdata = test + pred = exp(predlog), + err = pred - Income) %>% + summarize(rmse = sqrt(mean(err^2)), + rms.relerr = sqrt(mean((err/Income)^2))) RMSE RMS - relati v e error 38,906.61 2.276865 SUPERVISED LEARNING IN R : REGRESSION

Compare Errors log(Income) model : smaller RMS - relati v e error , larger RMSE Model RMSE RMS - relati v e error On Income 36,819.39 3.295189 On log(Income) 38,906.61 2.276865 SUPERVISED LEARNING IN R : REGRESSION

Transforming inp u ts before modeling SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector LLC

Wh y To Transform Inp u t Variables Domain kno w ledge / s y nthetic v ariables Intelligence ~ mass . brain / mass . body 2/3 SUPERVISED LEARNING IN R : REGRESSION

Wh y To Transform Inp u t Variables Domain kno w ledge / s y nthetic v ariables 2/3 Intelligence ~ mass . brain / mass . body Pragmatic reasons Log transform to red u ce d y namic range Log transform beca u se meaningf u l changes in v ariable are m u ltiplicati v e SUPERVISED LEARNING IN R : REGRESSION

Wh y To Transform Inp u t Variables Domain kno w ledge / s y nthetic v ariables 2/3 Intelligence ~ mass . brain / mass . body Pragmatic reasons Log transform to red u ce d y namic range Log transform beca u se meaningf u l changes in v ariable are m u ltiplicati v e y appro x imatel y linear in f ( x ) rather than in x SUPERVISED LEARNING IN R : REGRESSION

E x ample : Predicting An x iet y SUPERVISED LEARNING IN R : REGRESSION

Transforming the hassles v ariable SUPERVISED LEARNING IN R : REGRESSION

Different possible fits Which is best ? anx ~ I(hassles^2) anx ~ I(hassles^3) anx ~ I(hassles^2) + I(hassles^3) anx ~ exp(hassles) ... I() : treat an e x pression literall y ( not as an interaction ) SUPERVISED LEARNING IN R : REGRESSION

Compare different models Linear , Q u adratic , and C u bic models mod_lin <- lm(anx ~ hassles, hassleframe) summary(mod_lin)$r.squared 0.5334847 mod_quad <- lm(anx ~ I(hassles^2), hassleframe) summary(mod_quad)$r.squared 0.6241029 mod_tritic <- lm(anx ~ I(hassles^3), hassleframe) summary(mod_tritic)$r.squared 0.6474421 SUPERVISED LEARNING IN R : REGRESSION

Compare different models Use cross -v alidation to e v al u ate the models Model RMSE Linear ( hassles ) 7.69 2 Q u adratic ( hassles ) 6.89 3 C u bic ( hassles ) 6.70 SUPERVISED LEARNING IN R : REGRESSION

Categorical inp u ts SU P E R VISE D L E AR N IN G IN R : R E G R - PowerPoint PPT Presentation

Categorical inp u ts SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector , LLC E x ample : Effect of Diet on Weight Loss WtLoss24 ~ Diet + Age + BMI Diet Age BMI WtLoss 24 Med 59 30.67 -6.7 Lo

Corporate Presentation May 2017 TSX.V: INP 1 TSX. V: INP Forward Looking Information This

Corporate Presentation April 2017 TSX.V: INP 1 TSX. V: INP Forward Looking Information This

Categorical Professional Development In-Service August 6, 2019 Welcome Back Categorical Team

STAT 113 Describing Categorical Data Colin Reimer Dawson Oberlin College September 7, 2017 1 /

Categorical Probability and Statistics Peter McCullagh Department of Statistics University of

Case study introduction Emily Robinson Data Scientist DataCamp Categorical Data in the

Categorical quantum mechanics Chris Heunen 1 / 76 Categorical Quantum Mechanics? Study of

Categorical Semantics for Linear Logic Categorical semantics for linear logic Interaction

Categorical models of probability with symmetries Sam Staton, Oxford Categorical models

Reordering factors Emily Robinson Data Scientist DataCamp Categorical Data in the Tidyverse

STAT 113 Describing Categorical Data I Colin Reimer Dawson Oberlin College September 11, 2020

Directional Coupler based Polarization Beam Splitter using dissimilar waveguides in InP

A Fully Funded Growth Story October 2017 TSX.V: INP 1 Important notice concerning this document

Edge-connectivity of permutation hypergraphs Zolt an Szigeti Laboratoire G-SCOP INP Grenoble,

Mount Vernon School District Categorical Programs 2020-2021 Categorical Programs Supplemental

An Introduction to Category Theory basics Products, and Categorical Logic coproducts, and

July 17, 2014 Odyssey: A Journey to Lifelong Statistical Literacy 2014 ICOTS 1 2014 ICOTS 2

Initiative #1 Statewide Accountability Approach August 2018 Acronym Glossary A-APM Advanced

01/08/2016 MUSIC AND ART INSPIRED BY SHAKESPEARE Lecture 5 All At Sea William Maw Egley

Plenary Session and Workshops Andr Douette - Member of the Board of the Federation of European

Learning to Reason in Large Theories Without Imitation Kshitij Bansal, Sarah M. Loos, Markus N.

Toward Architecture-based Reliability Estimation Roshanak Roshandel & Nenad Medvidovic

Automation by Analogy, in Coq Alasdair Hill, Katya Komendantskaya Heriot-Watt University,

Real &me correla&on func&ons at finite temperature Based on Formalism + Vacuum

Categorical inp u ts SU P E R VISE D L E AR N IN G IN R : R E G R - PowerPoint PPT Presentation

Categorical inp u ts SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector , LLC E x ample : Effect of Diet on Weight Loss WtLoss24 ~ Diet + Age + BMI Diet Age BMI WtLoss 24 Med 59 30.67 -6.7 Lo

Corporate Presentation May 2017 TSX.V: INP 1 TSX. V: INP Forward Looking Information This

Corporate Presentation April 2017 TSX.V: INP 1 TSX. V: INP Forward Looking Information This

Categorical Professional Development In-Service August 6, 2019 Welcome Back Categorical Team

STAT 113 Describing Categorical Data Colin Reimer Dawson Oberlin College September 7, 2017 1 /

Categorical Probability and Statistics Peter McCullagh Department of Statistics University of

Case study introduction Emily Robinson Data Scientist DataCamp Categorical Data in the

Categorical quantum mechanics Chris Heunen 1 / 76 Categorical Quantum Mechanics? Study of

Categorical Semantics for Linear Logic Categorical semantics for linear logic Interaction

Categorical models of probability with symmetries Sam Staton, Oxford Categorical models

Reordering factors Emily Robinson Data Scientist DataCamp Categorical Data in the Tidyverse

STAT 113 Describing Categorical Data I Colin Reimer Dawson Oberlin College September 11, 2020

Directional Coupler based Polarization Beam Splitter using dissimilar waveguides in InP

A Fully Funded Growth Story October 2017 TSX.V: INP 1 Important notice concerning this document

Edge-connectivity of permutation hypergraphs Zolt an Szigeti Laboratoire G-SCOP INP Grenoble,

Mount Vernon School District Categorical Programs 2020-2021 Categorical Programs Supplemental

An Introduction to Category Theory basics Products, and Categorical Logic coproducts, and

July 17, 2014 Odyssey: A Journey to Lifelong Statistical Literacy 2014 ICOTS 1 2014 ICOTS 2

Initiative #1 Statewide Accountability Approach August 2018 Acronym Glossary A-APM Advanced

01/08/2016 MUSIC AND ART INSPIRED BY SHAKESPEARE Lecture 5 All At Sea William Maw Egley

Plenary Session and Workshops Andr Douette - Member of the Board of the Federation of European

Learning to Reason in Large Theories Without Imitation Kshitij Bansal, Sarah M. Loos, Markus N.

Toward Architecture-based Reliability Estimation Roshanak Roshandel &amp; Nenad Medvidovic

Automation by Analogy, in Coq Alasdair Hill, Katya Komendantskaya Heriot-Watt University,

Real &amp;me correla&amp;on func&amp;ons at finite temperature Based on Formalism + Vacuum

Toward Architecture-based Reliability Estimation Roshanak Roshandel & Nenad Medvidovic

Real &me correla&on func&ons at finite temperature Based on Formalism + Vacuum