Categorical inp u ts SU P E R VISE D L E AR N IN G IN R : R E G R - - PowerPoint PPT Presentation

categorical inp u ts
SMART_READER_LITE
LIVE PREVIEW

Categorical inp u ts SU P E R VISE D L E AR N IN G IN R : R E G R - - PowerPoint PPT Presentation

Categorical inp u ts SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector , LLC E x ample : Effect of Diet on Weight Loss WtLoss24 ~ Diet + Age + BMI Diet Age BMI WtLoss 24 Med 59 30.67 -6.7 Lo


slide-1
SLIDE 1

Categorical inputs

SU P E R VISE D L E AR N IN G IN R : R E G R E SSION

Nina Zumel and John Mount

Win-Vector, LLC

slide-2
SLIDE 2

SUPERVISED LEARNING IN R: REGRESSION

Example: Effect of Diet on Weight Loss

WtLoss24 ~ Diet + Age + BMI

Diet Age BMI WtLoss24 Med 59 30.67

  • 6.7

Low-Carb 48 29.59 8.4 Low-Fat 52 32.9 6.3 Med 53 28.92 8.3 Low-Fat 47 30.20 6.3

slide-3
SLIDE 3

SUPERVISED LEARNING IN R: REGRESSION

model.matrix()

model.matrix(WtLoss24 ~ Diet + Age + BMI, data = diet)

All numerical values Converts categorical variable with N levels into N - 1 indicator variables

slide-4
SLIDE 4

SUPERVISED LEARNING IN R: REGRESSION

Indicator Variables to Represent Categories

Original Data Diet Age ... Med 59 ... Low-Carb 48 ... Low-Fat 52 ... Med 53 ... Low-Fat 47 ... Model Matrix (Int) DietLow- Fat DietMed ... 1 1 ... 1 ... 1 1 ... 1 1 ... 1 1 ... reference level: "Low-Carb"

slide-5
SLIDE 5

SUPERVISED LEARNING IN R: REGRESSION

Interpreting the Indicator Variables

Linear Model:

lm(WtLoss24 ~ Diet + Age + BMI, data = diet)) Coefficients: (Intercept) DietLow-Fat DietMed

  • 1.37149 -2.32130 -0.97883

Age BMI 0.12648 0.01262

slide-6
SLIDE 6

SUPERVISED LEARNING IN R: REGRESSION

Issues with one-hot-encoding

Too many levels can be a problem Example: ZIP code (about 40,000 codes) Don't hash with geometric methods!

slide-7
SLIDE 7

Let's practice!

SU P E R VISE D L E AR N IN G IN R : R E G R E SSION

slide-8
SLIDE 8

Interactions

SU P E R VISE D L E AR N IN G IN R : R E G R E SSION

Nina Zumel and John Mount

Win-Vector, LLC

slide-9
SLIDE 9

SUPERVISED LEARNING IN R: REGRESSION

Additive relationships

Example of an additive relationship:

plant_height ~ bacteria + sun

Change in height is the sum of the eects of bacteria and sunlight Change in sunlight causes same change in height, independent of bacteria Change in bacteria causes same change in height, independent of sunlight

slide-10
SLIDE 10

SUPERVISED LEARNING IN R: REGRESSION

What is an Interaction?

The simultaneous inuence of two variables on the outcome is not additive.

plant_height ~ bacteria + sun + bacteria:sun

Change in height is more (or less) than the sum of the eects due to sun/bacteria At higher levels of sunlight, 1 unit change in bacteria causes more change in height

slide-11
SLIDE 11

SUPERVISED LEARNING IN R: REGRESSION

What is an Interaction?

The simultaneous inuence of two variables on the outcome is not additive.

plant_height ~ bacteria + sun + bacteria:sun sun : categorical {"sun", "shade"}

In sun, 1 unit change in bacteria causes m units change in height In shade, 1 unit change in bacteria causes n units change in height Like two separate models: one for sun, one for shade.

slide-12
SLIDE 12

SUPERVISED LEARNING IN R: REGRESSION

Example of no Interaction: Soybean Yield

yield ~ Stress + SO2 + O3

slide-13
SLIDE 13

SUPERVISED LEARNING IN R: REGRESSION

Example of an Interaction: Alcohol Metabolism

Metabol ~ Gastric + Sex

slide-14
SLIDE 14

SUPERVISED LEARNING IN R: REGRESSION

Expressing Interactions in Formulae

Interaction - Colon ( : )

y ~ a:b

Main eects and interaction - Asterisk ( * )

y ~ a*b # Both mean the same y ~ a + b + a:b

Expressing the product of two variables - I

y ~ I(a*b)

same as y ∝ ab

slide-15
SLIDE 15

SUPERVISED LEARNING IN R: REGRESSION

Finding the Correct Interaction Pattern

Formula RMSE (cross validation)

Metabol ~ Gastric + Sex

1.46

Metabol ~ Gastric * Sex

1.48

Metabol ~ Gastric + Gastric:Sex

1.39

slide-16
SLIDE 16

Let's practice!

SU P E R VISE D L E AR N IN G IN R : R E G R E SSION

slide-17
SLIDE 17

Transforming the response before modeling

SU P E R VISE D L E AR N IN G IN R : R E G R E SSION

Nina Zumel and John Mount

Win-Vector, LLC

slide-18
SLIDE 18

SUPERVISED LEARNING IN R: REGRESSION

The Log Transform for Monetary Data

Monetary values: lognormally distributed Long tail, wide dynamic range (60-700K)

slide-19
SLIDE 19

SUPERVISED LEARNING IN R: REGRESSION

Lognormal Distributions

mean > median (~ 50K vs 39K) Predicting the mean will overpredict typical values

slide-20
SLIDE 20

SUPERVISED LEARNING IN R: REGRESSION

Back to the Normal Distribution

For a Normal Distribution: mean = median (here: 4.53 vs 4.59) more reasonable dynamic range (1.8 - 5.8)

slide-21
SLIDE 21

SUPERVISED LEARNING IN R: REGRESSION

The Procedure

  • 1. Log the outcome and t a model

model <- lm(log(y) ~ x, data = train)

slide-22
SLIDE 22

SUPERVISED LEARNING IN R: REGRESSION

The Procedure

  • 1. Log the outcome and t a model

model <- lm(log(y) ~ x, data = train)

  • 2. Make the predictions in log space

logpred <- predict(model, data = test)

slide-23
SLIDE 23

SUPERVISED LEARNING IN R: REGRESSION

The Procedure

  • 1. Log the outcome and t a model

model <- lm(log(y) ~ x, data = train)

  • 2. Make the predictions in log space

logpred <- predict(model, data = test)

  • 3. Transform the predictions to outcome space

pred <- exp(logpred)

slide-24
SLIDE 24

SUPERVISED LEARNING IN R: REGRESSION

Predicting Log-transformed Outcomes: Multiplicative Error

log(a) + log(b) = log(ab) log(a) − log(b) = log(a/b)

Multiplicative error: pred/y Relative error: (pred − y)/y =

− 1

Reducing multiplicative error reduces relative error.

y pred

slide-25
SLIDE 25

SUPERVISED LEARNING IN R: REGRESSION

Root Mean Squared Relative Error

RMS-relative error = Predicting log-outcome reduces RMS-relative error But the model will oen have larger RMSE

√( )

y pred−y 2

slide-26
SLIDE 26

SUPERVISED LEARNING IN R: REGRESSION

Example: Model Income Directly

modIncome <- lm(Income ~ AFQT + Educ, data = train) AFQT : Score on prociency test 25 years before survey Educ : Years of education to time of survey Income : Income at time of survey

slide-27
SLIDE 27

SUPERVISED LEARNING IN R: REGRESSION

Model Performance

test %>% + mutate(pred = predict(modIncome, newdata = test), + err = pred - Income) %>% + summarize(rmse = sqrt(mean(err^2)), + rms.relerr = sqrt(mean((err/Income)^2)))

RMSE RMS-relative error 36,819.39 3.295189

slide-28
SLIDE 28

SUPERVISED LEARNING IN R: REGRESSION

Model log(Income)

modLogIncome <- lm(log(Income) ~ AFQT + Educ, data = train)

slide-29
SLIDE 29

SUPERVISED LEARNING IN R: REGRESSION

Model Performance

test %>% + mutate(predlog = predict(modLogIncome, newdata = test + pred = exp(predlog), + err = pred - Income) %>% + summarize(rmse = sqrt(mean(err^2)), + rms.relerr = sqrt(mean((err/Income)^2)))

RMSE RMS-relative error 38,906.61 2.276865

slide-30
SLIDE 30

SUPERVISED LEARNING IN R: REGRESSION

Compare Errors

log(Income) model: smaller RMS-relative error, larger RMSE

Model RMSE RMS-relative error On Income 36,819.39 3.295189 On log(Income) 38,906.61 2.276865

slide-31
SLIDE 31

Let's practice!

SU P E R VISE D L E AR N IN G IN R : R E G R E SSION

slide-32
SLIDE 32

Transforming inputs before modeling

SU P E R VISE D L E AR N IN G IN R : R E G R E SSION

Nina Zumel and John Mount

Win-Vector LLC

slide-33
SLIDE 33

SUPERVISED LEARNING IN R: REGRESSION

Why To Transform Input Variables

Domain knowledge/synthetic variables Intelligence ~ mass.brain/mass.body2/3

slide-34
SLIDE 34

SUPERVISED LEARNING IN R: REGRESSION

Why To Transform Input Variables

Domain knowledge/synthetic variables Intelligence ~ mass.brain/mass.body Pragmatic reasons Log transform to reduce dynamic range Log transform because meaningful changes in variable are multiplicative

2/3

slide-35
SLIDE 35

SUPERVISED LEARNING IN R: REGRESSION

Why To Transform Input Variables

Domain knowledge/synthetic variables Intelligence ~ mass.brain/mass.body Pragmatic reasons Log transform to reduce dynamic range Log transform because meaningful changes in variable are multiplicative

y approximately linear in f(x) rather than in x

2/3

slide-36
SLIDE 36

SUPERVISED LEARNING IN R: REGRESSION

Example: Predicting Anxiety

slide-37
SLIDE 37

SUPERVISED LEARNING IN R: REGRESSION

Transforming the hassles variable

slide-38
SLIDE 38

SUPERVISED LEARNING IN R: REGRESSION

Different possible fits

Which is best?

anx ~ I(hassles^2) anx ~ I(hassles^3) anx ~ I(hassles^2) + I(hassles^3) anx ~ exp(hassles)

...

I() : treat an expression literally (not as an interaction)

slide-39
SLIDE 39

SUPERVISED LEARNING IN R: REGRESSION

Compare different models

Linear, Quadratic, and Cubic models

mod_lin <- lm(anx ~ hassles, hassleframe) summary(mod_lin)$r.squared 0.5334847 mod_quad <- lm(anx ~ I(hassles^2), hassleframe) summary(mod_quad)$r.squared 0.6241029 mod_tritic <- lm(anx ~ I(hassles^3), hassleframe) summary(mod_tritic)$r.squared 0.6474421

slide-40
SLIDE 40

SUPERVISED LEARNING IN R: REGRESSION

Compare different models

Use cross-validation to evaluate the models Model RMSE Linear (hassles) 7.69 Quadratic (hassles ) 6.89 Cubic (hassles ) 6.70

2 3

slide-41
SLIDE 41

Let's practice!

SU P E R VISE D L E AR N IN G IN R : R E G R E SSION