E v al u ating a model graphicall y SU P E R VISE D L E AR N IN G - - PowerPoint PPT Presentation

e v al u ating a model graphicall y
SMART_READER_LITE
LIVE PREVIEW

E v al u ating a model graphicall y SU P E R VISE D L E AR N IN G - - PowerPoint PPT Presentation

E v al u ating a model graphicall y SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector LLC Plotting Gro u nd Tr u th v s . Predictions A w ell ing model A poorl y ing model x = y line r


slide-1
SLIDE 1

Evaluating a model graphically

SU P E R VISE D L E AR N IN G IN R : R E G R E SSION

Nina Zumel and John Mount

Win-Vector LLC

slide-2
SLIDE 2

SUPERVISED LEARNING IN R: REGRESSION

Plotting Ground Truth vs. Predictions

A well ing model x = y line runs through center of points "line of perfect prediction" A poorly ing model Points are all on one side of x = y line Systematic errors

slide-3
SLIDE 3

SUPERVISED LEARNING IN R: REGRESSION

The Residual Plot

A well ing model Residual: actual outcome - prediction Good t: no systematic errors A poorly ing model Systematic errors

slide-4
SLIDE 4

SUPERVISED LEARNING IN R: REGRESSION

The Gain Curve

Measures how well model sorts the outcome x-axis: houses in model- sorted order (decreasing) y-axis: fraction of total accumulated home sales Wizard curve: perfect model

slide-5
SLIDE 5

SUPERVISED LEARNING IN R: REGRESSION

Reading the Gain Curve

GainCurvePlot(houseprices, "prediction", "price", "Home price model")

slide-6
SLIDE 6

Let's practice!

SU P E R VISE D L E AR N IN G IN R : R E G R E SSION

slide-7
SLIDE 7

Root Mean Squared Error (RMSE)

SU P E R VISE D L E AR N IN G IN R : R E G R E SSION

Nina Zumel and John Mount

Win-Vector LLC

slide-8
SLIDE 8

SUPERVISED LEARNING IN R: REGRESSION

What is Root Mean Squared Error (RMSE)?

RMSE =

where

pred − y: the error, or residuals vector

: mean value of (pred − y)

√ (pred − y)2 (pred − y)2

2

slide-9
SLIDE 9

SUPERVISED LEARNING IN R: REGRESSION

RMSE of the Home Sales Price Model

# Calculate error err <- houseprices$prediction - houseprices$price price : column of actual sale prices (in thousands) prediction : column of predicted sale prices (in thousands)

slide-10
SLIDE 10

SUPERVISED LEARNING IN R: REGRESSION

RMSE of the Home Sales Price Model

# Calculate error err <- houseprices$prediction - houseprices$price # Square the error vector err2 <- err^2

slide-11
SLIDE 11

SUPERVISED LEARNING IN R: REGRESSION

RMSE of the Home Sales Price Model

# Calculate error err <- houseprices$prediction - houseprices$price # Square the error vector err2 <- err^2 # Take the mean, and sqrt it (rmse <- sqrt(mean(err2))) 58.33908

RMSE ≈ 58.3

slide-12
SLIDE 12

SUPERVISED LEARNING IN R: REGRESSION

Is the RMSE Large or Small?

# Take the mean, and sqrt it (rmse <- sqrt(mean(err2))) 58.33908 # The standard deviation of the outcome (sdtemp <- sd(houseprices$price)) 135.2694

RMSE ≈ 58.3 sd(price) ≈ 135

slide-13
SLIDE 13

Let's practice!

SU P E R VISE D L E AR N IN G IN R : R E G R E SSION

slide-14
SLIDE 14

R-Squared (R )

SU P E R VISE D L E AR N IN G IN R : R E G R E SSION

2

Nina Zumel and John Mount

Win-Vector LLC

slide-15
SLIDE 15

SUPERVISED LEARNING IN R: REGRESSION

What is R ?

A measure of how well the model ts or explains the data A value between 0-1 near 1: model ts well near 0: no beer than guessing the average value

2

slide-16
SLIDE 16

SUPERVISED LEARNING IN R: REGRESSION

Calculating R

R is the variance explained by the model. R = 1 −

where

RSS = (y − prediction)

Residual sum of squares (variance from model)

SS = (y − )

Total sum of squares (variance of data)

2

2 2

SSTot RSS ∑

2 Tot

∑ y

2

slide-17
SLIDE 17

SUPERVISED LEARNING IN R: REGRESSION

Calculate R of the House Price Model: RSS

Calculate error

err <- houseprices$prediction - houseprices$price

Square it and take the sum

rss <- sum(err^2) price : column of actual sale prices (in thousands) pred : column of predicted sale prices (in thousands)

RSS ≈ 136138

2

slide-18
SLIDE 18

SUPERVISED LEARNING IN R: REGRESSION

Calculate R of the House Price Model: SS

Take the dierence of prices from the mean price

toterr <- houseprices$price - mean(houseprices$price)

Square it and take the sum

sstot <- sum(toterr^2)

RSS ≈ 136138 SS ≈ 713615

2 Tot

Tot

slide-19
SLIDE 19

SUPERVISED LEARNING IN R: REGRESSION

Calculate R of the House Price Model

(r_squared <- 1 - (rss/sstot) ) 0.8092278

RSS ≈ 136138 SS ≈ 713615 R ≈ 0.809

2

Tot 2

slide-20
SLIDE 20

SUPERVISED LEARNING IN R: REGRESSION

Reading R from the lm() model

# From summary() summary(hmodel) ... Residual standard error: 60.66 on 37 degrees of freedom Multiple R-squared: 0.8092, Adjusted R-squared: 0.7989 F-statistic: 78.47 on 2 and 37 DF, p-value: 4.893e-14 summary(hmodel)$r.squared 0.8092278 # From glance() glance(hmodel)$r.squared 0.8092278

2

slide-21
SLIDE 21

SUPERVISED LEARNING IN R: REGRESSION

Correlation and R

rho <- cor(houseprices$prediction, houseprices$price) 0.8995709 rho^2 0.8092278

ρ = cor(prediction, price) = 0.8995709 ρ = 0.8092278 = R

2

2 2

slide-22
SLIDE 22

SUPERVISED LEARNING IN R: REGRESSION

Correlation and R

True for models that minimize squared error: Linear regression GAM regression Tree-based algorithms that minimize squared error True for training data; NOT true for future application data

2

slide-23
SLIDE 23

Let's practice!

SU P E R VISE D L E AR N IN G IN R : R E G R E SSION

slide-24
SLIDE 24

Properly Training a Model

SU P E R VISE D L E AR N IN G IN R : R E G R E SSION

Nina Zumel and John Mount

Win-Vector, LLC

slide-25
SLIDE 25

SUPERVISED LEARNING IN R: REGRESSION

Models can perform much better on training than they do on future data.

Training R : 0.9; Test R : 0.15 -- Overt

2 2

slide-26
SLIDE 26

SUPERVISED LEARNING IN R: REGRESSION

Test/Train Split

Recommended method when data is plentiful

slide-27
SLIDE 27

SUPERVISED LEARNING IN R: REGRESSION

Example: Model Female Unemployment

Train on 66 rows, test on 30 rows

slide-28
SLIDE 28

SUPERVISED LEARNING IN R: REGRESSION

Model Performance: Train vs. Test

Training: RMSE 0.71, R 0.8 Test: RMSE 0.93, R 0.75

2 2

slide-29
SLIDE 29

SUPERVISED LEARNING IN R: REGRESSION

Cross-Validation

Preferred when data is not large enough to split o a test set

slide-30
SLIDE 30

SUPERVISED LEARNING IN R: REGRESSION

Cross-Validation

slide-31
SLIDE 31

SUPERVISED LEARNING IN R: REGRESSION

Cross-Validation

slide-32
SLIDE 32

SUPERVISED LEARNING IN R: REGRESSION

Cross-Validation

slide-33
SLIDE 33

SUPERVISED LEARNING IN R: REGRESSION

Create a cross-validation plan

library(vtreat) splitPlan <- kWayCrossValidation(nRows, nSplits, NULL, NULL)

nRows : number of rows in the training data nSplits : number folds (partitions) in the cross-validation

e.g, nfolds = 3 for 3-way cross-validation remaining 2 arguments not needed here

slide-34
SLIDE 34

SUPERVISED LEARNING IN R: REGRESSION

Create a cross-validation plan

library(vtreat) splitPlan <- kWayCrossValidation(10, 3, NULL, NULL)

First fold (A and B to train, C to test)

splitPlan[[1]] $train 1 2 4 5 7 9 10 $app 3 6 8

Train on A and B, test on C, etc...

split <- splitPlan[[1]] model <- lm(fmla, data = df[split$train,]) df$pred.cv[split$app] <- predict(model, newdata = df[split$app,])

slide-35
SLIDE 35

SUPERVISED LEARNING IN R: REGRESSION

Final Model

slide-36
SLIDE 36

SUPERVISED LEARNING IN R: REGRESSION

Example: Unemployment Model

Measure type RMSE

R

train 0.7082675 0.8029275 test 0.9349416 0.7451896 cross-validation 0.8175714 0.7635331

2

slide-37
SLIDE 37

Let's practice!

SU P E R VISE D L E AR N IN G IN R : R E G R E SSION