[PPT] - Learning From Data Lecture 13 Validation and Model Selection The PowerPoint Presentation

SLIDE 1

Learning From Data Lecture 13 Validation and Model Selection

The Validation Set Model Selection Cross Validation

M. Magdon-Ismail

CSCI 4100/6100

SLIDE 2

recap: Regularization

Regularization combats the effects of noise by putting a leash on the algorithm. Eaug(h) = Ein(h) + λ N Ω(h) Ω(h) → smooth, simple h

noise is rough, complex.

Different regularizers give different results

can choose λ, the amount of regularization.

λ = 0 λ = 0.0001 λ = 0.01 λ = 1

x y Data Target Fit x y x y x y

Overfitting → → Underfitting

Optimal λ balances approximation and generalization, bias and variance.

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 2 /31

Peeking at Eout − →

SLIDE 3

Validation: A Sneak Peek at Eout

Eout(g) = Ein(g) + overfit penalty

VC bounds this using a complexity error bar Ω(H)

regularization estimates this through a heuristic complexity penalty Ω(g)

Validation goes directly for the jugular: Eout(g)

validation estimates this directly

= Ein(g) + overfit penalty. In-sample estimate of Eout is the Holy Grail of learning from data.

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 3 /31

Peeking at Eout − →

SLIDE 4

Validation: A Sneak Peek at Eout

Eout(g) = Ein(g) + overfit penalty

VC bounds this using a complexity error bar Ω(H)

regularization estimates this through a heuristic complexity penalty Ω(g)

Validation goes directly for the jugular: Eout(g)

validation estimates this directly

= Ein(g) + overfit penalty. In-sample estimate of Eout is the Holy Grail of learning from data.

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 4 /31

Peeking at Eout − →

SLIDE 5

Validation: A Sneak Peek at Eout

Eout(g) = Ein(g) + overfit penalty

VC bounds this using a complexity error bar Ω(H)

regularization estimates this through a heuristic complexity penalty Ω(g)

Validation goes directly for the jugular: Eout(g)

validation estimates this directly

= Ein(g) + overfit penalty. In-sample estimate of Eout is the Holy Grail of learning from data.

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 5 /31

Test set − →

SLIDE 6

The Test Set

D

(N data points)

Dtest

(K test points)

− − − − → − − − − → g

ek = e(g(xk), yk)

− − − − − − − − − − − − → e1, e2, . . . , eK − − − − → − − − − → g Etest = 1 K

K

k=1

ek − − − − → Eout(g)

Etest is an estimate for Eout(g) EDtest[ek] = Eout(g) E[Etest] = 1 K

K

k=1

E[ek] = 1 K

K

k=1

Eout(g)= Eout(g) e1, . . . , eK are independent Var[Etest] = 1 K2

K

k=1

Var[ek] = 1 K Var[e] տ

decreases like

1 K

bigger K = ⇒ more reliable Etest.

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 6 /31

Validation set − →

SLIDE 7

The Validation Set

D

(N data points)

− − − → − − − − − − − − − − − − − − − → Dtrain

(N − K training points)

Dval

(K validation points)

− − − − → − − − − → g

ek = e(g (xk), yk)

− − − − − − − − − − − − → e1, e2, . . . , eK − − − − → − − − − → g Eval = 1 K

K

k=1

ek − − − − → Eout(g )

1. Remove K points from D

D = Dtrain ∪ Dval.

2. Learn using Dtrain −

→ g .

3. Test g on Dval −

→ Eval.

4. Use error Eval to estimate Eout(g ).

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 7 /31

Validation − →

SLIDE 8

The Validation Set

D

(N data points)

− − − → − − − − − − − − − − − − − − − → Dtrain

(N − K training points)

Dval

(K validation points)

− − − − → − − − − → g

ek = e(g (xk), yk)

− − − − − − − − − − − − → e1, e2, . . . , eK − − − − → − − − − → g Eval = 1 K

K

k=1

ek − − − − → Eout(g )

1. Remove K points from D

D = Dtrain ∪ Dval.

2. Learn using Dtrain −

→ g .

3. Test g on Dval −

→ Eval.

4. Use error Eval to estimate Eout(g ).

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 8 /31

Reliability of validation − →

SLIDE 9

The Validation Set

D

(N data points)

− − − → − − − − − − − − − − − − − − − → Dtrain

(N − K training points)

Dval

(K validation points)

− − − − → − − − − → g

ek = e(g (xk), yk)

− − − − − − − − − − − − → e1, e2, . . . , eK − − − − → − − − − → g Eval = 1 K

K

k=1

ek − − − − → Eout(g )

Eval is an estimate for Eout(g ) EDval[ek] = Eout(g ) E[Etest] = 1 K

K

k=1

E[ek] = 1 K

K

k=1

Eout(g )= Eout(g ) e1, . . . , eK are independent Var[Eval] = 1 K2

K

k=1

Var[ek] = 1 K Var[e(g )] տdecreases like

1 K

depends on g , not H bigger K = ⇒ more reliable Eval?

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 9 /31

Eval versus K − →

SLIDE 10

Choosing K

Size of Validation Set, K Expected Eval

10 20 30

Rule of thumb: K∗ = N

5 .

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 10 /31

Restoring D − →

SLIDE 11

Restoring D

Dval D

(N)

Dtrain

(N − K)

g

(K)

Eval(g ) g

CUSTOMER

Primary goal: output best hypothesis.

g was trained on all the data.

Secondary goal: estimate Eout(g).

g is behind closed doors.

Eout(g) Eout(g ) ↓ ↓ Ein(g) Eval(g )

which should we use?

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 11 /31

Eval versus Ein − →

SLIDE 12

Eval Versus Ein

Eout(g) ≤ Ein(g) + O

dvc

N log N

Eout(g) ≤ Eout(g ) ≤ Eval(g ) + O

1

√ K

↑

learning curve is decreasing (a practical truth, not a theorem)

ւ

Biased error bar depends on H.

տ

Unbiased error bar depends on g .

Eval(g) usually wins as an estimate for Eout(g), especially when the learning curve is not steep.

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 12 /31

Model Selection − →

SLIDE 13

Model Selection

The most important use of validation H1 H2 H3

· · ·

HM − − − → − − − → − − − → − − − → g1 g2 g3

· · ·

gM − − − → E1 Dtrain − − − → Dval − − − →

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 13 /31

Validation estimate for Eout(g1) − →

SLIDE 14

Validation Estimate for (H1, g1)

The most important use of validation H1 H2 H3

· · ·

HM − − − → g1 − − − → Eval(g1) Dtrain − − − → Dval − − − →

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 14 /31

Call it E1 − →

SLIDE 15

Validation Estimate for (H1, g1)

The most important use of validation H1 H2 H3

· · ·

HM − − − → g1 − − − → E1 Dtrain − − − → Dval − − − →

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 15 /31

Validation estimates E1, . . . , EM − →

SLIDE 16

Compute Validation Estimates for All Models

The most important use of validation H1 H2 H3

· · ·

HM − − − → − − − → − − − → − − − → g1 g2 g3

· · ·

gM − − − → − − − → − − − → − − − → E1 E2 E3

· · ·

EM Dtrain − − − → Dval − − − →

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 16 /31

Pick best validation error − →

SLIDE 17

Pick The Best Model According to Validation Error

The most important use of validation H1 H2 H3

· · ·

HM − − − → − − − → − − − → − − − → g1 g2 g3

· · ·

gM − − − → − − − → − − − → − − − → E1 E2 E3

· · ·

EM Dtrain − − − → Dval − − − →

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 17 /31

Biased Eval(gm∗) − →

SLIDE 18

Eval(gm∗) is not Unbiased For Eout(gm∗)

Validation Set Size, K Expected Error Eval (gm∗) Eout (gm∗)

5 15 25 0.5 0.6 0.7 0.8

. . . because we choose one of the M finalists.

Eout(gm∗) ≤ Eval(gm∗) + O

ln M

K

↑

VC error bar for selecting a hypothesis from M using a data set of size K.

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 18 /31

Restoring D − →

SLIDE 19

Restoring D

H1 H2 H3

· · ·

HM − − − → − − − → − − − → − − − → g1 g2 g3

· · ·

gM − − − → E1 Model with best g also has best g

← leap of faith

We can find model with best g using validation

← true modulo Eval error bar

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 19 /31

Comparing Ein and Eval − →

SLIDE 20

Comparing Ein and Eval for Model Selection

Validation Set Size, K Expected Eout

ptimal

validation: gm∗ in-sample: g

m

validation: gm∗

5 15 25 0.48 0.52 0.56

H1 H2 HM g1 g2 gM · · · · · · E1 · · · EM Dval Dtrain gm∗ E2 (Hm∗, Em∗)

pick the best

D

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 20 /31

Selecting λ − →

SLIDE 21

Application to Selecting λ

Which regularization parameter to use? λ1, λ2, . . . , λM. This is a special case of model selection over M models, (H, λ1) (H, λ2) (H, λ3)

· · ·

(H, λM) − − − → − − − → − − − → − − − → g1 g2 g3

· · ·

gM Picking a model amounts to chosing the optimal λ

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 21 /31

Tradeoff with K − →

SLIDE 22

The Dilemma When Choosing K

Validation relies on the following chain of reasoning,

Eout(g) ≈

(small K)

Eout(g ) ≈

(large K)

Eval(g )

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 22 /31

K = 1? − →

SLIDE 23

Can we get away with K = 1?

Yes, almost!

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 23 /31

Leave one out − →

SLIDE 24

The Leave One Out Error (K = 1)

e1 x y

E[e1] = Eout(g1)

− − − − − − − − − − →

g1

. . . but it is a wild estimate

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 24 /31

Ecv − →

SLIDE 25

The Leave One Our Errors

e1 x y e2 x y e3 x y

E[e1] = Eout(g1)

Ecv = 1 N

N

n=1

en

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 25 /31

CV is unbiased − →

SLIDE 26

Cross Validation is Unbiased

Theorem. Ecv is an unbiased estimate of ¯

Eout(N − 1). տ

Expected Eout when learning with N − 1 points.

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 26 /31

Reliability of Ecv − →

SLIDE 27

Reliability of Ecv

en and em are not independent.

en depends on gn which was trained on (xm, ym). em is evaluated on (xm, ym).

en Ecv 1 N Effective number of fresh examples giving a comparable estimate of Eout

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 27 /31

Computational considerations − →

SLIDE 28

Cross Validation is Computationally Intensive

N epochs of learning each on a data set of size N − 1.

Analytic approaches, for example linear regression with weight decay

wreg = (ZtZ + λI)−1Zty Ecv = 1 N

N

n=1
ˆ

yn − yn 1 − Hnn(λ)

2

H(λ) = Z(ZtZ + λI)−1Zt.

10-fold cross validation

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 train train validate D

c

A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 28 /31

Restoring D − →

SLIDE 29

Restoring D

D1 D g g1 D2 · · · · · · Ecv

take average

gN g2

(x1, y1) (x2, y2) (xN, yN)

DN e1 e2 eN · · ·

CUSTOMER

Eout(g(N)) ≤ ¯ Eout(N − 1) ≤ Ecv + O

1

√ N

.

↑

learning curve

↑

nearly independent en

Ecv can be used for model selection just as Eval, for example to choose λ.

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 29 /31

Digits − →

SLIDE 30

Digits Problem: ‘1’ Versus ‘Not 1’

Average Intensity Symmetry

1 Not 1

# Features Used Error Eout Ecv Ein

5 10 15 20 0.01 0.02 0.03

x = (1, x1, x2) z = (1, x1, x2, x2

1, x1x2, x2 2, x3 1, x2 1x2, x1x2 2, x3 2, . . . , x5 1, x4 1x2, x3 1x2 2, x2 1x3 2, x1x4 2, x5 2)

5th order polynomial transform −

→ 20 dimensional non linear feature space

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 30 /31

Validation Wins − →

SLIDE 31

Validation Wins In the Real World

Average Intensity Symmetry Average Intensity Symmetry

no validation (20 features) Ein = 0% Eout = 2.5% cross validation (6 features) Ein = 0.8% Eout = 1.5%

c A M L Creator: Malik Magdon-Ismail

Validation and Model Selection: 31 /31