dimension d v ( H ) most p oints H an shatter 10 10 5 - - PowerPoint PPT Presentation

dimension d
SMART_READER_LITE
LIVE PREVIEW

dimension d v ( H ) most p oints H an shatter 10 10 5 - - PowerPoint PPT Presentation

Review of Leture 7 Utilit y of V C dimension V C dimension d v ( H ) most p oints H an shatter 10 10 5 Sop e of V C analysis 10 0 10 v 5 10 Rule of thumb: v up 20 40 60 80 100 120 140


slide-1
SLIDE 1 Review
  • f
Le ture 7
  • V
C dimension d v (H) most p
  • ints H
an shatter
  • S op
e
  • f
V C analysis

HYPOTHESIS SET ALGORITHM LEARNING FINAL HYPOTHESIS H A

g ~ f ~ f: X Y

TRAINING EXAMPLES UNKNOWN TARGET FUNCTION DISTRIBUTION PROBABILITY

  • n

P

X

x y x y

N N 1 1

( , ), ... , ( , )

up down

  • Utilit
y
  • f
V C dimension

20 40 60 80 100 120 140 160 180 200 10

−5

10 10

5

10

10

N ∝ d

v Rule
  • f
thumb:

N ≥ 10 d

v
  • Generalization
b
  • und

E

  • ut ≤ E
in + Ω
slide-2
SLIDE 2 Lea rning F rom Data Y aser S. Abu-Mostafa Califo rnia Institute
  • f
T e hnology Le ture 8: Bias-V a rian e T radeo Sp
  • nso
red b y Calte h's Provost O e, E&AS Division, and IST
  • Thursda
y , Ap ril 26, 2012
slide-3
SLIDE 3 Outline
  • Bias
and V a rian e
  • Lea
rning Curves

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 8 2/22
slide-4
SLIDE 4 App ro ximation-generalization tradeo Small E
  • ut
: go
  • d
app ro ximation
  • f f
  • ut
  • f
sample. Mo re
  • mplex H =

b etter han e
  • f
app ro ximating f Less
  • mplex H =

b etter han e
  • f
generalizing
  • ut
  • f
sample Ideal H = {f} winning lottery ti k et

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 8 3/22
slide-5
SLIDE 5 Quantifying the tradeo V C analysis w as
  • ne
app roa h: E
  • ut ≤ E
in + Ω Bias-va rian e analysis is another: de omp
  • sing E
  • ut
into 1. Ho w w ell H an app ro ximate f 2. Ho w w ell w e an zo
  • m
in
  • n
a go
  • d h ∈ H
Applies to real-valued ta rgets and uses squa red erro r

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 8 4/22
slide-6
SLIDE 6 Sta rt with E
  • ut

E

  • ut(g(D))= Ex
  • g(D)(x) − f(x)

2

ED

  • E
  • ut(g(D))
  • =

ED

  • Ex
  • g(D)(x) − f(x)

2 =

Ex

  • ED
  • g(D)(x) − f(x)

2

No w, let us fo us
  • n:

ED

  • g(D)(x) − f(x)

2

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 8 5/22
slide-7
SLIDE 7 The average hyp
  • thesis
T
  • evaluate ED
  • g(D)(x) − f(x)

2

w e dene the `average' hyp
  • thesis ¯

g(x)

:

¯ g(x) = ED

  • g(D)(x)
  • Imagine
many data sets D1, D2, · · · , DK

¯ g(x) ≈ 1 K

K

  • k=1

g(Dk)(x)

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 8 6/22
slide-8
SLIDE 8 Using ¯

g(x)

ED

  • g(D)(x) − f(x)

2 =ED

  • g(D)(x) − ¯

g(x) + ¯ g(x) − f(x) 2 = ED

  • g(D)(x) − ¯

g(x) 2 +

  • ¯

g(x) − f(x) 2 + 2

  • g(D)(x) − ¯

g(x) ¯ g(x) − f(x)

  • = ED
  • g(D)(x) − ¯

g(x) 2 +

  • ¯

g(x) − f(x) 2

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 8 7/22
slide-9
SLIDE 9 Bias and va rian e

ED

  • g(D)(x) − f(x)

2 = ED

  • g(D)(x) − ¯

g(x) 2

  • va
r(x)

+

  • ¯

g(x) − f(x) 2

  • bias(x)
Therefo re,

ED

  • E
  • ut(g(D))
  • = Ex
  • ED
  • g(D)(x) − f(x)

2 = Ex[bias(x) +

va r(x)]

=

bias

+

va r

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 8 8/22
slide-10
SLIDE 10 The tradeo bias = Ex
  • ¯

g(x) − f(x) 2

va r = Ex
  • ED
  • g(D)(x) − ¯

g(x) 2

f H

bias va r

f H

↓ H ↑ ↑

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 8 9/22
slide-11
SLIDE 11 Example: sine ta rget

H0 f

f :[−1, 1]→ R f(x) = sin(πx)

Only t w
  • training
examples!

N = 2

T w
  • mo
dels used fo r lea rning:

H0: h(x) = b H1: h(x) = ax + b

Whi h is b etter, H0
  • r H1
?

−1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 −2 −1.5 −1 −0.5 0.5 1 1.5 2

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 8 10/22
slide-12
SLIDE 12 App ro ximation
  • H0
versus H1

H0 H1

−1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 −2 −1.5 −1 −0.5 0.5 1 1.5 2 −1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 −2 −1.5 −1 −0.5 0.5 1 1.5 2

E

  • ut = 0.50

E

  • ut = 0.20

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 8 11/22
slide-13
SLIDE 13 Lea rning
  • H0
versus H1

H0 H1

−1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 −2 −1.5 −1 −0.5 0.5 1 1.5 2 −1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 −2 −1.5 −1 −0.5 0.5 1 1.5 2

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 8 12/22
slide-14
SLIDE 14 Bias and va rian e
  • H0
PSfrag repla ements

x y

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2
0.2 0.4 0.6 0.8 1
  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2
0.2 0.4 0.6 0.8 1 PSfrag repla ements

x y ¯ g(x) sin(πx)

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2
0.2 0.4 0.6 0.8 1
  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2
0.2 0.4 0.6 0.8 1

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 8 13/22
slide-15
SLIDE 15 Bias and va rian e
  • H1
PSfrag repla ements

x y

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2
0.2 0.4 0.6 0.8 1
  • 8
  • 6
  • 4
  • 2
2 4 6 PSfrag repla ements

x y ¯ g(x) sin(πx)

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2
0.2 0.4 0.6 0.8 1
  • 3
  • 2
  • 1
1 2 3

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 8 14/22
slide-16
SLIDE 16 and the winner is . . .

H0 H1

PSfrag repla ements

x y ¯ g(x) sin(πx)

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2
0.2 0.4 0.6 0.8 1
  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2
0.2 0.4 0.6 0.8 1 PSfrag repla ements

x y ¯ g(x) sin(πx)

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2
0.2 0.4 0.6 0.8 1
  • 3
  • 2
  • 1
1 2 3 bias = 0.50 va r = 0.25 bias = 0.21 va r = 1.69

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 8 15/22
slide-17
SLIDE 17 Lesson lea rned Mat h the `mo del
  • mplexit
y' to the data resour es, not to the ta rget
  • mplexit
y

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 8 16/22
slide-18
SLIDE 18 Outline
  • Bias
and V a rian e
  • Lea
rning Curves

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 8 17/22
slide-19
SLIDE 19 Exp e ted E
  • ut
and E in Data set D
  • f
size N Exp e ted
  • ut-of-sample
erro r

ED[E

  • ut(g(D))]
Exp e ted in-sample erro r

ED[E

in(g(D))] Ho w do they va ry with N ?

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 8 18/22
slide-20
SLIDE 20 The urves PSfrag repla ements Numb er
  • f
Data P
  • ints, N
Exp e ted Erro r

E

  • ut

E

in 20 40 60 80 100 120 0.16 0.18 0.2 0.22 PSfrag repla ements Numb er
  • f
Data P
  • ints, N
Exp e ted Erro r

E

  • ut

E

in 20 40 60 80 100 120 0.05 0.1 0.15 0.2 0.25 Simple Mo del Complex Mo del

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 8 19/22
slide-21
SLIDE 21 V C versus bias-va rian e PSfrag repla ements Numb er
  • f
Data P
  • ints, N
Exp e ted Erro r in-sample erro r generalization erro r

E

  • ut

E

in 20 40 60 80 0.16 0.17 0.18 0.19 0.2 0.21 0.22 PSfrag repla ements Numb er
  • f
Data P
  • ints, N
Exp e ted Erro r bias va rian e

E

  • ut

E

in 20 40 60 80 0.16 0.17 0.18 0.19 0.2 0.21 0.22 V C analysis bias-va rian e

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 8 20/22
slide-22
SLIDE 22 Linea r regression ase Noisy ta rget y = w∗ Tx + noise Data set D = {(x1, y1), . . . , (xN, yN)} Linea r regression solution: w = (X TX)−1X Ty In-sample erro r ve to r = Xw − y `Out-of-sample' erro r ve to r = Xw − y′

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 8 21/22
slide-23
SLIDE 23 Lea rning urves fo r linea r regression PSfrag repla ements Numb er
  • f
Data P
  • ints, N
Exp e ted Erro r

E

  • ut

E

in

σ2 d + 1

20 40 60 80 100 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Best app ro ximation erro r = σ2 Exp e ted in-sample erro r = σ2

1 − d+1

N

  • Exp
e ted
  • ut-of-sample
erro r = σ2

1 + d+1

N

  • Exp
e ted generalization erro r = 2σ2 d+1

N

  • A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 8 22/22