Introduction to Machine Learning 13. Learning Theory Geoff Gordon - - PowerPoint PPT Presentation

introduction to machine learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning 13. Learning Theory Geoff Gordon - - PowerPoint PPT Presentation

Introduction to Machine Learning 13. Learning Theory Geoff Gordon and Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701x 10-701 The Problem Training Data drawn iid


slide-1
SLIDE 1

Introduction to Machine Learning

  • 13. Learning Theory

Geoff Gordon and Alex Smola Carnegie Mellon University

  • http://alex.smola.org/teaching/cmu2013-10-701x

10-701

slide-2
SLIDE 2

The Problem

  • Training
  • Data drawn iid from
  • Loss function
  • Function class
  • Empirical risk minimization problem
  • Testing

{(x1, y1), . . . (xm, ym)} p(x, y) l(x, y, f(x)) F = {f : Ω[f] ≤ c} minimize

f∈F

1 m

m

X

i=1

l(xi, yi, f(xi)) E

(x,y)∼p(x,y) [l(x, y, f(x))]

slide-3
SLIDE 3

data

slide-4
SLIDE 4

classifier
 (polynomial regression)

slide-5
SLIDE 5

linear classifier (underfitting)

slide-6
SLIDE 6

quadratic classifier

slide-7
SLIDE 7

Typical behavior

model complexity error

slide-8
SLIDE 8

Typical behavior

model complexity error training error

slide-9
SLIDE 9

Typical behavior

model complexity error training error test error

slide-10
SLIDE 10

Typical behavior

model complexity error training error test error How do we find this?

slide-11
SLIDE 11

Typical behavior

model complexity error training error test error How do we find this?

slide-12
SLIDE 12

A broken reasoning

  • Hoeffding bound for bounded random variable


  • Function that minimizes empirical risk
  • Bounded risk by L
  • Apply bound to get with high probability
  • Why does our bound diverge in reality?

Pr (|ˆ µm − µ| > ✏) ≤ 2 exp ✓ −2m✏2 c2 ◆ .

f ∗ ✏ ≤ L p (log 2/)/2m

slide-13
SLIDE 13

A broken reasoning

  • Hoeffding bound for bounded random variable


  • Function that minimizes empirical risk
  • Bounded risk by L
  • Apply bound to get with high probability
  • Why does our bound diverge in reality?

Pr (|ˆ µm − µ| > ✏) ≤ 2 exp ✓ −2m✏2 c2 ◆ .

f ∗ ✏ ≤ L p (log 2/)/2m

slide-14
SLIDE 14

Multiple testing

  • Tossing an unbiased coin 10 times

1.75 3.5 5.25 7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

best ‘strategy’

slide-15
SLIDE 15

Multiple testing

  • Tossing an unbiased coin 100 times

17.5 35 52.5 70 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

best ‘strategy’

slide-16
SLIDE 16

Multiple testing

  • We invoke the bound each time we test
  • Picking the best out of N gives us N
  • pportunities to get it wrong!
  • Union bound
  • Testing over all functions in function class
  • Split error probability up among all functions
  • Take supremum over all terms

Pr {|Remp[f] − R[f]| > ✏} ≤ X

f 02F

Pr {|Remp[f 0] − R[f 0]| > ✏}

slide-17
SLIDE 17

Multiple testing

  • Our first generalization bound
  • Putting it all together
  • What if function class is not discrete?
  • What if we have binary loss

✏ ≤ L r log |F| + log 2/ 2m R[f ∗] ≤ inf

f∈F Remp[f] + L

r log |F| + log 2/δ 2m

slide-18
SLIDE 18

Multiple testing

  • Our first generalization bound
  • Putting it all together
  • What if function class is not discrete?
  • What if we have binary loss

✏ ≤ L r log |F| + log 2/ 2m R[f ∗] ≤ inf

f∈F Remp[f] + L

r log |F| + log 2/δ 2m

slide-19
SLIDE 19

Covering Numbers

  • What if we have an uncountable function class?
  • Approximate by finite cover
slide-20
SLIDE 20

Covering Numbers

  • What if we have an uncountable function class?
  • Approximate by finite cover
slide-21
SLIDE 21

Covering Numbers

  • What if we have an uncountable function class?
  • Approximate by finite cover
  • Now bound depends on discretization, too
slide-22
SLIDE 22

Covering Numbers

  • Approximation error
  • Covering number (actually need metric)

✏ N(F, ✏) R[f ⇤] ≤ inf

f2F Remp[f] + L

r log N(F, ✏) + log 2/ 2m + L0✏

slide-23
SLIDE 23

VC Dimension

  • Binary classification problem
  • Given locations, enumerate all possible ways

these points can be separated

  • Example - linear separation
slide-24
SLIDE 24

VC Dimension

  • Binary classification problem
  • Given locations, enumerate all possible ways

these points can be separated

  • Exponential growth to VCD, then polynomial
  • Examples
  • d-dimensional linear functions have h=d
  • has infinite h

R[f ∗] ≤ inf

f∈F Remp[f] +

r h(log(2m/h) + 1) + log 4/δ m sin(x/w)

slide-25
SLIDE 25

VC Dimension

  • Binary classification problem
  • Given locations, enumerate all possible ways

these points can be separated

  • Exponential growth to VCD, then polynomial
  • Examples
  • d-dimensional linear functions have h=d
  • has infinite h

R[f ∗] ≤ inf

f∈F Remp[f] +

r h(log(2m/h) + 1) + log 4/δ m sin(x/w)

polynomial growth

slide-26
SLIDE 26

Rademacher Averages

  • Nontrivial bound (state of the art)
  • Reasonably easy to compute
  • Recall McDiarmid’s inequality
  • Bound worst case deviation

Pr (|f(x1, . . . , xm) − EX1,...,Xm[f(x1, . . . , xm)]| > ✏) ≤ 2 exp

  • −2✏2C−2

.

C2 =

m

X

i=1

c2

i

|f(x1, . . . , xi, . . . , xm) − f(x1, . . . , x0

i, . . . , xm)| ≤ ci

Pr ( sup

f∈F

  • 1

m

m

X

i=1

l(xi, yi, f(xi)) − E(x,y) [l(x, y, f(x))]

  • > ✏

)

slide-27
SLIDE 27

Rademacher Averages

  • Worst case deviation
  • If we change single observation pair
  • Apply McDiarmid’s bound to get
  • Worst case deviation not far from typical case

Ξ(X, Y ) := sup

f∈F

  • 1

m

m

X

i=1

l(xi, yi, f(xi)) − E(x,y) [l(x, y, f(x))]

  • Ξ(X, Y ) − Ξ(Xi ∪ {x0

i} , Y i ∪ {y0 i})

  • ≤ L/m

Pr {|Ξ(X, Y ) > EX,Y [Ξ(X, Y )]| > ✏} ≤ 2 exp

  • −2m✏2L−2
slide-28
SLIDE 28

Rademacher Averages

  • Worst case deviation
  • If we change single observation pair
  • Apply McDiarmid’s bound to get
  • Worst case deviation not far from typical case

Ξ(X, Y ) := sup

f∈F

  • 1

m

m

X

i=1

l(xi, yi, f(xi)) − E(x,y) [l(x, y, f(x))]

  • Ξ(X, Y ) − Ξ(Xi ∪ {x0

i} , Y i ∪ {y0 i})

  • ≤ L/m

Pr {|Ξ(X, Y ) > EX,Y [Ξ(X, Y )]| > ✏} ≤ 2 exp

  • −2m✏2L−2
slide-29
SLIDE 29

Rademacher Averages

EX,Y " sup

f2F

  • 1

m

m

X

i=1

l(xi, yi, f(xi)) − E(x,y) [l(x, y, f(x))]

  • #

=EX,Y " sup

f2F

  • 1

m

m

X

i=1

l(xi, yi, f(xi)) − EX0,Y 0 1 m

m

X

i=1

[l(x0

i, y0 i, f(x0 i))]

  • #

≤EX,Y,X0,Y 0 " sup

f2F

  • 1

m

m

X

i=1

[l(xi, yi, f(xi)) − l(x0

i, y0 i, f(x0 i))]

  • #

=EX,Y,X0,Y 0Eσ " sup

f2F

  • 1

m

m

X

i=1

σi[l(xi, yi, f(xi)) − l(x0

i, y0 i, f(x0 i))]

  • #

≤ 2 mEX,Y Eσ " sup

f2F m

X

i=1

σil(xi, yi, f(xi)) #

slide-30
SLIDE 30

Rademacher Averages

  • Putting it all together
  • Rademacher average can be bounded easily

for linear function classes by solving a convex

  • ptimization problem.

R[f] ≤ Remp[f] + 2R[F, m] + L r log 2/δ 2m

behavior for random labels averaging

slide-31
SLIDE 31

Some Alternatives

  • Validation set
  • Train on training set (e.g. 90% of the data)
  • Check performance on remaining 10%
  • Use only if dataset is huge and few tests
  • Crossvalidation
  • Average over validation sets (e.g. 10 fold)
  • Nested cross-validation for model selection 


(e.g. 10-fold in each fold to find parameters)

  • Bayesian statistics