VC Dimension and classification John Duchi Prof. John Duchi - - PowerPoint PPT Presentation

vc dimension and classification
SMART_READER_LITE
LIVE PREVIEW

VC Dimension and classification John Duchi Prof. John Duchi - - PowerPoint PPT Presentation

VC Dimension and classification John Duchi Prof. John Duchi Outline I Setting: classification problems II Finite hypothesis classes 1 Union bounds 2 Zero error case III Shatter coe ffi cients and Rademacher complexity IV VC Dimension Prof.


slide-1
SLIDE 1

VC Dimension and classification

John Duchi

  • Prof. John Duchi
slide-2
SLIDE 2

Outline

I Setting: classification problems II Finite hypothesis classes

1 Union bounds 2 Zero error case

III Shatter coefficients and Rademacher complexity IV VC Dimension

  • Prof. John Duchi
slide-3
SLIDE 3

Setting for the lecture

Binary classification problems: data X 2 X and labels Y 2 {1, 1}. Hypothesis class H ⇢ {h : X ! R}. Goal: Find h 2 H with L(h) := E[1 {h(X)Y  0}] small Loss is always `(h; (x, y)) = 1 {h(x)y  0} = ( 1 if sign(h(x)) 6= y if sign(h(x)) = y

  • Prof. John Duchi
slide-4
SLIDE 4

Finite hypothesis classes

Theorem

Let H be a finite class. Then P 9h 2 H s.t. |L(h) b Ln(h)| r log |H| + t 2n !  2et.

  • Prof. John Duchi
slide-5
SLIDE 5

Finite hypothesis classes: generalization

Corollary

Let H be a finite class, b hn 2 argminh b Ln(h). Then (for numerical constant C < 1) L(b hn)  min

h2H L(h) + C

s log |H|

  • n

w.p. 1

  • Prof. John Duchi
slide-6
SLIDE 6

Finite hypothesis classes: perfect classifiers

Possible to give better guarantees if there are good classifiers! We won’t bother looking at bad ones.

Theorem

Let H be a finite hypothesis class and assume minh L(h) = 0. Then for t 0 P ✓ L(b hn) L(h?) + log |H| + t n ◆  et.

  • Prof. John Duchi
slide-7
SLIDE 7

Do not pick the bad ones

  • Prof. John Duchi
slide-8
SLIDE 8

Finite function classes: Rademacher complexity

Idea: Use Rademacher complexity to understand generalization even for these? Let F be finite with |f|  1 for f 2 F. Then Rn(F) := E " max

f2F

  • 1

n

n

X

i=1

"if(Zi)

  • #

satisfies P max

f2F

  • 1

n

n

X

i=1

f(Xi) E[f(Xi)]

  • 2Rn(F) + t

!  2 exp(cnt2)

  • Prof. John Duchi
slide-9
SLIDE 9

Finite function classes: sub-Gaussianity

I Let Pn be empirical distribution I Define kfk2 L2(Pn) = 1 n

Pn

i=1 f(xi)2 I What about sum

1 pn

n

X

i=1

"if(xi)

  • Prof. John Duchi
slide-10
SLIDE 10

Finite function classes: Rademacher complexity

Proposition (Massart’s finite class bound)

Let F be finite with M := maxf2F kfkL2(Pn). Then b Rn(F)  r 2M2 log(2 card(F)) n .

  • Prof. John Duchi
slide-11
SLIDE 11

Infinite classes with finite labels

What if we had a classifier h : X ! {1, 1} that could only give a certain number of different labelings to a data set?

Example (Sketchy)

Say X = R and ht(x) = sign(x t). Complexity of F := {f(x) = 1 {ht(x)  0}}?

  • Prof. John Duchi
slide-12
SLIDE 12

Complexity of function classes

Define F(x1:n) := {(f(x1), . . . , f(xn)) | f 2 F} . Then b Rn(F) = b Rn(F0) whenever F(x1:n) = F0(x1:n)

Proposition

Rademacher complexity depends on values of F: if |f(x)|  M for all x then Rn(F)  c · M sup

x1,...,xn2X

r log card(F(x1:n)) n .

  • Prof. John Duchi
slide-13
SLIDE 13

Proof of complexity

  • Prof. John Duchi
slide-14
SLIDE 14

Shatter coefficients

Given function class F, shattering coefficient (growth function) is sn(F) := sup

x1,...,xn2X

card (F(x1:n)) = sup

x1:n2X n card ((f(x1), . . . , f(xn)) | f 2 F)

Example

Thresholds in R

  • Prof. John Duchi
slide-15
SLIDE 15

Shatter coefficients and Rademacher complexity

Proposition

For any function class F with |f(x)|  M we have Rn(F)  cM r log sn(F) n .

  • Prof. John Duchi
slide-16
SLIDE 16

VC Dimension

How do we use shatter coefficients to give complexity guarantees?

Definition (VC Dimension)

Let H be a collection of boolean functions. The Vapnik Chervonenkis (VC) Dimension of H is VC(H) := sup {n 2 N : sn(H) = 2n} .

  • Prof. John Duchi
slide-17
SLIDE 17

VC Dimension: examples

Example (Thresholds in R) Example (Intervals in R)

  • Prof. John Duchi
slide-18
SLIDE 18

VC Dimension: examples

Example (Half-spaces in R2)

  • Prof. John Duchi
slide-19
SLIDE 19

Finite dimensional hypothesis classes

Let F be functions f : X ! R and suppose dim(F) = d

I Definition of dimension:

Example (Linear functionals)

If F = {f(x) = w>x, w 2 Rd} then dim(F) = d

Example (Nonlinear functionals)

If F = {f(x) = w>(x), w 2 Rd} then dim(F) = d

  • Prof. John Duchi
slide-20
SLIDE 20

VC dimension of finite dimensional classes

Let F have dim(F) = d and let H := {h : X ! {1, 1} s.t. h(x) = sign(f(x)), f 2 F} .

Proposition (Dimension bounds VC dimension)

VC(H)  dim(F)

  • Prof. John Duchi
slide-21
SLIDE 21

Finite dimensional hypothesis classes: proof

  • Prof. John Duchi
slide-22
SLIDE 22

Sauer-Shelah Lemma

Theorem

Let H be boolean functions with VC(H) = d. Then sn(H) 

d

X

i=0

✓n i ◆  ( 2n if n  d ne

d

d if n > d

  • Prof. John Duchi
slide-23
SLIDE 23

Rademacher complexity of VC classes

Proposition

Let H be collection of boolean functions with VC(H) = d. Then Rn(H)  c r d log n

d

n . Proof is immediate (but a tighter result is possible):

  • Prof. John Duchi
slide-24
SLIDE 24

Generalization bounds for VC classes

Proposition

Let H have VC-dimension d and `(h; (x, y)) = 1 {h(x) 6= y}. Then P @9 h 2 H s.t. |b Ln(h) L(h)| c s d log d

n

n + t 1 A  2ent2

  • Prof. John Duchi
slide-25
SLIDE 25

Things we have not addressed

I Multiclass problems (Natarajan dimension, due to Bala

Natarajan; see also Multiclass Learnability and the ERM Principle by Daniely et al.)

I Extending “zero error” results to infinite classes I Non-boolean classes

  • Prof. John Duchi
slide-26
SLIDE 26

Reading and bibliography

  • 1. M. Anthony and P. Bartlet. Neural Network Learning:

Theoretical Foundations. Cambridge University Press, 1999

  • 2. P. L. Bartlett and S. Mendelson. Rademacher and Gaussian

complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002

  • 3. S. Boucheron, O. Bousquet, and G. Lugosi. Theory of

classification: a survey of some recent advances. ESAIM: Probability and Statistics, 9:323–375, 2005

  • 4. A. W. van der Vaart and J. A. Wellner. Weak Convergence

and Empirical Processes: With Applications to Statistics. Springer, New York, 1996 (Ch. 2.6)

  • 5. Scribe notes for Statistics 300b:

http://web.stanford.edu/class/stats300b/

  • Prof. John Duchi