Advanced Introduction to Machine Learning, CMU-10715 Vapnik - - PowerPoint PPT Presentation

advanced introduction to machine learning cmu 10715
SMART_READER_LITE
LIVE PREVIEW

Advanced Introduction to Machine Learning, CMU-10715 Vapnik - - PowerPoint PPT Presentation

Advanced Introduction to Machine Learning, CMU-10715 Vapnik Chervonenkis Theory Barnabs Pczos Learning Theory We have explored many ways of learning from data But How good is our classifier, really? How much data do we need to


slide-1
SLIDE 1

Advanced Introduction to Machine Learning, CMU-10715

Vapnik–Chervonenkis Theory

Barnabás Póczos

slide-2
SLIDE 2

We have explored many ways of learning from data But…

– How good is our classifier, really? – How much data do we need to make it “good enough”?

Learning Theory

2

slide-3
SLIDE 3

Review of what we have learned so far

3

slide-4
SLIDE 4

Notation

4

This is what the learning algorithm produces

We will need these definitions, please copy it!

slide-5
SLIDE 5

Big Picture

5

Bayes risk

Estimation error Approximation error

Bayes risk

Ultimate goal:

Approximation error Estimation error

slide-6
SLIDE 6

Big Picture: Illustration of Risks

6

Upper bound

Goal of Learning:

slide-7
SLIDE 7

Learning Theory

7

slide-8
SLIDE 8

Outline

8

These results are useless if N is big, or infinite. (e.g. all possible hyper-planes)

Today we will see how to fix this with the Shattering coefficient and VC dimension

Theorem:

From Hoeffding’s inequality, we have seen that

slide-9
SLIDE 9

Outline

9

Theorem:

From Hoeffding’s inequality, we have seen that

After this fix, we can say something meaningful about this too: This is what the learning algorithm produces and its true risk

slide-10
SLIDE 10

Hoeffding inequality

10

Theorem: Observation! Definition:

slide-11
SLIDE 11

McDiarmid’s Bounded Difference Inequality

It follows that

11

slide-12
SLIDE 12

Bounded Difference Condition

12

Our main goal is to bound Lemma:

Let g denote the following function:

Observation: Proof:

) McDiarmid can be applied for g!

slide-13
SLIDE 13

Bounded Difference Condition

13

Corollary: The Vapnik-Chervonenkis inequality does that with the shatter coefficient (and VC dimension)!

slide-14
SLIDE 14

Concentration and Expected Value

14

slide-15
SLIDE 15

Vapnik-Chervonenkis inequality

15

We already know:

Vapnik-Chervonenkis inequality: Corollary: Vapnik-Chervonenkis theorem: Our main goal is to bound

slide-16
SLIDE 16

Shattering

16

slide-17
SLIDE 17

How many points can a linear boundary classify exactly in 1D?

  • +

2 pts 3 pts -

+ +

  • +
  • + - ??

There exists placement s.t. all labelings can be classified

  • +

17

The answer is 2

slide-18
SLIDE 18
  • +

3 pts 4 pts

  • +

+

  • +
  • +
  • ??
  • +
  • +

How many points can a linear boundary classify exactly in 2D?

There exists placement s.t. all labelings can be classified

18

The answer is 3

slide-19
SLIDE 19

How many points can a linear boundary classify exactly in d-dim?

19

The answer is d+1

How many points can a linear boundary classify exactly in 3D?

The answer is 4

tetraeder

+ +

slide-20
SLIDE 20

Growth function, Shatter coefficient

Definition 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1

(=5 in this example)

Growth function, Shatter coefficient

maximum number of behaviors on n points

20

slide-21
SLIDE 21

Growth function, Shatter coefficient

Definition Growth function, Shatter coefficient

maximum number of behaviors on n points

  • +

+

Example: Half spaces in 2D

+ +

  • 21
slide-22
SLIDE 22

VC-dimension

Definition Growth function, Shatter coefficient

maximum number of behaviors on n points

Definition: VC-dimension # behaviors Definition: Shattering Note:

22

slide-23
SLIDE 23

VC-dimension

Definition # behaviors

23

slide-24
SLIDE 24
  • +
  • +

VC-dimension

24

slide-25
SLIDE 25

Examples

25

slide-26
SLIDE 26

VC dim of decision stumps (axis aligned linear separator) in 2d

What’s the VC dim. of decision stumps in 2d?

  • +

+

  • +
  • +

+

  • There is a placement of 3 pts that can be shattered ) VC dim ≥ 3

26

slide-27
SLIDE 27

What’s the VC dim. of decision stumps in 2d?

If VC dim = 3, then for all placements of 4 pts, there exists a labeling that can’t be shattered

3 collinear 1 in convex hull

  • f other 3

quadrilateral

  • + -
  • +
  • +
  • +
  • VC dim of decision stumps

(axis aligned linear separator) in 2d

27

slide-28
SLIDE 28

VC dim. of axis parallel rectangles in 2d

What’s the VC dim. of axis parallel rectangles in 2d?

  • +

+

  • +
  • There is a placement of 3 pts that can be shattered ) VC dim ≥ 3

28

slide-29
SLIDE 29

VC dim. of axis parallel rectangles in 2d

There is a placement of 4 pts that can be shattered ) VC dim ≥ 4

29

slide-30
SLIDE 30

VC dim. of axis parallel rectangles in 2d

What’s the VC dim. of axis parallel rectangles in 2d?

+

+

  • +
  • + -
  • +
  • +
  • If VC dim = 4, then for all placements of 5 pts, there exists a labeling that

can’t be shattered

4 collinear 2 in convex hull 1 in convex hull pentagon

30

slide-31
SLIDE 31

Sauer’s Lemma

31

The VC dimension can be used to upper bound the shattering coefficient. Sauer’s lemma: Corollary: We already know that [Exponential in n] [Polynomial in n]

slide-32
SLIDE 32

Proof of Sauer’s Lemma

Write all different behaviors on a sample (x1,x2,…xn) in a matrix:

0 0 0 0 1 0 1 1 1 1 0 0 0 1 0 1 1 1 0 1 1

32

0 0 0 0 1 0 1 1 1 1 0 0 0 1 1

slide-33
SLIDE 33

Proof of Sauer’s Lemma

We will prove that

33

0 0 0 0 1 0 1 1 1 1 0 0 0 1 1

Shattered subsets of columns:

Therefore,

slide-34
SLIDE 34

Proof of Sauer’s Lemma

Lemma 1

34

0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 for any binary matrix with no repeated rows.

Shattered subsets of columns: Lemma 2 In this example: 5· 6 In this example: 6· 1+3+3=7

slide-35
SLIDE 35

Proof of Lemma 1

Lemma 1

35

0 0 0 0 1 0 1 1 1 1 0 0 0 1 1

Shattered subsets of columns: Proof

In this example: 6· 1+3+3=7

slide-36
SLIDE 36

Proof of Lemma 2

36

for any binary matrix with no repeated rows.

Lemma 2 Induction on the number of columns Proof Base case: A has one column. There are three cases: ) 1 · 1 ) 1 · 1 ) 2 · 2

slide-37
SLIDE 37

Proof of Lemma 2

37

Inductive case: A has at least two columns. We have, By induction (less columns)

0 0 0 0 1 0 1 1 1 1 0 0 0 1 1

slide-38
SLIDE 38

Proof of Lemma 2

38

because

0 0 0 0 1 0 1 1 1 1 0 0 0 1 1

slide-39
SLIDE 39

Vapnik-Chervonenkis inequality

39

Vapnik-Chervonenkis inequality: From Sauer’s lemma:

Since Therefore,

[We don’t prove this]

Estimation error

slide-40
SLIDE 40

Linear (hyperplane) classifiers

40

Estimation error We already know that Estimation error

Estimation error

slide-41
SLIDE 41

Vapnik-Chervonenkis Theorem

41

We already know from McDiarmid:

Corollary: Vapnik-Chervonenkis theorem: [We don’t prove them] Vapnik-Chervonenkis inequality: Hoeffding + Union bound for finite function class:

slide-42
SLIDE 42

PAC Bound for the Estimation Error

42

VC theorem:

Inversion:

Estimation error

slide-43
SLIDE 43

Structoral Risk Minimization

43

Bayes risk

Estimation error Approximation error

Ultimate goal:

Approximation error Estimation error So far we studied when estimation error ! 0, but we also want approximation error ! 0

Many different variants… penalize too complex models to avoid overfitting

slide-44
SLIDE 44

What you need to know

Complexity of the classifier depends on number of points that can be classified exactly

Finite case – Number of hypothesis Infinite case – Shattering coefficient, VC dimension

PAC bounds on true error in terms of empirical/training error and complexity of hypothesis space Empirical and Structural Risk Minimization

44

slide-45
SLIDE 45

Thanks for your attention 

45