Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabs - - PowerPoint PPT Presentation

introduction to machine learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabs - - PowerPoint PPT Presentation

Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabs Pczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic) : Bayes risk : Let us use the empirical counter part: Empirical risk: 3


slide-1
SLIDE 1

Introduction to Machine Learning

Vapnik–Chervonenkis Theory

Barnabás Póczos

slide-2
SLIDE 2

Empirical Risk and True Risk

2

slide-3
SLIDE 3

Empirical Risk

3

Let us use the empirical counter part: Shorthand:

True risk of f (deterministic): Bayes risk:

Empirical risk:

slide-4
SLIDE 4

Empirical Risk Minimization

4

Law of Large Numbers:

Empirical risk is converging to the true risk

slide-5
SLIDE 5

Overfitting in Classification with ERM

5

Bayes classifier:

Picture from David Pal

Bayes risk:

Generative model:

slide-6
SLIDE 6

6

Picture from David Pal

Bayes risk:

n-order thresholded polynomials

Empirical risk:

Overfitting in Classification with ERM

slide-7
SLIDE 7

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 1.2 1.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 1.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • 45
  • 40
  • 35
  • 30
  • 25
  • 20
  • 15
  • 10
  • 5

5

k=1 k=2 k=3 k=7

If we allow very complicated predictors, we could overfit the training data.

Examples: Regression (Polynomial of order k-1 – degree k )

Overfitting in Regression

7

constant linear quadratic 6th order

slide-8
SLIDE 8

Solutions to Overfitting

8

slide-9
SLIDE 9

Solutions to Overfitting Structural Risk Minimization

Notation:

Goal:

(Model error, Approximation error)

Solution: Structural Risk Minimzation (SRM)

9

Risk Empirical risk

slide-10
SLIDE 10

Big Picture

10

Bayes risk

Estimation error Approximation error

Bayes risk

Ultimate goal:

Approximation error Estimation error Bayes risk

slide-11
SLIDE 11

Empirical risk is no longer a good indicator of true risk

fixed # training data

If we allow very complicated predictors, we could overfit the training data.

Effect of Model Complexity

11

Prediction error on training data

slide-12
SLIDE 12

Classification using the 0-1 loss

12

slide-13
SLIDE 13

The Bayes Classifier

13

Lemma I: Lemma II: Proofs: Lemma I: Trivial from definition Lemma II: Surprisingly long calculation

slide-14
SLIDE 14

The Bayes Classifier

14

This is what the learning algorithm produces

We will need these definitions, please copy it!

slide-15
SLIDE 15

The Bayes Classifier

15

Theorem I: Bound on the Estimation error

The true risk of what the learning algorithm produces This is what the learning algorithm produces

slide-16
SLIDE 16

Theorem I: Bound on the Estimation error

The true risk of what the learning algorithm produces

Proof of Theorem 1

Proof:

slide-17
SLIDE 17

The Bayes Classifier

17

Theorem II:

This is what the learning algorithm produces

Proof: Trivial

slide-18
SLIDE 18

Corollary

18

Main message: It’s enough to derive upper bounds for Corollary:

slide-19
SLIDE 19

Illustration of the Risks

19

slide-20
SLIDE 20

It is a random variable that we need to bound! We will bound it with tail bounds!

20

It’s enough to derive upper bounds for

slide-21
SLIDE 21

Hoeffding’s inequality (1963)

21

Special case

slide-22
SLIDE 22

Binomial distributions

22

Our goal is to bound

Bernoulli(p)

Therefore, from Hoeffding we have:

Yuppie!!!

slide-23
SLIDE 23

Inversion

23

From Hoeffding we have: Therefore,

slide-24
SLIDE 24

Union Bound

24

Our goal is to bound: We already know:

Theorem: [tail bound on the ‘deviation’ in the worst case]

Worst case error Proof:

This is not the worst classifier in terms of classification accuracy! Worst case means that the empirical risk of classifier f is the furthest from its true risk!

slide-25
SLIDE 25

Inversion of Union Bound

25

Therefore,

We already know:

slide-26
SLIDE 26

Inversion of Union Bound

26

  • The larger the N, the looser the bound
  • This results is distribution free: True for all P(X,Y) distributions
  • It is useless if N is big, or infinite… (e.g. all possible hyperplanes)

It can be fixed with McDiarmid inequality and VC dimension…

slide-27
SLIDE 27

Concentration and Expected Value

27

n

slide-28
SLIDE 28

The Expected Error

28

Our goal is to bound:

Theorem: [Expected ‘deviation’ in the worst case] Worst case deviation

We already know:

Proof: we already know a tail bound. (Tail bound, Concentration inequality) (From that actually we get a bit weaker inequality… oh well)

This is not the worst classifier in terms of classification accuracy! Worst case means that the empirical risk of classifier f is the furthest from its true risk!

slide-29
SLIDE 29

Function classes with infinite many elements

slide-30
SLIDE 30

McDiarmid’s Bounded Difference Inequality

It follows that

30

slide-31
SLIDE 31

Bounded Difference Condition

31

Let g denote the following function:

Observation: Proof:

=> McDiarmid can be applied for g! Our main goal is to bound Lemma:

slide-32
SLIDE 32

Bounded Difference Condition

32

Corollary: The Vapnik-Chervonenkis inequality does that with the shatter coefficient (and VC dimension)!

slide-33
SLIDE 33

Vapnik-Chervonenkis inequality

33

We already know:

Vapnik-Chervonenkis inequality: Corollary: Vapnik-Chervonenkis theorem: Our main goal is to bound

slide-34
SLIDE 34

Shattering

34

slide-35
SLIDE 35

How many points can a linear boundary classify exactly in 1D?

  • +

2 pts 3 pts-

+ +

  • +
  • +
  • ??

There exists placement s.t. all labelings can be classified

  • +

35

The answer is 2

slide-36
SLIDE 36
  • +

3 pts 4 pts

  • +

+

  • +
  • +
  • ??
  • +
  • +

How many points can a linear boundary classify exactly in 2D?

There exists a placement s.t. all labelings can be classified

36

The answer is 3

No matter how we place 4 points, there is a labeling that cannot be classified

slide-37
SLIDE 37

How many points can a linear boundary classify exactly in d-dim?

37

The answer is d+1

How many points can a linear boundary classify exactly in 3D?

The answer is 4

tetraeder

+ +

slide-38
SLIDE 38

Growth function, Shatter coefficient

Definition 1 1 1 1 1 1 1 1 1 1 1

(=5 in this example)

Growth function, Shatter coefficient

maximum number of behaviors on n points

38

slide-39
SLIDE 39

Growth function, Shatter coefficient

Definition Growth function, Shatter coefficient

maximum number of behaviors on n points

  • +

+

Example: Half spaces in 2D

+ +

  • 39
slide-40
SLIDE 40

VC-dimension

Definition Growth function, Shatter coefficient

maximum number of behaviors on n points

Definition: VC-dimension # behaviors Definition: Shattering Note:

40

slide-41
SLIDE 41

VC-dimension

Definition # behaviors

41

slide-42
SLIDE 42
  • +
  • +

VC-dimension

42

(such that you want to maximize the # of different behaviors)

slide-43
SLIDE 43

Examples

43

slide-44
SLIDE 44

VC dim of decision stumps (axis aligned linear separator) in 2d

What’s the VC dim. of decision stumps in 2d?

  • +

+

  • +
  • +

+

  • There is a placement of 3 pts that can be shattered => VC dim ≥ 3

44

slide-45
SLIDE 45

What’s the VC dim. of decision stumps in 2d?

If VC dim = 3, then for all placements of 4 pts, there exists a labeling that can’t be shattered

3 collinear 1 in convex hull of other 3 quadrilateral

  • +
  • +
  • +
  • +
  • VC dim of decision stumps

(axis aligned linear separator) in 2d

45

=> VC dim = 3

slide-46
SLIDE 46

VC dim. of axis parallel rectangles in 2d

What’s the VC dim. of axis parallel rectangles in 2d?

  • +

+

  • +
  • There is a placement of 3 pts that can be shattered => VC dim ≥ 3

46

slide-47
SLIDE 47

VC dim. of axis parallel rectangles in 2d

There is a placement of 4 pts that can be shattered ) VC dim ≥ 4

47

slide-48
SLIDE 48

VC dim. of axis parallel rectangles in 2d

What’s the VC dim. of axis parallel rectangles in 2d?

+

+

  • +
  • +
  • +
  • +
  • If VC dim = 4, then for all placements of 5 pts, there exists a labeling

that can’t be shattered

4 collinear 2 in convex hull 1 in convex hull pentagon

48

) VC dim = 4

slide-49
SLIDE 49

Sauer’s Lemma

49

The VC dimension can be used to upper bound the shattering coefficient. Sauer’s lemma: Corollary: We already know that [Exponential in n] [Polynomial in n]

slide-50
SLIDE 50

Vapnik-Chervonenkis inequality

50

Vapnik-Chervonenkis inequality: From Sauer’s lemma:

Since Therefore,

[We don’t prove this]

Estimation error

slide-51
SLIDE 51

Linear (hyperplane) classifiers

51

Estimation error We already know that Estimation error

Estimation error

slide-52
SLIDE 52

Vapnik-Chervonenkis Theorem

52

We already know from McDiarmid:

Corollary: Vapnik-Chervonenkis theorem:

[We don’t prove them]

Vapnik-Chervonenkis inequality: We already know: Hoeffding + Union bound for finite function class:

slide-53
SLIDE 53

PAC Bound for the Estimation Error

53

VC theorem:

Inversion:

Estimation error

slide-54
SLIDE 54

What you need to know

Complexity of the classifier depends on number of points that can be classified exactly

Finite case – Number of hypothesis Infinite case – Shattering coefficient, VC dimension

slide-55
SLIDE 55

Thanks for your attention ☺

55

slide-56
SLIDE 56

Attic

slide-57
SLIDE 57

Proof of Sauer’s Lemma

Write all different behaviors on a sample (x1,x2,…xn) in a matrix:

1 1 1 1 1 1 1 1 1 1 1

57

1 1 1 1 1 1 1

VC dim =2

slide-58
SLIDE 58

Proof of Sauer’s Lemma

We will prove that

58

1 1 1 1 1 1 1

Shattered subsets of columns:

Therefore,

In this example: 5· 1+3+3=7, since VC=2, n=3

slide-59
SLIDE 59

Proof of Sauer’s Lemma

Lemma 1

59

1 1 1 1 1 1 1 for any binary matrix with no repeated rows.

Shattered subsets of columns: Lemma 2 In this example: 5· 6 In this example: 6· 1+3+3=7

slide-60
SLIDE 60

Proof of Lemma 1

Lemma 1

60

1 1 1 1 1 1 1

Shattered subsets of columns:

Proof

In this example: 6· 1+3+3=7

Q.E.D.

slide-61
SLIDE 61

Proof of Lemma 2

61

for any binary matrix with no repeated rows.

Lemma 2 Induction on the number of columns Proof: Base case: A has one column. There are three cases: => 1 · 1 => 1 · 1 =>2 · 2

slide-62
SLIDE 62

Proof of Lemma 2

62

Inductive case: A has at least two columns. We have, By induction (less columns)

1 1 1 1 1 1 1

slide-63
SLIDE 63

Proof of Lemma 2

63

because

1 1 1 1 1 1 1

Q.E.D.

slide-64
SLIDE 64

Solution to Overfitting

2nd issue:

Solution:

64

slide-65
SLIDE 65

Approximation with the Hinge loss and quadratic loss

Picture is taken from R. Herbrich

slide-66
SLIDE 66

Underfitting

Bayes risk = 0.1

66

slide-67
SLIDE 67

Underfitting

Best linear classifier:

The empirical risk of the best linear classifier:

67

slide-68
SLIDE 68

Underfitting

Best quadratic classifier:

Same as the Bayes risk ) good fit!

68

slide-69
SLIDE 69

Structural Risk Minimization

69

Bayes risk

Estimation error Approximation error

Ultimate goal:

Approximation error Estimation error So far we studied when estimation error ! 0, but we also want approximation error ! 0

Many different variants… penalize too complex models to avoid overfitting