Computational Learning Theory: Shattering and VC Dimensions Machine - - PowerPoint PPT Presentation

computational learning theory shattering and vc dimensions
SMART_READER_LITE
LIVE PREVIEW

Computational Learning Theory: Shattering and VC Dimensions Machine - - PowerPoint PPT Presentation

Computational Learning Theory: Shattering and VC Dimensions Machine Learning 1 Slides based on material from Dan Roth, Avrim Blum, Tom Mitchell and others This lecture: Computational Learning Theory The Theory of Generalization Probably


slide-1
SLIDE 1

Machine Learning

Computational Learning Theory: Shattering and VC Dimensions

1

Slides based on material from Dan Roth, Avrim Blum, Tom Mitchell and others

slide-2
SLIDE 2

This lecture: Computational Learning Theory

  • The Theory of Generalization
  • Probably Approximately Correct (PAC) learning
  • Positive and negative learnability results
  • Agnostic Learning
  • Shattering and the VC dimension

2

slide-3
SLIDE 3

This lecture: Computational Learning Theory

  • The Theory of Generalization
  • Probably Approximately Correct (PAC) learning
  • Positive and negative learnability results
  • Agnostic Learning
  • Shattering and the VC dimension

3

slide-4
SLIDE 4

Infinite Hypothesis Space

  • The previous analysis was restricted to finite hypothesis spaces
  • Some infinite hypothesis spaces are more expressive than others

– E.g., Rectangles, vs. 17- sides convex polygons vs. general convex polygons – Linear threshold function vs. a combination of LTUs

  • Need a measure of the expressiveness of an infinite hypothesis

space other than its size

  • The Vapnik-Chervonenkis dimension (VC dimension) provides such

a measure

– “What is the expressive capacity of a set of functions?”

  • Analogous to |H|, there are bounds for sample complexity using

VC(H)

4

slide-5
SLIDE 5

Infinite Hypothesis Space

  • The previous analysis was restricted to finite hypothesis spaces
  • Some infinite hypothesis spaces are more expressive than others

– E.g., Rectangles, vs. 17- sides convex polygons vs. general convex polygons – Linear threshold function vs. a combination of LTUs

  • Need a measure of the expressiveness of an infinite hypothesis

space other than its size

  • The Vapnik-Chervonenkis dimension (VC dimension) provides such

a measure

– “What is the expressive capacity of a set of functions?”

  • Analogous to |H|, there are bounds for sample complexity using

VC(H)

5

slide-6
SLIDE 6

Infinite Hypothesis Space

  • The previous analysis was restricted to finite hypothesis spaces
  • Some infinite hypothesis spaces are more expressive than others

– E.g., Rectangles, vs. 17- sides convex polygons vs. general convex polygons – Linear threshold function vs. a combination of LTUs

  • Need a measure of the expressiveness of an infinite hypothesis

space other than its size

  • The Vapnik-Chervonenkis dimension (VC dimension) provides such

a measure

– “What is the expressive capacity of a set of functions?”

  • Analogous to |H|, there are bounds for sample complexity using

VC(H)

6

slide-7
SLIDE 7

Infinite Hypothesis Space

  • The previous analysis was restricted to finite hypothesis spaces
  • Some infinite hypothesis spaces are more expressive than others

– E.g., Rectangles, vs. 17- sides convex polygons vs. general convex polygons – Linear threshold function vs. a combination of LTUs

  • Need a measure of the expressiveness of an infinite hypothesis

space other than its size

  • The Vapnik-Chervonenkis dimension (VC dimension) provides such

a measure

– “What is the expressive capacity of a set of functions?”

  • Analogous to |𝐼|, there are bounds for sample complexity using

𝑊𝐷(𝐼)

7

slide-8
SLIDE 8

Learning Rectangles

8

X Y Assume the target concept is an axis parallel rectangle

slide-9
SLIDE 9

X Y

Points inside are positive

Learning Rectangles

9

Assume the target concept is an axis parallel rectangle

Points outside are negative Points outside are negative Points outside are negative Points outside are negative

slide-10
SLIDE 10

Learning Rectangles

10

X Y + +

  • Assume the target concept is an axis parallel rectangle
slide-11
SLIDE 11

Learning Rectangles

11

X Y + +

  • Assume the target concept is an axis parallel rectangle
slide-12
SLIDE 12

Learning Rectangles

12

X Y + +

  • +

+ Assume the target concept is an axis parallel rectangle

slide-13
SLIDE 13

Learning Rectangles

13

X Y + +

  • +

+ Assume the target concept is an axis parallel rectangle

slide-14
SLIDE 14

Learning Rectangles

X Y + +

  • +

+ + + + + + + Assume the target concept is an axis parallel rectangle

14

slide-15
SLIDE 15

Learning Rectangles

15

X Y + +

  • +

+ + + + + + + + Assume the target concept is an axis parallel rectangle

slide-16
SLIDE 16

Learning Rectangles

16

X Y + +

  • +

+ + + + + + + + Assume the target concept is an axis parallel rectangle Will we be able to learn the target rectangle? Can we come close?

slide-17
SLIDE 17

Let’s think about expressivity of functions

17

Suppose we have two points. Can linear classifiers correctly classify any labeling of these points? What about fourteen points? Linear functions are expressive enough to shatter 2 points

slide-18
SLIDE 18

Let’s think about expressivity of functions

18

There are four ways to label two points

slide-19
SLIDE 19

Let’s think about expressivity of functions

19

There are four ways to label two points And it is possible to draw a line that separates positive and negative points in all four cases What about fourteen points? Linear functions are expressive enough to shatter 2 points

slide-20
SLIDE 20

Let’s think about expressivity of functions

20

What about fourteen points? We say that linear functions are expressive enough to shatter two points There are four ways to label two points And it is possible to draw a line that separates positive and negative points in all four cases

slide-21
SLIDE 21

Let’s think about expressivity of functions

21

What about fourteen points? We say that linear functions are expressive enough to shatter two points There are four ways to label two points And it is possible to draw a line that separates positive and negative points in all four cases

slide-22
SLIDE 22

Shattering

22

slide-23
SLIDE 23

Shattering

23

slide-24
SLIDE 24

Shattering

24

slide-25
SLIDE 25

Shattering

25

What about this labeling?

slide-26
SLIDE 26

Shattering

26

This particular labeling of the points cannot be separated by any line

slide-27
SLIDE 27

Shattering

27

This particular labeling of the points cannot be separated by any line

slide-28
SLIDE 28

Shattering

28

This particular labeling of the points cannot be separated by any line

slide-29
SLIDE 29

Shattering

29

This particular labeling of the points cannot be separated by any line

slide-30
SLIDE 30

Shattering

30

Linear functions are not expressive enough to shatter fourteen points Because there is at least one labeling that can not be separated by them

slide-31
SLIDE 31

Shattering

31

Of course, a more complex function could separate them Linear functions are not expressive enough to shatter fourteen points Because there is at least one labeling that can not be separated by them

slide-32
SLIDE 32

Shattering

Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Intuition: A rich set of functions shatters large sets of points

32

slide-33
SLIDE 33

Shattering

Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Intuition: A rich set of functions shatters large sets of points Example 1: Hypothesis class of left bounded intervals on the real axis: [0,a) for some real number a>0

33

𝑏 Points in this region will be labeled as positive Points outside the shaded region will be labeled as negative

slide-34
SLIDE 34

Left bounded intervals

Example 1: Hypothesis class of left bounded intervals

  • n the real axis: [0,a) for some real number a>0

34

slide-35
SLIDE 35

Left bounded intervals

Example 1: Hypothesis class of left bounded intervals

  • n the real axis: [0,a) for some real number a>0

35

If we have a set S with only this one point

slide-36
SLIDE 36

Left bounded intervals

Example 1: Hypothesis class of left bounded intervals

  • n the real axis: [0,a) for some real number a>0

36

If we have a set S with only this one point If the point is labeled +, we

can find an 𝑏 that is to the

right of that point

+

This hypothesis correctly labels the point as positive 𝑏

slide-37
SLIDE 37

Left bounded intervals

Example 1: Hypothesis class of left bounded intervals

  • n the real axis: [0,a) for some real number a>0

37

If we have a set S with only this one point If the point is labeled −, we

can find an 𝑏 that is to the

right of that point − This hypothesis correctly labels the point as negative 𝑏

slide-38
SLIDE 38

Left bounded intervals

Example 1: Hypothesis class of left bounded intervals

  • n the real axis: [0,a) for some real number a>0

38

If we have a set S with only this one point If the point is labeled −, we

can find an 𝑏 that is to the

right of that point − This hypothesis correctly labels the point as negative 𝑏 Any set of one point can be shattered by the hypothesis class of left bounded intervals

slide-39
SLIDE 39

Left bounded intervals

Example 1: Hypothesis class of left bounded intervals

  • n the real axis: [0,a) for some real number a>0

39

If we have a set S with these two points Let us consider a set with two points

slide-40
SLIDE 40

Left bounded intervals

Example 1: Hypothesis class of left bounded intervals

  • n the real axis: [0,a) for some real number a>0

40

If we have a set S with these two points Let us consider a set with two points We can label the points such that no hypothesis in our class can match the labels

slide-41
SLIDE 41

Left bounded intervals

Example 1: Hypothesis class of left bounded intervals

  • n the real axis: [0,a) for some real number a>0

41

If we have a set S with these two points Let us consider a set with two points We can label the points such that no hypothesis in our class can match the labels − +

slide-42
SLIDE 42

Left bounded intervals

Example 1: Hypothesis class of left bounded intervals

  • n the real axis: [0,a) for some real number a>0

42

Let us consider a set with two points We can label the points such that no hypothesis in our class can match the labels − + 𝑏 Incorrectly labels this point as negative

slide-43
SLIDE 43

Left bounded intervals

Example 1: Hypothesis class of left bounded intervals

  • n the real axis: [0,a) for some real number a>0

43

Let us consider a set with two points We can label the points such that no hypothesis in our class can match the labels − + 𝑏 Incorrectly labels this point as positive

slide-44
SLIDE 44

Left bounded intervals

Example 1: Hypothesis class of left bounded intervals

  • n the real axis: [0,a) for some real number a>0

44

Let us consider a set with two points We can label the points such that no hypothesis in our class can match the labels − + 𝑏 Incorrectly labels this point as negative Incorrectly labels this point as positive

slide-45
SLIDE 45

Shattering

Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Intuition: A rich set of functions shatters large sets of points Example 1: Hypothesis class of left bounded intervals on the real axis: [0,a) for some real number a>0

45

slide-46
SLIDE 46

Shattering

Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Intuition: A rich set of functions shatters large sets of points Example 1: Hypothesis class of left bounded intervals on the real axis: [0,a) for some real number a>0

46

Sets with one point can be shattered That is: Given one point, for any labeling of the points, we can find a concept in this class that is consistent with it

slide-47
SLIDE 47

Shattering

Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Intuition: A rich set of functions shatters large sets of points Example 1: Hypothesis class of left bounded intervals on the real axis: [0,a) for some real number a>0

47

Sets with one point can be shattered That is: Given one point, for any labeling of the points, we can find a concept in this class that is consistent with it Sets with two points cannot be shattered That is: given two points, you can label them in such a way that no concept in this class will be consistent with their labeling

slide-48
SLIDE 48

Shattering

Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 2: Hypothesis class is the set of intervals on the real axis: [a,b],for some real numbers b>a

48

slide-49
SLIDE 49

Shattering

Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 2: Hypothesis class is the set of intervals on the real axis: [a,b],for some real numbers b>a

49

𝑏 𝑐 Points in this region will be labeled as positive Points outside the shaded region will be labeled as negative

slide-50
SLIDE 50

Real intervals

Example 2: Hypothesis class is the set of intervals on the real axis: [a,b],for some real numbers b>a

50

slide-51
SLIDE 51

Shattering

Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 2: Hypothesis class is the set of intervals on the real axis: [a,b],for some real numbers b>a

51

𝑏 𝑐 + − +

slide-52
SLIDE 52

Shattering

Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 2: Hypothesis class is the set of intervals on the real axis: [a,b],for some real numbers b>a

52

All sets of one or two points can be shattered But sets of three points cannot be shattered

Proof? Enumerate all possible three points 𝑏 𝑐 + − +

slide-53
SLIDE 53

Shattering

Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 3: Half spaces in a plane

53

+ − − − − + + +

slide-54
SLIDE 54

Shattering

Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 3: Half spaces in a plane

54

Can one point be shattered? Two points? Three points? Can any three points be shattered?

+ − − − − + + +

slide-55
SLIDE 55

Half spaces on a plane: 3 points

55

slide-56
SLIDE 56

Shattering

Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 3: Half spaces in a plane

56

Can four points be shattered? Suppose three of them lie on the same line, label the outside points + and the inner one – Otherwise, make a convex hull. Label points outside + and the inner one – Four points cannot be shattered!

+ − − − − + + +

slide-57
SLIDE 57

Half spaces on a plane: 4 points

57

slide-58
SLIDE 58

Shattering: The adversarial game

58

You An adversary

slide-59
SLIDE 59

Shattering: The adversarial game

59

You An adversary You: Hypothesis class H can shatter these d points

slide-60
SLIDE 60

Shattering: The adversarial game

60

You An adversary You: Hypothesis class H can shatter these d points Adversary: That’s what you think! Here is a labeling that will defeat you.

slide-61
SLIDE 61

Shattering: The adversarial game

61

You An adversary You: Hypothesis class H can shatter these d points Adversary: That’s what you think! Here is a labeling that will defeat you. You: Aha! There is a function ℎ ∈ 𝐼 that correctly predicts your evil labeling

slide-62
SLIDE 62

Shattering: The adversarial game

62

You An adversary You: Hypothesis class H can shatter these d points Adversary: That’s what you think! Here is a labeling that will defeat you. You: Aha! There is a function ℎ ∈ 𝐼 that correctly predicts your evil labeling Adversary: Argh! You win this round. But I’ll be back…..

slide-63
SLIDE 63

Some functions can shatter infinite points!

If arbitrarily large finite subsets of the instance space X can be shattered by a hypothesis space H. An unbiased hypothesis space H shatters the entire instance space X, i.e, it can induce every possible partition on the set of all possible instances The larger the subset X that can be shattered, the more expressive a hypothesis space is, i.e., the less biased it is

63

Intuition: A rich set of functions shatters large sets of points

slide-64
SLIDE 64

Some functions can shatter infinite points!

If arbitrarily large finite subsets of the instance space X can be shattered by a hypothesis space H. An unbiased hypothesis space H shatters the entire instance space X, i.e, it can induce every possible partition on the set of all possible instances The larger the subset X that can be shattered, the more expressive a hypothesis space is, i.e., the less biased it is

64

Intuition: A rich set of functions shatters large sets of points

slide-65
SLIDE 65

Vapnik-Chervonenkis Dimension

A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Definition: The VC dimension of hypothesis space H over instance space X is the size of the largest finite subset of X that is shattered by H

  • If there exists any subset of size d that can be shattered, VC(H) >= d

– Even one subset will do

  • If no subset of size d can be shattered, then VC(H) < d

65

slide-66
SLIDE 66

Vapnik-Chervonenkis Dimension

A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Definition: The VC dimension of hypothesis space H over instance space X is the size of the largest finite subset of X that is shattered by H

  • If there exists any subset of size d that can be shattered, VC(H) >= d

– Even one subset will do

  • If no subset of size d can be shattered, then VC(H) < d

66

slide-67
SLIDE 67

Vapnik-Chervonenkis Dimension

A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Definition: The VC dimension of hypothesis space H over instance space X is the size of the largest finite subset of X that is shattered by H

  • If there exists any subset of size d that can be shattered, VC(H) >= d

– Even one subset will do

  • If no subset of size d can be shattered, then VC(H) < d

67

slide-68
SLIDE 68

What we have managed to prove

68

Concept class VC Dimension Why? Half intervals 1 There is a dataset of size 1 that can be shattered No dataset of size 2 can be shattered Intervals 2 There is a dataset of size 2 that can be shattered No dataset of size 3 can be shattered Half-spaces in the plane 3 There is a dataset of size 3 that can be shattered No dataset of size 4 can be shattered

slide-69
SLIDE 69

More VC dimensions

69

Concept class VC Dimension Linear threshold unit in d dimensions d + 1 Neural networks Number of parameters 1 nearest neighbors infinite Intuition: A rich set of functions shatters large sets of points

slide-70
SLIDE 70

More VC dimensions

70

Concept class VC Dimension Linear threshold unit in d dimensions d + 1 Neural networks Number of parameters 1 nearest neighbors infinite What is the number of parameters needed to specify a linear threshold unit in d dimensions? Intuition: A rich set of functions shatters large sets of points

slide-71
SLIDE 71

More VC dimensions

71

Concept class VC Dimension Linear threshold unit in d dimensions d + 1 Neural networks Number of parameters 1 nearest neighbors infinite What is the number of parameters needed to specify a linear threshold unit in d dimensions? d + 1 Intuition: A rich set of functions shatters large sets of points

slide-72
SLIDE 72

More VC dimensions

72

Concept class VC Dimension Linear threshold unit in d dimensions d + 1 Neural networks Number of parameters 1 nearest neighbors infinite What is the number of parameters needed to specify a linear threshold unit in d dimensions? d + 1 Local minima in learning means neural networks may not find the best parameters Intuition: A rich set of functions shatters large sets of points

slide-73
SLIDE 73

More VC dimensions

73

Concept class VC Dimension Linear threshold unit in d dimensions d + 1 Neural networks Number of parameters 1 nearest neighbors infinite What is the number of parameters needed to specify a linear threshold unit in d dimensions? d + 1 Local minima in learning means neural networks may not find the best parameters Exercise: Try to prove this after we see nearest neighbors Intuition: A rich set of functions shatters large sets of points

slide-74
SLIDE 74

Why VC dimension?

  • Remember sample complexity

– Occam’s razor – Agnostic learning

  • Sample complexity in both cases depends on the log
  • f the size of the hypothesis space
  • For infinite hypothesis spaces, its VC dimension

behaves like log(|𝐼|)

74

slide-75
SLIDE 75

VC dimension and Occam’s razor for consistent learners

  • Using VC(H) as a measure of expressiveness, we have an

Occam theorem for infinite hypothesis spaces

  • Given a sample D with m examples, find some h 2 H is

consistent with all m examples. If 𝑛 > 1 𝜗 8VC 𝐼 log 13 𝜗 + 4 log 2 𝜀 Then with probability at least 1 − 𝜀, the hypothesis h has error less than 𝜗.

75

That is, if m is polynomial we have a PAC learning algorithm; To be efficient, we need to produce the hypothesis h efficiently

slide-76
SLIDE 76

VC dimension and Agnostic Learning

Similar statement for the agnostic setting as well If we have m examples, then with probability 1 − 𝜀, a the true error of a hypothesis h with training error 𝑓𝑠𝑠

# ℎ is bounded by

𝑓𝑠𝑠

$ ℎ ≤ 𝑓𝑠𝑠 # ℎ +

𝑊𝐷 𝐼 ln 2𝑛 𝑊𝐷 𝐼 + 1 + ln 4 𝜀 𝑛

76

(Phew!)

slide-77
SLIDE 77

Exercises

What is the VC dimension of the following concept classes axis parallel rectangles (which we saw at the beginning of this lecture)? Your homework asks you to compute the VC dimension

  • f different classes of functions

77

slide-78
SLIDE 78

PAC learning: What you need to know

  • What is PAC learning?

– Remember: We care about generalization error, not training error

  • Finite hypothesis spaces

– Connection between size of hypothesis space and sample complexity – Derive and understand the sample complexity bounds – Count number of hypotheses in a hypothesis class

  • Infinite hypotheses classes

– What is shattering and VC dimension? – How to find VC dimension of simple concept classes? – Higher VC dimensions ⇒ more sample complexity

78

slide-79
SLIDE 79

Computational Learning Theory

  • Probably Approximately Correct (PAC) learning

– A general definition that assumes fixed, but perhaps unknown distribution

  • Occam’s razor for consistent learners in finite hypothesis spaces

– Positive and negative learnability results in this setting

  • Agnostic Learning and the associated Occam razor
  • Shattering and the VC dimension
  • Many extensions to the theory exist

– Noisy data, known data distributions, probabilistic models – One important extension: PAC-Bayes theory that makes assumptions about the the prior distribution over hypothesis spaces

79

slide-80
SLIDE 80

Why computational learning theory

  • Answers questions such as

– What is learnability? How good is my class of functions? – Is a concept learnable? How many examples do I need?

  • Mistake bounds imply PAC-learnability
  • Raises interesting theoretical questions

– If a concept class is weakly learnable (i.e there is a learning algorithm that can produce a classifier that does slightly better than chance), does this mean that the concept class is strongly learnable? – We have seen bounds of the form true error < training error + (a term with ± and VC dimension) Can we use this to define a learning algorithm?

80

slide-81
SLIDE 81

Why computational learning theory

  • Answers questions such as

– What is learnability? How good is my class of functions? – Is a concept learnable? How many examples do I need?

  • Mistake bounds imply PAC-learnability
  • Raises interesting theoretical questions

– If a concept class is weakly learnable (i.e there is a learning algorithm that can produce a classifier that does slightly better than chance), does this mean that the concept class is strongly learnable? – We have seen bounds of the form true error < training error + (a term with 𝜗, 𝜀 and VC dimension) Can we use this to define a learning algorithm?

81

Boosting Structural Risk Minimization principle Support Vector Machine