Computational Learning Theory: An Analysis of a Conjunction Learner - - PowerPoint PPT Presentation

computational learning theory an analysis of a
SMART_READER_LITE
LIVE PREVIEW

Computational Learning Theory: An Analysis of a Conjunction Learner - - PowerPoint PPT Presentation

Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning Slides based on material from Dan Roth, Avrim Blum, Tom Mitchell and others 1 This lecture: Computational Learning Theory The Theory of Generalization


slide-1
SLIDE 1

Machine Learning

Computational Learning Theory: An Analysis of a Conjunction Learner

1

Slides based on material from Dan Roth, Avrim Blum, Tom Mitchell and others

slide-2
SLIDE 2

This lecture: Computational Learning Theory

  • The Theory of Generalization
  • Probably Approximately Correct (PAC) learning
  • Positive and negative learnability results
  • Agnostic Learning
  • Shattering and the VC dimension

2

slide-3
SLIDE 3

Where are we?

  • The Theory of Generalization

– When can be trust the learning algorithm? – What functions can be learned? – Batch Learning

  • Probably Approximately Correct (PAC) learning
  • Positive and negative learnability results
  • Agnostic Learning
  • Shattering and the VC dimension

3

slide-4
SLIDE 4

This section

  • 1. Analyze a simple algorithm for learning conjunctions
  • 2. Define the PAC model of learning
  • 3. Make formal connections to the principle of Occam’s razor

4

slide-5
SLIDE 5

This section

  • 1. Analyze a simple algorithm for learning conjunctions
  • 2. Define the PAC model of learning
  • 3. Make formal connections to the principle of Occam’s razor

5

slide-6
SLIDE 6

Learning Conjunctions

Training data

– <(1,1,1,1,1,1,…,1,1), 1> – <(1,1,1,0,0,0,…,0,0), 0> – <(1,1,1,1,1,0,...0,1,1), 1> – <(1,0,1,1,1,0,...0,1,1), 0> – <(1,1,1,1,1,0,...0,0,1), 1> – <(1,0,1,0,0,0,...0,1,1), 0> – <(1,1,1,1,1,1,…,0,1), 1> – <(0,1,0,1,0,0,...0,1,1), 0>

f = x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

6

The true function

slide-7
SLIDE 7

Learning Conjunctions

Training data

– <(1,1,1,1,1,1,…,1,1), 1> – <(1,1,1,0,0,0,…,0,0), 0> – <(1,1,1,1,1,0,...0,1,1), 1> – <(1,0,1,1,1,0,...0,1,1), 0> – <(1,1,1,1,1,0,...0,0,1), 1> – <(1,0,1,0,0,0,...0,1,1), 0> – <(1,1,1,1,1,1,…,0,1), 1> – <(0,1,0,1,0,0,...0,1,1), 0>

f = x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

A simple learning algorithm (Elimination)

  • Discard all negative examples

7

slide-8
SLIDE 8

Learning Conjunctions

Training data

– <(1,1,1,1,1,1,…,1,1), 1> – <(1,1,1,0,0,0,…,0,0), 0> – <(1,1,1,1,1,0,...0,1,1), 1> – <(1,0,1,1,1,0,...0,1,1), 0> – <(1,1,1,1,1,0,...0,0,1), 1> – <(1,0,1,0,0,0,...0,1,1), 0> – <(1,1,1,1,1,1,…,0,1), 1> – <(0,1,0,1,0,0,...0,1,1), 0>

f = x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

h = x1 ∧ x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

A simple learning algorithm (Elimination)

  • Discard all negative examples
  • Build a conjunction using the features

that are common to all positive conjunctions

8

slide-9
SLIDE 9

Learning Conjunctions

Training data

– <(1,1,1,1,1,1,…,1,1), 1> – <(1,1,1,0,0,0,…,0,0), 0> – <(1,1,1,1,1,0,...0,1,1), 1> – <(1,0,1,1,1,0,...0,1,1), 0> – <(1,1,1,1,1,0,...0,0,1), 1> – <(1,0,1,0,0,0,...0,1,1), 0> – <(1,1,1,1,1,1,…,0,1), 1> – <(0,1,0,1,0,0,...0,1,1), 0>

f = x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

h = x1 ∧ x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

A simple learning algorithm (Elimination)

  • Discard all negative examples
  • Build a conjunction using the features

that are common to all positive conjunctions

9

Positive examples eliminate irrelevant features

slide-10
SLIDE 10

Learning Conjunctions

Training data

– <(1,1,1,1,1,1,…,1,1), 1> – <(1,1,1,0,0,0,…,0,0), 0> – <(1,1,1,1,1,0,...0,1,1), 1> – <(1,0,1,1,1,0,...0,1,1), 0> – <(1,1,1,1,1,0,...0,0,1), 1> – <(1,0,1,0,0,0,...0,1,1), 0> – <(1,1,1,1,1,1,…,0,1), 1> – <(0,1,0,1,0,0,...0,1,1), 0>

f = x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

h = x1 ∧ x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

A simple learning algorithm:

  • Discard all negative examples
  • Build a conjunction using the features

that are common to all positive conjunctions

10

Clearly this algorithm produces a conjunction that is consistent with the data, that is errS(h) = 0, if the target function is a monotone conjunction Exercise: Why?

slide-11
SLIDE 11

Learning Conjunctions: Analysis

Claim 1: Any hypothesis consistent with the training data will

  • nly make mistakes on positive future examples

Why?

11

f = x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

h = x1 ∧ x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

slide-12
SLIDE 12

Learning Conjunctions: Analysis

Claim 1: Any hypothesis consistent with the training data will

  • nly make mistakes on positive future examples

Why?

12

f = x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

h = x1 ∧ x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

A mistake will occur only if some literal z (in our example x1) is present in h but not in f

This mistake can cause a positive example to be predicted as negative by h

The reverse situation can never happen

For an example to be predicted as positive in the training set, every relevant literal must have been present

Specifically: x1 = 0, x2 =1, x3=1, x4=1, x5=1, x100=1

slide-13
SLIDE 13

Learning Conjunctions: Analysis

Claim 1: Any hypothesis consistent with the training data will

  • nly make mistakes on positive future examples

Why?

13

f = x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

h = x1 ∧ x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

A mistake will occur only if some literal z (in our example x1) is present in h but not in f

This mistake can cause a positive example to be predicted as negative by h

The reverse situation can never happen

For an example to be predicted as positive in the training set, every relevant literal must have been present

Specifically: x1 = 0, x2 =1, x3=1, x4=1, x5=1, x100=1

slide-14
SLIDE 14

Learning Conjunctions: Analysis

Claim 1: Any hypothesis consistent with the training data will

  • nly make mistakes on positive future examples

Why?

14

f h + +

  • f = x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

h = x1 ∧ x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

A mistake will occur only if some literal z (in our example x1) is present in h but not in f

This mistake can cause a positive example to be predicted as negative by h

The reverse situation can never happen

For an example to be predicted as positive in the training set, every relevant literal must have been present

Specifically: x1 = 0, x2 =1, x3=1, x4=1, x5=1, x100=1

slide-15
SLIDE 15

Learning Conjunctions: Analysis

Theorem: Suppose we are learning a conjunctive concept with n dimensional Boolean features using m training examples. If then, with probability > 1 - ±, the error of the learned hypothesis errD(h) will be less than ².

15

slide-16
SLIDE 16

Learning Conjunctions: Analysis

Theorem: Suppose we are learning a conjunctive concept with n dimensional Boolean features using m training examples. If then, with probability > 1 - ±, the error of the learned hypothesis errD(h) will be less than ².

16

If we see these many training examples, then the algorithm will produce a conjunction that, with high probability, will make few errors

Poly in n, 1/±, 1/²

slide-17
SLIDE 17

Learning Conjunctions: Analysis

Theorem: Suppose we are learning a conjunctive concept with n dimensional Boolean features using m training examples. If then, with probability > 1 - ±, the error of the learned hypothesis errD(h) will be less than ².

17

Let’s prove this assertion

slide-18
SLIDE 18

Proof Intuition

What kinds of examples would drive a hypothesis to make a mistake? Positive examples, where x1 is absent

f would say true and h would say false

None of these examples appeared during training

Otherwise x1 would have been eliminated

If they never appeared during training, maybe their appearance in the future would also be rare!

Let’s quantify our surprise at seeing such examples

18

f = x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

h = x1 ∧ x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

slide-19
SLIDE 19

Proof Intuition

What kinds of examples would drive a hypothesis to make a mistake? Positive examples, where x1 is absent

f would say true and h would say false

None of these examples appeared during training

Otherwise x1 would have been eliminated

If they never appeared during training, maybe their appearance in the future would also be rare!

Let’s quantify our surprise at seeing such examples

19

f = x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

h = x1 ∧ x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

slide-20
SLIDE 20

Proof Intuition

What kinds of examples would drive a hypothesis to make a mistake? Positive examples, where x1 is absent

f would say true and h would say false

None of these examples appeared during training

Otherwise x1 would have been eliminated

If they never appeared during training, maybe their appearance in the future would also be rare!

Let’s quantify our surprise at seeing such examples

20

f = x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

h = x1 ∧ x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

slide-21
SLIDE 21

Proof Intuition

What kinds of examples would drive a hypothesis to make a mistake? Positive examples, where x1 is absent

f would say true and h would say false

None of these examples appeared during training

Otherwise x1 would have been eliminated

If they never appeared during training, maybe their appearance in the future would also be rare!

Let’s quantify our surprise at seeing such examples

21

f = x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

h = x1 ∧ x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

slide-22
SLIDE 22

Proof Intuition

What kinds of examples would drive a hypothesis to make a mistake? Positive examples, where x1 is absent

f would say true and h would say false

None of these examples appeared during training

Otherwise x1 would have been eliminated

If they never appeared during training, maybe their appearance in the future would also be rare!

Let’s quantify our surprise at seeing such examples

22

f = x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

h = x1 ∧ x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

slide-23
SLIDE 23

Learning Conjunctions: Analysis

Let p(z) be the probability that, in an example drawn from D, the feature z is absent but the example has a positive label

  • That is, after training is done, p(z) is the probability that in a randomly

drawn example, the literal z causes a mistake

  • For any z in the target function, p(z) = 0

23

slide-24
SLIDE 24

Learning Conjunctions: Analysis

Let p(z) be the probability that, in an example drawn from D, the feature z is absent but the example has a positive label

  • That is, after training is done, p(z) is the probability that in a randomly

drawn example, the literal z causes a mistake

  • For any z in the target function, p(z) = 0

24

Remember that there will

  • nly be mistakes on positive

examples for this toy problem

slide-25
SLIDE 25

Learning Conjunctions: Analysis

Let p(z) be the probability that, in an example drawn from D, the feature z is absent but the example has a positive label

  • That is, after training is done, p(z) is the probability that in a randomly

drawn example, the literal z causes a mistake

  • For any z in the target function, p(z) = 0

25

f = x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

h = x1 ∧ x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

Remember that there will

  • nly be mistakes on positive

examples for this toy problem

slide-26
SLIDE 26

Learning Conjunctions: Analysis

Let p(z) be the probability that, in an example drawn from D, the feature z is absent but the example has a positive label

  • That is, after training is done, p(z) is the probability that in a randomly

drawn example, the literal z causes a mistake

  • For any z in the target function, p(z) = 0

26

f = x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

<(0,1,1,1,1,0,...0,1,1), 1>

h = x1 ∧ x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

Remember that there will

  • nly be mistakes on positive

examples for this toy problem

slide-27
SLIDE 27

Learning Conjunctions: Analysis

Let p(z) be the probability that, in an example drawn from D, the feature z is absent but the example has a positive label

  • That is, after training is done, p(z) is the probability that in a randomly

drawn example, the literal z causes a mistake

  • For any z in the target function, p(z) = 0

27

f = x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

<(0,1,1,1,1,0,...0,1,1), 1>

p(x1): Probability that this situation occurs

h = x1 ∧ x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100

Remember that there will

  • nly be mistakes on positive

examples for this toy problem

slide-28
SLIDE 28

Learning Conjunctions: Analysis

Let p(z) be the probability that, in an example drawn from D, the feature z is absent but the example has a positive label

  • That is, after training is done, p(z) is the probability that in a randomly

drawn example, the literal z causes a mistake

  • For any z in the target function, p(z) = 0

We know that

Via direct application of the union bound

28

slide-29
SLIDE 29

Learning Conjunctions: Analysis

Let p(z) be the probability that, in an example drawn from D, the feature z is absent but the example has a positive label

  • That is, after training is done, p(z) is the probability that in a randomly

drawn example, the literal z causes a mistake

  • For any z in the target function, p(z) = 0

We know that

Via direct application of the union bound

29

slide-30
SLIDE 30

Learning Conjunctions: Analysis

Let p(z) be the probability that, in an example drawn from D, the feature z is absent but the example has a positive label

  • That is, after training is done, p(z) is the probability that in a randomly

drawn example, the literal z causes a mistake

  • For any z in the target function, p(z) = 0

We know that

Via direct application of the union bound

30

Union bound For a set of events, probability that at least one

  • f them happens < the sum
  • f the probabilities of the

individual events

slide-31
SLIDE 31

Learning Conjunctions: Analysis

  • Call a literal z bad if
  • Intuitively, a bad literal is one that has a significant probability of

not appearing with a positive example

– (And, if it appears in all positive training examples, it can cause errors)

If there are no bad literals, then errD(h) < ²

– Why? Because

31

n = dimensionality Let us try to see when this will not happen

slide-32
SLIDE 32

Learning Conjunctions: Analysis

  • Call a literal z bad if
  • Intuitively, a bad literal is one that has a significant probability of

not appearing with a positive example

– (And, if it appears in all positive training examples, it can cause errors)

If there are no bad literals, then errD(h) < ²

– Why? Because

32

n = dimensionality Let us try to see when this will not happen

slide-33
SLIDE 33

Learning Conjunctions: Analysis

  • Call a literal z bad if
  • Intuitively, a bad literal is one that has a significant probability of

not appearing with a positive example

– (And, if it appears in all positive training examples, it can cause errors)

If there are no bad literals, then errD(h) < ²

– Why? Because

33

n = dimensionality Let us try to see when this will not happen

slide-34
SLIDE 34

Learning Conjunctions: Analysis

  • Call a literal z bad if
  • Intuitively, a bad literal is one that has a significant probability of

not appearing with a positive example

– (And, if it appears in all positive training examples, it can cause errors)

What if there are bad literals?

Let z be a bad literal What is the probability that it will not be eliminated by one training example?

34

n = dimensionality

slide-35
SLIDE 35

Learning Conjunctions: Analysis

  • Call a literal z bad if
  • Intuitively, a bad literal is one that has a significant probability of

not appearing with a positive example

– (And, if it appears in all positive training examples, it can cause errors)

What if there are bad literals?

Let z be a bad literal What is the probability that it will not be eliminated by one training example?

35

n = dimensionality

slide-36
SLIDE 36

Learning Conjunctions: Analysis

  • Call a literal z bad if
  • Intuitively, a bad literal is one that has a significant probability of

not appearing with a positive example

– (And, if it appears in all positive training examples, it can cause errors)

What if there are bad literals?

Let z be a bad literal What is the probability that it will not be eliminated by one training example?

36

n = dimensionality

slide-37
SLIDE 37

Learning Conjunctions: Analysis

  • Call a literal z bad if
  • Intuitively, a bad literal is one that has a significant probability of

not appearing with a positive example

– (And, if it appears in all positive training examples, it can cause errors)

What if there are bad literals?

Let z be a bad literal What is the probability that it will not be eliminated by one training example?

37

n = dimensionality

<(1,1,1,1,1,0,...0,1,1), 1>

There was one example of this kind

slide-38
SLIDE 38

Learning Conjunctions: Analysis

What we know so far: But say we have m training examples. Then There are at most n bad literals. So

38

n = dimensionality

slide-39
SLIDE 39

Learning Conjunctions: Analysis

What we know so far: But say we have m training examples. Then There are at most n bad literals. So

39

n = dimensionality

slide-40
SLIDE 40

Learning Conjunctions: Analysis

What we know so far: But say we have m training examples. Then There are at most n bad literals. So

40

n = dimensionality

slide-41
SLIDE 41

Learning Conjunctions: Analysis

We want this probability to be small Why? So that we can choose enough training examples so that the probability that any z survives all of them is less than some ± That is, we want We know that 1 – x < e-x. So it is sufficient to require

41

slide-42
SLIDE 42

Learning Conjunctions: Analysis

We want this probability to be small Why? So that we can choose enough training examples so that the probability that any z survives all of them is less than some ± That is, we want We know that 1 – x < e-x. So it is sufficient to require

42

slide-43
SLIDE 43

Learning Conjunctions: Analysis

We want this probability to be small Why? So that we can choose enough training examples so that the probability that any z survives all of them is less than some ± That is, we want We know that 1 – x < e-x. So it is sufficient to require

43

slide-44
SLIDE 44

Learning Conjunctions: Analysis

We want this probability to be small Why? So that we can choose enough training examples so that the probability that any z survives all of them is less than some ± That is, we want We know that 1 – x < e-x. So it is sufficient to require Or equivalently,

44

slide-45
SLIDE 45

Learning Conjunctions: Analysis

To guarantee a probability of failure (i.e, error > ²) that is less than ±, the number of examples we need is That is, if m has this property, then

  • With probability 1 - ±, there will be no bad literals,
  • Or equivalently, with probability 1 - ±, we will have errD(h) < ²

What does this mean:

  • If ² = 0.1 and ± = 0.1, then for n = 100, we need 6908 training examples
  • If ² = 0.1 and ± = 0.1, then for n = 10, we need only 461 examples
  • If ² = 0.1 and ± = 0.01, then for n = 10, we need 691 examples

45

Poly in n, 1/±, 1/²

slide-46
SLIDE 46

Learning Conjunctions: Analysis

To guarantee a probability of failure (i.e, error > ²) that is less than ±, the number of examples we need is That is, if m has this property, then

  • With probability 1 - ±, there will be no bad literals,
  • Or equivalently, with probability 1 - ±, we will have errD(h) < ²

What does this mean:

  • If ² = 0.1 and ± = 0.1, then for n = 100, we need 6908 training examples
  • If ² = 0.1 and ± = 0.1, then for n = 10, we need only 461 examples
  • If ² = 0.1 and ± = 0.01, then for n = 10, we need 691 examples

46

Poly in n, 1/±, 1/²

slide-47
SLIDE 47

Learning Conjunctions: Analysis

To guarantee a probability of failure (i.e, error > ²) that is less than ±, the number of examples we need is That is, if m has this property, then

  • With probability 1 - ±, there will be no bad literals,
  • Or equivalently, with probability 1 - ±, we will have errD(h) < ²

What does this mean:

  • If ² = 0.1 and ± = 0.1, then for n = 100, we need 6908 training examples
  • If ² = 0.1 and ± = 0.1, then for n = 10, we need only 461 examples
  • If ² = 0.1 and ± = 0.01, then for n = 10, we need 691 examples

47

Poly in n, 1/±, 1/²

slide-48
SLIDE 48

Learning Conjunctions: Analysis

To guarantee a probability of failure (i.e, error > ²) that is less than ±, the number of examples we need is That is, if m has this property, then

  • With probability 1 - ±, there will be no bad literals,
  • Or equivalently, with probability 1 - ±, we will have errD(h) < ²

What does this mean:

  • If ² = 0.1 and ± = 0.1, then for n = 100, we need 6908 training examples
  • If ² = 0.1 and ± = 0.1, then for n = 10, we need only 461 examples
  • If ² = 0.1 and ± = 0.01, then for n = 10, we need 691 examples

48

Poly in n, 1/±, 1/²

slide-49
SLIDE 49

Learning Conjunctions: Analysis

To guarantee a probability of failure (i.e, error > ²) that is less than ±, the number of examples we need is That is, if m has this property, then

  • With probability 1 - ±, there will be no bad literals,
  • Or equivalently, with probability 1 - ±, we will have errD(h) < ²

How to use this:

  • If ² = 0.1 and ± = 0.1, then for n = 100, we need 6908 training examples
  • If ² = 0.1 and ± = 0.1, then for n = 10, we need only 461 examples
  • If ² = 0.1 and ± = 0.01, then for n = 10, we need 691 examples

49

What we have here is a PAC guarantee Our algorithm is Probably Approximately Correct