Computational Learning Theory: The Theory of Generalization Machine - - PowerPoint PPT Presentation

computational learning theory the theory of generalization
SMART_READER_LITE
LIVE PREVIEW

Computational Learning Theory: The Theory of Generalization Machine - - PowerPoint PPT Presentation

Computational Learning Theory: The Theory of Generalization Machine Learning 1 Slides based on material from Dan Roth, Avrim Blum, Tom Mitchell and others Checkpoint: The bigger picture Supervised learning: instances, concepts, and


slide-1
SLIDE 1

Machine Learning

Computational Learning Theory: The Theory of Generalization

1

Slides based on material from Dan Roth, Avrim Blum, Tom Mitchell and others

slide-2
SLIDE 2

Checkpoint: The bigger picture

  • Supervised learning: instances, concepts, and hypotheses
  • Specific learners

– Decision trees – Perceptron – Winnow

  • General ML ideas

– Features as high dimensional vectors – Overfitting – Mistake-bound: One way of asking “Can my problem be learned?”

Learning algorithm Labeled data Hypothesis/ Model h h

New example Prediction

2

slide-3
SLIDE 3

Checkpoint: The bigger picture

  • Supervised learning: instances, concepts, and hypotheses
  • Specific learners

– Decision trees – Perceptron

  • General ML ideas

– Features as high dimensional vectors – Overfitting – Mistake-bound: One way of asking “Can my problem be learned?”

Learning algorithm Labeled data Hypothesis/ Model h h

New example Prediction

3

slide-4
SLIDE 4

Checkpoint: The bigger picture

  • Supervised learning: instances, concepts, and hypotheses
  • Specific learners

– Decision trees – Perceptron

  • General ML ideas

– Features as high dimensional vectors – Overfitting – Mistake-bound: One way of asking “Can my problem be learned?”

Learning algorithm Labeled data Hypothesis/ Model h h

New example Prediction

4

slide-5
SLIDE 5

Checkpoint: The bigger picture

  • Supervised learning: instances, concepts, and hypotheses
  • Specific learners

– Decision trees – Perceptron

  • General ML ideas

– Features as high dimensional vectors – Overfitting – Mistake-bound: One way of asking “Can my problem be learned?”

Learning algorithm Labeled data Hypothesis/ Model h h

New example Prediction Questions?

5

slide-6
SLIDE 6

Computational Learning Theory

  • The Theory of Generalization
  • Probably Approximately Correct (PAC) learning
  • Positive and negative learnability results
  • Agnostic Learning
  • Shattering and the VC dimension

6

slide-7
SLIDE 7

This lecture: Computational Learning Theory

  • The Theory of Generalization

– When can be trust the learning algorithm? – Errors of hypotheses – Batch Learning

  • Probably Approximately Correct (PAC) learning
  • Positive and negative learnability results
  • Agnostic Learning
  • Shattering and the VC dimension

7

slide-8
SLIDE 8

Computational Learning Theory

Are there general “laws of nature” related to learnability? We want theory that can relate

– Probability of successful Learning – Number of training examples – Complexity of hypothesis space – Accuracy to which target concept is approximated – Manner in which training examples are presented

8

slide-9
SLIDE 9

Learning Conjunctions

Teacher (Nature) provides the labels (f(x))

– <(1,1,1,1,1,1,…,1,1), 1> – <(1,1,1,0,0,0,…,0,0), 0> – <(1,1,1,1,1,0,...0,1,1), 1> – <(1,0,1,1,1,0,...0,1,1), 0> – <(1,1,1,1,1,0,...0,0,1), 1> – <(1,0,1,0,0,0,...0,1,1), 0> – <(1,1,1,1,1,1,…,0,1), 1> – <(0,1,0,1,0,0,...0,1,1), 0>

9

Some random source (nature) provides training examples

Notation: <example, label> How good is our learning algorithm?

slide-10
SLIDE 10

Learning Conjunctions

Teacher (Nature) provides the labels (f(x))

– <(1,1,1,1,1,1,…,1,1), 1> – <(1,1,1,0,0,0,…,0,0), 0> – <(1,1,1,1,1,0,...0,1,1), 1> – <(1,0,1,1,1,0,...0,1,1), 0> – <(1,1,1,1,1,0,...0,0,1), 1> – <(1,0,1,0,0,0,...0,1,1), 0> – <(1,1,1,1,1,1,…,0,1), 1> – <(0,1,0,1,0,0,...0,1,1), 0>

Some random source (nature) provides training examples

For a reasonable learning algorithm (by elimination), the final hypothesis will be

How good is our learning algorithm?

slide-11
SLIDE 11

Learning Conjunctions

Teacher (Nature) provides the labels (f(x))

– <(1,1,1,1,1,1,…,1,1), 1> – <(1,1,1,0,0,0,…,0,0), 0> – <(1,1,1,1,1,0,...0,1,1), 1> – <(1,0,1,1,1,0,...0,1,1), 0> – <(1,1,1,1,1,0,...0,0,1), 1> – <(1,0,1,0,0,0,...0,1,1), 0> – <(1,1,1,1,1,1,…,0,1), 1> – <(0,1,0,1,0,0,...0,1,1), 0>

11

Some random source (nature) provides training examples

Whenever the output is 1, x1 is present For a reasonable learning algorithm (by elimination), the final hypothesis will be

How good is our learning algorithm?

slide-12
SLIDE 12

Learning Conjunctions

Teacher (Nature) provides the labels (f(x))

– <(1,1,1,1,1,1,…,1,1), 1> – <(1,1,1,0,0,0,…,0,0), 0> – <(1,1,1,1,1,0,...0,1,1), 1> – <(1,0,1,1,1,0,...0,1,1), 0> – <(1,1,1,1,1,0,...0,0,1), 1> – <(1,0,1,0,0,0,...0,1,1), 0> – <(1,1,1,1,1,1,…,0,1), 1> – <(0,1,0,1,0,0,...0,1,1), 0>

12

Some random source (nature) provides training examples

Whenever the output is 1, x1 is present For a reasonable learning algorithm (by elimination), the final hypothesis will be

With the given data, we only learned an approximation to the true concept. Is it good enough?

How good is our learning algorithm?

slide-13
SLIDE 13

Two Directions

  • Can analyze the probabilistic intuition

– Never saw x1=0 in positive examples, maybe we’ll never see it – And if we do, it will be with small probability, so the concepts we learn may be pretty good

  • Pretty good: In terms of performance on future data

– PAC framework

  • Mistake Driven Learning algorithms

– Update your hypothesis only when you make mistakes – Define good in terms of how many mistakes you make before you stop

13

for How good is our learning algorithm?

slide-14
SLIDE 14

Two Directions

  • Can analyze the probabilistic intuition

– Never saw x1=0 in positive examples, maybe we’ll never see it – And if we do, it will be with small probability, so the concepts we learn may be pretty good

  • Pretty good: In terms of performance on future data

– PAC framework

  • Mistake Driven Learning algorithms

– Update your hypothesis only when you make mistakes – Define good in terms of how many mistakes you make before you stop

14

for How good is our learning algorithm?

slide-15
SLIDE 15

The mistake bound approach

  • The mistake bound model is a theoretical approach

– We may be able to determine the number of mistakes the learning algorithm can make before converging

  • But no answer to “How many examples do you need

before converging to a good hypothesis?”

  • Because the mistake-bound model makes no

assumptions about the order or distribution of training examples

– Both a strength and a weakness of the mistake bound model

15

slide-16
SLIDE 16

PAC learning

  • A model for batch learning

– Train on a fixed training set – Then deploy it in the wild

  • How well will your learning algorithm do on future

instances?

16

slide-17
SLIDE 17

The setup

  • Instance Space: 𝑌, the set of examples
  • Concept Space: 𝐷, the set of possible target functions: 𝑔 ∈ 𝐷 is the hidden

target function – Eg: all 𝑜-conjunctions; all 𝑜-dimensional linear functions, …

  • Hypothesis Space: H, the set of possible hypotheses

– This is the set that the learning algorithm explores

  • Training instances: S £ {-1,1}: positive and negative examples of the target
  • concept. (S is a finite subset of X)
  • What we want: A hypothesis h Î H such that h(x) = f(x)

– A hypothesis h Î H such that h(x) = f(x) for all x Î S ? – A hypothesis h Î H such that h(x) = f(x) for all x Î X ?

> < > < > < ) ( , ,... ) ( , , ) ( ,

2 2 1 1 n n

x f x x f x x f x

17

slide-18
SLIDE 18

The setup

  • Instance Space: 𝑌, the set of examples
  • Concept Space: 𝐷, the set of possible target functions: 𝑔 ∈ 𝐷 is the hidden

target function – Eg: all 𝑜-conjunctions; all 𝑜-dimensional linear functions, …

  • Hypothesis Space: 𝐼, the set of possible hypotheses

– This is the set that the learning algorithm explores

  • Training instances: 𝑇×{−1,1} : positive and negative examples of the

target concept. (𝑇 is a finite subset of 𝑌) 𝑦!, 𝑔 𝑦! , 𝑦", 𝑔 𝑦" , ⋯ , 𝑦#, 𝑔 𝑦#

  • What we want: A hypothesis h ∈ 𝐼 such that ℎ 𝑦 = 𝑔(𝑦)

– A hypothesis h ∈ 𝐼 such that ℎ 𝑦 = 𝑔(𝑦) for all 𝑦 ∈ 𝑇? – A hypothesis h ∈ 𝐼 such that ℎ 𝑦 = 𝑔(𝑦) for all 𝑦 ∈ 𝑌?

> < > < > < ) ( , ,... ) ( , , ) ( ,

2 2 1 1 n n

x f x x f x x f x

18

slide-19
SLIDE 19

The setup

  • Instance Space: 𝑌, the set of examples
  • Concept Space: 𝐷, the set of possible target functions: 𝑔 ∈ 𝐷 is the hidden

target function – Eg: all 𝑜-conjunctions; all 𝑜-dimensional linear functions, …

  • Hypothesis Space: 𝐼, the set of possible hypotheses

– This is the set that the learning algorithm explores

  • Training instances: 𝑇×{−1,1} : positive and negative examples of the

target concept. (𝑇 is a finite subset of 𝑌) 𝑦!, 𝑔 𝑦! , 𝑦", 𝑔 𝑦" , ⋯ , 𝑦#, 𝑔 𝑦#

  • What we want: A hypothesis h ∈ 𝐼 such that ℎ 𝑦 = 𝑔(𝑦)

– A hypothesis h ∈ 𝐼 such that ℎ 𝑦 = 𝑔(𝑦) for all 𝑦 ∈ 𝑇? – A hypothesis h ∈ 𝐼 such that ℎ 𝑦 = 𝑔(𝑦) for all 𝑦 ∈ 𝑌?

> < > < > < ) ( , ,... ) ( , , ) ( ,

2 2 1 1 n n

x f x x f x x f x

19

slide-20
SLIDE 20

The setup

  • Instance Space: 𝑌, the set of examples
  • Concept Space: 𝐷, the set of possible target functions: 𝑔 ∈ 𝐷 is the hidden

target function – Eg: all 𝑜-conjunctions; all 𝑜-dimensional linear functions, …

  • Hypothesis Space: 𝐼, the set of possible hypotheses

– This is the set that the learning algorithm explores

  • Training instances: 𝑇×{−1,1}: positive and negative examples of the target
  • concept. (𝑇 is a finite subset of 𝑌)

𝑦!, 𝑔 𝑦! , 𝑦", 𝑔 𝑦" , ⋯ , 𝑦#, 𝑔 𝑦#

  • What we want: A hypothesis h ∈ 𝐼 such that ℎ 𝑦 = 𝑔(𝑦)

– A hypothesis h ∈ 𝐼 such that ℎ 𝑦 = 𝑔(𝑦) for all 𝑦 ∈ 𝑇? – A hypothesis h ∈ 𝐼 such that ℎ 𝑦 = 𝑔(𝑦) for all 𝑦 ∈ 𝑌?

20

slide-21
SLIDE 21

The setup

  • Instance Space: 𝑌, the set of examples
  • Concept Space: 𝐷, the set of possible target functions: 𝑔 ∈ 𝐷 is the hidden

target function – Eg: all 𝑜-conjunctions; all 𝑜-dimensional linear functions, …

  • Hypothesis Space: 𝐼, the set of possible hypotheses

– This is the set that the learning algorithm explores

  • Training instances: 𝑇×{−1,1}: positive and negative examples of the target
  • concept. (𝑇 is a finite subset of 𝑌)

𝑦!, 𝑔 𝑦! , 𝑦", 𝑔 𝑦" , ⋯ , 𝑦#, 𝑔 𝑦#

  • What we want: A hypothesis h ∈ 𝐼 such that ℎ 𝑦 = 𝑔(𝑦)

– …for all 𝑦 ∈ 𝑇? – …for all 𝑦 ∈ 𝑌?

21

slide-22
SLIDE 22

The setup

  • Instance Space: 𝑌, the set of examples
  • Concept Space: 𝐷, the set of possible target functions: 𝑔 ∈ 𝐷 is the hidden

target function – Eg: all 𝑜-conjunctions; all 𝑜-dimensional linear functions, …

  • Hypothesis Space: 𝐼, the set of possible hypotheses

– This is the set that the learning algorithm explores

  • Training instances: 𝑇×{−1,1}: positive and negative examples of the target
  • concept. (𝑇 is a finite subset of 𝑌)

– Training instances are generated by a fixed unknown probability distribution 𝐸 over 𝑌

  • What we want: A hypothesis h ∈ 𝐼 such that ℎ 𝑦 = 𝑔(𝑦)

– Evaluate h on subsequent examples 𝑦 ∈ 𝑌 drawn according to 𝐸

22

slide-23
SLIDE 23

Distribution over the instance space

23

Consider a two dimensional instance space Not all points in the space are equally likely to exist as instances. For example, not every sequence of words is an email, not every sequence of letters is a name

slide-24
SLIDE 24

Distribution over the instance space

24

Consider a two dimensional instance space Not all points in the space are equally likely to exist as instances. For example, not every sequence of words is an email, not every sequence of letters is a name That is, there is a some probability that a point in the space of instances is an instance

slide-25
SLIDE 25

Distribution over the instance space

25

Consider a two dimensional instance space Not all points in the space are equally likely to exist as instances. For example, not every sequence of words is an email, not every sequence of letters is a name That is, there is a some probability that a point in the space of instances is an instance

slide-26
SLIDE 26

Distribution over the instance space

26

Consider a two dimensional instance space Not all points in the space are equally likely to exist as instances. For example, not every sequence of words is an email, not every sequence of letters is a name That is, there is a some probability that a point in the space of instances is an instance Or as a contour plot

slide-27
SLIDE 27

Distribution over the instance space

27

Consider a two dimensional instance space Not all points in the space are equally likely to exist as instances. For example, not every sequence of words is an email, not every sequence of letters is a name That is, there is a some probability that a point in the space of instances is an instance Or as a contour plot We assume that any finite set of examples is drawn i.i.d from this distribution.

slide-28
SLIDE 28

Distribution over the instance space

28

Consider a two dimensional instance space Not all points in the space are equally likely to exist as instances. For example, not every sequence of words is an email, not every sequence of letters is a name That is, there is a some probability that a point in the space of instances is an instance Or as a contour plot We assume that any finite set of examples is drawn from this distribution.

slide-29
SLIDE 29

Distribution over the instance space

29

Consider a two dimensional instance space Not all points in the space are equally likely to exist as instances. For example, not every sequence of words is an email, not every sequence of letters is a name That is, there is a some probability that a point in the space of instances is an instance Or as a contour plot We assume that any finite set of examples is drawn from this distribution. We may not know what the distribution is, but we assume one exists and is fixed

slide-30
SLIDE 30

PAC Learning – Intuition

The assumption of fixed distribution is important for two reasons:

1. Gives us hope that what we learn on the training data will be meaningful on future examples 2. Also gives a well-defined notion of the error of a hypothesis according to the target function

  • “The future will be like the past”: We have seen many

examples (drawn according to the distribution D)

– Since in all the positive examples x1 was active, it is very likely that it will be active in future positive examples – If not, in any case, x1 is active only in a small percentage of the examples so our error will be small

30

slide-31
SLIDE 31

PAC Learning – Intuition

The assumption of fixed distribution is important for two reasons:

1. Gives us hope that what we learn on the training data will be meaningful on future examples 2. Also gives a well-defined notion of the error of a hypothesis according to the target function

“The future will be like the past”: We have seen many examples (drawn according to the distribution 𝐸)

– Since in all the positive examples 𝑦! was active, it is very likely that it will be active in future positive examples – If not, in any case, 𝑦! is active only in a small percentage of the examples so our error will be small

31

slide-32
SLIDE 32

Error of a hypothesis

Definition Given a distribution 𝐸 over examples, the error of a hypothesis ℎ with respect to a target concept 𝑔 is err5 ℎ = Pr6~5[ℎ 𝑦 ≠ 𝑔 𝑦 ]

32

slide-33
SLIDE 33

Error of a hypothesis

Definition Given a distribution 𝐸 over examples, the error of a hypothesis ℎ with respect to a target concept 𝑔 is err5 ℎ = Pr6~5[ℎ 𝑦 ≠ 𝑔 𝑦 ]

33

Instance space 𝑌

slide-34
SLIDE 34

Error of a hypothesis

Definition Given a distribution 𝐸 over examples, the error of a hypothesis ℎ with respect to a target concept 𝑔 is err5 ℎ = Pr6~5[ℎ 𝑦 ≠ 𝑔 𝑦 ]

34

Instance space 𝑌 Target concept 𝑔 labels all these points as +ve

slide-35
SLIDE 35

Error of a hypothesis

Definition Given a distribution 𝐸 over examples, the error of a hypothesis ℎ with respect to a target concept 𝑔 is err5 ℎ = Pr6~5[ℎ 𝑦 ≠ 𝑔 𝑦 ]

35

Instance space 𝑌 Target concept 𝑔 labels all these points as +ve A hypothesis h labels all these points as +ve

slide-36
SLIDE 36

Error of a hypothesis

Definition Given a distribution 𝐸 over examples, the error of a hypothesis ℎ with respect to a target concept 𝑔 is err5 ℎ = Pr6~5[ℎ 𝑦 ≠ 𝑔 𝑦 ]

36

Instance space 𝑌 Target concept 𝑔 labels all these points as +ve A hypothesis h labels all these points as +ve Error: 𝑔 and ℎ disagree

slide-37
SLIDE 37

Empirical error

Contrast true error against the empirical error For a target concept 𝑔, the empirical error of a hypothesis ℎ is defined for a training set 𝑇 as the fraction of examples 𝑦 in 𝑇 for which the functions 𝑔 and ℎ disagree. That is, ℎ 𝑦 ≠ 𝑔 𝑦 Denoted by err' ℎ Overfitting: When the empirical error on the training set err' ℎ is substantially lower than err( ℎ

37

slide-38
SLIDE 38

The goal of batch learning

To devise good learning algorithms that avoid

  • verfitting

Not fooled by functions that only appear to be good because they explain the training set very well

38

slide-39
SLIDE 39

Online learning vs. Batch learning

Online learning

  • No assumptions about the

distribution of examples

  • Learning is a sequence of trials

– Learner sees a single example, makes a prediction – If mistake, update hypothesis

  • Goal: To bound the total

number of mistakes over time

Batch learning

  • Examples are drawn from a

fixed (perhaps unknown) probability distribution D over the instance space

  • Learning uses a training set S,

drawn i.i.d from the distribution D

  • Goal: To find a hypothesis that

has low chance of making a mistake on a new example from D

39

slide-40
SLIDE 40

Online learning vs. Batch learning

Online learning

  • No assumptions about the

distribution of examples

  • Learning is a sequence of trials

– Learner sees a single example, makes a prediction – If mistake, update hypothesis

  • Goal: To bound the total

number of mistakes over time

Batch learning

  • Examples are drawn from a

fixed (perhaps unknown) probability distribution D over the instance space

  • Learning uses a training set S,

drawn i.i.d from the distribution D

  • Goal: To find a hypothesis that

has low chance of making a mistake on a new example from D

40

slide-41
SLIDE 41

Online learning vs. Batch learning

Online learning

  • No assumptions about the

distribution of examples

  • Learning is a sequence of trials

– Learner sees a single example, makes a prediction – If mistake, update hypothesis

  • Goal: To bound the total

number of mistakes over time

Batch learning

  • Examples are drawn from a

fixed (perhaps unknown) probability distribution D over the instance space

  • Learning uses a training set S,

drawn i.i.d from the distribution D

  • Goal: To find a hypothesis that

has low chance of making a mistake on a new example from D

41

slide-42
SLIDE 42

Online learning vs. Batch learning

Online learning

  • No assumptions about the

distribution of examples

  • Learning is a sequence of trials

– Learner sees a single example, makes a prediction – If mistake, update hypothesis

  • Goal: To bound the total

number of mistakes over time

Batch learning

  • Examples are drawn from a

fixed (perhaps unknown) probability distribution D over the instance space

  • Learning uses a training set S,

drawn i.i.d from the distribution D

  • Goal: To find a hypothesis that

has low chance of making a mistake on a new example from D

42