CS485/685 Lecture 15: Feb 28, 2012 Probably Approximately Correct - - PDF document

cs485 685 lecture 15 feb 28 2012
SMART_READER_LITE
LIVE PREVIEW

CS485/685 Lecture 15: Feb 28, 2012 Probably Approximately Correct - - PDF document

28/02/2012 CS485/685 Lecture 15: Feb 28, 2012 Probably Approximately Correct Learning [BDSS] Chapter 1 CS485/685 (c) 2012 P. Poupart 1 Quick Recap Tom Mitchell (1998): A computer program is said to learn from experience E with respect to some


slide-1
SLIDE 1

28/02/2012 1

CS485/685 Lecture 15: Feb 28, 2012

Probably Approximately Correct Learning [BDSS] Chapter 1

CS485/685 (c) 2012 P. Poupart 1

Quick Recap

  • Tom Mitchell (1998): A computer program is said to

learn from experience E with respect to some class

  • f tasks T and performance measure P, if its

performance at tasks in T, as measured by P, improves with experience E.

– Experience: – Task: – Performance measure:

CS485/685 (c) 2012 P. Poupart 2

slide-2
SLIDE 2

28/02/2012 2

Performance Measure

  • So far, we measured the performance of algorithms

empirically

– Train with training set and measure performance with a separate test set – K‐fold cross validation:

  • Can reuse the data for training and testing
  • Average performance over multiple splits of the data to improve

statistical reliability

  • Open questions:

– How much data do we need to learn a task? – When is a task learnable?

CS485/685 (c) 2012 P. Poupart 3

Computational Complexity

  • Computational Complexity: branch of the theory of

computation that focuses on classifying computational problems based on their inherent difficulty

– Time complexity – Space complexity

  • In machine learning, we also consider

– Data complexity (a.k.a. sample complexity)

CS485/685 (c) 2012 P. Poupart 4

slide-3
SLIDE 3

28/02/2012 3

Computational Complexity

  • Time/space complexity

– How do time/space requirements vary with the size of the input?

  • Data complexity

– How do data requirements (size of the input) vary with the performance level?

  • Problem: we can’t guarantee a performance level

because the training data is usually different from the data that the algorithm will encounter in the future

  • Idea: study data requirements as a function of a

probabilistic performance level

CS485/685 (c) 2012 P. Poupart 5

Formal Model (Supervised Classification)

  • 1. The learner’s input

a. Domain set (e.g., possible emails in spam filtering) b. Label set (e.g., , ~)

For convenience assume that 0,1 or 1,1

c. Training data , , , , … , ,

sequence of pairs in

  • 2. The learner’s output:

hypothesis or prediction rule : → e.g., decision tree, k‐NN rule, linear separator

CS485/685 (c) 2012 P. Poupart 6

slide-4
SLIDE 4

28/02/2012 4

Formal Model (Supervised Classification)

  • 3. Data generation model

training and testing data is sampled independently and identically (i.i.d.) from an unknown distribution . , ~ ∀

  • 4. Performance measure: probability of error

∑ ,

,

true loss, but is unknown

CS485/685 (c) 2012 P. Poupart 7

Empirical Risk Minimization

  • is unknown, but is known

  • Empirical risk minimization (ERM):

Find that minimizes

  • How good is ERM?

– It can be pretty bad (due to overfitting)

CS485/685 (c) 2012 P. Poupart 8

slide-5
SLIDE 5

28/02/2012 5

Papaya example

  • Consider a papaya prediction problem

CS485/685 (c) 2012 P. Poupart 9

Papaya example

  • Hypothesis : if a papaya is identical to a previously

tasted papaya, predict the same taste. Otherwise, assume that it tastes bad.

  • Let

if ∃ such that

  • therwise
  • Then

but

  • This is an example of poor generalization (overfitting)

CS485/685 (c) 2012 P. Poupart 10

slide-6
SLIDE 6

28/02/2012 6

Generalization

  • How does the accuracy of vary with the amount of

data?

– As || ↑, then | | ↓

  • How much data do we need to make sure that the

hypothesis found by ERM is not much worse than the best hypothesis ∗ most of the time?

∈ ∗ ∈

CS485/685 (c) 2012 P. Poupart 11

Assumptions

  • 1. Finite Hypothesis class

– Assume is finite (and chosen before receiving

  • 2. Realizable assumption: there exists a perfect

hypothesis ∗ ∈

– i.e., ∃∗ ∈ such that ∗ 0 – This implies that for any training set , ∗ 0 – Since ∗ is deterministic, this implies that | is deterministic

  • 3. i.i.d. assumption:

– Data is independently and identically distributed from

CS485/685 (c) 2012 P. Poupart 12

slide-7
SLIDE 7

28/02/2012 7

Analysis

  • Find sample size || such that

– Here is a bound on the true loss

  • Problem: since is obtained by a random process,

and are random.

  • Instead: find sample size || such that

Pr

– Here is a bound on the probability that we obtain a sample for which is bad (i.e., ) – Hence 1 is our confidence in the bound

CS485/685 (c) 2012 P. Poupart 13

Bound

Corollary: Let be finite, ∈ 0,1, 0 and log

  • then for any (for which the realizable assumption

holds), with probability at least 1 we have that

CS485/685 (c) 2012 P. Poupart 14

slide-8
SLIDE 8

28/02/2012 8

Proof

Proof: we need to show that ~

  • Let ∈ | be the set of bad

hypotheses

  • By the realizable assumption, 0.
  • This implies that can only happen if for

some ∈ we have 0.

  • Hence | ⊆ |∃ ∈ , 0

⟹ | ⊆ ∪∈ | 0

CS485/685 (c) 2012 P. Poupart 15

Proof (continued)

  • Bound the learning failure
  • |

∪∈ 0 ∑

by the union bound Union bound: ∪

CS485/685 (c) 2012 P. Poupart 16

slide-9
SLIDE 9

28/02/2012 9

Proof (continued)

~ ∑ ~ 0

∑ ~∀,

∑ ∏

i.i.d. assumption ∑ ∏ 1

1 since 1 since

  • CS485/685 (c) 2012 P. Poupart

17

Probably Approximately Correct (PAC) Learning

  • Definition: A hypothesis class is PAC learnable if

for any 0, ∈ 0,1 there exists a function

  • ,
  • and a learning algorithm such that for

any distribution over which satisfies the realizability assumption, when running the algorithm

  • n i.i.d examples it returns ∈ such that with

probability at least 1 , .

  • By Corollary 1, finite hypothesis classes are PAC

learnable

CS485/685 (c) 2012 P. Poupart 18