L ECTURE 15: Regrade requests: L EARNING T HEORY Send us email, and - - PowerPoint PPT Presentation

l ecture 15
SMART_READER_LITE
LIVE PREVIEW

L ECTURE 15: Regrade requests: L EARNING T HEORY Send us email, and - - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Fall 2013) Announcements University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 Midterm grades are available on Compass. L ECTURE 15: Regrade requests: L EARNING T HEORY Send


slide-1
SLIDE 1

CS446 Introduction to Machine Learning (Fall 2013) University of Illinois at Urbana-Champaign

http://courses.engr.illinois.edu/cs446

  • Prof. Julia Hockenmaier

juliahmr@illinois.edu

LECTURE 15: LEARNING THEORY

CS446 Machine Learning

Announcements

Midterm grades are available on Compass. Regrade requests: Send us email, and come and see me next Tuesday.

2 CS446 Machine Learning

Learning theory questions

– Sample complexity:

How many training examples are needed for a learner to converge (with high probability) to a successful hypothesis?

– Computational complexity:

How much computational effort is required for a learner to converge (with high probability) to a successful hypothesis?

– Mistake bounds:

How many training examples will the learner misclassify before converging to a successful hypothesis?

3

PAC learning

(Probably Approximately Correct)

slide-2
SLIDE 2

CS446 Machine Learning

Terminology

The instance space X is the set of all instances x. Assume each x is of size n. Instances are drawn i.i.d. from an unknown probability distribution D over X: x ~ D A concept c: X {0,1} is a Boolean function (it identifies a subset of X) A concept class C is a set of concepts The hypothesis space H is the (sub)set of Boolean functions considered by the learner L We evaluate L by its performance on new instances drawn i.i.d. from D

5

What can a learner learn?

We can’t expect to learn concepts exactly:

– Many concepts may be consistent with the data – Unseen examples could have any label

We can’t expect to always learn close approximations to the target concept:

– Sometimes the data will not be representative

We can only expect to learn with high probability a close approximation to the target concept.

6

CS446 Machine Learning

True error of a hypothesis

The true error (errorD(h)) of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D: errorD(h) = Px~D(c(x) ≠ h(x))

7 CS446 Machine Learning

PAC learnability

Consider: – A concept class C over a set of instances X (each x is of length n) – A learner L that uses hypothesis space H C is PAC-learnable by L if for all c C and any distribution D over X, L will output with probability at least (1−δ) and in time that is polynomial in 1/ε, 1/δ, n and size(c), a hypothesis h H with errorD(h) ≤ ε (for 0 < δ < 0.5 and 0 < ε < 0.5)

8

slide-3
SLIDE 3

CS446 Machine Learning

PAC learnability in plain English

– L must with arbitrarily high probability (1−δ) output a hypothesis h with arbitrarily low error ε. – L must learn h efficiently

(using a polynomial amount of time per example, and a polynomial number of examples)

9 CS446 Machine Learning

Sample complexity (for finite hypothesis spaces and consistent learners)

Consistent learner: returns hypotheses that perfectly fit the training data (whenever possible).

10 CS446 Machine Learning

Version space VSH,D

The version space VSH,D is the set of all hypotheses h H that correctly classify the training data D:

VSH,D = { h H | x,c(x) D: h(x) = c(x) } Every consistent learner outputs a hypothesis h belonging to the version space. We need to only bound the number of examples needed to assure does not contain any unacceptable hypotheses

11 CS446 Machine Learning

Sample complexity (finite H)

– The version space VSH,D is said to be ε-exhausted with respect to concept c and distribution D if every h VSH,D has true error < ε with respect to c and distribution D – If H is finite, and the data D is a sequence of m i.i.d. samples of c, then for any 0 ≤ ε ≤ 1, the probability that VSH,D is not ε-exhausted with respect to c is ≤ |H|e-εm – #training examples required to reduce probability of failure below δ: Find m such that |H|e-εm < δ – So, a consistent learner needs m ≥ 1/ε (ln |H| + ln(1/δ)) examples to get an error below δ

(often an overestimate; |H| can be very large)

12

slide-4
SLIDE 4

PAC learning: intuition

A hypothesis h is bad if its true error > ε

∀x ∈ X: PrD(h(x) ≠ h*(x)) > ε

A hypothesis h looks good if it is correct

  • n our training set S

∀s ∈ S : h(s) = h*(s) |S| = N

We want the probability that a bad hypothesis looks good to be smaller than δ

13

PAC learning: intuition

We want the probability that a bad hypothesis looks good to be smaller than δ Probability of one bad h getting one x ~ XD correct: PD(h(x) = h*(x)) ≤ 1-ε Probability of one bad h getting m x ~ XD correct: PD(h(x) = h*(x)) ≤ (1-ε)m Prob’ty that any h gets m x~XD correct: ≤ H(1-ε)m

Exclusive union bound: P(A ∨ B) ≤ Pr(A) + Pr(B)

Set H(1-ε)m ≤ δ, solve for m

14 CS446 Machine Learning

Vapnik-Chervonenkis (VC) dimension

15 CS446 Machine Learning

VC dimension (basic idea)

The VC dimension of a hypothesis space H measures the complexity of H not by the number of distinct hypotheses (|H|), but by the number of distinct instances from X that can be completely discriminated using H.

16

slide-5
SLIDE 5

CS446 Machine Learning

Shattering a set of instances

A set of instances S is shattered by the hypothesis space H if and only if for every dichotomy of S there is a hypothesis h in H that is consistent with this dichotomy.

(dichotomy: label instances in S as + or -)

The ability of H to shatter S is a measure of its capacity to represent concepts over S

17 CS446 Machine Learning

VC dimension of H

The VC dimension of the hypothesis space H, VC(H), is the size of the largest finite subset of the instance space X that can be shattered by H. If arbitrarily large finite subsets of X can be shattered by X then VC(H) = ∞

18 CS446 Machine Learning

VC dimension if H is finite

If H is finite: VC(H) ≤ log2|H| – H requires 2d distinct hypotheses to shatter d instances. – If VC(H) = d: 2d ≤ |H| hence: d = VC(H) ≤ log2|H|

19

VC Dimension of linear classifiers in 2 dimensions

The VC dimension of a 2-d linear classifier is 3: The largest set of points that can be labeled arbitrarily Note that |H| is infinite, but expressiveness is quite low.

20