Computational Learning Theory For which tasks is successful learning - - PDF document

computational learning theory
SMART_READER_LITE
LIVE PREVIEW

Computational Learning Theory For which tasks is successful learning - - PDF document

Computational Learning Theory For which tasks is successful learning possible? Under what conditions is successful learning guaranteed? What is successful learning? Probably approximately correct (PAC) framework Bounds on number


slide-1
SLIDE 1

Computational Learning Theory

  • For which tasks is successful learning possible?
  • Under what conditions is successful learning guaranteed?
  • What is successful learning?
  • Probably approximately correct (PAC) framework

– Bounds on number of training examples needed

  • Mistake bound framework

– Bounds on training errors for intermediate hypotheses

1

slide-2
SLIDE 2

Problem

  • Given

– Size or complexity of hypothesis space considered by learner – Accuracy to which target concept must be approximated – Probability that learner will output successful hypothesis – Manner in which training examples presented to learner

  • Find

– Sample complexity ∗ Number of training examples needed for learner to con- verge (with high probability) to successful hypothesis – Computational complexity ∗ Amount of computational effort needed for learner to con- verge (with high probability) to successful hypothesis – Mistake bound ∗ Number of training examples misclassified by learner be- fore converging to successful hypothesis

2

slide-3
SLIDE 3

Problem Details

  • Successful hypothesis

– Equals target concept – Usually agrees with target concept

  • How training examples obtained

– Helpful teacher (near misses) – Learner-generated queries – Random sample

3

slide-4
SLIDE 4

Probably Learning an Approximately Correct Hypothesis

  • Probably approximately correct (PAC) learning model
  • E.g., boolean-valued concepts from noise-free training data
  • Problem setting

– X = set of all possible instances – C = set of possible target concepts ∗ Each c ∈ C corresponds to boolean-valued function c : X → {0, 1} ∗ c(x) = 1 → positive example ∗ c(x) = 0 → negative example – Instances randomly sampled from X according to prob. dis-

  • trib. D

∗ D is stationary (does not change over time) – Training examples consist of x, c(x) ∗ x randomly drawn from X according to D – Learner L considers possible hypotheses from H – Learner’s output h evaluated on randomly drawn test set from X by D – Looking for successful combinations of L, H and C – Worst case analysis for all possible C and D

4

slide-5
SLIDE 5

Error of Hypothesis

+ +

  • c

h Instance space X

  • Where c

and h disagree

  • True error (errorD(h))

– Of hypothesis h with respect to target concept c and distri- bution D is the probability that h will misclassify an instance drawn at random according to D – errorD(h) = Prx∈D(c(x) = h(x))

  • D can be any distribution, not necessarily uniform
  • L can only see training examples
  • Training error = fraction of training examples misclassified by h
  • Analysis centers around how well training error estimates true

error

5

slide-6
SLIDE 6

PAC Learnability

  • What classes of target concepts can be reliably learned with a

reasonable amount of time and training examples?

  • Learnability constraints

– errorD(h) = 0 ∗ Impossible unless we see entire X ∗ Small chance training sample is misleading – errorD(h) ≤ ǫ ∗ Probability of failure ≤ δ ∗ I.e., probably learn approximately correct hypothesis (PAC)

6

slide-7
SLIDE 7

Definition

  • Given concept class C over instances X of length n and learner L

using hypothesis space H, C is PAC-Learnable by L using H if ∀c ∈ C, ∀ distributions D over X, ∀ǫ such that 0 < ǫ < 1/2, and ∀δ such that 0 < δ < 1/2, learner L will with probability (1 − δ) output a hypothesis h ∈ H such that errorD(h) ≤ ǫ, in time polynomial in 1/ǫ, 1/δ, n and size(c).

  • n = size of an instance (e.g., number of boolean attributes)
  • size(c) = length of some encoding of elements in C
  • Definition limits number of training examples to be polynomial

too

7

slide-8
SLIDE 8

Sample Complexity for Finite Hypothesis Spaces

  • Sample complexity

– Number of training examples needed for learner to produce PAC hypothesis

  • Sample complexity for consistent learner

– Consistent learner ∗ Outputs hypothesis with no errors on training data (when possible)

  • Bound on sample complexity of ANY consistent learner

– Recall version space V SH,D ∗ V SH,D = {h ∈ H|∀x, c(x) ∈ D (h(x) = c(x))} – Every consistent learner outputs h ∈ V SH,D for any X, H and D – Bound number of examples to find consistent V SH,D

8

slide-9
SLIDE 9

ǫ-Exhausted Version Space

VSH,D

error=.1 =.2 r error=.2 =0 r error=.1 =0 r error=.3 =.1 r error=.2 =.3 r error=.3 r =.4

Hypothesis space H

(r = training error, error = true error)

  • Given hypothesis space H, target concept c, instance distribu-

tion D, and set of training examples D of c, version space V SH,D is ǫ-exhausted with respect to c and D, if every hypothesis h ∈ V SH,D has error less that ǫ with respect to c and D. ∀h ∈ V SH,D (errorD(h) < ǫ)

  • Can bound the probability that V SH,D is ǫ-exhausted after some

number of training examples

9

slide-10
SLIDE 10
  • Thm. 7.1 ǫ-Exhausting the Version Space
  • If hypothesis space H is finite, and D is a sequence of m ≥ 1

independent randomly drawn examples of some target concept c, then for any 0 ≤ ǫ ≤ 1, the probability that the version space V SH,D is not ǫ-exhausted (with respect to c) is ≤ |H|e(−ǫm)

  • Proof:

– Let h1, ..., hk be hypotheses in H with error > ǫ w.r.t. c – To not ǫ-exhaust V SH,D, one of hi would be in V SH,D ∗ I.e., hi consistent with all m training examples ∗ Probability = (1 − ǫ)m – Probability that one of hi ∈ V SH,D is k(1 − ǫ)m – Since k ≤ |H|, k(1 − ǫ)m ≤ |H|(1 − ǫ)m – Since (1 − ǫ) ≤ e−ǫ, |H|(1 − ǫ)m ≤ |H|e(−ǫm) ✷

  • Result:

– Want |H|e(−ǫm) ≤ δ ∗ Sample complexity m ≥ (1/ǫ)(ln |H| + ln(1/δ)) – Given this many training examples, any consistent learner will output a hypothesis that is probably approximately cor- rect ∗ Typically overestimates sample complexity due to |H|

10

slide-11
SLIDE 11

Agnostic Learner

  • Finds hypothesis with minimum training error when c ∈ H
  • Pr[errorD(h) > errorD(h) + ǫ] ≤ e(−2mǫ2)
  • Pr[(∃h ∈ H)(errorD(h) > errorD(h) + ǫ)] ≤ |H|e(−2mǫ2)
  • Letting this probability be δ

– m ≥ (1/2ǫ2)(ln |H| + ln(1/δ)) – m grows with square of 1/ǫ instead of linearly as before

11

slide-12
SLIDE 12

Example

C = conjunctions of boolean literals (a or ¬a)

  • Is C PAC learnable?

– Show poly number of training examples for any c ∈ C – Design consistent learner using poly time per training exam- ple

  • |H| = 3n for n boolean attributes

– m ≥ (1/ǫ)(n ln 3 + ln(1/δ)) – E.g., n = 10, δ = 0.05, ǫ = 0.1, m ≥ 140 – E.g., n = 10, δ = 0.01, ǫ = 0.01, m ≥ 1560

  • Algorithm Find-S is a consistent, poly time learner
  • Thus C is PAC-learnable by Find-S with H = C

12

slide-13
SLIDE 13

How About EnjoySport?

m ≥ 1 ǫ(ln |H| + ln(1/δ)) If H is as given in EnjoySport then |H| = 973, and m ≥ 1 ǫ(ln 973 + ln(1/δ)) If want to assure that with probability 95%, V S contains only hypotheses with errorD(h) ≤ .1, then it is sufficient to have m examples, where m ≥ 1 .1(ln 973 + ln(1/.05)) m ≥ 10(ln 973 + ln 20) m ≥ 10(6.88 + 3.00) m ≥ 98.8

13

slide-14
SLIDE 14

PAC-Learnability of Other Concept Classes

  • Unbiased concept class |C| = 2|X|

– E.g., for n boolean attributes, |X| = 2n – If H = C, then |H| = 2(2n) – m ≥ (1/ǫ)(2n ln 2 + ln(1/δ)) ∗ Exponential in n ⇒ not PAC learnable ∗ Can be proven that m = Θ(2n)

  • k-term DNF

– Concept form T1 ∨ T2 ∨ ... ∨ Tk ∗ Each Ti conjunction of literals from n boolean attributes – |H| = (3n)k = 3nk ∗ Overestimate: includes cases where Ti = Tj and Ti >g Tj – m ≥ (1/ǫ)(nk ln 3 + ln(1/δ)) – However, learning k-term DNF is NP-hard – Thus, not PAC-learnable when H = k-term DNF, but ...

  • k-CNF

– Concept form T1 ∧ ... ∧ Tj for arbitrarily large j ∗ Each Ti is a disjunction of k literals – k-CNF has poly time learner and sample complexity – Thus, H = k-CNF is PAC-learnable – Since any k-term DNF can be written as a k-CNF, k-term DNF is PAC-learnable by H = k-CNF

14

slide-15
SLIDE 15

Sample Complexity for Infinite Hypothesis Spaces

  • Weakness in above result

– Weak bound – Inapplicable for infinite H

  • Consider second measure of complexity of H (other than |H|)

– Vapnik-Chervonenkis (VC) dimension of H, V C(H) – Tighter than above bound – Finite for some infinite H’s

15

slide-16
SLIDE 16

Shattering a Set of Instances

  • Number of distinct instances of X that can be completely dis-

criminated using H

  • Given sample S from X

– There are 2|S| possible dichotomies of S – I.e., 2|S| different ways of assigning (+,-) classes to members

  • f S
  • H shatters S if every possible dichotomy of S can be expressed

by some hypothesis from H

  • Definition

– A set of instances S is shattered by hypothesis space H iff for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy

Instance space X

16

slide-17
SLIDE 17

VC Dimension

  • Ability to shatter related to inductive bias
  • Unbiased hypothesis space shatters X
  • What if H can shatter only some large subset of X?

– The larger this subset, the more expressive H is

  • VC dimension measures this expressiveness
  • Definition

– V C(H) of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by

  • H. If arbitrarily large subsets of X can be shattered by H,

then V C(H) = ∞

  • For any finite H, V C(H) ≤ lg |H|

– 2d ≤ |H|, where d = V C(H)

17

slide-18
SLIDE 18

Example

  • X = points in x, y plane; H = linear decision surfaces in x, y

plane – I.e., H = two-input perceptron – Up to 3 points can be shattered ∗ Although 3 colinear points cannot be shattered, we only have to find some set of 3 points that can be shattered – In general, V C(r-input perceptron) = r + 1

( ) ( ) a b

18

slide-19
SLIDE 19

More Examples

  • X = ℜ, H = {(a, b)|(a < x < b) where a, b ∈ ℜ}

– Can cover (+,+) (-,-) (+,-) (-,+); V C(H) at least 2 – Cannot cover (+,-,+); V C(H) = 2 – Note |H| is infinite, but V C(H) is finite

  • X = conjunction of exactly 3 boolean literals

– H = conjunction of ≤ 3 boolean literals – V C(H) = 3 – In general V C(conjunction of n boolean literals) = n

19

slide-20
SLIDE 20

Sample Complexity and VC Dimension

  • m ≥ (1/ǫ)(4 lg(2/δ) + 8V C(H) lg(13/ǫ))
  • Thm 7.3 lower bound on sample complexity

– Given any concept class C such that V C(C) ≥ 2, any learner L, and any 0 < ǫ < 1/8, and 0 < δ < 1/100, there exists distribution D and target concept in C such that if L ob- serves fewer examples than max((1/ǫ) log(1/δ), (V C(C) − 1)/32ǫ) then with probability at least δ, L outputs hypothesis h hav- ing errorD(h) > ǫ. VC dimension and neural networks

  • V C(networks of r-input perceptrons) ≤ 2(r + 1)s log(es)

– Where s = number of internal nodes

  • Sample complexity for sigmoid based networks at least this much

20

slide-21
SLIDE 21

Mistake Bound Model of Learning

  • Learner evaluated as total mistakes made until finds correct hy-

pothesis

  • Learner must predict class of training example before being given

c(x) – How many mistakes will learner make before learning target concept?

  • Useful for online learning
  • Mistake bound learning studied in different settings

– Mistakes made until PAC hypothesis learned – Mistakes made until exact target concept found (this one)

21

slide-22
SLIDE 22

Mistake Bound for Find-S

  • H = conjunctions of n boolean literals
  • Find-S

– Initialize h to most specific hypothesis: l1∧¬l1∧...∧ln∧¬ln – For each positive training instance x ∗ Remove from h any literal not satisfied by x – Output hypothesis h

  • Find-S converges to target concept if

– C ⊆ H – Training data is noise free

  • Mistake bound is n + 1

– First positive example (always classified negative) eliminates n of the 2n literals – Each incorrectly classified positive example eliminates at least

  • ne more literal

– (n + 1) worst case when target concept is (∀x, c(x) = 1)

22

slide-23
SLIDE 23

Mistake Bound for Halving Algorithm

  • Halving algorithm

– Candidate-Elimination algorithm with majority vote

  • Mistakes until |V S| = 1?

– lg |H| worst case ∗ Incorrectly classified example reduces V S by at least half ∗ In fact, mistake bound is ⌊lg |H|⌋ – Mistakes may be 0

23

slide-24
SLIDE 24

Optimal Mistake Bounds

  • Lowest worst case mistake bound over all possible learners
  • For arbitrary concept class C, assuming H = C
  • Let MA(c) = max mistakes made by algorithm A to exactly

learn c over all possible sequences of the training examples

  • For any nonempty concept class C, MA(C) = maxc∈C MA(c)

– E.g., MFind−S(C) = n + 1, C = conjunction of n boolean literals – E.g., MHalving(C) ≤ lg(|C|) for any concept class C

  • Optimal mistake bound Opt(C)

– Opt(C) = minA∈learning algs MA(C) – C is arbitrary nonempty concept class – I.e., Opt(C) is the number of mistakes made for the hardest target concept in C, using the hardest training sequence, by the best algorithm

  • V C(C) ≤ Opt(C) ≤ MHalving(C) ≤ lg |C|

– All equal when C = powerset of finite X

24

slide-25
SLIDE 25

Weighted-Majority Algorithm

  • Predicts using weighted majority vote of a pool of prediction algs
  • Inconsistent hypotheses not eliminated, but reduced in weight

– Can accommodate inconsistent training data

  • Mistake bound ≃ mistake bound of best prediction alg
  • Algorithm [Tab7.1,p224]

– Weights initialized to 1 – Whenever a predictor makes a mistake, its weight w is re- duced to wβ ∗ Where 0 ≤ β < 1 ∗ If β = 0, then WM = Halving ∗ If β > 0, then no predictor is ever completely eliminated

25

slide-26
SLIDE 26

Thm 7.5 Relative Mistake Bound for WM

  • Given any sequence D of training examples and any set A of n

prediction algs, and the minimum number of mistakes k made by any alg in A for training sequence D, then the number of mistakes over D made by WM using β = 1/2 is ≤ 2.4(k + lg n)

  • Proof

– Best alg’s weight will be (1/2)k – Total weight W =

n

i=1 wi = n initially

– Each mistake reduces W to ≤ (3/4)W ∗ Majority algs must hold at least half the total weight W ∗ Each reduced by β = 1/2 W =

  • majority wi ∗ (1/2) +
  • minority wi

=

n/2

i=1 wi ∗ (1/2) +

n

i=n/2 wi

= W − (W/2)(1/2) = (3/4)W – If M total mistakes, then final W ≤ n(3/4)M – (1/2)k ≤ n(3/4)M – M ≤ 2.4(k + lg n)

  • For arbitrary 0 ≤ β < 1,

mistakes ≤ k lg(1/β) + lg n lg(2/(1 + β))

26

slide-27
SLIDE 27

Summary

  • PAC learning framework

– Sample complexity bounds – VC dimension

  • Mistake bound framework

– WM pools alternative prediction algs

  • Online resources for Computational Learning Theory (COLT)

– www.learningtheory.org

27