csce 478 878 lecture 3 computational learning theory
play

CSCE 478/878 Lecture 3: Computational Learning Theory Examines the - PDF document

Introduction Combines machine learning with: Algorithm design and analysis Computational complexity CSCE 478/878 Lecture 3: Computational Learning Theory Examines the worst-case minimum and maximum data and time requirements for


  1. Introduction • Combines machine learning with: – Algorithm design and analysis – Computational complexity CSCE 478/878 Lecture 3: Computational Learning Theory • Examines the worst-case minimum and maximum data and time requirements for learning – Number of examples needed, number of mistakes made before convergence Stephen D. Scott • Tries to relate: (Adapted from Tom Mitchell’s slides) – Probability of successful learning – Number of training examples – Complexity of hypothesis space September 8, 2003 – Accuracy to which target concept is approximated – Manner in which training examples presented • Some average case analyses done as well 1 2 PAC Learning: The Problem Setting Given: Outline • set of instances X • Probably approximately correct (PAC) learning • set of hypotheses H • Sample complexity • set of possible target concepts C (typically, C ⊆ H ) • Agnostic learning • training instances independently generated by a fixed, • Vapnik-Chervonenkis (VC) dimension unknown, arbitrary probability distribution D over X • Mistake bound model Learner observes a sequence D of training examples of form � x, c ( x ) � , for some target concept c ∈ C • Note: as with previous lecture, we assume no noise, though most of the results can be made to hold in a • instances x are drawn from distribution D noisy setting • teacher provides target value c ( x ) for each 3 4

  2. True Error of a Hypothesis Instance space X - - c h PAC Learning: The Problem Setting + (cont’d) + Learner must output a hypothesis h ∈ H approximating - Where c c ∈ C and h disagree c △ h = symmetric difference between c and h • h is evaluated by its performance on subsequent in- stances drawn according to D Definition: The true error (denoted error D ( h ) ) of hypothesis h with respect to target concept c and distribution D is the probability that h will mis- Note: probabilistic instances, noise-free classifications classify an instance drawn at random according to D . error D ( h ) ≡ Pr x ∈D [ c ( x ) � = h ( x )] (example x ∈ X drawn randomly according to D ) 5 6 Two Notions of Error Training error of hypothesis h with respect to target con- cept c PAC Learning • How often h ( x ) � = c ( x ) over training instances Consider a class C of possible target concepts defined over a set of instances X of size n , and a learner L using True error of hypothesis h with respect to c hypothesis space H . • How often h ( x ) � = c ( x ) over future random instances Definition: C is PAC-learnable by L using H if for all c ∈ C , distributions D over X , ǫ such that 0 < ǫ < 1 / 2 , and δ such that 0 < δ < 1 / 2 , learner Our concern: L will, with probability at least (1 − δ ) , output a hypothesis h ∈ H such that error D ( h ) ≤ ǫ , in • Can we bound the true error of h given the training time that is polynomial in 1 /ǫ , 1 /δ , n and size ( c ) . error of h ? • First consider when training error of h is zero (i.e., h ∈ V S H,D ) 7 8

  3. How many examples m will ǫ -exhaust the VS? Exhausting the Version Space • Let h 1 , . . . , h k ∈ H be all hyps. with true error > ǫ w.r.t. c and D (i.e. the ǫ -bad hyps.) Hypothesis space H • VS is not ǫ -exhausted iff at least one of these hyps. is consistent with all m examples error =.3 error =.1 r =.4 r =.2 error =.2 r =0 • Prob. that an ǫ -bad hyp consistent with one random VSH,D error =.2 example is ≤ (1 − ǫ ) error =.3 r =.3 error =.1 r =.1 r =0 • Since random draws are independent, the prob. that ( r = training error, error = true error) a particular ǫ -bad hyp is consistent with m exs. is ≤ (1 − ǫ ) m Definition: The version space V S H,D is said to be ǫ -exhausted with respect to c and D , if every • So the prob. any ǫ -bad hyp is in VS is hypothesis h ∈ V S H,D has error less than ǫ with ≤ k (1 − ǫ ) m ≤ | H | (1 − ǫ ) m respect to c and D . ( ∀ h ∈ V S H,D ) error D ( h ) < ǫ • Given (1 − ǫ ) ≤ 1 /e ǫ for ǫ ∈ [0 , 1] : | H | (1 − ǫ ) m ≤ | H | e − mǫ 9 10 How many examples m will ǫ -exhaust the VS? Learning Conjunctions of Boolean Literals (cont’d) How many examples are sufficient to assure with proba- Theorem: [Haussler, 1988] bility at least (1 − δ ) that If the hypothesis space H is finite, and D is a se- every h in V S H,D satisfies error D ( h ) ≤ ǫ quence of m ≥ 1 independent random examples of some target concept c , then for any 0 ≤ ǫ ≤ 1 , the probability that the version space with respect Use the theorem: to H and D is not ǫ -exhausted (with respect to c ) is m ≥ 1 ǫ (ln | H | + ln(1 /δ )) ≤ | H | e − mǫ Suppose H contains conjunctions of constraints on up to This bounds the probability that any consistent learner will n boolean attributes (i.e., n boolean literals). Then | H | = output a hypothesis h with error ( h ) ≥ ǫ 3 n (why?), and If we want this probability to be ≤ δ (for PAC): m ≥ 1 ǫ (ln 3 n + ln(1 /δ )) , | H | e − mǫ ≤ δ or then m ≥ 1 m ≥ 1 ǫ ( n ln 3 + ln(1 /δ )) ǫ (ln | H | + ln(1 /δ )) Still need to find a hyp. from VS! suffices 11 12

  4. Unbiased Learners How About EnjoySport ? m ≥ 1 • Recall the unbiased concept class C = 2 X , i.e. set ǫ (ln | H | + ln(1 /δ )) of all subsets of X If H is as given in EnjoySport , then | H | = 973 and • If each instance x ∈ X is described by n boolean features, | X | = 2 n , so | C | = 2 2 n m ≥ 1 ǫ (ln 973 + ln(1 /δ )) • Also, to ensure c ∈ H , need H = C , so the theorem gives ... if want to assure that with probability 95%, V S contains m ≥ 1 ǫ (2 n ln 2 + ln(1 /δ )) , only hypotheses with error D ( h ) ≤ . 1 , then it is sufficient to have m examples, where i.e. exponentially large sample complexity m ≥ 1 . 1(ln 973 + ln(1 /. 05)) • Note the above is only sufficient, the theorem does not give necessary sample complexity m ≥ 10(ln 973 + ln 20) m ≥ 10(6 . 88 + 3 . 00) • (Necessary sample complexity is still exponential) m ≥ 98 . 8 ⇒ Further evidence for the need of bias (as if we need more) Again, how to find a consistent hypothesis? 13 14 Agnostic Learning So far, assumed c ∈ H Agnostic learning setting: don’t assume c ∈ H Vapnik-Chervonenkis Dimension Shattering a Set of Instances • What do we want then? – The hypothesis h that makes fewest errors on train- Definition: a dichotomy of a set S is a partition of ing data (i.e. the one that minimizes S into two disjoint subsets, i.e. into a set of + exs. disagreements, which can be harder than finding and a set of − exs. consistent hyp) Definition: a set of instances S is shattered by • What is sample complexity in this case? hypothesis space H if and only if for every di- chotomy of S there exists some hypothesis in H 1 m ≥ 2 ǫ 2 (ln | H | + ln(1 /δ )) , consistent with this dichotomy. derived from Hoeffding bounds, bounding prob. of large deviation from expected value: Pr [ error D ( h ) > error D ( h ) + ǫ ] ≤ e − 2 mǫ 2 15 16

  5. The Vapnik-Chervonenkis Dimension Example: Three Instances Shattered Definition: The Vapnik-Chervonenkis dimension, V C ( H ) , of hypothesis space H defined over in- stance space X , is the size of the largest finite Instance space X subset of X shattered by H . If arbitrarily large fi- nite sets of X can be shattered by H , then V C ( H ) ≡ ∞ . • So to show that V C ( H ) = d , must show there exists some subset X ′ ⊂ X of size d that H can shatter and show that there exists no subset of X of size > d that H can shatter • Note that V C ( H ) ≤ log 2 | H | (why?) 17 18 VCD of Linear Decision Surfaces (Halfspaces) Example: Intervals on ℜ • Let H be the set of closed intervals on the real line (each hyp is a single interval), X = ℜ , and a point x ∈ X is positive iff it lies in the target interval c ( ) a ( ) b pos/pos Can’t shatter (b), so what is lower bound on VCD? n/n n/p Can shatter 2 pts, so VCD >= 2 What about upper bound? pos/neg pos pos Can’t shatter any 3 pts, so VCD < 3 neg • Thus V C ( H ) = 2 (also note that | H | is infinite) 19 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend