l ecture 15
play

L ECTURE 15: Regrade requests: L EARNING T HEORY Send us email, and - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Fall 2013) Announcements University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 Midterm grades are available on Compass. L ECTURE 15: Regrade requests: L EARNING T HEORY Send


  1. CS446 Introduction to Machine Learning (Fall 2013) Announcements University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 Midterm grades are available on Compass. L ECTURE 15: Regrade requests: L EARNING T HEORY Send us email, and come and see me next Tuesday. Prof. Julia Hockenmaier juliahmr@illinois.edu CS446 Machine Learning 2 Learning theory questions – Sample complexity: How many training examples are needed PAC learning for a learner to converge (with high probability) to a successful hypothesis? ( P robably – Computational complexity: How much computational effort is required A pproximately for a learner to converge (with high probability) to a successful hypothesis? C orrect) – Mistake bounds: How many training examples will the learner misclassify before converging to a successful hypothesis? CS446 Machine Learning 3

  2. Terminology What can a learner learn? The instance space X is the set of all instances x . We can’t expect to learn concepts exactly: Assume each x is of size n . – Many concepts may be consistent with the data Instances are drawn i.i.d. from an unknown probability – Unseen examples could have any label distribution D over X: x ~ D We can’t expect to always learn close A concept c: X � {0,1} is a Boolean function (it identifies a subset of X ) approximations to the target concept: A concept class C is a set of concepts – Sometimes the data will not be representative The hypothesis space H is the (sub)set of Boolean We can only expect to learn with high functions considered by the learner L probability a close approximation to the We evaluate L by its performance on new instances drawn i.i.d. from D target concept. 6 CS446 Machine Learning 5 True error of a hypothesis PAC learnability The true error ( error D ( h )) of hypothesis h Consider: – A concept class C over a set of instances X with respect to target concept c and (each x is of length n ) distribution D is the probability that h will – A learner L that uses hypothesis space H misclassify an instance drawn at random according to D: C is PAC-learnable by L if for all c � C and any distribution D over X , error D ( h ) = P x ~D (c( x ) ≠ h( x )) L will output with probability at least (1 −δ ) and in time that is polynomial in 1/ ε , 1/ δ , n and size(c), a hypothesis h � H with error D (h) ≤ ε (for 0 < δ < 0.5 and 0 < ε < 0.5) CS446 Machine Learning 7 CS446 Machine Learning 8

  3. Sample complexity PAC learnability in plain English (for finite hypothesis spaces and consistent learners) – L must with arbitrarily high probability (1 −δ ) output a hypothesis h with arbitrarily low error ε . Consistent learner: returns hypotheses – L must learn h efficiently that perfectly fit the training data (whenever (using a polynomial amount of time per example, possible). and a polynomial number of examples) CS446 Machine Learning 9 CS446 Machine Learning 10 Version space VS H,D Sample complexity (finite H) The version space VS H,D is the set of all hypotheses – The version space VS H,D is said to be ε -exhausted with respect to concept c and distribution D if every h � VS H,D h � H that correctly classify the training data D: has true error < ε with respect to c and distribution D – If H is finite, and the data D is a sequence of m i.i.d. VS H,D = { h � H | �� x ,c( x ) �� D: h ( x ) = c( x ) } samples of c, then for any 0 ≤ ε ≤ 1, the probability that VS H,D is not ε -exhausted with respect to c is ≤ | H | e - ε m Every consistent learner outputs a hypothesis h belonging – #training examples required to reduce probability of to the version space. failure below δ : Find m such that | H | e - ε m < δ We need to only bound the number of examples needed to – So, a consistent learner needs m ≥ 1/ ε (ln | H| + ln(1/ δ )) assure does not contain any unacceptable hypotheses examples to get an error below δ (often an overestimate; |H| can be very large) CS446 Machine Learning 11 CS446 Machine Learning 12

  4. PAC learning: intuition PAC learning: intuition A hypothesis h is bad if its true error > ε We want the probability that a bad hypothesis looks good to be smaller than δ ∀ x ∈ X: Pr D ( h (x) ≠ h* (x)) > ε Probability of one bad h getting one x ~ X D correct: A hypothesis h looks good if it is correct P D ( h (x) = h* (x)) ≤ 1- ε on our training set S Probability of one bad h getting m x ~ X D correct: ∀ s ∈ S : h (s) = h* (s) | S | = N P D ( h (x) = h* (x)) ≤ (1- ε ) m Prob’ty that any h gets m x~ X D correct: ≤ � H � (1- ε ) m We want the probability that a bad hypothesis looks good to be smaller than δ Exclusive union bound: P(A ∨ B) ≤ Pr(A) + Pr(B) Set � H � (1- ε ) m ≤ δ , solve for m 13 14 VC dimension (basic idea) The VC dimension of a hypothesis space H measures the complexity of H not by the number of distinct hypotheses (|H|), Vapnik-Chervonenkis but by the number of distinct instances from X that can be completely discriminated using H. (VC) dimension CS446 Machine Learning 15 CS446 Machine Learning 16

  5. Shattering a set of instances VC dimension of H A set of instances S is shattered by the The VC dimension of the hypothesis space hypothesis space H if and only if for every H, VC(H), is the size of the largest finite dichotomy of S there is a hypothesis h in H subset of the instance space X that can be that is consistent with this dichotomy. shattered by H. (dichotomy: label instances in S as + or -) If arbitrarily large finite subsets of X can be shattered by X then VC(H) = ∞ The ability of H to shatter S is a measure of its capacity to represent concepts over S CS446 Machine Learning 17 CS446 Machine Learning 18 VC Dimension of linear VC dimension if H is finite classifiers in 2 dimensions If H is finite: VC(H) ≤ log 2 |H| – H requires 2 d distinct hypotheses to shatter d instances. – If VC( H ) = d : 2 d ≤ | H | hence: d = VC( H ) ≤ log 2 |H| The VC dimension of a 2-d linear classifier is 3: The largest set of points that can be labeled arbitrarily Note that |H| is infinite, but expressiveness is quite low. CS446 Machine Learning 19 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend