Lecture 6: − Learning Theory − Probability Review Aykut Erdem October 2016 Hacettepe University
Last time… Regularization , Cross-Validation N � E ( w ) = 1 { y ( x n , w ) − t n } 2 + λ � � 2 ∥ w ∥ 2 � 2 n =1 where ∥ w ∥ 2 ≡ w T w = w 2 0 + w 2 1 + . . . + w 2 M , the data importance of the regularization term compared ln λ = −∞ ln λ = − 18 ln λ = 0 w ⋆ 0.35 0.35 0.13 0 w ⋆ 232.37 4.74 -0.05 1 w ⋆ -5321.83 -0.77 -0.06 2 w ⋆ 48568.31 -31.97 -0.05 3 w ⋆ -231639.30 -3.89 -0.03 4 w ⋆ 640042.26 55.28 -0.02 NN classifier 5-NN classifier 5 w ⋆ -1061800.52 41.32 -0.01 6 w ⋆ 1042400.18 -45.95 -0.00 7 w ⋆ -557682.99 -91.53 0.00 8 w ⋆ 125201.43 72.68 0.01 9 Figure credit: Fei-Fei Li, Andrej Karpathy, Justin Johnson 2
Today • Learning Theory − Why ML works • Probability Review 3
Learning Theory: Why ML Works 4
Computational Learning Theory • Entire subfield devoted to the ( mathematical analysis of machine learning algorithms • Has led to several practical methods: − PAC (probably approximately correct) learning → boosting − VC (Vapnik–Chervonenkis) theory → support vector machines slide by Eric Eaton Annual conference: Conference on Learning Theory (COLT) 5
Computational Learning Theory • Is learning always possible? • How many training examples will I need to do a good job learning? • Is my test performance going to be much worse than my training performance? adapted from Hal Daume III The key idea that underlies all these answer is that simple functions generalize well . 6
The Role of Theory • Theory can serve two roles: − It can justify and help understand why theory after common practice works. − It can also serve to suggest new algorithms and approaches that turn out to work well in theory before practice. adapted from Hal Daume III Often, it turns out to be a mix! 7
The Role of Theory • Practitioners discover something that works surprisingly well. • Theorists figure out why it works and prove something about it. − In the process, they make it better or find new algorithms. • Theory can also help you understand what’s adapted from Hal Daume III possible and what’s not possible. 8
Induction is Impossible • From an algorithmic perspective, a natural question is − whether there is an “ultimate” learning algorithm, A awesome , that solves the Binary Classification problem. • Have you been wasting your time learning about KNN and other methods Perceptron and decision trees, when A awesome is out there? • What would such an ultimate learning algorithm do? − Take in a data set D and produce a function f . adapted from Hal Daume III − No matter what D looks like, this function f should get perfect classification on all future examples drawn from the same distribution that produced D . 9
Induction is Impossible • From an algorithmic perspective, a natural question is − whether there is an “ultimate” learning algorithm, A awesome , that solves the Binary Classification problem. Impossible • Have you been wasting your time learning about KNN and other methods Perceptron and decision trees, when A awesome is out there? • What would such an ultimate learning algorithm do? − Take in a data set D and produce a function f . adapted from Hal Daume III − No matter what D looks like, this function f should get perfect classification on all future examples drawn from the same distribution that produced D . 10
Label Noise • Let X = { − 1, +1} (i.e., a one-dimensional, binary distribution D = ( ⟨ +1 ⟩ ,+1) = 0.4 D = ( ⟨ -1 ⟩ ,-1) = 0.4 D = ( ⟨ +1 ⟩ ,-1) = 0.1 D = ( ⟨ -1 ⟩ ,+1) = 0.1 − 80% of data points in this distribution have x = y and 20% don’t . • No matter what function your learning algorithm produces, there’s no way that it can do better than 20% error on this data. adapted from Hal Daume III − No A awesome exists that always achieves an error rate of zero. − The best that we can hope is that the error rate is not “too large.” 11
Sampling • Another source of di ffi culty comes from the fact that the only access we have to the data distribution is through sampling. − When trying to learn about a distribution, you only get to see data points drawn from that distribution. − You know that “eventually” you will see enough data points that your sample is representative of the distribution, but it might not happen immediately. • For instance, even though a fair coin will come up heads only with probability 1/2, it’s completely plausible that in adapted from Hal Daume III a sequence of four coin flips you never see a tails, or perhaps only see one tails. 12
Induction is Impossible • We need to understand that A awesome will not always work. − In particular, if we happen to get a lousy sample of data from D , we need to allow A awesome to do something completely unreasonable. • We cannot hope that A awesome will do perfectly, every time. adapted from Hal Daume III The best we can reasonably hope of A awesome is that it will do pretty well, most of the time. 13
Probably Approximately Correct (PAC) Learning • A formalism based on the realization that the best we can hope of an algorithm is that − It does a good job most of the time ( probably approximately correct ) • Consider a hypothetical learning algorithm − We have 10 di ff erent binary classification data sets. − For each one, it comes back with functions f 1 , f 2 , . . . , f 10 . ✦ For some reason, whenever you run f 4 on a test point, it crashes your computer. For the other learned functions, their performance on test data is always at most 5% error. adapted from Hal Daume III ✦ If this situtation is guaranteed to happen, then this hypothetical learning algorithm is a PAC learning algorithm. ✤ It satisfies probably because it only failed in one out of ten cases, and it’s approximate because it achieved low, but non-zero, error on the remainder of the cases. 14
PAC Learning Definitions 1 . An algorithm A is an ( e , d ) -PAC learning algorithm if, for all distributions D : given samples from D , the probability that it returns a “bad function” is at most d ; where a “bad” function is one with test error rate more than e on D . • Two notions of e ffi ciency − Computational complexity: Prefer an algorithm that runs quickly to one that takes forever − Sample complexity: The number of examples required for your algorithm to achieve its goals adapted from Hal Daume III Definition: An algorithm A is an efficient ( e , d ) -PAC learning al- gorithm if it is an ( e , d ) -PAC learning algorithm whose runtime is polynomial in 1 e and 1 d . In other words, suppose that you want your algorithm to achieve In other words, to let your algorithm to achieve 4% error rather than 5%, the runtime required to do so should not go up by an exponential factor! 15
Example: PAC Learning of Conjunctions • Data points are binary vectors, for instance x = ⟨ 0, 1, 1, 0, 1 ⟩ • Some Boolean conjunction defines the true labeling of this data (e.g. x 1 ⋀ x 2 ⋀ x 5 ) • There is some distribution D X over binary data points (vectors) x = ⟨ x 1 , x 2 , . . . , x D ⟩ . • There is a fixed concept conjunction c that we are trying to learn. • There is no noise, so for any example x , its true label is simply y = c ( x ) • Example: adapted from Hal Daume III y x 1 x 2 x 3 x 4 − Clearly, the true formula cannot + 1 0 0 1 1 include the terms x 1 , x 2 , ¬ x 3 , ¬ x 4 + 1 0 1 1 1 - 1 1 1 0 1 able 10 . 1 : Data set for learning con- 16
Example: PAC Learning Algorithm 30 B inary C onjunction T rain ( D ) 1 : f ← x 1 ∧ ¬ x 1 ∧ x 2 ∧ ¬ x 2 ∧ · · · ∧ x D ∧ ¬ x D // initialize function 2 : for all positive examples ( x , + 1 ) in D do of Conjunctions for d = 1 . . . D do 3 : if x d = 0 then 4 : f ← f without term “ x d ” 5 : else 6 : y x 1 x 2 x 3 x 4 f ← f without term “ ¬ x d ” 7 : end if + 1 8 : 0 0 1 1 end for 9 : “ Throw Out Bad Terms” 10 : end for + 1 0 1 1 1 11 : return f - 1 1 1 0 1 able 10 . 1 : Data set for learning con- f 0 ( x ) = x 1 ⋀ ¬ x 1 ⋀ x 2 ⋀ ¬ x 2 ⋀ x 3 ⋀ ¬ x 3 ⋀ x 4 ⋀ ¬ x 4 f 1 ( x ) = ¬ x 1 ⋀ ¬ x 2 ⋀ x 3 ⋀ x 4 2 ( x ) = ¬ x 1 ⋀ x 3 ⋀ x 4 f 3 ( x ) = ¬ x 1 ⋀ x 3 ⋀ x 4 f • After processing an example, it is guaranteed to classify that example correctly (provided that there is no noise) adapted from Hal Daume III • Computationally very efficient − Given a data set of N examples in D dimensions, it takes O ( ND ) time to process the data. This is linear in the size of the data set. 17
Recommend
More recommend