lecture 6
play

Lecture 6: Learning Theory Probability Review Aykut Erdem October - PowerPoint PPT Presentation

Lecture 6: Learning Theory Probability Review Aykut Erdem October 2016 Hacettepe University Last time Regularization , Cross-Validation N E ( w ) = 1 { y ( x n , w ) t n } 2 + 2 w 2 2 n =1 where w


  1. Lecture 6: − Learning Theory − Probability Review Aykut Erdem October 2016 Hacettepe University

  2. Last time… Regularization , Cross-Validation N � E ( w ) = 1 { y ( x n , w ) − t n } 2 + λ � � 2 ∥ w ∥ 2 � 2 n =1 where ∥ w ∥ 2 ≡ w T w = w 2 0 + w 2 1 + . . . + w 2 M , the data importance of the regularization term compared ln λ = −∞ ln λ = − 18 ln λ = 0 w ⋆ 0.35 0.35 0.13 0 w ⋆ 232.37 4.74 -0.05 1 w ⋆ -5321.83 -0.77 -0.06 2 w ⋆ 48568.31 -31.97 -0.05 3 w ⋆ -231639.30 -3.89 -0.03 4 w ⋆ 640042.26 55.28 -0.02 NN classifier 5-NN classifier 5 w ⋆ -1061800.52 41.32 -0.01 6 w ⋆ 1042400.18 -45.95 -0.00 7 w ⋆ -557682.99 -91.53 0.00 8 w ⋆ 125201.43 72.68 0.01 9 Figure credit: Fei-Fei Li, Andrej Karpathy, Justin Johnson 2

  3. Today • Learning Theory − Why ML works • Probability Review 3

  4. Learning Theory: 
 Why ML Works 4

  5. Computational Learning 
 Theory • Entire subfield devoted to the 
 ( mathematical analysis of machine 
 learning algorithms • Has led to several practical methods: − PAC (probably approximately correct) learning 
 → boosting − VC (Vapnik–Chervonenkis) theory 
 → support vector machines 
 slide by Eric Eaton Annual conference: Conference on Learning Theory (COLT) 5

  6. Computational Learning Theory • Is learning always possible? • How many training examples will I need to do a good job learning? • Is my test performance going to be much worse than my training performance? adapted from Hal Daume III The key idea that underlies all these answer is that simple functions generalize well . 6

  7. The Role of Theory • Theory can serve two roles: − It can justify and help understand why theory after common practice works. − It can also serve to suggest new algorithms and approaches that turn out to work well in theory before practice. adapted from Hal Daume III Often, it turns out to be a mix! 7

  8. The Role of Theory • Practitioners discover something that works surprisingly well. • Theorists figure out why it works and prove something about it. − In the process, they make it better or find new algorithms. • Theory can also help you understand what’s adapted from Hal Daume III possible and what’s not possible. 8

  9. Induction is Impossible • From an algorithmic perspective, a natural question is − whether there is an “ultimate” learning algorithm, A awesome , that solves the Binary Classification problem. • Have you been wasting your time learning about KNN and other methods Perceptron and decision trees, when A awesome is out there? • What would such an ultimate learning algorithm do? − Take in a data set D and produce a function f . adapted from Hal Daume III − No matter what D looks like, this function f should get perfect classification on all future examples drawn from the same distribution that produced D . 9

  10. Induction is Impossible • From an algorithmic perspective, a natural question is − whether there is an “ultimate” learning algorithm, A awesome , that solves the Binary Classification problem. Impossible • Have you been wasting your time learning about KNN and other methods Perceptron and decision trees, when A awesome is out there? • What would such an ultimate learning algorithm do? − Take in a data set D and produce a function f . adapted from Hal Daume III − No matter what D looks like, this function f should get perfect classification on all future examples drawn from the same distribution that produced D . 10

  11. 
 Label Noise • Let X = { − 1, +1} (i.e., a one-dimensional, binary distribution 
 D = ( ⟨ +1 ⟩ ,+1) = 0.4 
 D = ( ⟨ -1 ⟩ ,-1) = 0.4 
 D = ( ⟨ +1 ⟩ ,-1) = 0.1 D = ( ⟨ -1 ⟩ ,+1) = 0.1 − 80% of data points in this distribution have x = y and 20% don’t . • No matter what function your learning algorithm produces, there’s no way that it can do better than 20% error on this data. adapted from Hal Daume III − No A awesome exists that always achieves an error rate of zero. − The best that we can hope is that the error rate is not “too large.” 11

  12. Sampling • Another source of di ffi culty comes from the fact that the only access we have to the data distribution is through sampling. − When trying to learn about a distribution, you only get to see data points drawn from that distribution. − You know that “eventually” you will see enough data points that your sample is representative of the distribution, but it might not happen immediately. • For instance, even though a fair coin will come up heads only with probability 1/2, it’s completely plausible that in adapted from Hal Daume III a sequence of four coin flips you never see a tails, or perhaps only see one tails. 12

  13. Induction is Impossible • We need to understand that A awesome will not always work. − In particular, if we happen to get a lousy sample of data from D , we need to allow A awesome to do something completely unreasonable. • We cannot hope that A awesome will do perfectly, every time. adapted from Hal Daume III The best we can reasonably hope of A awesome is that it will do pretty well, most of the time. 13

  14. Probably Approximately Correct 
 (PAC) Learning • A formalism based on the realization that the best we can hope of an algorithm is that − It does a good job most of the time ( probably approximately correct ) • Consider a hypothetical learning algorithm − We have 10 di ff erent binary classification data sets. − For each one, it comes back with functions f 1 , f 2 , . . . , f 10 . ✦ For some reason, whenever you run f 4 on a test point, it crashes your computer. For the other learned functions, their performance on test data is always at most 5% error. adapted from Hal Daume III ✦ If this situtation is guaranteed to happen, then this hypothetical learning algorithm is a PAC learning algorithm. ✤ It satisfies probably because it only failed in one out of ten cases, and it’s approximate because it achieved low, but non-zero, error on the remainder of the cases. 14

  15. PAC Learning Definitions 1 . An algorithm A is an ( e , d ) -PAC learning algorithm if, for all distributions D : given samples from D , the probability that it returns a “bad function” is at most d ; where a “bad” function is one with test error rate more than e on D . • Two notions of e ffi ciency − Computational complexity: Prefer an algorithm that runs quickly to one that takes forever − Sample complexity: The number of examples required for your algorithm to achieve its goals adapted from Hal Daume III Definition: An algorithm A is an efficient ( e , d ) -PAC learning al- gorithm if it is an ( e , d ) -PAC learning algorithm whose runtime is polynomial in 1 e and 1 d . In other words, suppose that you want your algorithm to achieve In other words, to let your algorithm to achieve 4% error rather than 5%, 
 the runtime required to do so should not go up by an exponential factor! 15

  16. 
 Example: PAC Learning of Conjunctions • Data points are binary vectors, for instance x = ⟨ 0, 1, 1, 0, 1 ⟩ • Some Boolean conjunction defines the true labeling of this data 
 (e.g. x 1 ⋀ x 2 ⋀ x 5 ) • There is some distribution D X over binary data points (vectors) 
 x = ⟨ x 1 , x 2 , . . . , x D ⟩ . • There is a fixed concept conjunction c that we are trying to learn. • There is no noise, so for any example x , its true label is simply 
 y = c ( x ) • Example: adapted from Hal Daume III y x 1 x 2 x 3 x 4 − Clearly, the true formula cannot 
 + 1 0 0 1 1 include the terms x 1 , x 2 , ¬ x 3 , ¬ x 4 
 + 1 0 1 1 1 - 1 1 1 0 1 able 10 . 1 : Data set for learning con- 16

  17. Example: PAC Learning Algorithm 30 B inary C onjunction T rain ( D ) 1 : f ← x 1 ∧ ¬ x 1 ∧ x 2 ∧ ¬ x 2 ∧ · · · ∧ x D ∧ ¬ x D // initialize function 2 : for all positive examples ( x , + 1 ) in D do of Conjunctions for d = 1 . . . D do 3 : if x d = 0 then 4 : f ← f without term “ x d ” 5 : else 6 : y x 1 x 2 x 3 x 4 f ← f without term “ ¬ x d ” 7 : end if + 1 8 : 0 0 1 1 end for 9 : “ Throw Out Bad Terms” 10 : end for + 1 0 1 1 1 11 : return f - 1 1 1 0 1 able 10 . 1 : Data set for learning con- f 0 ( x ) = x 1 ⋀ ¬ x 1 ⋀ x 2 ⋀ ¬ x 2 ⋀ x 3 ⋀ ¬ x 3 ⋀ x 4 ⋀ ¬ x 4 f 1 ( x ) = ¬ x 1 ⋀ ¬ x 2 ⋀ x 3 ⋀ x 4 2 ( x ) = ¬ x 1 ⋀ x 3 ⋀ x 4 f 3 ( x ) = ¬ x 1 ⋀ x 3 ⋀ x 4 f • After processing an example, it is guaranteed to classify that example correctly (provided that there is no noise) adapted from Hal Daume III • Computationally very efficient − Given a data set of N examples in D dimensions, it takes O ( ND ) time to process the data. This is linear in the size of the data set. 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend