Lecture 6: Learning Theory Probability Review Aykut Erdem October - PowerPoint PPT Presentation

Lecture 6: − Learning Theory − Probability Review Aykut Erdem October 2016 Hacettepe University

Last time… Regularization , Cross-Validation N � E ( w ) = 1 { y ( x n , w ) − t n } 2 + λ � � 2 ∥ w ∥ 2 � 2 n =1 where ∥ w ∥ 2 ≡ w T w = w 2 0 + w 2 1 + . . . + w 2 M , the data importance of the regularization term compared ln λ = −∞ ln λ = − 18 ln λ = 0 w ⋆ 0.35 0.35 0.13 0 w ⋆ 232.37 4.74 -0.05 1 w ⋆ -5321.83 -0.77 -0.06 2 w ⋆ 48568.31 -31.97 -0.05 3 w ⋆ -231639.30 -3.89 -0.03 4 w ⋆ 640042.26 55.28 -0.02 NN classifier 5-NN classifier 5 w ⋆ -1061800.52 41.32 -0.01 6 w ⋆ 1042400.18 -45.95 -0.00 7 w ⋆ -557682.99 -91.53 0.00 8 w ⋆ 125201.43 72.68 0.01 9 Figure credit: Fei-Fei Li, Andrej Karpathy, Justin Johnson 2

Today • Learning Theory − Why ML works • Probability Review 3

Learning Theory:   Why ML Works 4

Computational Learning   Theory • Entire subfield devoted to the   ( mathematical analysis of machine   learning algorithms • Has led to several practical methods: − PAC (probably approximately correct) learning   → boosting − VC (Vapnik–Chervonenkis) theory   → support vector machines   slide by Eric Eaton Annual conference: Conference on Learning Theory (COLT) 5

Computational Learning Theory • Is learning always possible? • How many training examples will I need to do a good job learning? • Is my test performance going to be much worse than my training performance? adapted from Hal Daume III The key idea that underlies all these answer is that simple functions generalize well . 6

The Role of Theory • Theory can serve two roles: − It can justify and help understand why theory after common practice works. − It can also serve to suggest new algorithms and approaches that turn out to work well in theory before practice. adapted from Hal Daume III Often, it turns out to be a mix! 7

The Role of Theory • Practitioners discover something that works surprisingly well. • Theorists figure out why it works and prove something about it. − In the process, they make it better or find new algorithms. • Theory can also help you understand what’s adapted from Hal Daume III possible and what’s not possible. 8

Induction is Impossible • From an algorithmic perspective, a natural question is − whether there is an “ultimate” learning algorithm, A awesome , that solves the Binary Classification problem. • Have you been wasting your time learning about KNN and other methods Perceptron and decision trees, when A awesome is out there? • What would such an ultimate learning algorithm do? − Take in a data set D and produce a function f . adapted from Hal Daume III − No matter what D looks like, this function f should get perfect classification on all future examples drawn from the same distribution that produced D . 9

Induction is Impossible • From an algorithmic perspective, a natural question is − whether there is an “ultimate” learning algorithm, A awesome , that solves the Binary Classification problem. Impossible • Have you been wasting your time learning about KNN and other methods Perceptron and decision trees, when A awesome is out there? • What would such an ultimate learning algorithm do? − Take in a data set D and produce a function f . adapted from Hal Daume III − No matter what D looks like, this function f should get perfect classification on all future examples drawn from the same distribution that produced D . 10

  Label Noise • Let X = { − 1, +1} (i.e., a one-dimensional, binary distribution   D = ( ⟨ +1 ⟩ ,+1) = 0.4   D = ( ⟨ -1 ⟩ ,-1) = 0.4   D = ( ⟨ +1 ⟩ ,-1) = 0.1 D = ( ⟨ -1 ⟩ ,+1) = 0.1 − 80% of data points in this distribution have x = y and 20% don’t . • No matter what function your learning algorithm produces, there’s no way that it can do better than 20% error on this data. adapted from Hal Daume III − No A awesome exists that always achieves an error rate of zero. − The best that we can hope is that the error rate is not “too large.” 11

Sampling • Another source of di ffi culty comes from the fact that the only access we have to the data distribution is through sampling. − When trying to learn about a distribution, you only get to see data points drawn from that distribution. − You know that “eventually” you will see enough data points that your sample is representative of the distribution, but it might not happen immediately. • For instance, even though a fair coin will come up heads only with probability 1/2, it’s completely plausible that in adapted from Hal Daume III a sequence of four coin flips you never see a tails, or perhaps only see one tails. 12

Induction is Impossible • We need to understand that A awesome will not always work. − In particular, if we happen to get a lousy sample of data from D , we need to allow A awesome to do something completely unreasonable. • We cannot hope that A awesome will do perfectly, every time. adapted from Hal Daume III The best we can reasonably hope of A awesome is that it will do pretty well, most of the time. 13

Probably Approximately Correct   (PAC) Learning • A formalism based on the realization that the best we can hope of an algorithm is that − It does a good job most of the time ( probably approximately correct ) • Consider a hypothetical learning algorithm − We have 10 di ff erent binary classification data sets. − For each one, it comes back with functions f 1 , f 2 , . . . , f 10 . ✦ For some reason, whenever you run f 4 on a test point, it crashes your computer. For the other learned functions, their performance on test data is always at most 5% error. adapted from Hal Daume III ✦ If this situtation is guaranteed to happen, then this hypothetical learning algorithm is a PAC learning algorithm. ✤ It satisfies probably because it only failed in one out of ten cases, and it’s approximate because it achieved low, but non-zero, error on the remainder of the cases. 14

PAC Learning Definitions 1 . An algorithm A is an ( e , d ) -PAC learning algorithm if, for all distributions D : given samples from D , the probability that it returns a “bad function” is at most d ; where a “bad” function is one with test error rate more than e on D . • Two notions of e ffi ciency − Computational complexity: Prefer an algorithm that runs quickly to one that takes forever − Sample complexity: The number of examples required for your algorithm to achieve its goals adapted from Hal Daume III Definition: An algorithm A is an efficient ( e , d ) -PAC learning algorithm if it is an ( e , d ) -PAC learning algorithm whose runtime is polynomial in 1 e and 1 d . In other words, suppose that you want your algorithm to achieve In other words, to let your algorithm to achieve 4% error rather than 5%,   the runtime required to do so should not go up by an exponential factor! 15

  Example: PAC Learning of Conjunctions • Data points are binary vectors, for instance x = ⟨ 0, 1, 1, 0, 1 ⟩ • Some Boolean conjunction defines the true labeling of this data   (e.g. x 1 ⋀ x 2 ⋀ x 5 ) • There is some distribution D X over binary data points (vectors)   x = ⟨ x 1 , x 2 , . . . , x D ⟩ . • There is a fixed concept conjunction c that we are trying to learn. • There is no noise, so for any example x , its true label is simply   y = c ( x ) • Example: adapted from Hal Daume III y x 1 x 2 x 3 x 4 − Clearly, the true formula cannot   + 1 0 0 1 1 include the terms x 1 , x 2 , ¬ x 3 , ¬ x 4   + 1 0 1 1 1 - 1 1 1 0 1 able 10 . 1 : Data set for learning con- 16

Example: PAC Learning Algorithm 30 B inary C onjunction T rain ( D ) 1 : f ← x 1 ∧ ¬ x 1 ∧ x 2 ∧ ¬ x 2 ∧ · · · ∧ x D ∧ ¬ x D // initialize function 2 : for all positive examples ( x , + 1 ) in D do of Conjunctions for d = 1 . . . D do 3 : if x d = 0 then 4 : f ← f without term “ x d ” 5 : else 6 : y x 1 x 2 x 3 x 4 f ← f without term “ ¬ x d ” 7 : end if + 1 8 : 0 0 1 1 end for 9 : “ Throw Out Bad Terms” 10 : end for + 1 0 1 1 1 11 : return f - 1 1 1 0 1 able 10 . 1 : Data set for learning con- f 0 ( x ) = x 1 ⋀ ¬ x 1 ⋀ x 2 ⋀ ¬ x 2 ⋀ x 3 ⋀ ¬ x 3 ⋀ x 4 ⋀ ¬ x 4 f 1 ( x ) = ¬ x 1 ⋀ ¬ x 2 ⋀ x 3 ⋀ x 4 2 ( x ) = ¬ x 1 ⋀ x 3 ⋀ x 4 f 3 ( x ) = ¬ x 1 ⋀ x 3 ⋀ x 4 f • After processing an example, it is guaranteed to classify that example correctly (provided that there is no noise) adapted from Hal Daume III • Computationally very efficient − Given a data set of N examples in D dimensions, it takes O ( ND ) time to process the data. This is linear in the size of the data set. 17

Lecture 6: Learning Theory Probability Review Aykut Erdem October - PowerPoint PPT Presentation

Lecture 6: Learning Theory Probability Review Aykut Erdem October 2016 Hacettepe University Last time Regularization , Cross-Validation N E ( w ) = 1 { y ( x n , w ) t n } 2 + 2 w 2 2 n =1 where w

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

MOVEMENT BACK TO SYRIA: SCENARIOS Possible developmentsin Syria and neighbouring countriesover the

earching for dist stant world stant world GO DIRECTLY TO THE PLANETARIUM The

Scalable Methods for the Analysis of Network-Based Data MURI Project: University of California,

A Lonely Giant: The Sparse Satellite Population of M94 Challenges Galaxy Formation Adam Smercina

Establishing Best Practices for 1 Wheel/Rail Interaction APTA/AREMA Working Group on

A look at Ansible Community in 2020 - from Collections to Contributions to Conferences

Querying Linked Data with SPARQL and the Wikidata Query Service Lucas Werkmeister 2019-12-27

Katrin twitter: @_die_katrin mastodon: @katrin-k.chaos.social Next Workshop: Sunday June 16th

Sambuz

Useful Links

Newsletter

Mail Us