learning from data lecture 3 is learning feasible
play

Learning From Data Lecture 3 Is Learning Feasible? Outside the - PowerPoint PPT Presentation

Learning From Data Lecture 3 Is Learning Feasible? Outside the Data Probability to the Rescue Learning vs. Verification Selection Bias - A Cartoon M. Magdon-Ismail CSCI 4100/6100 recap: The Perceptron Learning Algorithm y = +1 y x


  1. Learning From Data Lecture 3 Is Learning Feasible? Outside the Data Probability to the Rescue Learning vs. Verification Selection Bias - A Cartoon M. Magdon-Ismail CSCI 4100/6100

  2. recap: The Perceptron Learning Algorithm y ∗ = +1 y ∗ x ∗ w ( t + 1) w ( t ) x ∗ Income Income − → y ∗ = − 1 y ∗ x ∗ w ( t ) w ( t + 1) Age Age x ∗ PLA finds a linear separator in finite time. • What if data is not linearly separable? • We want g ≈ f – Separating the data amounts to “memorizing the data”: g ≈ f only on D . – g ≈ f means we are interested in outside the data . M Is Learning Feasible : 2 /27 � A c L Creator: Malik Magdon-Ismail Puzzle - Outside the data − →

  3. Outside the Data Set f = − 1 f = +1 f = ? M Is Learning Feasible : 3 /27 � A c L Creator: Malik Magdon-Ismail ± 1 both possible − →

  4. Outside the Data Set f = − 1 f = +1 f = ? • Did you say f = +1? ( f is measuring symmetry.) • Did you say f = − 1? ( f only cares about the top left pixel.) Who is correct? – we cannot rule out either possibility . M Is Learning Feasible : 4 /27 � A c L Creator: Malik Magdon-Ismail No Free Lunch − →

  5. Outside the Data Set f = − 1 f = +1 f = ? • An easy visual learning problem just got very messy. For every f that fits the data and is “+1” on the new point, there is one that is “ − 1”. Since f is unknown , it can take on any value outside the data, no matter how large the data. • This is called No Free Lunch (NFL) . You cannot know anything for sure about f outside the data without making assumptions. • What now! Is there any hope to know anything about f outside the data set without making assumptions about f ? Yes, if we are willing to give up the “for sure”. M Is Learning Feasible : 5 /27 � A c L Creator: Malik Magdon-Ismail The big question − →

  6. Can we infer something outside the data using only D ? M Is Learning Feasible : 6 /27 � A c L Creator: Malik Magdon-Ismail Population mean − →

  7. Population Mean from Sample Mean BIN The BIN Model • Bin with red and green marbles. SAMPLE • Pick a sample of N marbles independently . • µ : probability to pick a red marble. ν = fraction of red marbles in sample ν : fraction of red marbles in the sample. Sample − the data set → − → ν µ = probability to BIN − → outside the data − → µ pick a red marble Can we say anything about µ (outside the data) after observing ν (the data) ? ANSWER: No. It is possible for the sample to be all green marbles and the bin to be mostly red. Then, why do we trust polling (e.g. to predict the outcome of the presidential election). ANSWER: The bad case is possible , but not probable . M Is Learning Feasible : 7 /27 � A c L Creator: Malik Magdon-Ismail Hoeffding − →

  8. Probability to the Rescue: Hoeffding’s Inequality Hoeffding/Chernoff proved that, most of the time, ν cannot be too far from µ : P [ | ν − µ | > ǫ ] ≤ 2 e − 2 ǫ 2 N , for any ǫ > 0 . box it and memorize it P [ | ν − µ | ≤ ǫ ] ≥ 1 − 2 e − 2 ǫ 2 N , for any ǫ > 0 . We get to select any ǫ we want. newsflash: ν ≈ µ = . ⇒ µ ≈ ν µ ≈ ν is probably approximately correct (PAC-learning) M Is Learning Feasible : 8 /27 � A c L Creator: Malik Magdon-Ismail Hoeffding example − →

  9. Probability to the Rescue: Hoeffding’s Inequality P [ | ν − µ | > ǫ ] ≤ 2 e − 2 ǫ 2 N , for any ǫ > 0 . box it and memorize it P [ | ν − µ | ≤ ǫ ] ≥ 1 − 2 e − 2 ǫ 2 N , for any ǫ > 0 . Example : N = 1 , 000 ; draw a sample and observe ν . 99% of the time µ − 0 . 05 ≤ ν ≤ µ + 0 . 05 ( ǫ = 0 . 05) 99 . 9999996% of the time µ − 0 . 10 ≤ ν ≤ µ + 0 . 10 ( ǫ = 0 . 10) What does this mean? If I repeatedly pick a sample of size 1,000, observe ν and claim that µ ∈ [ ν − 0 . 05 , ν + 0 . 05] , (the error bar is ± 0 . 05 ) I will be right 99% of the time. On any particular sample you may be wrong, but not often. We learned something . From ν , we reached outside the data to µ . M Is Learning Feasible : 9 /27 � A c L Creator: Malik Magdon-Ismail Probability rescued us − →

  10. How Did Probability Rescue Us? • Key ingredient samples must be independent . If the sample is constructed in some arbitrary fashion, then indeed we cannot say anything. Even with independence, ν can take on arbitrary values; but some values are way more likely than others. This is what allows us to learn something – it is likely that ν ≈ µ . • The bound 2 e − 2 ǫ 2 N does not depend on µ or the size of the bin The bin can be infinite. It’s great that it does not depend on µ because µ is unknown; and we mean unknown . • The key player in the bound 2 e − 2 ǫ 2 N is N . If N → ∞ , µ ≈ ν with very very very . . . high probabilty, but not for sure . Can you live with 10 − 100 probability of error? We should probably have said “independence to the rescue” M Is Learning Feasible : 10 /27 � A c L Creator: Malik Magdon-Ismail Bin and learning − →

  11. Relating the Bin to Learning Target Function f Fixed hypothesis h Income Income Age Age UNKNOWN KNOWN In learning, the unknown is an entire function f ; in the bin it was a single number µ . M Is Learning Feasible : 11 /27 � A c L Creator: Malik Magdon-Ismail The error function − →

  12. Relating the Bin to Learning - The Error Function Target Function f Fixed hypothesis h Income Income Age Age green: h ( x ) = f ( x ) red: h ( x ) � = f ( x ) Income E ( h ) = P x [ h ( x ) � = f ( x )] (“size” of the red region) ↑ P ( x ) Age UNKNOWN M Is Learning Feasible : 12 /27 � A c L Creator: Malik Magdon-Ismail Errors=red ‘marbles’ − →

  13. Relating the Bin to Learning - The Error Function Target Function f Fixed a hypothesis h Income Income Age Age green “marble”: h ( x ) = f ( x ) red “marble”: h ( x ) � = f ( x ) BIN: X Income E out ( h ) = P x [ h ( x ) � = f ( x )] ↑ out-of-sample Age UNKNOWN M Is Learning Feasible : 13 /27 � A c L Creator: Malik Magdon-Ismail Data − →

  14. Relating the Bin to Learning - the Data Target Function f Fixed a hypothesis h Income Income Age Age Income Age M Is Learning Feasible : 14 /27 � A c L Creator: Malik Magdon-Ismail Data=sample of marbles − →

  15. Relating the Bin to Learning - the Data Target Function f Fixed a hypothesis h Income Income Age Age green data: h ( x n ) = f ( x n ) red data: h ( x n ) � = f ( x n ) Income E in ( h ) = fraction of red data ↑ ↑ in-sample misclassified Age KNOWN! M Is Learning Feasible : 15 /27 � A c L Creator: Malik Magdon-Ismail Learning vs. bin − →

  16. Relating the Bin to Learning BIN SAMPLE Income Income ν = fraction of red marbles in sample Age Age = probability to µ pick a red marble Unknown f and P ( x ), fixed h Learning Bin Model input space X Bin • green marble x for which h ( x ) = f ( x ) • red marble x for which h ( x ) � = f ( x ) P ( x ) randomly picking a marble data set D sample of N marbles Out-of-sample Error: E out ( h ) = P x [ h ( x ) � = f ( x )] µ = probability of picking a red marble N In-sample Error: E in ( h ) = 1 � � h ( x ) � = f ( x ) � ν = fraction of red marbles in the sample N n =1 M Is Learning Feasible : 16 /27 � A c L Creator: Malik Magdon-Ismail Hoeffding for E in − →

  17. Hoeffding says that E in ( h ) ≈ E out ( h ) P [ | E in ( h ) − E out ( h ) | > ǫ ] ≤ 2 e − 2 ǫ 2 N , for any ǫ > 0 . P [ | E in ( h ) − E out ( h ) | ≤ ǫ ] ≥ 1 − 2 e − 2 ǫ 2 N , for any ǫ > 0 . E in is random, but known ; E out fixed, but unknown . • If E in ≈ 0 = ⇒ E out ≈ 0 (with high probability), i.e. P x [ h ( x ) � = f ( x )] ≈ 0; We have learned something about the entire f : f ≈ h over X (outside D ) • If E in ≫ 0, we’re out of luck. But, we have still learned something about the entire f : f �≈ h ; it is not very useful though. Questions: Suppose that E in ≈ 1, have we learned something about the entire f that is useful? What is the worst E in for inferring about f ? M Is Learning Feasible : 17 /27 � A c L Creator: Malik Magdon-Ismail Verification vs. learning − →

  18. That’s Verification, not Real Learning The entire previous argument assumed a FIXED h and then came the data. • Given h ∈ H , a sample can verify whether or not it is good (w.r.t. f ): if E in is small, h is good, with high confidence. if E in is large, h is bad with high confidence. We have no control over E in . It is what it is. • In learning, you actually try to fit the data, as with the perceptron model g results from searching an entire hypothesis set H for a hypothesis with small E in . Verification Real Learning Fixed single hypothesis h Fixed hypothesis set H h to be certified g to be certified h does not depend on D g results after searching H to fit D No control over E in Pick best E in Verification: we can say something outside the data about h ? Learning: can we say something outside the data about g ? M Is Learning Feasible : 18 /27 � A c L Creator: Malik Magdon-Ismail Real learning – finite model − →

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend