summary
play

Summary Overfitting arises when we evaluate and train on the same - PowerPoint PPT Presentation

Summary Overfitting arises when we evaluate and train on the same data. We can bound error of a fixed function with Hoeffdings inequality. Next lecture well get a version sensitive to function class size. 41 / 61 Part 3. . .


  1. Summary ◮ Overfitting arises when we evaluate and train on the same data. ◮ We can bound error of a fixed function with Hoeffding’s inequality. ◮ Next lecture we’ll get a version sensitive to function class size. 41 / 61

  2. Part 3. . .

  3. Overfitting, in pictures With SVM, the model size scales with C . We had our best test error in the middle. C = 1 . 481101 , λ = 0 . 006752 ; train 0.010000 , test 0.060000 0.08 1.0 -0.800 0.07 0.800 0 0 . 4 0 0 . 0 0 0 -0.400 0.06 -0.800 0 -0.400 -1.200 0.8 0 8 0 . 0.800 misclassification rate - 0.05 0 0 2 1 . . 0 0 0 0 0.400 0.6 -1.200 0.04 -1.200 0.03 0.4 0.02 -0.400 0 0.400 0.800 0 -0.800 0 . 0 0.01 0.2 - 0 1 train . 0 2 2 0 . 0.00 test 0 1 -1 0 1 0.0 10 10 10 C 0.0 0.2 0.4 0.6 0.8 1.0 C = 0 . 040000 , λ = 0 . 250000 ; train 0.050000 , test 0.076000 C = 20 . 480000 , λ = 0 . 000488 ; train 0.000000 , test 0.076000 1.0 -0.080 1.0 - 0 . 8 0 0 -1.600 0 0 0 . 0 1.600 0 0.800 0 . - -0.800 0 0 0 . 0 8 - 1 0 . 6 0 0.8 0.8 0 0.800 0.080 -0.800 1.600 0.000 0 2.400 0.6 . 0.6 2 4 0 -0.240 0 6 1 . 0 0.4 0.4 - 1 . 6 -0.160 0 0 -0.800 0.000 0 0 0 0.080 0.800 0 . 0.160 0.2 0.2 0.000 -1.600 0 0 0.240 -0.080 8 . 0 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 42 / 61

  4. Bernoulli walks Let Z i be bernoulli with E Z i = 1 / 2 ; consider � t i =1 (2 Z i − 1) . 80 60 40 20 0 20 40 60 0 200 400 600 800 1000 43 / 61

  5. Bernoulli walks Let Z i be bernoulli with E Z i = 1 / 2 ; consider � t i =1 (2 Z i − 1) . 80 60 40 20 0 20 40 60 0 200 400 600 800 1000 √ Fact: with probability ≥ 1 − 1 / e , position ≤ 2 n . � Thus: with probability ≥ 1 − 1 / e , R ( h ) − � R ( h ) ≤ 1 / 2 n . 43 / 61

  6. Two ways to get that Bernoulli walk “Fact” Theorem (via Chebyshev). Given IID Z i ∈ [ a, b ] , with probability ≥ 1 − δ , � � � � � n � � � E Z 1 − 1 (1 /δ ) � � Z i ≤ ( b − a ) 4 n . � � n � � i =1 Theorem (via Hoeffding). Given IID Z i ∈ [ a, b ] , with probability ≥ 1 − δ , � � n E Z 1 − 1 ln(1 /δ ) Z i ≤ ( b − a ) . n 2 n i =1 Remarks. ◮ Defining Z i := 1 [ h ( X i ) � = Y i ] for a fixed h chosen without seeing (( X i , Y i )) n i =1 , left hand side becomes R ( h ) − � R ( h ) . 44 / 61

  7. Overfitting ◮ These bounds require IID ( Z i ) n i =1 where Z i := 1 [ h ( X i ) � = Y i ] . ◮ If h depends on (( X i , Y i )) n i =1 , we can’t guarantee independence of Z i . ◮ E.g., suppose h memorizes training data, and outputs “bear” on new data; can force � R ( h ) = 0 and R ( h ) = 0 , and also � R ( h ) = 0 but R ( h ) = 1 . 45 / 61

  8. 11. Finite classes

  9. Controlling k predictors Theorem. Let Z i,j ∈ [ a, b ] be given, where ( Z i,j ) n i =1 are independent for each j (but nothing is said across j ). With probability at least 1 − δ , � � n j ∈{ 1 ,...,k } E Z 1 ,j − 1 ln k + ln(1 /δ ) max Z i,j ≤ ( b − a ) . n 2 n i =1 Theorem. Let predictors ( h 1 , . . . , h k ) be given. With probability ≥ 1 − δ over an IID draw (( X i , Y i )) n i =1 , � ln k + ln(1 /δ ) R ( h j ) ≤ � R ( h j ) + ∀ j ∈ { 1 , . . . , k } . 2 n 46 / 61

  10. Controlling k predictors Theorem. Let Z i,j ∈ [ a, b ] be given, where ( Z i,j ) n i =1 are independent for each j (but nothing is said across j ). With probability at least 1 − δ , � � n j ∈{ 1 ,...,k } E Z 1 ,j − 1 ln k + ln(1 /δ ) max Z i,j ≤ ( b − a ) . n 2 n i =1 Theorem. Let predictors ( h 1 , . . . , h k ) be given. With probability ≥ 1 − δ over an IID draw (( X i , Y i )) n i =1 , � ln k + ln(1 /δ ) R ( h j ) ≤ � R ( h j ) + ∀ j ∈ { 1 , . . . , k } . 2 n Remarks. ◮ We pick ( h 1 , . . . , h k ) without seeing data! ◮ This is how all our generalization guarantees will go: we prove a guarantee on all possible things the algorithm can output, and thus avoid the issue of “ h depends on data”. Called “uniform deviations” or “uniform law of large numbers”. ◮ For this approach to work, we must build the tightest possible estimate of what the algorithm considers (on particular data). 46 / 61

  11. Proof of finite class bound Proof. 47 / 61

  12. Proof of finite class bound Proof. Fix any h j and some confidence level δ j > 0 . Define a failure event � � � ln(1 /δ j ) R ( h j ) > � F j := R ( h j ) + ǫ j where ǫ j := . 2 n 47 / 61

  13. Proof of finite class bound Proof. Fix any h j and some confidence level δ j > 0 . Define a failure event � � � ln(1 /δ j ) R ( h j ) > � F j := R ( h j ) + ǫ j where ǫ j := . 2 n By Hoeffding’s inequality (which requires independence!), Pr( F j ) ≤ δ j . 47 / 61

  14. Proof of finite class bound Proof. Fix any h j and some confidence level δ j > 0 . Define a failure event � � � ln(1 /δ j ) R ( h j ) > � F j := R ( h j ) + ǫ j where ǫ j := . 2 n By Hoeffding’s inequality (which requires independence!), Pr( F j ) ≤ δ j . The events ( F 1 , . . . , F k ) are not independent, but it doesn’t matter: Pr ( ∀ j � ¬ F j ) = 47 / 61

  15. Proof of finite class bound Proof. Fix any h j and some confidence level δ j > 0 . Define a failure event � � � ln(1 /δ j ) R ( h j ) > � F j := R ( h j ) + ǫ j where ǫ j := . 2 n By Hoeffding’s inequality (which requires independence!), Pr( F j ) ≤ δ j . The events ( F 1 , . . . , F k ) are not independent, but it doesn’t matter: k k � � Pr ( ∀ j � ¬ F j ) = 1 − Pr ( ∃ j � F j ) ≥ 1 − Pr ( F j ) ≥ 1 − δ j . j =1 j =1 47 / 61

  16. Proof of finite class bound Proof. Fix any h j and some confidence level δ j > 0 . Define a failure event � � � ln(1 /δ j ) R ( h j ) > � F j := R ( h j ) + ǫ j where ǫ j := . 2 n By Hoeffding’s inequality (which requires independence!), Pr( F j ) ≤ δ j . The events ( F 1 , . . . , F k ) are not independent, but it doesn’t matter: k k � � Pr ( ∀ j � ¬ F j ) = 1 − Pr ( ∃ j � F j ) ≥ 1 − Pr ( F j ) ≥ 1 − δ j . j =1 j =1 To finish the proof, set δ j = δ / k . � 47 / 61

  17. Proof of finite class bound Proof. Fix any h j and some confidence level δ j > 0 . Define a failure event � � � ln(1 /δ j ) R ( h j ) > � F j := R ( h j ) + ǫ j where ǫ j := . 2 n By Hoeffding’s inequality (which requires independence!), Pr( F j ) ≤ δ j . The events ( F 1 , . . . , F k ) are not independent, but it doesn’t matter: k k � � Pr ( ∀ j � ¬ F j ) = 1 − Pr ( ∃ j � F j ) ≥ 1 − Pr ( F j ) ≥ 1 − δ j . j =1 j =1 To finish the proof, set δ j = δ / k . � We can also prove the first (abstract) Theorem, and then plug in random � � variables Z i,j := 1 h j ( X i ) � = Y i . 47 / 61

  18. Picture behind proof Each predictor h j had a failure event F j := [ R ( h j ) > R ( h j ) + ǫ j ] . 48 / 61

  19. Picture behind proof Each predictor h j had a failure event F j := [ R ( h j ) > R ( h j ) + ǫ j ] . Concretely, F j is a subset of all possible (( X i , Y i )) n i =1 . 48 / 61

  20. Picture behind proof Each predictor h j had a failure event F j := [ R ( h j ) > R ( h j ) + ǫ j ] . Concretely, F j is a subset of all possible (( X i , Y i )) n i =1 . Some other h l might have a different failure event F l ! (But F l is still a subset of the same sample space!) 48 / 61

  21. Picture behind proof Each predictor h j had a failure event F j := [ R ( h j ) > R ( h j ) + ǫ j ] . Concretely, F j is a subset of all possible (( X i , Y i )) n i =1 . Some other h l might have a different failure event F l ! (But F l is still a subset of the same sample space!) F 1 48 / 61

  22. Picture behind proof Each predictor h j had a failure event F j := [ R ( h j ) > R ( h j ) + ǫ j ] . Concretely, F j is a subset of all possible (( X i , Y i )) n i =1 . Some other h l might have a different failure event F l ! (But F l is still a subset of the same sample space!) F 4 F 5 F 1 F 2 F 3 48 / 61

  23. Picture behind proof Each predictor h j had a failure event F j := [ R ( h j ) > R ( h j ) + ǫ j ] . Concretely, F j is a subset of all possible (( X i , Y i )) n i =1 . Some other h l might have a different failure event F l ! (But F l is still a subset of the same sample space!) F 4 F 5 F 1 F 2 F 3 Looking ahead: for infinitely many predictors, picture still works if failure events overlap! 48 / 61

  24. Finite class bound — summary Theorem. Let predictors ( h 1 , . . . , h k ) be given. With probability ≥ 1 − δ over an IID draw (( X i , Y i )) n i =1 , � ln k + ln(1 /δ ) R ( h j ) ≤ � R ( h j ) + ∀ j. 2 n Remarks. ◮ If we choose ( h 1 , . . . , h k ) before seeing (( X i , Y i )) n i =1 , we can use this bound. ◮ Example: train k classifiers, pick the best on validation set! ◮ This approach “produce bound for all possible algo outputs” may seem sloppy, but it’s the best we have! ◮ Letting F = ( h 1 , . . . , h k ) denote our set of predictors, the bound is: with probability ≥ 1 − δ , every f ∈ F satisfies � ln |F| + ln 1 / δ R ( f ) ≤ � R ( f ) + . 2 n In the next sections, we’ll handle |F| = ∞ by replacing ln |F| with complexity ( F ) , whose meaning will vary. 49 / 61

  25. 12. VC Dimension

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend