Summary Overfitting arises when we evaluate and train on the same - PowerPoint PPT Presentation

Summary ◮ Overfitting arises when we evaluate and train on the same data. ◮ We can bound error of a fixed function with Hoeffding’s inequality. ◮ Next lecture we’ll get a version sensitive to function class size. 41 / 61

Part 3. . .

Overfitting, in pictures With SVM, the model size scales with C . We had our best test error in the middle. C = 1 . 481101 , λ = 0 . 006752 ; train 0.010000 , test 0.060000 0.08 1.0 -0.800 0.07 0.800 0 0 . 4 0 0 . 0 0 0 -0.400 0.06 -0.800 0 -0.400 -1.200 0.8 0 8 0 . 0.800 misclassification rate - 0.05 0 0 2 1 . . 0 0 0 0 0.400 0.6 -1.200 0.04 -1.200 0.03 0.4 0.02 -0.400 0 0.400 0.800 0 -0.800 0 . 0 0.01 0.2 - 0 1 train . 0 2 2 0 . 0.00 test 0 1 -1 0 1 0.0 10 10 10 C 0.0 0.2 0.4 0.6 0.8 1.0 C = 0 . 040000 , λ = 0 . 250000 ; train 0.050000 , test 0.076000 C = 20 . 480000 , λ = 0 . 000488 ; train 0.000000 , test 0.076000 1.0 -0.080 1.0 - 0 . 8 0 0 -1.600 0 0 0 . 0 1.600 0 0.800 0 . - -0.800 0 0 0 . 0 8 - 1 0 . 6 0 0.8 0.8 0 0.800 0.080 -0.800 1.600 0.000 0 2.400 0.6 . 0.6 2 4 0 -0.240 0 6 1 . 0 0.4 0.4 - 1 . 6 -0.160 0 0 -0.800 0.000 0 0 0 0.080 0.800 0 . 0.160 0.2 0.2 0.000 -1.600 0 0 0.240 -0.080 8 . 0 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 42 / 61

Bernoulli walks Let Z i be bernoulli with E Z i = 1 / 2 ; consider � t i =1 (2 Z i − 1) . 80 60 40 20 0 20 40 60 0 200 400 600 800 1000 43 / 61

Bernoulli walks Let Z i be bernoulli with E Z i = 1 / 2 ; consider � t i =1 (2 Z i − 1) . 80 60 40 20 0 20 40 60 0 200 400 600 800 1000 √ Fact: with probability ≥ 1 − 1 / e , position ≤ 2 n . � Thus: with probability ≥ 1 − 1 / e , R ( h ) − � R ( h ) ≤ 1 / 2 n . 43 / 61

Two ways to get that Bernoulli walk “Fact” Theorem (via Chebyshev). Given IID Z i ∈ [ a, b ] , with probability ≥ 1 − δ , � � � � � n � � � E Z 1 − 1 (1 /δ ) � � Z i ≤ ( b − a ) 4 n . � � n � � i =1 Theorem (via Hoeffding). Given IID Z i ∈ [ a, b ] , with probability ≥ 1 − δ , � � n E Z 1 − 1 ln(1 /δ ) Z i ≤ ( b − a ) . n 2 n i =1 Remarks. ◮ Defining Z i := 1 [ h ( X i ) � = Y i ] for a fixed h chosen without seeing (( X i , Y i )) n i =1 , left hand side becomes R ( h ) − � R ( h ) . 44 / 61

Overfitting ◮ These bounds require IID ( Z i ) n i =1 where Z i := 1 [ h ( X i ) � = Y i ] . ◮ If h depends on (( X i , Y i )) n i =1 , we can’t guarantee independence of Z i . ◮ E.g., suppose h memorizes training data, and outputs “bear” on new data; can force � R ( h ) = 0 and R ( h ) = 0 , and also � R ( h ) = 0 but R ( h ) = 1 . 45 / 61

11. Finite classes

Controlling k predictors Theorem. Let Z i,j ∈ [ a, b ] be given, where ( Z i,j ) n i =1 are independent for each j (but nothing is said across j ). With probability at least 1 − δ , � � n j ∈{ 1 ,...,k } E Z 1 ,j − 1 ln k + ln(1 /δ ) max Z i,j ≤ ( b − a ) . n 2 n i =1 Theorem. Let predictors ( h 1 , . . . , h k ) be given. With probability ≥ 1 − δ over an IID draw (( X i , Y i )) n i =1 , � ln k + ln(1 /δ ) R ( h j ) ≤ � R ( h j ) + ∀ j ∈ { 1 , . . . , k } . 2 n 46 / 61

Controlling k predictors Theorem. Let Z i,j ∈ [ a, b ] be given, where ( Z i,j ) n i =1 are independent for each j (but nothing is said across j ). With probability at least 1 − δ , � � n j ∈{ 1 ,...,k } E Z 1 ,j − 1 ln k + ln(1 /δ ) max Z i,j ≤ ( b − a ) . n 2 n i =1 Theorem. Let predictors ( h 1 , . . . , h k ) be given. With probability ≥ 1 − δ over an IID draw (( X i , Y i )) n i =1 , � ln k + ln(1 /δ ) R ( h j ) ≤ � R ( h j ) + ∀ j ∈ { 1 , . . . , k } . 2 n Remarks. ◮ We pick ( h 1 , . . . , h k ) without seeing data! ◮ This is how all our generalization guarantees will go: we prove a guarantee on all possible things the algorithm can output, and thus avoid the issue of “ h depends on data”. Called “uniform deviations” or “uniform law of large numbers”. ◮ For this approach to work, we must build the tightest possible estimate of what the algorithm considers (on particular data). 46 / 61

Proof of finite class bound Proof. 47 / 61

Proof of finite class bound Proof. Fix any h j and some confidence level δ j > 0 . Define a failure event � � � ln(1 /δ j ) R ( h j ) > � F j := R ( h j ) + ǫ j where ǫ j := . 2 n 47 / 61

Proof of finite class bound Proof. Fix any h j and some confidence level δ j > 0 . Define a failure event � � � ln(1 /δ j ) R ( h j ) > � F j := R ( h j ) + ǫ j where ǫ j := . 2 n By Hoeffding’s inequality (which requires independence!), Pr( F j ) ≤ δ j . 47 / 61

Proof of finite class bound Proof. Fix any h j and some confidence level δ j > 0 . Define a failure event � � � ln(1 /δ j ) R ( h j ) > � F j := R ( h j ) + ǫ j where ǫ j := . 2 n By Hoeffding’s inequality (which requires independence!), Pr( F j ) ≤ δ j . The events ( F 1 , . . . , F k ) are not independent, but it doesn’t matter: Pr ( ∀ j � ¬ F j ) = 47 / 61

Proof of finite class bound Proof. Fix any h j and some confidence level δ j > 0 . Define a failure event � � � ln(1 /δ j ) R ( h j ) > � F j := R ( h j ) + ǫ j where ǫ j := . 2 n By Hoeffding’s inequality (which requires independence!), Pr( F j ) ≤ δ j . The events ( F 1 , . . . , F k ) are not independent, but it doesn’t matter: k k � � Pr ( ∀ j � ¬ F j ) = 1 − Pr ( ∃ j � F j ) ≥ 1 − Pr ( F j ) ≥ 1 − δ j . j =1 j =1 47 / 61

Proof of finite class bound Proof. Fix any h j and some confidence level δ j > 0 . Define a failure event � � � ln(1 /δ j ) R ( h j ) > � F j := R ( h j ) + ǫ j where ǫ j := . 2 n By Hoeffding’s inequality (which requires independence!), Pr( F j ) ≤ δ j . The events ( F 1 , . . . , F k ) are not independent, but it doesn’t matter: k k � � Pr ( ∀ j � ¬ F j ) = 1 − Pr ( ∃ j � F j ) ≥ 1 − Pr ( F j ) ≥ 1 − δ j . j =1 j =1 To finish the proof, set δ j = δ / k . � 47 / 61

Proof of finite class bound Proof. Fix any h j and some confidence level δ j > 0 . Define a failure event � � � ln(1 /δ j ) R ( h j ) > � F j := R ( h j ) + ǫ j where ǫ j := . 2 n By Hoeffding’s inequality (which requires independence!), Pr( F j ) ≤ δ j . The events ( F 1 , . . . , F k ) are not independent, but it doesn’t matter: k k � � Pr ( ∀ j � ¬ F j ) = 1 − Pr ( ∃ j � F j ) ≥ 1 − Pr ( F j ) ≥ 1 − δ j . j =1 j =1 To finish the proof, set δ j = δ / k . � We can also prove the first (abstract) Theorem, and then plug in random � � variables Z i,j := 1 h j ( X i ) � = Y i . 47 / 61

Picture behind proof Each predictor h j had a failure event F j := [ R ( h j ) > R ( h j ) + ǫ j ] . 48 / 61

Picture behind proof Each predictor h j had a failure event F j := [ R ( h j ) > R ( h j ) + ǫ j ] . Concretely, F j is a subset of all possible (( X i , Y i )) n i =1 . 48 / 61

Picture behind proof Each predictor h j had a failure event F j := [ R ( h j ) > R ( h j ) + ǫ j ] . Concretely, F j is a subset of all possible (( X i , Y i )) n i =1 . Some other h l might have a different failure event F l ! (But F l is still a subset of the same sample space!) 48 / 61

Picture behind proof Each predictor h j had a failure event F j := [ R ( h j ) > R ( h j ) + ǫ j ] . Concretely, F j is a subset of all possible (( X i , Y i )) n i =1 . Some other h l might have a different failure event F l ! (But F l is still a subset of the same sample space!) F 1 48 / 61

Picture behind proof Each predictor h j had a failure event F j := [ R ( h j ) > R ( h j ) + ǫ j ] . Concretely, F j is a subset of all possible (( X i , Y i )) n i =1 . Some other h l might have a different failure event F l ! (But F l is still a subset of the same sample space!) F 4 F 5 F 1 F 2 F 3 48 / 61

Picture behind proof Each predictor h j had a failure event F j := [ R ( h j ) > R ( h j ) + ǫ j ] . Concretely, F j is a subset of all possible (( X i , Y i )) n i =1 . Some other h l might have a different failure event F l ! (But F l is still a subset of the same sample space!) F 4 F 5 F 1 F 2 F 3 Looking ahead: for infinitely many predictors, picture still works if failure events overlap! 48 / 61

Finite class bound — summary Theorem. Let predictors ( h 1 , . . . , h k ) be given. With probability ≥ 1 − δ over an IID draw (( X i , Y i )) n i =1 , � ln k + ln(1 /δ ) R ( h j ) ≤ � R ( h j ) + ∀ j. 2 n Remarks. ◮ If we choose ( h 1 , . . . , h k ) before seeing (( X i , Y i )) n i =1 , we can use this bound. ◮ Example: train k classifiers, pick the best on validation set! ◮ This approach “produce bound for all possible algo outputs” may seem sloppy, but it’s the best we have! ◮ Letting F = ( h 1 , . . . , h k ) denote our set of predictors, the bound is: with probability ≥ 1 − δ , every f ∈ F satisfies � ln |F| + ln 1 / δ R ( f ) ≤ � R ( f ) + . 2 n In the next sections, we’ll handle |F| = ∞ by replacing ln |F| with complexity ( F ) , whose meaning will vary. 49 / 61

12. VC Dimension

Summary Overfitting arises when we evaluate and train on the same - PowerPoint PPT Presentation

Summary Overfitting arises when we evaluate and train on the same data. We can bound error of a fixed function with Hoeffdings inequality. Next lecture well get a version sensitive to function class size. 41 / 61 Part 3. . .

Baldwin Space Summary October 25 1 Baldwin School Space Summary 2 Baldwin School Space Summary

1 Product Range Products 2 summary summary summary summary Relays with 8 and 11-Pins

An Ultramarathon Pie with Doge Glaze An Ultramarathon Pie with Doge Glaze Marathon: The Summary

SUMMARY OF 2 0 1 5 BRI TI SH EVENTI NG DATA DATA SUMMARY 2015 68,269 Cross Country Starters

summary(dsm_x_tw) summary(dsm_xyb_tw) summary(dsm_xy_tw) Overview Estimating smooths How

New patent case filings per year 1 Summary Judgment motions per year 2 All courts: 101 Summary

Search Summary Search Summary Some material from: D Lin, J You, JC Latombe 1 Search Summary #

Q3FY18 RESULTS Results Summary Operating Highlights Financial Summary Key Strategies Appendix

Summary 1. Summary of

Preliminary Results For year end 31st July 2019 6 November 2019 SUMMARY & OUTLOOK SUMMARY

EXECUTIVE SUMMARY ABOUT SEMPERTI Semperti Executive Summary Version: v1 // 2016 SEMPERTI

Q1FY18 RESULTS Results Summary Operating Highlights Financial Summary Key Strategies Appendix

How similar are these curves? Jessica Sherette EAPSI Research and Experience Summary of Proposal

Lecture 12: Summary Summary Advanced Digital Communications (EQ2410) 1 Standards Final Exam

Security Summary Michael McCool Intel Osaka, W3C Web of Things F2F, 17 May 2017 Summary

GDRSD FINANCIAL GDRSD FINANCIAL GDRSD FINANCIAL GDRSD FINANCIAL OVERVIEW SUMMARY OVERVIEW

Linear Systems CS3220 Summer 2008 Jonathan Kaldor Systems of Linear Equations Want to find

FROM SPECIFICATION TO COLLABORATION one agencys move to scrum Stephanie El-Hajj WHAT IS

Plan 2 Outline: Continuous linear operators on normed spaces Explicit examples: shift on R d

Distribution and Mobility in Software Architectures Jos Luiz Fiadeiro Research Context

HelenOS annual update Jakub Jerm Introduction Who is Jakub HelenOS developer since 2001

Portable, Scalable, per-Core P t bl S l bl C Power Estimation Sally A. McKee Chalmers

Sparse and TV Kaczmarz solvers and the linearized Bregman method Dirk Lorenz, Frank Schpfer,

Machine Learning - MT 2017 2. Mathematical Basics Christoph Haase University of Oxford October

Summary Overfitting arises when we evaluate and train on the same - PowerPoint PPT Presentation

Summary Overfitting arises when we evaluate and train on the same data. We can bound error of a fixed function with Hoeffdings inequality. Next lecture well get a version sensitive to function class size. 41 / 61 Part 3. . .

Baldwin Space Summary October 25 1 Baldwin School Space Summary 2 Baldwin School Space Summary

1 Product Range Products 2 summary summary summary summary Relays with 8 and 11-Pins

An Ultramarathon Pie with Doge Glaze An Ultramarathon Pie with Doge Glaze Marathon: The Summary

SUMMARY OF 2 0 1 5 BRI TI SH EVENTI NG DATA DATA SUMMARY 2015 68,269 Cross Country Starters

summary(dsm_x_tw) summary(dsm_xyb_tw) summary(dsm_xy_tw) Overview Estimating smooths How

New patent case filings per year 1 Summary Judgment motions per year 2 All courts: 101 Summary

Search Summary Search Summary Some material from: D Lin, J You, JC Latombe 1 Search Summary #

Q3FY18 RESULTS Results Summary Operating Highlights Financial Summary Key Strategies Appendix

Summary 1. Summary of

Preliminary Results For year end 31st July 2019 6 November 2019 SUMMARY &amp; OUTLOOK SUMMARY

EXECUTIVE SUMMARY ABOUT SEMPERTI Semperti Executive Summary Version: v1 // 2016 SEMPERTI

Q1FY18 RESULTS Results Summary Operating Highlights Financial Summary Key Strategies Appendix

How similar are these curves? Jessica Sherette EAPSI Research and Experience Summary of Proposal

Lecture 12: Summary Summary Advanced Digital Communications (EQ2410) 1 Standards Final Exam

Security Summary Michael McCool Intel Osaka, W3C Web of Things F2F, 17 May 2017 Summary

GDRSD FINANCIAL GDRSD FINANCIAL GDRSD FINANCIAL GDRSD FINANCIAL OVERVIEW SUMMARY OVERVIEW

Linear Systems CS3220 Summer 2008 Jonathan Kaldor Systems of Linear Equations Want to find

FROM SPECIFICATION TO COLLABORATION one agencys move to scrum Stephanie El-Hajj WHAT IS

Plan 2 Outline: Continuous linear operators on normed spaces Explicit examples: shift on R d

Distribution and Mobility in Software Architectures Jos Luiz Fiadeiro Research Context

HelenOS annual update Jakub Jerm Introduction Who is Jakub HelenOS developer since 2001

Portable, Scalable, per-Core P t bl S l bl C Power Estimation Sally A. McKee Chalmers

Sparse and TV Kaczmarz solvers and the linearized Bregman method Dirk Lorenz, Frank Schpfer,

Machine Learning - MT 2017 2. Mathematical Basics Christoph Haase University of Oxford October

Preliminary Results For year end 31st July 2019 6 November 2019 SUMMARY & OUTLOOK SUMMARY