learning from data lecture 11 overfitting
play

Learning From Data Lecture 11 Overfitting What is Overfitting - PowerPoint PPT Presentation

Learning From Data Lecture 11 Overfitting What is Overfitting When does Overfitting Occur Stochastic and Deterministic Noise M. Magdon-Ismail CSCI 4100/6100 recap: Nonlinear Transforms X -space is R d d Z -space is R 1 1


  1. Learning From Data Lecture 11 Overfitting What is Overfitting When does Overfitting Occur Stochastic and Deterministic Noise M. Magdon-Ismail CSCI 4100/6100

  2. recap: Nonlinear Transforms Φ − → ˜ X -space is R d d Z -space is R 1 1 1       x 1 Φ 1 ( x ) z 1       x = z = Φ ( x ) =  = . . .  .   .   .  . . .      x d Φ ˜ d ( x ) z ˜ 1. Original data 2. Transform the data d x n ∈ X z n = Φ( x n ) ∈ Z x 1 , x 2 , . . . , x N z 1 , z 2 , . . . , z N ↓ y 1 , y 2 , . . . , y N y 1 , y 2 , . . . , y N w 0   w 1   no weights w = ˜ .  .  .   w ˜ ‘ Φ − 1 ’ d ← − d vc = d + 1 d vc = d + 1 g ( x ) = sign( ˜ w t Φ ( x )) 4. Classify in X -space 3. Separate data in Z -space g ( x ) = ˜ g (Φ( x )) = sign( ˜ w t Φ( x )) g ( z ) = sign( ˜ ˜ w t z ) M Overfitting : 2 /25 � A c L Creator: Malik Magdon-Ismail Digits data − →

  3. recap: Digits Data “1” Versus “All” Symmetry Symmetry Average Intensity Average Intensity Linear model 3rd order polynomial model E in = 2 . 13% E in = 1 . 75% E out = 2 . 38% E out = 1 . 87% M Overfitting : 3 /25 � A c L Creator: Malik Magdon-Ismail Superstitions − →

  4. Superstitions – Myth or Reality? • Paraskevedekatriaphobia – fear of Friday the 13th. – Are future Friday the 13ths really more dangerous? • OCD [medical journal, citation lost, can you find it?] the subjects performs an action which leads to a good outcome and thereby generalizes it as cause and effect: the action will always give good results. Having overfit the data, the subject compulsively engages in that activity. Humans are overfitting machines , very good at “finding coincidences”. M Overfitting : 4 /25 � A c L Creator: Malik Magdon-Ismail Simple illustration − →

  5. An Illustration of Overfitting on a Simple Example Data Quadratic f Target 5 data points y A little noise (measurement error) 5 data points → 4th order polynomial fit x Classic overfitting: simple target with excessively complex H . The noise did us in. (why?) M Overfitting : 5 /25 � A c L Creator: Malik Magdon-Ismail Classic overfitting − →

  6. An Illustration of Overfitting on a Simple Example Data Target Quadratic f Fit 5 data points y A little noise (measurement error) 5 data points → 4th order polynomial fit x Classic overfitting: simple target with excessively complex H . E in ≈ 0; E out ≫ 0 The noise did us in. (why?) M Overfitting : 6 /25 � A c L Creator: Malik Magdon-Ismail What is overfitting? − →

  7. What is Overfitting? Fitting the data more than is warranted M Overfitting : 7 /25 � A c L Creator: Malik Magdon-Ismail Is it bad generalization? − →

  8. Overfitting is Not Just Bad Generalization out-of-sample error Error bad generalization in-sample error VC dimension, d vc VC Analysis: Covers bad generalization but with lots of slack – the VC bound is loose M Overfitting : 8 /25 � A c L Creator: Malik Magdon-Ismail Beyond bad generalization − →

  9. Overfitting is Not Just Bad Generalization out-of-sample error Error overfitting in-sample error VC dimension, d vc Overfitting: Going for lower and lower E in results in higher and higher E out M Overfitting : 9 /25 � A c L Creator: Malik Magdon-Ismail Case study: simple and complex f − →

  10. Case Study: 2nd vs 10th Order Polynomial Fit y y Data Data Target Target x x 10th order f with noise. 50th order f with no noise. H 2 : 2nd order polynomial fit − special case of linear models with feature transform x �→ (1 , x, x 2 , · · · ) . ← H 10 : 10th order polynomial fit Which model do you pick for which problem and why? M Overfitting : 10 /25 � A c L Creator: Malik Magdon-Ismail H 2 versus H 10 − →

  11. Case Study: 2nd vs 10th Order Polynomial Fit y y Data Data Target Target x x 10th order f with noise. 50th order f with no noise. H 2 : 2nd order polynomial fit − special case of linear models with feature transform x �→ (1 , x, x 2 , · · · ) . ← H 10 : 10th order polynomial fit Which model do you pick for which problem and why? M Overfitting : 11 /25 � A c L Creator: Malik Magdon-Ismail H 2 wins for both cases − →

  12. Case Study: 2nd vs 10th Order Polynomial Fit y y Data Data 2nd Order Fit 2nd Order Fit 10th Order Fit 10th Order Fit x x simple noisy target complex noiseless target 2nd Order 10th Order 2nd Order 10th Order 10 − 5 0.050 0.034 0.029 E in E in 0.127 9.00 0.120 7680 E out E out Go figure: Simpler H is better even for the more complex target with no noise. M Overfitting : 12 /25 � A c L Creator: Malik Magdon-Ismail Is there really no noise − →

  13. Is there Really “No Noise” with the Complex f ? y y Data Data Target Target x x Simple f with noise. Complex f with no noise. H should match quantity and quality of data , not f M Overfitting : 13 /25 � A c L Creator: Malik Magdon-Ismail Look only at the data − →

  14. Is there Really “No Noise” with the Complex f ? y y x x Simple f with noise. Complex f with no noise. H should match quantity and quality of data , not f M Overfitting : 14 /25 � A c L Creator: Malik Magdon-Ismail Learning curves for H 2 , H 10 − →

  15. When is H 2 Better than H 10 ? Learning curves for H 2 Learning curves for H 10 Expected Error Expected Error E out E in E out E in Number of Data Points, N Number of Data Points, N Overfitting: E out ( H 10 ) > E out ( H 2 ) Overfit measure σ 2 vs. N − M Overfitting : 15 /25 � A c L Creator: Malik Magdon-Ismail →

  16. Overfit Measure: E out ( H 10 ) − E out ( H 2 ) 0.2 2 0.1 Noise Level, σ 2 0 1 -0.1 -0.2 0 80 100 120 Number of Data Points, N M Overfitting : 16 /25 � A c L Creator: Malik Magdon-Ismail Overfit measure Q f vs. N − →

  17. Overfit Measure: E out ( H 10 ) − E out ( H 2 ) 0.2 100 0.2 Target Complexity, Q f 2 0.1 0.1 75 Noise Level, σ 2 0 0 50 1 -0.1 -0.1 25 -0.2 0 -0.2 0 80 100 120 80 100 120 Number of Data Points, N Number of Data Points, N Number of data points ↑ Overfitting ↓ Noise ↑ Overfitting ↑ Target complexity ↑ Overfitting ↑ M Overfitting : 17 /25 � A c L Creator: Malik Magdon-Ismail Define ‘noise’ − →

  18. Noise That part of y we cannot model it has two sources . . . M Overfitting : 18 /25 � A c L Creator: Malik Magdon-Ismail Stochastic noise − →

  19. Stochastic Noise — Data Error We would like to learn from ◦ : y = f ( x ) y n = f ( x n ) stoch. noise Unfortunately, we only observe ◦ : y y n = f ( x n ) + ‘stochastic noise’ ↑ no one can model this x Stochastic Noise: fluctuations/measurement errors we cannot model. M Overfitting : 19 /25 � A c L Creator: Malik Magdon-Ismail Deterministic noise − →

  20. Deterministic Noise — Model Error We would like to learn from ◦ : best approximation to f in H h ∗ ( x ) y n = h ∗ ( x n ) det. noise Unfortunately, we only observe ◦ : y y = f ( x ) y n = f ( x n ) = h ∗ ( x n ) + ‘deterministic noise’ ↑ H cannot model this x Deterministic Noise: the part of f we cannot model. M Overfitting : 20 /25 � A c L Creator: Malik Magdon-Ismail Both hurt learning − →

  21. Stochastic & Deterministic Noise Hurt Learning Stochastic Noise Deterministic Noise f ( x ) h ∗ y y y = h ∗ ( x )+det. noise y = f ( x )+stoch. noise x x source: random measurement errors source: learner’s H cannot model f re-measure y n re-measure y n stochastic noise changes. deterministic noise the same. change H change H stochastic noise the same. deterministic noise changes. We have single D and fixed H so we cannot distinguish M Overfitting : 21 /25 � A c L Creator: Malik Magdon-Ismail Stochastic noise and bias - var − →

  22. Noise and the Bias-Variance Decomposition y = f ( x ) + ǫ ↑ measurement error E [ E out ( x )] = E D ,ǫ [( g ( x ) − f ( x ) − ǫ ) 2 ] = E D ,ǫ [( g ( x ) − f ( x )) 2 + 2( g ( x ) − f ( x )) ǫ + ǫ 2 ] ↓ ↓ ↓ σ 2 bias + var 0 bias - var - σ 2 and noise − M Overfitting : 22 /25 � A c L Creator: Malik Magdon-Ismail →

  23. Noise and the Bias-Variance Decomposition y = f ( x ) + ǫ ↑ measurement error σ 2 E [ E out ( x )] = + + bias var ↑ ↑ ↑ stochastic deterministic indirect noise noise impact of noise M Overfitting : 23 /25 � A c L Creator: Malik Magdon-Ismail Noise causes overfitting − →

  24. Noise is the Culprit Overfitting is the disease Noise is the cause Learning is led astray by fitting the noise more than the signal Cures Regularization: Putting on the brakes. Validation: A reality check from peeking at E out (the bottom line). M Overfitting : 24 /25 � A c L Creator: Malik Magdon-Ismail Regularization teaser − →

  25. Regularization no regularization regularization! Data Target Fit y x M Overfitting : 25 /25 � A c L Creator: Malik Magdon-Ismail Regularization teaser − →

  26. Regularization no regularization regularization! Data Target Fit y y x x M Overfitting : 26 /25 � A c L Creator: Malik Magdon-Ismail

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend