learning from data lecture 12 regularization
play

Learning From Data Lecture 12 Regularization Constraining the - PowerPoint PPT Presentation

Learning From Data Lecture 12 Regularization Constraining the Model Weight Decay Augmented Error M. Magdon-Ismail CSCI 4100/6100 recap: Overfitting Fitting the data more than is warranted Data Target Fit y x M Regularization : 2 /30


  1. Learning From Data Lecture 12 Regularization Constraining the Model Weight Decay Augmented Error M. Magdon-Ismail CSCI 4100/6100

  2. recap: Overfitting Fitting the data more than is warranted Data Target Fit y x M Regularization : 2 /30 � A c L Creator: Malik Magdon-Ismail Noise − →

  3. recap: Noise is Part of y We Cannot Model Stochastic Noise Deterministic Noise h ∗ f ( x ) y y y = h ∗ ( x )+det. noise y = f ( x )+stoch. noise x x Stochastic and Deterministic Noise Hurt Learning Human: Good at extracting the simple pattern, ignoring the noise and complications. � Computer: Pays equal attention to all pixels. Needs help simplifying → (features , regularization). M Regularization : 3 /30 � A c L Creator: Malik Magdon-Ismail What is regularization? − →

  4. Regularization What is regularization? A cure for our tendency to fit (get distracted by) the noise, hence improving E out . How does it work? By constraining the model so that we cannot fit the noise. ↑ putting on the brakes Side effects? The medication will have side effects – if we cannot fit the noise, maybe we cannot fit f (the signal)? M Regularization : 4 /30 � A c L Creator: Malik Magdon-Ismail Constraining − →

  5. Constraining the Model: Does it Help? y x . . . and the winner is: M Regularization : 5 /30 � A c L Creator: Malik Magdon-Ismail Small weights − →

  6. Constraining the Model: Does it Help? y y x x constrain weights to be smaller . . . and the winner is: M Regularization : 6 /30 � A c L Creator: Malik Magdon-Ismail bias − →

  7. Bias Goes Up A Little g ( x ) ¯ g ( x ) ¯ y y sin( x ) sin( x ) x x no regularization regularization bias = 0 . 21 bias = 0 . 23 ← side effect (Constant model had bias =0.5 and var =0.25.) M Regularization : 7 /30 � A c L Creator: Malik Magdon-Ismail var − →

  8. Variance Drop is Dramatic! g ( x ) ¯ g ( x ) ¯ y y sin( x ) sin( x ) x x no regularization regularization bias = 0 . 21 bias = 0 . 23 ← side effect var = 1 . 69 var = 0 . 33 ← treatment (Constant model had bias =0.5 and var =0.25.) M Regularization : 8 /30 � A c L Creator: Malik Magdon-Ismail Regularication in a nutshell − →

  9. Regularization in a Nutshell VC analysis: E out ( g ) ≤ E in ( g ) + Ω( H ) տ If you use a simpler H and get a good fit, then your E out is better. Regularization takes this a step further: If you use a ‘simpler’ h and get a good fit, then is your E out better? M Regularization : 9 /30 � A c L Creator: Malik Magdon-Ismail Polynomials − →

  10. Polynomials of Order Q - A Useful Testbed H q : polynomials of order Q . Standard Polynomial Legendre Polynomial we’re using linear regression     ւ 1 1  x  L 1 ( x )   h ( x ) = w t z ( x )   h ( x ) = w t z ( x )   x 2   z =  L 2 ( x )  z =     . = w 0 + w 1 x + · · · + w q x q .  .  = w 0 + w 1 L 1 ( x ) + · · · + w q L q ( x )  .  . .     տ x q L q ( x ) allows us to treat the weights ‘independently’ L 1 L 2 L 3 L 4 L 5 2 (3 x 2 − 1) 2 (5 x 3 − 3 x ) 8 (35 x 4 − 30 x 2 + 3) 8 (63 x 5 · · · ) 1 1 1 1 x M Regularization : 10 /30 � A c L Creator: Malik Magdon-Ismail recap: linear regression − →

  11. recap: Linear Regression ( x 1 , y 1 ) , . . . , ( x N , y N ) − → ( z 1 , y 1 ) , . . . , ( z N , y N ) � �� � � �� � X y Z y N min : E in ( w ) = 1 � ( w t z n − y n ) 2 N n =1 = 1 N (Z w − y ) t (Z w − y ) w lin = (Z t Z) − 1 Z t y ր linear regression fit M Regularization : 11 /30 � A c L Creator: Malik Magdon-Ismail Already saw constraints − →

  12. Constraining The Model: H 10 vs. H 2 � � h ( x ) = w 0 + w 1 Φ 1 ( x ) + w 2 Φ 2 ( x ) + w 3 Φ 3 ( x ) + · · · + w 10 Φ 10 ( x ) H 10 = � h ( x ) = w 0 + w 1 Φ 1 ( x ) + w 2 Φ 2 ( x ) + w 3 Φ 3 ( x ) + · · · + w 10 Φ 10 ( x ) � H 2 = such that: w 3 = w 4 = · · · = w 10 = 0 ր a ‘hard’ order constraint that sets some weights to zero H 2 ⊂ H 10 M Regularization : 12 /30 � A c L Creator: Malik Magdon-Ismail Soft constraint − →

  13. Soft Order Constraint Don’t set weights explicitly to zero (e.g. w 3 = 0). Give a budget and let the learning choose. H 10 C → ∞ q � w 2 q ≤ C soft order constraint allows ‘intermediate’ models տ q =0 budget for weights H 2 M Regularization : 13 /30 � A c L Creator: Malik Magdon-Ismail H C − →

  14. Soft Order Constrained Model H C � � h ( x ) = w 0 + w 1 Φ 1 ( x ) + w 2 Φ 2 ( x ) + w 3 Φ 3 ( x ) + · · · + w 10 Φ 10 ( x ) H 10 = � h ( x ) = w 0 + w 1 Φ 1 ( x ) + w 2 Φ 2 ( x ) + w 3 Φ 3 ( x ) + · · · + w 10 Φ 10 ( x ) � H 2 = such that: w 3 = w 4 = · · · = w 10 = 0   h ( x ) = w 0 + w 1 Φ 1 ( x ) + w 2 Φ 2 ( x ) + w 3 Φ 3 ( x ) + · · · + w 10 Φ 10 ( x )     H C = 10 � w 2 q ≤ C such that:     q =0 ր a ‘soft’ budget constraint on the sum of weights VC-perspective: H C is smaller than H 10 = ⇒ better generalization. M Regularization : 14 /30 � A c L Creator: Malik Magdon-Ismail Fitting data − →

  15. Fitting the Data The optimal weights w reg ∈ H C ր regularized should minimize the in-sample error, but be within the budget. w reg is a solution to E in ( w ) = 1 N (Z w − y ) t (Z w − y ) min : w t w ≤ C subject to: M Regularization : 15 /30 � A c L Creator: Malik Magdon-Ismail Getting w reg − →

  16. Solving For w reg E in ( w ) = 1 N (Z w − y ) t (Z w − y ) min : w t w ≤ C subject to: E in = const. w lin Observations: normal 1. Optimal w tries to get as ‘close’ to w lin as possible. w Optimal w will use full budget and be on the surface w t w = C . 2. Surface w t w = C , at optimal w , should be perpindicular to ∇ E in . Otherwise can move along the surface and decrease E in . ∇ E in 3. Normal to surface w t w = C is the vector w . 4. Surface is ⊥ ∇ E in ; surface is ⊥ normal. ∇ E in is parallel to normal (but in opposite direction). w t w = C ∇ E in ( w reg ) = − 2 λ C w reg ր λ C , the lagrange multiplier, is positive. The 2 is for mathematical convenience. M Regularization : 16 /30 � A c L Creator: Malik Magdon-Ismail Unconstrained minimization − →

  17. Solving For w reg is minimized, subject to: w t w ≤ C E in ( w ) ⇔ ∇ E in ( w reg ) + 2 λ C w reg = 0 � ⇔ ∇ ( E in ( w ) + λ C w t w ) w = w reg = 0 � ⇔ E in ( w ) + λ C w t w is minimized, unconditionally There is a correspondence: C ↑ λ C ↓ M Regularization : 17 /30 � A c L Creator: Malik Magdon-Ismail Augmented error − →

  18. The Augmented Error Pick a C and minimize subject to: w t w ≤ C E in ( w ) � Pick a λ C and minimize E aug ( w ) = E in ( w ) + λ C w t w unconditionally տ A penalty for the ‘complexity’ of h , measured by the size of the weights. We can pick any budget C . Translation: we are free to pick any multiplier λ C ↔ What’s the right C ? What’s the right λ C ? M Regularization : 18 /30 � A c L Creator: Malik Magdon-Ismail Linear regression − →

  19. Linear Regression With Soft Order Constraint E aug ( w ) = 1 N (Z w − y ) t (Z w − y ) + λ C w t w տ Convenient to set λ C = λ N E aug ( w ) = (Z w − y ) t (Z w − y ) + λ w t w տ N called ‘weight decay’ as the penalty encourages smaller weights Unconditionally minimize E aug ( w ). M Regularization : 19 /30 � A c L Creator: Malik Magdon-Ismail Linear regression solution − →

  20. The Solution for w reg ∇ E aug ( w ) = 2Z t (Z w − y ) + 2 λ w = 2(Z t Z + λ I) w − 2Z t y Set ∇ E aug ( w ) = 0 w reg = (Z t Z + λ I) − 1 Z t y ↑ λ determines the amount of regularization Recall the unconstrained solution ( λ = 0): w lin = (Z t Z) − 1 Z t y M Regularization : 20 /30 � A c L Creator: Malik Magdon-Ismail Dramatic effect − →

  21. A Little Regularization . . . E in ( w ) + λ N w t w Minimizing with different λ ’s λ = 0 λ = 0 . 0001 Data Target Fit y x Overfitting Wow! M Regularization : 21 /30 � A c L Creator: Malik Magdon-Ismail Just a little works − →

  22. . . . Goes A Long Way E in ( w ) + λ N w t w Minimizing with different λ ’s λ = 0 λ = 0 . 0001 Data Target Fit y y x x Overfitting Wow! M Regularization : 22 /30 � A c L Creator: Malik Magdon-Ismail Easy to overdose − →

  23. Don’t Overdose E in ( w ) + λ N w t w Minimizing with different λ ’s λ = 0 λ = 0 . 0001 λ = 0 . 01 λ = 1 Data Target Fit y y y y x x x x → → Overfitting Underfitting M Regularization : 23 /30 � A c L Creator: Malik Magdon-Ismail Overfitting and underfitting − →

  24. Overfitting and Underfitting 0.84 overfitting underfitting Expected E out 0.8 0.76 0 0.5 1 1.5 2 Regularization Parameter, λ M Regularization : 24 /30 � A c L Creator: Malik Magdon-Ismail Noise and regularization − →

  25. More Noise Needs More Medicine 1 σ 2 = 0 . 5 0.75 Expected E out σ 2 = 0 . 25 0.5 σ 2 = 0 0.25 0.5 1 1.5 2 Regularization Parameter, λ M Regularization : 25 /30 � A c L Creator: Malik Magdon-Ismail Deterministic too − →

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend