recap overfitting
play

recap: Overfitting Fitting the data more than is warranted Learning - PowerPoint PPT Presentation

recap: Overfitting Fitting the data more than is warranted Learning From Data Data Lecture 12 Target Regularization Fit Constraining the Model Weight Decay Augmented Error y M. Magdon-Ismail CSCI 4100/6100 x A M Regularization : 2


  1. recap: Overfitting Fitting the data more than is warranted Learning From Data Data Lecture 12 Target Regularization Fit Constraining the Model Weight Decay Augmented Error y M. Magdon-Ismail CSCI 4100/6100 x � A M Regularization : 2 /30 c L Creator: Malik Magdon-Ismail Noise − → recap: Noise is Part of y We Cannot Model Regularization Stochastic Noise Deterministic Noise What is regularization? f ( x ) h ∗ A cure for our tendency to fit (get distracted by) the noise, hence improving E out . y y y = h ∗ ( x )+det. noise y = f ( x )+stoch. noise How does it work? By constraining the model so that we cannot fit the noise. x x ↑ putting on the brakes Stochastic and Deterministic Noise Hurt Learning Side effects? The medication will have side effects – if we cannot fit the noise, maybe we cannot fit f (the signal)? Human: Good at extracting the simple pattern, ignoring the noise and complications. � Computer: Pays equal attention to all pixels. Needs help simplifying → (features , regularization). � A c M Regularization : 3 /30 � A c M Regularization : 4 /30 L Creator: Malik Magdon-Ismail What is regularization? − → L Creator: Malik Magdon-Ismail Constraining − →

  2. Constraining the Model: Does it Help? Constraining the Model: Does it Help? y y y x x x constrain weights to be smaller . . . and the winner is: . . . and the winner is: � A M Regularization : 5 /30 � A M Regularization : 6 /30 c L Creator: Malik Magdon-Ismail Small weights − → c L Creator: Malik Magdon-Ismail bias − → Bias Goes Up A Little Variance Drop is Dramatic! ¯ g ( x ) ¯ g ( x ) g ( x ) ¯ g ( x ) ¯ y y y y sin( x ) sin( x ) sin( x ) sin( x ) x x x x no regularization regularization no regularization regularization bias = 0 . 21 bias = 0 . 23 bias = 0 . 21 bias = 0 . 23 ← side effect ← side effect var = 1 . 69 var = 0 . 33 ← treatment (Constant model had bias =0.5 and var =0.25.) (Constant model had bias =0.5 and var =0.25.) � A c M Regularization : 7 /30 � A c M Regularization : 8 /30 L Creator: Malik Magdon-Ismail var − → L Creator: Malik Magdon-Ismail Regularication in a nutshell − →

  3. Regularization in a Nutshell Polynomials of Order Q - A Useful Testbed H q : polynomials of order Q . VC analysis: Standard Polynomial Legendre Polynomial E out ( g ) ≤ E in ( g ) + Ω( H ) we’re using linear regression   տ   ւ 1 1 If you use a simpler H and get a good fit,  x   L 1 ( x )  h ( x ) = w t z ( x ) then your E out is better.   h ( x ) = w t z ( x )   x 2   z =  L 2 ( x )  z =     . = w 0 + w 1 x + · · · + w q x q .  .  = w 0 + w 1 L 1 ( x ) + · · · + w q L q ( x ) .  .  .     տ x q L q ( x ) allows us to treat the weights ‘independently’ Regularization takes this a step further: If you use a ‘simpler’ h and get a L 1 L 2 L 3 L 4 L 5 good fit, then is your E out better? 2 (3 x 2 − 1) 2 (5 x 3 − 3 x ) 8 (35 x 4 − 30 x 2 + 3) 8 (63 x 5 · · · ) 1 1 1 1 x � A M Regularization : 9 /30 � A M Regularization : 10 /30 c L Creator: Malik Magdon-Ismail Polynomials − → c L Creator: Malik Magdon-Ismail recap: linear regression − → Constraining The Model: H 10 vs. H 2 recap: Linear Regression − → ( x 1 , y 1 ) , . . . , ( x N , y N ) ( z 1 , y 1 ) , . . . , ( z N , y N ) � � H 10 = h ( x ) = w 0 + w 1 Φ 1 ( x ) + w 2 Φ 2 ( x ) + w 3 Φ 3 ( x ) + · · · + w 10 Φ 10 ( x ) � �� � � �� � X y Z y N min : E in ( w ) = 1 � ( w t z n − y n ) 2 N n =1 � h ( x ) = w 0 + w 1 Φ 1 ( x ) + w 2 Φ 2 ( x ) + w 3 Φ 3 ( x ) + · · · + w 10 Φ 10 ( x ) � = 1 N (Z w − y ) t (Z w − y ) H 2 = such that: w 3 = w 4 = · · · = w 10 = 0 ր a ‘hard’ order constraint that sets some weights to zero w lin = (Z t Z) − 1 Z t y ր linear regression fit H 2 ⊂ H 10 � A c M Regularization : 11 /30 � A c M Regularization : 12 /30 L Creator: Malik Magdon-Ismail Already saw constraints − → L Creator: Malik Magdon-Ismail Soft constraint − →

  4. Soft Order Constrained Model H C Soft Order Constraint � � H 10 = h ( x ) = w 0 + w 1 Φ 1 ( x ) + w 2 Φ 2 ( x ) + w 3 Φ 3 ( x ) + · · · + w 10 Φ 10 ( x ) Don’t set weights explicitly to zero (e.g. w 3 = 0). Give a budget and let the learning choose. � h ( x ) = w 0 + w 1 Φ 1 ( x ) + w 2 Φ 2 ( x ) + w 3 Φ 3 ( x ) + · · · + w 10 Φ 10 ( x ) � H 2 = such that: w 3 = w 4 = · · · = w 10 = 0 H 10 C → ∞   h ( x ) = w 0 + w 1 Φ 1 ( x ) + w 2 Φ 2 ( x ) + w 3 Φ 3 ( x ) + · · · + w 10 Φ 10 ( x ) q   �   w 2 q ≤ C H C = soft order constraint allows 10 ‘intermediate’ models տ � w 2 q =0 such that: q ≤ C     budget for q =0 weights ր a ‘soft’ budget constraint on the sum of weights H 2 VC-perspective: H C is smaller than H 10 = ⇒ better generalization. � A M Regularization : 13 /30 � A M Regularization : 14 /30 c L Creator: Malik Magdon-Ismail H C − → c L Creator: Malik Magdon-Ismail Fitting data − → Fitting the Data Solving For w reg E in ( w ) = 1 min : N (Z w − y ) t (Z w − y ) The optimal weights w t w ≤ C subject to: w reg ∈ H C E in = const. ր regularized w lin should minimize the in-sample error, but be within the budget. Observations: normal 1. Optimal w tries to get as ‘close’ to w lin as possible. w Optimal w will use full budget and be on the surface w t w = C . 2. Surface w t w = C , at optimal w , should be perpindicular to ∇ E in . Otherwise can move along the surface and decrease E in . 3. Normal to surface w t w = C is the vector w . ∇ E in w reg is a solution to 4. Surface is ⊥ ∇ E in ; surface is ⊥ normal. ∇ E in is parallel to normal (but in opposite direction). w t w = C E in ( w ) = 1 min : N (Z w − y ) t (Z w − y ) w t w ≤ C subject to: ∇ E in ( w reg ) = − 2 λ C w reg ր λ C , the lagrange multiplier, is positive. The 2 is for mathematical convenience. � A c M Regularization : 15 /30 � A c M Regularization : 16 /30 L Creator: Malik Magdon-Ismail Getting w reg − → L Creator: Malik Magdon-Ismail Unconstrained minimization − →

  5. Solving For w reg The Augmented Error Pick a C and minimize E in ( w ) is minimized, subject to: w t w ≤ C E in ( w ) subject to: w t w ≤ C � ⇔ ∇ E in ( w reg ) + 2 λ C w reg = 0 � ⇔ ∇ ( E in ( w ) + λ C w t w ) w = w reg = 0 � Pick a λ C and minimize E aug ( w ) = E in ( w ) + λ C w t w unconditionally տ ⇔ E in ( w ) + λ C w t w is minimized, unconditionally A penalty for the ‘complexity’ of h , measured by the size of the weights. We can pick any budget C . Translation: we are free to pick any multiplier λ C There is a correspondence: C ↑ λ C ↓ What’s the right C ? ↔ What’s the right λ C ? � A M Regularization : 17 /30 � A M Regularization : 18 /30 c L Creator: Malik Magdon-Ismail Augmented error − → c L Creator: Malik Magdon-Ismail Linear regression − → Linear Regression With Soft Order Constraint The Solution for w reg ∇ E aug ( w ) = 2Z t (Z w − y ) + 2 λ w E aug ( w ) = 1 N (Z w − y ) t (Z w − y ) + λ C w t w = 2(Z t Z + λ I) w − 2Z t y տ Convenient to set λ C = λ N Set ∇ E aug ( w ) = 0 E aug ( w ) = (Z w − y ) t (Z w − y ) + λ w t w տ N called ‘weight decay’ as the penalty encourages smaller weights w reg = (Z t Z + λ I) − 1 Z t y ↑ λ determines the amount of regularization Recall the unconstrained solution ( λ = 0): Unconditionally minimize E aug ( w ). w lin = (Z t Z) − 1 Z t y � A c M Regularization : 19 /30 � A c M Regularization : 20 /30 L Creator: Malik Magdon-Ismail Linear regression solution − → L Creator: Malik Magdon-Ismail Dramatic effect − →

  6. A Little Regularization . . . . . . Goes A Long Way E in ( w ) + λ E in ( w ) + λ Minimizing N w t w with different λ ’s Minimizing N w t w with different λ ’s λ = 0 λ = 0 . 0001 λ = 0 λ = 0 . 0001 Data Data Target Target Fit Fit y y y x x x Overfitting Wow! Overfitting Wow! � A M Regularization : 21 /30 � A M Regularization : 22 /30 c L Creator: Malik Magdon-Ismail Just a little works − → c L Creator: Malik Magdon-Ismail Easy to overdose − → Don’t Overdose Overfitting and Underfitting E in ( w ) + λ N w t w Minimizing with different λ ’s 0.84 overfitting underfitting λ = 0 λ = 0 . 0001 λ = 0 . 01 λ = 1 Expected E out 0.8 Data Target Fit 0.76 y y y y 0 0.5 1 1.5 2 Regularization Parameter, λ x x x x → → Overfitting Underfitting � A c M Regularization : 23 /30 � A c M Regularization : 24 /30 L Creator: Malik Magdon-Ismail Overfitting and underfitting − → L Creator: Malik Magdon-Ismail Noise and regularization − →

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend