Learning From Data Lecture 12 Regularization Constraining the - PowerPoint PPT Presentation

Learning From Data Lecture 12 Regularization Constraining the Model Weight Decay Augmented Error M. Magdon-Ismail CSCI 4100/6100

recap: Overfitting Fitting the data more than is warranted Data Target Fit y x M Regularization : 2 /30 � A c L Creator: Malik Magdon-Ismail Noise − →

recap: Noise is Part of y We Cannot Model Stochastic Noise Deterministic Noise h ∗ f ( x ) y y y = h ∗ ( x )+det. noise y = f ( x )+stoch. noise x x Stochastic and Deterministic Noise Hurt Learning Human: Good at extracting the simple pattern, ignoring the noise and complications. � Computer: Pays equal attention to all pixels. Needs help simplifying → (features , regularization). M Regularization : 3 /30 � A c L Creator: Malik Magdon-Ismail What is regularization? − →

Regularization What is regularization? A cure for our tendency to fit (get distracted by) the noise, hence improving E out . How does it work? By constraining the model so that we cannot fit the noise. ↑ putting on the brakes Side effects? The medication will have side effects – if we cannot fit the noise, maybe we cannot fit f (the signal)? M Regularization : 4 /30 � A c L Creator: Malik Magdon-Ismail Constraining − →

Constraining the Model: Does it Help? y x . . . and the winner is: M Regularization : 5 /30 � A c L Creator: Malik Magdon-Ismail Small weights − →

Constraining the Model: Does it Help? y y x x constrain weights to be smaller . . . and the winner is: M Regularization : 6 /30 � A c L Creator: Malik Magdon-Ismail bias − →

Bias Goes Up A Little g ( x ) ¯ g ( x ) ¯ y y sin( x ) sin( x ) x x no regularization regularization bias = 0 . 21 bias = 0 . 23 ← side effect (Constant model had bias =0.5 and var =0.25.) M Regularization : 7 /30 � A c L Creator: Malik Magdon-Ismail var − →

Variance Drop is Dramatic! g ( x ) ¯ g ( x ) ¯ y y sin( x ) sin( x ) x x no regularization regularization bias = 0 . 21 bias = 0 . 23 ← side effect var = 1 . 69 var = 0 . 33 ← treatment (Constant model had bias =0.5 and var =0.25.) M Regularization : 8 /30 � A c L Creator: Malik Magdon-Ismail Regularication in a nutshell − →

Regularization in a Nutshell VC analysis: E out ( g ) ≤ E in ( g ) + Ω( H ) տ If you use a simpler H and get a good fit, then your E out is better. Regularization takes this a step further: If you use a ‘simpler’ h and get a good fit, then is your E out better? M Regularization : 9 /30 � A c L Creator: Malik Magdon-Ismail Polynomials − →

Polynomials of Order Q - A Useful Testbed H q : polynomials of order Q . Standard Polynomial Legendre Polynomial we’re using linear regression     ւ 1 1  x  L 1 ( x )   h ( x ) = w t z ( x )   h ( x ) = w t z ( x )   x 2   z =  L 2 ( x )  z =     . = w 0 + w 1 x + · · · + w q x q .  .  = w 0 + w 1 L 1 ( x ) + · · · + w q L q ( x )  .  . .     տ x q L q ( x ) allows us to treat the weights ‘independently’ L 1 L 2 L 3 L 4 L 5 2 (3 x 2 − 1) 2 (5 x 3 − 3 x ) 8 (35 x 4 − 30 x 2 + 3) 8 (63 x 5 · · · ) 1 1 1 1 x M Regularization : 10 /30 � A c L Creator: Malik Magdon-Ismail recap: linear regression − →

recap: Linear Regression ( x 1 , y 1 ) , . . . , ( x N , y N ) − → ( z 1 , y 1 ) , . . . , ( z N , y N ) � �� X y Z y N min : E in ( w ) = 1 � ( w t z n − y n ) 2 N n =1 = 1 N (Z w − y ) t (Z w − y ) w lin = (Z t Z) − 1 Z t y ր linear regression fit M Regularization : 11 /30 � A c L Creator: Malik Magdon-Ismail Already saw constraints − →

Constraining The Model: H 10 vs. H 2 � � h ( x ) = w 0 + w 1 Φ 1 ( x ) + w 2 Φ 2 ( x ) + w 3 Φ 3 ( x ) + · · · + w 10 Φ 10 ( x ) H 10 = � h ( x ) = w 0 + w 1 Φ 1 ( x ) + w 2 Φ 2 ( x ) + w 3 Φ 3 ( x ) + · · · + w 10 Φ 10 ( x ) � H 2 = such that: w 3 = w 4 = · · · = w 10 = 0 ր a ‘hard’ order constraint that sets some weights to zero H 2 ⊂ H 10 M Regularization : 12 /30 � A c L Creator: Malik Magdon-Ismail Soft constraint − →

Soft Order Constraint Don’t set weights explicitly to zero (e.g. w 3 = 0). Give a budget and let the learning choose. H 10 C → ∞ q � w 2 q ≤ C soft order constraint allows ‘intermediate’ models տ q =0 budget for weights H 2 M Regularization : 13 /30 � A c L Creator: Malik Magdon-Ismail H C − →

Soft Order Constrained Model H C � � h ( x ) = w 0 + w 1 Φ 1 ( x ) + w 2 Φ 2 ( x ) + w 3 Φ 3 ( x ) + · · · + w 10 Φ 10 ( x ) H 10 = � h ( x ) = w 0 + w 1 Φ 1 ( x ) + w 2 Φ 2 ( x ) + w 3 Φ 3 ( x ) + · · · + w 10 Φ 10 ( x ) � H 2 = such that: w 3 = w 4 = · · · = w 10 = 0   h ( x ) = w 0 + w 1 Φ 1 ( x ) + w 2 Φ 2 ( x ) + w 3 Φ 3 ( x ) + · · · + w 10 Φ 10 ( x )     H C = 10 � w 2 q ≤ C such that:     q =0 ր a ‘soft’ budget constraint on the sum of weights VC-perspective: H C is smaller than H 10 = ⇒ better generalization. M Regularization : 14 /30 � A c L Creator: Malik Magdon-Ismail Fitting data − →

Fitting the Data The optimal weights w reg ∈ H C ր regularized should minimize the in-sample error, but be within the budget. w reg is a solution to E in ( w ) = 1 N (Z w − y ) t (Z w − y ) min : w t w ≤ C subject to: M Regularization : 15 /30 � A c L Creator: Malik Magdon-Ismail Getting w reg − →

Solving For w reg E in ( w ) = 1 N (Z w − y ) t (Z w − y ) min : w t w ≤ C subject to: E in = const. w lin Observations: normal 1. Optimal w tries to get as ‘close’ to w lin as possible. w Optimal w will use full budget and be on the surface w t w = C . 2. Surface w t w = C , at optimal w , should be perpindicular to ∇ E in . Otherwise can move along the surface and decrease E in . ∇ E in 3. Normal to surface w t w = C is the vector w . 4. Surface is ⊥ ∇ E in ; surface is ⊥ normal. ∇ E in is parallel to normal (but in opposite direction). w t w = C ∇ E in ( w reg ) = − 2 λ C w reg ր λ C , the lagrange multiplier, is positive. The 2 is for mathematical convenience. M Regularization : 16 /30 � A c L Creator: Malik Magdon-Ismail Unconstrained minimization − →

Solving For w reg is minimized, subject to: w t w ≤ C E in ( w ) ⇔ ∇ E in ( w reg ) + 2 λ C w reg = 0 � ⇔ ∇ ( E in ( w ) + λ C w t w ) w = w reg = 0 � ⇔ E in ( w ) + λ C w t w is minimized, unconditionally There is a correspondence: C ↑ λ C ↓ M Regularization : 17 /30 � A c L Creator: Malik Magdon-Ismail Augmented error − →

The Augmented Error Pick a C and minimize subject to: w t w ≤ C E in ( w ) � Pick a λ C and minimize E aug ( w ) = E in ( w ) + λ C w t w unconditionally տ A penalty for the ‘complexity’ of h , measured by the size of the weights. We can pick any budget C . Translation: we are free to pick any multiplier λ C ↔ What’s the right C ? What’s the right λ C ? M Regularization : 18 /30 � A c L Creator: Malik Magdon-Ismail Linear regression − →

Linear Regression With Soft Order Constraint E aug ( w ) = 1 N (Z w − y ) t (Z w − y ) + λ C w t w տ Convenient to set λ C = λ N E aug ( w ) = (Z w − y ) t (Z w − y ) + λ w t w տ N called ‘weight decay’ as the penalty encourages smaller weights Unconditionally minimize E aug ( w ). M Regularization : 19 /30 � A c L Creator: Malik Magdon-Ismail Linear regression solution − →

The Solution for w reg ∇ E aug ( w ) = 2Z t (Z w − y ) + 2 λ w = 2(Z t Z + λ I) w − 2Z t y Set ∇ E aug ( w ) = 0 w reg = (Z t Z + λ I) − 1 Z t y ↑ λ determines the amount of regularization Recall the unconstrained solution ( λ = 0): w lin = (Z t Z) − 1 Z t y M Regularization : 20 /30 � A c L Creator: Malik Magdon-Ismail Dramatic effect − →

A Little Regularization . . . E in ( w ) + λ N w t w Minimizing with different λ ’s λ = 0 λ = 0 . 0001 Data Target Fit y x Overfitting Wow! M Regularization : 21 /30 � A c L Creator: Malik Magdon-Ismail Just a little works − →

. . . Goes A Long Way E in ( w ) + λ N w t w Minimizing with different λ ’s λ = 0 λ = 0 . 0001 Data Target Fit y y x x Overfitting Wow! M Regularization : 22 /30 � A c L Creator: Malik Magdon-Ismail Easy to overdose − →

Don’t Overdose E in ( w ) + λ N w t w Minimizing with different λ ’s λ = 0 λ = 0 . 0001 λ = 0 . 01 λ = 1 Data Target Fit y y y y x x x x → → Overfitting Underfitting M Regularization : 23 /30 � A c L Creator: Malik Magdon-Ismail Overfitting and underfitting − →

Overfitting and Underfitting 0.84 overfitting underfitting Expected E out 0.8 0.76 0 0.5 1 1.5 2 Regularization Parameter, λ M Regularization : 24 /30 � A c L Creator: Malik Magdon-Ismail Noise and regularization − →

More Noise Needs More Medicine 1 σ 2 = 0 . 5 0.75 Expected E out σ 2 = 0 . 25 0.5 σ 2 = 0 0.25 0.5 1 1.5 2 Regularization Parameter, λ M Regularization : 25 /30 � A c L Creator: Malik Magdon-Ismail Deterministic too − →

Learning From Data Lecture 12 Regularization Constraining the - PowerPoint PPT Presentation

Learning From Data Lecture 12 Regularization Constraining the Model Weight Decay Augmented Error M. Magdon-Ismail CSCI 4100/6100 recap: Overfitting Fitting the data more than is warranted Data Target Fit y x M Regularization : 2 /30

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Regularization Overview Regularization Overview Problems & Multicollinearity We will

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 L. Rosasco Regularization for

Regularization for Deep Learning Lecture slides for Chapter 7 of Deep Learning

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 September 2015 Tomaso

The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso

The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 September 2014 Tomaso

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

10. Regularization More on tradeoffs Regularization Effect of using different norms

Towards an European consensus indications for Towards an European consensus indications for major

Outline The web from a security perspective CSci 5271 Introduction to Computer Security SQL

Presenting: Michal Paszkowski Research: Michal Paszkowski, Radoslaw Drabinski Special thanks to:

8th Grade ELA Week 16 Vocabulary Slides November 28-December 2, 2016 1. ETHICAL -- inference

Package dprep August 21, 2009 Type Package Title Data preprocessing and visualization

Oncology Nursing: Leading the Oncology Nursing: Leading the Way presented by: Esther Green,

CANCER SCREENING AND DIAGNOSIS D R M A T T N O R M A N G P P A R T N E R , T H E B E L L S

Adams County Community Health Assessment Survey and Indicator Summaries August 2018 Data Center

Learning From Data Lecture 12 Regularization Constraining the - PowerPoint PPT Presentation

Learning From Data Lecture 12 Regularization Constraining the Model Weight Decay Augmented Error M. Magdon-Ismail CSCI 4100/6100 recap: Overfitting Fitting the data more than is warranted Data Target Fit y x M Regularization : 2 /30

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 L. Rosasco Regularization for

Regularization for Deep Learning Lecture slides for Chapter 7 of Deep Learning

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 September 2015 Tomaso

The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso

The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 September 2014 Tomaso

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

10. Regularization More on tradeoffs Regularization Effect of using different norms

Towards an European consensus indications for Towards an European consensus indications for major

Outline The web from a security perspective CSci 5271 Introduction to Computer Security SQL

Presenting: Michal Paszkowski Research: Michal Paszkowski, Radoslaw Drabinski Special thanks to:

8th Grade ELA Week 16 Vocabulary Slides November 28-December 2, 2016 1. ETHICAL -- inference

Package dprep August 21, 2009 Type Package Title Data preprocessing and visualization

Oncology Nursing: Leading the Oncology Nursing: Leading the Way presented by: Esther Green,

CANCER SCREENING AND DIAGNOSIS D R M A T T N O R M A N G P P A R T N E R , T H E B E L L S

Adams County Community Health Assessment Survey and Indicator Summaries August 2018 Data Center

Regularization Overview Regularization Overview Problems & Multicollinearity We will