Lecture 3: Regularization I Princeton University COS 495 - PowerPoint PPT Presentation

Deep Learning Basics Lecture 3: Regularization I Princeton University COS 495 Instructor: Yingyu Liang

What is regularization? • In general: any method to prevent overfitting or help the optimization • Specifically: additional terms in the training optimization objective to prevent overfitting or help the optimization

Review: overfitting

Overfitting example: regression using polynomials 𝑢 = sin 2𝜌𝑦 + 𝜗 Figure from Machine Learning and Pattern Recognition , Bishop

Overfitting example: regression using polynomials Figure from Machine Learning and Pattern Recognition , Bishop

Overfitting • Empirical loss and expected loss are different • Smaller the data set, larger the difference between the two • Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two • Thus has small training error but large test error (overfitting)

Prevent overfitting • Larger data set helps • Throwing away useless hypotheses also helps • Classical regularization: some principal ways to constrain hypotheses • Other types of regularization: data augmentation, early stopping, etc.

Different views of regularization

Regularization as hard constraint • Training objective 𝑜 𝑀 𝑔 = 1 ෠ min 𝑜 ෍ 𝑚(𝑔, 𝑦 𝑗 , 𝑧 𝑗 ) 𝑔 𝑗=1 subject to: 𝑔 ∈ 𝓘 • When parametrized 𝑜 𝑀 𝜄 = 1 ෠ min 𝑜 ෍ 𝑚(𝜄, 𝑦 𝑗 , 𝑧 𝑗 ) 𝜄 𝑗=1 subject to: 𝜄 ∈ 𝛻

Regularization as hard constraint • When 𝛻 measured by some quantity 𝑆 𝑜 𝑀 𝜄 = 1 ෠ min 𝑜 ෍ 𝑚(𝜄, 𝑦 𝑗 , 𝑧 𝑗 ) 𝜄 𝑗=1 subject to: 𝑆 𝜄 ≤ 𝑠 • Example: 𝑚 2 regularization 𝑜 𝑀 𝜄 = 1 ෠ min 𝑜 ෍ 𝑚(𝜄, 𝑦 𝑗 , 𝑧 𝑗 ) 𝜄 𝑗=1 2 ≤ 𝑠 2 subject to: | 𝜄| 2

Regularization as soft constraint • The hard-constraint optimization is equivalent to soft-constraint 𝑜 𝑀 𝑆 𝜄 = 1 ෠ 𝑚(𝜄, 𝑦 𝑗 , 𝑧 𝑗 ) + 𝜇 ∗ 𝑆(𝜄) min 𝑜 ෍ 𝜄 𝑗=1 for some regularization parameter 𝜇 ∗ > 0 • Example: 𝑚 2 regularization 𝑜 𝑀 𝑆 𝜄 = 1 ෠ 2 𝑚(𝜄, 𝑦 𝑗 , 𝑧 𝑗 ) + 𝜇 ∗ | 𝜄| 2 min 𝑜 ෍ 𝜄 𝑗=1

Regularization as soft constraint • Showed by Lagrangian multiplier method ℒ 𝜄, 𝜇 ≔ ෠ 𝑀 𝜄 + 𝜇[𝑆 𝜄 − 𝑠] • Suppose 𝜄 ∗ is the optimal for hard-constraint optimization 𝜄 ∗ = argmin 𝜇≥0 ℒ 𝜄, 𝜇 ≔ ෠ max 𝑀 𝜄 + 𝜇[𝑆 𝜄 − 𝑠] 𝜄 • Suppose 𝜇 ∗ is the corresponding optimal for max 𝜄 ∗ = argmin ℒ 𝜄, 𝜇 ∗ ≔ ෠ 𝑀 𝜄 + 𝜇 ∗ [𝑆 𝜄 − 𝑠] 𝜄

Regularization as Bayesian prior • Bayesian view: everything is a distribution • Prior over the hypotheses: 𝑞 𝜄 • Posterior over the hypotheses: 𝑞 𝜄 | {𝑦 𝑗 , 𝑧 𝑗 } • Likelihood: 𝑞 𝑦 𝑗 , 𝑧 𝑗 𝜄) • Bayesian rule: 𝑞 𝜄 | {𝑦 𝑗 , 𝑧 𝑗 } = 𝑞 𝜄 𝑞 𝑦 𝑗 , 𝑧 𝑗 𝜄) 𝑞({𝑦 𝑗 , 𝑧 𝑗 })

Regularization as Bayesian prior • Bayesian rule: 𝑞 𝜄 | {𝑦 𝑗 , 𝑧 𝑗 } = 𝑞 𝜄 𝑞 𝑦 𝑗 , 𝑧 𝑗 𝜄) 𝑞({𝑦 𝑗 , 𝑧 𝑗 }) • Maximum A Posteriori (MAP): max log 𝑞 𝜄 | {𝑦 𝑗 , 𝑧 𝑗 } = max log 𝑞 𝜄 + log 𝑞 𝑦 𝑗 , 𝑧 𝑗 | 𝜄 𝜄 𝜄 Regularization MLE loss

Regularization as Bayesian prior • Example: 𝑚 2 loss with 𝑚 2 regularization 𝑜 𝑀 𝑆 𝜄 = 1 𝜄 𝑦 𝑗 − 𝑧 𝑗 2 + 𝜇 ∗ | 𝜄| 2 ෠ 2 min 𝑜 ෍ 𝑔 𝜄 𝑗=1 • Correspond to a normal likelihood 𝑞 𝑦, 𝑧 | 𝜄 and a normal prior 𝑞(𝜄)

Three views • Typical choice for optimization: soft-constraint 𝑀 𝑆 𝜄 = ෠ ෠ min 𝑀 𝜄 + 𝜇𝑆(𝜄) 𝜄 • Hard constraint and Bayesian view: conceptual; or used for derivation

Three views • Hard-constraint preferred if • Know the explicit bound 𝑆 𝜄 ≤ 𝑠 • Soft-constraint causes trapped in a local minima with small 𝜄 • Projection back to feasible set leads to stability • Bayesian view preferred if • Know the prior distribution

Some examples

Classical regularization • Norm penalty • 𝑚 2 regularization • 𝑚 1 regularization • Robustness to noise

𝑚 2 regularization 𝑀(𝜄) + 𝛽 𝑀 𝑆 𝜄 = ෠ ෠ 2 min 2 | 𝜄| 2 𝜄 • Effect on (stochastic) gradient descent • Effect on the optimal solution

Effect on gradient descent • Gradient of regularized objective 𝛼෠ 𝑀 𝑆 𝜄 = 𝛼෠ 𝑀(𝜄) + 𝛽𝜄 • Gradient descent update 𝜄 ← 𝜄 − 𝜃𝛼෠ 𝑀 𝑆 𝜄 = 𝜄 − 𝜃 𝛼෠ 𝑀 𝜄 − 𝜃𝛽𝜄 = 1 − 𝜃𝛽 𝜄 − 𝜃 𝛼෠ 𝑀 𝜄 • Terminology: weight decay

Effect on the optimal solution • Consider a quadratic approximation around 𝜄 ∗ 𝑀 𝜄 ∗ + 1 𝑀 𝜄 ∗ + 𝜄 − 𝜄 ∗ 𝑈 𝛼෠ 𝑀 𝜄 ≈ ෠ ෠ 2 𝜄 − 𝜄 ∗ 𝑈 𝐼 𝜄 − 𝜄 ∗ 𝑀 𝜄 ∗ = 0 • Since 𝜄 ∗ is optimal, 𝛼෠ 𝑀 𝜄 ∗ + 1 𝑀 𝜄 ≈ ෠ ෠ 2 𝜄 − 𝜄 ∗ 𝑈 𝐼 𝜄 − 𝜄 ∗ 𝛼෠ 𝑀 𝜄 ≈ 𝐼 𝜄 − 𝜄 ∗

Effect on the optimal solution • Gradient of regularized objective 𝑀 𝑆 𝜄 ≈ 𝐼 𝜄 − 𝜄 ∗ + 𝛽𝜄 𝛼෠ ∗ • On the optimal 𝜄 𝑆 ∗ ≈ 𝐼 𝜄 𝑆 ∗ − 𝜄 ∗ + 𝛽𝜄 𝑆 0 = 𝛼෠ ∗ 𝑀 𝑆 𝜄 𝑆 ∗ ≈ 𝐼 + 𝛽𝐽 −1 𝐼𝜄 ∗ 𝜄 𝑆

Effect on the optimal solution • The optimal ∗ ≈ 𝐼 + 𝛽𝐽 −1 𝐼𝜄 ∗ 𝜄 𝑆 • Suppose 𝐼 has eigen-decomposition 𝐼 = 𝑅Λ𝑅 𝑈 ∗ ≈ 𝐼 + 𝛽𝐽 −1 𝐼𝜄 ∗ = 𝑅 Λ + 𝛽𝐽 −1 Λ𝑅 𝑈 𝜄 ∗ 𝜄 𝑆 • Effect: rescale along eigenvectors of 𝐼

Effect on the optimal solution Notations: ∗ = ෥ 𝜄 ∗ = 𝑥 ∗ , 𝜄 𝑆 𝑥 Figure from Deep Learning , Goodfellow, Bengio and Courville

𝑚 1 regularization 𝑀 𝑆 𝜄 = ෠ ෠ min 𝑀(𝜄) + 𝛽| 𝜄 | 1 𝜄 • Effect on (stochastic) gradient descent • Effect on the optimal solution

Effect on gradient descent • Gradient of regularized objective 𝛼෠ 𝑀 𝑆 𝜄 = 𝛼෠ 𝑀 𝜄 + 𝛽 sign(𝜄) where sign applies to each element in 𝜄 • Gradient descent update 𝜄 ← 𝜄 − 𝜃𝛼෠ 𝑀 𝑆 𝜄 = 𝜄 − 𝜃 𝛼෠ 𝑀 𝜄 − 𝜃𝛽 sign(𝜄)

Effect on the optimal solution • Consider a quadratic approximation around 𝜄 ∗ 𝑀 𝜄 ∗ + 1 𝑀 𝜄 ∗ + 𝜄 − 𝜄 ∗ 𝑈 𝛼෠ 𝑀 𝜄 ≈ ෠ ෠ 2 𝜄 − 𝜄 ∗ 𝑈 𝐼 𝜄 − 𝜄 ∗ 𝑀 𝜄 ∗ = 0 • Since 𝜄 ∗ is optimal, 𝛼෠ 𝑀 𝜄 ∗ + 1 𝑀 𝜄 ≈ ෠ ෠ 2 𝜄 − 𝜄 ∗ 𝑈 𝐼 𝜄 − 𝜄 ∗

Effect on the optimal solution • Further assume that 𝐼 is diagonal and positive (𝐼 𝑗𝑗 > 0, ∀𝑗) • not true in general but assume for getting some intuition • The regularized objective is (ignoring constants) 1 ∗ 2 + 𝛽 |𝜄 𝑗 | ෠ 𝑀 𝑆 𝜄 ≈ ෍ 2 𝐼 𝑗𝑗 𝜄 𝑗 − 𝜄 𝑗 𝑗 ∗ • The optimal 𝜄 𝑆 ∗ − 𝛽 ∗ ≥ 0 max 𝜄 𝑗 , 0 if 𝜄 𝑗 𝐼 𝑗𝑗 ∗ ) 𝑗 ≈ (𝜄 𝑆 ∗ + 𝛽 ∗ < 0 min 𝜄 𝑗 , 0 if 𝜄 𝑗 𝐼 𝑗𝑗

Effect on the optimal solution • Effect: induce sparsity ∗ ) 𝑗 (𝜄 𝑆 (𝜄 ∗ ) 𝑗 𝛽 − 𝛽 𝐼 𝑗𝑗 𝐼 𝑗𝑗

Effect on the optimal solution • Further assume that 𝐼 is diagonal ∗ • Compact expression for the optimal 𝜄 𝑆 ∗ − 𝛽 ∗ max{ 𝜄 𝑗 ∗ ) 𝑗 ≈ sign 𝜄 𝑗 (𝜄 𝑆 , 0} 𝐼 𝑗𝑗

Bayesian view • 𝑚 1 regularization corresponds to Laplacian prior 𝑞 𝜄 ∝ exp(𝛽 ෍ |𝜄 𝑗 |) 𝑗 log 𝑞 𝜄 = 𝛽 ෍ |𝜄 𝑗 | + constant = 𝛽| 𝜄 | 1 + constant 𝑗

Lecture 3: Regularization I Princeton University COS 495 - PowerPoint PPT Presentation

Deep Learning Basics Lecture 3: Regularization I Princeton University COS 495 Instructor: Yingyu Liang What is regularization? In general: any method to prevent overfitting or help the optimization Specifically: additional terms in the

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Regularization Regularization is a general approach to add a complexity parameter to a

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

10. Regularization More on tradeoffs Regularization Effect of using different norms

Regularization Methods for System Identification Input Design Biqiang MU Academy of Mathematics

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 L. Rosasco Regularization for

Learning From Data Lecture 12 Regularization Constraining the Model Weight Decay Augmented

Regularization for Deep Learning Lecture slides for Chapter 7 of Deep Learning

Lecture 4: Regularization II Princeton University COS 495 Instructor: Yingyu Liang Review

Webinar Instructions PowerPoint and webinar recording will be available on the HUD Exchange

Affect/Emotion in Design Administrivia Poster session Thursday NOT MY FAULT Critiques

How the Changing Landscape of Oncology Drug Development and Approval Will Affect Advanced

Machine Learning for Healthcare HST.956, 6.S897 Lecture 15: Causal Inference Part 2 David Sontag

Restorative Practices, Neuroscience and the Social and Emotional Aspects of Learning Nicola

in Depression Conor Liston, MD, PhD Brain and Mind Research Institute and Department of

Outline Psychological Disorders: A General Outlook Anxiety Disorders Somatoform

Learning about health and medicine from Internet data Elad Yom-Tov, Microsoft Research Israel

Lecture 3: Regularization I Princeton University COS 495 - PowerPoint PPT Presentation

Deep Learning Basics Lecture 3: Regularization I Princeton University COS 495 Instructor: Yingyu Liang What is regularization? In general: any method to prevent overfitting or help the optimization Specifically: additional terms in the

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Regularization Regularization is a general approach to add a complexity parameter to a

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

10. Regularization More on tradeoffs Regularization Effect of using different norms

Regularization Methods for System Identification Input Design Biqiang MU Academy of Mathematics

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 L. Rosasco Regularization for

Learning From Data Lecture 12 Regularization Constraining the Model Weight Decay Augmented

Regularization for Deep Learning Lecture slides for Chapter 7 of Deep Learning

Lecture 4: Regularization II Princeton University COS 495 Instructor: Yingyu Liang Review

Webinar Instructions PowerPoint and webinar recording will be available on the HUD Exchange

Affect/Emotion in Design Administrivia Poster session Thursday NOT MY FAULT Critiques

How the Changing Landscape of Oncology Drug Development and Approval Will Affect Advanced

Machine Learning for Healthcare HST.956, 6.S897 Lecture 15: Causal Inference Part 2 David Sontag

Restorative Practices, Neuroscience and the Social and Emotional Aspects of Learning Nicola

in Depression Conor Liston, MD, PhD Brain and Mind Research Institute and Department of

Outline Psychological Disorders: A General Outlook Anxiety Disorders Somatoform

Learning about health and medicine from Internet data Elad Yom-Tov, Microsoft Research Israel

Regularization Overview Regularization Overview Problems & Multicollinearity We will