Lecture 4: Regularization II Princeton University COS 495 - PowerPoint PPT Presentation

Deep Learning Basics Lecture 4: Regularization II Princeton University COS 495 Instructor: Yingyu Liang

Review

Regularization as hard constraint • Constrained optimization 𝑜 𝑀 𝜄 = 1 ෠ min 𝑜 ෍ 𝑚(𝜄, 𝑦 𝑗 , 𝑧 𝑗 ) 𝜄 𝑗=1 subject to: 𝑆 𝜄 ≤ 𝑠

Regularization as soft constraint • Unconstrained optimization 𝑜 𝑀 𝑆 𝜄 = 1 ෠ min 𝑜 ෍ 𝑚(𝜄, 𝑦 𝑗 , 𝑧 𝑗 ) + 𝜇𝑆(𝜄) 𝜄 𝑗=1 for some regularization parameter 𝜇 > 0

Regularization as Bayesian prior • Bayesian rule: 𝑞 𝜄 | {𝑦 𝑗 , 𝑧 𝑗 } = 𝑞 𝜄 𝑞 𝑦 𝑗 , 𝑧 𝑗 𝜄) 𝑞({𝑦 𝑗 , 𝑧 𝑗 }) • Maximum A Posteriori (MAP): max log 𝑞 𝜄 | {𝑦 𝑗 , 𝑧 𝑗 } = max log 𝑞 𝜄 + log 𝑞 𝑦 𝑗 , 𝑧 𝑗 | 𝜄 𝜄 𝜄 Regularization MLE loss

Classical regularizations • Norm penalty • 𝑚 2 regularization • 𝑚 1 regularization

More examples

Other types of regularizations • Robustness to noise • Noise to the input • Noise to the weights • Noise to the output • Data augmentation • Early stopping • Dropout

Multiple optimal solutions? Class +1 𝑥 1 𝑥 2 𝑥 3 Class -1 Prefer 𝑥 2 (higher confidence)

Add noise to the input Class +1 𝑥 2 Class -1 Prefer 𝑥 2 (higher confidence)

Caution: not too much noise Too much noise leads to data points cross the boundary Class +1 𝑥 2 Class -1 Prefer 𝑥 2 (higher confidence)

Equivalence to weight decay • Suppose the hypothesis is 𝑔 𝑦 = 𝑥 𝑈 𝑦 , noise is 𝜗~𝑂(0, 𝜇𝐽) • After adding noise, the loss is 𝑀(𝑔) = 𝔽 𝑦,𝑧,𝜗 𝑔 𝑦 + 𝜗 − 𝑧 2 = 𝔽 𝑦,𝑧,𝜗 𝑔 𝑦 + 𝑥 𝑈 𝜗 − 𝑧 2 𝑀(𝑔) = 𝔽 𝑦,𝑧,𝜗 𝑔 𝑦 − 𝑧 2 + 2 𝔽 𝑦,𝑧,𝜗 𝑥 𝑈 𝜗 𝑔 𝑦 − 𝑧 + 𝔽 𝑦,𝑧,𝜗 𝑥 𝑈 𝜗 2 2 𝑀(𝑔) = 𝔽 𝑦,𝑧,𝜗 𝑔 𝑦 − 𝑧 2 + 𝜇 𝑥

Add noise to the weights • For the loss on each data point, add a noise term to the weights before computing the prediction 𝜗~𝑂(0, 𝜃𝐽) , 𝑥′ = 𝑥 + 𝜗 • Prediction: 𝑔 𝑥 ′ 𝑦 instead of 𝑔 𝑥 𝑦 • Loss becomes 𝑥+𝜗 𝑦 − 𝑧 2 𝑀(𝑔) = 𝔽 𝑦,𝑧,𝜗 𝑔

Add noise to the weights • Loss becomes 𝑥+𝜗 𝑦 − 𝑧 2 𝑀(𝑔) = 𝔽 𝑦,𝑧,𝜗 𝑔 • To simplify, use Taylor expansion 𝜗 𝑈 𝛼 2 𝑔 𝑦 𝜗 𝑥 𝑦 + 𝜗 𝑈 𝛼𝑔 𝑦 + • 𝑔 𝑥+𝜗 𝑦 ≈ 𝑔 2 • Plug in 𝑥 𝑦 − 𝑧 2 + 𝜃𝔽[ 𝑔 𝑥 𝑦 − 𝑧 𝛼 2 𝑔 𝑥 (𝑦)|| 2 • 𝑀 𝑔 ≈ 𝔽 𝑔 𝑥 𝑦 ] + 𝜃𝔽||𝛼𝑔 Small so can be ignored Regularization term

Data augmentation Figure from Image Classification with Pyramid Representation and Rotated Data Augmentation on Torch 7, by Keven Wang

Data augmentation • Adding noise to the input: a special kind of augmentation • Be careful about the transformation applied: • Example: classifying ‘b’ and ‘d’ • Example: classifying ‘6’ and ‘9’

Early stopping • Idea: don’t train the network to too small training error • Recall overfitting: Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two • Prevent overfitting: do not push the hypothesis too much; use validation error to decide when to stop

Early stopping Figure from Deep Learning , Goodfellow, Bengio and Courville

Early stopping • When training, also output validation error • Every time validation error improved, store a copy of the weights • When validation error not improved for some time, stop • Return the copy of the weights stored

Early stopping • hyperparameter selection: training step is the hyperparameter • Advantage • Efficient: along with training; only store an extra copy of weights • Simple: no change to the model/algo • Disadvantage: need validation data

Early stopping • Strategy to get rid of the disadvantage • After early stopping of the first run, train a second run and reuse validation data • How to reuse validation data 1. Start fresh, train with both training data and validation data up to the previous number of epochs 2. Start from the weights in the first run, train with both training data and validation data util the validation loss < the training loss at the early stopping point

Early stopping as a regularizer Figure from Deep Learning , Goodfellow, Bengio and Courville

Dropout • Randomly select weights to update • More precisely, in each update step • Randomly sample a different binary mask to all the input and hidden units • Multiple the mask bits with the units and do the update as usual • Typical dropout probability: 0.2 for input and 0.5 for hidden units

Dropout Figure from Deep Learning , Goodfellow, Bengio and Courville

What regularizations are frequently used? • 𝑚 2 regularization • Early stopping • Dropout • Data augmentation if the transformations known/easy to implement

Lecture 4: Regularization II Princeton University COS 495 - PowerPoint PPT Presentation

Deep Learning Basics Lecture 4: Regularization II Princeton University COS 495 Instructor: Yingyu Liang Review Regularization as hard constraint Constrained optimization = 1 min (, ,

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Regularization Regularization is a general approach to add a complexity parameter to a

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

10. Regularization More on tradeoffs Regularization Effect of using different norms

Regularization Methods for System Identification Input Design Biqiang MU Academy of Mathematics

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 L. Rosasco Regularization for

Learning From Data Lecture 12 Regularization Constraining the Model Weight Decay Augmented

Lecture 3: Regularization I Princeton University COS 495 Instructor: Yingyu Liang What is

Regularization for Deep Learning Lecture slides for Chapter 7 of Deep Learning

INSPIRE CHALLENGE ACHIEVE Year 7 Parent Information Evening 6.00pm 7.00pm AGENDA Welcome

Science of Computational Logic Working Material 1 Steffen H olldobler International

Presentations by the funders and providers, 13-17 May 2019 What Creative NZ do Funding

On representing planning domains under uncertainty Felipe Meneguzzi CMU Yuqing Tang CUNY

Embedding GI inside the JVM Benoit Baudry Kwaku Yeboah-Antwi 1 Specialize to environment .. or not

The Flour Pot 10/21/2010 Pink B + The problem + The solution + Product contract

Spence is a storage container that measures dispenses flour for the home baker. Store or Measure?

PRETZEL:Opening the Black Box of Machine Learning Prediction Serving Systems Presented by

Lecture 4: Regularization II Princeton University COS 495 - PowerPoint PPT Presentation

Deep Learning Basics Lecture 4: Regularization II Princeton University COS 495 Instructor: Yingyu Liang Review Regularization as hard constraint Constrained optimization = 1 min (, ,

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Regularization Regularization is a general approach to add a complexity parameter to a

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

10. Regularization More on tradeoffs Regularization Effect of using different norms

Regularization Methods for System Identification Input Design Biqiang MU Academy of Mathematics

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 L. Rosasco Regularization for

Learning From Data Lecture 12 Regularization Constraining the Model Weight Decay Augmented

Lecture 3: Regularization I Princeton University COS 495 Instructor: Yingyu Liang What is

Regularization for Deep Learning Lecture slides for Chapter 7 of Deep Learning

INSPIRE CHALLENGE ACHIEVE Year 7 Parent Information Evening 6.00pm 7.00pm AGENDA Welcome

Science of Computational Logic Working Material 1 Steffen H olldobler International

Presentations by the funders and providers, 13-17 May 2019 What Creative NZ do Funding

On representing planning domains under uncertainty Felipe Meneguzzi CMU Yuqing Tang CUNY

Embedding GI inside the JVM Benoit Baudry Kwaku Yeboah-Antwi 1 Specialize to environment .. or not

The Flour Pot 10/21/2010 Pink B + The problem + The solution + Product contract

Spence is a storage container that measures dispenses flour for the home baker. Store or Measure?

PRETZEL:Opening the Black Box of Machine Learning Prediction Serving Systems Presented by

Regularization Overview Regularization Overview Problems & Multicollinearity We will