Neural Network Part 2: Regularization Yingyu Liang Computer - PowerPoint PPT Presentation

Neural Network Part 2: Regularization Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan, Tom Dietterich, and Pedro Domingos.

Goals for the lecture you should understand the following concepts • regularization • different views of regularization • norm constraint • data augmentation • early stopping • dropout • batch normalization 2

What is regularization? • In general: any method to prevent overfitting or help the optimization • Specifically: additional terms in the training optimization objective to prevent overfitting or help the optimization

Overfitting example: regression using polynomials 𝑢 = sin 2𝜌𝑦 + 𝜗 Figure from Machine Learning and Pattern Recognition , Bishop

Overfitting example: regression using polynomials Figure from Machine Learning and Pattern Recognition , Bishop

Overfitting • Key: empirical loss and expected loss are different • Smaller the data set, larger the difference between the two • Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two • Thus has small training error but large test error (overfitting) • Larger data set helps • Throwing away useless hypotheses also helps (regularization)

Different views of regularization

Regularization as hard constraint • Training objective 𝑜 𝑀 𝑔 = 1 ෠ min 𝑜 ෍ 𝑚(𝑔, 𝑦 𝑗 , 𝑧 𝑗 ) 𝑔 𝑗=1 subject to: 𝑔 ∈ 𝓘 • When parametrized 𝑜 𝑀 𝜄 = 1 ෠ min 𝑜 ෍ 𝑚(𝜄, 𝑦 𝑗 , 𝑧 𝑗 ) 𝜄 𝑗=1 subject to: 𝜄 ∈ 𝛻

Regularization as hard constraint • When 𝛻 measured by some quantity 𝑆 𝑜 𝑀 𝜄 = 1 ෠ min 𝑜 ෍ 𝑚(𝜄, 𝑦 𝑗 , 𝑧 𝑗 ) 𝜄 𝑗=1 subject to: 𝑆 𝜄 ≤ 𝑠 • Example: 𝑚 2 regularization 𝑜 𝑀 𝜄 = 1 ෠ min 𝑜 ෍ 𝑚(𝜄, 𝑦 𝑗 , 𝑧 𝑗 ) 𝜄 𝑗=1 2 ≤ 𝑠 2 subject to: | 𝜄| 2

Regularization as soft constraint • The hard-constraint optimization is equivalent to soft-constraint 𝑜 𝑀 𝑆 𝜄 = 1 ෠ 𝑚(𝜄, 𝑦 𝑗 , 𝑧 𝑗 ) + 𝜇 ∗ 𝑆(𝜄) min 𝑜 ෍ 𝜄 𝑗=1 for some regularization parameter 𝜇 ∗ > 0 • Example: 𝑚 2 regularization 𝑜 𝑀 𝑆 𝜄 = 1 ෠ 2 𝑚(𝜄, 𝑦 𝑗 , 𝑧 𝑗 ) + 𝜇 ∗ | 𝜄| 2 min 𝑜 ෍ 𝜄 𝑗=1

Regularization as soft constraint • Showed by Lagrangian multiplier method ℒ 𝜄, 𝜇 ≔ ෠ 𝑀 𝜄 + 𝜇[𝑆 𝜄 − 𝑠] • Suppose 𝜄 ∗ is the optimal for hard-constraint optimization 𝜄 ∗ = argmin 𝜇≥0 ℒ 𝜄, 𝜇 ≔ ෠ max 𝑀 𝜄 + 𝜇[𝑆 𝜄 − 𝑠] 𝜄 • Suppose 𝜇 ∗ is the corresponding optimal for max 𝜄 ∗ = argmin ℒ 𝜄, 𝜇 ∗ ≔ ෠ 𝑀 𝜄 + 𝜇 ∗ [𝑆 𝜄 − 𝑠] 𝜄

Regularization as Bayesian prior • Bayesian view: everything is a distribution • Prior over the hypotheses: 𝑞 𝜄 • Posterior over the hypotheses: 𝑞 𝜄 | {𝑦 𝑗 , 𝑧 𝑗 } • Likelihood: 𝑞 𝑦 𝑗 , 𝑧 𝑗 𝜄) • Bayesian rule: 𝑞 𝜄 | {𝑦 𝑗 , 𝑧 𝑗 } = 𝑞 𝜄 𝑞 𝑦 𝑗 , 𝑧 𝑗 𝜄) 𝑞({𝑦 𝑗 , 𝑧 𝑗 })

Regularization as Bayesian prior • Bayesian rule: 𝑞 𝜄 | {𝑦 𝑗 , 𝑧 𝑗 } = 𝑞 𝜄 𝑞 𝑦 𝑗 , 𝑧 𝑗 𝜄) 𝑞({𝑦 𝑗 , 𝑧 𝑗 }) • Maximum A Posteriori (MAP): max log 𝑞 𝜄 | {𝑦 𝑗 , 𝑧 𝑗 } = max log 𝑞 𝜄 + log 𝑞 𝑦 𝑗 , 𝑧 𝑗 | 𝜄 𝜄 𝜄 Regularization MLE loss

Regularization as Bayesian prior • Example: 𝑚 2 loss with 𝑚 2 regularization 𝑜 𝑀 𝑆 𝜄 = 1 𝜄 𝑦 𝑗 − 𝑧 𝑗 2 + 𝜇 ∗ | 𝜄| 2 ෠ 2 min 𝑜 ෍ 𝑔 𝜄 𝑗=1 • Correspond to a normal likelihood 𝑞 𝑦, 𝑧 | 𝜄 and a normal prior 𝑞(𝜄)

Three views • Typical choice for optimization: soft-constraint 𝑀 𝑆 𝜄 = ෠ ෠ min 𝑀 𝜄 + 𝜇𝑆(𝜄) 𝜄 • Hard constraint and Bayesian view: conceptual; or used for derivation

Three views • Hard-constraint preferred if • Know the explicit bound 𝑆 𝜄 ≤ 𝑠 • Soft-constraint causes trapped in a local minima while projection back to feasible set leads to stability • Bayesian view preferred if • Domain knowledge easy to represent as a prior

Examples of Regularization

Classical regularization • Norm penalty • 𝑚 2 regularization • 𝑚 1 regularization • Robustness to noise • Noise to the input • Noise to the weights

𝑚 2 regularization 𝑀(𝜄) + 𝛽 𝑀 𝑆 𝜄 = ෠ ෠ 2 min 2 | 𝜄| 2 𝜄 • Effect on (stochastic) gradient descent • Effect on the optimal solution

Effect on gradient descent • Gradient of regularized objective 𝛼෠ 𝑀 𝑆 𝜄 = 𝛼෠ 𝑀(𝜄) + 𝛽𝜄 • Gradient descent update 𝜄 ← 𝜄 − 𝜃𝛼෠ 𝑀 𝑆 𝜄 = 𝜄 − 𝜃 𝛼෠ 𝑀 𝜄 − 𝜃𝛽𝜄 = 1 − 𝜃𝛽 𝜄 − 𝜃 𝛼෠ 𝑀 𝜄 • Terminology: weight decay

Effect on the optimal solution • Consider a quadratic approximation around 𝜄 ∗ 𝑀 𝜄 ∗ + 1 𝑀 𝜄 ∗ + 𝜄 − 𝜄 ∗ 𝑈 𝛼෠ 𝑀 𝜄 ≈ ෠ ෠ 2 𝜄 − 𝜄 ∗ 𝑈 𝐼 𝜄 − 𝜄 ∗ 𝑀 𝜄 ∗ = 0 • Since 𝜄 ∗ is optimal, 𝛼෠ 𝑀 𝜄 ∗ + 1 𝑀 𝜄 ≈ ෠ ෠ 2 𝜄 − 𝜄 ∗ 𝑈 𝐼 𝜄 − 𝜄 ∗ 𝛼෠ 𝑀 𝜄 ≈ 𝐼 𝜄 − 𝜄 ∗

Effect on the optimal solution • Gradient of regularized objective 𝑀 𝑆 𝜄 ≈ 𝐼 𝜄 − 𝜄 ∗ + 𝛽𝜄 𝛼෠ ∗ • On the optimal 𝜄 𝑆 ∗ ≈ 𝐼 𝜄 𝑆 ∗ − 𝜄 ∗ + 𝛽𝜄 𝑆 0 = 𝛼෠ ∗ 𝑀 𝑆 𝜄 𝑆 ∗ ≈ 𝐼 + 𝛽𝐽 −1 𝐼𝜄 ∗ 𝜄 𝑆

Effect on the optimal solution • The optimal ∗ ≈ 𝐼 + 𝛽𝐽 −1 𝐼𝜄 ∗ 𝜄 𝑆 • Suppose 𝐼 has eigen-decomposition 𝐼 = 𝑅Λ𝑅 𝑈 ∗ ≈ 𝐼 + 𝛽𝐽 −1 𝐼𝜄 ∗ = 𝑅 Λ + 𝛽𝐽 −1 Λ𝑅 𝑈 𝜄 ∗ 𝜄 𝑆 • Effect: rescale along eigenvectors of 𝐼

Effect on the optimal solution Notations: ∗ = ෥ 𝜄 ∗ = 𝑥 ∗ , 𝜄 𝑆 𝑥 Figure from Deep Learning , Goodfellow, Bengio and Courville

𝑚 1 regularization 𝑀 𝑆 𝜄 = ෠ ෠ min 𝑀(𝜄) + 𝛽| 𝜄 | 1 𝜄 • Effect on (stochastic) gradient descent • Effect on the optimal solution

Effect on gradient descent • Gradient of regularized objective 𝛼෠ 𝑀 𝑆 𝜄 = 𝛼෠ 𝑀 𝜄 + 𝛽 sign(𝜄) where sign applies to each element in 𝜄 • Gradient descent update 𝜄 ← 𝜄 − 𝜃𝛼෠ 𝑀 𝑆 𝜄 = 𝜄 − 𝜃 𝛼෠ 𝑀 𝜄 − 𝜃𝛽 sign(𝜄)

Effect on the optimal solution • Consider a quadratic approximation around 𝜄 ∗ 𝑀 𝜄 ∗ + 1 𝑀 𝜄 ∗ + 𝜄 − 𝜄 ∗ 𝑈 𝛼෠ 𝑀 𝜄 ≈ ෠ ෠ 2 𝜄 − 𝜄 ∗ 𝑈 𝐼 𝜄 − 𝜄 ∗ 𝑀 𝜄 ∗ = 0 • Since 𝜄 ∗ is optimal, 𝛼෠ 𝑀 𝜄 ∗ + 1 𝑀 𝜄 ≈ ෠ ෠ 2 𝜄 − 𝜄 ∗ 𝑈 𝐼 𝜄 − 𝜄 ∗

Effect on the optimal solution • Further assume that 𝐼 is diagonal and positive (𝐼 𝑗𝑗 > 0, ∀𝑗) • not true in general but assume for getting some intuition • The regularized objective is (ignoring constants) 1 ∗ 2 + 𝛽 |𝜄 𝑗 | ෠ 𝑀 𝑆 𝜄 ≈ ෍ 2 𝐼 𝑗𝑗 𝜄 𝑗 − 𝜄 𝑗 𝑗 ∗ • The optimal 𝜄 𝑆 ∗ − 𝛽 ∗ ≥ 0 max 𝜄 𝑗 , 0 if 𝜄 𝑗 𝐼 𝑗𝑗 ∗ ) 𝑗 ≈ (𝜄 𝑆 ∗ + 𝛽 ∗ < 0 min 𝜄 𝑗 , 0 if 𝜄 𝑗 𝐼 𝑗𝑗

Effect on the optimal solution • Effect: induce sparsity ∗ ) 𝑗 (𝜄 𝑆 (𝜄 ∗ ) 𝑗 𝛽 − 𝛽 𝐼 𝑗𝑗 𝐼 𝑗𝑗

Effect on the optimal solution • Further assume that 𝐼 is diagonal ∗ • Compact expression for the optimal 𝜄 𝑆 ∗ − 𝛽 ∗ max{ 𝜄 𝑗 ∗ ) 𝑗 ≈ sign 𝜄 𝑗 (𝜄 𝑆 , 0} 𝐼 𝑗𝑗

Bayesian view • 𝑚 1 regularization corresponds to Laplacian prior 𝑞 𝜄 ∝ exp(𝛽 ෍ |𝜄 𝑗 |) 𝑗 log 𝑞 𝜄 = 𝛽 ෍ |𝜄 𝑗 | + constant = 𝛽| 𝜄 | 1 + constant 𝑗

Multiple optimal solutions? Class +1 𝑥 1 𝑥 2 𝑥 3 Class -1 Prefer 𝑥 2 (higher confidence)

Add noise to the input Class +1 𝑥 2 Class -1 Prefer 𝑥 2 (higher confidence)

Caution: not too much noise Too much noise leads to data points cross the boundary Class +1 𝑥 2 Class -1 Prefer 𝑥 2 (higher confidence)

Equivalence to weight decay • Suppose the hypothesis is 𝑔 𝑦 = 𝑥 𝑈 𝑦 , noise is 𝜗~𝑂(0, 𝜇𝐽) • After adding noise, the loss is 𝑀(𝑔) = 𝔽 𝑦,𝑧,𝜗 𝑔 𝑦 + 𝜗 − 𝑧 2 = 𝔽 𝑦,𝑧,𝜗 𝑔 𝑦 + 𝑥 𝑈 𝜗 − 𝑧 2 𝑀(𝑔) = 𝔽 𝑦,𝑧,𝜗 𝑔 𝑦 − 𝑧 2 + 2 𝔽 𝑦,𝑧,𝜗 𝑥 𝑈 𝜗 𝑔 𝑦 − 𝑧 + 𝔽 𝑦,𝑧,𝜗 𝑥 𝑈 𝜗 2 2 𝑀(𝑔) = 𝔽 𝑦,𝑧,𝜗 𝑔 𝑦 − 𝑧 2 + 𝜇 𝑥

Neural Network Part 2: Regularization Yingyu Liang Computer - PowerPoint PPT Presentation

Neural Network Part 2: Regularization Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page,

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Neural Network II Neural Network II Week 8 1 Team Homework Assignment #10 Team Homework

Neural Network Part 3: Convolutional Neural Networks CS 760@UW-Madison Goals for the lecture

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Neural Networks for Machine Learning Lecture 2a An overview of the main types of neural network

Conformal Field Theories, Conformal Bootstrap and Applications Konstantinos Deligiannis December

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

Neural network applications ALVINN (Pomerleau, mid 1990s) Autonomous Land Vehicle in Neural

Deep Learning Primer Nishith Khandwala Neural Networks Overview Neural Network Basics

A Semantic Loss Function for Deep Learning with Symbolic Knowledge Jingyi Xu, Zilu Zhang , Tal

AnIsabelleFormalization oftheExpressiveness ofDeepLearning Alexander Bentkamp Vrije

Bit Fu Bi Fusion on Bit-Level Dynamically Composable Architecture for Deep Neural Networks

Collaborative Deep Learning for Recommender Systems Hao Wang Naiyan Wang Dit-Yan Yeung 1

Introduction to Deep Learning A. G. Schwing & S. Fidler University of Toronto, 2014 A. G.

Reasoning with Deep Learning: an Open Challenge Marco Lippi marco.lippi@unimore.it Marco Lippi

Towards Evaluating the Robustness of Neural Networks Nicholas Carlini and David Wagner

Solving High-dimensional PDEs Using Deep Learning Jiequn Han The Program in Applied &

Neural Network Part 2: Regularization Yingyu Liang Computer - PowerPoint PPT Presentation

Neural Network Part 2: Regularization Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page,

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Neural Network II Neural Network II Week 8 1 Team Homework Assignment #10 Team Homework

Neural Network Part 3: Convolutional Neural Networks CS 760@UW-Madison Goals for the lecture

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Neural Networks for Machine Learning Lecture 2a An overview of the main types of neural network

Conformal Field Theories, Conformal Bootstrap and Applications Konstantinos Deligiannis December

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

Neural network applications ALVINN (Pomerleau, mid 1990s) Autonomous Land Vehicle in Neural

Deep Learning Primer Nishith Khandwala Neural Networks Overview Neural Network Basics

A Semantic Loss Function for Deep Learning with Symbolic Knowledge Jingyi Xu, Zilu Zhang , Tal

AnIsabelleFormalization oftheExpressiveness ofDeepLearning Alexander Bentkamp Vrije

Bit Fu Bi Fusion on Bit-Level Dynamically Composable Architecture for Deep Neural Networks

Collaborative Deep Learning for Recommender Systems Hao Wang Naiyan Wang Dit-Yan Yeung 1

Introduction to Deep Learning A. G. Schwing &amp; S. Fidler University of Toronto, 2014 A. G.

Reasoning with Deep Learning: an Open Challenge Marco Lippi marco.lippi@unimore.it Marco Lippi

Towards Evaluating the Robustness of Neural Networks Nicholas Carlini and David Wagner

Solving High-dimensional PDEs Using Deep Learning Jiequn Han The Program in Applied &amp;

Introduction to Deep Learning A. G. Schwing & S. Fidler University of Toronto, 2014 A. G.

Solving High-dimensional PDEs Using Deep Learning Jiequn Han The Program in Applied &