Deep learning Optimization and Regularization in deep networks - PowerPoint PPT Presentation

Deep learning Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of technology October 9, 2019 Hamid Beigy | Sharif university of technology | October 9, 2019 1 / 57

Deep learning Table of contents 1 Optimization 2 Model selection 3 Regularization 4 Dataset augmentation 5 Bagging 6 Dropout 7 Reading Hamid Beigy | Sharif university of technology | October 9, 2019 2 / 57

Deep learning | Optimization Table of contents 1 Optimization 2 Model selection 3 Regularization 4 Dataset augmentation 5 Bagging 6 Dropout 7 Reading Hamid Beigy | Sharif university of technology | October 9, 2019 2 / 57

Deep learning | Optimization Batch gradient descent 1 Batch gradient descent, computes the gradient of the cost function w.r.t. to the parameters θ . θ = θ − η · ∇ θ J ( θ ) 2 We need to calculate the gradients for the whole dataset to perform just one update. 3 Batch gradient descent can be very slow and is intractable for datasets that don’t fit in memory. 4 Batch gradient descent also doesn’t allow us to update our model online, i.e. with new examples on-the-fly. Hamid Beigy | Sharif university of technology | October 9, 2019 3 / 57

Deep learning | Optimization Stochastic gradient descent 1 Stochastic gradient descent (SGD) performs a parameter update for each training example x ( i ) and label y ( i ) . θ = θ − η · ∇ θ J ( θ ; x ( i ) ; y ( i ) ) 2 Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. 3 SGD does away with this redundancy by performing one update at a time. 4 It is therefore usually much faster and can also be used to learn online. 5 SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily. Hamid Beigy | Sharif university of technology | October 9, 2019 4 / 57

Deep learning | Optimization Mini-batch gradient descent 1 Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of n training examples. θ = θ − η · ∇ θ J ( θ ; x ( i : i + n ) ; y ( i : i + n ) ) 2 This method reduces the variance of the parameter updates, which can lead to more stable convergence 3 It can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient (mini-batch very efficient). 4 Common mini-batch sizes range between 50 and 256, but can vary for different applications. Hamid Beigy | Sharif university of technology | October 9, 2019 5 / 57

Deep learning | Optimization Mini-batch gradient descent (Challenges) Mini-batch gradient descent does not guarantee good convergence and offers a few challenges that need to be addressed. 1 Choosing a proper learning rate can be difficult. 2 Choosing the parameters (schedules and thresholds) of learning rate schedules is difficult. 3 Are we using the same learning rate for all parameters? 4 How to avoid from getting trapped in suboptimal local minima. Hamid Beigy | Sharif university of technology | October 9, 2019 6 / 57

Deep learning | Optimization Momentum 1 Momentum 1 is a method that helps accelerate SGD in the relevant direction and dampens oscillations. 2 It does this by adding a fraction γ of the update vector of the past time step to the current update vector: v t = γ v t − 1 + η ∇ θ J ( θ ) θ = θ − v t 1 Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1), 145–151. Hamid Beigy | Sharif university of technology | October 9, 2019 7 / 57

Deep learning | Optimization Nesterov accelerated gradient 1 A ball that rolls down a hill, blindly following the slope, is highly unsatisfactory. 2 We would like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again. 3 Nesterov accelerated gradient (NAG) 2 is a way to give our momentum term this kind of prescience. v t = γ v t − 1 + η ∇ θ J ( θ − γ v t − 1 ) θ = θ − v t The value of θ − γ v t − 1 gives an approximation of the next position of the parameters. 2 Nesterov, Y. (1983). A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). Doklady ANSSSR (translated as Soviet.Math.Docl.), vol. 269, pp. 543– 54 Hamid Beigy | Sharif university of technology | October 9, 2019 8 / 57

Deep learning | Optimization Adagrad 1 Adagrad 3 is an algorithm for gradient-based optimization that adapts the learning rate to the parameters. 2 Adagrad updates the parameters in the following manner. η θ t +1 , i = θ t , i − G t , ii + ϵ · ∇ θ J ( θ t , i ) √ where G t ∈ R d × d is a diagonal matrix where each diagonal element ( i , i ) is the sum of the squares of the gradients w.r.t. θ i . ϵ is a smoothing term that avoids division by zero. 3 Adadelta 4 is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. 3 Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121–2159. 4 Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. Arxiv. Hamid Beigy | Sharif university of technology | October 9, 2019 9 / 57

Deep learning | Optimization Adam 1 Adaptive Moment Estimation (Adam) 5 computes adaptive learning rates for each parameter. 2 Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface. 3 Adam computes the decaying averages of past and past squared gradients m t and v t . m t = β 1 m t − 1 + (1 − β 1 ) ∇ θ J ( θ t ) (1) v t = β 2 v t − 1 + (1 − β 2 )( ∇ θ J ( θ t )) 2 where m t and v t are estimates of the first moment and the second moment of the gradients, respectively, 5 Kingma, D. P., and Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 1–13. Hamid Beigy | Sharif university of technology | October 9, 2019 10 / 57

Deep learning | Optimization Adam (cont.) 1 Initially m t and v t are set to 0. 2 m t and v t are biased towards zero. 3 Bias-corrected m t and v t are m t m t = ˆ 1 − β t 1 (2) v t v t = ˆ 1 − β t 2 Then Adam updates parameters as η θ t +1 = θ t − √ ˆ v t + ϵ ˆ m t Hamid Beigy | Sharif university of technology | October 9, 2019 11 / 57

Deep learning | Optimization Other optimization algorithms) 1 AdaMax changed parameter v t of Adam 2 Nadam (Nesterov-accelerated Adaptive Moment Estimation) combines Adam and NAG. 3 For more information please read the following paper. Sebastian Ruder (2017), ”An overview of gradient descent optimization algorithms”, Arxiv. 4 Which optimizer to use? 5 Dozat, T. (2016). Incorporating Nesterov Momentum into Adam. ICLR Workshop, (1), 2013–2016. Hamid Beigy | Sharif university of technology | October 9, 2019 12 / 57

Deep learning | Model selection Table of contents 1 Optimization 2 Model selection 3 Regularization 4 Dataset augmentation 5 Bagging 6 Dropout 7 Reading Hamid Beigy | Sharif university of technology | October 9, 2019 12 / 57

Deep learning | Model selection Model selection 1 Considering regression problem, in which the training set is S = { ( x 1 , t 1 ) , ( x 2 , t 2 ) , . . . , ( x N , t N ) } , t k ∈ R . where t k = f ( x k ) + ϵ ∀ k = 1 , 2 , . . . , N f ( x k ) ∈ R is the unknown function and ϵ is the random noise. 2 The goal is to approximate the f ( x ) by a function g ( x ). 3 The empirical error on the training set S is measured using cost ∑ N i =1 ( t i − g ( x i )) 2 function E E ( g ( x ) | S ) = 1 2 4 The aim is to find g ( . ) that minimizes the empirical error. 5 We assume that a hypothesis class for g ( . ) with a small set of parameters. 6 Assume that g ( x ) is linear g ( x ) = w 0 + w 1 x 1 + w 2 x 2 + . . . + w D x D Hamid Beigy | Sharif university of technology | October 9, 2019 13 / 57

Deep learning | Model selection Model selection 1 Define the following vectors and Matrix Data matrix   1 x 11 x 12 . . . x 1 D 1 x 21 x 22 . . . x 2 D   X =  . .  . .   . .   1 x N 1 x N 2 . . . x DD The k th input vector: X k = (1 , x k 1 , x k 2 , . . . , x kD ) T The weight vector: W = ( w 0 , w 1 , w 2 , . . . , w D ) T The target vector: t = ( t 1 , t 2 , t 3 , . . . , t N ) T ) 2 . 2 The empirical error equals to: E E ( g ( x ) | S ) = 1 ∑ N ( t k − W T X k 2 k =1 3 The gradient of E E ( g ( x ) | S ) equals to ∇ W E E ( g ( x ) | S ) = ∑ N k − W T ∑ N k =1 t k X T k =1 X k X T k = 0 ) − 1 X T t 4 Solving for W , we obtain W ∗ = ( X T X Hamid Beigy | Sharif university of technology | October 9, 2019 14 / 57

Deep learning | Model selection Model selection 1 If the linear model is too simple, the model can be a polynomial (a more complex hypothesis set) g ( x ) = w 0 + w 1 x + w 2 x 2 + . . . + w M x M . 2 M is the order of the polynomial. 3 Choosing the right value of M is called model selection. 4 For M = 1, we have a too general model 5 For M = 9, we have a too specific model Hamid Beigy | Sharif university of technology | October 9, 2019 15 / 57

Deep learning Optimization and Regularization in deep networks - PowerPoint PPT Presentation

Deep learning Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of technology October 9, 2019 Hamid Beigy | Sharif university of technology | October 9, 2019 1 / 57 Deep learning Table of contents 1

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Minjie Wang Deep Learning Deep Learning trend in the past 10 years Caffe State-of-art DL

Deep Learning: State of the Art (2020) Deep Learning Lecture Series https://deeplearning.mit.edu

Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell University of

Characterizing Deep-Learning I/O Workloads in TensorFlow Steven W. D. Chien, Stefano Markidis,

Optimization for Training Deep Models presented by Kan Ren Table of Contents Optimization

Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman

Nave Bayes, Perceptron CMSC 470 Marine Carpuat Slides credit: Jacob Eisenstein Linear Models

Bag-of-components: an online algorithm for batch learning of mixture models Olivier Schwander

AMMI Introduction to Deep Learning 6.4. Batch normalization Fran cois Fleuret

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Deep learning Optimization and Regularization in deep networks - PowerPoint PPT Presentation

Deep learning Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of technology October 9, 2019 Hamid Beigy | Sharif university of technology | October 9, 2019 1 / 57 Deep learning Table of contents 1

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Minjie Wang Deep Learning Deep Learning trend in the past 10 years Caffe State-of-art DL

Deep Learning: State of the Art (2020) Deep Learning Lecture Series https://deeplearning.mit.edu

Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell University of

Characterizing Deep-Learning I/O Workloads in TensorFlow Steven W. D. Chien, Stefano Markidis,

Optimization for Training Deep Models presented by Kan Ren Table of Contents Optimization

Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman

Nave Bayes, Perceptron CMSC 470 Marine Carpuat Slides credit: Jacob Eisenstein Linear Models

Bag-of-components: an online algorithm for batch learning of mixture models Olivier Schwander

AMMI Introduction to Deep Learning 6.4. Batch normalization Fran cois Fleuret

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Deep learning for natural language processing A short primer on deep learning Benoit Favre <