deep learning
play

Deep learning Optimization and Regularization in deep networks - PowerPoint PPT Presentation

Deep learning Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of technology October 9, 2019 Hamid Beigy | Sharif university of technology | October 9, 2019 1 / 57 Deep learning Table of contents 1


  1. Deep learning Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of technology October 9, 2019 Hamid Beigy | Sharif university of technology | October 9, 2019 1 / 57

  2. Deep learning Table of contents 1 Optimization 2 Model selection 3 Regularization 4 Dataset augmentation 5 Bagging 6 Dropout 7 Reading Hamid Beigy | Sharif university of technology | October 9, 2019 2 / 57

  3. Deep learning | Optimization Table of contents 1 Optimization 2 Model selection 3 Regularization 4 Dataset augmentation 5 Bagging 6 Dropout 7 Reading Hamid Beigy | Sharif university of technology | October 9, 2019 2 / 57

  4. Deep learning | Optimization Batch gradient descent 1 Batch gradient descent, computes the gradient of the cost function w.r.t. to the parameters θ . θ = θ − η · ∇ θ J ( θ ) 2 We need to calculate the gradients for the whole dataset to perform just one update. 3 Batch gradient descent can be very slow and is intractable for datasets that don’t fit in memory. 4 Batch gradient descent also doesn’t allow us to update our model online, i.e. with new examples on-the-fly. Hamid Beigy | Sharif university of technology | October 9, 2019 3 / 57

  5. Deep learning | Optimization Stochastic gradient descent 1 Stochastic gradient descent (SGD) performs a parameter update for each training example x ( i ) and label y ( i ) . θ = θ − η · ∇ θ J ( θ ; x ( i ) ; y ( i ) ) 2 Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. 3 SGD does away with this redundancy by performing one update at a time. 4 It is therefore usually much faster and can also be used to learn online. 5 SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily. Hamid Beigy | Sharif university of technology | October 9, 2019 4 / 57

  6. Deep learning | Optimization Mini-batch gradient descent 1 Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of n training examples. θ = θ − η · ∇ θ J ( θ ; x ( i : i + n ) ; y ( i : i + n ) ) 2 This method reduces the variance of the parameter updates, which can lead to more stable convergence 3 It can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient (mini-batch very efficient). 4 Common mini-batch sizes range between 50 and 256, but can vary for different applications. Hamid Beigy | Sharif university of technology | October 9, 2019 5 / 57

  7. Deep learning | Optimization Mini-batch gradient descent (Challenges) Mini-batch gradient descent does not guarantee good convergence and offers a few challenges that need to be addressed. 1 Choosing a proper learning rate can be difficult. 2 Choosing the parameters (schedules and thresholds) of learning rate schedules is difficult. 3 Are we using the same learning rate for all parameters? 4 How to avoid from getting trapped in suboptimal local minima. Hamid Beigy | Sharif university of technology | October 9, 2019 6 / 57

  8. Deep learning | Optimization Momentum 1 Momentum 1 is a method that helps accelerate SGD in the relevant direction and dampens oscillations. 2 It does this by adding a fraction γ of the update vector of the past time step to the current update vector: v t = γ v t − 1 + η ∇ θ J ( θ ) θ = θ − v t 1 Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1), 145–151. Hamid Beigy | Sharif university of technology | October 9, 2019 7 / 57

  9. Deep learning | Optimization Nesterov accelerated gradient 1 A ball that rolls down a hill, blindly following the slope, is highly unsatisfactory. 2 We would like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again. 3 Nesterov accelerated gradient (NAG) 2 is a way to give our momentum term this kind of prescience. v t = γ v t − 1 + η ∇ θ J ( θ − γ v t − 1 ) θ = θ − v t The value of θ − γ v t − 1 gives an approximation of the next position of the parameters. 2 Nesterov, Y. (1983). A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). Doklady ANSSSR (translated as Soviet.Math.Docl.), vol. 269, pp. 543– 54 Hamid Beigy | Sharif university of technology | October 9, 2019 8 / 57

  10. Deep learning | Optimization Adagrad 1 Adagrad 3 is an algorithm for gradient-based optimization that adapts the learning rate to the parameters. 2 Adagrad updates the parameters in the following manner. η θ t +1 , i = θ t , i − G t , ii + ϵ · ∇ θ J ( θ t , i ) √ where G t ∈ R d × d is a diagonal matrix where each diagonal element ( i , i ) is the sum of the squares of the gradients w.r.t. θ i . ϵ is a smoothing term that avoids division by zero. 3 Adadelta 4 is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. 3 Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121–2159. 4 Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. Arxiv. Hamid Beigy | Sharif university of technology | October 9, 2019 9 / 57

  11. Deep learning | Optimization Adam 1 Adaptive Moment Estimation (Adam) 5 computes adaptive learning rates for each parameter. 2 Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface. 3 Adam computes the decaying averages of past and past squared gradients m t and v t . m t = β 1 m t − 1 + (1 − β 1 ) ∇ θ J ( θ t ) (1) v t = β 2 v t − 1 + (1 − β 2 )( ∇ θ J ( θ t )) 2 where m t and v t are estimates of the first moment and the second moment of the gradients, respectively, 5 Kingma, D. P., and Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 1–13. Hamid Beigy | Sharif university of technology | October 9, 2019 10 / 57

  12. Deep learning | Optimization Adam (cont.) 1 Initially m t and v t are set to 0. 2 m t and v t are biased towards zero. 3 Bias-corrected m t and v t are m t m t = ˆ 1 − β t 1 (2) v t v t = ˆ 1 − β t 2 Then Adam updates parameters as η θ t +1 = θ t − √ ˆ v t + ϵ ˆ m t Hamid Beigy | Sharif university of technology | October 9, 2019 11 / 57

  13. Deep learning | Optimization Other optimization algorithms) 1 AdaMax changed parameter v t of Adam 2 Nadam (Nesterov-accelerated Adaptive Moment Estimation) combines Adam and NAG. 3 For more information please read the following paper. Sebastian Ruder (2017), ”An overview of gradient descent optimization algorithms”, Arxiv. 4 Which optimizer to use? 5 Dozat, T. (2016). Incorporating Nesterov Momentum into Adam. ICLR Workshop, (1), 2013–2016. Hamid Beigy | Sharif university of technology | October 9, 2019 12 / 57

  14. Deep learning | Model selection Table of contents 1 Optimization 2 Model selection 3 Regularization 4 Dataset augmentation 5 Bagging 6 Dropout 7 Reading Hamid Beigy | Sharif university of technology | October 9, 2019 12 / 57

  15. Deep learning | Model selection Model selection 1 Considering regression problem, in which the training set is S = { ( x 1 , t 1 ) , ( x 2 , t 2 ) , . . . , ( x N , t N ) } , t k ∈ R . where t k = f ( x k ) + ϵ ∀ k = 1 , 2 , . . . , N f ( x k ) ∈ R is the unknown function and ϵ is the random noise. 2 The goal is to approximate the f ( x ) by a function g ( x ). 3 The empirical error on the training set S is measured using cost ∑ N i =1 ( t i − g ( x i )) 2 function E E ( g ( x ) | S ) = 1 2 4 The aim is to find g ( . ) that minimizes the empirical error. 5 We assume that a hypothesis class for g ( . ) with a small set of parameters. 6 Assume that g ( x ) is linear g ( x ) = w 0 + w 1 x 1 + w 2 x 2 + . . . + w D x D Hamid Beigy | Sharif university of technology | October 9, 2019 13 / 57

  16. Deep learning | Model selection Model selection 1 Define the following vectors and Matrix Data matrix   1 x 11 x 12 . . . x 1 D 1 x 21 x 22 . . . x 2 D   X =  . .  . .   . .   1 x N 1 x N 2 . . . x DD The k th input vector: X k = (1 , x k 1 , x k 2 , . . . , x kD ) T The weight vector: W = ( w 0 , w 1 , w 2 , . . . , w D ) T The target vector: t = ( t 1 , t 2 , t 3 , . . . , t N ) T ) 2 . 2 The empirical error equals to: E E ( g ( x ) | S ) = 1 ∑ N ( t k − W T X k 2 k =1 3 The gradient of E E ( g ( x ) | S ) equals to ∇ W E E ( g ( x ) | S ) = ∑ N k − W T ∑ N k =1 t k X T k =1 X k X T k = 0 ) − 1 X T t 4 Solving for W , we obtain W ∗ = ( X T X Hamid Beigy | Sharif university of technology | October 9, 2019 14 / 57

  17. Deep learning | Model selection Model selection 1 If the linear model is too simple, the model can be a polynomial (a more complex hypothesis set) g ( x ) = w 0 + w 1 x + w 2 x 2 + . . . + w M x M . 2 M is the order of the polynomial. 3 Choosing the right value of M is called model selection. 4 For M = 1, we have a too general model 5 For M = 9, we have a too specific model Hamid Beigy | Sharif university of technology | October 9, 2019 15 / 57

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend