Deep Learning Gradient-based optimization Caio Corro Universit - PowerPoint PPT Presentation

Deep Learning Gradient-based optimization Caio Corro Université Paris Sud 23 octobre 2019

Table of contents Recall: neural networks The training loop Backpropagation Parameter initialization Regularization Better optimizers 2 / 64

Recall: neural networks 3 / 64

Neural network p ◮ x : input features ◮ z (1) , z (2) , z (3) : hidden representation ◮ z (4) : output logits or class weights z (3) z (3) z (3) z (3) z (3) 1 2 3 4 5 ◮ p : probability distribution over classes ◮ θ = { W (1) , b (1) , ... } : parameters ◮ σ : non-linear activation function z (2) z (2) z (2) z (2) z (2) � W (1) x + b (1) � z (1) = σ 1 2 3 4 5 � W (2) z (1) + b (2) � z (2) = σ � W (3) z (2) + b (3) � z (3) = σ z (1) z (1) z (1) z (1) z (1) 1 2 3 4 5 � W (4) z (3) + b (4) � z (4) = σ exp( z (4) ) p = Softmax( z (4) ) i i.e. p i = � j exp( z (4) x 1 x 2 x 3 x 4 ) j 4 / 64

Representation learning: Computer Vision [Lee et al., 2009] 5 / 64

Representation learning: Natural Language Processing [Voita et al., 2019] 6 / 64

The training loop 7 / 64

The big picture Data split and usage ◮ Training set: to learn the parameters of the network ◮ Development (or dev or validation) set: to monitor the network during training ◮ Test set: to evaluate our model at the end Generally you don’t have to split the data yourself: there exists standard splits to allow benchmarking. Training loop 1. Update the parameters the minimize the loss on the training set 2. Evaluate the prediction accuracy on the dev set 3. If not satisfied, go back to 1 4. Evaluate the prediction accuracy on the test set with the best parameters on dev 8 / 64

Pseudo-code function Train ( f , θ, T , D ) function Evaluate ( f , D ) bestdev = −∞ n = 0 for epoch = 1 to E do for x , y ∈ D do Shuffle T y = arg max y f ( x ; θ ) y ˆ for x , y ∈ T do if ˆ y = y then loss = L ( f ( x ; θ ) , y ) n = n + 1 θ = θ − ǫ ∇ loss return n / |D| devacc = Evaluate ( f , D ) if devacc > bestdev then ˆ θ = θ bestdev = devacc return ˆ θ 9 / 64

Further details Sampling without replacement ◮ shuffle the training set ◮ loop over the new order Experimentally it works better than "true" sampling and it seems to also have good theoretical properties [Nagaraj et al., 2019] Verbosity At each epoch, it is useful to display: ◮ mean loss ◮ accuracy on training data ◮ accuracy on dev data ◮ timing information ◮ (sometimes) evaluate on dev several times by epoch 10 / 64

Step-size θ ( t +1) = θ ( t ) − ǫ ( t ) ∇ loss How to choose the step size ǫ ( t +1) ? ⇒ Convex optimization ◮ Nonsummable diminishing step size: � ∞ t =1 ǫ ( t ) = ∞ and lim t →∞ ǫ ( t ) = 0 ◮ Backtracking/exact line search Simple neural network heuristic 1. Start with a small value, e.g. ǫ = 0 . 01 2. If dev accuracy did not improve during the last N epochs: decay the learning rate by a small value α , e.g. ǫ = α ∗ ǫ with α = 0 . 1 Step-size annealing ◮ Step decay: multiple ǫ by α ∈ [0 , 1] every N epochs ◮ Exponential decay: ǫ ( t ) = ǫ (0) exp( − α · t ) ◮ 1 / t decay: ǫ ( t ) = ǫ (0) 1+ α · t 11 / 64

Backpropagation 12 / 64

Scalar input Derivative Let f : R → R be a function and x , y ∈ R be variables such that: y = f ( x ) . For a given x , how does an infinitesimal change of x impact y ? dy f ( x + ǫ ) − f ( x ) dx = f ′ ( x ) = lim ǫ ǫ → 0 Linear approximation Let � f : R → R be function parameterized by a ∈ R defined as follows: � f ( x ; a ) = f ( a ) + f ′ ( a ) · ( x − a ) Then, � f ( x ; a ) is an approximation of f at a . 13 / 64

Scalar input 100 0 − 100 Example − 10 − 5 f ( x ) = x 2 + 2 0 5 10 f ′ ( x ) = 2 x ◮ a = − 6 ◮ Black: f ( x ) � f ( x ; a ) = f ( a ) + f ′ ( a ) · ( x − a ) ◮ Red: � = a 2 + 2 + 2 a ( x − a ) f ( x ; a = − 6) 100 = 2 ax + 2 − a 2 50 Intuition: the sign of f ′ ( a ) gives the slope 0 of the approximation, we can use this − 50 information to move closer to the minimum of f ( x ). 14 / 64 − 10 − 5 0 5 10

Scalar input Chain rule Let f : R → R and g : R → R be two functions and x , y , z be variables such that: z = f ( x ) , i.e. y = g ( f ( x )) = g ◦ f ( x ). y = g ( z ) For a given x , how does an infinitesimal change of x impact y ? dy dx = dy dz · dz dx 15 / 64

Scalar input Example: explicit differentiation f ( x ) = (2 x + 1) 2 = 4 x 2 + 4 x + 1 f ′ ( x ) = 8 x + 4 Example: differentiation using the chain rule dz z = 2 x + 1 dx = 2 dy y = z 2 = f ( x ) dz = 2 z dy dx = dy dz · dz dx = 2 z ∗ 2 = 4(2 x + 1) = 8 x + 4 = f ′ ( x ) 16 / 64

Vector input Let f : R m → R be a function and x ∈ R m , y ∈ R be variables such that: y = f ( x ) . Partial derivative Gradient For a given x , how does an infinitesimal For a given x , how does an infinitesimal change of x i impact y ? change of x impact y ?   ∂ y ∂ y   ∂ x 1 ∂ x i     ∇ x y = ∂ y   ∂ x 2 i.e. each input x j , j � = i is considered as a   ... constant. 17 / 64

Vector input Chain rule Let f : R m → R n and g : R n → R be two functions and x m , z n , y be variables such that: z = f ( x ) , y = g ( z ) For a given x i , how does an infinitesimal change of x i impact y ? � ∂ y ∂ y · ∂ z j = ∂ x i ∂ z j ∂ x i j 18 / 64

Vector example � ∂ z j z = W x + b or z j = W j , i x i + b j = W j , i x i i � ∂ y y = z j = 1 z j j � � ∂ y ∂ y · ∂ z j = = 1 ∗ W j , i ∂ x i ∂ z j ∂ x i j j 19 / 64

Vector example z (1) = ... x ... z (2) = ... z (1) ... y = ... z (2) ... ∂ z (1) · ∂ z (2) ∂ z (2) � � � ∂ y ∂ y ∂ y j k k = = · · ∂ z (2) ∂ z (2) ∂ z (1) ∂ x i ∂ x i ∂ x i k k j j k k ⇒ It is starting to get annoying! 20 / 64

Jacobian Let f : R m → R n be a function and x ∈ R m , y ∈ R n be variables such that: y = f ( x ) . Gradient Jacobian For a given x , how does an infinitesimal For a given x , how does an infinitesimal change of x impact y j ? change of x impact y ?     ∂ y j ∂ y 1 ∂ y 1 ...  ∂ x 1 ∂ x 2    ∂ x 1       ∂ y j ∂ y 2 ∂ y 2   J x y = ∇ x y j = ...     ∂ x 1 ∂ x 2 ∂ x 2     ... ... ... ... 21 / 64

Chain rule using the Jacobian notation Let f : R m → R n and g : R n → R be two functions and x m , z n , y be variables such that: z = f ( x ) , y = g ( z ) Partial notation Gradient+Jacobian notation Let �· , ·� be the dot product operation: � ∂ y ∂ y · ∂ z j = ∂ x i ∂ z j ∂ x i ∇ x y = � J x z , ∇ z y � j       ∂ y ∂ z 1 ∂ z 1 ∂ y ...       ∂ x 1 ∂ x 1 ∂ x 2 ∂ z 1       ∈ R n × m  ∂ y  ∈ R m    ∂ y  ∈ R n ∇ x y = J x z = ∂ z 2 ∂ z 2 ∇ z y = ...       ∂ x 2 ∂ x 1 ∂ x 2 ∂ z 2       ... ... ... ... ... 22 / 64

Forward and backward passes Forward pass Backward pass z (1) = f (1) ( x ; θ (1) ) ∇ θ (1) y = � J θ (1) z (1) , ∇ z (1) y � ↓ ↑ z (2) = f (2) ( z (1) ; θ (2) ) ∇ z (1) y = � J z (1) z (2) , ∇ z (2) y � ∇ θ (2) y = � J θ (2) z (2) , ∇ z (2) y � ↓ ↑ z (3) = f (3) ( z (2) ; θ (3) ) ∇ z (2) y = � J z (2) z (3) , ∇ z (3) y � ∇ θ (3) y = � J θ (3) z (3) , ∇ z (3) y � ↓ ↑ z (4) = f (4) ( z (3) ; θ (4) ) ∇ z (3) y = � J z (3) z (4) , ∇ z (4) y � ∇ θ (4) y = � J θ (4) z (4) , ∇ z (4) y � ↓ ↑ = f (5) ( z (4) ; θ (5) ) ∇ z (4) y ∇ θ (5) y y 23 / 64

Computation Graph (CG) 1/2 ∇ z (1) L ∇ z (2) L ∇ L L ∇ ... ∇ ... ∇ ... ∇ ... × σ × − x + z (1) + z (2) log pick L Softmax ∇ b (1) L y W (1) b (1) W (2) b (2) � W (1) x + b (1) � exp( z (2) y ) z (1) = σ z (2) = W (2) x + b (2) L = − log � y ′ exp( z (2) y ′ ) 24 / 64

Computation Graph (CG) 2/2 σ x L Linear Linear NLL W (1) , b (1) W (2) , b (2) y � W (1) x + b (1) � exp( z (2) y ) z (1) = σ z (2) = W (2) x + b (2) L = − log � y ′ exp( z (2) y ′ ) 25 / 64

Computation Graph (CG) implementation CG construction / Eager forward pass The computation graph is built in topological order ( ∼ order execution of operations): ◮ x , z (1) , z (2) , ..., L : Expression nodes ◮ W (1) , b (1) , ... : Parameter nodes Expression node Parameter node ◮ Persistent values ◮ Values ◮ Gradient ◮ Gradient ◮ Backward operation ◮ Backpointer(s) to antecedents The backward operation and backpointer(s) are null for input operations 26 / 64

Eager forward pass example Non-linear activation function: Projection operation z = Wx + b : z ′ = relu( z ) z = Linear( x , W , b ) function relu ( z ) function Linear ( x , W , b ) ⊲ Create node ⊲ Create node z ′ = ExpressionNode() z = ExpressionNode() ⊲ Compute forward value ⊲ Compute forward value   max(0 , z 1 ) z . value = Wx + b   z ′ . value = max(0 , z 2 ) ⊲ Set backward operation   ... z . d = d_linear ⊲ Set backward operation ⊲ Set backpointers z ′ . d = d_relu z . backptrs = [ W , b ] ⊲ Set backpointers z ′ . backptrs = [ z ] return z return z ′ 27 / 64

Deep Learning Gradient-based optimization Caio Corro Universit - PowerPoint PPT Presentation

Deep Learning Gradient-based optimization Caio Corro Universit Paris Sud 23 octobre 2019 Table of contents Recall: neural networks The training loop Backpropagation Parameter initialization Regularization Better optimizers 2 / 64

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Minjie Wang Deep Learning Deep Learning trend in the past 10 years Caffe State-of-art DL

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks & software Andrzej

Learning Transferable Features with Deep Adaptation Networks Mingsheng Long 12 , Yue Cao 1 ,

Emergence of Cooperative Long-lasting Loyalty in Double Auction Markets Aleksandra Aloric

Convolutional Neural Networks for Sentence Classification Yoon Kim New York University 1 / 34

CSI5180. MachineLearningfor BioinformaticsApplications Deep learning practical issues by

Generative vs. discriminative Generative Discriminative Belief network A is more More

Deep Learning: Training Juhan Nam Training Deep Neural Networks Forward (hidden unit

Understanding Convolutional Neural Networks David Stutz July 24th, 2014 David Stutz | July

Deep Learning Gradient-based optimization Caio Corro Universit - PowerPoint PPT Presentation

Deep Learning Gradient-based optimization Caio Corro Universit Paris Sud 23 octobre 2019 Table of contents Recall: neural networks The training loop Backpropagation Parameter initialization Regularization Better optimizers 2 / 64

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Minjie Wang Deep Learning Deep Learning trend in the past 10 years Caffe State-of-art DL

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks &amp; software Andrzej

Learning Transferable Features with Deep Adaptation Networks Mingsheng Long 12 , Yue Cao 1 ,

Emergence of Cooperative Long-lasting Loyalty in Double Auction Markets Aleksandra Aloric

Convolutional Neural Networks for Sentence Classification Yoon Kim New York University 1 / 34

CSI5180. MachineLearningfor BioinformaticsApplications Deep learning practical issues by

Generative vs. discriminative Generative Discriminative Belief network A is more More

Deep Learning: Training Juhan Nam Training Deep Neural Networks Forward (hidden unit

Understanding Convolutional Neural Networks David Stutz July 24th, 2014 David Stutz | July

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks & software Andrzej