training dnns basic methods
play

Training DNNs: Basic Methods Ju Sun Computer Science & - PowerPoint PPT Presentation

Training DNNs: Basic Methods Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities March 3, 2020 1 / 50 Supervised learning as function approximation Underlying true function: f 0 Training data: { x i , y i }


  1. Which activation at output node? DNN depending on the desired output – unbounded scalar/vector output (e.g. , regression): identity activation 1 – binary classification with 0 or 1 output: e.g., sigmoid σ ( x ) = 1+ e − x – multiclass classification: labels into vectors via one-hot encoding ⇒ [0 , . . . , 0 , L k = , 1 , 0 , . . . , 0 ] ⊺ � �� � � �� � k − 1 0 ′ s n − k 0 ′ s Softmax activation: � � ⊺ e z 1 e zp z �→ j e zj , . . . , . j e zj � � – discrete probability distribution: softmax – etc . 11 / 50

  2. Which loss? Which ℓ to choose? Make it differentiable, or almost so – regression : �·� 2 2 (common, torch.nn.MSELoss ), �·� 1 (for robustness, torch.nn.L1Loss ), etc 12 / 50

  3. Which loss? Which ℓ to choose? Make it differentiable, or almost so – regression : �·� 2 2 (common, torch.nn.MSELoss ), �·� 1 (for robustness, torch.nn.L1Loss ), etc 12 / 50

  4. Which loss? Which ℓ to choose? Make it differentiable, or almost so – regression : �·� 2 2 (common, torch.nn.MSELoss ), �·� 1 (for robustness, torch.nn.L1Loss ), etc – binary classification : encoder the classes as { 0 , 1 } , �·� 2 2 or cross-entropy: ℓ ( y, ˆ y ) = y log ˆ y − (1 − y ) log(1 − ˆ y ) (min at ˆ y = y , torch.nn.BCELoss ) – multiclass classification based on one-hot encoding and softmax y ) = − � activation: �·� 2 2 or cross-entropy: ℓ ( y , � i y i log � y i (min at y = � y , torch.nn.CrossEntropyLoss ) 12 / 50

  5. Which loss? Which ℓ to choose? Make it differentiable, or almost so – regression : �·� 2 2 (common, torch.nn.MSELoss ), �·� 1 (for robustness, torch.nn.L1Loss ), etc – binary classification : encoder the classes as { 0 , 1 } , �·� 2 2 or cross-entropy: ℓ ( y, ˆ y ) = y log ˆ y − (1 − y ) log(1 − ˆ y ) (min at ˆ y = y , torch.nn.BCELoss ) – multiclass classification based on one-hot encoding and softmax y ) = − � activation: �·� 2 2 or cross-entropy: ℓ ( y , � i y i log � y i (min at y = � y , torch.nn.CrossEntropyLoss ) – multiclass classification label smoothing , assuming m classes: one-hot encoding makes n − 1 entropies in y 0 ’s. When y i = 0 , the derivative of ⇒ no update due to y i . Remedy: relax ... change y i log � y i is 0 = ] ⊺ into [ ε, . . . , ε, ] ⊺ for a small ε [0 , . . . , 0 , , 1 , 0 , . . . , 0 , 1 − ( m − 1) ε, ε, . . . , ε � �� � � �� � � �� � � �� � k − 1 0 ′ s n − k 0 ′ s k − 1 ε ′ s n − k ε ′ s 12 / 50

  6. Which loss? Which ℓ to choose? Make it differentiable, or almost so – regression : �·� 2 2 (common, torch.nn.MSELoss ), �·� 1 (for robustness, torch.nn.L1Loss ), etc – binary classification : encoder the classes as { 0 , 1 } , �·� 2 2 or cross-entropy: ℓ ( y, ˆ y ) = y log ˆ y − (1 − y ) log(1 − ˆ y ) (min at ˆ y = y , torch.nn.BCELoss ) – multiclass classification based on one-hot encoding and softmax y ) = − � activation: �·� 2 2 or cross-entropy: ℓ ( y , � i y i log � y i (min at y = � y , torch.nn.CrossEntropyLoss ) – multiclass classification label smoothing , assuming m classes: one-hot encoding makes n − 1 entropies in y 0 ’s. When y i = 0 , the derivative of ⇒ no update due to y i . Remedy: relax ... change y i log � y i is 0 = ] ⊺ into [ ε, . . . , ε, ] ⊺ for a small ε [0 , . . . , 0 , , 1 , 0 , . . . , 0 , 1 − ( m − 1) ε, ε, . . . , ε � �� � � �� � � �� � � �� � k − 1 0 ′ s n − k 0 ′ s k − 1 ε ′ s n − k ε ′ s – difference between distributions : Kullback-Leibler divergence loss ( torch.nn.KLDivLoss ) or Wasserstein metric 12 / 50

  7. Outline Three design choices Training algorithms Which method Where to start When to stop Suggested reading 13 / 50

  8. Framework of line-search methods A generic line search algorithm Input: initialization x 0 , stopping criterion (SC), k = 1 1: while SC not satisfied do choose a direction d k 2: decide a step size t k 3: make a step: x k = x k − 1 + t k d k 4: update counter: k = k + 1 5: 6: end while Four questions: – How to choose direction d k ? – How to choose step size t k ? – Where to initialize? – When to stop? 14 / 50

  9. Outline Three design choices Training algorithms Which method Where to start When to stop Suggested reading 15 / 50

  10. From deterministic to stochastic optimization Recall our optimization problem: � m 1 min ℓ ( y i , DNN W ( x i )) + Ω ( W ) m W i =1 What happens when m is large, i.e., in the “big data” regime? 16 / 50

  11. From deterministic to stochastic optimization Recall our optimization problem: � m 1 min ℓ ( y i , DNN W ( x i )) + Ω ( W ) m W i =1 What happens when m is large, i.e., in the “big data” regime? Blessing : assume ( x i , y i ) ’s are iid, then � m 1 i =1 ℓ ( y i , DNN W ( x i )) → E x , y ℓ ( y , DNN W ( x )) m by the law of large numbers. Large m ≈ good generalization! 16 / 50

  12. From deterministic to stochastic optimization Recall our optimization problem: � m 1 min ℓ ( y i , DNN W ( x i )) + Ω ( W ) m W i =1 What happens when m is large, i.e., in the “big data” regime? Blessing : assume ( x i , y i ) ’s are iid, then � m 1 i =1 ℓ ( y i , DNN W ( x i )) → E x , y ℓ ( y , DNN W ( x )) m by the law of large numbers. Large m ≈ good generalization! Curse : storage and computation – storage : the dataset { ( x i , y ) } typically stored on GPU/TPU for parallel computing—loading whole datasets into GPU often infeasible 16 / 50

  13. From deterministic to stochastic optimization Recall our optimization problem: � m 1 min ℓ ( y i , DNN W ( x i )) + Ω ( W ) m W i =1 What happens when m is large, i.e., in the “big data” regime? Blessing : assume ( x i , y i ) ’s are iid, then � m 1 i =1 ℓ ( y i , DNN W ( x i )) → E x , y ℓ ( y , DNN W ( x )) m by the law of large numbers. Large m ≈ good generalization! Curse : storage and computation – storage : the dataset { ( x i , y ) } typically stored on GPU/TPU for parallel computing—loading whole datasets into GPU often infeasible 16 / 50

  14. From deterministic to stochastic optimization Recall our optimization problem: � m 1 min ℓ ( y i , DNN W ( x i )) + Ω ( W ) m W i =1 What happens when m is large, i.e., in the “big data” regime? Blessing : assume ( x i , y i ) ’s are iid, then � m 1 i =1 ℓ ( y i , DNN W ( x i )) → E x , y ℓ ( y , DNN W ( x )) m by the law of large numbers. Large m ≈ good generalization! Curse : storage and computation – storage : the dataset { ( x i , y ) } typically stored on GPU/TPU for parallel computing—loading whole datasets into GPU often infeasible – computation : each iteration costs at least O ( mn ) , where n is #(opt variables)—both can be large for training DNNs! 16 / 50

  15. From deterministic to stochastic optimization How to get around the storage and computation bottleneck when m is large? stochastic optimization (stochastic = random) 17 / 50

  16. From deterministic to stochastic optimization How to get around the storage and computation bottleneck when m is large? stochastic optimization (stochastic = random) Idea : use a small batch of data samples to approximate quantities of interest 17 / 50

  17. From deterministic to stochastic optimization How to get around the storage and computation bottleneck when m is large? stochastic optimization (stochastic = random) Idea : use a small batch of data samples to approximate quantities of interest � m 1 – gradient: i =1 ∇ W ℓ ( y i , DNN W ( x i )) → E x , y ∇ W ℓ ( y , DNN W ( x )) m 17 / 50

  18. From deterministic to stochastic optimization How to get around the storage and computation bottleneck when m is large? stochastic optimization (stochastic = random) Idea : use a small batch of data samples to approximate quantities of interest � m 1 – gradient: i =1 ∇ W ℓ ( y i , DNN W ( x i )) → E x , y ∇ W ℓ ( y , DNN W ( x )) m approximated by stochastic gradient : � � � 1 j ∈ J ∇ W ℓ y j , DNN W ( x j ) | J | for a random subset J ⊂ { 1 , . . . , m } , where | J | ≪ m � m 1 i =1 ∇ 2 W ℓ ( y i , DNN W ( x i )) → E x , y ∇ 2 – Hessian: W ℓ ( y , DNN W ( x )) m approximated by stochastic Hessian : � � � 1 j ∈ J ∇ 2 W ℓ y j , DNN W ( x j ) | J | for a random subset J ⊂ { 1 , . . . , m } , where | J | ≪ m 17 / 50

  19. From deterministic to stochastic optimization How to get around the storage and computation bottleneck when m is large? stochastic optimization (stochastic = random) Idea : use a small batch of data samples to approximate quantities of interest � m 1 – gradient: i =1 ∇ W ℓ ( y i , DNN W ( x i )) → E x , y ∇ W ℓ ( y , DNN W ( x )) m approximated by stochastic gradient : � � � 1 j ∈ J ∇ W ℓ y j , DNN W ( x j ) | J | for a random subset J ⊂ { 1 , . . . , m } , where | J | ≪ m � m 1 i =1 ∇ 2 W ℓ ( y i , DNN W ( x i )) → E x , y ∇ 2 – Hessian: W ℓ ( y , DNN W ( x )) m approximated by stochastic Hessian : � � � 1 j ∈ J ∇ 2 W ℓ y j , DNN W ( x j ) | J | for a random subset J ⊂ { 1 , . . . , m } , where | J | ≪ m ... justified by the law of large numbers 17 / 50

  20. Stochastic gradient descent (SGD) In general (i.e., not only for DNNs), suppose we want to solve m � = 1 F ( w ) . min f ( w ; ξ i ) ξ i ’s are data samples m w i =1 idea : replace gradient with a stochastic gradient in each step of GD 18 / 50

  21. Stochastic gradient descent (SGD) In general (i.e., not only for DNNs), suppose we want to solve m � = 1 F ( w ) . min f ( w ; ξ i ) ξ i ’s are data samples m w i =1 idea : replace gradient with a stochastic gradient in each step of GD Stochastic gradient descent (SGD) Input: initialization x 0 , stopping criterion (SC), k = 1 1: while SC not satisfied do 2: sample a random subset J k ⊂ { 0 , . . . , m − 1 } � . 1 j ∈ J k ∇ w f ( w ; ξ i ) 3: calculate the stochastic gradient � g k = | J k | 4: decide a step size t k 5: make a step: x k = x k − 1 − t k � g k 6: update counter: k = k + 1 7: end while 18 / 50

  22. Stochastic gradient descent (SGD) In general (i.e., not only for DNNs), suppose we want to solve m � = 1 F ( w ) . min f ( w ; ξ i ) ξ i ’s are data samples m w i =1 idea : replace gradient with a stochastic gradient in each step of GD Stochastic gradient descent (SGD) Input: initialization x 0 , stopping criterion (SC), k = 1 1: while SC not satisfied do 2: sample a random subset J k ⊂ { 0 , . . . , m − 1 } � . 1 j ∈ J k ∇ w f ( w ; ξ i ) 3: calculate the stochastic gradient � g k = | J k | 4: decide a step size t k 5: make a step: x k = x k − 1 − t k � g k 6: update counter: k = k + 1 7: end while – J k is redrawn in each iteration – Traditional SGD: | J k | = 1 . The version presented is also called mini-batch gradient descent 18 / 50

  23. What’s an epoch? – Canonical SGD: sample a random subset J k ⊂ { 1 , . . . , m } each iteration—sampling with replacement 19 / 50

  24. What’s an epoch? – Canonical SGD: sample a random subset J k ⊂ { 1 , . . . , m } each iteration—sampling with replacement – Practical SGD: shuffle the training set, and take a consecutive batch of size B (called batch size ) each iteration—sampling without replacement 19 / 50

  25. What’s an epoch? – Canonical SGD: sample a random subset J k ⊂ { 1 , . . . , m } each iteration—sampling with replacement – Practical SGD: shuffle the training set, and take a consecutive batch of size B (called batch size ) each iteration—sampling without replacement one pass of the shuffled training set is called one epoch . 19 / 50

  26. What’s an epoch? – Canonical SGD: sample a random subset J k ⊂ { 1 , . . . , m } each iteration—sampling with replacement – Practical SGD: shuffle the training set, and take a consecutive batch of size B (called batch size ) each iteration—sampling without replacement one pass of the shuffled training set is called one epoch . Practical stochastic gradient descent (SGD) Input: init. x 0 , SC, batch size B , iteration counter k = 1 , epoch counter ℓ = 1 1: while SC not satisfied do 2: permute the index set { 0 , · · · , m } and divide it into batches of size B 3: for i ∈ { 1 , . . . , #batches } do g k based on the i th batch 4: calculate the stochastic gradient � 5: decide a step size t k 6: make a step: x k = x k − 1 − t k � g k 7: update iteration counter: k = k + 1 8: end for 9: update epoch counter: ℓ = ℓ + 1 10: end while 19 / 50

  27. GD vs. SGD Consider min w � y − Xw � 2 2 , where X ∈ R 10000 × 500 , y ∈ R 10000 , w ∈ R 500 20 / 50

  28. GD vs. SGD Consider min w � y − Xw � 2 2 , where X ∈ R 10000 × 500 , y ∈ R 10000 , w ∈ R 500 – By iteration: GD is faster 20 / 50

  29. GD vs. SGD Consider min w � y − Xw � 2 2 , where X ∈ R 10000 × 500 , y ∈ R 10000 , w ∈ R 500 – By iteration: GD is faster – By iter(GD)/epoch(SGD): SGD is faster 20 / 50

  30. GD vs. SGD Consider min w � y − Xw � 2 2 , where X ∈ R 10000 × 500 , y ∈ R 10000 , w ∈ R 500 – By iteration: GD is faster – By iter(GD)/epoch(SGD): SGD is faster – Remember, cost of one epoch of SGD ≈ cost of one iteration of GD! 20 / 50

  31. GD vs. SGD Consider min w � y − Xw � 2 2 , where X ∈ R 10000 × 500 , y ∈ R 10000 , w ∈ R 500 – By iteration: GD is faster – By iter(GD)/epoch(SGD): SGD is faster – Remember, cost of one epoch of SGD ≈ cost of one iteration of GD! Overall, SGD could be quicker to find a medium-accuracy solution with lower cost, which suffices for most purposes in machine learning [Bottou and Bousquet, 2008]. 20 / 50

  32. Step size (learning rate) for SGD Recall the recommended step size rule for GD: back-tracking line search key idea: F ( x − t ∇ F ( x )) − F ( x ) ≈ − ct �∇ F ( x ) � 2 for a certain c ∈ (0 , 1) 21 / 50

  33. Step size (learning rate) for SGD Recall the recommended step size rule for GD: back-tracking line search key idea: F ( x − t ∇ F ( x )) − F ( x ) ≈ − ct �∇ F ( x ) � 2 for a certain c ∈ (0 , 1) Shall we do it for SGD? 21 / 50

  34. Step size (learning rate) for SGD Recall the recommended step size rule for GD: back-tracking line search key idea: F ( x − t ∇ F ( x )) − F ( x ) ≈ − ct �∇ F ( x ) � 2 for a certain c ∈ (0 , 1) Shall we do it for SGD? No, but why? 21 / 50

  35. Step size (learning rate) for SGD Recall the recommended step size rule for GD: back-tracking line search key idea: F ( x − t ∇ F ( x )) − F ( x ) ≈ − ct �∇ F ( x ) � 2 for a certain c ∈ (0 , 1) Shall we do it for SGD? No, but why? – SGD tries to avoid the m factor in computing the full gradient � m 1 ∇ w F ( w ) = i =1 ∇ w f ( w ; ξ i ) , i.e., reducing m to B (batch size) m 21 / 50

  36. Step size (learning rate) for SGD Recall the recommended step size rule for GD: back-tracking line search key idea: F ( x − t ∇ F ( x )) − F ( x ) ≈ − ct �∇ F ( x ) � 2 for a certain c ∈ (0 , 1) Shall we do it for SGD? No, but why? – SGD tries to avoid the m factor in computing the full gradient � m 1 ∇ w F ( w ) = i =1 ∇ w f ( w ; ξ i ) , i.e., reducing m to B (batch size) m � m 1 – But computing F ( w ) = i =1 f ( w ; ξ i ) or m � m 1 F ( w − t � g ) = i =1 f ( w − t � g ; ξ i ) brings back the m factor; similarly m for ∇ F 21 / 50

  37. Step size (learning rate) for SGD Recall the recommended step size rule for GD: back-tracking line search key idea: F ( x − t ∇ F ( x )) − F ( x ) ≈ − ct �∇ F ( x ) � 2 for a certain c ∈ (0 , 1) Shall we do it for SGD? No, but why? – SGD tries to avoid the m factor in computing the full gradient � m 1 ∇ w F ( w ) = i =1 ∇ w f ( w ; ξ i ) , i.e., reducing m to B (batch size) m � m 1 – But computing F ( w ) = i =1 f ( w ; ξ i ) or m � m 1 F ( w − t � g ) = i =1 f ( w − t � g ; ξ i ) brings back the m factor; similarly m for ∇ F – What about computing approximations to the objective values based on small batches also? 21 / 50

  38. Step size (learning rate) for SGD Recall the recommended step size rule for GD: back-tracking line search key idea: F ( x − t ∇ F ( x )) − F ( x ) ≈ − ct �∇ F ( x ) � 2 for a certain c ∈ (0 , 1) Shall we do it for SGD? No, but why? – SGD tries to avoid the m factor in computing the full gradient � m 1 ∇ w F ( w ) = i =1 ∇ w f ( w ; ξ i ) , i.e., reducing m to B (batch size) m � m 1 – But computing F ( w ) = i =1 f ( w ; ξ i ) or m � m 1 F ( w − t � g ) = i =1 f ( w − t � g ; ξ i ) brings back the m factor; similarly m for ∇ F – What about computing approximations to the objective values based on small batches also? Approximation errors for F and ∇ F may ruin the stability of the Taylor criterion 21 / 50

  39. Step size (learning rate, or LR) for SGD Classical theory for SGD on convex problems requires � � t 2 t k = ∞ , k < ∞ . k k 22 / 50

  40. Step size (learning rate, or LR) for SGD Classical theory for SGD on convex problems requires � � t 2 t k = ∞ , k < ∞ . k k Practical implementation: diminishing step size/LR, e.g., – 1 /t delay : t k = α/ (1 + βk ) , α, β : tunable parameters, k : iteration index – exponential delay : t k = αe − βk , α, β : tunable parameters, k : iteration index – staircase delay : start from t 0 , divide it by a factor (e.g., 5 or 10 ) every L (say, 10 ) epochs—popular in practice. Some heuristic variants: 22 / 50

  41. Step size (learning rate, or LR) for SGD Classical theory for SGD on convex problems requires � � t 2 t k = ∞ , k < ∞ . k k Practical implementation: diminishing step size/LR, e.g., – 1 /t delay : t k = α/ (1 + βk ) , α, β : tunable parameters, k : iteration index – exponential delay : t k = αe − βk , α, β : tunable parameters, k : iteration index – staircase delay : start from t 0 , divide it by a factor (e.g., 5 or 10 ) every L (say, 10 ) epochs—popular in practice. Some heuristic variants: – watch the validation error and decrease the LR when it stagnates – watch the objective and decrease the LR when it stagnates 22 / 50

  42. Step size (learning rate, or LR) for SGD Classical theory for SGD on convex problems requires � � t 2 t k = ∞ , k < ∞ . k k Practical implementation: diminishing step size/LR, e.g., – 1 /t delay : t k = α/ (1 + βk ) , α, β : tunable parameters, k : iteration index – exponential delay : t k = αe − βk , α, β : tunable parameters, k : iteration index – staircase delay : start from t 0 , divide it by a factor (e.g., 5 or 10 ) every L (say, 10 ) epochs—popular in practice. Some heuristic variants: – watch the validation error and decrease the LR when it stagnates – watch the objective and decrease the LR when it stagnates check out torch.optim.lr scheduler in PyTorch! https: //pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate 22 / 50

  43. Beyond the vanilla SGD – Momentum/acceleration methods 23 / 50

  44. Beyond the vanilla SGD – Momentum/acceleration methods – SGD with adaptive learning rates 23 / 50

  45. Beyond the vanilla SGD – Momentum/acceleration methods – SGD with adaptive learning rates – Stochastic 2nd order methods 23 / 50

  46. Why momentum? Credit: Princeton ELE522 – GD is cheap ( O ( n ) per step) but overall convergence sensitive to conditioning – Newton’s convergence is not sensitive to conditioning but expensive ( O ( n 3 ) per step) 24 / 50

  47. Why momentum? Credit: Princeton ELE522 – GD is cheap ( O ( n ) per step) but overall convergence sensitive to conditioning – Newton’s convergence is not sensitive to conditioning but expensive ( O ( n 3 ) per step) A cheap way to achieve faster convergence? 24 / 50

  48. Why momentum? Credit: Princeton ELE522 – GD is cheap ( O ( n ) per step) but overall convergence sensitive to conditioning – Newton’s convergence is not sensitive to conditioning but expensive ( O ( n 3 ) per step) A cheap way to achieve faster convergence? Answer: using historic information 24 / 50

  49. Heavy ball method In physics, a heavy object has a large inertia/momentum — resistance to change velocity. 25 / 50

  50. Heavy ball method In physics, a heavy object has a large inertia/momentum — resistance to change velocity. x k +1 = x k − α k ∇ f ( x k ) + β k ( x k − x k − 1 ) due to Polyak � �� � momentum 25 / 50

  51. Heavy ball method In physics, a heavy object has a large inertia/momentum — resistance to change velocity. x k +1 = x k − α k ∇ f ( x k ) + β k ( x k − x k − 1 ) due to Polyak � �� � momentum Credit: Princeton ELE522 History helps to smooth out the zig-zag path! 25 / 50

  52. Nesterov’s accelerated gradient methods due to Y. Nesterov x k +1 = x k + β k ( x k − x k − 1 ) − α k ∇ f ( x k + β k ( x k − x k − 1 )) 26 / 50

  53. Nesterov’s accelerated gradient methods due to Y. Nesterov x k +1 = x k + β k ( x k − x k − 1 ) − α k ∇ f ( x k + β k ( x k − x k − 1 )) Credit: Stanford CS231N 26 / 50

  54. Nesterov’s accelerated gradient methods due to Y. Nesterov x k +1 = x k + β k ( x k − x k − 1 ) − α k ∇ f ( x k + β k ( x k − x k − 1 )) Credit: Stanford CS231N SGD with momentum/acceleration : replace the gradient term ∇ f by the stochastic gradient � g based on small batches check out torch.optim.SGD at (their convention slightly differs from here) https://pytorch.org/docs/stable/optim.html#torch.optim.SGD 26 / 50

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend