hyper parameters tweaking
play

Hyper-parameters/Tweaking Yufeng Ma, Chris Dusold Virginia Tech - PowerPoint PPT Presentation

Hyper-parameters/Tweaking Yufeng Ma, Chris Dusold Virginia Tech November 17, 2015 Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 1 / 40 Overview Batch Normalization 1 Internal Covariate Shift


  1. Key Points in Batch Normalization Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 13 / 40

  2. Key Points in Batch Normalization Original parameters and newly introduced γ and β will be trained. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 13 / 40

  3. Key Points in Batch Normalization Original parameters and newly introduced γ and β will be trained. When in inference, the whole population of training data is used for mean and variance statistics instead of the estimate. E ( x ) ← E B [ µ B ] m m − 1 E B [ σ 2 Var [ x ] ← B ] Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 13 / 40

  4. Key Points in Batch Normalization Original parameters and newly introduced γ and β will be trained. When in inference, the whole population of training data is used for mean and variance statistics instead of the estimate. E ( x ) ← E B [ µ B ] m m − 1 E B [ σ 2 Var [ x ] ← B ] In Convolutional layers, different locations of a feature map should be normalized in the same way. m ′ = |B| = m · pq , and γ ( k ) , β ( k ) per feature map Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 13 / 40

  5. Key Points in Batch Normalization Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40

  6. Key Points in Batch Normalization Higher learning rates are allowed Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40

  7. Key Points in Batch Normalization Higher learning rates are allowed BN ( Wu ) = BN (( aW ) u ) Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40

  8. Key Points in Batch Normalization Higher learning rates are allowed BN ( Wu ) = BN (( aW ) u ) ∂ BN ( Wu ) = ∂ BN (( aW ) u ) , ∂ BN ( Wu ) = 1 a · ∂ BN (( aW ) u ) ∂ u ∂ u ∂ aW ∂ W Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40

  9. Key Points in Batch Normalization Higher learning rates are allowed BN ( Wu ) = BN (( aW ) u ) ∂ BN ( Wu ) = ∂ BN (( aW ) u ) , ∂ BN ( Wu ) = 1 a · ∂ BN (( aW ) u ) ∂ u ∂ u ∂ aW ∂ W Batch Normalization will regularize the model with less overfitting. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40

  10. Overview Batch Normalization 1 Internal Covariate Shift Mini-Batch Normalization Key Points in Batch Normalization Experiments and Results Importance of Initialization and Momentum 2 Overview of first-order method Momentum & Nesterov’s Accelerated Gradient(NAG) Deep Autoencoders & RNN - Echo-State Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 15 / 40

  11. Activations over time Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 16 / 40

  12. Activations over time Batch Normalization helps train faster and achieve higher accuracy. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 16 / 40

  13. Activations over time Batch Normalization helps train faster and achieve higher accuracy. figure credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 16 / 40

  14. Activations over time Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 17 / 40

  15. Activations over time Batch Normalization makes input distribution more stable. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 17 / 40

  16. Activations over time Batch Normalization makes input distribution more stable. figure credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 17 / 40

  17. Accelerating Batch Normalization Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 18 / 40

  18. Accelerating Batch Normalization Networks Tricks to follow Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 18 / 40

  19. Accelerating Batch Normalization Networks Tricks to follow Increasing learning rate Remove or Reduce Dropout Reduce ℓ 2 weight regularization Accelerate the learning rate decay Remove Local Response Normalization Shuffle training examples more thoroughly Reduce the photometric distortions Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 18 / 40

  20. Network Comparisons Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 19 / 40

  21. Network Comparisons Inception, BN-Baseline, BN-x5, BN-x30, BN-x5-Sigmoid Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 19 / 40

  22. Network Comparisons Inception, BN-Baseline, BN-x5, BN-x30, BN-x5-Sigmoid Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 19 / 40

  23. Network Comparisons Inception, BN-Baseline, BN-x5, BN-x30, BN-x5-Sigmoid figure credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 19 / 40

  24. Ensemble Classification Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 20 / 40

  25. Ensemble Classification Top-5 validation error of 4.9% and test error of 4.82%, exceeds the estimated accuracy of human raters. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 20 / 40

  26. Ensemble Classification Top-5 validation error of 4.9% and test error of 4.82%, exceeds the estimated accuracy of human raters. figure credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 20 / 40

  27. Overview Batch Normalization 1 Internal Covariate Shift Mini-Batch Normalization Key Points in Batch Normalization Experiments and Results Importance of Initialization and Momentum 2 Overview of first-order method Momentum & Nesterov’s Accelerated Gradient(NAG) Deep Autoencoders & RNN - Echo-State Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 21 / 40

  28. Challenges to be solved Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40

  29. Challenges to be solved Reference paper: On the importance of initialization and momentum in deep learning Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40

  30. Challenges to be solved Reference paper: On the importance of initialization and momentum in deep learning Difficult to use first-order method to reach performance previously only achievable by second-order method like Hessian-Free. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40

  31. Challenges to be solved Reference paper: On the importance of initialization and momentum in deep learning Difficult to use first-order method to reach performance previously only achievable by second-order method like Hessian-Free. Well-designed random initialization Slowly increasing schedule for momentum parameter Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40

  32. Challenges to be solved Reference paper: On the importance of initialization and momentum in deep learning Difficult to use first-order method to reach performance previously only achievable by second-order method like Hessian-Free. Well-designed random initialization Slowly increasing schedule for momentum parameter No need for sophisticated second-order methods. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40

  33. Overview of first-order method First-order Methods Vanilla Stochastic Gradient Descent SGD + Momentum Nesterov’s Accelerated Gradient(NAG) AdaGrad Adam Rprop RMSProp AdaDelta slide credit: Ishan Misra Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 23 / 40

  34. Overview Batch Normalization 1 Internal Covariate Shift Mini-Batch Normalization Key Points in Batch Normalization Experiments and Results Importance of Initialization and Momentum 2 Overview of first-order method Momentum & Nesterov’s Accelerated Gradient(NAG) Deep Autoencoders & RNN - Echo-State Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 24 / 40

  35. Several First-order Methods Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 25 / 40

  36. Several First-order Methods Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽ f - Gradient of f , v - Velocity vector, µ - Momentum coefficient Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 25 / 40

  37. Several First-order Methods Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽ f - Gradient of f , v - Velocity vector, µ - Momentum coefficient Vanilla SGD v t +1 = ǫ ▽ f ( θ t ) θ t +1 = θ t − v t +1 slide credit: Ishan Misra Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 25 / 40

  38. Several First-order Methods Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 26 / 40

  39. Several First-order Methods Rprop Update if ▽ f t ▽ f t − 1 > 0 v t = η + v t − 1 else if ▽ f t ▽ f t − 1 < 0 v t = η − v t − 1 else v t = v t − 1 θ t +1 = θ t − v t where 0 < η − < 1 < η + slide credit: Ishan Misra Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 26 / 40

  40. Several First-order Methods Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 27 / 40

  41. Several First-order Methods AdaGrad r t = θ 2 t + r t − 1 α v t +1 = √ r t ▽ f ( θ t ) θ t +1 = θ t − v t +1 Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 27 / 40

  42. Several First-order Methods AdaGrad r t = θ 2 t + r t − 1 α v t +1 = √ r t ▽ f ( θ t ) θ t +1 = θ t − v t +1 RMSProp = Rprop + SGD r t = (1 − γ ) θ 2 t + γ r t − 1 α v t +1 = √ r t ▽ f ( θ t ) θ t +1 = θ t − v t +1 slide credit: Ishan Misra Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 27 / 40

  43. Several First-order Methods Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 28 / 40

  44. Several First-order Methods AdaDelta v t +1 = H − 1 ▽ f , ∝ f ′ f ′′ 1 / units of θ ∝ (1 / units of θ ) 2 ∝ units of θ Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 28 / 40

  45. Several First-order Methods AdaDelta v t +1 = H − 1 ▽ f , ∝ f ′ f ′′ 1 / units of θ ∝ (1 / units of θ ) 2 ∝ units of θ Adam r t = (1 − γ 1 ) ▽ f ( θ t ) + γ 1 r t − 1 p t = (1 − γ 2 ) ▽ f ( θ t ) 2 + γ 2 p t − 1 r t r t = ˆ (1 − (1 − γ 1 ) t ) p t p t = ˆ (1 − (1 − r 2 ) t ) v t = α ˆ r t √ ˆ p t θ t +1 = θ t − v t slide credit: Ishan Misra Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 28 / 40

  46. Overview Batch Normalization 1 Internal Covariate Shift Mini-Batch Normalization Key Points in Batch Normalization Experiments and Results Importance of Initialization and Momentum 2 Overview of first-order method Momentum & Nesterov’s Accelerated Gradient(NAG) Deep Autoencoders & RNN - Echo-State Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 29 / 40

  47. Momentum and NAG Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 30 / 40

  48. Momentum and NAG Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽ f - Gradient of f , v - Velocity vector, µ - Momentum coefficient Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 30 / 40

  49. Momentum and NAG Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽ f - Gradient of f , v - Velocity vector, µ - Momentum coefficient Classical Momentum v t +1 = µ v t − ǫ ▽ f ( θ t ) θ t +1 = θ t + v t +1 Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 30 / 40

  50. Momentum and NAG Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽ f - Gradient of f , v - Velocity vector, µ - Momentum coefficient Classical Momentum v t +1 = µ v t − ǫ ▽ f ( θ t ) θ t +1 = θ t + v t +1 Nesterov’s Accelerated Gradient v t +1 = µ v t − ǫ ▽ f ( θ t + µ v t ) θ t +1 = θ t + v t +1 Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 30 / 40

  51. Relationship between CM and NAG Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 31 / 40

  52. Relationship between CM and NAG NAG uses θ t + µ v t but MISSING the yet unknown correction. Thus when the addition of µ v t results in an immediate undesirable increase in the objective f , Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 31 / 40

  53. Relationship between CM and NAG NAG uses θ t + µ v t but MISSING the yet unknown correction. Thus when the addition of µ v t results in an immediate undesirable increase in the objective f , Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 31 / 40

  54. Relationship between CM and NAG NAG uses θ t + µ v t but MISSING the yet unknown correction. Thus when the addition of µ v t results in an immediate undesirable increase in the objective f , figure credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 31 / 40

  55. Relationship between CM and NAG Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 32 / 40

  56. Relationship between CM and NAG Apply CM and NAG to a positive definite quadratic objective q ( x ) = x T Ax / 2 + b T x . Difference in effective momentum coefficient Classical Momentum: µ NAG: µ (1 − λǫ ), where λ is the eigenvalue of A . Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 32 / 40

  57. Relationship between CM and NAG Apply CM and NAG to a positive definite quadratic objective q ( x ) = x T Ax / 2 + b T x . Difference in effective momentum coefficient Classical Momentum: µ NAG: µ (1 − λǫ ), where λ is the eigenvalue of A . ǫ small, CM and NAG are equivalent ǫ large, NAG gives smaller µ (1 − λ i ǫ ) to stop oscillations. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 32 / 40

  58. Overview Batch Normalization 1 Internal Covariate Shift Mini-Batch Normalization Key Points in Batch Normalization Experiments and Results Importance of Initialization and Momentum 2 Overview of first-order method Momentum & Nesterov’s Accelerated Gradient(NAG) Deep Autoencoders & RNN - Echo-State Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 33 / 40

  59. Deep Autoencoders Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 34 / 40

  60. Deep Autoencoders Structure of Deep Autoencoder Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 34 / 40

  61. Deep Autoencoders Structure of Deep Autoencoder figure credit: http://deeplearning4j.org/deepautoencoder.html Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 34 / 40

  62. Deep Autoencoders Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 35 / 40

  63. Deep Autoencoders Sparse Initialization -each random unit connected to 15 randomly chosen units in the previous layer, drawn from a unit Gaussian Schedule for Momentum Coefficient µ t = min (1 − 2 − 1 − log 2 ( ⌊ t / 250 ⌋ +1) , µ max ) Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 35 / 40

  64. Deep Autoencoders Sparse Initialization -each random unit connected to 15 randomly chosen units in the previous layer, drawn from a unit Gaussian Schedule for Momentum Coefficient µ t = min (1 − 2 − 1 − log 2 ( ⌊ t / 250 ⌋ +1) , µ max ) µ t = 1 − 3 / ( t + 5), not strongly convex - Nesterov(1983) constant µ t , strongly convex - Nesterov(2003) Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 35 / 40

  65. Deep Autoencoders Sparse Initialization -each random unit connected to 15 randomly chosen units in the previous layer, drawn from a unit Gaussian Schedule for Momentum Coefficient µ t = min (1 − 2 − 1 − log 2 ( ⌊ t / 250 ⌋ +1) , µ max ) µ t = 1 − 3 / ( t + 5), not strongly convex - Nesterov(1983) constant µ t , strongly convex - Nesterov(2003) table credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 35 / 40

  66. RNN - Echo-State Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 36 / 40

  67. RNN - Echo-State Networks Echo-State Networks (a family RNNs) Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 36 / 40

  68. RNN - Echo-State Networks Echo-State Networks (a family RNNs) figure credit: Mantas Lukoevicius Hidden-to-output connections learned from data Recurrent connections fixed to a random draw from a specific distribution Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 36 / 40

  69. RNN - Echo-State Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 37 / 40

  70. RNN - Echo-State Networks ESN-based Initialization Spectral Radius of Hidden-to-hidden matrix around 1.1 Initial scale of Input-to-hidden connections plays an important role (Gaussian draw with a standard deviation of 0.001 achieves good balance, but is Task Dependent) Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 37 / 40

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend