dropout in rnns following a vi interpretation
play

Dropout in RNNs Following a VI Interpretation Yarin Gal - PowerPoint PPT Presentation

Dropout in RNNs Following a VI Interpretation Yarin Gal yg279@cam.ac.uk Unless specified otherwise, photos are either original work or taken from Wikimedia, under Creative Commons license Recurrent Neural Networks Recurrent neural networks


  1. Dropout in RNNs Following a VI Interpretation Yarin Gal yg279@cam.ac.uk Unless specified otherwise, photos are either original work or taken from Wikimedia, under Creative Commons license

  2. Recurrent Neural Networks Recurrent neural networks (RNNs) are damn useful. Figure : RNN structure Image Source: karpathy.github.io/2015/05/21/rnn-effectiveness 2 of 24

  3. Recurrent Neural Networks But these also overfit very quickly... Figure : Overfitting This means... ◮ We can’t use large models ◮ We have to use early stopping ◮ We can’t use small data ◮ We have to waste data for validation sets... 3 of 24

  4. Recurrent Neural Networks But these also overfit very quickly... Figure : Overfitting This means... ◮ We can’t use large models ◮ We have to use early stopping ◮ We can’t use small data ◮ We have to waste data for validation sets... 3 of 24

  5. Recurrent Neural Networks But these also overfit very quickly... Figure : Overfitting This means... ◮ We can’t use large models ◮ We have to use early stopping ◮ We can’t use small data ◮ We have to waste data for validation sets... 3 of 24

  6. Recurrent Neural Networks But these also overfit very quickly... Figure : Overfitting This means... ◮ We can’t use large models ◮ We have to use early stopping ◮ We can’t use small data ◮ We have to waste data for validation sets... 3 of 24

  7. Recurrent Neural Networks But these also overfit very quickly... Figure : Overfitting This means... ◮ We can’t use large models ◮ We have to use early stopping ◮ We can’t use small data ◮ We have to waste data for validation sets... 3 of 24

  8. Dropout in recurrent neural networks Let’s use dropout then. But lots of research has claimed that that’s a bad idea : ◮ Pachitariu & Sahani , 2013 ◮ noise added in the recurrent connections of an RNN leads to model instabilities ◮ Bayer et al. , 2013 ◮ with dropout, the RNNs dynamics change dramatically ◮ Pham et al. , 2014 ◮ dropout in recurrent layers disrupts the RNNs ability to model sequences ◮ Zaremba et al. , 2014 ◮ applying dropout to the non-recurrent connections alone results in improved performance ◮ Bluche et al. , 2015 ◮ exploratory analysis of the performance of dropout before, inside, and after the RNNs 4 of 24

  9. Dropout in recurrent neural networks → has settled on using dropout for inputs and outputs alone : h t − 1 h t + 1 h t x t − 1 x t x t + 1 Figure : Naive application of dropout in RNNs (colours = different dropout masks) 5 of 24

  10. Dropout in recurrent neural networks Why not use dropout with recurrent layers? ◮ It doesn’t work ◮ Noise drowns the signal ◮ Because it’s not used correctly? 6 of 24

  11. Dropout in recurrent neural networks Why not use dropout with recurrent layers? ◮ It doesn’t work ◮ Noise drowns the signal ◮ Because it’s not used correctly? 6 of 24

  12. Dropout in recurrent neural networks Why not use dropout with recurrent layers? ◮ It doesn’t work ◮ Noise drowns the signal ◮ Because it’s not used correctly? 6 of 24

  13. Dropout in recurrent neural networks Why not use dropout with recurrent layers? ◮ It doesn’t work ◮ Noise drowns the signal ◮ Because it’s not used correctly? First, some background on Bayesian modelling and VI in Bayesian neural networks. 6 of 24

  14. Bayesian modelling and inference ◮ Observed inputs X = { x i } N i = 1 and outputs Y = { y i } N i = 1 ◮ Capture stochastic process believed to have generated outputs ◮ Def. ω model parameters as r.v. ◮ Prior dist. over ω : p ( ω ) ◮ Likelihood: p ( Y | ω , X ) ◮ Posterior: p ( ω | X , Y ) = p ( Y | ω , X ) p ( ω ) (Bayes’ theorem) p ( Y | X ) ◮ Predictive distribution given new input x ∗ � p ( y ∗ | x ∗ , X , Y ) = p ( y ∗ | x ∗ , ω ) p ( ω | X , Y ) d ω � �� � posterior ◮ But... p ( ω | X , Y ) is often intractable 7 of 24

  15. Bayesian modelling and inference ◮ Observed inputs X = { x i } N i = 1 and outputs Y = { y i } N i = 1 ◮ Capture stochastic process believed to have generated outputs ◮ Def. ω model parameters as r.v. ◮ Prior dist. over ω : p ( ω ) ◮ Likelihood: p ( Y | ω , X ) ◮ Posterior: p ( ω | X , Y ) = p ( Y | ω , X ) p ( ω ) (Bayes’ theorem) p ( Y | X ) ◮ Predictive distribution given new input x ∗ � p ( y ∗ | x ∗ , X , Y ) = p ( y ∗ | x ∗ , ω ) p ( ω | X , Y ) d ω � �� � posterior ◮ But... p ( ω | X , Y ) is often intractable 7 of 24

  16. Bayesian modelling and inference ◮ Observed inputs X = { x i } N i = 1 and outputs Y = { y i } N i = 1 ◮ Capture stochastic process believed to have generated outputs ◮ Def. ω model parameters as r.v. ◮ Prior dist. over ω : p ( ω ) ◮ Likelihood: p ( Y | ω , X ) ◮ Posterior: p ( ω | X , Y ) = p ( Y | ω , X ) p ( ω ) (Bayes’ theorem) p ( Y | X ) ◮ Predictive distribution given new input x ∗ � p ( y ∗ | x ∗ , X , Y ) = p ( y ∗ | x ∗ , ω ) p ( ω | X , Y ) d ω � �� � posterior ◮ But... p ( ω | X , Y ) is often intractable 7 of 24

  17. Approximate inference ◮ Approximate p ( ω | X , Y ) with simple dist. q θ ( ω ) ◮ Minimise divergence from posterior w.r.t. θ KL ( q θ ( ω ) || p ( ω | X , Y )) ◮ Identical to minimising prior likelihood � � �� � � �� � L VI ( θ ) := − q θ ( ω ) log p ( Y | X , ω ) d ω + KL ( q θ ( ω ) || p ( ω )) ◮ We can approximate the predictive distribution � q θ ( y ∗ | x ∗ ) = p ( y ∗ | x ∗ , ω ) q θ ( ω ) d ω . 8 of 24

  18. Bayesian neural networks ◮ Place prior p ( W i ) : W i ∼ N ( 0 , I ) for i ≤ L (and write ω := { W i } L i = 1 ). � � � � � � ◮ Output is a r.v. f x , ω = W L σ ... W 2 σ W 1 x + b 1 . ... � � � � �� 9 of 24 ◮ Softmax likelihood for class.: p y | x , ω = softmax f x , ω

  19. Bayesian neural networks ◮ Place prior p ( W i ) : W i ∼ N ( 0 , I ) for i ≤ L (and write ω := { W i } L i = 1 ). � � � � � � ◮ Output is a r.v. f x , ω = W L σ ... W 2 σ W 1 x + b 1 ... . � � � � �� ◮ Softmax likelihood for class.: p y | x , ω = softmax f x , ω � � � � � � , τ − 1 I or a Gaussian for regression: p y | x , ω = N y ; f x , ω . ◮ But difficult to evaluate posterior � � ω | X , Y p . 9 of 24

  20. Bayesian neural networks ◮ Place prior p ( W i ) : W i ∼ N ( 0 , I ) for i ≤ L (and write ω := { W i } L i = 1 ). � � � � � � ◮ Output is a r.v. f x , ω = W L σ ... W 2 σ W 1 x + b 1 ... . � � � � �� ◮ Softmax likelihood for class.: p y | x , ω = softmax f x , ω � � � � � � , τ − 1 I or a Gaussian for regression: p y | x , ω = N y ; f x , ω . ◮ But difficult to evaluate posterior � � ω | X , Y p . 9 of 24

  21. Bayesian neural networks ◮ Place prior p ( W i ) : W i ∼ N ( 0 , I ) for i ≤ L (and write ω := { W i } L i = 1 ). � � � � � � ◮ Output is a r.v. f x , ω = W L σ ... W 2 σ W 1 x + b 1 ... . � � � � �� ◮ Softmax likelihood for class.: p y | x , ω = softmax f x , ω � � � � � � , τ − 1 I or a Gaussian for regression: p y | x , ω = N y ; f x , ω . ◮ But difficult to evaluate posterior � � ω | X , Y p . 9 of 24

  22. Approximate inference in Bayesian NNs � � � � ◮ Def q θ ω | X , Y to approximate posterior p ω ◮ KL divergence to minimise: � � � � �� KL q θ || p ω | X , Y ω � � � � � � � � � �� ∝ − q θ log p Y | X , ω d ω + KL q θ || p ω ω ω =: L ( θ ) ◮ Approximate the integral with MC integration � ω ∼ q θ ( ω ) : � � � � � � �� � L ( θ ) := − log p Y | X , � + KL || p q θ ω ω ω 10 of 24

  23. Approximate inference in Bayesian NNs � � � � ◮ Def q θ ω | X , Y to approximate posterior p ω ◮ KL divergence to minimise: � � � � �� KL q θ || p ω | X , Y ω � � � � � � � � � �� ∝ − q θ log p Y | X , ω d ω + KL q θ || p ω ω ω =: L ( θ ) ◮ Approximate the integral with MC integration � ω ∼ q θ ( ω ) : � � � � � � �� � L ( θ ) := − log p Y | X , � + KL || p q θ ω ω ω 10 of 24

  24. Approximate inference in Bayesian NNs � � � � ◮ Def q θ ω | X , Y to approximate posterior p ω ◮ KL divergence to minimise: � � � � �� KL q θ || p ω | X , Y ω � � � � � � � � � �� ∝ − q θ log p Y | X , ω d ω + KL q θ || p ω ω ω =: L ( θ ) ◮ Approximate the integral with MC integration � ω ∼ q θ ( ω ) : � � � � � � �� � L ( θ ) := − log p Y | X , � + KL || p q θ ω ω ω 10 of 24

  25. Stochastic approx. inf. in Bayesian NNs ◮ Unbiased estimator: � � � E � L ( θ ) = L ( θ ) ω ∼ q θ ( ω ) ◮ Converges to the same optima as L ( θ ) ◮ For inference, repeat: ◮ Sample � ω ∼ q θ ( ω ) ◮ And minimise (one step) � � � � � � �� � L ( θ ) = − log p Y | X , � + KL q θ || p ω ω ω w.r.t. θ . 11 of 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend