Dropout in RNNs Following a VI Interpretation
Yarin Gal
yg279@cam.ac.uk
Unless specified otherwise, photos are either original work or taken from Wikimedia, under Creative Commons license
Dropout in RNNs Following a VI Interpretation Yarin Gal - - PowerPoint PPT Presentation
Dropout in RNNs Following a VI Interpretation Yarin Gal yg279@cam.ac.uk Unless specified otherwise, photos are either original work or taken from Wikimedia, under Creative Commons license Recurrent Neural Networks Recurrent neural networks
Unless specified otherwise, photos are either original work or taken from Wikimedia, under Creative Commons license
2 of 24
3 of 24
3 of 24
3 of 24
3 of 24
3 of 24
◮ noise added in the recurrent connections of an RNN leads to
◮ with dropout, the RNNs dynamics change dramatically
◮ dropout in recurrent layers disrupts the RNNs ability to model
◮ applying dropout to the non-recurrent connections alone results
◮ exploratory analysis of the performance of dropout before,
4 of 24
5 of 24
6 of 24
6 of 24
6 of 24
6 of 24
7 of 24
7 of 24
7 of 24
8 of 24
9 of 24
9 of 24
9 of 24
◮ Sample
◮ And minimise (one step)
11 of 24
◮ Sample
◮ And minimise (one step)
11 of 24
◮ Sample
◮ And minimise (one step)
11 of 24
12 of 24
◮ Sample
j=1)
i=1
◮ Minimise (one step)
i=1 (set of matrices).
13 of 24
◮ = Randomly set columns of Mi to zero ◮ Minimise (one step)
i=1 (set of matrices).
13 of 24
◮ = Randomly set units of the network to zero ◮ Minimise (one step)
i=1 (set of matrices).
13 of 24
14 of 24
1See Gal and Ghahramani (2015) and Kingma et al. (2015) 15 of 24
16 of 24
◮ single recurrent unit transition. E.g. tanh of affine
◮ model output (e.g. affine transformation of last state, or function
◮ model likelihood. E.g. N(y; fω
y (hT), σ2)
17 of 24
18 of 24
19 of 24
20 of 24
◮ Randomly set embedding matrix rows to zero – entire word
◮ Mask is repeated at each time step → drop the same words
◮ i.e. drop word types at random rather than word tokens
21 of 24
22 of 24
22 of 24
22 of 24
22 of 24
◮ Capture language ambiguity? Image Source: cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf ◮ Weight uncertainty for model debugging?
◮ New appr. distributions = new stochastic reg. techniques? ◮ Model compression: Wi ∼ discrete distribution w. continuous
23 of 24
◮ Capture language ambiguity? ◮ Weight uncertainty for model debugging?
◮ New appr. distributions = new stochastic reg. techniques? ◮ Model compression: Wi ∼ discrete distribution w. continuous
23 of 24
◮ Capture language ambiguity? ◮ Weight uncertainty for model debugging?
◮ New appr. distributions = new stochastic reg. techniques?
◮ Model compression: Wi ∼ discrete distribution w. continuous
23 of 24
◮ Capture language ambiguity? ◮ Weight uncertainty for model debugging?
◮ New appr. distributions = new stochastic reg. techniques? ◮ Model compression: Wi ∼ discrete distribution w. continuous
23 of 24
◮ Capture language ambiguity? ◮ Weight uncertainty for model debugging?
◮ New appr. distributions = new stochastic reg. techniques? ◮ Model compression: Wi ∼ discrete distribution w. continuous
23 of 24
24 of 24
24 of 24