ammi introduction to deep learning 11 2 lstm and gru
play

AMMI Introduction to Deep Learning 11.2. LSTM and GRU Fran cois - PowerPoint PPT Presentation

AMMI Introduction to Deep Learning 11.2. LSTM and GRU Fran cois Fleuret https://fleuret.org/ammi-2018/ November 12, 2018 COLE POLYTECHNIQUE FDRALE DE LAUSANNE The Long-Short Term Memory unit (LSTM) by Hochreiter and Schmidhuber


  1. AMMI – Introduction to Deep Learning 11.2. LSTM and GRU Fran¸ cois Fleuret https://fleuret.org/ammi-2018/ November 12, 2018 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

  2. The Long-Short Term Memory unit (LSTM) by Hochreiter and Schmidhuber (1997), is a recurrent network with a gating of the form c t = c t − 1 + i t ⊙ g t where c t is a recurrent state, i t is a gating function and g t is a full update. This assures that the derivatives of the loss wrt c t does not vanish. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 1 / 17

  3. It is noteworthy that this model implemented 20 years before the resnets of He et al. (2015) uses the exact same strategy to deal with depth. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 2 / 17

  4. It is noteworthy that this model implemented 20 years before the resnets of He et al. (2015) uses the exact same strategy to deal with depth. This original architecture was improved with a forget gate (Gers et al., 2000), resulting in the standard LSTM in use. In what follows we consider notation and variant from Jozefowicz et al. (2015). Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 2 / 17

  5. The recurrent state is composed of a “cell state” c t and an “output state” h t . Gate f t modulates if the cell state should be forgotten, i t if the new update should be taken into account, and o t if the output state should be reset. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 3 / 17

  6. The recurrent state is composed of a “cell state” c t and an “output state” h t . Gate f t modulates if the cell state should be forgotten, i t if the new update should be taken into account, and o t if the output state should be reset. � � f t = sigm W (x f) x t + W (h f) h t − 1 + b (f) (forget gate) � � i t = sigm W (x i) x t + W (h i) h t − 1 + b (i) (input gate) � � g t = tanh W (x c) x t + W (h c) h t − 1 + b (c) (full cell state update) c t = f t ⊙ c t − 1 + i t ⊙ g t (cell state) � � o t = sigm W (x o) x t + W (h o) h t − 1 + b (o) (output gate) h t = o t ⊙ tanh( c t ) (output state) Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 3 / 17

  7. The recurrent state is composed of a “cell state” c t and an “output state” h t . Gate f t modulates if the cell state should be forgotten, i t if the new update should be taken into account, and o t if the output state should be reset. � � f t = sigm W (x f) x t + W (h f) h t − 1 + b (f) (forget gate) � � i t = sigm W (x i) x t + W (h i) h t − 1 + b (i) (input gate) � � g t = tanh W (x c) x t + W (h c) h t − 1 + b (c) (full cell state update) c t = f t ⊙ c t − 1 + i t ⊙ g t (cell state) � � o t = sigm W (x o) x t + W (h o) h t − 1 + b (o) (output gate) h t = o t ⊙ tanh( c t ) (output state) As pointed out by Gers et al. (2000), the forget bias b (f) should be initialized with large values so that initially f t ≃ 1 and the gating has no effect. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 3 / 17

  8. The recurrent state is composed of a “cell state” c t and an “output state” h t . Gate f t modulates if the cell state should be forgotten, i t if the new update should be taken into account, and o t if the output state should be reset. � � f t = sigm W (x f) x t + W (h f) h t − 1 + b (f) (forget gate) � � i t = sigm W (x i) x t + W (h i) h t − 1 + b (i) (input gate) � � g t = tanh W (x c) x t + W (h c) h t − 1 + b (c) (full cell state update) c t = f t ⊙ c t − 1 + i t ⊙ g t (cell state) � � o t = sigm W (x o) x t + W (h o) h t − 1 + b (o) (output gate) h t = o t ⊙ tanh( c t ) (output state) As pointed out by Gers et al. (2000), the forget bias b (f) should be initialized with large values so that initially f t ≃ 1 and the gating has no effect. This model was extended by Gers et al. (2003) with “peephole connections” that allow gates to depend on c t − 1 . Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 3 / 17

  9. yt − 1 yt Ψ Ψ . . . . . . h 1 h 1 t − 1 t LSTM cell . . . . . . c 1 c 1 t − 1 t xt Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 4 / 17

  10. yt − 1 yt Ψ Ψ . . . . . . h 1 h 1 t − 1 t LSTM cell . . . . . . c 1 c 1 t − 1 t xt � Prediction is done from the h t state, hence called the output state. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 4 / 17

  11. Several such “cells” can be combined to create a multi-layer LSTM. yt − 1 yt Ψ Ψ . . . . . . h 1 h 1 t − 1 t LSTM cell . . . . . . c 1 c 1 t − 1 t xt Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 5 / 17

  12. Several such “cells” can be combined to create a multi-layer LSTM. yt − 1 yt Ψ Ψ . . . . . . h 2 h 2 t − 1 t LSTM cell . . . . . . c 2 c 2 t − 1 t . . . . . . h 1 h 1 t − 1 t LSTM cell . . . . . . c 1 c 1 t − 1 t xt Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 5 / 17

  13. Several such “cells” can be combined to create a multi-layer LSTM. yt − 1 yt Ψ Ψ Two layer LSTM . . . . . . h 2 h 2 t − 1 t LSTM cell . . . . . . c 2 c 2 t − 1 t . . . . . . h 1 h 1 t − 1 t LSTM cell . . . . . . c 1 c 1 t − 1 t xt Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 5 / 17

  14. PyTorch’s torch.nn.LSTM implements this model. Its processes several sequences, and returns two tensors, with D the number of layers and T the sequence length: • the outputs for all the layers at the last time step: h 1 T and h D T , and • the outputs of the last layer at each time step: h D 1 , . . . , h D T . The initial recurrent states h 1 0 , . . . , h D 0 and c 1 0 , . . . , c D 0 can also be specified. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 6 / 17

  15. PyTorch’s RNNs can process batches of sequences of same length, that can be encoded in a regular tensor, or batches of sequences of various lengths using the type nn.utils.rnn.PackedSequence . Such an object can be created with nn.utils.rnn.pack_padded_sequence : Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 7 / 17

  16. PyTorch’s RNNs can process batches of sequences of same length, that can be encoded in a regular tensor, or batches of sequences of various lengths using the type nn.utils.rnn.PackedSequence . Such an object can be created with nn.utils.rnn.pack_padded_sequence : >>> from torch.nn.utils.rnn import pack_padded_sequence >>> pack_padded_sequence(torch.tensor([[[ 1. ], [ 2. ]], ... [[ 3. ], [ 4. ]], ... [[ 5. ], [ 0. ]]]), ... [3, 2]) PackedSequence(data=tensor([[ 1.], [ 2.], [ 3.], [ 4.], [ 5.]]), batch_sizes=tensor([ 2, 2, 1])) Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 7 / 17

  17. PyTorch’s RNNs can process batches of sequences of same length, that can be encoded in a regular tensor, or batches of sequences of various lengths using the type nn.utils.rnn.PackedSequence . Such an object can be created with nn.utils.rnn.pack_padded_sequence : >>> from torch.nn.utils.rnn import pack_padded_sequence >>> pack_padded_sequence(torch.tensor([[[ 1. ], [ 2. ]], ... [[ 3. ], [ 4. ]], ... [[ 5. ], [ 0. ]]]), ... [3, 2]) PackedSequence(data=tensor([[ 1.], [ 2.], [ 3.], [ 4.], [ 5.]]), batch_sizes=tensor([ 2, 2, 1])) � The sequences must be sorted by decreasing lengths. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 7 / 17

  18. PyTorch’s RNNs can process batches of sequences of same length, that can be encoded in a regular tensor, or batches of sequences of various lengths using the type nn.utils.rnn.PackedSequence . Such an object can be created with nn.utils.rnn.pack_padded_sequence : >>> from torch.nn.utils.rnn import pack_padded_sequence >>> pack_padded_sequence(torch.tensor([[[ 1. ], [ 2. ]], ... [[ 3. ], [ 4. ]], ... [[ 5. ], [ 0. ]]]), ... [3, 2]) PackedSequence(data=tensor([[ 1.], [ 2.], [ 3.], [ 4.], [ 5.]]), batch_sizes=tensor([ 2, 2, 1])) � The sequences must be sorted by decreasing lengths. nn.utils.rnn.pad_packed_sequence converts back to a padded tensor. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 7 / 17

  19. class LSTMNet(nn.Module): def __init__(self, dim_input, dim_recurrent, num_layers, dim_output): super(LSTMNet, self).__init__() self.lstm = nn.LSTM(input_size = dim_input, hidden_size = dim_recurrent, num_layers = num_layers) self.fc_o2y = nn.Linear(dim_recurrent, dim_output) def forward(self, input): # Makes this a batch of size 1 input = input.unsqueeze(1) # Get the activations of all layers at the last time step output, _ = self.lstm(input) # Drop the batch index output = output.squeeze(1) output = output[output.size(0) - 1:output.size(0)] return self.fc_o2y(F.relu(output)) Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 8 / 17

  20. 0.5 Baseline w/ Gating LSTM 0.4 0.3 Error 0.2 0.1 0 0 50000 100000 150000 200000 250000 Nb. sequences seen Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 9 / 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend