AMMI Introduction to Deep Learning 11.2. LSTM and GRU Fran cois - PowerPoint PPT Presentation

AMMI – Introduction to Deep Learning 11.2. LSTM and GRU Fran¸ cois Fleuret https://fleuret.org/ammi-2018/ November 12, 2018 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

The Long-Short Term Memory unit (LSTM) by Hochreiter and Schmidhuber (1997), is a recurrent network with a gating of the form c t = c t − 1 + i t ⊙ g t where c t is a recurrent state, i t is a gating function and g t is a full update. This assures that the derivatives of the loss wrt c t does not vanish. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 1 / 17

It is noteworthy that this model implemented 20 years before the resnets of He et al. (2015) uses the exact same strategy to deal with depth. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 2 / 17

It is noteworthy that this model implemented 20 years before the resnets of He et al. (2015) uses the exact same strategy to deal with depth. This original architecture was improved with a forget gate (Gers et al., 2000), resulting in the standard LSTM in use. In what follows we consider notation and variant from Jozefowicz et al. (2015). Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 2 / 17

The recurrent state is composed of a “cell state” c t and an “output state” h t . Gate f t modulates if the cell state should be forgotten, i t if the new update should be taken into account, and o t if the output state should be reset. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 3 / 17

The recurrent state is composed of a “cell state” c t and an “output state” h t . Gate f t modulates if the cell state should be forgotten, i t if the new update should be taken into account, and o t if the output state should be reset. � � f t = sigm W (x f) x t + W (h f) h t − 1 + b (f) (forget gate) � � i t = sigm W (x i) x t + W (h i) h t − 1 + b (i) (input gate) � � g t = tanh W (x c) x t + W (h c) h t − 1 + b (c) (full cell state update) c t = f t ⊙ c t − 1 + i t ⊙ g t (cell state) � � o t = sigm W (x o) x t + W (h o) h t − 1 + b (o) (output gate) h t = o t ⊙ tanh( c t ) (output state) Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 3 / 17

The recurrent state is composed of a “cell state” c t and an “output state” h t . Gate f t modulates if the cell state should be forgotten, i t if the new update should be taken into account, and o t if the output state should be reset. � � f t = sigm W (x f) x t + W (h f) h t − 1 + b (f) (forget gate) � � i t = sigm W (x i) x t + W (h i) h t − 1 + b (i) (input gate) � � g t = tanh W (x c) x t + W (h c) h t − 1 + b (c) (full cell state update) c t = f t ⊙ c t − 1 + i t ⊙ g t (cell state) � � o t = sigm W (x o) x t + W (h o) h t − 1 + b (o) (output gate) h t = o t ⊙ tanh( c t ) (output state) As pointed out by Gers et al. (2000), the forget bias b (f) should be initialized with large values so that initially f t ≃ 1 and the gating has no effect. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 3 / 17

The recurrent state is composed of a “cell state” c t and an “output state” h t . Gate f t modulates if the cell state should be forgotten, i t if the new update should be taken into account, and o t if the output state should be reset. � � f t = sigm W (x f) x t + W (h f) h t − 1 + b (f) (forget gate) � � i t = sigm W (x i) x t + W (h i) h t − 1 + b (i) (input gate) � � g t = tanh W (x c) x t + W (h c) h t − 1 + b (c) (full cell state update) c t = f t ⊙ c t − 1 + i t ⊙ g t (cell state) � � o t = sigm W (x o) x t + W (h o) h t − 1 + b (o) (output gate) h t = o t ⊙ tanh( c t ) (output state) As pointed out by Gers et al. (2000), the forget bias b (f) should be initialized with large values so that initially f t ≃ 1 and the gating has no effect. This model was extended by Gers et al. (2003) with “peephole connections” that allow gates to depend on c t − 1 . Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 3 / 17

yt − 1 yt Ψ Ψ . . . . . . h 1 h 1 t − 1 t LSTM cell . . . . . . c 1 c 1 t − 1 t xt Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 4 / 17

yt − 1 yt Ψ Ψ . . . . . . h 1 h 1 t − 1 t LSTM cell . . . . . . c 1 c 1 t − 1 t xt � Prediction is done from the h t state, hence called the output state. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 4 / 17

Several such “cells” can be combined to create a multi-layer LSTM. yt − 1 yt Ψ Ψ . . . . . . h 1 h 1 t − 1 t LSTM cell . . . . . . c 1 c 1 t − 1 t xt Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 5 / 17

Several such “cells” can be combined to create a multi-layer LSTM. yt − 1 yt Ψ Ψ . . . . . . h 2 h 2 t − 1 t LSTM cell . . . . . . c 2 c 2 t − 1 t . . . . . . h 1 h 1 t − 1 t LSTM cell . . . . . . c 1 c 1 t − 1 t xt Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 5 / 17

Several such “cells” can be combined to create a multi-layer LSTM. yt − 1 yt Ψ Ψ Two layer LSTM . . . . . . h 2 h 2 t − 1 t LSTM cell . . . . . . c 2 c 2 t − 1 t . . . . . . h 1 h 1 t − 1 t LSTM cell . . . . . . c 1 c 1 t − 1 t xt Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 5 / 17

PyTorch’s torch.nn.LSTM implements this model. Its processes several sequences, and returns two tensors, with D the number of layers and T the sequence length: • the outputs for all the layers at the last time step: h 1 T and h D T , and • the outputs of the last layer at each time step: h D 1 , . . . , h D T . The initial recurrent states h 1 0 , . . . , h D 0 and c 1 0 , . . . , c D 0 can also be specified. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 6 / 17

PyTorch’s RNNs can process batches of sequences of same length, that can be encoded in a regular tensor, or batches of sequences of various lengths using the type nn.utils.rnn.PackedSequence . Such an object can be created with nn.utils.rnn.pack_padded_sequence : Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 7 / 17

PyTorch’s RNNs can process batches of sequences of same length, that can be encoded in a regular tensor, or batches of sequences of various lengths using the type nn.utils.rnn.PackedSequence . Such an object can be created with nn.utils.rnn.pack_padded_sequence : >>> from torch.nn.utils.rnn import pack_padded_sequence >>> pack_padded_sequence(torch.tensor([[[ 1. ], [ 2. ]], ... [[ 3. ], [ 4. ]], ... [[ 5. ], [ 0. ]]]), ... [3, 2]) PackedSequence(data=tensor([[ 1.], [ 2.], [ 3.], [ 4.], [ 5.]]), batch_sizes=tensor([ 2, 2, 1])) Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 7 / 17

PyTorch’s RNNs can process batches of sequences of same length, that can be encoded in a regular tensor, or batches of sequences of various lengths using the type nn.utils.rnn.PackedSequence . Such an object can be created with nn.utils.rnn.pack_padded_sequence : >>> from torch.nn.utils.rnn import pack_padded_sequence >>> pack_padded_sequence(torch.tensor([[[ 1. ], [ 2. ]], ... [[ 3. ], [ 4. ]], ... [[ 5. ], [ 0. ]]]), ... [3, 2]) PackedSequence(data=tensor([[ 1.], [ 2.], [ 3.], [ 4.], [ 5.]]), batch_sizes=tensor([ 2, 2, 1])) � The sequences must be sorted by decreasing lengths. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 7 / 17

PyTorch’s RNNs can process batches of sequences of same length, that can be encoded in a regular tensor, or batches of sequences of various lengths using the type nn.utils.rnn.PackedSequence . Such an object can be created with nn.utils.rnn.pack_padded_sequence : >>> from torch.nn.utils.rnn import pack_padded_sequence >>> pack_padded_sequence(torch.tensor([[[ 1. ], [ 2. ]], ... [[ 3. ], [ 4. ]], ... [[ 5. ], [ 0. ]]]), ... [3, 2]) PackedSequence(data=tensor([[ 1.], [ 2.], [ 3.], [ 4.], [ 5.]]), batch_sizes=tensor([ 2, 2, 1])) � The sequences must be sorted by decreasing lengths. nn.utils.rnn.pad_packed_sequence converts back to a padded tensor. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 7 / 17

class LSTMNet(nn.Module): def __init__(self, dim_input, dim_recurrent, num_layers, dim_output): super(LSTMNet, self).__init__() self.lstm = nn.LSTM(input_size = dim_input, hidden_size = dim_recurrent, num_layers = num_layers) self.fc_o2y = nn.Linear(dim_recurrent, dim_output) def forward(self, input): # Makes this a batch of size 1 input = input.unsqueeze(1) # Get the activations of all layers at the last time step output, _ = self.lstm(input) # Drop the batch index output = output.squeeze(1) output = output[output.size(0) - 1:output.size(0)] return self.fc_o2y(F.relu(output)) Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 8 / 17

0.5 Baseline w/ Gating LSTM 0.4 0.3 Error 0.2 0.1 0 0 50000 100000 150000 200000 250000 Nb. sequences seen Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.2. LSTM and GRU 9 / 17

AMMI Introduction to Deep Learning 11.2. LSTM and GRU Fran cois - PowerPoint PPT Presentation

AMMI Introduction to Deep Learning 11.2. LSTM and GRU Fran cois Fleuret https://fleuret.org/ammi-2018/ November 12, 2018 COLE POLYTECHNIQUE FDRALE DE LAUSANNE The Long-Short Term Memory unit (LSTM) by Hochreiter and Schmidhuber

AMMI Introduction to Deep Learning 11.3. Word embeddings and translation Fran cois Fleuret

AMMI Introduction to Deep Learning 6.5. Residual networks Fran cois Fleuret

AMMI Introduction to Deep Learning 8.4. Optimizing inputs Fran cois Fleuret

Natural Language Processing with Deep Learning LSTM, GRU, and applications in summarization and

AMMI Introduction to Deep Learning 6.3. Dropout Fran cois Fleuret

AMMI Introduction to Deep Learning 9.1. Transposed convolutions Fran cois Fleuret

Attention Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Encoder-decoder Models

Attention Graham Neubig Site https://phontron.com/class/nn4nlp2020/ Encoder-decoder Models

AMMI Introduction to Deep Learning 10.4. Model persistence and checkpoints Fran cois

AMMI Introduction to Deep Learning 1.2. Current applications and success Fran cois Fleuret

AMMI Introduction to Deep Learning 1.3. What is really happening? Fran cois Fleuret

AMMI Introduction to Deep Learning 7.2. Networks for image classification Fran cois

AMMI Introduction to Deep Learning 8.3. Visualizing the processing in the input Fran cois

AMMI Introduction to Deep Learning 4.1. DAG networks Fran cois Fleuret

AMMI Introduction to Deep Learning 6.6. Using GPUs Fran cois Fleuret

AMMI Introduction to Deep Learning 8.2. Looking at activations Fran cois Fleuret

A sequent calculus for opetopes CT 2019 Pierre-Louis Curien 1 Cdric Ho Thanh 1 Samuel Mimram 2

Cell structures for finite subset spaces Christopher Tuffley Institute of Fundamental Sciences

Cell Division Comics Using abstraction to represent a complex scientific topic in a simple,

Diffusion Contaminant at Contaminant Solutes (contaminants) migrate due to concentration

On the universal optimality of the 600 -cell: the Levenshtein framework lifted Peter Dragnev

RoboCup Challenges Simulation League Small League Medium-sized League (less

Automatic Cell Layout in the 7nm Era Pascal Cremer , Stefan Hougardy, Jan Schneider, and Jannik

What Will We Do About the Boys? Providing Academic and Social Support to African American and