NPFL114, Lecture 12
Transformer, External Memory Networks
Milan Straka
May 20, 2019
Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated
Transformer, External Memory Networks Milan Straka May 20, 2019 - - PowerPoint PPT Presentation
NPFL114, Lecture 12 Transformer, External Memory Networks Milan Straka May 20, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Exams Five questions,
Milan Straka
May 20, 2019
Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated
Five questions, written preparation, then we go through it together (or you can leave and let me grade it by myself). Each question is 20 points, and up to 40 points (surplus above 80 points; there is no distinction between regular and competition points) transfered from the practicals, and up to 10 points for GitHub pull requests. To pass the exam, you need to obtain at least 60, 75 and 90 out of 100 points for the written exam (plus up to 40 points from the practicals), to obtain grades 3, 2 and 1, respectively. The SIS should give you an exact time of the exam (including a gap between students) so that you do not come all at once.
2/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
In the winter semester:
Reading group of deep learning papers (in all areas). Every participant presents a paper about deep learning, learning how to read a paper, present it in a understandable way, and get deep learning knowledge from other presentations.
In a sense continuation of Deep Learning, but instead of supervised learning, reinforced learning is the main method. Similar format to the Deep Learning course.
A course intended as a prequel to Deep Learning – introduction to machine learning (regression, classification, structured prediction, clusterization, hyperparameter optimization; decision trees, SVM, maximum entropy classifiers, gradient boosting, …), with practicals in Python.
3/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
Using REINFORCE with baseline, we can design neural network architectures. We fix the overall architecture and design only Normal and Reduction cells.
Figure 2 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.
4/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
Every block is designed by a RNN controller generating individual operations.
Figure 3 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.
5/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
Figure 4 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.
6/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
Table 2 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.
7/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
Figure 5 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.
8/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
For some sequence processing tasks, sequential processing of its elements might be too
9/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
Figure 1 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.
10/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
The attention module for a queries , keys and values is defined as: The queries, keys and values are computed from current word representations using a linear transformation as
Q K V Attention(Q, K, V ) = softmax V . ( d
k
QK⊤ ) W Q K V = V
⋅ WQ
= V
⋅ WK
= V
⋅ WV
11/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
Multihead attention is used in practice. Instead of using one huge attention, we split queries, keys and values to several groups (similar to how ResNeXt works), compute the attention in each of the groups separately, and then concatenate the results.
Scaled Dot-Product Attention Multi-Head Attention
Figure 2 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.
12/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
We need to encode positional information (which was implicit in RNNs). Learned embeddings for every position. Sinusoids of different frequencies: This choice of functions should allow the model to attend to relative positions, since for any fixed , is a linear function of .
PE
(pos,2i)
PE
(pos,2i+1)
= sin pos/10000 (
2i/d)
= cos pos/10000 (
2i/d)
k PE
pos+k
PE
pos
13/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
Positional embeddings for 20 words of dimension 512, lighter colors representing values closer to 1 and darker colors representing values closer to -1.
http://jalammar.github.io/illustrated-transformer/
14/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
The network is regularized by: dropout of input embeddings, dropout of each sub-layer, just before before it is added to the residual connection (and then normalized), label smoothing. Default dropout rate and also label smoothing weight is 0.1.
Training can be performed in parallel because of the masked attention – the softmax weights of the self-attention are zeroed out not to allow attending words later in the sequence. However, inference is still sequential (and no substantial improvements have been achieved on parallel inference similar to WaveNet).
15/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
Table 1 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.
16/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
Table 2 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.
17/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
N dmodel dff h dk dv Pdrop ǫls train PPL BLEU params steps (dev) (dev) ×106 base 6 512 2048 8 64 64 0.1 0.1 100K 4.92 25.8 65 (A) 1 512 512 5.29 24.9 4 128 128 5.00 25.5 16 32 32 4.91 25.8 32 16 16 5.01 25.4 (B) 16 5.16 25.1 58 32 5.01 25.4 60 (C) 2 6.11 23.7 36 4 5.19 25.3 50 8 4.88 25.5 80 256 32 32 5.75 24.5 28 1024 128 128 4.66 26.0 168 1024 5.12 25.4 53 4096 4.75 26.2 90 (D) 0.0 5.77 24.6 0.2 4.95 25.5 0.0 4.67 25.3 0.2 5.47 25.7 (E) positional embedding instead of sinusoids 4.92 25.7 big 6 1024 4096 16 0.3 300K 4.33 26.4 213
Table 4 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.
18/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
So far, all input information was stored either directly in network weights, or in a state of a recurrent network. However, mammal brains seem to operate with a working memory – a capacity for short-term storage of information and its rule-based manipulation. We can therefore try to introduce an external memory to a neural network. The memory will be a matrix, where rows correspond to memory cells.
M
19/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
The network will control the memory using a controller which reads from the memory and writes to is. Although the original paper also considered a feed-forward (non-recurrent) controller, usually the controller is a recurrent LSTM network.
Figure 1 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.
20/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
To read the memory in a differentiable way, the controller at time emits a read distribution
is then
Writing is performed in two steps – an erase followed by an add. The controller at time emits a write distribution
and an add vector . The memory is then updates as
t w
t
r
t
r
=t
w (i) ⋅i
∑
t
M
(i).t
t w
t
e
t
a
t
M
(i) =t
M (i)[1 −
t−1
w
(i)e ] +t t
w
(i)a .t t
21/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
The addressing mechanism is designed to allow both content addressing, and location addressing.
Figure 2 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.
22/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
Content addressing starts by the controller emitting the key vector , which is compared to all memory locations , generating a distribution using a with temperature . The measure is usually the cosine similarity
k
t
M
(i)t
softmax β
t
w
(i) =t c
exp(β ⋅ distance(k , M (j))∑j
t t t
exp(β
⋅ distance(k , M (i))t t t
distance distance(a, b) = . ∣∣a∣∣ ⋅ ∣∣b∣∣ a ⋅ b
23/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
To allow iterative access to memory, the controller might decide to reuse the memory location from previous timestep. Specifically, the controller emits interpolation gate and defines Then, the current weighting may be shifted, i.e., the controller might decide to “rotate” the weights by a small integer. For a given range (the simplest case are only shifts ), the network emits distribution over the shifts, and the weights are then defined using a circular convolution Finally, not to lose precision over time, the controller emits a sharpening factor and the final memory location weights are
g
t
w
=t g
g
w +t t c
(1 − g
)w .t t−1
{−1, 0, 1} softmax
(i) =w ~t
w (j)s (i −j
∑
t g t
j). γ
t
w
(i) =t
(i) / (j) .w ~t
γ
t ∑j w
~t
γ
t
24/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
Even if not specified in the original paper, following the DNC paper, the LSTM controller can be implemented as a (potentially deep) LSTM. Assuming read heads and one write head, the input is and read vectors from previous time step, and output of the controller are vectors , and the final output is . The is a concatenation of
R x
t
R r
, … , rt−1 1 t−1 R
(ν
, ξ )t t
y
=t
ν
+t
W
[r , … , r ]r t 1 t R
ξ
t
k
, β , g , s , γ , k , β , g , s , γ , … , k , β , g , s , γ , e , a .t 1 t 1 t 1 t 1 t 1 t 2 t 2 t 2 t 2 t 2 t w t w t w t w t w t w t w
25/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
Repeat the same sequence as given on input. Trained with sequences of length up to 20.
Figure 3 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.
26/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
Figure 4: NTM Generalisation on the Copy Task. The four pairs of plots in the top row depict network outputs and corresponding copy targets for test sequences of length 10, 20, 30, and 50, respectively. The plots in the bottom row are for a length 120 sequence. The network was only trained on sequences of up to length 20. The first four sequences are reproduced with high confidence and very few mistakes. The longest one has a few more local errors and one global error: at the point indicated by the red arrow at the bottom, a single vector is duplicated, pushing all subsequent vectors one step back. Despite being subjectively close to a correct copy, this leads to a high loss.
Figure 4 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.
27/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
Figure 5: LSTM Generalisation on the Copy Task. The plots show inputs and outputs for the same sequence lengths as Figure 4. Like NTM, LSTM learns to reproduce sequences
Also note that the length of the accurate prefix decreases as the sequence length increases, suggesting that the network has trouble retaining information for long periods.
Figure 5 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.
28/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
Figure 6 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.
29/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
In associative recall, a sequence is given on input, consisting of subsequences of length 3. Then a randomly chosen subsequence is presented on input and the goal is to produce the following subsequence.
30/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
Figure 11 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.
31/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
Figure 12 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.
32/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
NTM was later extended to a Differentiable Neural Computer.
B C F B C F
W N W
a
Cont r
l er
b
Read and wr i t e heads
c
Memor y
d
Memor y usage and t empor al l i nks Out put I nput
Wr i t e vect
Er ase vect
Wr i t e key Read key Read mode Read key Read mode Read vect
s
Wr i t e Read 1 Read 2
Figure 1 of paper "Hybrid computing using a neural network with dynamic external memory", https://www.nature.com/articles/nature20101.
33/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
The DNC contains multiple read heads and one write head. The controller is a deep LSTM network, with input at time being the current input and read vectors from previous time step. The output of the controller are vectors , and the final output is . The is a concatenation of parameters for read and write heads (keys, gates, sharpening parameters, …). In DNC, usage of every memory location is tracked, which allows us to define allocation
to previously ( ) and subsequently ( ). The, the write weighting is defined as a weighted combination of the allocation weighting and write content weighting, and read weighting is computed as a weighted combination of read content weighting, previous write weighting, and subsequent write weighting.
t x
t
R r
, … , rt−1 1 t−1 R
(ν
, ξ )t t
y
=t
ν
+t
W
[r , … , r ]r t 1 t R
ξ
t
b
t
f
t
34/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
a
Random gr aph
c
Fami l y t r ee
b
London Under gr
Shor t est
h Tr aver sal
I an Jodi e Tom Char l
t e Mat Jo Fer gus Al i ce St eve Si mon Fr eya Nat al i e Jane Bob Mar y Al an Li ndsey Li am Ni na Becky Al i son
Mat er nal gr eat uncl e Shor t est
h quest i
( Moor gat e, Pi ccadi l l yCi r cus, _)
Tr aver sal quest i
( BondSt , _, Cent r al ) , ( _, _, Ci r cl e) , ( _, _, Ci r cl e) , ( _, _, Ci r cl e) , ( _, _, Ci r cl e) , ( _, _, Jubi l ee) , ( _, _, Jubi l ee) ,
Under gr
i nput :
( Oxf
dCi r cus, Tot t enhamCt Rd, Cent r al ) ( Tot t enhamCt Rd, Oxf
dCi r cus, Cent r al ) ( Baker St , Mar yl ebone, Ci r cl e) ( Baker St , Mar yl ebone, Baker l
( Baker St , Oxf
dCi r cus, Baker l
( Lei cest er Sq, Char i ngCr
Nor t her n) ( Tot t enhamCt Rd, Lei cest er Sq, Nor t her n) ( Oxf
dCi r cus, Pi ccadi l l yCi r cus, Baker l
( Oxf
dCi r cus, Not t i ngHi l l Gat e, Cent r al ) ( Oxf
dCi r cus, Eust
Vi ct
i a) 84 edges i n t
al
I nf er ence quest i
( Fr eya, _, Mat er nal Gr eat Uncl e)
Fami l y t r ee i nput :
( Char l
t e, Al an, Fat her ) ( Si mon, St eve, Fat her ) ( St eve , Si mon, Son1) ( Ni na, Al i son, Mot her ) ( Li ndsey, Fer gus, Son1) ( Bob, Jane, Mot her ) ( Nat al i e, Al i ce, Mot her ) ( Mar y, I an, Fat her ) ( Jane, Al i ce, Daught er 1) ( Mat , Char l
t e, Mot her ) 54 edges i n t
al
Answer :
( BondSt , Not t i ngHi l l Gat e, Cent r al ) ( Not t i ngHi l l Gat e, Gl
er Rd, Ci r cl e) ( West mi nst er , Gr eenPar k, Jubi l ee) ( Gr eenPar k, BondSt , Jubi l ee)
Answer :
( Moor gat e, Bank, Nor t her n) ( Bank, Hol bor n, Cent r al ) ( Hol bor n, Lei cest er Sq, Pi ccadi l l y) ( Lei cest er Sq, Pi ccadi l l yCi r cus, Pi ccadi l l y)
Answer :
( Fr eya, Fer gus, Mat er nal Gr eat Uncl e)
… … …
Figure 2 of paper "Hybrid computing using a neural network with dynamic external memory", https://www.nature.com/articles/nature20101.
35/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
c
London Under gr
map
d
Read key
a
Read and wr i t e wei ght i ngs
Fr
e
Locat i
cont ent
To Li ne
Oxf
d Ci r cus>Tot t enham Cour t Rd Tot t enham Cour t Rd>Oxf
d Ci r cus Gr een Par k>Oxf
d Ci r cus Vi ct
i a>Gr een Par k Oxf
d Ci r cus>Gr een Par k Gr een Par k>Vi ct
i a Gr een Par k>Pi ccadi l l y Ci r cus Pi ccadi l l y Ci r cus>Lei cest er Sq Pi ccadi l l y Ci r cus>Gr een Par k Lei cest er Sq>Pi ccadi l l y Ci r cus Pi ccadi l l y Ci r cus>Oxf
d Ci r cus Char i ng Cr
ccadi l l y Ci r cus Pi ccadi l l y Ci r cus>Char i ng Cr
Oxf
d Ci r cus>Pi ccadi l l y Ci r cus Lei cest er Sq>Tot t enham Cour t Rd Char i ng Cr
cest er Sq Lei cest er Sq>Char i ng Cr
Tot t enham Cour t Rd>Lei cest er Sq Vi ct
i a>___ Vi ct
i a N ___>___ Vi ct
i a N ___>___ Cent r al E ___>___ Nor t h S ___>___ Pi ccadi l l y W ___>___ Baker l
___>___ Cent r al E Char i ng Cr
Gr een Par k Lei cest er Sq Oxf
d Ci r cus Pi ccadi l l y Ci r cus Tot t enham Cour t Rd Vi ct
i a Baker l
Baker l
Cent r al E Cent r al W Nor t h N Nor t h S Pi ccadi l l y E Pi ccadi l l y W Vi ct
i a N Vi ct
i a S Char i ng Cr
Gr een Par k Lei cest er Sq Oxf
d Ci r cus Pi ccadi l l y Ci r cus Tot t enham Cour t Rd Vi ct
i a
Fr
To Li ne
Char i ng Cr
Gr een Par k Lei cest er Sq Oxf
d Ci r cus Pi ccadi l l y Ci r cus Tot t enham Cour t Rd Vi ct
i a Baker l
Baker l
Cent r al E Cent r al W Nor t h N Nor t h S Pi ccadi l l y E Pi ccadi l l y W Vi ct
i a N Vi ct
i a S Char i ng Cr
Gr een Par k Lei cest er Sq Oxf
d Ci r cus Pi ccadi l l y Ci r cus Tot t enham Cour t Rd Vi ct
i a
Decode
Decoded memor y l
i
b
Read mode
Decode Wr i t e head Read head 1 Read head 2
Backwar d Cont ent For war d
Gr aph defini t i
Quer y Answer
Backwar d Cont ent For war d
Ti me
5 10 15 20 25 30
Figure 3 of paper "Hybrid computing using a neural network with dynamic external memory", https://www.nature.com/articles/nature20101.
36/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
Figure 1 of paper "One-shot learning with Memory-Augmented Neural Networks", https://arxiv.org/abs/1605.06065.
37/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
Page 3 of paper "One-shot learning with Memory-Augmented Neural Networks", https://arxiv.org/abs/1605.06065. Page 4 of paper "One-shot learning with Memory-Augmented Neural Networks", https://arxiv.org/abs/1605.06065.
38/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN
Figure 2 of paper "One-shot learning with Memory-Augmented Neural Networks", https://arxiv.org/abs/1605.06065.
39/39 NPFL114, Lecture 12
Organization NASNet Transformer NeuralTuringMachines DNC MANN