Transformer, External Memory Networks Milan Straka May 20, 2019 - - PowerPoint PPT Presentation

transformer external memory networks
SMART_READER_LITE
LIVE PREVIEW

Transformer, External Memory Networks Milan Straka May 20, 2019 - - PowerPoint PPT Presentation

NPFL114, Lecture 12 Transformer, External Memory Networks Milan Straka May 20, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Exams Five questions,


slide-1
SLIDE 1

NPFL114, Lecture 12

Transformer, External Memory Networks

Milan Straka

May 20, 2019

Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

slide-2
SLIDE 2

Exams

Five questions, written preparation, then we go through it together (or you can leave and let me grade it by myself). Each question is 20 points, and up to 40 points (surplus above 80 points; there is no distinction between regular and competition points) transfered from the practicals, and up to 10 points for GitHub pull requests. To pass the exam, you need to obtain at least 60, 75 and 90 out of 100 points for the written exam (plus up to 40 points from the practicals), to obtain grades 3, 2 and 1, respectively. The SIS should give you an exact time of the exam (including a gap between students) so that you do not come all at once.

2/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-3
SLIDE 3

What Next

In the winter semester:

NPFL117 – Deep Learning Seminar [0/2 Ex]

Reading group of deep learning papers (in all areas). Every participant presents a paper about deep learning, learning how to read a paper, present it in a understandable way, and get deep learning knowledge from other presentations.

NPFL122 – Deep Reinforcement Learning [2/2 C+Ex]

In a sense continuation of Deep Learning, but instead of supervised learning, reinforced learning is the main method. Similar format to the Deep Learning course.

NPFL129 – Machine Learning 101

A course intended as a prequel to Deep Learning – introduction to machine learning (regression, classification, structured prediction, clusterization, hyperparameter optimization; decision trees, SVM, maximum entropy classifiers, gradient boosting, …), with practicals in Python.

3/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-4
SLIDE 4

Neural Architecture Search (NASNet) – 2017

Using REINFORCE with baseline, we can design neural network architectures. We fix the overall architecture and design only Normal and Reduction cells.

Figure 2 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.

4/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-5
SLIDE 5

Neural Architecture Search (NASNet) – 2017

Every block is designed by a RNN controller generating individual operations.

Figure 3 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.

5/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-6
SLIDE 6

Neural Architecture Search (NASNet) – 2017

Figure 4 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.

6/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-7
SLIDE 7

Neural Architecture Search (NASNet) – 2017

Table 2 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.

7/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-8
SLIDE 8

Neural Architecture Search (NASNet) – 2017

Figure 5 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.

8/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-9
SLIDE 9

Attention is All You Need

For some sequence processing tasks, sequential processing of its elements might be too

  • restricting. Instead, we may want to combine sequence elements independently on their distance.

9/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-10
SLIDE 10

Attention is All You Need

Figure 1 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.

10/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-11
SLIDE 11

Attention is All You Need

The attention module for a queries , keys and values is defined as: The queries, keys and values are computed from current word representations using a linear transformation as

Q K V Attention(Q, K, V ) = softmax V . ( d

k

QK⊤ ) W Q K V = V

⋅ W

Q

= V

⋅ W

K

= V

⋅ W

V

11/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-12
SLIDE 12

Attention is All You Need

Multihead attention is used in practice. Instead of using one huge attention, we split queries, keys and values to several groups (similar to how ResNeXt works), compute the attention in each of the groups separately, and then concatenate the results.

Scaled Dot-Product Attention Multi-Head Attention

Figure 2 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.

12/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-13
SLIDE 13

Attention is All You Need

Positional Embeddings

We need to encode positional information (which was implicit in RNNs). Learned embeddings for every position. Sinusoids of different frequencies: This choice of functions should allow the model to attend to relative positions, since for any fixed , is a linear function of .

PE

(pos,2i)

PE

(pos,2i+1)

= sin pos/10000 (

2i/d)

= cos pos/10000 (

2i/d)

k PE

pos+k

PE

pos

13/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-14
SLIDE 14

Attention is All You Need

Positional embeddings for 20 words of dimension 512, lighter colors representing values closer to 1 and darker colors representing values closer to -1.

http://jalammar.github.io/illustrated-transformer/

14/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-15
SLIDE 15

Attention is All You Need

Regularization

The network is regularized by: dropout of input embeddings, dropout of each sub-layer, just before before it is added to the residual connection (and then normalized), label smoothing. Default dropout rate and also label smoothing weight is 0.1.

Parallel Execution

Training can be performed in parallel because of the masked attention – the softmax weights of the self-attention are zeroed out not to allow attending words later in the sequence. However, inference is still sequential (and no substantial improvements have been achieved on parallel inference similar to WaveNet).

15/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-16
SLIDE 16

Why Attention

Table 1 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.

16/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-17
SLIDE 17

Transformers Results

Table 2 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.

17/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-18
SLIDE 18

Transformers Results

N dmodel dff h dk dv Pdrop ǫls train PPL BLEU params steps (dev) (dev) ×106 base 6 512 2048 8 64 64 0.1 0.1 100K 4.92 25.8 65 (A) 1 512 512 5.29 24.9 4 128 128 5.00 25.5 16 32 32 4.91 25.8 32 16 16 5.01 25.4 (B) 16 5.16 25.1 58 32 5.01 25.4 60 (C) 2 6.11 23.7 36 4 5.19 25.3 50 8 4.88 25.5 80 256 32 32 5.75 24.5 28 1024 128 128 4.66 26.0 168 1024 5.12 25.4 53 4096 4.75 26.2 90 (D) 0.0 5.77 24.6 0.2 4.95 25.5 0.0 4.67 25.3 0.2 5.47 25.7 (E) positional embedding instead of sinusoids 4.92 25.7 big 6 1024 4096 16 0.3 300K 4.33 26.4 213

Table 4 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.

18/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-19
SLIDE 19

Neural Turing Machines

So far, all input information was stored either directly in network weights, or in a state of a recurrent network. However, mammal brains seem to operate with a working memory – a capacity for short-term storage of information and its rule-based manipulation. We can therefore try to introduce an external memory to a neural network. The memory will be a matrix, where rows correspond to memory cells.

M

19/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-20
SLIDE 20

Neural Turing Machines

The network will control the memory using a controller which reads from the memory and writes to is. Although the original paper also considered a feed-forward (non-recurrent) controller, usually the controller is a recurrent LSTM network.

Figure 1 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.

20/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-21
SLIDE 21

Neural Turing Machine

Reading

To read the memory in a differentiable way, the controller at time emits a read distribution

  • ver memory locations, and the returned read vector

is then

Writing

Writing is performed in two steps – an erase followed by an add. The controller at time emits a write distribution

  • ver memory locations, and also an erase vector

and an add vector . The memory is then updates as

t w

t

r

t

r

=

t

w (i) ⋅

i

t

M

(i).

t

t w

t

e

t

a

t

M

(i) =

t

M (i)[1 −

t−1

w

(i)e ] +

t t

w

(i)a .

t t

21/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-22
SLIDE 22

Neural Turing Machine

The addressing mechanism is designed to allow both content addressing, and location addressing.

Figure 2 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.

22/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-23
SLIDE 23

Neural Turing Machine

Content Addressing

Content addressing starts by the controller emitting the key vector , which is compared to all memory locations , generating a distribution using a with temperature . The measure is usually the cosine similarity

k

t

M

(i)

t

softmax β

t

w

(i) =

t c

exp(β ⋅ distance(k , M (j))

∑j

t t t

exp(β

⋅ distance(k , M (i))

t t t

distance distance(a, b) = . ∣∣a∣∣ ⋅ ∣∣b∣∣ a ⋅ b

23/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-24
SLIDE 24

Neural Turing Machine

Location-Based Addressing

To allow iterative access to memory, the controller might decide to reuse the memory location from previous timestep. Specifically, the controller emits interpolation gate and defines Then, the current weighting may be shifted, i.e., the controller might decide to “rotate” the weights by a small integer. For a given range (the simplest case are only shifts ), the network emits distribution over the shifts, and the weights are then defined using a circular convolution Finally, not to lose precision over time, the controller emits a sharpening factor and the final memory location weights are

g

t

w

=

t g

g

w +

t t c

(1 − g

)w .

t t−1

{−1, 0, 1} softmax

(i) =

w ~t

w (j)s (i −

j

t g t

j). γ

t

w

(i) =

t

(i) / (j) .

w ~t

γ

t ∑j w

~t

γ

t

24/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-25
SLIDE 25

Neural Turing Machine

Overall Execution

Even if not specified in the original paper, following the DNC paper, the LSTM controller can be implemented as a (potentially deep) LSTM. Assuming read heads and one write head, the input is and read vectors from previous time step, and output of the controller are vectors , and the final output is . The is a concatenation of

R x

t

R r

, … , r

t−1 1 t−1 R

, ξ )

t t

y

=

t

ν

+

t

W

[r , … , r ]

r t 1 t R

ξ

t

k

, β , g , s , γ , k , β , g , s , γ , … , k , β , g , s , γ , e , a .

t 1 t 1 t 1 t 1 t 1 t 2 t 2 t 2 t 2 t 2 t w t w t w t w t w t w t w

25/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-26
SLIDE 26

Neural Turing Machines

Copy Task

Repeat the same sequence as given on input. Trained with sequences of length up to 20.

Figure 3 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.

26/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-27
SLIDE 27

Neural Turing Machines

Figure 4: NTM Generalisation on the Copy Task. The four pairs of plots in the top row depict network outputs and corresponding copy targets for test sequences of length 10, 20, 30, and 50, respectively. The plots in the bottom row are for a length 120 sequence. The network was only trained on sequences of up to length 20. The first four sequences are reproduced with high confidence and very few mistakes. The longest one has a few more local errors and one global error: at the point indicated by the red arrow at the bottom, a single vector is duplicated, pushing all subsequent vectors one step back. Despite being subjectively close to a correct copy, this leads to a high loss.

Figure 4 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.

27/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-28
SLIDE 28

Neural Turing Machines

Figure 5: LSTM Generalisation on the Copy Task. The plots show inputs and outputs for the same sequence lengths as Figure 4. Like NTM, LSTM learns to reproduce sequences

  • f up to length 20 almost perfectly. However it clearly fails to generalise to longer sequences.

Also note that the length of the accurate prefix decreases as the sequence length increases, suggesting that the network has trouble retaining information for long periods.

Figure 5 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.

28/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-29
SLIDE 29

Neural Turing Machines

Figure 6 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.

29/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-30
SLIDE 30

Neural Turing Machines

Associative Recall

In associative recall, a sequence is given on input, consisting of subsequences of length 3. Then a randomly chosen subsequence is presented on input and the goal is to produce the following subsequence.

30/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-31
SLIDE 31

Neural Turing Machines

Figure 11 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.

31/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-32
SLIDE 32

Neural Turing Machines

Figure 12 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.

32/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-33
SLIDE 33

Differentiable Neural Computer

NTM was later extended to a Differentiable Neural Computer.

B C F B C F

W N W

a

Cont r

  • l

l er

b

Read and wr i t e heads

c

Memor y

d

Memor y usage and t empor al l i nks Out put I nput

Wr i t e vect

  • r

Er ase vect

  • r

Wr i t e key Read key Read mode Read key Read mode Read vect

  • r

s

Wr i t e Read 1 Read 2

Figure 1 of paper "Hybrid computing using a neural network with dynamic external memory", https://www.nature.com/articles/nature20101.

33/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-34
SLIDE 34

Differentiable Neural Computer

The DNC contains multiple read heads and one write head. The controller is a deep LSTM network, with input at time being the current input and read vectors from previous time step. The output of the controller are vectors , and the final output is . The is a concatenation of parameters for read and write heads (keys, gates, sharpening parameters, …). In DNC, usage of every memory location is tracked, which allows us to define allocation

  • weighting. Furthermore, for every memory location we track which memory location was written

to previously ( ) and subsequently ( ). The, the write weighting is defined as a weighted combination of the allocation weighting and write content weighting, and read weighting is computed as a weighted combination of read content weighting, previous write weighting, and subsequent write weighting.

t x

t

R r

, … , r

t−1 1 t−1 R

, ξ )

t t

y

=

t

ν

+

t

W

[r , … , r ]

r t 1 t R

ξ

t

b

t

f

t

34/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-35
SLIDE 35

Differentiable Neural Computer

a

Random gr aph

c

Fami l y t r ee

b

London Under gr

  • und

Shor t est

  • pat

h Tr aver sal

I an Jodi e Tom Char l

  • t

t e Mat Jo Fer gus Al i ce St eve Si mon Fr eya Nat al i e Jane Bob Mar y Al an Li ndsey Li am Ni na Becky Al i son

Mat er nal gr eat uncl e Shor t est

  • pat

h quest i

  • n:

( Moor gat e, Pi ccadi l l yCi r cus, _)

Tr aver sal quest i

  • n:

( BondSt , _, Cent r al ) , ( _, _, Ci r cl e) , ( _, _, Ci r cl e) , ( _, _, Ci r cl e) , ( _, _, Ci r cl e) , ( _, _, Jubi l ee) , ( _, _, Jubi l ee) ,

Under gr

  • und

i nput :

( Oxf

  • r

dCi r cus, Tot t enhamCt Rd, Cent r al ) ( Tot t enhamCt Rd, Oxf

  • r

dCi r cus, Cent r al ) ( Baker St , Mar yl ebone, Ci r cl e) ( Baker St , Mar yl ebone, Baker l

  • o)

( Baker St , Oxf

  • r

dCi r cus, Baker l

  • o)

( Lei cest er Sq, Char i ngCr

  • ss,

Nor t her n) ( Tot t enhamCt Rd, Lei cest er Sq, Nor t her n) ( Oxf

  • r

dCi r cus, Pi ccadi l l yCi r cus, Baker l

  • o)

( Oxf

  • r

dCi r cus, Not t i ngHi l l Gat e, Cent r al ) ( Oxf

  • r

dCi r cus, Eust

  • n,

Vi ct

  • r

i a) 84 edges i n t

  • t

al

I nf er ence quest i

  • n:

( Fr eya, _, Mat er nal Gr eat Uncl e)

Fami l y t r ee i nput :

( Char l

  • t

t e, Al an, Fat her ) ( Si mon, St eve, Fat her ) ( St eve , Si mon, Son1) ( Ni na, Al i son, Mot her ) ( Li ndsey, Fer gus, Son1) ( Bob, Jane, Mot her ) ( Nat al i e, Al i ce, Mot her ) ( Mar y, I an, Fat her ) ( Jane, Al i ce, Daught er 1) ( Mat , Char l

  • t

t e, Mot her ) 54 edges i n t

  • t

al

Answer :

( BondSt , Not t i ngHi l l Gat e, Cent r al ) ( Not t i ngHi l l Gat e, Gl

  • ucest

er Rd, Ci r cl e) ( West mi nst er , Gr eenPar k, Jubi l ee) ( Gr eenPar k, BondSt , Jubi l ee)

Answer :

( Moor gat e, Bank, Nor t her n) ( Bank, Hol bor n, Cent r al ) ( Hol bor n, Lei cest er Sq, Pi ccadi l l y) ( Lei cest er Sq, Pi ccadi l l yCi r cus, Pi ccadi l l y)

Answer :

( Fr eya, Fer gus, Mat er nal Gr eat Uncl e)

… … …

Figure 2 of paper "Hybrid computing using a neural network with dynamic external memory", https://www.nature.com/articles/nature20101.

35/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-36
SLIDE 36

Differentiable Neural Computer

c

London Under gr

  • und

map

d

Read key

a

Read and wr i t e wei ght i ngs

Fr

  • m

e

Locat i

  • n

cont ent

To Li ne

Oxf

  • r

d Ci r cus>Tot t enham Cour t Rd Tot t enham Cour t Rd>Oxf

  • r

d Ci r cus Gr een Par k>Oxf

  • r

d Ci r cus Vi ct

  • r

i a>Gr een Par k Oxf

  • r

d Ci r cus>Gr een Par k Gr een Par k>Vi ct

  • r

i a Gr een Par k>Pi ccadi l l y Ci r cus Pi ccadi l l y Ci r cus>Lei cest er Sq Pi ccadi l l y Ci r cus>Gr een Par k Lei cest er Sq>Pi ccadi l l y Ci r cus Pi ccadi l l y Ci r cus>Oxf

  • r

d Ci r cus Char i ng Cr

  • ss>Pi

ccadi l l y Ci r cus Pi ccadi l l y Ci r cus>Char i ng Cr

  • ss

Oxf

  • r

d Ci r cus>Pi ccadi l l y Ci r cus Lei cest er Sq>Tot t enham Cour t Rd Char i ng Cr

  • ss>Lei

cest er Sq Lei cest er Sq>Char i ng Cr

  • ss

Tot t enham Cour t Rd>Lei cest er Sq Vi ct

  • r

i a>___ Vi ct

  • r

i a N ___>___ Vi ct

  • r

i a N ___>___ Cent r al E ___>___ Nor t h S ___>___ Pi ccadi l l y W ___>___ Baker l

  • N

___>___ Cent r al E Char i ng Cr

  • ss

Gr een Par k Lei cest er Sq Oxf

  • r

d Ci r cus Pi ccadi l l y Ci r cus Tot t enham Cour t Rd Vi ct

  • r

i a Baker l

  • N

Baker l

  • S

Cent r al E Cent r al W Nor t h N Nor t h S Pi ccadi l l y E Pi ccadi l l y W Vi ct

  • r

i a N Vi ct

  • r

i a S Char i ng Cr

  • ss

Gr een Par k Lei cest er Sq Oxf

  • r

d Ci r cus Pi ccadi l l y Ci r cus Tot t enham Cour t Rd Vi ct

  • r

i a

Fr

  • m

To Li ne

Char i ng Cr

  • ss

Gr een Par k Lei cest er Sq Oxf

  • r

d Ci r cus Pi ccadi l l y Ci r cus Tot t enham Cour t Rd Vi ct

  • r

i a Baker l

  • N

Baker l

  • S

Cent r al E Cent r al W Nor t h N Nor t h S Pi ccadi l l y E Pi ccadi l l y W Vi ct

  • r

i a N Vi ct

  • r

i a S Char i ng Cr

  • ss

Gr een Par k Lei cest er Sq Oxf

  • r

d Ci r cus Pi ccadi l l y Ci r cus Tot t enham Cour t Rd Vi ct

  • r

i a

Decode

Decoded memor y l

  • cat

i

  • ns

b

Read mode

Decode Wr i t e head Read head 1 Read head 2

Backwar d Cont ent For war d

Gr aph defini t i

  • n

Quer y Answer

Backwar d Cont ent For war d

Ti me

5 10 15 20 25 30

Figure 3 of paper "Hybrid computing using a neural network with dynamic external memory", https://www.nature.com/articles/nature20101.

36/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-37
SLIDE 37

Memory-augmented Neural Networks

Figure 1 of paper "One-shot learning with Memory-Augmented Neural Networks", https://arxiv.org/abs/1605.06065.

37/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-38
SLIDE 38

Memory-augmented NNs

Page 3 of paper "One-shot learning with Memory-Augmented Neural Networks", https://arxiv.org/abs/1605.06065. Page 4 of paper "One-shot learning with Memory-Augmented Neural Networks", https://arxiv.org/abs/1605.06065.

38/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN

slide-39
SLIDE 39

Memory-augmented NNs

Figure 2 of paper "One-shot learning with Memory-Augmented Neural Networks", https://arxiv.org/abs/1605.06065.

39/39 NPFL114, Lecture 12

Organization NASNet Transformer NeuralTuringMachines DNC MANN