Deep learning 13.1. Attention for Memory and Sequence Translation - - PowerPoint PPT Presentation

deep learning 13 1 attention for memory and sequence
SMART_READER_LITE
LIVE PREVIEW

Deep learning 13.1. Attention for Memory and Sequence Translation - - PowerPoint PPT Presentation

Deep learning 13.1. Attention for Memory and Sequence Translation Fran cois Fleuret https://fleuret.org/ee559/ Nov 1, 2020 In all the operations we have seen such as fully connected layers, convolutions, or poolings, the contribution of a


slide-1
SLIDE 1

Deep learning 13.1. Attention for Memory and Sequence Translation

Fran¸ cois Fleuret https://fleuret.org/ee559/ Nov 1, 2020

slide-2
SLIDE 2

In all the operations we have seen such as fully connected layers, convolutions,

  • r poolings, the contribution of a value in the input tensor to a value in the
  • utput tensor is entirely driven by their [relative] locations [in the tensor].

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 1 / 20

slide-3
SLIDE 3

In all the operations we have seen such as fully connected layers, convolutions,

  • r poolings, the contribution of a value in the input tensor to a value in the
  • utput tensor is entirely driven by their [relative] locations [in the tensor].

Attention mechanisms aggregate features with an importance score that

  • depends on the feature themselves, not on their positions in the tensor,
  • relax locality constraints.

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 1 / 20

slide-4
SLIDE 4

Attention mechanisms modulate dynamically the weighting of different parts of a signal and allow the representation and allocation of information channels to be dependent on the activations themselves. While they were developed to equip deep-learning models with memory-like modules (Graves et al., 2014), their main use now is to provide long-term dependency for sequence-to-sequence translation (Vaswani et al., 2017).

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 2 / 20

slide-5
SLIDE 5

Neural Turing Machine

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 3 / 20

slide-6
SLIDE 6

Graves et al. (2014) proposed to equip a deep model with an explicit memory to allow for long-term storage and retrieval. (Graves et al., 2014)

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 4 / 20

slide-7
SLIDE 7

The said module has an hidden internal state that takes the form of a tensor Mt ∈ RN×M where t is the time step, N is the number of entries in the memory and M is their dimension. A “controller” is implemented as a standard feed-forward or recurrent model and at every iteration t it computes activations that modulate the reading / writing operations.

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 5 / 20

slide-8
SLIDE 8

More formally, the memory module implements

  • Reading, where given attention weights wt ∈ RN

+, n wt(n) = 1, it gets

rt =

N

  • n=1

wt(n)Mt(n).

  • Writing, where given attention weights wt, an erase vector et ∈ [0, 1]M and

an add vector at ∈ RM the memory is updated with ∀n, Mt(n) = Mt−1(n)(1 − wt(n)et) + wt(n)at.

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 6 / 20

slide-9
SLIDE 9

More formally, the memory module implements

  • Reading, where given attention weights wt ∈ RN

+, n wt(n) = 1, it gets

rt =

N

  • n=1

wt(n)Mt(n).

  • Writing, where given attention weights wt, an erase vector et ∈ [0, 1]M and

an add vector at ∈ RM the memory is updated with ∀n, Mt(n) = Mt−1(n)(1 − wt(n)et) + wt(n)at. The controller has multiple “heads”, and computes at each t, for each writing head wt, et, at, and for each reading head wt, and gets back a read value rt.

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 6 / 20

slide-10
SLIDE 10

The vectors wt are themselves recurrent, and the controller can strengthen them on certain key values, and/or shift them.

Figure 2: Flow Diagram of the Addressing Mechanism. The key vector, kt, and key strength, βt, are used to perform content-based addressing of the memory matrix, Mt. The resulting content-based weighting is interpolated with the weighting from the previous time step based on the value of the interpolation gate, gt. The shift weighting, st, determines whether and by how much the weighting is rotated. Finally, depending on γt, the weighting is sharpened and used for memory access.

(Graves et al., 2014)

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 7 / 20

slide-11
SLIDE 11

Results on the copy task

2 4 6 8 10 200 400 600 800 1000 cost per sequence (bits) sequence number (thousands) LSTM NTM with LSTM Controller NTM with Feedforward Controller

(Graves et al., 2014)

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 8 / 20

slide-12
SLIDE 12

Results on the N-gram task

130 135 140 145 150 155 160 200 400 600 800 1000 cost per sequence (bits) sequence number (thousands) LSTM NTM with LSTM Controller NTM with Feedforward Controller Optimal Estimator

(Graves et al., 2014)

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 9 / 20

slide-13
SLIDE 13

Figure 15: NTM Memory Use During the Dynamic N-Gram Task. The red and green arrows indicate point where the same context is repeatedly observed during the test sequence (“00010” for the green arrows, “01111” for the red arrows). At each such point the same location is accessed by the read head, and then, on the next time-step, accessed by the write

  • head. We postulate that the network uses the writes to keep count of the fraction of ones and

zeros following each context in the sequence so far. This is supported by the add vectors, which are clearly anti-correlated at places where the input is one or zero, suggesting a distributed “counter.” Note that the write weightings grow fainter as the same context is repeatedly seen; this may be because the memory records a ratio of ones to zeros, rather than absolute counts. The red box in the prediction sequence corresponds to the mistake at the first red arrow in Figure 14; the controller appears to have accessed the wrong memory location, as the previous context was “01101” and not “01111.”

(Graves et al., 2014)

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 10 / 20

slide-14
SLIDE 14

Attention for seq2seq

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 11 / 20

slide-15
SLIDE 15

Given an input sequence x1, . . . , xT , the standard approach for sequence- to-sequence translation (Sutskever et al., 2014) uses a recurrent model ht = f (xt, ht−1), and considers that the final hidden state v = hT carries enough information to drive an auto-regressive generative model yt ∼ p (y1, . . . , yt−1, v) , itself implemented with another RNN.

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 12 / 20

slide-16
SLIDE 16

The main weakness of such an approach is that all the information has to flow through a single state v, whose capacity has to accommodate any situation. x1 x2 x3 x4 . . . xT−1 xT v y1 y2 y3 . . . yS There are no direct “channels” to transport local information from the input sequence to the place where it is useful in the resulting sequence.

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 13 / 20

slide-17
SLIDE 17

Attention mechanisms (Bahdanau et al., 2014) can transport information from parts of the signal to other parts specified dynamically. x1 x2 x3 x4 . . . xT−1 xT y1 y2 y3 . . . yS

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 14 / 20

slide-18
SLIDE 18

Bahdanau et al. (2014) proposed to extend a standard recurrent model with such a mechanism. They first run a bi-directionnal RNN to get a hidden state hi = (h→

i

, h←

i

), i = 1, . . . , T. From this, they compute a new process si, i = 1, . . . , T which looks at weighted averages of the hj, where the weight are functions of the signal.

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 15 / 20

slide-19
SLIDE 19

Given y1, . . . , yi−1 and s1, . . . , si−1 first compute an attention ∀j, αi,j = softmaxj a(si−1, hj) where a is a one hidden layer tanh MLP (this is “additive attention”, or “concatenation”). Then compute the context vector from the hs ci =

T

  • j=1

αi,jhj.

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 16 / 20

slide-20
SLIDE 20

The model can now make the prediction si = f (si−1, yi−1, ci) yi ∼ g(yi−1, si, ci) where f is a GRU (Cho et al., 2014).

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 17 / 20

slide-21
SLIDE 21

The model can now make the prediction si = f (si−1, yi−1, ci) yi ∼ g(yi−1, si, ci) where f is a GRU (Cho et al., 2014). This is context attention where si−1 modulates what to look at in h1, . . . , hT to compute si and sample yi.

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 17 / 20

slide-22
SLIDE 22

x1 x2 x3 . . . xT−1 xT

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 18 / 20

slide-23
SLIDE 23

x1 x2 x3 . . . xT−1 xT h1 h2 h3 . . . hT−1 hT RNN

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 18 / 20

slide-24
SLIDE 24

x1 x2 x3 . . . xT−1 xT h1 h2 h3 . . . hT−1 hT RNN s1 s2 y1 y2

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 18 / 20

slide-25
SLIDE 25

x1 x2 x3 . . . xT−1 xT h1 h2 h3 . . . hT−1 hT RNN s1 s2 y1 y2 a3,1

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 18 / 20

slide-26
SLIDE 26

x1 x2 x3 . . . xT−1 xT h1 h2 h3 . . . hT−1 hT RNN s1 s2 y1 y2 a3,1 a3,2

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 18 / 20

slide-27
SLIDE 27

x1 x2 x3 . . . xT−1 xT h1 h2 h3 . . . hT−1 hT RNN s1 s2 y1 y2 a3,1 a3,2 a3,3

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 18 / 20

slide-28
SLIDE 28

x1 x2 x3 . . . xT−1 xT h1 h2 h3 . . . hT−1 hT RNN s1 s2 y1 y2 a3,1 a3,2 a3,3 a3,T−1 a3,T . . .

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 18 / 20

slide-29
SLIDE 29

x1 x2 x3 . . . xT−1 xT h1 h2 h3 . . . hT−1 hT RNN s1 s2 y1 y2 α3,1 α3,2 α3,3 α3,T−1 α3,T . . .

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 18 / 20

slide-30
SLIDE 30

x1 x2 x3 . . . xT−1 xT h1 h2 h3 . . . hT−1 hT RNN s1 s2 y1 y2 α3,1 α3,2 α3,3 α3,T−1 α3,T . . . c3

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 18 / 20

slide-31
SLIDE 31

x1 x2 x3 . . . xT−1 xT h1 h2 h3 . . . hT−1 hT RNN s1 s2 y1 y2 α3,1 α3,2 α3,3 α3,T−1 α3,T . . . c3 s3

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 18 / 20

slide-32
SLIDE 32

x1 x2 x3 . . . xT−1 xT h1 h2 h3 . . . hT−1 hT RNN s1 s2 y1 y2 α3,1 α3,2 α3,3 α3,T−1 α3,T . . . c3 s3 y3

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 18 / 20

slide-33
SLIDE 33

10 20 30 40 50 60

Sentence length

5 10 15 20 25 30

BLEU score

RNNsearch-50 RNNsearch-30 RNNenc-50 RNNenc-30

Figure 2: The BLEU scores

  • f the generated translations
  • n the test set with respect

to the lengths of the sen- tences. The results are on the full test set which in- cludes sentences having un- known words to the models.

(Bahdanau et al., 2014)

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 19 / 20

slide-34
SLIDE 34

The agreement

  • n

the European Economic Area was signed in August 1992 . <end> L' accord sur la zone économique européenne a été signé en août 1992 . <end> It should be noted that the marine environment is the least known

  • f

environments . <end> Il convient de noter que l' environnement marin est le moins connu de l' environnement . <end>

(a) (b)

Destruction

  • f

the equipment means that Syria can no longer produce new chemical weapons . <end> La destruction de l' équipement signifie que la Syrie ne peut plus produire de nouvelles armes chimiques . <end> " This will change my future with my family , " the man said . <end> " Cela va changer mon avenir avec ma famille " , a dit l' homme . <end>

(Bahdanau et al., 2014)

Fran¸ cois Fleuret Deep learning / 13.1. Attention for Memory and Sequence Translation 20 / 20

slide-35
SLIDE 35

The end

slide-36
SLIDE 36

References

  • D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to

align and translate. CoRR, abs/1409.0473, 2014.

  • K. Cho, B. van Merrienboer, C

¸. G¨ ul¸ cehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine

  • translation. CoRR, abs/1406.1078, 2014.
  • A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. CoRR, abs/1410.5401,

2014.

  • I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural
  • networks. In Neural Information Processing Systems (NIPS), pages 3104–3112, 2014.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and
  • I. Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017.