Alternative Architectures
Philipp Koehn 15 October 2020
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
Alternative Architectures Philipp Koehn 15 October 2020 Philipp - - PowerPoint PPT Presentation
Alternative Architectures Philipp Koehn 15 October 2020 Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 Alternative Architectures 1 We introduced one translation model attentional seq2seq model core
Philipp Koehn 15 October 2020
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
1
– attentional seq2seq model – core organizing feature: recurrent neural networks
– convolutional neural networks – attention
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
2
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
3
– a neuron receives signals from other neurons – if sufficiently activated, it sends signals – feed-forward layers are roughly based on this
– any function possible, as long as it is partially differentiable – not limited by appeals to biological validity
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
4
Mx + b
y = activation(Mx + b)
y = FFactivation(x) = a(Mx + b)
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
5
– input layer – hidden layers – output layer
– appeals to its geometrical properties – straight lines in input still straight lines in output
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
6
⇒ Need to reduce size of matrix
x x y y v M A B
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
7
x x y y v M A B
– given highly dimension vector x – first map to into lower dimensional vector v (matrix A) – then map to output vector y (matrix B) v = Ax y = Bv = BAx
– |x| = 20,000, |y| = 50,000 → M = 1,000,000,000 – |v| = 100 → A = 20,000 × 100 = 2,000,000, B = 100 × 50,000 = 5,000,000 – reduction from 1,000,000,000 to 7,000,000
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
8
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
9
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
10
– input word – previous state
y = activation(M1x1 + M2x2 + b)
x = concat(x1, x2) y = activation(Mx + b)
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
11
s =
n
wi
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
12
– element-wise multiplication v ⊙ u =
v2
u2
v2 × u2
v · u = vTu =
v2 T u1 u2
used for simple version of attention mechanism – third possibility: vuT, not commonly done
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
13
– any region of image may have positive match – represent different regions with element in a vector – maximum value: any region has a face
– given: n dimensional vector – goal: reduce to n
k dimensional vector
– method: break up vector into blocks of k elements, map each into single value
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
14
– first branch out into multiple feed-forward layers W1x + b1 W2x + b2 – element-wise maximum maxout(x) = max(W1x + b1, W2x + b2)
ReLu(x) = max(Wx + b, 0)
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
15
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
16
– propagate state s – over time steps t – receiving an input xt at each turn st = f(st−1, xt) (state may computed may as a feed-forward layer)
– gated recurrent units (GRU) – long short-term memory cells (LSTM)
– humans also receive word by word – most recent words most relevant → closer to current state
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
17
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
18
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
19
– matrix spanning part of image reduced to single value – overlapping regions
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
20
Embed Embed Embed FF FF FF FF Embed Embed Embed FF FF Embed Embed FF FF FF Embed Embed FF FF FF Embed Embed Embed FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
21
– language is recursive – central: verb – dependents: subject, objects, adjuncts – their dependents: adjectives, determiners – also nested: relative clauses
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
22
– take a high dimensional input representation – map to lower dimensional representation
– map 50×50 pixel area into scalar value – combine 3 or more neighboring words into a single vector
– encode input sentence into single vector – decode this vector into a sentence in the output language
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
23
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
24
– output is not a single label – output structure needs to be built, word by word
translating ⇒ Attention mechanism
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
25
– previous hidden state si−1 – input word embedding hj – trainable parameters b, Wa, Ua, va a(si−1, hj) = vT
a tanh(Wasi−1 + Uahj + b)
– Dot product: a(si−1, hj) = sT
i−1hj
– Scaled dot product: a(si−1, hj) =
1
√
|hj|sT i−1hj
– General: a(si−1, hj) = sT
i−1Wahj
– Local: a(si−1) = Wasi−1
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
26
a(si−1, hj) = sT
i−1hj
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
27
Luong et al. (2015) Bahdanau et al. (2015)
RNN Weighted Sum Attention RNN argmax
Output Word Prediction Output Word Output Word Embedding Decoder State Input Context Attention Encoder State ti yi E yi-1 si ci αij h…j…
RNN Attention RNN argmax Softmax Weighted Sum Softmax
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
28
Luong et al. (2015) Attention αij = softmax FF(si−1, hj) Input context ci =
j αijhj
Output word p(yt|y<t, x) = softmax
si = FFtanh(si−1, Eyi−1) Bahdanau et al. (2015) Attention αij = softmax FF(si−1, hj) Input context ci =
j αijhj
Output word p(yt|y<t, x) = softmax
si = FFtanh(si−1, Eyi−1, ci)
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
29
– say, 16 attention weights – each based on its own parameters
– decoder state si−1 at time step i – encoder state hj for the jth input word – using the softmax of some parameterized function ak αk
ij = softmax ak(si−1, hj)
αij = 1 k
αk
ij
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
30
– learn weights for each element – computation of attention values returns vector instead of scalar
a(si−1, hj) = FFk(si−1, hj)
αd
ij = exp ad(si−1, hj)
ci =
αij × hj
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
31
– representation of an input word mostly depends on itself – but also informed by the surrounding context – previously: recurrent neural networks (considers left or right context) – now: attention mechanism
Which of the surrounding words is most relevant to refine representation?
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
32
self-attention(H) = softmax HHT
– computed by dot product – results in a vector of raw association values HHT
softmax HHT
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
33
√
|h|
ajk = 1 |h|hjhT
k
αjk = exp(ajk)
self-attention(hj) =
αjκhk
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
34
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
35
[Kalchbrenner and Blunsom, 2013]
Embed Embed Embed FF FF FF FF Embed Embed Embed FF FF FF FF FF
Input Word Embeddings K2 Layer K3 Layer L3 Layer Input Words
– always two convolutional layers, with different size – here: K2 and K3
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
36
Embed Embed Embed FF FF FF FF Embed Embed Embed FF FF FF FF
Input Word Embedding K2 Encoder K3 Encoder Input Word
FF FF FF
Transfer
FF FF FF FF FF RNN RNN RNN RNN RNN RNN Softmax Softmax Softmax Softmax Softmax Softmax Embed Embed Embed Embed Embed Embed
K3 Decoder K2 Decoder Output Word Prediction Output Word Output Word Embedding
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
37
[Gehring et al. 2017]
– convolutional neural networks – attention
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
38
FF FF Embed Embed Embed Embed FF FF FF Embed Embed FF FF Embed FF FF FF FF FF FF FF FF FF FF FF FF FF FF
Encoder Convolution 3 Encoder Convolution 2 Encoder Convolution 1 Input Word Embeddings Input Words
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
39
h0,j = E xj
– sequence of layer encodings hd,j – at different depth d – until maximum depth D hd,j = f(hd−1,j−k, ..., hd−1,j+k)
– function f is feed-forward layer with shortcut connection – final representation hD,j may only be informed by partial sentence context – all words at one depth can be processed in parallel → fast
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
40
FF Softmax Embed Embed Embed Embed
Decoder Convolution 3 Output Word Prediction Output Word Output Word Embedding
FF FF
Decoder Convolution 2
FF FF FF
Decoder Convolution 1
Embed Embed FF FF FF Embed
Input Context
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
41
si = f(si−1, Eyi−1, ci) – encoder state si – embedding of previous output word Eyi−1 – input context ci
– state computation not depending on previous state si−1 (not recurrent) – conditioned on the sequence of the κ most recent previous words si = f(Eyi−κ, ..., Eyi−1, ci)
s1,i = f(Eyi−κ, ..., Eyi−1, ci) sd,i = f(sd−1,i−κ−1, ..., sd−1,i, ci) for d > 0, d ≤ ˆ D
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
42
– encoder state hj – decoder state si−1
– encoder state hD,j – decoder state s ˆ
D,i−1
shortcut connection between encoder state hD,j and input word embedding xj
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
43
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
44
– refine word representation based on relevant context words – relevance determined by self attention
– refine output word predictions based on relevant previous output words – relevance determined by self attention
(maybe only with self attention in decoder, but regular recurrent decoder)
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
45
Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention
Input Context Attention
Embed Embed Embed Embed Embed Embed Embed
Word and Position Embedding Ewxj Epj
<s> the house is big . </s>
Input Word xj
Add Add Add Add Add Add Add
Positional Input Word Embedding Ewxj + Epj
Embed Embed Embed Embed Embed Embed Embed
Input Word Position
j 1 2 3 4 5 6 Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm
Input Context with Shortcut ĥj
Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF
Encoder State Refinement hj
Sequence of self-attention layers
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
46
self-attention(H) = softmax HHT
self-attention(hj) + hj
ˆ hj = layer-normalization(self-attention(hj) + hj)
relu(Wˆ hj + b)
layer-normalization(relu(Wˆ hj + b) + ˆ hj)
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
47
h0,j = Exj
hd,j = self-attention-layer(hd−1,j)
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
48
Self Attention Self Attention Self Attention Self Attention Self Attention Self Attention Self Attention
Self-Attention
Embed Embed Embed Embed Embed Embed Embed
Word and Position Embedding
<s> the house is big . </s>
Output Word yi
Add Add Add Add Add Add Add
Positional Output Word Embedding si
Embed Embed Embed Embed Embed Embed Embed
Output Word Position
i 1 2 3 4 5 6 Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum
Output Context
Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm
Normalization with Shortcut
Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF
Output State Refinement
Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum
Context
Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm
Normalization with Shortcut ŝi
Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF
Decoder State Refinement si
Attention Attention Attention Attention Attention Attention Attention
Encoder State Attention h
Decoder computes attention-based representations of the output in several layers, initialized with the embeddings of the previous output words
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
49
– association of a word si is limited to words sk (k ≤ i) – resulting representation ˜ si self-attention( ˜ S) = softmax SST
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
50
SHT
S and the final encoder states H attention( ˜ S, H) = softmax ˜ SHT
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
51
Encoder Layer Input Word Encoder Layer Encoder Layer Encoder Layer Output Word Embedding Decoder Layer Decoder Layer Decoder Layer Decoder Layer
Softmax Softmax Softmax Softmax Softmax Softmax Softmax Argmax Argmax Argmax Argmax Argmax Argmax Argmax
Output Word Prediction Output Word Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
52
self-attention( ˜ S) = softmax SST
– shortcut connections – layer normalization – feed-forward layer
attention( ˜ S, H) = softmax ˜ SHT
– shortcut connections – layer normalization – feed-forward layer
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
53
– recurrent neural networks – self-attention layers
– recurrent neural networks – self-attention layers
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020