Alternative Architectures Philipp Koehn 15 October 2020 Philipp - - PowerPoint PPT Presentation

alternative architectures
SMART_READER_LITE
LIVE PREVIEW

Alternative Architectures Philipp Koehn 15 October 2020 Philipp - - PowerPoint PPT Presentation

Alternative Architectures Philipp Koehn 15 October 2020 Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 Alternative Architectures 1 We introduced one translation model attentional seq2seq model core


slide-1
SLIDE 1

Alternative Architectures

Philipp Koehn 15 October 2020

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-2
SLIDE 2

1

Alternative Architectures

  • We introduced one translation model

– attentional seq2seq model – core organizing feature: recurrent neural networks

  • Other core neural architectures

– convolutional neural networks – attention

  • But first: look at various components of neural architectures

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-3
SLIDE 3

2

components

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-4
SLIDE 4

3

Components of Neural Networks

  • Neural networks originally inspired by the brain

– a neuron receives signals from other neurons – if sufficiently activated, it sends signals – feed-forward layers are roughly based on this

  • Computation graph

– any function possible, as long as it is partially differentiable – not limited by appeals to biological validity

  • Deep learning maybe a better name

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-5
SLIDE 5

4

Feed-Forward Layer

  • Classic neural network component
  • Given an input vector x, matrix multiplication M with adding a bias vector b

Mx + b

  • Adding a non-linear activation function

y = activation(Mx + b)

  • Notation

y = FFactivation(x) = a(Mx + b)

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-6
SLIDE 6

5

Feed-Forward Layer

  • Historic neural network designs: several feed-forward layers

– input layer – hidden layers – output layer

  • Powerful tools for a wide range of machine learning problems
  • Matrix multiplication also called affine transforms

– appeals to its geometrical properties – straight lines in input still straight lines in output

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-7
SLIDE 7

6

Factored Decomposition

  • One challenge: very large input and output vectors
  • Number of parameters in matrix M = |x| × |y|

⇒ Need to reduce size of matrix

  • Solution: first reduce to smaller representation

x x y y v M A B

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-8
SLIDE 8

7

Factored Decomposition: Math

x x y y v M A B

  • Intuition

– given highly dimension vector x – first map to into lower dimensional vector v (matrix A) – then map to output vector y (matrix B) v = Ax y = Bv = BAx

  • Example

– |x| = 20,000, |y| = 50,000 → M = 1,000,000,000 – |v| = 100 → A = 20,000 × 100 = 2,000,000, B = 100 × 50,000 = 5,000,000 – reduction from 1,000,000,000 to 7,000,000

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-9
SLIDE 9

8

Factored Decomposition: Interpretation

  • Vector v is a bottleneck feature
  • Forced to captures salient features
  • One example: word embeddings

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-10
SLIDE 10

9

basic mathematical operations

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-11
SLIDE 11

10

Concatenation

  • Often multiple input vectors to processing step
  • For instance recurrent neural network

– input word – previous state

  • Combined in feed-forward layer

y = activation(M1x1 + M2x2 + b)

  • Another view

x = concat(x1, x2) y = activation(Mx + b)

  • Splitting hairs here, but concatenation useful generally

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-12
SLIDE 12

11

Addition

  • Adding vectors: very simplistic, but often done
  • Example: compute sentence embeddings s from word embeddings w1, ..., wn

s =

n

  • i

wi

  • Reduces varying length sentence representation into fixed sized vector
  • Maybe weight the words, e.g., by attention

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-13
SLIDE 13

12

Multiplication

  • Another elementary mathematical operation
  • Three ways to multiply vectors

– element-wise multiplication v ⊙ u =

  • v1

v2

  • u1

u2

  • =
  • v1 × u1

v2 × u2

  • – dot product

v · u = vTu =

  • v1

v2 T u1 u2

  • = v1 × u1 + v2 × u2

used for simple version of attention mechanism – third possibility: vuT, not commonly done

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-14
SLIDE 14

13

Maximum

  • Goal: reduce the dimensionality of representation
  • Example: detect if a face is in image

– any region of image may have positive match – represent different regions with element in a vector – maximum value: any region has a face

  • Max pooling

– given: n dimensional vector – goal: reduce to n

k dimensional vector

– method: break up vector into blocks of k elements, map each into single value

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-15
SLIDE 15

14

Max Out

  • Max out

– first branch out into multiple feed-forward layers W1x + b1 W2x + b2 – element-wise maximum maxout(x) = max(W1x + b1, W2x + b2)

  • ReLu activation is a maxout layer: maximum of feed-forward layer and 0

ReLu(x) = max(Wx + b, 0)

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-16
SLIDE 16

15

processing sequences

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-17
SLIDE 17

16

Recurrent Neural Networks

  • Already described recurrent neural networks at length

– propagate state s – over time steps t – receiving an input xt at each turn st = f(st−1, xt) (state may computed may as a feed-forward layer)

  • More successful

– gated recurrent units (GRU) – long short-term memory cells (LSTM)

  • Good fit for sequences, like words in a sentence

– humans also receive word by word – most recent words most relevant → closer to current state

  • But computational problematic: very long computation chains

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-18
SLIDE 18

17

Alternative Sequence Processing

  • Convolutional neural networks
  • Attention

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-19
SLIDE 19

18

convolutional neural networks

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-20
SLIDE 20

19

Convolutional Neural Networks (CNN)

  • Popular in image processing
  • Regions of an image are reduced into increasingly smaller representation

– matrix spanning part of image reduced to single value – overlapping regions

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-21
SLIDE 21

20

CNNs for Language

Embed Embed Embed FF FF FF FF Embed Embed Embed FF FF Embed Embed FF FF FF Embed Embed FF FF FF Embed Embed Embed FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF

  • Map words into fixed-sized sentence representation

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-22
SLIDE 22

21

Hierarchical Structure and Language

  • Syntactic and semantic theories of language

– language is recursive – central: verb – dependents: subject, objects, adjuncts – their dependents: adjectives, determiners – also nested: relative clauses

  • How to compute sentence embeddings active research topic

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-23
SLIDE 23

22

Convolutional Neural Networks

  • Key step

– take a high dimensional input representation – map to lower dimensional representation

  • Several repetitions of this step
  • Examples

– map 50×50 pixel area into scalar value – combine 3 or more neighboring words into a single vector

  • Machine translation

– encode input sentence into single vector – decode this vector into a sentence in the output language

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-24
SLIDE 24

23

attention

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-25
SLIDE 25

24

Attention

  • Machine translation is a structured prediction task

– output is not a single label – output structure needs to be built, word by word

  • Relevant information for each word prediction varies
  • Human translators pay attention to different parts of the input sentence when

translating ⇒ Attention mechanism

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-26
SLIDE 26

25

Computing Attention

  • Attention mechanism in neural translation model (Bahdanau et al., 2015)

– previous hidden state si−1 – input word embedding hj – trainable parameters b, Wa, Ua, va a(si−1, hj) = vT

a tanh(Wasi−1 + Uahj + b)

  • Other ways to compute attention

– Dot product: a(si−1, hj) = sT

i−1hj

– Scaled dot product: a(si−1, hj) =

1

|hj|sT i−1hj

– General: a(si−1, hj) = sT

i−1Wahj

– Local: a(si−1) = Wasi−1

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-27
SLIDE 27

26

Attention of Luong et al. (2015)

  • Luong et al. (2015) demonstrate good results with the dot product

a(si−1, hj) = sT

i−1hj

  • No trainable parameters
  • Additional changes
  • Currently more popular

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-28
SLIDE 28

27

Attention of Luong et al. (2015)

Luong et al. (2015) Bahdanau et al. (2015)

RNN Weighted Sum Attention RNN argmax

Output Word Prediction Output Word Output Word Embedding Decoder State Input Context Attention Encoder State ti yi E yi-1 si ci αij h…j…

RNN Attention RNN argmax Softmax Weighted Sum Softmax

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-29
SLIDE 29

28

Attention of Luong et al. (2015)

Luong et al. (2015) Attention αij = softmax FF(si−1, hj) Input context ci =

j αijhj

Output word p(yt|y<t, x) = softmax

  • W FFtanh(si−1, ci)
  • Decoder state

si = FFtanh(si−1, Eyi−1) Bahdanau et al. (2015) Attention αij = softmax FF(si−1, hj) Input context ci =

j αijhj

Output word p(yt|y<t, x) = softmax

  • W FFtanh(si−1, Eyi−1, ci)
  • Decoder state

si = FFtanh(si−1, Eyi−1, ci)

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-30
SLIDE 30

29

Multi-Head Attention

  • Add redundancy

– say, 16 attention weights – each based on its own parameters

  • Formally, for each head k compute an associated between

– decoder state si−1 at time step i – encoder state hj for the jth input word – using the softmax of some parameterized function ak αk

ij = softmax ak(si−1, hj)

  • Average the attention weights

αij = 1 k

  • k

αk

ij

  • Multi-head attention is a form of ensembling

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-31
SLIDE 31

30

Fine-Grained Attention

  • Why just use a single scalar value to weight entire vectors?

– learn weights for each element – computation of attention values returns vector instead of scalar

  • Architecturally, still a feed-forward neural network (or any of variants)

a(si−1, hj) = FFk(si−1, hj)

  • Softmax is now applied over each dimension d

αd

ij = exp ad(si−1, hj)

  • k ad(si−1, hk)
  • Input context is now computed by a element-wise multiplication

ci =

  • j

αij × hj

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-32
SLIDE 32

31

Self Attention

  • Finally, a very different take at attention
  • Motivation so far: need for alignment between input words and output words
  • Now: refine representation of input words in the encoder

– representation of an input word mostly depends on itself – but also informed by the surrounding context – previously: recurrent neural networks (considers left or right context) – now: attention mechanism

  • Self attention:

Which of the surrounding words is most relevant to refine representation?

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-33
SLIDE 33

32

Self Attention

  • Formal definition (based on sequence of vectors hj, packed into matrix H

self-attention(H) = softmax HHT

  • |h|
  • H
  • Association between every word representation hj any other context word hk

– computed by dot product – results in a vector of raw association values HHT

  • Scaled by the size of the word representation vectors |h|, and softmax

softmax HHT

  • |h|
  • Resulting vector of normalized association values used to weigh context words

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-34
SLIDE 34

33

Self Attention

  • More familiar math, using word representation vectors hj
  • Raw association HHT

|h|

ajk = 1 |h|hjhT

k

  • Normalized association (softmax)

αjk = exp(ajk)

  • κ exp(ajκ)
  • Weighted sum

self-attention(hj) =

  • k

αjκhk

  • More on this later (Transformer)

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-35
SLIDE 35

34

convolutional machine translation

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-36
SLIDE 36

35

Convolutional Machine Translation

  • First end-to-end neural machine translation model of the modern era

[Kalchbrenner and Blunsom, 2013]

  • Encoder

Embed Embed Embed FF FF FF FF Embed Embed Embed FF FF FF FF FF

Input Word Embeddings K2 Layer K3 Layer L3 Layer Input Words

– always two convolutional layers, with different size – here: K2 and K3

  • Decoder similar

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-37
SLIDE 37

36

Refinement

Embed Embed Embed FF FF FF FF Embed Embed Embed FF FF FF FF

Input Word Embedding K2 Encoder K3 Encoder Input Word

FF FF FF

Transfer

FF FF FF FF FF RNN RNN RNN RNN RNN RNN Softmax Softmax Softmax Softmax Softmax Softmax Embed Embed Embed Embed Embed Embed

K3 Decoder K2 Decoder Output Word Prediction Output Word Output Word Embedding

  • Convolutions do not result in a single sentence embedding but a sequence
  • Decoder is also informed by a recurrent neural network

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-38
SLIDE 38

37

CNNs With Attention

[Gehring et al. 2017]

  • Combination of

– convolutional neural networks – attention

  • Sequence-to-sequence attention, mainly as before
  • Recurrent neural networks replaced by convolutional layers

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-39
SLIDE 39

38

Encoder

FF FF Embed Embed Embed Embed FF FF FF Embed Embed FF FF Embed FF FF FF FF FF FF FF FF FF FF FF FF FF FF

Encoder Convolution 3 Encoder Convolution 2 Encoder Convolution 1 Input Word Embeddings Input Words

  • Stacked encoder convolutions
  • Not shortening representations
  • But: faster processing due to more parallelism

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-40
SLIDE 40

39

Encoder: Math

  • Start with input word embeddings Exj

h0,j = E xj

  • Progress through

– sequence of layer encodings hd,j – at different depth d – until maximum depth D hd,j = f(hd−1,j−k, ..., hd−1,j+k)

  • Details

– function f is feed-forward layer with shortcut connection – final representation hD,j may only be informed by partial sentence context – all words at one depth can be processed in parallel → fast

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-41
SLIDE 41

40

Decoder

FF Softmax Embed Embed Embed Embed

Decoder Convolution 3 Output Word Prediction Output Word Output Word Embedding

FF FF

Decoder Convolution 2

FF FF FF

Decoder Convolution 1

Embed Embed FF FF FF Embed

Input Context

  • Decoder state computed by convolutional layers over previous output words
  • Each convolutional state also informed by the input context (using attention)

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-42
SLIDE 42

41

Decoder: Math

  • Recall: decoder recurrent neural network decoder

si = f(si−1, Eyi−1, ci) – encoder state si – embedding of previous output word Eyi−1 – input context ci

  • Now

– state computation not depending on previous state si−1 (not recurrent) – conditioned on the sequence of the κ most recent previous words si = f(Eyi−κ, ..., Eyi−1, ci)

  • Stacked convolutions

s1,i = f(Eyi−κ, ..., Eyi−1, ci) sd,i = f(sd−1,i−κ−1, ..., sd−1,i, ci) for d > 0, d ≤ ˆ D

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-43
SLIDE 43

42

Attention

  • Attention mechanism fundamentally unchanged
  • Input context ci computed based on association a(si−1, hj) between

– encoder state hj – decoder state si−1

  • Now

– encoder state hD,j – decoder state s ˆ

D,i−1

  • Refinement when computing the context vector ci:

shortcut connection between encoder state hD,j and input word embedding xj

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-44
SLIDE 44

43

transformer

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-45
SLIDE 45

44

Self Attention: Transformer

  • Self-attention in encoder

– refine word representation based on relevant context words – relevance determined by self attention

  • Self-attention in decoder

– refine output word predictions based on relevant previous output words – relevance determined by self attention

  • Also regular attention to encoder states in decoder
  • Currently most successful model

(maybe only with self attention in decoder, but regular recurrent decoder)

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-46
SLIDE 46

45

Encoder

Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention

Input Context Attention

Embed Embed Embed Embed Embed Embed Embed

Word and Position Embedding Ewxj Epj

<s> the house is big . </s>

Input Word xj

Add Add Add Add Add Add Add

Positional Input Word Embedding Ewxj + Epj

Embed Embed Embed Embed Embed Embed Embed

Input Word Position

j 1 2 3 4 5 6 Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm

Input Context with Shortcut ĥj

Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF

Encoder State Refinement hj

Sequence of self-attention layers

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-47
SLIDE 47

46

Self Attention Layer

  • Given: input word representations hj, packed into a matrix H = (h1, ..., hj)
  • Self attention

self-attention(H) = softmax HHT

  • |h|
  • H
  • Shortcut connection

self-attention(hj) + hj

  • Layer normalization

ˆ hj = layer-normalization(self-attention(hj) + hj)

  • Feed-forward step with ReLU activation function

relu(Wˆ hj + b)

  • Again, shortcut connection and layer normalization

layer-normalization(relu(Wˆ hj + b) + ˆ hj)

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-48
SLIDE 48

47

Stacked Self Attention Layers

  • Stack several such layers (say, D = 6)
  • Start with input word embedding

h0,j = Exj

  • Stacked layers

hd,j = self-attention-layer(hd−1,j)

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-49
SLIDE 49

48

Decoder

Self Attention Self Attention Self Attention Self Attention Self Attention Self Attention Self Attention

Self-Attention

Embed Embed Embed Embed Embed Embed Embed

Word and Position Embedding

<s> the house is big . </s>

Output Word yi

Add Add Add Add Add Add Add

Positional Output Word Embedding si

Embed Embed Embed Embed Embed Embed Embed

Output Word Position

i 1 2 3 4 5 6 Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum

Output Context

Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm

Normalization with Shortcut

Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF

Output State Refinement

Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum

Context

Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm

Normalization with Shortcut ŝi

Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF

Decoder State Refinement si

Attention Attention Attention Attention Attention Attention Attention

Encoder State Attention h

Decoder computes attention-based representations of the output in several layers, initialized with the embeddings of the previous output words

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-50
SLIDE 50

49

Self-Attention in the Decoder

  • Same idea as in the encoder
  • Output words are initially encoded by word embeddings si = Eyi.
  • Self attention is computed over previous output words

– association of a word si is limited to words sk (k ≤ i) – resulting representation ˜ si self-attention( ˜ S) = softmax SST

  • |h|
  • S

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-51
SLIDE 51

50

Attention in the Decoder

  • Original intuition of attention mechanism: focus on relevant input words
  • Computed with dot product ˜

SHT

  • Compute attention between the decoder states ˜

S and the final encoder states H attention( ˜ S, H) = softmax ˜ SHT

  • |h|
  • H
  • Note: attention mechanism formally mirrors self-attention

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-52
SLIDE 52

51

Full Decoder

Encoder Layer Input Word Encoder Layer Encoder Layer Encoder Layer Output Word Embedding Decoder Layer Decoder Layer Decoder Layer Decoder Layer

Softmax Softmax Softmax Softmax Softmax Softmax Softmax Argmax Argmax Argmax Argmax Argmax Argmax Argmax

Output Word Prediction Output Word Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-53
SLIDE 53

52

Full Decoder

  • Self-attention

self-attention( ˜ S) = softmax SST

  • |h|
  • S

– shortcut connections – layer normalization – feed-forward layer

  • Attention

attention( ˜ S, H) = softmax ˜ SHT

  • |h|
  • H

– shortcut connections – layer normalization – feed-forward layer

  • Multiple stacked layers

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

slide-54
SLIDE 54

53

Mix and Match

  • Encoder may be multiple layers of either

– recurrent neural networks – self-attention layers

  • Decoder may be multiple layers of either

– recurrent neural networks – self-attention layers

  • Also possible: self-attention encoder, recurrent neural network deocder
  • Even better: both self-attention and recurrent neural network, merged at the end

Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020