Transformer Models CSE545 - Spring 2019 Review: Feed Forward - - PowerPoint PPT Presentation

transformer models
SMART_READER_LITE
LIVE PREVIEW

Transformer Models CSE545 - Spring 2019 Review: Feed Forward - - PowerPoint PPT Presentation

Transformer Models CSE545 - Spring 2019 Review: Feed Forward Network Z (full-connected) (skymind, AI Wiki) Review: Convolutional NN (Barter, 2018) Review: Recurrent Neural Network y (t) = f(h (t) W) Activation Function h (t) = g(h (t-1)


slide-1
SLIDE 1

Transformer Models

CSE545 - Spring 2019

slide-2
SLIDE 2

Review: Feed Forward Network (full-connected)

(skymind, AI Wiki)

Z

slide-3
SLIDE 3

Review: Convolutional NN

(Barter, 2018)

slide-4
SLIDE 4

Review: Recurrent Neural Network

(Jurafsky, 2019)

“hidden layer” y(t) = f(h(t)W) Activation Function h(t) = g(h(t-1) U + x(t)V)

slide-5
SLIDE 5

FFN CNN RNN

Can model computation (e.g. matrix operations for a single input) be parallelized?

slide-6
SLIDE 6

FFN CNN RNN

Can model computation (e.g. matrix operations for a single input) be parallelized?

slide-7
SLIDE 7

FFN CNN RNN

Can model computation (e.g. matrix operations for a single input) be parallelized?

slide-8
SLIDE 8

FFN CNN RNN

Can model computation (e.g. matrix operations for a single input) be parallelized?

Ultimately limits how complex the model can be (i.e. it’s total number of paramers/weights) as compared to a CNN.

slide-9
SLIDE 9

The Transformer: “Attention-only” models

Can handle sequences and long-distance dependencies, but….

  • Don’t want complexity of LSTM/GRU cells
  • Constant num edges between input steps
  • Enables “interactions” (i.e. adaptations) between words
  • Easy to parallelize -- don’t need sequential processing.
slide-10
SLIDE 10

The Transformer: “Attention-only” models

Challenge:

  • Long distance dependency when translating:

<go> y(0) y(1) y(2) ….

y(0) y(1) y(2) y(3) y(4)

Kayla kicked the ball. The ball was kicked by kayla.

slide-11
SLIDE 11

The Transformer: “Attention-only” models

Challenge:

  • Long distance dependency when translating:

<go> y(0) y(1) y(2) ….

y(0) y(1) y(2) y(3) y(4)

Kayla kicked the ball. The ball was kicked by kayla.

slide-12
SLIDE 12

Attention

chi αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

z1 z2 z3 z4 values

slide-13
SLIDE 13

Attention

chi hi 𝜔 αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

Score function: W z1 z2 z3 z4 values query

slide-14
SLIDE 14

Attention

s1 s2 s3 s4 chi hi 𝜔 αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

Score function: W z1 z2 z3 z4 values keys query

slide-15
SLIDE 15

The Transformer: “Attention-only” models

Challenge:

  • Long distance dependency when translating:

Attention came about for encoder decoder models. Then self-attention was introduced:

slide-16
SLIDE 16

Attention

s1 s2 s3 s4 chi hi 𝜔 αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

Score function: W z1 z2 z3 z4 values query keys

slide-17
SLIDE 17

Self-Attention

s1 s2 si s4 ci 𝜔 αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

Score function: W z1 z2 zi z4 values keys q u e r y

slide-18
SLIDE 18

The Transformer: “Attention-only” models

(Eisenstein, 2018) Attention as weighting a value based on a query and key:

slide-19
SLIDE 19

The Transformer: “Attention-only” models

(Eisenstein, 2018)

Output α 𝜔 h hi-1

hi hi+1

x

slide-20
SLIDE 20

Output α 𝜔 h

The Transformer: “Attention-only” models

(Eisenstein, 2018)

hi-1

hi hi+1

self attention

hi hi-1 hi-1

slide-21
SLIDE 21

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2

slide-22
SLIDE 22

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2 wi-1

wi wi+1

wi+2

FFN

slide-23
SLIDE 23

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2 wi-1

wi wi+1

wi+2 yi-1

yi yi+1

yi+2

slide-24
SLIDE 24

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2 wi-1

wi wi+1

wi+2 …. yi-1

yi yi+1

yi+2

...

slide-25
SLIDE 25

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2 wi-1

wi wi+1

wi+2 yi-1

yi yi+1

yi+2

Attend to all hidden states in your “neighborhood”.

slide-26
SLIDE 26

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2 wi-1

wi wi+1

wi+2 yi-1

yi yi+1

yi+2

X X X X

+

dot product dp dp dp

ktq

slide-27
SLIDE 27

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2 wi-1

wi wi+1

wi+2 yi-1

yi yi+1

yi+2

X X X X

+

dot product dp dp dp

scaling parameter (ktq) σ (k,q)

slide-28
SLIDE 28

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2 wi-1

wi wi+1

wi+2 yi-1

yi yi+1

yi+2

X X X X

+

dot product dp dp dp

Linear layer: WTX One set of weights for each of for K, Q, and V ktq (k,q) (ktq) σ

slide-29
SLIDE 29

The Transformer

Limitation (thus far): Can’t capture multiple types of dependencies between words.

slide-30
SLIDE 30

The Transformer

Solution: Multi-head attention

slide-31
SLIDE 31

Multi-head Attention

slide-32
SLIDE 32

Transformer for Encoder-Decoder

slide-33
SLIDE 33

Transformer for Encoder-Decoder

sequence index (t)

slide-34
SLIDE 34

Transformer for Encoder-Decoder

slide-35
SLIDE 35

Transformer for Encoder-Decoder

Residualized Connections

slide-36
SLIDE 36

Transformer for Encoder-Decoder

Residualized Connections

residuals enable positional information to be passed along

slide-37
SLIDE 37

Transformer for Encoder-Decoder

slide-38
SLIDE 38

Transformer for Encoder-Decoder

essentially, a language model

slide-39
SLIDE 39

Transformer for Encoder-Decoder

essentially, a language model Decoder blocks out future inputs

slide-40
SLIDE 40

Transformer for Encoder-Decoder

essentially, a language model Add conditioning of the LM based on the encoder

slide-41
SLIDE 41

Transformer for Encoder-Decoder

slide-42
SLIDE 42

Transformer (as of 2017)

“WMT-2014” Data Set. BLEU scores:

slide-43
SLIDE 43

Transformer

  • Utilize Self-Attention
  • Simple att scoring function (dot product, scaled)
  • Added linear layers for Q, K, and V
  • Multi-head attention
  • Added positional encoding
  • Added residual connection
  • Simulate decoding by masking

https://4.bp.blogspot.com/-OlrV-PAtEkQ/W3RkOJCBkaI/AAAAAAAADOg/gNZXo_eK3tMNOmIfsuvPzrRfNb3qFQwJwCLcB GAs/s640/image1.gif

slide-44
SLIDE 44

Transformer

Why?

  • Don’t need complexity of LSTM/GRU cells
  • Constant num edges between words (or input

steps)

  • Enables “interactions” (i.e. adaptations)

between words

  • Easy to parallelize -- don’t need sequential

processing. Drawbacks:

  • Only unidirectional by default
  • Only a “single-hop” relationship per layer

(multiple layers to capture multiple)

slide-45
SLIDE 45

Why?

  • Don’t need complexity of LSTM/GRU cells
  • Constant num edges between words (or input

steps)

  • Enables “interactions” (i.e. adaptations)

between words

  • Easy to parallelize -- don’t need sequential

processing. Drawbacks of Vanilla Transformers:

  • Only unidirectional by default
  • Only a “single-hop” relationship per layer

(multiple layers to capture multiple)

BERT

Bidirectional Encoder Representations from Transformers Produces contextualized embeddings (or pre-trained contextualized encoder)

slide-46
SLIDE 46

Why?

  • Don’t need complexity of LSTM/GRU cells
  • Constant num edges between words (or input

steps)

  • Enables “interactions” (i.e. adaptations)

between words

  • Easy to parallelize -- don’t need sequential

processing. Drawbacks of Vanilla Transformers:

  • Only unidirectional by default
  • Only a “single-hop” relationship per layer

(multiple layers to capture multiple)

BERT

Bidirectional Encoder Representations from Transformers Produces contextualized embeddings (or pre-trained contextualized encoder)

  • Bidirectional context by “masking” in the middle
  • A lot of layers, hidden states, attention heads.
slide-47
SLIDE 47

BERT

Differences from previous state of the art:

  • Bidirectional transformer (through masking)
  • Directions jointly trained at once.
  • Capture sentence-level relations

(Devlin et al., 2019)

tokenize into “word pieces”

slide-48
SLIDE 48

Bert: Attention by Layers

https://colab.research.google.com/drive/1vlOJ1lhdujVjfH857hvYKIdKPTD9Kid8

(Vig, 2019)

slide-49
SLIDE 49

BERT Performance: e.g. Question Answering

https://rajpurkar.github.io/SQuAD-explorer/

slide-50
SLIDE 50

BERT: Pre-training; Fine-tuning

12 or 24 layers

slide-51
SLIDE 51

BERT: Pre-training; Fine-tuning

12 or 24 layers

slide-52
SLIDE 52

BERT: Pre-training; Fine-tuning

12 or 24 layers Novel classifier (e.g. sentiment classifier; stance detector...etc..)

slide-53
SLIDE 53

The Transformer: “Attention-only” models

Can handle sequences and long-distance dependencies, but….

  • Don’t want complexity of LSTM/GRU cells
  • Constant num edges between input steps
  • Enables “interactions” (i.e. adaptations) between words
  • Easy to parallelize -- don’t need sequential processing.