Transformer Sequence Models and Sequence Applications (Machine - - PowerPoint PPT Presentation

transformer sequence models and sequence applications
SMART_READER_LITE
LIVE PREVIEW

Transformer Sequence Models and Sequence Applications (Machine - - PowerPoint PPT Presentation

Transformer Sequence Models and Sequence Applications (Machine Translation, Speech Recognition) CSE392 - Spring 2019 Special Topic in CS Most NLP Tasks. E.g. Transformer Networks Sequence Tasks Transformers Language


slide-1
SLIDE 1

Transformer Sequence Models and Sequence Applications

(Machine Translation, Speech Recognition)

CSE392 - Spring 2019 Special Topic in CS

slide-2
SLIDE 2

Most NLP Tasks. E.g.

  • Sequence Tasks

○ Language Modeling ○ Machine Translation ○ Speech Recognition

  • Transformer Networks

○ Transformers ○ BERT

slide-3
SLIDE 3

Multi-level bidirectional RNN (LSTM or GRU)

(Eisenstein, 2018)

slide-4
SLIDE 4

Multi-level bidirectional RNN (LSTM or GRU)

(Eisenstein, 2018) Each node has a forward -> and backward <- hidden state: Can represent as a concatenation of both.

slide-5
SLIDE 5

Multi-level bidirectional RNN (LSTM or GRU)

Average of top layer is an embedding (average of concated vectors) (Eisenstein, 2018)

slide-6
SLIDE 6

Multi-level bidirectional RNN (LSTM or GRU)

Sometimes just use left-most and right-most hidden state instead (Eisenstein, 2018)

slide-7
SLIDE 7

Encoder

A representation of input. (Eisenstein, 2018)

slide-8
SLIDE 8

Encoder-Decoder

Representing input and converting to output (Eisenstein, 2018)

slide-9
SLIDE 9

Encoder-Decoder

(Eisenstein, 2018) Softmax y(0) y(1) y(2) y(3)

….

slide-10
SLIDE 10

Encoder-Decoder

Softmax y(0) y(1) y(2) y(3)

….

<go> y(0) y(1) y(2)

….

slide-11
SLIDE 11

Encoder-Decoder

A representation of input. Softmax y(0) y(1) y(2) y(3)

….

<go>

slide-12
SLIDE 12

Encoder-Decoder

A representation of input. Softmax y(0) y(1) y(2) y(3)

….

<go> essentially a language model conditioned on the final state from the encoder.

slide-13
SLIDE 13

Encoder-Decoder

When applied to new data... <go> essentially a language model conditioned on the final state from the encoder.

slide-14
SLIDE 14

Encoder-Decoder

A representation of input. Softmax y(0) y(1) y(2) y(3)

….

<go>

slide-15
SLIDE 15

Encoder-Decoder “seq2seq” model

Softmax y(0) y(1) y(2) y(3)

….

<go>

Language 1: (e.g. Chinese) Language 2: (e.g. English)

slide-16
SLIDE 16

Encoder-Decoder

Challenge:

  • Long distance dependency when translating:

<go> y(0) y(1) y(2) ….

y(0) y(1) y(2) y(3) y(4)

slide-17
SLIDE 17

Encoder-Decoder

Challenge:

  • Long distance dependency when translating:

<go> y(0) y(1) y(2) ….

y(0) y(1) y(2) y(3) y(4)

slide-18
SLIDE 18

Encoder-Decoder

Challenge:

  • Long distance dependency when translating:

<go> y(0) y(1) y(2) ….

y(0) y(1) y(2) y(3) y(4)

Kayla kicked the ball. The ball was kicked by kayla.

slide-19
SLIDE 19

Encoder-Decoder

Challenge:

  • Long distance dependency when translating:

<go> y(0) y(1) y(2) ….

y(0) y(1) y(2) y(3) y(4)

A lot of responsibility put fixed-size hidden state passed from encoder to decoder

Kayla kicked the ball. The ball was kicked by kayla.

slide-20
SLIDE 20

Long Distance / Out of order dependencies

<go> Softmax y(0) y(1) y(2) y(3)

….

A lot of responsibility put fixed-size hidden state passed from encoder to decoder

slide-21
SLIDE 21

Long Distance / Out of order dependencies

<go> Softmax y(0) y(1) y(2) y(3)

….

slide-22
SLIDE 22

Attention

<go> Softmax y(0) y(1) y(2) y(3)

….

s1 s2 s3 s4

slide-23
SLIDE 23

Attention

<go> Softmax y(0) y(1) y(2) y(3)

….

Analogy: random access memory s1 s2 s3 s4

slide-24
SLIDE 24

Attention

<go> Softmax y(0) y(1) y(2) y(3)

….

attention layer s1 s2 s3 s4

slide-25
SLIDE 25

Attention

<go> Softmax y(0) y(1) y(2) y(3)

….

attention layer i: current token of output N: tokens of input

hi-1 hi hi+1 zn-1 zn zn+1 hn-1 hn hn+1 hn-1 hn hn+1

chi

s1 s2 s3 s4

slide-26
SLIDE 26

Attention

s1 s2 s3 s4 chi αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

slide-27
SLIDE 27

Attention

z1 z2 z3 z4 chi αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

Z is the vector to be attended to (the value in memory). It is typically hidden states of the input (i.e. sn) but can be anything.

slide-28
SLIDE 28

Attention

s1 s2 s3 s4 chi αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

slide-29
SLIDE 29

Attention

s1 s2 s3 s4 chi hi 𝜔 αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

slide-30
SLIDE 30

Attention

s1 s2 s3 s4 chi hi 𝜔 αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

Score function: v , Wh , Ws

slide-31
SLIDE 31

Attention

s1 s2 s3 s4 chi hi 𝜔 αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

Score function: v , Wh , Ws A useful abstraction is to make the vector attended to (the “value vector”, Z) separate than the “key vector” (s). z1 z2 z3 z4

slide-32
SLIDE 32

Attention

s1 s2 s3 s4 chi hi 𝜔 αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

Score function: v , Wh , Ws A useful abstraction is to make the vector attended to (the “value vector”, Z) separate than the “key vector” (s). z1 z2 z3 z4 values keys query

slide-33
SLIDE 33

Attention

s1 s2 s3 s4 chi hi 𝜔 αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

Alternative Scoring Functions

slide-34
SLIDE 34

If variables are standardized, matrix multiply produces a similarity score.

Attention

s1 s2 s3 s4 chi hi 𝜔 αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

Alternative Scoring Functions

slide-35
SLIDE 35

Attention

(“synced”, 2017) hi s4 s3 s2 s1

slide-36
SLIDE 36

Attention

(“synced”, 2017) hi s4 s3 s2 s1

slide-37
SLIDE 37

Attention

(“synced”, 2017) hi s4 s3 s2 s1

(Bahdanau et al., 2015)

slide-38
SLIDE 38

Attention

(“synced”, 2017) hi s4 s3 s2 s1

(Bahdanau et al., 2015)

slide-39
SLIDE 39

Machine Translation

Why?

  • $40billion/year industry
  • A center piece of many genres of science fiction
  • A fairly “universal” problem:

○ Language understanding ○ Language generation

  • Societal benefits of inter-

cultural communication

slide-40
SLIDE 40

Machine Translation

Why?

  • $40billion/year industry
  • A center piece of many genres of science fiction
  • A fairly “universal” problem:

○ Language understanding ○ Language generation

  • Societal benefits of inter-

cultural communication

(Douglas Adams)

slide-41
SLIDE 41

Machine Translation

Why Neural Network Approach works? (Manning, 2018)

  • Joint end-to-end training: learning all parameters at once.
  • Exploiting distributed representations (embeddings)
  • Exploiting variable-length context
  • High quality generation from deep decoders - stronger

language models (even when wrong, make sense)

slide-42
SLIDE 42

Machine Translation

As an optimization problem (Eisenstein, 2018):

slide-43
SLIDE 43

Attention

(“synced”, 2017) hi s4 s3 s2 s1

slide-44
SLIDE 44

Attention

<go> Softmax y(0) y(1) y(2) y(3)

….

Analogy: random access memory s1 s2 s3 s4

slide-45
SLIDE 45

Attention

<go> Softmax y(0) y(1) y(2) y(3)

….

s1 s2 s3 s4

Do we even need all these RNNs?

(Vaswani et al., 2017: Attention is all you need)

slide-46
SLIDE 46

Attention

s1 s2 s3 s4 chi hi 𝜔 αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

v , Wh , Ws A useful abstraction is to make the vector attended to (the “value vector”, Z) separate than the “key vector” (s). z1 z2 z3 z4 values keys query

slide-47
SLIDE 47

Attention

s1 s2 s3 s4 chi hi 𝜔 αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

v , Wh , Ws A useful abstraction is to make the vector attended to (the “value vector”, Z) separate than the “key vector” (s). z1 z2 z3 z4 values keys query (Eisenstein, 2018) zj sj hi

slide-48
SLIDE 48

The Transformer: “Attention-only” models

(Eisenstein, 2018) Attention as weighting a value based on a query and key:

slide-49
SLIDE 49

The Transformer: “Attention-only” models

(Eisenstein, 2018)

Output α 𝜔 h hi-1

hi hi+1

x

slide-50
SLIDE 50

Output α 𝜔 h

The Transformer: “Attention-only” models

(Eisenstein, 2018)

hi-1

hi hi+1

self attention

hi hi-1 hi-1

slide-51
SLIDE 51

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2

slide-52
SLIDE 52

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2 wi-1

wi wi+1

wi+2

FFN

slide-53
SLIDE 53

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2 wi-1

wi wi+1

wi+2 yi-1

yi yi+1

yi+2

slide-54
SLIDE 54

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2 wi-1

wi wi+1

wi+2 …. yi-1

yi yi+1

yi+2

...

slide-55
SLIDE 55

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2 wi-1

wi wi+1

wi+2 yi-1

yi yi+1

yi+2

Attend to all hidden states in your “neighborhood”.

slide-56
SLIDE 56

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2 wi-1

wi wi+1

wi+2 yi-1

yi yi+1

yi+2

X X X X

+

dot product dp dp dp

ktq

slide-57
SLIDE 57

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2 wi-1

wi wi+1

wi+2 yi-1

yi yi+1

yi+2

X X X X

+

dot product dp dp dp

scaling parameter (ktq) σ (k,q)

slide-58
SLIDE 58

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2 wi-1

wi wi+1

wi+2 yi-1

yi yi+1

yi+2

X X X X

+

dot product dp dp dp

Linear layer: WTX One set of weights for each of for K, Q, and V ktq (k,q) (ktq) σ

slide-59
SLIDE 59

The Transformer: “Attention-only” models

Why?

  • Don’t need complexity of LSTM/GRU cells
  • Constant num edges between words (or input steps)
  • Enables “interactions” (i.e. adaptations) between words
  • Easy to parallelize -- don’t need sequential processing.
slide-60
SLIDE 60

The Transformer

Limitation (thus far): Can’t capture multiple types of dependencies between words.

slide-61
SLIDE 61

The Transformer

Solution: Multi-head attention

slide-62
SLIDE 62

Multi-head Attention

slide-63
SLIDE 63

Transformer for Encoder-Decoder

slide-64
SLIDE 64

Transformer for Encoder-Decoder

sequence index (t)

slide-65
SLIDE 65

Transformer for Encoder-Decoder

slide-66
SLIDE 66

Transformer for Encoder-Decoder

Residualized Connections

slide-67
SLIDE 67

Transformer for Encoder-Decoder

Residualized Connections

residuals enable positional information to be passed along

slide-68
SLIDE 68

Transformer for Encoder-Decoder

slide-69
SLIDE 69

Transformer for Encoder-Decoder

essentially, a language model

slide-70
SLIDE 70

Transformer for Encoder-Decoder

essentially, a language model Decoder blocks out future inputs

slide-71
SLIDE 71

Transformer for Encoder-Decoder

essentially, a language model Add conditioning of the LM based on the encoder

slide-72
SLIDE 72

Transformer for Encoder-Decoder

slide-73
SLIDE 73

Transformer (as of 2017)

“WMT-2014” Data Set. BLEU scores:

slide-74
SLIDE 74

Transformer

  • Utilize Self-Attention
  • Simple att scoring function (dot product, scaled)
  • Added linear layers for Q, K, and V
  • Multi-head attention
  • Added positional encoding
  • Added residual connection
  • Simulate decoding by masking
slide-75
SLIDE 75

Transformer

Why?

  • Don’t need complexity of LSTM/GRU cells
  • Constant num edges between words (or input

steps)

  • Enables “interactions” (i.e. adaptations)

between words

  • Easy to parallelize -- don’t need sequential

processing. Drawbacks:

  • Only unidirectional by default
  • Only a “single-hop” relationship per layer

(multiple layers to capture multiple)

slide-76
SLIDE 76

Why?

  • Don’t need complexity of LSTM/GRU cells
  • Constant num edges between words (or input

steps)

  • Enables “interactions” (i.e. adaptations)

between words

  • Easy to parallelize -- don’t need sequential

processing. Drawbacks of Vanilla Transformers:

  • Only unidirectional by default
  • Only a “single-hop” relationship per layer

(multiple layers to capture multiple)

BERT

Bidirectional Encoder Representations from Transformers Produces contextualized embeddings (or pre-trained contextualized encoder)

slide-77
SLIDE 77

Why?

  • Don’t need complexity of LSTM/GRU cells
  • Constant num edges between words (or input

steps)

  • Enables “interactions” (i.e. adaptations)

between words

  • Easy to parallelize -- don’t need sequential

processing. Drawbacks of Vanilla Transformers:

  • Only unidirectional by default
  • Only a “single-hop” relationship per layer

(multiple layers to capture multiple)

BERT

Bidirectional Encoder Representations from Transformers Produces contextualized embeddings (or pre-trained contextualized encoder)

  • Bidirectional context by “masking” in the middle
  • A lot of layers, hidden states, attention heads.
slide-78
SLIDE 78

BERT

Bidirectional Encoder Representations from Transformers Produces contextualized embeddings (or pre-trained contextualized encoder)

  • Bidirectional context by “masking” in the middle
  • A lot of layers, hidden states, attention heads.

She saw the man on the hill with the telescope. She [mask] the man on the hill [mask] the telescope.

slide-79
SLIDE 79

BERT

Bidirectional Encoder Representations from Transformers Produces contextualized embeddings (or pre-trained contextualized encoder)

  • Bidirectional context by “masking” in the middle
  • A lot of layers, hidden states, attention heads.

She saw the man on the hill with the telescope. She [mask] the man on the hill [mask] the telescope. Mask 1 in 7 words:

  • Too few: expensive, less robust
  • Too many: not enough context
slide-80
SLIDE 80

BERT

Bidirectional Encoder Representations from Transformers Produces contextualized embeddings (or pre-trained contextualized encoder)

  • Bidirectional context by “masking” in the middle
  • A lot of layers, hidden states, attention heads.
  • BERT-Base, Cased:

12-layer, 768-hidden, 12-heads , 110M parameters

slide-81
SLIDE 81

BERT

Bidirectional Encoder Representations from Transformers Produces contextualized embeddings (or pre-trained contextualized encoder)

  • Bidirectional context by “masking” in the middle
  • A lot of layers, hidden states, attention heads.
  • BERT-Base, Cased:

12-layer, 768-hidden, 12-heads , 110M parameters

  • BERT-Large, Cased:

24-layer, 1024-hidden, 16-heads, 340M parameters

  • BERT-Base, Multilingual Cased:

104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters

slide-82
SLIDE 82

BERT

(Devlin et al., 2019)

slide-83
SLIDE 83

BERT

Differences from previous state of the art:

  • Bidirectional transformer (through masking)
  • Directions jointly trained at once.

(Devlin et al., 2019)

slide-84
SLIDE 84

BERT

Differences from previous state of the art:

  • Bidirectional transformer (through masking)
  • Directions jointly trained at once.
  • Capture sentence-level relations

(Devlin et al., 2019)

slide-85
SLIDE 85

BERT

Differences from previous state of the art:

  • Bidirectional transformer (through masking)
  • Directions jointly trained at once.
  • Capture sentence-level relations

(Devlin et al., 2019)

slide-86
SLIDE 86

BERT

Differences from previous state of the art:

  • Bidirectional transformer (through masking)
  • Directions jointly trained at once.
  • Capture sentence-level relations

(Devlin et al., 2019)

slide-87
SLIDE 87

BERT

Differences from previous state of the art:

  • Bidirectional transformer (through masking)
  • Directions jointly trained at once.
  • Capture sentence-level relations

(Devlin et al., 2019)

slide-88
SLIDE 88

BERT

Differences from previous state of the art:

  • Bidirectional transformer (through masking)
  • Directions jointly trained at once.
  • Capture sentence-level relations

(Devlin et al., 2019)

tokenize into “word pieces”

slide-89
SLIDE 89

BERT Performance: e.g. Question Answering

https://rajpurkar.github.io/SQuAD-explorer/

slide-90
SLIDE 90

Bert: Attention by Layers

https://colab.research.google.com/drive/1vlOJ1lhdujVjfH857hvYKIdKPTD9Kid8

(Vig, 2019)

slide-91
SLIDE 91

BERT: Pre-training; Fine-tuning

12 or 24 layers

slide-92
SLIDE 92

BERT: Pre-training; Fine-tuning

12 or 24 layers

slide-93
SLIDE 93

BERT: Pre-training; Fine-tuning

12 or 24 layers Novel classifier (e.g. sentiment classifier; stance detector...etc..)

slide-94
SLIDE 94

BERT: Pre-training; Fine-tuning

[CLS] vector at start is supposed to capture meaning of whole sequence. Novel classifier (e.g. sentiment classifier; stance detector...etc..)

slide-95
SLIDE 95

BERT: Pre-training; Fine-tuning

[CLS] vector at start is supposed to capture meaning of whole sequence. Average of top layer (or second to top) also often used. Novel classifier (e.g. sentiment classifier; stance detector...etc..)

avg

slide-96
SLIDE 96

BERT for Machine Translation:

(Lample & Conneau, Facebook, 2019)

slide-97
SLIDE 97

BERT for Machine Translation:

(Lample & Conneau, Facebook, 2019)

slide-98
SLIDE 98

BERT for Machine Translation:

(Lample & Conneau, Facebook, 2019)

Use as a pre-trained model for feeding into a machine translation system.

slide-99
SLIDE 99

BERT for Machine Translation:

(Lample & Conneau, Facebook, 2019)

Use as a pre-trained model for feeding into a machine translation system.

slide-100
SLIDE 100

Neural Machine Translation

Where does neural approach fall short? (Manning, 2018)

  • Translation process is mostly a black box -- can’t answer

“why” for reordering, word choice decisions

  • No direct use of semantic or syntactic structures
  • Not modeling discourse structure -- only rough sense of

how sentences relate to each other. Doesn’t model long distance anaphora.