Conditional Neural Language Models Karl Stratos Rutgers University - - PowerPoint PPT Presentation

conditional neural language models
SMART_READER_LITE
LIVE PREVIEW

Conditional Neural Language Models Karl Stratos Rutgers University - - PowerPoint PPT Presentation

CS 533: Natural Language Processing Conditional Neural Language Models Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/53 Language Models Considered So Far p Y | X ( y | x 1:100 ) Classical trigram


slide-1
SLIDE 1

CS 533: Natural Language Processing

Conditional Neural Language Models

Karl Stratos

Rutgers University

Karl Stratos CS 533: Natural Language Processing 1/53

slide-2
SLIDE 2

Language Models Considered So Far

pY |X(y|x1:100)

◮ Classical trigram models: qY |X(y|x99, x100)

◮ Training: closed-form solution

◮ Log-linear models: softmaxy([w⊤φ((x99, x100), y′)]y′)

◮ Training: gradient descent on convex loss

◮ Neural models

◮ Feedforward: softmaxy(FF([Ex99, Ex100])) ◮ Recurrent: softmaxy(FF(h(x1:99), Ex100)) ◮ Training: gradient descent on nonconvex loss Karl Stratos CS 533: Natural Language Processing 2/53

slide-3
SLIDE 3

Conditional Language Models

◮ Machine translation

And the programme has been implemented ⇒ Le programme a ´ et´ e mis en application

◮ Summarization

russian defense minister ivanov called sunday for the creation of a joint front for combating global terrorism ⇒ russia calls for joint front against terrorism

◮ Data-to-text generation

(Wiseman et al., 2017)

◮ Image captioning

the dog saw the cat Karl Stratos CS 533: Natural Language Processing 3/53

slide-4
SLIDE 4

Encoder-Decoder Models

Much of machine learning is learning x → y where x, y are some complicated structures Encoder-decoder models are conditional models that handle this wide class of problems in two steps:

  • 1. Encode the given input x using some architecture.
  • 2. Decode output y.

Training: again minimize cross entropy min

θ

E

(input,output)∼pY |X

  • − ln qθ

Y |X(output|input)

  • Karl Stratos

CS 533: Natural Language Processing 4/53

slide-5
SLIDE 5

Agenda

  • 1. MT
  • 2. Attention in detail
  • 3. Beam Search

Karl Stratos CS 533: Natural Language Processing 5/53

slide-6
SLIDE 6

Machine Translation (MT)

◮ Goal: Translate text from one language to another. ◮ One of the oldest problems in artificial intelligence.

Karl Stratos CS 533: Natural Language Processing 6/53

slide-7
SLIDE 7

Some History

◮ Early ’90s: Rise of statistical MT (SMT) ◮ Exploit parallel text.

And the programme has been implemented Le programme a ´ et´ e mis en application

◮ Infer word alignment (“IBM” models, Brown et al., 1993)

Karl Stratos CS 533: Natural Language Processing 7/53

slide-8
SLIDE 8

SMT: Huge Pipeline

  • 1. Use IBM models to extract word alignment, phrase alignment

(Koehn et al., 2003).

  • 2. Use syntactic analyzers (e.g., parser) to extract features and

manipulate text (e.g., phrase re-ordering).

  • 3. Use a separate language model to enforce fluency.
  • 4. . . .

Multiple independently trained models patched together

◮ Really complicated, prone to error propogation

Karl Stratos CS 533: Natural Language Processing 8/53

slide-9
SLIDE 9

Rise of Neural MT

Started taking off around 2014

◮ Replaced the entire pipeline with a single model ◮ Called “end-to-end” training/prediction

Input: Le programme a ´ et´ e mis en application Output: And the programme has been implemented

◮ Revolution in MT

◮ Better performance, way simpler system ◮ A hallmark of the recent neural domination in NLP ◮ Key: attention mechanism Karl Stratos CS 533: Natural Language Processing 9/53

slide-10
SLIDE 10

Recap: Recurrent Neural Network (RNN)

◮ Always think of an RNN as a mapping φ : Rd × Rd′ → Rd′

Input: an input vector x ∈ Rd, a state vector h ∈ Rd′ Output: a new state vector h′ ∈ Rd′

◮ Left-to-right RNN processes input sequence x1 . . . xm ∈ Rd as

hi = φ (xi, hi−1)

where h0 is an initial state vector.

◮ Idea: hi is a representation of xi that has incorporated all

inputs to the left.

hi = φ (xi, φ (xi−1, φ (xi−2, · · · φ (x1, h0) · · · )))

Karl Stratos CS 533: Natural Language Processing 10/53

slide-11
SLIDE 11

Variety 1: “Simple” RNN

◮ Parameters U ∈ Rd′×d and V ∈ Rd′×d′

hi = tanh (Uxi + V hi−1)

Karl Stratos CS 533: Natural Language Processing 11/53

slide-12
SLIDE 12

Picture

Karl Stratos CS 533: Natural Language Processing 12/53

slide-13
SLIDE 13

Stacked Simple RNN

◮ Parameters U (1) . . . U (L) ∈ Rd′×d and V (1) . . . V (L) ∈ Rd′×d′

h(1)

i

= tanh

  • U (1)xi + V (1)h(1)

i−1

  • h(2)

i

= tanh

  • U (2)h(1)

i

+ V (2)h(2)

i−1

  • .

. . h(L)

i

= tanh

  • U (L)h(L−1)

i

+ V (L)h(L)

i−1

  • ◮ Think of it as mapping φ : Rd × RLd′ → RLd′.

xi    h(1)

i−1

. . . h(L)

i−1

   →    h(1)

i

. . . h(L)

i

  

Karl Stratos CS 533: Natural Language Processing 13/53

slide-14
SLIDE 14

Variety 2: Long Short-Term Memory (LSTM)

◮ Parameters U q, Uc, Uo ∈ Rd′×d, V q, V c, V o, W q, W o ∈ Rd′×d′

qi = σ (U qxi + V qhi−1 + W qci−1) ci = (1 − qi) ⊙ ci−1 + qi ⊙ tanh (U cxi + V chi−1)

  • i = σ (U oxi + V ohi−1 + W oci)

hi = oi ⊙ tanh(ci)

◮ Idea: “Memory cells” ci can carry long-range information.

◮ What happens if qi is close to zero?

◮ Can be stacked as in simple RNN.

Karl Stratos CS 533: Natural Language Processing 14/53

slide-15
SLIDE 15

Translation Problem

◮ Vocabulary of the source language V src

V src =

ᅳ, ᄀ ᅢᄀ ᅡ, ᄇ ᅩᄋ ᅡ ᆻᄃ ᅡ, ᄉ ᅩᄉ ᅵ ᆨ, 2017, 5ᄋ ᅯ ᆯ. . .

  • ◮ Vocabulary of the target language V trg

V trg =

  • the, dog, cat, 2021, May, . . .
  • ◮ Task. Given any sentence x1 . . . xm ∈ V src, produce a

corresponding translation y1 . . . yn ∈ V trg. ᄀ ᅢᄀ ᅡ ᄌ ᅵ ᆽᄋ ᅥ ᆻᄃ ᅡ = ⇒ the dog barked

Karl Stratos CS 533: Natural Language Processing 15/53

slide-16
SLIDE 16

Evaluating Machine Translation

◮ T: human-translated sentences ◮

T: machine-translated sentences

◮ pn: precision of n-grams in

T against n-grams in T (sentence-wise)

◮ BLEU: Controversial but popular scheme to automatically

evaluate translation quality BLEU = min  1,

  • T
  • |T|

  × 4

  • n=1

pn 1

4 Karl Stratos CS 533: Natural Language Processing 16/53

slide-17
SLIDE 17

Translation Model: Conditional Language Model

A translation model defines a probability distribution p(y1 . . . yn|x1 . . . xm) over all sentences y1 . . . yn ∈ V trg conditioning on any sentence x1 . . . xm ∈ V src. Goal: Design a good translation model

p(the dog barked|ᄀ ᅢᄀ ᅡ ᄌ ᅵ ᆽᄋ ᅥ ᆻᄃ ᅡ) > p(the cat barked|ᄀ ᅢᄀ ᅡ ᄌ ᅵ ᆽᄋ ᅥ ᆻᄃ ᅡ) > p(dog the barked|ᄀ ᅢᄀ ᅡ ᄌ ᅵ ᆽᄋ ᅥ ᆻᄃ ᅡ) > p(oqc shgwqw#w 1g0|ᄀ ᅢᄀ ᅡ ᄌ ᅵ ᆽᄋ ᅥ ᆻᄃ ᅡ)

How can we use an RNN to build a translation model?

Karl Stratos CS 533: Natural Language Processing 17/53

slide-18
SLIDE 18

Basic Encoder-Decoder Framework

Model parameters

◮ Vector ex ∈ Rd for every x ∈ V src ◮ Vector ey ∈ Rd for every y ∈ V trg ∪ {*} ◮ Encoder RNN ψ : Rd × Rd′ → Rd′ for V src ◮ Decoder RNN φ : Rd × Rd′ → Rd′ for V trg ◮ Feedforward f : Rd′ → R|V trg|+1

Basic idea

  • 1. Transform x1 . . . xm ∈ V src with ψ into some representation ξ.
  • 2. Build a language model φ over V trg conditioning on ξ.

Karl Stratos CS 533: Natural Language Processing 18/53

slide-19
SLIDE 19

Encoder

For i = 1 . . . m,

i = ψ

  • exi, hψ

i−1

m = ψ

  • exm, ψ
  • exm−1, ψ
  • exm−2, · · · ψ
  • ex1, hψ
  • · · ·
  • Karl Stratos

CS 533: Natural Language Processing 19/53

slide-20
SLIDE 20

Decoder

Initialize hφ

0 = hψ m and y0 = *.

For i = 1, 2, . . ., the decoder defines a probability distribution over V trg ∪ {STOP} as

i = φ

  • eyi−1, hφ

i−1

  • pΘ(y|x1 . . . xm, y0 . . . yi−1) = softmaxy(f(hφ

i ))

Probability of translation y1 . . . yn given x1 . . . xm: pΘ(y1 . . . yn|x1 . . . xm) =

n

  • i=1

pΘ(yi|x1 . . . xm, y0 . . . yi−1)× pΘ(STOP|x1 . . . xm, y0 . . . yn)

Karl Stratos CS 533: Natural Language Processing 20/53

slide-21
SLIDE 21

Slide Credit: Danchi Chen & Karthik Narasimhan

Encoder

xt ht−1 xt+1 ht xt+2 ht+1 ht+2 xt+3 ht+3 h This cat is cute Sentence: This cat is cute

word embedding

Karl Stratos CS 533: Natural Language Processing 21/53

slide-22
SLIDE 22

Slide Credit: Danchi Chen & Karthik Narasimhan

Encoder

x1 h0 xt+1 h1 xt+2 ht+1 ht+2 xt+3 ht+3 h This cat is cute Sentence: This cat is cute

word embedding

Karl Stratos CS 533: Natural Language Processing 22/53

slide-23
SLIDE 23

Slide Credit: Danchi Chen & Karthik Narasimhan

x1 h0 x2 h1 x3 h2 h3 x4 h4

Encoder

xt+2 ht+2 xt+3 ht+3 h This cat is cute Sentence: This cat is cute

word embedding

Karl Stratos CS 533: Natural Language Processing 23/53

slide-24
SLIDE 24

Slide Credit: Danchi Chen & Karthik Narasimhan

Encoder

x1 h0 x2 h1 x3 h2 h3 x4 h4 This cat is cute

(encoded representation)

Sentence: This cat is cute

word embedding

henc

Karl Stratos CS 533: Natural Language Processing 24/53

slide-25
SLIDE 25

Slide Credit: Danchi Chen & Karthik Narasimhan

Decoder

x′

1

x′

2

z1 x′

3

z2 ce

  • z3
  • x′

4

z4

  • <s> ce chat est

chat mignon est x′

5

z5

  • <e>

mignon

word embedding

henc

Karl Stratos CS 533: Natural Language Processing 25/53

slide-26
SLIDE 26

Slide Credit: Danchi Chen & Karthik Narasimhan

Decoder

y1

henc

x′

2

z1 x′

3

z2 ce

  • z3
  • x′

4

z4

  • <s> ce chat est

chat mignon est x′

5

z5

  • <e>

mignon

word embedding

Karl Stratos CS 533: Natural Language Processing 26/53

slide-27
SLIDE 27

Slide Credit: Danchi Chen & Karthik Narasimhan

Decoder

y1 y2 z1 x′

3

z2 ce

  • z3
  • x′

4

z4

  • <s> ce chat est

chat mignon est x′

5

z5

  • <e>

mignon

word embedding

henc

Karl Stratos CS 533: Natural Language Processing 27/53

slide-28
SLIDE 28

Slide Credit: Danchi Chen & Karthik Narasimhan

Decoder

y1 y2 z1 y3 z2 ce

  • z3
  • y4

z4

  • <s> ce chat est

chat mignon

  • A conditioned language model

est y5 z5

  • <e>

mignon

word embedding

henc

Karl Stratos CS 533: Natural Language Processing 28/53

slide-29
SLIDE 29

Training

Given parallel text of N sentence-translation pairs (x(1), y(1)) . . . (x(N), y(N)), find parameters Θ∗ that maximize the log likelihood of the data: Θ∗ ≈ arg min

Θ

N

  • i=1

log pΘ(y(i)|x(i)) loss In PyTorch loss.back()

  • ptim.step()

Training not trivial due to exploting/vanishing gradients

Karl Stratos CS 533: Natural Language Processing 29/53

slide-30
SLIDE 30

Sequence-to-Sequence (Seq2Seq) Learning (Sutskever et al., 2014)

Problems?

Karl Stratos CS 533: Natural Language Processing 30/53

slide-31
SLIDE 31

Decoder with Attention

◮ Instead of using 1 fixed vector to encode all x1 . . . xm,

decoder decides which words to pay attention to, at every step.

◮ For i = 0, 1, . . .,

pΘ(y|x1 . . . xm, y0 . . . yi) = softmaxy

  • FF

m

  • j=1

αi,jhψ

j

  • Karl Stratos

CS 533: Natural Language Processing 31/53

slide-32
SLIDE 32

Attention Weights

m

  • j=1

αi,jhψ

j

◮ αi,j: Importance of xj for predicting i-th translation ◮ Various options

βi,j = u⊤ tanh

  • Whφ

i + V hψ j

  • βi,j =

i

⊤ Bhψ

j

Typically take softmax to make them probabilities: (αi,1 . . . αi,m) = softmax (βi,1 . . . βi,m)

Karl Stratos CS 533: Natural Language Processing 32/53

slide-33
SLIDE 33

Encoder-Decoder with Attention: Slides by Sasha Rush

Karl Stratos CS 533: Natural Language Processing 33/53

slide-34
SLIDE 34

Encoder-Decoder with Attention: Slides by Sasha Rush

Karl Stratos CS 533: Natural Language Processing 34/53

slide-35
SLIDE 35

Encoder-Decoder with Attention: Slides by Sasha Rush

Karl Stratos CS 533: Natural Language Processing 35/53

slide-36
SLIDE 36

Encoder-Decoder with Attention: Slides by Sasha Rush

Karl Stratos CS 533: Natural Language Processing 36/53

slide-37
SLIDE 37

Encoder-Decoder with Attention: Slides by Sasha Rush

Karl Stratos CS 533: Natural Language Processing 37/53

slide-38
SLIDE 38

Encoder-Decoder with Attention: Slides by Sasha Rush

Karl Stratos CS 533: Natural Language Processing 38/53

slide-39
SLIDE 39

Encoder-Decoder with Attention: Slides by Sasha Rush

Karl Stratos CS 533: Natural Language Processing 39/53

slide-40
SLIDE 40

Encoder-Decoder with Attention: Slides by Sasha Rush

Karl Stratos CS 533: Natural Language Processing 40/53

slide-41
SLIDE 41

Encoder-Decoder with Attention: Slides by Sasha Rush

Karl Stratos CS 533: Natural Language Processing 41/53

slide-42
SLIDE 42

Encoder-Decoder with Attention: Slides by Sasha Rush

Karl Stratos CS 533: Natural Language Processing 42/53

slide-43
SLIDE 43

Input-Feeding Approach (Luong et al., 2015)

Explicit alignment information. Very large computation graph. Parallel computation during training no longer possible.

Karl Stratos CS 533: Natural Language Processing 43/53

slide-44
SLIDE 44

Matrix Form of Input-Feeding Attention

Bank X ∈ Rd×T , hidden state ht−1 ∈ Rd, current word yt ∈ V (Board)

Karl Stratos CS 533: Natural Language Processing 44/53

slide-45
SLIDE 45

Greedy Decoding

Given sentence x: for t = 1 . . . Tmax yt ← arg max

y∈V ∪{STOP}

log q(y|y<t, x) stop if y = STOP. Is this what we want? y∗ = arg max

y∈V +: |y|≤Tmax

log q(y|x)

Karl Stratos CS 533: Natural Language Processing 45/53

slide-46
SLIDE 46

Beam Search: Idea

◮ Instead of enumerating |V |Tmax candidates, keep K (called

beam size) highest scoring partial structures at every step.

Only an approximation

◮ Applicable to any decomposable score function ◮ Score function in seq2seq:

score(y1 . . . yt) = log q(y1 . . . yt|x) =

t

  • i=1

log q(yi|y<i, x) = score(y1 . . . yt−1) + log q(yt|y<t, x)

◮ Runtime: O(|V | TmaxK2 log K)

Karl Stratos CS 533: Natural Language Processing 46/53

slide-47
SLIDE 47

Beam Search

Karl Stratos CS 533: Natural Language Processing 47/53

slide-48
SLIDE 48

Beam Search

Karl Stratos CS 533: Natural Language Processing 48/53

slide-49
SLIDE 49

Beam Search

Karl Stratos CS 533: Natural Language Processing 49/53

slide-50
SLIDE 50

Beam Search: Backtrack

Karl Stratos CS 533: Natural Language Processing 50/53

slide-51
SLIDE 51

Beam Search: More Details

◮ Different hypotheses may stop at different time steps (place

them aside)

◮ Continue beam search until

◮ All K hypotheses stop ◮ We hit max length limit T

◮ Select top hypotheses using the normalized likelihood score

1 M

M

  • i=1

log q(yi|y<i, x) Otherwise hypotheses get higher scores for just being shorter

Karl Stratos CS 533: Natural Language Processing 51/53

slide-52
SLIDE 52

Copy Mechanism

q(yt|y<t, x) =

  • z∈{0,1}

q(yt, z|y<t, x)

Karl Stratos CS 533: Natural Language Processing 52/53

slide-53
SLIDE 53

Picture (Credit: See et al., 2017)

Karl Stratos CS 533: Natural Language Processing 53/53