CS 533: Natural Language Processing
Conditional Neural Language Models
Karl Stratos
Rutgers University
Karl Stratos CS 533: Natural Language Processing 1/53
Conditional Neural Language Models Karl Stratos Rutgers University - - PowerPoint PPT Presentation
CS 533: Natural Language Processing Conditional Neural Language Models Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/53 Language Models Considered So Far p Y | X ( y | x 1:100 ) Classical trigram
CS 533: Natural Language Processing
Rutgers University
Karl Stratos CS 533: Natural Language Processing 1/53
◮ Classical trigram models: qY |X(y|x99, x100)
◮ Training: closed-form solution
◮ Log-linear models: softmaxy([w⊤φ((x99, x100), y′)]y′)
◮ Training: gradient descent on convex loss
◮ Neural models
◮ Feedforward: softmaxy(FF([Ex99, Ex100])) ◮ Recurrent: softmaxy(FF(h(x1:99), Ex100)) ◮ Training: gradient descent on nonconvex loss Karl Stratos CS 533: Natural Language Processing 2/53
◮ Machine translation
And the programme has been implemented ⇒ Le programme a ´ et´ e mis en application
◮ Summarization
russian defense minister ivanov called sunday for the creation of a joint front for combating global terrorism ⇒ russia calls for joint front against terrorism
◮ Data-to-text generation
(Wiseman et al., 2017)
◮ Image captioning
the dog saw the cat Karl Stratos CS 533: Natural Language Processing 3/53
θ
(input,output)∼pY |X
Y |X(output|input)
CS 533: Natural Language Processing 4/53
Karl Stratos CS 533: Natural Language Processing 5/53
◮ Goal: Translate text from one language to another. ◮ One of the oldest problems in artificial intelligence.
Karl Stratos CS 533: Natural Language Processing 6/53
◮ Early ’90s: Rise of statistical MT (SMT) ◮ Exploit parallel text.
And the programme has been implemented Le programme a ´ et´ e mis en application
◮ Infer word alignment (“IBM” models, Brown et al., 1993)
Karl Stratos CS 533: Natural Language Processing 7/53
◮ Really complicated, prone to error propogation
Karl Stratos CS 533: Natural Language Processing 8/53
◮ Replaced the entire pipeline with a single model ◮ Called “end-to-end” training/prediction
Input: Le programme a ´ et´ e mis en application Output: And the programme has been implemented
◮ Revolution in MT
◮ Better performance, way simpler system ◮ A hallmark of the recent neural domination in NLP ◮ Key: attention mechanism Karl Stratos CS 533: Natural Language Processing 9/53
◮ Always think of an RNN as a mapping φ : Rd × Rd′ → Rd′
Input: an input vector x ∈ Rd, a state vector h ∈ Rd′ Output: a new state vector h′ ∈ Rd′
◮ Left-to-right RNN processes input sequence x1 . . . xm ∈ Rd as
◮ Idea: hi is a representation of xi that has incorporated all
Karl Stratos CS 533: Natural Language Processing 10/53
◮ Parameters U ∈ Rd′×d and V ∈ Rd′×d′
Karl Stratos CS 533: Natural Language Processing 11/53
Karl Stratos CS 533: Natural Language Processing 12/53
◮ Parameters U (1) . . . U (L) ∈ Rd′×d and V (1) . . . V (L) ∈ Rd′×d′
i
i−1
i
i
i−1
i
i
i−1
i−1
i−1
i
i
Karl Stratos CS 533: Natural Language Processing 13/53
◮ Parameters U q, Uc, Uo ∈ Rd′×d, V q, V c, V o, W q, W o ∈ Rd′×d′
◮ Idea: “Memory cells” ci can carry long-range information.
◮ What happens if qi is close to zero?
◮ Can be stacked as in simple RNN.
Karl Stratos CS 533: Natural Language Processing 14/53
◮ Vocabulary of the source language V src
Karl Stratos CS 533: Natural Language Processing 15/53
◮ T: human-translated sentences ◮
◮ pn: precision of n-grams in
◮ BLEU: Controversial but popular scheme to automatically
4 Karl Stratos CS 533: Natural Language Processing 16/53
p(the dog barked|ᄀ ᅢᄀ ᅡ ᄌ ᅵ ᆽᄋ ᅥ ᆻᄃ ᅡ) > p(the cat barked|ᄀ ᅢᄀ ᅡ ᄌ ᅵ ᆽᄋ ᅥ ᆻᄃ ᅡ) > p(dog the barked|ᄀ ᅢᄀ ᅡ ᄌ ᅵ ᆽᄋ ᅥ ᆻᄃ ᅡ) > p(oqc shgwqw#w 1g0|ᄀ ᅢᄀ ᅡ ᄌ ᅵ ᆽᄋ ᅥ ᆻᄃ ᅡ)
Karl Stratos CS 533: Natural Language Processing 17/53
◮ Vector ex ∈ Rd for every x ∈ V src ◮ Vector ey ∈ Rd for every y ∈ V trg ∪ {*} ◮ Encoder RNN ψ : Rd × Rd′ → Rd′ for V src ◮ Decoder RNN φ : Rd × Rd′ → Rd′ for V trg ◮ Feedforward f : Rd′ → R|V trg|+1
Karl Stratos CS 533: Natural Language Processing 18/53
m = ψ
CS 533: Natural Language Processing 19/53
0 = hψ m and y0 = *.
i = φ
i−1
i ))
n
Karl Stratos CS 533: Natural Language Processing 20/53
xt ht−1 xt+1 ht xt+2 ht+1 ht+2 xt+3 ht+3 h This cat is cute Sentence: This cat is cute
word embedding
Karl Stratos CS 533: Natural Language Processing 21/53
x1 h0 xt+1 h1 xt+2 ht+1 ht+2 xt+3 ht+3 h This cat is cute Sentence: This cat is cute
word embedding
Karl Stratos CS 533: Natural Language Processing 22/53
x1 h0 x2 h1 x3 h2 h3 x4 h4
xt+2 ht+2 xt+3 ht+3 h This cat is cute Sentence: This cat is cute
word embedding
Karl Stratos CS 533: Natural Language Processing 23/53
x1 h0 x2 h1 x3 h2 h3 x4 h4 This cat is cute
(encoded representation)
Sentence: This cat is cute
word embedding
henc
Karl Stratos CS 533: Natural Language Processing 24/53
x′
1
x′
2
z1 x′
3
z2 ce
4
z4
chat mignon est x′
5
z5
mignon
word embedding
henc
Karl Stratos CS 533: Natural Language Processing 25/53
y1
henc
x′
2
z1 x′
3
z2 ce
4
z4
chat mignon est x′
5
z5
mignon
word embedding
Karl Stratos CS 533: Natural Language Processing 26/53
y1 y2 z1 x′
3
z2 ce
4
z4
chat mignon est x′
5
z5
mignon
word embedding
henc
Karl Stratos CS 533: Natural Language Processing 27/53
y1 y2 z1 y3 z2 ce
z4
chat mignon
est y5 z5
mignon
word embedding
henc
Karl Stratos CS 533: Natural Language Processing 28/53
Θ
N
Karl Stratos CS 533: Natural Language Processing 29/53
Karl Stratos CS 533: Natural Language Processing 30/53
◮ Instead of using 1 fixed vector to encode all x1 . . . xm,
◮ For i = 0, 1, . . .,
j
CS 533: Natural Language Processing 31/53
◮ αi,j: Importance of xj for predicting i-th translation ◮ Various options
i + V hψ j
i
j
Karl Stratos CS 533: Natural Language Processing 32/53
Karl Stratos CS 533: Natural Language Processing 33/53
Karl Stratos CS 533: Natural Language Processing 34/53
Karl Stratos CS 533: Natural Language Processing 35/53
Karl Stratos CS 533: Natural Language Processing 36/53
Karl Stratos CS 533: Natural Language Processing 37/53
Karl Stratos CS 533: Natural Language Processing 38/53
Karl Stratos CS 533: Natural Language Processing 39/53
Karl Stratos CS 533: Natural Language Processing 40/53
Karl Stratos CS 533: Natural Language Processing 41/53
Karl Stratos CS 533: Natural Language Processing 42/53
Karl Stratos CS 533: Natural Language Processing 43/53
Karl Stratos CS 533: Natural Language Processing 44/53
y∈V ∪{STOP}
y∈V +: |y|≤Tmax
Karl Stratos CS 533: Natural Language Processing 45/53
◮ Instead of enumerating |V |Tmax candidates, keep K (called
◮ Applicable to any decomposable score function ◮ Score function in seq2seq:
t
◮ Runtime: O(|V | TmaxK2 log K)
Karl Stratos CS 533: Natural Language Processing 46/53
Karl Stratos CS 533: Natural Language Processing 47/53
Karl Stratos CS 533: Natural Language Processing 48/53
Karl Stratos CS 533: Natural Language Processing 49/53
Karl Stratos CS 533: Natural Language Processing 50/53
◮ Different hypotheses may stop at different time steps (place
◮ Continue beam search until
◮ All K hypotheses stop ◮ We hit max length limit T
◮ Select top hypotheses using the normalized likelihood score
M
Karl Stratos CS 533: Natural Language Processing 51/53
Karl Stratos CS 533: Natural Language Processing 52/53
Karl Stratos CS 533: Natural Language Processing 53/53