CS11-747 Neural Networks for NLP
Conditioned Generation
Graham Neubig
Site https://phontron.com/class/nn4nlp2017/
Conditioned Generation Graham Neubig Site - - PowerPoint PPT Presentation
CS11-747 Neural Networks for NLP Conditioned Generation Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Language Models Language models are generative models of text s ~ P(x) The Malfoys! said Hermione. Harry was watching
CS11-747 Neural Networks for NLP
Graham Neubig
Site https://phontron.com/class/nn4nlp2017/
s ~ P(x)
Text Credit: Max Deutsch (https://medium.com/deep-writing/)
“The Malfoys!” said Hermione. Harry was watching him. He looked like Madame Maxime. When she strode up the wrong staircase to visit himself. “I’m afraid I’ve definitely been suspended from power, no chance — indeed?” said Snape. He put his head back behind them and read groups as they crossed a corner and fluttered down onto their ink lamp, and picked up his spoon. The doorbell rang. It was a lot cleaner down in London.
some specification Input X Output Y (Text) English Japanese Task Translation Structured Data NL Description NL Generation Document Short Description Summarization Utterance Response Response Generation Image Text Image Captioning Speech Transcript Speech Recognition
I
i=1
Next Word Context
J
j=1
Added Context!
(One Type of) Language Model
LSTM LSTM LSTM LSTM
movie this hate I
predict hate predict this predict movie predict </s> LSTM
<s>
predict I
LSTM LSTM LSTM LSTM LSTM
</s>
LSTM LSTM LSTM LSTM argmax argmax argmax argmax
</s>
argmax
(One Type of) Conditional Language Model
(Sutskever et al. 2014)
I hate this movie kono eiga ga kirai I hate this movie Encoder Decoder
encoder decoder
encoder decoder transform
encoder decoder decoder decoder
generate a sentence?
according to the probability distribution.
highest probability.
work needed.
maintain several paths
LSTM1
<s>
predict1 I LSTM2
<s>
predict2
P(yj | X, y1, . . . , yj−1) =
M
X
m=1
Pm(yj | X, y1, . . . , yj−1)P(m | X, y1, . . . , yj−1)
Probability according to model m Probability of model m
Interpolation coefficient for model m Log probability
P(yj | X, y1, . . . , yj−1) = softmax M X
m=1
λm(X, y1, . . . , yj−1) log Pm(yj | X, y1, . . . , yj−1) ! Normalize
high probability
models at test time, increasing our time/memory complexity
good effects of ensembling
end of training, and take the average of parameters
within the same run
ensemble
predicted words
an ensemble
prediction of outputs is done in very different ways?
character-by-character translation model
calculating features for another system
very different systems
tricks: consider paraphrases, reordering, and function word/content word difference
(although these can be made automatically), and more complicated
set without doing generation
generating output.
ambiguity.
an old-school NLPer, it means this
et al. 2016)
2016)