CS11-747 Neural Networks for NLP
Pre-trained Sentence and Contextualized Word Representations
Graham Neubig
Site https://phontron.com/class/nn4nlp2020/
(w/ slides by Antonis Anastasopoulos)
Pre-trained Sentence and Contextualized Word Representations - - PowerPoint PPT Presentation
CS11-747 Neural Networks for NLP Pre-trained Sentence and Contextualized Word Representations Graham Neubig Site https://phontron.com/class/nn4nlp2020/ (w/ slides by Antonis Anastasopoulos) Remember: Neural Models I hate this movie
CS11-747 Neural Networks for NLP
Graham Neubig
Site https://phontron.com/class/nn4nlp2020/
(w/ slides by Antonis Anastasopoulos)
I hate this movie
embed embed predict predict predict predict
Word-level embedding/prediction
predict
Sentence-level embedding/prediction
representations
I hate this movie I love this movie
very good good neutral bad very bad very good good neutral bad very bad
use a loose sense of similarity Charles O. Prince, 53, was named as Mr. Weill’s successor.
named as his successor.
(Marelli et al. 2014)
where opposite is also true)
→ The woman bought lunch
→ The woman did not buy a sandwich
→ The woman bought a sandwich for dinner
this is an example this is another example
classifier
yes/no How do we get such a representation?
multiple tasks
where we only really care about one of the tasks
where the output is the same, but we want to handle different topics or genres, etc.
different varieties of data
translation
many fewer data
(e.g. web text → medical text)
language (e.g. English → Telugu)
(e.g. LM -> parser)
(Klerke et al. 2016)
this is an example LM Tagging Encoder
this is an example Translation Encoder this is an example Tagging Encoder Initialize
word representations (Dai et al. 2015, Melamud et al. 2016)
CoVe, ELMo, BERT along with pre-trained models
train
model
need to be)!
training objective
reviews
weights and continue training
(Radford et al. 2018)
Downstream: Some task fine-tuning, other tasks additional multi-sentence training
construct the sentence
reviews
weights and continue training
(Kiros et al. 2015)
paraphrases or not from
paraphrase.org), created from bilingual data
classification, etc.
domain data, but word averaging generalizes better
(Wieting and Gimpel 2018)
(about 30M are high quality)
(Conneau et al. 2017)
if:
entailment learn generalizable embeddings?
nuance → yes?, or data is much smaller → no?
word! this is an example this is another example
classifier
yes/no How to train this representation?
(Melamud et al. 2016)
LSTM
word given context
corpus
for sentence completion, word sense disambiguation, etc.
(McMann et al. 2017)
sentence pairs for classification
(Peters et al. 2018)
word right->left independently
combination of layers on the downstream task
(Devlin et al. 2018)
sentence prediction
word
multi-layer self attention
not:
"consecutive"
(Devlin et al. 2018)
(Liu et al. 2019)
drop sentence prediction objective
(Clark et al. 2020)
to discriminate which words are sampled
models
Web
averaging is more robust out-of-domain
directional transformer, but no comparison to LSTM like ELMo (for performance reasons?)
to BERT is used and improvements are shown
data, and find that bi-directional LM seems better than MT encoder
probably better, but results preliminary.
adding much more data from web, but not 100% consistent.