Machine Learning 2
DS 4420 - Spring 2020
Transformers
Byron C. Wallace
Material in this lecture derived rom materials created by Jay Alammar (http:// jalammar.github.io/illustrated-transformer/)
Machine Learning 2 DS 4420 - Spring 2020 Transformers Byron C. - - PowerPoint PPT Presentation
Machine Learning 2 DS 4420 - Spring 2020 Transformers Byron C. Wallace Material in this lecture derived rom materials created by Jay Alammar (http:// jalammar.github.io/illustrated-transformer/) Some housekeeping First, lets talk
DS 4420 - Spring 2020
Byron C. Wallace
Material in this lecture derived rom materials created by Jay Alammar (http:// jalammar.github.io/illustrated-transformer/)
★ I gave back to 5 points to everyone (mean now 75) ★ We are releasing an optional bonus assignment that covers the
same content as Q2 — you can use this to make up up to half (12.5) points on said question
★ I gave back to 5 points to everyone (mean now 75) ★ We are releasing an optional bonus assignment that covers the
same content as Q2 — you can use this to make up up to half (12.5) points on said question
★ I gave back to 5 points to everyone (mean now 75) ★ We are releasing an optional bonus assignment that covers the
same content as Q2 — you can use this to make up up to half (12.5) points on said question
★ I gave back to 5 points to everyone (mean now 75) ★ We are releasing an optional bonus assignment that covers the
same content as Q2 — you can use this to make up up to half (12.5) points on said question
★ I gave back to 5 points to everyone (mean now 75) ★ We are releasing an optional bonus assignment that covers the same
content as Q2 — you can use this to make up up to half (12.5) points
Northeastern will move to holding all classes remotely in the coming days/weeks
will also have to be remote or recorded (will figure out!)
neural networks that have come to dominate in NLP
neural networks that have come to dominate in NLP
This movie so … terrible
BiLSTM word embeddings
… … …
h1 h2
…
hT-1 hT Attention
α1
α2
αT
αT-1
…
=
c =
ˆ y
T
X
i=1
αihi
This movie so … terrible
BiLSTM word embeddings
… … …
h1 h2
…
hT-1 hT Attention
α1
α2
αT
αT-1
…
=
c =
ˆ y
T
X
i=1
αihi
word embeddings
…
This movie so … terrible
BiLSTM word embeddings
… … …
h1 h2
…
hT-1 hT Attention
α1
α2
αT
αT-1
…
=
c =
ˆ y
T
X
i=1
αihi
word embeddings
…
This movie so … terrible
BiLSTM word embeddings
… … …
h1 h2
…
hT-1 hT Attention
α1
α2
αT
αT-1
…
=
c =
ˆ y
T
X
i=1
αihi
This movie so … terrible
BiLSTM word embeddings
… … …
h1 h2
…
hT-1 hT Attention
α1
α2
αT
αT-1
…
=
c =
ˆ y
T
X
i=1
αihi
This movie so … terrible
BiLSTM word embeddings
… … …
h1 h2
…
hT-1 hT Attention
α1
α2
αT
αT-1
…
=
c =
ˆ y
T
X
i=1
αihi
This movie so … terrible
BiLSTM word embeddings
… … …
h1 h2
…
hT-1 hT Attention
α1
α2
αT
αT-1
…
=
c =
ˆ y
T
X
i=1
αihi
This movie so … terrible
BiLSTM word embeddings
… … …
h1 h2
…
hT-1 hT Attention
α1
α2
αT
αT-1
…
=
c =
ˆ y
T
X
i=1
αihi
http://jalammar.github.io/illustrated-transformer/ source:
http://jalammar.github.io/illustrated-transformer/ source:
http://jalammar.github.io/illustrated-transformer/ source:
http://jalammar.github.io/illustrated-transformer/ source:
http://jalammar.github.io/illustrated-transformer/ source:
http://jalammar.github.io/illustrated-transformer/ source:
http://jalammar.github.io/illustrated-transformer/ source:
This one weird trick
Learned
http://jalammar.github.io/illustrated-transformer/ source:
http://jalammar.github.io/illustrated-transformer/ source:
http://jalammar.github.io/illustrated-transformer/ source:
http://jalammar.github.io/illustrated-transformer/ source:
https://talktotransformer.com/
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language {jacobdevlin,mingweichang,kentonl,kristout}@google.com
A Primer in BERTology: What we know about how BERT works
Anna Rogers, Olga Kovaleva, Anna Rumshisky Department of Computer Science, University of Massachusetts Lowell Lowell, MA 01854 {arogers, okovalev, arum}@cs.uml.edu
BERT BERT
E[CLS]
E1 E[SEP]
...
EN E1’
...
EM’ C T1 T[SEP]
...
TN T1’
...
TM’
[CLS] Tok 1 [SEP]...
Tok N Tok 1...
TokMQuestion Paragraph Start/End Span
BERT
E[CLS]
E1 E[SEP]
...
EN E1’
...
EM’ C T1 T[SEP]
...
TN T1’
...
TM’
[CLS] Tok 1 [SEP]...
Tok N Tok 1...
TokMMasked Sentence A Masked Sentence B
Pre-training Fine-Tuning
NSP Mask LM Mask LM Unlabeled Sentence A and B Pair
SQuAD
Question Answer Pair
NER MNLI
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language {jacobdevlin,mingweichang,kentonl,kristout}@google.comBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language {jacobdevlin,mingweichang,kentonl,kristout}@google.comThe cat is very cute
The cat is very cute The [MASK] is very cute X y cat
sense of “location” in the sequence
sense of “location” in the sequence
sense of “location” in the sequence
“location” in the sequence
http://nlp.seas.harvard.edu/2018/04/03/attention.html