[PPT] - Machine Learning 2 DS 4420 - Spring 2020 Transformers Byron C. PowerPoint Presentation

SLIDE 1

Machine Learning 2

DS 4420 - Spring 2020

Transformers

Byron C. Wallace

Material in this lecture derived rom materials created by Jay Alammar (http:// jalammar.github.io/illustrated-transformer/)

SLIDE 2

Some housekeeping

First, let’s talk midterm…
Mean: 70 (from 30s to high 90s)
I miscalibrated Q2 (average: 56%)

★ I gave back to 5 points to everyone (mean now 75) ★ We are releasing an optional bonus assignment that covers the

same content as Q2 — you can use this to make up up to half (12.5) points on said question

SLIDE 3

Some housekeeping

First, let’s talk midterm…
Mean: 70 (from 30s to high 90s)
I miscalibrated Q2 (average: 56%)

★ I gave back to 5 points to everyone (mean now 75) ★ We are releasing an optional bonus assignment that covers the

same content as Q2 — you can use this to make up up to half (12.5) points on said question

SLIDE 4

Some housekeeping

First, let’s talk midterm…
Mean: 70 (from 30s to high 90s)
I miscalibrated Q2 (average: 56%)

★ I gave back to 5 points to everyone (mean now 75) ★ We are releasing an optional bonus assignment that covers the

same content as Q2 — you can use this to make up up to half (12.5) points on said question

SLIDE 5

Some housekeeping

First, let’s talk midterm…
Mean: 70 (from 30s to high 90s)
I miscalibrated Q2 (average: 56%)

★ I gave back to 5 points to everyone (mean now 75) ★ We are releasing an optional bonus assignment that covers the

same content as Q2 — you can use this to make up up to half (12.5) points on said question

SLIDE 6

Some housekeeping

First, let’s talk midterm…
Mean: 70 (from 30s to high 90s)
I miscalibrated Q2 (average: 56%)

★ I gave back to 5 points to everyone (mean now 75) ★ We are releasing an optional bonus assignment that covers the same

content as Q2 — you can use this to make up up to half (12.5) points

n said question. This will be released tonight; due date is flexible.

SLIDE 7

HW 4

HW 4 will be released soon; due 3/24 (Tuesday)

SLIDE 8

Projects!

THURSDAY 3/13 Project proposal is due!
TUESDAY 3/17 Project pitches in class!

SLIDE 9

A remote possibility

There is a (increasingly) non-zero chance that

Northeastern will move to holding all classes remotely in the coming days/weeks

In this case: Remote / recorded lectures; on-demand
ffice hours, remotely; project presentations (+ pitches)

will also have to be remote or recorded (will figure out!)

Keep an eye on Piazza for more updates

SLIDE 10

Today

Will introduce transformer networks, which are a type of

neural networks that have come to dominate in NLP

To get there, will first review RNNs briefly

SLIDE 11

Today

Will introduce transformer networks, which are a type of

neural networks that have come to dominate in NLP

To get there, will first review RNNs briefly

SLIDE 12

RNNs

Review [on board]

SLIDE 13

Transformers

Hey, maybe we can get rid of recurrence!

SLIDE 14

Attention mechanisms

SLIDE 15

This movie so … terrible

BiLSTM word embeddings

… … …

h1 h2

…

hT-1 hT Attention

α1

α2

αT

αT-1

…

=

c =

utput layer

ˆ y

T

X

i=1

αihi

SLIDE 16

This movie so … terrible

BiLSTM word embeddings

… … …

h1 h2

…

hT-1 hT Attention

α1

α2

αT

αT-1

…

=

c =

utput layer

ˆ y

T

X

i=1

αihi

word embeddings

…

SLIDE 17

This movie so … terrible

BiLSTM word embeddings

… … …

h1 h2

…

hT-1 hT Attention

α1

α2

αT

αT-1

…

=

c =

utput layer

ˆ y

T

X

i=1

αihi

word embeddings

…

This movie so … terrible

BiLSTM word embeddings

… … …

h1 h2

…

hT-1 hT Attention

α1

α2

αT

αT-1

…

=

c =

utput layer

ˆ y

T

X

i=1

αihi

SLIDE 18

This movie so … terrible

BiLSTM word embeddings

… … …

h1 h2

…

hT-1 hT Attention

α1

α2

αT

αT-1

…

=

c =

utput layer

ˆ y

T

X

i=1

αihi

SLIDE 19

This movie so … terrible

BiLSTM word embeddings

… … …

h1 h2

…

hT-1 hT Attention

α1

α2

αT

αT-1

…

=

c =

utput layer

ˆ y

T

X

i=1

αihi

SLIDE 20

This movie so … terrible

BiLSTM word embeddings

… … …

h1 h2

…

hT-1 hT Attention

α1

α2

αT

αT-1

…

=

c =

utput layer

ˆ y

T

X

i=1

αihi

SLIDE 21

This movie so … terrible

BiLSTM word embeddings

… … …

h1 h2

…

hT-1 hT Attention

α1

α2

αT

αT-1

…

=

c =

utput layer

ˆ y

T

X

i=1

αihi

SLIDE 22

SLIDE 23

Transformer block

http://jalammar.github.io/illustrated-transformer/ source:

SLIDE 24

http://jalammar.github.io/illustrated-transformer/ source:

First, embed

SLIDE 25

http://jalammar.github.io/illustrated-transformer/ source:

Then transform

SLIDE 26

http://jalammar.github.io/illustrated-transformer/ source:

What is “self-attention”?

SLIDE 27

http://jalammar.github.io/illustrated-transformer/ source:

SLIDE 28

http://jalammar.github.io/illustrated-transformer/ source:

SLIDE 29

http://jalammar.github.io/illustrated-transformer/ source:

This one weird trick

SLIDE 30

SLIDE 31

In matrices

Learned

http://jalammar.github.io/illustrated-transformer/ source:

SLIDE 32

In matrices

http://jalammar.github.io/illustrated-transformer/ source:

SLIDE 33

Let’s implement… [notebook TODOs 1 & 2]

SLIDE 34

OK, but what is it used for?

SLIDE 35

Translation

http://jalammar.github.io/illustrated-transformer/ source:

SLIDE 36

Translation

http://jalammar.github.io/illustrated-transformer/ source:

SLIDE 37

Language modeling

https://talktotransformer.com/

SLIDE 38

BERT

SLIDE 39

BERT

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language {jacobdevlin,mingweichang,kentonl,kristout}@google.com

SLIDE 40

Pre-train (self-supervise) then fine-tune: A winning combo

SLIDE 41

A Primer in BERTology: What we know about how BERT works

Anna Rogers, Olga Kovaleva, Anna Rumshisky Department of Computer Science, University of Massachusetts Lowell Lowell, MA 01854 {arogers, okovalev, arum}@cs.uml.edu

This is a thing now

SLIDE 42

BERT BERT

E[CLS]

E1 E[SEP]

...

EN E1’

...

EM’ C T1 T[SEP]

...

TN T1’

...

TM’

[CLS] Tok 1 [SEP]

...

Tok N Tok 1

...

TokM

Question Paragraph Start/End Span

BERT

E[CLS]

E1 E[SEP]

...

EN E1’

...

EM’ C T1 T[SEP]

...

TN T1’

...

TM’

[CLS] Tok 1 [SEP]

...

Tok N Tok 1

...

TokM

Masked Sentence A Masked Sentence B

Pre-training Fine-Tuning

NSP Mask LM Mask LM Unlabeled Sentence A and B Pair

SQuAD

Question Answer Pair

NER MNLI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language {jacobdevlin,mingweichang,kentonl,kristout}@google.com

SLIDE 43

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language {jacobdevlin,mingweichang,kentonl,kristout}@google.com

Self-Supervise an Encoder

SLIDE 44

Self-Supervise an Encoder

The cat is very cute

SLIDE 45

Self-Supervise an Encoder

The cat is very cute The [MASK] is very cute X y cat

SLIDE 46

Let’s implement … [notebook TODO 3]

SLIDE 47

BERT details we did not consider

BERT actually uses word-pieces rather than entire words
Also uses “positional” embeddings in the inputs to give a

sense of “location” in the sequence

Multiple self-attention “heads”
Deeper (12+ layers)

SLIDE 48

BERT details we did not consider

BERT actually uses word-pieces rather than entire words
Also uses “positional” embeddings in the inputs to give a

sense of “location” in the sequence

Multiple self-attention “heads”
Deeper (12+ layers)

SLIDE 49

BERT details we did not consider

BERT actually uses word-pieces rather than entire words
Also uses “positional” embeddings in the inputs to give a

sense of “location” in the sequence

Multiple self-attention “heads”
Deeper (12+ layers)

SLIDE 50

BERT details we did not consider

BERT actually uses word-pieces rather than entire words
Also uses “positional” embeddings in the inputs to give a sense of

“location” in the sequence

Multiple self-attention “heads”
Deeper (12+ layers)
Residual + layer norms (prevents explosions/NaNs)

SLIDE 51

For a more detailed implementation …

See Sasha Rush’s excellent “annotated transformer”:

http://nlp.seas.harvard.edu/2018/04/03/attention.html