Machine Learning 2 DS 4420 - Spring 2020 Transformers Byron C. - - PowerPoint PPT Presentation

machine learning 2
SMART_READER_LITE
LIVE PREVIEW

Machine Learning 2 DS 4420 - Spring 2020 Transformers Byron C. - - PowerPoint PPT Presentation

Machine Learning 2 DS 4420 - Spring 2020 Transformers Byron C. Wallace Material in this lecture derived rom materials created by Jay Alammar (http:// jalammar.github.io/illustrated-transformer/) Some housekeeping First, lets talk


slide-1
SLIDE 1

Machine Learning 2

DS 4420 - Spring 2020

Transformers

Byron C. Wallace

Material in this lecture derived rom materials created by Jay Alammar (http:// jalammar.github.io/illustrated-transformer/)

slide-2
SLIDE 2

Some housekeeping

  • First, let’s talk midterm…
  • Mean: 70 (from 30s to high 90s)
  • I miscalibrated Q2 (average: 56%)

★ I gave back to 5 points to everyone (mean now 75) ★ We are releasing an optional bonus assignment that covers the

same content as Q2 — you can use this to make up up to half (12.5) points on said question

slide-3
SLIDE 3

Some housekeeping

  • First, let’s talk midterm…
  • Mean: 70 (from 30s to high 90s)
  • I miscalibrated Q2 (average: 56%)

★ I gave back to 5 points to everyone (mean now 75) ★ We are releasing an optional bonus assignment that covers the

same content as Q2 — you can use this to make up up to half (12.5) points on said question

slide-4
SLIDE 4

Some housekeeping

  • First, let’s talk midterm…
  • Mean: 70 (from 30s to high 90s)
  • I miscalibrated Q2 (average: 56%)

★ I gave back to 5 points to everyone (mean now 75) ★ We are releasing an optional bonus assignment that covers the

same content as Q2 — you can use this to make up up to half (12.5) points on said question

slide-5
SLIDE 5

Some housekeeping

  • First, let’s talk midterm…
  • Mean: 70 (from 30s to high 90s)
  • I miscalibrated Q2 (average: 56%)

★ I gave back to 5 points to everyone (mean now 75) ★ We are releasing an optional bonus assignment that covers the

same content as Q2 — you can use this to make up up to half (12.5) points on said question

slide-6
SLIDE 6

Some housekeeping

  • First, let’s talk midterm…
  • Mean: 70 (from 30s to high 90s)
  • I miscalibrated Q2 (average: 56%)

★ I gave back to 5 points to everyone (mean now 75) ★ We are releasing an optional bonus assignment that covers the same

content as Q2 — you can use this to make up up to half (12.5) points

  • n said question. This will be released tonight; due date is flexible.
slide-7
SLIDE 7

HW 4

  • HW 4 will be released soon; due 3/24 (Tuesday)
slide-8
SLIDE 8

Projects!

  • THURSDAY 3/13 Project proposal is due!
  • TUESDAY 3/17 Project pitches in class!
slide-9
SLIDE 9

A remote possibility

  • There is a (increasingly) non-zero chance that

Northeastern will move to holding all classes remotely in the coming days/weeks

  • In this case: Remote / recorded lectures; on-demand
  • ffice hours, remotely; project presentations (+ pitches)

will also have to be remote or recorded (will figure out!)

  • Keep an eye on Piazza for more updates
slide-10
SLIDE 10

Today

  • Will introduce transformer networks, which are a type of

neural networks that have come to dominate in NLP

  • To get there, will first review RNNs briefly
slide-11
SLIDE 11

Today

  • Will introduce transformer networks, which are a type of

neural networks that have come to dominate in NLP

  • To get there, will first review RNNs briefly
slide-12
SLIDE 12

RNNs

  • Review [on board]
slide-13
SLIDE 13

Transformers

  • Hey, maybe we can get rid of recurrence!
slide-14
SLIDE 14

Attention mechanisms

slide-15
SLIDE 15

This movie so … terrible

BiLSTM word embeddings

… … …

h1 h2

hT-1 hT Attention

α1

α2

αT

αT-1

=

c =

  • utput layer

ˆ y

T

X

i=1

αihi

slide-16
SLIDE 16

This movie so … terrible

BiLSTM word embeddings

… … …

h1 h2

hT-1 hT Attention

α1

α2

αT

αT-1

=

c =

  • utput layer

ˆ y

T

X

i=1

αihi

word embeddings

slide-17
SLIDE 17

This movie so … terrible

BiLSTM word embeddings

… … …

h1 h2

hT-1 hT Attention

α1

α2

αT

αT-1

=

c =

  • utput layer

ˆ y

T

X

i=1

αihi

word embeddings

This movie so … terrible

BiLSTM word embeddings

… … …

h1 h2

hT-1 hT Attention

α1

α2

αT

αT-1

=

c =

  • utput layer

ˆ y

T

X

i=1

αihi

slide-18
SLIDE 18

This movie so … terrible

BiLSTM word embeddings

… … …

h1 h2

hT-1 hT Attention

α1

α2

αT

αT-1

=

c =

  • utput layer

ˆ y

T

X

i=1

αihi

slide-19
SLIDE 19

This movie so … terrible

BiLSTM word embeddings

… … …

h1 h2

hT-1 hT Attention

α1

α2

αT

αT-1

=

c =

  • utput layer

ˆ y

T

X

i=1

αihi

slide-20
SLIDE 20

This movie so … terrible

BiLSTM word embeddings

… … …

h1 h2

hT-1 hT Attention

α1

α2

αT

αT-1

=

c =

  • utput layer

ˆ y

T

X

i=1

αihi

slide-21
SLIDE 21

This movie so … terrible

BiLSTM word embeddings

… … …

h1 h2

hT-1 hT Attention

α1

α2

αT

αT-1

=

c =

  • utput layer

ˆ y

T

X

i=1

αihi

slide-22
SLIDE 22
slide-23
SLIDE 23

Transformer block

http://jalammar.github.io/illustrated-transformer/ source:

slide-24
SLIDE 24

http://jalammar.github.io/illustrated-transformer/ source:

First, embed

slide-25
SLIDE 25

http://jalammar.github.io/illustrated-transformer/ source:

Then transform

slide-26
SLIDE 26

http://jalammar.github.io/illustrated-transformer/ source:

What is “self-attention”?

slide-27
SLIDE 27

http://jalammar.github.io/illustrated-transformer/ source:

slide-28
SLIDE 28

http://jalammar.github.io/illustrated-transformer/ source:

slide-29
SLIDE 29

http://jalammar.github.io/illustrated-transformer/ source:

This one weird trick

slide-30
SLIDE 30
slide-31
SLIDE 31

In matrices

Learned

http://jalammar.github.io/illustrated-transformer/ source:

slide-32
SLIDE 32

In matrices

http://jalammar.github.io/illustrated-transformer/ source:

slide-33
SLIDE 33

Let’s implement… [notebook TODOs 1 & 2]

slide-34
SLIDE 34

OK, but what is it used for?

slide-35
SLIDE 35

Translation

http://jalammar.github.io/illustrated-transformer/ source:

slide-36
SLIDE 36

Translation

http://jalammar.github.io/illustrated-transformer/ source:

slide-37
SLIDE 37

Language modeling

https://talktotransformer.com/

slide-38
SLIDE 38

BERT

slide-39
SLIDE 39

BERT

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language {jacobdevlin,mingweichang,kentonl,kristout}@google.com

slide-40
SLIDE 40

Pre-train (self-supervise) then fine-tune: A winning combo

slide-41
SLIDE 41

A Primer in BERTology: What we know about how BERT works

Anna Rogers, Olga Kovaleva, Anna Rumshisky Department of Computer Science, University of Massachusetts Lowell Lowell, MA 01854 {arogers, okovalev, arum}@cs.uml.edu

This is a thing now

slide-42
SLIDE 42

BERT BERT

E[CLS]

E1 E[SEP]

...

EN E1’

...

EM’ C T1 T[SEP]

...

TN T1’

...

TM’

[CLS] Tok 1 [SEP]

...

Tok N Tok 1

...

TokM

Question Paragraph Start/End Span

BERT

E[CLS]

E1 E[SEP]

...

EN E1’

...

EM’ C T1 T[SEP]

...

TN T1’

...

TM’

[CLS] Tok 1 [SEP]

...

Tok N Tok 1

...

TokM

Masked Sentence A Masked Sentence B

Pre-training Fine-Tuning

NSP Mask LM Mask LM Unlabeled Sentence A and B Pair

SQuAD

Question Answer Pair

NER MNLI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language {jacobdevlin,mingweichang,kentonl,kristout}@google.com
slide-43
SLIDE 43

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language {jacobdevlin,mingweichang,kentonl,kristout}@google.com

Self-Supervise an Encoder

slide-44
SLIDE 44

Self-Supervise an Encoder

The cat is very cute

slide-45
SLIDE 45

Self-Supervise an Encoder

The cat is very cute The [MASK] is very cute X y cat

slide-46
SLIDE 46

Let’s implement … [notebook TODO 3]

slide-47
SLIDE 47

BERT details we did not consider

  • BERT actually uses word-pieces rather than entire words
  • Also uses “positional” embeddings in the inputs to give a

sense of “location” in the sequence

  • Multiple self-attention “heads”
  • Deeper (12+ layers)
slide-48
SLIDE 48

BERT details we did not consider

  • BERT actually uses word-pieces rather than entire words
  • Also uses “positional” embeddings in the inputs to give a

sense of “location” in the sequence

  • Multiple self-attention “heads”
  • Deeper (12+ layers)
slide-49
SLIDE 49

BERT details we did not consider

  • BERT actually uses word-pieces rather than entire words
  • Also uses “positional” embeddings in the inputs to give a

sense of “location” in the sequence

  • Multiple self-attention “heads”
  • Deeper (12+ layers)
slide-50
SLIDE 50

BERT details we did not consider

  • BERT actually uses word-pieces rather than entire words
  • Also uses “positional” embeddings in the inputs to give a sense of

“location” in the sequence

  • Multiple self-attention “heads”
  • Deeper (12+ layers)
  • Residual + layer norms (prevents explosions/NaNs)
slide-51
SLIDE 51

For a more detailed implementation …

  • See Sasha Rush’s excellent “annotated transformer”:

http://nlp.seas.harvard.edu/2018/04/03/attention.html