Lecture 16: Language Model CS109B Data Science 2 Pavlos Protopapas, - - PowerPoint PPT Presentation

β–Ά
lecture 16 language model
SMART_READER_LITE
LIVE PREVIEW

Lecture 16: Language Model CS109B Data Science 2 Pavlos Protopapas, - - PowerPoint PPT Presentation

Lecture 16: Language Model CS109B Data Science 2 Pavlos Protopapas, Mark Glickman, and Chris Tanner Outline Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 2


slide-1
SLIDE 1

CS109B Data Science 2

Pavlos Protopapas, Mark Glickman, and Chris Tanner

Lecture 16: Language Model

slide-2
SLIDE 2

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions

2

Outline

slide-3
SLIDE 3

CS109B, PROTOPAPAS, GLICKMAN, TANNER

We could easily spend an entire semester on this material. The goal for today and Wednesday is to convey:

  • the ubiquity and importance of sequential data
  • high-level overview of the most useful, relevant models
  • foundation for diving deeper
  • when to use which models, based on your data
slide-4
SLIDE 4

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions

4

Outline

slide-5
SLIDE 5

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Background

5

Regardless of how we model sequential data, keep in mind that we can estimate any time series as follows:

𝑄 𝑦#, … , 𝑦& = ( π‘ž 𝑦* 𝑦*+#, … , 𝑦#)

& *-#

Joint distribution of all measurements Conditional probability

  • f an event, depends on

all of the events that

  • ccurred before it.

Th This comp mpounds fo for al all subsequent eve vents, to too

slide-6
SLIDE 6

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Example

6

The probability of the following 3-day weather pattern in Seattle:

Da Day 1 1 Da Day 2 2 Da Day 3 3

slide-7
SLIDE 7

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Example

The probability of the following 3-day weather pattern in Seattle:

Da Day 1 1 Da Day 2 2 Da Day 3 3

𝑄 βˆ—βˆ—,βˆ—βˆ—,βˆ—βˆ— =

slide-8
SLIDE 8

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Example

8

The probability of the following 3-day weather pattern in Seattle:

Da Day 1 1 Da Day 2 2 Da Day 3 3

𝑄 βˆ—βˆ—,βˆ—βˆ—,βˆ—βˆ— = 𝑄 βˆ—βˆ— 𝑄 βˆ—βˆ— | βˆ—βˆ— 𝑄(βˆ—βˆ— | βˆ—βˆ—,βˆ—βˆ—)

Da Day 1 1 Da Day 2 2 Da Day 3 3

slide-9
SLIDE 9

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Example

9

Why is it useful to accurately estimate the joint of any given sequence of length 𝑂?

slide-10
SLIDE 10

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Background

10

Having the ability to estimate the probability of any sequence of length 𝑂 allows us to determine the most likely next event (i.e., sequence of length 𝑂 + 1)

𝑄 βˆ—βˆ—,βˆ—βˆ—,βˆ—βˆ—, ? = 𝑄 βˆ—βˆ— 𝑄 βˆ—βˆ— | βˆ—βˆ— 𝑄(βˆ—βˆ— | βˆ—βˆ—,βˆ—βˆ—)𝑄(? | βˆ—βˆ—,βˆ—βˆ—,βˆ—βˆ—)

Da Day 1 1 Da Day 2 2 Da Day 3 3 Da Day 4 4

slide-11
SLIDE 11

CS109B, PROTOPAPAS, GLICKMAN, TANNER

11

For the remainder of this lecture, we will use text (natural language) as examples because:

  • It’s easy to interpret success/failures
  • Real-world impact and commonplace usages
  • Availability of data to try things yourself

Ye Yet, for any model, you can imagine using any other sequential data

slide-12
SLIDE 12

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling

12

A Language Model represents the language used by a given entity (e.g., a particular person, genre, or other well-defined class of text)

slide-13
SLIDE 13

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling

13

A Language Model represents the language used by a given entity (e.g., a particular person, genre, or other well-defined class of text)

slide-14
SLIDE 14

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling

14

A Language Model represents the language used by a given entity (e.g., a particular person, genre, or other well-defined class of text)

slide-15
SLIDE 15

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: Formal Definition

15

A Language Model estimates the probability of any sequence of words Let 𝒀 = β€œEleni was late for class” P(𝒀) = 𝑄(β€œEleni was late for class”) π‘₯# π‘₯9 π‘₯: π‘₯; π‘₯<

slide-16
SLIDE 16

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling

16

Ge Generate Text xt

slide-17
SLIDE 17

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling

17

Ge Generate Text xt

slide-18
SLIDE 18

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling

18

Ge Generate Text xt

slide-19
SLIDE 19

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling

19

β€œDrug kingpin El Chapo testified that he gave MILLIONS to Pelosi, Schiff &

  • Killary. The Feds then closed the courtroom doors.”
slide-20
SLIDE 20

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling

20

A Language Model is useful for: Generating Text Classifying Text

  • Auto-complete
  • Speech-to-text
  • Question-answering / chatbots
  • Machine translation
  • Authorship attribution
  • Detecting spam vs not spam

And much more!

slide-21
SLIDE 21

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling

21

Today, we heavily focus on Language Modelling (LM) because: 1. It’s foundational for nearly all NLP tasks 2. Since we’re ultimately modelling a sequence, LM approaches are generalizable to any type of data, not just text.

slide-22
SLIDE 22

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: unigrams

22

How can we build a language model? Naive Approach: unigram model Assume each word is independent of all others. Count how often each word occurs (in the training data).

slide-23
SLIDE 23

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: unigrams

23

How can we build a language model? Naive Approach: unigram model Assume each word is independent of all others Let 𝒀 = β€œEleni was late for class” π‘₯# π‘₯9 π‘₯: π‘₯; π‘₯<

slide-24
SLIDE 24

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: unigrams

24

How can we build a language model? Naive Approach: unigram model Assume each word is independent of all others Let 𝒀 = β€œEleni was late for class” P(𝒀) = 𝑄(Eleni)𝑄(was)𝑄(late)𝑄(for)𝑄(class) π‘₯# π‘₯9 π‘₯: π‘₯; π‘₯<

= 0.00015 * 0.01 * 0.004 * 0.03 * 0.0035 = 6.3x10-13 You calculate each of these probabilities from the training corpus

slide-25
SLIDE 25

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: unigrams

25

UNIGRAM ISSUES?

slide-26
SLIDE 26

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: unigrams

26

UNIGRAM ISSUES? 𝑄(β€œEleni was late for class”) = 𝑄(β€œclass for was late Eleni”) Context doesn’t play a role at all

slide-27
SLIDE 27

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: unigrams

27

UNIGRAM ISSUES? 𝑄(β€œEleni was late for class”) = 𝑄(β€œclass for was late Eleni”) Context doesn’t play a role at all Eleni was late for class _____ Sequence generation: What’s the most likely next word?

slide-28
SLIDE 28

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: unigrams

28

UNIGRAM ISSUES? 𝑄(β€œEleni was late for class”) = 𝑄(β€œclass for was late Eleni”) Context doesn’t play a role at all Eleni was late for class _____ Sequence generation: What’s the most likely next word? Eleni was late for class the Anqi was late for class the the

slide-29
SLIDE 29

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: bigrams

29

How can we build a language model? Alternative Approach: bigram model Look at pairs of consecutive words Let 𝒀 = β€œEleni was late for class” π‘₯# π‘₯9 π‘₯: π‘₯; π‘₯<

slide-30
SLIDE 30

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: bigrams

30

How can we build a language model? Alternative Approach: bigram model Look at pairs of consecutive words Let 𝒀 = β€œEleni was late for class” π‘₯# π‘₯9 π‘₯: π‘₯; π‘₯<

pr proba babi bility

P(𝒀) = 𝑄(was|Eleni)

slide-31
SLIDE 31

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: bigrams

31

How can we build a language model? Alternative Approach: bigram model Look at pairs of consecutive words Let 𝒀 = β€œEleni was late for class” π‘₯# π‘₯9 π‘₯: π‘₯; π‘₯<

pr proba babi bility

P(𝒀) = 𝑄(was|Eleni)𝑄(late|was)

slide-32
SLIDE 32

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: bigrams

32

How can we build a language model? Alternative Approach: bigram model Look at pairs of consecutive words Let 𝒀 = β€œEleni was late for class” π‘₯# π‘₯9 π‘₯: π‘₯; π‘₯<

pr proba babi bility

P(𝒀) = 𝑄(was|Eleni)𝑄(late|was)𝑄(for|late)

slide-33
SLIDE 33

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: bigrams

33

How can we build a language model? Alternative Approach: bigram model Look at pairs of consecutive words Let 𝒀 = β€œEleni was late for class” π‘₯# π‘₯9 π‘₯: π‘₯; π‘₯<

pr proba babi bility

P(𝒀) = 𝑄(was|Eleni)𝑄(late|was)𝑄(for|late)𝑄(class|for)

slide-34
SLIDE 34

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: bigrams

34

How can we build a language model? Alternative Approach: bigram model Look at pairs of consecutive words Let 𝒀 = β€œEleni was late for class” π‘₯# π‘₯9 π‘₯: π‘₯; π‘₯<

pr proba babi bility

P(𝒀) = 𝑄(was|Eleni)𝑄(late|was)𝑄(for|late)𝑄(class|for) You calculate each of these probabilities by simply counting the occurrences

P(class| for) = count(for class) count(for)

slide-35
SLIDE 35

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: bigrams

BIGRAM ISSUES?

slide-36
SLIDE 36

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: bigrams

BIGRAM ISSUES?

  • Out-of-vocabulary items are 0 Γ  kills the overall probability
  • Always need more context (e.g., trigram, 4-gram), but

sparsity is an issue (rarely seen subsequences)

  • Storage becomes a problem as we increase window size
  • No semantic information conveyed by counts (e.g., vehicle vs car)
slide-37
SLIDE 37

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: neural networks

IDEA: Let’s use a neural networks! First, each word is represented by a word embedding (e.g., vector of length 200) man woman table

slide-38
SLIDE 38

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: neural networks

IDEA: Let’s use a neural networks! First, each word is represented by a word embedding (e.g., vector of length 200) man woman table

  • Each circle is a specific floating point scalar
  • Words that are more semantically similar to one another

will have embeddings that are proportionally similar, too

  • We can use pre-existing word embeddings that have

been trained on gigantic corpora

slide-39
SLIDE 39

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: neural networks

These word embeddings are so rich that you get nice properties:

Word2vec: https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf GloVe: https://www.aclweb.org/anthology/D14-1162.pdf

king woman queen man

____________ ____________

+

  • ~
slide-40
SLIDE 40

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: neural networks

How can we use these embeddings to build a LM? Remember, we only need a system that can estimate: She went to class

𝑄 𝑦*=#|𝑦*, 𝑦*+#, … , 𝑦#

next word previous words

Example input sentence

slide-41
SLIDE 41

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: Feed-forward Neural Net

Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word She went to

Example input sentence

π‘Š 𝑋

Hidden layer Output layer

class?

slide-42
SLIDE 42

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: Feed-forward Neural Net

Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word She went to class?

Example input sentence

π‘Š 𝑋

Hidden layer Output layer

𝑦 = [𝑦#, 𝑦9, 𝑦:]

Co Conc ncatena enated ed wo word em embed eddi ding ngs

β„Ž = 𝑔(π‘Šπ‘¦ + 𝑐#) 𝑧 F = softmax π‘‹β„Ž + 𝑐9 ∈ ℝ P

slide-43
SLIDE 43

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: Feed-forward Neural Net

Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word class went to after

Example input sentence

π‘Š 𝑋

Hidden layer Output layer

𝑦 = [𝑦#, 𝑦9, 𝑦:]

Co Conc ncatena enated ed wo word em embed eddi ding ngs

β„Ž = 𝑔(π‘Šπ‘¦ + 𝑐#) 𝑧 F = softmax π‘‹β„Ž + 𝑐9 ∈ ℝ P

slide-44
SLIDE 44

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: Feed-forward Neural Net

Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word class to after

Example input sentence

π‘Š 𝑋

Hidden layer Output layer

𝑦 = [𝑦#, 𝑦9, 𝑦:]

Co Conc ncatena enated ed wo word em embedd beddings ngs

β„Ž = 𝑔(π‘Šπ‘¦ + 𝑐#) 𝑧 F = softmax π‘‹β„Ž + 𝑐9 ∈ ℝ P visiting

slide-45
SLIDE 45

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: Feed-forward Neural Net

Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word class after

Example input sentence

π‘Š 𝑋

Hidden layer Output layer

𝑦 = [𝑦#, 𝑦9, 𝑦:]

Co Conc ncatena enated ed wo word em embed eddi ding ngs

β„Ž = 𝑔(π‘Šπ‘¦ + 𝑐#) 𝑧 F = softmax π‘‹β„Ž + 𝑐9 ∈ ℝ P visiting her

slide-46
SLIDE 46

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: Feed-forward Neural Net

Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word after

Example input sentence

π‘Š 𝑋

Hidden layer Output layer

𝑦 = [𝑦#, 𝑦9, 𝑦:]

Co Conc ncatena enated ed wo word em embed eddi ding ngs

β„Ž = 𝑔(π‘Šπ‘¦ + 𝑐#) 𝑧 F = softmax π‘‹β„Ž + 𝑐9 ∈ ℝ P visiting her grandma

slide-47
SLIDE 47

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling : Feed-forward Neural Net

FFN FFNN ISSU SSUES? S? FFN FFNN ST STRENGTHS? S?

slide-48
SLIDE 48

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling : Feed-forward Neural Net

FFN FFNN ISSU SSUES? S? FFN FFNN ST STRENGTHS? S?

  • No sparsity issues (it’s okay if we’ve never seen a segment of words)
  • No storage issues (we never store counts)
  • Fixed-window size can never be big enough. Need more context.
  • Increasing window size adds many more weights
  • The weights awkwardly handle word position
  • No concept of time
  • Requires inputting entire context just to predict one word
slide-49
SLIDE 49

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling

We especially need a system that:

  • Has an β€œinfinite” concept of the past, not just a fixed window
  • For each new input, output the most likely next event (e.g., word)
slide-50
SLIDE 50

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions

50

Outline

slide-51
SLIDE 51

CS109B, PROTOPAPAS, GLICKMAN, TANNER

51

slide-52
SLIDE 52

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling

IDEA: for every individual input, output a prediction She

Ex Example i input wo word

π‘Š 𝑋

Hi Hidde dden layer Out Output ut layer er

𝑦 = 𝑦#

si single word embedding

β„Ž = 𝑔(π‘Šπ‘¦ + 𝑐#) 𝑧 F = softmax π‘‹β„Ž + 𝑐9 ∈ ℝ P went Le Let’s use the previous hidden state, too

slide-53
SLIDE 53

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: RNNs

In Input l layer Hi Hidde dden layer Out Output ut layer er

π‘Š 𝑋 𝑧 F# 𝑦# Neural Approach #2: Recurrent Neural Network (RNN) V 𝑋 𝑧 F9 𝑦9 𝑉 V 𝑋 𝑧 F: 𝑦: 𝑉 V 𝑋 𝑧 F; 𝑦; 𝑉

slide-54
SLIDE 54

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling: RNNs

We have seen this abstract view in Lecture 15.

In Input l layer Hi Hidde dden layer Out Output ut layer er

π‘Š 𝑋 𝑧 FR 𝑦R 𝑉

Th The recurrent loop loop 𝑽 co conveys that the cu current hidden layer is influence ced by the hi hidden en layer er from the he prev evious us time e step ep.

slide-55
SLIDE 55

CS109B, PROTOPAPAS, GLICKMAN, TANNER

In Input l layer Hi Hidde dden layer Out Output ut layer er

π‘Š 𝑋 𝑧 F# 𝑦#

RNN (review)

Tr Training Process

𝐷𝐹 𝑧#, 𝑧 F# Er Error

She

𝐷𝐹 𝑧R, 𝑧 FR = βˆ’ W 𝑧X

R log(𝑧

FX

R )

  • X∈\

π‘Š 𝑋 𝑧 F9 𝑦9 𝑉

𝐷𝐹 𝑧9, 𝑧 F9

went π‘Š 𝑋 𝑧 F: 𝑦: 𝑉

𝐷𝐹 𝑧:, 𝑧 F:

to π‘Š 𝑋 𝑧 F; 𝑦; 𝑉

𝐷𝐹 𝑧;, 𝑧 F;

class

slide-56
SLIDE 56

CS109B, PROTOPAPAS, GLICKMAN, TANNER

In Input l layer Hi Hidde dden layer Out Output ut layer er

π‘Š 𝑋 𝑧 F# 𝑦#

RNN (review)

Tr Training Process

𝐷𝐹 𝑧#, 𝑧 F# Er Error

She

𝐷𝐹 𝑧R, 𝑧 FR = βˆ’ W 𝑧X

R log(𝑧

FX

R )

  • X∈\

π‘Š 𝑋 𝑧 F9 𝑦9 𝑉

𝐷𝐹 𝑧9, 𝑧 F9

went π‘Š 𝑋 𝑧 F: 𝑦: 𝑉

𝐷𝐹 𝑧:, 𝑧 F:

to π‘Š 𝑋 𝑧 F; 𝑦; 𝑉

𝐷𝐹 𝑧;, 𝑧 F;

class During training, regardless of our output predictions, we feed in the correct inputs

slide-57
SLIDE 57

CS109B, PROTOPAPAS, GLICKMAN, TANNER

In Input l layer Hi Hidde dden layer Out Output ut layer er

π‘Š 𝑋 𝑧 F 𝑦#

RNN (review)

Tr Training Process

𝐷𝐹 𝑧#, 𝑧 F# Er Error

She

𝐷𝐹 𝑧R, 𝑧 FR = βˆ’ W 𝑧X

R log(𝑧

FX

R )

  • X∈\

π‘Š 𝑋 𝑦9 𝑉

𝐷𝐹 𝑧9, 𝑧 F9

went π‘Š 𝑋 𝑦: 𝑉

𝐷𝐹 𝑧:, 𝑧 F:

to π‘Š 𝑋 𝑦; 𝑉

𝐷𝐹 𝑧;, 𝑧 F;

class went?

  • ver?

class? after? Our total loss is simply the average loss across all π‘ˆ time steps

slide-58
SLIDE 58

CS109B, PROTOPAPAS, GLICKMAN, TANNER

In Input l layer Hi Hidde dden layer Out Output ut layer er

π‘Š 𝑋 𝑧 F 𝑦#

RNN (review)

Tr Training Process

𝐷𝐹 𝑧#, 𝑧 F# Er Error

She

𝐷𝐹 𝑧R, 𝑧 FR = βˆ’ W 𝑧X

R log(𝑧

FX

R )

  • X∈\

π‘Š 𝑋 𝑦9 𝑉

𝐷𝐹 𝑧9, 𝑧 F9

went π‘Š 𝑋 𝑦: 𝑉

𝐷𝐹 𝑧:, 𝑧 F:

to π‘Š 𝑋 𝑦; 𝑉

𝐷𝐹 𝑧;, 𝑧 F;

class went?

  • ver?

class? after? Us Using g the ch chain rule, we trace ce the de derivative all the wa way back to the beginning, w , while s summing t the r results lts. To To update our weights (e.g. . 𝑽), ), we we calculate the gradi dient

  • f
  • f ou
  • ur los

loss w. w.r.t. t . the r repeate ted w weight m t matri trix (e (e.g .g., ., 𝝐𝑴

𝝐𝑽 ).

slide-59
SLIDE 59

CS109B, PROTOPAPAS, GLICKMAN, TANNER

In Input l layer Hi Hidde dden layer Out Output ut layer er

π‘Š 𝑋 𝑧 F 𝑦#

RNN (review)

Tr Training Process

𝐷𝐹 𝑧#, 𝑧 F# Er Error

She π‘Š 𝑋 𝑦9 𝑉

𝐷𝐹 𝑧9, 𝑧 F9

went π‘Š 𝑋 𝑦: 𝑉

𝐷𝐹 𝑧:, 𝑧 F:

to π‘Š 𝑋 𝑦; 𝑉

𝐷𝐹 𝑧;, 𝑧 F;

class went?

  • ver?

class? after?

slide-60
SLIDE 60

CS109B, PROTOPAPAS, GLICKMAN, TANNER

In Input l layer Hi Hidde dden layer Out Output ut layer er

π‘Š 𝑋 𝑧 F 𝑦#

RNN (review)

Tr Training Process

𝐷𝐹 𝑧#, 𝑧 F# Er Error

She π‘Š 𝑋 𝑦9 𝑉

𝐷𝐹 𝑧9, 𝑧 F9

went π‘Š 𝑋 𝑦: 𝑉

𝐷𝐹 𝑧:, 𝑧 F:

to π‘Š 𝑋 𝑦; 𝑉:

𝐷𝐹 𝑧;, 𝑧 F;

class went?

  • ver?

class? after?

𝝐𝑴 𝝐𝑾

slide-61
SLIDE 61

CS109B, PROTOPAPAS, GLICKMAN, TANNER

In Input l layer Hi Hidde dden layer Out Output ut layer er

π‘Š 𝑋 𝑧 F 𝑦#

RNN (review)

Tr Training Process

𝐷𝐹 𝑧#, 𝑧 F# Er Error

She π‘Š 𝑋 𝑦9 𝑉

𝐷𝐹 𝑧9, 𝑧 F9

went π‘Š 𝑋 𝑦: 𝑉9

𝐷𝐹 𝑧:, 𝑧 F:

to π‘Š 𝑋 𝑦; 𝑉:

𝐷𝐹 𝑧;, 𝑧 F;

class went?

  • ver?

class? after?

𝝐𝑴 𝝐𝑾

slide-62
SLIDE 62

CS109B, PROTOPAPAS, GLICKMAN, TANNER

In Input l layer Hi Hidde dden layer Out Output ut layer er

π‘Š 𝑋 𝑧 F 𝑦#

RNN (review)

Tr Training Process

𝐷𝐹 𝑧#, 𝑧 F# Er Error

She π‘Š 𝑋 𝑦9 𝑉#

𝐷𝐹 𝑧9, 𝑧 F9

went π‘Š 𝑋 𝑦: 𝑉9

𝐷𝐹 𝑧:, 𝑧 F:

to π‘Š 𝑋 𝑦; 𝑉:

𝐷𝐹 𝑧;, 𝑧 F;

class went?

  • ver?

class? after?

𝝐𝑴 𝝐𝑾

slide-63
SLIDE 63

CS109B, PROTOPAPAS, GLICKMAN, TANNER

RNN (review)

  • This backpropagation through time (BPTT) process is

expensive

  • Instead of updating after every timestep, we tend to do so

every π‘ˆ steps (e.g., every sentence or paragraph)

  • This isn’t equivalent to using only a window size π‘ˆ

(a la n-grams) because we still have β€˜infinite memory’

slide-64
SLIDE 64

CS109B, PROTOPAPAS, GLICKMAN, TANNER

RNN: Generation

We can generate the most likely next event (e.g., word) by sampling from 𝑧 F Continue until we generate <EOS> symbol.

slide-65
SLIDE 65

CS109B, PROTOPAPAS, GLICKMAN, TANNER

RNN: Generation

We can generate the most likely next event (e.g., word) by sampling from 𝒛 c Continue until we generate <EOS> symbol.

In Input l layer Hi Hidde dden layer Out Output ut layer er

π‘Š 𝑋 𝑦# <START> β€œSorry”

slide-66
SLIDE 66

CS109B, PROTOPAPAS, GLICKMAN, TANNER

RNN: Generation

We can generate the most likely next event (e.g., word) by sampling from 𝒛 c Continue until we generate <EOS> symbol.

In Input l layer Hi Hidde dden layer Out Output ut layer er

π‘Š 𝑋 𝑦# <START> β€œSorry” π‘Š 𝑋 𝑦9 𝑉 β€œSorry” Harry π‘Š 𝑋 𝑦: 𝑉 β€œHarry” shouted π‘Š 𝑋 𝑦; 𝑉

shouted,

panicking

slide-67
SLIDE 67

CS109B, PROTOPAPAS, GLICKMAN, TANNER

RNN: Generation

In Input l layer Hi Hidde dden layer Out Output ut layer er

π‘Š 𝑋 𝑦# <START> β€œSorry” π‘Š 𝑋 𝑦9 𝑉 β€œSorry” Harry π‘Š 𝑋 𝑦: 𝑉 β€œHarry” shouted π‘Š 𝑋 𝑦; 𝑉 β€œshouted,” panicking NOTE: the same input (e.g., β€œHarry”) can easily yield different outputs, depending on the context (unlike FFNNs and n-grams).

slide-68
SLIDE 68

CS109B, PROTOPAPAS, GLICKMAN, TANNER

RNN: Generation

When trained on Harry Potter text, it generates:

Source: https://medium.com/deep-writing/harry-potter-written-by-artificial-intelligence-8a9431803da6

slide-69
SLIDE 69

CS109B, PROTOPAPAS, GLICKMAN, TANNER

RNN: Generation

When trained on recipes

Source: https://gist.github.com/nylki/1efbaa36635956d35bcc

slide-70
SLIDE 70

CS109B, PROTOPAPAS, GLICKMAN, TANNER

RNNs: Overview

RN RNN IS ISSUES ES? RN RNN S STREN RENGTHS?

  • Can handle infinite-length sequences (not just a fixed-window)
  • Has a β€œmemory” of the context (thanks to the hidden layer’s recurrent loop)
  • Same weights used for all inputs, so word order isn’t wonky (like FFNN)
  • Slow to train (BPTT)
  • Due to ”infinite sequence”, gradients can easily vanish or explode
  • Has trouble actually making use of long-range context
slide-71
SLIDE 71

CS109B, PROTOPAPAS, GLICKMAN, TANNER

RNNs: Overview

RN RNN IS ISSUES ES? RN RNN S STREN RENGTHS?

  • Can handle infinite-length sequences (not just a fixed-window)
  • Has a β€œmemory” of the context (thanks to the hidden layer’s recurrent loop)
  • Same weights used for all inputs, so word order isn’t wonky (like FFNN)
  • Slow to train (BPTT)
  • Due to ”infinite sequence”, gradients can easily vanish or explode
  • Has trouble actually making use of long-range context
slide-72
SLIDE 72

CS109B, PROTOPAPAS, GLICKMAN, TANNER

RNNs: Vanishing and Exploding Gradients (review)

𝑉 𝑉 𝑉 𝑉 π‘Š:

𝐷𝐹 𝑧;, 𝑧 F;

𝑧 F

ππ‘΄πŸ“ ππ‘ΎπŸ

π‘Š9 π‘Š#

In Input l layer Hi Hidde dden layer

𝑋 𝑋 𝑋 𝑋 She went to class

ππ‘΄πŸ“ ππ‘ΎπŸ = ?

slide-73
SLIDE 73

CS109B, PROTOPAPAS, GLICKMAN, TANNER

𝑉 𝑉 𝑉 𝑉 π‘Š:

𝐷𝐹 𝑧;, 𝑧 F;

𝑧 F

ππ‘΄πŸ“ ππ‘ΎπŸ

π‘Š9 π‘Š#

In Input l layer Hi Hidde dden layer

𝑋 𝑋 𝑋 𝑋 She went to class

ππ‘΄πŸ“ ππ‘ΎπŸ = ππ‘΄πŸ“ ππ‘ΎπŸ’

RNNs: Vanishing and Exploding Gradients (review)

slide-74
SLIDE 74

CS109B, PROTOPAPAS, GLICKMAN, TANNER

𝑉 𝑉 𝑉 𝑉 π‘Š:

𝐷𝐹 𝑧;, 𝑧 F;

𝑧 F

ππ‘΄πŸ“ ππ‘ΎπŸ

π‘Š9 π‘Š#

In Input l layer Hi Hidde dden layer

𝑋 𝑋 𝑋 𝑋 She went to class

ππ‘΄πŸ“ ππ‘ΎπŸ = ππ‘΄πŸ“ ππ‘ΎπŸ’ ππ‘ΎπŸ’ ππ‘ΎπŸ‘

RNNs: Vanishing and Exploding Gradients (review)

slide-75
SLIDE 75

CS109B, PROTOPAPAS, GLICKMAN, TANNER

𝑉 𝑉 𝑉 𝑉 π‘Š:

𝐷𝐹 𝑧;, 𝑧 F;

𝑧 F

ππ‘΄πŸ“ ππ‘ΎπŸ

π‘Š9 π‘Š#

In Input l layer Hi Hidde dden layer

𝑋 𝑋 𝑋 𝑋 She went to class

ππ‘΄πŸ“ ππ‘ΎπŸ = ππ‘΄πŸ“ ππ‘ΎπŸ’ ππ‘ΎπŸ’ ππ‘ΎπŸ‘ ππ‘ΎπŸ‘ ππ‘ΎπŸ

RNNs: Vanishing and Exploding Gradients (review)

slide-76
SLIDE 76

CS109B, PROTOPAPAS, GLICKMAN, TANNER

To address RNNs’ finnicky nature with long-range context, we turned to an RNN variant named LSTMs (long short-term memory) But first, let’s recap what we’ve learned so far

RNNs: Vanishing and Exploding Gradients (review)

slide-77
SLIDE 77

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequential Modelling (so far)

n-grams

𝑄 went π‘‡β„Žπ‘“ = count(π‘‡β„Žπ‘“ π‘₯π‘“π‘œπ‘’) count(π‘‡β„Žπ‘“)

FFNN

She went to 𝑋 𝑉 class

𝑋 𝑉 𝑧 F# 𝑋 𝑉 𝑋 𝑉 𝑧 F9 𝑧 F: π‘Š π‘Š

RNN

  • Ba

Basic counts; fast

  • Fi

Fixed xed wind ndow size

  • Sp

Sparsity & storage e issues ues

  • No

Not robust

  • Kind of robust…

… almost

  • Fi

Fixed xed wind ndow size

  • We

Weirdly handles context po positions

  • No

No β€œmemory” of past

  • Ha

Handles infinite context (i (in t theory)

  • Ro

Robust to rare words

  • Sl

Slow

  • Di

Difficulty w with l long c context

She went to

slide-78
SLIDE 78

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions

78

Outline

slide-79
SLIDE 79

CS109B, PROTOPAPAS, GLICKMAN, TANNER

79

slide-80
SLIDE 80

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Long short-term memory (LSTM)

  • A type of RNN that is designed to better handle long-range

dependencies

  • In ”vanilla” RNNs, the hidden state is perpetually being

rewritten

  • In addition to a traditional hidden state h, let’s have a

dedicated memory cell c for long-term events. More power to relay sequence info.

slide-81
SLIDE 81

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Inside an LSTM Hidden Layer

𝐼*+# 𝐷*+# 𝐼* 𝐷* 𝐼*=# 𝐷*=#

Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

slide-82
SLIDE 82

CS109B, PROTOPAPAS, GLICKMAN, TANNER

𝐼*+# 𝐷*+# 𝐼* 𝐷* 𝐼*=# 𝐷*=#

some old memories are β€œforgotten”

Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Inside an LSTM Hidden Layer

slide-83
SLIDE 83

CS109B, PROTOPAPAS, GLICKMAN, TANNER

𝐼*+# 𝐷*+# 𝐼* 𝐷* 𝐼*=# 𝐷*=#

some old memories are β€œforgotten” some new memories are made

Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Inside an LSTM Hidden Layer

slide-84
SLIDE 84

CS109B, PROTOPAPAS, GLICKMAN, TANNER

𝐼*+# 𝐷*+# 𝐼* 𝐷* 𝐼*=# 𝐷*=#

some old memories are β€œforgotten” some new memories are made

a nonlinear weighted version of the long-term memory becomes our short-term memory

Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Inside an LSTM Hidden Layer

slide-85
SLIDE 85

CS109B, PROTOPAPAS, GLICKMAN, TANNER

𝐼*+# 𝐷*+# 𝐼* 𝐷* 𝐼*=# 𝐷*=#

some old memories are β€œforgotten” some new memories are made

a nonlinear weighted version of the long-term memory becomes our short-term memory memory is written, erased, and read by three gates – which are influenced by π’š and π’Š

Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Inside an LSTM Hidden Layer

slide-86
SLIDE 86

CS109B, PROTOPAPAS, GLICKMAN, TANNER

It’s still possible for LSTMs to suffer from vanishing/exploding gradients, but it’s way less likely than with vanilla RNNs:

  • If RNNs wish to preserve info over long contexts, it must delicately

find a recurrent weight matrix 𝑋

β„Ž that isn’t too large or small

  • However, LSTMs have 3 separate mechanism that adjust the flow
  • f information (e.g., forget gate, if turned off, will preserve all info)

Inside an LSTM Hidden Layer

slide-87
SLIDE 87

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Long short-term memory (LSTM)

LS LSTM ISSUES? LS LSTM STRENGTHS?

  • Almost always outperforms vanilla RNNs
  • Captures long-range dependencies shockingly well
  • Has more weights to learn than vanilla RNNs; thus,
  • Requires a moderate amount of training data (otherwise, vanilla

RNNs are better)

  • Can still suffer from vanishing/exploding gradients
slide-88
SLIDE 88

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequential Modelling

n-grams

𝑄 went π‘‡β„Žπ‘“ = count(π‘‡β„Žπ‘“ π‘₯π‘“π‘œπ‘’) count(π‘‡β„Žπ‘“)

FFNN

She went to 𝑋 𝑉 clas s

𝑋 𝑉 𝑧 F# 𝑋 𝑉 𝑋 𝑉 𝑧 F9 𝑧 F: π‘Š π‘Š

RNN

She went to

𝑧 F# 𝑧 F9 𝑧 F:

She went to

LSTM

slide-89
SLIDE 89

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequential Modelling

If your goal isn’t to predict the next item in a sequence, and you rather do some other classification or regression task using the sequence, then you can:

  • Train an aforementioned model (e.g., LSTM) as a language model
  • Use the hidden layers that correspond to each item in your

sequence IM IMPORTANT

slide-90
SLIDE 90

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequential Modelling

In Input la layer Hi Hidde dden la layer Out Output ut la layer

π‘Š 𝑋 𝑧 F# π‘Š 𝑋 π‘Š 𝑋 π‘Š 𝑋 𝑧 F9 𝑧 F: 𝑧 F; 𝑦# 𝑦9 𝑦: 𝑦; 𝑉 𝑉 𝑉

  • 1. Train LM to learn hidden layer

embeddings

  • 2. Use hidden layer

embeddings for other tasks

𝑦# 𝑦9 𝑦: 𝑦;

Sentiment score

slide-91
SLIDE 91

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequential Modelling

In Input la layer Hi Hidde dden la layer 1 Out Output ut la layer

π‘Š 𝑋 π‘Š 𝑋 π‘Š 𝑋 π‘Š 𝑋 𝑦# 𝑦9 𝑦: 𝑦; 𝑉 𝑉 𝑉

Or jointly learn hidden embeddings toward a particular task (end-to-end)

Sentiment score

Hi Hidde dden la layer 2

slide-92
SLIDE 92

CS109B, PROTOPAPAS, GLICKMAN, TANNER

You now have the foundation for modelling sequential data. Most state-of-the-art advances are based on those core RNN/LSTM ideas. But, with tens of thousands of researchers and hackers exploring deep learning, there are many tweaks that haven proven useful. (This is where things get crazy.)

slide-93
SLIDE 93

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Bi-directional (review)

93

𝑍

*+9

𝑍

*+#

𝑍

*

π‘Œ*+9 π‘Œ*+# π‘Œ* previous state previous state 𝑍

*

π‘Œ* symbol for a BRNN

slide-94
SLIDE 94

CS109B, PROTOPAPAS, GLICKMAN, TANNER

RNNs/LSTMs use the left-to-right context and sequentially process data. If you have full access to the data at testing time, why not make use of the flow of information from right-to-left, also?

Bi-directional (review)

slide-95
SLIDE 95

CS109B, PROTOPAPAS, GLICKMAN, TANNER

RNN Extensions: Bi-directional LSTMs (review)

In Input l layer Hi Hidde dden layer

𝑦# 𝑦9 𝑦: 𝑦; β„Ž#

v

β„Ž9

v

β„Ž:

v

β„Ž;

v

For brevity, let’s use the follow schematic to represent an RNN

slide-96
SLIDE 96

CS109B, PROTOPAPAS, GLICKMAN, TANNER

In Input l layer Hi Hidde dden layer

𝑦# 𝑦9 𝑦: 𝑦; 𝑦# 𝑦9 𝑦: 𝑦; β„Ž#

v

β„Ž9

v

β„Ž:

v

β„Ž;

v

β„Ž#

w

β„Ž9

w

β„Ž:

w

β„Ž;

w

For brevity, let’s use the follow schematic to represent an RNN

RNN Extensions: Bi-directional LSTMs (review)

slide-97
SLIDE 97

CS109B, PROTOPAPAS, GLICKMAN, TANNER

In Input l layer Hi Hidde dden layer

𝑦# 𝑦9 𝑦: 𝑦; 𝑦# 𝑦9 𝑦: 𝑦; β„Ž#

v

β„Ž9

v

β„Ž:

v

β„Ž;

v

β„Ž#

w

β„Ž9

w

β„Ž:

w

β„Ž;

w

β„Ž#

v

β„Ž#

w

β„Ž9

v

β„Ž9

w

β„Ž:

v

β„Ž:

w

β„Ž;

v

β„Ž;

w Co Concatenate the hidden layers

RNN Extensions: Bi-directional LSTMs (review)

slide-98
SLIDE 98

CS109B, PROTOPAPAS, GLICKMAN, TANNER

In Input l layer Hi Hidde dden layer Out Output ut layer er

𝑦# 𝑦9 𝑦: 𝑦; 𝑦# 𝑦9 𝑦: 𝑦; β„Ž#

v

β„Ž9

v

β„Ž:

v

β„Ž;

v

β„Ž#

w

β„Ž9

w

β„Ž:

w

β„Ž;

w

β„Ž#

v

β„Ž#

w

β„Ž9

v

β„Ž9

w

β„Ž:

v

β„Ž:

w

β„Ž;

v

β„Ž;

w

𝑧 F# 𝑧 F9 𝑧 F: 𝑧 F;

Co Concatenate the hidden layers

RNN Extensions: Bi-directional LSTMs (review)

slide-99
SLIDE 99

CS109B, PROTOPAPAS, GLICKMAN, TANNER

  • Usually performs at least as well as uni-directional RNNs/LSTMs

BI BI-LS LSTM ISSUES? BI BI-LS LSTM STRENGTHS?

  • Slower to train
  • Only possible if access to full data is allowed

RNN Extensions: Bi-directional LSTMs (review)

slide-100
SLIDE 100

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Deep RNN (review)

LSTMs units can be arranged in layers, so that the output of each unit is the input to the other units. This is called a deep RNN, where the adjective β€œdeep” refers to these multiple layers.

  • Each layer feeds the LSTM on the next layer
  • First time step of a feature is fed to the first LSTM, which processes

that data and produces an output (and a new state for itself).

  • That output is fed to the next LSTM, which does the same thing, and

the next, and so on.

  • Then the second time step arrives at the first LSTM, and the process

repeats.

100

slide-101
SLIDE 101

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Deep RNN (review)

101

π‘Œ*+9 π‘Œ*+# π‘Œ* π‘Œ*=# π‘Œ*=9 𝑍

*+9

𝑍

*+#

𝑍

*

𝑍

*=#

𝑍

*=9

π‘Œ 𝑍

slide-102
SLIDE 102

CS109B, PROTOPAPAS, GLICKMAN, TANNER

In Input l layer Hi Hidde dden layer r #1

𝑦# 𝑦9 𝑦: 𝑦; β„Ž#

v

β„Ž9

v

β„Ž:

v

β„Ž;

v

Hidden layers provide an abstraction (holds β€œmeaning”). Stacking hidden layers provides increased abstractions.

Deep RNN (review)

slide-103
SLIDE 103

CS109B, PROTOPAPAS, GLICKMAN, TANNER

In Input l layer Hi Hidde dden layer r #1

𝑦# 𝑦9 𝑦: 𝑦; β„Ž#

v

β„Ž9

v

β„Ž:

v

β„Ž;

v

β„Ž;

v9

β„Ž:

v9

β„Ž9

v9

β„Ž#

v9 Hi Hidde dden layer r #2

Hidden layers provide an abstraction (holds β€œmeaning”). Stacking hidden layers provides increased abstractions.

Deep RNN (review)

slide-104
SLIDE 104

CS109B, PROTOPAPAS, GLICKMAN, TANNER

In Input l layer Hi Hidde dden layer r #1

𝑦# 𝑦9 𝑦: 𝑦; β„Ž#

v

β„Ž9

v

β„Ž:

v

β„Ž;

v

β„Ž;

v9

β„Ž:

v9

β„Ž9

v9

β„Ž#

v9

𝑧 F# 𝑧 F9 𝑧 F: 𝑧 F;

Out Output ut layer er Hi Hidde dden layer r #2

Hidden layers provide an abstraction (holds β€œmeaning”). Stacking hidden layers provides increased abstractions.

Deep RNN (review)

slide-105
SLIDE 105

CS109B, PROTOPAPAS, GLICKMAN, TANNER

ELMo: Stacked Bi-directional LSTMs

General Idea:

  • Goal is to get highly rich embeddings for each word (unique type)
  • Use both directions of context (bi-directional), with increasing

abstractions (stacked)

  • Linearly combine all abstract representations (hidden layers) and
  • ptimize w.r.t. a particular task (e.g., sentiment classification)

ELMo Slides: https://www.slideshare.net/shuntaroy/a-review-of-deep-contextualized-word-representations-peters-2018

slide-106
SLIDE 106

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Illustration: http://jalammar.github.io/illustrated-bert/

ELMo: Stacked Bi-directional LSTMs

slide-107
SLIDE 107

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Illustration: http://jalammar.github.io/illustrated-bert/

slide-108
SLIDE 108

CS109B, PROTOPAPAS, GLICKMAN, TANNER

  • ELMo yielded incredibly good word embeddings, which yielded state-of-

the-art results when applied to many NLP tasks.

  • Main ELMo takeaway: given enough training data, having tons of

explicit connections between your vectors is useful (system can determine how to best use context)

ELMo Slides: https://www.slideshare.net/shuntaroy/a-review-of-deep-contextualized-word-representations-peters-2018

ELMo: Stacked Bi-directional LSTMs

slide-109
SLIDE 109

CS109B, PROTOPAPAS, GLICKMAN, TANNER

REFLE LECTION

So So far, for all of our ur seq equent uential model elling ng, we e ha have e been een co conce cerned with emitting 1 output per input datum. So Somet etimes es, a se sequence is is the sma malle llest granula larit ity we we care ab about t though (e (e.g .g., an ., an E English s sentence)

slide-110
SLIDE 110

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions

110

Outline

slide-111
SLIDE 111

CS109B, PROTOPAPAS, GLICKMAN, TANNER

111

slide-112
SLIDE 112

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequence-to-Sequence (seq2seq)

  • If our input is a sentence in Language A, and we wish to translate it to

Language B, it is clearly sub-optimal to translate word by word (like our current models are suited to do).

  • Instead, let a sequence of tokens be the unit that we ultimately wish to

work with (a sequence of length N may emit a sequences of length M)

  • Seq2seq models are comprised of 2 RNNs: 1 encoder, 1 decoder
slide-113
SLIDE 113

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequence-to-Sequence (seq2seq)

In Input l layer Hi Hidde dden layer

β„Ž#

x

β„Ž9

x

β„Ž:

x

β„Ž;

x

The brown dog ran

EN ENCODER ER RN RNN

slide-114
SLIDE 114

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequence-to-Sequence (seq2seq)

In Input l layer Hi Hidde dden layer

β„Ž#

x

β„Ž9

x

β„Ž:

x

The brown dog ran

EN ENCODER ER RN RNN

β„Ž;

x Th The fi final hidden state of f the encoder RNN is is the in init itia ial l state of

  • f the decod
  • der RNN
slide-115
SLIDE 115

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequence-to-Sequence (seq2seq)

In Input l layer Hi Hidde dden layer

β„Ž#

x

β„Ž9

x

β„Ž:

x

β„Ž;

x

The brown dog ran

Th The fi final hidden state of f the encoder RNN is is the in init itia ial l state of

  • f the decod
  • der RNN

EN ENCODER ER RN RNN

β„Ž#

y

β„Ž9

y

β„Ž:

y

Le chien brun a

DE DECODE DER R RNN

β„Ž;

y

β„Ž<

y

couru

slide-116
SLIDE 116

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequence-to-Sequence (seq2seq)

In Input l layer Hi Hidde dden layer

β„Ž#

x

β„Ž9

x

β„Ž:

x

β„Ž;

x

The brown dog ran

EN ENCODER ER RN RNN

β„Ž#

y

β„Ž9

y

β„Ž:

y

Le chien brun a

DE DECODE DER R RNN

β„Ž;

y

β„Ž<

y

couru 𝑧 F# 𝑧 F9 𝑧 F: 𝑧 F; 𝑧 F<

slide-117
SLIDE 117

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequence-to-Sequence (seq2seq)

In Input l layer Hi Hidde dden layer

β„Ž#

x

β„Ž9

x

β„Ž:

x

β„Ž;

x

The brown dog ran

EN ENCODER ER RN RNN

β„Ž#

y

β„Ž9

y

β„Ž:

y

Le chien brun a

DE DECODE DER R RNN

β„Ž;

y

β„Ž<

y

couru 𝑧 F# 𝑧 F9 𝑧 F: 𝑧 F; 𝑧 F<

Tr Training oc

  • ccurs like RNNs typ

ypical ally y do;

  • ; the los
  • ss

(f (from t the d decoder o

  • utputs) i

) is c calculated, a , and we we u update we weights a all t the wa way t to t the be begi ginning g (encode der)

slide-118
SLIDE 118

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequence-to-Sequence (seq2seq)

In Input l layer Hi Hidde dden layer

β„Ž#

x

β„Ž9

x

β„Ž:

x

β„Ž;

x

The brown dog ran

EN ENCODER ER RN RNN

β„Ž#

y

β„Ž9

y

β„Ž:

y

Le chien brun a

DE DECODE DER R RNN

β„Ž;

y

β„Ž<

y

couru 𝑧 F# 𝑧 F9 𝑧 F: 𝑧 F; 𝑧 F<

Te Testing ge generates de decode der outpu puts one word d at at a a time, until we generat ate a a <E <EOS> > to token. Ea Each decoder’s 𝒛 c𝒋 be becomes the inpu put π’šπ’‹=𝟐

slide-119
SLIDE 119

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequence-to-Sequence (seq2seq)

See any issues with this traditional seq2seq paradigm?

slide-120
SLIDE 120

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequence-to-Sequence (seq2seq)

In Input l layer Hi Hidde dden layer

β„Ž#

x

β„Ž9

x

β„Ž:

x

β„Ž;

x

The brown dog ran

It’s crazy that the entire β€œmeaning” of the 1st sequence is expected to be packed into this one embedding, and that the encoder then never interacts w/ the decoder again. Hands free. EN ENCODER ER RN RNN

β„Ž#

y

β„Ž9

y

β„Ž:

y

Le chien brun a

DE DECODE DER R RNN

β„Ž;

y

β„Ž<

y

couru

slide-121
SLIDE 121

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequence-to-Sequence (seq2seq)

Instead, what if the decoder, at each step, pays attention to a distribution of all of the encoder’s hidden states?

slide-122
SLIDE 122

CS109B, PROTOPAPAS, GLICKMAN, TANNER

seq2seq + Attention

In Input l layer Hi Hidde dden layer

The brown dog ran

EN ENCODER ER RN RNN DE DECODE DER R RNN

[𝑑#

y, β„Ž; x]

𝑧 F# Le 𝑑#

y

𝑏#

#

𝑏9

#

𝑏:

#

𝑏;

#

NOTE: each attention weight 𝑏𝑗

π‘˜ is based on the decoder’s current hidden state, too.

slide-123
SLIDE 123

CS109B, PROTOPAPAS, GLICKMAN, TANNER

seq2seq + Attention

In Input l layer Hi Hidde dden layer

The brown dog ran

EN ENCODER ER RN RNN DE DECODE DER R RNN

𝑑9

y

𝑏#

9

𝑏9

9

𝑏:

9

𝑏;

9

[𝑑#

y, β„Ž; x]

𝑧 F# [𝑑9

y,𝑧

F#] 𝑧 F9 Le chien

NOTE: each attention weight 𝑏𝑗

π‘˜ is based on the decoder’s current hidden state, too.

slide-124
SLIDE 124

CS109B, PROTOPAPAS, GLICKMAN, TANNER

seq2seq + Attention

In Input l layer Hi Hidde dden layer

The brown dog ran

EN ENCODER ER RN RNN DE DECODE DER R RNN

𝑑:

y

𝑏#

:

𝑏9

:

𝑏:

:

𝑏;

:

[𝑑#

y, β„Ž; x]

𝑧 F# [𝑑9

y,𝑧

F#] [𝑑:

y,𝑧

F9] 𝑧 F9 𝑧 F: Le chien brun

NOTE: each attention weight 𝑏𝑗

π‘˜ is based on the decoder’s current hidden state, too.

slide-125
SLIDE 125

CS109B, PROTOPAPAS, GLICKMAN, TANNER

seq2seq + Attention

In Input l layer Hi Hidde dden layer

The brown dog ran

EN ENCODER ER RN RNN DE DECODE DER R RNN

𝑑;

y

𝑏#

;

𝑏9

;

𝑏:

;

𝑏;

;

[𝑑#

y, β„Ž; x]

𝑧 F# [𝑑9

y,𝑧

F#] [𝑑:

y,𝑧

F9] [𝑑;

y,𝑧

F:] 𝑧 F9 𝑧 F: 𝑧 F; Le chien brun a

NOTE: each attention weight 𝑏𝑗

π‘˜ is based on the decoder’s current hidden state, too.

slide-126
SLIDE 126

CS109B, PROTOPAPAS, GLICKMAN, TANNER

seq2seq + Attention

In Input l layer Hi Hidde dden layer

The brown dog ran

EN ENCODER ER RN RNN DE DECODE DER R RNN

𝑑<

y

𝑏#

<

𝑏9

<

𝑏:

<

𝑏;

<

[𝑑#

y, β„Ž; x]

𝑧 F# [𝑑9

y,𝑧

F#] [𝑑:

y,𝑧

F9] [𝑑;

y,𝑧

F:] [𝑑<

y,𝑧

F;] 𝑧 F9 𝑧 F: 𝑧 F; 𝑧 F< Le chien brun a couru

NOTE: each attention weight 𝑏𝑗

π‘˜ is based on the decoder’s current hidden state, too.

slide-127
SLIDE 127

CS109B, PROTOPAPAS, GLICKMAN, TANNER

seq2seq + Attention

Attention:

  • greatly improves seq2seq results
  • allows us to visualize the

contribution each word gave during each step of the decoder

Image source: Fig 3 in Bahdanau et al., 2015

slide-128
SLIDE 128

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions

128

Outline