Latent Models: Sequence Models Beyond HMMs and Machine Translation - - PowerPoint PPT Presentation

latent models
SMART_READER_LITE
LIVE PREVIEW

Latent Models: Sequence Models Beyond HMMs and Machine Translation - - PowerPoint PPT Presentation

Latent Models: Sequence Models Beyond HMMs and Machine Translation Alignment CMSC 473/673 UMBC Outline Review: EM for HMMs Machine Translation Alignment Limited Sequence Models Maximum Entropy Markov Models Conditional Random Fields


slide-1
SLIDE 1

Latent Models: Sequence Models Beyond HMMs and Machine Translation Alignment

CMSC 473/673 UMBC

slide-2
SLIDE 2

Outline

Review: EM for HMMs Machine Translation Alignment Limited Sequence Models Maximum Entropy Markov Models Conditional Random Fields Recurrent Neural Networks Basic Definitions Example in PyTorch

slide-3
SLIDE 3

Why Do We Need Both the Forward and Backward Algorithms? Compute posteriors

α(i, s) * p(s’ | s) * p(obs at i+1 | s’) * β(i+1, s’) = total probability of paths through the ss’ arc (at time i) α(i, s) * β(i, s) = total probability of paths through state s at step i

𝑞 𝑨𝑗 = 𝑡 𝑥1, ⋯ , 𝑥𝑂) = 𝛽 𝑗, 𝑡 ∗ 𝛾(𝑗, 𝑡) 𝛽(𝑂 + 1, END) 𝑞 𝑨𝑗 = 𝑡, 𝑨𝑗+1 = 𝑡′ 𝑥1, ⋯ , 𝑥𝑂) = 𝛽 𝑗, 𝑡 ∗ 𝑞 𝑡′ 𝑡 ∗ 𝑞 obs𝑗+1 𝑡′ ∗ 𝛾(𝑗 + 1, 𝑡′) 𝛽(𝑂 + 1, END)

slide-4
SLIDE 4

EM for HMMs

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

estimated counts

pobs(w | s) ptrans(s’ | s)

𝑞∗ 𝑨𝑗 = 𝑡 𝑥1, ⋯ , 𝑥𝑂) = 𝛽 𝑗, 𝑡 ∗ 𝛾(𝑗, 𝑡) 𝛽(𝑂 + 1, END) 𝑞∗ 𝑨𝑗 = 𝑡, 𝑨𝑗+1 = 𝑡′ 𝑥1, ⋯ , 𝑥𝑂) = 𝛽 𝑗, 𝑡 ∗ 𝑞 𝑡′ 𝑡 ∗ 𝑞 obs𝑗+1 𝑡′ ∗ 𝛾(𝑗 + 1, 𝑡′) 𝛽(𝑂 + 1, END)

slide-5
SLIDE 5

EM For HMMs (Baum-Welch Algorithm)

α = computeForwards() β = computeBackwards() L = α[N+1][EN D] for(i = N; i ≥ 0; --i) { for(next = 0; next < K*; ++next) { cobs(obsi+1 | next) += α[i+1][next]* β[i+1][next]/L for(state = 0; state < K*; ++state) { u = pobs(obsi+1 | next) * ptrans (next | state) ctrans(next| state) += α[i][state] * u * β[i+1][next]/L } } } update pobs, ptrans using cobs, ctrans

slide-6
SLIDE 6

Semi-Supervised Learning

      ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

labeled data: human annotated

  • relatively small/few
  • examples

unlabeled data:

  • raw; not annotated
  • plentiful

EM

slide-7
SLIDE 7

N V end start 1.8 .1 .1 N 1.5 .8 .1 V 1.4 1.1 .4 w1 w2 W3 w4 N .4 .3 .2 .2 V .1 .6 .3 .3 Expected Transition Counts Expected Emission Counts N V end start 2 N 1 2 2 V 2 1 w1 w2 W3 w4 N 2 1 2 V 2 1 Transition Counts Emission Counts

Semi-Supervised Parameter Estimation for HMMs

N V end start 3.8 .1 .1 N 2.5 2.8 2.1 V 3.4 2.1 .4 w1 w2 W3 w4 N 2.4 .3 1.2 2.2 V .1 2.6 1.3 .3 Mixed Transition Counts Mixed Emission Counts

slide-8
SLIDE 8

Outline

Review: EM for HMMs Machine Translation Alignment Limited Sequence Models Maximum Entropy Markov Models Conditional Random Fields Recurrent Neural Networks Basic Definitions Example in PyTorch

slide-9
SLIDE 9

Warren Weaver’s Note

When I look at an article in Russian, I say “This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.” (Warren Weaver, 1947)

http://www.mt-archive.info/Weaver-1949.pdf

Slides courtesy Rebecca Knowles

slide-10
SLIDE 10

Noisy Channel Model

language

язы́к Decode

speak

text

w

  • r

d language

Rerank

speak

text

w

  • r

d language

written in (clean) English

  • bserved

Russian (noisy) text translation/ decode model (clean) language model

English

Slides courtesy Rebecca Knowles

slide-11
SLIDE 11

Noisy Channel Model

Decode Rerank written in (clean) English

  • bserved

Russian (noisy) text translation/ decode model (clean) language model

English

language

язы́к

speak

text

w

  • r

d language

speak

text

w

  • r

d language

Slides courtesy Rebecca Knowles

slide-12
SLIDE 12

Noisy Channel Model

Decode Rerank written in (clean) English

  • bserved

Russian (noisy) text translation/ decode model (clean) language model

English

language

язы́к

speak

text

w

  • r

d language

speak

text

w

  • r

d language

Slides courtesy Rebecca Knowles

slide-13
SLIDE 13

Translation

Translate French (observed) into English:

The cat is on the chair. Le chat est sur la chaise.

Slides courtesy Rebecca Knowles

slide-14
SLIDE 14

Translation

Translate French (observed) into English:

The cat is on the chair. Le chat est sur la chaise.

Slides courtesy Rebecca Knowles

slide-15
SLIDE 15

Translation

Translate French (observed) into English:

The cat is on the chair. Le chat est sur la chaise.

Slides courtesy Rebecca Knowles

slide-16
SLIDE 16

?

Alignment

The cat is on the chair. Le chat est sur la chaise. The cat is on the chair. Le chat est sur la chaise.

Slides courtesy Rebecca Knowles

slide-17
SLIDE 17

Parallel Texts

Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world, Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people, Whereas it is essential, if man is not to be compelled to have recourse, as a last resort, to rebellion against tyranny and oppression, that human rights should be protected by the rule of law, Whereas it is essential to promote the development of friendly relations between nations, …

http://www.un.org/en/universal-declaration-human-rights/

Yolki, pampa ni tlatepanitalotl, ni tlasenkauajkayotl iuan ni kuali nemilistli ipan ni tlalpan, yaya ni moneki moixmatis uan monemilis, ijkinoj nochi kuali tiitstosej ika touampoyouaj. Pampa tlaj amo tikixmatij tlatepanitalistli uan tlen kuali nemilistli ipan ni tlalpan, yeka onkatok kualantli, onkatok tlateuilistli, onkatok majmajtli uan sekinok tlamantli teixpanolistli; yeka moneki ma kuali timouikakaj ika nochi touampoyouaj, ma amo onkaj majmajyotl uan teixpanolistli; moneki ma onkaj yejyektlalistli, ma titlajtlajtokaj uan ma tijneltokakaj tlen tojuantij tijnekij tijneltokasej uan amo tlen ma topanti, kenke, pampa tijnekij ma onkaj tlatepanitalistli. Pampa ni tlatepanitalotl moneki ma tiyejyekokaj, ma tijchiuakaj uan ma tijmanauikaj; ma nojkia kiixmatikaj tekiuajtinij, uejueyij tekiuajtinij, ijkinoj amo onkas nopeka se akajya touampoj san tlen ueli kinekis techchiuilis, technauatis, kinekis technauatis ma tijchiuakaj se tlamantli tlen amo kuali; yeka ni tlatepanitalotl tlauel moneki ipan tonemilis ni tlalpan. Pampa nojkia tlauel moneki ma kuali timouikakaj, ma tielikaj keuak tiiknimej, nochi tlen tlakamej uan siuamej tlen tiitstokej ni tlalpan.

http://www.ohchr.org/EN/UDHR/Pages/Language.aspx?LangID=nhn

Slides courtesy Rebecca Knowles

slide-18
SLIDE 18

Preprocessing

Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation

  • f freedom, justice and peace in the world,

Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people, Whereas it is essential, if man is not to be compelled to have recourse, as a last resort, to rebellion against tyranny and oppression, that human rights should be protected by the rule of law, Whereas it is essential to promote the development of friendly relations between nations, … http://www.un.org/en/universal-declaration-human-rights/ Yolki, pampa ni tlatepanitalotl, ni tlasenkauajkayotl iuan ni kuali nemilistli ipan ni tlalpan, yaya ni moneki moixmatis uan monemilis, ijkinoj nochi kuali tiitstosej ika touampoyouaj. Pampa tlaj amo tikixmatij tlatepanitalistli uan tlen kuali nemilistli ipan ni tlalpan, yeka

  • nkatok kualantli, onkatok tlateuilistli, onkatok majmajtli uan sekinok tlamantli

teixpanolistli; yeka moneki ma kuali timouikakaj ika nochi touampoyouaj, ma amo

  • nkaj majmajyotl uan teixpanolistli; moneki ma onkaj yejyektlalistli, ma titlajtlajtokaj

uan ma tijneltokakaj tlen tojuantij tijnekij tijneltokasej uan amo tlen ma topanti, kenke, pampa tijnekij ma onkaj tlatepanitalistli. Pampa ni tlatepanitalotl moneki ma tiyejyekokaj, ma tijchiuakaj uan ma tijmanauikaj; ma nojkia kiixmatikaj tekiuajtinij, uejueyij tekiuajtinij, ijkinoj amo onkas nopeka se akajya touampoj san tlen ueli kinekis techchiuilis, technauatis, kinekis technauatis ma tijchiuakaj se tlamantli tlen amo kuali; yeka ni tlatepanitalotl tlauel moneki ipan tonemilis ni tlalpan. Pampa nojkia tlauel moneki ma kuali timouikakaj, ma tielikaj keuak tiiknimej, nochi tlen tlakamej uan siuamej tlen tiitstokej ni tlalpan. … http://www.ohchr.org/EN/UDHR/Pages/Language.aspx?LangID=nhn

Sentence align

  • Clean corpus
  • Tokenize
  • Handle case
  • Word segmentation
  • (morphological, BPE, etc.)

Language

  • specific

preprocessing (example: pre-reordering) ...

  • Slides courtesy Rebecca Knowles
slide-19
SLIDE 19

Alignments

If we had word-aligned text, we could easily estimate P(f|e). But we don’t usually have word alignments, and they are expensive to produce by hand… If we had P(f|e) we could produce alignments automatically.

Slides courtesy Rebecca Knowles

slide-20
SLIDE 20

IBM Model 1 (1993)

  • Lexical Translation Model
  • Word Alignment Model
  • The simplest of the original IBM models
  • For all IBM models, see the original paper

(Brown et al, 1993): http://www.aclweb.org/anthology/J93-2003

Slides courtesy Rebecca Knowles

slide-21
SLIDE 21

Simplified IBM 1

We

  • ’ll work through an example with a

simplified version of IBM Model 1 Figures and examples are drawn from

  • A

Statistical MT Tutorial Workbook, Section 27, (Knight, 1999) Simplifying assumption:

  • each source word

must translate to exactly one target word and vice versa

Slides courtesy Rebecca Knowles

slide-22
SLIDE 22

IBM Model 1 (1993)

f: vector of French words (visualization of alignment) e: vector of English words a: vector of alignment indices Le chat est sur la chaise verte The cat is on the green chair 0 1 2 3 4 6 5

Slides courtesy Rebecca Knowles

slide-23
SLIDE 23

IBM Model 1 (1993)

f: vector of French words (visualization of alignment) e: vector of English words a: vector of alignment indices t(fj|ei) : translation probability

  • f the word fj given the word

ei Le chat est sur la chaise verte The cat is on the green chair 0 1 2 3 4 6 5

Slides courtesy Rebecca Knowles

slide-24
SLIDE 24

Model and Parameters

Want: P(f|e) But don’t know how to train this directly… Solution: Use P(a, f|e), where a is an alignment Remember:

Slides courtesy Rebecca Knowles

slide-25
SLIDE 25

Model and Parameters: Intuition

Translation prob.: Example: Interpretation: How probable is it that we see fj given ei

Slides courtesy Rebecca Knowles

slide-26
SLIDE 26

Model and Parameters: Intuition

Alignment/translation prob.: Example (visual representation of a):

P( | “the cat”) < P( | “the cat”)

Interpretation: How probable are the alignment a and the translation f (given e)

le chat the cat le chat the cat

Slides courtesy Rebecca Knowles

slide-27
SLIDE 27

Model and Parameters: Intuition

Alignment prob.: Example:

P( | “le chat”, “the cat”) < P( | “le chat”, “the cat”)

Interpretation: How probable is alignment a (given e and f)

Slides courtesy Rebecca Knowles

slide-28
SLIDE 28

Model and Parameters

How to compute:

Slides courtesy Rebecca Knowles

slide-29
SLIDE 29

Parameters

For IBM model 1, we can compute all parameters given translation parameters: How many of these are there?

Slides courtesy Rebecca Knowles

slide-30
SLIDE 30

Parameters

For IBM model 1, we can compute all parameters given translation parameters: How many of these are there? |French vocabulary| x |English vocabulary|

Slides courtesy Rebecca Knowles

slide-31
SLIDE 31

Data

Two sentence pairs:

English French b c x y b y

Slides courtesy Rebecca Knowles

slide-32
SLIDE 32

All Possible Alignments

x y b c x y b c y b

(French: x, y) (English: b, c) Remember: simplifying assumption that each word must be aligned exactly once

Slides courtesy Rebecca Knowles

slide-33
SLIDE 33

Expectation Maximization (EM)

  • 0. Assume some value for

and compute other parameter values Two step, iterative algorithm

  • 1. E-step: count alignments and translations under

uncertainty, assuming these parameters

  • 2. M-step: maximize log-likelihood (update

parameters), using uncertain counts

estimated counts

P( | “the cat”) P( | “the cat”)

le chat le chat Slides courtesy Rebecca Knowles

slide-34
SLIDE 34

Review of IBM Model 1 & EM

Iteratively learned an alignment/translation model from sentence-aligned text (without “gold standard” alignments) Model can now be used for alignment and/or word-level translation We explored a simplified version of this; IBM Model 1 allows more types of alignments

Slides courtesy Rebecca Knowles

slide-35
SLIDE 35

Why is Model 1 insufficient?

Why won’t this produce great translations?

Indifferent to order (language model may help?) Translates one word at a time Translates each word in isolation ...

Slides courtesy Rebecca Knowles

slide-36
SLIDE 36

Uses for Alignments

Component of machine translation systems Produce a translation lexicon automatically Cross-lingual projection/extraction of information Supervision for training other models (for example, neural MT systems)

Slides courtesy Rebecca Knowles

slide-37
SLIDE 37

Evaluating Machine Translation

Human evaluations: Test set (source, human reference translations, MT output) Humans judge the quality of MT output (in

  • ne of several possible

ways)

Koehn (2017), http://mt-class.org/jhu/slides/lecture-evaluation.pdf

Slides courtesy Rebecca Knowles

slide-38
SLIDE 38

Evaluating Machine Translation

Automatic evaluations: Test set (source, human reference translations, MT output) Aim to mimic (correlate with) human evaluations

Many metrics: TER (Translation Error/Edit Rate) HTER (Human-Targeted Translation Edit Rate) BLEU (Bilingual Evaluation Understudy) METEOR (Metric for Evaluation of Translation with Explicit Ordering)

Slides courtesy Rebecca Knowles

slide-39
SLIDE 39

Machine Translation Alignment Now

Explicitly with fancier IBM models Implicitly/learned jointly with attention in recurrent neural networks (RNNs)

slide-40
SLIDE 40

Outline

Review: EM for HMMs Machine Translation Alignment Limited Sequence Models Maximum Entropy Markov Models Conditional Random Fields Recurrent Neural Networks Basic Definitions Example in PyTorch

slide-41
SLIDE 41

Recall: N-gram to Maxent to Neural Language Models

predict the next word given some context…

𝑞 𝑥𝑗 𝑥𝑗−3, 𝑥𝑗−2, 𝑥𝑗−1) ∝ 𝑑𝑝𝑣𝑜𝑢(𝑥𝑗−3, 𝑥𝑗−2, 𝑥𝑗−1, 𝑥𝑗)

wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

slide-42
SLIDE 42

Recall: N-gram to Maxent to Neural Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

𝑞 𝑥𝑗 𝑥𝑗−3, 𝑥𝑗−2, 𝑥𝑗−1) = softmax(𝜄 ⋅ 𝑔(𝑥𝑗−3, 𝑥𝑗−2, 𝑥𝑗−1, 𝑥𝑗))

slide-43
SLIDE 43

Hidden Markov Model Representation

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

emission probabilities/parameters transition probabilities/parameters

z1

w1

w2 w3 w4

z2 z3 z4

represent the probabilities and independence assumptions in a graph

slide-44
SLIDE 44

A Different Model’s Representation

z1

w1

w2 w3 w4

z2 z3 z4

represent the probabilities and independence assumptions in a graph

slide-45
SLIDE 45

A Different Model’s Representation

z1

w1

w2 w3 w4

z2 z3 z4

represent the probabilities and independence assumptions in a graph

𝑞 𝑨1, 𝑨2, … , 𝑨𝑂|𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑨1| 𝑨0, 𝑥1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1, 𝑥𝑂 = ෑ

𝑗

𝑞 𝑨𝑗| 𝑨𝑗−1, 𝑥𝑗

slide-46
SLIDE 46

A Different Model’s Representation

z1

w1

w2 w3 w4

z2 z3 z4

represent the probabilities and independence assumptions in a graph

𝑞 𝑨𝑗 𝑨𝑗−1, 𝑥𝑗) ∝ exp( 𝜄𝑈𝑔 𝑥𝑗, 𝑨𝑗−1, 𝑨𝑗 )

𝑞 𝑨1, 𝑨2, … , 𝑨𝑂|𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑨1| 𝑨0, 𝑥1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1, 𝑥𝑂 = ෑ

𝑗

𝑞 𝑨𝑗| 𝑨𝑗−1, 𝑥𝑗

slide-47
SLIDE 47

A Different Model’s Representation

z1

w1

w2 w3 w4

z2 z3 z4

represent the probabilities and independence assumptions in a graph

Maximum Entropy Markov Model (MEMM)

𝑞 𝑨𝑗 𝑨𝑗−1, 𝑥𝑗) ∝ exp( 𝜄𝑈𝑔 𝑥𝑗, 𝑨𝑗−1, 𝑨𝑗 )

𝑞 𝑨1, 𝑨2, … , 𝑨𝑂|𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑨1| 𝑨0, 𝑥1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1, 𝑥𝑂 = ෑ

𝑗

𝑞 𝑨𝑗| 𝑨𝑗−1, 𝑥𝑗

slide-48
SLIDE 48

MEMMs

Discriminative: don’t care about generating

  • bserved sequence at all

Maxent: use features Problem: Label-Bias problem

z1

w1

w2 w3 w4

z2 z3 z4

slide-49
SLIDE 49

Label-Bias Problem

zi wi

slide-50
SLIDE 50

Label-Bias Problem

zi wi 1

incoming mass must sum to 1

slide-51
SLIDE 51

Label-Bias Problem

zi wi 1 1

incoming mass must sum to 1

  • utgoing mass must

sum to 1

slide-52
SLIDE 52

Label-Bias Problem

zi wi 1 1

incoming mass must sum to 1

  • utgoing mass must

sum to 1

  • bserve, but do not

generate (explain) the

  • bservation

Take-aways: the model can learn to

  • ignore observations

the model can get itself

  • stuck on “bad” paths
slide-53
SLIDE 53

Outline

Review: EM for HMMs Machine Translation Alignment Limited Sequence Models Maximum Entropy Markov Models Conditional Random Fields Recurrent Neural Networks Basic Definitions Example in PyTorch

slide-54
SLIDE 54

(Linear Chain) Conditional Random Fields

Discriminative: don’t care about generating observed sequence at all Condition on the entire observed word sequence w1…wN Maxent: use features Solves the label-bias problem

z1

w1 w2 w3 w4 … z2 z3 z4

slide-55
SLIDE 55

(Linear Chain) Conditional Random Fields

z1

w1 w2 w3 w4 … z2 z3 z4

𝑞 𝑨1, … , 𝑨𝑂 𝑥1, … , 𝑥𝑂) ∝ ෑ

𝑗

exp( 𝜄𝑈𝑔 𝑨𝑗−1, 𝑨𝑗, 𝑥1, … , 𝑥𝑂 )

slide-56
SLIDE 56

(Linear Chain) Conditional Random Fields

z1

w1 w2 w3 w4 … z2 z3 z4

𝑞 𝑨1, … , 𝑨𝑂 𝑥1, … , 𝑥𝑂) ∝ ෑ

𝑗

exp( 𝜄𝑈𝑔 𝑨𝑗−1, 𝑨𝑗, 𝒙𝟐, … , 𝒙𝑶 )

condition on entire sequence

slide-57
SLIDE 57

Conditional vs. Sequence

CRF Tutorial, Fig 1.2, Sutton & McCallum (2012)

slide-58
SLIDE 58

Conditional vs. Sequence

CRF Tutorial, Fig 1.2, Sutton & McCallum (2012)

slide-59
SLIDE 59

Conditional vs. Sequence

CRF Tutorial, Fig 1.2, Sutton & McCallum (2012)

slide-60
SLIDE 60

Conditional vs. Sequence

CRF Tutorial, Fig 1.2, Sutton & McCallum (2012)

slide-61
SLIDE 61

Outline

Review: EM for HMMs Machine Translation Alignment Limited Sequence Models Maximum Entropy Markov Models Conditional Random Fields Recurrent Neural Networks Basic Definitions Example in PyTorch

slide-62
SLIDE 62

Recall: N-gram to Maxent to Neural Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

𝑞 𝑥𝑗 𝑥𝑗−3, 𝑥𝑗−2, 𝑥𝑗−1) = softmax(𝜄𝑥𝑗 ⋅ 𝒈(𝑥𝑗−3, 𝑥𝑗−2, 𝑥𝑗−1))

create/use “distributed representations”… ei-3 ei-2 ei-1 combine these representations… C = f

matrix-vector product

ew θwi

slide-63
SLIDE 63

A More Typical View of Recurrent Neural Language Modeling

wi-3 wi-2 wi wi-1 hi-3 hi-2 hi-1 hi wi-2 wi-1 wi+1 wi

slide-64
SLIDE 64

A More Typical View of Recurrent Neural Language Modeling

wi-3 wi-2 wi wi-1 hi-3 hi-2 hi-1 hi wi-2 wi-1 wi+1 wi

  • bserve these words one at a time
slide-65
SLIDE 65

A More Typical View of Recurrent Neural Language Modeling

wi-3 wi-2 wi wi-1 hi-3 hi-2 hi-1 hi wi-2 wi-1 wi+1 wi

  • bserve these words one at a time

predict the next word

slide-66
SLIDE 66

A More Typical View of Recurrent Neural Language Modeling

wi-3 wi-2 wi wi-1 hi-3 hi-2 hi-1 hi wi-2 wi-1 wi+1 wi

  • bserve these words one at a time

predict the next word from these hidden states

slide-67
SLIDE 67

wi-3 wi-2 wi wi-1 hi-3 hi-2 hi-1 hi wi-2 wi-1 wi+1 wi

  • bserve these words one at a time

predict the next word from these hidden states

“cell”

A More Typical View of Recurrent Neural Language Modeling

slide-68
SLIDE 68

wi wi-1 hi-1 hi wi+1 wi

A Recurrent Neural Network Cell

slide-69
SLIDE 69

wi wi-1 hi-1 hi wi+1 wi

A Recurrent Neural Network Cell

W W

slide-70
SLIDE 70

encoding

wi wi-1 hi-1 hi wi+1 wi

A Recurrent Neural Network Cell

W W U U

slide-71
SLIDE 71

decoding encoding

wi wi-1 hi-1 hi wi+1 wi

A Recurrent Neural Network Cell

W W U U S S

slide-72
SLIDE 72

decoding encoding

wi wi-1 hi-1 hi wi+1 wi

A Simple Recurrent Neural Network Cell

W W U U S S

ℎ𝑗 = 𝜏(𝑋ℎ𝑗−1 + 𝑉𝑥𝑗)

slide-73
SLIDE 73

decoding encoding

wi wi-1 hi-1 hi wi+1 wi

A Simple Recurrent Neural Network Cell

W W U U S S

ℎ𝑗 = 𝜏(𝑋ℎ𝑗−1 + 𝑉𝑥𝑗)

𝜏 𝑦 = 1 1 + exp(−𝑦)

slide-74
SLIDE 74

decoding encoding

wi wi-1 hi-1 hi wi+1 wi

A Simple Recurrent Neural Network Cell

W W U U S S

ℎ𝑗 = 𝜏(𝑋ℎ𝑗−1 + 𝑉𝑥𝑗)

𝜏 𝑦 = 1 1 + exp(−𝑦)

slide-75
SLIDE 75

decoding encoding

wi wi-1 hi-1 hi wi+1 wi

A Simple Recurrent Neural Network Cell

W W U U S S

ℎ𝑗 = 𝜏(𝑋ℎ𝑗−1 + 𝑉𝑥𝑗)

𝜏 𝑦 = 1 1 + exp(−𝑦)

slide-76
SLIDE 76

decoding encoding

wi wi-1 hi-1 hi wi+1 wi

A Simple Recurrent Neural Network Cell

W W U U S S

ℎ𝑗 = 𝜏(𝑋ℎ𝑗−1 + 𝑉𝑥𝑗)

𝜏 𝑦 = 1 1 + exp(−𝑦)

ෝ 𝑥𝑗+1 = softmax(𝑇ℎ𝑗)

slide-77
SLIDE 77

decoding encoding

wi wi-1 hi-1 hi wi+1 wi

A Simple Recurrent Neural Network Cell

W W U U S S

ℎ𝑗 = 𝜏(𝑋ℎ𝑗−1 + 𝑉𝑥𝑗) ෝ 𝑥𝑗+1 = softmax(𝑇ℎ𝑗)

must learn matrices U, S, W

slide-78
SLIDE 78

decoding encoding

wi wi-1 hi-1 hi wi+1 wi

A Simple Recurrent Neural Network Cell

W W U U S S

ℎ𝑗 = 𝜏(𝑋ℎ𝑗−1 + 𝑉𝑥𝑗) ෝ 𝑥𝑗+1 = softmax(𝑇ℎ𝑗)

must learn matrices U, S, W suggested solution: gradient descent on prediction ability

slide-79
SLIDE 79

decoding encoding

wi wi-1 hi-1 hi wi+1 wi

A Simple Recurrent Neural Network Cell

W W U U S S

ℎ𝑗 = 𝜏(𝑋ℎ𝑗−1 + 𝑉𝑥𝑗) ෝ 𝑥𝑗+1 = softmax(𝑇ℎ𝑗)

must learn matrices U, S, W suggested solution: gradient descent on prediction ability problem: they’re tied across inputs/timesteps

slide-80
SLIDE 80

decoding encoding

wi wi-1 hi-1 hi wi+1 wi

A Simple Recurrent Neural Network Cell

W W U U S S

ℎ𝑗 = 𝜏(𝑋ℎ𝑗−1 + 𝑉𝑥𝑗) ෝ 𝑥𝑗+1 = softmax(𝑇ℎ𝑗)

must learn matrices U, S, W suggested solution: gradient descent on prediction ability problem: they’re tied across inputs/timesteps good news for you: many toolkits do this automatically

slide-81
SLIDE 81

Why Is Training RNNs Hard?

Conceptually, it can get strange But really getting the gradient just requires many applications of the chain rule for derivatives

slide-82
SLIDE 82

Why Is Training RNNs Hard?

Conceptually, it can get strange But really getting the gradient just requires many applications of the chain rule for derivatives Vanishing gradients Multiply the same matrices at each timestep  multiply many matrices in the gradients

slide-83
SLIDE 83

Why Is Training RNNs Hard?

Conceptually, it can get strange But really getting the gradient just requires many applications of the chain rule for derivatives Vanishing gradients Multiply the same matrices at each timestep  multiply many matrices in the gradients One solution: clip the gradients to a max value

slide-84
SLIDE 84

Outline

Review: EM for HMMs Machine Translation Alignment Limited Sequence Models Maximum Entropy Markov Models Conditional Random Fields Recurrent Neural Networks Basic Definitions Example in PyTorch

slide-85
SLIDE 85

Natural Language Processing from torch import * from keras import *

slide-86
SLIDE 86

Pick Your Toolkit

PyTorch Deeplearning4j TensorFlow DyNet Caffe Keras MxNet Gluon CNTK …

Comparisons: https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software https://deeplearning4j.org/compare-dl4j-tensorflow-pytorch https://github.com/zer0n/deepframeworks (older---2015)

slide-87
SLIDE 87

Defining A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

wi-2 wi-1 wi-1 wi wi wi+1 hi-2 hi-1 hi

slide-88
SLIDE 88

Defining A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

wi-2 wi-1 wi-1 wi wi wi+1 hi-2 hi-1 hi

slide-89
SLIDE 89

Defining A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

wi-2 wi-1 wi-1 wi wi wi+1 hi-2 hi-1 hi

slide-90
SLIDE 90

Defining A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

slide-91
SLIDE 91

Defining A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

encode

wi-2 wi-1 wi-1 wi wi wi+1 hi-2 hi-1 hi

slide-92
SLIDE 92

Defining A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

decode

wi-2 wi-1 wi-1 wi wi wi+1 hi-2 hi-1 hi

slide-93
SLIDE 93

Training A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

slide-94
SLIDE 94

Training A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

Negative log- likelihood

slide-95
SLIDE 95

Training A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

Negative log- likelihood get predictions

wi-2 wi-1 wi-1 wi wi wi+1 hi-2 hi-1 hi

slide-96
SLIDE 96

Training A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

Negative log- likelihood get predictions eval predictions

slide-97
SLIDE 97

Training A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

Negative log- likelihood get predictions eval predictions compute gradient

slide-98
SLIDE 98

Training A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

Negative log- likelihood get predictions eval predictions compute gradient perform SGD

slide-99
SLIDE 99

Another Solution: LSTMs/GRUs

LSTM: Long Short-Term Memory (Hochreiter & Schmidhuber, 1997) GRU: Gated Recurrent Unit (Cho et al., 2014) Basic Ideas: learn to forget

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

forget line representation line

slide-100
SLIDE 100

Outline

Review: EM for HMMs Machine Translation Alignment Limited Sequence Models Maximum Entropy Markov Models Conditional Random Fields Recurrent Neural Networks Basic Definitions Example in PyTorch