Algorithms for NLP Machine Translation II Yulia Tsvetkov CMU - - PowerPoint PPT Presentation

algorithms for nlp
SMART_READER_LITE
LIVE PREVIEW

Algorithms for NLP Machine Translation II Yulia Tsvetkov CMU - - PowerPoint PPT Presentation

Algorithms for NLP Machine Translation II Yulia Tsvetkov CMU Slides: Philipp Koehn JHU; Chris Dyer DeepMind MT is Hard Ambiguities words morphology syntax semantics pragmatics Levels of Transfer Two Views of


slide-1
SLIDE 1

Machine Translation II

Yulia Tsvetkov – CMU Slides: Philipp Koehn – JHU; Chris Dyer – DeepMind

Algorithms for NLP

slide-2
SLIDE 2

MT is Hard

Ambiguities ▪ words ▪ morphology ▪ syntax ▪ semantics ▪ pragmatics

slide-3
SLIDE 3

Levels of Transfer

slide-4
SLIDE 4

Two Views of Statistical MT

▪ Direct modeling (aka pattern matching)

▪ I have really good learning algorithms and a bunch of example inputs (source language sentences) and outputs (target language translations)

▪ Code breaking (aka the noisy channel, Bayes rule)

▪ I know the target language ▪ I have example translations texts (example enciphered data)

slide-5
SLIDE 5

MT as Direct Modeling

▪ one model does everything ▪ trained to reproduce a corpus of translations

slide-6
SLIDE 6

Noisy Channel Model

slide-7
SLIDE 7

Which is better?

▪ Noisy channel -

▪ easy to use monolingual target language data ▪ search happens under a product of two models (individual models can be simple, product can be powerful) ▪ obtaining probabilities requires renormalizing

▪ Direct model -

▪ directly model the process you care about ▪ model must be very powerful

slide-8
SLIDE 8

Centauri-Arcturan Parallel Text

slide-9
SLIDE 9

Noisy Channel Model : Phrase-Based MT

e f source phrase target phrase translation features

Translation Model

e

Language Model

f e

Reranking Model

feature weights

Parallel corpus Monolingual corpus Held-out parallel corpus

slide-10
SLIDE 10

Phrase-Based MT

e f source phrase target phrase translation features

Translation Model

e

Language Model

f e

Reranking Model

feature weights

Parallel corpus Monolingual corpus Held-out parallel corpus

slide-11
SLIDE 11

Phrase-Based Translation

slide-12
SLIDE 12

Phrase-Based System Overview

Sentence-aligned corpus

cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8 dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9 language ||| langue ||| 0.9 …

Phrase table (translation model) Word alignments

slide-13
SLIDE 13

Lexical Translation

▪ How do we translate a word? Look it up in the dictionary Haus — house, building, home, household, shell ▪ Multiple translations

▪ some more frequent than others ▪ different word senses, different registers, different inflections (?) ▪ house, home are common

▪ shell is specialized (the Haus of a snail is a shell)

slide-14
SLIDE 14

How common is each?

Look at a parallel corpus (German text along with English translation)

slide-15
SLIDE 15

Estimate Translation Probabilities

Maximum likelihood estimation

slide-16
SLIDE 16

▪ Goal: a model ▪ where e and f are complete English and Foreign sentences

Lexical Translation

slide-17
SLIDE 17

Alignment Function

▪ In a parallel text (or when we translate), we align words in one language with the words in the other ▪ Alignments are represented as vectors of positions:

slide-18
SLIDE 18

▪ Formalizing alignment with an alignment function ▪ Mapping an English target word at position i to a German source word at position j with a function a : i → j ▪ Example

Alignment Function

slide-19
SLIDE 19

Reordering

▪ Words may be reordered during translation.

slide-20
SLIDE 20

One-to-many Translation

▪ A source word may translate into more than one target word ▪

slide-21
SLIDE 21

Word Dropping

▪ A source word may not be translated at all

slide-22
SLIDE 22

Word Insertion

▪ Words may be inserted during translation

▪ English just does not have an equivalent ▪ But it must be explained - we typically assume every source sentence contains a NULL token

slide-23
SLIDE 23

Many-to-one Translation

▪ More than one source word may not translate as a unit in lexical translation

slide-24
SLIDE 24

Mary did not slap the green witch

?

Generative Story

slide-25
SLIDE 25

Generative Story

Mary did not slap the green witch Mary not slap slap slap the green witch

slide-26
SLIDE 26

Generative Story

Mary did not slap the green witch Mary not slap slap slap the green witch

n(3|slap) fertility

slide-27
SLIDE 27

Generative Story

Mary did not slap the green witch Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch

n(3|slap) fertility

slide-28
SLIDE 28

Generative Story

Mary did not slap the green witch Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch

n(3|slap) P(NULL) fertility NULL insertion

slide-29
SLIDE 29

Generative Story

Mary did not slap the green witch Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch

n(3|slap)

Mary no daba una botefada a la verde bruja

P(NULL) fertility NULL insertion

slide-30
SLIDE 30

Generative Story

Mary did not slap the green witch Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch

n(3|slap)

Mary no daba una botefada a la verde bruja

P(NULL)

t(la|the)

fertility NULL insertion lexical translation

slide-31
SLIDE 31

Generative Story

Mary did not slap the green witch Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch

n(3|slap)

Mary no daba una botefada a la verde bruja _ _ _ _ _ _ _ _ _

P(NULL)

t(la|the)

fertility NULL insertion lexical translation

slide-32
SLIDE 32

Generative Story

Mary did not slap the green witch Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch

n(3|slap)

Mary no daba una botefada a la verde bruja _ _ _ _ _ _ _ _ _

P(NULL)

t(la|the) d(j|i)

fertility NULL insertion lexical translation distortion

slide-33
SLIDE 33

The IBM Models 1--5 (Brown et al. 93)

Mary did not slap the green witch Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch

n(3|slap)

Mary no daba una botefada a la verde bruja Mary no daba una botefada a la bruja verde

P(NULL)

t(la|the) d(j|i)

[from Al-Onaizan and Knight, 1998] fertility NULL insertion lexical translation distortion

slide-34
SLIDE 34

Alignment Models

▪ IBM Model 1: lexical translation ▪ IBM Model 2: alignment model, global monotonicity ▪ HMM model: local monotonicity ▪ fastalign: efficient reparametrization of Model 2 ▪ IBM Model 3: fertility ▪ IBM Model 4: relative alignment model ▪ IBM Model 5: deficiency ▪ +many more

slide-35
SLIDE 35

P(e,a|f)

P(e, alignment|f) = ∏pf∏pt∏pd

Mary did not slap the green witch Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch

n(3|slap)

Mary no daba una botefada a la verde bruja Mary no daba una botefada a la bruja verde

P(NULL)

t(la|the) d(j|i)

fertility NULL insertion lexical translation distortion

slide-36
SLIDE 36

P(e|f)

P(e|f) = ∑all_possible_alignments∏pf∏pt∏pd

Mary did not slap the green witch Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch

n(3|slap)

Mary no daba una botefada a la verde bruja Mary no daba una botefada a la bruja verde

P(NULL)

t(la|the) d(j|i)

fertility NULL insertion lexical translation distortion

slide-37
SLIDE 37

IBM Model 1

▪ Generative model: break up translation process into smaller steps ▪ Simplest possible lexical translation model ▪ Additional assumptions

▪ All alignment decisions are independent ▪ The alignment distribution for each ai is uniform over all source words and NULL

slide-38
SLIDE 38

IBM Model 1

▪ Translation probability

▪ for a foreign sentence f = (f1, ..., flf ) of length lf ▪ to an English sentence e = (e1, ..., ele ) of length le ▪ with an alignment of each English word ej to a foreign word fi according to the alignment function a : j → i

▪ parameter ϵ is a normalization constant

slide-39
SLIDE 39

Example

slide-40
SLIDE 40

Learning Lexical Translation Models

We would like to estimate the lexical translation probabilities t(e|f) from a parallel corpus ▪ ... but we do not have the alignments ▪ Chicken and egg problem

▪ if we had the alignments, → we could estimate the parameters of our generative model (MLE) ▪ if we had the parameters, → we could estimate the alignments

slide-41
SLIDE 41

EM Algorithm

▪ Incomplete data

▪ if we had complete data, would could estimate the model ▪ if we had the model, we could fill in the gaps in the data

▪ Expectation Maximization (EM) in a nutshell

  • 1. initialize model parameters (e.g. uniform, random)
  • 2. assign probabilities to the missing data
  • 3. estimate model parameters from completed data
  • 4. iterate steps 2–3 until convergence
slide-42
SLIDE 42

EM Algorithm

▪ Initial step: all alignments equally likely ▪ Model learns that, e.g., la is often aligned with the

slide-43
SLIDE 43

EM Algorithm

▪ After one iteration ▪ Alignments, e.g., between la and the are more likely

slide-44
SLIDE 44

EM Algorithm

▪ After another iteration ▪ It becomes apparent that alignments, e.g., between fleur and flower are more likely (pigeon hole principle)

slide-45
SLIDE 45

EM Algorithm

▪ Convergence ▪ Inherent hidden structure revealed by EM

slide-46
SLIDE 46

EM Algorithm

▪ Parameter estimation from the aligned corpus

slide-47
SLIDE 47

IBM Model 1 and EM

EM Algorithm consists of two steps ▪ Expectation-Step: Apply model to the data

▪ parts of the model are hidden (here: alignments) ▪ using the model, assign probabilities to possible values

▪ Maximization-Step: Estimate model from data

▪ take assigned values as fact ▪ collect counts (weighted by lexical translation probabilities) ▪ estimate model from counts

▪ Iterate these steps until convergence

slide-48
SLIDE 48

IBM Model 1 and EM

▪ We need to be able to compute:

▪ Expectation-Step: probability of alignments ▪ Maximization-Step: count collection

slide-49
SLIDE 49

IBM Model 1 and EM

t-table

slide-50
SLIDE 50

IBM Model 1 and EM

t-table

slide-51
SLIDE 51

IBM Model 1 and EM

t-table

slide-52
SLIDE 52

IBM Model 1 and EM

Applying the chain rule:

t-table

slide-53
SLIDE 53

IBM Model 1 and EM: Expectation Step

slide-54
SLIDE 54

IBM Model 1 and EM: Expectation Step

slide-55
SLIDE 55

The Trick

slide-56
SLIDE 56

IBM Model 1 and EM: Expectation Step

slide-57
SLIDE 57

IBM Model 1 and EM: Expectation Step

E-step t-table

slide-58
SLIDE 58

IBM Model 1 and EM: Maximization Step

slide-59
SLIDE 59

IBM Model 1 and EM: Maximization Step

E-step M-step t-table

slide-60
SLIDE 60

IBM Model 1 and EM: Maximization Step

slide-61
SLIDE 61

IBM Model 1 and EM: Maximization Step

Update t-table: p(the|la) = c(the|la)/c(la)

E-step M-step t-table

slide-62
SLIDE 62

IBM Model 1 and EM: Pseudocode

slide-63
SLIDE 63

Convergence

slide-64
SLIDE 64

Problems with IBM Model 1

Mary did not slap the green witch Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch

n(3|slap)

Mary no daba una botefada a la verde bruja Mary no daba una botefada a la bruja verde

P(NULL)

t(la|the) d(j|i)

fertility NULL insertion lexical translation distortion

slide-65
SLIDE 65

IBM Model 2

Mary did not slap the green witch Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch

n(3|slap)

Mary no daba una botefada a la verde bruja Mary no daba una botefada a la bruja verde

P(NULL)

t(la|the)

fertility NULL insertion lexical translation monotonic alignment

slide-66
SLIDE 66

IBM Model 2

▪ compare with Model 1:

slide-67
SLIDE 67

Higher IBM Models

slide-68
SLIDE 68

The IBM Models 1--5 (Brown et al. 93)

Mary did not slap the green witch Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch

n(3|slap)

Mary no daba una botefada a la verde bruja Mary no daba una botefada a la bruja verde

P(NULL)

t(la|the) d(j|i)

[from Al-Onaizan and Knight, 1998] fertility NULL insertion lexical translation distortion

slide-69
SLIDE 69

Word Alignment

slide-70
SLIDE 70

Word Alignment?

slide-71
SLIDE 71

Word Alignment?

slide-72
SLIDE 72

Word Alignment and IBM Models

▪ IBM Models create a many-to-one mapping

▪ words are aligned using an alignment function ▪ a function may return the same value for different input (one-to-many mapping) ▪ a function can not return multiple values for one input (no many-to-one mapping)

▪ Real word alignments have many-to-many mappings

slide-73
SLIDE 73

Symmetrization

slide-74
SLIDE 74

Growing Heuristics

▪ Add alignment points from union based on heuristics ▪ Popular method: grow-diag-final-and

slide-75
SLIDE 75

Evaluating Alignment Models

▪ How do we measure quality of a word-to-word model?

▪ Method 1: use in an end-to-end translation system

▪ Hard to measure translation quality ▪ Option: human judges ▪ Option: reference translations (NIST, BLEU) ▪ Option: combinations (HTER) ▪ Actually, no one uses word-to-word models alone as TMs

▪ Method 2: measure quality of the alignments produced

▪ Easy to measure ▪ Hard to know what the gold alignments should be ▪ Often does not correlate well with translation quality (like perplexity in LMs)

slide-76
SLIDE 76

Alignment Error Rate

slide-77
SLIDE 77

Alignment Error Rate

slide-78
SLIDE 78

Alignment Error Rate

slide-79
SLIDE 79

Alignment Error Rate

slide-80
SLIDE 80

Alignment Error Rate

slide-81
SLIDE 81

Problems with Lexical Translation

▪ Complexity -- exponential in sentence length ▪ Weak reordering -- the output is not fluent ▪ Many local decisions -- error propagation

slide-82
SLIDE 82

Phrase-Based Translation

P(e, alignment|f) = psegmentationptranslationpreorderings

slide-83
SLIDE 83

Phrase-Based MT

f e source phrase target phrase translation features

Translation Model

f

Language Model

e f

Reranking Model

feature weights

Parallel corpus Monolingual corpus Held-out parallel corpus