[PPT] - Algorithms for NLP CS 11711, Fall 2019 Lecture 21: Machine PowerPoint Presentation

SLIDE 1

1

Yulia Tsvetkov

Algorithms for NLP

CS 11711, Fall 2019

Lecture 21: Machine Translation I

SLIDE 2

Machine Translation

SLIDE 3

SLIDE 4

from Dream of the Red Chamber Cao Xue Qin (1792)

SLIDE 5

SLIDE 6

English: leg, foot, paw French: jambe, pied, patte, etape

SLIDE 7

Challenges

Ambiguities ▪ words ▪ morphology ▪ syntax ▪ semantics ▪ pragmatics Gaps in data ▪ availability of corpora ▪ commonsense knowledge +Understanding of context, connotation, social norms, etc.

SLIDE 8

Classical Approaches to MT

▪ Direct translation: word-by-word dictionary translation ▪ Transfer approaches: word dictionary + rules

morphological analysis lexical transfer using bilingual dictionary local reordering morphological generation source language text target language text

SLIDE 9

SLIDE 10

Levels of Transfer

SLIDE 11

Levels of Transfer

▪ Syntactic transfer

SLIDE 12

Levels of Transfer

▪ Syntactic transfer

SLIDE 13

Levels of Transfer

▪ Syntactic transfer

SLIDE 14

Levels of Transfer

▪ Semantic transfer

SLIDE 15

Levels of Transfer

SLIDE 16

Levels of Transfer

SLIDE 17

Levels of Transfer

SLIDE 18

The Vauquois Triangle (1968)

SLIDE 19

SLIDE 20

Levels of Transfer: The Vauquois triangle

SLIDE 21

SLIDE 22

Statistical approaches

SLIDE 23

Statistical MT

Modeling correspondences between languages

Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow Hasta pronto

See you soon

Hasta pronto

See you around

Yo lo haré pronto

N

v

e l S e n t e n c e

I will do it soon I will do it around See you tomorrow Machine translation system: Model of translation

SLIDE 24

Research Problems

▪ How can we formalize the process of learning to translate from examples? ▪ How can we formalize the process of finding translations for new inputs? ▪ If our model produces many outputs, how do we find the best

ne?

▪ If we have a gold standard translation, how can we tell if our

utput is good or bad?

SLIDE 25

Two Views of MT

SLIDE 26

MT as Code Breaking

SLIDE 27

27

The Noisy-Channel Model

source

W A

noisy channel

SLIDE 28

28

source

W A

noisy channel decoder

bserved

w a best

The Noisy-Channel Model

SLIDE 29

29

source

W A

noisy channel decoder

bserved

w a best

▪ We want to predict a sentence given acoustics:

The Noisy-Channel Model

SLIDE 30

30

▪ We want to predict a sentence given acoustics: ▪ The noisy-channel approach:

The Noisy-Channel Model

SLIDE 31

31

▪ The noisy-channel approach:

channel model source model

source

W A

noisy channel decoder

bserved

w a best

The Noisy-Channel Model

SLIDE 32

32

▪ The noisy-channel approach:

source

W A

noisy channel decoder

bserved

w a best

Prior Acoustic model (HMMs) Translation model Likelihood Language model: Distributions over sequences

f words (sentences)

The Noisy-Channel Model

SLIDE 33

Noisy Channel Model

SLIDE 34

MT as Direct Modeling

▪ one model does everything ▪ trained to reproduce a corpus of translations

SLIDE 35

Two Views of MT

▪ Code breaking (aka the noisy channel, Bayes rule)

▪ I know the target language ▪ I have example translations texts (example enciphered data)

▪ Direct modeling (aka pattern matching)

▪ I have really good learning algorithms and a bunch of example inputs (source language sentences) and outputs (target language translations)

SLIDE 36

Which is better?

▪ Noisy channel -

▪ easy to use monolingual target language data ▪ search happens under a product of two models (individual models can be simple, product can be powerful) ▪ obtaining probabilities requires renormalizing

▪ Direct model -

▪ directly model the process you care about ▪ model must be very powerful

SLIDE 37

Where are we in 2019?

▪ Direct modeling is where most of the action is

▪ Neural networks are very good at generalizing and conceptually very simple ▪ Inference in “product of two models” is hard

▪ Noisy channel ideas are incredibly important and still play a big role in how we think about translation

SLIDE 38

Two Views of MT

SLIDE 39

Noisy Channel: Phrase-Based MT

f e source phrase target phrase translation features

Translation Model

f

Language Model

e f

Reranking Model

feature weights

Parallel corpus Monolingual corpus Held-out parallel corpus

SLIDE 40

Neural MT: Conditional Language Modeling

http://opennmt.net/

SLIDE 41

A common problem

Both models must assign probabilities to how a sentence in one language translates into a sentence in another language.

SLIDE 42

Learning from Data

SLIDE 43

Parallel corpora

SLIDE 44

SLIDE 45

SLIDE 46

SLIDE 47

Mining parallel data from microblogs Ling et al. 2013

SLIDE 48

http://opus.nlpl.eu

SLIDE 49

SLIDE 50

▪ There is a lot more monolingual data in the world than translated data ▪ Easy to get about 1 trillion words of English by crawling the web ▪ With some work, you can get 1 billion translated words of English-French

▪ What about Japanese-Turkish?

SLIDE 51

Centauri-Arcturan Parallel Text

SLIDE 52

Phrase-Based MT

f e source phrase target phrase translation features

Translation Model

f

Language Model

e f

Reranking Model

feature weights

Parallel corpus Monolingual corpus Held-out parallel corpus

SLIDE 53

Construction of t-table

Sentence-aligned corpus

cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8 dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9 language ||| langue ||| 0.9 …

Phrase table (translation model) Word alignments

Many slides and examples from Philipp Koehn or John DeNero

SLIDE 54

Lexical Translation

SLIDE 55

Phrase-Based Translation

SLIDE 56

Word Alignment Models

SLIDE 57

Lexical Translation

▪ How do we translate a word? Look it up in the dictionary Haus — house, building, home, household, shell ▪ Multiple translations

▪ some more frequent than others ▪ different word senses, different registers, different inflections (?) ▪ house, home are common

▪ shell is specialized (the Haus of a snail is a shell)

SLIDE 58

How common is each?

Look at a parallel corpus (German text along with English translation)

SLIDE 59

Estimate Translation Probabilities

Maximum likelihood estimation

SLIDE 60

The Alignment Function

▪ Alignments can be visualized in by drawing links between two sentences, and they are represented as vectors of positions:

SLIDE 61

Reordering

▪ Words may be reordered during translation.

SLIDE 62

Word Dropping

▪ A source word may not be translated at all

SLIDE 63

Word Insertion

▪ Words may be inserted during translation

▪ English just does not have an equivalent ▪ But it must be explained - we typically assume every source sentence contains a NULL token

SLIDE 64

One-to-many Translation

▪ A source word may translate into more than one target word

SLIDE 65

Many-to-one Translation

▪ More than one source word may not translate as a unit in lexical translation

SLIDE 66

IBM Model 1

▪ Generative model: break up translation process into smaller steps ▪ Simplest possible lexical translation model ▪ Additional assumptions

▪ All alignment decisions are independent ▪ The alignment distribution for each ai is uniform over all source words and NULL

SLIDE 67

IBM Model 1

▪ Translation probability

▪ for a foreign sentence f = (f1, ..., flf ) of length lf ▪ to an English sentence e = (e1, ..., ele ) of length le ▪ with an alignment of each English word ej to a foreign word fi according to the alignment function a : j → i

▪ parameter ϵ is a normalization constant

SLIDE 68

Example

SLIDE 69

▪ Goal: a model ▪ where e and f are complete English and Foreign sentences

Lexical Translation

SLIDE 70

Lexical Translation

▪ Goal: a model ▪ where e and f are complete English and Foreign sentences ▪ Lexical translation makes the following assumptions:

▪ Each word in ei in e is generated from exactly one word in f ▪ Thus, we have an alignment ai that indicates which word ei “came from”, specifically it came from fai ▪ Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word fai

SLIDE 71

Lexical Translation

▪ Putting our assumptions together, we have:

SLIDE 72

Estimate Translation Probabilities

Maximum likelihood estimation

SLIDE 73

If we have translation probabilities: We can estimate Viterbi alignment

Estimate alignments given t-table

SLIDE 74

Finding the Viterbi Alignment

In model 1:

SLIDE 75

Finding the Viterbi Alignment

SLIDE 76

Finding the Viterbi Alignment

SLIDE 77

Finding the Viterbi Alignment

SLIDE 78

Finding the Viterbi Alignment

SLIDE 79

Finding the Viterbi Alignment

SLIDE 80

Finding the Viterbi Alignment

SLIDE 81

Finding the Viterbi Alignment

SLIDE 82

Finding the Viterbi Alignment

SLIDE 83

Finding the Viterbi Alignment

SLIDE 84

Finding the Viterbi Alignment

SLIDE 85

Finding the Viterbi Alignment

SLIDE 86

Finding the Viterbi Alignment

SLIDE 87

Finding the Viterbi Alignment

SLIDE 88

Finding the Viterbi Alignment

SLIDE 89

Finding the Viterbi Alignment

SLIDE 90

Finding the Viterbi Alignment

SLIDE 91

Finding the Viterbi Alignment

SLIDE 92

Finding the Viterbi Alignment

SLIDE 93

Finding the Viterbi Alignment

SLIDE 94

Finding the Viterbi Alignment

SLIDE 95

Finding the Viterbi Alignment

SLIDE 96

Finding the Viterbi Alignment

SLIDE 97

Finding the Viterbi Alignment

SLIDE 98

Finding the Viterbi Alignment

SLIDE 99

Finding the Viterbi Alignment

SLIDE 100

Finding the Viterbi Alignment

SLIDE 101

Learning Lexical Translation Models

We would like to estimate the lexical translation probabilities t(e|f) from a parallel corpus but we do not have the alignments ▪ Chicken and egg problem

▪ if we had the alignments, → we could estimate the parameters of our generative model (MLE) ▪ if we had the parameters, → we could estimate the alignments

SLIDE 102

EM Algorithm

▪ Incomplete data

▪ if we had complete data, we could estimate the model ▪ if we had the model, we could fill in the gaps in the data

▪ Expectation Maximization (EM) in a nutshell

1. initialize model parameters (e.g. uniform, random)
2. assign probabilities to the missing data
3. estimate model parameters from completed data
4. iterate steps 2–3 until convergence

SLIDE 103

EM Algorithm

▪ Initial step: all alignments equally likely ▪ Model learns that, e.g., la is often aligned with the

SLIDE 104

EM Algorithm

▪ After one iteration ▪ Alignments, e.g., between la and the are more likely

SLIDE 105

EM Algorithm

▪ After another iteration ▪ It becomes apparent that alignments, e.g., between fleur and flower are more likely (pigeon hole principle)

SLIDE 106

EM Algorithm

▪ Convergence ▪ Inherent hidden structure revealed by EM

SLIDE 107

EM Algorithm

▪ Parameter estimation from the aligned corpus

SLIDE 108

IBM Model 1 and EM

EM Algorithm consists of two steps ▪ Expectation-Step: Apply model to the data

▪ parts of the model are hidden (here: alignments) ▪ using the model, assign probabilities to possible values

▪ Maximization-Step: Estimate model from data

▪ take assigned values as fact ▪ collect counts (weighted by lexical translation probabilities) ▪ estimate model from counts

▪ Iterate these steps until convergence

SLIDE 109

IBM Model 1 and EM

▪ We need to be able to compute:

▪ Expectation-Step: probability of alignments ▪ Maximization-Step: count collection

SLIDE 110

IBM Model 1 and EM

t-table

SLIDE 111

IBM Model 1 and EM

t-table

SLIDE 112

IBM Model 1 and EM

t-table

SLIDE 113

IBM Model 1 and EM

Applying the chain rule:

t-table

SLIDE 114

IBM Model 1 and EM: Expectation Step

SLIDE 115

IBM Model 1 and EM: Expectation Step

SLIDE 116

The Trick

SLIDE 117

IBM Model 1 and EM: Expectation Step

SLIDE 118

IBM Model 1 and EM: Expectation Step

E-step t-table

SLIDE 119

IBM Model 1 and EM: Maximization Step

SLIDE 120

IBM Model 1 and EM: Maximization Step

E-step M-step t-table

SLIDE 121

IBM Model 1 and EM: Maximization Step

SLIDE 122

IBM Model 1 and EM: Maximization Step

Update t-table: p(the|la) = c(the|la)/c(la)

E-step M-step t-table

SLIDE 123

IBM Model 1 and EM: Pseudocode

SLIDE 124

Convergence

SLIDE 125

Word Alignment

SLIDE 126

Word Alignment?

SLIDE 127

Word Alignment?

SLIDE 128

Higher IBM Models

SLIDE 129

The IBM Models 1--5 (Brown et al. 93)

Mary did not slap the green witch Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch

n(3|slap)

Mary no daba una botefada a la verde bruja Mary no daba una botefada a la bruja verde

P(NULL)

t(la|the) d(j|i)

[from Al-Onaizan and Knight, 1998] fertility NULL insertion lexical translation distortion

SLIDE 130

Word Alignment and IBM Models

▪ IBM Models create a many-to-one mapping

▪ words are aligned using an alignment function ▪ a function may return the same value for different input (one-to-many mapping) ▪ a function can not return multiple values for one input (no many-to-one mapping)

▪ Real word alignments have many-to-many mappings

SLIDE 131

Symmetrization

SLIDE 132

Growing Heuristics

▪ Add alignment points from union based on heuristics ▪ Popular method: grow-diag-final-and

SLIDE 133

Evaluating Alignment Models

▪ How do we measure quality of a word-to-word model?

▪ Method 1: use in an end-to-end translation system

▪ Hard to measure translation quality ▪ Option: human judges ▪ Option: reference translations (NIST, BLEU) ▪ Option: combinations (HTER) ▪ Actually, no one uses word-to-word models alone as TMs

▪ Method 2: measure quality of the alignments produced

▪ Easy to measure ▪ Hard to know what the gold alignments should be ▪ Often does not correlate well with translation quality (like perplexity in LMs)

SLIDE 134

Alignment Error Rate

SLIDE 135

Alignment Error Rate

SLIDE 136

Alignment Error Rate

SLIDE 137

Alignment Error Rate

SLIDE 138

Alignment Error Rate

SLIDE 139

Problems with Lexical Translation

▪ Complexity -- exponential in sentence length ▪ Weak reordering -- the output is not fluent ▪ Many local decisions -- error propagation

SLIDE 140