Algorithms for NLP IITP, Fall 2019 Lecture 21: Machine Translation - - PowerPoint PPT Presentation

algorithms for nlp
SMART_READER_LITE
LIVE PREVIEW

Algorithms for NLP IITP, Fall 2019 Lecture 21: Machine Translation - - PowerPoint PPT Presentation

Algorithms for NLP IITP, Fall 2019 Lecture 21: Machine Translation I Yulia Tsvetkov 1 Machine Translation from Dream of the Red Chamber Cao Xue Qin (1792) English: leg, foot, paw French: jambe, pied, patte, etape Challenges Ambiguities


slide-1
SLIDE 1

1

Yulia Tsvetkov

Algorithms for NLP

IITP, Fall 2019

Lecture 21: Machine Translation I

slide-2
SLIDE 2

Machine Translation

slide-3
SLIDE 3
slide-4
SLIDE 4

from Dream of the Red Chamber Cao Xue Qin (1792)

slide-5
SLIDE 5
slide-6
SLIDE 6

English: leg, foot, paw French: jambe, pied, patte, etape

slide-7
SLIDE 7

Challenges

Ambiguities ▪ words ▪ morphology ▪ syntax ▪ semantics ▪ pragmatics Gaps in data ▪ availability of corpora ▪ commonsense knowledge +Understanding of context, connotation, social norms, etc.

slide-8
SLIDE 8

Classical Approaches to MT

▪ Direct translation: word-by-word dictionary translation ▪ Transfer approaches: word dictionary + rules

morphological analysis lexical transfer using bilingual dictionary local reordering morphological generation source language text target language text

slide-9
SLIDE 9
slide-10
SLIDE 10

Levels of Transfer

slide-11
SLIDE 11

Levels of Transfer

▪ Syntactic transfer

slide-12
SLIDE 12

Levels of Transfer

▪ Syntactic transfer

slide-13
SLIDE 13

Levels of Transfer

▪ Syntactic transfer

slide-14
SLIDE 14

Levels of Transfer

▪ Semantic transfer

slide-15
SLIDE 15

Levels of Transfer

slide-16
SLIDE 16

Levels of Transfer

slide-17
SLIDE 17

Levels of Transfer

slide-18
SLIDE 18

The Vauquois Triangle (1968)

slide-19
SLIDE 19
slide-20
SLIDE 20

Levels of Transfer: The Vauquois triangle

slide-21
SLIDE 21
slide-22
SLIDE 22

Statistical approaches

slide-23
SLIDE 23

Statistical MT

Modeling correspondences between languages

Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow Hasta pronto

See you soon

Hasta pronto

See you around

Yo lo haré pronto

N

  • v

e l S e n t e n c e

I will do it soon I will do it around See you tomorrow Machine translation system: Model of translation

slide-24
SLIDE 24

Research Problems

▪ How can we formalize the process of learning to translate from examples? ▪ How can we formalize the process of finding translations for new inputs? ▪ If our model produces many outputs, how do we find the best

  • ne?

▪ If we have a gold standard translation, how can we tell if our

  • utput is good or bad?
slide-25
SLIDE 25

Two Views of MT

slide-26
SLIDE 26

MT as Code Breaking

slide-27
SLIDE 27

27

The Noisy-Channel Model

source

W A

noisy channel

slide-28
SLIDE 28

28

source

W A

noisy channel decoder

  • bserved

w a best

The Noisy-Channel Model

slide-29
SLIDE 29

29

source

W A

noisy channel decoder

  • bserved

w a best

▪ We want to predict a sentence given acoustics:

The Noisy-Channel Model

slide-30
SLIDE 30

30

▪ We want to predict a sentence given acoustics: ▪ The noisy-channel approach:

The Noisy-Channel Model

slide-31
SLIDE 31

31

▪ The noisy-channel approach:

channel model source model

source

W A

noisy channel decoder

  • bserved

w a best

The Noisy-Channel Model

slide-32
SLIDE 32

32

▪ The noisy-channel approach:

source

W A

noisy channel decoder

  • bserved

w a best

Prior Acoustic model (HMMs) Translation model Likelihood Language model: Distributions over sequences

  • f words (sentences)

The Noisy-Channel Model

slide-33
SLIDE 33

Noisy Channel Model

slide-34
SLIDE 34

MT as Direct Modeling

▪ one model does everything ▪ trained to reproduce a corpus of translations

slide-35
SLIDE 35

Two Views of MT

▪ Code breaking (aka the noisy channel, Bayes rule)

▪ I know the target language ▪ I have example translations texts (example enciphered data)

▪ Direct modeling (aka pattern matching)

▪ I have really good learning algorithms and a bunch of example inputs (source language sentences) and outputs (target language translations)

slide-36
SLIDE 36

Which is better?

▪ Noisy channel -

▪ easy to use monolingual target language data ▪ search happens under a product of two models (individual models can be simple, product can be powerful) ▪ obtaining probabilities requires renormalizing

▪ Direct model -

▪ directly model the process you care about ▪ model must be very powerful

slide-37
SLIDE 37

Where are we in 2019?

▪ Direct modeling is where most of the action is

▪ Neural networks are very good at generalizing and conceptually very simple ▪ Inference in “product of two models” is hard

▪ Noisy channel ideas are incredibly important and still play a big role in how we think about translation

slide-38
SLIDE 38

Two Views of MT

slide-39
SLIDE 39

Noisy Channel: Phrase-Based MT

f e source phrase target phrase translation features

Translation Model

f

Language Model

e f

Reranking Model

feature weights

Parallel corpus Monolingual corpus Held-out parallel corpus

slide-40
SLIDE 40

Neural MT: Conditional Language Modeling

http://opennmt.net/

slide-41
SLIDE 41

A common problem

Both models must assign probabilities to how a sentence in one language translates into a sentence in another language.

slide-42
SLIDE 42

Learning from Data

slide-43
SLIDE 43

Parallel corpora

slide-44
SLIDE 44
slide-45
SLIDE 45
slide-46
SLIDE 46
slide-47
SLIDE 47

Mining parallel data from microblogs Ling et al. 2013

slide-48
SLIDE 48

http://opus.nlpl.eu

slide-49
SLIDE 49
slide-50
SLIDE 50

▪ There is a lot more monolingual data in the world than translated data ▪ Easy to get about 1 trillion words of English by crawling the web ▪ With some work, you can get 1 billion translated words of English-French

▪ What about Japanese-Turkish?

slide-51
SLIDE 51

Phrase-Based MT

f e source phrase target phrase translation features

Translation Model

f

Language Model

e f

Reranking Model

feature weights

Parallel corpus Monolingual corpus Held-out parallel corpus

slide-52
SLIDE 52

Phrase-Based System Overview

Sentence-aligned corpus

cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8 dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9 language ||| langue ||| 0.9 …

Phrase table (translation model) Word alignments

Many slides and examples from Philipp Koehn or John DeNero

slide-53
SLIDE 53

Word Alignment Models

slide-54
SLIDE 54

Lexical Translation

▪ How do we translate a word? Look it up in the dictionary Haus — house, building, home, household, shell ▪ Multiple translations

▪ some more frequent than others ▪ different word senses, different registers, different inflections (?) ▪ house, home are common

▪ shell is specialized (the Haus of a snail is a shell)

slide-55
SLIDE 55

How common is each?

Look at a parallel corpus (German text along with English translation)

slide-56
SLIDE 56

Estimate Translation Probabilities

Maximum likelihood estimation

slide-57
SLIDE 57

▪ Goal: a model ▪ where e and f are complete English and Foreign sentences

Lexical Translation

slide-58
SLIDE 58

Lexical Translation

▪ Goal: a model ▪ where e and f are complete English and Foreign sentences ▪ Lexical translation makes the following assumptions:

▪ Each word in ei in e is generated from exactly one word in f ▪ Thus, we have an alignment ai that indicates which word ei “came from”, specifically it came from fai ▪ Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word fai

slide-59
SLIDE 59

Lexical Translation

▪ Putting our assumptions together, we have:

slide-60
SLIDE 60

The Alignment Function

▪ Alignments can be visualized in by drawing links between two sentences, and they are represented as vectors of positions:

slide-61
SLIDE 61

Reordering

▪ Words may be reordered during translation.

slide-62
SLIDE 62

Word Dropping

▪ A source word may not be translated at all

slide-63
SLIDE 63

Word Insertion

▪ Words may be inserted during translation

▪ English just does not have an equivalent ▪ But it must be explained - we typically assume every source sentence contains a NULL token

slide-64
SLIDE 64

One-to-many Translation

▪ A source word may translate into more than one target word

slide-65
SLIDE 65

Many-to-one Translation

▪ More than one source word may not translate as a unit in lexical translation

slide-66
SLIDE 66

IBM Model 1

▪ Generative model: break up translation process into smaller steps ▪ Simplest possible lexical translation model ▪ Additional assumptions

▪ All alignment decisions are independent ▪ The alignment distribution for each ai is uniform over all source words and NULL

slide-67
SLIDE 67

IBM Model 1

▪ Translation probability

▪ for a foreign sentence f = (f1, ..., flf ) of length lf ▪ to an English sentence e = (e1, ..., ele ) of length le ▪ with an alignment of each English word ej to a foreign word fi according to the alignment function a : j → i

▪ parameter ϵ is a normalization constant

slide-68
SLIDE 68

Example

slide-69
SLIDE 69

Learning Lexical Translation Models

We would like to estimate the lexical translation probabilities t(e|f) from a parallel corpus ▪ ... but we do not have the alignments ▪ Chicken and egg problem

▪ if we had the alignments, → we could estimate the parameters of our generative model (MLE) ▪ if we had the parameters, → we could estimate the alignments

slide-70
SLIDE 70

EM Algorithm

▪ Incomplete data

▪ if we had complete data, we could estimate the model ▪ if we had the model, we could fill in the gaps in the data

▪ Expectation Maximization (EM) in a nutshell

  • 1. initialize model parameters (e.g. uniform, random)
  • 2. assign probabilities to the missing data
  • 3. estimate model parameters from completed data
  • 4. iterate steps 2–3 until convergence
slide-71
SLIDE 71

EM Algorithm

▪ Initial step: all alignments equally likely ▪ Model learns that, e.g., la is often aligned with the

slide-72
SLIDE 72

EM Algorithm

▪ After one iteration ▪ Alignments, e.g., between la and the are more likely

slide-73
SLIDE 73

EM Algorithm

▪ After another iteration ▪ It becomes apparent that alignments, e.g., between fleur and flower are more likely (pigeon hole principle)

slide-74
SLIDE 74

EM Algorithm

▪ Convergence ▪ Inherent hidden structure revealed by EM

slide-75
SLIDE 75

EM Algorithm

▪ Parameter estimation from the aligned corpus

slide-76
SLIDE 76

IBM Model 1 and EM

EM Algorithm consists of two steps ▪ Expectation-Step: Apply model to the data

▪ parts of the model are hidden (here: alignments) ▪ using the model, assign probabilities to possible values

▪ Maximization-Step: Estimate model from data

▪ take assigned values as fact ▪ collect counts (weighted by lexical translation probabilities) ▪ estimate model from counts

▪ Iterate these steps until convergence

slide-77
SLIDE 77

IBM Model 1 and EM

▪ We need to be able to compute:

▪ Expectation-Step: probability of alignments ▪ Maximization-Step: count collection

slide-78
SLIDE 78

IBM Model 1 and EM

t-table

slide-79
SLIDE 79

IBM Model 1 and EM

t-table

slide-80
SLIDE 80

IBM Model 1 and EM

t-table

slide-81
SLIDE 81

IBM Model 1 and EM

Applying the chain rule:

t-table

slide-82
SLIDE 82

IBM Model 1 and EM: Expectation Step

slide-83
SLIDE 83

IBM Model 1 and EM: Expectation Step

slide-84
SLIDE 84

The Trick

slide-85
SLIDE 85

IBM Model 1 and EM: Expectation Step

slide-86
SLIDE 86

IBM Model 1 and EM: Expectation Step

E-step t-table

slide-87
SLIDE 87

IBM Model 1 and EM: Maximization Step

slide-88
SLIDE 88

IBM Model 1 and EM: Maximization Step

E-step M-step t-table

slide-89
SLIDE 89

IBM Model 1 and EM: Maximization Step

slide-90
SLIDE 90

IBM Model 1 and EM: Maximization Step

Update t-table: p(the|la) = c(the|la)/c(la)

E-step M-step t-table

slide-91
SLIDE 91

IBM Model 1 and EM: Pseudocode

slide-92
SLIDE 92

Convergence

slide-93
SLIDE 93

IBM Model 1

▪ Generative model: break up translation process into smaller steps ▪ Simplest possible lexical translation model ▪ Additional assumptions

▪ All alignment decisions are independent ▪ The alignment distribution for each ai is uniform over all source words and NULL

slide-94
SLIDE 94

IBM Model 1

▪ Translation probability

▪ for a foreign sentence f = (f1, ..., flf ) of length lf ▪ to an English sentence e = (e1, ..., ele ) of length le ▪ with an alignment of each English word ej to a foreign word fi according to the alignment function a : j → i

▪ parameter ϵ is a normalization constant

slide-95
SLIDE 95

Example

slide-96
SLIDE 96

Evaluating Alignment Models

▪ How do we measure quality of a word-to-word model?

▪ Method 1: use in an end-to-end translation system

▪ Hard to measure translation quality ▪ Option: human judges ▪ Option: reference translations (NIST, BLEU) ▪ Option: combinations (HTER) ▪ Actually, no one uses word-to-word models alone as TMs

▪ Method 2: measure quality of the alignments produced

▪ Easy to measure ▪ Hard to know what the gold alignments should be ▪ Often does not correlate well with translation quality (like perplexity in LMs)

slide-97
SLIDE 97

Alignment Error Rate

slide-98
SLIDE 98

Alignment Error Rate

slide-99
SLIDE 99

Alignment Error Rate

slide-100
SLIDE 100

Alignment Error Rate

slide-101
SLIDE 101

Alignment Error Rate

slide-102
SLIDE 102

Problems with Lexical Translation

▪ Complexity -- exponential in sentence length ▪ Weak reordering -- the output is not fluent ▪ Many local decisions -- error propagation