Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn - - PowerPoint PPT Presentation

machine translation
SMART_READER_LITE
LIVE PREVIEW

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn - - PowerPoint PPT Presentation

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020 Machine Translation: French (2012) 1 Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020 Machine


slide-1
SLIDE 1

Machine Translation

Philipp Koehn 28 April 2020

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-2
SLIDE 2

1

Machine Translation: French (2012)

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-3
SLIDE 3

2

Machine Translation: French (2020)

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-4
SLIDE 4

3

No Single Right Answer

Israeli officials are responsible for airport security. Israel is in charge of the security at this airport. The security work for this airport is the responsibility of the Israel government. Israeli side was in charge of the security of this airport. Israel is responsible for the airport’s security. Israel is responsible for safety work at this airport. Israel presides over the security of the airport. Israel took charge of the airport security. The safety of this airport is taken charge of by Israel. This airport’s security is the responsibility of the Israeli security officials.

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-5
SLIDE 5

4

A Clear Plan

Source Target

Lexical Transfer

Interlingua

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-6
SLIDE 6

5

A Clear Plan

Source Target

Lexical Transfer Syntactic Transfer

Interlingua

Analysis Generation

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-7
SLIDE 7

6

A Clear Plan

Source Target

Lexical Transfer Syntactic Transfer Semantic Transfer

Interlingua

Analysis Generation

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-8
SLIDE 8

7

A Clear Plan

Source Target

Lexical Transfer Syntactic Transfer Semantic Transfer

Interlingua

Analysis Generation

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-9
SLIDE 9

8

Learning from Data

Statistical Machine Translation System Training Data Linguistic Tools Statistical Machine Translation System Translation Source Text Training Using

parallel corpora monolingual corpora dictionaries Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-10
SLIDE 10

9

why is that a good plan?

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-11
SLIDE 11

10

Word Translation Problems

  • Words are ambiguous

He deposited money in a bank account with a high interest rate. Sitting on the bank of the Mississippi, a passing ship piqued his interest.

  • How do we find the right meaning, and thus translation?
  • Context should be helpful

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-12
SLIDE 12

11

Syntactic Translation Problems

  • Languages have different sentence structure

das behaupten sie wenigstens

this claim they at least the she

  • Convert from object-verb-subject (OVS) to subject-verb-object (SVO)
  • Ambiguities can be resolved through syntactic analysis

– the meaning the of das not possible (not a noun phrase) – the meaning she of sie not possible (subject-verb agreement)

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-13
SLIDE 13

12

Semantic Translation Problems

  • Pronominal anaphora

I saw the movie and it is good.

  • How to translate it into German (or French)?

– it refers to movie – movie translates to Film – Film has masculine gender – ergo: it must be translated into masculine pronoun er

  • We are not handling this very well [Le Nagard and Koehn, 2010]

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-14
SLIDE 14

13

Semantic Translation Problems

  • Coreference

Whenever I visit my uncle and his daughters, I can’t decide who is my favorite cousin.

  • How to translate cousin into German? Male or female?
  • Complex inference required

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-15
SLIDE 15

14

Semantic Translation Problems

  • Discourse

Since you brought it up, I do not agree with you. Since you brought it up, we have been working on it.

  • How to translated since? Temporal or conditional?
  • Analysis of discourse structure — a hard problem

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-16
SLIDE 16

15

Learning from Data

  • What is the best translation?

Sicherheit → security 14,516 Sicherheit → safety 10,015 Sicherheit → certainty 334

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-17
SLIDE 17

16

Learning from Data

  • What is the best translation?

Sicherheit → security 14,516 Sicherheit → safety 10,015 Sicherheit → certainty 334

  • Counts in European Parliament corpus

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-18
SLIDE 18

17

Learning from Data

  • What is the best translation?

Sicherheit → security 14,516 Sicherheit → safety 10,015 Sicherheit → certainty 334

  • Phrasal rules

Sicherheitspolitik → security policy 1580 Sicherheitspolitik → safety policy 13 Sicherheitspolitik → certainty policy 0 Lebensmittelsicherheit → food security 51 Lebensmittelsicherheit → food safety 1084 Lebensmittelsicherheit → food certainty 0 Rechtssicherheit → legal security 156 Rechtssicherheit → legal safety 5 Rechtssicherheit → legal certainty 723

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-19
SLIDE 19

18

Learning from Data

  • What is most fluent?

a problem for translation 13,000 a problem of translation 61,600 a problem in translation 81,700

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-20
SLIDE 20

19

Learning from Data

  • What is most fluent?

a problem for translation 13,000 a problem of translation 61,600 a problem in translation 81,700

  • Hits on Google

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-21
SLIDE 21

20

Learning from Data

  • What is most fluent?

a problem for translation 13,000 a problem of translation 61,600 a problem in translation 81,700 a translation problem 235,000

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-22
SLIDE 22

21

Learning from Data

  • What is most fluent?

police disrupted the demonstration 2,140 police broke up the demonstration 66,600 police dispersed the demonstration 25,800 police ended the demonstration 762 police dissolved the demonstration 2,030 police stopped the demonstration 722,000 police suppressed the demonstration 1,400 police shut down the demonstration 2,040

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-23
SLIDE 23

22

Learning from Data

  • What is most fluent?

police disrupted the demonstration 2,140 police broke up the demonstration 66,600 police dispersed the demonstration 25,800 police ended the demonstration 762 police dissolved the demonstration 2,030 police stopped the demonstration 722,000 police suppressed the demonstration 1,400 police shut down the demonstration 2,040

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-24
SLIDE 24

23

word alignment

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-25
SLIDE 25

24

Lexical Translation

  • How to translate a word → look up in dictionary

Haus — house, building, home, household, shell.

  • Multiple translations

– some more frequent than others – for instance: house, and building most common – special cases: Haus of a snail is its shell

  • Note: In all lectures, we translate from a foreign language into English

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-26
SLIDE 26

25

Collect Statistics

Look at a parallel corpus (German text along with English translation) Translation of Haus Count house 8,000 building 1,600 home 200 household 150 shell 50

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-27
SLIDE 27

26

Estimate Translation Probabilities

Maximum likelihood estimation pf(e) = ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 0.8 if e = house, 0.16 if e = building, 0.02 if e = home, 0.015 if e = household, 0.005 if e = shell.

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-28
SLIDE 28

27

Alignment

  • In a parallel text (or when we translate), we align words in one language with

the words in the other

das Haus ist klein the house is small

1 2 3 4 1 2 3 4

  • Word positions are numbered 1–4

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-29
SLIDE 29

28

Alignment Function

  • Formalizing alignment with an alignment function
  • Mapping an English target word at position i to a German source word at

position j with a function a ∶ i → j

  • Example

a ∶ {1 → 1,2 → 2,3 → 3,4 → 4}

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-30
SLIDE 30

29

Reordering

Words may be reordered during translation

das Haus ist klein the house is small

1 2 3 4 1 2 3 4

a ∶ {1 → 3,2 → 4,3 → 2,4 → 1}

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-31
SLIDE 31

30

One-to-Many Translation

A source word may translate into multiple target words

das Haus ist klitzeklein the house is very small

1 2 3 4 1 2 3 4 5

a ∶ {1 → 1,2 → 2,3 → 3,4 → 4,5 → 4}

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-32
SLIDE 32

31

Dropping Words

Words may be dropped when translated (German article das is dropped)

das Haus ist klein house is small

1 2 3 1 2 3 4

a ∶ {1 → 2,2 → 3,3 → 4}

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-33
SLIDE 33

32

Inserting Words

  • Words may be added during translation

– The English just does not have an equivalent in German – We still need to map it to something: special NULL token

das Haus ist klein the house is just small

NULL

1 2 3 4 1 2 3 4 5

a ∶ {1 → 1,2 → 2,3 → 3,4 → 0,5 → 4}

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-34
SLIDE 34

33

IBM Model 1

  • Generative model: break up translation process into smaller steps

– IBM Model 1 only uses lexical translation

  • Translation probability

– for a foreign sentence f = (f1,...,flf) of length lf – to an English sentence e = (e1,...,ele) of length le – with an alignment of each English word ej to a foreign word fi according to the alignment function a ∶ j → i p(e,a∣f) = ǫ (lf + 1)le

le

j=1

t(ej∣fa(j)) – parameter ǫ is a normalization constant

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-35
SLIDE 35

34

Example

das Haus ist klein e t(e∣f) the 0.7 that 0.15 which 0.075 who 0.05 this 0.025 e t(e∣f) house 0.8 building 0.16 home 0.02 household 0.015 shell 0.005 e t(e∣f) is 0.8 ’s 0.16 exists 0.02 has 0.015 are 0.005 e t(e∣f) small 0.4 little 0.4 short 0.1 minor 0.06 petty 0.04 p(e,a∣f) = ǫ 43 × t(the∣das) × t(house∣Haus) × t(is∣ist) × t(small∣klein) = ǫ 43 × 0.7 × 0.8 × 0.8 × 0.4 = 0.0028ǫ

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-36
SLIDE 36

35

em algorithm

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-37
SLIDE 37

36

Learning Lexical Translation Models

  • We would like to estimate the lexical translation probabilities t(e∣f) from a

parallel corpus

  • ... but we do not have the alignments
  • Chicken and egg problem

– if we had the alignments, → we could estimate the parameters of our generative model – if we had the parameters, → we could estimate the alignments

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-38
SLIDE 38

37

EM Algorithm

  • Incomplete data

– if we had complete data, would could estimate model – if we had model, we could fill in the gaps in the data

  • Expectation Maximization (EM) in a nutshell
  • 1. initialize model parameters (e.g. uniform)
  • 2. assign probabilities to the missing data
  • 3. estimate model parameters from completed data
  • 4. iterate steps 2–3 until convergence

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-39
SLIDE 39

38

EM Algorithm

... la maison ... la maison blue ... la fleur ... ... the house ... the blue house ... the flower ...

  • Initial step: all alignments equally likely
  • Model learns that, e.g., la is often aligned with the

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-40
SLIDE 40

39

EM Algorithm

... la maison ... la maison blue ... la fleur ... ... the house ... the blue house ... the flower ...

  • After one iteration
  • Alignments, e.g., between la and the are more likely

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-41
SLIDE 41

40

EM Algorithm

... la maison ... la maison bleu ... la fleur ... ... the house ... the blue house ... the flower ...

  • After another iteration
  • It becomes apparent that alignments, e.g., between fleur and flower are more

likely (pigeon hole principle)

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-42
SLIDE 42

41

EM Algorithm

... la maison ... la maison bleu ... la fleur ... ... the house ... the blue house ... the flower ...

  • Convergence
  • Inherent hidden structure revealed by EM

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-43
SLIDE 43

42

EM Algorithm

... la maison ... la maison bleu ... la fleur ... ... the house ... the blue house ... the flower ... p(la|the) = 0.453 p(le|the) = 0.334 p(maison|house) = 0.876 p(bleu|blue) = 0.563 ...

  • Parameter estimation from the aligned corpus

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-44
SLIDE 44

43

IBM Model 1 and EM

  • EM Algorithm consists of two steps
  • Expectation-Step: Apply model to the data

– parts of the model are hidden (here: alignments) – using the model, assign probabilities to possible values

  • Maximization-Step: Estimate model from data

– take assign values as fact – collect counts (weighted by probabilities) – estimate model from counts

  • Iterate these steps until convergence

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-45
SLIDE 45

44

phrase-based models

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-46
SLIDE 46

45

Phrase-Based Model

  • Foreign input is segmented in phrases
  • Each phrase is translated into English
  • Phrases are reordered

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-47
SLIDE 47

46

Phrase Translation Table

  • Main knowledge source: table with phrase translations and their probabilities
  • Example: phrase translations for natuerlich

Translation Probability φ(¯ e∣ ¯ f)

  • f course

0.5 naturally 0.3

  • f course ,

0.15 , of course , 0.05

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-48
SLIDE 48

47

Real Example

  • Phrase translations for den Vorschlag learned from the Europarl corpus:

English φ(¯ e∣ ¯ f) English φ(¯ e∣ ¯ f) the proposal 0.6227 the suggestions 0.0114 ’s proposal 0.1068 the proposed 0.0114 a proposal 0.0341 the motion 0.0091 the idea 0.0250 the idea of 0.0091 this proposal 0.0227 the proposal , 0.0068 proposal 0.0205 its proposal 0.0068

  • f the proposal

0.0159 it 0.0068 the proposals 0.0159 ... ... – lexical variation (proposal vs suggestions) – morphological variation (proposal vs proposals) – included function words (the, a, ...) – noise (it)

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-49
SLIDE 49

48

decoding

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-50
SLIDE 50

49

Decoding

  • We have a mathematical model for translation

p(e∣f)

  • Task of decoding: find the translation ebest with highest probability

ebest = argmaxe p(e∣f)

  • Two types of error

– the most probable translation is bad → fix the model – search does not find the most probably translation → fix the search

  • Decoding is evaluated by search error, not quality of translations

(although these are often correlated)

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-51
SLIDE 51

50

Translation Process

  • Task: translate this sentence from German into English

er geht ja nicht nach hause

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-52
SLIDE 52

51

Translation Process

  • Task: translate this sentence from German into English

er geht ja nicht nach hause er he

  • Pick phrase in input, translate

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-53
SLIDE 53

52

Translation Process

  • Task: translate this sentence from German into English

er geht ja nicht nach hause er ja nicht he does not

  • Pick phrase in input, translate

– it is allowed to pick words out of sequence reordering – phrases may have multiple words: many-to-many translation

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-54
SLIDE 54

53

Translation Process

  • Task: translate this sentence from German into English

er geht ja nicht nach hause er geht ja nicht he does not go

  • Pick phrase in input, translate

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-55
SLIDE 55

54

Translation Process

  • Task: translate this sentence from German into English

er geht ja nicht nach hause er geht ja nicht nach hause he does not go home

  • Pick phrase in input, translate

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-56
SLIDE 56

55

decoding process

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-57
SLIDE 57

56

Translation Options

he

er geht ja nicht nach hause

it , it , he is are goes go yes is , of course not do not does not is not after to according to in house home chamber at home not is not does not do not home under house return home do not it is he will be it goes he goes is are is after all does to following not after not to , not is not are not is not a

  • Many translation options to choose from

– in Europarl phrase table: 2727 matching phrase pairs for this sentence – by pruning to the top 20 per phrase, 202 translation options remain

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-58
SLIDE 58

57

Translation Options

he

er geht ja nicht nach hause

it , it , he is are goes go yes is , of course not do not does not is not after to according to in house home chamber at home not is not does not do not home under house return home do not it is he will be it goes he goes is are is after all does to following not after not to not is not are not is not a

  • The machine translation decoder does not know the right answer

– picking the right translation options – arranging them in the right order → Search problem solved by heuristic beam search

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-59
SLIDE 59

58

Decoding: Precompute Translation Options

er geht ja nicht nach hause

consult phrase translation table for all input phrases

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-60
SLIDE 60

59

Decoding: Start with Initial Hypothesis

er geht ja nicht nach hause

initial hypothesis: no input words covered, no output produced

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-61
SLIDE 61

60

Decoding: Hypothesis Expansion

er geht ja nicht nach hause

are

pick any translation option, create new hypothesis

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-62
SLIDE 62

61

Decoding: Hypothesis Expansion

er geht ja nicht nach hause

are it he

create hypotheses for all other translation options

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-63
SLIDE 63

62

Decoding: Hypothesis Expansion

er geht ja nicht nach hause

are it he goes does not yes go to home home

also create hypotheses from created partial hypothesis

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-64
SLIDE 64

63

Decoding: Find Best Path

er geht ja nicht nach hause

are it he goes does not yes go to home home

backtrack from highest scoring complete hypothesis

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-65
SLIDE 65

64

Recombination

  • Two hypothesis paths lead to two matching hypotheses

– same number of foreign words translated – same English words in the output – different scores

it is it is

  • Worse hypothesis is dropped

it is Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-66
SLIDE 66

65

Stacks

are it he goes does not yes

no word translated

  • ne word

translated two words translated three words translated

  • Hypothesis expansion in a stack decoder

– translation option is applied to hypothesis – new hypothesis is dropped into a stack further down

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-67
SLIDE 67

66

syntax-based models

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-68
SLIDE 68

67

Phrase Structure Grammar

PRP

I

MD

shall

VB

be

VBG

passing

RP

  • n

TO

to

PRP

you

DT

some

NNS

comments

NP-A PP VP-A VP-A VP-A S Phrase structure grammar tree for an English sentence (as produced Collins’ parser)

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-69
SLIDE 69

68

Synchronous Phrase Structure Grammar

  • English rule

NP → DET JJ NN

  • French rule

NP → DET NN JJ

  • Synchronous rule (indices indicate alignment):

NP → DET1 NN2 JJ3 ∣ DET1 JJ3 NN2

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-70
SLIDE 70

69

Synchronous Grammar Rules

  • Nonterminal rules

NP → DET1 NN2 JJ3 ∣ DET1 JJ3 NN2

  • Terminal rules

N → maison ∣ house NP → la maison bleue ∣ the blue house

  • Mixed rules

NP → la maison JJ1 ∣ the JJ1 house

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-71
SLIDE 71

70

Syntax Decoding

Sie

PPER

will

VAFIN

eine

ART

Tasse

NN

Kaffee

NN

trinken

VVINF NP VP S VB

drink ➏

German input sentence with tree

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-72
SLIDE 72

71

Syntax Decoding

Sie

PPER

will

VAFIN

eine

ART

Tasse

NN

Kaffee

NN

trinken

VVINF NP VP S PRO

she

VB

drink ➏ ➊

Purely lexical rule: filling a span with a translation (a constituent in the chart)

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-73
SLIDE 73

72

Syntax Decoding

Sie

PPER

will

VAFIN

eine

ART

Tasse

NN

Kaffee

NN

trinken

VVINF NP VP S PRO

she

VB

drink

NN

coffee ➏ ➊ ➋

Purely lexical rule: filling a span with a translation (a constituent in the chart)

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-74
SLIDE 74

73

Syntax Decoding

Sie

PPER

will

VAFIN

eine

ART

Tasse

NN

Kaffee

NN

trinken

VVINF NP VP S PRO

she

VB

drink

NN

coffee ➏ ➊ ➋ ➌

Purely lexical rule: filling a span with a translation (a constituent in the chart)

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-75
SLIDE 75

74

Syntax Decoding

Sie

PPER

will

VAFIN

eine

ART

Tasse

NN

Kaffee

NN

trinken

VVINF NP VP S PRO

she

VB

drink

NN |

cup

IN |

  • f

NP PP NN NP DET |

a

NN

coffee ➏ ➊ ➋ ➌ ➍

Complex rule: matching underlying constituent spans, and covering words

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-76
SLIDE 76

75

Syntax Decoding

Sie

PPER

will

VAFIN

eine

ART

Tasse

NN

Kaffee

NN

trinken

VVINF NP VP S PRO

she

VB

drink

NN |

cup

IN |

  • f

NP PP NN NP DET |

a

VBZ |

wants

VB VP VP NP TO |

to

NN

coffee ➏ ➊ ➋ ➌ ➍ ➎

Complex rule with reordering

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-77
SLIDE 77

76

Syntax Decoding

Sie

PPER

will

VAFIN

eine

ART

Tasse

NN

Kaffee

NN

trinken

VVINF NP VP S PRO

she

VB

drink

NN |

cup

IN |

  • f

NP PP NN NP DET |

a

VBZ |

wants

VB VP VP NP TO |

to

NN

coffee

S PRO VP

➏ ➊ ➋ ➌ ➍ ➎

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-78
SLIDE 78

77

neural language models

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-79
SLIDE 79

78

N-Gram Backoff Language Model

  • Previously, we approximated

p(W) = p(w1,w2,...,wn)

  • ... by applying the chain rule

p(W) = ∑

i

p(wi∣w1,...,wi−1)

  • ... and limiting the history (Markov order)

p(wi∣w1,...,wi−1) ≃ p(wi∣wi−4,wi−3,wi−2,wi−1)

  • Each p(wi∣wi−4,wi−3,wi−2,wi−1) may not have enough statistics to estimate

→ we back off to p(wi∣wi−3,wi−2,wi−1), p(wi∣wi−2,wi−1), etc., all the way to p(wi) – exact details of backing off get complicated — ”interpolated Kneser-Ney”

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-80
SLIDE 80

79

First Sketch

Word 1 Word 2 Word 3 Word 4 Word 5 Hidden Layer

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-81
SLIDE 81

80

Representing Words

  • Words are represented with a one-hot vector, e.g.,

– dog = (0,0,0,0,1,0,0,0,0,....) – cat = (0,0,0,0,0,0,0,1,0,....) – eat = (0,1,0,0,0,0,0,0,0,....)

  • That’s a large vector!

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-82
SLIDE 82

81

Second Sketch

Word 1 Word 2 Word 3 Word 4 Word 5 Hidden Layer

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-83
SLIDE 83

82

Add a Hidden Layer

Word 1 Word 2 Word 3 Word 4 Word 5 Hidden Layer C C C C

  • Map each word first into a lower-dimensional real-valued space
  • Shared weight matrix C

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-84
SLIDE 84

83

Details (Bengio et al., 2003)

  • Add direct connections from embedding layer to output layer
  • Activation functions

– input→embedding: none – embedding→hidden: tanh – hidden→output: softmax

  • Training

– loop through the entire corpus – update between predicted probabilities and 1-hot vector for output word

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-85
SLIDE 85

84

Word Embeddings

C

Word Embedding

  • By-product: embedding of word into continuous space
  • Similar contexts → similar embedding
  • Recall: distributional semantics

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-86
SLIDE 86

85

Word Embeddings

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-87
SLIDE 87

86

Word Embeddings

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-88
SLIDE 88

87

Are Word Embeddings Magic?

  • Morphosyntactic regularities (Mikolov et al., 2013)

– adjectives base form vs. comparative, e.g., good, better – nouns singular vs. plural, e.g., year, years – verbs present tense vs. past tense, e.g., see, saw

  • Semantic regularities

– clothing is to shirt as dish is to bowl – evaluated on human judgment data of semantic similarities

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-89
SLIDE 89

88

recurrent neural networks

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-90
SLIDE 90

89

Recurrent Neural Networks

Word 1 Word 2 E C 1 H

  • Start: predict second word from first
  • Mystery layer with nodes all with value 1

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-91
SLIDE 91

90

Recurrent Neural Networks

Word 1 Word 2 E C 1 H Word 2 Word 3 E C H H

copy values

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-92
SLIDE 92

91

Recurrent Neural Networks

Word 1 Word 2 E C 1 H Word 2 Word 3 E C H H

copy values

Word 3 Word 4 E C H H

copy values

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-93
SLIDE 93

92

Training

Word 1 Word 2 E 1 H

  • Process first training example
  • Update weights with back-propagation

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-94
SLIDE 94

93

neural translation model

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-95
SLIDE 95

94

Feed Forward Neural Language Model

Word 1 Word 2 Word 3 Word 4 Word 5 Hidden Layer C C C C

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-96
SLIDE 96

95

Recurrent Neural Language Model

<s> the

Given word Embedding Hidden state Predicted word

Predict the first word

  • f a sentence

Same as before, just drawn top-down

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-97
SLIDE 97

96

Recurrent Neural Language Model

<s> the the house

Given word Embedding Hidden state Predicted word

Predict the second word

  • f a sentence

Re-use hidden state from first word prediction

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-98
SLIDE 98

97

Recurrent Neural Language Model

<s> the the house house is

Given word Embedding Hidden state Predicted word

Predict the third word

  • f a sentence

... and so on

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-99
SLIDE 99

98

Recurrent Neural Language Model

<s> the the house house is big . is big . </s>

Given word Embedding Hidden state Predicted word

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-100
SLIDE 100

99

Recurrent Neural Translation Model

  • We predicted the words of a sentence
  • Why not also predict their translations?

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-101
SLIDE 101

100

Encoder-Decoder Model

<s> the the house house is big . is big . </s>

Given word Embedding Hidden state Predicted word

</s> das das Haus Haus ist groß . ist groß . </s>

  • Obviously madness
  • Proposed by Google (Sutskever et al. 2014)

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-102
SLIDE 102

101

What is Missing?

  • Alignment of input words to output words

⇒ Solution: attention mechanism

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-103
SLIDE 103

102

neural translation model with attention

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-104
SLIDE 104

103

Input Encoding

Given word Embedding Hidden state Predicted word

  • Inspiration: recurrent neural network language model on the input side

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-105
SLIDE 105

104

Hidden Language Model States

  • This gives us the hidden states

H1 H2 H3 H4 H5 H6

  • These encode left context for each word
  • Same process in reverse: right context for each word

Ĥ1 Ĥ2 Ĥ3 Ĥ4 Ĥ5 Ĥ6

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-106
SLIDE 106

105

Input Encoder

Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN

  • Input encoder: concatenate bidrectional RNN states
  • Each word representation includes full left and right sentence context

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-107
SLIDE 107

106

Decoder

  • We want to have a recurrent neural network predicting output words

Hidden State Output Words

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-108
SLIDE 108

107

Decoder

  • We want to have a recurrent neural network predicting output words

Hidden State Output Words

  • We feed decisions on output words back into the decoder state

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-109
SLIDE 109

108

Decoder

  • We want to have a recurrent neural network predicting output words

Input Context Hidden State Output Words

  • We feed decisions on output words back into the decoder state
  • Decoder state is also informed by the input context

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-110
SLIDE 110

109

Attention

Encoder States Attention Hidden State Output Words

  • Given what we have generated so far (decoder hidden state)
  • ... which words in the input should we pay attention to (encoder states)?

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-111
SLIDE 111

110

Attention

Encoder States Attention Input Context Hidden State Output Words

  • Normalize attention (softmax)

αij = exp(a(si−1,hj)) ∑k exp(a(si−1,hk))

  • Relevant input context: weigh input words according to attention: ci = ∑j αijhj

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-112
SLIDE 112

111

Attention

Encoder States Attention Input Context Hidden State Output Words

  • Use context to predict next hidden state and output word

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-113
SLIDE 113

112

Encoder-Decoder with Attention

Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN Attention Input Context Hidden State Output Words

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020

slide-114
SLIDE 114

113

questions?

Philipp Koehn Artificial Intelligence: Machine Translation 28 April 2020