Machine Translation 12: (Non-neural) Statistical Machine Translation - - PowerPoint PPT Presentation

machine translation
SMART_READER_LITE
LIVE PREVIEW

Machine Translation 12: (Non-neural) Statistical Machine Translation - - PowerPoint PPT Presentation

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of Edinburgh R. Sennrich MT 2018 12 1 / 27 Todays Lecture So far, main focus of lecture was on: neural machine translation research


slide-1
SLIDE 1

Machine Translation

12: (Non-neural) Statistical Machine Translation Rico Sennrich

University of Edinburgh

  • R. Sennrich

MT – 2018 – 12 1 / 27

slide-2
SLIDE 2

Today’s Lecture

So far, main focus of lecture was on:

neural machine translation research since ≈2013

today, we look at (non-neural) Statistical Machine Translation, and research since ≈ 1990

  • R. Sennrich

MT – 2018 – 12 1 / 27

slide-3
SLIDE 3

MT – 2018 – 12

1

Statistical Machine Translation Basics Phrase-based SMT Hierarchical SMT Syntax-based SMT

  • R. Sennrich

MT – 2018 – 12 2 / 27

slide-4
SLIDE 4

Refresher: A probabilistic model of translation

Suppose that we have:

a source sentence S of length m (x1, . . . , xm) a target sentence T of length n (y1, . . . , yn)

We can express translation as a probabilistic model:

T ∗ = arg max

T

P(T|S) = arg max

T

P(S|T)P(T)

Bayes’ theorem

We can model translation via two models:

language model to estimate P(T) translation model to estimate P(S|T)

Without continuous space representations, how to estimate P(S|T)?

→ break it up into smaller units

  • R. Sennrich

MT – 2018 – 12 3 / 27

slide-5
SLIDE 5

Word Alignment

chicken-and-egg problem

let’s break up P(S|T) into small units (words): we can estimate an alignment given a translation model expectation step we can estimate translation model given a an alignment (using relative frequencies) maximization step what can we do if we have neither? solution: Expectation Maximization Algorithm initialize model iterate between estimating alignment and translation model simplest model based on lexical translation; more complex models consider position and fertility

  • R. Sennrich

MT – 2018 – 12 4 / 27

slide-6
SLIDE 6

Word Alignment: IBM Models [Brown et al., 1993]

... la maison ... la maison blue ... la fleur ... ... the house ... the blue house ... the flower ...

  • Initial step: all alignments equally likely
  • Model learns that, e.g., la is often aligned with the
  • R. Sennrich

MT – 2018 – 12 5 / 27

slide-7
SLIDE 7

Word Alignment: IBM Models [Brown et al., 1993]

... la maison ... la maison blue ... la fleur ... ... the house ... the blue house ... the flower ...

  • After one iteration
  • Alignments, e.g., between la and the are more likely
  • R. Sennrich

MT – 2018 – 12 5 / 27

slide-8
SLIDE 8

Word Alignment: IBM Models [Brown et al., 1993]

... la maison ... la maison bleu ... la fleur ... ... the house ... the blue house ... the flower ...

  • After another iteration
  • It becomes apparent that alignments, e.g., between fleur and flower are more

likely (pigeon hole principle)

  • R. Sennrich

MT – 2018 – 12 5 / 27

slide-9
SLIDE 9

Word Alignment: IBM Models [Brown et al., 1993]

... la maison ... la maison bleu ... la fleur ... ... the house ... the blue house ... the flower ...

  • Convergence
  • Inherent hidden structure revealed by EM
  • R. Sennrich

MT – 2018 – 12 5 / 27

slide-10
SLIDE 10

Word Alignment: IBM Models [Brown et al., 1993]

  • Probabilities

p(the|la) = 0.7 p(house|la) = 0.05 p(the|maison) = 0.1 p(house|maison) = 0.8

  • Alignments

la • maison

  • the
  • house
  • la •

maison

  • the
  • house

❅ ❅

la • maison

  • the
  • house

✱ ✱

la • maison

  • the
  • house

❅ ❅ ✱ ✱ ✱

p(e, a|f) = 0.56 p(e, a|f) = 0.035 p(e, a|f) = 0.08 p(e, a|f) = 0.005 p(a|e, f) = 0.824 p(a|e, f) = 0.052 p(a|e, f) = 0.118 p(a|e, f) = 0.007

  • Counts

c(the|la) = 0.824 + 0.052 c(house|la) = 0.052 + 0.007 c(the|maison) = 0.118 + 0.007 c(house|maison) = 0.824 + 0.118

  • R. Sennrich

MT – 2018 – 12 5 / 27

slide-11
SLIDE 11

Linear Models

T ∗ = arg max

T

P(S|T)P(T)

Bayes’ theorem

T ∗ ≈ arg max

T M

  • m=1

λmhm(S, T)

[Och, 2003]

linear combination of arbitrary features Minimum Error Rate Training to optimize feature weights big trend in SMT research: engineering new/better features

  • R. Sennrich

MT – 2018 – 12 6 / 27

slide-12
SLIDE 12

Word-based SMT

core idea

combine word-based translation model and n-gram language model to compute score of translation

consequences

+ models are easy to compute

  • word translations are assumed to be independent of each other:
  • nly LM takes into account context
  • poor at modelling long-distance phenomena:

n-gram context is limited

  • R. Sennrich

MT – 2018 – 12 7 / 27

slide-13
SLIDE 13

MT – 2018 – 12

1

Statistical Machine Translation Basics Phrase-based SMT Hierarchical SMT Syntax-based SMT

  • R. Sennrich

MT – 2018 – 12 8 / 27

slide-14
SLIDE 14

Phrase-based SMT

core idea

Basic translation unit in translation model is not word, but word sequence (phrase)

consequences

+ much better memorization of frequent phrase translations

  • large (and noisy) phrase table
  • large search space; requires sophisticated pruning
  • still poor at modelling long-distance phenomena

Mr Steiger gone to Cologne Herr Steiger nach Köln gefahren unfortunately , leider ist has

  • R. Sennrich

MT – 2018 – 12 9 / 27

slide-15
SLIDE 15

Phrase Extraction

extraction rules based on word-aligned sentence pair phrase pair must be compatible with alignment... ...but unaligned words are ok phrases are contiguous sequences

I shall be passing some

  • n

to you comments Ich werde Ihnen die entsprechenden Anmerkungen aushändigen shall be = werde

  • R. Sennrich

MT – 2018 – 12 10 / 27

slide-16
SLIDE 16

Phrase Extraction

extraction rules based on word-aligned sentence pair phrase pair must be compatible with alignment... ...but unaligned words are ok phrases are contiguous sequences

I shall be passing some

  • n

to you comments Ich werde Ihnen die entsprechenden Anmerkungen aushändigen some comments = die entsprechenden Anmerkungen

  • R. Sennrich

MT – 2018 – 12 10 / 27

slide-17
SLIDE 17

Phrase Extraction

extraction rules based on word-aligned sentence pair phrase pair must be compatible with alignment... ...but unaligned words are ok phrases are contiguous sequences

I shall be passing some

  • n

to you comments Ich werde Ihnen die entsprechenden Anmerkungen aushändigen werde Ihnen die entsprechenden Anmerkungen aushändigen = shall be passing on to you some comments

  • R. Sennrich

MT – 2018 – 12 10 / 27

slide-18
SLIDE 18

Common Features in Phrase-based SMT

phrase translation probabilities (in both directions) word translation probabilities (in both directions) language model reordering model constant penalty for each phrase used sparse features with learned cost for some (classes of) phrase pairs multiple models of each type possible

  • R. Sennrich

MT – 2018 – 12 11 / 27

slide-19
SLIDE 19

Decoding

he

er geht ja nicht nach hause

it , it , he is are goes go yes is , of course not do not does not is not after to according to in house home chamber at home not is not does not do not home under house return home do not it is he will be it goes he goes is are is after all does to following not after not to not is not are not is not a

  • The machine translation decoder does not know the right answer

– picking the right translation options – arranging them in the right order → Search problem solved by heuristic beam search

  • R. Sennrich

MT – 2018 – 12 12 / 27

slide-20
SLIDE 20

Decoding

er geht ja nicht nach hause

are

pick any translation option, create new hypothesis

  • R. Sennrich

MT – 2018 – 12 13 / 27

slide-21
SLIDE 21

Decoding

er geht ja nicht nach hause

are it he

create hypotheses for all other translation options

  • R. Sennrich

MT – 2018 – 12 13 / 27

slide-22
SLIDE 22

Decoding

er geht ja nicht nach hause

are it he goes does not yes go to home home

also create hypotheses from created partial hypothesis

  • R. Sennrich

MT – 2018 – 12 13 / 27

slide-23
SLIDE 23

Decoding

er geht ja nicht nach hause

are it he goes does not yes go to home home

backtrack from highest scoring complete hypothesis

  • R. Sennrich

MT – 2018 – 12 13 / 27

slide-24
SLIDE 24

Decoding

large search space (exponential number of hypotheses) reduction of search space:

recombination of identical hypotheses pruning of hypotheses

efficient decoding is a lot more complex in SMT than in neural MT

  • R. Sennrich

MT – 2018 – 12 14 / 27

slide-25
SLIDE 25

MT – 2018 – 12

1

Statistical Machine Translation Basics Phrase-based SMT Hierarchical SMT Syntax-based SMT

  • R. Sennrich

MT – 2018 – 12 15 / 27

slide-26
SLIDE 26

Hierarchical SMT

core idea

use context-free grammars (CFG) rules as basic translation units

→ allows gaps consequences

+ better modeling of some reordering patterns

Herr Steiger leider Köln ist nach gefahren Mr Steiger unfortunately Cologne , has gone to

  • overgeneralisation is still possible

Herr Steiger nicht leider Köln ist nach gefahren Herr Steiger does not unfortunately Cologne , has gone to

  • R. Sennrich

MT – 2018 – 12 16 / 27

slide-27
SLIDE 27

Hierarchical Phrase Extraction

I shall be passing some

  • n

to you comments Ich werde Ihnen die entsprechenden Anmerkungen aushändigen werde X aushändigen = shall be passing on X

subtracting subphrase

  • R. Sennrich

MT – 2018 – 12 17 / 27

slide-28
SLIDE 28

Decoding

Decoding via (S)CFG derivation

|

  • Derivation starts with pair of linked s symbols.
  • R. Sennrich

MT – 2018 – 12 18 / 27

slide-29
SLIDE 29

Decoding

Decoding via (S)CFG derivation

| ⇒ s2 x3 | s2 x3

  • s → s1 x2 | s1 x2

(glue rule)

  • R. Sennrich

MT – 2018 – 12 18 / 27

slide-30
SLIDE 30

Decoding

Decoding via (S)CFG derivation

| ⇒ s2 x3 | s2 x3 ⇒ s2 x4 und x5 | s2 x4 and x5

  • x → x1 und x2 | x1 and x2
  • R. Sennrich

MT – 2018 – 12 18 / 27

slide-31
SLIDE 31

Decoding

Decoding via (S)CFG derivation

| ⇒ s2 x3 | s2 x3 ⇒ s2 x4 und x5 | s2 x4 and x5 ⇒ s2 unzutreffend und x5 | s2 unfounded and x5

  • x → unzutreffend | unfounded
  • R. Sennrich

MT – 2018 – 12 18 / 27

slide-32
SLIDE 32

Decoding

Decoding via (S)CFG derivation

| ⇒ s2 x3 | s2 x3 ⇒ s2 x4 und x5 | s2 x4 and x5 ⇒ s2 unzutreffend und x5 | s2 unfounded and x5 ⇒ s2 unzutreffend und irref¨ uhrend | s2 unfounded and misleading

  • x → irref¨

uhrend | misleading

  • R. Sennrich

MT – 2018 – 12 18 / 27

slide-33
SLIDE 33

Decoding

Decoding via (S)CFG derivation

| ⇒ s2 x3 | s2 x3 ⇒ s2 x4 und x5 | s2 x4 and x5 ⇒ s2 unzutreffend und x5 | s2 unfounded and x5 ⇒ s2 unzutreffend und irref¨ uhrend | s2 unfounded and misleading ⇒ x6 unzutreffend und irref¨ uhrend | x6 unfounded and misleading

  • s → x1 | x1

(glue rule)

  • R. Sennrich

MT – 2018 – 12 18 / 27

slide-34
SLIDE 34

Decoding

Decoding via (S)CFG derivation

| ⇒ s2 x3 | s2 x3 ⇒ s2 x4 und x5 | s2 x4 and x5 ⇒ s2 unzutreffend und x5 | s2 unfounded and x5 ⇒ s2 unzutreffend und irref¨ uhrend | s2 unfounded and misleading ⇒ x6 unzutreffend und irref¨ uhrend | x6 unfounded and misleading ⇒ deshalb x7 die x8 unzutreffend und irref¨ uhrend | therefore the x8 x7 unfounded and misleading

  • x → deshalb x1 die x2 | therefore the x2 x1

(non-terminal reordering)

  • R. Sennrich

MT – 2018 – 12 18 / 27

slide-35
SLIDE 35

Decoding

Decoding via (S)CFG derivation

| ⇒ s2 x3 | s2 x3 ⇒ s2 x4 und x5 | s2 x4 and x5 ⇒ s2 unzutreffend und x5 | s2 unfounded and x5 ⇒ s2 unzutreffend und irref¨ uhrend | s2 unfounded and misleading ⇒ x6 unzutreffend und irref¨ uhrend | x6 unfounded and misleading ⇒ deshalb x7 die x8 unzutreffend und irref¨ uhrend | therefore the x8 x7 unfounded and misleading ⇒ deshalb sei die x8 unzutreffend und irref¨ uhrend | therefore the x8 was unfounded and misleading

  • x → sei | was
  • R. Sennrich

MT – 2018 – 12 18 / 27

slide-36
SLIDE 36

Decoding

Decoding via (S)CFG derivation

| ⇒ s2 x3 | s2 x3 ⇒ s2 x4 und x5 | s2 x4 and x5 ⇒ s2 unzutreffend und x5 | s2 unfounded and x5 ⇒ s2 unzutreffend und irref¨ uhrend | s2 unfounded and misleading ⇒ x6 unzutreffend und irref¨ uhrend | x6 unfounded and misleading ⇒ deshalb x7 die x8 unzutreffend und irref¨ uhrend | therefore the x8 x7 unfounded and misleading ⇒ deshalb sei die x8 unzutreffend und irref¨ uhrend | therefore the x8 was unfounded and misleading ⇒ deshalb sei die Werbung unzutreffend und irref¨ uhrend | therefore the advertisement was unfounded and misleading

  • x → Werbung | advertisement
  • R. Sennrich

MT – 2018 – 12 18 / 27

slide-37
SLIDE 37

MT – 2018 – 12

1

Statistical Machine Translation Basics Phrase-based SMT Hierarchical SMT Syntax-based SMT

  • R. Sennrich

MT – 2018 – 12 19 / 27

slide-38
SLIDE 38

Syntax-based SMT

core idea

use syntax on source, target, or both rule extraction constrained by syntax potentially use syntactic structures for scoring (syntax-based LMs)

consequences

depend on exact flavor of syntax used; here: string-to-tree SMT + less overgeneralisation

  • sparsity in grammar requires relaxation of extraction constraints
  • label matching constraints increase search space during decoding

Herr Steiger NN S NN NP VAFIN leider ADV APPR Köln NE ist nach gefahren VVPP PP VP Herr Steiger NNP S NNP NP , unfortunately ADV VBZ Cologne NP , has gone VBN VP VP NNP to TO PP

  • R. Sennrich

MT – 2018 – 12 20 / 27

slide-39
SLIDE 39

Syntax-based Phrase Extraction

PRP I MD shall VB be VBG passing DT some RP on TO to PRP you NNS comments

Ich PPER werde VAFIN Ihnen PPER die ART

  • entspr. ADJ
  • Anm. NN

aushänd. VVFIN

NP PP VP VP VP S NP VP VP S

pro Ihnen

=

pp prp you to to

  • R. Sennrich

MT – 2018 – 12 21 / 27

slide-40
SLIDE 40

Decoding

Input jemand mußte Josef K. verleumdet haben someone must Josef K. slandered have Grammar

⇒ r1: np → Josef K. | Josef K. 0.90 ⇒ r2: vbn → verleumdet | slandered 0.40 ⇒ r3: vbn → verleumdet | defamed 0.20 ⇒ r4: vp → mußte x1 x2 haben | must have vbn2 np1 0.10 ⇒ r5: s → jemand x1 | someone vp1 0.60 ⇒ r6: s → jemand mußte x1 x2 haben | someone must have vbn2 np1 0.80 ⇒ r7: s → jemand mußte x1 x2 haben | np1 must have been vbn1 by someone 0.05

Derivation 1

jemand X someone S Source Target verleumdet X Josef haben X X mußte slandered have VBN must VP K. NP Josef K.

  • R. Sennrich

MT – 2018 – 12 22 / 27

slide-41
SLIDE 41

Why Syntax-based SMT?

many variants (syntax on source/target/both...) syntactic constraints for rule extraction and application prevent some

  • ver-generalizations

syntactic structure can be exploited by feature functions:

unification constraints [Williams, 2009]

“eine” →          cat ART infl       case nom declension mixed agr

  • gender

f num sg

              “Welt” →        cat NN infl     case nom agr

  • gender

f num sg

         

syntax-based neural language model [Sennrich, 2015]

PSYNTAX(T, D) ≈

n

  • i=1

Pl(i) × Pw(i) Pl(i) =P(li |wa(i), la(i)) Pw(i) =P(wi |li, wa(i), la(i)) Laura hat einen kleinen Garten Laura has a small garden

root

  • bja

attr subj det

  • R. Sennrich

MT – 2018 – 12 23 / 27

slide-42
SLIDE 42

Edinburgh’s* WMT Results over the Years

2013 2014 2015 2016 2017 0.0 10.0 20.0 30.0

20.3 20.9 20.8 21.5 19.4 20.2 22.0 22.1 18.9 24.7 26.0

BLEU (newstest2013 EN→DE) phrase-based SMT syntax-based SMT neural MT

*NMT 2015 from U. Montréal: ❤tt♣s✿✴✴s✐t❡s✳❣♦♦❣❧❡✳❝♦♠✴s✐t❡✴❛❝❧✶✻♥♠t✴

  • R. Sennrich

MT – 2018 – 12 24 / 27

slide-43
SLIDE 43

What Phrase-based SMT (Still) Does Better than NMT

better performance in low-data conditions [Koehn and Knowles, 2017] clear stopping criterion at decoding time: when all source words have been covered by a phrase pair good ecosystem of methods for specialized requirements (e.g. inclusion of terminology) ability to inspect translation decisions and models:

alignment between source and output add/remove phrase table entries

  • R. Sennrich

MT – 2018 – 12 25 / 27

slide-44
SLIDE 44

Software

Moses SMT Toolkit

developed in Edinburgh many features and extensive documentation:

❤tt♣✿✴✴✇✇✇✳st❛t♠t✳♦r❣✴♠♦s❡s

documentation of baseline phrase-based systems:

❤tt♣✿✴✴✇✇✇✳st❛t♠t✳♦r❣✴♠♦s❡s✴❄♥❂♠♦s❡s✳❜❛s❡❧✐♥❡ ❤tt♣✿✴✴❧♦t✉s✳❦✉❡❡✳❦②♦t♦✲✉✳❛❝✳❥♣✴❲❆❚✴❲❆❚✷✵✶✼✴❜❛s❡❧✐♥❡✴ ❜❛s❡❧✐♥❡❙②st❡♠P❤r❛s❡❴❦❥✳❤t♠❧

config files for SOTA (in 2014/5) syntax-based systems:

❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴rs❡♥♥r✐❝❤✴✇♠t✷✵✶✹✲s❝r✐♣ts

  • R. Sennrich

MT – 2018 – 12 26 / 27

slide-45
SLIDE 45

Further Reading

text books

Philipp Koehn (2009). Statistical Machine Translation. Philip Williams; Rico Sennrich; Matt Post; Philipp Koehn (2016). Syntax-based Statistical Machine Translation.

  • nline resources

syntax-based tutorial by Philip Williams and Philipp Koehn (slide credit to them for some slides shown here):

❤tt♣✿✴✴❤♦♠❡♣❛❣❡s✳✐♥❢✳❡❞✳❛❝✳✉❦✴s✵✽✾✽✼✼✼✴s②♥t❛①✲t✉t♦r✐❛❧✳♣❞❢

slides on word- and phrase-based SMT by Philipp Koehn:

❤tt♣✿✴✴✇✇✇✳st❛t♠t✳♦r❣✴❜♦♦❦✴s❧✐❞❡s✴✵✹✲✇♦r❞✲❜❛s❡❞✲♠♦❞❡❧s✳♣❞❢ ❤tt♣✿✴✴✇✇✇✳st❛t♠t✳♦r❣✴❜♦♦❦✴s❧✐❞❡s✴✵✺✲♣❤r❛s❡✲❜❛s❡❞✲♠♦❞❡❧s✳♣❞❢ ❤tt♣✿✴✴✇✇✇✳st❛t♠t✳♦r❣✴❜♦♦❦✴s❧✐❞❡s✴✵✻✲❞❡❝♦❞✐♥❣✳♣❞❢

  • R. Sennrich

MT – 2018 – 12 27 / 27

slide-46
SLIDE 46

Bibliography I

Brown, P . F., Della Pietra, V. J., Della Pietra, S. A., and Mercer, R. L. (1993). The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19(2):263–311. Koehn, P . and Knowles, R. (2017). Six Challenges for Neural Machine Translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39, Vancouver. Association for Computational Linguistics. Och, F. J. (2003). Minimum Error Rate Training in Statistical Machine Translation. In ACL ’03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pages 160–167, Sapporo,

  • Japan. Association for Computational Linguistics.

Sennrich, R. (2015). Modelling and Optimizing on Syntactic N-Grams for Statistical Machine Translation. Transactions of the Association for Computational Linguistics, 3:169–182. Williams, P . (2009). Towards Statistical Machine Translation with Unification Grammars. Master’s thesis, University of Edinburgh, Edinburgh, UK.

  • R. Sennrich

MT – 2018 – 12 28 / 27