Machine Translation Some slides are borrowed from Kevin Knight, - - PowerPoint PPT Presentation

machine translation
SMART_READER_LITE
LIVE PREVIEW

Machine Translation Some slides are borrowed from Kevin Knight, - - PowerPoint PPT Presentation

Sessions 6/7: Statistical MT Intro Acknowledgements: Machine Translation Some slides are borrowed from Kevin Knight, University of Southern California, from Colin Cherry, Classical and Statistical Approaches Alberta (see


slide-1
SLIDE 1

Machine Translation

– Classical and Statistical Approaches

Session 7: Statistical MT – Intro (2)

Jonas Kuhn Universität des Saarlandes, Saarbrücken The University of Texas at Austin jonask@coli.uni-sb.de DGfS/CL Fall School 2005, Ruhr-Universität Bochum, September 19-30, 2005

Jonas Kuhn: MT 2

Sessions 6/7: Statistical MT – Intro

Acknowledgements:

Some slides are borrowed from Kevin Knight,

University of Southern California, from Colin Cherry, Alberta (see http://www.cs.ualberta.ca/~colinc) and from Leila Kosseim (http://www.cs.concordia.ca/~kosseim/) “Translation without understanding”

Very brief introduction to probabilities The noisy channel model for translation

Language modeling Translation modeling Decoding Slides from Kevin Knight 3

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Slides from Kevin Knight 4

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

slide-2
SLIDE 2

Slides from Kevin Knight 5

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Slides from Kevin Knight 6

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp ???

Slides from Kevin Knight 7

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Slides from Kevin Knight 8

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

slide-3
SLIDE 3

Slides from Kevin Knight 9

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Slides from Kevin Knight 10

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp ???

Slides from Kevin Knight 11

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Slides from Kevin Knight 12

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp process of elimination

slide-4
SLIDE 4

Slides from Kevin Knight 13

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp cognate?

Slides from Kevin Knight 14

Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp }

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

zero fertility

Slides from Kevin Knight 15

Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa

It’s Really Spanish/English

  • 1a. Garcia and associates .
  • 1b. Garcia y asociados .
  • 7a. the clients and the associates are enemies .
  • 7b. los clients y los asociados son enemigos .
  • 2a. Carlos Garcia has three associates .
  • 2b. Carlos Garcia tiene tres asociados .
  • 8a. the company has three groups .
  • 8b. la empresa tiene tres grupos .
  • 3a. his associates are not strong .
  • 3b. sus asociados no son fuertes .
  • 9a. its groups are in Europe .
  • 9b. sus grupos estan en Europa .
  • 4a. Garcia has a company also .
  • 4b. Garcia tambien tiene una empresa .
  • 10a. the modern groups sell strong pharmaceuticals

.

  • 10b. los grupos modernos venden medicinas fuertes

.

  • 5a. its clients are angry .
  • 5b. sus clientes estan enfadados .
  • 11a. the groups do not sell zenzanine .
  • 11b. los grupos no venden zanzanina .
  • 6a. the associates are also angry .
  • 6b. los asociados tambien estan enfadados .
  • 12a. the small groups are not modern .
  • 12b. los grupos pequenos no son modernos .

Jonas Kuhn: MT 16

We need three things (for FE)

1.

A Language Model of English: P( E )

  • Measures fluency
  • Probability of an English sentence
  • ~ Provides a set of fluent sentences to test for potential

translation

2.

A Translation Model: P( F | E )

  • Measures faithfulness
  • Probability of an (French, English) pair (given English

sentence)

  • ~Tests if a given fluent sentence is a translation

3.

A Decoder: arg max

  • An effective and efficient search technique to find E*
  • The search space is infinite and rather unstructured, so

heuristic search has to be applied

slide-5
SLIDE 5

Jonas Kuhn: MT 17

Where will we get P(F|E)?

Books in English Same books, in French

Machine Learning Magic

P(F|E) model

We call collections stored in two languages parallel corpora or parallel texts

Want to update your system? Just add more text!

Jonas Kuhn: MT 18

Problem:

How are we going to generalize from examples of

translations?

Strategy: Generative Story When modeling P(X|Y):

Assume you start with Y Decompose the creation of X from Y into some number

  • f operations

Track statistics of individual operations For a new example X,Y: P(X|Y) can be calculated

based on the probability of the operations needed to get X from Y

Jonas Kuhn: MT 19

Translation modeling

Word-based translation models

Original approach

“Phrase”-based translation models

Currently best-performing

Syntax-based translation models

Ongoing research

Jonas Kuhn: MT 20

Translation modeling

Word-based translation models

“IBM models 1-5” from Candide project

Detailed description and motivation in Brown, Peter F.,

Vincent J. Della Pietra, Stephen A. Della Pietra, Robert L. Mercer (1993): The Mathematics of Statistical Machine Translation: Parameter Estimation. In Computational Linguistics 19(2), 263-311.

More accessible intro in Kevin Knight’s tutorial

workbook (1999) “Phrase”-based translation models

First proposed by Franz-Josef Och

Overview in Franz Josef Och and Hermann Ney (2004):

The Alignment Template Approach to Statistical Machine

  • Translation. In Computational Linguistics 30(4), 417-449.

Compare Koehn et al. (2003) paper

slide-6
SLIDE 6

Jonas Kuhn: MT 21

Translation modeling

Syntax-based translation models

Alshawi, H., Srinivas, B. and Douglas, S (2000): Learning

dependency translation models as collections of finite state head transducers,'' Computational Linguistics 26 (1), 45-60.

Dekai Wu (1997): Stochastic Inversion Transduction

Grammars and Bilingual Parsing of Parallel Corpora. In Computational Linguistics 23 (3), 377-403.

Kenji Yamada and Kevin Knight (2001): A Syntax-based

Statistical Translation Model. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, 523-530.

Jonas Kuhn: MT 22

Sentence Alignment

A crucial preprocessing step for all sentence-internal

alignment approaches

Standard algorithm by Gale and Church (1993)

Exclusively based on the length of candidate segments

(sections, paragraphs, sentences) in characters

Also take into account overall length ratio between the

two languages

Exploits the fact that typically no reordering takes place Possible alignment patterns:

1-1, 2-1, 2-1, 1-0, or 0-1

Jonas Kuhn: MT 23

Sentence Alignment

The old man is

  • happy. He has

fished many times. His wife talks to him. The fish are jumping. The sharks await. El viejo está feliz porque ha pescado muchos veces. Su mujer habla con él. Los tiburones esperan.

Jonas Kuhn: MT 24

Sentence Alignment

1.

The old man is happy.

2.

He has fished many times.

3.

His wife talks to him.

4.

The fish are jumping.

5.

The sharks await.

1.

El viejo está feliz porque ha pescado muchos veces.

2.

Su mujer habla con él.

3.

Los tiburones esperan.

slide-7
SLIDE 7

Jonas Kuhn: MT 25

Sentence Alignment

1.

The old man is

  • happy. He has

fished many times.

2.

His wife talks to him.

3.

The sharks await.

1.

El viejo está feliz porque ha pescado muchos veces.

2.

Su mujer habla con él.

3.

Los tiburones esperan.

Note that unaligned sentences are thrown out, and sentences are merged in n-to-m alignments (n, m > 0).

Jonas Kuhn: MT 26

Mary did not slap the green witch Mary not slap slap slap the green witch n(3|slap) Maria no dió una botefada a la bruja verde d(j|i) Mary not slap slap slap NULL the green witch P-Null Maria no dió una botefada a la verde bruja t(la|the)

Probabilities can be learned from raw bilingual text.

Word-based model (IBM-3)

Generative story: assume a random process:

Jonas Kuhn: MT 27

Word alignments as hidden variables

… la maison … la maison bleue … la fleur … … the house … the blue house … the flower … All word alignments equally likely All P(french-word | english-word) equally likely

Jonas Kuhn: MT 28

Word alignments as hidden variables

… la maison … la maison bleue … la fleur … … the house … the blue house … the flower … “la” and “the” observed to co-occur frequently, so P(la | the) is increased.

slide-8
SLIDE 8

Jonas Kuhn: MT 29

Word alignments as hidden variables

… la maison … la maison bleue … la fleur … … the house … the blue house … the flower … “house” co-occurs with both “la” and “maison”, but P(maison | house) can be raised without limit, to 1.0, while P(la | house) is limited because of “the” (pigeonhole principle)

Jonas Kuhn: MT 30

Word alignments as hidden variables

… la maison … la maison bleue … la fleur … … the house … the blue house … the flower … settling down after another iteration

Jonas Kuhn: MT 31

Word alignments as hidden variables

… la maison … la maison bleue … la fleur … … the house … the blue house … the flower …

Inherent structure is revealed by

Expectation Maximization (EM) algorithm

For details, see:

  • “A Statistical MT Tutorial Workbook” (Knight, 1999).
  • “The Mathematics of Statistical Machine Translation”

(Brown et al, 1993)

Jonas Kuhn: MT 32

Statistical Machine Translation

… la maison … la maison bleue … la fleur … … the house … the blue house … the flower … P(juste | fair) = 0.411 P(juste | correct) = 0.027 P(juste | right) = 0.020

new French sentence Possible English translations, to be rescored by language model

slide-9
SLIDE 9

Jonas Kuhn: MT 33

Flaws of Word-Based MT

Multiple English words for one French word

IBM models can do one-to-many (fertility) but not

many-to-one Phrasal Translation

“real estate”, “note that”, “interest in”

Syntactic Transformations

Verb at the beginning in Arabic Translation model penalizes any proposed re-ordering Language model not strong enough to force the verb to

move to the right place

Jonas Kuhn: MT 34

“Phrase”-Based Statistical MT

Morgen fliege ich nach Kanada zur Konferenz Tomorrow I will fly to the conference In Canada

Foreign input segmented in to phrases

“phrase” is any sequence of words

Each phrase is probabilistically translated into

English

P(to the conference | zur Konferenz) P(into the meeting | zur Konferenz)

Phrases are probabilistically re-ordered This is state-of-the-art!

Jonas Kuhn: MT 35

Advantages of “Phrase”-Based Models

Many-to-many mappings can handle non-

compositional phrases

Local context is very useful for

disambiguating

“Interest rate” … “Interest in” …

The more data, the longer the learned

phrases

Sometimes whole sentences

Jonas Kuhn: MT 36

Learning the phrase translation table

One method: “alignment templates” (Och et al, 1999) Start with word alignment, build phrases from that.

Mary did not slap the green witch

Maria no dió una bofetada a la bruja verde

This word-to-word alignment is a by-product of training a translation model like IBM-Model-3. This is the best (or “Viterbi”) alignment.

slide-10
SLIDE 10

Jonas Kuhn: MT 37

Learning the phrase translation table

Mary did not slap the green witch

Maria no dió una bofetada a la bruja verde

This word-to-word alignment is a by-product of training a translation model like IBM-Model-3. This is the best (or “Viterbi”) alignment.

One method: “alignment templates” (Och et al, 1999) Start with word alignment, build phrases from that.

Jonas Kuhn: MT 38

IBM Models are 1-to-Many

Run IBM-style aligner both directions, then merge:

EF best alignment

Union or Intersection

MERGE FE best alignment

Jonas Kuhn: MT 39

Learning the phrase translation table

Collect all phrase pairs that are consistent with the

word alignment

Mary did not slap the green witch

Maria no dió una bofetada a la bruja verde

  • ne

example phrase pair

Jonas Kuhn: MT 40

Consistent with Word Alignment

Phrase alignment must contain all alignment points for all the words in both phrases! x x

Mary did not slap Maria no dió Mary did not slap Maria no dió Mary did not slap Maria no dió

consistent inconsistent inconsistent

slide-11
SLIDE 11

41

Mary did not slap the green witch

Maria no dió una bofetada a la bruja verde

Word Alignment Induced Phrases

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

42

Mary did not slap the green witch

Maria no dió una bofetada a la bruja verde

Word Alignment Induced Phrases

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the)

43

Mary did not slap the green witch

Maria no dió una bofetada a la bruja verde

Word Alignment Induced Phrases

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch)

44

Mary did not slap the green witch

Maria no dió una bofetada a la bruja verde

Word Alignment Induced Phrases

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) …

slide-12
SLIDE 12

45

Mary did not slap the green witch

Maria no dió una bofetada a la bruja verde

Word Alignment Induced Phrases

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) … (Maria no dió una bofetada a la bruja verde, Mary did not slap the green witch)

Jonas Kuhn: MT 46

Phrase Pair Probabilities

A certain phrase pair (f-f-f, e-e-e) may appear many

times across the bilingual corpus.

We hope so!

So, now we have a vast list of phrase pairs and their

frequencies – how to assign probabilities?

Jonas Kuhn: MT 47

Phrase Pair Probabilities

Basic idea:

No EM training Just relative frequency:

P(f-f-f | e-e-e) = count(f-f-f, e-e-e) / count(e-e-e)

  • Important refinements:

Smooth using word probs P(f | e) for individual words connected in

the word alignment

Some low count phrase pairs now have high probability, others

have low probability

Discount for ambiguity

If phrase e-e-e can map to 5 different French phrases, due to the

ambiguity of unaligned words, each pair gets a 1/5 count

Count BAD events too

If phrase e-e-e doesn’t map onto any contiguous French phrase,

increment event count(BAD, e-e-e)

Jonas Kuhn: MT 48

Training: a bootstrapping process

  • Sophisticated bootstrapping routine

1.

Split corpus into sentences

2.

Align sentences (1-1, 2-1, 2-1, 1-0, or 0-1)

3.

Estimate simple word translation probs (IBM-1)

4.

Successively improve assumed word alignment (IBM-2 to IBM-4/5)

5.

Extract “phrases” defined by word alignment

slide-13
SLIDE 13

Jonas Kuhn: MT 49

Training: a bootstrapping process

  • Widely used open-source tool for estimating a word

translation model (and producing a word alignment): GIZA++ implementation of IBM models

  • Extension of the program GIZA (part of the SMT toolkit

EGYPT) which was developed by the Statistical Machine Translation team during the summer workshop in 1999 at the Center for Language and Speech Processing at Johns- Hopkins University

  • GIZA++ available from: http://www.fjoch.com/GIZA++.html
  • Training script for phrase-based translation by Philipp

Koehn (building on GIZA++)

  • Works with Koehn’s “Pharaoh” Decoder