Domain Adaptation in Statistical Machine Translation Logic, - - PowerPoint PPT Presentation

domain adaptation in statistical machine translation
SMART_READER_LITE
LIVE PREVIEW

Domain Adaptation in Statistical Machine Translation Logic, - - PowerPoint PPT Presentation

Domain Adaptation in Statistical Machine Translation Logic, Language and Computation Bart Mellebeek October 28, 2013 Bart Mellebeek Domain Adaptation in Statistical Machine Translation Outline Machine Translation and AI Statistical Machine


slide-1
SLIDE 1

Domain Adaptation in Statistical Machine Translation

Logic, Language and Computation Bart Mellebeek October 28, 2013

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-2
SLIDE 2

Outline

Machine Translation and AI Statistical Machine Translation (SMT) Domain Adaptation for SMT

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-3
SLIDE 3

Background

◮ Machine Translation ◮ Natural Language Processing in industry

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-4
SLIDE 4

Outline

Machine Translation and AI Statistical Machine Translation (SMT) Domain Adaptation for SMT

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-5
SLIDE 5

The Turing Test

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-6
SLIDE 6

1947

“When I look at an article in Russian, I say: ‘This is really written in English, but it has been coded in some strange

  • symbols. I will now proceed to decode.’ ”

– Warren Weaver, March 1947 “… as to the problem of mechanical translation, I frankly am afraid that the [semantic] boundaries of words in different languages are too vague … to make any quasi-mechanical translation scheme very hopeful.” – Norbert Wiener, April 1947

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-7
SLIDE 7

2013

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-8
SLIDE 8

2013

◮ ‘Poetic’ Statistical Machine Translation: Rhyme and Meter,

Genzel et al., EMNLP 2010

◮ Automatic Analysis of Rhythmic Poetry with Applications to

Generation and Translation, Greene et al. EMNLP 2010

◮ Modeling Hip Hop Challenge-Response Lyrics as Machine

Translation, Wu et al., MT Summit 2013

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-9
SLIDE 9

The Translation Pyramid

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-10
SLIDE 10

Outline

Machine Translation and AI Statistical Machine Translation (SMT) Domain Adaptation for SMT

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-11
SLIDE 11

Centauri/Arcturan [Knight, 1997]

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-12
SLIDE 12

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-13
SLIDE 13

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-14
SLIDE 14

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-15
SLIDE 15

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

???

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-16
SLIDE 16

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-17
SLIDE 17

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-18
SLIDE 18

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-19
SLIDE 19

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

???

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-20
SLIDE 20

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-21
SLIDE 21

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

process of elimination

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-22
SLIDE 22

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

cognate?

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-23
SLIDE 23

Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp }

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

zero fertility

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-24
SLIDE 24

Clients do not sell pharmaceuticals in Europ => Clientes no venden medicinas en Europa

It’s Really Spanish/English

  • 1a. Garcia and associates .
  • 1b. Garcia y asociados .
  • 7a. the clients and the associates are enemies .
  • 7b. los clients y los asociados son enemigos .
  • 2a. Carlos Garcia has three associates .
  • 2b. Carlos Garcia tiene tres asociados .
  • 8a. the company has three groups .
  • 8b. la empresa tiene tres grupos .
  • 3a. his associates are not strong .
  • 3b. sus asociados no son fuertes .
  • 9a. its groups are in Europe .
  • 9b. sus grupos estan en Europa .
  • 4a. Garcia has a company also .
  • 4b. Garcia tambien tiene una empresa .
  • 10a. the modern groups sell strong pharmaceuticals .
  • 10b. los grupos modernos venden medicinas fuertes .
  • 5a. its clients are angry .
  • 5b. sus clientes estan enfadados .
  • 11a. the groups do not sell zenzanine .
  • 11b. los grupos no venden zanzanina .
  • 6a. the associates are also angry .
  • 6b. los asociados tambien estan enfadados .
  • 12a. the small groups are not modern .
  • 12b. los grupos pequenos no son modernos .

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-25
SLIDE 25

Noisy Channel Model

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-26
SLIDE 26

Noisy Channel Model

ˆ t = argmax

t

P(t|s) = argmax

t

P(s|t)P(t)

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-27
SLIDE 27

Decoding

12

Translation Options

bofetada una dio a la verde bruja no Maria Mary not did not give a slap to the witch green by to the to green witch the witch did not give no a slap slap the slap

  • Look up possible phrase translations

– many different ways to segment words into phrases – many different ways to translate each phrase

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-28
SLIDE 28

Decoding

13

Hypothesis Expansion

dio a la verde bruja no Maria Mary not did not give a slap to the witch green by to the to green witch the witch did not give no a slap slap the slap e: f: --------- p: 1 una bofetada

  • Start with empty hypothesis

– e: no English words – f: no foreign words covered – p: probability 1

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-29
SLIDE 29

Decoding

14

Hypothesis Expansion

dio a la verde bruja no Maria Mary not did not give a slap to the witch green by to the to green witch the witch did not give no a slap slap the slap e: Mary f: *-------- p: .534 e: f: --------- p: 1 una bofetada

  • Pick translation option
  • Create hypothesis

– e: add English phrase Mary – f: first foreign word covered – p: probability 0.534

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-30
SLIDE 30

Decoding

16

Hypothesis Expansion

dio a la verde bruja no Maria Mary not did not give a slap to the witch green by to the to green witch the witch did not give no a slap slap the slap e: Mary f: *-------- p: .534 e: witch f: -------*- p: .182 e: f: --------- p: 1 una bofetada

  • Add another hypothesis

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-31
SLIDE 31

Decoding

17

Hypothesis Expansion

dio una bofetada a la verde bruja no Maria Mary not did not give a slap to the witch green by to the to green witch the witch did not give no a slap slap the slap e: Mary f: *-------- p: .534 e: witch f: -------*- p: .182 e: f: --------- p: 1 e: ... slap f: *-***---- p: .043

  • Further hypothesis expansion

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-32
SLIDE 32

Decoding

18

Hypothesis Expansion

dio una bofetada bruja verde Maria Mary not did not give a slap to the witch green by to the to green witch the witch did not give no a slap slap the slap e: Mary f: *-------- p: .534 e: witch f: -------*- p: .182 e: f: --------- p: 1 e: slap f: *-***---- p: .043 e: did not f: **------- p: .154 e: slap f: *****---- p: .015 e: the f: *******-- p: .004283 e:green witch f: ********* p: .000271 a la no

  • ... until all foreign words covered

– find best hypothesis that covers all foreign words – backtrack to read off translation

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-33
SLIDE 33

Decoding

19

Hypothesis Expansion

Mary not did not give a slap to the witch green by to the to green witch the witch did not give no a slap slap the slap e: Mary f: *-------- p: .534 e: witch f: -------*- p: .182 e: f: --------- p: 1 e: slap f: *-***---- p: .043 e: did not f: **------- p: .154 e: slap f: *****---- p: .015 e: the f: *******-- p: .004283 e:green witch f: ********* p: .000271 no dio a la verde bruja no Maria una bofetada

  • Adding more hypothesis

⇒ Explosion of search space

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-34
SLIDE 34

ˆ t = argmax

t

P(t|s) = argmax

t

P(s|t)P(t) ≈ argmax

t,a

P(s, a|t)P(t)

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-35
SLIDE 35

Hidden Representations

1 aj l j m 1

Alignment a "translation dictionary" Language Model on e

1 l e e e aj f1 fm fj

We need to devise

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-36
SLIDE 36

Hidden Representations

j m 1 L a n g u a g e M

  • d

e l

  • n

e

1 l e e e aj f1 fm fj

1 aj l

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-37
SLIDE 37

In practise ...

argmax

t,a

P(s, a|t) ∝ exp

i

αifi(s, t, a)

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-38
SLIDE 38

Outline

Machine Translation and AI Statistical Machine Translation (SMT) Domain Adaptation for SMT

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-39
SLIDE 39

Domain Adaptation: why?

Given: de→en MT system trained on Europarl argmax e P(e|Das Wort) =?

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-40
SLIDE 40

Domain Adaptation: why?

Given: de→en MT system trained on Europarl argmax e P(e|Das Wort) = the floor

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-41
SLIDE 41

Domain Adaptation: why?

Given: de→en MT system trained on Europarl argmax e P(e|Das Wort) = the floor You have the floor – Sie haben das Wort

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-42
SLIDE 42

Domain Adaptation: why?

◮ Language varies across different genres, topics ◮ This affects empirical models ◮ Domain overfitting ◮ One solution: train smaller models and combine them in a

smart way

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-43
SLIDE 43

Domain Adaptation for SMT

◮ Training material

heterogeneous, with some parts that are not too far from test domain

◮ Development set drawn from

test domain is available

Training data

SMT

Dev/Test data

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-44
SLIDE 44

Research Questions

  • 1. Construction of data repository map.
  • 2. Data/Model weighting according to relevance input text.
  • 3. Data enrichment (e.g. Hierarchical Alignment Trees)

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-45
SLIDE 45

Adaptation Techniques

◮ Mixture Models ◮ Transductive Learning ◮ Instance Weighting ◮ Data Selection ◮ Phrase Sense Disambiguation

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-46
SLIDE 46

Mixture Models

◮ ˆ

t = argmax

t

P(t|s) ∝ exp

i

αifi(s, t, a)

◮ Combine different fi in a smart way. ◮ Linear interpolation, log-linear interpolation, alternative

decoding paths

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-47
SLIDE 47

Data Clustering

◮ experiments with monolingual + bilingual distributions ◮ metrics ◮ clustering

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-48
SLIDE 48

Baselines

  • 1. In-domain only
  • 2. Concatenation
  • 3. Linear interpolation
  • 4. Log-linear interpolation
  • 5. Log-linear interpolation with alternative decoding paths
  • 6. Instance Weighting
  • 7. Instance Weighting with devset perplexity minimization
  • 8. Fill-up

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-49
SLIDE 49

Some Ideas

◮ Invitation-based Instance Weighting ◮ Dynamic Topic Adaptation ◮ Decoder-driven Adaptation ◮ Domain Adaptation via Triangulation

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-50
SLIDE 50

Invitation-based Instance Weighting

  • 1. Bilingual context vector for all phrases
  • 2. ‘invite’ and weigh out-of-domain phrases in an interative

process

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-51
SLIDE 51

Dynamic Topic Adaptation

  • 1. Derive topic distribution over devset and phrase pairs
  • 2. Incorporate similarity feature in translation model

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-52
SLIDE 52

Decoder-driven Adaptation

  • 1. Leave individual components alone
  • 2. Integrate genre detector in decoder
  • 3. Google approach

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-53
SLIDE 53

Domain Adaptation via Triangulation

  • 1. Presupposes existence of multilingual corpus
  • 2. Triangulated source-target data already exist
  • 3. Apply similar idea to IN vs OUT data
  • 4. Si|Tj ≈ {S1

i |T 1 k , . . . , Sn i |T n k } → add {S1 i |T 1 j , . . . , Sn i |T n j } to

OUT’

Bart Mellebeek Domain Adaptation in Statistical Machine Translation

slide-54
SLIDE 54

Discussion

Thanks for listening!

Bart Mellebeek Domain Adaptation in Statistical Machine Translation