Machine T ranslation between Languages with Significant Word - - PowerPoint PPT Presentation

machine t ranslation between languages with significant
SMART_READER_LITE
LIVE PREVIEW

Machine T ranslation between Languages with Significant Word - - PowerPoint PPT Presentation

Machine T ranslation between Languages with Significant Word Reordering and Rich T arget-side Morphology Machine Translation between Languages with Significant Word Reordering and Rich T arget-side Morphology th Week of Doctoral Students,


slide-1
SLIDE 1

Machine T ranslation between Languages with Significant Word Reordering and Rich T arget-side Morphology

20

th Week of Doctoral Students, June 3 rd, 2011

ÚFAL, Charles University in Prague Bushra Jawaid

  • RNDr. Ondřej Bojar (PhD. Advisor)

Machine Translation between Languages with Significant Word Reordering and Rich T arget-side Morphology

slide-2
SLIDE 2

Language Pair & Properties

  • Language Pair → English-Urdu
  • English is SVO language and has strict word order.
  • Urdu is restricted free word order language and mostly

follows SOV structure by default.

English Sentence: I understand English and Urdu? Urdu Translation: ںوہ یتھججمسودُرا ُ روایزییرگناںییم Transliteration: meñ angrezī aor Urdū samjhte hūñ Gloss: I English and Urdu understand (Auxiliary)

2

slide-3
SLIDE 3

Language Pair & Prop (Cont)

  • Urdu has concatenative inflective morphological system.
  • For example, verbs in Urdu inflects for tense, mood, aspect, gender and

number.

  • Table below shows three different masculine forms of verb (be made)

3

Root Infinitive Oblique Intransitive/ (di) Transitive b n ə نب b nn ə ɑ اننب b nne ə ےننننب Direct Causative b n ə ɑ انب b n n ə ɑ ɑ انانب b n ne ə ɑ ےنانب Indirect Causative b nw ə ɑ اونب b nw n ə ɑ ɑ اناونب b nw ne ə ɑ ےناونب

slide-4
SLIDE 4

Research Focus ..

  • Exploring methods and techniques when translating

into the direction of morphologically richer languages.

  • Reduce the word order differences in source and

target languages.

  • Main motivation:
  • Model the problem of reordering.
  • Deal with word form choice separately.
  • Improve generalization.

4

slide-5
SLIDE 5

Possible Solutions

Translate+Generate (T+T+G) Setup (Bojar et al., 2010): Issues with this setup:

  • Factors in Moses synchronous → all factors have to be fully

constructed before main search.

  • Many possible options of lemma, tag and final word form →

Pruning strikes hard.

5

English Czech Form Form +LM Lemma Lemma +LM Morphology Morphology +LM

slide-6
SLIDE 6

Possible Solutions (Cont) ..

  • Translation options of German word “haus”, (Koehn et al. 2007)
  • Translation: Mapping lemmas

{ ?|house|?|?, ?|home|?|?, ?|building|?|?, ?|shell|?|? }

  • Translation: Mapping morphology

{ ?|house|NN|plural, ?|home|NN|plural, ?|building|NN|plural, ?|shell| NN|plural, ?|house|NN|singular,... }

  • Generation: Generating surface forms

{ houses|house|NN|plural, homes|home|NN|plural, buildings|building| NN|plural, shells|shell|NN|plural, house|house|NN|singular, ... }

6

slide-7
SLIDE 7

Two-Step Architecture..

7 Reordering Plain text Input Middle Language Plain text Output

Middle layer

Morphology

1st step 2nd step

(Fraser, 2009) and (Bojar, 2010)

slide-8
SLIDE 8

Possible Solutions (Cont) ..

  • Two-Step Setup (to avoid explosion of translation options):
  • First step translates from source to augmented lemmatized

target word.

  • Monolingual features are *not* represented, for example the

gender for adjectives.

Src

good book

Mid

A1XX.اھچا NSNX.باتک

Gloss adj+1stdeg...good noun+sg+nom…book

8

slide-9
SLIDE 9

Possible Solutions (Cont) ..

  • The second step is monotone translation from lemmatized

target word to fully inflected target word.

Src good book Mid

A1XX.اھچا NSNX.باتک

Gloss adj+1stdeg...good noun+sg+nom…book

Out یھچا(achi) باتک (kitab) Idea behind 2-step architecture → Model target-side morphology separately if not dependent on source morphology.

9

slide-10
SLIDE 10

Basic Two-Step Setup..

10 Moses Plain text Input Plain text Output

Monotone

Moses

Distance, Lexicalized reordering

Strings

  • - - - - -
  • - - - - -
slide-11
SLIDE 11

Two-Step Variants

1. Reordering options:

  • Using moses-chart or Joshua or manual reordering on 1st step for

improved reordering.

  • Moses-chart and joshua are hierarchical, i.e. allow block

movements.

11 Plain text Input Moses-chart Joshua Strings

  • - - - - -
  • - - - - -

Plain text Output

Monotone

Moses

1-best output

slide-12
SLIDE 12

Two-Step Variants (Cont ..)

  • Pre-reorder input sentences using Transformation system

(Jawaid, 2010) and pass 1-best reordered output to 1st layer.

12 Plain text Input

Transformation System

Moses-chart Joshua Strings

  • - - - - -
  • - - - - -

Plain text Output

Monotone

Moses

  • - - Strings - - -

1-best output 1-best output

slide-13
SLIDE 13

Two-Step Variants (Cont ..)

  • Generate input lattice from multiple reorderings of each sentence.
  • Use of lattices (Niehues et al. 2009) and (Bisazza et al. 2010).

13 Plain text Input

Transformation System

Strings

  • - - - - -
  • - - - - -

Plain text Output

Monotone

Moses L a t t i c e s

1-best output N-best output

Moses

slide-14
SLIDE 14

Two-Step Variants (Cont ..)

2. Middle Layer options:

Passing lattices of possible hypothesis from1st step to 2nd step instead of passing hypothesis of simple string. Multiple reorderings are considered and 2nd step is free to choose the one that is the easiest to inflect.

14 Lattices Plain text Input

Transformation System

Plain text Output

Monotone

Moses S t r i n g s / L a t t i c e s Joshua Moses-chart

N-best output 1 or N best

  • utput

Moses

slide-15
SLIDE 15

Two-Step Variants (Cont ..)

3. 2

nd Layer options:

Adding a classifier on 2nd step to get the best hypothesis.

15 Lattices Plain text Input

Transformation System

Plain text Output

Monotone

Classifier S t r i n g s / L a t t i c e s

N-best output 1 or N best

  • utput

Joshua Moses-chart Moses

slide-16
SLIDE 16

Main Issues ..

  • Urdu is under-resourced language.
  • Current research work:
  • Finding and Improving Taggers

– Collecting tools such as tagger and morphological analyzer for

Urdu.

– Trying to combine the taggers to improve precision. – Need to merge the different tagsets.

  • Collecting more data.

16

slide-17
SLIDE 17

Questions?

Feel free to ask questions.

17