Machine Translation Week 1: Classical approaches Classical and - - PowerPoint PPT Presentation

machine translation
SMART_READER_LITE
LIVE PREVIEW

Machine Translation Week 1: Classical approaches Classical and - - PowerPoint PPT Presentation

Session 1: Overview Course Overview Machine Translation Week 1: Classical approaches Classical and Statistical Approaches Week 2: Data-driven, statistical approaches Machine Translation: Some History Major


slide-1
SLIDE 1

Machine Translation

– Classical and Statistical Approaches

Session 1: Overview

Jonas Kuhn Universität des Saarlandes, Saarbrücken The University of Texas at Austin jonask@coli.uni-sb.de

DGfS/CL Fall School 2005, Ruhr-Universität Bochum, September 19-30, 2005

Jonas Kuhn: MT 2

Session 1: Overview

Course Overview

Week 1: “Classical” approaches Week 2: Data-driven, statistical approaches

Machine Translation: Some History Major architectures/paradigms in “classical”

Machine Translation

Translation challenges – a classification

Jonas Kuhn: MT 3

Course Overview (1)

Week 1: “Classical” approaches

History & Overview Transfer-based translation

Syntax-based transfer

[Trujillo 1999]

Transfer as LFG projection

[Kaplan et al. 1999]

Interlingua-based translation

[Dorr 1994]

Term-rewriting transfer

[Emele/Dorna 1998]

Jonas Kuhn: MT 4

Course Overview (2)

Week 2: Data-driven, statistical approaches

The noisy channel model

[Brown et al. 1990, Knight 1999]

Language modeling Translation modeling

Word alignment Phrase alignment

[Koehn et al. 2003]

Decoding

[Koehn 1994]

Other uses of word alignments

[Yarowsky et al. 2001]

slide-2
SLIDE 2

Jonas Kuhn: MT 5

MT: Some History

Translation

  • c. 2000 BC (Old Babylonian period): bilingual

Sumerian-Akkadian text fragments

China, 9th century BC: references to translators and

interpreters

  • c. mid-8th century BC: reference to interpreting in Old

Testament (Genesis 42:23)

240 BC: Livius Andronicus translates the Odyssey from

Greek into Latin

197 BC: Rosetta stone carved (three scripts: Egyptian

hieroglyphs, Demotic, Greek; discovered in 1799, and deciphered by Jean François Champillion in 1822)

Jonas Kuhn: MT 6

MT: Some History

1947: Memo by Warren Weaver (Rockefeller Foundation)

“I have a text in front of me which is written in Russian but I am going to pretend that it is really written in English and that it has been coded in some strange symbols. All I need to do is strip off the code in order to retrieve the information contained in the text.” (compare use of computers for cryptography in WW-II)

1954: first prototype of Russian-English MT system (GAT

system, Peter Toma; Georgetown University, Washington D.C.)

1961, UT Austin: Linguistic Research Center (led by Winfred

Lehmann)

Fundamental research and development of METAL, a

bidirectional English-German transfer system

Initially funded by US Air Force Rome Air Development

Center; since 1978 by Siemens

First commercial METAL system appeared in 1989

Jonas Kuhn: MT 7

MT: Some History

1966: The ALPAC Report

(Automatic Language Processing Advisory Committee, commissioned by the US National Academy of Sciences)

no shortage of human translators, no immediate prospect of

MT producing useful translation of general scientific texts

funding for MT was virtually stopped (especially in the USA)

Groups continuing to work on MT in 1970s:

TAUM group in Montreal: METEO system (used for

translating weather forecasts since 1977)

groups in the USSR GETA group in Grenoble, France SUSY group in Saarbrücken, Germany Peter Toma working on Systran (in various organizations)

Systran is now available for 36 language pairs;

http://www.systransoft.com/

Underlying technology in Babel Fish Translation (by Altavista) Jonas Kuhn: MT 8

MT: Some History

slide-3
SLIDE 3

Jonas Kuhn: MT 9

MT: Some History

1976: Commission of the European Communities

installs English-French version of Systran

commissions further language pairs of Systran

1982-1993: Eurotra – large-scale MT project funded

by the European Communities

1993-2000: Verbmobil – large-scale speech-to-

speech translation project funded by the German ministry for research

late 1980s-1990s: Candide project at IBM Watson

Research Center – pioneering work in Statistical Machine Translation

Basis for all ongoing work in Statistical MT Example: “Surprise Language Project” by DARPA – 1 month

time for developing an MT system for a given language (June 2003: Hindi) (11 research institutions participated)

Jonas Kuhn: MT 10

Architectures and Paradigms in MT

Classification following Dorr/Jordan/Benoit (1999): A Survey of Current Paradigms in Machine Translation. In: Zelkowitz, Marvin (Hg.) Advances in Computers 49, 1-68. Academic Press, London.

MT Architectures

Direct translation Transfer-based translation Interlingua-based translation

MT Paradigms

Linguistic-based paradigms

Constraint-based MT, Knowledge-based MT, Lexical-based

MT, Rule-based MT, Principle-based MT, Shake-and-Bake MT

Non-linguistic-based paradigms

Statistical-based MT, Example-based MT, Dialogue-based MT

Hybrid paradigms

Jonas Kuhn: MT 11

The Vauquois Triangle

Jonas Kuhn: MT 12

Transfer vs. Interlingua

Some slides taken from Arturo Trujillo…

(author of “Translation Engines” 1999,

Springer)

slide-4
SLIDE 4

Jonas Kuhn: MT 13

Transfer vs. Interlingua

Transfer:

Contrasts are fundamental to translation. Statements in one theory (source language) are mapped into statements in another theory (target language).

Interlingua:

Meanings are language independent and can be encoded. They are extracted from SL sentences and rendered as TL sentences.

Jonas Kuhn: MT 14

Multilinguality – Transfer

English Catalan Spanish German French Japanese

Jonas Kuhn: MT 15

Multilinguality – Interlingua

English Catalan Spanish German French Japanese Interlingua

Jonas Kuhn: MT 16

+ Easier to implement + Good for mono- or bi- directional systems + Humans work on 2 languages at a time + Eliminates redundancy + Highly modular + Simplifies addition of languages

  • Modifications affect several

transfer modules

  • Inefficient for multilinguality
  • Different linguists may

disagree on representation of meaning

  • Difficult to ensure that TL

generator can produce sentence from SL representation

Transfer vs. Interlingua

slide-5
SLIDE 5

Jonas Kuhn: MT 17

Classifying translation challenges

Translation divergence:

Meaning is conveyed by translation, although

syntactic structure and semantic distribution of meaning components is different in the two languages Translation mismatch

Difference in information content between

source and target sentence

Example (from Dorr 1994): translation of fish

into Spanish – pez (alive), pescado (food)

Jonas Kuhn: MT 18

Types of divergence

Thematic divergence Head-switching divergence Structural divergence Categorial divergence Lexical gap (conflational divergence) Divergence in lexicalization (lexical

divergence)

Collocational divergence Multi-lexeme and idiomatic divergence

Jonas Kuhn: MT 19

Types of divergence

Thematic divergence

En: You like her Sp: Ella te gusta (Lit: She you-ACC pleases)

Head-switching divergence

En: The baby just ate Sp: El bebé acaba de comer (Lit: The baby finishes of to-eat)

Structural divergence

En: Luisa entered the house Sp: Luisa entró a la casa (Lit: Luisa entered to the house)

Jonas Kuhn: MT 20

Types of divergence

Categorial divergence

En: a little bread Sp: un poco de pan (Lit: a bit of bread)

Lexical gap (conflational divergence)

En: Camillo got up early Sp: Camillo madrugó En: I stabbed Juan Sp: Yo le di puñeladas a Juan (Lit: I gave knife-wounds to Juan)

slide-6
SLIDE 6

Jonas Kuhn: MT 21

Types of divergence

Divergence in lexicalization (lexical divergence)

En: Susan swam across the channel Sp: Susan cruzó el canal nadando (Lit: Susan crossed the channel swimming)

Collocational divergence

En: Jan made a decision Sp: Jan tomó/*hizó una decisión (Lit: Jan took/*made a

decision)

Multi-lexeme and idiomatic divergence

En: Socrates kicked the bucket Sp: Socrates estiró la pata

(Lit: Socrates stretched the leg)

  • En: Frank is as tall as Orlaith
  • Sp: Frank es tan alto como Orlaith (Lit: Frank is so tall like Orlaith)

Jonas Kuhn: MT 22

Other translation challenges

Ambiguity: Language understanding problem

(compare Dorr et al. 1999)

Syntactic ambiguity

I saw the man on the hill with the telescope

Resolution may not be necessary, since ambiguity

may transfer to target language

Lexical ambiguity

En: book Sp: libro / reservar

Semantic ambiguity

Homography

En: ball Sp: pelota (spherical object) / baile (formal dance)

Polysemy

En: kill Sp: matar (kill a man) / acabar (kill a process)

Jonas Kuhn: MT 23

Other translation challenges

Ambiguity (compare Dorr et al. 1999)

Complex semantic ambiguity

Homography

En: The box was in the pen Sp: La caja estaba en el corral / *la pluma corral: enclosure, pluma: writing pen

Metonymy

En: While driving, John swerved and hit a tree Sp: Mientras que John estaba manejando, se desvió y golpeó con un arbol (‘While John was driving, (itself) swerved and hit with a tree’)

Jonas Kuhn: MT 24

Other translation challenges

Ambiguity (compare Dorr et al. 1999)

Contextual ambiguity

En: The computer outputs the data; it is fast Sp: La computadora imprime los datos; es rápida (es: singular) En: The computer outputs the data; it is stored in ascii Sp: La computadora imprime los datos; están almacenados en ascii (están: plural)

Complex contextual ambiguity

En: John hit the dog with a stick Sp: John golpeó el perro con el palo / que tenía el palo (hit … with the stick / (the dog) that had a stick)

slide-7
SLIDE 7

Jonas Kuhn: MT 25

Other translation challenges

Language generation problems (Dorr et al. 1999)

Lexical selection

Sp: esperar En: wait, hope Ge: können En: know, understand En: be Sp: ser, estar En: fish Sp: pez, pescado

Jonas Kuhn: MT 26

Other translation challenges

Language generation problems (Dorr et al. 1999)

Tense generation

Ch: Wŏ bèi Hángzhōu de fēngjĭng xīyĭnzhù le En: I was captivized by the scenery of Hangchow En: I am captivized by the scenery of Hangchow En: Mary went to Mexico. During her stay she learned Spanish. Sp: Mary iba a Mexico. Durante su visita, aprendió español. En: Mary went to Mexico. When she returned she started to speak Spanish. Sp: Mary fue a Mexico. Cuando regresó, comenzó a hablar español.

Jonas Kuhn: MT 27

Next topic: Syntax-based transfer

Reading: Trujillo 1999, ch. 6 Lab 1: Develop a small syntax-based transfer

system from scratch

Prolog (allows for easy declarative

specification of components)

Grammar formalism: Prolog DCGs (Definite

Clause Grammars)

Use simplistic parser and generator (code will

be provided)