Machine Translation The noisy channel model [Brown et al. 1990, - - PowerPoint PPT Presentation

machine translation
SMART_READER_LITE
LIVE PREVIEW

Machine Translation The noisy channel model [Brown et al. 1990, - - PowerPoint PPT Presentation

Week 2: Overview Data-driven, statistical approaches to MT Machine Translation The noisy channel model [Brown et al. 1990, Knight 1999] Classical and Statistical Approaches Language modeling Translation modeling Word


slide-1
SLIDE 1

Machine Translation

– Classical and Statistical Approaches Session 10: MT Evaluation & Wrap-Up

Jonas Kuhn Universität des Saarlandes, Saarbrücken The University of Texas at Austin jonask@coli.uni-sb.de

DGfS/CL Fall School 2005, Ruhr-Universität Bochum, September 19-30, 2005

Jonas Kuhn: MT 2

Week 2: Overview

Data-driven, statistical approaches to MT

The noisy channel model

[Brown et al. 1990, Knight 1999]

Language modeling Translation modeling

Word alignment Phrase alignment

[Koehn et al. 2003]

Decoding

[Koehn 1994]

Lab exercise: building a phrase-based statistical MT

system from parallel texts taken from the Internet

Evaluation methods Other uses of word alignments

[Yarowsky et al. 2001]

Jonas Kuhn: MT 3

Today’s session

Lab exercise:

Running the phrase-based decoder Pharaoh

MT Evaluation Other uses of word alignments Wrap-Up Final projects Certificates of participation

Jonas Kuhn: MT 4

Running the decoder

Sample data taken from

http://www.statmt.org/wpt05/mt-shared-task/

Large French-English phrase table (trained from

Europarl)

Language model for English Test sentences in French (along with model solution)

A script is provided for filtering out the relevant part of

the translation table for a set of test sentences

run-filtered-pharaoh.perl filtered100.fr

pharaoh pharaoh.fr.ini test100.fr.lowercase "-monotone" > test100.fr.out.monotone

slide-2
SLIDE 2

Jonas Kuhn: MT 5

Translation results

Original:

Nous savons très bien que les Traités actuels ne suffisent pas et qu' il sera nécessaire à l' avenir de développer une structure plus efficace et différente pour l' Union, une structure plus constitutionnelle qui indique clairement quelles sont les compétences des États membres et quelles sont les compétences de l' Union.

Reference translation:

We know all too well that the present Treaties are inadequate and that the Union will need a better and different structure in future, a more constitutional structure which clearly distinguishes the powers of the Member States and those of the Union .

Jonas Kuhn: MT 6

Translation with Pharaoh decoder

Original:

Nous savons très bien que les Traités actuels ne suffisent pas et qu' il sera nécessaire à l' avenir de développer une structure plus efficace et différente pour l' Union, une structure plus constitutionnelle qui indique clairement quelles sont les compétences des États membres et quelles sont les compétences de l' Union.

Phraraoh translation:

we know very well that the current treaties are not enough and that it will be necessary in the future to develop a structure which is more effective and different for the union , a structure more constitutional which makes it clear what are the powers of the member states , and what are the powers of the union .

Jonas Kuhn: MT 7

Commercial system (online version)

Original:

Nous savons très bien que les Traités actuels ne suffisent pas et qu' il sera nécessaire à l' avenir de développer une structure plus efficace et différente pour l' Union, une structure plus constitutionnelle qui indique clairement quelles sont les compétences des États membres et quelles sont les compétences de l' Union.

Systran translation: (http://www.systransoft.com/)

We know very well that the current Treaties are not enough and that it will be necessary in the future to develop a more effective and different structure for the Union, a more constitutional structure which indicates clearly which are competences of the Member States and which are competences of the Union.

Jonas Kuhn: MT 8

MT Evaluation

Manual:

SSER (subjective sentence error rate) Correct/Incorrect Error categorization

Testing in an application that uses MT as one sub-

component

Question answering from foreign language documents

Automatic:

WER (word error rate) BLEU (Bilingual Evaluation Understudy) Slides from Kevin Knight

slide-3
SLIDE 3

Jonas Kuhn: MT 9

Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its

  • ffices both received an e-mail

from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport . Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.

BLEU Evaluation Metric

(Papineni et al, ACL-2002)

  • N-gram precision (score is between 0 & 1)

– What percentage of machine n-grams can be found in the reference translation? – An n-gram is a sequence of n words – Not allowed to use same portion of reference translation twice (can’t cheat by typing out “the the the the the”)

  • Brevity penalty

– Can’t just type out single word “the” (precision 1.0!) *** Amazingly hard to “game” the system (i.e., find a way to change machine output so that BLEU goes up, but quality doesn’t) Slides from Kevin Knight

Jonas Kuhn: MT 10

Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its

  • ffices both received an e-mail

from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport . Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.

BLEU Evaluation Metric

(Papineni et al, ACL-2002)

BLEU4 formula (counts n-grams up to length 4)

exp (1.0 * log p1 + 0.5 * log p2 + 0.25 * log p3 + 0.125 * log p4 – max(words-in-reference / words-in-machine – 1, 0) p1 = 1-gram precision p2 = 2-gram precision p3 = 3-gram precision p4 = 4-gram precision

Slides from Kevin Knight

Jonas Kuhn: MT 11 Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport . Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert . Reference translation 4: US Guam International Airport and its

  • ffice received an email from Mr. Bin

Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter . Reference translation 2: Guam International Airport and its

  • ffices are maintaining a high state of

alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack

  • n the airport and other public places .

Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.

Multiple Reference Translations

Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport . Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert . Reference translation 4: US Guam International Airport and its

  • ffice received an email from Mr. Bin

Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter . Reference translation 2: Guam International Airport and its

  • ffices are maintaining a high state of

alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack

  • n the airport and other public places .

Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.

Slides from Kevin Knight

Jonas Kuhn: MT 12

BLEU Tends to Predict Human Judgments

R2 = 88.0% R2 = 90.2%

  • 2.5
  • 2.0
  • 1.5
  • 1.0
  • 0.5

0.0 0.5 1.0 1.5 2.0 2.5

  • 2.5
  • 2.0
  • 1.5
  • 1.0
  • 0.5

0.0 0.5 1.0 1.5 2.0 2.5

Human Judgments NIST Score Adequacy Fluency

slide from G. Doddington (NIST)

(variant of BLEU)

slide-4
SLIDE 4

Jonas Kuhn: MT 13

BLEU in Action

枪手被警方击毙。 (Foreign Original) the gunman was shot to death by the police . (Reference Translation) the gunman was police kill . #1 wounded police jaya of #2 the gunman was shot dead by the police . #3 the gunman arrested by police kill . #4 the gunmen were killed . #5 the gunman was shot to death by the police . #6 gunmen were killed by police ?SUB>0 ?SUB>0 #7 al by the police . #8 the ringer is killed by the police . #9 police killed the gunman . #10 Slides from Kevin Knight

Jonas Kuhn: MT 14

BLEU in Action

枪手被警方击毙。 (Foreign Original) the gunman was shot to death by the police . (Reference Translation) the gunman was police kill . #1 wounded police jaya of #2 the gunman was shot dead by the police . #3 the gunman arrested by police kill . #4 the gunmen were killed . #5 the gunman was shot to death by the police . #6 gunmen were killed by police ?SUB>0 ?SUB>0 #7 al by the police . #8 the ringer is killed by the police . #9 police killed the gunman . #10 green = 4-gram match (good!) red = word not matched (bad!) Slides from Kevin Knight

Jonas Kuhn: MT 15

Sample Learning Curves

0.05 0.1 0.15 0.2 0.25 0.3 0.35 10k 20k 40k 80k 160k 320k Swedish/English French/English German/English Finnish/English # of sentence pairs used in training BLEU score

Experiments by Philipp Koehn

Slides from Kevin Knight

Jonas Kuhn: MT 16

Applying BLUE metric yourself

A simple script can be used to compute the

score of a translation relative to the reference translation (provided with the Pharaoh example

data on http://www.statmt.org/wpt05/mt-shared-task/)

cat system-output.txt | multi-bleu.perl reference-

translation.txt

(Of course one has to make sure that the reference translation is

lowercased if one used lowercased training data)

slide-5
SLIDE 5

Jonas Kuhn: MT 17

Let us compare three versions blindly

Reference Translation VERSION Z Babelfish/ Systran VERSION Y Pharaoh VERSION X

?

Jonas Kuhn: MT 18

Your quality judgement?

VERSION X

  • to find an agreement on the processes is a good thing in oneself , but it

should be taken care that this system cannot be used as political deterrent force .

  • they also have now a clear vision of the rights which they must respect

.

  • i am in agreement with his warning against the return , which tries

some , to the intergovernmental methods .

  • we are much to want a federation of states nations .
  • the rapporteurs underlined the quality of the discussion and also the

need to go further . of course , i can only join them .

  • i also thank all those which pleasantly reproached me for not having

made this speech earlier . these , i will answer that for making a speech , it is necessary to learn , to know , to evaluate the involved forces , because the political speech must always be realistic and close to reality and the objectives that we fix ourselves .

  • i would like that one starts from this co-operation reinforced to give

some examples of the new european potentiality .

Jonas Kuhn: MT 19

Your quality judgement?

VERSION Y

  • an agreement on procedures in itself is a good thing , but we must

make sure that the system cannot be used as a political weapon .

  • they too now have a clear idea of the rights which they have to respect

.

  • i agree with him on the need to guard against a return to

intergovernmental methods , which some find appealing .

  • there are many of us who want a federation of nation states , which

means that each state must find the position that best suits it .

  • the rapporteurs have already stressed the quality of the debate and the

need to progress further , and i can only agree with them .

  • in reply , i would say that , before making a speech , one must identify

all the major factors at work , familiarise oneself with them and weigh them up , for political discourse must always be realistic , respond to the real weight of the forces at work and relate to the aims we all set

  • urselves .
  • i would like us to use closer cooperation to generate fresh potential for

europe .

Jonas Kuhn: MT 20

Your quality judgement?

VERSION Z

  • reach an agreement on the way it is a good thing in itself , but we must

ensure that this system should not be striking force policy .

  • they also have to present a clear vision of the rights they must be

respected .

  • i am in agreement with its warning against the return , which is certain ,

the intergovernmental methods .

  • we are very much to want a federation of nation states .
  • the rapporteurs have stressed the quality of the debate and also the

need to go further . of course , i cannot , of course , that the join .

  • i would also like to thank all those who , i have kindly accused of not

making this speech earlier . to them , i would say that prior to make a speech , we must learn , to know , evaluate the warring forces , because the political rhetoric must always be realistic and close to the reality and the objectives we have set ourselves .

  • i would like to see parte of this closer cooperation in order to give some

examples of the new potential in europe .

slide-6
SLIDE 6

Jonas Kuhn: MT 21

For reference…

French original

  • Trouver un accord relatif aux procédés est une bonne chose en soi,

mais il faut veiller à ce que ce système ne puisse pas servir de force de frappe politique.

  • Eux aussi ont à présent une vision claire des droits qu' ils doivent

respecter.

  • Je suis en accord avec sa mise en garde contre le retour, qui tente

certains, aux méthodes intergouvernementales.

  • Nous sommes beaucoup à vouloir une fédération d' États nations.
  • Les rapporteurs ont souligné la qualité de la discussion et aussi le

besoin d' aller plus loin. Bien sûr, je ne peux que les rejoindre.

  • Je remercie également tous ceux qui m' ont aimablement reproché de

ne pas avoir fait ce discours plus tôt. À ceux-là, je répondrai qu' avant de prononcer un discours, il faut apprendre, connaître, évaluer les forces en présence, parce que le discours politique doit toujours se montrer réaliste et proche de la réalité et des objectifs que nous nous fixons.

  • Je voudrais que l' on parte de cette coopération renforcée pour donner

quelques exemples de la nouvelle potentialité européenne.

Jonas Kuhn: MT 22

Which system was which?

Reference Translation VERSION Z Babelfish/ Systran VERSION Y Pharaoh VERSION X

Jonas Kuhn: MT 23

Applying BLUE metric yourself

Results for first 20 sentences from test data Systran translation

cat test20.systran.lowercase | multi- bleu.perl test20.en.lowercase BLEU = 20.07, 57.4/27.1/14.3/7.3 (BP=1.000, ration=1.044)

Pharaoh translation

cat test20.pharaoh | multi-bleu.perl test20.en.lowercase BLEU = 25.97, 59.3/31.8/19.1/12.6 (BP=1.000, ration=1.067)

Jonas Kuhn: MT 24

Other uses of word alignments

“Annotation Projection”

David Yarowsky, Grace Ngai, Richard

Wicentowski (2001): Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora

slide-7
SLIDE 7

Jonas Kuhn: MT 25

“Annotation Projection”

General idea

Use a parallel corpus for English and some

  • ther language

Compute statistical word alignment Apply analysis tool (part-of-speech tagger,

chunker, morphological analyzer) to English text

Use projected (noisy) information as training

data for a robust learning approach

Jonas Kuhn: MT 26

“Annotation Projection”

Yarowsky/Ngai/Wicentowski 2001 Projected annotation is used as training data for a

tagger/chunker in the target language

Robust learning techniques based on confidence

in training data

]

  • il

NN crude JJ [ for IN ] producer significant a [ NN JJ DT JJ NN IN JJ NN DT ] brut petrole [ de ] important producteur un [

Jonas Kuhn: MT 27

Annotation Projection

PoS information, NE tags, chunking: E C

Jonas Kuhn: MT 28

Annotation Projection

Projection of morphological information

slide-8
SLIDE 8

Jonas Kuhn: MT 29

PoS tag projection

6 scenarios E F

Jonas Kuhn: MT 30

PoS tag projection

Even at the relatively low tagset granularity of

English, direct projection of core POS tags onto French achieves only 76% accuracy

Part of this deficiency is due to word-alignment error;

when word alignments were manually corrected, direct projection core-tag accuracy increased to 85%.

Also, standard bigram taggers trained on the

automatically projected data achieve only modest success at generalization (86% when reapplied to the noisy training data).

Special smoothing techniques

Jonas Kuhn: MT 31

PoS tag projection

Special smoothing techniques

Jonas Kuhn: MT 32

PoS tag projection

Results of Yarowsky/Ngai/Wicentowski 2001

slide-9
SLIDE 9

Jonas Kuhn: MT 33

Chunk projection

Jonas Kuhn: MT 34

Chunk projection

Jonas Kuhn: MT 35

Morphology projection

Bilingual corpora as a bridge for aligning complex

inflected word forms in a new language with their root forms

Works even when their surface similarity is quite

different or highly irregular

Jonas Kuhn: MT 36

Morphology projection

Single-step inference (ideal):

slide-10
SLIDE 10

Jonas Kuhn: MT 37

Annotation Projection

Projection taking advantage of an analyzer for

English

Jonas Kuhn: MT 38

Multiple bridges

Jonas Kuhn: MT 39

Multiple bridges

Multi-bridge inference:

Jonas Kuhn: MT 40

Exploiting multiple translations

slide-11
SLIDE 11

Jonas Kuhn: MT 41

Lemmatization precision

Jonas Kuhn: MT 42

Learn “deeper” syntactic grammars?

New weakly supervised learning approach for

(probabilistic) syntactic grammars:

Training data: Parallel corpora – collections of original texts

and their translations into one or more languages

Preparatory step: Identification of word correspondences

with known statistical techniques (word alignment from statistical machine translation)

en de fr

dar anders völlig jedoch Lage die sich stellt Heute is The situation now however radically different

Jonas Kuhn: MT 43

Learn “deeper” syntactic grammars?

Beyond lexical information, patterns in the word correspondence

relation contain rich implicit information about the grammars of the languages

One should be able to exploit this implicit information about

structure and meaning for grammar learning

Little manual annotation effort should be required Combination of insights from linguistics and machine learning

dar anders völlig jedoch Lage die sich stellt Heute is The situation now however radically different

Jonas Kuhn: MT 44

The PTOLEMAIOS Project

Rosetta Stone Parallel Corpus-Based Grammar

Induction: PTOLEMAIOS

Parallel-Text-based Optimization for

Language Learning – Exploiting Multilingual Alignment for the Induction Of Syntactic Grammars

Funded by DFG (German Research

Foundation) as an Emmy Noether research group

Universität des Saarlandes

(Saarbrücken), Department of Computational Linguistics

Starting date: 1 April 2005 Expected duration: 4 years (1-year

extension possible)

slide-12
SLIDE 12

Jonas Kuhn: MT 45

Project Goals

Development of formalisms and algorithms to

support grammar induction for arbitrary languages from parallel corpora

To make goals tangible…

Intended prototype:

The PTOLEMAIOS I system for building grammars for new (sub-)languages

Jonas Kuhn: MT 46

The PTOLEMAIOS I system

Resources required:

Parallel corpus of language L and one or (ideally) more

  • ther languages

No NLP tools for language L required

Preparatory work required:

Manual annotation of a set of seed sentence pairs

(e.g., 50-100 pairs)

Phrasal correspondence across languages “Lean” bracketing: mark only

full argument/modifier phrases (PPs, NPs) and full clauses

Jonas Kuhn: MT 47

The PTOLEMAIOS I system

Training steps:

(Sentence alignment on parallel corpus) Word alignment on parallel corpus

Using standard techniques from Statistical Machine

Translation (GIZA++ tool)

Part-of-speech clustering for L Bootstrapping learning of syntactic grammars for L and

the other language(s)

Starting from annotated seed data Exploit large amounts of unannotated data, finding

systematic patterns in phrasal correspondences

Assuming implicit underlying representation (“pseudo

meaning representation”)

Relying on consensus across the grammars

Jonas Kuhn: MT 48

Bootstrapping

Improving grammars by using sentences as training data for

which a high-confidence consensus analysis exists Grammar E-1 Grammar D-1 Grammar D-2 Grammar E-2 Grammar D-3

slide-13
SLIDE 13

Jonas Kuhn: MT 49

Motivating examples

Aktive/Passive

[ A whole 500 days before Atlanta ], [ committed women in

Paris ] founded [ the Atlanta Plus Committee ].

[ Bereits 500 Tage vor Atlanta ] gründeten [ engagierte Frauen

in Paris ] [ das Atlanta Plus-Komitee ].

That is why [ a group ] was founded [ after the Barcelona games

in 1992 ], [ the Atlanta Plus Committee ].

Deshalb wurde [1992 in Barcelona nach den Spielen ] [ eine

Gruppe ] gegründet – [ das Atlanta Plus-Komitee ].

Jonas Kuhn: MT 50

Motivating examples

Nominalized Verbs

Two years ago, when [ the WTO ] was founded in Geneva, … Vor zwei Jahren – bei der Gründung [ der WTO ] in Genf – … It is well known that since the creation [ of NAFO ] [ in 1978 ], … Es ist bekannt, daß seit der Gründung [ der NAFO ] [ 1978 ] …

Jonas Kuhn: MT 51

The PTOLEMAIOS I system

Result:

Robust probabilistic grammar for L

Representation of predicate-argument and modifier

relations

Models predict probabilities for cross-linguistic

argument/modifier links (These will be particularly useful in lexicalized models)

Application:

Multilingual Information Extraction, Question

Answering

Intermediate step for syntax-based MT

Jonas Kuhn: MT 52

Motivation for the Project

Practical

Explore alternative to standard treebank training of

grammars

For “smaller” languages, it is unrealistic to do the

necessary manual resource annotation Theoretical

Establish parallel corpora as an empirical basis for

(crosslinguistic or monolingual) linguistic studies

Frequency-related phenomena (like multi-word

expressions/collocations) are otherwise hard to assess empirically at the level of syntax

Learnability properties as a criterion for assessing

formal models for natural language

slide-14
SLIDE 14

Jonas Kuhn: MT 53

A Pilot Study

Unsupervised learning of a probabilistic context-free

grammar (PCFG) exploiting partial information from a parallel corpus [Kuhn 2004 (ACL)]

Underlying consideration:

The distribution of word correspondences in the

translation of a string contains partial information about possible phrase boundaries

dar anders völlig jedoch Lage die sich stellt Heute is The situation now however radically different

Jonas Kuhn: MT 54

Unsupervised Grammar Learning

Grammar induction of an X-bar grammar

Using a variant of standard PCFG induction with the

Inside-Outside algorithm (an Expectation Maximization algorithm)

All word spans are considered as phrase candidates –

except the excluded distituents

Automatic generalization based on patterns in the

learning data (after part-of-speech tagging)

Exclusion of distituents can reduce the effect of

frequent non-phrasal word sequences

Jonas Kuhn: MT 55

Empirical Results

Comparative experiment [Kuhn 2004 (ACL)]

A: Grammar induction based on English corpus data only B: Induction including partial information from parallel corpus (Europarl corpus) [Koehn 2002]

Statistical word alignment, trained with GIZA++

[Al-Onaizan et al., 1999; Och/Ney, 2003]

Exclusion of “distituents” based on alignment-induced word

blocks

Evaluation:

Parsing sentences from the Penn Treebank (Wall Street

Journal) with the induced grammars

Comparison of the “automatical” analyses with the gold

standard treebank analyses created by linguists

Jonas Kuhn: MT 56

Empirical Results

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Precision Recall F-Score Strictly left-branching structure Strictly right-branching structure A: Standard PCFG induction B: Induction using partial information from parallel corpus Upper bound (Oracle binary grammar) # proposed phrases # correctly identified phrases # gold standard phrases # correctly identified phrases

Mean of Precision und Recall

slide-15
SLIDE 15

Jonas Kuhn: MT 57

The PTOLEMAIOS research agenda

Formal grammar model for

parallel linguistic analysis

Specific choice of linguistic

representations/constraints

Efficient parallel parsing

algorithms

Probability models for

parallel structural representations

Weakly supervised learning

techniques for bootstrapping the grammars

Grammar formalism Linguistic specification Algorithmic realization Probabilistic modeling Bootstrapping learning

Jonas Kuhn: MT 58

Architecture

The planned

PTOLEMAIOS architecture

Jonas Kuhn: MT 59

Course Wrap-Up (1)

Week 1: “Classical” approaches

History & Overview Transfer-based translation

Syntax-based transfer

[Trujillo 1999]

Transfer as LFG projection

[Kaplan et al. 1999]

Interlingua-based translation

[Dorr 1994]

Term-rewriting transfer

[Emele/Dorna 1998]

Jonas Kuhn: MT 60

Course Wrap-Up (2)

Week 2: Data-driven, statistical approaches

The noisy channel model

[Brown et al. 1990, Knight 1999]

Language modeling Translation modeling

Word alignment Phrase alignment

[Koehn et al. 2003]

Decoding

[Koehn 1994]

Other uses of word alignments

[Yarowsky et al. 2001]

slide-16
SLIDE 16

Jonas Kuhn: MT 61

Course resources

The Prolog examples from week 1 and the tools for

week 2 are available at:

http://uts.cc.utexas.edu/~jonask/mt-course-material/

Most papers pointed to as additional readings are

accessible online from the “ACL Anthology”

http://acl.ldc.upenn.edu/ (Includes the Journal “Computational Linguistics” and

conference proceedings from ACL, EACL, NAACL, COLING, HLT)

Jonas Kuhn: MT 62

Possible project topics

  • 1. A rule-based transfer system
  • Expand the DCG-based system we used in week 1
  • New language pair and/or
  • More complicated divergence examples and/or
  • Higher level of abstraction for transfer
  • 2. Phrase-based statistical MT system
  • Implementation of a phrase table extraction routine,

based on word alignments output by GIZA++ and

  • Evaluation with Pharaoh decoder on test data kept

separate during training

Jonas Kuhn: MT 63

Possible project topics

3.

Statistical MT

  • GIZA++ training on a new parallel corpus (possibly

involving several languages) and

  • Linguistic discussion of the resulting alignment

patterns (and errors) and/or

  • Experiments with variants of GIZA++ training (e.g.,

suffix deletion trick) 4.

Using word alignment information in other ways

  • Extracting simple transfer rules (e.g., for nouns and

adjectives) from a GIZA++-aligned corpus and

  • Integration in a rule-based (toy) transfer system and
  • Discussion of (=Speculation about!?) scalability

Jonas Kuhn: MT 64

Project format

Team project (4-5 participants) Remote collaboration (through email exchanges) is

part of the exercise

Submission:

Running system with instructions, sample inputs Short report on project (3-4 pages) Wherever possible, evaluation of system quality

(include in report) Deadlines:

Announce topic (and specifics) chosen, team

members: October 7

Final project: October 31 (Email submission)

jonask@coli.uni-sb.de