Discourse-Level - - PowerPoint PPT Presentation

discourse level
SMART_READER_LITE
LIVE PREVIEW

Discourse-Level - - PowerPoint PPT Presentation

Statistical Machine Translation Discourse-Level


slide-1
SLIDE 1

Discourse-Level SMT

(joint work with Christian Hardmeier and others)

Statistical Machine Translation

  • The U.S. island of Guam is maintaining a high

state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport.

P(e|c)

Search problem (decoding): e* = argmax P(e|c) e

exactly one sentence

Statistical MT Models

Phrase-Based SMT

  • translation of fragments
  • left-to-right decoding

Syntax-Based SMT

  • synchronous grammars
  • translate = parsing

Neural MT

  • continuous vector

representations

  • recurrent networks

am a student _ Je suis étudiant Je suis étudiant _ I

Encoder' Decoder'

induced from data

Discourse-Level Phenomena

Translate the following to German: Ich sagte: “Stell es zurück!” Aber es war nicht ihr Schlüssel. Ich sagte: “Leg ihn zurück!” Das Mädchen nahm meinen Schlüssel aus dem Schloss. sein (?) Ich sagte: “Steck ihn wieder rein!”

  • The girl took my key from the door lock.
  • But it wasn’t her key!
  • I said: “Put it back!”
slide-2
SLIDE 2

Textual Cohesion

The 10 commandments (1956) To some land flowing with milk and honey! Till ett land fullt av mjölk och honung. I’ve never tasted honey. Jag har aldrig smakat honung. ... He showed you no milk and honey! Han gav er ingen mjölk och honung. Kerd ma lui (2004) Mari honey ... Mari, gumman ... Sweetheart, where are you going? Älskling, var ska du? ... Who was that, honey? Vem var det, gumman? The incredibles (2004) How you doing, honey? Hur går det älskling? Do I have to answer? Måste jag svara på det? Kids, strap yourselves down like I told you. Barn ... Gör som jag sa åt er .. Here we go, honey. Nu gäller det älskling

“One sense per discourse” - “One translation per discourse”

Connectedness of Natural Language

Textual cohesion

  • discourse markers
  • referential devices (e.g. pronominal anaphora)
  • ellipses (word omissions) and substitutions
  • lexical cohesion (word repetition, collocation)

Textual coherence

  • semantically meaningful relations
  • logical tense structure
  • presuppositions and implications connected to general world

knowledge

Discourse and Machine Translation

Long-distance relations are lost in local MT models

  • sentence-by-sentence translation
  • limited context window

Discourse-level devices do not easily map between languages

  • explicit vs. implicit discourse markers
  • grammatical agreement in anaphoric relations

Terminological consistency

  • domain-specific requirements

Locality in Phrase-Based SMT

Bakom huset hittade polisen en stor mängd narkotika . Behind the house found police a large quantity of narcotics . Behind the house police found a large quantity of narcotics . Behind the house the house police house police found police found a found a large

Small context-window from N-gram language model context-independent phrase translations context-independent re-ordering models

slide-3
SLIDE 3

Left-to-right decoding Dynamic programming using hypotheses recombination

Decoding by Hypothesis Expansion

er geht ja nicht nach hause

are it he goes does not yes go to home home

Hypothesis Recombination

Combine branches to greatly reduce search space

  • only possible with strictly local models
  • lossy beam search is severely effected if recombination

cannot be done

are

p:-1.220

it

p:-0.484

he

p:-0.556

goes

p:-1.648

does not

p:-1.664

go

p:-2.743

to

p:-2.839

to house

p:-4.334

home

p:-4.182

go

p:-4.087

house

p:-5.912

not

p:-3.526

goes

p:-1.388

home

p:-5.012

  • 2.729
  • 3.569
  • 4.672

Lexical Cohesion and Consistency

Lexical consistency and textual cohesion

  • encourage consistent lexical choice (two-pass decoding)
  • re-ranking MT output (n-best lists)
  • explicit cohesion model

Natural Repetitions and Probabilistic Models

  • one sense per discourse
  • likelihood to see the same term goes up

Cache-Based Models

Adaptive language models

  • cache information (even across sentence boundaries)
  • integrated topic model, topic-shift detection

Cached Language Models

  • standard n-gram language models: modified likelihoods
  • add term likelihoods from caching history
  • include decay function to slowly forget cached history

P(w1..wn) = (1 − λ)Pn−gram(w1..wn) + λPcache(w1..wn)

fixed (estimated from training data) dynamic (estimated from cache)

slide-4
SLIDE 4

Cache-Based Models

Model Perplexity on Out-Of-Domain Data

cache size λ = 0.005 λ = 0.05 λ = 0.1 λ = 0.2 λ = 0.3 376.124 376.124 376.124 376.124 376.124 50 317.695 270.700 259.212 256.376 264.905 100 314.195 261.115 246.618 239.237 243.276 500 313.591 252.155 233.098 219.118 216.989 1000 310.135 240.646 217.996 199.221 192.870 2000 309.362 234.570 209.578 187.857 179.056 5000 312.367 235.323 209.068 185.789 175.783 10000 315.435 237.633 210.745 186.647 176.061 20000 318.101 239.868 212.471 187.735 176.674

Large impact! → 53.3% perplexity reduction

Decaying Cache Models

Pdecaycache(wn|wn−K..wn−1) ≈ 1 Z

n−1

X

i=n−K

I(wn == wi)exp−α(n−i)

168 170 172 174 176 178 180 182 184 0.0005 0.001 0.0015 0.002 0.0025 0.003 perplexity decay ratio cache size = 2000 cache size = 5000 cache size = 10000

decay factor

Cached Translation Models

Prefer consistent translation options Selective caching

  • content words only (approximated by length threshold)
  • reliable hypotheses (hypothesis cost threshold

φcache(en|fn) = PK

i=1 I(hen, fni == hei, fii) ⇤ exp−αi

PK

i=1 I(fn == fi)

Cache Models in Out-of-domain SMT

  • 2
  • 1

1 2 3 4 5 BLEU score difference cache models vs. standard models (scored per document)

slide-5
SLIDE 5

Cache Models in Out-of-domain SMT

  • 2
  • 1

1 2 3 4 5 BLEU score difference cache LM vs. standard LM cache TM vs. standard TM

Mixed-Domain Models

WMT 2010 data (train = Europarl + News, test = News): n-gram scores BLEU 1 2 3 4 de-en baseline 21.3 57.4 27.8 15.1 8.6 de-en cache 21.5 58.1 28.1 15.2 8.7 en-de baseline 15.6 52.5 21.7 10.6 5.5 en-de cache 14.4 52.6 21.0 9.9 4.9 es-en baseline 26.7 61.7 32.7 19.9 12.6 es-en cache 26.1 62.6 32.7 19.8 12.5 en-es baseline 26.9 61.5 33.3 20.5 12.9 en-es cache 23.0 60.6 30.4 17.6 10.4 → no success!

One Problem: Error Propagation

input Naturschützer wird der Erpressung beschuldigt baseline facing conservationists is accused of extortion cache facing conservationists is accused of extortion reference Nature protection officers accused of blackmail input Die Leitmeritz-Polizei beschuldigte den Vorsitzenden der Bürgervere- inigung "Naturschutzgemeinschaft Leitmeritz" wegen Erpressung. baseline the leitmeritz-polizei accused the chairman of the bürgervereinigung " naturschutzgemeinschaft leitmeritz " because of blackmail . cache the leitmeritz-polizei accused the chairman of the bürgervereinigung " naturschutzgemeinschaft leitmeritz " because of extortion . reference The Litomerice police have accused the chairman of the Litomerice Nature Protection Society civil association of blackmail.

Discussion

Mixed Results with Cache-Based Models

  • lower perplexity especially on out-of-domain data
  • not always positive effects when used in SMT

What is the problem?

  • better explains human data (not generated data)
  • pushes decoder in the wrong direction when used for

generating machine translation output (error propagation!)

  • cache model part is too simple (usually unigram model)

Possible application: Interactive Machine Translation

slide-6
SLIDE 6

Document-Level Decoding

are

p:-1.220

it

p:-0.484

he

p:-0.556

goes

p:-1.648

does not

p:-1.664

go

p:-2.743

to

p:-2.839

to house

p:-4.334

home

p:-4.182

go

p:-4.087

house

p:-5.912

not

p:-3.526

goes

p:-1.388

home

p:-5.012

  • 2.729
  • 3.569
  • 4.672

Beam Search Decoding

Advantages:

  • very good search results in a huge search space
  • manageable complexity
  • best efficiency / accuracy trade-off with current models

Disadvantages:

  • Markov assumption (independence outside of limited history)
  • sentence-internal long-distance dependencies dramatically

increase search space and cause more search errors

  • difficult to integrate cross-sentence dependencies

Document-Level Decoding in Docent

DOCENT

document translation change operation & score update document translation accept / reject change

MOSES

  • r random

initialize

Local search with stochastic search operations

  • start with a complete (suboptimal) solution
  • apply small changes anywhere to improve it
  • hill-climbing or annealing

final doc translation step limit (108) rejection limit (105)

Document-Level Decoding

Initial state

  • randomly selected segmentation and translation
  • or initialised by beam search with local features only

Search operations

  • simple operations that open entire search space
  • randomly selecting next operation

Search and Termination

  • accept useful changes
  • until step limit is reached
  • or rejection limit us reached

Search is non-deterministic

slide-7
SLIDE 7

Example: Initial Step

  • that prevention this disease
  • unfortunately , there are is not a miracle cure contribute to

pre- venting get cancer but

  • in spite to the made progress in ’ scientific research there

remains the a healthy ways of life the best solution , the if of the risks decrease , in with him disease this .

Example: Step 8,192

  • that prevention of the disease
  • unfortunately , there are no miracle cure for preventing

cancer .

  • in spite of the progress in research remains the adoption of

a healthy lifestyle is the best way , about the risk to reduce , in him on developing them .

Example: Step 134,217,728

  • prevention of the disease
  • unfortunately , there is no miracle cure for preventing

cancer .

  • despite the progress in research , the adoption of a healthy

lifestyle , the best way to minimise the risk of him on ill .

Search Operations

er geht ja nicht nach hause it is yes not after house

slide-8
SLIDE 8

Search Operations

er geht ja nicht nach hause it is yes not after house er geht ja nicht nach hause he is yes not after house

Change translation option:

  • change the translation of one randomly selected phrase

Search Operations

er geht ja nicht nach hause it is yes not after house er geht ja nicht nach hause he is yes not after house

Change translation option:

  • change the translation of one randomly selected phrase

Resegment:

  • find a new segmentation into phrases for a randomly

selected sequence of adjacent source words

er geht ja nicht nach hause he is not after house

Search Operations

er geht ja nicht nach hause it is yes not after house er geht ja nicht nach hause he is yes not after house

Change translation option:

  • change the translation of one randomly selected phrase

Resegment:

  • find a new segmentation into phrases for a randomly

selected sequence of adjacent source words Swap phrases:

  • Exchange the positions of two phrases

er geht ja nicht nach hause he is not after house er geht ja nicht nach hause he is not after house

Greedy Hill Climbing

Initialize From given state generate successor state

  • apply randomly chosen operation at random location

Compute score for the new state

  • may look at any place in the document
  • efficient score updates are necessary

Hill climbing

  • accept if the new state has a higher model score
  • continue until satisfied (or tiered of waiting)
slide-9
SLIDE 9

Learning Curves

English-French Newswire (WMT task)

1e+02 1e+04 1e+06 1e+08 −10000 −6000 −2000 2000

Standard models

steps Normalised score

  • with beam search

w/o beam search

  • ● ● ● ● ● ● ● ● ●

1e+02 1e+04 1e+06 1e+08 0.00 0.05 0.10 0.15 0.20 0.25

Standard models

steps BLEU

  • with beam search

w/o beam search

  • ● ● ● ● ● ● ● ● ●

Selected Discourse-Level Features

Lexical cohesion models

  • motivation: one sense per discourse / translation
  • document language model using distributional semantics

Lexical consistency

  • penalise inconsistent translation options

Stylistic features, target group adaptation

  • improved readability
  • simplified translation
  • format / formatting constraints

Pronominal Anaphora

Translating pronouns requires target-language dependencies

  • The funeral of the Queen Mother will take place on Friday.

It will be broadcast live.

  • Les funérailles de la reine-mère auront lieu vendredi.

Elles seront retransmises en direct. Alternative translation:

  • L’enterrement de la reine-mère aura lieu vendredi.

Il sera retransmis en direct.

A Model for Pronominal Anaphora

The latest version released in March is equipped with ... It is sold at ... La dernière version lancée en mars est dotée de ...

  • est vendue ...

fem.sg.

head finding word alignment anaphora resolution morphological annotation dependency modelling

slide-10
SLIDE 10

A Model for Pronominal Anaphora

The latest version released in March is equipped with ... It is sold at ... La dernière version lancée en mars est dotée de ... Elle est vendue ...

fem.sg.

head finding word alignment anaphora resolution morphological annotation dependency modelling

Pronoun Translation as Prediction

Pronoun prediction as a separate task

  • create a classifier for ambiguous cases
  • train on training data with or without supervision

Integrate classifier predictions in SMT

  • prediction likelihood / score as feature
  • combine with all other models in document decoder

Any classifier will do but we opt for a neural network

  • hidden layers for better abstraction over diverse input

features

E H S

1 2 3

L3 R3 L2 R2 L1 R1 P

p1 p2 p3

A

Predicting Pronoun Translations

English Pronoun (it, they) coreferential context (antecedent candidates) potential antecedents (weighted by resolver) French Pronoun ce elle elles il ils NONE

lation of the pronoun. The latest version released in March is equipped with ... It is sold at ... La dernière version lancée en mars est dotée de ...

  • est vendue ...

network trained on large collections of translated documents

source language context embeddings

Integrated Anaphora Resolution

E H S

1 2 3

L3 R3 L2 R2 L1 R1 P T U V

1 2 3

A

lation of the pronoun. The latest version released in March is equipped with ... It is sold at ... La dernière version lancée en mars est dotée de ...

  • est vendue ...

Anaphora resolution

`

Training with word-aligned bitexts as the only type of supervision!

slide-11
SLIDE 11

Classification Results

P R F

with BART ME baseline

ce 0.611 0.723 0.662

0.686 0.654

elle 0.749 0.596 0.664

0.679 0.632

elles 0.602 0.616 0.609

0.434 0.273

il 0.733 0.638 0.682

0.649 0.639

ils 0.710 0.884 0.788

0.778 0.759

  • ther

0.760 0.704 0.731

0.709 0.708

Neural Network Classifier

  • ff-the-shelf co-reference

resolution system no separate co-reference resolution system F-scores

SMT Integration

Pronoun prediction classifier

  • used as additional classifier
  • quite straightforward in Docent

Results

  • difficult to evaluate
  • standard SMT metrics don’t

work well

  • mixed results so far

P R F News Baseline 0.317 0.343 0.330 predicted 0.321 0.346 0.333 TED Baseline 0.451 0.444 0.447 predicted 0.457 0.448 0.452 gold 0.458 0.449 0.453

pronoun evaluation scores for SMT with and without anaphora model

Summary

Document-Level Decoding

  • local search with stochastic change operations
  • flexible framework that allows long-distance dependencies
  • initialize with beam search

Discourse-Level SMT

  • respect connectedness of sentences
  • treat phenomena that differ between languages
  • difficult to achieve visible (positive) effects

Questions? Suggestions?

slide-12
SLIDE 12

Links

Docent - the document level decoder:

  • https://github.com/chardmeier/docent/wiki

Christian Hardmeier’s PhD Thesis:

  • Discourse in Statistical Machine Translation
  • http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-223798

Shared Task in Cross-Lingual Pronoun Prediction:

  • http://www.statmt.org/wmt16/pronoun-task.html