Adri` a de Gispert Department of Engineering University of - - PowerPoint PPT Presentation

adri a de gispert
SMART_READER_LITE
LIVE PREVIEW

Adri` a de Gispert Department of Engineering University of - - PowerPoint PPT Presentation

H IERARCHICAL P HRASE -B ASED T RANSLATION AT U NIVERSITY OF C AMBRIDGE Adri` a de Gispert Department of Engineering University of Cambridge 19 July 2011 Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with


slide-1
SLIDE 1

HIERARCHICAL PHRASE-BASED TRANSLATION

AT UNIVERSITY OF CAMBRIDGE

Adri` a de Gispert

Department of Engineering University of Cambridge 19 July 2011

slide-2
SLIDE 2

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Hierarchical Phrase-Based Translation

Translation with a Probabilistic Synchronous CFG, G

◮ CYK parsing process to source sentence s ◮ Create a (strongly regular) context-free target language, T ={s}◦G

Application of a Language Model, M

◮ Intersect N-gram model ◮ Create a language of translation candidates, L=T ∩ M

Search for highest-probability candidate

◮ Apply shortest path, k-best, beam-search algorithm, ˆ

L=argmaxl∈LL

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 1 / 51

slide-3
SLIDE 3

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Two-fold Motivation

  • 1. Powerful Decoding Tools

◮ Explore large search spaces ◮ Apply strong language models in intersection with grammar ◮ Output rich space of candidates for rescoring

  • 2. Adequate Hierarchical Grammar

◮ Ruleset extraction from parallel corpora ◮ Allow sufficient movement/translation for each language pair ◮ Minimise overgeneration

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 2 / 51

slide-4
SLIDE 4

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Outline

◮ Hierarchical Phrase-Based Translation with WFSTs (2009-2010)

◮ Lattice-Based Alternative to Cube Pruning

◮ Hiero Grammar Definition (2010)

◮ Shallow Grammars for Exact Search ◮ Rule Extraction from Alignment Posteriors

◮ Decoding with Push-Down Transducers (2011-) ◮ FAUST project (2010-2012)

Joint work with: Bill Byrne, Gonzalo Iglesias, Graeme Blackwood + PhD students Juan Pino, Rory Waite (University of Cambridge)

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 3 / 51

slide-5
SLIDE 5

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Hierarchical Phrase-Based Translation

s1 s2 s3 s4 s5 s6 wqAl Alr}ys Alswry Ams Anh syEwd X R3 S R1 said ( دوعيس هنا سما يروسلا سيئرلا لاقو ) R1: S→X , X R2: S→S X , S X R3: X→s1 , said R4: X→s1 s2 , the president said R5: X→s1 s2 s3 , Syrian president says R6: X→s2 , president R7: X→s3 , the Syrian R8: X→s4 , yesterday R9: X→s5 , that R10: X→s6 , would return R11: X→s6 , he would return

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 4 / 51

slide-6
SLIDE 6

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Hierarchical Phrase-Based Translation

s1 s2 s3 s4 s5 s6 wqAl Alr}ys Alswry Ams Anh syEwd X X X X X X R3 R6 R7 R8 R9 R11 S R1 S R2 ...

x5 times

said president the Syrian yesterday that he would return ( دوعيس هنا سما يروسلا سيئرلا لاقو ) R1: S→X , X R2: S→S X , S X R3: X→s1 , said R4: X→s1 s2 , the president said R5: X→s1 s2 s3 , Syrian president says R6: X→s2 , president R7: X→s3 , the Syrian R8: X→s4 , yesterday R9: X→s5 , that R10: X→s6 , would return R11: X→s6 , he would return

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 4 / 51

slide-7
SLIDE 7

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Hierarchical Phrase-Based Translation

s1 s2 s3 s4 s5 s6 wqAl Alr}ys Alswry Ams Anh syEwd X X X X R7 R8 R9 R11 S R1 S R2 ...

x3 times

the Syrian president said yesterday that he would return X X R12 R13 ( دوعيس هنا سما يروسلا سيئرلا لاقو ) R1: S→X , X R2: S→S X , S X R3: X→s1 , said ... R6: X→s2 , president R7: X→s3 , the Syrian R8: X→s4 , yesterday R9: X→s5 , that R10: X→s6 , would return R11: X→s6 , he would return R12: X→s1 X , X said R13: X→s2 X , X president

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 4 / 51

slide-8
SLIDE 8

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Hierarchical Phrase-Based Translation

s1 s2 s3 s4 s5 s6 wqAl Alr}ys Alswry Ams Anh syEwd X1 X2 X R3 R7 R11 S R1 S R2 yesterday the Syrian president said that he would return X R14 ( دوعيس هنا سما يروسلا سيئرلا لاقو ) R1: S→X , X R2: S→S X , S X R3: X→s1 , said ... R6: X→s2 , president R7: X→s3 , the Syrian R8: X→s4 , yesterday R9: X→s5 , that R10: X→s6 , would return R11: X→s6 , he would return R14: X→X1 s2 X2 s4 s5 , y’day X2 president X1 that

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 4 / 51

slide-9
SLIDE 9

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Keeping Track of All Derivations. CYK Grid

R6 R7 R8 R9 R10 R11 R3 R4 R5 s1 s2 s3 s4 s5 s6 S X X X X X X wqAl Alr}ys Alswry Ams Anh syEwd R1: S→X , X R2: S→S X , S X R3: X→s1 , said R4: X→s1 s2 , the president said R5: X→s1 s2 s3 , Syrian president says R6: X→s2 , president R7: X→s3 , the Syrian R8: X→s4 , yesterday R9: X→s5 , that R10: X→s6 , would return R11: X→s6 , he would return

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 5 / 51

slide-10
SLIDE 10

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Keeping Track of All Derivations. CYK Grid

R6 R7 R8 R9 R10 R11 R3 R4 R1 R2 R5 s1 s2 s3 s4 s5 s6 R1 R1 R2 R2 R2 R2 S X X X X X X wqAl Alr}ys Alswry Ams Anh syEwd R1: S→X , X R2: S→S X , S X R3: X→s1 , said R4: X→s1 s2 , the president said R5: X→s1 s2 s3 , Syrian president says R6: X→s2 , president R7: X→s3 , the Syrian R8: X→s4 , yesterday R9: X→s5 , that R10: X→s6 , would return R11: X→s6 , he would return

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 5 / 51

slide-11
SLIDE 11

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Keeping Track of All Derivations. CYK Grid (2)

R6 R7 R8 R9 R10 R11 R3 R4 R5 R14 s1 s2 s3 s4 s5 s6 R1 S X X X X X X wqAl Alr}ys Alswry Ams Anh syEwd R1: S→X , X R2: S→S X , S X R3: X→s1 , said ... R6: X→s2 , president R7: X→s3 , the Syrian R8: X→s4 , yesterday R9: X→s5 , that R10: X→s6 , would return R11: X→s6 , he would return R14: X→X1 s2 X2 s4 s5 , y’day X2 president X1 that

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 5 / 51

slide-12
SLIDE 12

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Cube Pruning Algorithm 1

x20 x20 x20 x20 x420 x20

s1 s2 s3

x20 x8420

S X

◮ The number of derivations can be vast ◮ Each derivation produces a translation candidate ◮ Each candidate has a score ◮ Find best candidate

argmax

t ∈ T

P(s|t) P(t)

◮ Cube-Pruning Algorithm

◮ One-by-one processing of every derivation is not feasible ◮ Lists of k-best hypotheses are kept in each cell (k=104) ◮ Local decisions based on Translation and Language Model

Translation Model fits well in this grid representation × Language Model does not: P(t) = I

n=1 p(tn|tn−1)

would return ← p(return|would)×p(would|?) he would return ← p(return|would) × p(would|he)×p(he|?)

1Chiang, D. 2005. A Hierarchical Phrase-Based Model for Statistical Machine Translation. Proc. ACL. This

is a modified version of the CFG intersection with a finite-state machine described by Bar-Hillel et at. 1964.

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 6 / 51

slide-13
SLIDE 13

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Weighted Finite-State Acceptors (WFSAs)

◮ WFSAs are devices that compactly model a formal language2 ◮ A Weighted Acceptor of strings ‘a b c d’ and ‘a b b d’ :

1 a/0.1 2 b/0.3 3 c/0.7 b/0.2 4 d

is defined by a set of states Q and a set of arcs : q

x/w

→ q′

◮ Weighted Acceptors can assign costs to strings:

  • strings are associated with paths, which are sequences of arcs
  • weights are accumulated over paths by means of a product operation

w(p) = w(e1) ⊗ · · · ⊗ w(en)

2Follow the formulation of M. Mohri. 1997. Finite-state transducers in language and speech processing.

Computational Linguistics

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 7 / 51

slide-14
SLIDE 14

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

WFSA Operations - Union

A string x is accepted by A = A ∪ B if x is accepted by A or by B [ [C] ](x) = [ [A] ](x)

  • [

[B] ](x)

A red/0.5 1 green/0.3 2 blue yellow/0.6 B 1 green/0.4 2 blue/1.2

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 8 / 51

slide-15
SLIDE 15

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

WFSA Operations - Concatenation (or Product)

A string x is accepted by C = A ⊗ B if x can be split into x = x1x2 such that x1 is accepted by A and x2 is accepted by B [ [C] ](x) =

  • x1,x2: x=x1x2

[ [A] ](x1) ⊗ [ [B] ](x2)

A red/0.5 1 green/0.3 2 blue yellow/0.6 B 1 green/0.4 2 blue/1.2

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 9 / 51

slide-16
SLIDE 16

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

WFSA Operations for Compactness

WFSAs can be made compact with operations that: ⊲ reduce their size in number of states/arcs ⊲ accept the same distinct strings ⊲ the cost of each string is respected according to the semiring

1 a/0.4 b/0.5 2 a/0.5 b/0.4 3 c/1 c/2

◮ Lattices can represent compactly more than 1060 paths in translation3 ◮ Processing a WFSA can be much faster than processing each of its paths

individually

  • 3R. Tromble, S. Kumar, F

. Och, and W. Macherey. 2008. Lattice Minimum Bayes-Risk Decoding for Statistical Machine Translation. EMNLP .

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 10 / 51

slide-17
SLIDE 17

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

  • HiFST. Hierarchical Translation with WFSTs 4

x20 x20 x20 x20 x420 x20

s1 s2 s3

x20 x8420

S X ◮ Keep all possible derivations in each cell

Efficiently explore largest T in argmax

t ∈ T

P(s|t) P(t)

◮ Build a WFSA in each cell

◮ They compactly store millions of paths with Translation Model costs ◮ We can operate with them easily and faster ◮ Applying a Language Model to a WFSA is a well-established task

In each cell, do: For each rule in the cell: Build Rule WFSA by Concatenating target elements ( ) Build Cell WFSA by Unioning Rule WFSAs ( )

4Iglesias, G. et al. 2009. Hierarchical Phrase-Based Translation with Weighted Finite State Transducers.

  • Proc. of NAACL-HLT.

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 11 / 51

slide-18
SLIDE 18

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Building Rule WFSAs by Concatenation

R5 R12

R5 : R12 :

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 12 / 51

slide-19
SLIDE 19

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Building Cell WFSA by Union

R5 : R12 :

◮ Can be made compact ◮ Target language model can be applied ◮ Search can be carried out efficiently

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 13 / 51

slide-20
SLIDE 20

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Delayed Translation

Easy implementation with FST Replace operation Usual FST operations can be applied to skeleton → lattice size reduction

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 14 / 51

slide-21
SLIDE 21

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Language Model Intersection and Pruning

Output is top-most translation lattice L(S, 1, J). Steps required:

  • 1. Fully expand lattice via FST replace
  • 2. FST compose with target Language Model (delayed)
  • 3. Perform likelihood-based pruning (admissible)

Pruning in Search for lower-level cells (inadmissible)

◮ If number of states, non-terminal category and source span meet certain

conditions, then:

⊲ Expand Pointers in translation Lattice and Compose with Language Model ⊲ Perform likelihood-based pruning of the lattice ⊲ Remove Language Model ⊲ Remove epsion, determinize, minimize for compactness

◮ Only required for certain language pairs and grammars (more on this later)

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 15 / 51

slide-22
SLIDE 22

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Translation Results into English. Contrast CP vs HiFST

25 30 35 40 45 50 55

Chinese Arabic Spanish CP (k=10,000) HiFST

Best scoring WMT08 system

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 16 / 51

slide-23
SLIDE 23

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Reliability of N-gram Posterior Distributions – lattices vs k-best lists

Precision of 4-grams in the translation hypotheses as a function of their posterior

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Posterior probability threshold β Average per−sentence 4−gram precision p4,β

AR→EN mt0205test

Lattice n=10 n=100 n=1000 n=10000 ML BLEU=53.78 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Posterior probability threshold β Average per−sentence 4−gram precision p4,β

AR→EN mt0205tune

Lattice n=10 n=100 n=1000 n=10000 ML BLEU=54.23

◮ High posterior n-grams are more likely to be found in the references ◮ Using the full evidence space of the lattice is much better than even very large

k-best lists for computing posterior probabilities

◮ Should be useful for translation confidence measures

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 17 / 51

slide-24
SLIDE 24

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Lattice-Based Rescoring Steps

◮ Zero-cutoff stupid-backoff5 5-gram LM estimated over 6.6B words of English ◮ Lattice MBR6

◮ Fast computation of path posteriors using counting transducers7

◮ Large gains in combination of lattices from alternative input representations8 ◮ Easy to change semiring: marginalization over derivations is possible, leading

to gains prior to MBR

  • 5T. Brants, A. Popat, P

. Xu, F . Och and J. Dean. 2007. Large Language Models in Machine Translation. EMNLP .

  • 6R. Tromble, S. Kumar, F

. Och, and W. Macherey. 2008. Lattice Minimum Bayes-Risk Decoding for Statistical Machine Translation. EMNLP .

  • 7G. Blackwood, A. de Gispert and W. Byrne. 2010. Efficient Path Counting Transducers for Minimum

Bayes-Risk Decoding of Statistical Machine Translation Lattices. ACL.

  • 8A. de Gispert, S. Virpioja, M. Kurimo and W. Byrne. 2009. Minimum Bayes Risk Combination of

Translation Hypotheses from Alternative Morphological Decompositions. HLT-NAACL.

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 18 / 51

slide-25
SLIDE 25

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Contrast CP vs HiFST

HiFST generates a bigger, richer space of translation candidates Fewer Search Errors: 19% in Arabic→English, 48% in Chinese→English Improved parameter optimization Larger gains in subsequent rescoring steps Faster decoding times (lattice vs deep n-best) Simple implementation, Google OpenFST toolkit 9 General, well-studied algorithms Capable of complex semiring operations Competitive performance10 Top-1 in Arabic→English NIST 2009 MT Evaluation (22 teams)

  • 9C. Allauzen, M. Riley, J. Schalkwyk, W. Skut , and M. Mohri (2007), OpenFst: A General and Efficient

Weighted Finite-State Transducer Library. CIAA.

10DeGispert et al., 2010. Computational Linguistics 36(3) Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 19 / 51

slide-26
SLIDE 26

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors Hiero with Push-Down Transducers FAUST project

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 20 / 51

slide-27
SLIDE 27

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Viterbi-based Rule Extraction

Current practice:

  • 1. Estimate IBM word alignment models over parallel corpus
  • 2. Generate a set of links = 1-best Viterbi estimate of the alignments11
  • 3. Extract hierarchical rules X→γ,α,∼ , where γ, α ∈ {X ∪ T}+
  • 4. p(α|γ) is set by relative frequency

◮ Extraction from a set of links:

  • Alignment Constraint: consistency between links in the bilingual phrase
  • Each extracted rule is assigned a count of 1
  • Alignment models are used no further

11Exceptions: Venugopal et al. 2008 use n-best lists of alignments. Liu et al. 2009 use weighted alignment

matrices created from n-best lists of alignments.

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 21 / 51

slide-28
SLIDE 28

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Full Hierarchical Grammar

Formally it contains the following rules, where T is the set of terminals (words). full hierarchical grammar S→X,X glue rule S→S X,S X glue rule X→γ,α,∼ , γ, α ∈ {X ∪ T}+ hiero rules of any level

◮ Leaving aside concatenation rules, all rules have the same non-terminal X on

their left-hand side

◮ This allows plenty of rules to ’fit in each X gap’, which means that many

reorderings are possible

◮ In theory, rule nesting is unlimited ◮ In practice, there are limits imposed by:

◮ which rules have been extracted from the parallel corpus used in training ◮ which words occur in the source sentence to be translated, as at least one terminal

is consumed by each hierarchical rule

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 22 / 51

slide-29
SLIDE 29

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Complications – Spurious Ambiguity

Spurious ambiguity: many distinct derivations have the same model feature vectors and give the same translation

R1: S→X,X R2: X→s2 s3,t2 R3: X→s1 X,X t3 R4: X→X s4,t1 X

◮ the use of a single non-terminal X makes the grammar flexible, but redundant ◮ this can have a big impact in decoding time and memory requirements, even

with efficient implementations

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 23 / 51

slide-30
SLIDE 30

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Complications – Overgeneration

Overgeneration: different translations arising from the same set of rules

R1: X→s1 X,A X R2: X→X s3,C X R3: X→s2,B

A strong target language model is relied upon to discard unsuitable hypotheses. ... but do we need this flexible movement for all language pairs?

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 24 / 51

slide-31
SLIDE 31

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Shallow Hierarchical Grammars12

Formally we can define the following grammar, where T is the set of terminals. shallow-1 grammar S→X,X glue rule S→S X,S X glue rule X→γ,α,∼ , γ, α ∈ {{V } ∪ T}+ hiero rules level 1 V →γp,αp , γp, αp ∈ T+ regular phrases

◮ For Arabic-to-English and Spanish-to-English, shallow-1 grammar performs

similarly to full hiero - but ∼20× faster

◮ Constrained search space, but can be built exactly and quickly - no pruning

required

◮ If full hiero could be searched without errors, we would expect minor translation

quality improvements

  • 12A. de Gispert, G. Iglesias, G. Blackwood, E.R. Banga and W. Byrne. 2010. Hierarchical Phrase-Based

Translation with Weighted Finite State Transducers and Shallow-N Grammars. Computational Linguistics 36(3).

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 25 / 51

slide-32
SLIDE 32

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Shallow Hierarchical Grammars (2)

Example:

R1: S→X,X R2: S→S X,S X R3: X→X s3,t5 X R4: X→X s4,t3 X R5: X→s1 s2,t1 t2 R6: X→s4,t7

◮ Tree on the left uses rule nesting twice, so it is not possible under shallow-1

grammar

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 26 / 51

slide-33
SLIDE 33

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Shallow Hierarchical Grammars (3)

Formally we can control the level of nesting we want with the following grammars, where T is the set of terminals. shallow-N grammar S→XN,XN glue rule S→S XN,S XN glue rule Xn→γ,α,∼ , γ, α ∈ {{Xn−1} ∪ T}+ hiero rules levels n = 1, . . . , N with the requirement that α and γ contain at least one Xn−1 X0→γ,α , γ, α ∈ T+ regular phrases

◮ In Arabic-to-English, shallow-2 grammar does not provide improvements ◮ In Chinese-to-English, shallow-3 grammar is better than shallow-1 or shallow-2 ◮ Note: for n=1, this is equivalent to previous slide where X1 ≡ X and X0 ≡ V

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 27 / 51

slide-34
SLIDE 34

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Can we extract more useful rules?

◮ Better exploitation of alignment models

  • Do not use a set of links to guide extraction → alignment models directly
  • Use posterior probabilities over parallel data
  • Assign counts to rules according to their quality in the context of each parallel

sentence, as found by alignment models

  • Build larger rule sets with better translation probability estimates

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 28 / 51

slide-35
SLIDE 35

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Rule Extraction from Alignment Posteriors 13

?

◮ Procedure for regular phrases X→f j2 j1 ,ei2 i1

for each f j2

j1 , do:

for each ei2

i1 that meets Alignment Constraints

score (f j2

j1 , ei2 i1) with Ranking Function fR

apply Selection Criteria to ranked candidates for each surviving ei2

i1, do:

extract (f j2

j1 , ei2 i1) with count = Counting Function fC

  • 13A. de Gispert, J. Pino and W. Byrne. 2010. Hierarchical Phrase-Based Translation Grammars Extracted

from Alignment Posterior Probabilities. EMNLP .

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 29 / 51

slide-36
SLIDE 36

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Word-to-word Alignment Posterior Probabilities WP

  • Word-to-word alignment (link) posterior probabilities p(lji|f J

1 , eI 1)

  • Efficiently computed for Model 1, Model 2 and HMM 14
  • To apply this to phrase pairs (f j2

j1 , ei2 i1):

w(f j2

j1 , ei2 i1) = j2

  • j=j1

i2

  • i=i1

p(lji|f J

1 , eI 1)

i2 − i1 + 1

◮ Comments

  • Allows ranking target phrases ⇒ we use fR = w
  • For a given source phrase, this is not a proper conditional probability distribution
  • ver all target phrases
  • Distribution too sharp, over-emphasises short phrases ⇒ we use fC = 1

14Brown et al. 1993; Venugopal et al., 2003; Deng and Byrne, 2008 Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 30 / 51

slide-37
SLIDE 37

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Phrase-to-phrase Alignment Posterior Probabilities PP

  • Alignment probability distributions over Phrase Alignments
  • Forward algorithm: Marginalise over a set of word alignments A
  • Posterior probability of A given a sentence pair:

p(A|eI

1, f J 1 ) =

  • aJ

1 ∈A p(f J

1 , aJ 1 |eI 1)

  • aJ

1 p(f J

1 , aJ 1 |eI 1)

  • Computed in terms of posterior link probabilities (Model 1 and 2) or using the

forward algorithm (HMM) 15

◮ Comments

  • Probability distribution over the alignments of translation candidates
  • Useful for ranking and scoring extracted rules ⇒ we use fR = fC = p
  • Fractional count to each extracted rule
  • Allows finer estimation of phrase translation probabilities

15Deng, 2005 Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 31 / 51

slide-38
SLIDE 38

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Hierarchical Translation Grammar Definition, G0

◮ Experimented with increasingly complex hierarchical grammars ◮ Each grammar includes more rule types ◮ G0 : monotonic phrase-based translation grammar

S→X,X G0 S→S X,S X X→w,w ⊲ includes all regular phrases and two glue rules to allow concatenation

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 32 / 51

slide-39
SLIDE 39

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Hierarchical Translation Grammar Definition, G1

◮ G1 : adds phrase swap rules

S→X,X S→S X,S X G1 X→w,w X→w X,X w X→X w,w X ⊲ incorporates reordering capabilities ⊲ nested reordering is possible (nonterminal X) ⊲ must be consecutive ⇒ swap after a swap ⊲ otherwise concatenation with ’glue’

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 32 / 51

slide-40
SLIDE 40

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Hierarchical Translation Grammar Definition, G2

◮ G2 : adds monotonic concatenation rules

S→X,X S→S X,S X X→w,w G2 X→w X,X w X→X w,w X X→w X,w X ⊲ the ’glue’ rule already allows rule concatenation but only for S category ⇒ rule concatenation after reordering ⊲ the new rule type allows ⇒ concatenation before reordering ⇒ nested reordering without consecutive swaps

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 32 / 51

slide-41
SLIDE 41

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Hierarchical Translation Grammar Definition, G3

◮ G3 : adds disjoint phrases

S→X,X S→S X,S X X→w,w G3 X→w X,X w X→X w,w X X→w X,w X X→w X w,w X w ⊲ these rules can encode a monotonic or reordered relationship between terminals ⊲ depends on alignments in the parallel corpus ⊲ terminal sequences w may be new

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 32 / 51

slide-42
SLIDE 42

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Example

G1 G2 G3

s1 s2 s3 s4 X R2 S R3 X X R4 R1 t1 t2 t3 s1 s2 s3 s4 X R2 S R5 X X R4 R1 t1 t7 t2 s1 s2 s3 s4 X R2 S R6 X R1 t5 t2 t6

X→w X,X w X→w X,w X X→w X w,w X w X→X w,w X

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 33 / 51

slide-43
SLIDE 43

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Measuring Expressive Power. Zh→En

◮ Measure expressive power of grammars G0, G1, G2 and G3 ◮ Decode in alignment mode:

⊲ Replace the LM by the reference and check if it is found ⊲ 10k sentences of the parallel corpus ⊲ rules are extracted only from the sentence to be aligned ⊲ only reachability is important

◮ Contrasting extraction methods:

model Viterbi Word Posteriors source-to-target V-st WP-st target-to-source V-ts WP-ts union V-union – grow-diag-final V-gdf – merge (st + ts) V-merge WP-merge

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 34 / 51

slide-44
SLIDE 44

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Measuring Expressive Power. Zh→En

30 40 50 60 70 80 V

  • s

t V

  • t

s V

  • u

n i

  • n

V

  • g

d f V

  • m

e r g e W P

  • s

t W P

  • t

s W P

  • m

e r g e

Percentage of Aligned Parallel Sentences

G0 G1 G2 G3 ⊲ percentage increases when switching from G0 to G1, G2 and G3 ⊲ posterior-based extraction outperforms Viterbi for nearly all grammars ⊲ highest percentages obtained with WP-merge, approx. 80% for G3

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 35 / 51

slide-45
SLIDE 45

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Number of Extracted Rules

◮ Measure performance of G1, G2 and G3 ◮ Baseline: standard hierarchical phrase-based grammar GH with rules of up to

two nonterminals16

◮ Contrasting extraction methods:

V-union

Viterbi union

WP-st

Word posteriors source-to-target

PP-st

Phrase Posteriors source-to-target

1 2 3 4 5 6 GH G1 G2 G3

Number of Extracted Rules (in M)

V-union WP-st PP-st

16No monotonic concatenation rules included (Iglesias et al. 2009) Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 36 / 51

slide-46
SLIDE 46

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Zh→En Translation Performance

◮ Measure performance of G1, G2 and G3 ◮ Baseline: standard hierarchical phrase-based grammar GH with rules of up to

two nonterminals16

◮ Contrasting extraction methods:

V-union

Viterbi union

WP-st

Word posteriors source-to-target

PP-st

Phrase Posteriors source-to-target

34 34.5 35 35.5 36 36.5 37 GH G1 G2 G3

BLEU score ( test-nw1 )

V-union WP-st PP-st

16No monotonic concatenation rules included (Iglesias et al. 2009) Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 37 / 51

slide-47
SLIDE 47

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Pruning in Search

◮ Likelihood-based search pruning if: # states in CYK cell lattice > 10,000 ◮ Contrasting extraction methods:

V-union

Viterbi union

WP-st

Word posteriors source-to-target

PP-st

Phrase Posteriors source-to-target

1 2 3 4 5 GH G1 G2 G3

Search Pruning ( instances/word )

V-union WP-st PP-st

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 37 / 51

slide-48
SLIDE 48

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Reducing Grammar Redundancy

◮ Monotonic concatenation rules in Grammar G2

allow concatenation before reordering × also allow concatenation after reordering, but ’glue’ already does this × single nonterminal X causes this redundancy R0: S→X,X R1: S→S X,S X R2: X→s1,t1 R3: X→s2,t2 R4: X→s1 X,t1 X → Two derivations for s1,s2 : R2,R0,R3,R1 R3,R4,R0

◮ Introduce an additional nonterminal M for monotonic concatenation rules

(+0.2 BLEU, -33% time)

  • R4 substituted by:

R4a: M→s1 X,t1 X R4b: M→s1 M,t1 M → Single derivation for s1,s2 : R2,R0,R3,R1

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 37 / 51

slide-49
SLIDE 49

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project Shallow-N Grammars for Exact Search Rule Extraction from Alignment Posteriors

Reducing Grammar Redundancy (2)

◮ Spurious ambiguity caused by phrase swap rules (+0.3 BLEU)

R1: S→{X, R, L},{X, R, L} R2: X→s2 s3,t2 R3: R→s1 {X, R, L},{X, R, L} t3 R4: L→{X, L} s4,t1 {X, L}

◮ this is really tough when many different rule types are included!

⇒ grammar induction remains a hot topic

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 38 / 51

slide-50
SLIDE 50

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 39 / 51

slide-51
SLIDE 51

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Hierarchical Phrase-Based Representations

Representation of Space of Translation Hypotheses T determines the form and complexity of LM intersection and shortest path search algorithms:

  • 1. Hypergraphs

Chiang 2007, Huang 2008

  • 2. (Weighted) Finite-State Transducers

Iglesias et al. 2009, de Gispert et al. 2010

  • 3. (Weighted) Push-Down Transducers

Joint work with Cyril Allauzen and Michael Riley (Google NYC). Iglesias et al. 2011 (EMNLP , to appear)

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 40 / 51

slide-52
SLIDE 52

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Weighted Finite-State Transducers

◮ Translation Grammar expressed as RTN

S → a b X d g S → a c X f g X → b c

◮ Fully expanded into FST before LM intersection 1 a 2 a 3 b 4 c 5 eps 6 eps 7 b 8 b 9 c 10 c 11 eps 12 eps 13 d 14 f 15 g 16 g ◮ Complexity of Translation and Language Model intersection (excl. parsing)

O(e|s|2|G| |M|)

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 41 / 51

slide-53
SLIDE 53

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Weighted Push-Down Transducers

Push-Down Transducers augment finite automata with the use of a stack

◮ Accept a language of “balanced” strings over a finite number of parentheses

(push/pop in stack)

◮ Can be generated from a Recursive Transition Network

via a Replacement algorithm

◮ Can be expanded into an FST if stack is bounded

1 ( 3 ε 2 a 4 ( ) 5 b )

Example PDA that accepts a*b*

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 42 / 51

slide-54
SLIDE 54

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Weighted Push-Down Transducers (2)

◮ Translation Grammar expressed as PDT 1 a 6 a 2 b 7 c 11 ( 12 b 3 4 d 5 g [ 8 9 f 10 g 13 c ) ] ◮ PDT Intersection with LM acceptor returns another PDT (prior to FST

expansion)

◮ Complexity of Translation and Language Model intersection (excl. parsing)

O(|s|3|G| |M|2)

◮ Quadratic in the size of the LM (Allauzen, Riley)

⇒ Appropriate for compact language models M and larger grammars G

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 43 / 51

slide-55
SLIDE 55

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Decoding with Entropy-Pruned LMs

◮ Given translation grammar G and language model M1

  • 1. Entropy-based pruning of M1 under relative perplexity threshold θ to create Mθ

model

  • 2. Fast translation under Mθ
  • 3. Likelihood based pruning of output lattice, beam width β
  • 4. Rescore lattice with original language model M1 and larger 5-gram model M2

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 44 / 51

slide-56
SLIDE 56

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Zh→En Translation with Compact Grammars

◮ Compact Grammar G1

◮ Only rules with translation probability > 0.01 are used ◮ Entire lattice can be expanded and intersected with M1 ◮ FST and PDT representations equally good

Number of N-grams: 200M 20M 4M 1M Time (sec/w): 0.68 0.38 0.28 0.20 Entropy threshold θ BLEU

◮ Full performance recovered

after rescoring with LM

◮ Tractable beam width β required ◮ Decoding speed-up

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 45 / 51

slide-57
SLIDE 57

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Zh→En Translation with Large Grammars

◮ Large Grammar G2

◮ All observed rules are considered (+alternatives per rule) ◮ FST representation cannot decode (pruning in search) ◮ PDT achieves exact decoding under smaller Mθ

θ

Success Expand Fails Intersect Fails Success Intersect Fails Expand Fails

10-9 12% 51% 37% 40% 8% 52% 10-8 16% 53% 31% 76% 1% 23% 10-7 18% 53% 29% 99.8% 0% 0.2%

HiFST HiPDT

◮ Improved results with PDT (+0.5 BLEU)

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 46 / 51

slide-58
SLIDE 58

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Representation with Push-Down Transducers

◮ Compactness: larger grammars can be explored ◮ Expensive intersection with LM: entropy-pruned models ◮ BLEU improvements due to less search errors’ ◮ Faster decoding times, less memory requirements

Future research lines

◮ Hybrid FST/PDT approach for improved robustness

(exact decoding under M when feasible)

◮ Improved shortest path algorithms for PDTs

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 47 / 51

slide-59
SLIDE 59

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 48 / 51

slide-60
SLIDE 60

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

FAUST - Feedback Analysis for User adaptive Statistical Translation

EU-FP7 funded project (2010-2012), led by Cambridge http://www.faust-fp7.eu/faust User Feedback Analysis and Adaptation

◮ Enhance Reverso.net website with infrastructure to study user feedback ◮ Deploy novel feedback collection mechanisms to increase feedback quantity

and quality

◮ User feedback modeling ◮ Develop mechanisms to incorporate the useful feedback

Fluent SMT output

◮ Integrate NLG and MT to improve translation fluency

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 49 / 51

slide-61
SLIDE 61

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

Integrating Fluency-driven Constraints16

◮ General framework: LMBR search over space of fluent hypotheses H:

ˆ EMBR = argmin

E′∈H

  • E∈E

L(E, E′)P(E|F)

◮ We distinguish between the evidence space and hypothesis space:

  • 1. MBR evidence space E produced by baseline SMT system
  • 2. MBR search for translations in collection of fluent sentences H

◮ Choose translation closest to top baseline SMT hypotheses

◮ As determined by the loss function L(E, E′) (e.g. 1 − BLEU)

◮ Experiment: Hypothesis space is subset of evidence space, containing

hypotheses only made of observed high-order n-grams

→ same BLEU, but seen as more fluent

  • 16G. Blackwood, A. de Gispert and W. Byrne, 2010. Fluency Constraints for Minimum Bayes-Risk

Decoding of Statistical Machine Translation Lattices. COLING.

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 50 / 51

slide-62
SLIDE 62

Intro Hierarchical Translation with WFSTs Hiero Grammar Definition Hiero with Push-Down Transducers FAUST project

User Interaction with Research Systems

Department of Engineering University of Cambridge Hierarchical Phrase-Based Translation at University of Cambridge July 2011 51 / 51

slide-63
SLIDE 63

Thanks! Questions and comments welcome.

Department of Engineering University of Cambridge