Out of GIZAEfficient Word Alignment Models for SMT Yanjun Ma - - PowerPoint PPT Presentation

out of giza efficient word alignment models for smt
SMART_READER_LITE
LIVE PREVIEW

Out of GIZAEfficient Word Alignment Models for SMT Yanjun Ma - - PowerPoint PPT Presentation

Out of GIZAEfficient Word Alignment Models for SMT Yanjun Ma National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Series March 4, 2009 Y. Ma (DCU) Out of Giza 1 / 28 Outline 1 Contexts 2 HMM


slide-1
SLIDE 1

Out of GIZA—Efficient Word Alignment Models for SMT

Yanjun Ma

National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Series March 4, 2009

  • Y. Ma

(DCU) Out of Giza 1 / 28

slide-2
SLIDE 2

Outline

1 Contexts 2 HMM and IBM Model 4 3 Improved HMM Alignment Models 4 Simultaneous Word Alignment and Phrase Extraction

  • Y. Ma

(DCU) Out of Giza 2 / 28

slide-3
SLIDE 3

Outline

1 Contexts 2 HMM and IBM Model 4 3 Improved HMM Alignment Models 4 Simultaneous Word Alignment and Phrase Extraction

  • Y. Ma

(DCU) Out of Giza 3 / 28

slide-4
SLIDE 4

Word Alignment and SMT

All SMT systems rely on word alignment

Word-Based SMT Phrase-Based SMT Hiero, hierarchical SMT Syntax-Based SMT, i.e, tree-to-string, string-to-tree, tree-to-tree

  • Y. Ma

(DCU) Out of Giza 4 / 28

slide-5
SLIDE 5

Word Alignment and SMT

All SMT systems rely on word alignment

Word-Based SMT Phrase-Based SMT Hiero, hierarchical SMT Syntax-Based SMT, i.e, tree-to-string, string-to-tree, tree-to-tree

Giza implementation of IBM model 4 is dominant

  • Y. Ma

(DCU) Out of Giza 4 / 28

slide-6
SLIDE 6

Word Alignment and SMT

All SMT systems rely on word alignment

Word-Based SMT Phrase-Based SMT Hiero, hierarchical SMT Syntax-Based SMT, i.e, tree-to-string, string-to-tree, tree-to-tree

Giza implementation of IBM model 4 is dominant “Viterbi” alignment from IBM model 4 is used

  • Y. Ma

(DCU) Out of Giza 4 / 28

slide-7
SLIDE 7

Outline

1 Contexts 2 HMM and IBM Model 4 3 Improved HMM Alignment Models 4 Simultaneous Word Alignment and Phrase Extraction

  • Y. Ma

(DCU) Out of Giza 5 / 28

slide-8
SLIDE 8

Efficient Model: HMM Model [Vogel et al., 1996]

HMM emission (translation) model p(tj|saj)

  • Y. Ma

(DCU) Out of Giza 6 / 28

slide-9
SLIDE 9

HMM transition (alignment) model p(aj|aj − aj−1)

  • Y. Ma

(DCU) Out of Giza 7 / 28

slide-10
SLIDE 10

HMM transition (alignment) model p(aj|aj − aj−1)

p(t, a|s) =

  • j

p(aj|aj − aj−1) · p(tj|saj) (1)

  • Y. Ma

(DCU) Out of Giza 7 / 28

slide-11
SLIDE 11

Deficient Model: IBM Model 3 and 4

Model 3: zero-order distortion model

  • Y. Ma

(DCU) Out of Giza 8 / 28

slide-12
SLIDE 12

Model 4: first-order distortion model

  • Y. Ma

(DCU) Out of Giza 9 / 28

slide-13
SLIDE 13

Derivation

P(tJ

1 , aJ 1 |sI 1)

= P(tJ

1 , BI 0|sI 1)

(2)

  • Y. Ma

(DCU) Out of Giza 10 / 28

slide-14
SLIDE 14

Derivation

P(tJ

1 , aJ 1 |sI 1)

= P(tJ

1 , BI 0|sI 1)

(2) = P(B0|BI

1) × I

  • i=1

P(Bi|Bi−1

1

, eI

1) × P(f J 1 |BI 0, eI 1)

(3)

  • Y. Ma

(DCU) Out of Giza 10 / 28

slide-15
SLIDE 15

Derivation

P(tJ

1 , aJ 1 |sI 1)

= P(tJ

1 , BI 0|sI 1)

(2) = P(B0|BI

1) × I

  • i=1

P(Bi|Bi−1

1

, eI

1) × P(f J 1 |BI 0, eI 1)

(3) = P(B0|BI

1) × I

  • i=1

p(Bi|Bi−1, ei)

  • fertility-distortion

(4)

  • Y. Ma

(DCU) Out of Giza 10 / 28

slide-16
SLIDE 16

Derivation

P(tJ

1 , aJ 1 |sI 1)

= P(tJ

1 , BI 0|sI 1)

(2) = P(B0|BI

1) × I

  • i=1

P(Bi|Bi−1

1

, eI

1) × P(f J 1 |BI 0, eI 1)

(3) = P(B0|BI

1) × I

  • i=1

p(Bi|Bi−1, ei)

  • fertility-distortion

(4) ×

I

  • i=0
  • j∈Bi

p(fj|ei)

translation

(5)

  • Y. Ma

(DCU) Out of Giza 10 / 28

slide-17
SLIDE 17

Model 3 fertility and distortion

p(Bi|Bi−1, ei) = p(φi|ei)

fertility

φi!

  • j∈Bi

p(j|i, J)

  • distortion

(6)

  • Y. Ma

(DCU) Out of Giza 11 / 28

slide-18
SLIDE 18

Model 3 fertility and distortion

p(Bi|Bi−1, ei) = p(φi|ei)

fertility

φi!

  • j∈Bi

p(j|i, J)

  • distortion

(6)

Model 4 fertility and distortion

p(Bi|Bi−1, ei) = p(φi|ei)

fertility

p=1(Bi1 − Bρ(i)| · · · )

  • first word

φi

  • k=2

p>1(Bik − Bi,k−1| · · · )

  • remaining words

(7)

  • Y. Ma

(DCU) Out of Giza 11 / 28

slide-19
SLIDE 19

Decoding

HMM

Viterbi decoding: ˆ a = argmax

a

p(a|s, t) Posterior decoding: Align point aj → i iff. p(aj → i|s, t) ≥ δ

  • Y. Ma

(DCU) Out of Giza 12 / 28

slide-20
SLIDE 20

Decoding

HMM

Viterbi decoding: ˆ a = argmax

a

p(a|s, t) Posterior decoding: Align point aj → i iff. p(aj → i|s, t) ≥ δ

IBM model 3 and 4

No efficient algorithm available

  • Y. Ma

(DCU) Out of Giza 12 / 28

slide-21
SLIDE 21

Advantages of HMM models

Efficient parameter estimation algorithm: forward-backward algorithm (Baum-Welch algorithm)

  • Y. Ma

(DCU) Out of Giza 13 / 28

slide-22
SLIDE 22

Advantages of HMM models

Efficient parameter estimation algorithm: forward-backward algorithm (Baum-Welch algorithm)

Figure: Eric B. Baum (son of Leonard E. Baum, who was the inventor of the algorithm) and Lloyd R. Welch

  • Y. Ma

(DCU) Out of Giza 13 / 28

slide-23
SLIDE 23

Advantages of HMM models

Efficient parameter estimation algorithm: forward-backward algorithm (Baum-Welch algorithm)

Figure: Eric B. Baum (son of Leonard E. Baum, who was the inventor of the algorithm) and Lloyd R. Welch

The resulting posterior probabilities are useful

  • Y. Ma

(DCU) Out of Giza 13 / 28

slide-24
SLIDE 24

Disadvantages of standard HMM models

Objective is maximising the likelihood

  • Y. Ma

(DCU) Out of Giza 14 / 28

slide-25
SLIDE 25

Disadvantages of standard HMM models

Objective is maximising the likelihood

There is no garantee that the optimised parameters correspond to more accurate alignments

  • Y. Ma

(DCU) Out of Giza 14 / 28

slide-26
SLIDE 26

Disadvantages of standard HMM models

Objective is maximising the likelihood

There is no garantee that the optimised parameters correspond to more accurate alignments To complicate things (sometimes!) does help, e.g. IBM model 4

  • Y. Ma

(DCU) Out of Giza 14 / 28

slide-27
SLIDE 27

Outline

1 Contexts 2 HMM and IBM Model 4 3 Improved HMM Alignment Models 4 Simultaneous Word Alignment and Phrase Extraction

  • Y. Ma

(DCU) Out of Giza 15 / 28

slide-28
SLIDE 28

Improved HMM models

Two more sophisticated HMM models

Segmental HMM model, word-to-phrase alignment model Constrained HMM model, agreement-guided alignment model

  • Y. Ma

(DCU) Out of Giza 16 / 28

slide-29
SLIDE 29

HMM Word-to-Phrase Alignment [Deng and Byrne, 2008]

Introducing a segmentation model: segmental HMM

  • Y. Ma

(DCU) Out of Giza 17 / 28

slide-30
SLIDE 30

P(t, a|s) = P(vK

1 , K, aK 1 , hK 1 , φK 1 |s)

(8)

  • Y. Ma

(DCU) Out of Giza 18 / 28

slide-31
SLIDE 31

P(t, a|s) = P(vK

1 , K, aK 1 , hK 1 , φK 1 |s)

(8) = P(K|J, s)

  • segmentation

(9)

  • Y. Ma

(DCU) Out of Giza 18 / 28

slide-32
SLIDE 32

P(t, a|s) = P(vK

1 , K, aK 1 , hK 1 , φK 1 |s)

(8) = P(K|J, s)

  • segmentation

(9) × P(aK

1 , φK 1 , hK 1 |K, J, s)

  • alignment-fertility

(10)

  • Y. Ma

(DCU) Out of Giza 18 / 28

slide-33
SLIDE 33

P(t, a|s) = P(vK

1 , K, aK 1 , hK 1 , φK 1 |s)

(8) = P(K|J, s)

  • segmentation

(9) × P(aK

1 , φK 1 , hK 1 |K, J, s)

  • alignment-fertility

(10) × P(vK

1 |aK 1 , φK 1 , hK 1 , K, J, s)

  • translation

(11)

  • Y. Ma

(DCU) Out of Giza 18 / 28

slide-34
SLIDE 34

HMM Word-to-Phrase Alignment

P(aK

1 , φK 1 , hK 1 |K, J, s)

=

K

  • k=1

P(ak, hk, φk|ak−1, φk−1, hk−1, K, J, s) (12)

  • Y. Ma

(DCU) Out of Giza 19 / 28

slide-35
SLIDE 35

HMM Word-to-Phrase Alignment

P(aK

1 , φK 1 , hK 1 |K, J, s)

=

K

  • k=1

P(ak, hk, φk|ak−1, φk−1, hk−1, K, J, s) (12) =

K

  • k=1

p(ak, |ak−1, hk; I)

  • alignment

· d(hk)

null alignment

· n(φk; sak)

  • fertility

(13)

  • Y. Ma

(DCU) Out of Giza 19 / 28

slide-36
SLIDE 36

Performance of HMM Word-to-Phrase Alignment

MTTK implementation

  • Y. Ma

(DCU) Out of Giza 20 / 28

slide-37
SLIDE 37

Performance of HMM Word-to-Phrase Alignment

MTTK implementation Used by Cambridge University Engineering Department

Arabic–English NIST 2008 (6th out of 16, third best university participant, behind LIUM and ISI) Consistent performance for Chinese–English for differently sized collections of corpus Parallelised to handle large amount of data (e.g. 10M sentence pairs)

  • Y. Ma

(DCU) Out of Giza 20 / 28

slide-38
SLIDE 38

Agreement Constrained HMM Alignment [Ganchev et al., 2008]

Objective

argmin

q(a)∈(Q)

{KL(q(a)||pθ(a|s, t))} s.t. Eq[f(s, t, a)] ≤ b (14)

  • Y. Ma

(DCU) Out of Giza 21 / 28

slide-39
SLIDE 39

Agreement Constrained HMM Alignment [Ganchev et al., 2008]

Objective

argmin

q(a)∈(Q)

{KL(q(a)||pθ(a|s, t))} s.t. Eq[f(s, t, a)] ≤ b (14)

Figure: − → p θ(a|s, t), ← − p θ(a|s, t) and − → q (a), ← − q (a)

  • Y. Ma

(DCU) Out of Giza 21 / 28

slide-40
SLIDE 40

Agreement Constrained HMM Alignment [Ganchev et al., 2008]

Constrained E(M)

  • Y. Ma

(DCU) Out of Giza 22 / 28

slide-41
SLIDE 41

Performance of Agreement Constrained HMM

PostCAT implementation Evaluation

Six language pairs, from 100,000 to 1M sentence pairs Outperform IBM Model 4 (16 out 18 times) However... getting slightly worse when the training data is over 1M

  • Y. Ma

(DCU) Out of Giza 23 / 28

slide-42
SLIDE 42

Algorithm 1 Agreement Constrained HMM Alignment

1: λij ← ∀i, j 2: for T interations do 3:

− → θ ′

t(tj|si) ← −

→ θ t(tj|si)eλij∀i, j

4:

← − θ ′

t(si|tj) ← ←

− θ t(si|tj)e−λij∀i, j

5:

− → q ← forwardBackward(− → θ ′

t, −

→ θ a)

6:

← − q ← forwardBackward(← − θ ′

t, ←

− θ a)

7: end for 8: return (−

→ q , ← − q )

  • Y. Ma

(DCU) Out of Giza 23 / 28

slide-43
SLIDE 43

Outline

1 Contexts 2 HMM and IBM Model 4 3 Improved HMM Alignment Models 4 Simultaneous Word Alignment and Phrase Extraction

  • Y. Ma

(DCU) Out of Giza 24 / 28

slide-44
SLIDE 44

Phrase Pair Extraction

State-of-the-art: using viterbi alignment only

  • Y. Ma

(DCU) Out of Giza 25 / 28

slide-45
SLIDE 45

Phrase Pair Extraction

State-of-the-art: using viterbi alignment only Using all possible alignments

A(i1, i2; j1, j2) = {a = aJ

1 : aj ∈ [i1, i2] iff. j ∈ [j1, j2]}

(15)

  • Y. Ma

(DCU) Out of Giza 25 / 28

slide-46
SLIDE 46

Phrase Pair Extraction via Model-Based Posteriors

Derivation

P(t, A(i1, i2; j1, j2)|s; θ) =

  • a∈A(i1,i2;j1,j2)

P(t, a|s; θ) (16)

  • Y. Ma

(DCU) Out of Giza 26 / 28

slide-47
SLIDE 47

Phrase Pair Extraction via Model-Based Posteriors

Derivation

P(t, A(i1, i2; j1, j2)|s; θ) =

  • a∈A(i1,i2;j1,j2)

P(t, a|s; θ) (16) P(A(i1, i2; j1, j2)|s, t; θ) = P(t, A(i1, i2; j1, j2)|s; θ) P(t, a|s; θ) (17)

  • Y. Ma

(DCU) Out of Giza 26 / 28

slide-48
SLIDE 48

Evaluation

Significant gains when used as an augmentation to the original phrase extraction strategy

  • Y. Ma

(DCU) Out of Giza 27 / 28

slide-49
SLIDE 49

References

Deng, Y. and Byrne, W. (2008). HMM word and phrase alignment for statistical machine translation. IEEE Transactions on Audio, Speech, and Language Processing, 16(3):494–507. Ganchev, K., Graca, J., and Taskar, B. (2008). Better alignments = better translations? In Proceedings of the 46th Annual Meeting of the Association of Computational Linguistics, Columbus, OH. Vogel, S., Ney, H., and Tillmann, C. (1996). HMM-based word alignment in statistical translation. In Proceedings of the 16th International Conference on Computational Linguistics, pages 836–841, Copenhagen, Denmark.

  • Y. Ma

(DCU) Out of Giza 28 / 28