Decoding in SMT Nitin Madnani February 8, 2006 The Decoding - - PowerPoint PPT Presentation

decoding in smt
SMART_READER_LITE
LIVE PREVIEW

Decoding in SMT Nitin Madnani February 8, 2006 The Decoding - - PowerPoint PPT Presentation

Decoding in SMT Nitin Madnani February 8, 2006 The Decoding Problem Search Inputs: Input string Bunch of statistical models A function to assign score to any translation Output: Best scoring translation


slide-1
SLIDE 1

Decoding in SMT

Nitin Madnani February 8, 2006

slide-2
SLIDE 2

The Decoding Problem

  • Search
  • Inputs:
  • Input string
  • Bunch of statistical models
  • A function to assign score to any translation
  • Output:
  • Best scoring translation
slide-3
SLIDE 3

Mathematically ...

e = arg max

ˆ e

S(ˆ e, f)

slide-4
SLIDE 4

Mathematically ...

e = arg max

ˆ e

S(ˆ e, f)

Score (models, candidate, input string)

slide-5
SLIDE 5

Mathematically ...

e = arg max

ˆ e

S(ˆ e, f)

Search operation Score (models, candidate, input string)

slide-6
SLIDE 6

Mathematically ...

e = arg max

ˆ e

S(ˆ e, f)

Search operation Score (models, candidate, input string) search space (all possible translations)

slide-7
SLIDE 7

Mathematically ...

e = arg max

ˆ e

S(ˆ e, f)

Search operation Score (models, candidate, input string) search space (all possible translations) “Best” Translation

slide-8
SLIDE 8

Mathematically ...

e = arg max

ˆ e

S(ˆ e, f)

Search operation Score (models, candidate, input string) search space (all possible translations) “Best” Translation

Examples:

  • Models = P(e), P(a,f|e); Score = P(e)*P(a,f|e)
  • Models = P(e),P(f|e), P(e|f), P(a,f|e), P(e|f) etc; Score = exp(∑wnmn)
slide-9
SLIDE 9

Decoding is hard

slide-10
SLIDE 10

Decoding is hard

  • Very simple example

f1 f2 f3 f4 fm

...

slide-11
SLIDE 11

Decoding is hard

  • Very simple example
  • Models: LM, Model 1 (1/1)

f1 f2 f3 f4 fm

...

e1 e2 e3 e4 em

...

slide-12
SLIDE 12

Decoding is hard

  • Very simple example
  • Models: LM, Model 1 (1/1)
  • Search space: All possible
  • rderings of e1..m

f1 f2 f3 f4 fm

...

e1 e2 e3 e4 em

...

slide-13
SLIDE 13

Decoding is hard

  • Very simple example
  • Models: LM, Model 1 (1/1)
  • Search space: All possible
  • rderings of e1..m
  • Picked by the LM

f1 f2 f3 f4 fm

...

e1 e2 e3 e4 em

...

slide-14
SLIDE 14

Decoding is hard

  • Very simple example
  • Models: LM, Model 1 (1/1)
  • Search space: All possible
  • rderings of e1..m
  • Picked by the LM
  • w(e1→e2) = p(e2 | e1)

f1 f2 f3 f4 fm

...

e1 e2 e3 e4 em

...

e1 e2 e3 e4 e5 ... em

slide-15
SLIDE 15

Decoding is hard

  • Very simple example
  • Models: LM, Model 1 (1/1)
  • Search space: All possible
  • rderings of e1..m
  • Picked by the LM
  • w(e1→e2) = p(e2 | e1)
  • Look familiar ?

f1 f2 f3 f4 fm

...

e1 e2 e3 e4 em

...

e1 e2 e3 e4 e5 ... em

slide-16
SLIDE 16

Decoding is hard

  • Very simple example
  • Models: LM, Model 1 (1/1)
  • Search space: All possible
  • rderings of e1..m
  • Picked by the LM
  • w(e1→e2) = p(e2 | e1)
  • Look familiar ?
  • TSP - NP Complete !

f1 f2 f3 f4 fm

...

e1 e2 e3 e4 em

...

e1 e2 e3 e4 e5 ... em

slide-17
SLIDE 17

Problem characteristics

  • Clear-cut optimization problem
  • There is always one right answer
  • Inherently Complex
  • Number of ways to order words (LM)
  • Number of ways to cover input words (TM)
  • Harder than in SR:
  • No left to right input-output correspondence
slide-18
SLIDE 18

Decoding Methods

  • Stack-based Decoding
  • Most common
  • Almost all contemporary decoders are stack-based
  • Greedy Decoding
  • Faster but more error-prone
  • Optimal Decoding
  • Finds the optimal translation
  • Really Really Slow !
slide-19
SLIDE 19

Stack-based Decoding

  • Originally introduced by Jelinek in SR
  • Stores partial translations (hypotheses) in a stack
  • Builds new translations by extending existing hypotheses
  • Optimal translation guaranteed if given unlimited stack size

and search time

  • Note: stack does not imply LIFO; actually a (priority) queue
slide-20
SLIDE 20

Stack-based Decoding

Hypothesis Stack (finite size and sorted by cost)

slide-21
SLIDE 21

Stack-based Decoding

Hypothesis Stack (finite size and sorted by cost) Pop (1)

slide-22
SLIDE 22

Stack-based Decoding

Hypothesis Stack (finite size and sorted by cost) Pop (1) Extend by translating every possible word (2)

slide-23
SLIDE 23

Stack-based Decoding

Hypothesis Stack (finite size and sorted by cost) Pop (1) Extend by translating every possible word (2) Push (3)

slide-24
SLIDE 24

Stack-based Decoding

Hypothesis Stack (finite size and sorted by cost) Pop (1) Extend by translating every possible word (2) Push (3) Repeat (1)-(3) until a complete hypothesis is encountered

slide-25
SLIDE 25
  • Hypothesis cost = cost of translation so far
  • Problem: Shorter hypotheses will push longer ones out
  • Solution: Use translation cost + future cost
  • Future cost: What it would cost to complete an hypothesis
  • A heuristic provides an estimate of the future cost
  • No heuristic can be perfect (no monotonicity)
  • Need to find another solution

Heuristic function

slide-26
SLIDE 26

Multi-stack Decoding

  • Use multiple stacks
  • One for each subset of the input words (2n)
  • One for each number of words covered (n)
  • Extend the top hypothesis from each stack
  • Competition is among similar hypotheses
slide-27
SLIDE 27

Other Optimizations

  • Beam-based Pruning
  • Relative threshold - prune if p(h) < α * p(hbest)
  • Histogram - Only keep a certain number of hypotheses,

prune the rest

  • Can accidentally prune out a good hypothesis
  • Hypothesis Recombination
  • If similar(h1,h2) then keep only the cheaper one
  • Risk-free
slide-28
SLIDE 28

Greedy Decoding

  • Start with the word-for-word English gloss
  • Iterate exhaustively over all alignments one simple
  • peration away
  • Add, substitute, change order etc.
  • Pick the one with the highest probability
  • Commit the change
  • Repeat until no improvement possible
slide-29
SLIDE 29

Greedy Decoding

  • Pros
  • Much much faster
  • Complexity only scales polynomially with

sentence length

  • Cons
  • Searches only a very small subspace
  • Cannot find best translation if far from gloss
slide-30
SLIDE 30

Optimal Decoding

  • Transform decoding problem into a TSP instance
  • Foreign words ~ Cities
  • Translations ~ Hotels in cities
  • Cost ~ Distance
  • Solve TSP using Integer Programming (IP)
  • Cast tour selection as a constrained integer program
  • Can find tours of various lengths (n-best lists)
slide-31
SLIDE 31

Optimal Decoding

  • Pros
  • Fast decoder development
  • Optimal n-best lists
  • Extremely customizable
  • Cons
  • Extremely slow !
  • Hard to integrate non-related information

sources

slide-32
SLIDE 32

Decoding Errors

  • Search Error
  • decode(f) = e, but ∃ e’ s.t. score(e’) > score(e)
  • The right answer is in the space but we couldn’t find it
  • Hard to prove sub-optimal decoding
  • Model Error
  • correct(f) ∉ Search space
  • The right answer is not in the space because of

imperfect models

slide-33
SLIDE 33

Observations*

  • |spacegreedy| << |spacestack| (hence the speed)
  • spacestack ⊂ spaceoptimal
  • nSEgreedy >> nSEstack >> nSEoptimal (=0)
  • tgreedy < tstack <<< toptimal (50 for m=6, 500 for 8!)
  • nME >> 0 for all, since Model 4 is deficient

* All decoders are Model 4 and tested on the same set

slide-34
SLIDE 34

Take Home Messages

  • Optimal decoding is possible but highly impractical
  • Optimized stack-based decoding provides good balance
  • All modern decoders are basically the same (stack-based)
  • Differences in models, score and extension operations.

Examples: Pharaoh, Rewrite

  • Better translations will come from improving models

(Hiero)