[PPT] - Decoding in SMT Nitin Madnani February 8, 2006 The Decoding PowerPoint Presentation

SLIDE 1

Decoding in SMT

Nitin Madnani February 8, 2006

SLIDE 2

The Decoding Problem

Search
Inputs:
Input string
Bunch of statistical models
A function to assign score to any translation
Output:
Best scoring translation

SLIDE 3

Mathematically ...

e = arg max

ˆ e

S(ˆ e, f)

SLIDE 4

Mathematically ...

e = arg max

ˆ e

S(ˆ e, f)

Score (models, candidate, input string)

SLIDE 5

Mathematically ...

e = arg max

ˆ e

S(ˆ e, f)

Search operation Score (models, candidate, input string)

SLIDE 6

Mathematically ...

e = arg max

ˆ e

S(ˆ e, f)

Search operation Score (models, candidate, input string) search space (all possible translations)

SLIDE 7

Mathematically ...

e = arg max

ˆ e

S(ˆ e, f)

Search operation Score (models, candidate, input string) search space (all possible translations) “Best” Translation

SLIDE 8

Mathematically ...

e = arg max

ˆ e

S(ˆ e, f)

Search operation Score (models, candidate, input string) search space (all possible translations) “Best” Translation

Examples:

Models = P(e), P(a,f|e); Score = P(e)*P(a,f|e)
Models = P(e),P(f|e), P(e|f), P(a,f|e), P(e|f) etc; Score = exp(∑wnmn)

SLIDE 9

Decoding is hard

SLIDE 10

Decoding is hard

Very simple example

f1 f2 f3 f4 fm

...

SLIDE 11

Decoding is hard

Very simple example
Models: LM, Model 1 (1/1)

f1 f2 f3 f4 fm

...

e1 e2 e3 e4 em

...

SLIDE 12

Decoding is hard

Very simple example
Models: LM, Model 1 (1/1)
Search space: All possible
rderings of e1..m

f1 f2 f3 f4 fm

...

e1 e2 e3 e4 em

...

SLIDE 13

Decoding is hard

Very simple example
Models: LM, Model 1 (1/1)
Search space: All possible
rderings of e1..m
Picked by the LM

f1 f2 f3 f4 fm

...

e1 e2 e3 e4 em

...

SLIDE 14

Decoding is hard

Very simple example
Models: LM, Model 1 (1/1)
Search space: All possible
rderings of e1..m
Picked by the LM
w(e1→e2) = p(e2 | e1)

f1 f2 f3 f4 fm

...

e1 e2 e3 e4 em

...

e1 e2 e3 e4 e5 ... em

SLIDE 15

Decoding is hard

Very simple example
Models: LM, Model 1 (1/1)
Search space: All possible
rderings of e1..m
Picked by the LM
w(e1→e2) = p(e2 | e1)
Look familiar ?

f1 f2 f3 f4 fm

...

e1 e2 e3 e4 em

...

e1 e2 e3 e4 e5 ... em

SLIDE 16

Decoding is hard

Very simple example
Models: LM, Model 1 (1/1)
Search space: All possible
rderings of e1..m
Picked by the LM
w(e1→e2) = p(e2 | e1)
Look familiar ?
TSP - NP Complete !

f1 f2 f3 f4 fm

...

e1 e2 e3 e4 em

...

e1 e2 e3 e4 e5 ... em

SLIDE 17

Problem characteristics

Clear-cut optimization problem
There is always one right answer
Inherently Complex
Number of ways to order words (LM)
Number of ways to cover input words (TM)
Harder than in SR:
No left to right input-output correspondence

SLIDE 18

Decoding Methods

Stack-based Decoding
Most common
Almost all contemporary decoders are stack-based
Greedy Decoding
Faster but more error-prone
Optimal Decoding
Finds the optimal translation
Really Really Slow !

SLIDE 19

Stack-based Decoding

Originally introduced by Jelinek in SR
Stores partial translations (hypotheses) in a stack
Builds new translations by extending existing hypotheses
Optimal translation guaranteed if given unlimited stack size

and search time

Note: stack does not imply LIFO; actually a (priority) queue

SLIDE 20

Stack-based Decoding

Hypothesis Stack (finite size and sorted by cost)

SLIDE 21

Stack-based Decoding

Hypothesis Stack (finite size and sorted by cost) Pop (1)

SLIDE 22

Stack-based Decoding

Hypothesis Stack (finite size and sorted by cost) Pop (1) Extend by translating every possible word (2)

SLIDE 23

Stack-based Decoding

Hypothesis Stack (finite size and sorted by cost) Pop (1) Extend by translating every possible word (2) Push (3)

SLIDE 24

Stack-based Decoding

Hypothesis Stack (finite size and sorted by cost) Pop (1) Extend by translating every possible word (2) Push (3) Repeat (1)-(3) until a complete hypothesis is encountered

SLIDE 25

Hypothesis cost = cost of translation so far
Problem: Shorter hypotheses will push longer ones out
Solution: Use translation cost + future cost
Future cost: What it would cost to complete an hypothesis
A heuristic provides an estimate of the future cost
No heuristic can be perfect (no monotonicity)
Need to find another solution

Heuristic function

SLIDE 26

Multi-stack Decoding

Use multiple stacks
One for each subset of the input words (2n)
One for each number of words covered (n)
Extend the top hypothesis from each stack
Competition is among similar hypotheses

SLIDE 27

Other Optimizations

Beam-based Pruning
Relative threshold - prune if p(h) < α * p(hbest)
Histogram - Only keep a certain number of hypotheses,

prune the rest

Can accidentally prune out a good hypothesis
Hypothesis Recombination
If similar(h1,h2) then keep only the cheaper one
Risk-free

SLIDE 28

Greedy Decoding

Start with the word-for-word English gloss
Iterate exhaustively over all alignments one simple
peration away
Add, substitute, change order etc.
Pick the one with the highest probability
Commit the change
Repeat until no improvement possible

SLIDE 29

Greedy Decoding

Pros
Much much faster
Complexity only scales polynomially with

sentence length

Cons
Searches only a very small subspace
Cannot find best translation if far from gloss

SLIDE 30

Optimal Decoding

Transform decoding problem into a TSP instance
Foreign words ~ Cities
Translations ~ Hotels in cities
Cost ~ Distance
Solve TSP using Integer Programming (IP)
Cast tour selection as a constrained integer program
Can find tours of various lengths (n-best lists)

SLIDE 31

Optimal Decoding

Pros
Fast decoder development
Optimal n-best lists
Extremely customizable
Cons
Extremely slow !
Hard to integrate non-related information

sources

SLIDE 32

Decoding Errors

Search Error
decode(f) = e, but ∃ e’ s.t. score(e’) > score(e)
The right answer is in the space but we couldn’t find it
Hard to prove sub-optimal decoding
Model Error
correct(f) ∉ Search space
The right answer is not in the space because of

imperfect models

SLIDE 33

Observations*

|spacegreedy| << |spacestack| (hence the speed)
spacestack ⊂ spaceoptimal
nSEgreedy >> nSEstack >> nSEoptimal (=0)
tgreedy < tstack <<< toptimal (50 for m=6, 500 for 8!)
nME >> 0 for all, since Model 4 is deficient

* All decoders are Model 4 and tested on the same set

SLIDE 34

Take Home Messages

Optimal decoding is possible but highly impractical
Optimized stack-based decoding provides good balance
All modern decoders are basically the same (stack-based)
Differences in models, score and extension operations.

Examples: Pharaoh, Rewrite

Better translations will come from improving models

(Hiero)