Decoding continued 1 Thursday, February 16, 12 Activity Build a - - PowerPoint PPT Presentation

decoding
SMART_READER_LITE
LIVE PREVIEW

Decoding continued 1 Thursday, February 16, 12 Activity Build a - - PowerPoint PPT Presentation

Decoding continued 1 Thursday, February 16, 12 Activity Build a translation model that well use later today. Instructions Subject is mt-class The body has six lines There is one, one- word translation per line 2


slide-1
SLIDE 1

Decoding

1

continued

Thursday, February 16, 12
slide-2
SLIDE 2

Activity

2

Build a translation model that we’ll use later today. Instructions

  • Subject is “mt-class”
  • The body has six lines
  • There is one, one-

word translation per line

Thursday, February 16, 12
slide-3
SLIDE 3

ADMINISTRATIVE

3
  • Schedule for language in 10 minutes
  • Leaderboard
Thursday, February 16, 12
slide-4
SLIDE 4 4

THE STORY SO FAR...

training data (parallel text) learner decoder

However , the sky remained clear under the strong north wind .

model 联合国 安全 理事会 的 五个 常任 理事 国都

Thursday, February 16, 12
slide-5
SLIDE 5 5

SCHEDULE

  • TUESDAY
  • stack-based decoding

in conception

  • TODAY
  • stack-based decoding

in practice

  • scoring, dynamic

programming, pruning

Thursday, February 16, 12
slide-6
SLIDE 6 6

DECODING

  • the process of producing a translation of a sentence
  • Two main problems:
  • modeling – given a pair of sentences, how do we

assign a probability to them?

They still lack experience in international competitions

他们还缺乏国际比 赛的经验.

( )

P = high

(C → E)

Thursday, February 16, 12
slide-7
SLIDE 7 7

DECODING

  • the process of producing a translation of a sentence
  • Two main problems:
  • modeling – given a pair of sentences, how do we

assign a probability to them?

( )

P = low

(C → E)

This is not a good translation of the above sentence.

他们还缺乏国际比 赛的经验.

Thursday, February 16, 12
slide-8
SLIDE 8

MODEL

  • Noisy Channel model
8

P(e | f ) ∝ P(f | e)P(e)

NOISE

[English words] [English words]

NOISE

[French words]

SPEECH RECOGNITION MACHINE TRANSLATION

Thursday, February 16, 12
slide-9
SLIDE 9

MODEL TRANSFORMS

  • Add weights
9

P(e | f ) ∝ P(f | e)P(e) ∝ P(f | e)λ1P(e)λ2

Thursday, February 16, 12
slide-10
SLIDE 10 10

WEIGHTS

  • Why?
  • Just like in real life, where

we trust people’s claims differently, we will want to learn how to trust different models

25 50 75 100

Your brother Paul Hamm

credibility “I can do a backflip off this pommel horse”

Thursday, February 16, 12
slide-11
SLIDE 11

MODEL TRANSFORMS

  • Log space transform
  • Because:

0.0001 * 0.0001 * 0.0001 = 0.000000000001 log(0.0001) + log(0.0001) + log(0.0001) = -12

11

P(e | f ) ∝ P(f | e)P(e) ∝ P(f | e)λ1P(e)λ2 = λ1 log P(f | e) + λ2 log P(e)

Thursday, February 16, 12
slide-12
SLIDE 12

MODEL TRANSFORMS

  • Generalization
12

P(e | f ) ∝ P(f | e)P(e) ∝ P(f | e)λ1P(e)λ2 = λ1 log P(f | e) + λ2 log P(e) = λ1φ1(f , e) + λ2φ2(f , e) =

  • i

λiφi(f , e)

Thursday, February 16, 12
slide-13
SLIDE 13 13

MODEL

search

how do we find it?

model

what is a good translation?

e∗, a∗ = argmax

e,a

Pr(e, a | c) =

  • i

λ

weight feature function

A better “fundamental equation” for MT

Thursday, February 16, 12
slide-14
SLIDE 14 14

DECODING

  • the process of producing a translation of a sentence
  • Two main problems:
  • search – given a model and a source sentence, how

do we find the sentence that the model likes best?

  • impractical: enumerate all sentences, score them
  • stack decoding: assemble translations piece by piece
Thursday, February 16, 12
slide-15
SLIDE 15

STACK DECODING

  • Start with a list of hypotheses, containing only the

empty hypothesis

  • For each stack
  • For each hypothesis
  • For each applicable word
  • Extend the hypothesis with the word
  • Place the new hypothesis on the right stack
15 Thursday, February 16, 12
slide-16
SLIDE 16

FACTORING MODELS

  • Stack decoding works by extending hypotheses word

by word

  • These can be arranged into a search graph representing

the space we search

16

tengo →am

+ =

Thursday, February 16, 12
slide-17
SLIDE 17

FACTORING MODELS

17

Yo → I

tengo →am

tengo→ have

hambre →hungry hambre →hunger

Thursday, February 16, 12
slide-18
SLIDE 18

FACTORING MODELS

  • Stack decoding works by extending hypotheses word

by word

  • These can be arranged into a search graph representing

the space we search

  • The component models we use need to factorize over

this graph, and we accumulate the score as we go

18

tengo →am

+ =

Thursday, February 16, 12
slide-19
SLIDE 19

FACTORING MODELS

  • Example hypothesis creation:
  • translation model: trivial case, since all the words are

translated independently hypothesis.score += PTM(am | tengo)

  • a function of just the word that is added
19

tengo →am

+ =

  • ld

hypothesis add word new hypothesis

Thursday, February 16, 12
slide-20
SLIDE 20

FACTORING MODELS

  • Example hypothesis creation:
  • language model: still easy, since (bigram) language

models depend only on the previous word hypothesis.score += PLM(am | I)

  • a function of the old hyp. and the new word translation
20

tengo →am

+ =

  • ld

hypothesis add word new hypothesis

Thursday, February 16, 12
slide-21
SLIDE 21

DYNAMIC PROGRAMMING

  • We saw Tuesday how huge the search space could get
  • Notice anything here?
  • (1) <s> is never used in computing the scores AND

(2) <s> is implicit in the graph structure

  • let’s get rid of the extra state!
21

tengo →am

+ =

  • ld

hypothesis add word new hypothesis score += PTM(am | tengo) + PLM(am | I)

Thursday, February 16, 12
slide-22
SLIDE 22

DYNAMIC PROGRAMMING

  • Before
  • After
22

... ... ... The score of the new hypothesis is the maximum way to compute it

Thursday, February 16, 12
slide-23
SLIDE 23

STACK DECODING (WITH DP)

  • Start with a list of hypotheses, containing only the

empty hypothesis

  • For each stack
  • For each hypothesis
  • For each applicable word
  • Extend the hypothesis with the word
  • Place the new hypothesis on the right stack
23

IF either (1) no equivalent hypothesis exists

  • r (2) this hypothesis has a higher score.
Thursday, February 16, 12
slide-24
SLIDE 24

MORE GENERALLY

  • What is an “equivalent hypothesis”?
  • Hypotheses that match on the minimum necessary

state:

  • last word (for language model computation)
  • the score (of the best way to get here)
  • the coverage vector (so we know which words we

haven’t translated)

24 Thursday, February 16, 12
slide-25
SLIDE 25

OLD GRAPH (BEFORE DP)

25 Thursday, February 16, 12
slide-26
SLIDE 26

PRUNING

  • Even with DP

, there are still too many hypotheses

  • So we prune:
  • histogram pruning: keep only k items on each stack
  • threshold pruning: don’t keep items that have a score

beyond some distance from the most probable item in the stack

26 Thursday, February 16, 12
slide-27
SLIDE 27

STACK DECODING (WITH PRUNING)

  • Start with a list of hypotheses, containing only the empty

hypothesis

  • For each stack
  • For each hypothesis
  • For each applicable word
  • Extend the hypothesis with the word
  • If it’s the best, place the new hypothesis on the right stack

(possible replacing an old one)

  • Prune
27 Thursday, February 16, 12
slide-28
SLIDE 28

PITFALLS

28
  • Search errors
  • def: not finding the model’s highest-scoring translation
  • this happens when the shortcuts we took excluded

good hypotheses

  • Model errors
  • def: the model’s best hypothesis isn’t a good one
  • depends on some metric (e.g., human judgment)
Thursday, February 16, 12
slide-29
SLIDE 29

Activity

29

http://cs.jhu.edu/~post/mt-class/stack-decoder/ Instructions (10 minutes) In groups or alone, find the highest-scoring translation under our model under different stack size and reordering settings. Are there any search or model errors?

Thursday, February 16, 12
slide-30
SLIDE 30

IMPORTANT CONCEPTS

30
  • generalized weighted feature function formulation
  • decoding as graph search
  • factorized models for scoring edges
  • dynamic programming
  • pruning (histogram, beam/threshold)
Thursday, February 16, 12
slide-31
SLIDE 31

NOT DISCUSSED (BUT IMPORTANT)

31
  • Outside (future) cost estimates and A* search
  • Computational complexity
Thursday, February 16, 12