CRF Word Alignment & Noisy Channel Translation Machine - - PowerPoint PPT Presentation

crf word alignment noisy channel translation
SMART_READER_LITE
LIVE PREVIEW

CRF Word Alignment & Noisy Channel Translation Machine - - PowerPoint PPT Presentation

CRF Word Alignment & Noisy Channel Translation Machine Translation Lecture 6 Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu Website: mt-class.org/penn Last Time ... X Translation Translation Alignment p ( p ( ) = )


slide-1
SLIDE 1

CRF Word Alignment & Noisy Channel Translation

Machine Translation Lecture 6 Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu Website: mt-class.org/penn

slide-2
SLIDE 2

Last Time ...

Alignment

X

Alignment

p(

)

Translation

p(

)=

Translation

,

Alignment Translation | Alignment

×

X

Alignment

p( p(

) )

=

p(e | f, m) = X

a∈[0,n]m

p(a | f, m) ×

m

Y

i=1

p(ei | fai)

| {z }

| {z }

z }| { z

}| {

slide-3
SLIDE 3

MAP alignment

IBM Model 4 alignment Our model's alignment

slide-4
SLIDE 4

p(e|f)

A few tricks...

p(f|e)

slide-5
SLIDE 5

A few tricks...

p(f|e) p(e|f)

slide-6
SLIDE 6

A few tricks...

p(f|e) p(e|f)

slide-7
SLIDE 7

p(e | f, m) = X

a∈[0,n]m

p(a | f, m) ×

m

Y

i=1

p(ei | fai)

The problem of word alignment is as:

a∗ = arg max

a∈[0,n]m p(a | e, f, m)

Can we model this distribution directly?

Another View

With this model:

slide-8
SLIDE 8

Markov Random Fields (MRFs)

A B C X Y Z

p(A, B, C, X, Y, Z) = p(A) × p(B | A) × p(C | B)× p(X | A)p(Y | B)p(Z | C)

A B C X Y Z

p(A, B, C, X, Y, Z) = 1 Z × Ψ1(A, B) × Ψ2(B, C) × Ψ3(C, D)× Ψ4(X) × Ψ5(Y ) × Ψ6(Z)

“Factors”

slide-9
SLIDE 9

Computing Z

X Y

X = {a, b, c} X ∈ X Y ∈ X Z = X

x∈X

X

y∈X

Ψ1(x, y)Ψ2(x)Ψ3(y)

When the graph has certain structures (e.g., chains), you can factor to get polynomial time dynamic programming algorithms.

Z = X

x∈X

Ψ2(x) X

y∈X

Ψ1(x, y)Ψ3(y)

slide-10
SLIDE 10

Log-linear models

A B C X Y Z

p(A, B, C, X, Y, Z) = 1 Z × Ψ1(A, B) × Ψ2(B, C) × Ψ3(C, D)× Ψ4(X) × Ψ5(Y ) × Ψ6(Z) Ψ1,2,3(x, y) = exp X

k

wkfk(x, y)

Weights (learned) Feature functions
 (specified)

slide-11
SLIDE 11

Random Fields

  • Benefits
  • Potential functions can be defined with respect

to arbitrary features (functions) of the variables

  • Great way to incorporate knowledge
  • Drawbacks
  • Likelihood involves computing Z
  • Maximizing likelihood usually requires computing

Z (often over and over again!)

slide-12
SLIDE 12

Conditional Random Fields

  • Use MRFs to parameterize a conditional

distribution. Very easy: let feature functions look at anything they want in the “input”

y

All factors in the graph of

p(y | x) = 1 Zw(y) exp X

F ∈G

X

k

wkfk(F, x)

slide-13
SLIDE 13

Parameter Learning

  • CRFs are trained to maximize conditional likelihood
  • Recall we want to directly model
  • The likelihood of what alignments?

p(a | e, f)

Gold reference alignments!

ˆ wMLE = arg max

w

Y

(xi,yi)∈D

p(yi | xi ; w)

slide-14
SLIDE 14

CRF for Alignment

  • One of many possibilities, due to Blunsom &

Cohn (2006)

  • a has the same form as in the lexical translation

models (still make a one-to-many assumption)

  • wk are the model parameters
  • fk are the feature functions

p(a | e, f) = 1 Zw(e, f) exp

|e|

X

i=1

X

k

wkf(ai, ai−1, i, e, f) O(n2m) ≈ O(n3)

slide-15
SLIDE 15

Model

  • Labels (one per target word) index source sentence
  • Train model (e,f) and (f,e) [inverting the reference alignments]
slide-16
SLIDE 16

Alignment Experiments

  • French-English Canadian Hansards corpus
  • 484 manually word-aligned sentence pairs

(100 training, 37 development, 347 testing)

  • 1.1 million sentence-aligned pairs
  • Baseline for comparison: Giza++

implementation of IBM Model 4

  • (Also experimented on Romanian-English)
slide-17
SLIDE 17

Identical word

pervez musharrafs langer abschied pervez musharraf ’s long goodbye

Identical word

17

slide-18
SLIDE 18

Matching prefix

pervez musharrafs langer abschied pervez musharraf ’s long goodbye

Identical word Matching prefix

18

slide-19
SLIDE 19

Matching suffix

pervez musharrafs langer abschied pervez musharraf ’s long goodbye

Identical word Matching prefix Matching suffix

19

slide-20
SLIDE 20

Orthographic similarity

pervez musharrafs langer abschied pervez musharraf ’s long goodbye

Identical word Matching prefix Matching suffix Orthographic similarity

20

slide-21
SLIDE 21

pervez musharrafs langer abschied pervez musharraf ’s long goodbye

In dictionary

Identical word Matching prefix Matching suffix Orthographic similarity In dictionary ...

21

slide-22
SLIDE 22

Lexical Features

  • Word↔word indicator features
  • Various word↔word co-occurrence scores
  • IBM Model 1 probabilities (t→s , s→t)
  • Geometric mean of Model 1 probabilities
  • Dice’s coefficient [binned]
  • Products of the above
slide-23
SLIDE 23

Lexical Features

  • Word class↔word class indicator
  • NN translates as NN (NN_NN=1)
  • NN does not translate as MD (NN_MD=1)
  • Identical word feature
  • 2010 = 2010 (IdentWord=1 IdentNum=1)
  • Identical prefix feature
  • Obama ~ Obamu (IdentPrefix=1)
  • Orthographic similarity measure [binned]
  • Al-Qaeda ~ Al-Kaida (OrthoSim050_080=1)
slide-24
SLIDE 24

Other Features

  • Compute features from large amounts of

unlabeled text

  • Does the Model 4 alignment contain this

alignment point?

  • What is the Model 1 posterior

probability of this alignment point?

slide-25
SLIDE 25

Results

slide-26
SLIDE 26

Summary

  • CRFs outperform unsupervised / latent

variable alignment models, even when only a small number of word-aligned sentences are available

  • Diverse range of features can be

incorporated and are beneficial to word- alignment quality

  • Features from unsupervised models can

also be incorporated Unfortunately, you need gold alignments!

slide-27
SLIDE 27

Putting the pieces together

  • We have seen how to model the following:

p(e) p(e | f, m) p(e, a | f, m) p(a | e, f)

slide-28
SLIDE 28

Putting the pieces together

  • We have seen how to model the following:
  • Goal: a better model of that knows about

p(e) p(e | f, m) p(e, a | f, m) p(a | e, f) p(e | f, m) p(e)

slide-29
SLIDE 29

One naturally wonders if the problem

  • f translation could conceivably be

treated as a problem in cryptography. When I look at an article in Russian, I say: ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’

Warren Weaver to Norbert Wiener, March, 1947

slide-30
SLIDE 30

Claude Shannon. “A Mathematical Theory of Communication” 1948.

Encoder

M

Message

“Noisy” channel Decoder

Y X M 0

Sent
 transmission Received
 transmission Recovered
 message

slide-31
SLIDE 31

Claude Shannon. “A Mathematical Theory of Communication” 1948.

Encoder

M

Message

“Noisy” channel Decoder

Y X M 0

Sent
 transmission Received
 transmission Recovered
 message

p(y) p(x) p(x|y)

slide-32
SLIDE 32

Claude Shannon. “A Mathematical Theory of Communication” 1948.

Encoder

M

Message

“Noisy” channel Decoder

Y X M 0

Sent
 transmission Received
 transmission Recovered
 message

p(y) p(x) p(x|y)

slide-33
SLIDE 33

Claude Shannon. “A Mathematical Theory of Communication” 1948.

Encoder

M

Message

“Noisy” channel Decoder

Y X M 0

Sent
 transmission Received
 transmission Recovered
 message

p(y) p(x|y)

Shannon’s theory tells us: 1) how much data you can send
 2) the limits of compression
 3) why your download is so slow 4) how to translate


slide-34
SLIDE 34

“Noisy” channel Decoder

Y X M 0

Sent
 transmission Received
 transmission Recovered
 message

p(y) p(x|y)

Y 0

slide-35
SLIDE 35

“Noisy” channel Decoder

Y X M 0

Sent
 transmission Received
 transmission Recovered
 message

p(y) p(x|y)

Y 0

slide-36
SLIDE 36

“Noisy” channel Decoder

Y X M 0

Sent
 transmission Received
 transmission Recovered
 message

p(y) p(x|y)

Y 0

slide-37
SLIDE 37

“Noisy” channel Decoder

Y X M 0

Sent
 transmission Received
 transmission Recovered
 message

p(y) p(x|y)

Y 0

y0 = arg max

y

p(y|x) = arg max

y

p(x|y)p(y) p(x) = arg max

y

p(x|y)p(y)

slide-38
SLIDE 38

“Noisy” channel Decoder

Y X M 0

Sent
 transmission Received
 transmission Recovered
 message

p(y) p(x|y)

Y 0

y0 = arg max

y

p(y|x) = arg max

y

p(x|y)p(y) p(x) = arg max

y

p(x|y)p(y)

6 =

I can help.

slide-39
SLIDE 39

“Noisy” channel Decoder

Y X M 0

Sent
 transmission Received
 transmission Recovered
 message

Y 0

y0 = arg max

y

p(y|x) = arg max

y

p(x|y)p(y) p(x) = arg max

y

p(x|y)p(y)

slide-40
SLIDE 40

“Noisy” channel Decoder

Y X M 0

Sent
 transmission Received
 transmission Recovered
 message

Y 0

y0 = arg max

y

p(y|x) = arg max

y

p(x|y)p(y) p(x) = arg max

y

p(x|y)p(y)

Denominator doesn’t depend on .

y

slide-41
SLIDE 41

“Noisy” channel Decoder

Y X M 0

Sent
 transmission Received
 transmission Recovered
 message

Y 0

y0 = arg max

y

p(y|x) = arg max

y

p(x|y)p(y) p(x) = arg max

y

p(x|y)p(y)

slide-42
SLIDE 42

“Noisy” channel Decoder

Y X M 0

Sent
 transmission Received
 transmission Recovered
 message

Y 0

y0 = arg max

y

p(x|y)p(y)

slide-43
SLIDE 43

Sent
 transmission Received
 transmission Recovered
 message

“Noisy” channel Decoder

Y X M 0 Y 0

English “French” English’

y0 = arg max

y

p(x|y)p(y)

e0 = arg max

e

p(f|e)p(e)

slide-44
SLIDE 44

Sent
 transmission Received
 transmission Recovered
 message

“Noisy” channel Decoder

Y X M 0 Y 0

English “French” English’

y0 = arg max

y

p(x|y)p(y)

e0 = arg max

e

p(f|e)p(e)

translation model

slide-45
SLIDE 45

Sent
 transmission Received
 transmission Recovered
 message

“Noisy” channel Decoder

Y X M 0 Y 0

English “French” English’

y0 = arg max

y

p(x|y)p(y)

e0 = arg max

e

p(f|e)p(e)

translation model language model Other noisy channel applications: OCR, speech recognition, spelling correction...

slide-46
SLIDE 46

Division of labor

  • Translation model
  • probability of translation back into the

source

  • ensures adequacy of translation
  • Language model
  • is a translation hypothesis “good” English?
  • ensures fluency of translation
slide-47
SLIDE 47

English

p(e) p(f | e) e∗ = arg max

e

p(e | f) = arg max

e

p(f | e) × p(e)

slide-48
SLIDE 48

Announcements

  • HW1 leaderboard submissions are due

tonight at 11:59pm

  • HW1 writeup and code are due 24 hours

later

  • Next week: Phrase-based Machine

Translation