CRF Word Alignment & Noisy Channel Translation January 31, 2013 - - PowerPoint PPT Presentation

crf word alignment noisy channel translation
SMART_READER_LITE
LIVE PREVIEW

CRF Word Alignment & Noisy Channel Translation January 31, 2013 - - PowerPoint PPT Presentation

CRF Word Alignment & Noisy Channel Translation January 31, 2013 Tuesday, February 19, 13 Last Time ... X Translation Translation Alignment p ( p ( ) = ) , Alignment Tuesday, February 19, 13 Last Time ... X Translation Translation


slide-1
SLIDE 1

CRF Word Alignment & Noisy Channel Translation

January 31, 2013

Tuesday, February 19, 13

slide-2
SLIDE 2

Last Time ...

Alignment

X

Alignment

p(

)

Translation

p(

)=

Translation

,

Tuesday, February 19, 13

slide-3
SLIDE 3

Last Time ...

Alignment

X

Alignment

p(

)

Translation

p(

)=

Translation

,

Alignment Translation | Alignment

×

X

Alignment

p( p(

) )

=

Tuesday, February 19, 13

slide-4
SLIDE 4

Last Time ...

Alignment

X

Alignment

p(

)

Translation

p(

)=

Translation

,

Alignment Translation | Alignment

×

X

Alignment

p( p(

) )

=

p(e | f, m) = X

a∈[0,n]m

p(a | f, m) ×

m

Y

i=1

p(ei | fai)

| {z }

| {z }

z }| { z

}| {

Tuesday, February 19, 13

slide-5
SLIDE 5

MAP alignment

IBM Model 4 alignment Our model's alignment

Tuesday, February 19, 13

slide-6
SLIDE 6

MAP alignment

IBM Model 4 alignment Our model's alignment

Tuesday, February 19, 13

slide-7
SLIDE 7

p(e|f)

A few tricks...

p(f|e)

Tuesday, February 19, 13

slide-8
SLIDE 8

A few tricks...

p(f|e) p(e|f)

Tuesday, February 19, 13

slide-9
SLIDE 9

A few tricks...

p(f|e) p(e|f)

Tuesday, February 19, 13

slide-10
SLIDE 10

p(e | f, m) = X

a∈[0,n]m

p(a | f, m) ×

m

Y

i=1

p(ei | fai)

The problem of word alignment is as:

a∗ = arg max

a∈[0,n]m p(a | e, f, m)

Another View

With this model:

Tuesday, February 19, 13

slide-11
SLIDE 11

p(e | f, m) = X

a∈[0,n]m

p(a | f, m) ×

m

Y

i=1

p(ei | fai)

The problem of word alignment is as:

a∗ = arg max

a∈[0,n]m p(a | e, f, m)

Can we model this distribution directly?

Another View

With this model:

Tuesday, February 19, 13

slide-12
SLIDE 12

Markov Random Fields (MRFs)

A B C X Y Z

p(A, B, C, X, Y, Z) = p(A) × p(B | A) × p(C | B)× p(X | A)p(Y | B)p(Z | C)

Tuesday, February 19, 13

slide-13
SLIDE 13

Markov Random Fields (MRFs)

A B C X Y Z

p(A, B, C, X, Y, Z) = p(A) × p(B | A) × p(C | B)× p(X | A)p(Y | B)p(Z | C)

A B C X Y Z

p(A, B, C, X, Y, Z) = 1 Z × Ψ1(A, B) × Ψ2(B, C) × Ψ3(C, D)× Ψ4(X) × Ψ5(Y ) × Ψ6(Z)

Tuesday, February 19, 13

slide-14
SLIDE 14

Markov Random Fields (MRFs)

A B C X Y Z

p(A, B, C, X, Y, Z) = p(A) × p(B | A) × p(C | B)× p(X | A)p(Y | B)p(Z | C)

A B C X Y Z

p(A, B, C, X, Y, Z) = 1 Z × Ψ1(A, B) × Ψ2(B, C) × Ψ3(C, D)× Ψ4(X) × Ψ5(Y ) × Ψ6(Z)

“Factors”

Tuesday, February 19, 13

slide-15
SLIDE 15

Computing Z

X Y

X = {a, b, c} X ∈ X Y ∈ X Z = X

x∈X

X

y∈X

Ψ1(x, y)Ψ2(x)Ψ3(y)

When the graph has certain structures (e.g., chains), you can factor to get polytime DP algorithms.

Z = X

x∈X

Ψ2(x) X

y∈X

Ψ1(x, y)Ψ3(y)

Tuesday, February 19, 13

slide-16
SLIDE 16

Log-linear models

A B C X Y Z

p(A, B, C, X, Y, Z) = 1 Z × Ψ1(A, B) × Ψ2(B, C) × Ψ3(C, D)× Ψ4(X) × Ψ5(Y ) × Ψ6(Z) Ψ1,2,3(x, y) = exp X

k

wkfk(x, y)

Tuesday, February 19, 13

slide-17
SLIDE 17

Log-linear models

A B C X Y Z

p(A, B, C, X, Y, Z) = 1 Z × Ψ1(A, B) × Ψ2(B, C) × Ψ3(C, D)× Ψ4(X) × Ψ5(Y ) × Ψ6(Z) Ψ1,2,3(x, y) = exp X

k

wkfk(x, y)

Weights (learned)

Tuesday, February 19, 13

slide-18
SLIDE 18

Log-linear models

A B C X Y Z

p(A, B, C, X, Y, Z) = 1 Z × Ψ1(A, B) × Ψ2(B, C) × Ψ3(C, D)× Ψ4(X) × Ψ5(Y ) × Ψ6(Z) Ψ1,2,3(x, y) = exp X

k

wkfk(x, y)

Weights (learned) Feature functions (specified)

Tuesday, February 19, 13

slide-19
SLIDE 19

Random Fields

  • Benefits
  • Potential functions can be defined with respect

to arbitrary features (functions) of the variables

  • Great way to incorporate knowledge
  • Drawbacks
  • Likelihood involves computing Z
  • Maximizing likelihood usually requires

computing Z (often over and over again!)

Tuesday, February 19, 13

slide-20
SLIDE 20

Conditional Random Fields

  • Use MRFs to parameterize a conditional

distribution. Very easy: let feature functions look at anything they want in the “input”

Tuesday, February 19, 13

slide-21
SLIDE 21

Conditional Random Fields

  • Use MRFs to parameterize a conditional

distribution. Very easy: let feature functions look at anything they want in the “input”

p(y | x) = 1 Zw(y) exp X

F ∈G

X

k

wkfk(F, x)

Tuesday, February 19, 13

slide-22
SLIDE 22

Conditional Random Fields

  • Use MRFs to parameterize a conditional

distribution. Very easy: let feature functions look at anything they want in the “input”

y

All factors in the graph of

p(y | x) = 1 Zw(y) exp X

F ∈G

X

k

wkfk(F, x)

Tuesday, February 19, 13

slide-23
SLIDE 23

Parameter Learning

  • CRFs are trained to maximize conditional likelihood
  • Recall we want to directly model
  • The likelihood of what alignments?

p(a | e, f) ˆ wMLE = arg max

w

Y

(xi,yi)∈D

p(yi | xi ; w)

Tuesday, February 19, 13

slide-24
SLIDE 24

Parameter Learning

  • CRFs are trained to maximize conditional likelihood
  • Recall we want to directly model
  • The likelihood of what alignments?

p(a | e, f)

Gold reference alignments!

ˆ wMLE = arg max

w

Y

(xi,yi)∈D

p(yi | xi ; w)

Tuesday, February 19, 13

slide-25
SLIDE 25

CRF for Alignment

  • One of many possibilities, due to Blunsom &

Cohn (2006)

  • a has the same form as in the lexical translation

models (still make a one-to-many assumption)

  • wk are the model parameters
  • fk are the feature functions

p(a | e, f) = 1 Zw(e, f) exp

|e|

X

i=1

X

k

wkf(ai, ai−1, i, e, f)

Tuesday, February 19, 13

slide-26
SLIDE 26

CRF for Alignment

  • One of many possibilities, due to Blunsom &

Cohn (2006)

  • a has the same form as in the lexical translation

models (still make a one-to-many assumption)

  • wk are the model parameters
  • fk are the feature functions

p(a | e, f) = 1 Zw(e, f) exp

|e|

X

i=1

X

k

wkf(ai, ai−1, i, e, f) O(n2m) ≈ O(n3)

Tuesday, February 19, 13

slide-27
SLIDE 27

Model

  • Labels (one per target word) index source sentence
  • Train model (e,f) and (f,e) [inverting the reference alignments]

Tuesday, February 19, 13

slide-28
SLIDE 28

Experiments

Tuesday, February 19, 13

slide-29
SLIDE 29

Identical word

pervez musharrafs langer abschied pervez musharraf ’s long goodbye

Identical word

17

Tuesday, February 19, 13

slide-30
SLIDE 30

Matching prefix

pervez musharrafs langer abschied pervez musharraf ’s long goodbye

Identical word Matching prefix

18

Tuesday, February 19, 13

slide-31
SLIDE 31

Matching suffix

pervez musharrafs langer abschied pervez musharraf ’s long goodbye

Identical word Matching prefix Matching suffix

19

Tuesday, February 19, 13

slide-32
SLIDE 32

Orthographic similarity

pervez musharrafs langer abschied pervez musharraf ’s long goodbye

Identical word Matching prefix Matching suffix Orthographic similarity

20

Tuesday, February 19, 13

slide-33
SLIDE 33

pervez musharrafs langer abschied pervez musharraf ’s long goodbye

In dictionary

Identical word Matching prefix Matching suffix Orthographic similarity In dictionary ...

21

Tuesday, February 19, 13

slide-34
SLIDE 34

pervez musharrafs langer abschied pervez musharraf ’s long goodbye

Identical word Matching prefix Matching suffix Orthographic similarity In dictionary ...

21

Tuesday, February 19, 13

slide-35
SLIDE 35

Lexical Features

  • Word word indicator features
  • Various word word co-occurrence scores
  • IBM Model 1 probabilities (t→s , s→t)
  • Geometric mean of Model 1 probabilities
  • Dice’s coefficient [binned]
  • Products of the above

↔ ↔

Tuesday, February 19, 13

slide-36
SLIDE 36

Lexical Features

  • Word class word class indicator
  • NN translates as NN (NN_NN=1)
  • NN does not translate as MD (NN_MD=1)
  • Identical word feature
  • 2010 = 2010 (IdentWord=1 IdentNum=1)
  • Identical prefix feature
  • Obama ~ Obamu (IdentPrefix=1)
  • Orthographic similarity measure [binned]
  • Al-Qaeda ~ Al-Kaida (OrthoSim050_080=1)

Tuesday, February 19, 13

slide-37
SLIDE 37

Other Features

  • Compute features from large amounts of

unlabeled text

  • Does the Model 4 alignment contain this

alignment point?

  • What is the Model 1 posterior

probability of this alignment point?

Tuesday, February 19, 13

slide-38
SLIDE 38

Results

Tuesday, February 19, 13

slide-39
SLIDE 39

Summary

Tuesday, February 19, 13

slide-40
SLIDE 40

Summary

Unfortunately, you need gold alignments!

Tuesday, February 19, 13

slide-41
SLIDE 41

Putting the pieces together

p(e) p(e | f, m) p(e, a | f, m) p(a | e, f)

Tuesday, February 19, 13

slide-42
SLIDE 42

Putting the pieces together

  • We have seen how to model the following:

p(e) p(e | f, m) p(e, a | f, m) p(a | e, f)

Tuesday, February 19, 13

slide-43
SLIDE 43

Putting the pieces together

  • We have seen how to model the following:

p(e) p(e | f, m) p(e, a | f, m) p(a | e, f)

Tuesday, February 19, 13

slide-44
SLIDE 44

Putting the pieces together

  • We have seen how to model the following:
  • Goal: a better model of that knows about

p(e) p(e | f, m) p(e, a | f, m) p(a | e, f) p(e | f, m) p(e)

Tuesday, February 19, 13

slide-45
SLIDE 45

One naturally wonders if the problem

  • f translation could conceivably be

treated as a problem in cryptography. When I look at an article in Russian, I say: ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’

Warren Weaver to Norbert Wiener, March, 1947

Tuesday, February 19, 13

slide-46
SLIDE 46

Claude Shannon. “A Mathematical Theory of Communication” 1948.

Encoder

M

Message

“Noisy” channel Decoder

Y X M 0

Sent transmission Received transmission Recovered message

Tuesday, February 19, 13

slide-47
SLIDE 47

Claude Shannon. “A Mathematical Theory of Communication” 1948.

Encoder

M

Message

“Noisy” channel Decoder

Y X M 0

Sent transmission Received transmission Recovered message

p(y) p(x) p(x|y)

Tuesday, February 19, 13

slide-48
SLIDE 48

Claude Shannon. “A Mathematical Theory of Communication” 1948.

Encoder

M

Message

“Noisy” channel Decoder

Y X M 0

Sent transmission Received transmission Recovered message

p(y) p(x) p(x|y)

Tuesday, February 19, 13

slide-49
SLIDE 49

Claude Shannon. “A Mathematical Theory of Communication” 1948.

Encoder

M

Message

“Noisy” channel Decoder

Y X M 0

Sent transmission Received transmission Recovered message

p(y) p(x|y)

Shannon’s theory tells us: 1) how much data you can send 2) the limits of compression 3) why your download is so slow 4) how to translate

Tuesday, February 19, 13

slide-50
SLIDE 50

“Noisy” channel Decoder

Y X M 0

Sent transmission Received transmission Recovered message

p(y) p(x|y)

Y 0

Tuesday, February 19, 13

slide-51
SLIDE 51

“Noisy” channel Decoder

Y X M 0

Sent transmission Received transmission Recovered message

p(y) p(x|y)

Y 0

Tuesday, February 19, 13

slide-52
SLIDE 52

“Noisy” channel Decoder

Y X M 0

Sent transmission Received transmission Recovered message

p(y) p(x|y)

Y 0

Tuesday, February 19, 13

slide-53
SLIDE 53

“Noisy” channel Decoder

Y X M 0

Sent transmission Received transmission Recovered message

p(y) p(x|y)

Y 0

y0 = arg max

y

p(y|x) = arg max

y

p(x|y)p(y) p(x) = arg max

y

p(x|y)p(y)

Tuesday, February 19, 13

slide-54
SLIDE 54

“Noisy” channel Decoder

Y X M 0

Sent transmission Received transmission Recovered message

p(y) p(x|y)

Y 0

y0 = arg max

y

p(y|x) = arg max

y

p(x|y)p(y) p(x) = arg max

y

p(x|y)p(y)

6 =

Tuesday, February 19, 13

slide-55
SLIDE 55

“Noisy” channel Decoder

Y X M 0

Sent transmission Received transmission Recovered message

p(y) p(x|y)

Y 0

y0 = arg max

y

p(y|x) = arg max

y

p(x|y)p(y) p(x) = arg max

y

p(x|y)p(y)

6 =

I can help.

Tuesday, February 19, 13

slide-56
SLIDE 56

“Noisy” channel Decoder

Y X M 0

Sent transmission Received transmission Recovered message

Y 0

y0 = arg max

y

p(y|x) = arg max

y

p(x|y)p(y) p(x) = arg max

y

p(x|y)p(y)

Tuesday, February 19, 13

slide-57
SLIDE 57

“Noisy” channel Decoder

Y X M 0

Sent transmission Received transmission Recovered message

Y 0

y0 = arg max

y

p(y|x) = arg max

y

p(x|y)p(y) p(x) = arg max

y

p(x|y)p(y)

Denominator doesn’t depend on .

y

Tuesday, February 19, 13

slide-58
SLIDE 58

“Noisy” channel Decoder

Y X M 0

Sent transmission Received transmission Recovered message

Y 0

y0 = arg max

y

p(y|x) = arg max

y

p(x|y)p(y) p(x) = arg max

y

p(x|y)p(y)

Tuesday, February 19, 13

slide-59
SLIDE 59

“Noisy” channel Decoder

Y X M 0

Sent transmission Received transmission Recovered message

Y 0

y0 = arg max

y

p(x|y)p(y)

Tuesday, February 19, 13

slide-60
SLIDE 60

Sent transmission Received transmission Recovered message

“Noisy” channel Decoder

Y X M 0 Y 0

English “French” English’

y0 = arg max

y

p(x|y)p(y)

e0 = arg max

e

p(f|e)p(e)

Tuesday, February 19, 13

slide-61
SLIDE 61

Sent transmission Received transmission Recovered message

“Noisy” channel Decoder

Y X M 0 Y 0

English “French” English’

y0 = arg max

y

p(x|y)p(y)

e0 = arg max

e

p(f|e)p(e)

translation model

Tuesday, February 19, 13

slide-62
SLIDE 62

Sent transmission Received transmission Recovered message

“Noisy” channel Decoder

Y X M 0 Y 0

English “French” English’

y0 = arg max

y

p(x|y)p(y)

e0 = arg max

e

p(f|e)p(e)

translation model language model

Tuesday, February 19, 13

slide-63
SLIDE 63

Sent transmission Received transmission Recovered message

“Noisy” channel Decoder

Y X M 0 Y 0

English “French” English’

y0 = arg max

y

p(x|y)p(y)

e0 = arg max

e

p(f|e)p(e)

translation model language model Other noisy channel applications: OCR, speech recognition, spelling correction...

Tuesday, February 19, 13

slide-64
SLIDE 64

Division of labor

  • Translation model
  • probability of translation back into the

source

  • ensures adequacy of translation
  • Language model
  • is a translation hypothesis “good” English?
  • ensures fluency of translation

Tuesday, February 19, 13

slide-65
SLIDE 65

English

p(e)

Tuesday, February 19, 13

slide-66
SLIDE 66

English

p(e) p(f | e)

Tuesday, February 19, 13

slide-67
SLIDE 67

English

p(e) p(f | e) e∗ = arg max

e

p(e | f) = arg max

e

p(f | e) × p(e)

Tuesday, February 19, 13

slide-68
SLIDE 68

Announcements

  • Upcoming language-in-10
  • Tuesday: Jon/Austin - Русский
  • Leaderboard is functional

Tuesday, February 19, 13