Discriminative Training February 19, 2013 Tuesday, February 19, 13 - - PowerPoint PPT Presentation

discriminative training
SMART_READER_LITE
LIVE PREVIEW

Discriminative Training February 19, 2013 Tuesday, February 19, 13 - - PowerPoint PPT Presentation

Discriminative Training February 19, 2013 Tuesday, February 19, 13 Noisy Channels Again p ( e ) English source Tuesday, February 19, 13 Noisy Channels Again p ( g | e ) p ( e ) English source German Tuesday, February 19, 13 Noisy


slide-1
SLIDE 1

Discriminative Training

February 19, 2013

Tuesday, February 19, 13

slide-2
SLIDE 2

Noisy Channels Again

English

source

p(e)

Tuesday, February 19, 13

slide-3
SLIDE 3

Noisy Channels Again

German English

source

p(e) p(g | e)

Tuesday, February 19, 13

slide-4
SLIDE 4

Noisy Channels Again

German English

source

decoder

p(e) p(g | e) e∗ = arg max

e

p(e | g) = arg max

e

p(g | e) × p(e) p(g) = arg max

e

p(g | e) × p(e)

Tuesday, February 19, 13

slide-5
SLIDE 5

Noisy Channels Again

e∗ = arg max

e

p(e | g) = arg max

e

p(g | e) × p(e) p(g) = arg max

e

p(g | e) × p(e)

Tuesday, February 19, 13

slide-6
SLIDE 6

Noisy Channels Again

e∗ = arg max

e

p(e | g) = arg max

e

p(g | e) × p(e) p(g) = arg max

e

p(g | e) × p(e) = arg max

e

log p(g | e) + log p(e) = arg max

e

1 1 > | {z }

w>

log p(g | e) log p(e)

  • |

{z }

h(g,e)

Tuesday, February 19, 13

slide-7
SLIDE 7

Noisy Channels Again

e∗ = arg max

e

p(e | g) = arg max

e

p(g | e) × p(e) p(g) = arg max

e

p(g | e) × p(e) = arg max

e

log p(g | e) + log p(e) = arg max

e

1 1 > | {z }

w>

log p(g | e) log p(e)

  • |

{z }

h(g,e)

Tuesday, February 19, 13

slide-8
SLIDE 8

Noisy Channels Again

e∗ = arg max

e

p(e | g) = arg max

e

p(g | e) × p(e) p(g) = arg max

e

p(g | e) × p(e) = arg max

e

log p(g | e) + log p(e) = arg max

e

1 1 > | {z }

w>

log p(g | e) log p(e)

  • |

{z }

h(g,e)

Does this look familiar?

Tuesday, February 19, 13

slide-9
SLIDE 9

The Noisy Channel

  • log p(g | e)
  • log p(e)

Tuesday, February 19, 13

slide-10
SLIDE 10

As a Linear Model

  • log p(g | e)
  • log p(e)

~ w

Tuesday, February 19, 13

slide-11
SLIDE 11

As a Linear Model

  • log p(g | e)
  • log p(e)

~ w

Tuesday, February 19, 13

slide-12
SLIDE 12

As a Linear Model

  • log p(g | e)
  • log p(e)

~ w

Tuesday, February 19, 13

slide-13
SLIDE 13

As a Linear Model

  • log p(g | e)
  • log p(e)

~ w

g

Improvement 1: change to find better translations ~ w

Tuesday, February 19, 13

slide-14
SLIDE 14

As a Linear Model

  • log p(g | e)
  • log p(e)

~ w

Tuesday, February 19, 13

slide-15
SLIDE 15

As a Linear Model

  • log p(g | e)
  • log p(e)

~ w

Tuesday, February 19, 13

slide-16
SLIDE 16

As a Linear Model

  • log p(g | e)
  • log p(e)

~ w

Tuesday, February 19, 13

slide-17
SLIDE 17

As a Linear Model

  • log p(g | e)
  • log p(e)

~ w

Improvement 2: Add dimensions to make points separable

Tuesday, February 19, 13

slide-18
SLIDE 18

Linear Models

  • Improve the modeling capacity of the noisy

channel in two ways

  • Reorient the weight vector
  • Add new dimensions (new features)
  • Questions
  • What features?
  • How do we set the weights?

e⇤ = arg max

e

w>h(g, e) h(g, e) w

Tuesday, February 19, 13

slide-19
SLIDE 19

Mann beißt Hund

18

Tuesday, February 19, 13

slide-20
SLIDE 20

Mann beißt Hund

18

x BITES y

Tuesday, February 19, 13

slide-21
SLIDE 21

Mann beißt Hund man bites cat

Mann beißt Hund

man bite cat

Mann beißt Hund

man bites dog

Mann beißt Hund

man chase dog

Mann beißt Hund

man bite dog

Mann beißt Hund

man bites dog

Mann beißt Hund

19

x BITES y

Tuesday, February 19, 13

slide-22
SLIDE 22

Mann beißt Hund man bites cat

Mann beißt Hund

man bite cat

Mann beißt Hund

man bites dog

Mann beißt Hund

man chase dog

Mann beißt Hund

man bite dog

Mann beißt Hund

man bites dog

Mann beißt Hund

19

x BITES y

Tuesday, February 19, 13

slide-23
SLIDE 23

Mann beißt Hund man bites cat

Mann beißt Hund

man bite cat

Mann beißt Hund

man bites dog

Mann beißt Hund

man chase dog

Mann beißt Hund

man bite dog

Mann beißt Hund

man bites dog

Mann beißt Hund

19

x BITES y

Tuesday, February 19, 13

slide-24
SLIDE 24

Mann beißt Hund man bites cat

Mann beißt Hund

man bite cat

Mann beißt Hund

man bites dog

Mann beißt Hund

man chase dog

Mann beißt Hund

man bite dog

Mann beißt Hund

man bites dog

Mann beißt Hund

19

x BITES y

Tuesday, February 19, 13

slide-25
SLIDE 25

Mann beißt Hund man bites cat

Mann beißt Hund

man bite cat

Mann beißt Hund

man bites dog

Mann beißt Hund

man chase dog

Mann beißt Hund

man bite dog

Mann beißt Hund

man bites dog

Mann beißt Hund

19

x BITES y

Tuesday, February 19, 13

slide-26
SLIDE 26

Feature Classes

20

Lexical

bank = “River bank” vs. “Financial institution”

Are lexical choices appropriate?

Tuesday, February 19, 13

slide-27
SLIDE 27

Feature Classes

20

Lexical

bank = “River bank” vs. “Financial institution”

Are lexical choices appropriate?

Configurational

Are semantic/syntactic relations preserved?

“Dog bites man” vs. “Man bites dog”

Tuesday, February 19, 13

slide-28
SLIDE 28

Feature Classes

20

Lexical

bank = “River bank” vs. “Financial institution”

Are lexical choices appropriate?

Configurational

Are semantic/syntactic relations preserved?

“Dog bites man” vs. “Man bites dog”

Grammatical

Is the output fluent / well-formed?

“Man bites dog” vs. “Man bite dog”

Tuesday, February 19, 13

slide-29
SLIDE 29

21

man bites cat

Mann beißt Hund

What do lexical features look like?

Tuesday, February 19, 13

slide-30
SLIDE 30

21

man bites cat

Mann beißt Hund

What do lexical features look like?

Tuesday, February 19, 13

slide-31
SLIDE 31

21

man bites cat

Mann beißt Hund

What do lexical features look like?

score(g, e) = w>h(g, e) h15,342(g, e) = ( 1, ∃i, j : gi = Hund, ej = cat 0,

  • therwise

First attempt:

Tuesday, February 19, 13

slide-32
SLIDE 32

But what if a cat is being chased by a Hund?

21

man bites cat

Mann beißt Hund

What do lexical features look like?

score(g, e) = w>h(g, e) h15,342(g, e) = ( 1, ∃i, j : gi = Hund, ej = cat 0,

  • therwise

First attempt:

Tuesday, February 19, 13

slide-33
SLIDE 33

man bites cat

Mann beißt Hund

Latent variables enable more precise features:

22

What do lexical features look like?

score(g, e, a) = w>h(g, e, a) h15,342(g, e, a) = X

(i,j)2a

( 1, if gi = Hund, ej = cat 0,

  • therwise

Tuesday, February 19, 13

slide-34
SLIDE 34

Standard Features

  • Target side features
  • log p(e) [ n-gram language model ]
  • Number of words in hypothesis
  • Non-English character count
  • Source + target features
  • log relative frequency e|f of each rule [ log #(e,f) - log #(f) ]
  • log relative frequency f|e of each rule [ log #(e,f) - log #(e) ]
  • “lexical translation” log probability e|f of each rule [ ≈ log pmodel1(e|f) ]
  • “lexical translation” log probability f|e of each rule [ ≈ log pmodel1(f|e) ]
  • Other features
  • Count of rules/phrases used
  • Reordering pattern probabilities

23

Tuesday, February 19, 13

slide-35
SLIDE 35

Parameter Learning

24

Tuesday, February 19, 13

slide-36
SLIDE 36

25

h1 h2

Hypothesis Space

Tuesday, February 19, 13

slide-37
SLIDE 37

25

h1 h2

Hypothesis Space

Hypotheses

Tuesday, February 19, 13

slide-38
SLIDE 38

26

h1 h2

Hypothesis Space

References

Tuesday, February 19, 13

slide-39
SLIDE 39

Preliminaries

27

We assume a decoder that computes:

he⇤, a⇤i = arg max

he,ai w>h(g, e, a)

And K-best lists of, that is:

{e⇤

i , a⇤ i ⇥}i=K i=1 = arg ith- max he,ai w>h(g, e, a)

Standard, efficient algorithms exist for this.

Tuesday, February 19, 13

slide-40
SLIDE 40

Learning Weights

  • Try to match the reference translation exactly
  • Conditional random field
  • Maximize the conditional probability of the

reference translations

  • “Average” over the different latent variables
  • Max-margin
  • Find the weight vector that separates the reference

translation from others by the maximal margin

  • Maximal setting of the latent variables

28

Tuesday, February 19, 13

slide-41
SLIDE 41

Learning Weights

  • Try to match the reference translation exactly
  • Conditional random field
  • Maximize the conditional probability of the

reference translations

  • “Average” over the different latent variables
  • Max-margin
  • Find the weight vector that separates the reference

translation from others by the maximal margin

  • Maximal setting of the latent variables

28

Tuesday, February 19, 13

slide-42
SLIDE 42

Problems

  • These methods give “full credit” when the

model exactly produces the reference and no credit otherwise

  • What is the problem with this?
  • There are many ways to translate a sentence
  • What if we have multiple reference

translations?

  • What about partial credit?

29

Tuesday, February 19, 13

slide-43
SLIDE 43

Problems

  • These methods give “full credit” when the

model exactly produces the reference and no credit otherwise

  • What is the problem with this?
  • There are many ways to translate a sentence
  • What if we have multiple reference

translations?

  • What about partial credit?

29

Tuesday, February 19, 13

slide-44
SLIDE 44

Cost-Sensitive Training

  • Assume we have a cost function that gives

a score for how good/bad a translation is

  • Optimize the weight vector by making

reference to this function

  • We will talk about two ways to do this

30

(ˆ e, E) ⇥ [0, 1]

Tuesday, February 19, 13

slide-45
SLIDE 45

31

h1 h2 ~ w

K-Best List Example

Tuesday, February 19, 13

slide-46
SLIDE 46

31

h1 h2 ~ w

#2#1

K-Best List Example

#3 #4 #5 #6 #7 #8 #9 #10

Tuesday, February 19, 13

slide-47
SLIDE 47

32

h1 h2

#2#1

K-Best List Example

#3 #4 #5 #6 #7 #8 #9 #10

0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 ~ w

Tuesday, February 19, 13

slide-48
SLIDE 48

Training as Classification

  • Pairwise Ranking Optimization
  • Reduce training problem to binary classification

with a linear model

  • Algorithm
  • For i=1 to N
  • Pick random pair of hypotheses (A,B) from K-best list
  • Use cost function to determine if is A or B better
  • Create ith training instance
  • Train binary linear classifier

33

Tuesday, February 19, 13

slide-49
SLIDE 49

34

h1 h2

#2 #1 #3 #4 #5 #6 #7 #8 #9 #10

0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 h1 h2

Tuesday, February 19, 13

slide-50
SLIDE 50

34

h1 h2

#2 #1 #3 #4 #5 #6 #7 #8 #9 #10

0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 h1 h2

Tuesday, February 19, 13

slide-51
SLIDE 51

35

h1 h2

#2 #1 #3 #4 #5 #6 #7 #8 #9 #10

0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 h1 h2

Worse!

Tuesday, February 19, 13

slide-52
SLIDE 52

36

h1 h2

#2 #1 #3 #4 #5 #6 #7 #8 #9 #10

0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 h1 h2

Worse!

Tuesday, February 19, 13

slide-53
SLIDE 53

37

h1 h2

#2 #1 #3 #4 #5 #6 #7 #8 #9 #10

0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 h1 h2

Tuesday, February 19, 13

slide-54
SLIDE 54

38

h1 h2

#2 #1 #3 #4 #5 #6 #7 #8 #9 #10

0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 h1 h2

Better!

Tuesday, February 19, 13

slide-55
SLIDE 55

39

h1 h2

#2 #1 #3 #4 #5 #6 #7 #8 #9 #10

0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 h1 h2

Better!

Tuesday, February 19, 13

slide-56
SLIDE 56

40

h1 h2

#2 #1 #3 #4 #5 #6 #7 #8 #9 #10

0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 h1 h2

Worse!

Tuesday, February 19, 13

slide-57
SLIDE 57

41

h1 h2

#2 #1 #3 #4 #5 #6 #7 #8 #9 #10

0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 h1 h2

Better!

Tuesday, February 19, 13

slide-58
SLIDE 58

42

h1 h2

#2 #1 #3 #4 #5 #6 #7 #8 #9 #10

0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 h1 h2

Tuesday, February 19, 13

slide-59
SLIDE 59

43

h1 h2

Fit a linear model

Tuesday, February 19, 13

slide-60
SLIDE 60

44

h1 h2

Fit a linear model

~ w

Tuesday, February 19, 13

slide-61
SLIDE 61

45

h1 h2

#2#1

K-Best List Example

#3 #4 #5 #6 #7 #8 #9 #10

0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 ~ w

Tuesday, February 19, 13

slide-62
SLIDE 62

MERT

  • Minimum Error Rate Training
  • Directly target an automatic evaluation

metric

  • BLEU is defined at the corpus level
  • MERT optimizes at the corpus level
  • Downsides
  • Does not deal well with > ~20 features

46

Tuesday, February 19, 13

slide-63
SLIDE 63

MERT

47

Now pick a search vector , and consider how the score of this hypothesis will change: Given weight vector , any hypothesis will have a (scalar) score

w he, ai m = w>h(g, e, a) m = (w + γv)>h(g, e, a) = w>h(g, e, a) | {z }

b

+γ v>h(g, e, a) | {z }

a

= aγ + b v wnew = w + γv m

Tuesday, February 19, 13

slide-64
SLIDE 64

MERT

48

Now pick a search vector , and consider how the score of this hypothesis will change: Given weight vector , any hypothesis will have a (scalar) score

w he, ai m = w>h(g, e, a) m = (w + γv)>h(g, e, a) = w>h(g, e, a) | {z }

b

+γ v>h(g, e, a) | {z }

a

= aγ + b v wnew = w + γv m

Tuesday, February 19, 13

slide-65
SLIDE 65

MERT

49

Now pick a search vector , and consider how the score of this hypothesis will change: Given weight vector , any hypothesis will have a (scalar) score

w he, ai m = w>h(g, e, a) m = (w + γv)>h(g, e, a) = w>h(g, e, a) | {z }

b

+γ v>h(g, e, a) | {z }

a

= aγ + b v wnew = w + γv m

Tuesday, February 19, 13

slide-66
SLIDE 66

MERT

50

Now pick a search vector , and consider how the score of this hypothesis will change: Given weight vector , any hypothesis will have a (scalar) score

w he, ai m = w>h(g, e, a) m = (w + γv)>h(g, e, a) = w>h(g, e, a) | {z }

b

+γ v>h(g, e, a) | {z }

a

= aγ + b v wnew = w + γv m

Tuesday, February 19, 13

slide-67
SLIDE 67

MERT

51

Now pick a search vector , and consider how the score of this hypothesis will change: Given weight vector , any hypothesis will have a (scalar) score

w he, ai m = w>h(g, e, a) m = (w + γv)>h(g, e, a) = w>h(g, e, a) | {z }

b

+γ v>h(g, e, a) | {z }

a

= aγ + b v wnew = w + γv m

Tuesday, February 19, 13

slide-68
SLIDE 68

MERT

51

Now pick a search vector , and consider how the score of this hypothesis will change: Given weight vector , any hypothesis will have a (scalar) score

w he, ai m = w>h(g, e, a) m = (w + γv)>h(g, e, a) = w>h(g, e, a) | {z }

b

+γ v>h(g, e, a) | {z }

a

= aγ + b v wnew = w + γv m

Linear function in 2D!

Tuesday, February 19, 13

slide-69
SLIDE 69

MERT

52

m γ

Tuesday, February 19, 13

slide-70
SLIDE 70

MERT

53

Recall our k-best set {e∗

i , a∗ i ⇥}K i=1

m γ

Tuesday, February 19, 13

slide-71
SLIDE 71

MERT

54

Recall our k-best set {e∗

i , a∗ i ⇥}K i=1

m γ

Tuesday, February 19, 13

slide-72
SLIDE 72

MERT

55

m γ

Tuesday, February 19, 13

slide-73
SLIDE 73

MERT

56

m he∗

162, a∗ 162i

he∗

28, a∗ 28i

he∗

73, a∗ 73i

γ

Tuesday, February 19, 13

slide-74
SLIDE 74

MERT

57

m γ he∗

162, a∗ 162i

he∗

28, a∗ 28i

he∗

73, a∗ 73i

Tuesday, February 19, 13

slide-75
SLIDE 75

MERT

58

m γ he∗

162, a∗ 162i

he∗

28, a∗ 28i

he∗

73, a∗ 73i

γ

errors

Tuesday, February 19, 13

slide-76
SLIDE 76

MERT

59

m γ he∗

162, a∗ 162i

he∗

28, a∗ 28i

he∗

73, a∗ 73i

γ

errors

Tuesday, February 19, 13

slide-77
SLIDE 77

MERT

60

m γ γ

errors

Tuesday, February 19, 13

slide-78
SLIDE 78

MERT

61

m γ γ

errors

γ

Tuesday, February 19, 13

slide-79
SLIDE 79

62

γ

errors

wnew = γ∗v + w

Let

γ∗

Tuesday, February 19, 13

slide-80
SLIDE 80

MERT

  • In practice “errors” are sufficient statistics

for evaluation metrics (e.g., BLEU)

  • Can maximize or minimize!
  • Envelope can also be computed using

dynamic programming

  • Interesting complexity bounds
  • How do you pick the search direction?

63

Tuesday, February 19, 13

slide-81
SLIDE 81

Summary

  • Evaluation metrics
  • Figure out how well we’re doing
  • Figure out if a feature helps
  • But ALSO: train your system!
  • What’s a great way to improve translation?
  • Improve evaluation!

64

Tuesday, February 19, 13

slide-82
SLIDE 82

Thank You!

65

m

γ he∗

162, a∗ 162i

he∗

28, a∗ 28i

he∗

73, a∗ 73i

Tuesday, February 19, 13