[PPT] - Guy Dar Machine Translation Seminar Tel Aviv University 2014 } Pr PowerPoint Presentation

SLIDE 1

Guy Dar Machine Translation Seminar Tel Aviv University 2014

SLIDE 2

} Pr

Problems: lems:

Poor grammar.
Distortion model is local

local . (Instance of the former)

} Solution (?)

(?): Unsup

Unsupervised ervised syntax-based translation model.

} Wh

Which ich m mean eans: No linguistic predefined rules.

} The system learns from a bilingual corpus.

SLIDE 3

Man Mandarin darin (Ch (Chin ines ese): e): Aozhou shi yu Bei Han you bangjiao

Australia is with North Korea have diplomatic relations

de shaoshu guojia zhiyi

that few countries one of

Correct t Translati tion: Australia is one of the few countries that have diplomatic relations with North Korea.

Note: Correct translation requires reversing 5 elements.

SLIDE 4

} Idea

Idea: Translating ‘linguistic’ structures - “templates” to templates, and not phrases to phrases.

} How?

How? Rules! for example:

[1

[1] de [2 [2] à

à the [2

[2] that [1 [1] ]

[1

[1] zhiyi à

à one of [1

[1]

yu [1

[1] you [2 [2] à

à have [2

[2] with [1 [1]

} We can apply rules recursively. } This way we can derive the correct

translation.

SLIDE 5

} Formal constr

tructi tion:

Each rule will be of the following form:

X à <α, ¡γ, ¡~> ¡

where X is a non-terminal (variable), α is a string in the source language, and γ is a string in the target.

Both strings consist of non-terminals and terminals, and ~ is a

ne-to-one correspondence between non-terminals in S and T.

} In our model, we will use only two non-

terminals: S, X.

SLIDE 6

} Our system will learn rules from the bilingual

corpus only.

} The only rules we add manually are two gl

glue ue ru rules les:

S à <S[1

[1] X[2 [2] , S[1 [1] X[2 [2] >

S à < X[1

[1] ,X[1 [1] >

SLIDE 7

<S[1]

[1], S[1] [1]>

initial pair

à <S[2] [2] X[3] [3], S[2] [2] X[3] [3]>

S à < S[1]

[1] X[2] [2], S[1] [1] X[2] [2]> à <S[4] [4] X[5] [5] X[3] [3], S[4] [4] X[5] [5] X[3] [3]>

S à < S[1]

[1] X[2] [2], S[1] [1] X[2] [2]> à <X[6] [6] X[5] [5] X[3] [3], X[6] [6] X[5] [5] X[3] [3]>

S à < X[1]

[1], X[1] [1]> à <Aozhou X[5] [5] X[3] [3], Australia X[5] [5] X[3] [3]>

X à < Aozhou, Australia>

à <Aozhou shi X[3] [3], Australia is X[3] [3]>

X à <shi, is>

à <Aozhou shi X[7] [7] zhiyi, Australia is one of X[7] [7]>

X à <X[1]

[1] zhiyi, one of X[1] [1] > à <Aozhou shi X[8] [8] de X[9] [9] zhiyi, Australia is one of the X[9] [9] that X[8] [8]>

X à <X[1]

[1] de X[2] [2], the X[2] [2] that X[1] [1] > à <Aozhou shi yu X[1] [1] you X[2] [2] de X[9] [9] zhiyi, Australia is one of the

X[9]

[9] that have X[2] [2] with X[1] [1]>

X à <yu X[1]

[1] you X[2] [2], have X[2] [2] with X[1] [1] >

SLIDE 8

} Let us now return to our system. } Every rule gets a weight (Log-linear model): } φi ¡are ¡called ¡the ¡features. ¡ } λi ¡ ¡are ¡the ¡feature ¡weights.

SLIDE 9

} In our design, we have the following features:

P(

P(γ| α) - what are the chances that γ is translated to α. .

P(

P(α |γ) - the other way around.

Pw(α |γ) , P

) , Pw(γ |α ) – Lexical weights estimate how well the words are translated. (word alignment)

Phrase penalty

ty – a constant e = exp(1); We use it to penalize long derivations.(encourage?)

SLIDE 10

} Two special rules:

w(S à <S[1

[1] X[2 [2] , S[1 [1] X[2 [2] >)

= exp(-λg)

w(S à < X[1] ,X[1]>)

= 1

} We also give weights to derivations (a

sequence of rules), for every derivation D:

Where the product is over all rules used in D.

plm is the language model and exp(- λwp|e|) is the word penalty, to discourage use of too many words. (as opposed to phrase penalty)

} Note: For things to go right, we must integrate the

extra factors into the rule weights.

SLIDE 11

} Input: A word-aligned bilingual corpus. (many-

to-many)

} Objective: Learn hierarchical rules. } We are given a pair of word-aligned

sentences <f,e,~> (f for French, e for English, ~ is the

word-alignment)

} Big pictu

ture: First we extract initi tial phrase pairs pairs , then we refine them into more “sophisticated” rules.

SLIDE 12

} Initi

tial phrase pair is a pair <f’,e’> s.t. :

f’ is a substring of f, and e’ is a substring of e (a substring

must be of the form str[i:j], no ‘holes’ are allowed)

All words in f’ are aligned to words in e’
And vice versa, no words outside f’ mapped to e’

} Reminds something?

Philipp Koehn, http://www.statmt.org/book/slides/05-phrase-based-models.pdf

SLIDE 13

} Every initial phrase pair gives us a rule

X à<f’,e’>

} Now, we construct new rules from existing:

If Xà<α, ¡γ> ¡is ¡a ¡rule, ¡ ¡
and ¡there’s ¡an ¡ini7al ¡phrase-‑pair ¡<f’,e’> ¡such ¡that ¡α= ¡α1f’α2, ¡

γ= ¡γ1e’γ2 ¡ ¡

Then, ¡add ¡the ¡rule ¡ ¡

¡X à< ¡α1 ¡X[k] ¡α2, ¡γ1 ¡X[k] ¡γ2 ¡>

¡

Practi tically, we use additi tional heuristi tics to to make th this procedure more efficient t and less ambiguous.

SLIDE 14

} Our ¡es7mate ¡will ¡distribute ¡weights ¡equally ¡among ¡

all ¡ini*al ¡phrase ¡pairs; ¡

} Then, ¡every ¡ini7al ¡phrase ¡pair ¡distributes ¡its ¡weight ¡

equally ¡among ¡all ¡rules ¡extracted ¡from ¡it. ¡

} Now, ¡we ¡use ¡this ¡es7mate ¡to ¡determine ¡P(α |γ),

P(γ |α). ¡

} No7ce ¡that ¡we ¡yet ¡to ¡have ¡values ¡for ¡our ¡feature ¡

weights. ¡

SLIDE 15

} We are given a sentence f in the foreign

language.

} we would try to find the derivati

tion with the best score that ends with f on the French side:

arg argmax ax w(D)

s. s.t. f(D)=f

the English side of this derivation will be our

translation of f.

SLIDE 16

} Our algorithm is basically a CKY parser.

An algorithm to check whether a word belongs to a

CFG.

There is a CKY parser for weighted CFGs.

} Since we cannot try all options, we use

pru prunin ing techniques. (Similar to what we saw in

Koehn’s chapter on decoding: http://www.statmt.org/book/slides/06-decoding.pdf)

SLIDE 17

} Consti

titu tuent t (liguisti tics) – A single unit within a heirarchical structure.

} We can factor a consti

titu tuent t featu ture into the weight of a derivation D:

} For every rule r. f[i:j] is the slice of the French side that r is ‘responsible

for’. (the [leaves of] the subtree derived from r)

} c(i,j) was learnt from Penn Chinese Treebank (ver. 3)

c(i,j) =

1 f[i:j] is a constituent

therwise

SLIDE 18

} Lan

Languag ages es: Mandarin to English

} Models

Models com compared: pared:

Pharaoh (Baseline)
Hierarchical model
Hierarchical model + constituent feature

} Training set

t

Translation model - FBIS corpus (7.2M+9.2M)
Language model - English newswire text (155M

words)

SLIDE 19

} De

Development t set t

2002 NIST MT evaluation test set

} Test

t set t

2003 NIST MT evaluation test set

} Ev

Evaluati tion

BLEU

SLIDE 20

} Featu

ture weights ts tu tuned by running Minimum Er Error- Rate te Trainer (MER ERT) on the development set.

} Tuning results

ts

SLIDE 21

SLIDE 22

} Difference between Baseline and hierarchical

model is statistically significant

SLIDE 23

} New system improves state-of-art results.(in

2005)

} Constituent feature improves results only

slightly. (Statistically insignificant)

} Further study suggests that increasing initial

phrase max. length from 10 to 15 improve accuracy.