Guy Dar Machine Translation Seminar Tel Aviv University 2014 } Pr - - PowerPoint PPT Presentation
Guy Dar Machine Translation Seminar Tel Aviv University 2014 } Pr - - PowerPoint PPT Presentation
Guy Dar Machine Translation Seminar Tel Aviv University 2014 } Pr Problems: lems: Poor grammar. Distortion model is local local . (Instance of the former) } Solution (?) (?) : Unsup Unsupervised ervised syntax-based translation
} Pr
Problems: lems:
- Poor grammar.
- Distortion model is local
local . (Instance of the former)
} Solution (?)
(?): Unsup
Unsupervised ervised syntax-based translation model.
} Wh
Which ich m mean eans: No linguistic predefined rules.
} The system learns from a bilingual corpus.
Man Mandarin darin (Ch (Chin ines ese): e): Aozhou shi yu Bei Han you bangjiao
Australia is with North Korea have diplomatic relations
de shaoshu guojia zhiyi
that few countries one of
Correct t Translati tion: Australia is one of the few countries that have diplomatic relations with North Korea.
Note: Correct translation requires reversing 5 elements.
} Idea
Idea: Translating ‘linguistic’ structures - “templates” to templates, and not phrases to phrases.
} How?
How? Rules! for example:
- [1
[1] de [2 [2] à
à the [2
[2] that [1 [1] ]
- [1
[1] zhiyi à
à one of [1
[1]
- yu [1
[1] you [2 [2] à
à have [2
[2] with [1 [1]
} We can apply rules recursively. } This way we can derive the correct
translation.
} Formal constr
tructi tion:
- Each rule will be of the following form:
X à <α, ¡γ, ¡~> ¡
where X is a non-terminal (variable), α is a string in the source language, and γ is a string in the target.
Both strings consist of non-terminals and terminals, and ~ is a
- ne-to-one correspondence between non-terminals in S and T.
} In our model, we will use only two non-
terminals: S, X.
} Our system will learn rules from the bilingual
corpus only.
} The only rules we add manually are two gl
glue ue ru rules les:
- S à <S[1
[1] X[2 [2] , S[1 [1] X[2 [2] >
- S à < X[1
[1] ,X[1 [1] >
<S[1]
[1], S[1] [1]>
initial pair
à <S[2] [2] X[3] [3], S[2] [2] X[3] [3]>
S à < S[1]
[1] X[2] [2], S[1] [1] X[2] [2]> à <S[4] [4] X[5] [5] X[3] [3], S[4] [4] X[5] [5] X[3] [3]>
S à < S[1]
[1] X[2] [2], S[1] [1] X[2] [2]> à <X[6] [6] X[5] [5] X[3] [3], X[6] [6] X[5] [5] X[3] [3]>
S à < X[1]
[1], X[1] [1]> à <Aozhou X[5] [5] X[3] [3], Australia X[5] [5] X[3] [3]>
X à < Aozhou, Australia>
à <Aozhou shi X[3] [3], Australia is X[3] [3]>
X à <shi, is>
à <Aozhou shi X[7] [7] zhiyi, Australia is one of X[7] [7]>
X à <X[1]
[1] zhiyi, one of X[1] [1] > à <Aozhou shi X[8] [8] de X[9] [9] zhiyi, Australia is one of the X[9] [9] that X[8] [8]>
X à <X[1]
[1] de X[2] [2], the X[2] [2] that X[1] [1] > à <Aozhou shi yu X[1] [1] you X[2] [2] de X[9] [9] zhiyi, Australia is one of the
X[9]
[9] that have X[2] [2] with X[1] [1]>
X à <yu X[1]
[1] you X[2] [2], have X[2] [2] with X[1] [1] >
} Let us now return to our system. } Every rule gets a weight (Log-linear model): } φi ¡are ¡called ¡the ¡features. ¡ } λi ¡ ¡are ¡the ¡feature ¡weights.
} In our design, we have the following features:
- P(
P(γ| α) - what are the chances that γ is translated to α. .
- P(
P(α |γ) - the other way around.
- Pw(α |γ) , P
) , Pw(γ |α ) – Lexical weights estimate how well the words are translated. (word alignment)
- Phrase penalty
ty – a constant e = exp(1); We use it to penalize long derivations.(encourage?)
} Two special rules:
- w(S à <S[1
[1] X[2 [2] , S[1 [1] X[2 [2] >)
= exp(-λg)
- w(S à < X[1] ,X[1]>)
= 1
} We also give weights to derivations (a
sequence of rules), for every derivation D:
Where the product is over all rules used in D.
plm is the language model and exp(- λwp|e|) is the word penalty, to discourage use of too many words. (as opposed to phrase penalty)
} Note: For things to go right, we must integrate the
extra factors into the rule weights.
} Input: A word-aligned bilingual corpus. (many-
to-many)
} Objective: Learn hierarchical rules. } We are given a pair of word-aligned
sentences <f,e,~> (f for French, e for English, ~ is the
word-alignment)
} Big pictu
ture: First we extract initi tial phrase pairs pairs , then we refine them into more “sophisticated” rules.
} Initi
tial phrase pair is a pair <f’,e’> s.t. :
- f’ is a substring of f, and e’ is a substring of e (a substring
must be of the form str[i:j], no ‘holes’ are allowed)
- All words in f’ are aligned to words in e’
- And vice versa, no words outside f’ mapped to e’
} Reminds something?
Philipp Koehn, http://www.statmt.org/book/slides/05-phrase-based-models.pdf
} Every initial phrase pair gives us a rule
X à<f’,e’>
} Now, we construct new rules from existing:
- If Xà<α, ¡γ> ¡is ¡a ¡rule, ¡ ¡
- and ¡there’s ¡an ¡ini7al ¡phrase-‑pair ¡<f’,e’> ¡such ¡that ¡α= ¡α1f’α2, ¡
γ= ¡γ1e’γ2 ¡ ¡
- Then, ¡add ¡the ¡rule ¡ ¡
¡X à< ¡α1 ¡X[k] ¡α2, ¡γ1 ¡X[k] ¡γ2 ¡>
¡
Practi tically, we use additi tional heuristi tics to to make th this procedure more efficient t and less ambiguous.
} Our ¡es7mate ¡will ¡distribute ¡weights ¡equally ¡among ¡
all ¡ini*al ¡phrase ¡pairs; ¡
} Then, ¡every ¡ini7al ¡phrase ¡pair ¡distributes ¡its ¡weight ¡
equally ¡among ¡all ¡rules ¡extracted ¡from ¡it. ¡
} Now, ¡we ¡use ¡this ¡es7mate ¡to ¡determine ¡P(α |γ),
P(γ |α). ¡
} No7ce ¡that ¡we ¡yet ¡to ¡have ¡values ¡for ¡our ¡feature ¡
- weights. ¡
} We are given a sentence f in the foreign
language.
} we would try to find the derivati
tion with the best score that ends with f on the French side:
arg argmax ax w(D)
s. s.t. f(D)=f
- the English side of this derivation will be our
translation of f.
} Our algorithm is basically a CKY parser.
- An algorithm to check whether a word belongs to a
CFG.
- There is a CKY parser for weighted CFGs.
} Since we cannot try all options, we use
pru prunin ing techniques. (Similar to what we saw in
Koehn’s chapter on decoding: http://www.statmt.org/book/slides/06-decoding.pdf)
} Consti
titu tuent t (liguisti tics) – A single unit within a heirarchical structure.
} We can factor a consti
titu tuent t featu ture into the weight of a derivation D:
} For every rule r. f[i:j] is the slice of the French side that r is ‘responsible
for’. (the [leaves of] the subtree derived from r)
} c(i,j) was learnt from Penn Chinese Treebank (ver. 3)
c(i,j) =
1 f[i:j] is a constituent
- therwise
} Lan
Languag ages es: Mandarin to English
} Models
Models com compared: pared:
- Pharaoh (Baseline)
- Hierarchical model
- Hierarchical model + constituent feature
} Training set
t
- Translation model - FBIS corpus (7.2M+9.2M)
- Language model - English newswire text (155M
words)
} De
Development t set t
- 2002 NIST MT evaluation test set
} Test
t set t
- 2003 NIST MT evaluation test set
} Ev
Evaluati tion
- BLEU
} Featu
ture weights ts tu tuned by running Minimum Er Error- Rate te Trainer (MER ERT) on the development set.
} Tuning results
ts
} Difference between Baseline and hierarchical
model is statistically significant
} New system improves state-of-art results.(in
2005)
} Constituent feature improves results only
- slightly. (Statistically insignificant)
} Further study suggests that increasing initial
phrase max. length from 10 to 15 improve accuracy.
} David Chiang, A Hierarchical Phrase-Based Model
for Statistical Machine Translation, http://www.aclweb.org/anthology/P05-1033
} Philipp Koehn, Statistical Machine Translation,
http://www.statmt.org/book/
} Wikipedia,
- CYK algorithm [Last Modified Dec. 16, 2014],
http://en.wikipedia.org/wiki/CYK_algorithm
- Constituent (Linguistics) [Last Modified Nov. 17, 2014],