Statistical machine translation in a few slides Mikel L. Forcada 1 , - - PowerPoint PPT Presentation

statistical machine translation in a few slides
SMART_READER_LITE
LIVE PREVIEW

Statistical machine translation in a few slides Mikel L. Forcada 1 , - - PowerPoint PPT Presentation

Statistical machine translation in a few slides Mikel L. Forcada 1 , 2 1 Departament de Llenguatges i Sistemes Informtics, Universitat dAlacant, E-03071 Alacant (Spain) 2 Prompsit Language Engineering, S.L., E-03690 St. Vicent del Raspeig


slide-1
SLIDE 1

Statistical machine translation in a few slides

Mikel L. Forcada1,2

1Departament de Llenguatges i Sistemes Informàtics, Universitat d’Alacant,

E-03071 Alacant (Spain)

2Prompsit Language Engineering, S.L., E-03690 St. Vicent del Raspeig (Spain)

April 14-16, 2009: Free/open-source MT tutorial at the CNGL

slide-2
SLIDE 2

Translation as probability/1

◮ Instead of saying that

◮ a source-language (SL) sentence s in a SL text ◮ and a target-language (TL) sentence t

as found in a SL–TL bitext are or are not a translation of each other,

◮ in SMT one says that they are a translation of each other

with a probability p(s, t) = p(t, s) (a joint probability).

◮ We’ll assume we have such a probability model available.

Or at least a reasonable estimate.

slide-3
SLIDE 3

Translation as probability/2

◮ According to basic probability laws, we can write:

p(s, t) = p(t, s) = p(s|t)p(t) = p(t|s)p(s) (1) where p(x|y) is the conditional probability of x given y.

◮ We are interested in translating from SL to TL. That is, we

want to find the most likely translation given the SL sentence s: t⋆ = arg max

t

p(t|s) (2)

slide-4
SLIDE 4

The “canonical” model

◮ We can rewrite eq. (1) as

p(t|s) = p(s|t)p(t) p(s) (3)

◮ and then with (2) to get

t⋆ = arg max

t

p(s|t)p(t) (4)

slide-5
SLIDE 5

“Decoding”/1

t⋆ = arg max

t

p(s|t)p(t)

◮ We have a product of two probability models:

◮ A reverse translation model p(s|t) which tells us how likely

the SL sentence s is a translation of the candidate TL sentence t, and

◮ a target-language model p(t) which tells us how likely the

sentence t is in the TL side of bitexts.

◮ These may be related (respectively) to the usual notions of

◮ [reverse] adequacy: how much of the meaning of t is

conveyed by s

◮ fluency: how fluent is the candidate TL sentence.

◮ The arg max strikes a balance between the two.

slide-6
SLIDE 6

“Decoding”/2

◮ In SMT parlance, the process of finding t∗ is called

decoding.1

◮ Obviously, it does not explore all possible translations t in

the search space. There are infinitely many.

◮ The search space is pruned. ◮ Therefore, one just gets a reasonable t≃⋆ instead of the

ideal t⋆

◮ Pruning and search strategies are a very active research

topic. Free/open-source software: Moses.

1Reading SMT articles usually entails deciphering jargon which may be

very obscure to outsiders or newcomers

slide-7
SLIDE 7

Training/1

◮ So where do these probabilities come from? ◮ p(t) may easily be estimated from a large monolingual TL

corpus (free/open-source software: irstlm)

◮ The estimation of p(s|t) is more complex. It’s usually made

  • f

◮ a lexical model describing the probability that the

translation of certain TL word or sequence of words (“phrase”2) is a certain SL word or sequence of words.

◮ an alignment model describing the reordering of words or

“phrases”.

2A very unfortunate choice in SMT jargon

slide-8
SLIDE 8

Training/2

◮ The lexical model and the alignment model are estimated

using a large sentence-aligned bilingual corpus through a complex iterative process.

◮ An initial set of lexical probabilities is obtained by

assuming, for instance, that any word in the TL sentence aligns with any word in its SL counterpart. And then:

◮ Alignment probabilities in accordance with the lexical

probabilities are computed.

◮ Lexical probabilities are obtained in accordance with the

alignment probabilities

This process (“expectation maximization”) is repeated a fixed number of times or until some convergence is

  • bserved (free/open-source software: Giza++).
slide-9
SLIDE 9

Training/3

◮ In “phrase-based” SMT, alignments may be used to extract

◮ (SL-phrase, TL-phrase) pairs of phrases ◮ and their corresponding probabilities

for easier decoding and to avoid “word salad”.

slide-10
SLIDE 10

“Log-linear”/1

◮ More SMT jargon! ◮ It’s short for linear combination of logarithms of

probabilities.

◮ And, sometimes, even features that aren’t logarithms or

probabilities of any kind.

◮ OK, let’s take a look at the maths.

slide-11
SLIDE 11

“Log-linear”/2

◮ One can write a more general formula:

p(t|s) = exp(nF

k=1 λkfk(t, s))

Z (5) with nF feature functions fk(t, s) which can depend on s, t

  • r both.

◮ Setting nF = 2, f1(s, t) = log p(s|t), f2(s, t) = log p(t), and

Z = p(s) one recovers the canonical formula (3).

◮ The best translation is then

t⋆ = arg max

t nF

  • k=1

λkfk(t, s) (6) Most of the fk(t, s) are logarithms, hence “log-linear”.

slide-12
SLIDE 12

“Log-linear”/3

◮ “Feature selection is a very open problem in SMT” (Lopez

2008)

◮ Other possible functions include length penalties

(discouraging unreasonably short or long translations), “inverted” versions of p(s|t), etc.

◮ Where do we get the λk’s from? ◮ They are usually tuned so as to optimize the results on a

tuning set, according to a certain objective function that

◮ is taken to be an indicator that correlates with translation

quality

◮ may be automatically obtained from the output of the SMT

system and the translation in the corpus.

This is called MERT (minimum error rate training) sometimes (free/open-source software: the Moses suite).

slide-13
SLIDE 13

Ain’t got nothin’ but the BLEUs?

◮ The most famous “quality indicator” is called BLEU, but

there are many others.

◮ BLEU counts which fraction of the 1-word,

2-word,. . . n-word sequences in the output match the reference translation.

◮ Correlation with subjective assessments of quality is still an

  • pen question.

◮ A lot of SMT research is currently BLEU-driven and makes

little contact with real applications of MT.

slide-14
SLIDE 14

The SMT lifecycle

Development: Training: monolingual and sentence-aligned bilingual corpora are used to estimate probability models (features) Tuning: a held-out portion of the sentence-aligned bilingual corpus is used to tune the coeficients λk Decoding: sentences s are fed into the SMT system and “decoded” into their translations t. Evaluation: the system is evaluated against a reference corpus.

slide-15
SLIDE 15

License

This work may be distributed under the terms of

◮ the Creative Commons Attribution–Share Alike license:

http: //creativecommons.org/licenses/by-sa/3.0/

◮ the GNU GPL v. 3.0 License:

http://www.gnu.org/licenses/gpl.html Dual license! E-mail me to get the sources: mlf@ua.es