Statistical machine translation in a few slides Mikel L. Forcada 1 , - - PowerPoint PPT Presentation

▶

Apr 23, 2023 588 likes •768 views

Statistical machine translation in a few slides Mikel L. Forcada 1 , 2 1 Departament de Llenguatges i Sistemes Informtics, Universitat dAlacant, E-03071 Alacant (Spain) 2 Prompsit Language Engineering, S.L., E-03690 St. Vicent del Raspeig

SLIDE 1

Statistical machine translation in a few slides

Mikel L. Forcada1,2

1Departament de Llenguatges i Sistemes Informàtics, Universitat d’Alacant,

E-03071 Alacant (Spain)

2Prompsit Language Engineering, S.L., E-03690 St. Vicent del Raspeig (Spain)

April 14-16, 2009: Free/open-source MT tutorial at the CNGL

SLIDE 2

Translation as probability/1

◮ Instead of saying that

◮ a source-language (SL) sentence s in a SL text ◮ and a target-language (TL) sentence t

as found in a SL–TL bitext are or are not a translation of each other,

◮ in SMT one says that they are a translation of each other

with a probability p(s, t) = p(t, s) (a joint probability).

◮ We’ll assume we have such a probability model available.

Or at least a reasonable estimate.

SLIDE 3

Translation as probability/2

◮ According to basic probability laws, we can write:

p(s, t) = p(t, s) = p(s|t)p(t) = p(t|s)p(s) (1) where p(x|y) is the conditional probability of x given y.

◮ We are interested in translating from SL to TL. That is, we

want to find the most likely translation given the SL sentence s: t⋆ = arg max

t

p(t|s) (2)

SLIDE 4

The “canonical” model

◮ We can rewrite eq. (1) as

p(t|s) = p(s|t)p(t) p(s) (3)

◮ and then with (2) to get

t⋆ = arg max

t

p(s|t)p(t) (4)

SLIDE 5

“Decoding”/1

t⋆ = arg max

t

p(s|t)p(t)

◮ We have a product of two probability models:

◮ A reverse translation model p(s|t) which tells us how likely

the SL sentence s is a translation of the candidate TL sentence t, and

◮ a target-language model p(t) which tells us how likely the

sentence t is in the TL side of bitexts.

◮ These may be related (respectively) to the usual notions of

◮ [reverse] adequacy: how much of the meaning of t is

conveyed by s

◮ fluency: how fluent is the candidate TL sentence.

◮ The arg max strikes a balance between the two.

SLIDE 6

“Decoding”/2

◮ In SMT parlance, the process of finding t∗ is called

decoding.1

◮ Obviously, it does not explore all possible translations t in

the search space. There are infinitely many.

◮ The search space is pruned. ◮ Therefore, one just gets a reasonable t≃⋆ instead of the

ideal t⋆

◮ Pruning and search strategies are a very active research

topic. Free/open-source software: Moses.

1Reading SMT articles usually entails deciphering jargon which may be

very obscure to outsiders or newcomers

SLIDE 7

Training/1

◮ So where do these probabilities come from? ◮ p(t) may easily be estimated from a large monolingual TL

corpus (free/open-source software: irstlm)

◮ The estimation of p(s|t) is more complex. It’s usually made

◮ a lexical model describing the probability that the

translation of certain TL word or sequence of words (“phrase”2) is a certain SL word or sequence of words.

◮ an alignment model describing the reordering of words or

“phrases”.

2A very unfortunate choice in SMT jargon

SLIDE 8

Training/2

◮ The lexical model and the alignment model are estimated

using a large sentence-aligned bilingual corpus through a complex iterative process.

◮ An initial set of lexical probabilities is obtained by

assuming, for instance, that any word in the TL sentence aligns with any word in its SL counterpart. And then:

◮ Alignment probabilities in accordance with the lexical

probabilities are computed.

◮ Lexical probabilities are obtained in accordance with the

alignment probabilities

This process (“expectation maximization”) is repeated a fixed number of times or until some convergence is

bserved (free/open-source software: Giza++).

SLIDE 9

Training/3

◮ In “phrase-based” SMT, alignments may be used to extract

◮ (SL-phrase, TL-phrase) pairs of phrases ◮ and their corresponding probabilities

for easier decoding and to avoid “word salad”.

SLIDE 10

“Log-linear”/1

◮ More SMT jargon! ◮ It’s short for linear combination of logarithms of

probabilities.

◮ And, sometimes, even features that aren’t logarithms or

probabilities of any kind.

◮ OK, let’s take a look at the maths.

SLIDE 11

“Log-linear”/2

◮ One can write a more general formula:

p(t|s) = exp(nF

k=1 λkfk(t, s))

Z (5) with nF feature functions fk(t, s) which can depend on s, t

r both.

◮ Setting nF = 2, f1(s, t) = log p(s|t), f2(s, t) = log p(t), and

Z = p(s) one recovers the canonical formula (3).

◮ The best translation is then

t⋆ = arg max

t nF

λkfk(t, s) (6) Most of the fk(t, s) are logarithms, hence “log-linear”.

SLIDE 12

“Log-linear”/3

◮ “Feature selection is a very open problem in SMT” (Lopez

2008)

◮ Other possible functions include length penalties

(discouraging unreasonably short or long translations), “inverted” versions of p(s|t), etc.

◮ Where do we get the λk’s from? ◮ They are usually tuned so as to optimize the results on a

tuning set, according to a certain objective function that

◮ is taken to be an indicator that correlates with translation

quality

◮ may be automatically obtained from the output of the SMT

system and the translation in the corpus.

This is called MERT (minimum error rate training) sometimes (free/open-source software: the Moses suite).

SLIDE 13

Ain’t got nothin’ but the BLEUs?

◮ The most famous “quality indicator” is called BLEU, but

there are many others.

◮ BLEU counts which fraction of the 1-word,

2-word,. . . n-word sequences in the output match the reference translation.

◮ Correlation with subjective assessments of quality is still an

pen question.

◮ A lot of SMT research is currently BLEU-driven and makes

little contact with real applications of MT.

SLIDE 14

The SMT lifecycle

Development: Training: monolingual and sentence-aligned bilingual corpora are used to estimate probability models (features) Tuning: a held-out portion of the sentence-aligned bilingual corpus is used to tune the coeficients λk Decoding: sentences s are fed into the SMT system and “decoded” into their translations t. Evaluation: the system is evaluated against a reference corpus.

SLIDE 15