Introduction to Machine Translation Joost Bastings ILLC, University - PowerPoint PPT Presentation

MT as Crime Scene Investigation Sentence f is a “crime scene”. and then that person actually did the crime. So we start reasoning about: These two things may confmict. Someone with a good motive, but without the means. Someone who could easily have done the crime, but has no motive. 27 Our generative model might be something like: some person e decided to do the crime, 1. who did it? P ( e ) : motive, personality,... 2. how did they do it? P ( f | e ) : transportation, weapons, ...

Word reordering • programming • better • language • I • never • seen • a • have Would this work? Let’s try it: also grammatical in f are generally translations of words in e much margin for error. 28 If we model P ( e | f ) directly, there is not We can use P ( f | e ) to make sure that words P ( e ) then ensures that the translation e is

Word choice We need this especially when the French word is ambiguous . Example A French word translates as either “ in ” or “ on ”. Now there may be two English strings with equally good P f e scores: 1. she is in the end zone 2. she is on the end zone P e selects the right one 29 The P ( e ) model can also be useful for selecting English translations of French words.

Word choice We need this especially when the French word is ambiguous . Example A French word translates as either “ in ” or “ on ”. 1. she is in the end zone 2. she is on the end zone 29 The P ( e ) model can also be useful for selecting English translations of French words. Now there may be two English strings with equally good P ( f | e ) scores: P ( e ) selects the right one

i French words IBM Model 3 [Brown et al., 1990, Brown et al., 1993] i the English word that generates it • ... based on the absolute position of French word • assign an absolute position to each • Permute French words • generate spurious word • generate • choose a fertility TL;DR • For each English word e i The story of IBM Model 3 We need to account for this. • English words may disappear French words • English words may produce multiple First observations: words around into the right word order Translate word by word, then scramble the 30

IBM Model 3 [Brown et al., 1990, Brown et al., 1993] TL;DR the English word that generates it • ... based on the absolute position of French word • assign an absolute position to each • Permute French words • generate spurious word • For each English word e i The story of IBM Model 3 We need to account for this. • English words may disappear French words • English words may produce multiple First observations: words around into the right word order Translate word by word, then scramble the 30 • choose a fertility φ i • generate φ i French words

IBM3: Example Mary did not slap the green witch Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch Mary no daba una botefada a la verde bruja Mary no daba una botefada a la bruja verde 31

IBM3: Parameters 3. Spurious p 32 1. Translation t ( huis | house ) 2. Fertility n ( 1 | house ) 4. Position d ( 1 | 2 , | e | , | f | )

parameters. (But.. we don’t!) word alignments. (But.. we don’t!) How do we learn these parameters? 15 • If we had the parameters we could get the rewriting examples, we could also obtain the • If we had word alignments instead of Chicken-and-egg problem 33 If we had rewriting examples , then we deleted during the fjrst rewriting step If ‘did’ appeared 15,000 times and was Example to it every ‘did’ and checking what happened could estimate n ( 0 | ‘did’ ) by fjnding 13,000 times, then n ( 0 | ‘did’ ) = 13

How do we learn these parameters? 15 word alignments. (But.. we don’t!) • If we had the parameters we could get the rewriting examples, we could also obtain the • If we had word alignments instead of Chicken-and-egg problem 33 If we had rewriting examples , then we deleted during the fjrst rewriting step If ‘did’ appeared 15,000 times and was Example to it every ‘did’ and checking what happened could estimate n ( 0 | ‘did’ ) by fjnding parameters. (But.. we don’t!) 13,000 times, then n ( 0 | ‘did’ ) = 13

EM intuition • Let’s say we do have alignments, but for each sentence we have multiple ones • Let’s say we have 2 alignments for each sentence • We don’t know which one is best • We could simply multiply the counts 2 • We need to consider all possible alignments , not just 2 • No problem! We use fractional counts , and we just multiply with a smaller number. 34 from both possible alignments by 1 • We call these fractional counts

EM intuition • Let’s say we do have alignments, but for each sentence we have multiple ones • Let’s say we have 2 alignments for each sentence • We don’t know which one is best • We could simply multiply the counts 2 • We call these fractional counts • We need to consider all possible alignments , not just 2 • No problem! We use fractional counts , and we just multiply with a smaller number. 34 from both possible alignments by 1

EM Example Let’s say we have 40000 French words in our vocabulary 1 40000 We can do the same for the other parameters, but for now let’s focus on 35 We start by assigning uniform parameter values to our t ( f | e ) Then each t ( f | e ) = obtaining better t ( f | e ) parameters

EM: Example x y b y x c b y c Let’s say we have a small corpus with only 2 sentences: b The fjrst sentence has two possibilities , the second one has only one : y b x y b c French English 36

Before we start We have now simplifjed our model to be IBM Model 1 : M i.e. multiply the probabilities of aligned words 37 � P ( a , f | e ) = t ( f j | e a j ) j = 1

EM: Initialization Start with uniform parameters : 2 2 2 2 Remember our corpus: 38 y b x y b c French English t ( x | b ) = 1 t ( y | b ) = 1 t ( x | c ) = 1 t ( y | c ) = 1

EM: Step 1 b 2 y b 4 y x Step 1 c 4 y x c b 39 Compute P ( a , f | e ) for each possible alignment P ( a , f | e ) = 1 2 ∗ 1 2 = 1 P ( a , f | e ) = 1 2 ∗ 1 2 = 1 P ( a , f | e ) = 1

1 1 1 1 1 1 EM: Step 3 and 4 3 4 t x c 1 2 1 2 1 2 1 2 t y c 1 2 1 2 1 2 1 2 These are the revised parameters! 2 1 2 1 2 2 2 2 Step 4 Normalize fractional counts Step 3 b t x 2 1 2 2 1 4 t y b 2 41 Collect fractional counts tc ( x | b ) = 1 tc ( y | b ) = 1 2 + 1 = 1 1 tc ( x | c ) = 1 tc ( y | c ) = 1

EM: Step 3 and 4 2 Step 3 2 3 4 1 2 1 1 4 2 1 2 1 2 1 2 These are the revised parameters! 2 1 1 1 2 2 2 2 Step 4 Normalize fractional counts 41 2 1 2 Collect fractional counts t ( x | b ) = = tc ( x | b ) = 1 2 + 1 1 tc ( y | b ) = 1 2 + 1 = 1 1 1 1 t ( y | b ) = = 2 + 1 1 tc ( x | c ) = 1 t ( x | c ) = = 2 + 1 tc ( y | c ) = 1 t ( y | c ) = = 2 + 1

EM: Repeat step 1 x 4 3 P a f e y b 8 3 2 1 4 3 P a f e y c Step 1 (again, now using the new parameters) b 8 1 2 1 4 1 P a f e y x c b 42 Compute P ( a , f | e ) for each possible alignment

EM: Repeat step 1 P a f e 4 3 P a f e y b 8 3 2 1 4 3 y Step 1 (again, now using the new parameters) x c b 8 y x c b 42 Compute P ( a , f | e ) for each possible alignment P ( a , f | e ) = 1 4 ∗ 1 2 = 1

EM: Repeat step 1 Step 1 (again, now using the new parameters) 4 3 P a f e y b 8 y x c b 8 y x c b 42 Compute P ( a , f | e ) for each possible alignment P ( a , f | e ) = 1 4 ∗ 1 2 = 1 P ( a , f | e ) = 3 4 ∗ 1 2 = 3

EM: Repeat step 1 b 4 y b 8 y x Step 1 (again, now using the new parameters) c 8 y x c b 42 Compute P ( a , f | e ) for each possible alignment P ( a , f | e ) = 1 4 ∗ 1 2 = 1 P ( a , f | e ) = 3 4 ∗ 1 2 = 3 P ( a , f | e ) = 3

EM: Repeat step 2 4 8 1 8 3 8 3 b P a e f y P a f e 3 4 3 4 1 3 y Step 2 (again) 8 b c x y P a e f 1 1 x 8 3 8 1 4 b c 43 Normalize P ( a , f | e ) to yield P ( a | e , f )

EM: Repeat step 2 4 8 1 8 3 8 3 b P a e f y P a f e 3 4 3 4 1 3 y Step 2 (again) x b c x y 1 8 1 8 4 b c 43 Normalize P ( a , f | e ) to yield P ( a | e , f ) P ( a | e , f ) = = 1 8 + 3

EM: Repeat step 2 x 1 4 3 4 3 P a f e y b 4 8 1 8 3 Step 2 (again) y c b b c x y 1 8 1 4 8 43 Normalize P ( a , f | e ) to yield P ( a | e , f ) P ( a | e , f ) = = 1 P ( a | e , f ) = = 3 8 + 3 8 + 3

1 3 1 3 1 3 EM: Repeat steps 3 and 4 1 4 4 7 8 t x c 3 4 3 4 4 1 3 4 t y c 1 4 3 4 1 4 1 4 Even better parameters! 4 b Step 3 (again) 1 1 4 3 4 1 13 4 3 4 t y 44 4 Step 4 (again) 8 1 4 4 1 4 1 b t x Normalize fractional counts Collect fractional counts tc ( x | b ) = tc ( y | b ) = tc ( x | c ) = tc ( y | c ) =

1 3 1 3 1 3 4 4 7 8 3 4 3 4 1 EM: Repeat steps 3 and 4 4 4 1 4 3 4 1 4 1 4 Even better parameters! 3 1 Step 3 (again) 1 4 4 4 4 4 Normalize fractional counts Step 4 (again) 4 4 8 1 1 4 44 Collect fractional counts t ( x | b ) = tc ( x | b ) = 1 tc ( y | b ) = 3 4 + 1 = 13 t ( y | b ) = tc ( x | c ) = 3 t ( x | c ) = tc ( y | c ) = 1 t ( y | c ) =

EM: Repeat steps 3 and 4 4 Step 3 (again) 4 7 8 3 4 3 3 8 4 1 4 3 4 1 4 Even better parameters! 4 1 1 1 4 4 4 4 Step 4 (again) Normalize fractional counts 44 4 1 4 Collect fractional counts t ( x | b ) = = tc ( x | b ) = 1 4 + 1 3 tc ( y | b ) = 3 4 + 1 = 13 1 3 t ( y | b ) = = 4 + 1 3 tc ( x | c ) = 3 t ( x | c ) = = 4 + 1 tc ( y | c ) = 1 t ( y | c ) = = 4 + 1

If we do this many many times.. 45 t ( x | b ) = 0 . 0001 t ( y | b ) = 0 . 9999 t ( x | c ) = 0 . 9999 t ( y | c ) = 0 . 0001

Notes on EM • EM is not guaranteed to fjnd a global optimum, but rather only a local optimum • Where EM ends up is therefore a function of where it starts 46 • Each iteration of the EM algorithm is guaranteed to improve P ( f | e )

Notes on IBM Model 3 • The distortion parameters in Model 3 each other! story allows words to pile up on top of • The reordering step in the generative • This model is defjcient word-order change in translation are a very weak description of A few critical notes: EM for Model 3 is just like this! • d (reordering) • p (spurious word insertion) • n (fertility) • we also collect fractional counts for: Except for: 47 • we use Model 3’s formula for P ( a | f , e )

Decoding e , the best translation: e • It is impossible to search through all possible sentences • .. but we can inspect a highly relevant subset of such sentences 48 With a language model p ( e ) and a translation model p ( f | e ) , we want to fjnd ˆ ˆ e = arg max P ( f | e ) P ( e ) • This process of fjnding ˆ e is called decoding

Phrase-based Statistical Machine Translation

Phrase-based SMT Atomic units • In the IBM models, the atomic units of translation are words • In phrase-based models, the atomic units are phrases , i.e. a few consecutive words Advantages • Handle many-to-many translation • Capture local context • More data gives us more phrases • No more fertility, insertion, deletion For a long time this was the main approach for Google Translate 49

Phrase alignment natürlich hat John Spaß am Spiel of course john has fun with the game segment the input, translate, reorder 1 1 Adapted from: Philipp Koehn. Statistical Machine Translation. 50

Phrase table for ‘natürlich’ Translation of course 0.5 naturally 0.3 of course , 0.15 , of course , 0.05 ‘natürlich’ translates into two words, so we want a mapping to a phrase! 51 e | ¯ Probability φ (¯ f )

The Noisy Channel – same as before e e channel source • the channel is the translation model (now using phrases!) 52 argmax P ( e | f ) = argmax P ( f | e ) P ( e ) � �� • the source is the language model

53 phrases product of translating each English phrase into its foreign phrase & reordering distance based reordering i Decomposition of P ( f | e ) P ( f | e ) = P ( f 1 ... M | e 1 ... N ) � φ (¯ f i | ¯ = e i ) d ( start i − end i − 1 − 1 ) � ��

Answer: start 2 - end 1 - 1 = 6 - 3 - 1 = 2 Distance based reordering Q: What is the distance for the second English phrase? 2 2 Distance is measured on the foreign side! distance based reordering foreign i English 7 6 4 5 1 2 3 54 � φ (¯ f i | ¯ P ( f 1 ... M | e 1 ... N ) = e i ) d ( start i − end i − 1 − 1 ) � ��

Distance based reordering Q: What is the distance for the second English phrase? 2 2 Distance is measured on the foreign side! distance based reordering foreign i English 7 6 4 5 1 2 3 54 � φ (¯ f i | ¯ P ( f 1 ... M | e 1 ... N ) = e i ) d ( start i − end i − 1 − 1 ) � �� Answer: start 2 - end 1 - 1 = 6 - 3 - 1 = 2

f N in f that have alignment e M in e , and vice versa. Phrase extraction How do we get phrases? We extract all phrases that are consistent with a word alignment A Defjnition: Consistent phrase pair A phrase pair f e is consistent with A, if all words f 1 points in A, have these with words e 1 Consistent Inconsistent Consistent 55

Phrase extraction How do we get phrases? We extract all phrases that are consistent with a word alignment A Defjnition: Consistent phrase pair f that have alignment e , and vice versa. Consistent Inconsistent Consistent 55 A phrase pair (¯ e ) is consistent with A, if all words f 1 , . . . , f N in ¯ f , ¯ points in A, have these with words e 1 , . . . , e M in ¯

Phrase probabilities • We can choose to use many short phrases, or a few long ones, or anything in between • In the IBM models, there was a generative story about how all the English words turn 56 • Here we do not choose among different phrase alignments into French words • We estimate the phrase translation probability φ (¯ f , ¯ e ) by the relative frequency: e , ¯ count (¯ f ) φ (¯ f , ¯ e ) = � e , ¯ i count (¯ f i )

Log-linear models We can put all of this in a general log-linear i e which allows us to weight the components: The phrase-based model so far already n model: 57 e with an argmax • phrase translation probabilities multiplied so that we can fjnd best Probabilities from each component are works well. So far we have: • language model • reordering model d � p ( x ) = exp λ i h i ( x ) i = 1 • λφ for the translation model • λ d for the reordering model • λ LM for the language model ˆ e = arg max p LM ( e ) λ LM translation ˆ � φ (¯ ∗ f i | ¯ e i ) λ φ ∗ d ( . . . ) λ d

Log-linear models (2) to obtain lexicalized reordering • (D) discontinuous • (S) swap with previous phrase • (M) monotone order want to predict: MSD-reordering : between 2 phrases, we • A popular way to do this is on distance • So far reordering is modelled just based probabilities • Another improvement we can make is Since we have a log-linear model now, • Phrase penalty length) • Word penalty (control output • Lexical weighting probabilities • Bi-directional translation Examples: we can add all kinds of feature 58 functions h i ( x ) together with a weight λ i

Decoding • To fjnd the best translation using our model, we need to perform decoding • The search space is huge , so many heuristics are used in practice • We can expand a translation hypothesis from left-to-right , one phrase at a time • Every time we check the translation model, reordering model, and language model if this is a good idea • We cannot keep all hypotheses in memory, so we put them in hypothesis stacks based on how many foreign words they cover • When a stack gets too large, we prune it 59

Evaluation

c if the total This is the basis for BLEU Evaluation – How good are our translations? Ref 2: there is a cat on the mat [Papineni et al., 2002] BLEU length of the candidates is shorter. r We multiply the score with e 1 Solution: Brevity penalty No, because there are multiple references. Can we use recall? 1 2 2 P Ref 1: the cat is on the mat Candidate: the the the the the the the Candidate: the cat Ref 1: the cat is on the mat Ref 2: there is a cat on the mat Idea 1: Precision # words in candidate 7 Idea 2: Modifjed Precision Clip the number of matching words (e.g. 7 for ‘the’) to their max. count in a ref. (e.g. only 2) P 2 7 What is the modifjed precision for this? 60 P = # words in candidate that are in ref = 7

c if the total This is the basis for BLEU Evaluation – How good are our translations? Ref 2: there is a cat on the mat [Papineni et al., 2002] BLEU length of the candidates is shorter. r We multiply the score with e 1 Solution: Brevity penalty No, because there are multiple references. Can we use recall? 1 2 2 P Ref 1: the cat is on the mat Candidate: the the the the the the the Candidate: the cat What is the modifjed precision for this? 7 ‘the’) to their max. count in a ref. (e.g. only 2) Clip the number of matching words (e.g. 7 for Idea 2: Modifjed Precision 7 # words in candidate Idea 1: Precision Ref 2: there is a cat on the mat Ref 1: the cat is on the mat 60 P = # words in candidate that are in ref = 7 P = 2

c if the total This is the basis for BLEU Evaluation – How good are our translations? Candidate: the the the the the the the [Papineni et al., 2002] BLEU length of the candidates is shorter. r We multiply the score with e 1 Solution: Brevity penalty No, because there are multiple references. Can we use recall? Ref 2: there is a cat on the mat Ref 1: the cat is on the mat Candidate: the cat What is the modifjed precision for this? 7 ‘the’) to their max. count in a ref. (e.g. only 2) Clip the number of matching words (e.g. 7 for Idea 2: Modifjed Precision 7 # words in candidate Idea 1: Precision Ref 2: there is a cat on the mat Ref 1: the cat is on the mat 60 P = 2 2 = 1 P = # words in candidate that are in ref = 7 P = 2

This is the basis for BLEU Evaluation – How good are our translations? What is the modifjed precision for this? [Papineni et al., 2002] BLEU length of the candidates is shorter. Solution: Brevity penalty No, because there are multiple references. Can we use recall? Ref 2: there is a cat on the mat Ref 1: the cat is on the mat Candidate: the the the the the the the Candidate: the cat 7 7 Ref 1: the cat is on the mat Ref 2: there is a cat on the mat Idea 1: Precision # words in candidate 60 Idea 2: Modifjed Precision Clip the number of matching words (e.g. 7 for ‘the’) to their max. count in a ref. (e.g. only 2) P = 2 2 = 1 P = # words in candidate that are in ref = 7 We multiply the score with e 1 − r c if the total P = 2

Evaluation – How good are our translations? 7 [Papineni et al., 2002] BLEU length of the candidates is shorter. Solution: Brevity penalty No, because there are multiple references. Can we use recall? Ref 2: there is a cat on the mat Ref 1: the cat is on the mat Candidate: the the the the the the the What is the modifjed precision for this? Candidate: the cat 60 ‘the’) to their max. count in a ref. (e.g. only 2) Ref 1: the cat is on the mat Ref 2: there is a cat on the mat Idea 1: Precision # words in candidate 7 Idea 2: Modifjed Precision Clip the number of matching words (e.g. 7 for P = 2 2 = 1 P = # words in candidate that are in ref = 7 We multiply the score with e 1 − r c if the total P = 2 This is the basis for BLEU

Neural Machine Translation

Encoder-Decoder [Cho et al., 2014, Sutskever et al., 2014] x 1 x 2 x 3 EOS y 1 y 2 y 3 y 4 EOS y 1 y 2 y 3 y 4 61

The Annotated Encoder-Decoder A blog post on how to implement an Encoder-Decoder from scratch in PyTorch: https://bastings.github.io/annotated_encoder_decoder/ 62

Google Translate Experiment iä iä iä iä iä iä iä iä What is going on here? etc.. iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä Try the following input: iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä iä 63

References i Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer, R. L., and Roossin, P. S. (1990). A statistical approach to machine translation. Comput. Linguist. , 16(2):79–85. Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., and Mercer, R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational linguistics , 19(2):263–311. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation.

Introduction to Machine Translation Joost Bastings ILLC, University - PowerPoint PPT Presentation

Introduction to Machine Translation Joost Bastings ILLC, University of Amsterdam bastings.github.io Table of contents 1. A Brief History of MT 2. Statistical Machine Translation 3. Phrase-based Statistical Machine Translation 4. Evaluation

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Machine Translation: Going Deep Philipp Koehn 4 June 2015 Philipp Koehn Machine Translation:

Machine Translation Philipp Koehn 1 December 2015 Philipp Koehn Artificial Intelligence:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

Global Translation Services Website translation using post-edited machine translation and

Community Translation By Willem Stoeller Examples Community Translation Virtual Teams Powering

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

The Lords S upper Possible site where the Lord and Passover in His disciples Upper

Getting to the Source of Pollution (Energy use, climate change gas) Pollution = f (land use,

Introduction to Cloud Computing Corso di Sistemi Distribuiti e Cloud Computing A.A. 2020/21

Applying 42 CFR Part 2 to Behavioral Health and Primary Care Providers December 17, 2015

NETWORK FLOWS NETWORK FLOWS A network consists of a loopless digraph D = ( V , A ) plus a function

RTCP Reporting Extensions draft-friedman-avt-rtcp-report-extns-02.txt Minneapolis, 20 March 2002

Local, Unconstrained Function Optimization COMPSCI 527 Computer Vision COMPSCI 527

Constructing English Reading Courseware Masao Utiyama (NICT) Midori Tanimura (Kinki Univ.)