Deciphering Foreign Language
NLP
Sujith Ravi and Kevin Knight
sravi@usc.edu, knight@isi.edu
Deciphering Foreign Language NLP Sujith Ravi and Kevin Knight - - PowerPoint PPT Presentation
Deciphering Foreign Language NLP Sujith Ravi and Kevin Knight sravi@usc.edu,knight@isi.edu Information Sciences Institute University of Southern California Statistical Machine Translation (MT) Bilingual text Translation tables Current
NLP
sravi@usc.edu, knight@isi.edu
2
(Spanish) Garcia y asociados (English) Garcia and associates (Spanish) sus grupos están en Europa (English) its groups are in Europe : : : Bilingual text
TRAIN associates/ asociados : 0.8 groups/ grupos : 0.9 : :
Translation tables
2
(Spanish) Garcia y asociados (English) Garcia and associates (Spanish) sus grupos están en Europa (English) its groups are in Europe : : : Bilingual text
TRAIN associates/ asociados : 0.8 groups/ grupos : 0.9 : :
Translation tables
Spanish/English
2
(Spanish) Garcia y asociados (English) Garcia and associates (Spanish) sus grupos están en Europa (English) its groups are in Europe : : : Bilingual text
TRAIN associates/ asociados : 0.8 groups/ grupos : 0.9 : :
Translation tables
Spanish/English Japanese/German
2
(Spanish) Garcia y asociados (English) Garcia and associates (Spanish) sus grupos están en Europa (English) its groups are in Europe : : : Bilingual text
TRAIN associates/ asociados : 0.8 groups/ grupos : 0.9 : :
Translation tables
Spanish/English Japanese/German Malayalam/English
2
(Spanish) Garcia y asociados (English) Garcia and associates (Spanish) sus grupos están en Europa (English) its groups are in Europe : : : Bilingual text
TRAIN associates/ asociados : 0.8 groups/ grupos : 0.9 : :
Translation tables
Spanish/English Japanese/German Malayalam/English Swahili/German
...
2
(Spanish) Garcia y asociados (English) Garcia and associates (Spanish) sus grupos están en Europa (English) its groups are in Europe : : : Bilingual text
TRAIN associates/ asociados : 0.8 groups/ grupos : 0.9 : :
Translation tables
Spanish/English Japanese/German Malayalam/English Swahili/German
... BOTTLENECK
2
(Spanish) Garcia y asociados (English) Garcia and associates (Spanish) sus grupos están en Europa (English) its groups are in Europe : : : Bilingual text
TRAIN associates/ asociados : 0.8 groups/ grupos : 0.9 : :
Translation tables
Spanish/English Japanese/German Malayalam/English Swahili/German
... BOTTLENECK
2
(Spanish) Garcia y asociados (English) Garcia and associates (Spanish) sus grupos están en Europa (English) its groups are in Europe : : : Bilingual text
TRAIN associates/ asociados : 0.8 groups/ grupos : 0.9 : :
Translation tables
Spanish text Monolingual corpora TRAIN associates/ asociados : 0.8 : : :
Translation tables
English text
Spanish/English Japanese/German Malayalam/English Swahili/German
... BOTTLENECK
2
(Spanish) Garcia y asociados (English) Garcia and associates (Spanish) sus grupos están en Europa (English) its groups are in Europe : : : Bilingual text
TRAIN associates/ asociados : 0.8 groups/ grupos : 0.9 : :
Translation tables
Spanish text Monolingual corpora TRAIN associates/ asociados : 0.8 : : :
Translation tables
English text
PLENTY PLENTY
Spanish/English Japanese/German Malayalam/English Swahili/German
... BOTTLENECK
3
Machine Translation without parallel data
➡ useful for rare language-pairs (limited/no parallel data)
3
Spanish text Monolingual corpora TRAIN associates/ asociados : 0.8 : : :
Translation tables
English text
Machine Translation without parallel data
➡ useful for rare language-pairs (limited/no parallel data)
3
Spanish text Monolingual corpora TRAIN associates/ asociados : 0.8 : : :
Translation tables
English text
Machine Translation without parallel data
➡ useful for rare language-pairs (limited/no parallel data)
➡ monolingual resources available in plenty
➡ exploit word context frequencies (Fung, 1995; Rapp,
➡ Canonical Correlation Analysis (CCA) method (Haghighi
➡ need dictionary, some initial parallel data
4
Machine Translation without parallel data
5
NLP
➡ novel decipherment approach for translation ➡ novel methods for training translation models from
non-parallel text
➡ Bayesian training for IBM 3 translation model
New
6
➡ Step 1: Word Substitution ➡ Step 2: Foreign Language as a Cipher
7
“When I look at an article in Spanish, I say to myself, this is really English, but it has been encoded in some strange symbols. Now I will proceed to decode...” Warren Weaver
(Spanish)
New
(1947)
7
“When I look at an article in Spanish, I say to myself, this is really English, but it has been encoded in some strange symbols. Now I will proceed to decode...” Warren Weaver
(English) (Spanish)
New
(1947)
f
Spanish corpus
El portal web permite la búsqueda por todo tipo de métodos. Por un lado, Wikileaks ha
categorías atendiendo a los hechos más
criminal, fuego amigo, ...
New
f
Spanish corpus
El portal web permite la búsqueda por todo tipo de métodos. Por un lado, Wikileaks ha
categorías atendiendo a los hechos más
criminal, fuego amigo, ...
New
English-to-Spanish Translation Model
P(e) e
English
P(f | e)
English Language Model
f
Spanish corpus
El portal web permite la búsqueda por todo tipo de métodos. Por un lado, Wikileaks ha
categorías atendiendo a los hechos más
criminal, fuego amigo, ...
New
(CNN) WikiLeaks website publishes classified military documents from Iraq. The whistle-blower website WikiLeaks published nearly 400,000 classified military documents from the Iraq war on Friday, calling it the largest classified military leak in history,....
English corpus
Language Model Training
English-to-Spanish Translation Model
P(e) e
English
P(f | e)
English Language Model
f
Spanish corpus
El portal web permite la búsqueda por todo tipo de métodos. Por un lado, Wikileaks ha
categorías atendiendo a los hechos más
criminal, fuego amigo, ...
New
For each f
(CNN) WikiLeaks website publishes classified military documents from Iraq. The whistle-blower website WikiLeaks published nearly 400,000 classified military documents from the Iraq war on Friday, calling it the largest classified military leak in history,....
English corpus
Language Model Training
English-to-Spanish Translation Model
P(e) e
English
P(f | e)
English Language Model
f
Spanish corpus
El portal web permite la búsqueda por todo tipo de métodos. Por un lado, Wikileaks ha
categorías atendiendo a los hechos más
criminal, fuego amigo, ...
TRAINING
Train parameters θ to maximize probability of
argmax θ Pθ (f ) ≈ argmax θ ∑e Pθ (e, f) ≈ argmax θ ∑e P(e) . Pθ (f | e)
New
For each f
(CNN) WikiLeaks website publishes classified military documents from Iraq. The whistle-blower website WikiLeaks published nearly 400,000 classified military documents from the Iraq war on Friday, calling it the largest classified military leak in history,....
English corpus
Language Model Training
English-to-Spanish Translation Model
P(e) e
English
P(f | e)
English Language Model
9
Determinism in Key? Insertion Deletion Linguistic unit of substitution Transposition Scale
(re-ordering) (vocabulary & data sizes)
10
Determinism in Key? Insertion Deletion Linguistic unit of substitution Transposition Scale
(re-ordering) (vocabulary & data sizes)
many-to-many Word / Phrase Large
(100 - 1M word types)
Hard problem!
11
Determinism in Key? Insertion Deletion Linguistic unit of substitution Transposition Scale
(re-ordering) (vocabulary & data sizes)
many-to-many Word / Phrase Large
(100 - 1M word types)
1-to-1 Word Large
(100 - 1M word types)
Hard problem! Tackle a simpler problem first!
12
13
1*2(%3,)34(56*)(&%789%$#..*,.:%(;<(4$%$"(5(%#5(%"3,&5(&)%'=% $"'3)#,&)%'=%>$#.)?%4(5%<*4"(5%$'2(,%@$#.%A%-,./*)"%B'5&C%
95 90 76 31 95 20 43 11 80 60 16 95 65 31 50 42 16 65 31 50 58 42 16 19 95 16 92 73 16 65 67 57 31 26 65 38 70 52 57 30 33 26 16 47 24 56 21 16 ... ...
Word Substitution
English words masked by cipher symbols
14
1*2(%3,)34(56*)(&%789%$#..*,.:%(;<(4$%$"(5(%#5(%"3,&5(&)%'=% $"'3)#,&)%'=%>$#.)?%4(5%<*4"(5%$'2(,%@$#.%A%-,./*)"%B'5&C%
95 90 76 31 95 20 43 11 80 60 16 95 65 31 50 42 16 65 31 50 58 42 16 19 95 16 92 73 16 65 67 57 31 26 65 38 70 52 57 30 33 26 16 47 24 56 21 16 ... ...
!"#$%
!"#$%&'(
Learn substitution mappings between English words and cipher symbols
Word Substitution
15
95 90 76 31 95 20 43 11 80 60 16 95 65 31 50 42 16 65 31 50 58 42 16 19 95 16 92 73 16 65 67 57 31 26 65 38 70 52 57 30 33 26 16 47 24 56 21 16 ... ...
Word Substitution
15
!"#$%$&'$ !"&'$ &$ ()*+,-.$
/0&12$
?
95 90 76 31 95 20 43 11 80 60 16 95 65 31 50 42 16 65 31 50 58 42 16 19 95 16 92 73 16 65 67 57 31 26 65 38 70 52 57 30 33 26 16 47 24 56 21 16 ... ...
Word Substitution
15
!"#$%$&'$ !"&'$ &$ ()*+,-.$
/0&12$
?
English Language Model Substitution Table
95 90 76 31 95 20 43 11 80 60 16 95 65 31 50 42 16 65 31 50 58 42 16 19 95 16 92 73 16 65 67 57 31 26 65 38 70 52 57 30 33 26 16 47 24 56 21 16 ... ...
Word Substitution
15
!"#$%$&'$ !"&'$ &$ ()*+,-.$
/0&12$
?
English Language Model Substitution Table
LM Training
English corpus
95 90 76 31 95 20 43 11 80 60 16 95 65 31 50 42 16 65 31 50 58 42 16 19 95 16 92 73 16 65 67 57 31 26 65 38 70 52 57 30 33 26 16 47 24 56 21 16 ... ...
Word Substitution
16
!"#$%$&'$ !"&'$ &$ ()*+,-.$
/0&12$
?
Word Substitution
16
!"#$%$&'$ !"&'$ &$ ()*+,-.$
/0&12$
?
key contains millions of parameters!!!
Word Substitution
16
procedure
New New
parallelized sampling)
!"#$%$&'$ !"&'$ &$ ()*+,-.$
/0&12$
?
key contains millions of parameters!!!
Word Substitution
17
New
Word Substitution
18
➡ incremental scoring of derivations during sampling
New
Word Substitution
19
Restaurant Process (CRP) formulations
➡
Base distributions (P0): source = English LM probabilities, channel = uniform
➡
Priors: source (α) = 104, channel (β) = 0.01
➡
Smart sample-choice selection
Word Substitution
20
cipher: 22 94 43 04 98 current: three eight living here ? resample: three the living here ? resample: three a living here ?! resample: three and living here ?! resample: three boys living here ? resample: three gun living here ? resample: three brick living here ? resample: three ran living here ? ... !"#$%&'()*+,)-("#$!$%&'()*+,-.!/0,'!1,*0%$!/#2%-3! 40#5'(-6'!#$!),++,#-'!#"!1,*0%$!/#2%-'!*%$!%*#107! 85$$%-/!'#+59#-:!!'()*+%!+./'(0+1(233(*+(,-/%;/!10#,1%'! <=,-%!1#)*5/(9#-:!!>,?%-!@!(-6!AB!C0(/!($%!/#*&DEE!F!$(-2%6!GH!IJ@!F!AKL! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!M"!@!(-6!A!-%?%$!1#s$$%6B!/0%-!*,12!DEE!$(-6#)!C#$6'7!
Word Substitution
21
Restaurant Process (CRP) formulations
➡
Base distributions (P0): source = English LM probabilities, channel = uniform
➡
Priors: source (α) = 104, channel (β) = 0.01
➡
Smart sample-choice selection
➡
Parallelized sampling using Map Reduce (3 to 5-fold faster)
(details in paper)
Word Substitution
22
Word Substitution
23
Word Substitution
Cipher Original English Deciphered
24
25
!"#$%&'()$*)( +(
Machine Translation without parallel data
Spanish corpus
25
!"#$%&'()$*)( +(
!"#$%$&'$ !"&'$ &$ ()*+,-.$
/0&12$
?
Machine Translation without parallel data
Spanish corpus
25
!"#$%&'()$*)( +(
!"#$%$&'$ !"&'$ &$ ()*+,-.$
/0&12$
?
Machine Translation without parallel data
Spanish corpus
English Language Model English-to-Foreign Translation Model
25
!"#$%&'()$*)( +(
!"#$%$&'$ !"&'$ &$ ()*+,-.$
/0&12$
?
Machine Translation without parallel data
Spanish corpus
English Language Model English-to-Foreign Translation Model
LM Training
English corpus
25
!"#$%&'()$*)( +(
!"#$%$&'$ !"&'$ &$ ()*+,-.$
/0&12$
?
Machine Translation without parallel data
Spanish corpus
English Language Model English-to-Foreign Translation Model
LM Training
English corpus
W
d S u b s t i t u t i
Machine Translation without parallel data
+ transposition + insertion + deletion + ...
=
26
!"#$%&'()$*)( +(
!"#$%$&'$ !"&'$ &$ ()*+,-.$
/0&12$
?
Machine Translation without parallel data
➡
but without parallel data, training is intractable
(e.g., IBM Model 3)
26
!"#$%&'()$*)( +(
!"#$%$&'$ !"&'$ &$ ()*+,-.$
/0&12$
?
Machine Translation without parallel data
➡
but without parallel data, training is intractable
(e.g., IBM Model 3)
using EM
New New
parallelized sampling)
27
Machine Translation without parallel data
➡
morphology, linguistic constraints (e.g., “8” in English maps to “8” in Spanish)
New
e.g., ARE YOU TALKING ABOUT ME ? THANK YOU TALKING ABOUT ?
28
Machine Translation without parallel data
translation (word substitution) fertility distortion (transposition)
New
28
Machine Translation without parallel data
➡
replace IBM Model 3 components with CRP processes
translation (word substitution) fertility distortion (transposition)
New
28
Machine Translation without parallel data
➡
replace IBM Model 3 components with CRP processes
translation (word substitution) fertility distortion (transposition) CRP cache model
New
29
Machine Translation without parallel data
➡ efficient, scalable inference using strategies described earlier
➡ point-wise Gibbs sampling: for each foreign string f, jointly
sample alignments, e translations
➡ sampling operators = translate 1 word, swap alignments, ...
(similar to German et al., 2001)
➡ blocked sampling: sample single derivation for repeating
sentences
(continued)
... 10 días consecutivos de cotización 10 semanas consecutivas 100 años después ... 17 de abril 1986 ... años cuarto puesto enero alrededor. 28 ... mil años antes mil años ...
... una de tres horas uno de tres años un jueves por la noche Un día hace poco ...
30
Machine Translation without parallel data
Spanish corpus
... 10 MONTHS LATER 10 MORE YEARS 24 MINUTES 28 CONSECUTIVE QUARTERS ... A WEEK EARLIER ABOUT A DECADE AGO ABOUT A MONTH AFTER ... AUGUST 6 , 1789 ... CENTURIES AGO DEC . 11 , 1989 ... TWO DAYS LATER TWO DECADES LATER TWO FULL DAYS ... YEARS ...
English corpus
... 10 días consecutivos de cotización 10 semanas consecutivas 100 años después ... 17 de abril 1986 ... años cuarto puesto enero alrededor. 28 ... mil años antes mil años ...
... una de tres horas uno de tres años un jueves por la noche Un día hace poco ...
30
Machine Translation without parallel data
Spanish corpus
... 10 MONTHS LATER 10 MORE YEARS 24 MINUTES 28 CONSECUTIVE QUARTERS ... A WEEK EARLIER ABOUT A DECADE AGO ABOUT A MONTH AFTER ... AUGUST 6 , 1789 ... CENTURIES AGO DEC . 11 , 1989 ... TWO DAYS LATER TWO DECADES LATER TWO FULL DAYS ... YEARS ...
English corpus
25 50 75 100
48.7 18.2
Baseline without parallel data Decipherment without parallel data BLEU score
↑Higher is better
88
MOSES with parallel data
31
Machine Translation without parallel data
... ALL RIGHT , LET' S GO . ARE YOU ALL RIGHT ? ARE YOU CRAZY ? ... HEY , DO YOU WANT TO COME OUT AND PLAY THE GAME ? ... WHAT ARE YOU DOING HERE ? ... YEAH ! YOU KNOW WHAT I MEAN ? ...
English corpus
... abran la puerta . bien hecho . ... ¡ por aquí ! ¿ a qué te refieres ? ¿ cómo podré verlos a través de mis lágrimas ?
... un segundo . vamonos . ...
Spanish corpus
OPUS Spanish/English corpus [ Tiedemann, 2009 ]
31
Machine Translation without parallel data
... ALL RIGHT , LET' S GO . ARE YOU ALL RIGHT ? ARE YOU CRAZY ? ... HEY , DO YOU WANT TO COME OUT AND PLAY THE GAME ? ... WHAT ARE YOU DOING HERE ? ... YEAH ! YOU KNOW WHAT I MEAN ? ...
English corpus
... abran la puerta . bien hecho . ... ¡ por aquí ! ¿ a qué te refieres ? ¿ cómo podré verlos a través de mis lágrimas ?
... un segundo . vamonos . ...
Spanish corpus
OPUS Spanish/English corpus [ Tiedemann, 2009 ]
25 50 75 100
19.3
Decipherment without parallel data BLEU score
↑Higher is better
MOSES with parallel data 63.6
32
Phrase-based MT, parallel data IBM Model 3 - distortion, parallel data
MT quality
EM Decipherment, no parallel data
Machine Translation without parallel data
32
Phrase-based MT, parallel data IBM Model 3 - distortion, parallel data
MT quality
EM Decipherment, no parallel data
Machine Translation without parallel data
33
➡ very challenging task, but shown to be possible! (using
decipherment approach)
➡ initial results promising ➡ can easily extend to new language pairs, domains
➡
Scalable decipherment methods for full-scale MT
➡
Better unsupervised algorithms for decipherment
➡
Leverage existing bilingual resources (e.g., dictionaries, etc.) during decipherment
➡
Applications for domain adaptation
34
Language Translation
Spanish text Monolingual corpora TRAIN associates/ asociados : 0.8 : : :
Translation tables
English text
This talk
34
Language Translation
Spanish text Monolingual corpora TRAIN associates/ asociados : 0.8 : : :
Translation tables
English text
This talk Afternoon talk
(2pm, Machine Learning Session 2-B)
Cryptanalysis
35
NLP