Translation without bilingual parallel corpora Chris Callison-Burch - - PowerPoint PPT Presentation

translation without bilingual parallel corpora
SMART_READER_LITE
LIVE PREVIEW

Translation without bilingual parallel corpora Chris Callison-Burch - - PowerPoint PPT Presentation

Translation without bilingual parallel corpora Chris Callison-Burch Lecture 20 with Ann Irvine, Alex Klementiev, and David Yarowsky How to Improve Machine Transla5on 30 25 Translation quality 20 Better models 15


slide-1
SLIDE 1

Translation without bilingual parallel corpora

Chris Callison-Burch Lecture 20

with Ann Irvine, Alex Klementiev, and David Yarowsky

slide-2
SLIDE 2

How ¡to ¡Improve ¡Machine ¡Transla5on

2

5 10 15 20 25 30

1 20000 40000 60000 82000

Translation quality

Bilingual training data

❶ Better models ❷ More bilingual training data ❸ Eliminate the need for bitexts

slide-3
SLIDE 3
slide-4
SLIDE 4

Bilingual ¡data ¡varies ¡by ¡language

4

Urdu 1.5M Arabic and Chinese DARPA GALE 200M European Parliament 50M French-English 10^9 word webcrawl 1000M

slide-5
SLIDE 5

Monolingual ¡data ¡is ¡more ¡common ¡

5

  • Typically ¡we ¡have ¡orders ¡of ¡magnitude ¡more ¡

monolingual ¡data ¡ ¡

  • Can ¡we ¡use ¡monolingual ¡data ¡to ¡learn ¡

transla5ons? ¡

  • Is ¡that ¡a ¡crazy ¡idea?
slide-6
SLIDE 6

הקישנ

slide-7
SLIDE 7

Scoring ¡Transla5ons: ¡Time

7

terrorist (en) terrorista (es)

Occurrences

terrorist (en) riqueza (es)

Occurrences Time

similar dissimilar

slide-8
SLIDE 8

Scoring ¡Transla5ons: ¡Time

8

eólica estambul terrorista vacuno wind istanbul terrorist beef renewable erdogan terrorism cattle solar turkish terrorists bse sources turkey attacks compulsory renewables turks fight meat energy ankara attack cows energies membership terror veal electricity negotiations acts cow photovoltaic undcp threat labelling grid talks september papayannakis

slide-9
SLIDE 9

If we consider oculist and eye-doctor we find that, as our corpus of utterances grows, these two occur in almost the same environments. In contrast, there are many sentence environments in which oculist occurs but lawyer does not... It is a question of the relative frequency of such environments, and of what we will obtain if we ask an informant to substitute any word he wishes for oculist (not asking what words have the same meaning). These and similar tests all measure the probability of particular environments occurring with particular elements... If A and B have almost identical environments we say that they are synonyms. –Zellig Harris (1954)

Distributional Hypothesis

slide-10
SLIDE 10

He found five fish swimming in an old bathtub. He slipped down in the bathtub.

a 1 down 1 find 1 fish 1 five 1 he 2 in 2 slip 1 swim 1 the 1

bathtub water money

Vector ¡Space ¡Models ¡of ¡Word ¡Similarity

Represent a word through the contexts that it has been observed in

slide-11
SLIDE 11

He found five fish swimming in an old bathtub. He slipped down in the bathtub.

a 1 down 1 find 1 fish 1 five 1 he 2 in 2 slip 1 swim 1 the 1

bathtub water money cos(bathtub, water)

Vector ¡Space ¡Models ¡of ¡Word ¡Similarity

Represent a word through the contexts that it has been observed in

slide-12
SLIDE 12

crecer rápidamente economías planeta empleo extranjero 1 crecer rápidamente economías planeta empleo extranjero 1 1 crecer rápidamente economías planeta empleo extranjero 2 1

Scoring ¡Transla5ons: ¡Context

12

... este número podría crecer muy rápidamente si no se modifica ... ... nuestras economías a crecer y desarrollarse de forma saludable ... ... que nos permitirá crecer rápidamente cuando el contexto ...

slide-13
SLIDE 13

7 4 3 1 1 2 5 7 9

crecer expand activity rápidamente economías planeta empleo extranjero policy

7 4 3 7 4

dict.

1 1 2 5 7 9

crecer expand activity quickly policy economic growth employment rápidamente economías planeta empleo extranjero policy crecer (projected)

3

7 4 1 1 2 5 7 9

expand activity quickly policy economic growth employment policy crecer (projected)

3

Scoring ¡Transla5ons: ¡Context

13

slide-14
SLIDE 14

Scoring ¡Transla5ons: ¡Context

14

eólica estambul admirable choque wind istanbul remarkable shock nuclear virginia wonderful shocks hydroelectric zagreb admirable clash geothermal london splendid disagreement photovoltaic

  • reja

magnificent disparity purchasing rosales excellent link saving moscow

  • utstanding

contradiction efficiency attending fantastic divisions atomic washington producing confrontation wielded johannesburg commendable synergies

slide-15
SLIDE 15

Scoring ¡Transla5ons: ¡Orthography

15

democracia democracy Spanish English

Etymologically related words often retain similar spelling across languages with the same writing system Words with lower edit distances are sometimes good translations of each other

slide-16
SLIDE 16

Scoring ¡Transla5ons: ¡Spelling

16

sanitario desarrollos volcánica montana sanitary ferroalloy volcanic montana sanitation barrosos volcanism fontana unitario destroyers voltaic montane sanitarium mccarroll vacancy mentana sanitation disallows konica montagna sagittario disallow dominica montanha sanitarias scrolls veronica montan kantaro payrolls monica montano sanitorium carroll volcano montani santoro steamrolls vratnica montand

slide-17
SLIDE 17

Scoring ¡Transla5ons: ¡Orthography

17

democracia democracy Spanish English

Measuring edit distance for languages which share the same writing system We transliterate for languages with different writing systems

демократия

democracy Russian Transliterated demokratiya English

Assign a similarity score with edit distance or with a discriminative transliteration model

slide-18
SLIDE 18

Translitera5on ¡using ¡SMT

18

ا!!ل!!ی!!گ!!ز!!ی!!ن!!ڈ!!ر!!ی!!ا! a!!l!!e!!x!!a!!n!!d!!r!!i!!a!

slide-19
SLIDE 19

Character-­‑based ¡transla5on

  • Instead ¡of ¡aligning ¡words ¡across ¡sentence ¡pairs, ¡

we ¡align ¡characters ¡across ¡name ¡pairs ¡

  • Learn ¡transla5on ¡rules ¡for ¡sequences ¡of ¡leRers ¡
  • Language ¡model ¡is ¡n-­‑graph ¡leRer ¡sequence ¡built ¡

from ¡English ¡names ¡

  • Requires: ¡

– Many ¡pairs ¡of ¡foreign-­‑English ¡names ¡ – Many ¡names ¡wriRen ¡in ¡English ¡for ¡LM

19

slide-20
SLIDE 20

Translitera5on ¡training ¡data

  • Extracted ¡name ¡pairs ¡from ¡automa5cally ¡word ¡

aligned ¡parallel ¡corpus ¡

  • Gathered ¡training ¡data ¡from ¡Wikipedia ¡

– 890 ¡ar5cles ¡about ¡people ¡w/inter-­‑language ¡links ¡

  • Hired ¡Urdu ¡speakers ¡on ¡Mechanical ¡Turk ¡to ¡

transliterate ¡names ¡

– gathered ¡5,470 ¡English-­‑>Urdu ¡names ¡and ¡5,470 ¡Urdu-­‑ >English ¡names ¡ – 2/3 ¡of ¡the ¡data ¡was ¡high ¡quality ¡ – 12,384 ¡addi5onal ¡names ¡for ¡<$300

20

slide-21
SLIDE 21

Learning ¡Curve

IH?, IHV, IHN, G, GHG, GH?, GHV, GHN, A, IAAA, GAAA, MAAA, ?AAA, =AAA, VAAA, @AAA,

!"#$%$%&'()*'($+)'

[A,K9&)#$,%-R%)(,+<, J"+\)9()2,5B%,+F)", #&'.$6)$(%L, ,,,,[GGU, Q':'J)2'#, 2#(#,#22)2, [GNA, [GGU, [I=M, [II?, [@ V, [MN,

Training data size Avg edit distance

slide-22
SLIDE 22

Example ¡translitera5ons

22

slide-23
SLIDE 23

Scoring ¡Transla5ons: ¡Orthography

23

democracia democracy Spanish English

Etymologically related words often retain similar spelling across languages with the same writing system We transliterate for languages with different writing systems

демократия

democracy Russian Transliterated demokratiya English

Assign a similarity score with edit distance or with a discriminative transliteration model

slide-24
SLIDE 24

Scoring ¡Transla5ons: ¡Topics

Topic 1 Topic 2 Topic 3 Topic N L1 L2

Phrases and their translations used to describe the same topics. The more similar the set of topics two phrases appear in, the more likely they are translations. We treat Wikipedia article pairs with interlingual links as topics.

slide-25
SLIDE 25

Scoring ¡Transla5ons: ¡Context

7 4 3 7 4

dict.

1 1 2 5 7 9

crecer expand activity quickly policy economic growth employment rápidamente economías planeta empleo extranjero policy crecer (projected)

3

slide-26
SLIDE 26

Scoring ¡Transla5ons: ¡Topics

Barack_Obama Обама,_Барак Virginia Виргиния Iraq_War Иракская_война Ückeritz Иккериц Otto_von_Bismarck Бисмарк,_Отто_фон Music Музыка 15 32 10 1 4 troops войска 8 15 8 5 1 7 2 завтра цветок Wikipedia

slide-27
SLIDE 27

Scoring ¡Transla5ons: ¡Context

27

sanitario desarrollos volcánica montana health developments volcanic montana transcultural developed eruptions miley medical development volcanism hannah sanitation used lava beartooth patient using plumes cyrus deliverables modern eruption crazier pharmaceutica based volcano bozeman sewerage important volcanoes chelsom healthcare history breakouts absaroka care different volcanically baucus

slide-28
SLIDE 28

How ¡good ¡is ¡each ¡approach?

We have a wide variety of using monolingual texts to measure translation equivalence. Which is the best? We measured the accuracy on 24 languages: Albanian, Azeri, Bengali, Bosnian, Bulgarian, Cebuano, Gujarati, Hindi, Hungarian, Indonesian, Latvian, Nepali, Romanian, Serbian, Slovak, Somali, Swedish, Tamil, Telugu, Turkish, Ukrainian, Uzbek, Vietnamese and Welsh. For each foreign word we computed a ranked list of English words using each signal of translation equivalence. The number of candidate English words varied by language, from 34,000 to 287,000.

slide-29
SLIDE 29

How ¡good ¡is ¡each ¡approach?

We compared the predictions against a bilingual dictionary for each language, and calculated whether a good translation

  • ccurred anywhere in its its top-k predictions.

acck

l L Ilk

L

Accuracy at rank k Number of words in the test set for a language 1 iff a correct item is in the top-k list of translations for word l Sum over all test words

slide-30
SLIDE 30

Wide ¡range ¡of ¡signals

We measured the top-10 accuracy for 18 signals of translation equivalence, and averaged across the 24 languages.

  • 1. Web Crawls Contextual Similarity
  • 2. Web Crawls Temporal Similarity
  • 3. Orthographic Similarity
  • 4. Wikipedia Contextual Similarity
  • 5. Wikipedia Topic Similarity
  • 6. Wikipedia Frequency Similarity
  • 7. Wikipedia IDF Similarity
  • 8. Wikipedia Burstiness Similarity
  • 9. Web Crawls Prefix Contextual Similarity
  • 10. Web Crawls Prefix Temporal Similarity
  • 11. Web Crawls Suffix Contextual Similarity
  • 12. Web Crawls Suffix Temporal Similarity
  • 13. Wikipedia Prefix Contextual Similarity
  • 14. Wikipedia Prefix Topical Similarity
  • 15. Wikipedia Suffix Contextual Similarity
  • 16. Wikipedia Suffix Topical Similarity
  • 17. String Identity
  • 18. Inverse Log of Target Wikipedia Frequency
slide-31
SLIDE 31

Per-­‑Signal ¡Results

0.0 0.2 0.4 0.6 0.8 1.0

Accuracy in Top−10

Crawls Context Edit Distance Crawls Time Wiki Context Wiki Topic Prefix Wiki Context Prefix Wiki Topic Prefix Crawls Context Prefix Crawls Time Suffix Wiki Context Suffix Wiki Topic Suffix Crawls Context Suffix Crawls Time Is−Identical Diff Log Freqs Inverse Log Trg Freq Burstiness IDF MRR

Wiki features do best, but their max performance is <40% accuracy and their average accuracy < 20%

slide-32
SLIDE 32

Combining ¡signals

On its own, each of these measures of translation equivalence is a weak signal. Can we combine the weak signals into something stronger? If so, how?

MRRe

h H 1 rh e

H

Mean Reciprocal Rank Set of signals Rank of word e by signal h 1 over the rank

slide-33
SLIDE 33

Combining ¡signals

MRR is an unsupervised approach to combining signals. We also introduce a novel discriminative approach that exploits the fact we use a small bilingual dictionary to project across vector spaces. We train a binary classifier to predict whether a word is a translation or not. Translations from our dictionary serve as positive training examples. Each one is paired with 3 randomly selected non-translations as negative training examples. We rank translations based on the strength of the classifier’s prediction that a word is a translation.

slide-34
SLIDE 34

Per-­‑Signal ¡Results

0.0 0.2 0.4 0.6 0.8 1.0

Accuracy in Top−10

Crawls Context Edit Distance Crawls Time Wiki Context Wiki Topic Prefix Wiki Context Prefix Wiki Topic Prefix Crawls Context Prefix Crawls Time Suffix Wiki Context Suffix Wiki Topic Suffix Crawls Context Suffix Crawls Time Is−Identical Diff Log Freqs Inverse Log Trg Freq Burstiness IDF MRR Discriminative Model

Our discriminative model substantially

  • utperforms MRR, and does better than all

individual features.

slide-35
SLIDE 35

Per-­‑Language ¡Results

Vietnamese Uzbek Somali Turkish Hungarian Nepali Azeri Cebuano Indonesian Swedish Slovak Bengali Ukrainian Tamil Latvian Albanian Telugu Bosnian Hindi Welsh Gujarati Serbian Romanian Bulgarian

Top−10 Accuracy

0.0 0.1 0.2 0.3 0.4 0.5

Baseline Supervised Model

slide-36
SLIDE 36

Example ¡transla5on

  • ne forest one उ&च density its साथ one field is wood ( tree ) एसज5गल its lots

definitions , is which of various मानद5डo on based J.यह पोदाM total ९.४ % the earth

  • f surface को surround R is ( either 30 % ) those of आवासo ( habitat ) STUोलोiगक

Vवाह ( hydrologic flow ) मोWलातोसX ( modulator ) , and soil ( soil ) safeguard , one the earth its बीओि]फअ का rules important sides of गठन.का foreign do is history telling is , of " forest " one बीहड़ field whose means कानaनी for on बाजa of for nidhirit ि◌शकार ( hunting ) its iारा साम5ती ( feudal ) कuलीनता ( nobility ) is , and these ि◌शकार in jungles compulsory more if me all ( see wild no was royal forest ( royal forest ) ) .हालsiक , ि◌शकार its in jungles usual वuडलuड its importance areas को िशामल did while , शvद forest at the end wild land more generally means do of for was था.एक वuडलuड ( woodland ) which of one ज5गल from different is .

Dictionary gloss for Hindi Wikipedia article

slide-37
SLIDE 37

Example ¡transla5on

  • ne forest one systolic density of which one field is tree ( tree ) canopy of many

definitions , is which of various crm on based han.yh nearly headless . % of the earth surface ko surround te is ( or 30 % ) which of keyhole ( organisms ) canopy irr ( telecom low ) modulators ( coniferous ) , and soil ( erosion ) safeguard , one the earth of app ka more important sides of gthn.ka foreign to do is history telling is , of " forest " one maestra field whose means responsibility for on pulleys of for nidhirit mane ( africana ) of dhara necker ( electors ) émigrés ( forest ) is , and these lions forests more necessary if among all ( see no wild the royal forest ( royal society ) ) .hallanki , mane of forests often evergreen of important areas ko they did while , quirk forest at the end wild land more generally means do its for was tho.aq evergreen ( forests ) which of one forest from different is .

Translation for same article with Dictionary + Transliterations + Induced Translations

slide-38
SLIDE 38

End-­‑to-­‑End ¡MT

Could we do full end-to-end machine translation without using any bilingual parallel corpora? Aside from learning the translations of words, and estimating their probabilities, what else would we need? Discuss with your neighbor.

slide-39
SLIDE 39

Re-­‑ordering ¡model

39

How much you

for

your Facebook Wieviel man aufrgund seines profile Profils in Facebook charge verdienen should sollte

m

d m d m

s

Reordering features are probability estimates of s, d, and m

d

m: monotone (keep order) s: swap order d: become discontinuous

slide-40
SLIDE 40

Das Anlegen eines Profils in Facebook ist einfach.

What does your Facebook profile reveal?

Re-­‑ordering ¡model ¡(monolingual)

40

Phrase Table

German English ! das , and profile Profils … … Facebook in Facebook … … und nicht and a lack zustand situation as

Mono English Mono German

s

Estimate same probabilities, but from pairs of (unaligned) sentences taken from monolingual data Repeat over many sentences

slide-41
SLIDE 41

Re-­‑ordering ¡model

What your Facebook reveal does Das Anlegen eines profile Profils in Facebook einfach ist

s

How much you

for

your Facebook Wieviel man aufrgund seines profile Profils in Facebook charge verdienen should sollte

s

41

slide-42
SLIDE 42

Experimental ¡Setup

  • How ¡well ¡can ¡we ¡es5mate ¡the ¡parameters ¡a ¡phrase-­‑

based ¡SMT ¡system ¡with ¡monolingual ¡data? ¡

– ¡Performed ¡abla5on ¡study ¡to ¡removed ¡each ¡part ¡of ¡the ¡ standard ¡bilingually ¡es5mated ¡system ¡ ¡ – ¡Restored ¡each ¡component ¡with ¡monolingual ¡equivalent ¡ ¡

  • Cleanroom ¡experiment ¡

– ¡Phrase-­‑table ¡is ¡same ¡across ¡two ¡condi5ons ¡

  • Data ¡

– ¡Europarl ¡parallel ¡corpus ¡(50M ¡words) ¡ – ¡Spanish ¡and ¡English ¡Gigaword ¡corpora ¡(1B ¡words) ¡ – ¡Spanish ¡and ¡English ¡paired ¡Wikipedia ¡ar5cles ¡(40-­‑60M ¡words)

42

I d e a l i z a t i

  • n
slide-43
SLIDE 43

Spanish-­‑English ¡MT ¡w/o ¡bitexts

43

6 13 19 25

23.3

18.8

17.9 10.2 4.0 12.9 21.5 21.9

Full details in Klementiev, Irvine, Callison-Burch and Yarowsky (EACL 2012)

Standard phrase- based MT Removed the bilingual reordering model Removed bilingual translation probabilities Removed all bilingual features Added monolingual reordering model Added monolingual phrase features All and only monolingual features Standard phrase- based MT + monolingual phrase features 83% of performance recovered

slide-44
SLIDE 44

Transla5on ¡comparison

44

The US administration can inject 700 billion dollars in banking The highest representatives of the congress and the government, the president George W. Bush, reached agreement in a pact in broad terms

  • f financial aid to the system of

American finance. The vote will take place at the beginning of next week. The American legislators caused a gap in the talks on the approval of the rescue plan in the form of aid to the US financial system with the amount of 700 billion dollars. However, is not yet won. The US congressmen must fine-tune certain details of the contract before they can make public the final shape

  • f the law and that is adopted.

the plan of aid to the financial system The US government can inject 700 billion dollars of the bank The highest representatives of congress and the government, the president George W. Bush, agreed to a pact many terms of financial aid to the system of finance American. The vote will take place as early as next week. The legislature American caused a breach in talks on the approval of rescue plan in the form of the financial system American with the amount of 700 million dollars. However, is not yet livestock . Congress further to some details of the contract before it can make public the final form of the law, with an voted. The plan of the financial system will

Bilingually estimated Monolingually estimated

slide-45
SLIDE 45

Announcements

Will Lewis from Microsoft Research will be giving the lecture on

  • Thursday. He has a lot of job openings. If you’d like to meet

with him, email me tonight. Deadlines: Tonight - complete term project is due. No extensions April 16 - read over other students’ projects and vote on the

  • nes you want to do as your final HW assignment.

Tuesday April 28th (last day of class): (1) Turn in your solution to one of the other team’s projects as your final HW. (2) Language research assignment is due