Multilingual Information Retrieval
Doug Oard
College of Information Studies and UMIACS University of Maryland, College Park USA
January 14, 2019 AFIRM
Multilingual Information Retrieval Doug Oard College of - - PowerPoint PPT Presentation
Multilingual Information Retrieval Doug Oard College of Information Studies and UMIACS University of Maryland, College Park USA January 14, 2019 AFIRM Global Trade 2.5 USA 2.0 EU Exports (Trillions of USD) China 1.5 1.0 Japan Hong
College of Information Studies and UMIACS University of Maryland, College Park USA
January 14, 2019 AFIRM
Source: Wikipedia (mostly 2017 estimates)
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 Exports (Trillions of USD) Imports (Trillions of USD)
USA Japan South Korea Hong Kong China EU
Source: Ethnologue (SIL), 2018
200 400 600 800 1,000 1,200
Southern Min Persian Thai Hausa Italian Vietnamese Yue Chinese Tamil Marathi Korean Turkish Telugu Wu Chinese Javanese Western Punjabi Swahili Japanese German Urdu Indonesian Portuguese Bengali Russian Modern Std Arabic French Spanish Hindi Mandarin Chinese English
Billions of Speakers
L1 speakers L2 speakers
64% 5% 4% 6% 2% 8% 2% 4% 5% 0% 33% 28% 9% 6% 5% 5% 4% 4% 4% 2% English Chinese Spanish Japanese Portuguese German Arabic French Russian Korean
– Document containing more than one language
– Collection of documents in different languages
– Can retrieve from a mixed-language collection
– Query in one language finds document in another
– Queries can find documents in any language
– Focusing on document representation
– To the extent time allows
Documents Query Hits
Representation Function Representation Function Query Representation Document Representation Comparison Function
Index
| 0 NUL | 32 SPACE | 64 @ | 96 ` | | 1 SOH | 33 ! | 65 A | 97 a | | 2 STX | 34 " | 66 B | 98 b | | 3 ETX | 35 # | 67 C | 99 c | | 4 EOT | 36 $ | 68 D | 100 d | | 5 ENQ | 37 % | 69 E | 101 e | | 6 ACK | 38 & | 70 F | 102 f | | 7 BEL | 39 ' | 71 G | 103 g | | 8 BS | 40 ( | 72 H | 104 h | | 9 HT | 41 ) | 73 I | 105 i | | 10 LF | 42 * | 74 J | 106 j | | 11 VT | 43 + | 75 K | 107 k | | 12 FF | 44 , | 76 L | 108 l | | 13 CR | 45 - | 77 M | 109 m | | 14 SO | 46 . | 78 N | 110 n | | 15 SI | 47 / | 79 O | 111 o | | 16 DLE | 48 0 | 80 P | 112 p | | 17 DC1 | 49 1 | 81 Q | 113 q | | 18 DC2 | 50 2 | 82 R | 114 r | | 19 DC3 | 51 3 | 83 S | 115 s | | 20 DC4 | 52 4 | 84 T | 116 t | | 21 NAK | 53 5 | 85 U | 117 u | | 22 SYN | 54 6 | 86 V | 118 v | | 23 ETB | 55 7 | 87 W | 119 w | | 24 CAN | 56 8 | 88 X | 120 x | | 25 EM | 57 9 | 89 Y | 121 y | | 26 SUB | 58 : | 90 Z | 122 z | | 27 ESC | 59 ; | 91 [ | 123 { | | 28 FS | 60 < | 92 \ | 124 | | | 29 GS | 61 = | 93 ] | 125 } | | 30 RS | 62 > | 94 ^ | 126 ~ | | 31 US | 64 ? | 95 _ | 127 DEL |
– French, Spanish, Catalan, Galician, Basque, Portuguese, Italian, Albanian, Afrikaans, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English
Printable Characters, 7-bit ASCII Additional Defined Characters, ISO 8859-1
– Two-byte encoding schemes (e.g., EUC) are used
– GB in Peoples Republic of China, BIG5 in Taiwan, JIS in Japan, KS in Korea, TCVN in Vietnam
– Research Libraries Group developed EACC
– ISO Standard 10646
– Code space extends Latin-1
– UTF-7 encoding will pass through email
– UTF-8 encoding is designed for disk file systems
– e.g., accents can be part of a character or separate
– But they come from unrelated languages
– But what we actually search are character strings
– In English, words are often a good choice
– In German, compounds may need to be split
– In Chinese, word boundaries are not marked
– Morphemes are the units of meaning – Combined to make words
– Doug ’s running late !
Morphological Segmentation Swahili Example
a + li + ni + andik + ish + a
he
+
past-tense
+
me
+ write + causer-effect + Declarative-mode
Credit: Ramy Eskander
Morphological Segmentation Somali Example
cun + t + aa
eat
+ sh
e
+
present- tense
Credit: Ramy Eskander
– Rule-based suffix-stripping helps for English
– Prefix-stripping is needed in some languages
– Overstemming
– Understamming:
– Remove the longest single substring in the list – Repeat until no substrings are found in the list
– washington
– ach, hin, hing, sei, ton, was, wasch
– was-hing-ton – Roughly translates as “What tone is attached?”
petroleum probe survey take samples restrain
petroleum probe survey take samples cymbidium goeringii
– c1 c2 c3 … cn – c1 c2 c3 c3 … cn – c1 c2 c3 … cn – etc.
– Compute Pr(w1 w2 w3 ) using a language model
c1 c2 , c2 c3 , c3 c4 , … , cn-1 cn
– So the key is to index the right kind of terms!
– We have focused on character coded text – Same ideas apply to handwriting, OCR, and speech
– Words where possible, character n-grams otherwise
– Stemming, phrases, …
– Focusing on document representation
– To the extent time allows
English queries Somali Document Collection Retrieval Engine Translation System English Document Collection Results select examine
Retrieval Engine Translation System Somali queries Somali documents Results English queries select examine Somali Document Collection
– Efficient for short queries (not relevance feedback) – Limited context for ambiguous query terms
– Rapid support for interactive selection – Need only be done once (if query language is same)
100 200 300 400 500
10 15 20 25 35 40 45
Thousands of documents Indexing time (sec)
monolingual cross-language
“Interlingual” Retrieval 1: 0.91 2: 0.57 3: 0.36 Query “Translation” Somali Query Terms English Document Terms Document “Translation”
– Phrase books, bilingual dictionaries, …
– Translations (“parallel”) – Similar topics (“comparable”)
– Similar writing (if the character set is the same) – Similar pronunciation
– May be able to guess topic from lousy translations
– Organization of knowledge
– Ontology specialized to support search
– Rich word list, designed for use by people
– Rich word list, designed for use by a machine
– Pairs of translation-equivalent terms
Named entities removed Named entities from term list Named entities added Full Query
mangez mangez mangez mange mange mangez mange mange mangez mange mangent mange
eat
Document Translation Lexicon
surface form surface form stem surface form surface form stem stem stem
Hieroglyphic Egyptian Demotic Greek
– Document pairs – Sentence pairs – Term pairs
– Collection pairs – Document pairs
– DE-News (German-English) – Hong-Kong News, Xinhua News (Chinese-English)
– Canadian Hansards (French-English) – Europarl (Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portugese, Spanish, Swedish) – UN Treaties (Russian, English, Arabic, …)
– Bible, Koran, Book of Mormon
Diverging opinions about planned tax reform Unterschiedliche Meinungen zur geplanten Steuerreform English German Madam President , I had asked the administration … English Señora Presidenta, había pedido a la administración del Parlamento … Spanish
) | ( e f p
i
1 ) | ( =
i
f i e
f p
where,
p(探测|survey) = 0.4 p(试探|survey) = 0.3 p(测量|survey) = 0.25 p(样品|survey) = 0.05
– Takes advantage of multiple translations and translation probabilities
× =
i
f k i i k
D f TF e f p D e TF ) , ( ) | ( ) , (
× =
i
f i i
f DF e f p e DF ) ( ) | ( ) (
] ) ( 7 ) ( * 8 )) , ( ) ( * 9 . 3 . ( )) , ( * 2 . 2 ( ][ ) 5 . ) ( ( ) 5 . ) ( ( [log e qtf e qtf d e tf avdl d dl d e tf e df e df N
Q e k k k
+ + + + + −
∈
document frequency term frequency document length
] ) ( 7 ) ( * 8 )) , ( ) ( * 9 . 3 . ( )) , ( * 2 . 2 ( ][ ) 5 . ) ( ( ) 5 . ) ( ( [log e qtf e qtf d e tf avdl d dl d e tf e df e df N
Q e k k k
+ + + + + −
∈
] ) ( 7 ) ( * 8 )) , ( ) ( * 9 . 3 . ( )) , ( * 2 . 2 ( ][ ) 5 . ) ( ( ) 5 . ) ( ( [log e qtf e qtf d e tf avdl d dl d e tf e df e df N
Q e k k k
+ + + + + −
∈
40% 50% 60% 70% 80% 90% 100% 110% 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Cumulative Probability Threshold MAP: CLIR/Monolingual
DAMM IMM PSQ
CLEF French
source language query
Query Translation
results
Source Language IR Target Language IR source language collection target language collection expanded source language query expanded target language terms
Pre-translation expansion Post-translation expansion
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 5,000 10,000 15,000 Mean Average Precision Unique Dutch Terms
Both Post Pre None
Paul McNamee and James Mayfield, SIGIR-2002
– Translation of proper names – Translation of newly coined terms – Translation of unfamiliar technical terms
– Orthography-based – Pronunciation-based
– Often works well between European languages
– Even off-the-shelf spelling correction can help!
– Trained using a set of representative cognates
– Generate all potential transliterations
– Guess source string(s) that produced a transliteration
Search
Translated Query Ranked List
Query Translation
Query
Search
Translated Query
Selection
Ranked List
Examination
Document
Use
Document
Query Formulation Query Translation
Query Query Reformulation
Indicative Translation Snippet Translation Term Translation Term Matching Informative Translation
1 2 3 4 5 6 7 8 8 11 13 4 16 6 14 7 2 10 15 12 1 3 9 5
Users with Correct Answers Question Number
iCLEF 2004
8 Who is the managing director of the International Monetary Fund? 11 Who is the president of Burundi? 13 Of what team is Bobby Robson coach? 4 Who committed the terrorist attack in the Tokyo underground? 16 Who won the Nobel Prize for Literature in 1994? 6 When did Latvia gain independence? 14 When did the attack at the Saint-Michel underground station in Paris occur? 7 How many people were declared missing in the Philippines after the typhoon “Angela”? 2 How many human genes are there? 10 How many people died of asphyxia in the Baku underground? 15 How many people live in Bombay? 12 What is Charles Millon's political party? 1 What year was Thomas Mann awarded the Nobel Prize? 3 Who is the German Minister for Economic Affairs? 9 When did Lenin die? 5 How much did the Channel Tunnel cost?
– Paul McNamee et al, Addressing Morphological Variation in Alphabetic Languages, SIGIR, 2009
– Open CLIR Challenge (Swahili), IARPA, 2018 – Nkosana Malumba et al, AfriWeb: A Search Engine for a Marginalized Language, ICADL, 2015
– Jian-Yun Nie, Cross-Language Information Retrieval, Synthesis Lectures in HLT, Morgan&Claypool, 2010 – Jianqiang Wang and Douglas W. Oard, Matching Meaning for Cross-Language Information Retrieval, Information Processing and Management, 2012