Multilingual Information Retrieval Doug Oard College of - - PowerPoint PPT Presentation

multilingual information retrieval
SMART_READER_LITE
LIVE PREVIEW

Multilingual Information Retrieval Doug Oard College of - - PowerPoint PPT Presentation

Multilingual Information Retrieval Doug Oard College of Information Studies and UMIACS University of Maryland, College Park USA January 14, 2019 AFIRM Global Trade 2.5 USA 2.0 EU Exports (Trillions of USD) China 1.5 1.0 Japan Hong


slide-1
SLIDE 1

Multilingual Information Retrieval

Doug Oard

College of Information Studies and UMIACS University of Maryland, College Park USA

January 14, 2019 AFIRM

slide-2
SLIDE 2

Global Trade

Source: Wikipedia (mostly 2017 estimates)

0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 Exports (Trillions of USD) Imports (Trillions of USD)

USA Japan South Korea Hong Kong China EU

slide-3
SLIDE 3

Most Widely-Spoken Languages

Source: Ethnologue (SIL), 2018

200 400 600 800 1,000 1,200

Southern Min Persian Thai Hausa Italian Vietnamese Yue Chinese Tamil Marathi Korean Turkish Telugu Wu Chinese Javanese Western Punjabi Swahili Japanese German Urdu Indonesian Portuguese Bengali Russian Modern Std Arabic French Spanish Hindi Mandarin Chinese English

Billions of Speakers

L1 speakers L2 speakers

slide-4
SLIDE 4

64% 5% 4% 6% 2% 8% 2% 4% 5% 0% 33% 28% 9% 6% 5% 5% 4% 4% 4% 2% English Chinese Spanish Japanese Portuguese German Arabic French Russian Korean

Global Internet Users

slide-5
SLIDE 5

What Does “Multilingual” Mean?

  • Mixed-language document

– Document containing more than one language

  • Mixed-language collection

– Collection of documents in different languages

  • Multi-monolingual systems

– Can retrieve from a mixed-language collection

  • Cross-language system

– Query in one language finds document in another

  • (Truly) multingual system

– Queries can find documents in any language

slide-6
SLIDE 6

A Story in Two Parts

  • IR from the ground up in any language

– Focusing on document representation

  • Cross-Language IR

– To the extent time allows

slide-7
SLIDE 7

Documents Query Hits

Representation Function Representation Function Query Representation Document Representation Comparison Function

Index

slide-8
SLIDE 8

ASCII

  • American Standard

Code for Information Interchange

  • ANSI X3.4-1968

| 0 NUL | 32 SPACE | 64 @ | 96 ` | | 1 SOH | 33 ! | 65 A | 97 a | | 2 STX | 34 " | 66 B | 98 b | | 3 ETX | 35 # | 67 C | 99 c | | 4 EOT | 36 $ | 68 D | 100 d | | 5 ENQ | 37 % | 69 E | 101 e | | 6 ACK | 38 & | 70 F | 102 f | | 7 BEL | 39 ' | 71 G | 103 g | | 8 BS | 40 ( | 72 H | 104 h | | 9 HT | 41 ) | 73 I | 105 i | | 10 LF | 42 * | 74 J | 106 j | | 11 VT | 43 + | 75 K | 107 k | | 12 FF | 44 , | 76 L | 108 l | | 13 CR | 45 - | 77 M | 109 m | | 14 SO | 46 . | 78 N | 110 n | | 15 SI | 47 / | 79 O | 111 o | | 16 DLE | 48 0 | 80 P | 112 p | | 17 DC1 | 49 1 | 81 Q | 113 q | | 18 DC2 | 50 2 | 82 R | 114 r | | 19 DC3 | 51 3 | 83 S | 115 s | | 20 DC4 | 52 4 | 84 T | 116 t | | 21 NAK | 53 5 | 85 U | 117 u | | 22 SYN | 54 6 | 86 V | 118 v | | 23 ETB | 55 7 | 87 W | 119 w | | 24 CAN | 56 8 | 88 X | 120 x | | 25 EM | 57 9 | 89 Y | 121 y | | 26 SUB | 58 : | 90 Z | 122 z | | 27 ESC | 59 ; | 91 [ | 123 { | | 28 FS | 60 < | 92 \ | 124 | | | 29 GS | 61 = | 93 ] | 125 } | | 30 RS | 62 > | 94 ^ | 126 ~ | | 31 US | 64 ? | 95 _ | 127 DEL |

slide-9
SLIDE 9

The Latin-1 Character Set

  • ISO 8859-1 8-bit characters for Western Europe

– French, Spanish, Catalan, Galician, Basque, Portuguese, Italian, Albanian, Afrikaans, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English

Printable Characters, 7-bit ASCII Additional Defined Characters, ISO 8859-1

slide-10
SLIDE 10

Other ISO-8859 Character Sets

  • 2
  • 3
  • 4
  • 5
  • 7
  • 6
  • 9
  • 8
slide-11
SLIDE 11

East Asian Character Sets

  • More than 256 characters are needed

– Two-byte encoding schemes (e.g., EUC) are used

  • Several countries have unique character sets

– GB in Peoples Republic of China, BIG5 in Taiwan, JIS in Japan, KS in Korea, TCVN in Vietnam

  • Many characters appear in several languages

– Research Libraries Group developed EACC

  • Unified “CJK” character set for USMARC records
slide-12
SLIDE 12

Unicode

  • Single code for all the world’s characters

– ISO Standard 10646

  • Separates “code space” from “encoding”

– Code space extends Latin-1

  • The first 256 positions are identical

– UTF-7 encoding will pass through email

  • Uses only the 64 printable ASCII characters

– UTF-8 encoding is designed for disk file systems

slide-13
SLIDE 13

Limitations of Unicode

  • Produces larger files than Latin-1
  • Fonts may be hard to obtain for some characters
  • Some characters have multiple representations

– e.g., accents can be part of a character or separate

  • Some characters look identical when printed

– But they come from unrelated languages

  • Encoding does not define the “sort order”
slide-14
SLIDE 14

Strings and Segments

  • Retrieval is (often) a search for concepts

– But what we actually search are character strings

  • What strings best represent concepts?

– In English, words are often a good choice

  • Well-chosen phrases might also be helpful

– In German, compounds may need to be split

  • Otherwise queries using constituent words would fail

– In Chinese, word boundaries are not marked

  • Thissegmentationproblemissimilartothatofspeech
slide-15
SLIDE 15

Tokenization

  • Words (from linguistics):

– Morphemes are the units of meaning – Combined to make words

  • Anti (disestablishmentarian) ism
  • Tokens (from computer science)

– Doug ’s running late !

slide-16
SLIDE 16

Morphological Segmentation Swahili Example

a + li + ni + andik + ish + a

he

+

past-tense

+

me

+ write + causer-effect + Declarative-mode

Credit: Ramy Eskander

slide-17
SLIDE 17

Morphological Segmentation Somali Example

cun + t + aa

eat

+ sh

e

+

present- tense

Credit: Ramy Eskander

slide-18
SLIDE 18

Stemming

  • Conflates words, usually preserving meaning

– Rule-based suffix-stripping helps for English

  • {destroy, destroyed, destruction}: destr

– Prefix-stripping is needed in some languages

  • Arabic: {alselam}: selam [Root: SLM (peace)]
  • Imperfect: goal is to usually be helpful

– Overstemming

  • {centennial,century,center}: cent

– Understamming:

  • {acquire,acquiring,acquired}: acquir
  • {acquisition}: acquis
  • Snowball: rule-based system for making stemmers
slide-19
SLIDE 19

Longest Substring Segmentation

  • Greedy algorithm based on a lexicon
  • Start with a list of every possible term
  • For each unsegmented string

– Remove the longest single substring in the list – Repeat until no substrings are found in the list

slide-20
SLIDE 20

Longest Substring Example

  • Possible German compound term (!):

– washington

  • List of German words:

– ach, hin, hing, sei, ton, was, wasch

  • Longest substring segmentation

– was-hing-ton – Roughly translates as “What tone is attached?”

slide-21
SLIDE 21
  • il

petroleum probe survey take samples restrain

  • il

petroleum probe survey take samples cymbidium goeringii

slide-22
SLIDE 22

Probabilistic Segmentation

  • For an input string c1 c2 c3 … cn
  • Try all possible partitions into w1 w2 w3 …

– c1 c2 c3 … cn – c1 c2 c3 c3 … cn – c1 c2 c3 … cn – etc.

  • Choose the highest probability partition

– Compute Pr(w1 w2 w3 ) using a language model

  • Challenges: search, probability estimation
slide-23
SLIDE 23

Non-Segmentation: N-gram Indexing

  • Consider a Chinese document c1 c2 c3 … cn
  • Don’t segment (you could be wrong!)
  • Instead, treat every character bigram as a term

c1 c2 , c2 c3 , c3 c4 , … , cn-1 cn

  • Break up queries the same way
slide-24
SLIDE 24

A “Term” is Whatever You Index

  • Word sense
  • Token
  • Word
  • Stem
  • Character n-gram
  • Phrase
slide-25
SLIDE 25

Summary

  • A term is whatever you index

– So the key is to index the right kind of terms!

  • Start by finding fundamental features

– We have focused on character coded text – Same ideas apply to handwriting, OCR, and speech

  • Combine characters into easily recognized units

– Words where possible, character n-grams otherwise

  • Apply further processing to optimize results

– Stemming, phrases, …

slide-26
SLIDE 26

A Story in Two Parts

  • IR from the ground up in any language

– Focusing on document representation

  • Cross-Language IR

– To the extent time allows

slide-27
SLIDE 27

Query-Language CLIR

English queries Somali Document Collection Retrieval Engine Translation System English Document Collection Results select examine

slide-28
SLIDE 28

Document-Language CLIR

Retrieval Engine Translation System Somali queries Somali documents Results English queries select examine Somali Document Collection

slide-29
SLIDE 29

Query vs. Document Translation

  • Query translation

– Efficient for short queries (not relevance feedback) – Limited context for ambiguous query terms

  • Document translation

– Rapid support for interactive selection – Need only be done once (if query language is same)

slide-30
SLIDE 30

Indexing Time: Statistical Document Translation

100 200 300 400 500

10 15 20 25 35 40 45

Thousands of documents Indexing time (sec)

monolingual cross-language

slide-31
SLIDE 31

Language-Neutral Retrieval

“Interlingual” Retrieval 1: 0.91 2: 0.57 3: 0.36 Query “Translation” Somali Query Terms English Document Terms Document “Translation”

slide-32
SLIDE 32

Translation Evidence

  • Lexical Resources

– Phrase books, bilingual dictionaries, …

  • Large text collections

– Translations (“parallel”) – Similar topics (“comparable”)

  • Similarity

– Similar writing (if the character set is the same) – Similar pronunciation

  • People

– May be able to guess topic from lousy translations

slide-33
SLIDE 33

Types of Lexical Resources

  • Ontology

– Organization of knowledge

  • Thesaurus

– Ontology specialized to support search

  • Dictionary

– Rich word list, designed for use by people

  • Lexicon

– Rich word list, designed for use by a machine

  • Bilingual term list

– Pairs of translation-equivalent terms

slide-34
SLIDE 34

Named entities removed Named entities from term list Named entities added Full Query

slide-35
SLIDE 35

Backoff Translation

  • Lexicon might contain stems, surface

forms, or some combination of the two.

mangez mangez mangez mange mange mangez mange mange mangez mange mangent mange

  • eat
  • eats

eat

  • eat
  • eat

Document Translation Lexicon

surface form surface form stem surface form surface form stem stem stem

slide-36
SLIDE 36

Hieroglyphic Egyptian Demotic Greek

slide-37
SLIDE 37

Types of Bilingual Corpora

  • Parallel corpora: translation-equivalent pairs

– Document pairs – Sentence pairs – Term pairs

  • Comparable corpora: topically related

– Collection pairs – Document pairs

slide-38
SLIDE 38

Some Modern Rosetta Stones

  • News:

– DE-News (German-English) – Hong-Kong News, Xinhua News (Chinese-English)

  • Government:

– Canadian Hansards (French-English) – Europarl (Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portugese, Spanish, Swedish) – UN Treaties (Russian, English, Arabic, …)

  • Religion

– Bible, Koran, Book of Mormon

slide-39
SLIDE 39

Word-Level Alignment

Diverging opinions about planned tax reform Unterschiedliche Meinungen zur geplanten Steuerreform English German Madam President , I had asked the administration … English Señora Presidenta, había pedido a la administración del Parlamento … Spanish

slide-40
SLIDE 40

A Translation Model

  • From word-aligned bilingual text, we

induce a translation model

  • Example:

) | ( e f p

i

1 ) | ( =

i

f i e

f p

where,

p(探测|survey) = 0.4 p(试探|survey) = 0.3 p(测量|survey) = 0.25 p(样品|survey) = 0.05

slide-41
SLIDE 41

Using Multiple Translations

  • Weighted Structured Query Translation

– Takes advantage of multiple translations and translation probabilities

  • TF and DF of query term e are computed

using TF and DF of its translations:

× =

i

f k i i k

D f TF e f p D e TF ) , ( ) | ( ) , (

× =

i

f i i

f DF e f p e DF ) ( ) | ( ) (

slide-42
SLIDE 42

BM-25

] ) ( 7 ) ( * 8 )) , ( ) ( * 9 . 3 . ( )) , ( * 2 . 2 ( ][ ) 5 . ) ( ( ) 5 . ) ( ( [log e qtf e qtf d e tf avdl d dl d e tf e df e df N

Q e k k k

+ + + + + −

document frequency term frequency document length

] ) ( 7 ) ( * 8 )) , ( ) ( * 9 . 3 . ( )) , ( * 2 . 2 ( ][ ) 5 . ) ( ( ) 5 . ) ( ( [log e qtf e qtf d e tf avdl d dl d e tf e df e df N

Q e k k k

+ + + + + −

] ) ( 7 ) ( * 8 )) , ( ) ( * 9 . 3 . ( )) , ( * 2 . 2 ( ][ ) 5 . ) ( ( ) 5 . ) ( ( [log e qtf e qtf d e tf avdl d dl d e tf e df e df N

Q e k k k

+ + + + + −

slide-43
SLIDE 43

40% 50% 60% 70% 80% 90% 100% 110% 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Cumulative Probability Threshold MAP: CLIR/Monolingual

DAMM IMM PSQ

Retrieval Effectiveness

CLEF French

slide-44
SLIDE 44

Bilingual Query Expansion

source language query

Query Translation

results

Source Language IR Target Language IR source language collection target language collection expanded source language query expanded target language terms

Pre-translation expansion Post-translation expansion

slide-45
SLIDE 45

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 5,000 10,000 15,000 Mean Average Precision Unique Dutch Terms

Both Post Pre None

Query Expansion Effect

Paul McNamee and James Mayfield, SIGIR-2002

slide-46
SLIDE 46

Cognate Matching

  • Dictionary coverage is inherently limited

– Translation of proper names – Translation of newly coined terms – Translation of unfamiliar technical terms

  • Strategy: model derivational translation

– Orthography-based – Pronunciation-based

slide-47
SLIDE 47

Matching Orthographic Cognates

  • Retain untranslatable words unchanged

– Often works well between European languages

  • Rule-based systems

– Even off-the-shelf spelling correction can help!

  • Subword (e.g., character-level) MT

– Trained using a set of representative cognates

slide-48
SLIDE 48

Matching Phonetic Cognates

  • Forward transliteration

– Generate all potential transliterations

  • Reverse transliteration

– Guess source string(s) that produced a transliteration

  • Match in phonetic space
slide-49
SLIDE 49

Cross-Language “Retrieval”

Search

Translated Query Ranked List

Query Translation

Query

slide-50
SLIDE 50

Uses of “MT” in CLIR

Search

Translated Query

Selection

Ranked List

Examination

Document

Use

Document

Query Formulation Query Translation

Query Query Reformulation

Indicative Translation Snippet Translation Term Translation Term Matching Informative Translation

slide-51
SLIDE 51

Interactive Cross-Language Question Answering

1 2 3 4 5 6 7 8 8 11 13 4 16 6 14 7 2 10 15 12 1 3 9 5

Users with Correct Answers Question Number

iCLEF 2004

slide-52
SLIDE 52

Questions, Grouped by Difficulty

8 Who is the managing director of the International Monetary Fund? 11 Who is the president of Burundi? 13 Of what team is Bobby Robson coach? 4 Who committed the terrorist attack in the Tokyo underground? 16 Who won the Nobel Prize for Literature in 1994? 6 When did Latvia gain independence? 14 When did the attack at the Saint-Michel underground station in Paris occur? 7 How many people were declared missing in the Philippines after the typhoon “Angela”? 2 How many human genes are there? 10 How many people died of asphyxia in the Baku underground? 15 How many people live in Bombay? 12 What is Charles Millon's political party? 1 What year was Thomas Mann awarded the Nobel Prize? 3 Who is the German Minister for Economic Affairs? 9 When did Lenin die? 5 How much did the Channel Tunnel cost?

slide-53
SLIDE 53

For Further Reading

  • Multilingual IR

– Paul McNamee et al, Addressing Morphological Variation in Alphabetic Languages, SIGIR, 2009

  • African-Language IR

– Open CLIR Challenge (Swahili), IARPA, 2018 – Nkosana Malumba et al, AfriWeb: A Search Engine for a Marginalized Language, ICADL, 2015

  • Cross-Language IR

– Jian-Yun Nie, Cross-Language Information Retrieval, Synthesis Lectures in HLT, Morgan&Claypool, 2010 – Jianqiang Wang and Douglas W. Oard, Matching Meaning for Cross-Language Information Retrieval, Information Processing and Management, 2012