Multilingual Information Retrieval Doug Oard College of - PowerPoint PPT Presentation

Multilingual Information Retrieval Doug Oard College of Information Studies and UMIACS University of Maryland, College Park USA January 14, 2019 AFIRM

Global Trade 2.5 USA 2.0 EU Exports (Trillions of USD) China 1.5 1.0 Japan Hong Kong 0.5 South Korea 0.0 0.0 0.5 1.0 1.5 2.0 2.5 Imports (Trillions of USD) Source: Wikipedia (mostly 2017 estimates)

Most Widely-Spoken Languages English Mandarin Chinese Hindi Spanish French Modern Std Arabic Russian Bengali Portuguese Indonesian Urdu German Japanese Swahili Western Punjabi Javanese Wu Chinese L1 speakers Telugu Turkish L2 speakers Korean Marathi Tamil Yue Chinese Vietnamese Italian Hausa Thai Persian Southern Min 0 200 400 600 800 1,000 1,200 Billions of Speakers Source: Ethnologue (SIL), 2018

Global Internet Users 2% 4% 4% 4% 0% 5% 4% 2% 5% 33% English 8% Chinese Spanish 5% 2% Japanese 6% Portuguese German 6% Arabic 4% French 64% Russian 5% Korean 9% 28%

What Does “Multilingual” Mean? • Mixed-language document – Document containing more than one language • Mixed-language collection – Collection of documents in different languages • Multi-monolingual systems – Can retrieve from a mixed-language collection • Cross-language system – Query in one language finds document in another • (Truly) multingual system – Queries can find documents in any language

A Story in Two Parts • IR from the ground up in any language – Focusing on document representation • Cross-Language IR – To the extent time allows

Query Documents Representation Representation Function Function Query Representation Document Representation Comparison Index Function Hits

| 0 NUL | 32 SPACE | 64 @ | 96 ` | | 1 SOH | 33 ! | 65 A | 97 a | | 2 STX | 34 " | 66 B | 98 b | ASCII | 3 ETX | 35 # | 67 C | 99 c | | 4 EOT | 36 $ | 68 D | 100 d | | 5 ENQ | 37 % | 69 E | 101 e | | 6 ACK | 38 & | 70 F | 102 f | • American Standard | 7 BEL | 39 ' | 71 G | 103 g | | 8 BS | 40 ( | 72 H | 104 h | Code for Information | 9 HT | 41 ) | 73 I | 105 i | | 10 LF | 42 * | 74 J | 106 j | | 11 VT | 43 + | 75 K | 107 k | Interchange | 12 FF | 44 , | 76 L | 108 l | | 13 CR | 45 - | 77 M | 109 m | | 14 SO | 46 . | 78 N | 110 n | | 15 SI | 47 / | 79 O | 111 o | • ANSI X3.4-1968 | 16 DLE | 48 0 | 80 P | 112 p | | 17 DC1 | 49 1 | 81 Q | 113 q | | 18 DC2 | 50 2 | 82 R | 114 r | | 19 DC3 | 51 3 | 83 S | 115 s | | 20 DC4 | 52 4 | 84 T | 116 t | | 21 NAK | 53 5 | 85 U | 117 u | | 22 SYN | 54 6 | 86 V | 118 v | | 23 ETB | 55 7 | 87 W | 119 w | | 24 CAN | 56 8 | 88 X | 120 x | | 25 EM | 57 9 | 89 Y | 121 y | | 26 SUB | 58 : | 90 Z | 122 z | | 27 ESC | 59 ; | 91 [ | 123 { | | 28 FS | 60 < | 92 \ | 124 | | | 29 GS | 61 = | 93 ] | 125 } | | 30 RS | 62 > | 94 ^ | 126 ~ | | 31 US | 64 ? | 95 _ | 127 DEL |

The Latin-1 Character Set • ISO 8859-1 8-bit characters for Western Europe – French, Spanish, Catalan, Galician, Basque, Portuguese, Italian, Albanian, Afrikaans, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English Printable Characters, 7-bit ASCII Additional Defined Characters, ISO 8859-1

Other ISO-8859 Character Sets -2 -6 -7 -3 -4 -8 -9 -5

East Asian Character Sets • More than 256 characters are needed – Two-byte encoding schemes (e.g., EUC) are used • Several countries have unique character sets – GB in Peoples Republic of China, BIG5 in Taiwan, JIS in Japan, KS in Korea, TCVN in Vietnam • Many characters appear in several languages – Research Libraries Group developed EACC • Unified “CJK” character set for USMARC records

Unicode • Single code for all the world’s characters – ISO Standard 10646 • Separates “code space” from “encoding” – Code space extends Latin-1 • The first 256 positions are identical – UTF-7 encoding will pass through email • Uses only the 64 printable ASCII characters – UTF-8 encoding is designed for disk file systems

Limitations of Unicode • Produces larger files than Latin-1 • Fonts may be hard to obtain for some characters • Some characters have multiple representations – e.g., accents can be part of a character or separate • Some characters look identical when printed – But they come from unrelated languages • Encoding does not define the “sort order”

Strings and Segments • Retrieval is (often) a search for concepts – But what we actually search are character strings • What strings best represent concepts? – In English, words are often a good choice • Well-chosen phrases might also be helpful – In German, compounds may need to be split • Otherwise queries using constituent words would fail – In Chinese, word boundaries are not marked • Thissegmentationproblemissimilartothatofspeech

Tokenization • Words (from linguistics): – Morphemes are the units of meaning – Combined to make words • Anti (disestablishmentarian) ism • Tokens (from computer science) – Doug ’s running late !

Morphological Segmentation Swahili Example a + li + ni + andik + ish + a + + + write + causer-effect + Declarative-mode he past-tense me Credit: Ramy Eskander

Morphological Segmentation Somali Example cun + t + aa + sh eat + present- e tense Credit: Ramy Eskander

Stemming • Conflates words, usually preserving meaning – Rule-based suffix-stripping helps for English • {destroy, destroyed, destruction}: destr – Prefix-stripping is needed in some languages • Arabic: {alselam}: selam [Root: SLM (peace)] • Imperfect: goal is to usually be helpful – Overstemming • {centennial,century,center}: cent – Understamming: • {acquire,acquiring,acquired}: acquir • {acquisition}: acquis • Snowball: rule-based system for making stemmers

Longest Substring Segmentation • Greedy algorithm based on a lexicon • Start with a list of every possible term • For each unsegmented string – Remove the longest single substring in the list – Repeat until no substrings are found in the list

Longest Substring Example • Possible German compound term (!): – washington • List of German words: – ach, hin, hing, sei, ton, was, wasch • Longest substring segmentation – was-hing-ton – Roughly translates as “What tone is attached?”

oil probe petroleum survey take samples probe cymbidium survey goeringii oil take samples petroleum restrain

Probabilistic Segmentation • For an input string c 1 c 2 c 3 … c n • Try all possible partitions into w 1 w 2 w 3 … – c 1 c 2 c 3 … c n – c 1 c 2 c 3 c 3 … c n – c 1 c 2 c 3 … c n – etc. • Choose the highest probability partition – Compute Pr(w 1 w 2 w 3 ) using a language model • Challenges: search, probability estimation

Non-Segmentation: N-gram Indexing • Consider a Chinese document c 1 c 2 c 3 … c n • Don’t segment (you could be wrong!) • Instead, treat every character bigram as a term c 1 c 2 , c 2 c 3 , c 3 c 4 , … , c n-1 c n • Break up queries the same way

A “Term” is Whatever You Index • Word sense • Token • Word • Stem • Character n-gram • Phrase

Summary • A term is whatever you index – So the key is to index the right kind of terms! • Start by finding fundamental features – We have focused on character coded text – Same ideas apply to handwriting, OCR, and speech • Combine characters into easily recognized units – Words where possible, character n-grams otherwise • Apply further processing to optimize results – Stemming, phrases, …

A Story in Two Parts • IR from the ground up in any language – Focusing on document representation  Cross-Language IR – To the extent time allows

Query-Language CLIR Somali Document Collection Translation Results System examine select Retrieval Engine English queries English Document Collection

Document-Language CLIR Somali Document Collection Somali documents Retrieval Translation Results Engine System Somali queries examine select English queries

Query vs. Document Translation • Query translation – Efficient for short queries (not relevance feedback) – Limited context for ambiguous query terms • Document translation – Rapid support for interactive selection – Need only be done once (if query language is same)

Indexing Time: Statistical Document Translation 500 monolingual cross-language Indexing time (sec) 400 300 200 100 0 0 10 15 20 25 35 40 45 Thousands of documents

Language-Neutral Retrieval Somali Query Terms Query “Translation” English 1: 0.91 “Interlingual” Document Document 2: 0.57 Retrieval “Translation” Terms 3: 0.36

Translation Evidence • Lexical Resources – Phrase books, bilingual dictionaries, … • Large text collections – Translations (“parallel”) – Similar topics (“comparable”) • Similarity – Similar writing (if the character set is the same) – Similar pronunciation • People – May be able to guess topic from lousy translations

Multilingual Information Retrieval Doug Oard College of - PowerPoint PPT Presentation

Multilingual Information Retrieval Doug Oard College of Information Studies and UMIACS University of Maryland, College Park USA January 14, 2019 AFIRM Global Trade 2.5 USA 2.0 EU Exports (Trillions of USD) China 1.5 1.0 Japan Hong

Drupal 8s multilingual APIs Gbor Hojtsy DRUPAL 7 MULTILINGUAL DRUPAL 7 MULTILINGUAL Drupal

Drupal 8 Multilingual Wonderland Gabor Hojtsy Acquia Foreign language site Multilingual site

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Cross-Language Information Retrieval Carol Peters ISTI-CNR, Pisa Cross-Language Information

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Multilingual App Toolkit Standards and multilingual software development 29, April 2015 Jan

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Multilingual and Crosslingual Information Retrieval and Access Feiyu Xu DFKI, LT-Lab Germany

Camille Apple, Elizabeth Miller, Julie A Stortz, Juan C Mira, McKenzie K Hollen, Tyler J Loftus,

2/4/2016 DELAYED CORD CLAMPING An old idea revisited 211 YEARS AGO Another

Firefighter Injuries Relative to Fire Response Characteristics Kathryn E. Sinden RKin, PhD,

Presenter Paolo MANZONI, MD, PHD Director Division of Pediatrics and Neonatology Department of

II . JM Keynes and The General Theory (1936) 1. German and French translations The French

Alternatives Assessment Webinar: 3D Printing: Emerging hazards and the application of alternatives

Vision to Action Direct attention to relevant activities Affect intensity of effort

Crucial Statistical Concepts for the Judiciary Sam Tyner, PhD 2017-12-11 2 Contents 4.1 3.3

Multilingual Information Retrieval Doug Oard College of - PowerPoint PPT Presentation

Multilingual Information Retrieval Doug Oard College of Information Studies and UMIACS University of Maryland, College Park USA January 14, 2019 AFIRM Global Trade 2.5 USA 2.0 EU Exports (Trillions of USD) China 1.5 1.0 Japan Hong

Drupal 8s multilingual APIs Gbor Hojtsy DRUPAL 7 MULTILINGUAL DRUPAL 7 MULTILINGUAL Drupal

Drupal 8 Multilingual Wonderland Gabor Hojtsy Acquia Foreign language site Multilingual site

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Cross-Language Information Retrieval Carol Peters ISTI-CNR, Pisa Cross-Language Information

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

Multilingual App Toolkit Standards and multilingual software development 29, April 2015 Jan

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Multilingual and Crosslingual Information Retrieval and Access Feiyu Xu DFKI, LT-Lab Germany

Camille Apple, Elizabeth Miller, Julie A Stortz, Juan C Mira, McKenzie K Hollen, Tyler J Loftus,

2/4/2016 DELAYED CORD CLAMPING An old idea revisited 211 YEARS AGO Another

Firefighter Injuries Relative to Fire Response Characteristics Kathryn E. Sinden RKin, PhD,

Presenter Paolo MANZONI, MD, PHD Director Division of Pediatrics and Neonatology Department of

II . JM Keynes and The General Theory (1936) 1. German and French translations The French

Alternatives Assessment Webinar: 3D Printing: Emerging hazards and the application of alternatives

Vision to Action Direct attention to relevant activities Affect intensity of effort

Crucial Statistical Concepts for the Judiciary Sam Tyner, PhD 2017-12-11 2 Contents 4.1 3.3

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models