CrossLanguageIR CISC489/689010,Lecture#23 Monday,May11 th - PDF document

5/24/09  Cross‐Language IR  CISC489/689‐010, Lecture #23  Monday, May 11 th   Ben CartereCe  Cross‐Language IR  • User submits a query in one language, gets  results in a different language  • Documents are semi‐structured and  heterogeneous (as almost all data in IR), and  also in mulNple languages  • InformaNon may only be available in  documents wriCen in one of the languages  • Highly useful to intelligence community  1 

5/24/09  Approaches to CLIR  • Translate the documents into the users’  language, and let the users submit queries in  their own language  • Translate the users’ queries into target  language(s) and use the translated query for  retrieval  • Translate both queries and documents to an  “intermediate” language  AutomaNc TranslaNon  • What are some approaches to automaNc  translaNon?  – Language‐to‐language dicNonaries  • Languages do not translate precisely  – One word with several meanings in one language  might translate to several different words in the other  – Many words with the same meaning might all  translate to a single word  – A word in one language might only be expressible as a  phrase in another (or vice‐versa)  – etc…  2 

5/24/09  Example  • English queries to retrieve Spanish documents  • System works by translaNng query to Spanish  • Query:  “bank fraud”  • TranslaNons of “bank”:  • TranslaNons of “fraud”:  – Orilla  (river bank)  – Impostor (fraudulent person)  – Terraplen (bank of earth)  – Fraude  (decepNon)  – Banco (bank of clouds)  – Bateria (bank of lights)  • How would a dicNonary‐ – Banco (financial insNtuNon)  based system know which  – Banca  (casino bank) pair of translaNons to use?  • Possibly correct translaNon: • Fraude bancario   StaNsNcal Approach  • Instead of trying to translate directly, apply  staNsNcal methods  • Learn “translaNon probabiliNes” P(f | e) –  probability of translaNng string e in language E  to string f in language F  • E.g.:  – P(orilla fraude | bank fraud), P(orilla impostor |  bank fraud), P(banco fraude | bank fraud), …  3 

5/24/09  Cross‐Language Language Model  • Recall query‐likelihood language model:  (1 − α D ) tf qD ctf q � � P ( Q | D ) = P ( q | D ) = | D | + α D | C | q ∈ Q q ∈ Q • Let’s adapt this to cross‐language retrieval  using staNsNcal translaNon  � P ( Q f | D e ) = P ( q f | D e ) q f ∈ Q f � � = P ( q f | t e ) P ( t e | D e ) q f ∈ Q f t e ∈ E � � (1 − α D e ) tf t e D e ctf t e � � = P ( q f | t e ) | D e | + α D e | C e | q f ∈ Q f t e ∈ E TranslaNon Model  • What is P(q f  | t e )?  • The  transla6on model :  probability of  translaNng word t e  in language E to word q f  in  language F  • Where does it come from?  – Maybe a dicNonary approach:  every possible  translaNon of t e  has equal probability  – e.g. P(orilla | bank) = P(banco | bank) = P(banca |  bank) = …  4 

5/24/09  StaNsNcal TranslaNon Model  • An alternaNve approach:   parallel corpora   StaNsNcal TranslaNon with Parallel  Corpora  • Parallel corpora consist of documents in two  or more languages that are known to be  translaNons of one another  • The parallel copora are  aligned :  string e and  string f are marked as translaNons of each  other  • We can use these alignments to esNmate a  translaNon model  5 

5/24/09  TranslaNon Model  • To esNmate P(q f  | t e ), count the number of  aligned string pairs (e, f) such that t e  is a word  in e and q f  is a word in f  • Divide by the total number of strings in  language e that contain t e  P ( q f | t e ) = |{ ( e, f ) | t e ∈ e and q f ∈ f }| |{ e | t e ∈ e }| Simple Alignment Example  • English sentence:  “The objecNve was clear: arrest and  extradite to Mexico the woman against whom they had  charged for fraud to a recognized banking insNtuNon.”  • Spanish sentence:  “El objeNvo era claro: detener a la  mujer y enviarla de regreso a México pues habían  cargos en su contra por fraude a una reconocida  insNtución bancaria.”  • Every pair of words in these two sentences will have  some translaNon probability  • Over many sentences, the highest probabiliNes will be  the pairs of words that are most closely related  6 

5/24/09  Alignments  • Alignments can be much more detailed  Images from Brown et al., “The  MathemaNcs of StaNsNcal Machine  TranslaNon”  Parallel Corpora  • Where do we get parallel corpora?  – Find documents that we know to be translaNons  – Canadian Hansard:  transcripts of Canadian  parliamentary debates in both English and French  – European Union law in 22 languages  • Anything that’s not law‐related?  – Wikipedia arNcles in different languages..  Not  necessarily translaNons though  7 

5/24/09  CLIR Experiments  • CLIR track ran at TREC from 1998 through  2002  • Languages used include English, German,  French, Italian, Chinese, and Arabic  • Other issues in CLIR:  – SegmentaNon, stemming, stopping, phrases  require different approaches in different  languages  – I am going to focus on high‐level problem  CLIR Experiments  • In 2001 and 2002, the main CLIR task was English  queries to retrieve Arabic documents  • Documents:  383,872 news arNcles from Agence  France Press from 1994‐2000  • InformaNon needs:  25 queries, descripNons, and  narraNves in English by naNve Arabic speakers  – Translated into Arabic and French as well  • ParNcipaNng sites could do CLIR (English to Arabic  or French to Arabic) or normal IR (Arabic to  Arabic)  8 

5/24/09  Example Topic  <num> Number: AR26  <num> Number: AR26  <Ntle>   ﺲﻠﺠﻣ   ﺔﻣوﺎﻘﳌا   ﻲﻨﻃﻮﻟا   ﻲﻧﺎﺘﺳدﺮﻜﻟا    <Ntle> Kurdistan Independence   <desc> DescripNon:   <desc> DescripNon:     ﻒﻴﻛ   ﺮﻈﻨﻳ   ﺲﻠﺠﻣ   ﺔﻣوﺎﻘﳌا   ﺔﻴﻨﻃﻮﻟا   ﻰﻟا   لﻼﻘﺘﺳﻹا   How does the NaNonal Council of  ﻞﻤﺘﶈا   داﺮﻛﻼﻟ؟   Resistance relate to the potenNal  independence of Kurdistan?   <narr> NarraNve:    <narr> NarraNve:    عﻮﺿﻮﳌا   ﻦﻤﻀﺘﻳ   صﻮﺼﻧ   ﺔﻘﻠﻌﺘﻣ   تﺎﻛﺮﺤﺘﺑ   ﺲﻠﺠﻣ   ﺔﻣوﺎﻘﳌا   ﺔﻴﻨﻃﻮﻟا   ،   تﻻﺎﻘﻣ   ثﺪﺤﺘﺗ   ﻦﻋ   ةدﺎﻴﻗ   ArNcles reporNng acNviNes of the  نﻼﺟوا   ﻦﻤﺿ   دﻮﻬﺟ   داﺮﻛﻻا   لﻼﻘﺘﺳﻼﻟ  .  NaNonal Council of Resistance are  considered on topic. ArNcles  discussing Ocalan's leadership  within the context of the Kurdish  efforts toward independence are  also considered on topic.   Example Document  9 

5/24/09  Results  Cross‐lingual (English/French to Arabic)  Monolingual (Arabic to Arabic)  • BBN, Umass, IBM used staNsNcal models  • Umass performance on cross‐language is roughly  equal to performance on monolingual!  Plots from Oard & Gey, “The TREC‐2002 Arabic/English CLIR Track”  Analysis  • The translaNon model is imperfect  – It assigns probabiliNes to almost every pair of  words  – There are many errors in translaNon  • So how could cross‐lingual be almost as good  as monolingual?  • Hypotheses:  – TranslaNon process disambiguates some terms  – TranslaNon process smooths query models  10 

CrossLanguageIR CISC489/689010,Lecture#23 Monday,May11 th - PDF document

5/24/09 CrossLanguageIR CISC489/689010,Lecture#23 Monday,May11 th BenCartereCe CrossLanguageIR Usersubmitsaqueryinonelanguage,gets

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

The Shadow of the Cross The Cross of Jesus part 1B The Shadow of the Cross Hebrews 10:1-14 The

Cross-Language Information Retrieval Carol Peters ISTI-CNR, Pisa Cross-Language Information

Cross Ram Support Set Ram accessories 1 Cross Ram Support Set Set composition The Cross

Analysis of Cross Language Information Retrieval methods Introduction to Cross Language

Red Cross Disaster Communications and the Amateur Radio Community 1 American Red Cross Gold

Red Cross Clubs Red Cross Clubs Why Red Cross Clubs should be started at your school What We

MALAYSIA IN CROSS BORDER RAIL INITIATIVE 20 DECEMBER 2017 Content: i. Cross Border Railway

CROSS CULTURAL CHALLENGES IN THE CROSS CULTURAL CHALLENGES IN THE CROSS CULTURAL CHALLENGES IN

Web browsing support for cross-community activities Tomohiro Oda Agenda cross-community

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Salem County Cross- - Salem County Cross Acceptance Acceptance Public Meeting Public Meeting

Cross ROADS Initiative Cross-jurisdictional Resources and Opportunities to Advance the Delivery

Lessons Learnt from Japanese Red Cross Response to 3.11 Naoki Shiratsuchi Japanese Red Cross

Cross Border Xpress Welcome to CBX https://www.youtube.com/watch?v=gDWTvwgyOD4 2 Cross Border

Kicking Down the Cross Domain Door Techniques for Cross Domain Exploitation Billy K Rios (BK) and

Cartography on unoriented surfaces, with applications to real and quaternionic random matrices

STAT 113 Describing Categorical Data I Colin Reimer Dawson Oberlin College September 11, 2020

2020 Financial Goal August 6, 2018 Cautionary Statements This presentation contains

T H E W A T E R W I L L C O M E : R I S I N G S E A S , S I N K I N G C I T I E S A N D T

Understandig the Search Behaviour of Greedy Best-First Search Manuel Heusner Thomas Keller

Code review-review is the managers job @johnbarton Make time (nearly) every day to

Cost Semantics for Space Usage in a Parallel Language Daniel Spoonhower Carnegie Mellon

Enhancing the FreeBSD TCP Implementation An Update Lawrence Stewart lastewart@swin.edu.au