for CLIR CLEF09: Ad-hoc (TEL) Session, Corfu, Greece Institute AIFB - PowerPoint PPT Presentation

Combining Concept Based and Text Based Indexes for CLIR CLEF09: Ad-hoc (TEL) Session, Corfu, Greece Institute AIFB – University of Karlsruhe Philipp Sorg Institute AIFB, Universität Karlsruhe sorg@kit.edu Philipp Cimiano Web Information Systems Group, Delft University of Technology p.cimiano@tudelft.nl KIT – The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe (TH)

Research Questions Can multi-lingual information be used to improve retrieval on the TEL dataset? Queries in different languages Documents in different languages Fields of documents in different languages Can text based (= Machine Translation based) retrieval be combined with concept based retrieval? Representation of documents in concept space Explicit Semantic Analysis (ESA) Score aggregation problem 2 01.10.2009 Philipp Sorg - Institute AIFB

Agenda Research Questions Language Preprocessing NLP Detection Explicit Concept Cross- Motivation Semantic based CLIR lingual ESA Analysis Retrieval Matching Score Architecture Models Aggregation Evaluation Results Conclusion 3 01.10.2009 Philipp Sorg - Institute AIFB

Preprocessing of Dataset Selection of content fields Title, subject, alternative, abstract Language Detection Character n-gram model for language detection Ling Pipe Identification Tool Each field is classified Based on language tag and language detection Results in documents with multi-lingual fields NLP Stemming in all languages supported by Snowball stemmer Language specific stopword removal 5 01.10.2009 Philipp Sorg - Institute AIFB

Motivation of Concept Based CLIR Traditional approach to Multi-lingual IA Translation of queries or documents Problems MT is not available for many language pairs Propagation of error, inherits all problems of mono-lingual retrieval Alternative approach: Concept space query doc Language- independent Representation 7 01.10.2009 Philipp Sorg - Institute AIFB

Explicit Concept Model Idea: Use Web 2.0 resources to define concepts Pragmatic definition of concepts Wikipedia articles, tagged web sites, products, … Cover a broad range of topics and languages Freely available Example Wikipedia articles as concepts We use Explicit Semantic Analysis (Cross-lingual ESA) Gabrilovich and Markovitch IJCAI 2007 Potthast et al. ECIR 2008, Sorg and Cimiano CLEF 2008 8 01.10.2009 Philipp Sorg - Institute AIFB

Idea of ESA Bicycle “The transport of bicycles on trains” A bicycle , bike , or cycle is a pedal- TF.IDF Function driven, human- powered vehicle with two wheels attached to a frame, 1.52 <Road_bicycle> one behind the 1.18 <Bicycle> other. A person who 1.12 <Velorama> rides a bicycle is 0.92 <Cycling> called a cyclist or a 0.92 <Biker> bicyclist . 0.92 <Bianchi_(bicycle_manufacturer)> 0.79 <Train_(disambiguation)> 0.77 <Transport> … …

Example Cross-lingual ESA Concept Vector “The transport of bicycles on trains” <Radrennen> 1.52 A1 <Road_bicycle> <Fahrrad> 1.18 A2 <Bicycle> <Velorama> 1.12 A3 <Velorama> <Fahrradfahren> 0.92 A4 <Cycling> <Biker> 0.92 A5 <Biker> <Bianchi_(Unternehmen)> 0.92 A6 <Bianchi_(bicycle_manufacturer)> <Train> 0.79 A7 <Train_(disambiguation)> <Verkehr> 0.77 A8 <Transport> … … … … German interpretation English interpretation

Retrieval Architecture Language TEL TEL Record TEL Record TEL Classification Record TEL Record Record … en de fr Indexing ESA (en) ESA ESA (de) ESA (fr) … Index Index Index Baseline ESA ESA (en) (de) (fr) Index Index Index Matching and Aggregation (Step 1) Matching and Aggregation Search (Step 2) ESA … de en fr Machine Topic Topic Translation 12 01.10.2009 Philipp Sorg - Institute AIFB

Matching and Aggregation (Step 1) Optimization of matching model Using CLEF2008 topics and relevance assessments Models provided by the Terrier framework BL: DLH13, ONB: LemurTF_IDF, BNF: BB2 Linear aggregation of scores Each document has a score for each index (=language) Different normalization functions Based on maximal score in each ranking Based on the number of retrieved documents of each ranking Based on a priori weights Language distribution of text in corpus score ( t;d ) := P r 2 R ± ( r ) score r ( t;d ) 13 01.10.2009 Philipp Sorg - Institute AIFB

Matching and Aggregation (Step 2) ESA retrieval using cosine similarity Implementation based on inverted concept index Linear aggregation of concept based scores and text based scores Using the aggregated score from text based retrieval (Step 1) Weight factor to modify influence of concept based retrieval Optimized on CLEF2008 topics Evaluation measures MAP: Mean Average Precision P@10: Precision at cutoff level of 10 documents 14 01.10.2009 Philipp Sorg - Institute AIFB

Evaluation Topic Retrieval Method BL ONB BNF lang. MAP P@10 MAP P@10 MAP P@10 En Baseline (single index) 35 51 16 26 25 39 Multiple Indexes 33 50 15 24 22 34 Concept + Baseline 35 52 17 27 25 39 De Baseline (single index) 33 49 23 35 24 35 Multiple Indexes 31 48 23 34 22 32 Concept + Baseline 33 49 24 35 24 36 Fr Baseline (single index) 31 48 15 22 27 38 Multiple Indexes 29 45 14 20 25 35 Concept + Baseline 32 51 15 22 27 37 16 01.10.2009 Philipp Sorg - Institute AIFB

Conclusion Baseline is very strong Can multi-lingual information be used to improve retrieval on the TEL dataset? Use of multi-lingual indexes based on language detection did not improve retrieval Problem of score aggregation Linear aggregation model with (simple) normalization is not working Can text based (= Machine Translation based) retrieval be combined with concept based retrieval? Combination of concept and text based indexes yields only small improvements We could not reconstruct the large improvements reported on monolingual collections Not enough context in short TEL records for concept mapping? 20 01.10.2009 Philipp Sorg - Institute AIFB

Thank you! Questions? Joint work with Philipp Cimiano (Universität Bielefeld) Marlon Braun, David Nicolay (Universität Karlsruhe) Acknowledgments Multipla Project DFG grant 38457858 21 01.10.2009 Philipp Sorg - Institute AIFB

for CLIR CLEF09: Ad-hoc (TEL) Session, Corfu, Greece Institute AIFB - PowerPoint PPT Presentation

Combining Concept Based and Text Based Indexes for CLIR CLEF09: Ad-hoc (TEL) Session, Corfu, Greece Institute AIFB University of Karlsruhe Philipp Sorg Institute AIFB, Universitt Karlsruhe sorg@kit.edu Philipp Cimiano Web Information

Techniques to improve Dictionary Based CLIR Sai Madhurya Peyyeti KX48810 Different Techniques

End-to-End Neural CLIR by Sharing Representation LILY Spring 2018 Workshop Rui Zhang

A Survey on Cross-language IR (CLIR) Naveen Yamparala (RS09174) Types of IR (Language based)

Analysis of Cross Language Information Retrieval methods Introduction to Cross Language

1 Translation model Language model Dictionaries used Languages Name #Entries Type P(S|T)

Revisiting Document Length Hypotheses NTCIR-4 CLIR and Patent Experiments at Patolis 4 June 2004

IASL System for NTCIR-6 Korean-Chinese CLIR Yu-Chun Wang Cheng-Wei Lee Richard Tzong-Han Tsai

How Deep Learning is making MT and other areas converge? MARTA R. COSTA-JUSS UNIVERSITAT

Digitizing Hidden Collections Recipient Informational Webinar June 6, 2018

Modeling Power and Pilgrimage in Medieval Orkney Jennifer Grayburn Julie Gibson CLIR

Dictionary and Monolingual Corpus-based Query Translation for Basque-English CLIR Xabier Saralegi

Cross-Language Information Retrieval Carol Peters ISTI-CNR, Pisa Cross-Language Information

The Future of Combustion: Clean Air, Low Cost, Rapid Payback v Cautionary Note on

Postdoctoral Fellowship Program Microgrants Webinar March 20, 2019 What is the purpose of the

From CLEF to TrebleCLEF: the Evolution of the Cross-Language Evaluation Forum Carol Peters -

CS344: Introduction to Artificial CS344: Introduction to Artificial Intelligence Vishal Vachhani

Redes de rea Extensa (WAN) Area de Ingeniera Telemtica http://www.tlm.unavarra.es Redes de

the Foundation for 5G Joe Cozzolino SVP, Cisco Mobility Business Group May 26, 2015 When will

Recursion and Networking CS 118 Computer Network Fundamentals Peter Reiher Lecture 12 CS 118

High Speed Networks Carey Williamson Department of Computer Science University of Calgary

SciForum MOL2NET Efficient Actor-critic Algorithm with Dual Piecewise Model Learning Shan Zhong

Accelerating Kernels from WRF on GPUs John Michalakes, NREL Manish Vachharajani, University of

Parallel architectures Electronic Computers LM Parallelism 1 Architecture Architecture:

Searching for Subspace Trails and Truncated Differentials March 5th, 2018 Horst Grtz Institute