 
              The Importance of Evaluation for Multilingual Information Retrieval Carol Peters ISTI-CNR, Pisa, Italy FIRE 2011 IIT Bombay, 2-4 December 2011
From FIRE 2008 to FIRE 2010 FIRE 2008  CLEF: Objectives and First Results FIRE 2010  10 Years of CLEF: An Assessment  What we’ve done  What we’ve learned  What the next steps should be FIRE 2011  Exploiting the Results for MLIR System Building FIRE 2011 IIT Bombay, 2-4 December 2011
MLIR/CLIR System Evaluation In IR the role of an evaluation campaign is to:  Identify priority areas for research:  evaluation permits hypotheses to be validated and progress assessed  Support system development and testing  evaluation saves developers time and money  1997 – First MLIR/CLIR system evaluation campaigns in US and Japan: TREC and NTCIR  2000 – MLIR/CLIR evaluation in Europe: CLEF (extension of CLIR track at TREC)  2008 – FIRE: MLIR/CLIR evaluation for Indian languages FIRE 2011 IIT Bombay, 2-4 December 2011
Results These evaluation initiatives:  Promote research  Encourage creation of multi-disciplinary communities  Produce vast amounts of valuable scientific data  Favour understanding of issues involved in successful system development FIRE 2011 IIT Bombay, 2-4 December 2011
Outline  The Need for MLIR/CLIR?  What are the Challenges?  What is the Contribution of Evaluation?  The Example of CLEF FIRE 2011 IIT Bombay, 2-4 December 2011
MLIR in the Information Society  Web is an important platform for knowledge dissemination and acquisition  User information needs are increasingly varied  From primarily academic use to widespread commercial, leisure, educational, entertainment etc. uses  Content is available in many languages and non-English content is growing rapidly  Information providers and seekers should have equal opportunities  Preservation of national languages FIRE 2011 IIT Bombay, 2-4 December 2011
The Need for Multilingual Search http://www.internetworldstats.com/stats.htm FIRE 2011 IIT Bombay, 2-4 December 2011
Countries with most Internet Users Country Population Internet Internet Penetration Users 2000 Users 2011 % of Pop. China 1,336,718,015 22,5000.000 485,000,000 36.3% United States 313,232,044 95,354,000 245,000,000 78.2% India 1,189,172,906 5,000,000 100,000,000 8.4% Japan 126,475,664 47,080,000 99,182,000 78.4% Brazil 203,429,773 5,000,000 75,982,000 37.4% Germany 81,471,834 24,000,000 65,125,000 79.9% Russia 138,739,892 3,100,000 59,700,000 43.0% UK 62,698,362 15,400,000 51,442,100 82.0% France 65,102,719 8,500,000 45,262,000 69.5% Nigeria 155,215,573 20,000 43,982,200 28.3% http://www.internetworldstats.com/top20.htm FIRE 2011 IIT Bombay, 2-4 December 2011
Countries with most Internet Users Country Population Internet Internet Penetration Users 2000 Users 2011 % of Pop. China 1,336,718,015 22,5000.000 485,000,000 36.3% United States 313,232,044 95,354,000 245,000,000 78.2% India 1,189,172,906 5,000,000 100,000,000 8.4% Japan 126,475,664 47,080,000 99,182,000 78.4% Brazil 203,429,773 5,000,000 75,982,000 37.4% Germany 81,471,834 24,000,000 65,125,000 79.9% Russia 138,739,892 3,100,000 59,700,000 43.0% UK 62,698,362 15,400,000 51,442,100 82.0% France 65,102,719 8,500,000 45,262,000 69.5% Nigeria 155,215,573 20,000 43,982,200 28.3% http://www.internetworldstats.com/top20.htm FIRE 2011 IIT Bombay, 2-4 December 2011
MLIR related research  Concerns the storage, access, retrieval and presentation of digital information in any of the world's languages.  Main areas of interest:  enabling technology (character encoding, scripts, internationalisation, localisation)  multiple language access, browsing, retrieval, display  Crossing the language boundary (filtering, merging, ranking, selecting, presenting results) FIRE 2011 IIT Bombay, 2-4 December 2011
The Terminology  Multilingual Information Access (MLIA)  Accessing, querying and retrieving information from collections in any language (covering basic enabling techniques and including MLIR and CLIR)  Multilingual Information Retrieval (MLIR)  Information retrieval in multiple languages (includes CLIR)  Cross-Language Information Retrieval (CLIR)  Querying multilingual collections in one language in order to retrieve documents in other languages FIRE 2011 IIT Bombay, 2-4 December 2011
The Grand Challenge Fully multilingual, multimodal IR systems  capable of processing a query in any medium and any language  finding relevant information from a multilingual multimedia collection containing documents in any language and form  and presenting it in style most likely to be useful to the user Oard & Hull , AAAI Spring Symposium, Stanford 19 97 FIRE 2011 IIT Bombay, 2-4 December 2011
MLIR/CLIR System Development is Complex  There are 6,800 known languages spoken in 200 countries  ca 2,250 have writing systems (the others are only spoken)  Just 300 have some kind of language processing tools MLIR/CLIR System development involves integrating IR techniques with Language Processing tools and Language Transfer mechanisms
MLIR/CLIR System Development is Complex  Multilingual Portals (Localization)  How many languages / how many levels should be multilingual / how to handle updates /linguistic and cultural dependent issues  Monolingual Search for Multiple Languages  encoding and representation issues / language identification / indexing issues (stop words, stemmers, morphological analysers, named entity recognition, ..)  Cross-Language Search  translation resources (lexicons, corpora, MT systems)  Presentation of Results  in form interpretable and exploitable by user FIRE 2011 IIT Bombay, 2-4 December 2011
Main Challenges  Understanding Search in the Multilingual Context (language & culture)  Globalisation (internationalisation & localisation)  MLIR/CLIR System Development  Language processing tools  Best retrieval mechanisms (indexing, matching, merging)  Best translation resources  From text to multimodal retrieval  Providing effective user support  Going from Research to Practice FIRE 2011 IIT Bombay, 2-4 December 2011
Main Challenges  Understanding Search in the Multilingual Context (language & culture)  Globalisation (internationalisation & localisation)  MLIR/CLIR System Development  Language processing tools  Best retrieval mechanisms (indexing, matching, merging)  Best translation resources  From text to multimodal retrieval  Providing effective user support  Going from Research to Practice FIRE 2011 IIT Bombay, 2-4 December 2011
Main Challenges  Understanding Search in the Multilingual Context (language & culture)  Globalisation (internationalisation & localisation)  MLIR/CLIR System Development  Language processing tools  Best retrieval mechanisms (indexing, matching, merging)  Best translation resources  From text to multimodal retrieval  Providing effective user support  Going from Research to Practice FIRE 2011 IIT Bombay, 2-4 December 2011
Building a CLIR System  Pre-process & index both documents and queries – generally using language dependent techniques (tokenisation, stopwords, stemming, morphological analysis, decompounding, etc.)  Translate: queries or documents (or both)  Translation resources • Machine Translation (MT) • Parallel/comparable corpora • Bilingual Dictionaries • Multilingual Thesauri • Conceptual Interlingua  Find relevant documents in target collection(s) & present results FIRE 2011 IIT Bombay, 2-4 December 2011
Main CLIR Difficulties (I)  Language identification  Morphology: inflection, derivation, compounding, …  OOV terms, e.g. proper names, terminology  Multi-word concepts, e.g. phrases and idioms  Ambiguity, e.g. polysemy  Handling many languages: L1 -> Ln  Merging results from different sources / media  Presenting the results in useful fashion FIRE 2011 IIT Bombay, 2-4 December 2011
Main CLIR Difficulties (II)  CLIR system need clever pre-processing of target collections (e.g. semantic analysis, classification, information extraction)  CLIR systems need intelligent post-processing of results: merging/ summarization / translation  CLIR systems need well-developed resources  Language Processing Tools  Language Resources  Resources are expensive to acquire, maintain, update FIRE 2011 IIT Bombay, 2-4 December 2011
CLIR for Multimedia  Retrieval from a mixed media collection is non- trivial problem  Different media processed in different ways and suffer from different kinds of indexing errors:  spoken documents indexed using speech recognition  handwritten documents indexed using OCR  images indexed using significant features  Need for complex integration of multiple technologies  Need for merging of results from different sources FIRE 2011 IIT Bombay, 2-4 December 2011
Supporting the User FIRE 2011 IIT Bombay, 2-4 December 2011 Clough October 2011
MLIR/CLIR System Evaluation is Complex  Need to evaluate single components  Need to evaluate overall system performance  Need to distinguish CL aspects from IR issues FIRE 2011 IIT Bombay, 2-4 December 2011
Recommend
More recommend