the importance of evaluation for multilingual information
play

The Importance of Evaluation for Multilingual Information Retrieval - PowerPoint PPT Presentation

The Importance of Evaluation for Multilingual Information Retrieval Carol Peters ISTI-CNR, Pisa, Italy FIRE 2011 IIT Bombay, 2-4 December 2011 From FIRE 2008 to FIRE 2010 FIRE 2008 CLEF: Objectives and First Results FIRE 2010 10


  1. The Importance of Evaluation for Multilingual Information Retrieval Carol Peters ISTI-CNR, Pisa, Italy FIRE 2011 IIT Bombay, 2-4 December 2011

  2. From FIRE 2008 to FIRE 2010 FIRE 2008  CLEF: Objectives and First Results FIRE 2010  10 Years of CLEF: An Assessment  What we’ve done  What we’ve learned  What the next steps should be FIRE 2011  Exploiting the Results for MLIR System Building FIRE 2011 IIT Bombay, 2-4 December 2011

  3. MLIR/CLIR System Evaluation In IR the role of an evaluation campaign is to:  Identify priority areas for research:  evaluation permits hypotheses to be validated and progress assessed  Support system development and testing  evaluation saves developers time and money  1997 – First MLIR/CLIR system evaluation campaigns in US and Japan: TREC and NTCIR  2000 – MLIR/CLIR evaluation in Europe: CLEF (extension of CLIR track at TREC)  2008 – FIRE: MLIR/CLIR evaluation for Indian languages FIRE 2011 IIT Bombay, 2-4 December 2011

  4. Results These evaluation initiatives:  Promote research  Encourage creation of multi-disciplinary communities  Produce vast amounts of valuable scientific data  Favour understanding of issues involved in successful system development FIRE 2011 IIT Bombay, 2-4 December 2011

  5. Outline  The Need for MLIR/CLIR?  What are the Challenges?  What is the Contribution of Evaluation?  The Example of CLEF FIRE 2011 IIT Bombay, 2-4 December 2011

  6. MLIR in the Information Society  Web is an important platform for knowledge dissemination and acquisition  User information needs are increasingly varied  From primarily academic use to widespread commercial, leisure, educational, entertainment etc. uses  Content is available in many languages and non-English content is growing rapidly  Information providers and seekers should have equal opportunities  Preservation of national languages FIRE 2011 IIT Bombay, 2-4 December 2011

  7. The Need for Multilingual Search http://www.internetworldstats.com/stats.htm FIRE 2011 IIT Bombay, 2-4 December 2011

  8. Countries with most Internet Users Country Population Internet Internet Penetration Users 2000 Users 2011 % of Pop. China 1,336,718,015 22,5000.000 485,000,000 36.3% United States 313,232,044 95,354,000 245,000,000 78.2% India 1,189,172,906 5,000,000 100,000,000 8.4% Japan 126,475,664 47,080,000 99,182,000 78.4% Brazil 203,429,773 5,000,000 75,982,000 37.4% Germany 81,471,834 24,000,000 65,125,000 79.9% Russia 138,739,892 3,100,000 59,700,000 43.0% UK 62,698,362 15,400,000 51,442,100 82.0% France 65,102,719 8,500,000 45,262,000 69.5% Nigeria 155,215,573 20,000 43,982,200 28.3% http://www.internetworldstats.com/top20.htm FIRE 2011 IIT Bombay, 2-4 December 2011

  9. Countries with most Internet Users Country Population Internet Internet Penetration Users 2000 Users 2011 % of Pop. China 1,336,718,015 22,5000.000 485,000,000 36.3% United States 313,232,044 95,354,000 245,000,000 78.2% India 1,189,172,906 5,000,000 100,000,000 8.4% Japan 126,475,664 47,080,000 99,182,000 78.4% Brazil 203,429,773 5,000,000 75,982,000 37.4% Germany 81,471,834 24,000,000 65,125,000 79.9% Russia 138,739,892 3,100,000 59,700,000 43.0% UK 62,698,362 15,400,000 51,442,100 82.0% France 65,102,719 8,500,000 45,262,000 69.5% Nigeria 155,215,573 20,000 43,982,200 28.3% http://www.internetworldstats.com/top20.htm FIRE 2011 IIT Bombay, 2-4 December 2011

  10. MLIR related research  Concerns the storage, access, retrieval and presentation of digital information in any of the world's languages.  Main areas of interest:  enabling technology (character encoding, scripts, internationalisation, localisation)  multiple language access, browsing, retrieval, display  Crossing the language boundary (filtering, merging, ranking, selecting, presenting results) FIRE 2011 IIT Bombay, 2-4 December 2011

  11. The Terminology  Multilingual Information Access (MLIA)  Accessing, querying and retrieving information from collections in any language (covering basic enabling techniques and including MLIR and CLIR)  Multilingual Information Retrieval (MLIR)  Information retrieval in multiple languages (includes CLIR)  Cross-Language Information Retrieval (CLIR)  Querying multilingual collections in one language in order to retrieve documents in other languages FIRE 2011 IIT Bombay, 2-4 December 2011

  12. The Grand Challenge Fully multilingual, multimodal IR systems  capable of processing a query in any medium and any language  finding relevant information from a multilingual multimedia collection containing documents in any language and form  and presenting it in style most likely to be useful to the user Oard & Hull , AAAI Spring Symposium, Stanford 19 97 FIRE 2011 IIT Bombay, 2-4 December 2011

  13. MLIR/CLIR System Development is Complex  There are 6,800 known languages spoken in 200 countries  ca 2,250 have writing systems (the others are only spoken)  Just 300 have some kind of language processing tools MLIR/CLIR System development involves integrating IR techniques with Language Processing tools and Language Transfer mechanisms

  14. MLIR/CLIR System Development is Complex  Multilingual Portals (Localization)  How many languages / how many levels should be multilingual / how to handle updates /linguistic and cultural dependent issues  Monolingual Search for Multiple Languages  encoding and representation issues / language identification / indexing issues (stop words, stemmers, morphological analysers, named entity recognition, ..)  Cross-Language Search  translation resources (lexicons, corpora, MT systems)  Presentation of Results  in form interpretable and exploitable by user FIRE 2011 IIT Bombay, 2-4 December 2011

  15. Main Challenges  Understanding Search in the Multilingual Context (language & culture)  Globalisation (internationalisation & localisation)  MLIR/CLIR System Development  Language processing tools  Best retrieval mechanisms (indexing, matching, merging)  Best translation resources  From text to multimodal retrieval  Providing effective user support  Going from Research to Practice FIRE 2011 IIT Bombay, 2-4 December 2011

  16. Main Challenges  Understanding Search in the Multilingual Context (language & culture)  Globalisation (internationalisation & localisation)  MLIR/CLIR System Development  Language processing tools  Best retrieval mechanisms (indexing, matching, merging)  Best translation resources  From text to multimodal retrieval  Providing effective user support  Going from Research to Practice FIRE 2011 IIT Bombay, 2-4 December 2011

  17. Main Challenges  Understanding Search in the Multilingual Context (language & culture)  Globalisation (internationalisation & localisation)  MLIR/CLIR System Development  Language processing tools  Best retrieval mechanisms (indexing, matching, merging)  Best translation resources  From text to multimodal retrieval  Providing effective user support  Going from Research to Practice FIRE 2011 IIT Bombay, 2-4 December 2011

  18. Building a CLIR System  Pre-process & index both documents and queries – generally using language dependent techniques (tokenisation, stopwords, stemming, morphological analysis, decompounding, etc.)  Translate: queries or documents (or both)  Translation resources • Machine Translation (MT) • Parallel/comparable corpora • Bilingual Dictionaries • Multilingual Thesauri • Conceptual Interlingua  Find relevant documents in target collection(s) & present results FIRE 2011 IIT Bombay, 2-4 December 2011

  19. Main CLIR Difficulties (I)  Language identification  Morphology: inflection, derivation, compounding, …  OOV terms, e.g. proper names, terminology  Multi-word concepts, e.g. phrases and idioms  Ambiguity, e.g. polysemy  Handling many languages: L1 -> Ln  Merging results from different sources / media  Presenting the results in useful fashion FIRE 2011 IIT Bombay, 2-4 December 2011

  20. Main CLIR Difficulties (II)  CLIR system need clever pre-processing of target collections (e.g. semantic analysis, classification, information extraction)  CLIR systems need intelligent post-processing of results: merging/ summarization / translation  CLIR systems need well-developed resources  Language Processing Tools  Language Resources  Resources are expensive to acquire, maintain, update FIRE 2011 IIT Bombay, 2-4 December 2011

  21. CLIR for Multimedia  Retrieval from a mixed media collection is non- trivial problem  Different media processed in different ways and suffer from different kinds of indexing errors:  spoken documents indexed using speech recognition  handwritten documents indexed using OCR  images indexed using significant features  Need for complex integration of multiple technologies  Need for merging of results from different sources FIRE 2011 IIT Bombay, 2-4 December 2011

  22. Supporting the User FIRE 2011 IIT Bombay, 2-4 December 2011 Clough October 2011

  23. MLIR/CLIR System Evaluation is Complex  Need to evaluate single components  Need to evaluate overall system performance  Need to distinguish CL aspects from IR issues FIRE 2011 IIT Bombay, 2-4 December 2011

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend