FIRE Forum for Information Retrieval Evaluation (for Indian - - PowerPoint PPT Presentation

fire forum for information retrieval evaluation for
SMART_READER_LITE
LIVE PREVIEW

FIRE Forum for Information Retrieval Evaluation (for Indian - - PowerPoint PPT Presentation

FIRE Forum for Information Retrieval Evaluation (for Indian Languages) Mandar Mitra Prasenjit Majumder Indian Statistical Institute Kolkata FIREForum for Information Retrieval Evaluation(for Indian Languages) p. 1/14 Overview The CLIA


slide-1
SLIDE 1

FIRE Forum for Information Retrieval Evaluation (for Indian Languages)

Mandar Mitra Prasenjit Majumder

Indian Statistical Institute Kolkata

FIREForum for Information Retrieval Evaluation(for Indian Languages) – p. 1/14

slide-2
SLIDE 2

Overview

The CLIA project Evaluation corpora topics relevance judgments Timeline

FIREForum for Information Retrieval Evaluation(for Indian Languages) – p. 2/14

slide-3
SLIDE 3

The CLIA project

Sponsored by the Dept. of Information Tech., Govt. of India Sanctioned in August 2006, work started in early 2007 Consortium mode project Anna University - College of Engg., Guindy Anna University - KBC centre CDAC - Noida CDAC - Pune IIIT Hyderabad IIT Bombay (coordinating instt.) IIT Kharagpur (co-coordinating instt.) Indian Statistical Institute Jadavpur University Utkal University

FIREForum for Information Retrieval Evaluation(for Indian Languages) – p. 3/14

slide-4
SLIDE 4

The CLIA project

Assigned task: Create a portal where

  • 1. a user will be able to give a query in one Indian language;
  • 2. s/he will be able to access documents available in the

language of the query, Hindi (if the query language is not Hindi), and English,

  • 3. all presented to the user in the language of the query.

Languages Bangla Hindi Marathi Punjabi Tamil Telugu

FIREForum for Information Retrieval Evaluation(for Indian Languages) – p. 4/14

slide-5
SLIDE 5

Language issues

Inflectionality: Hindi (low) → Bangla (medium) → Tamil / Telugu (high) Spelling variations: case markers may / may not be attached to word long vowels / short vowels, three sibilants, two N’s Words in a compound may be written together or separately e.g. state government vs. StateGovernment Names are often abstract nouns (qualities) / adjectives e.g. Mamata, Atal

FIREForum for Information Retrieval Evaluation(for Indian Languages) – p. 5/14

slide-6
SLIDE 6

Evaluation

Tasks Ad-hoc monolingual retrieval for each of the 6 languages Ad-hoc cross-lingual retrieval (6 × 6)

FIREForum for Information Retrieval Evaluation(for Indian Languages) – p. 6/14

slide-7
SLIDE 7

Evaluation

Corpora News corpora from 2004-2007 for each language (in UTF-8) Plus all available documents on health and tourism Bangla corpus: ABP Sep 2004 - July 2007 (186,513 docs., 3.9 GB) CRI Sep 2004 - July 2007 (23,862 files, 124MB) Need to work out distribution issues

FIREForum for Information Retrieval Evaluation(for Indian Languages) – p. 7/14

slide-8
SLIDE 8

Evaluation

Topics Single set of 80 topics (30 training + 50 testing) Formulated in English by language representatives Translated into all languages Deal with national / international issues

FIREForum for Information Retrieval Evaluation(for Indian Languages) – p. 8/14

slide-9
SLIDE 9

Evaluation

Topics Example: <title> Political turmoil in South Asian countries <desc> Struggle for democratic governance in various South Asian countries. <narr> The document should contain information regarding the power struggle between the monarchy / military government and popular political leaders in Nepal, Thailand, Pakistan and Bangladesh.

FIREForum for Information Retrieval Evaluation(for Indian Languages) – p. 9/14

slide-10
SLIDE 10

Evaluation

Topics Example: <title> Nobel theft <desc> Rabindranath Tagore’s Nobel Prize medal was stolen from

  • Santiniketan. The document should contain information about

this theft. <narr> A relevant document should contain information regarding the missing Nobel Prize Medal that was stolen along with some

  • ther artefacts and paintings on 25th March, 2004. Documents

containing reports related to investigations by government agencies like CBI / CID are also relevant, as are articles that describe public reaction and expressions of outrage by various political parties.

FIREForum for Information Retrieval Evaluation(for Indian Languages) – p. 10/14

slide-11
SLIDE 11

Topics

Initial queries formulated by browsing the corpus To be refined based on initial retrieval results Aim: balance of easy, medium and hard queries

FIREForum for Information Retrieval Evaluation(for Indian Languages) – p. 11/14

slide-12
SLIDE 12

Relevance judgments

Pooling Term-weighting + inner product similarity (Smart) Language modeling (Lemur) Boolean queries Divergence from randomness (Terrier) Logistic regression (Cheshire) Interactive retrieval Cover detection (?)

FIREForum for Information Retrieval Evaluation(for Indian Languages) – p. 12/14

slide-13
SLIDE 13

Timeline

29.10.2007 Topic set frozen 31.01.2007 Training data pools complete 16.05.2008 Relevance judgments complete (training data) 01.07.2008 Test topics released 15.09.2008 Runs due (earlier date possible?) 01.12.2008 Results out 15.12.2008 NTCIR 19.12.2008 FIRE?

FIREForum for Information Retrieval Evaluation(for Indian Languages) – p. 13/14

slide-14
SLIDE 14

Acknowledgments

Noriko Kando - EVIA 2007 Djoerd Hiemstra, Doug Oard, Mark Sanderson - SIGIR 2007 Carol Peters - CLEF 2007 Donna Harman, Ellen Voorhees - TREC 2007 Google Research Award - support for travel to CLEF 2007

FIREForum for Information Retrieval Evaluation(for Indian Languages) – p. 14/14