TREC 2003 Tracks A Tale of Two Evaluat ions Retrieval in a domain - - PDF document

trec 2003 tracks a tale of two evaluat ions
SMART_READER_LITE
LIVE PREVIEW

TREC 2003 Tracks A Tale of Two Evaluat ions Retrieval in a domain - - PDF document

TREC 2003 Tracks A Tale of Two Evaluat ions Retrieval in a domain Genome Novelty Answers, not docs Q&A TREC and RI A Web searching Web VLC Video Beyond text Speech OCR X {X,Y,Z} Beyond just English Chinese Spanish


slide-1
SLIDE 1

A Tale of Two Evaluat ions

Donna Harman

Sponsored by: NI ST, ARDA, DARPA

TREC and RI A

1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003

TREC 2003 Tracks

Retrieval in a domain Answers, not docs Web searching Beyond text Beyond just English Human-in-the-loop Streamed text Static text Ad Hoc, Robust Interactive, HARD X→{X,Y,Z} Chinese Spanish Video Speech OCR Web VLC Novelty Q&A Filtering Routing Genome

Genomics Track

  • New t rack f or 2003
  • f irst year of a 5-year plan
  • Mot ivat ion: explore ret rieval in a domain
  • Two t asks
  • primary: ad hoc t ask of f inding MEDLI NE records

t hat f ocus on t he basic biology of 50 specif ic gene names; GeneRI F dat a used as surrogat e answers

  • Secondary: Ext ract GeneRI F dat a f rom 139

art icles

QA 2003 Main Task

  • Three quest ion t ypes

– 413 f actoids: same as passages t ask except must be exact answer, not document ext ract – 37 lists: assemble set of inst ances where each inst ance is a f act oid quest ion answer – 50 def initions: ret urn t ext st rings t hat t oget her def ine t arget of quest ion

  • Final score weight ed average of

component s

FinalScore = ½ Fact oidScore + ¼ List Score + ¼ Def Score

QA Def init ion Component

  • 50 quest ions asking f or a def init ion of a t erm
  • r biographical dat a f or a person
  • Who is Vlad t he I mpaler? What is pH in chemist ry?
  • quest ions drawn f rom same logs as f act oids
  • assessor creat ed def init ion by searching docs
  • Syst em response is an unordered set of st rings
  • each st ring represent s dif f erent f acet of def
  • no limit on lengt h of st rings or number of st rings
  • Assessor mat ched his f acet s t o syst em st rings
  • could be 0, 1, or mult iple mat ches per st ring
  • F score wit h recall weight ed 5 t imes “precision”
  • “precision” is a f unct ion of lengt h

QA Main Task Result s

  • 0. 1
  • 0. 2
  • 0. 3
  • 0. 4
  • 0. 5
  • 0. 6

LCCmainS03 nusmm103r 2 lexiclone92 isi03a BBN2003C MI TCSAI L03a ir st qa2003w I BM2003c Albany03I 2 FDUT12Q A3

Def init ion List Fact oid Final combined scor es f or best main t ask r un per gr oup f or t op 10 gr oups

slide-2
SLIDE 2

HARD t rack

  • Goal: improve ad hoc ret rieval by cust omizing

t he search t o t he user using: 1) Met adat a f rom t opic st at ement s

1) t he pur pose of t he search 2) t he genr e or granular it y of t he desired response 3) t he user’s f amiliar it y wit h t he subj ect mat t er 4) biogr aphical dat a about user (age, sex, et c.)

2) Clarif ying f orms

1) assessor (sur rogat e user) spends at most 3 minut es/ t opic responding t o t opic-specif ic f orm 2) example uses: sense r esolut ion, relevance j udgment s

Robust Ret rieval Track

  • New t rack in 2003
  • Mot ivat ions:
  • f ocus on poor ly perf orming t opics since average

ef f ect iveness usually masks huge variance

  • bring t radit ional ad hoc t ask back t o TREC
  • Task
  • 100 t opics

– 50 old t opics f rom TRECs 6-8 – 50 new t ropics creat ed by 2003 assessors

  • TREC 6-8 document collect ion: disks 4&5 (no CR)
  • st andard t rec_eval evaluat ion plus new measures

2003 Robust Ret rieval Track

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall Pr ecision 100 Topics Old Topics New Topics

Ret rieval Met hods

  • CUNY and Wat erloo expanded using t he

web (and possibly ot her collect ions)

  • ef f ect ive, even f or poor perf ormers
  • QE based on t arget collect ion generally

improved mean scores, but did not help poor perf ormers

  • Approaches f or poor perf ormers
  • predict when t o expand
  • f use result s f rom mult iple runs
  • reorder t op ranked based on clust ering of

ret rieved set

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Average Precision best

  • k8alx

CL99XT

The Problem RI A Workshop

  • I n t he summer of 2003, NI ST organized

a 6-week workshop called Reliable I nf ormat ion Access (RI A)

  • RI A was part of t he Nort heast Regional

Research Cent er summer workshop series sponsored by t he Advanced Research and Development Act ivit y of t he US Depart ment of Def ense

slide-3
SLIDE 3

Workshop Goals

To learn how t o cust omize I R syst ems f or

  • pt imal perf ormance on any given query

I nit ial st rong f ocus on relevance f eedback and pseudo-relevance (blind) f eedback I f t ime, expand t o ot her t ools Apply t he result s t o quest ion answering in mult iple ways

Part icipant s (28)

Donna Harman and Chris Buckley (coordinat ors)

Cit y Universit y, London: Andy MacFarlane Clairvoyance: David Evans, David Hull, J esse Mont gomery Carnegie Mellon U: J amie Callan, Paul Ogilvie, Yi Zhang, Luo Si, Kevyn Collins-Thompson MI TRE: Warren Greif f NI ST: I an Soborof f and Ellen Voorhees

  • U. of Massachuset t s at Amherst : Andres Corrada-Emmanuel
  • U. of New York at Albany: Tomek St rzalkowski, Paul Kant or, Sharon

Small, Ting Liu, Sean Ryan

  • U. Wat erloo: Charlie Clarke, Gordon Cormack, Tom Lyman, Egidio Terr a

Ot her st udent s: Zhenmei Gu, Luo Ming, Robert Warr en,J ef f Terrace

Overall approach

Massive f ailure analysis done manually

f or a single run by each system Statistical analysis using many “ident ical” f eedback runs f rom all systems Use the results of the above to group queries needing similar t reatment

Failure analysis

1) Chose 44 out of 150 topics that were "f ailures" a) Mean Average Precision <= average b) have the most variance across systems 2) Use results f rom 6 systems’ standard runs 3) 6 people per topic (one per system) spent 45- 60 minutes looking at those results 4) Short 6- person group discussion to come to consensus about topic 5) I ndividual + overall report (f rom templates).

Grouping of queries by f ailure

Need out side expansion of “general” t erm 8 438 – What count ries are experiencing an increase in t ourism? General I R t echnical f ailure 8 Missing dif f icult aspect (semant ics in query) 7 401 – What language and cult ural dif f erence impede t he int egrat ion of f oreign minorit ies in Germany? 362 – I dent if y incident s of human smuggling All syst ems emphasize one aspect ; miss anot her 21

Preliminary conclusions f rom f ailure analysis

Systems agreed on causes of f ailure much more than had been expected Systems retrieve dif f erent documents, but don’t retrieve dif f erent classes of documents Majority of f ailures could be f ixed with better f eedback and term weighting and query analysis that gives guidance as to the relative importance of the terms

slide-4
SLIDE 4

(Blind) Relevance Feedback

What are new met hods of producing st eel?

* FBIS4-53871 title1 …. FT923-9006 title2 …. * FBIS4-27797 . * FT944-1455 . FBIS3-24678 . FT923-9281 . * FT923-10837 . FT922-11827 . FT941-11316 . .

List of experiment s run

bf _base: base runs f or all syst ems bot h using blind f eedback (bf ) and no f eedback bf _numdocs: vary # docs used f or bf f rom 0-100 bf _numdocs_relonly: same but only use relevant bf _numt erms: vary # t erms added f rom 0-100 bf _pass_numt erms: same but use passages as source inst ead of document s bf _swap-doc: use document s f rom ot her syst ems bf _swap_doc_t erm: expand using docs and t erms bf _swap_doc_clust er: use CLARI T clust ers bf _swap_doc_f use: use f usion of ot her syst ems

bf _numdocs, relevant only bf _numt erms_passages

1) Failure analysis a) systems tend to f ail f or the same reason b) getting the right concepts in system query critical 2) Surprises that require more analysis a) bf _swap_docs: some systems better at providing docs b) some systems more robust during expansion c) bf _num_docs relevant only: some relevant docs are bad f eedback docs d) no topic in which there were “golden” terms in top 1- 4 f eedback terms

Preliminary Lessons Learned Addit ional experiment s

  • t opic_analysis: producing & comparing groups of

t opics using assort ed measures

  • qa_st andard: ef f ect of I R algorit hms on QA

using docs/ passages

  • t opic_coverage: HI TI QA experiment using all

syst ems

slide-5
SLIDE 5

I mpact

1620 f inal runs made on TREC 678 collect ion This inf ormat ion will be publicly dist ribut ed t o

  • pen t he way f or import ant f urt her analysis

wit hin t he I R communit y Analysis wit hin t he workshop shows several promising measures f or predict ing blind relevance f eedback f ailure Addit ionally much has been learned (and will be published) about t he int eract ion of search engines, t opics and dat a collect ions, leading t o more research in t his crit ical area

Workshop lessons learned

Learning t o “cat egorize” quest ions of a varied nat ure like TREC t opics is much harder t han anyone expect ed Doing massive and caref ul f ailure analysis across mult iple syst ems is a big win Perf orming parallel experiment s using mult iple syst ems may be t he only way of learning some general principles

Fut ur e

  • TREC will cont inue (t rec.nist .gov)

– This year’s t racks likely t o cont inue

  • QA: request s f or required inf o + ot her inf o

– One new t rack

  • invest igat e ad hoc evaluat ion met hodologies f or

t erabyt e scale collect ions

  • SI GI R 2004 workshop on RI A result s

– Many more det ails on what was done – Lot s of t ime f or discussion – Breakout sessions on where t o go next