[PDF] - Robustness? Robustness ? Robustness? PDF Document

SLIDE 1

1

Thomas Mandl: Robust CLEF 2007 - Overview

Thomas Mandl

Information Science Universität Hildesheim mandl@uni-hildesheim.de

!"##$

Robust Task - Result Overview and Lessons Learned from Robustness Evaluation

2

Thomas Mandl: Robust CLEF 2007 - Overview

Robustness? Robustness Robustness? ?

Robust … means … capable of functioning

correctly, (or at the very minimum, not failing catastrophically) under a great many conditions. (http://www.reference.com/)

Robust IR means the capability of an IR

system to work well (and reach at least a minimal performance) under a variety of conditions (topics, difficulty, collections, users, languages …)

3

Thomas Mandl: Robust CLEF 2007 - Overview

Variety of conditions … Variety Variety of

f conditions

conditions … …

Variance between topics

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Mono FR Mono EN Mono PT Bi ->FR

4

Thomas Mandl: Robust CLEF 2007 - Overview

System Variance System System Variance Variance

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Mono FR Mono EN Mono PT Bi ->FR

5

Thomas Mandl: Robust CLEF 2007 - Overview

History of Robust IR Evaluation History History of Robust IR Evaluation

f Robust IR Evaluation
TREC

– Mono-lingual Retrieval – 2003 - 2005

CLEF

– Mono-, bi- and Multilingual Retrieval – 2006 six languages – 2007 three languages

6

Thomas Mandl: Robust CLEF 2007 - Overview

Robust Task 2007 Robust Robust Task Task 2007 2007

Again …

– Use topics and relevance assessment from previous CLEF campaigns – Take a different perspective and use a robust evaluation measure (GMAP) – Emphasize the difficult (= low performing) topics

SLIDE 2

2

7

Thomas Mandl: Robust CLEF 2007 - Overview

Training and Test Training and Test Training and Test

CLEF 2001, 2002 and 2003 for training
CLEF 2004, 2005 and 2006 for testing

8

Thomas Mandl: Robust CLEF 2007 - Overview

Which system is better? Which Which system system is is better? better?

T o p ic S y s te m R e s u lt T o p ic S y s te m R e s u lt 1 A 0 .1 1 B 0 .2 2 A 0 .1 2 B 0 .2 3 A 0 .9 3 B 0 .6 G e o A v e A 0 .2 1 G e o A v e B 0 .2 9 M A P A 0 .3 7 M A P B 0 .3 3

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Result A Result B I II III

n n i i

x geoAve

∏

=

1

Topics

9

Thomas Mandl: Robust CLEF 2007 - Overview

Collections Collections Collections

201-350

Público 1995

Portuguese 251-350 41-140 Le Monde 1994 Swiss News Agency 94 French 251-350 41-200 Los Angeles Times 1994 English Test Topics Training Topics Target Collection Language

10

Thomas Mandl: Robust CLEF 2007 - Overview

Robust Task 2007 Robust Robust Task Task 2007 2007

11

Thomas Mandl: Robust CLEF 2007 - Overview

Participation Participation Participation

63 runs submitted by 7 groups
2006: 133 runs by 8 groups

12

Thomas Mandl: Robust CLEF 2007 - Overview

Results Results Results

Mono English Rank Participant Experiment MAP GMAP 1st reina 10.2415/AH-ROBUST-MONO-EN-TEST- CLEF2007.REINA.REINAENTDNT 38.97% 18.50% 2nd daedalus 10.2415/AH-ROBUST-MONO-EN-TEST- CLEF2007.DAEDALUS.ENFSEN22S 37.78% 17.72% 3rd hildesheim 10.2415/AH-ROBUST-MONO-EN-TEST- CLEF2007.HILDESHEIM.HIMOENBRFNE 5.88% 0.32% Mono Portuguese Rank Participant Experiment MAP GMAP 1st reina 10.2415/AH-ROBUST-MONO-PT-TEST- CLEF2007.REINA.REINAPTTDNT 41.40% 12.87% 2nd jaen 10.2415/AH-ROBUST-MONO-PT-TEST- CLEF2007.JAEN.UJARTPT1 24.74% 0.58% 3rd daedalus 10.2415/AH-ROBUST-MONO-PT-TEST- CLEF2007.DAEDALUS.PTFSPT2S 23.75% 0.50% 4th xldb 10.2415/AH-ROBUST-MONO-PT-TEST- CLEF2007.XLDB.XLDBROB16 1.21% 0.071%

SLIDE 3

3

13

Thomas Mandl: Robust CLEF 2007 - Overview

Results Mono English Results Results Mono English Mono English

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Recall Precision Ad−Hoc Robust Monolingual English Test Task Top 5 Participants − Standard Recall Levels vs Mean Interpolated Precision reina [Experiment REINAENTDNT; MAP 38.97%; Not Pooled] daedalus [Experiment ENFSEN22S; MAP 37.78%; Not Pooled] hildesheim [Experiment HIMOENBRFNE; MAP 5.88%; Not Pooled]

14

Thomas Mandl: Robust CLEF 2007 - Overview

Results Mono Portuguese Results Results Mono Mono Portuguese Portuguese

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Recall Precision

Ad−Hoc Robust Monolingual Portuguese Test Task Top 5 Participants − Standard Recall Levels vs Mean Interpolated Precision

reina [Experiment REINAPTTDNT; MAP 41.40%; Not Pooled] jaen [Experiment UJARTPT1; MAP 24.74%; Not Pooled] daedalus [Experiment PTFSPT2S; MAP 23.75%; Not Pooled] xldb [Experiment XLDBROB16_10; MAP 1.21%; Not Pooled]

15

Thomas Mandl: Robust CLEF 2007 - Overview

Results Results Results

Mono French Rank Participant Experiment MAP GMAP 1st unine 10.2415/AH-ROBUST-MONO-FR-TEST- CLEF2007.UNINE.UNINEFR1 42.13% 14.24% 2nd reina 10.2415/AH-ROBUST-MONO-FR-TEST- CLEF2007.REINA.REINAFRTDET 38.04% 12.17% 3rd jaen 10.2415/AH-ROBUST-MONO-FR-TEST- CLEF2007.JAEN.UJARTFR1 34.76% 10.69% 4th daedalus 10.2415/AH-ROBUST-MONO-FR-TEST- CLEF2007.DAEDALUS.FRFSFR22S 29.91% 7.43% 5th hildesheim 10.2415/AH-ROBUST-MONO-FR-TEST- CLEF2007.HILDESHEIM.HIMOFRBRF2 27.31% 5.47% Bi -> French Rank Participant Experiment MAP GMAP 1st reina 10.2415/AH-ROBUST-BILI-X2FR-TEST- CLEF2007.REINA.REINAE2FTDNT 35.83% 12.28% 2nd unine 10.2415/AH-ROBUST-BILI-X2FR-TEST- CLEF2007.UNINE.UNINEBILFR1 33.50% 5.01% 3rd colesun 10.2415/AH-ROBUST-BILI-X2FR-TEST- CLEF2007.COLESUN.EN2FRTST4GRINTLOGLU001 22.87% 3.57%

16

Thomas Mandl: Robust CLEF 2007 - Overview

Results Mono French Results Results Mono French Mono French

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Recall Precision

Ad−Hoc Robust Monolingual French Test Task Top 5 Participants − Standard Recall Levels vs Mean Interpolated Precision

unine [Experiment UNINEFR1; MAP 42.13%; Not Pooled] reina [Experiment REINAFRTDET; MAP 38.04%; Not Pooled] jaen [Experiment UJARTFR1; MAP 34.76%; Not Pooled] daedalus [Experiment FRFSFR22S; MAP 29.91%; Not Pooled] hildesheim [Experiment HIMOFRBRF2; MAP 27.31%; Not Pooled]

17

Thomas Mandl: Robust CLEF 2007 - Overview

Results Bi-lingual X -> French Results Results Bi Bi-

lingual X

lingual X -

> French

> French

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Recall Precision

Ad−Hoc Robust Bilingual Test Task, French target collection(s) Top 5 Participants − Standard Recall Levels vs Mean Interpolated Precision

reina [Experiment REINAE2FTDNT; MAP 35.83%; Not Pooled] unine [Experiment UNINEBILFR1; MAP 33.50%; Not Pooled] colesun [Experiment EN2FRTST4GRINTLOGLU001; MAP 22.87%; Not Pooled]

18

Thomas Mandl: Robust CLEF 2007 - Overview

Approaches Approaches Approaches

Adoption of traditional and “advanced” CLIR

methods

– BM 25 (Miracle) – N-gram translation (CoLesIR) – Weighting, stemming (Uni NE)

Adoption of “robust” heuristics

– Expansion with an external resource (SINAI)

SLIDE 4

4

19

Thomas Mandl: Robust CLEF 2007 - Overview

Percentage of Bad Topics Percentage Percentage of

f Bad

Bad Topics Topics

25 20 27 32 Average 23 18 17 26 Best System Bi -> FR Mono FR Mono EN Mono PT

Percentage of Topics which received an MAP

below 0.1

20

Thomas Mandl: Robust CLEF 2007 - Overview

Topics Topics Topics

Large improvements are still possible
Difficult topics can be solved better

0.1588 0.1588 0.0342 282 Bi -> FR 0.0160 0.0247 0.0157 192 Mono FR 0.0357 0.1120 0.0217 266 Mono EN 0.0183 0.0478 0.0108 222 Mono PT System Nr. 1 Best System Average Topic Task

21

Thomas Mandl: Robust CLEF 2007 - Overview

Correlation between Measures? Correlation Correlation between between Measures Measures? ?

Often IR measures correlation highly
For a larger topic set – as used in the robust

task – the correlation might be even higher

– More topics make a test more reliable

If correlation is high, it makes no sense to use

alternative measures

22

Thomas Mandl: Robust CLEF 2007 - Overview

Analysis with Reduced Topic Sets Analysis Analysis with with Reduced Reduced Topic Sets Topic Sets

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

20 30 40 50 60 70 80 90 100 GMAP to MAP MAP to full MAP Min of GMAP to MAP Min of MAP to full MAP

Robust task 2007 Mono-lingual English

23

Thomas Mandl: Robust CLEF 2007 - Overview

Analysis with Reduced Topic Sets Analysis Analysis with with Reduced Reduced Topic Sets Topic Sets

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

20 30 40 50 60 70 80 90 100 GMAP to MAP MAP to full MAP Min of GMAP to MAP Min of MAP to full MAP

Robust task 2007 Bi-lingual -> FR

24

Thomas Mandl: Robust CLEF 2007 - Overview

Analysis with Reduced Topic Sets Analysis Analysis with with Reduced Reduced Topic Sets Topic Sets

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

20 30 40 50 60 70 80 90 100 GMAP to MAP MAP to full MAP Min of GMAP to MAP Min of MAP to full MAP

Robust task 2007 Mono-lingual Portuguese

SLIDE 5

5

25

Thomas Mandl: Robust CLEF 2007 - Overview

Analysis with Reduced Topic Sets Analysis Analysis with with Reduced Reduced Topic Sets Topic Sets

Robust task 2007 Mono-lingual French

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

20 30 40 50 60 70 80 90 100 GMAP to MAP MAP to full MAP Min of GMAP to MAP Min of MAP to full MAP 26 Thomas Mandl: Robust CLEF 2007 - Overview

Analysis with Reduced Topic Sets Analysis Analysis with with Reduced Reduced Topic Sets Topic Sets

Robust task 2006 Multi-lingual

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

20 30 40 50 60 70 80 90 100 GMAP to MAP MAP to full MAP Min of GMAP to MAP Min of MAP to full 27 Thomas Mandl: Robust CLEF 2007 - Overview

Changes in Rankings Changes Changes in Rankings in Rankings

1 2 3 4 5 6 7 8 9 10 1 2 ujamlrsv2 ujamllr ujamlblr ujamlblr ml5XRSFSen4S ml4XRSFSen4S mlRSFSen2S CoLesIRmultTst reinaES2mtdtest reinaES2mttest

MAP GMAP Robust task 2006 Multi-lingual