Neuchatel at NTCIR-4 From CLEF to NTCIR Jacques Savoy University - PowerPoint PPT Presentation

Neuchatel at NTCIR-4 From CLEF to NTCIR Jacques Savoy University of Neuchatel, Switzerland www.unine.ch/info/clef/

From CLEF to NTCIR European languages, Asian languages, different languages but same IR problems? one byte = one char But limited set of char same indexing? space between words same search and translation scheme? different writings

Indexing methods  E: Words  CJK: bigrams  Stopword list  Stoplist  Stemming  No stemming SMART system In K, 80% of nouns are composed of two characters (Lee et al., IP&M, 1999)

Example in Chinese

IR models  Vector-space  Probabilistic  Lnu-ltc  Okapi  tf-idf (ntc-ntc)  Prosit or deviation from  binary (bnn-bnn) randomness

Monolingual evaluation Model English Korean T D T D Okapi 0.3132 0.2992 0.4033 0.3475 Prosit 0.2997 0.2871 0.3882 0.3010 Lnu-ltc 0.3069 0.3139 0.4193 0.4001 tf-idf 0.1975 0.2171 0.3245 0.3406 binary 0.1562 0.1262 0.1944 0.0725

Monolingual evaluation Model English Korean T D T D Okapi 0.3132 0.2992 0.4033 0.3475 +PRF 0.3594 0.3181 0.4960 0.4441 +15% +6% +23% +28% Prosit 0.2997 0.2871 0.3882 0.3010 +PRF 0.3731 0.3513 0.4875 0.4257 +25% +22% +26% +41%

Data Fusion K K K by SE1 by SE2 by SE3 <– Data fusion

Data fusion 1 KR120 1.2 1 KR043 0.8 1 KR050 1.6 2 KR120 0.75 2 KR005 1.3 2 KR200 1.0 3 KR050 0.7 3 KR055 0.65 3 KR120 0.9 4 KR705 0.6 4 … 4 … … 1 KR… 2 KR… 3 KR… 4 ….

Data fusion  Round-robin (baseline)  Sum RSV (Fox et al., TREC-2)  Normalize (divide by the max)  Z-score

Z-score normalization Compute the mean µ and 1 KR120 1.2 standard deviation σ 2 KR200 1.0 3 KR050 0.7 New score = 4 KR765 0.6 ((old score- µ ) / σ ) + δ … … 1 KR120 7.0 2 KR200 5.0 3 KR050 2.0 4 KR765 1.0 …

Monolingual (data fusion) Korean T (4 SE) TDNC (2 SE) best single 0.4868 0.5141 Round-robin 0.4737 0.5047 SumRSV 0.5044 0.5030 Norm max 0.5084 0.5045 Z-score 0.5074 0.5023 Z-score wt 0.5078 0.5058

Monolingual evaluation (C) Model Chinese-unigram Chinese-bigram T D T D Okapi 0.1667 0.1198 0.1755 0.1576 Prosit 0.1452 0.0850 0.1658 0.1467 Lnu-ltc 0.1834 0.1484 0.1794 0.1609 tf-idf 0.1186 0.1136 0.1542 0.1507 binary 0.0431 0.0112 0.0796 0.0686

Monolingual evaluation (C) Model Chinese-unigram Chinese-bigram T D T D Okapi 0.1667 0.1198 0.1755 0.1576 +PRF 0.1884 0.1407 0.2004 0.1805 +13% +17% +14% +15% Prosit 0.1452 0.0850 0.1658 0.1467 +PRF 0.1659 0.1132 0.2140 0.1987 +14% +33% +29% +35%

Monolingual evaluation (J) Model Bigram (kanji,kata) Bigram (kanji) T D T D Okapi 0.2873 0.2821 0.2972 0.2762 Prosit 0.2637 0.2573 0.2734 0.2517 Lnu-ltc 0.2701 0.2740 0.2806 0.2718 tf-idf 0.2104 0.2087 0.2166 0.2101 binary 0.1743 0.1741 0.1703 0.1105

Monolingual evaluation (J) Model Bigram (kanji,kata) Bigram (kanji) T D T D Okapi 0.2873 0.2821 0.2972 0.2762 +PRF 0.3259 0.3331 0.3514 0.3200 +13% +18% +18% +16% Prosit 0.2637 0.2573 0.2734 0.2517 +PRF 0.3396 0.3394 0.3495 0.3218 +29% +32% +28% +28%

Translation resources  Machine-readable dictionaries  Babylon  Evdict  Machine translation services  WorldLingo  BabelFish  Parallel and/or comparable corpora (not used in this evaluation campaign)

Bilingual evaluation E->C/J/K T Chinese Japanese Korean Okapi bigram bigram k&k bigram Manual 0.1755 0.2873 0.4033 Babylon 1 0.0458 0.0946 0.1015 Lingo 0.0794 0.1951 0.1847 Babelfish 0.0360 0.1952 0.1855 Combined 0.0854 0.2174 0.1848

Bilingual evaluation E->C/J/K T Chinese Japanese Korean bigram bigram k&k bigram Manual 0.1755 0.2873 0.4033 0.2174 0.1848 Okapi 0.0854 0.2733 0.2397 +PRF 0.1039 Prosit 0.0817 0.1973 0.1721 +PRF 0.1213 0.2556 0.2326

Multilingual IR E->CJKE  Create a common index Document translation (DT)  Search on each language and merge the result lists (QT)  Mix QT and DT  No translation

Merging problem C E J K <–– Merging

Multilingual IR (merging)  Round-robin (baseline)  Raw-score merging  Normalize (by the max)  Z-score  Logistic regression

Test-collection NTCIR-4 E C J K size 619 MB 490 MB 733 MB 370 MB doc 347550 381681 596058 254438 mean 96.6 363.4 114.5 236.2 topic 58 59 55 57 rel. 35.5 19 88 43

Multilingual evaluation CJE T (auto) T (manual) Round-robin 0.1564 0.2204 Raw-score 0.1307 0.2035 Norm max 0.1654 0.2222 Biased RR 0.1413 0.2290 Z-score wt 0.1719 0.2370

Multilingual evaluation CJKE T (auto) T (manual) Round-robin 0.1419 0.2371 Raw-score 0.1033 0.1564 Norm max 0.1411 0.2269 Biased RR 0.1320 0.2431 Z-score 0.1446 0.2483

Conclusions (monolingual) From CLEF to NTCIR  The best IR model seems to be language-dependant (Okapi in CLEF)  Pseudo-relevance feedback improves the initial search  Data fusion (yes, with shot queries limited in CLEF)

Conclusions (bilingual) From CLEF to NTCIR  Translation resources freely available produce a poor IR performance (differs from CLEF)  Improvement by  Combining translations (not here, yes in CLEF)  Pseudo-relevance feedback (as in CLEF)  Data fusion (not clear)

Conclusions (multilingual) From CLEF to NTCIR  Selection and merging are still hard problems (as in CLEF)  Z-score seems to produce good IR performance over different conditions (as in CLEF)

Neuchatel at NTCIR-4 From CLEF to NTCIR Jacques Savoy University - PowerPoint PPT Presentation

Neuchatel at NTCIR-4 From CLEF to NTCIR Jacques Savoy University of Neuchatel, Switzerland www.unine.ch/info/clef/ From CLEF to NTCIR European languages, Asian languages, different languages but same IR problems? one byte = one char But

Cross-Language Evaluation Forum What happened at CLEF 2003 From CLEF 2003 to CLEF 2004

NTCIR-9 Kick-Off Event ff 2010.10.05 : 13:30- English Session: 15:30-

CLEF-HIPE-2020 Named Entity Recognition and Linking on Historical Newspapers 1 CLEF-HIPE-2020

I t Introduction to NTCIR-7 d ti t NTCIR 7 N Noriko Kando k K d National Institute of

C lt Cultural Heritage in CLEF (CHiC) 2012 l H it i CLEF (CHiC) 2012 Pilot Lab Overview

CLEF: 15 Years of IR Evaluation in Europe Nicola Ferro University of Padua, Italy Forum

CLEF 20 th Anniversary Nicola Ferro @frrncl University of Padua, Italy 10 th Conference and Labs

CLEF eHealth 2020 @clefehealth CLEF eHealth 2020 Task 1: Multilingual Information Extraction

Search Snippet Evaluation Mikhail Lebedev, Pavel Braslavski, Denis Savenkov CLEF 2011 CLEF 2011

CLEF and P CLEF and P PROMISEs PROMISEs Nicola a Ferro Information Management Sys

Grid@CLEF Track Overview Donna Harman Nicola Ferro NIST, USA University of Padua, Italy

UniNE at PAN-CLEF 2020: Profiling Fake News Spreaders on Twitter Catherine Ikae, Jacques Savoy

Overview of the Sixth NTCIR Workshop Noriko Kando National Institute of Informatics

NTCIR 2014 Slides - TUW-IMP at the NTCIR-11 Math-2 Presentation February 2015 CITATIONS READS

KSU Teams QA System for World History Exams at the NTCIR-13 QA Lab-3 Task Tasuku Kimura, Ryo

Kyoto-U: Syntactical EBMT System for NTCIR 7 Patent System for NTCIR-7 Patent Translation Task

MARCOLIN BOND REPORT AS OF AND FOR THE YEAR ENDED DECEMBER 31, 2014 1 DISCLAIMER The following

MARCOLIN BOND REPORT AS OF AND FOR THE YEAR ENDED DECEMBER 31, 2015 1 DISCLAIMER The following

WP1 By-catch High-risk areas and evaluation of measures to reduce by-catch Co-funded by the

Who a re we? Accord M Music Net Ltd d offers distr ibution of m usical works, , according to

A Text Alignment Algorithm Based on Prediction of Obfuscation Types Using SVM Neural Network

(c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016

MONARCH PROPERTIES INVESTMENT OPPORTUNITIES RETAIL & OFFICE SPACES Monarch Aqua, Old Madras

Ben Burr Trail PROJECT ALIGNMENT Project alignment Hamblen Elem School PROJECT ALIGNMENT

Neuchatel at NTCIR-4 From CLEF to NTCIR Jacques Savoy University - PowerPoint PPT Presentation

Neuchatel at NTCIR-4 From CLEF to NTCIR Jacques Savoy University of Neuchatel, Switzerland www.unine.ch/info/clef/ From CLEF to NTCIR European languages, Asian languages, different languages but same IR problems? one byte = one char But

Cross-Language Evaluation Forum What happened at CLEF 2003 From CLEF 2003 to CLEF 2004

NTCIR-9 Kick-Off Event ff 2010.10.05 : 13:30- English Session: 15:30-

CLEF-HIPE-2020 Named Entity Recognition and Linking on Historical Newspapers 1 CLEF-HIPE-2020

I t Introduction to NTCIR-7 d ti t NTCIR 7 N Noriko Kando k K d National Institute of

C lt Cultural Heritage in CLEF (CHiC) 2012 l H it i CLEF (CHiC) 2012 Pilot Lab Overview

CLEF: 15 Years of IR Evaluation in Europe Nicola Ferro University of Padua, Italy Forum

CLEF 20 th Anniversary Nicola Ferro @frrncl University of Padua, Italy 10 th Conference and Labs

CLEF eHealth 2020 @clefehealth CLEF eHealth 2020 Task 1: Multilingual Information Extraction

Search Snippet Evaluation Mikhail Lebedev, Pavel Braslavski, Denis Savenkov CLEF 2011 CLEF 2011

CLEF and P CLEF and P PROMISEs PROMISEs Nicola a Ferro Information Management Sys

Grid@CLEF Track Overview Donna Harman Nicola Ferro NIST, USA University of Padua, Italy

UniNE at PAN-CLEF 2020: Profiling Fake News Spreaders on Twitter Catherine Ikae, Jacques Savoy

Overview of the Sixth NTCIR Workshop Noriko Kando National Institute of Informatics

NTCIR 2014 Slides - TUW-IMP at the NTCIR-11 Math-2 Presentation February 2015 CITATIONS READS

KSU Teams QA System for World History Exams at the NTCIR-13 QA Lab-3 Task Tasuku Kimura, Ryo

Kyoto-U: Syntactical EBMT System for NTCIR 7 Patent System for NTCIR-7 Patent Translation Task

MARCOLIN BOND REPORT AS OF AND FOR THE YEAR ENDED DECEMBER 31, 2014 1 DISCLAIMER The following

MARCOLIN BOND REPORT AS OF AND FOR THE YEAR ENDED DECEMBER 31, 2015 1 DISCLAIMER The following

WP1 By-catch High-risk areas and evaluation of measures to reduce by-catch Co-funded by the

Who a re we? Accord M Music Net Ltd d offers distr ibution of m usical works, , according to

A Text Alignment Algorithm Based on Prediction of Obfuscation Types Using SVM Neural Network

(c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016

MONARCH PROPERTIES INVESTMENT OPPORTUNITIES RETAIL &amp; OFFICE SPACES Monarch Aqua, Old Madras

Ben Burr Trail PROJECT ALIGNMENT Project alignment Hamblen Elem School PROJECT ALIGNMENT

MONARCH PROPERTIES INVESTMENT OPPORTUNITIES RETAIL & OFFICE SPACES Monarch Aqua, Old Madras