NTCIR 2014 Slides - TUW-IMP at the NTCIR-11 Math-2 Presentation - PDF document

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/272683082 NTCIR 2014 Slides - TUW-IMP at the NTCIR-11 Math-2 Presentation · February 2015 CITATIONS READS 0 51 1 author: Aldo Lipani University College London 55 PUBLICATIONS 219 CITATIONS SEE PROFILE Some of the authors of this publication are also working on these related projects: Abstracting Domain-Specific Information Retrieval and Evaluation (ADmIRE) View project Space-time mapping and modelling of soil properties in Mediterranean and Temperate areas View project All content following this page was uploaded by Aldo Lipani on 24 February 2015. The user has requested enhancement of the downloaded file.

TUW-IMP at the NTCIR-11 Math-2 A LDO L IPANI , L INDA A NDERSSON , F LORINA P IROI , M IHAI L UPU AND A LLAN H ANBURY Institute of Software Technology and Interactive Systems (ISIS) Vienna University of Technology, Austria {surname}@ifs.tuwien.ac.at 11/12/2014 - Tokyo

System data Main ¡contribu,ons: ¡ 1. Novel ¡way ¡to ¡tokenize ¡ formula ¡structures ; ¡ 2. Applica,on ¡of ¡a ¡ domain ¡specific ¡ query ¡expansion ; ¡ Developed ¡using ¡the ¡following ¡programming ¡languages: ¡ ¡ 1. Scala ¡2.10 ¡ 2. Java ¡1.7 ¡ ¡ Dependencies: ¡ 1. Apache ¡Lucene ¡4.6 ¡ 2. Apache ¡Commons ¡Compress ¡1.9 ¡ Distributed ¡with ¡ Apache License 2.0 on GitHub ¡ 2 |

Information Retrieval System Query Preprocessing Collection Preprocessing Reranker Reranker Indexer Topics Collection Run Scorer Merger Document Preprocessing 3 |

The Developed Math Retrieval System FOR IRSC text Query Extension NTCIR-Math2 3 for formulae and Topics 1 for text Merging step NTCIR-Math2 Lucene 4.6 - TUW-IMP Runs Collection BM25 For IRSCs formula For IRSC text Normalization 3 ver. one for each English minimal step, if enabled formula tokenizer stemmer 4 |

Information Retrieval System Core for Formulae Formula Tokenizer Example ¡using ¡the ¡formula ¡of ¡the ¡topic ¡with ¡id ¡NTCIR-‑Math2-‑5 ¡ ¡ Latex ¡representa,on: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Tokens ¡extracted ¡with ¡the ¡three ¡formula ¡tokenizers: ¡Literal, ¡L1 ¡and ¡L2 ¡ 5 |

Information Retrieval System Core for Formulae Formula Tokenizer Life Sciences apply eq apply apply csymbol ci apply times apply qvar apply ci apply subscript S x times ci ci divide cn apply csymbol ci cn root apply E H 1 subscript G E csymbol ci cn minus apply subscript G E csymbol ci cn superscript g 3

Information Retrieval System Core for Text Hyponymy and Hypernymy Extraction Analysis Life Sciences Linguistic patterns: Example sentences LSP 1 … work such author as Herrick, Goldsmith, and Shakespeare such NP as {NP, }* {(or|and)} NP Even then, we would trail behind other European Community 2 member, such as Germany, France and Italy 3 Bruises, wounds, broken bones or other injuries NP{, NP}*{,} or other NP 4 Temples, treasuries, and other important civic buildings NP{, NP}*{,} and other NP 5 All common-law countries, including Canada and England NP{,} including {NP,}* {or|and} NP … most European countries, especially France, England, and 6 NP{,} especially {NP,}* {or|and} NP Spain Improvement using NLP_mod DomainRules NoRules hyper ok hypo ok hyper ok hypo ok A kind of 85% 78% 64% 64% MathIR Relations 86% 77% 68% 82% 7 |

Reranking and Merging steps To each system core is applied the following normalization step: 2 1,6 1,2 k=10 k=3 k=1 0,8 k=0.5 0,4 k=0.1 -0,4 0 0,4 0,8 1,2 1,6 2 -0,4 The merging step is performed summing linearly the output of each system core. All the result sets (formulas and text) are summed linearly, with a second reranking step that groups all formulas occurring in the same document, which in the final, submitted run, is assigned a score equal to the sum of the formula scores. 8 |

Runs’ Settings The retrieved paragraphs must contain at least Constraint sets for the submitted runs: one query term extension SMWE Strict Multi Words Entities The retrieved formulae must contain all the SL Strict Literals literals tokens that are in the query SNL Strict Non Literals The retrieved formulae must contain all the non literals tokens that are in the query N Normalized It normalizes the score as explained in the previous slide Runs FLA FLASM FLASL FLAN SMWE NO NO YES NO Wiki SL YES YES YES YES Official SNL NO YES NO NO N NO NO NO YES 9 |

Results Results ¡taking ¡in ¡count ¡the ¡documents ¡most ¡relevant ¡(relevancy≥3) ¡ ¡ ¡ MAP ¡ P@10 ¡ P@5 ¡ ¡ FLASL ¡ 0,0439 ¡ 0,0840 ¡ 0,1280 ¡ FLAN ¡ ¡ 0,0388 ¡ 0,0740 ¡ 0,1200 ¡ FLA ¡ ¡ 0,0366 ¡ 0,0800 ¡ 0,1160 ¡ FLASM ¡ ¡ 0,0351 ¡ 0,0780 ¡ 0,1200 ¡ ¡ 0.600 0.500 0.400 FLA MAP 0.300 FLAN 0.200 FLASL 0.100 FLASM 0.000 1 6 11 16 21 26 31 36 41 46 Topic 10 |

Results Life Sciences Results taking in count all the documents with all the degrees of relevancy (relevancy ≥ 1) MAP ¡ P@10 ¡ P@5 ¡ FLASL ¡ 0,0670 ¡ 0,2380 ¡ 0,3760 ¡ FLAN ¡ 0,0611 ¡ 0,2160 ¡ 0,3320 ¡ FLA ¡ 0,0577 ¡ 0,2200 ¡ 0,3360 ¡ FLASM ¡ 0,0580 ¡ 0,2160 ¡ 0,3480 ¡ 0.250 0.200 0.150 FLA MAP FLAN 0.100 FLASL 0.050 FLASM 0.000 1 6 11 16 21 26 31 36 41 46 Topic 11 |

Results Life Sciences At the Wikipedia subtask: 1. At page granularity P@1 ¡ P@2 ¡ P@3 ¡ P@4 ¡ P@5 ¡ P@10 ¡ P@20 ¡ P@30 ¡ P@100 ¡ P@1000 ¡ 0,740 ¡ 0,400 ¡ 0,266 ¡ 0,207 ¡ 0,172 ¡ 0,090 ¡ 0,047 ¡ 0,031 ¡ 0,010 ¡ 0,001 ¡ 2. At formula granularity P@1 ¡ P@2 ¡ P@3 ¡ P@4 ¡ P@5 ¡ P@10 ¡ P@20 ¡ P@30 ¡ P@100 ¡ P@1000 ¡ 0,770 ¡ 0,420 ¡ 0,290 ¡ 0,223 ¡ 0,180 ¡ 0,091 ¡ 0,046 ¡ 0,030 ¡ 0,009 ¡ 0,001 ¡ 12 |

Future Work / Conclusion The IR system is in its infancy : 1. The method of formula token extraction should be extended 2. Closer examination of the impact of extending the query terms with hyponymy and hypernymy on the retrieval results After examining the relevance judgments, it is clear that in the assessors’ information need, the topic query words clearly carry a weight at least as important as the topic formulae. In our IR model, keywords were given, implicitly, a lower weight than the tokens extracted from formulae. Further more, we believe that some form of unification between the formulae retrieved from the collection and the topic formulae is necessary.

Thank you for your attention! View publication stats View publication stats

NTCIR 2014 Slides - TUW-IMP at the NTCIR-11 Math-2 Presentation - PDF document

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/272683082 NTCIR 2014 Slides - TUW-IMP at the NTCIR-11 Math-2 Presentation February 2015 CITATIONS READS 0 51 1 author: Aldo

NTCIR-9 Kick-Off Event ff 2010.10.05 : 13:30- English Session: 15:30-

Neuchatel at NTCIR-4 From CLEF to NTCIR Jacques Savoy University of Neuchatel, Switzerland

I t Introduction to NTCIR-7 d ti t NTCIR 7 N Noriko Kando k K d National Institute of

The BestRES methodology Simon De Clercq (3E), Daniel Schwabender (TUW-EEG) BestRES Workshop

Overview of the Sixth NTCIR Workshop Noriko Kando National Institute of Informatics

KSU Teams QA System for World History Exams at the NTCIR-13 QA Lab-3 Task Tasuku Kimura, Ryo

Kyoto-U: Syntactical EBMT System for NTCIR 7 Patent System for NTCIR-7 Patent Translation Task

GUST e-Foundry MATH FONTS Latin Modern Math, ver. 1.959 T EX Gyre Bonum Math, ver. 1.005 T EX

Math 211 Math 211 Lecture #1 August 29, 2000 2 Welcome to Math 211 Welcome to Math 211 Math

Art Artistic ic Imp mpressio ion Onl Only Art Artistic ic Imp mpressio ion Onl Only No

IMP IMPO IMP - Impala Platinum Holdings - Consolidated interim results for the six

Using Imp Type Theory and Coq Tom Salet Radboud University Nijmegen May 13, 2016 Tom Salet

MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN

Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides

SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides

Overview of the 7 th NTCIR f Workshop N Noriko Kando k K d National Institute of

Regularities and dynamics in bisimulation reductions of big graphs Yongming Luo , George

LTE WLAN B LTE, WLAN, B LUETOOTH AND OO F UTURE F UTURE Aditya K. Jagannatham Indian Institute

Blockcipher Security Notions Martijn Stam Department of Computer Science, University Of Bristol,

Game of Trees Stefan Sperling < stsp@openbsd.org > EuroBSDcon 2019 What is Game of Trees?

A Fibrational Approach to Automata Theory Eilenberg-type Correspondences in One Liang-Ting Chen

End of presentation

Distributed Systems Introduction Matthias Kovatsch Distributed Systems Group Friday, 24

Challenges in achieving scalable and robust linear solvers L. Grigori Alpines Inria Paris and

NTCIR 2014 Slides - TUW-IMP at the NTCIR-11 Math-2 Presentation - PDF document

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/272683082 NTCIR 2014 Slides - TUW-IMP at the NTCIR-11 Math-2 Presentation February 2015 CITATIONS READS 0 51 1 author: Aldo

NTCIR-9 Kick-Off Event ff 2010.10.05 : 13:30- English Session: 15:30-

Neuchatel at NTCIR-4 From CLEF to NTCIR Jacques Savoy University of Neuchatel, Switzerland

I t Introduction to NTCIR-7 d ti t NTCIR 7 N Noriko Kando k K d National Institute of

The BestRES methodology Simon De Clercq (3E), Daniel Schwabender (TUW-EEG) BestRES Workshop

Overview of the Sixth NTCIR Workshop Noriko Kando National Institute of Informatics

KSU Teams QA System for World History Exams at the NTCIR-13 QA Lab-3 Task Tasuku Kimura, Ryo

Kyoto-U: Syntactical EBMT System for NTCIR 7 Patent System for NTCIR-7 Patent Translation Task

GUST e-Foundry MATH FONTS Latin Modern Math, ver. 1.959 T EX Gyre Bonum Math, ver. 1.005 T EX

Math 211 Math 211 Lecture #1 August 29, 2000 2 Welcome to Math 211 Welcome to Math 211 Math

Art Artistic ic Imp mpressio ion Onl Only Art Artistic ic Imp mpressio ion Onl Only No

IMP IMPO IMP - Impala Platinum Holdings - Consolidated interim results for the six

Using Imp Type Theory and Coq Tom Salet Radboud University Nijmegen May 13, 2016 Tom Salet

MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN

Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides

SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides

Overview of the 7 th NTCIR f Workshop N Noriko Kando k K d National Institute of

Regularities and dynamics in bisimulation reductions of big graphs Yongming Luo , George

LTE WLAN B LTE, WLAN, B LUETOOTH AND OO F UTURE F UTURE Aditya K. Jagannatham Indian Institute

Blockcipher Security Notions Martijn Stam Department of Computer Science, University Of Bristol,

Game of Trees Stefan Sperling &lt; stsp@openbsd.org &gt; EuroBSDcon 2019 What is Game of Trees?

A Fibrational Approach to Automata Theory Eilenberg-type Correspondences in One Liang-Ting Chen

End of presentation

Distributed Systems Introduction Matthias Kovatsch Distributed Systems Group Friday, 24

Challenges in achieving scalable and robust linear solvers L. Grigori Alpines Inria Paris and

Game of Trees Stefan Sperling < stsp@openbsd.org > EuroBSDcon 2019 What is Game of Trees?