cicm 2016 doctoral program
play

CICM 2016, Doctoral program Augmenting Mathematical Formulae for - PowerPoint PPT Presentation

CICM 2016, Doctoral program Augmenting Mathematical Formulae for More Effective Querying & Presentation Moritz Schubotz 31/07/2016 www.formulasearchengine.com 1 Motivation 26th of February 2011 18.03.-26.03.2011 28th of March


  1. CICM 2016, Doctoral program Augmenting Mathematical Formulae for More Effective Querying & Presentation Moritz Schubotz 31/07/2016 www.formulasearchengine.com 1

  2. Motivation 26th of February 2011 18.03.-26.03.2011 28th of March 31/07/2016 www.formulasearchengine.com 2

  3. 1 1 𝑦 ≀ Example 1: 𝑦 $\frac 1 \mean ?x \le \mean \frac 1 ?x$ 1 <apply> 1. Different forms e.g. 𝑦 𝑦 β‰₯ 1 <leq/> <apply> 2. Different notations e.g. <divide/> π‘Œ 𝑔 𝑦 𝑦𝑒𝑦 = βŒ©π‘¦βŒͺ Χ¬ <cn type="integer">1</cn> <apply> 3. Exact match seldom <mean/> 4. Ambiguity in syntax e.g. 𝐹Ψ = <qvar>x</qvar></apply></apply> <apply> ΰ·‘ 𝐼Ψ NTCIR-11 <mean/> Math-2 5. no TeX-function mean <apply> WMC-D1 <divide> <cn type="integer">1</cn> <qvar>x</qvar ></apply></apply>…

  4. 1 1 𝑦 ≀ Example 1: 𝑦 $\frac 1 \mean ?x \le \mean \frac 1 ?x$ 1 <apply> 1. Different forms e.g. 𝑦 𝑦 β‰₯ 1 <leq/> <apply> 2. Different notations e.g. <divide/> π‘Œ 𝑔 𝑦 𝑦𝑒𝑦 = βŒ©π‘¦βŒͺ Χ¬ <cn type="integer">1</cn> <apply> 3. Exact match seldom <mean/> 4. Ambiguity in syntax e.g. 𝐹Ψ = <qvar>x</qvar></apply></apply> <apply> ΰ·‘ 𝐼Ψ NTCIR-11 <mean/> Math-2 5. no TeX-function mean <apply> WMC-D1 <divide> <cn type="integer">1</cn> <qvar>x</qvar ></apply></apply>…

  5. Result 1: πœ’(𝔽[π‘Œ]) ≀ 𝔽[πœ’(π‘Œ)] $ \frac 1 \mean ?x \le $ \frac 1 \mean ?x \le $ ?f[type=function] \mean ?x \le $ ?f[type=function] \mean ?x \le \mean \frac 1 ?x$ \mean \frac 1 ?x$ \mean ?f[type=function] ?x $ \mean ?f[type=function] ?x $ \le \le \frac \mean ?f[type=function] \mean \mean \frac \mean ?f[type=function] 1 ?x ?x ?x 1 ?x

  6. Result 1: πœ’(𝔽[π‘Œ]) ≀ 𝔽[πœ’(π‘Œ)] $ \frac 1 \mean ?x \le $ ?f[type=function] \mean ?x \le \mean \frac 1 ?x$ \mean ?f[type=function] ?x $ <apply> <apply> <leq/> <leq/> <apply> <apply> <qvar type="function">f</qvar> <divide/> $\frac 1$ -> $?f[type=function]$ <cn type="integer">1</cn> <apply> <apply> Not trivial <mean/> <mean/> <qvar>x</qvar></apply></apply> <qvar>x</qvar></apply></apply> <apply> <apply> <mean/> <mean/> <apply> <apply> <qvar type="function">f</qvar> <divide> <cn type="integer">1</cn> <qvar>x</qvar ></apply></apply>… <qvar>x</qvar ></apply></apply>…

  7. Solution 1 inexact matches β€’ Refined query: $\superconceptOf[ orderby = editdistance ]{ \frac 1 \mean ?x \le \mean \frac 1 ?x }$ β€’ Computational complexity β€’ Restriction of the search space β€’ Check most likely solutions at first 31/07/2016 www.formulasearchengine.com 8

  8. But there are diverse information needs 1. Definition look-up 2. Explanation look-up 3. Proof look-up 4. Application look-up 5. Computation assistance 6. General formula search 31/07/2016 www.formulasearchengine.com 9

  9. , and the data looks like that 31.07.2016 www.formulasearchengine.com 10

  10. Levels of Abstraction Semantic Content Presentation 31/07/2016 www.formulasearchengine.com 11

  11. Overview Image Presen- Structure Entity Image Content Semantic pro- tation detection Linkage cessing Integrated Queries Meta- Math Text data Presen- Content Semantic Keywords Relations tation 31/07/2016 www.formulasearchengine.com 12

  12. Completed Research Querying Processing Scalibility - Making Math Searchable in - Mathematical Language - Applying Stratosphere for Big Data Analytics (coauthor, BTW 2013) Processing (CICM 2014) Wikipedia (CICM 2012) - Querying large Collections - Digital Repository of - Evaluation of Similarity- of Mathematical Mathematical Formulae (CICM Measure Factors for 2014 coauthor) Publications (NTCIR 2013 Formulae (NTCIR 2015) - Growing the DRMF with with Marcus Leich) - Wikipedia Subtask at NTCIR generic LaTeX sources 11 (SIGIR 2015) - Mathoid: Accessible Math - Exploring the single-brain Rendering for Wikipedia ( barrier (NTCIR 2016) CICM 2014) - Semantification of Identifiers in Mathematics for Better Math Information Retrieval (SIGIR 2016) Integrated Queries Meta- Math Text data Presen- Content Semantic Keywords Relations tation 31/07/2016 www.formulasearchengine.com 13

  13. Mathoid: Robust, Scalable, Fast and Accessible Math Rendering for Wikipedia Let be a probability space, X an integrable real- valued random variable and Ο† a convex function. Then: πœ’(𝔽[π‘Œ]) ≀ 𝔽[πœ’(π‘Œ)]. β€’ convex function (Q319913, NDL ID 00573442 ) β€’ subclass of function β€’ ja: ε‡Έι–’ζ•° 31.07.2016 www.formulasearchengine.com 14

  14. Exploring the single-brain barrier β€’ β€œone - brain barrier” [1] – Metaphor: relevant knowledge to conduct math research needs to be co-located in one brain β€’ Goals of our contribution to NTCIR12: – Create a point of reference w.r.t. to this barrier for a trained mathematician – Compare the performance of a human to MIR systems and analyse characteristic strengths and weaknesses – Derive insights to improve MIR systems – Combine the relevant results of the human and the MIR systems to create a gold standard 31/07/2016 www.formulasearchengine.com 15

  15. Exploring the single-brain barrier 31/07/2016 www.formulasearchengine.com 16

  16. Exploring the single-brain barrier 31/07/2016 www.formulasearchengine.com 17

  17. Exploring the single-brain barrier β€’ Strengths of MIR systems: – Definition lookup queries – Application lookup β€’ Weaknesses of MIR systems – Low precision – No unified query language to specify query type β€’ Gold standard dataset can help to develop a math-aware search engine for Wikipedia 31/07/2016 www.formulasearchengine.com 18

  18. Semantification of Identifiers in Mathematics for Better Math Information Retrieval β€’ First step to enable computer to understand mathematicians notations β€’ Focus on identifiers β€’ Extract identifier semantics by combining math and β€’ Use computers to find relevant mathematics β€’ Computers must understand semantics in math to provide needed information 31/07/2016 www.formulasearchengine.com 19

  19. Math Augmentation Approach 31/07/2016 www.formulasearchengine.com 20

  20. (1) Extract formulae 31/07/2016 www.formulasearchengine.com 21

  21. (2) Extract identifiers 31/07/2016 www.formulasearchengine.com 22

  22. (3) Find identifiers 31/07/2016 www.formulasearchengine.com 23

  23. (4) Find definiens candidates 31/07/2016 www.formulasearchengine.com 24

  24. (5) Score all identifer-definiens pairs 31/07/2016 www.formulasearchengine.com 25

  25. (6) Generate feature vectors 31/07/2016 www.formulasearchengine.com 26

  26. (7) Cluster feature vectors 31/07/2016 www.formulasearchengine.com 27

  27. (8) Map clusters to subject hierarchy 31/07/2016 www.formulasearchengine.com 28

  28. Wikipedia Subtask at NTCIR 11 β€’ NTCIR 11 Wikipedia dataset* β€’ 30k Wikipedia Articles β€’ 280k Formulae β€’ 100 queries *) Moritz Schubotz, Abdou Youssef, Volker Markl, and Howard S. Cohl. 2015. Challenges of Mathematical Information Retrieval in the NTCIR-11 Math Wikipedia Task. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '15). ACM, New York, NY, USA, 951-954. 31/07/2016 www.formulasearchengine.com 29

  29. Wikipedia Subtask at NTCIR 11 β€’ CICM 2012 (2 Participants) β€’ NTCIR 2013 (pilot) (6 Participants) β€’ NTCIR 2014 arXiv (8 Participants) Wikipedia (7 Participants) Part III / IV Completed Research 31/07/2016 30

  30. Wikipedia Task results Higher β€ž precision β€œ Higher recall Part III / IV Completed Research 31/07/2016 31

  31. Gold standard available from http://mlp.formulasearchengine.com 31/07/2016 www.formulasearchengine.com 32

  32. Gold standard details 310 Identifiers 8 97 174 27 4 Wikidata item Two Wikidata items Wikidata item + NP Individual NP Multiple NP 31/07/2016 www.formulasearchengine.com 33

  33. Results: Identifiers β€’ 294/310 correctly extracted (94.8%) β€’ 57 false positive (fp) β€’ Problems 𝑅1βˆ’π‘…2 – Incorrect markup (8fn, 33fp) πœƒ = 𝑅1 d – Symbols (9fp) d𝑦 2 – Sub-super script (3fp, 2fn) 𝜏 𝑧 – Special notation (10fp, 2fn) 𝒗 Γ— π’˜ = πœ— π‘—π‘˜π‘™ 𝑣 π‘˜ 𝑀 𝑙 𝒇 𝒋 31/07/2016 www.formulasearchengine.com 34

  34. Results: (4) Find definiens candidates 31/07/2016 www.formulasearchengine.com 35

  35. Results: Definitions Definitions 83 88 19 120 Exact match Partial match not found not in document 31/07/2016 www.formulasearchengine.com 36

  36. Distribution of identifier counts 31/07/2016 www.formulasearchengine.com 37

  37. Results: Definitions with Namespace support Definitions 60 103 147 Exact match Partial match not found 31/07/2016 www.formulasearchengine.com 38

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend