CICM 2016, Doctoral program Augmenting Mathematical Formulae for - - PowerPoint PPT Presentation

β–Ά
cicm 2016 doctoral program
SMART_READER_LITE
LIVE PREVIEW

CICM 2016, Doctoral program Augmenting Mathematical Formulae for - - PowerPoint PPT Presentation

CICM 2016, Doctoral program Augmenting Mathematical Formulae for More Effective Querying & Presentation Moritz Schubotz 31/07/2016 www.formulasearchengine.com 1 Motivation 26th of February 2011 18.03.-26.03.2011 28th of March


slide-1
SLIDE 1

Moritz Schubotz

31/07/2016 www.formulasearchengine.com 1

CICM 2016, Doctoral program

Augmenting Mathematical Formulae for More Effective Querying & Presentation

slide-2
SLIDE 2

Motivation

31/07/2016 www.formulasearchengine.com 2

26th of February 2011

28th of March

18.03.-26.03.2011

slide-3
SLIDE 3

Example 1:

1 𝑦 ≀ 1 𝑦

<apply> <leq/> <apply> <divide/> <cn type="integer">1</cn> <apply> <mean/> <qvar>x</qvar></apply></apply> <apply> <mean/> <apply> <divide> <cn type="integer">1</cn> <qvar>x</qvar></apply></apply>…

  • 1. Different forms e.g. 𝑦

1 𝑦 β‰₯ 1

  • 2. Different notations e.g.

Χ¬

π‘Œ 𝑔 𝑦 𝑦𝑒𝑦 = βŒ©π‘¦βŒͺ

  • 3. Exact match seldom
  • 4. Ambiguity in syntax e.g. 𝐹Ψ =

ෑ 𝐼Ψ

  • 5. no TeX-function mean

$\frac 1 \mean ?x \le \mean \frac 1 ?x$ NTCIR-11 Math-2 WMC-D1

slide-4
SLIDE 4

Example 1:

1 𝑦 ≀ 1 𝑦

<apply> <leq/> <apply> <divide/> <cn type="integer">1</cn> <apply> <mean/> <qvar>x</qvar></apply></apply> <apply> <mean/> <apply> <divide> <cn type="integer">1</cn> <qvar>x</qvar></apply></apply>…

  • 1. Different forms e.g. 𝑦

1 𝑦 β‰₯ 1

  • 2. Different notations e.g.

Χ¬

π‘Œ 𝑔 𝑦 𝑦𝑒𝑦 = βŒ©π‘¦βŒͺ

  • 3. Exact match seldom
  • 4. Ambiguity in syntax e.g. 𝐹Ψ =

ෑ 𝐼Ψ

  • 5. no TeX-function mean

$\frac 1 \mean ?x \le \mean \frac 1 ?x$ NTCIR-11 Math-2 WMC-D1

slide-5
SLIDE 5

Result 1: πœ’(𝔽[π‘Œ]) ≀ 𝔽[πœ’(π‘Œ)]

$?f[type=function] \mean ?x \le \mean ?f[type=function] ?x $ $\frac 1 \mean ?x \le \mean \frac 1 ?x$ $?f[type=function] \mean ?x \le \mean ?f[type=function] ?x $ $\frac 1 \mean ?x \le \mean \frac 1 ?x$ \le \mean ?x \mean ?f[type=function] ?f[type=function] ?x \le \mean ?x \mean \frac \frac ?x 1 1

slide-6
SLIDE 6

Result 1: πœ’(𝔽[π‘Œ]) ≀ 𝔽[πœ’(π‘Œ)]

<apply> <leq/> <apply> <divide/> <cn type="integer">1</cn> <apply> <mean/> <qvar>x</qvar></apply></apply> <apply> <mean/> <apply> <divide> <cn type="integer">1</cn> <qvar>x</qvar></apply></apply>… <apply> <leq/> <apply> <qvar type="function">f</qvar> <apply> <mean/> <qvar>x</qvar></apply></apply> <apply> <mean/> <apply> <qvar type="function">f</qvar> <qvar>x</qvar></apply></apply>…

$?f[type=function] \mean ?x \le \mean ?f[type=function] ?x $ $\frac 1 \mean ?x \le \mean \frac 1 ?x$

$\frac 1$ -> $?f[type=function]$

Not trivial

slide-7
SLIDE 7

Solution 1 inexact matches

  • Refined query:

$\superconceptOf[

  • rderby = editdistance ]{

\frac 1 \mean ?x \le \mean \frac 1 ?x }$

  • Computational complexity
  • Restriction of the search space
  • Check most likely solutions at first

31/07/2016 www.formulasearchengine.com 8

slide-8
SLIDE 8

But there are diverse information needs

  • 1. Definition look-up
  • 2. Explanation look-up
  • 3. Proof look-up
  • 4. Application look-up
  • 5. Computation assistance
  • 6. General formula search

31/07/2016 www.formulasearchengine.com 9

slide-9
SLIDE 9

, and the data looks like that

31.07.2016 www.formulasearchengine.com 10

slide-10
SLIDE 10

Levels of Abstraction

31/07/2016 www.formulasearchengine.com 11

Presentation Content Semantic

slide-11
SLIDE 11

Overview

31/07/2016 www.formulasearchengine.com 12

Image pro- cessing

Image

Structure detection

Presen- tation

Entity Linkage

Content Semantic

Integrated Queries

Math

Presen- tation Content Semantic

Text

Keywords Relations

Meta- data

slide-12
SLIDE 12

Completed Research

Querying Processing Scalibility

  • Making Math Searchable in

Wikipedia (CICM 2012)

  • Evaluation of Similarity-

Measure Factors for Formulae (NTCIR 2015)

  • Wikipedia Subtask at NTCIR

11 (SIGIR 2015)

  • Exploring the single-brain

barrier (NTCIR 2016)

  • Mathematical Language

Processing (CICM 2014)

  • Digital Repository of

Mathematical Formulae (CICM 2014 coauthor)

  • Growing the DRMF with

generic LaTeX sources

  • Mathoid: Accessible Math

Rendering for Wikipedia ( CICM 2014)

  • Semantification of Identifiers

in Mathematics for Better Math Information Retrieval (SIGIR 2016)

  • Applying Stratosphere for Big Data

Analytics (coauthor, BTW 2013)

  • Querying large Collections
  • f Mathematical

Publications (NTCIR 2013 with Marcus Leich)

31/07/2016 www.formulasearchengine.com 13

Integrated Queries

Math

Presen- tation Content Semantic

Text

Keywords Relations

Meta- data

slide-13
SLIDE 13

Mathoid: Robust, Scalable, Fast and Accessible Math Rendering for Wikipedia

πœ’(𝔽[π‘Œ]) ≀ 𝔽[πœ’(π‘Œ)].

  • convex function (Q319913, NDL ID 00573442)
  • subclass of function
  • ja:ε‡Έι–’ζ•°

31.07.2016 www.formulasearchengine.com 14

Let be a probability space, X an integrable real- valued random variable and Ο† a convex function. Then:

slide-14
SLIDE 14

Exploring the single-brain barrier

  • β€œone-brain barrier” [1]

– Metaphor: relevant knowledge to conduct math research needs to be co-located in one brain

  • Goals of our contribution to NTCIR12:

– Create a point of reference w.r.t. to this barrier for a trained mathematician – Compare the performance of a human to MIR systems and analyse characteristic strengths and weaknesses – Derive insights to improve MIR systems – Combine the relevant results of the human and the MIR systems to create a gold standard

31/07/2016 www.formulasearchengine.com 15

slide-15
SLIDE 15

Exploring the single-brain barrier

31/07/2016 www.formulasearchengine.com 16

slide-16
SLIDE 16

Exploring the single-brain barrier

31/07/2016 www.formulasearchengine.com 17

slide-17
SLIDE 17

Exploring the single-brain barrier

  • Strengths of MIR systems:

– Definition lookup queries – Application lookup

  • Weaknesses of MIR systems

– Low precision – No unified query language to specify query type

  • Gold standard dataset can help to develop a

math-aware search engine for Wikipedia

31/07/2016 www.formulasearchengine.com 18

slide-18
SLIDE 18

Semantification of Identifiers in Mathematics for Better Math Information Retrieval

31/07/2016 www.formulasearchengine.com 19

  • First step to enable computer to understand

mathematicians notations

  • Focus on identifiers
  • Extract identifier semantics by combining math and
  • Use computers to find relevant mathematics
  • Computers must understand semantics in math to

provide needed information

slide-19
SLIDE 19

Math Augmentation Approach

31/07/2016 www.formulasearchengine.com 20

slide-20
SLIDE 20

31/07/2016 www.formulasearchengine.com 21

(1) Extract formulae

slide-21
SLIDE 21

31/07/2016 www.formulasearchengine.com 22

(2) Extract identifiers

slide-22
SLIDE 22

31/07/2016 www.formulasearchengine.com 23

(3) Find identifiers

slide-23
SLIDE 23

31/07/2016 www.formulasearchengine.com 24

(4) Find definiens candidates

slide-24
SLIDE 24

31/07/2016 www.formulasearchengine.com 25

(5) Score all identifer-definiens pairs

slide-25
SLIDE 25

31/07/2016 www.formulasearchengine.com 26

(6) Generate feature vectors

slide-26
SLIDE 26

31/07/2016 www.formulasearchengine.com 27

(7) Cluster feature vectors

slide-27
SLIDE 27

31/07/2016 www.formulasearchengine.com 28

(8) Map clusters to subject hierarchy

slide-28
SLIDE 28

Wikipedia Subtask at NTCIR 11

  • NTCIR 11 Wikipedia dataset*
  • 30k Wikipedia Articles
  • 280k Formulae
  • 100 queries

31/07/2016 www.formulasearchengine.com 29

*) Moritz Schubotz, Abdou Youssef, Volker Markl, and Howard S. Cohl. 2015. Challenges of Mathematical Information Retrieval in the NTCIR-11 Math Wikipedia Task. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '15). ACM, New York, NY, USA, 951-954.

slide-29
SLIDE 29

Wikipedia Subtask at NTCIR 11

  • CICM 2012

(2 Participants)

  • NTCIR 2013 (pilot)

(6 Participants)

  • NTCIR 2014

arXiv (8 Participants) Wikipedia (7 Participants)

31/07/2016 30

Part III / IV Completed Research

slide-30
SLIDE 30

Wikipedia Task results

31/07/2016 31

Higher recall Higher β€žprecisionβ€œ

Part III / IV Completed Research

slide-31
SLIDE 31

Gold standard

available from http://mlp.formulasearchengine.com

31/07/2016 www.formulasearchengine.com 32

slide-32
SLIDE 32

Gold standard details

31/07/2016 www.formulasearchengine.com 33

174 4 27 97 8

310 Identifiers

Wikidata item Two Wikidata items Wikidata item + NP Individual NP Multiple NP

slide-33
SLIDE 33

31/07/2016 www.formulasearchengine.com 34

Results: Identifiers

  • 294/310 correctly extracted (94.8%)
  • 57 false positive (fp)
  • Problems

– Incorrect markup (8fn, 33fp) πœƒ =

𝑅1βˆ’π‘…2 𝑅1

– Symbols (9fp)

d d𝑦

– Sub-super script (3fp, 2fn) πœπ‘§

2

– Special notation (10fp, 2fn) 𝒗 Γ— π’˜ = πœ—π‘—π‘˜π‘™π‘£π‘˜π‘€π‘™π’‡π’‹

slide-34
SLIDE 34

31/07/2016 www.formulasearchengine.com 35

Results: (4) Find definiens candidates

slide-35
SLIDE 35

31/07/2016 www.formulasearchengine.com 36

Results: Definitions

88 120 19 83

Definitions

Exact match Partial match not found not in document

slide-36
SLIDE 36

Distribution of identifier counts

31/07/2016 www.formulasearchengine.com 37

slide-37
SLIDE 37

31/07/2016 www.formulasearchengine.com 38

Results: Definitions with Namespace support

103 147 60

Definitions

Exact match Partial match not found

slide-38
SLIDE 38

Discovered namespaces

31/07/2016 www.formulasearchengine.com 39

  • 250 clusters -> 167 mapped to classification

schemata

  • 5618 definitions with 𝑑 > 1 (2124 Wikidata

concepts)

  • Evaluate 6 randomly sampled namespaces
slide-39
SLIDE 39

Impression from namespace samples

31/07/2016 www.formulasearchengine.com 40

129 7 8 144

Definitions in Namespaces

correct wrong unspecific cannot say

slide-40
SLIDE 40

Discovered namespaces

31/07/2016 www.formulasearchengine.com 41

  • Physics
  • Identifiers are

significant for formulae

  • Mathematics
  • Identifiers might be

less significant for formulae

slide-41
SLIDE 41

Conclusions

  • For 10% of the identifiers used in Wikipedia

(en) we could assign the associated Wikidata item

  • οƒ 90% ahead

– More specific Wikidata items needed – Combine data from different language versions – Improve recognition rate within a document

  • Namespaces for mathematical identifiers

could be identified

31/07/2016 www.formulasearchengine.com 42

slide-42
SLIDE 42

Next steps

  • The identifier information is available from the

Wikipedia API today http://en.wikipedia.org/api

  • Develop tools to augment the user experience for

math in Wikipedia and beyond

– Tooltips (ongoing) – Physical Dimensions (ongoing) – Translation to Computer Algebra Systems (started) – Math Question and Answering (ongoing) – Author assistance (ongoing with WMF) – Related formulae search (started)

  • οƒ Justification for semantification effort

31/07/2016 www.formulasearchengine.com 43

slide-43
SLIDE 43

Contact

Moritz Schubotz (now at UniversitΓ€t Konstanz) moritz@schubotz.de +49 7531 88 4438 Mobile: +49 1578 047 1397 www.isg.uni-konstanz.de www.formulasearchengine.com

31/07/2016 www.formulasearchengine.com 44