Natural Language Processing Techniques for Mathematical Formula Search
Akiko Aizawa aizawa@nii.ac.jp National Institute of Informatics
1
Formula Search Akiko Aizawa aizawa@nii.ac.jp National Institute of - - PowerPoint PPT Presentation
1 Natural Language Processing Techniques for Mathematical Formula Search Akiko Aizawa aizawa@nii.ac.jp National Institute of Informatics About myself Affiliation National Institute of Informatics (Digital Content Media Science Division)
Akiko Aizawa aizawa@nii.ac.jp National Institute of Informatics
1
Affiliation
Content Media Science Division)
Content Science
Department)
(Department of Informatics)
Keywords
CICM 2018 (3)
CICM 2018 (4)
CICM 2018 (5)
Pythagorean theorem π = π2 + π2 πΆπ·2 + π΅π·2 = π΅πΆ2
CICM 2018 (6)
Pythagorean theorem π = π2 + π2 πΆπ·2 + π΅π·2 = π΅πΆ2 Shahab Kamali and Frank Wm. Tompa. 2010. A new mathematics retrieval system. In Proceedings of the 19th ACM international conference on Information and knowledge management (CIKM 2010).
CICM 2018 (7)
Is there a standard representation for math formulae? Are there many documents with math formulae? Are there may users who need math formulae?
8
to your research?
Web-browsable XML
<math xmlns='http://www.w3.org/1998/Math /MathML' mathematica:form='TraditionalForm' xmlns:mathematica='http://www.wolfram.com/XML /'> <semantics> <mrow> <mrow> <mrow> <mrow> <mi> log </mi> <mo> ⁡ </mo> <mo> ( </mo> <msub> <mi> z </mi> <mn> 1 </mn> </msub>
XML for math semantics
<annotation-xml encoding='MathML-Content'> <apply> <ci> Condition </ci> <apply> <eq /> <apply> <plus /> <apply> <ln /> <apply> <ci> Subscript </ci> <ci> z </ci> <cn type='integer'> 1 </cn> </apply> </apply> <apply> <ln /> <apply> <ci> Subscript </ci>
MathML (W3C recommendation)
77% researchers across diversity of disciplines answered βYESβ.
Somewhat related Related Strongly related NIST Digital Library of Mathematical Functions
NISTEP Policy Study
Wolfram Function SITE 307,409 formulas Wikipedia 16,962 math articles assessed by WikiProject Mathematics Many many scientific articles
CICM 2018 (9)
Is there a standard representation for math formulae? Are there many documents with math formulae? Are there may users who need math formulae?
Still, math formula search is a tough problem
10
11
CICM 2018 (12)
Presentation level Content level Computation level For printing/displaying
LaTex MathML Presentation Markup PDF
X squared
β¦
CICM 2018 (13)
Presentation level Content level Computation level For βsemanticsβ
Snuggletex https://www2.ph.ed.ac.uk/snuggletex/UpConversionDemo
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> <apply> <power/> <ci>X</ci> <cn>2</cn> </apply> </math>
X squared
CICM 2018 (14)
Presentation level Content level Computation level For βcomputingβ
power(X, 2)
X squared
X ** 2 math.pow(X, 2) X * X β¦
15
CICM 2018 (16)
CICM 2018 (17)
CICM 2018 (18)
βTrueβ or βFalseβ? Description of a real world (which is assumed to be true) A hypothesis that needs to be proven π2 + π2 = π2 Pythagorean theorem
What is π when π = 3 and π = 4?
CICM 2018 (19)
Natural language text is a complement for a mathematical proof Description of a real world (which is assumed to be true) Hypothesis that needs to be proven Math formulae are a part of natural language semantics
indispensable component of a sentence
exists certain ambiguity with math formulae in a document that should be resolved by their context information
into natural language sentences, and sometimes, vice versa.
20
CICM 2018 (21)
figure figure caption section title page number textbox textbox textbox textbox textbox figure figure caption textbox
CICM 2018 (22)
Puyang Xu, Asela Gunawardana, Sanjeev Khudanpur: Efficient Subsampling for Training Complex Language Models, EMNLP 2011
CICM 2018 (23)
Denoting the concatenated d- dimensional word representations MATH-w-6-5- 0-32, we have the following probability definition: MATH-p- 6-6-0 where MATH-w-6-7-0-1 denotes the hidden layer size, MATH-w-6-7-0-8 and MATH-w- 6-7-0-10 are the bias vectors for the output nodes and hidden nodes respectively.
Puyang Xu, Asela Gunawardana, Sanjeev Khudanpur: Efficient Subsampling for Training Complex Language Models, EMNLP 2011
indispensable component of a sentence
exists certain ambiguity with math formulae in a document that should be resolved by their context information
into natural language sentences, and sometimes, vice versa.
24
f : variable f : function ?
Intuitively, if z is a function g of y and y is a function f of x, then z is a function of x. ( ) f x 1/ f 2 f ο°
( ) f a b +
... where f is a frequency. Can anyone explain to me how angular frequency (w) = 2pi(f)? For which of the following functions is f(a+b) = f(a) + f(b) for all positive numbers a and b?
Context information becomes crucial for semantic disambiguation
CICM 2018 (25)
Corinneβs Shibboleth Suppose the temperature on a rectangular stab of metal is given by π(π¦, π§) = π(π¦2 + π§2) where π is a constant. What is π(π , π)?
CICM 2018 (26)
Mathematicians Physicists
Dray, T. & Manogoue, C. (2002). Vector calculus bridge project website, http://www.math.oregonstate.edu/bridge/ideas/functions Redish, E. F., & Kuo, E. (2015). Language of physics, language of math: Disciplinary culture and dynamic epistemology. Science & Education, 24(5-6), 561-590.
Sometimes, βcommonsenseβ is also important for disambiguation
A: π π , π = ππ 2 B: π π , π = π(π 2+π2)
indispensable component of a sentence
exists certain ambiguity with math formulae in a document that should be resolved by their context information
into natural language sentences, and sometimes, vice versa.
27
Braille Mathematics Notation
CICM 2018 (28)
RNIB, Braille Mathematics Notation
National Institute for the Blind.
ChattyInfty
Nonprofit Organization Science Accessibility Net Reading software for mathematical document
http://www.sciaccess.net/en/ChattyInfty/
Prof. Masakazu Suzuki
CICM 2018 (29)
indispensable component of a sentence
exists certain ambiguity with math formulae in a document that should be resolved by their context information
into natural language sentences, and sometimes, vice versa.
30
Semantic analysis of math formulae can be considered as one variation of NLP semantic parsing
CICM 2018 (31)
CICM 2018 (32)
Not relevant Not relevant Relevant Relevant Relevant Relevant Relevant Not relevant Relevant Not relevant Not relevant Not relevant
Compare Perform manual evaluation For the same document collection and the same query, retrieval systems return different document ranks
CICM 2018 (33)
formula search
34
Moritz Chubotz*NTCIR11,PhDstudent
arXiv Wikipedia
CICM 2018 (35)
Topics Docu- ments System Run Results Pooling Human Assess- ment Evalua- tion
CICM 2018 (36)
Topics Docu- ments System Run Results Pooling Human Assess- ment Evalua- tion
papers from arXiv.org xhtml/xml math representation (MathML)
37
KWARC Project
th, physics:nlin
format by the KWARC project (http://kwarc.info/).
units
formulae including monomial expressions.
38
(A collection of large-scale XML data)
CICM 2018 (39)
Topics Docu- ments System Run Results Pooling Human Assess- ment Evalua- tion
40
expressed in the MathML format with βquery variablesβ Math formula + keywords
CICM 2018 (42)
Topics Docu- ments System Run Results Pooling Human Assess- ment Evalua- tion
Distribute the documents and queries and ask the participants to submit their retrieval results
CICM 2018 (43)
NTCIR12-MathIR-1 1 1307.6316_1_108 1 5.06740975 MCAT NTCIR12-MathIR-1 1 math0503510_1_11 2 4.70783107 MCAT NTCIR12-MathIR-1 1 math-ph0105008_1_4 3 4.62195235 MCAT NTCIR12-MathIR-1 1 1002.4757_1_52 4 4.55444812 MCAT NTCIR12-MathIR-1 1 math-ph0108005_1_1 5 4.49042815 MCAT NTCIR12-MathIR-1 1 1111.3507_1_8 6 4.47622183 MCAT NTCIR12-MathIR-1 1 1011.6313_1_1 7 4.47066163 MCAT NTCIR12-MathIR-1 1 quant-ph9904101_1_41 8 4.36044515 MCAT NTCIR12-MathIR-1 1 1201.5232_1_35 9 4.33834295 MCAT NTCIR12-MathIR-1 1 math0701262_1_21 10 4.33552455 MCAT NTCIR12-MathIR-1 1 math0701262_1_20 11 4.33552455 MCAT NTCIR12-MathIR-1 1 1201.0836_1_67 12 4.32558394 MCAT NTCIR12-MathIR-1 1 1112.0483_1_50 13 4.30918135 MCAT NTCIR12-MathIR-1 1 math0305174_1_44 14 4.30729695 MCAT NTCIR12-MathIR-1 1 math0603478_1_45 15 4.27833455 MCAT
QueryID DocID Rank SystemRunID
44
47 submitted runs
GroupID arXiv main arXiv simto Wiki main Wiki formula ICST Peking Univ. CN 1 1 FSE TU Berlin & Univ. of Konstanz DE 1 MCAT National Institute of Informatcs JP 4 4 4 3 MIRMU Masaryk Univ. CZ 4 4 4 RITUW Rochester Inst. Tech & Univ. Waterloo US,C A 4 4 4 SMSG5 Samsung R&D India-Bangalore IN 1 4
CICM 2018 (45)
Topics Docu- ments System Run Results Pooling Human Assess- ment Evalua- tion
CICM 2018 (46)
NTCIR12-MathIR-1 1 1307.6316_1_108 1 5.06740975 MCAT NTCIR12-MathIR-1 1 math0503510_1_11 2 4.70783107 MCAT NTCIR12-MathIR-1 1 math-ph0105008_1_4 3 4.62195235 MCAT NTCIR12-MathIR-1 1 1002.4757_1_52 4 4.55444812 MCAT NTCIR12-MathIR-1 1 math-ph0108005_1_1 5 4.49042815 MCAT NTCIR12-MathIR-1 1 1111.3507_1_8 6 4.47622183 MCAT NTCIR12-MathIR-1 1 1011.6313_1_1 7 4.47066163 MCAT NTCIR12-MathIR-1 1 quant-ph9904101_1_41 8 4.36044515 MCAT NTCIR12-MathIR-1 1 1201.5232_1_35 9 4.33834295 MCAT NTCIR12-MathIR-1 1 math0701262_1_21 10 4.33552455 MCAT NTCIR12-MathIR-1 1 math0701262_1_20 11 4.33552455 MCAT NTCIR12-MathIR-1 1 1201.0836_1_67 12 4.32558394 MCAT NTCIR12-MathIR-1 1 1112.0483_1_50 13 4.30918135 MCAT NTCIR12-MathIR-1 1 math0305174_1_44 14 4.30729695 MCAT
QueryID DocID Rank SystemRunID Due to the limited human resource, not all the submitted results are evaluated!!
Human Judgment No Judgment Human Judgment Human Judgment Human Judgment Human Judgment No Judgment No Judgment No Judgment No Judgment Human Judgment
CICM 2018 (47)
Topics Docu- ments System Run Results Pooling Human Assess- ment Evalua- tion
48
Given a pair of <query, retrieved document>, decide:
49
arXiv-main arXiv-simto Wiki-main Wiki-formula docs 4,234 612 4,107 2,687 Fleissβ kappa 0.5615 0.5380 0.3546 0.2619 Agreement score for the two judges
CICM 2018 (50)
Topics Docu- ments System Run Results Pooling Human Assess- ment Evalua- tion
http://research.nii.ac.jp/ntcir/data/data-en.html
CICM 2018 (51)
CICM 2018 (52)
research-level mathematical information needs: A Test Collection and Technical Terminology Experiment (ACL 2015)
Categorization with Structure of Mathematical
information needs
document categorization (MDC), applied to MathOverflow and ArXiv collections
CICM 2018 (54)
55
56
57
About 60 millions math formulae Number of Nodes per formula tree: much heterogeneity in the target collection
NTCIR-Math ArXiv Dataset
58
Mathematical symbols are translated into keywords and indexed using a general search engine
Indexing of general search engine
Digital and computational solutions are becoming the prevalent means for the generation, communication, processing, storage and curation
become mathematical means prevalent solution Bag-of-Words
Sub-tree based indexing of math formulae
x
hypotenuse
apply apply apply ci divide ci eq ci divide
CICM 2018 (59)
π π¦ = π¦ π β =β
unification
f(x)=y is not a βfixed pointβ !
CICM 2018 (60)
MathWebSearch KWARC Projext
Query: π2 +
π π
Math Information Retrieval team of Masaryk University (MIRMU) Subtree Hash with Query Variables: MathCat@NII
61
62
Similarity between two mathematical expressions is calculated based on the tree-structure similarity score
Tree-structure similarity = Math semantic similarity?
63
100 200 300 400 500 600 700 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Not Relevant
20 40 60 80 100 120 140 160 180 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Strongly Relevant
have similar meaningsβ (Harris, 1954).
(Firth, 1957)
CICM 2018 (64)
https://aclweb.org/aclwiki/Distributional_Hypothesis
A math formula is characterized by the natural language context it occurs.
symbols)
CICM 2018 (65)
symbols)
CICM 2018 (66)
Search Extract Math formula expressed in MathML Natural language context surrounding the math formula Math database XML
Substructures of MathML trees Keywords extracted from math descriptions
CICM 2018 (67)
symbols)
CICM 2018 (68)
CICM 2018 (69)
Denoting the concatenated d- dimensional word representations MATH-w-6-5- 0-32, we have the following probability definition: MATH-p- 6-6-0 where MATH-w-6-7-0-1 denotes the hidden layer size, MATH-w-6-7-0-8 and MATH-w- 6-7-0-10 are the bias vectors for the output nodes and hidden nodes respectively.
Puyang Xu, Asela Gunawardana, Sanjeev Khudanpur: Efficient Subsampling for Training Complex Language Models, EMNLP 2011
concatenated d- dimensional word representations
probability distribution
hidden layer size
CICM 2018 (70)
Denoting the concatenated d- dimensional word representations MATH-w-6-5- 0-32, we have the following probability definition: MATH-p- 6-6-0 where MATH-w-6-7-0-1 denotes the hidden layer size, MATH-w-6-7-0-8 and MATH-w- 6-7-0-10 are the bias vectors for the output nodes and hidden nodes respectively.
π π₯π = π π₯πβ1, β¦ , π₯πβπ+1 = πππ Οπ πππ ππ = ππ + ΰ·
π=1 β
π
ππtanh(ππ +
ΰ·
π=1 πβ1 π
ππππ
π)
symbols)
CICM 2018 (71)
CICM 2018 (72)
Denoting the concatenated d- dimensional word representations MATH-w-6-5- 0-32, we have the following probability definition: MATH-p- 6-6-0 where MATH-w-6-7-0-1 denotes the hidden layer size, MATH-w-6-7-0-8 and MATH-w- 6-7-0-10 are the bias vectors for the output nodes and hidden nodes respectively.
π π₯π = π π₯πβ1, β¦ , π₯πβπ+1 = πππ Οπ πππ ππ = ππ + ΰ·
π=1 β
π
ππtanh(ππ +
ΰ·
π=1 πβ1 π
ππππ
π)
MATH-p-6-6-0: MATH-w-6-5-0-32 π concatenated d- dimensional word representations MATH-w-6-7-0-1 h hidden layer size MATH-w-6-7-0-8 b bias vectors for the
MATH-w-6-7-0-10 c bias vectors for the hidden nodes probability distribution
CICM 2018 (73)
Denoting the concatenated d- dimensional word representations MATH-w-6-5- 0-32, we have the following probability definition: MATH-p- 6-6-0 where MATH-w-6-7-0-1 denotes the hidden layer size, MATH-w-6-7-0-8 and MATH-w- 6-7-0-10 are the bias vectors for the output nodes and hidden nodes respectively.
π π₯π = π π₯πβ1, β¦ , π₯πβπ+1 = πππ Οπ πππ ππ = ππ + ΰ·
π=1 β
π
ππtanh(ππ +
ΰ·
π=1 πβ1 π
ππππ
π)
MATH-p-6-6-0: MATH-w-6-5-0-32 π concatenated d- dimensional word representations MATH-w-6-7-0-1 h hidden layer size MATH-w-6-7-0-8 b bias vectors for the
MATH-w-6-7-0-10 c bias vectors for the hidden nodes probability distribution
, as shown in (1), where wi is the i-th word, and βπ is the word history preceding wi
Dependency Graph
π£ππ = ΰ·
π=1 π
(β1)πβπ π β 1 π β 1 π β π π β π π2πβπβ1
π
ratio
π π π
parameter 7
Giovanni Yoko Kristianto, Goran Topic, Akiko Aizawa: βUtilizing Dependency Relationships between Math Expressions in Math IR.β Information Retrieval Journal
Query: x2 β x β 1 = 0 (golden ratio; Fibonacci)
NLP Issues in dependency graph extraction
Query: x2 β x β 1 = 0 (golden ratio; Fibonacci)
π£ππ = ΰ·
π=1 π
(β1)πβπ π β 1 π β 1 π β π π β π π2πβπβ1
π
ratio
π π π
parameter 7
π¦, π¦1, π¦2, β¦ unit/class identification x , b, β¦. symbols expressed as non-math symbols where π = π. math exp as a text element (but not as a simple noun) Let Ο and πbe the mean and standard deviation of the normal distribution, respectively. coordination
CICM 2018 (76)
π π₯π = π π₯πβ1, β¦ , π₯πβπ+1 = πππ Οπ πππ ππ = ππ + ΰ·
π=1 β
π
ππtanh(ππ +
ΰ·
π=1 πβ1 π
ππππ
π)
probability distribution π : concatenated d-dimensional word representations h : hidden layer size b : bias vectors for the output nodes c : bias vectors for the hidden nodes Components π₯π : the i-th word Description
77
Andrea Kohlhase, Michael Kohlhase and Taweechai Ouypornkochagorn. Discourse Phenomena in Math Documents (Next!)
78
Puyang Xu, Asela Gunawardana, Sanjeev Khudanpur: Efficient Subsampling for Training Complex Language Models, EMNLP 2011
CICM 2018 (79)
CICM 2018 (80)
Malus domestica (Propbank) break Roles: Arg0: breaker Arg1: thing broken Arg2: instrument Arg3: pieces
word-level ambiguity sentence-level ambiguity
CICM 2018 (81)
Where would McCain be without Sarah?
John McCain Sarah Palin
Text in a document Entities in a knowledgebase (ex. Wikipedia articles)
Mention
There can be multiple
CICM 2018 (82)
https://free-pal.appspot.com/ Two entities Entity X Entity Y Syntactic Pattern ex. [X] married [Y] Relation ex. Marry(X, Y)
CICM 2018 (83)
Math entity recognition
symbols
(constant, variable,
sin(x) sin: operator x: variable Math parsing
structure
MathML conversion y=f(x)
eq
y
apply
f x
apply
POM tagger Abdou Youssef: Part-of-math tagging and applications (CICM-2017)
Variable Typing Yiannos A. Stathopoulos, Simon Baker, Marek Rei and Simone Teufel : Variable Typing: Assigning Meaning to Variables in Mathematical Text (NAACL 2018)
CICM 2018 (84)
search engine queries Robot control Math problem solver Lambda calculus First order predicate logic SQL(database queries)
85
Mark Steedman (2000) The Syntactic Process
Anna married Manny ππ¦. ππ§. πππ π π§β²π¦ π§ πππππ§β² ππππβ² = ππ§. πππ π π§β²πππππ§β² π§ ππππβ² = πππ π π§β²πππππ§β²ππππβ²
ππ ππ (π β ππ)/ππ π β ππ π ππ ππ V VP π
Anna married Manny
86
https://nlp.stanford.edu/software/sempre/
Jonathan Berant, Andrew Chou, Roy Frostig, Percy Liang. Semantic Parsing on Freebase from Question-Answer Pairs. Empirical Methods in Natural Language Processing (EMNLP), 2013.
Which college did Obama go to ?
Type.University alignment alignment BarackObama Education bridging Type.University β© Education.BarackObama
Occidental College, Columbia University Execute on Database
3 9 y z x y z xy yz zx ο€ ο€ + + = ο + + = β
Quantifier Elimination Method
cited from a book "Algorithms of Quantifier Elimination and Their Applications" by Anai and Yokoyama
3 5 x β ο£ ο£
Natural language description + math formula First order predicate logic
87
Find the range of π¦ when the real numbers π¦, π§, and π¨ satisfy the following formula.
CICM 2018 (88)
Low dimensional continuous vector representation
Translation
Preliminary Exploration of Formula Embedding for Mathematical Information Retrieval: can mathematical formulae be embedded like a natural language? CIKM 2017 Workshop on Interpretable Data Mining (IDM): Bridging the Gap between Shallow and Deep Models
ArXiv:1083.09123v1 (2018)
CICM 2018 (89)
CICM 2018 (90)
Liangcai Gao, Zhuoren Jiang, Yue Yin, Ke Yuan, Zuoyu Yan, Zhi Tang: Preliminary Exploration of Formula Embedding for Mathematical Information Retrieval: can mathematical formulae be embedded like a natural language? CIKM 2017 Workshop on Interpretable Data Mining (IDM): Bridging the Gap between Shallow and Deep Models
more than the second number. Find the two numbers. Equations: π¦ + π§ = 18, π¦ = π§ + 4
times the smallest and the middle number is 5. What are the numbers? Equations: π¦ + π§ + π¨ = 15, π¦ = 4 β π¨, π§ = 5
CICM 2018 (91)
Danqing Huang, Jin-Ge Yao, Chin-Yew Lin, Qingyu Zhou, Jian Yin: Using Intermediate Representations to Solve Math Word Problems, ACL-2018
translate translate
CICM 2018 (92)
Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. 2018. A Survey of Machine Learning for Big Code and Naturalness. ACM Comput. Survey. 51, 4, Article 81 (July 2018), 37 pages. DOI: https://doi.org/10.1145/3212695 Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit the abundance of patterns of code. In this article, we survey this work. We contrast programming languages against natural languages and discuss how these similarities and differences drive the design
taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review
First experiments with Neural Translation of Informal to Formal Mathematics.
Math Knowledge Processing
Towards Formula Translation using Recursive Neural Networks
CICM 2018 (93)
CICM 2018 (94)
search engine queries Robot control Math problem solver Lambda calculus First order predicate logic SQL(database queries)
CICM 2018 (95)
CICM 2018 (96)