[PPT] - Formula Search Akiko Aizawa aizawa@nii.ac.jp National Institute of PowerPoint Presentation

SLIDE 1

Natural Language Processing Techniques for Mathematical Formula Search

Akiko Aizawa aizawa@nii.ac.jp National Institute of Informatics

1

SLIDE 2

About myself

Affiliation

National Institute of Informatics (Digital

Content Media Science Division)

Research Center for Knowledge Media and

Content Science

The University of Tokyo (Computer Science

Department)

Graduate School for Advanced Studies

(Department of Informatics)

Keywords

Text Processing
Natural Language Processing
Information Retrieval
Knowledge Engineering

Akiko Aizawa

SLIDE 3

Today’s Talk

Introduction: Math Formula in Scientific Literature
Mathematical Formula in Natural Language Text
NTCIR: Math Information Retrieval Task
Technical Challenges in NTCIR Math IR
Math Understanding as AI Challenges

CICM 2018 (3)

SLIDE 4

Introduction: Math Formula in Scientific Literature

CICM 2018 (4)

SLIDE 5

Searching Math Formulae on the Web

CICM 2018 (5)

Pythagorean theorem 𝑑 = 𝑏2 + 𝑐2 𝐶𝐷2 + 𝐵𝐷2 = 𝐵𝐶2

SLIDE 6

Searching Math Formulae on the Web

CICM 2018 (6)

Pythagorean theorem 𝑑 = 𝑏2 + 𝑐2 𝐶𝐷2 + 𝐵𝐷2 = 𝐵𝐶2 Shahab Kamali and Frank Wm. Tompa. 2010. A new mathematics retrieval system. In Proceedings of the 19th ACM international conference on Information and knowledge management (CIKM 2010).

“Given a mathematical expression, the problem is to find expressions that match it.”

SLIDE 7

3R’s for Math Formula Search

CICM 2018 (7)

Representations Resources Requirement

Is there a standard representation for math formulae? Are there many documents with math formulae? Are there may users who need math formulae?

SLIDE 8

3R’s for Math Formula Search

8

Q. Is mathematics related

to your research?

Representations Resources

Web-browsable XML

XML for math semantics

<annotation-xml encoding='MathML-Content'> <apply> <ci> Condition </ci> <apply> <eq /> <apply> <plus /> <apply> <ln /> <apply> <ci> Subscript </ci> <ci> z </ci> <cn type='integer'> 1 </cn> </apply> </apply> <apply> <ln /> <apply> <ci> Subscript </ci>

MathML (W3C recommendation)

Requirement

77% researchers across diversity of disciplines answered ‘YES’.

Somewhat related Related Strongly related NIST Digital Library of Mathematical Functions

NISTEP Policy Study

Wolfram Function SITE 307,409 formulas Wikipedia 16,962 math articles assessed by WikiProject Mathematics Many many scientific articles

SLIDE 9

3R’s for Math Formula Search

CICM 2018 (9)

Representations Resources Requirement

Is there a standard representation for math formulae? Are there many documents with math formulae? Are there may users who need math formulae?

Still, math formula search is a tough problem

SLIDE 10

Why math formula search is difficult?

10

Math Formula Search Math Formula Understanding

SLIDE 11

11

Presentation level Content level Computation level

Three Representation Levels of Math Formulae

SLIDE 12

CICM 2018 (12)

Presentation level Content level Computation level For printing/displaying

𝑌2

LaTex MathML Presentation Markup PDF

X squared

…

SLIDE 13

CICM 2018 (13)

Presentation level Content level Computation level For “semantics”

Snuggletex https://www2.ph.ed.ac.uk/snuggletex/UpConversionDemo

X squared

SLIDE 14

CICM 2018 (14)

Presentation level Content level Computation level For “computing”

power(X, 2)

X squared

X ** 2 math.pow(X, 2) X * X …

SLIDE 15

15

Presentation level Content level Computation level

Computer understanding of math formulae

SLIDE 16

Mathematical Formulae in Natural Language Text

CICM 2018 (16)

SLIDE 17

Math in Scientific Researches: The Cultural Difference

Math for mathematics
Math for many other research fields

CICM 2018 (17)

SLIDE 18

Math in Scientific Researches: The Cultural Difference

CICM 2018 (18)

“True” or “False”? Description of a real world (which is assumed to be true) A hypothesis that needs to be proven 𝑏2 + 𝑐2 = 𝑑2 Pythagorean theorem

What is 𝑑 when 𝑏 = 3 and 𝑐 = 4?

SLIDE 19

Math in Scientific Researches: The Cultural Difference

Math for mathematics
Math for many other research fields

CICM 2018 (19)

Natural language text is a complement for a mathematical proof Description of a real world (which is assumed to be true) Hypothesis that needs to be proven Math formulae are a part of natural language semantics

SLIDE 20

Math and NLP: Observations

Inseparability: A math formulae often appears as an

indispensable component of a sentence

Ambiguity: Just like natural language text, there

exists certain ambiguity with math formulae in a document that should be resolved by their context information

Translatability: Math formulae can be translated

into natural language sentences, and sometimes, vice versa.

20

SLIDE 21

Typical page layout of a scientific paper

CICM 2018 (21)

figure figure caption section title page number textbox textbox textbox textbox textbox figure figure caption textbox

SLIDE 22

Typical Scientific “Text”

CICM 2018 (22)

Puyang Xu, Asela Gunawardana, Sanjeev Khudanpur: Efficient Subsampling for Training Complex Language Models, EMNLP 2011

SLIDE 23

Math formulae in a sentence

CICM 2018 (23)

Denoting the concatenated d- dimensional word representations MATH-w-6-5- 0-32, we have the following probability definition: MATH-p- 6-6-0 where MATH-w-6-7-0-1 denotes the hidden layer size, MATH-w-6-7-0-8 and MATH-w- 6-7-0-10 are the bias vectors for the output nodes and hidden nodes respectively.

Puyang Xu, Asela Gunawardana, Sanjeev Khudanpur: Efficient Subsampling for Training Complex Language Models, EMNLP 2011

SLIDE 24

Math and NLP: Observations

Inseparability: A math formulae often appears as an

indispensable component of a sentence

Ambiguity: Just like natural language text, there

exists certain ambiguity with math formulae in a document that should be resolved by their context information

Translatability: Math formulae can be translated

into natural language sentences, and sometimes, vice versa.

24

SLIDE 25

f : variable f : function ?

Ambiguity of mathematical formulae

Intuitively, if z is a function g of y and y is a function f of x, then z is a function of x. ( ) f x 1/ f 2 f 

( ) f a b +

... where f is a frequency. Can anyone explain to me how angular frequency (w) = 2pi(f)? For which of the following functions is f(a+b) = f(a) + f(b) for all positive numbers a and b?

Context information becomes crucial for semantic disambiguation

CICM 2018 (25)

SLIDE 26

Ambiguity of mathematical formulae

Corinne’s Shibboleth Suppose the temperature on a rectangular stab of metal is given by 𝑈(𝑦, 𝑧) = 𝑙(𝑦2 + 𝑧2) where 𝑙 is a constant. What is 𝑈(𝑠, 𝜄)?

CICM 2018 (26)

Mathematicians Physicists

Dray, T. & Manogoue, C. (2002). Vector calculus bridge project website, http://www.math.oregonstate.edu/bridge/ideas/functions Redish, E. F., & Kuo, E. (2015). Language of physics, language of math: Disciplinary culture and dynamic epistemology. Science & Education, 24(5-6), 561-590.

Sometimes, “commonsense” is also important for disambiguation

A: 𝑈 𝑠, 𝜄 = 𝑙𝑠2 B: 𝑈 𝑠, 𝜄 = 𝑙(𝑠2+𝜄2)

SLIDE 27

Math and NLP: Observations

Inseparability: A math formulae often appears as an

indispensable component of a sentence

Ambiguity: Just like natural language text, there

exists certain ambiguity with math formulae in a document that should be resolved by their context information

Translatability: Math formulae can be translated

into natural language sentences, and sometimes, vice versa.

27

SLIDE 28

Math to NLP Translation

Braille Mathematics Notation

CICM 2018 (28)

RNIB, Braille Mathematics Notation

1987. 1987, Peterborough: Royal

National Institute for the Blind.

ChattyInfty

Nonprofit Organization Science Accessibility Net Reading software for mathematical document

http://www.sciaccess.net/en/ChattyInfty/

Prof. Masakazu Suzuki

SLIDE 29

Math formula as a sentence

CICM 2018 (29)

Y equals X squared.

𝑍 = 𝑌2.

SLIDE 30

Math and NLP: Observations

Inseparability: A math formulae often appears as an

indispensable component of a sentence

Ambiguity: Just like natural language text, there

exists certain ambiguity with math formulae in a document that should be resolved by their context information

Translatability: Math formulae can be translated

into natural language sentences, and sometimes, vice versa.

30

Semantic analysis of math formulae can be considered as one variation of NLP semantic parsing

SLIDE 31

NTCIR: Math Information Retrieval Tasks

CICM 2018 (31)

SLIDE 32

Evaluation of retrieval systems

CICM 2018 (32)

Not relevant Not relevant Relevant Relevant Relevant Relevant Relevant Not relevant Relevant Not relevant Not relevant Not relevant

Compare Perform manual evaluation For the same document collection and the same query, retrieval systems return different document ranks

SLIDE 33

Why IR Task is necessary?

Enable comparison between IR systems
Collect documents
Provide example queries
Reproduce the IR experiments
Define the target problem
Identify the challenges

CICM 2018 (33)

SLIDE 34

NTCIR Math

Information retrieval tasks dedicated to math

formula search

NTCIR-10 Math Pilot Task (2012.2-2013.6)
NTCIR-11 Math-2 (-2014.12)
NTCIR-12 MathIR (-2016.6)
Organizers
Akiko Aizawa
Michael Kohlhasse
Iadh Ounis
Richard Zanibbi *NTCIR12

34

Moritz Chubotz*NTCIR11,PhDstudent

arXiv Wikipedia

SLIDE 35

Genera IR Task Framework

CICM 2018 (35)

Topics Docu- ments System Run Results Pooling Human Assess- ment Evalua- tion

NTCIR-Math Policy Keep it as simple as possible!

SLIDE 36

NTCIR Math-IR Task Framework

CICM 2018 (36)

Topics Docu- ments System Run Results Pooling Human Assess- ment Evalua- tion

Collect documents with Math formulae in MathML format

SLIDE 37

Math Retrieval Task : Dataset Example

papers from arXiv.org xhtml/xml math representation (MathML)

37

KWARC Project

SLIDE 38

About 105,120 scientific articles from arXiv.org
arXiv categories: math, cs, physics:math-ph, stat, physics:hep-

th, physics:nlin

Converted from LaTeX to an HTML+MathML-based

format by the KWARC project (http://kwarc.info/).

Each paper is divided into paragraphs used as retrieval

units

A total of 8,301,578 search units with about 60 millions math

formulae including monomial expressions.

Document Collection (arXiv Task)

38

(A collection of large-scale XML data)

SLIDE 39

NTCIR Math-IR Task Framework

CICM 2018 (39)

Topics Docu- ments System Run Results Pooling Human Assess- ment Evalua- tion

Prepare “queries” used for evaluation

SLIDE 40

Topic example

40

expressed in the MathML format with “query variables” Math formula + keywords

SLIDE 41

NTCIR Math-IR Task Framework

CICM 2018 (42)

Topics Docu- ments System Run Results Pooling Human Assess- ment Evalua- tion

Distribute the documents and queries and ask the participants to submit their retrieval results

SLIDE 42

Retrieval Result Submission Format

CICM 2018 (43)

NTCIR12-MathIR-1 1 1307.6316_1_108 1 5.06740975 MCAT NTCIR12-MathIR-1 1 math0503510_1_11 2 4.70783107 MCAT NTCIR12-MathIR-1 1 math-ph0105008_1_4 3 4.62195235 MCAT NTCIR12-MathIR-1 1 1002.4757_1_52 4 4.55444812 MCAT NTCIR12-MathIR-1 1 math-ph0108005_1_1 5 4.49042815 MCAT NTCIR12-MathIR-1 1 1111.3507_1_8 6 4.47622183 MCAT NTCIR12-MathIR-1 1 1011.6313_1_1 7 4.47066163 MCAT NTCIR12-MathIR-1 1 quant-ph9904101_1_41 8 4.36044515 MCAT NTCIR12-MathIR-1 1 1201.5232_1_35 9 4.33834295 MCAT NTCIR12-MathIR-1 1 math0701262_1_21 10 4.33552455 MCAT NTCIR12-MathIR-1 1 math0701262_1_20 11 4.33552455 MCAT NTCIR12-MathIR-1 1 1201.0836_1_67 12 4.32558394 MCAT NTCIR12-MathIR-1 1 1112.0483_1_50 13 4.30918135 MCAT NTCIR12-MathIR-1 1 math0305174_1_44 14 4.30729695 MCAT NTCIR12-MathIR-1 1 math0603478_1_45 15 4.27833455 MCAT

QueryID DocID Rank SystemRunID

SLIDE 43

Submissions (NTCIR-12 Math-IR)

44

47 submitted runs

GroupID arXiv main arXiv simto Wiki main Wiki formula ICST Peking Univ. CN 1 1 FSE TU Berlin & Univ. of Konstanz DE 1 MCAT National Institute of Informatcs JP 4 4 4 3 MIRMU Masaryk Univ. CZ 4 4 4 RITUW Rochester Inst. Tech & Univ. Waterloo US,C A 4 4 4 SMSG5 Samsung R&D India-Bangalore IN 1 4

SLIDE 44

NTCIR Math-IR Task Framework

CICM 2018 (45)

Topics Docu- ments System Run Results Pooling Human Assess- ment Evalua- tion

Decide document-query pairs for manual evaluation

SLIDE 45

Retrieval Result Submission Format

CICM 2018 (46)

NTCIR12-MathIR-1 1 1307.6316_1_108 1 5.06740975 MCAT NTCIR12-MathIR-1 1 math0503510_1_11 2 4.70783107 MCAT NTCIR12-MathIR-1 1 math-ph0105008_1_4 3 4.62195235 MCAT NTCIR12-MathIR-1 1 1002.4757_1_52 4 4.55444812 MCAT NTCIR12-MathIR-1 1 math-ph0108005_1_1 5 4.49042815 MCAT NTCIR12-MathIR-1 1 1111.3507_1_8 6 4.47622183 MCAT NTCIR12-MathIR-1 1 1011.6313_1_1 7 4.47066163 MCAT NTCIR12-MathIR-1 1 quant-ph9904101_1_41 8 4.36044515 MCAT NTCIR12-MathIR-1 1 1201.5232_1_35 9 4.33834295 MCAT NTCIR12-MathIR-1 1 math0701262_1_21 10 4.33552455 MCAT NTCIR12-MathIR-1 1 math0701262_1_20 11 4.33552455 MCAT NTCIR12-MathIR-1 1 1201.0836_1_67 12 4.32558394 MCAT NTCIR12-MathIR-1 1 1112.0483_1_50 13 4.30918135 MCAT NTCIR12-MathIR-1 1 math0305174_1_44 14 4.30729695 MCAT

QueryID DocID Rank SystemRunID Due to the limited human resource, not all the submitted results are evaluated!!

Human Judgment No Judgment Human Judgment Human Judgment Human Judgment Human Judgment No Judgment No Judgment No Judgment No Judgment Human Judgment

SLIDE 46

NTCIR Math-IR Task Framework

CICM 2018 (47)

Ask evaluators to judge the relevance of the document to the target query

Topics Docu- ments System Run Results Pooling Human Assess- ment Evalua- tion

SLIDE 47

SEPIA Snapshot

48

Given a pair of <query, retrieved document>, decide:

Relevant
Partially relevant
Not relevant

SLIDE 48

Human Assessment

2 assessments per retrieval unit
Relevant, Partially relevant, Non relevant
Use SEPIA with Math Extension

49

arXiv-main arXiv-simto Wiki-main Wiki-formula docs 4,234 612 4,107 2,687 Fleiss’ kappa 0.5615 0.5380 0.3546 0.2619 Agreement score for the two judges

SLIDE 49

NTCIR Math-IR Task Framework

CICM 2018 (50)

Analyze the performance of the participating systems

Topics Docu- ments System Run Results Pooling Human Assess- ment Evalua- tion

SLIDE 50

http://research.nii.ac.jp/ntcir/data/data-en.html

CICM 2018 (51)

SLIDE 51

https://sigmathling.kwarc.info/

CICM 2018 (52)

SLIDE 52

Importance of information need analysis

Yiannos A. Stathopoulos and Simone Teufel: Retrieval of

research-level mathematical information needs: A Test Collection and Technical Terminology Experiment (ACL 2015)

Tokinori Suzuki, Atsushi Fujii: Mathematical Document

Categorization with Structure of Mathematical

Expressions. (JCDL 2017)
CICM 2018 (53)
Construct a test collection to analyze mathematical

information needs

Collected data from MathOverflow
Automatic categorization into Mathematical

document categorization (MDC), applied to MathOverflow and ArXiv collections

SLIDE 53

Technical Challenges in NTCIR Math IR

CICM 2018 (54)

SLIDE 54

Math IR Challenges

Approximate search for a large scale tree

structure data

Semantic similarity of mathematical

formulae

55

SLIDE 55

Math IR Challenges

Approximate search for a large scale tree

structure data

Semantic similarity of mathematical

formulae

56

SLIDE 56

Formula Tree: Statistics

57

About 60 millions math formulae Number of Nodes per formula tree: much heterogeneity in the target collection

NTCIR-Math ArXiv Dataset

SLIDE 57

Token-based methods

58

Mathematical symbols are translated into keywords and indexed using a general search engine

Indexing of general search engine

Digital and computational solutions are becoming the prevalent means for the generation, communication, processing, storage and curation

f mathematical information.

become mathematical means prevalent solution Bag-of-Words

Sub-tree based indexing of math formulae

x

pposite

hypotenuse

apply apply apply ci divide ci eq ci divide

SLIDE 58

Query variable problem

CICM 2018 (59)

Query variables

𝑔 𝑦 = 𝑦 𝑔 ∗ =∗

unification

f(x)=y is not a “fixed point” !

SLIDE 59

Unification-based Methods

CICM 2018 (60)

MathWebSearch KWARC Projext

Query: 𝑏2 +

𝑐 𝑑

Math Information Retrieval team of Masaryk University (MIRMU) Subtree Hash with Query Variables: MathCat@NII

SLIDE 60

Math IR Challenges

Approximate search for a large scale tree

structure data

Semantic similarity of mathematical

formulae

61

SLIDE 61

Structural and semantic similarities

62

Similarity between two mathematical expressions is calculated based on the tree-structure similarity score

Tree-structure similarity = Math semantic similarity?

SLIDE 62

Correlation between Tree Edit Distance and Relevance Judgment

NTCIR-11 Math-2 Dataset
Human relevance judgment (1-5)
Correlation coefficient was -0.07692 !

63

100 200 300 400 500 600 700 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Not Relevant

20 40 60 80 100 120 140 160 180 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Strongly Relevant

SLIDE 63

Distributional Hypothesis in NLP

“Words that occur in the same contexts tend to

have similar meanings” (Harris, 1954).

“A word is characterized by the company it keeps”

(Firth, 1957)

CICM 2018 (64)

https://aclweb.org/aclwiki/Distributional_Hypothesis

A math formula is characterized by the natural language context it occurs.

SLIDE 64

Exploiting Context of Formulae

Step. 1
Extract keywords in surrounding windows
Step. 2
Extract descriptions of mathematical formulae (including

symbols)

Step. 3
Extract a dependency graph of math formulae

CICM 2018 (65)

SLIDE 65

Exploiting Context of Formulae

Step. 1
Extract keywords in surrounding windows
Step. 2
Extract descriptions of mathematical formulae (including

symbols)

Step. 3
Extract a dependency graph of math formulae

CICM 2018 (66)

SLIDE 66

Search Extract Math formula expressed in MathML Natural language context surrounding the math formula Math database XML

Substructures of MathML trees Keywords extracted from math descriptions

CICM 2018 (67)

SLIDE 67

Exploiting Context of Formulae

Step. 1
Extract keywords in surrounding windows
Step. 2
Extract descriptions of mathematical formulae (including

symbols)

Step. 3
Extract a dependency graph of math formulae

CICM 2018 (68)

SLIDE 68

Extracting Math Descriptions

CICM 2018 (69)

Denoting the concatenated d- dimensional word representations MATH-w-6-5- 0-32, we have the following probability definition: MATH-p- 6-6-0 where MATH-w-6-7-0-1 denotes the hidden layer size, MATH-w-6-7-0-8 and MATH-w- 6-7-0-10 are the bias vectors for the output nodes and hidden nodes respectively.

Puyang Xu, Asela Gunawardana, Sanjeev Khudanpur: Efficient Subsampling for Training Complex Language Models, EMNLP 2011

SLIDE 69

Extracting Math Descriptions

MATH-w-6-5-0-32 𝑠

concatenated d- dimensional word representations

MATH-p-6-6-0:

probability distribution

MATH-w-6-7-0-1 ℎ

hidden layer size

CICM 2018 (70)

Denoting the concatenated d- dimensional word representations MATH-w-6-5- 0-32, we have the following probability definition: MATH-p- 6-6-0 where MATH-w-6-7-0-1 denotes the hidden layer size, MATH-w-6-7-0-8 and MATH-w- 6-7-0-10 are the bias vectors for the output nodes and hidden nodes respectively.

𝑄 𝑥𝑗 = 𝑙 𝑥𝑗−1, … , 𝑥𝑗−𝑜+1 = 𝑓𝑏𝑙 σ𝑛 𝑓𝑏𝑛 𝑏𝑙 = 𝑐𝑙 + ෍

𝑚=1 ℎ

𝑋

𝑙𝑚tanh(𝑑𝑚 +

෍

𝑘=1 𝑜−1 𝑒

𝑉𝑚𝑘𝑠

𝑘)

SLIDE 70

Exploiting Context of Formulae

Step. 1
Extract keywords in surrounding windows
Step. 2
Extract descriptions of mathematical formulae (including

symbols)

Step. 3
Extract a dependency graph of math formulae

CICM 2018 (71)

SLIDE 71

Extracting Math Descriptions

CICM 2018 (72)

Denoting the concatenated d- dimensional word representations MATH-w-6-5- 0-32, we have the following probability definition: MATH-p- 6-6-0 where MATH-w-6-7-0-1 denotes the hidden layer size, MATH-w-6-7-0-8 and MATH-w- 6-7-0-10 are the bias vectors for the output nodes and hidden nodes respectively.

𝑄 𝑥𝑗 = 𝑙 𝑥𝑗−1, … , 𝑥𝑗−𝑜+1 = 𝑓𝑏𝑙 σ𝑛 𝑓𝑏𝑛 𝑏𝑙 = 𝑐𝑙 + ෍

𝑚=1 ℎ

𝑋

𝑙𝑚tanh(𝑑𝑚 +

෍

𝑘=1 𝑜−1 𝑒

𝑉𝑚𝑘𝑠

𝑘)

MATH-p-6-6-0: MATH-w-6-5-0-32 𝑠 concatenated d- dimensional word representations MATH-w-6-7-0-1 h hidden layer size MATH-w-6-7-0-8 b bias vectors for the

utput nodes

MATH-w-6-7-0-10 c bias vectors for the hidden nodes probability distribution

SLIDE 72

Extracting Math Descriptions

CICM 2018 (73)

Denoting the concatenated d- dimensional word representations MATH-w-6-5- 0-32, we have the following probability definition: MATH-p- 6-6-0 where MATH-w-6-7-0-1 denotes the hidden layer size, MATH-w-6-7-0-8 and MATH-w- 6-7-0-10 are the bias vectors for the output nodes and hidden nodes respectively.

𝑄 𝑥𝑗 = 𝑙 𝑥𝑗−1, … , 𝑥𝑗−𝑜+1 = 𝑓𝑏𝑙 σ𝑛 𝑓𝑏𝑛 𝑏𝑙 = 𝑐𝑙 + ෍

𝑚=1 ℎ

𝑋

𝑙𝑚tanh(𝑑𝑚 +

෍

𝑘=1 𝑜−1 𝑒

𝑉𝑚𝑘𝑠

𝑘)

MATH-p-6-6-0: MATH-w-6-5-0-32 𝑠 concatenated d- dimensional word representations MATH-w-6-7-0-1 h hidden layer size MATH-w-6-7-0-8 b bias vectors for the

utput nodes

MATH-w-6-7-0-10 c bias vectors for the hidden nodes probability distribution

, as shown in (1), where wi is the i-th word, and ℎ𝑗 is the word history preceding wi

SLIDE 73

Dependency Graph

𝑣𝑗𝑘 = ෍

𝑙=1 𝑘

(−1)𝑗−𝑙 𝑗 − 1 𝑙 − 1 𝑜 − 𝑗 𝑘 − 𝑙 𝑏2𝑙−𝑗−1

𝑏

the golden

ratio

a root

𝑘 𝑙 𝑜

rows
the upper

parameter 7

Giovanni Yoko Kristianto, Goran Topic, Akiko Aizawa: “Utilizing Dependency Relationships between Math Expressions in Math IR.” Information Retrieval Journal

Query: x2 – x – 1 = 0 (golden ratio; Fibonacci)

SLIDE 74

NLP Issues in dependency graph extraction

Query: x2 – x – 1 = 0 (golden ratio; Fibonacci)

𝑣𝑗𝑘 = ෍

𝑙=1 𝑘

(−1)𝑗−𝑙 𝑗 − 1 𝑙 − 1 𝑜 − 𝑗 𝑘 − 𝑙 𝑏2𝑙−𝑗−1

𝑏

the golden

ratio

a root

𝑘 𝑙 𝑜

rows
the upper

parameter 7

𝑦, 𝑦1, 𝑦2, … unit/class identification x , b, …. symbols expressed as non-math symbols where 𝑏 = 𝑐. math exp as a text element (but not as a simple noun) Let π and 𝜏be the mean and standard deviation of the normal distribution, respectively. coordination

SLIDE 75

A Navigator for Math browsing

CICM 2018 (76)

𝑄 𝑥𝑗 = 𝑙 𝑥𝑗−1, … , 𝑥𝑗−𝑜+1 = 𝑓𝑏𝑙 σ𝑛 𝑓𝑏𝑛 𝑏𝑙 = 𝑐𝑙 + ෍

𝑚=1 ℎ

𝑋

𝑙𝑚tanh(𝑑𝑚 +

෍

𝑘=1 𝑜−1 𝑒

𝑉𝑚𝑘𝑠

𝑘)

probability distribution 𝑠 : concatenated d-dimensional word representations h : hidden layer size b : bias vectors for the output nodes c : bias vectors for the hidden nodes Components 𝑥𝑗 : the i-th word Description

SLIDE 76

Understanding mathematical expressions: the eye movement

77

Andrea Kohlhase, Michael Kohlhase and Taweechai Ouypornkochagorn. Discourse Phenomena in Math Documents (Next!)

SLIDE 77

Automatic Query Generation

78

𝑓𝑏𝑙 σ𝑛 𝑓𝑏𝑛

Puyang Xu, Asela Gunawardana, Sanjeev Khudanpur: Efficient Subsampling for Training Complex Language Models, EMNLP 2011

SLIDE 78

Math Understanding as AI Challenges

CICM 2018 (79)

SLIDE 79

Surface and Semantic levels in NLP

CICM 2018 (80)

Apple

Malus domestica (Propbank) break Roles: Arg0: breaker Arg1: thing broken Arg2: instrument Arg3: pieces

word-level ambiguity sentence-level ambiguity

Recognizing entities in text Recognizing relations between entities

SLIDE 80

Recognizing entities in text

CICM 2018 (81)

“the task of determining the identity of entities mentioned in text”

Where would McCain be without Sarah?

John McCain Sarah Palin

Text in a document Entities in a knowledgebase (ex. Wikipedia articles)

Mention

There can be multiple

r no candidates

SLIDE 81

Recognizing relations between entities

“the task of identifying relations between entities in text”

CICM 2018 (82)

https://free-pal.appspot.com/ Two entities Entity X Entity Y Syntactic Pattern ex. [X] married [Y] Relation ex. Marry(X, Y)

SLIDE 82

Two Component Techniques for Computer Math Understanding

CICM 2018 (83)

Math entity recognition

Tokenize mathematical

symbols

Identifying their types

(constant, variable,

perator)
…

sin(x) sin: operator x: variable Math parsing

Identifying the semantic

structure

Presentation to Content

MathML conversion y=f(x)

eq

y

apply

f x

apply

POM tagger Abdou Youssef: Part-of-math tagging and applications (CICM-2017)

Variable Typing Yiannos A. Stathopoulos, Simon Baker, Marek Rei and Simone Teufel : Variable Typing: Assigning Meaning to Variables in Mathematical Text (NAACL 2018)

SLIDE 83

Semantic Parsing of Natural Language Sentences

What is “semantic parsing”?

Translate natural language into a formal representation on which a machine can act.

CICM 2018 (84)

search engine queries Robot control Math problem solver Lambda calculus First order predicate logic SQL(database queries)

SLIDE 84

NLP Semantic Parsing

CCG (Combinatory Categorical Grammar)

85

Mark Steedman (2000) The Syntactic Process

Anna married Manny 𝜇𝑦. 𝜇𝑧. 𝑛𝑏𝑠𝑠𝑧′𝑦 𝑧 𝑛𝑏𝑜𝑜𝑧′ 𝑏𝑜𝑜𝑏′ = 𝜇𝑧. 𝑛𝑏𝑠𝑠𝑧′𝑛𝑏𝑜𝑜𝑧′ 𝑧 𝑏𝑜𝑜𝑏′ = 𝑛𝑏𝑠𝑠𝑧′𝑛𝑏𝑜𝑜𝑧′𝑏𝑜𝑜𝑏′

𝑂𝑄 𝑂𝑄 (𝑇 ∖ 𝑂𝑄)/𝑂𝑄 𝑇 ∖ 𝑂𝑄 𝑇 𝑂𝑄 𝑂𝑄 V VP 𝑇

Anna married Manny

SLIDE 85

NLP Semantic Parsing

Semantic Parsing with Execution

86

https://nlp.stanford.edu/software/sempre/

Jonathan Berant, Andrew Chou, Roy Frostig, Percy Liang. Semantic Parsing on Freebase from Question-Answer Pairs. Empirical Methods in Natural Language Processing (EMNLP), 2013.

Which college did Obama go to ?

Type.University alignment alignment BarackObama Education bridging Type.University ∩ Education.BarackObama

Occidental College, Columbia University Execute on Database

SLIDE 86

University Entrance Exam

3, 9 x y z xy yz zx + + = + + = −

( )

3 9 y z x y z xy yz zx   + + =  + + = −

Quantifier Elimination Method

cited from a book "Algorithms of Quantifier Elimination and Their Applications" by Anai and Yokoyama

3 5 x −  

Natural language description + math formula First order predicate logic

87

Find the range of 𝑦 when the real numbers 𝑦, 𝑧, and 𝑨 satisfy the following formula.

SLIDE 87

“Deep” Understanding

CICM 2018 (88)

Math

Low dimensional continuous vector representation

Executable code

Translation

SLIDE 88

Formula Embeddings

Liangcai Gao, Zhuoren Jiang, Yue Yin, Ke Yuan, Zuoyu Yan, Zhi Tang:

Preliminary Exploration of Formula Embedding for Mathematical Information Retrieval: can mathematical formulae be embedded like a natural language? CIKM 2017 Workshop on Interpretable Data Mining (IDM): Bridging the Gap between Shallow and Deep Models

Kriste Krostovsk and David M. Blei: Equation Embeddings,

ArXiv:1083.09123v1 (2018)

CICM 2018 (89)

SLIDE 89

CICM 2018 (90)

Liangcai Gao, Zhuoren Jiang, Yue Yin, Ke Yuan, Zuoyu Yan, Zhi Tang: Preliminary Exploration of Formula Embedding for Mathematical Information Retrieval: can mathematical formulae be embedded like a natural language? CIKM 2017 Workshop on Interpretable Data Mining (IDM): Bridging the Gap between Shallow and Deep Models

SLIDE 90

Neural Translation for Math Word Problems

The sum of 2 numbers is 18. The first number is 4

more than the second number. Find the two numbers. Equations: 𝑦 + 𝑧 = 18, 𝑦 = 𝑧 + 4

The sum of 3 numbers is 15. The larger number is 4

times the smallest and the middle number is 5. What are the numbers? Equations: 𝑦 + 𝑧 + 𝑨 = 15, 𝑦 = 4 ∗ 𝑨, 𝑧 = 5

CICM 2018 (91)

Danqing Huang, Jin-Ge Yao, Chin-Yew Lin, Qingyu Zhou, Jian Yin: Using Intermediate Representations to Solve Math Word Problems, ACL-2018

translate translate

SLIDE 91

NLP and Source Code

CICM 2018 (92)

Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. 2018. A Survey of Machine Learning for Big Code and Naturalness. ACM Comput. Survey. 51, 4, Article 81 (July 2018), 37 pages. DOI: https://doi.org/10.1145/3212695 Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit the abundance of patterns of code. In this article, we survey this work. We contrast programming languages against natural languages and discuss how these similarities and differences drive the design

f probabilistic models. We present a

taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review

SLIDE 92

Deep Learning @CICM 2018 Today!

Qingxiang Wang, Cezary Kaliszyk and Josef Urban:

First experiments with Neural Translation of Informal to Formal Mathematics.

Abdou Youssef and Bruce Miller: Deep Learning for

Math Knowledge Processing

Felix Petersen, Moritz Shubotz and Bela Gipp:

Towards Formula Translation using Recursive Neural Networks

CICM 2018 (93)

SLIDE 93

Semantic Parsing of Natural Language Sentences

What is “semantic parsing”?

Translate natural language into a formal representation on which a machine can act.

CICM 2018 (94)

search engine queries Robot control Math problem solver Lambda calculus First order predicate logic SQL(database queries)

SLIDE 94

Final Remarks

“Reading for Computing”

➢(Long term research goal) Recognize relationships between real problems, formal models, and computation methods ➢(Current target) Retrieval and understanding of mathematical formulae

CICM 2018 (95)

SLIDE 95

Thank you!

CICM 2018 (96)