Formula Search Akiko Aizawa aizawa@nii.ac.jp National Institute of - - PowerPoint PPT Presentation

β–Ά
formula search
SMART_READER_LITE
LIVE PREVIEW

Formula Search Akiko Aizawa aizawa@nii.ac.jp National Institute of - - PowerPoint PPT Presentation

1 Natural Language Processing Techniques for Mathematical Formula Search Akiko Aizawa aizawa@nii.ac.jp National Institute of Informatics About myself Affiliation National Institute of Informatics (Digital Content Media Science Division)


slide-1
SLIDE 1

Natural Language Processing Techniques for Mathematical Formula Search

Akiko Aizawa aizawa@nii.ac.jp National Institute of Informatics

1

slide-2
SLIDE 2

About myself

Affiliation

  • National Institute of Informatics (Digital

Content Media Science Division)

  • Research Center for Knowledge Media and

Content Science

  • The University of Tokyo (Computer Science

Department)

  • Graduate School for Advanced Studies

(Department of Informatics)

Keywords

  • Text Processing
  • Natural Language Processing
  • Information Retrieval
  • Knowledge Engineering

Akiko Aizawa

slide-3
SLIDE 3

Today’s Talk

  • Introduction: Math Formula in Scientific Literature
  • Mathematical Formula in Natural Language Text
  • NTCIR: Math Information Retrieval Task
  • Technical Challenges in NTCIR Math IR
  • Math Understanding as AI Challenges

CICM 2018 (3)

slide-4
SLIDE 4

Introduction: Math Formula in Scientific Literature

CICM 2018 (4)

slide-5
SLIDE 5

Searching Math Formulae on the Web

CICM 2018 (5)

Pythagorean theorem 𝑑 = 𝑏2 + 𝑐2 𝐢𝐷2 + 𝐡𝐷2 = 𝐡𝐢2

slide-6
SLIDE 6

Searching Math Formulae on the Web

CICM 2018 (6)

Pythagorean theorem 𝑑 = 𝑏2 + 𝑐2 𝐢𝐷2 + 𝐡𝐷2 = 𝐡𝐢2 Shahab Kamali and Frank Wm. Tompa. 2010. A new mathematics retrieval system. In Proceedings of the 19th ACM international conference on Information and knowledge management (CIKM 2010).

β€œGiven a mathematical expression, the problem is to find expressions that match it.”

slide-7
SLIDE 7

3R’s for Math Formula Search

CICM 2018 (7)

Representations Resources Requirement

Is there a standard representation for math formulae? Are there many documents with math formulae? Are there may users who need math formulae?

slide-8
SLIDE 8

3R’s for Math Formula Search

8

  • Q. Is mathematics related

to your research?

Representations Resources

Web-browsable XML

<math xmlns='http://www.w3.org/1998/Math /MathML' mathematica:form='TraditionalForm' xmlns:mathematica='http://www.wolfram.com/XML /'> <semantics> <mrow> <mrow> <mrow> <mrow> <mi> log </mi> <mo> &#8289; </mo> <mo> ( </mo> <msub> <mi> z </mi> <mn> 1 </mn> </msub>

XML for math semantics

<annotation-xml encoding='MathML-Content'> <apply> <ci> Condition </ci> <apply> <eq /> <apply> <plus /> <apply> <ln /> <apply> <ci> Subscript </ci> <ci> z </ci> <cn type='integer'> 1 </cn> </apply> </apply> <apply> <ln /> <apply> <ci> Subscript </ci>

MathML (W3C recommendation)

Requirement

77% researchers across diversity of disciplines answered β€˜YES’.

Somewhat related Related Strongly related NIST Digital Library of Mathematical Functions

NISTEP Policy Study

Wolfram Function SITE 307,409 formulas Wikipedia 16,962 math articles assessed by WikiProject Mathematics Many many scientific articles

slide-9
SLIDE 9

3R’s for Math Formula Search

CICM 2018 (9)

Representations Resources Requirement

Is there a standard representation for math formulae? Are there many documents with math formulae? Are there may users who need math formulae?

Still, math formula search is a tough problem

slide-10
SLIDE 10

Why math formula search is difficult?

10

Math Formula Search Math Formula Understanding

slide-11
SLIDE 11

11

Presentation level Content level Computation level

Three Representation Levels of Math Formulae

slide-12
SLIDE 12

CICM 2018 (12)

Presentation level Content level Computation level For printing/displaying

π‘Œ2

LaTex MathML Presentation Markup PDF

X squared

…

slide-13
SLIDE 13

CICM 2018 (13)

Presentation level Content level Computation level For β€œsemantics”

Snuggletex https://www2.ph.ed.ac.uk/snuggletex/UpConversionDemo

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> <apply> <power/> <ci>X</ci> <cn>2</cn> </apply> </math>

X squared

slide-14
SLIDE 14

CICM 2018 (14)

Presentation level Content level Computation level For β€œcomputing”

power(X, 2)

X squared

X ** 2 math.pow(X, 2) X * X …

slide-15
SLIDE 15

15

Presentation level Content level Computation level

Computer understanding of math formulae

slide-16
SLIDE 16

Mathematical Formulae in Natural Language Text

CICM 2018 (16)

slide-17
SLIDE 17

Math in Scientific Researches: The Cultural Difference

  • Math for mathematics
  • Math for many other research fields

CICM 2018 (17)

slide-18
SLIDE 18

Math in Scientific Researches: The Cultural Difference

CICM 2018 (18)

β€œTrue” or β€œFalse”? Description of a real world (which is assumed to be true) A hypothesis that needs to be proven 𝑏2 + 𝑐2 = 𝑑2 Pythagorean theorem

What is 𝑑 when 𝑏 = 3 and 𝑐 = 4?

slide-19
SLIDE 19

Math in Scientific Researches: The Cultural Difference

  • Math for mathematics
  • Math for many other research fields

CICM 2018 (19)

Natural language text is a complement for a mathematical proof Description of a real world (which is assumed to be true) Hypothesis that needs to be proven Math formulae are a part of natural language semantics

slide-20
SLIDE 20

Math and NLP: Observations

  • Inseparability: A math formulae often appears as an

indispensable component of a sentence

  • Ambiguity: Just like natural language text, there

exists certain ambiguity with math formulae in a document that should be resolved by their context information

  • Translatability: Math formulae can be translated

into natural language sentences, and sometimes, vice versa.

20

slide-21
SLIDE 21

Typical page layout of a scientific paper

CICM 2018 (21)

figure figure caption section title page number textbox textbox textbox textbox textbox figure figure caption textbox

slide-22
SLIDE 22

Typical Scientific β€œText”

CICM 2018 (22)

Puyang Xu, Asela Gunawardana, Sanjeev Khudanpur: Efficient Subsampling for Training Complex Language Models, EMNLP 2011

slide-23
SLIDE 23

Math formulae in a sentence

CICM 2018 (23)

Denoting the concatenated d- dimensional word representations MATH-w-6-5- 0-32, we have the following probability definition: MATH-p- 6-6-0 where MATH-w-6-7-0-1 denotes the hidden layer size, MATH-w-6-7-0-8 and MATH-w- 6-7-0-10 are the bias vectors for the output nodes and hidden nodes respectively.

Puyang Xu, Asela Gunawardana, Sanjeev Khudanpur: Efficient Subsampling for Training Complex Language Models, EMNLP 2011

slide-24
SLIDE 24

Math and NLP: Observations

  • Inseparability: A math formulae often appears as an

indispensable component of a sentence

  • Ambiguity: Just like natural language text, there

exists certain ambiguity with math formulae in a document that should be resolved by their context information

  • Translatability: Math formulae can be translated

into natural language sentences, and sometimes, vice versa.

24

slide-25
SLIDE 25

f : variable f : function ?

Ambiguity of mathematical formulae

Intuitively, if z is a function g of y and y is a function f of x, then z is a function of x. ( ) f x 1/ f 2 f 

( ) f a b +

... where f is a frequency. Can anyone explain to me how angular frequency (w) = 2pi(f)? For which of the following functions is f(a+b) = f(a) + f(b) for all positive numbers a and b?

Context information becomes crucial for semantic disambiguation

CICM 2018 (25)

slide-26
SLIDE 26

Ambiguity of mathematical formulae

Corinne’s Shibboleth Suppose the temperature on a rectangular stab of metal is given by π‘ˆ(𝑦, 𝑧) = 𝑙(𝑦2 + 𝑧2) where 𝑙 is a constant. What is π‘ˆ(𝑠, πœ„)?

CICM 2018 (26)

Mathematicians Physicists

Dray, T. & Manogoue, C. (2002). Vector calculus bridge project website, http://www.math.oregonstate.edu/bridge/ideas/functions Redish, E. F., & Kuo, E. (2015). Language of physics, language of math: Disciplinary culture and dynamic epistemology. Science & Education, 24(5-6), 561-590.

Sometimes, β€œcommonsense” is also important for disambiguation

A: π‘ˆ 𝑠, πœ„ = 𝑙𝑠2 B: π‘ˆ 𝑠, πœ„ = 𝑙(𝑠2+πœ„2)

slide-27
SLIDE 27

Math and NLP: Observations

  • Inseparability: A math formulae often appears as an

indispensable component of a sentence

  • Ambiguity: Just like natural language text, there

exists certain ambiguity with math formulae in a document that should be resolved by their context information

  • Translatability: Math formulae can be translated

into natural language sentences, and sometimes, vice versa.

27

slide-28
SLIDE 28

Math to NLP Translation

Braille Mathematics Notation

CICM 2018 (28)

RNIB, Braille Mathematics Notation

  • 1987. 1987, Peterborough: Royal

National Institute for the Blind.

ChattyInfty

Nonprofit Organization Science Accessibility Net Reading software for mathematical document

http://www.sciaccess.net/en/ChattyInfty/

Prof. Masakazu Suzuki

slide-29
SLIDE 29

Math formula as a sentence

CICM 2018 (29)

Y equals X squared.

𝑍 = π‘Œ2.

slide-30
SLIDE 30

Math and NLP: Observations

  • Inseparability: A math formulae often appears as an

indispensable component of a sentence

  • Ambiguity: Just like natural language text, there

exists certain ambiguity with math formulae in a document that should be resolved by their context information

  • Translatability: Math formulae can be translated

into natural language sentences, and sometimes, vice versa.

30

Semantic analysis of math formulae can be considered as one variation of NLP semantic parsing

slide-31
SLIDE 31

NTCIR: Math Information Retrieval Tasks

CICM 2018 (31)

slide-32
SLIDE 32

Evaluation of retrieval systems

CICM 2018 (32)

Not relevant Not relevant Relevant Relevant Relevant Relevant Relevant Not relevant Relevant Not relevant Not relevant Not relevant

Compare Perform manual evaluation For the same document collection and the same query, retrieval systems return different document ranks

slide-33
SLIDE 33

Why IR Task is necessary?

  • Enable comparison between IR systems
  • Collect documents
  • Provide example queries
  • Reproduce the IR experiments
  • Define the target problem
  • Identify the challenges

CICM 2018 (33)

slide-34
SLIDE 34

NTCIR Math

  • Information retrieval tasks dedicated to math

formula search

  • NTCIR-10 Math Pilot Task (2012.2-2013.6)
  • NTCIR-11 Math-2 (-2014.12)
  • NTCIR-12 MathIR (-2016.6)
  • Organizers
  • Akiko Aizawa
  • Michael Kohlhasse
  • Iadh Ounis
  • Richard Zanibbi *NTCIR12

34

Moritz Chubotz*NTCIR11,PhDstudent

arXiv Wikipedia

slide-35
SLIDE 35

Genera IR Task Framework

CICM 2018 (35)

Topics Docu- ments System Run Results Pooling Human Assess- ment Evalua- tion

NTCIR-Math Policy Keep it as simple as possible!

slide-36
SLIDE 36

NTCIR Math-IR Task Framework

CICM 2018 (36)

Topics Docu- ments System Run Results Pooling Human Assess- ment Evalua- tion

Collect documents with Math formulae in MathML format

slide-37
SLIDE 37

Math Retrieval Task : Dataset Example

papers from arXiv.org xhtml/xml math representation (MathML)

37

KWARC Project

slide-38
SLIDE 38
  • About 105,120 scientific articles from arXiv.org
  • arXiv categories: math, cs, physics:math-ph, stat, physics:hep-

th, physics:nlin

  • Converted from LaTeX to an HTML+MathML-based

format by the KWARC project (http://kwarc.info/).

  • Each paper is divided into paragraphs used as retrieval

units

  • A total of 8,301,578 search units with about 60 millions math

formulae including monomial expressions.

Document Collection (arXiv Task)

38

(A collection of large-scale XML data)

slide-39
SLIDE 39

NTCIR Math-IR Task Framework

CICM 2018 (39)

Topics Docu- ments System Run Results Pooling Human Assess- ment Evalua- tion

Prepare β€œqueries” used for evaluation

slide-40
SLIDE 40

Topic example

40

expressed in the MathML format with β€œquery variables” Math formula + keywords

slide-41
SLIDE 41

NTCIR Math-IR Task Framework

CICM 2018 (42)

Topics Docu- ments System Run Results Pooling Human Assess- ment Evalua- tion

Distribute the documents and queries and ask the participants to submit their retrieval results

slide-42
SLIDE 42

Retrieval Result Submission Format

CICM 2018 (43)

NTCIR12-MathIR-1 1 1307.6316_1_108 1 5.06740975 MCAT NTCIR12-MathIR-1 1 math0503510_1_11 2 4.70783107 MCAT NTCIR12-MathIR-1 1 math-ph0105008_1_4 3 4.62195235 MCAT NTCIR12-MathIR-1 1 1002.4757_1_52 4 4.55444812 MCAT NTCIR12-MathIR-1 1 math-ph0108005_1_1 5 4.49042815 MCAT NTCIR12-MathIR-1 1 1111.3507_1_8 6 4.47622183 MCAT NTCIR12-MathIR-1 1 1011.6313_1_1 7 4.47066163 MCAT NTCIR12-MathIR-1 1 quant-ph9904101_1_41 8 4.36044515 MCAT NTCIR12-MathIR-1 1 1201.5232_1_35 9 4.33834295 MCAT NTCIR12-MathIR-1 1 math0701262_1_21 10 4.33552455 MCAT NTCIR12-MathIR-1 1 math0701262_1_20 11 4.33552455 MCAT NTCIR12-MathIR-1 1 1201.0836_1_67 12 4.32558394 MCAT NTCIR12-MathIR-1 1 1112.0483_1_50 13 4.30918135 MCAT NTCIR12-MathIR-1 1 math0305174_1_44 14 4.30729695 MCAT NTCIR12-MathIR-1 1 math0603478_1_45 15 4.27833455 MCAT

QueryID DocID Rank SystemRunID

slide-43
SLIDE 43

Submissions (NTCIR-12 Math-IR)

44

47 submitted runs

GroupID arXiv main arXiv simto Wiki main Wiki formula ICST Peking Univ. CN 1 1 FSE TU Berlin & Univ. of Konstanz DE 1 MCAT National Institute of Informatcs JP 4 4 4 3 MIRMU Masaryk Univ. CZ 4 4 4 RITUW Rochester Inst. Tech & Univ. Waterloo US,C A 4 4 4 SMSG5 Samsung R&D India-Bangalore IN 1 4

slide-44
SLIDE 44

NTCIR Math-IR Task Framework

CICM 2018 (45)

Topics Docu- ments System Run Results Pooling Human Assess- ment Evalua- tion

Decide document-query pairs for manual evaluation

slide-45
SLIDE 45

Retrieval Result Submission Format

CICM 2018 (46)

NTCIR12-MathIR-1 1 1307.6316_1_108 1 5.06740975 MCAT NTCIR12-MathIR-1 1 math0503510_1_11 2 4.70783107 MCAT NTCIR12-MathIR-1 1 math-ph0105008_1_4 3 4.62195235 MCAT NTCIR12-MathIR-1 1 1002.4757_1_52 4 4.55444812 MCAT NTCIR12-MathIR-1 1 math-ph0108005_1_1 5 4.49042815 MCAT NTCIR12-MathIR-1 1 1111.3507_1_8 6 4.47622183 MCAT NTCIR12-MathIR-1 1 1011.6313_1_1 7 4.47066163 MCAT NTCIR12-MathIR-1 1 quant-ph9904101_1_41 8 4.36044515 MCAT NTCIR12-MathIR-1 1 1201.5232_1_35 9 4.33834295 MCAT NTCIR12-MathIR-1 1 math0701262_1_21 10 4.33552455 MCAT NTCIR12-MathIR-1 1 math0701262_1_20 11 4.33552455 MCAT NTCIR12-MathIR-1 1 1201.0836_1_67 12 4.32558394 MCAT NTCIR12-MathIR-1 1 1112.0483_1_50 13 4.30918135 MCAT NTCIR12-MathIR-1 1 math0305174_1_44 14 4.30729695 MCAT

QueryID DocID Rank SystemRunID Due to the limited human resource, not all the submitted results are evaluated!!

Human Judgment No Judgment Human Judgment Human Judgment Human Judgment Human Judgment No Judgment No Judgment No Judgment No Judgment Human Judgment

slide-46
SLIDE 46

NTCIR Math-IR Task Framework

CICM 2018 (47)

Ask evaluators to judge the relevance of the document to the target query

Topics Docu- ments System Run Results Pooling Human Assess- ment Evalua- tion

slide-47
SLIDE 47

SEPIA Snapshot

48

Given a pair of <query, retrieved document>, decide:

  • Relevant
  • Partially relevant
  • Not relevant
slide-48
SLIDE 48

Human Assessment

  • 2 assessments per retrieval unit
  • Relevant, Partially relevant, Non relevant
  • Use SEPIA with Math Extension

49

arXiv-main arXiv-simto Wiki-main Wiki-formula docs 4,234 612 4,107 2,687 Fleiss’ kappa 0.5615 0.5380 0.3546 0.2619 Agreement score for the two judges

slide-49
SLIDE 49

NTCIR Math-IR Task Framework

CICM 2018 (50)

Analyze the performance of the participating systems

Topics Docu- ments System Run Results Pooling Human Assess- ment Evalua- tion

slide-50
SLIDE 50

http://research.nii.ac.jp/ntcir/data/data-en.html

CICM 2018 (51)

slide-51
SLIDE 51

https://sigmathling.kwarc.info/

CICM 2018 (52)

slide-52
SLIDE 52

Importance of information need analysis

  • Yiannos A. Stathopoulos and Simone Teufel: Retrieval of

research-level mathematical information needs: A Test Collection and Technical Terminology Experiment (ACL 2015)

  • Tokinori Suzuki, Atsushi Fujii: Mathematical Document

Categorization with Structure of Mathematical

  • Expressions. (JCDL 2017)
  • CICM 2018 (53)
  • Construct a test collection to analyze mathematical

information needs

  • Collected data from MathOverflow
  • Automatic categorization into Mathematical

document categorization (MDC), applied to MathOverflow and ArXiv collections

slide-53
SLIDE 53

Technical Challenges in NTCIR Math IR

CICM 2018 (54)

slide-54
SLIDE 54

Math IR Challenges

  • Approximate search for a large scale tree

structure data

  • Semantic similarity of mathematical

formulae

55

slide-55
SLIDE 55

Math IR Challenges

  • Approximate search for a large scale tree

structure data

  • Semantic similarity of mathematical

formulae

56

slide-56
SLIDE 56

Formula Tree: Statistics

57

About 60 millions math formulae Number of Nodes per formula tree: much heterogeneity in the target collection

NTCIR-Math ArXiv Dataset

slide-57
SLIDE 57

Token-based methods

58

Mathematical symbols are translated into keywords and indexed using a general search engine

Indexing of general search engine

Digital and computational solutions are becoming the prevalent means for the generation, communication, processing, storage and curation

  • f mathematical information.

become mathematical means prevalent solution Bag-of-Words

Sub-tree based indexing of math formulae

x

  • pposite

hypotenuse

apply apply apply ci divide ci eq ci divide

slide-58
SLIDE 58

Query variable problem

CICM 2018 (59)

Query variables

𝑔 𝑦 = 𝑦 𝑔 βˆ— =βˆ—

unification

f(x)=y is not a β€œfixed point” !

slide-59
SLIDE 59

Unification-based Methods

CICM 2018 (60)

MathWebSearch KWARC Projext

Query: 𝑏2 +

𝑐 𝑑

Math Information Retrieval team of Masaryk University (MIRMU) Subtree Hash with Query Variables: MathCat@NII

slide-60
SLIDE 60

Math IR Challenges

  • Approximate search for a large scale tree

structure data

  • Semantic similarity of mathematical

formulae

61

slide-61
SLIDE 61

Structural and semantic similarities

62

Similarity between two mathematical expressions is calculated based on the tree-structure similarity score

Tree-structure similarity = Math semantic similarity?

slide-62
SLIDE 62

Correlation between Tree Edit Distance and Relevance Judgment

  • NTCIR-11 Math-2 Dataset
  • Human relevance judgment (1-5)
  • Correlation coefficient was -0.07692 !

63

100 200 300 400 500 600 700 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Not Relevant

20 40 60 80 100 120 140 160 180 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Strongly Relevant

slide-63
SLIDE 63

Distributional Hypothesis in NLP

  • β€œWords that occur in the same contexts tend to

have similar meanings” (Harris, 1954).

  • β€œA word is characterized by the company it keeps”

(Firth, 1957)

CICM 2018 (64)

https://aclweb.org/aclwiki/Distributional_Hypothesis

A math formula is characterized by the natural language context it occurs.

slide-64
SLIDE 64

Exploiting Context of Formulae

  • Step. 1
  • Extract keywords in surrounding windows
  • Step. 2
  • Extract descriptions of mathematical formulae (including

symbols)

  • Step. 3
  • Extract a dependency graph of math formulae

CICM 2018 (65)

slide-65
SLIDE 65

Exploiting Context of Formulae

  • Step. 1
  • Extract keywords in surrounding windows
  • Step. 2
  • Extract descriptions of mathematical formulae (including

symbols)

  • Step. 3
  • Extract a dependency graph of math formulae

CICM 2018 (66)

slide-66
SLIDE 66

Search Extract Math formula expressed in MathML Natural language context surrounding the math formula Math database XML

Substructures of MathML trees Keywords extracted from math descriptions

CICM 2018 (67)

slide-67
SLIDE 67

Exploiting Context of Formulae

  • Step. 1
  • Extract keywords in surrounding windows
  • Step. 2
  • Extract descriptions of mathematical formulae (including

symbols)

  • Step. 3
  • Extract a dependency graph of math formulae

CICM 2018 (68)

slide-68
SLIDE 68

Extracting Math Descriptions

CICM 2018 (69)

Denoting the concatenated d- dimensional word representations MATH-w-6-5- 0-32, we have the following probability definition: MATH-p- 6-6-0 where MATH-w-6-7-0-1 denotes the hidden layer size, MATH-w-6-7-0-8 and MATH-w- 6-7-0-10 are the bias vectors for the output nodes and hidden nodes respectively.

Puyang Xu, Asela Gunawardana, Sanjeev Khudanpur: Efficient Subsampling for Training Complex Language Models, EMNLP 2011

slide-69
SLIDE 69

Extracting Math Descriptions

  • MATH-w-6-5-0-32 𝑠

concatenated d- dimensional word representations

  • MATH-p-6-6-0:

probability distribution

  • MATH-w-6-7-0-1 β„Ž

hidden layer size

CICM 2018 (70)

Denoting the concatenated d- dimensional word representations MATH-w-6-5- 0-32, we have the following probability definition: MATH-p- 6-6-0 where MATH-w-6-7-0-1 denotes the hidden layer size, MATH-w-6-7-0-8 and MATH-w- 6-7-0-10 are the bias vectors for the output nodes and hidden nodes respectively.

𝑄 π‘₯𝑗 = 𝑙 π‘₯π‘—βˆ’1, … , π‘₯π‘—βˆ’π‘œ+1 = 𝑓𝑏𝑙 σ𝑛 𝑓𝑏𝑛 𝑏𝑙 = 𝑐𝑙 + ෍

π‘š=1 β„Ž

𝑋

π‘™π‘štanh(π‘‘π‘š +

෍

π‘˜=1 π‘œβˆ’1 𝑒

π‘‰π‘šπ‘˜π‘ 

π‘˜)

slide-70
SLIDE 70

Exploiting Context of Formulae

  • Step. 1
  • Extract keywords in surrounding windows
  • Step. 2
  • Extract descriptions of mathematical formulae (including

symbols)

  • Step. 3
  • Extract a dependency graph of math formulae

CICM 2018 (71)

slide-71
SLIDE 71

Extracting Math Descriptions

CICM 2018 (72)

Denoting the concatenated d- dimensional word representations MATH-w-6-5- 0-32, we have the following probability definition: MATH-p- 6-6-0 where MATH-w-6-7-0-1 denotes the hidden layer size, MATH-w-6-7-0-8 and MATH-w- 6-7-0-10 are the bias vectors for the output nodes and hidden nodes respectively.

𝑄 π‘₯𝑗 = 𝑙 π‘₯π‘—βˆ’1, … , π‘₯π‘—βˆ’π‘œ+1 = 𝑓𝑏𝑙 σ𝑛 𝑓𝑏𝑛 𝑏𝑙 = 𝑐𝑙 + ෍

π‘š=1 β„Ž

𝑋

π‘™π‘štanh(π‘‘π‘š +

෍

π‘˜=1 π‘œβˆ’1 𝑒

π‘‰π‘šπ‘˜π‘ 

π‘˜)

MATH-p-6-6-0: MATH-w-6-5-0-32 𝑠 concatenated d- dimensional word representations MATH-w-6-7-0-1 h hidden layer size MATH-w-6-7-0-8 b bias vectors for the

  • utput nodes

MATH-w-6-7-0-10 c bias vectors for the hidden nodes probability distribution

slide-72
SLIDE 72

Extracting Math Descriptions

CICM 2018 (73)

Denoting the concatenated d- dimensional word representations MATH-w-6-5- 0-32, we have the following probability definition: MATH-p- 6-6-0 where MATH-w-6-7-0-1 denotes the hidden layer size, MATH-w-6-7-0-8 and MATH-w- 6-7-0-10 are the bias vectors for the output nodes and hidden nodes respectively.

𝑄 π‘₯𝑗 = 𝑙 π‘₯π‘—βˆ’1, … , π‘₯π‘—βˆ’π‘œ+1 = 𝑓𝑏𝑙 σ𝑛 𝑓𝑏𝑛 𝑏𝑙 = 𝑐𝑙 + ෍

π‘š=1 β„Ž

𝑋

π‘™π‘štanh(π‘‘π‘š +

෍

π‘˜=1 π‘œβˆ’1 𝑒

π‘‰π‘šπ‘˜π‘ 

π‘˜)

MATH-p-6-6-0: MATH-w-6-5-0-32 𝑠 concatenated d- dimensional word representations MATH-w-6-7-0-1 h hidden layer size MATH-w-6-7-0-8 b bias vectors for the

  • utput nodes

MATH-w-6-7-0-10 c bias vectors for the hidden nodes probability distribution

, as shown in (1), where wi is the i-th word, and β„Žπ‘— is the word history preceding wi

slide-73
SLIDE 73

Dependency Graph

π‘£π‘—π‘˜ = ෍

𝑙=1 π‘˜

(βˆ’1)π‘—βˆ’π‘™ 𝑗 βˆ’ 1 𝑙 βˆ’ 1 π‘œ βˆ’ 𝑗 π‘˜ βˆ’ 𝑙 𝑏2π‘™βˆ’π‘—βˆ’1

𝑏

  • the golden

ratio

  • a root

π‘˜ 𝑙 π‘œ

  • rows
  • the upper

parameter 7

Giovanni Yoko Kristianto, Goran Topic, Akiko Aizawa: β€œUtilizing Dependency Relationships between Math Expressions in Math IR.” Information Retrieval Journal

Query: x2 – x – 1 = 0 (golden ratio; Fibonacci)

slide-74
SLIDE 74

NLP Issues in dependency graph extraction

Query: x2 – x – 1 = 0 (golden ratio; Fibonacci)

π‘£π‘—π‘˜ = ෍

𝑙=1 π‘˜

(βˆ’1)π‘—βˆ’π‘™ 𝑗 βˆ’ 1 𝑙 βˆ’ 1 π‘œ βˆ’ 𝑗 π‘˜ βˆ’ 𝑙 𝑏2π‘™βˆ’π‘—βˆ’1

𝑏

  • the golden

ratio

  • a root

π‘˜ 𝑙 π‘œ

  • rows
  • the upper

parameter 7

𝑦, 𝑦1, 𝑦2, … unit/class identification x , b, …. symbols expressed as non-math symbols where 𝑏 = 𝑐. math exp as a text element (but not as a simple noun) Let Ο€ and 𝜏be the mean and standard deviation of the normal distribution, respectively. coordination

slide-75
SLIDE 75

A Navigator for Math browsing

CICM 2018 (76)

𝑄 π‘₯𝑗 = 𝑙 π‘₯π‘—βˆ’1, … , π‘₯π‘—βˆ’π‘œ+1 = 𝑓𝑏𝑙 σ𝑛 𝑓𝑏𝑛 𝑏𝑙 = 𝑐𝑙 + ෍

π‘š=1 β„Ž

𝑋

π‘™π‘štanh(π‘‘π‘š +

෍

π‘˜=1 π‘œβˆ’1 𝑒

π‘‰π‘šπ‘˜π‘ 

π‘˜)

probability distribution 𝑠 : concatenated d-dimensional word representations h : hidden layer size b : bias vectors for the output nodes c : bias vectors for the hidden nodes Components π‘₯𝑗 : the i-th word Description

slide-76
SLIDE 76

Understanding mathematical expressions: the eye movement

77

Andrea Kohlhase, Michael Kohlhase and Taweechai Ouypornkochagorn. Discourse Phenomena in Math Documents (Next!)

slide-77
SLIDE 77

Automatic Query Generation

78

𝑓𝑏𝑙 σ𝑛 𝑓𝑏𝑛

Puyang Xu, Asela Gunawardana, Sanjeev Khudanpur: Efficient Subsampling for Training Complex Language Models, EMNLP 2011

slide-78
SLIDE 78

Math Understanding as AI Challenges

CICM 2018 (79)

slide-79
SLIDE 79

Surface and Semantic levels in NLP

CICM 2018 (80)

Apple

Malus domestica (Propbank) break Roles: Arg0: breaker Arg1: thing broken Arg2: instrument Arg3: pieces

word-level ambiguity sentence-level ambiguity

Recognizing entities in text Recognizing relations between entities

slide-80
SLIDE 80

Recognizing entities in text

CICM 2018 (81)

β€œthe task of determining the identity of entities mentioned in text”

Where would McCain be without Sarah?

John McCain Sarah Palin

Text in a document Entities in a knowledgebase (ex. Wikipedia articles)

Mention

There can be multiple

  • r no candidates
slide-81
SLIDE 81

Recognizing relations between entities

β€œthe task of identifying relations between entities in text”

CICM 2018 (82)

https://free-pal.appspot.com/ Two entities Entity X Entity Y Syntactic Pattern ex. [X] married [Y] Relation ex. Marry(X, Y)

slide-82
SLIDE 82

Two Component Techniques for Computer Math Understanding

CICM 2018 (83)

Math entity recognition

  • Tokenize mathematical

symbols

  • Identifying their types

(constant, variable,

  • perator)
  • …

sin(x) sin: operator x: variable Math parsing

  • Identifying the semantic

structure

  • Presentation to Content

MathML conversion y=f(x)

eq

y

apply

f x

apply

POM tagger Abdou Youssef: Part-of-math tagging and applications (CICM-2017)

Variable Typing Yiannos A. Stathopoulos, Simon Baker, Marek Rei and Simone Teufel : Variable Typing: Assigning Meaning to Variables in Mathematical Text (NAACL 2018)

slide-83
SLIDE 83

Semantic Parsing of Natural Language Sentences

  • What is β€œsemantic parsing”?

Translate natural language into a formal representation on which a machine can act.

CICM 2018 (84)

search engine queries Robot control Math problem solver Lambda calculus First order predicate logic SQL(database queries)

slide-84
SLIDE 84

NLP Semantic Parsing

  • CCG (Combinatory Categorical Grammar)

85

Mark Steedman (2000) The Syntactic Process

Anna married Manny πœ‡π‘¦. πœ‡π‘§. 𝑛𝑏𝑠𝑠𝑧′𝑦 𝑧 π‘›π‘π‘œπ‘œπ‘§β€² π‘π‘œπ‘œπ‘β€² = πœ‡π‘§. π‘›π‘π‘ π‘ π‘§β€²π‘›π‘π‘œπ‘œπ‘§β€² 𝑧 π‘π‘œπ‘œπ‘β€² = π‘›π‘π‘ π‘ π‘§β€²π‘›π‘π‘œπ‘œπ‘§β€²π‘π‘œπ‘œπ‘β€²

𝑂𝑄 𝑂𝑄 (𝑇 βˆ– 𝑂𝑄)/𝑂𝑄 𝑇 βˆ– 𝑂𝑄 𝑇 𝑂𝑄 𝑂𝑄 V VP 𝑇

Anna married Manny

slide-85
SLIDE 85

NLP Semantic Parsing

  • Semantic Parsing with Execution

86

https://nlp.stanford.edu/software/sempre/

Jonathan Berant, Andrew Chou, Roy Frostig, Percy Liang. Semantic Parsing on Freebase from Question-Answer Pairs. Empirical Methods in Natural Language Processing (EMNLP), 2013.

Which college did Obama go to ?

Type.University alignment alignment BarackObama Education bridging Type.University ∩ Education.BarackObama

Occidental College, Columbia University Execute on Database

slide-86
SLIDE 86

University Entrance Exam

3, 9 x y z xy yz zx + + = + + = βˆ’

( )

3 9 y z x y z xy yz zx ο€€ ο€€ + + =  + + = βˆ’

Quantifier Elimination Method

cited from a book "Algorithms of Quantifier Elimination and Their Applications" by Anai and Yokoyama

3 5 x βˆ’ ο‚£ ο‚£

Natural language description + math formula First order predicate logic

87

Find the range of 𝑦 when the real numbers 𝑦, 𝑧, and 𝑨 satisfy the following formula.

slide-87
SLIDE 87

β€œDeep” Understanding

CICM 2018 (88)

Math

Low dimensional continuous vector representation

Executable code

Translation

slide-88
SLIDE 88

Formula Embeddings

  • Liangcai Gao, Zhuoren Jiang, Yue Yin, Ke Yuan, Zuoyu Yan, Zhi Tang:

Preliminary Exploration of Formula Embedding for Mathematical Information Retrieval: can mathematical formulae be embedded like a natural language? CIKM 2017 Workshop on Interpretable Data Mining (IDM): Bridging the Gap between Shallow and Deep Models

  • Kriste Krostovsk and David M. Blei: Equation Embeddings,

ArXiv:1083.09123v1 (2018)

CICM 2018 (89)

slide-89
SLIDE 89

CICM 2018 (90)

Liangcai Gao, Zhuoren Jiang, Yue Yin, Ke Yuan, Zuoyu Yan, Zhi Tang: Preliminary Exploration of Formula Embedding for Mathematical Information Retrieval: can mathematical formulae be embedded like a natural language? CIKM 2017 Workshop on Interpretable Data Mining (IDM): Bridging the Gap between Shallow and Deep Models

slide-90
SLIDE 90

Neural Translation for Math Word Problems

  • The sum of 2 numbers is 18. The first number is 4

more than the second number. Find the two numbers. Equations: 𝑦 + 𝑧 = 18, 𝑦 = 𝑧 + 4

  • The sum of 3 numbers is 15. The larger number is 4

times the smallest and the middle number is 5. What are the numbers? Equations: 𝑦 + 𝑧 + 𝑨 = 15, 𝑦 = 4 βˆ— 𝑨, 𝑧 = 5

CICM 2018 (91)

Danqing Huang, Jin-Ge Yao, Chin-Yew Lin, Qingyu Zhou, Jian Yin: Using Intermediate Representations to Solve Math Word Problems, ACL-2018

translate translate

slide-91
SLIDE 91

NLP and Source Code

CICM 2018 (92)

Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. 2018. A Survey of Machine Learning for Big Code and Naturalness. ACM Comput. Survey. 51, 4, Article 81 (July 2018), 37 pages. DOI: https://doi.org/10.1145/3212695 Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit the abundance of patterns of code. In this article, we survey this work. We contrast programming languages against natural languages and discuss how these similarities and differences drive the design

  • f probabilistic models. We present a

taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review

slide-92
SLIDE 92

Deep Learning @CICM 2018 Today!

  • Qingxiang Wang, Cezary Kaliszyk and Josef Urban:

First experiments with Neural Translation of Informal to Formal Mathematics.

  • Abdou Youssef and Bruce Miller: Deep Learning for

Math Knowledge Processing

  • Felix Petersen, Moritz Shubotz and Bela Gipp:

Towards Formula Translation using Recursive Neural Networks

CICM 2018 (93)

slide-93
SLIDE 93

Semantic Parsing of Natural Language Sentences

  • What is β€œsemantic parsing”?

Translate natural language into a formal representation on which a machine can act.

CICM 2018 (94)

search engine queries Robot control Math problem solver Lambda calculus First order predicate logic SQL(database queries)

slide-94
SLIDE 94

Final Remarks

β€œReading for Computing”

➒(Long term research goal) Recognize relationships between real problems, formal models, and computation methods ➒(Current target) Retrieval and understanding of mathematical formulae

CICM 2018 (95)

slide-95
SLIDE 95

Thank you!

CICM 2018 (96)