Adaptive method for the digitization of mathematical journals - - PowerPoint PPT Presentation

adaptive method for the digitization of mathematical
SMART_READER_LITE
LIVE PREVIEW

Adaptive method for the digitization of mathematical journals - - PowerPoint PPT Presentation

Adaptive method for the digitization of mathematical journals IMU-WDML Workshop June 2, 2012, Washington DC Masakazu Suzuki Kyushu University, Professor emeritus Kyushu Institute of Systems, Informatics and Nanotechnologies (ISIT)


slide-1
SLIDE 1

http://www.inftyproject.org/

Adaptive method for the digitization of mathematical journals

Masakazu Suzuki Kyushu University, Professor emeritus Kyushu Institute of Systems, Informatics and Nanotechnologies (ISIT) InftyProject ((http://www/inftyproject.org) Science Accessibility Net (http://www.sciaccess.net)

IMU-WDML Workshop June 2, 2012, Washington DC

slide-2
SLIDE 2

http://www.inftyproject.org/

2

Plan of the talk

About InftyProject Making Rich Digital Mathematical Libraries

 Process Flow and Technical Components

Current State of the Art with Demonstration Adaptive Method

 Character and Symbol Recognition  Logical Structure Analysis

Future Problems

slide-3
SLIDE 3

http://www.inftyproject.org/

3

Se c tio n 1

About Infty Pr

  • je c t
slide-4
SLIDE 4

http://www.inftyproject.org/

4

InftyProject

R&D on Math Information Systems Main system development

InftyReader : Math OCR software InftyEditor : Editor of math documents Data conversion(XML, LaTeX, MathML, PDF, etc.) ChattyInfty : InftyEditor + speech output, Authoring of DAISY

URL:

Project site: http://www.inftyproject.org/en// Release & user support of Infty products: Science Accessibility Net http://www.sciaccess.net/

slide-5
SLIDE 5

http://www.inftyproject.org/

5

InftyProject

R&D on Math Information Systems Main system development

InftyReader : Math OCR software InftyEditor : Editor of math documents Data conversion(XML, LaTeX, MathML, PDF, etc.) ChattyInfty : InftyEditor + speech output, Authoring of DAISY

URL:

Project site: http://www.inftyproject.org/en// Release & user support of Infty products: Science Accessibility Net http://www.sciaccess.net/

slide-6
SLIDE 6

http://www.inftyproject.org/

6

“InftyReader” OCR software for math documents

Demonstration.

Recognition result samples (YMJ, AJM).

slide-7
SLIDE 7

http://www.inftyproject.org/

7

Se c tio n 2

T

  • war

d Ric h DML

slide-8
SLIDE 8

http://www.inftyproject.org/

8

Different levels in digitization

 Level 1: Bitmap images of printed materials

e.g. GIF, TIFF

 Level 2: Searchable digitized document

e.g. PDF with hidden text, Bib Link

 Level 3: Structured accessible document

e.g. XML, HTML(+MathML), LATEX, …

 Level 4: (partially) Executable document

e.g. Mathematica, Maple

 Level 5: Formally presented document

e.g. Mizar, OMDoc

slide-9
SLIDE 9

http://www.inftyproject.org/

9

Different levels in digitization

 Level 1: Bitmap images of printed materials

e.g. GIF, TIFF

 Level 2: Searchable digitized document

e.g. PDF with hidden text, Bib Link

 Level 3: Structured accessible document

e.g. XML, HTML(+MathML), LATEX, …

 Level 4: (partially) Executable document

e.g. Mathematica, Maple

 Level 5: Formally presented document

e.g. Mizar, OMDoc

WDML achieved this level.

slide-10
SLIDE 10

http://www.inftyproject.org/

10

Different levels in digitization

 Level 1: Bitmap images of printed materials

e.g. GIF, TIFF

 Level 2: Searchable digitized document

e.g. PDF with hidden text

 Level 3: Structured accessible document

e.g. XML, HTML(+MathML), LATEX, …

 Level 4: (partially) Executable document

e.g. Mathematica, Maple

 Level 5: Formally presented document

e.g. Mizar, OMDoc

Infty : Level 1 → Level 3

slide-11
SLIDE 11

http://www.inftyproject.org/

11

Process Flow of Digitization

Layout Analysis : Segmentation of Areas (Text, Table, Figure) Recognition per line

(Character recognition, Math/Text segmentation, Math. Structure analysis)

Document Structure analysis

(Chapter, Section, Itemize, Theorem description, References, etc.)

XML

Outputs

LaTeX, XHTML+MathML, PDF, Braille codes, etc.

PDF

Image File (TIF)

Texts & Math symbols

slide-12
SLIDE 12

http://www.inftyproject.org/

12

Layout Analysis

Segmentation of Areas (Text, Table, Figure) Recognition per line

(Character recognition, Math. Structure analysis)

Document Structure analysis

(Title, Chapter, Section, Itemize, Theorem, Bib, etc.)

XML

Outputs

  • LaTeX. HTML,

Human readable TeX Braille codes, Speak data, etc.

PDF

Image File (TIF) (Pre processing)

slide-13
SLIDE 13

http://www.inftyproject.org/

13

Layout Analysis

Segmentation of Areas Table Analysis Recognition per line

(Character recognition, Math. Structure analysis)

Document Structure analysis

(Title, Chapter, Section, Itemize, Theorem, Bib, etc.)

XML

Outputs

  • LaTeX. HTML,

Human readable TeX Braille codes, Speak data, etc.

PDF

Image File (TIF)

slide-14
SLIDE 14

http://www.inftyproject.org/

14

Process Flow of Digitization

Layout Analysis : Segmentation of Areas (Text, Table, Figure) Recognition per line

(Character recognition, Math/Text segmentation, Math. Structure analysis)

Document Structure analysis

(Chapter, Section, Itemize, Theorem description, References, etc.)

XML

Outputs

LaTeX, XHTML+MathML, PDF, Braille codes, etc.

PDF

Image File (TIF)

Texts & Math symbols

slide-15
SLIDE 15

http://www.inftyproject.org/

15

Process Flow of Digitization

Layout Analysis : Segmentation of Areas (Text, Table, Figure) Recognition per line

(Character recognition, Math/Text segmentation, Math. Structure analysis)

Document Structure analysis

(Chapter, Section, Itemize, Theorem description, References, etc.)

XML

Outputs

LaTeX, XHTML+MathML, PDF, Braille codes, etc.

PDF

Image File (TIF)

Texts & Math symbols

slide-16
SLIDE 16

http://www.inftyproject.org/

16

Process Flow of Digitization

Layout Analysis : Segmentation of Areas (Text, Table, Figure) Recognition per line

(Character recognition, Math/Text segmentation, Math. Structure analysis)

Document Structure analysis

(Chapter, Section, Itemize, Theorem description, References, etc.)

XML

Outputs

LaTeX, XHTML+MathML, PDF, Braille codes, etc.

PDF

Image File (TIF)

Texts & Math symbols

slide-17
SLIDE 17

http://www.inftyproject.org/

17

Document Structure Analysis

Detection of : Title, Autor, Section, Subsection, Itemization, BibItem,

Theorem, Lemma, etc.

  • Currently, naïve methods are used:

Line classification using the combination features such as: Character size, Font Information (Bold, Italic, Small Capital), Keywords, Indentation, Starting with Numbers or Special pattern (e.g. “[Num]”), etc.

  • Stronger method is required in actual digitization.

Hyperlink inside document.

slide-18
SLIDE 18

http://www.inftyproject.org/

18

Se c tio n 3

Cur r e nt state of the ar t with de monstr ation

slide-19
SLIDE 19

http://www.inftyproject.org/

19

“InftyReader” OCR software for math documents

 Demonstration…

 Math recognition (Already shown)  Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample  Matrices  Layout analysis, Table recognition  Logical structure analysis

slide-20
SLIDE 20

http://www.inftyproject.org/

20

“InftyReader” OCR software for math documents

 Demonstration…

 Math recognition (Already shown)  Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample  Matrices  Layout analysis, Table recognition  Logical structure analysis

slide-21
SLIDE 21

http://www.inftyproject.org/

21

“InftyReader” OCR software for math documents

 Demonstration…

 Math recognition (Already shown)  Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample  Matrices  Layout analysis, Table recognition  Logical structure analysis

slide-22
SLIDE 22

http://www.inftyproject.org/

22

“InftyReader” OCR software for math documents

 Demonstration…

 Math recognition (Already shown)  Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample  Matrices  Layout analysis, Table recognition  Logical structure analysis

slide-23
SLIDE 23

http://www.inftyproject.org/

23

“InftyReader” OCR software for math documents

 Demonstration…

 Math recognition (Already shown)  Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample  Matrices  Layout analysis, Table recognition  Logical structure analysis

slide-24
SLIDE 24

http://www.inftyproject.org/

24

“InftyReader” OCR software for math documents

 Demonstration…

 Math recognition (Already shown)  Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample  Matrices  Layout analysis, Table recognition  Logical structure analysis

slide-25
SLIDE 25

http://www.inftyproject.org/

25

Se c tio n 4

L ar ge Volume Re c ognition

slide-26
SLIDE 26

http://www.inftyproject.org/

26

Large Volume Digitization

Adaptive method is efficient:

Get information from the target document:

  • Character features,
  • Math formula parameters,
  • Layout parameters, etc.

Recognition

  • r

(Directly) After manual checking (Semi-automatic)

slide-27
SLIDE 27

http://www.inftyproject.org/

27

Process Flow using BatchInfty & InftyReader pro

  • 1. Noise reduction, centering, etc.
  • 2. Trial recognition
  • 3. Extraction features:
  • Document style → Logical structure analysis
  • Character cluster images → OCR engine
  • 4. Recognition & verification
  • 5. PDF output

Large Volume Digitization

slide-28
SLIDE 28

http://www.inftyproject.org/

28

Generation of UserDictionary adapting OCR engine to the target documents.

Large Volume Digitization

Trial recognition CharDataA: Centroides of the clusters of text characters with reliable score CharDataB: Centroides of the clusters of math symbols and text characters with low score User Dictionary of Character Features (automatic) (manual correction) Clustering of the character images Show CharImageManager

slide-29
SLIDE 29

http://www.inftyproject.org/

29

Generation of UserDictionary adapting OCR engine to the target documents.

Large Volume Digitization

Trial recognition CharDataA: Centroides of the clusters of text characters with reliable score CharDataB: Centroides of the clusters of math symbols and text characters with low score User Dictionary of Character Features (automatic) (manual correction) Clustering of the character images Show CharImageManager

slide-30
SLIDE 30

http://www.inftyproject.org/

30

Se c tio n 5

Ope n Pr

  • ble ms
slide-31
SLIDE 31

http://www.inftyproject.org/

31

Problems

Further improvement of character/symbol recognition and structure analysis of math expressions.

 Touched characters, Broken characters in math area  Low resolution image  Different type face (Old books, typewriter prints, etc.)  Bold char detection in math area

slide-32
SLIDE 32

http://www.inftyproject.org/

32

Problems

Logical Structure Analysis (Automatic detection and manual correction) --- still difficult!

 Title, Autor, Section, Subsection, Itemization, BibItem, Theorem, Lemma, etc.  Hyperlink inside document.

slide-33
SLIDE 33

http://www.inftyproject.org/

33

Problems

Detection/Analysis of Figures and Tables

 Detection of characters in figures  Table structure analysis (Sample)  Diagram recognition  Chemical diagrams ← Recently developing world wide  (Commutative diagrams) ← Future work

slide-34
SLIDE 34

http://www.inftyproject.org/

34

Problems

Detection/Analysis of Figures and Tables

 Detection of characters in figures  Table structure analysis (Sample)  Diagram recognition  Chemical diagrams ← Recently developing world wide  (Commutative diagrams) ← Future work

slide-35
SLIDE 35

http://www.inftyproject.org/

35

Problems

Detection/Analysis of Figures and Tables

 Detection of characters in figures  Table structure analysis (Sample)  Diagram recognition  Chemical diagrams ← Recently developing world wide  (Commutative diagrams) ← Future work

slide-36
SLIDE 36

http://www.inftyproject.org/

36

Conc lusion

 InftyProject.

 Research group of math information processing.

 Demo (InftyReader) to show the current state of

the art.

 Adaptive method to improve character and

symbol recogition (CharImageManager).

 Proposed some problems to be attacked.

slide-37
SLIDE 37

http://www.inftyproject.org/

37

“INFTY” an integrated OCR for mathematical documents

Thanks you!

Masakazu Suzuki suzuki@isit.or.jp (current address) msuzuki@kyudai.jp (permanent address)

InftyProject: http://www.inftyproject.org/en/ Science Accessibility Net: http://www.sciaccess.net/en/