adaptive method for the digitization of mathematical
play

Adaptive method for the digitization of mathematical journals - PowerPoint PPT Presentation

Adaptive method for the digitization of mathematical journals IMU-WDML Workshop June 2, 2012, Washington DC Masakazu Suzuki Kyushu University, Professor emeritus Kyushu Institute of Systems, Informatics and Nanotechnologies (ISIT)


  1. Adaptive method for the digitization of mathematical journals IMU-WDML Workshop June 2, 2012, Washington DC Masakazu Suzuki Kyushu University, Professor emeritus Kyushu Institute of Systems, Informatics and Nanotechnologies (ISIT) InftyProject ((http://www/inftyproject.org) Science Accessibility Net (http://www.sciaccess.net) http://www.inftyproject.org/

  2. Plan of the talk  About InftyProject  Making Rich Digital Mathematical Libraries  Process Flow and Technical Components  Current State of the Art with Demonstration  Adaptive Method  Character and Symbol Recognition  Logical Structure Analysis  Future Problems 2 http://www.inftyproject.org/

  3. Se c tio n 1 About Infty Pr oje c t 3 http://www.inftyproject.org/

  4. InftyProject  R&D on Math Information Systems  Main system development InftyReader : Math OCR software InftyEditor : Editor of math documents Data conversion ( XML, LaTeX, MathML, PDF, etc.) ChattyInfty : InftyEditor + speech output, Authoring of DAISY  URL : Project site: http://www.inftyproject.org/en// Release & user support of Infty products: Science Accessibility Net http://www.sciaccess.net/ 4 http://www.inftyproject.org/

  5. InftyProject  R&D on Math Information Systems  Main system development InftyReader : Math OCR software InftyEditor : Editor of math documents Data conversion ( XML, LaTeX, MathML, PDF, etc.) ChattyInfty : InftyEditor + speech output, Authoring of DAISY  URL : Project site: http://www.inftyproject.org/en// Release & user support of Infty products: Science Accessibility Net http://www.sciaccess.net/ 5 http://www.inftyproject.org/

  6. “ InftyReader” OCR software for math documents  Demonstration. Recognition result samples (YMJ, AJM). 6 http://www.inftyproject.org/

  7. Se c tio n 2 T owar d Ric h DML 7 http://www.inftyproject.org/

  8. Different levels in digitization  Level 1: Bitmap images of printed materials e.g. GIF, TIFF  Level 2: Searchable digitized document e.g. PDF with hidden text, Bib Link  Level 3: Structured accessible document e.g. XML, HTML(+MathML), LATEX, …  Level 4: (partially) Executable document e.g. Mathematica, Maple  Level 5: Formally presented document e.g. Mizar, OMDoc 8 http://www.inftyproject.org/

  9. Different levels in digitization  Level 1: Bitmap images of printed materials WDML achieved this level. e.g. GIF, TIFF  Level 2: Searchable digitized document e.g. PDF with hidden text, Bib Link  Level 3: Structured accessible document e.g. XML, HTML(+MathML), LATEX, …  Level 4: (partially) Executable document e.g. Mathematica, Maple  Level 5: Formally presented document e.g. Mizar, OMDoc 9 http://www.inftyproject.org/

  10. Different levels in digitization  Level 1: Bitmap images of printed materials e.g. GIF, TIFF  Level 2: Searchable digitized document Infty : Level 1 → Level 3 e.g. PDF with hidden text  Level 3: Structured accessible document e.g. XML, HTML(+MathML), LATEX, …  Level 4: (partially) Executable document e.g. Mathematica, Maple  Level 5: Formally presented document e.g. Mizar, OMDoc 10 http://www.inftyproject.org/

  11. Process Flow of Digitization PDF Image File (TIF) Texts & Math symbols Layout Analysis : Segmentation of Areas (Text, Table, Figure) Recognition per line (Character recognition, Math/Text segmentation, Math. Structure analysis) Document Structure analysis (Chapter, Section, Itemize, Theorem description, References, etc.) Outputs LaTeX, XHTML+MathML, XML PDF, Braille codes, etc. 11 http://www.inftyproject.org/

  12. Layout Analysis PDF Image File (TIF) (Pre processing) Segmentation of Areas (Text, Table, Figure) Recognition per line (Character recognition, Math. Structure analysis) Document Structure analysis (Title, Chapter, Section, Itemize, Theorem, Bib, etc.) Outputs LaTeX. HTML, XML Human readable TeX Braille codes, Speak data, etc. 12 http://www.inftyproject.org/

  13. Layout Analysis PDF Image File (TIF) Segmentation of Areas Table Analysis Recognition per line (Character recognition, Math. Structure analysis) Document Structure analysis (Title, Chapter, Section, Itemize, Theorem, Bib, etc.) Outputs LaTeX. HTML, XML Human readable TeX Braille codes, Speak data, etc. 13 http://www.inftyproject.org/

  14. Process Flow of Digitization PDF Image File (TIF) Texts & Math symbols Layout Analysis : Segmentation of Areas (Text, Table, Figure) Recognition per line (Character recognition, Math/Text segmentation, Math. Structure analysis) Document Structure analysis (Chapter, Section, Itemize, Theorem description, References, etc.) Outputs LaTeX, XHTML+MathML, XML PDF, Braille codes, etc. 14 http://www.inftyproject.org/

  15. Process Flow of Digitization PDF Image File (TIF) Texts & Math symbols Layout Analysis : Segmentation of Areas (Text, Table, Figure) Recognition per line (Character recognition, Math/Text segmentation, Math. Structure analysis) Document Structure analysis (Chapter, Section, Itemize, Theorem description, References, etc.) Outputs LaTeX, XHTML+MathML, XML PDF, Braille codes, etc. 15 http://www.inftyproject.org/

  16. Process Flow of Digitization PDF Image File (TIF) Texts & Math symbols Layout Analysis : Segmentation of Areas (Text, Table, Figure) Recognition per line (Character recognition, Math/Text segmentation, Math. Structure analysis) Document Structure analysis (Chapter, Section, Itemize, Theorem description, References, etc.) Outputs LaTeX, XHTML+MathML, XML PDF, Braille codes, etc. 16 http://www.inftyproject.org/

  17. Document Structure Analysis  Detection of : Title, Autor, Section, Subsection, Itemization, BibItem, Theorem, Lemma, etc. - Currently, naïve methods are used: Line classification using the combination features such as: Character size, Font Information (Bold, Italic, Small Capital), Keywords, Indentation, Starting with Numbers or Special pattern (e.g. “[Num]”), etc. - Stronger method is required in actual digitization.  Hyperlink inside document. 17 http://www.inftyproject.org/

  18. Se c tio n 3 Cur r e nt state of the ar t with de monstr ation 18 http://www.inftyproject.org/

  19. “ InftyReader” OCR software for math documents  Demonstration…  Math recognition (Already shown)  Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample  Matrices  Layout analysis, Table recognition  Logical structure analysis 19 http://www.inftyproject.org/

  20. “ InftyReader” OCR software for math documents  Demonstration…  Math recognition (Already shown)  Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample  Matrices  Layout analysis, Table recognition  Logical structure analysis 20 http://www.inftyproject.org/

  21. “ InftyReader” OCR software for math documents  Demonstration…  Math recognition (Already shown)  Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample  Matrices  Layout analysis, Table recognition  Logical structure analysis 21 http://www.inftyproject.org/

  22. “ InftyReader” OCR software for math documents  Demonstration…  Math recognition (Already shown)  Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample  Matrices  Layout analysis, Table recognition  Logical structure analysis 22 http://www.inftyproject.org/

  23. “ InftyReader” OCR software for math documents  Demonstration…  Math recognition (Already shown)  Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample  Matrices  Layout analysis, Table recognition  Logical structure analysis 23 http://www.inftyproject.org/

  24. “ InftyReader” OCR software for math documents  Demonstration…  Math recognition (Already shown)  Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample  Matrices  Layout analysis, Table recognition  Logical structure analysis 24 http://www.inftyproject.org/

  25. Se c tio n 4 L ar ge Volume Re c ognition 25 http://www.inftyproject.org/

  26. Large Volume Digitization  Adaptive method is efficient: Get information from the target document: - Character features, - Math formula parameters, - Layout parameters, etc. or After manual checking (Directly) (Semi-automatic) Recognition 26 http://www.inftyproject.org/

  27. Large Volume Digitization  Process Flow using BatchInfty & InftyReader pro 1. Noise reduction, centering, etc. 2. Trial recognition 3. Extraction features: - Document style → Logical structure analysis - Character cluster images → OCR engine 4. Recognition & verification 5. PDF output 27 http://www.inftyproject.org/

  28. Large Volume Digitization  Generation of UserDictionary adapting OCR engine to the target documents. Trial recognition Clustering of the character images CharDataA: Centroides of the CharDataB: Centroides of the clusters of text characters with clusters of math symbols and reliable score text characters with low score (automatic) (manual correction) Show User Dictionary of Character Features CharImageManager 28 http://www.inftyproject.org/

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend