document page layout analysis document page layout
play

Document Page Layout Analysis Document Page Layout Analysis - PowerPoint PPT Presentation

Document Page Layout Analysis Document Page Layout Analysis Bhabatosh Chanda Electronics and Communication Sciences Unit Indian Statistical Institute Indian Statistical Institute Kolkata 700108, India Acknowledgement Acknowledgement Amit Das


  1. Document Page Layout Analysis Document Page Layout Analysis Bhabatosh Chanda Electronics and Communication Sciences Unit Indian Statistical Institute Indian Statistical Institute Kolkata 700108, India

  2. Acknowledgement Acknowledgement • Amit Das IIEST Sibpur Amit Das , IIEST, Sibpur • Sekhar Mandal, IIEST, Sibpur • Sanjoy Kumar Saha, Jadavpur Univeristy S j S h • Ranjan Mandal, Indian Statistical Institute January 30, 2017 Indian Statistical Institute 2

  3. Outline Outline • Introduction • Projection method – Zone content classification • Morphological operators – Skew correction • Morphology based method Morphology based method • Deep learning based method • Performance evaluation • Database: examples • Conclusion January 30, 2017 Indian Statistical Institute 3

  4. Introduction Introduction Problem description Problem description • Motivation • Improve performance of OCR I f f OCR • Data compression • Graphics recognition • Browsing and navigation • Physical and logical structure • January 30, 2017 Indian Statistical Institute 4

  5. Problem Description Problem Description January 30, 2017 Indian Statistical Institute 5

  6. Objective Objective January 30, 2017 Indian Statistical Institute 6

  7. Major Source of Document Pages Major Source of Document Pages 1 1. Books Books 2. Journals 3. Magazines 3 i 4. Newspapers 5. Forms and leaflets 6 6. Reports Reports January 30, 2017 Indian Statistical Institute 7

  8. Types of document pages Types of document pages Consider books and journals Consider books and journals • Title page • Publisher’s page bli h ’ • Table of Contents • Text page • Index page Index page January 30, 2017 Indian Statistical Institute 8

  9. Different types of pages Different types of pages Title page Title page Publisher’s page Publisher s page January 30, 2017 Indian Statistical Institute 9

  10. Different types of pages Different types of pages Table of Content page Table of Content page Table of Content page Table of Content page January 30, 2017 Indian Statistical Institute 10

  11. Different types of pages Different types of pages Text page ‐ 1 Text page ‐ 1 Text page ‐ 2 Text page ‐ 2 January 30, 2017 Indian Statistical Institute 11

  12. Different types of pages Different types of pages Text page ‐ 3 Text page ‐ 3 Index page Index page January 30, 2017 Indian Statistical Institute 12

  13. Issues in document page scanning Issues in document page scanning Resolution Resolution • Back page impression • Granular noise G l i • Blotted text (specially in old documents) • Bending of pages at the binding • Skew Skew • (due to placement of the page in the scanner) January 30, 2017 Indian Statistical Institute 13

  14. Entities of Document Page Entities of Document Page Text Text • – Body text Line  Word  Character Line  Word  Character • • – Heading Non ‐ text Non text • • – Half ‐ tone – Table T bl – Graphics or line drawing January 30, 2017 Indian Statistical Institute 14

  15. Entities of Document Page Entities of Document Page • Each detected zone or block must be homogeneous Each detected zone or block must be homogeneous in terms of content or entity • Each zone will be input to one of the suitable p modules based on entity. – OCR system – Image compressor – Vectorization system • Output of these modules may be compiled and archived using suitable structure. January 30, 2017 Indian Statistical Institute 15

  16. Geometrical / Physical structure Geometrical / Physical structure Page Non ‐ Block c Word text h a r Page Block Document . Word a . Line . . . c t . e . Block Line r . . Word s Page Line January 30, 2017 Indian Statistical Institute 16

  17. Logical structure Logical structure Document Text Non ‐ Text Normal Normal High ‐ lighted High ‐ lighted Half ‐ tone lf Line i (image) drawing Body Heading Graphics Abstract Sub ‐ heading Table January 30, 2017 Indian Statistical Institute 17

  18. Logical structure Logical structure • Different entities: Different entities: – Text (red box) – Halftone (green box) – Table (magenta box) – Line drawing (blue box) • Reading direction (dark blue arrow) • Link between entities (brown arrow) January 30, 2017 Indian Statistical Institute 18

  19. Zone / block detection Zone / block detection • One of the simple way is Projection method. One of the simple way is Projection method. • Algorithm – Take horizontal (or vertical) projection of foreground Take horizontal (or vertical) projection of foreground pixels. (may be implemented as pixel count) – If there exists a characteristic change in projection profile, put a horizontal (resp. vertical) separator. h i l ( i l) – Take horizontal and vertical direction alternately. – Continue, until above condition is satisfied. Continue until above condition is satisfied • Works well for structured document , usually the pages of technical journals, books, etc. January 30, 2017 Indian Statistical Institute 19

  20. Projection Method: An Example Projection Method: An Example January 30, 2017 Indian Statistical Institute 20

  21. Example (contd.) Example (contd.) January 30, 2017 Indian Statistical Institute 21

  22. Example (contd.) January 30, 2017 Indian Statistical Institute 22

  23. Example (contd.) January 30, 2017 Indian Statistical Institute 23

  24. Problems of Projection method Problems of Projection method Cannot say what each block contains until further Cannot say what each block contains until further • analysis. Extract feature s from a zone – Recognize the zone content using a classifier – Results are highly dependent even on small skew in • the scanned page. January 30, 2017 Indian Statistical Institute 24

  25. Zone content recognition Zone content recognition Features: • Black pixel ratio (no. of black pixel / zone area) • Horizontal transition (black to white) count • Vertical transition (black to white) count • Normalized mean length of horizontal black pixel run • Normalized mean length of vertical black pixel run • Normalized mean length of vertical black pixel run • Connected component ratio Classifier: • Two ‐ class (text and non ‐ text) SVM with RBF kernel (accuracy 94.89%) Duong, Emptoz, Côté: Features for Printed Document Image Analysis, ICPR 2002. January 30, 2017 Indian Statistical Institute 25

  26. Zone content recognition Zone content recognition • Functional classification of text blocks u c o a c ass ca o o e b oc s – Title / Heading, Sub ‐ heading, Body text … • Features: – complexity (measured by entropy) – visibility values (or relative boldness) – directional compactness (horizontal and vertical) di i l (h i l d i l) – geometric characteristics (block height, width, etc.) • Classifier: Classifier: – K ‐ means clustering followed by min. distance classifier Bres, Eglin, and Gafneux, Unsupervised Clustering of Text Entities in Heterogeneous Grey Level Documents, ICPR, 2002. January 30, 2017 Indian Statistical Institute 26

  27. Problems of Projection method Problems of Projection method Cannot say what each block contains until further Cannot say what each block contains until further • analysis Extract feature s from a zone – Recognize the zone content using a classifier – Results are highly dependent even on small skew in • the scanned page Detecting base line of each text line of the document – Determining orientation (slope) angle of base line – Estimation overall skew of the document page – January 30, 2017 Indian Statistical Institute 27

  28. Processing Tool Processing Tool Spatial domain operator that can handle Spatial domain operator that can handle • shape information directly Mathematically well defined Mathematically well defined • • Neighborhood operator such that hardware • i implementation should be simple l i h ld b i l January 30, 2017 Indian Statistical Institute 28

  29. Mathematical Morphology Mathematical Morphology • Mathematical morphological operators are Mathematical morphological operators are good choice. Objects Objects • All characters, figures, drawing, i.e., black components against white background components against white background Structuring element • Regular geometric figures: R l i fi – mostly line segment, square, circle, etc. January 30, 2017 Indian Statistical Institute 29

  30. Morphological Operations Morphological Operations Set theoretic operations (including union, intersection, etc.): 1. Dilation 1. Dilation 2. Erosion 3. Opening 4. Closing January 30, 2017 Indian Statistical Institute 30

  31. Morphological operator: Dilation Morphological operator: Dilation Orig. • Expands the objects. p j        | , A B a b a A b B SE: where A is an object and Circ ‐ 5 B is SE. • Properties: Circ ‐ 9 Commutative, associative associative, distributive (over union), Line ‐ 19 increasing g January 30, 2017 Indian Statistical Institute 31

  32. Morphological operator: Erosion Morphological operator: Erosion Orig. • Shrinks the objects. j       | A B p B p A SE: where A is an object and Circ ‐ 5 B is SE. • Properties: Circ ‐ 9 Distributive (over intersection), increasing increasing. • Dilation and erosion are dual. Line ‐ 19 January 30, 2017 Indian Statistical Institute 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend