Document layout analysis in SCRIBO Outline Introduction & Goals - - PowerPoint PPT Presentation

document layout analysis in scribo
SMART_READER_LITE
LIVE PREVIEW

Document layout analysis in SCRIBO Outline Introduction & Goals - - PowerPoint PPT Presentation

Document layout analysis in SCRIBO Document layout analysis in SCRIBO Outline Introduction & Goals CSI Seminar - July 2011 Proposed approach Preprocessing Textlines extraction Julien MARQUEGNIES Statistical model Extraction


slide-1
SLIDE 1

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Document layout analysis in SCRIBO

CSI Seminar - July 2011 Julien MARQUEGNIES

EPITA Research and Development Laboratory

July 2011

1 / 31 Julien MARQUEGNIES

slide-2
SLIDE 2

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Introduction & Goals Proposed approach Preprocessing Textlines extraction Statistical model Extraction Paragraphs extraction Linking lines Cutting lines Results Conclusion Bibliography

2 / 31 Julien MARQUEGNIES

slide-3
SLIDE 3

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Introduction & Goals

The extraction of the different structures of a digitalized document is based on the setup of a processing chain composed of crucial steps including :

Figure: Simplified processing chain in SCRIBO.

3 / 31 Julien MARQUEGNIES

slide-4
SLIDE 4

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Contribution

◮ Textlines construction. ◮ Paragraphs extraction. ◮ Paragraphs polygon boundary.

4 / 31 Julien MARQUEGNIES

slide-5
SLIDE 5

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Document layout analysis (1 / 2)

Document layout analysis studies the physical and logical layout of a document image.

◮ Physical: segmentation into blocks of maximum size

and classification into a set of definite types like lines, paragraphs, pictures. . .

◮ Logical: retrieval of information about text regions

(text reading order, titles, subtitles. . . ).

5 / 31 Julien MARQUEGNIES

slide-6
SLIDE 6

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Document layout analysis (2 / 2)

Two different categories for document layout analysis algorithms depending on how they process the document.

◮ Top-down: XY-Cut [7, 8, 9] and the whitespace

analysis methods [10, 11, 12].

◮ Bottom-up: smearing algorithms [13, 14, 15],

Voronoi diagram-based algorithms [16, 17, 18] and the Docstrum algorithm [19].

6 / 31 Julien MARQUEGNIES

slide-7
SLIDE 7

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Font features

Figure: Lines features [2].

7 / 31 Julien MARQUEGNIES

slide-8
SLIDE 8

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Proposed approach

◮ A bottom-up approach strengthened by information

extracted from a top-down sight of the document.

◮ Flexible to adapt on arbitrarily shaped regions. ◮ Clustering based on the connected components

bounding boxes extension to form higher-level entities.

8 / 31 Julien MARQUEGNIES

slide-9
SLIDE 9

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Processing chain

Figure: Our document layout analysis processing chain.

9 / 31 Julien MARQUEGNIES

slide-10
SLIDE 10

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Preprocessing (1 / 3)

◮ No a priori knowledge about the type of connected

components.

◮ Clustering initialization.

Figure: Groups after the linking step.

10 / 31 Julien MARQUEGNIES

slide-11
SLIDE 11

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Preprocessing (2 / 3)

◮ Delimiters extraction : physical delimiters and

tab-stops.

◮ Tab-stops: vertical alignments on textlines edges

infered by margins and column edges which are all placed at a fixed x-position.

Figure: Document with thin column spaces.

11 / 31 Julien MARQUEGNIES

slide-12
SLIDE 12

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Preprocessing (3 / 3)

Figure: Green lines are tab-stops. Orange lines are the tab-stops removed after filtering.

12 / 31 Julien MARQUEGNIES

slide-13
SLIDE 13

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Statistical model (1 / 2)

◮ Bottom-up approaches are sensitive to the measures

used to form higher-level entities → need of reliable statistics.

◮ Our model heavily relies on baseline and mean line

information to link groups of words.

◮ How to compute the mean line and baseline ?

Figure: Mean line and baseline estimation using the mean value.

13 / 31 Julien MARQUEGNIES

slide-14
SLIDE 14

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Statistical model (2 / 3)

Figure: Mean line and baseline estimation using the median value.

14 / 31 Julien MARQUEGNIES

slide-15
SLIDE 15

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Statistical model (3 / 3)

Proposed approach:

◮ Clustering over the 1D values. ◮ Maximize ascent and descent on textlines. ◮ Better estimate than the mean and median over the

250 test documents.

Figure: Mean line and baseline estimation using clustering.

15 / 31 Julien MARQUEGNIES

slide-16
SLIDE 16

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Processing chain

Figure: Textlines extraction processing chain.

16 / 31 Julien MARQUEGNIES

slide-17
SLIDE 17

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Tagging

◮ Determine which groups are likely to be textlines. ◮ Simple conditions for groups with 3 characters or

more.

◮ Special case for groups composed of only 2

characters.

◮ Remainder considered as non-text.

17 / 31 Julien MARQUEGNIES

slide-18
SLIDE 18

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Merging (1 / 3)

◮ Merging is done by using the extended bounding box

  • f each line.

◮ 7 anchors checked. ◮ Looking for intersections with other lines extended

bounding boxes.

◮ Use of baseline alignments and x-height similarities

for lines merging.

◮ Specific conditions for non-text and text merging

(especially for punctuation).

Figure: Extended bounding box anchors.

18 / 31 Julien MARQUEGNIES

slide-19
SLIDE 19

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Merging (2 / 3)

Figure: Textlines merging example.

19 / 31 Julien MARQUEGNIES

slide-20
SLIDE 20

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Merging (3 / 3)

Result:

Figure: Groups after the linking step. Figure: Textlines.

20 / 31 Julien MARQUEGNIES

slide-21
SLIDE 21

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Linking lines

◮ Looking for a top and a bottom neighbor for the

current textline.

◮ Maximum search range : 1.5×textline_x_height. ◮ Merging of the top and bottom links. ◮ At the end of the process, textlines are grouped into

text regions (globally columns).

(a) Bottom-search pro- cess. (b) Top-search process.

Figure: Search process.

21 / 31 Julien MARQUEGNIES

slide-22
SLIDE 22

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Cutting lines (1 / 3)

The goal of the cutting lines step is to establish paragraphs by studying:

◮ Text indentations. ◮ Spaces and baseline variations between adjacent

lines.

◮ Lines features (x-height and mean char width). ◮ Text fonts changes among a text region.

22 / 31 Julien MARQUEGNIES

slide-23
SLIDE 23

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Cutting lines (2 / 3)

Figure: Case with a bottom and a top neighbor.

◮ Distances check between top_distance and

bottom_distance.

◮ X-height and mean char width check between the

three lines.

23 / 31 Julien MARQUEGNIES

slide-24
SLIDE 24

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Cutting lines (3 / 3)

We have established three common patterns for indentations check.

(a) Left indentation. (b) End of paragraph indentation. (c) End of column.

Figure: Common indentation patterns.

24 / 31 Julien MARQUEGNIES

slide-25
SLIDE 25

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Results (1 / 4)

◮ Tested over 250 300-dpi documents. ◮ 61 documents from the ICDAR 2009 page

segmentation contest.

◮ A complete issue of “Le Nouvel Observateur”. ◮ Dataset composed of technical articles, magazines,

ads and newspapers articles. Computation time on a Phenom II X4 955 3.2 GHz. Figure Size Time (s) Figure 1

2337×3249

0.47 Figure 2

2323×3246

0.5 Figure 3

2309×3138

0.46 Figure 4

2516×3272

0.55 Figure 5

2222×3080

0.51 Figure 6

2516×3272

0.78

25 / 31 Julien MARQUEGNIES

slide-26
SLIDE 26

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Results (2 / 4)

Figure: Result 1.

26 / 31 Julien MARQUEGNIES

slide-27
SLIDE 27

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Results (3 / 4)

Figure: Result 2.

27 / 31 Julien MARQUEGNIES

slide-28
SLIDE 28

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Results (4 / 4)

Figure: Result 3.

28 / 31 Julien MARQUEGNIES

slide-29
SLIDE 29

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Conclusion

◮ Relatively fast and accurate method for document

layout analysis.

◮ Integrated in the new release of SCRIBO. ◮ Submitted to ICDAR 2011 Historical Document

Layout Analysis contest.

◮ Submitted to ICDAR 2011 Specific Document

Analysis Algorithm Contributions in End-to-End Applications.

◮ Waiting for results.

29 / 31 Julien MARQUEGNIES

slide-30
SLIDE 30

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Future work

◮ Italic detection. ◮ Table handling. ◮ Dedicated step for text in pictures. ◮ Text reading order. ◮ Logical layout analysis.

30 / 31 Julien MARQUEGNIES

slide-31
SLIDE 31

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Questions

Questions

Any questions?

31 / 31 Julien MARQUEGNIES

slide-32
SLIDE 32

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

  • G. Nagy, Twenty years of document image analysis in

PAMI, Pattern Analysis and Machine Intelligence, January 2000. Wikipedia, X-height, http://en.wikipedia.org/wiki/X-height, May 2011.

  • A. Namboodiri, A. Jain, Document Structure and

Layout Analysis, Digital Document Processing, 2007.

  • A. Antonacopoulos, D. Karatzas, D. Bridson, Ground

Truth for Layout Analysis Performance Evaluation, Document Analysis Systems VII, 2006.

  • A. Antonacopoulos, S. Pletschacher, D. Bridson, C.

Papadopoulos, ICDAR2009 Page Segmentation Competition, 10th International Conference on Document Analysis and Recognition, 2009. F . Shafait, D. Keysers, T. M. Breuel, Performance Evaluation and Benchmarking of Six-Page

31 / 31 Julien MARQUEGNIES

slide-33
SLIDE 33

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

Segmentation Algorithms, IEEE Tansactions on Pattern Analysis and Machine Intelligence, June 2008.

  • G. Nagy, S. Seth, M. Viswanathan, A Prototype

Document Image Analysis System for Technical Journals, Computer, vol. 25, July 1992.

  • B. Kruatrachue, N. Moongfangklang, K. Siriboon, Fast

Document Segmentation Using Contour and X-Y Cut Technique, The Third World Enformatika Conference, April 2005. J-L. Meunier, Optimized XY-Cut for Determining a Page Reading Order, Document Analysis and Recognition, September 2005.

  • HS. Baird, Background Structure in Document

Images, Document Image Analysis, 1994. T.M. Breuel Two Geometric Algorithms for Layout Analysis, Document Analysis Systems, August 2002.

31 / 31 Julien MARQUEGNIES

slide-34
SLIDE 34

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

T.M. Breuel, An Algorithm for Finding Maximal Whitespace Rectangles at Arbitrary Orientations for Document Layout Analysis, Document Analysis and Recognition, 2003. K.Y. Wong, R.G. Casey, F .M. Wahl, Document Analysis System, IBM J. Reasearch and Development, 1982. H-M Sun, Enhanced Constrained Run-Length Algorithm for Complex Layout Document Processing, International Journal of Applied Science and Engineering, December 2006.

  • S. Ferilli, M. Biba, F

. Esposito, A Distance-based Technique for non-Manhattan Layout Analysis, 10th International Conference on Document Analysis and Recognition, 2009.

31 / 31 Julien MARQUEGNIES

slide-35
SLIDE 35

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

  • K. Kise, A. Sato, M. Iwata, Segmentation of Page

Images Using the Area Voronoi Diagram, Computer Vision and Image Understanding, June 1998.

  • K. Kise, M. Iwata, K. Matsumoto, On the Application of

Voronoi Diagrams to Page Segmentation, Proc. of Workshop on Document Layout Analysis and Its Applications, 1999

  • Z. Wang, Y. Lu, C. L. Tan, Word Extraction Using Area

Voronoi Diagram, Computer Vision and Pattern Recognition Workshop, 2003.

  • L. O’Gorman, The Document Spectrum for Page

Layout Analysis, IEEE Trans. Pattern Analysis and Machine Intelligence, 1993.

  • R. Smith Hybrid Page Layout Analysis via Tab-Stop

Detection, 10th International Conference on Document Analysis and Recognition, 2009.

31 / 31 Julien MARQUEGNIES

slide-36
SLIDE 36

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

  • T. M. Breuel, High Performance Document Layout

Analysis, Symposium on Document Image Understanding Technology, 2003. F . Chang, A hierarchical segmentation method for Document analysis, IPPR Conference on Computer Vision, Graphics, and Image Processing, 2003. C-A. Boiangiu, D-C. Cananau, B. Raducanu, I. Bucur, A Hierarchical Clustering Method Aimed at Document Layout Understanding and Analysis, Mathematical Models and Methods in Applied Sciences, 2008.

  • A. K. Jain, B. Yu, Document Representation and Its

Application to Page Decomposition, IEEE Trans. Pattern Analysis and Machine Intelligence, 1998. S-W Lee, D-S Ryu, Parameter-Free Geometric Document Layout Analysis, IEEE Trans. Pattern Analysis and Machine Intelligence, 2001.

31 / 31 Julien MARQUEGNIES

slide-37
SLIDE 37

Document layout analysis in SCRIBO

Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction

Statistical model Extraction

Paragraphs extraction

Linking lines Cutting lines

Results Conclusion Bibliography

F . Shafait, D. Keysers, T. M. Breuel, Pixel-Accurate Representation and Evaluation of Page Segmentation in Document Images, 18th International Conference

  • n Pattern Recognition, 2006.

F . Shafait , J. van Beusekom, D. Keysers , T. M. Breuel, Structural Mixtures for Statistical Layout Analysis, Document Analysis Systems, 2008.

  • R. Cattoni, T. Coianiz, S. Messelodi, C. M. Modena,

Geometric layout analysis techniques for document image understanding: a review, ITC-irst Technical Report, 1998.

31 / 31 Julien MARQUEGNIES