OPTICAL CHARACTER RECOGNITION Mster de Visi per Computador Curs - PDF document

OPTICAL CHARACTER RECOGNITION Màster de Visió per Computador Curs 2006 - 2007 Outline • Introduction • Pre-processing (document level) – Binarization – Skew correction • Segmentation – Layout analysis – Character segmentation • Pre-processing (character level) • Feature extraction – Image-based features – Statistical features – Transform-based features – Structural features • Classification • Post-proccessing – Classifier combination – Exploitation of context information • Examples of OCR systems • Bibliography 2 Outline 1

Optical Character Recognition Pattern Recognition Statistical Structural Image Processing Pattern Recognition Pattern Recognition Methods Applications Optical Document Character Analysis Recognition 3 Introduction Some examples Drawings, maps Books, journals, reports License plates PDAs Identity Cards Old documents Cheques, bills Postal addresses Quality control 4 Introduction 2

Document Image Analysis G. Nagy: Twenty years of document image analysis in PAMI. IEEE Trans. on PAMI , vol. 22, nº 1, pp. 38-62 January 2000. Document image analysis is the subfield of digital image processing that aims at converting document images to symbolic form for modification, storage, retrieval, reuse and transmission. Document image analysis is the theory and practice of recovering the symbol structure of digital images scanned from paper or produced by computer. What is a document? Objects created expressly to convey information encoded as iconic symbols – Scanned images from paper documents – Electronic documents – Multimedia documents (video with text) – … 5 Introduction Applications of DIA Document Imaging: •Digitization •Storage •Compression •Re-printing Applications of DIA Document understanding •Recognition •Interpretation •Indexing •Retrieval 6 Introduction 3

DIA tasks Mostly text Mostly graphics •Acquisition •Acquisition •Binarization •Binarization Document Imaging •Filtering •Filtering •Skew correction •Vectorization •Segmentation •Text-graphics separation •Layout analysis Document Understanding •Symbol recognition •OCR •Interpretation 7 Introduction Outline of the course Focus: Document understanding of mostly text documents 1. Acquisition 2. Pre-processing Binarization − Skew correction − 3. Layout analysis 4. Character segmentation 5. OCR Feature extraction − Classification − Post-processing − 8 Introduction 4

Categorization of Character Recognition Optical Character Recognition According to the type of writing Machine-printed Hand-written character recognition character recognition According to the type of acquisition Off-line On-line character recognition character recognition 9 Introduction Machine-printed character recognition • Characters are totally defined by the font type: – Dimensions (segmentation) • Character width • Inter-character separation • Character height – Shape (recognition) • Typographic effects (boldface, italics, underline). • Challenges: – Similar shapes among characters – Multiple fonts – Joined characters – Digitization noise: broken lines, random noise, heavy characters, etc. – Document degradation: old documents, photocopies, etc. 10 Introduction 5

Machine-printed character recognition • Classification of machine-printed OCR systems – Monofont: • One single type of font – Multifont: • Recognition of a fixed and known set of fonts • It is necessary to identify and learn the differences between characters of all the types of fonts – Omnifont: • Recognition of any arbitrary type of font, even if it has not been previously learned 11 Introduction Off-line hand-written character recognition • Hand-written • Off-line: acquisition by a scanner or a camera • Challenges: – Shape variability among images of the same character – Character segmentation • Subproblems: – Hand-written numeral recognition: digit recognition – Hand-printed character recognition: well-separated characters – Cursive character recognition: non-separated characters 12 Introduction 6

On-line hand-written character recognition • On-line acquisition – Digitizer tablets – Digital Pen – Tablet PC • Advantages with respect to off-line acquisition: – Image is acquired while the text is written – We can take advantage of dynamic information: • Temporal information: writing order, stroke segmentation, etc • Writing speed • Pen pressure • Subproblems: – Cursive script recognition. – Signature verification/recognition. 13 Introduction Levels of difficulty in character recognition •S.Mori, H.Nishida, H.Yamada. Optical Character Recognition . John Wiley and sons. 1999. •S.V. Rice, G. Nagy, T.A. Nartker. Optical Character Recognition: An illustrated guide to the frontier . Kluwer Academic Publishers. 1999. • Little shape variability Level 0 0.0. Printed characters. Specific font. Constant size. Roman • Small number of characters alphabets. • Little noise 0.1. Constrained hand-printed characters. Arabic numerals. 1.0. Printed characters. Multiple fonts. Nº characters < 100 • Medium variation in shape Level 1 1.1. Loosely constrained hand-printed char. Nº char < 100 • Medium noise 1.2. Chinese characters of few fonts 1.3. Loosely constrained hand-printed char. Nº char ≈ 1000 Level 2 2.0. Printed characters of multiple fonts • Much variation in shape 2.1. Unconstrained hand-printed characters • Heavy noise 2.2. Affine transformed characters Level 3 3.0. Touching/broken characters • Nonsegmented strings of characters 3.1. Cursive handwriting characters 3.2. Characters on a textured background 14 Introduction 7

Levels of difficulty in character recognition Level 0 0.0. Printed character of a specific font with a constant size • Constant size • Connectivity of characters • Variation in the stroke thickness • Little noise 0.1. Constrained hand-printed characters • Characters are written according to some instructions or box guidelines Solved problem 15 Introduction Levels of difficulty in character recognition Level 1 1.0. Printed characters of multiple fonts 1.1. Loosely constrained hand-printed characters 1.2. Chinese characters of few fonts. 1.3. Loosely constrained hand-printed characters. Nº characters ≈ 1000 Solved problem 16 Introduction 8

Levels of difficulty in character recognition Level 2 2.0. Printed characters of multiple fonts 2.1. Unconstrained hand-printed characters 2.2. Affine transformed characters 17 Introduction Levels of difficulty in character recognition Level 3 3.0. Caràcters no separats o trencats 3.1. Cursive handwriting characters 3.2. Characters on a textured background 18 Introduction 9

Databases for OCR Off-line Hand-written characters On-line Machine- characters printed CEDAR CENPARMI NIST UNIPEN Univ. Washington • 50.000 • 17.000 • More than • Definition of a • More than segmented manually 1.000.000 format to 1500 pages of numerals from segmented characters from represent on- articles in zip codes numerals from forms line data english • 5.000 zip codes zip codes • Several learning • 4.500.000 • More than 500 • 5.000 city and test sets characters pages of articles names (more • Segmented in japanese • 9.000 state variability) characters, • Originals, names • 91.500 words and photocopies and sentences with a sentences pages with dictionary arificially generated noise • Page segmentation into labeled zones 19 Introduction Performance evaluation of OCR systems • Hand-printed Character Recognition – Institute for Posts and Telecommunications Policy (IPIP) – Japan - 1996 – 5000 hand-written numerals from japanese zip codes – Performance of the best system: 97.94% (human performance: 99.84%) • Machine-printed Character Recognition – “The fifth Annual Test of OCR Accuracy”. Information Science Research Institute. TR-96-01. April 1996. http://www.isri.unlv.edu – 5.000.000 characters from 2000 pages in journals, newspapers, letters and technical reports – Performance in good quality documents: 99.77% - 99.13% – Performance in medium quality documents: 99.27% - 98.21% – Performance in low quality documents: 97.01% - 89.34% • Performance 99% => 30 errors /page (3000 characters/page) 20 Introduction 10

Performance evaluation of OCR systems C.Y. Suen, J. Tan: Analysis of errors of handwritten digits made by a multitude of classifiers . PRL 26, pp. 369-379. 2005 Classifier N of test samples Recognition (%) Error (%) MNIST database 10,000 99.06 0.94 GPR 10,000 99.62 0.38 VSVMb 10,000 99.44 0.56 VSV2 10,000 99.18 0.82 LeNet-5 10,000 98.32 1.68 POE CENPARMI database 2000 98.7 1.3 VSVM USPS database 2007 97.66 2.34 VSVM NIST SD19 database 30,000 99.16 0.84 MLP 21 Introduction Performance evaluation of OCR systems Classifier N. of errors Category 1 Category 2 Category 3 MNIST database 94 24 11 59 GPR 38 15 6 17 VSVMb 56 15 9 32 VSV2 82 17 14 51 LeNet5 168 41 9 118 POE CENPARMI database 26 6 4 16 VSVM USPS database 47 13 13 21 VSVM NIST SD 19 database 119 30 8 81 MLP Sum 630 161 74 395 Percentage (%) 100 25.56 11.75 62.70 22 Introduction 11

OPTICAL CHARACTER RECOGNITION Mster de Visi per Computador Curs - PDF document

OPTICAL CHARACTER RECOGNITION Mster de Visi per Computador Curs 2006 - 2007 Outline Introduction Pre-processing (document level) Binarization Skew correction Segmentation Layout analysis Character

Optical Character Recognition Domain Expert Approximation Through Oracle Learning Joshua Menke

Design Elements Issue Task Force March 12, 2014 1 Historic Character 2 Historic Character 3

Curriculum on Character Development L1/A: Character in Leadership Character Development Agenda

Curriculum on Character Development Character in Leadership Character Development Agenda

Optical Character Recognition using Bayesian Networks Ioannis Klasinas iklasinas@telecom.tuc.gr

Optical Rings and Hybrid Mesh Rings Optical Networks draft-papadimitriou-optical-rings-00.txt

Handwritten character recognition Handwritten character recognition using elastic matching based

Maturation Process of the Ligature Based Urdu Noori Nastalique Optical Character Recognizer

Optical Recording and Optical Recording and That audio or video is of the highest quality

Experiment 3 Optical Rotation Optical rotation or optical activity The rotation of the plane

Character Education at Character Education at Northampton Academy An Academy of Character and

CANTERBURY TALES: POWERPOINT CHARACTER PRESENTATION CHARACTER PRESENTER PHYSICAL CHARACTER

- Character set - Character escape conventions - Canonical form - Line editing conventions

Strings II Review Strings are stored character by character. Can access each character

A summary of deep models for face recognition Qianli Liao Face recognition Face recognition:

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // JACOB LOGAS L E C T U R E # 1 0 : L

Requirements March 8, 2016 Developmental Disabilities Division Overview New regulations

(FC FCEP) Brothers Keeper Init itia iati tive RFPGC16-013 Full proposals must be

Requesting Research Identifiable Data for HCIA Awardees 02/19/2014 Presented by Faith Asper,

Measurement of Higgs boson production in the diphoton decay channel with the ATLAS detector 2017

EQUIPPING SYMBOLIC FRAMEWORKS WITH SOFT COMPUTING FEATURES K A I - U W E K H N B E R G E R I

Security II: Security Strikes Back 15-441/641 Fall 2019 Profs Peter Steenkiste & Justine

Covering the Basics of QRTP in Dependency Court James Richardson III, Assistant Attorney General,

OPTICAL CHARACTER RECOGNITION Mster de Visi per Computador Curs - PDF document

OPTICAL CHARACTER RECOGNITION Mster de Visi per Computador Curs 2006 - 2007 Outline Introduction Pre-processing (document level) Binarization Skew correction Segmentation Layout analysis Character

Optical Character Recognition Domain Expert Approximation Through Oracle Learning Joshua Menke

Design Elements Issue Task Force March 12, 2014 1 Historic Character 2 Historic Character 3

Curriculum on Character Development L1/A: Character in Leadership Character Development Agenda

Curriculum on Character Development Character in Leadership Character Development Agenda

Optical Character Recognition using Bayesian Networks Ioannis Klasinas iklasinas@telecom.tuc.gr

Optical Rings and Hybrid Mesh Rings Optical Networks draft-papadimitriou-optical-rings-00.txt

Handwritten character recognition Handwritten character recognition using elastic matching based

Maturation Process of the Ligature Based Urdu Noori Nastalique Optical Character Recognizer

Optical Recording and Optical Recording and That audio or video is of the highest quality

Experiment 3 Optical Rotation Optical rotation or optical activity The rotation of the plane

Character Education at Character Education at Northampton Academy An Academy of Character and

CANTERBURY TALES: POWERPOINT CHARACTER PRESENTATION CHARACTER PRESENTER PHYSICAL CHARACTER

- Character set - Character escape conventions - Canonical form - Line editing conventions

Strings II Review Strings are stored character by character. Can access each character

A summary of deep models for face recognition Qianli Liao Face recognition Face recognition:

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // JACOB LOGAS L E C T U R E # 1 0 : L

Requirements March 8, 2016 Developmental Disabilities Division Overview New regulations

(FC FCEP) Brothers Keeper Init itia iati tive RFPGC16-013 Full proposals must be

Requesting Research Identifiable Data for HCIA Awardees 02/19/2014 Presented by Faith Asper,

Measurement of Higgs boson production in the diphoton decay channel with the ATLAS detector 2017

EQUIPPING SYMBOLIC FRAMEWORKS WITH SOFT COMPUTING FEATURES K A I - U W E K H N B E R G E R I

Security II: Security Strikes Back 15-441/641 Fall 2019 Profs Peter Steenkiste &amp; Justine

Covering the Basics of QRTP in Dependency Court James Richardson III, Assistant Attorney General,

Security II: Security Strikes Back 15-441/641 Fall 2019 Profs Peter Steenkiste & Justine