Historical Document Analysis Marcus Liwicki University of Fribourg - PowerPoint PPT Presentation

Historical Document Analysis Marcus Liwicki University of Fribourg University of Kaiserslautern Insiders Technologies GmbH marcus.liwicki@unifr.ch Marcus Liwicki, Historical Document Analysis

Typical Tasks of Scholars in the Humanities  Cataloging  Transcribing  Searching  Comparing texts  … 2  It’s hard to find interesting and relevant doucments 2 Marcus Liwicki, Historical Document Analysis

State of the Art Tool in the Humanities: Catalogs  But automatic methods can help! 3 Marcus Liwicki, Historical Document Analysis

Vision: D IVA Desk A scientific workbench for scholars 4 Marcus Liwicki, Historical Document Analysis

Outline  Challenge: Why historical Documents?  State-of-the-Art  Recent Trends  D IVA Services: Approach Towards Interoperability 5 Marcus Liwicki, Historical Document Analysis

What is the main Challenge?  Data variation? 6 Marcus Liwicki, Historical Document Analysis

Data Variation  Different languages and alphabets  Writing style differs  Quality of the images/data  Changing writing instruments  Abbreviations and misspellings  Graphics & handwriting  Language and writing evolves  Annotations  Change of support 7 Marcus Liwicki, Historical Document Analysis

What is the main Challenge?  Data variation?  Degradation? 8 Marcus Liwicki, Historical Document Analysis

What is the main Challenge?  Data variation?  Degradation?  Communication between humanist scholars and computer science experts! 9 Marcus Liwicki, Historical Document Analysis

Communication between Humanist Scholars and DIA Experts  Different expectations ^ Clearly defined challenging datasets vs. useful systems  Bridging the gap is the biggest challenge Marcus Liwicki, Historical Document Analysis 10

Success in Computer Science ?!?  HIP 2011 (27 papers accepted) ^ Information retrieval (text / graphic) ^ Projects But: ask a random scholar attending ^ Text/Character recognition + Calligraphy the Digital Humanities conference: ^ Visualization Do you know about HIP? ^ Digitization  HIP 2013 (18 papers accepted) ^ Information Extraction and Retrieval ^ Reconstruction and Degradation ^ Text and Image Recognition ^ Segmentation, Layout Analysis and Databases  HIP 2015 (18 papers accepted) ^ Text Transcription ^ Segmentation and Layout Analysis ^ Templates, Date Estimation, and Script Specific Approaches Thanks to Mickael Coustaty, IDAKS 2016 Marcus Liwicki, Historical Document Analysis 11

Overview of Projects on Hist-OCR * If you ask scholars who want to use the systems  EU IMPACT Project (2008-2012)  EU TRANSKRIBUS (2012-2016)  EU READ (2016-now)  CIS, LMU München, Post-OCR Correction  OCR-D Projekt DFG (since 2015, 1.5 Mio books)  Early Modern OCR Project, Texas A&M(2012-2015)  Kallimachos (Uni Würzburg, 2014-2017)  Ocular, University of California, Berkeley (2013-now)  … Marcus Liwicki, Historical Document Analysis 12

Communication Problems and Approaches for Solution For Computer Science Experts: For Scholars in the Humanities • Not a unique representation of • Methods are not understandable knowledge • Not clear what 95% means • Same content has a lot of • Systems not accessible interpretation • Too specific solutions • A description is not shared by all scientists • Focus on different aspects  We need more interdisciplinary discussions  Reduce black box effects (describe methods, give examples)  Approximate results are not enough Interfaces needed  Alternatives to be reported  Marcus Liwicki, Historical Document Analysis 13

Outline  Challenge: Why historical Documents?  State-of-the-Art  Recent Trends  D IVA Services: Approach Towards Interoperability Marcus Liwicki, Historical Document Analysis 14

Processing Steps of Automatic DIA Threshold (local, global) Top‐down vs bottom‐up ‐ Sauvola Classification Layout Preprocessing Binarization Analysis Information Classification OCR Extraction Marcus Liwicki, Historical Document Analysis 15

Layout Analysis Methods  Based on connected components  XY-cut  Other histogram-based approaches Marcus Liwicki, Historical Document Analysis

Processing Steps of Automatic DIA Threshold (local, global) Top‐down ‐ Sauvola ‐ XY‐cut, histograms Classification Bottom‐up ‐ Connected components Layout Preprocessing Binarization Analysis Information Classification OCR Extraction Marcus Liwicki, Historical Document Analysis 17

Feature Extraction  Marti, Bunke (2001) ^ Use a sliding window (similar to ASR) 1. Average grey value 2. Center of gravity 2 nd order moment vert. 3. 4. Uppermost pixel 5. Lowermost pixel 6. Gradient uppermost 7. Gradient lowermost 8. Number of b/w-transitions 9. #pix/d(upper,lower) Marcus Liwicki, Historical Document Analysis

Classification  Machine learning methods for sequences ^ HMMs ^ Recurrent NNs 100 90 80 70 60 50 1 2 3 Marcus Liwicki, Historical Document Analysis

Bidirectional Long Short-Term Memory Network Features  Importance of context Input Layer Hidden Layer Hidden Layer Hidden Layer Multilayer perceptron network Output Layer Recurrent connections Bidirectional Memory instead of perceptron Transcription November 1, 2007 Marcus Liwicki, Historical Document Analysis

Limits of MLP  Limit: static input/output operation n  x , , x y 1   Human brain is capable of memorizing  Needed for solving many problems ^ Sequence recognition ^ Navigation through a labyrinth ^ Video analysis     1 1 T T 1 U ( x , , x ), , ( x , , x ) ( y , , y ) | U T     1 n 1 n  Idea: add backward-connections to maintain state Marcus Liwicki, Historical Document Analysis

Recurrent Neural Networks (RNNs)  Recurrent connections are added Features in order to keep information of previous time stamps in the Input Layer network  Novel equation for the activation:    Hidden Layer   t t t 1 a w x w b i i h h  Context information is used Output Layer  How to train those networks …? Output Marcus Liwicki, Historical Document Analysis

Training of RNNs – Backpropagation Through Time Features t-k Input Layer t-k Features t-1 Hidden Layer t-k Features t Input Layer t-1 .... Input Layer t Hidden Layer t-1 Hidden Layer t  Unfold the network in time ^ k timestamps (parameter) Output Layer t ^ Perform Backpropagation for output at t    0 t T 1 Output t  Repeat this for each Marcus Liwicki, Historical Document Analysis

Recurrent Neural Networks (RNN)  Recurrent connections are added in Features order to keep information of previous time stamps in the network  Novel equation for activation: Input Layer      t t t 1 a w x w b i i h h Hidden Layer  Can be written in matrix form      t t t 1 A W X W B i h Output Layer  Context information is used, however: impossible to store precise Output information over long durations Marcus Liwicki, Historical Document Analysis

Vanishing Gradient  Usual RNN forget information after a short period of time Example: Neuron 7 timestamps Information vanishes Marcus Liwicki, Historical Document Analysis

Core Idea: New Memory Cell Instead of Perceptron Marcus Liwicki, Historical Document Analysis

No Vanishing Gradient       t t t 1 t a W X W B W S     a , h , c , c  Output Gate  Output  Neuron now O : open ( σ =1 ) | : closed ( σ =0 ) Marcus Liwicki, Historical Document Analysis

Bidirectional RNN Features t-1 Features t Features t+1 Input Layer t-1 Input Layer t Input Layer t+1 Forward Layer t-1 Forward Layer t Forward Layer t+1 Hidden Layer t-1 Hidden Layer t Hidden Layer t+1 Backw. Layer t-1 Backw. Layer t Backw. Layer t+1 Output Layer t-1 Output Layer t Output Layer t+1 Output t-1 Output t Output t+1  Trained with backpropagation through time (forward path trough all time stamps for each hidden layer sequentially) Marcus Liwicki, Historical Document Analysis

Connected Temporal Classification  Additional blank label ( b green)  Allows application to whole sequences  Output with normalized likelihood for each word  Training: objective function is smoothed and recalculated after each iteration (details in references)  Testing: similar to HMM Viterbi-algorithm Marcus Liwicki, Historical Document Analysis

Processing Steps of Automatic DIA Threshold (local, global) Top‐down ‐ Sauvola ‐ XY‐cut, histograms Classification Bottom‐up ‐ Connected components Layout Preprocessing Binarization Analysis Information Classification OCR Extraction HMM on features LSTM with CTC New: MDLSTM on pixels Marcus Liwicki, Historical Document Analysis 30

Outline  Challenge: Why historical Documents?  State-of-the-Art  Recent Trends  D IVA Services: Approach Towards Interoperability Marcus Liwicki, Historical Document Analysis 31

Historical Document Analysis Marcus Liwicki University of Fribourg - PowerPoint PPT Presentation

Historical Document Analysis Marcus Liwicki University of Fribourg University of Kaiserslautern Insiders Technologies GmbH marcus.liwicki@unifr.ch Marcus Liwicki, Historical Document Analysis Typical Tasks of Scholars in the Humanities

Historical Development Historical Development Historical Development Lesson No. 2 ENV H 471

Historical Treebanks The Penn Historical Corpora and the Icelandic Historical Parsed Corpus 1

SWOT Analysis W T S O SWOT Analysis Learning Objectives What is SWOT Analysis? What is SWOT

Analysis and Optimizations Analysis and Optimizations Program Analysis Program Analysis

Introduction to Historical Texts Over 350, 000 late 15 th to long 19 th century

Technical Analysis Technical Analysis Technical Analysis Technical Analysis Introduction

ICE Analysis Training Program Module 5: How to Prepare the Analysis and Reach ICE Analysis

Alias Analysis Last time Reuse optimization Today Alias analysis (pointer analysis)

Survival Analysis / Time-to- Event Analysis in R Heidi Seibold Statistician at LMU Munich

Academic Quality and Social Responsibility Historical background HISTORICAL BACKGROUND 1623

Historical Background of HIV Historical Background of HIV 1984 1984 The first

S I C I S I C I an information system on historical an information system on historical

Historical Spaces Historical Spaces Revisiting revolution memory and Revisiting revolution

Historical review and state of the art in Historical review and state of the art in Time

Historical Perspective Historical Perspective and Legacy and Legacy Presentation Presentation

Historical Society Activities The Beaver Island Historical Society, a 501(c)3 organization,

Nanocones A classification result in chemistry Gunnar Brinkmann Nico Van Cleemput Combinatorial

Ridges and umbilics of polynomial parametric surfaces Frdric Cazals 1 Jean-Charles Faugre 2

OSIRIS towards an Open and Sustainable ICT Research Infrastructure Strategy Antonio Candiello,

Joue la crypto! Ange Albertini - Corkami RMLL 15mes Rencontres Mondiales du Logiciel Libre

Optimal global rigidity estimates in unitary invariant ensembles Tom Claeys joint work with

Visual Analytics - Introduction Eduard Grller Institute of Computer Graphics and Algorithms

Logistics Embedded Systems and Kinetic Art Class meets Wednesdays from 3:05-6:05 CS5968:

Logistics Embedded Systems and Kinetic Art Class meets M-W from 11:50-2:50 Well start