Economical Bimodal Classification of a Massive Heterogeneous - PowerPoint PPT Presentation

Economical Bimodal Classification of a Massive Heterogeneous Document Collection Patrick Schone (patrickjohn.schone@familysearch.org) 24 February 2020 Standards Technical Conference

Overview • Timelines (Lead-up) • Description of the Collections • Classification Goals for Automation • Speed-focused System Architectures • Performance and Outcomes 2

Timelines (Lead-up) 2015: FamilySearch was able to auto- index 21M born-digital newspapers. Can auto-indexing work with born-paper? How about handwriting?? 2016-2017 : FamilySearch & BYU collaborate on technologies to auto- transcribe HW. 2017-2018: FamilySearch auto- transcribed about 33M newspaper stories and over 110M mostly-English handwritten & mixed documents with the goal of auto- indexing them. 2019 : Newspaper going forward. But the massively-heterogeneous collection makes auto- indexing complex. Need to group & categorize documents, identify ‘gotchas’, and subdivide images. 3

Collections Of Interest Two different, but related, kinds of corpora: ENGLISH_DEPTH ENGLISH_BREADTH 163K Rolls of Film, every image ~1M Rolls of Film, several ims/roll [Abt 110M images] [Abt 3-4M images] Represents EVERY instance of Represents EVERY ‘English’ roll particular types of US Legal documents 4

Can We Classify After-the-Fact? If we could describe each image of the Breadth/Depth corpora, we could target sub-collections for auto-indexing based on current capabilities & develop the capability for others. Also, if we could identify any anomalies , that might help us do a better job handling them. But we want to do this quickly ! We want to finish in a week or so. But if we only took 1 sec/document (typical load time of a full image), it’d take [1.1 x 10 8 images] x [1 sec/image] = 3.5 CPU years ! 5

Classify: Semantic Categories 130+ Semantic Categories: What is the PURPOSE for the document? Vital/Death/Legal Probate/Will Registration/Civil Family/Pedigree Land/Deed General/Newspaper 6

Classify: Layout Categories ~12 Layout Categories: What is the STRUCTURE of the document? Table/1 Line Per Row Freeform (Complex) Form Graphical Multicolumn Fill in the Blank 7

Classify: Story Count ~12 Story Classification: How many unique ‘stories’ are in the document? Story=1n Story=E&S Story=1 Story=0p Story=many Story=2 8

Classify: Language Info Linguistics: What are the Unicode scripts, language, countries, writing style? Latin/Italian/MX Latin/English/HW Latin/English/MX Chinese/Japanese/HP Latin/Spanish/PR Latin/English/MX 9

Anomalies: Binary Properties SINGLE FOTO ROTATED REV_VIDEO CRUFT TWO-D OLD MARGIN LOBE DRAW META 10

Speedy Classification? One Option : Use thumbnail images and do image-level classification. Definite ‘Wins’ : • FamilySearch automatically stores 200x200 thumbnails of each image. • Thumbnails for an entire roll of film (1000 images) occupy about the same storage space as 3 images [so, over 99% compression]. • Since these are small, load time and subsequent processing time is short. • Can see color, periphery, two-up-ness, photos, & line patterns Paired Free Multi- RV Table Photo Form Vertical Forms column Drawbacks : • Their amount of detail is limited, so it’s hard to assess the true semantics. Have to guess the semantics based on ‘this is a paired form, and that’s what deeds look like, so I’ll guess it’s a deed.” 11

Speedy Classification? Another Option : Use transcripts with bounding boxes & do text-level classification. Definite Wins : • Processing transcript is orders of magnitude faster than thumbnails. • Semantic information is often very clear at the textual level. • Language, script, country, writing style – should all be straightforward to note. ‘..my last ‘ Know all “Indice ‘ 天文 ‘Diario de ‘Separation ‘ …by his ‘Certicate will and men by Decennale” Avisos’ from U.S. attorneys’ these of Death’ 十三 ’ testament’ Naval…’ presents News/ Military/ Will/ Census/ Deeds/ Crime/ Death/ Pedigree/ Spanish English English Italian English English English ZH/JA Serious Drawbacks : • Color is gone; borders are likely gone; photos are gone. How can one even tell if an image was reverse video if all you have is the transcript? How can you tell if it was complicated form or if it was nicely laid out? • One needs to have the transcripts already. 12

Speedy Classification? BEST Option : Use BOTH snapshots AND transcripts+bounding boxes . Definite Wins : • Get the best of both worlds: semantics from text, visuals from thumbnail. • Not much more expensive than JUST thumbnails when using both. • Can toggle and use text-based or image-based models if that’s all one has. ‘ Know all ‘..my last “Indice ‘Diario de ‘ 天文 ‘Separation ‘ …by his men by ‘Certicate will and Decennale” Avisos’ from U.S. these attorneys’ of Death’ testament’ 十三 ’ Naval…’ presents News/ Will/ Census/ Deeds/ Crime/ Death/ Pedigree/ Military/ Spanish English/ Italian/ English/ English/ English/ ZH/JA English/ Multicol Free Table PairForm Newsclip Form Vertical RV w/photo Drawbacks : • Model management is slightly more complex. 13

System Architecture: Text Input χs χs χs χs χs χs χ bin <= Loss Functions Coun Lang Bin’y Sem Stct Scrpt Form HwPr try 131 Cats, <= Loss Weights 1 0.7 0.7 0.1 0.2 0.1 0.3 1 14.4K Trn, 1.6K Dev: 8 Fully-Connected Layers 82.4% acc CudnnLSTM (100) MaxPool1D (w=4) Conv1D (64, w=5) Dropout = 10% GLOVE + ⊕ Random => 16-D Prop Vector Word Embedding @ Starts BoundBox CharProps Transcript Words 14

System Design: Image Input χs χs χs χs χs χs χ bin 82.1% acc Coun Lang Bin’y Sem Stct Scrpt Form HwPr try 1 0.7 0.7 0.1 0.2 0.1 0.3 1 EfficientNet [M. Tan, Q. Le, 2019] 8 Fully-Connected Layers Net #Param #Flops xVersus Flatten B0 5.3M 0.39B 9% (ResNet50) Dropout (20%) B1 7.8M 0.70B 12% (Incpt’nV3) 7x7 2D MaxPool B2 9.2M 1.0 B 7.6% (Incpt’nV4) B3 12M 1.8 B 5.6% (ResNxt50) B4 19M 4.2 B 18%(AmoebaNtA) Top-Removed EfficientNet/B1 B5 30M 9.9 B 24%(AmoebaNtC) B6 43M 19 B 200 x 224 x B7 66M 37 B 200 224 Results reported by Tan&Le. 15

System Design: Fused Input 86.7% acc Coun Bin’y Scrpt HwPr Sem Lang Stct Form try ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ Coun Bin’y Coun HwPr Sem Scrpt Stct Lang Bin’y Form Scrpt HwPr Sem Stct Form Lang try try For fully-connected weights at start, assume near-50% weights for class C from text(or image) going to class C in final, and near-zero weights for all other connections. 16

Outcomes: Timings 115,973,482 Images Ran TWO trials. First was TEXT ONLY, second was FULL. TextOnly: Ran on one box (Dual-Gpu System). Three jobs/Gpu (but lock around Gpu process) Took 3.5 days. FullSystem : Re-Ran on 3 diff’t machines, with variable number of Gpus. But would have taken ~20 days on system of ‘TextOnly’ (with bulk of the additional cost going to thumbnail processing). 17

Outcomes: Results 115,973,482 Images Recording % Layouts % Handwrit’n 59.1 Freeform 68.1 Mixed 22.0 Fill-in 18.2 Semantics % #Stories % PrintOnly 18.3 Table/1line 10.4 Deeds 52.6 Exactly 1 35.0 Blank 0.7 Form 1.7 Land Index 11.6 EndOrStrt 19.3 Gen.Legal 8.3 >1, but <2 9.3 Anomalies % Gen.Probate 5.6 End&Start 8.4 One-ups 52.4 Will 4.0 1-∞ Index 7.7 Old (<1800) 3.7 Inventory 3.4 Exactly 2 7.2 HasMeta 2.0 Recpt/Check 1.1 Many 7.0 HasLobes 1.5 ReverseVid 0.6 BleedThru 0.5 18

Summary • Identified deep neural networks to mine text and image content, with sparse network combiner • 86.7% acc on 131 category determination, plus generates multiple other kinds of classifications simultaneously • Demonstrated result on large collection of >110 images QUESTIONS ? 19

Economical Bimodal Classification of a Massive Heterogeneous - PowerPoint PPT Presentation

Economical Bimodal Classification of a Massive Heterogeneous Document Collection Patrick Schone (patrickjohn.schone@familysearch.org) 24 February 2020 Standards Technical Conference Overview Timelines (Lead-up) Description of the

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

A Bimodal Analysis of Knowability Sergei Artemov & Tudor Protopopescu Logic Colloquium 2011

Bimodal Multicast And Cache Invalidation Who/What/Where Bruce Spang Software

Bimodal Algorithms Uni-modal distribution Input data block boundaries unimodal chunking 64 KB

Feb.22, 2005 Feb.22, 2005 Feb.22, 2005 Feb.22, 2005 Hiroshi Fujiwara,Ph.D Hiroshi

The FIFA Universe Massive scale, massive influence, massive corruption First, Some History.

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

A different look to massive MIMO Ana Garca Armada Communications Research Group (GCOM)

Massive Data Algorithmics Lecture 10: Connected Components and MST Massive Data Algorithmics

1 2 Compress a massive object to a small sketch 2 Compress a massive object to a small

YSOFT SAFEQ 5 Private Cloud Security & access management Key benefits Economical Secure

(a) Quantitative classification (b) Qualitative classification (c) Area classification (d) Simple

Classification Image Classification Set of predefined categories [eg: table, apple, dog, giraffe]

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Library of Congress Classification: Module 1.3 1 Library of Congress Classification: Module 1.3

Presented By: Mr Peter Huma Ms Matanki Hlapane Headin ing 1. Background 2. Contents of the

Financial Results FY18 Financial Results May 31, 2017 Put a date here. May 31, 2017 Or the

Current status and opportunities . October 8, 2015 Minnesota Trade Office in partnership with

SSRB WMP 10 Year Review M ik e M u r r a y Introduction to Project Approved Water Management

SCHEDULE OF PRESENTATION FRIDAY, 26th AUGUST 2016 Socioeconomy and Community THEME Governance

Northwest Independent School District Instructional Materials Adoption Implementation 2020-2021

Glass Container Recycling Lynn Bragg President Glass Packaging

Table of Contents Company/Analyst Information 3 Legal Disclaimers 4 Earnings Release Text 5

Economical Bimodal Classification of a Massive Heterogeneous - PowerPoint PPT Presentation

Economical Bimodal Classification of a Massive Heterogeneous Document Collection Patrick Schone (patrickjohn.schone@familysearch.org) 24 February 2020 Standards Technical Conference Overview Timelines (Lead-up) Description of the

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

A Bimodal Analysis of Knowability Sergei Artemov &amp; Tudor Protopopescu Logic Colloquium 2011

Bimodal Multicast And Cache Invalidation Who/What/Where Bruce Spang Software

Bimodal Algorithms Uni-modal distribution Input data block boundaries unimodal chunking 64 KB

Feb.22, 2005 Feb.22, 2005 Feb.22, 2005 Feb.22, 2005 Hiroshi Fujiwara,Ph.D Hiroshi

The FIFA Universe Massive scale, massive influence, massive corruption First, Some History.

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

A different look to massive MIMO Ana Garca Armada Communications Research Group (GCOM)

Massive Data Algorithmics Lecture 10: Connected Components and MST Massive Data Algorithmics

1 2 Compress a massive object to a small sketch 2 Compress a massive object to a small

YSOFT SAFEQ 5 Private Cloud Security &amp; access management Key benefits Economical Secure

(a) Quantitative classification (b) Qualitative classification (c) Area classification (d) Simple

Classification Image Classification Set of predefined categories [eg: table, apple, dog, giraffe]

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Library of Congress Classification: Module 1.3 1 Library of Congress Classification: Module 1.3

Presented By: Mr Peter Huma Ms Matanki Hlapane Headin ing 1. Background 2. Contents of the

Financial Results FY18 Financial Results May 31, 2017 Put a date here. May 31, 2017 Or the

Current status and opportunities . October 8, 2015 Minnesota Trade Office in partnership with

SSRB WMP 10 Year Review M ik e M u r r a y Introduction to Project Approved Water Management

SCHEDULE OF PRESENTATION FRIDAY, 26th AUGUST 2016 Socioeconomy and Community THEME Governance

Northwest Independent School District Instructional Materials Adoption Implementation 2020-2021

Glass Container Recycling Lynn Bragg President Glass Packaging

Table of Contents Company/Analyst Information 3 Legal Disclaimers 4 Earnings Release Text 5

A Bimodal Analysis of Knowability Sergei Artemov & Tudor Protopopescu Logic Colloquium 2011

YSOFT SAFEQ 5 Private Cloud Security & access management Key benefits Economical Secure