Book Scanning – Book Scanning – Technologies and Technologies and Techniques Techniques Mike Mansfield Mike Mansfield Director of Content Director of Content Engineering Engineering Ancestry.com / Ancestry.com / Genealogy.com Genealogy.com
Outline Outline Project Analysis Project Analysis Scanning Parameters Scanning Parameters Book Scanners Book Scanners
Project Analysis Project Analysis Overview Overview Scope Scope Goals Goals Project and Customer Requirements Project and Customer Requirements Content Evaluation Content Evaluation
Project Analysis Project Analysis Scanning Assessment Scanning Assessment Quality Assurance Quality Assurance Selection Selection Post-Processing – OCR, Post-Processing – OCR, Information Information Compression, Format Compression, Format Representation – Goals Representation – Goals Conversion Conversion and Metrics and Metrics Return originals to the Return originals to the Funding Funding collection collection Planning and resource Planning and resource Host the data Host the data assignment assignment Archiving and Archiving and Prepare the originals Prepare the originals preservation preservation for digitization for digitization
Book Scanning Parameters Book Scanning Parameters Overview Overview Resolution Resolution Bit Depth Bit Depth Dynamic Range Dynamic Range Tonal Sensitivity Tonal Sensitivity Geometrical Corrections Geometrical Corrections De-Skew De-Skew Curve Correction Curve Correction Text Crushing Text Crushing Masking and Cropping Masking and Cropping
Resolution Resolution Samples Per Inch (SPI), Dots Per Inch (DPI), Samples Per Inch (SPI), Dots Per Inch (DPI), Pixels Per Inch (PPI) Pixels Per Inch (PPI) Archival Quality Archival Quality Access Quality Access Quality “ “Faithful” Representation of the page Faithful” Representation of the page
Resolution and OCR Resolution and OCR Most OCR engines are optimized for 300 Most OCR engines are optimized for 300 DPI images with typefaces in point sizes DPI images with typefaces in point sizes between 10 and 14. between 10 and 14. In cases where the font size of characters In cases where the font size of characters on an image are very small (point size of 6 on an image are very small (point size of 6 or less), scanning images at 400 DPI can or less), scanning images at 400 DPI can improve character recognition improve character recognition
Bit Depth Bit Depth Number of colors or “tones” a scanner can Number of colors or “tones” a scanner can differentiate differentiate Bitonal Bitonal Grayscale Grayscale Color Color
Dynamic Range Dynamic Range A scanner's dynamic range is a measure of A scanner's dynamic range is a measure of how well the device can record changes in how well the device can record changes in the brightness of the image it's scanning the brightness of the image it's scanning
Tonal Sensitivity Tonal Sensitivity The ability of a scanner to accurately The ability of a scanner to accurately represent similar, adjacent tonal values as represent similar, adjacent tonal values as distinct from each other distinct from each other
Geometrical Corrections Geometrical Corrections Deskew Deskew Bookfold Corrections Bookfold Corrections Curve Correction Curve Correction Text Crushing Text Crushing
Deskew Deskew Skew detection and correction Skew detection and correction
Bookfold Corrections Bookfold Corrections Curve Correction and Text Curve Correction and Text Crushing Crushing Pages of bound books are three Pages of bound books are three dimensional surfaces dimensional surfaces
Curve Correction and Text Curve Correction and Text Crushing Compensation Crushing Compensation Straighten curves and preserve uniform Straighten curves and preserve uniform distances in the drape and gutters of distances in the drape and gutters of scanned book pages scanned book pages
Finger Masking Finger Masking Methods to remove the images of the Methods to remove the images of the operator’s fingers holding down the pages operator’s fingers holding down the pages during scanning during scanning
Cropping and Page Splitting Cropping and Page Splitting Detecting and cropping edges to remove Detecting and cropping edges to remove portions of the image containing the book portions of the image containing the book cover, end-papers, spine edges, and page cover, end-papers, spine edges, and page fan-outs. fan-outs. Splitting double page images. Splitting double page images.
Not What We Want Not What We Want
What We Do Want What We Do Want
Book Scanners Book Scanners Overview Overview Document Scanners Document Scanners Planetary Book Scanners Planetary Book Scanners Flying Linear Arrays Flying Linear Arrays Digital Photography Digital Photography Robotic Page Turners Robotic Page Turners
Document Scanners Document Scanners Cut the spine off of the book and scan the Cut the spine off of the book and scan the loose pages in a document scanner loose pages in a document scanner The book is rendered almost useless for The book is rendered almost useless for additional use additional use Rebinding is expensive and slow Rebinding is expensive and slow Makes most sense when a sacrificial Makes most sense when a sacrificial copy of the book exists. copy of the book exists.
Document Scanners Document Scanners Extremely Fast Extremely Fast Feature Rich Feature Rich Relatively Inexpensive Relatively Inexpensive Large range of options and price points Large range of options and price points Some limited applications in the Family Some limited applications in the Family History and Genealogy domain History and Genealogy domain
Document Scanners Document Scanners Major office Major office equipment equipment manufactures manufactures Canon Canon Fujitsu Fujitsu Kodak Kodak Panasonic Panasonic Ricoh Ricoh
Document Scanners Document Scanners Resolution: 100-600 DPI Resolution: 100-600 DPI Bit Depths: Bitonal, Grayscale, Color Bit Depths: Bitonal, Grayscale, Color Simplex / Duplex Simplex / Duplex 2 x 3 inch to 12 x 30 inch documents 2 x 3 inch to 12 x 30 inch documents Rate: Few hundred pages per day to tens Rate: Few hundred pages per day to tens of thousands of pages per day of thousands of pages per day Deskewing, cropping, dithering, dynamic Deskewing, cropping, dithering, dynamic thresholding, binarization, etc… thresholding, binarization, etc…
Planetary Book Scanners Planetary Book Scanners Specialized devices designed to do Specialized devices designed to do primarily one thing – scan bound books primarily one thing – scan bound books CCD Array, integrated lighting, specialized CCD Array, integrated lighting, specialized scan beds/book cradles, and book specific scan beds/book cradles, and book specific image processing options image processing options
Dissection of a Minolta PS Dissection of a Minolta PS 7000 7000 7,500 Pixel Reduction 7,500 Pixel Reduction type line CCD type line CCD Halogen Lamp Lighting Halogen Lamp Lighting Up to A2 Size Up to A2 Size 200/300/400/600 DPI 200/300/400/600 DPI Bitonal or 8-bit Grayscale Bitonal or 8-bit Grayscale 4.5 Seconds per scan on 4.5 Seconds per scan on an A4 page at 400 DPI an A4 page at 400 DPI
Dissection of a Minolta PS Dissection of a Minolta PS 7000 7000 Image Processing Image Processing Curvature Correction Curvature Correction Text Crushing Text Crushing Correction Correction Centering Centering Finger Masking Finger Masking Spread/Single/Book Spread/Single/Book Split Split Linearization Linearization
Dissection of a Minolta PS Dissection of a Minolta PS 7000 7000 Articulating Book Articulating Book Cradle Cradle
Dissection of a Minolta PS Dissection of a Minolta PS 7000 7000 Scan buttons on the scan bed Scan buttons on the scan bed
Minolta Minolta
Bookeye Bookeye
Zeutschel Zeutschel
Planetary Book Scanners Planetary Book Scanners Resolutions from 300 DPI to 600 DPI Resolutions from 300 DPI to 600 DPI Bit-Depths: Bitonal, Grayscale, Full Color Bit-Depths: Bitonal, Grayscale, Full Color Rich feature set well suited to large production Rich feature set well suited to large production projects projects Book cradles, glass plates to reduce page curvature, Book cradles, glass plates to reduce page curvature, specialized image processing, human-factors, etc. specialized image processing, human-factors, etc. Support for most book sizes from small books to Support for most book sizes from small books to large quarto volumes and smaller atlases large quarto volumes and smaller atlases Proven technology, few moving parts, highly reliable Proven technology, few moving parts, highly reliable 1 page scan in 5-10 seconds 1 page scan in 5-10 seconds 1,500 – 3,000 pages per 8 hour shift 1,500 – 3,000 pages per 8 hour shift
Recommend
More recommend