LOC-DB Reference Extraction DR. DR.-ING SHERAZ AHMED SYED TA - - PowerPoint PPT Presentation

loc db reference extraction
SMART_READER_LITE
LIVE PREVIEW

LOC-DB Reference Extraction DR. DR.-ING SHERAZ AHMED SYED TA - - PowerPoint PPT Presentation

LOC-DB Reference Extraction DR. DR.-ING SHERAZ AHMED SYED TA TAHSEEN RAZA RIZVI LOC-DB Architecture 2 LOC-DB OCR Component Types of Input files: Digital Born PDF Scanned Documents XML/HTML XML File Scanned Document Textual PDF


slide-1
SLIDE 1

LOC-DB Reference Extraction

DR. DR.-ING SHERAZ AHMED SYED TA TAHSEEN RAZA RIZVI

slide-2
SLIDE 2

LOC-DB Architecture

2

slide-3
SLIDE 3

LOC-DB OCR Component

Types of Input files:

  • Digital Born PDF
  • Scanned Documents
  • XML/HTML

3

XML File Textual PDF Scanned Document

slide-4
SLIDE 4

Reference Extraction from: Scanned Documents

4

slide-5
SLIDE 5

Scanned Documents: Reference Extraction

5

  • Step 1: Binarization
  • Greyscale(0-255)/color to

Binary (0-1)

RGB Image Binary Image

slide-6
SLIDE 6

Scanned Documents : Reference Extraction

6

  • Step 2: Image Classification
  • Single/Double Column Documents

Single Column Document Double Column Document

Single Column Documents Double Column Documents

slide-7
SLIDE 7

Scanned Documents : Reference Extraction

7

  • Step 3: OCR (Optical Character Recognition)

OCR Result

slide-8
SLIDE 8

Scanned Documents : Reference Extraction

8

  • Step 4: Reference Segmentation
  • Using ParsCit
slide-9
SLIDE 9

Reference Extraction from: Textual / Digital Born PDFs

9

slide-10
SLIDE 10

Digital Born PDFs : Reference Extraction

10

  • Step 1: Text Extraction

Textual PDF Extracted Text

slide-11
SLIDE 11

Digital Born PDFs : Reference Extraction

11

  • Step 2: Reference Extraction
  • Using ParsCit
slide-12
SLIDE 12

Reference Extraction from: Structured XML

12

slide-13
SLIDE 13

Structured XML : Reference Extraction

13

  • Step 1: Preprocessing
slide-14
SLIDE 14

Structured XML : Reference Extraction

14

  • Step 2: Reference Extraction
  • Using ParsCit
slide-15
SLIDE 15

Scanned Documents Textual PDFs Structured XML

Reference Extraction Pipeline - Overview

15

Binarization Image Classification OCR Text Extraction Reference Segmentation Pre-Processing

slide-16
SLIDE 16

DeepBibX: A Neural Network based approach

16

slide-17
SLIDE 17

DeepBibX: Intuition

17

slide-18
SLIDE 18

Neural Network Based Approach

18

slide-19
SLIDE 19

Comparison with ParsCit

19

ParsCit Output DeepBibX Output

slide-20
SLIDE 20

Comparison with ParsCit

  • On a test set of 286 bibliographic document

images:

  • Total: 5090 references
  • ParsCit extracted: 3645 references
  • Proposed approach: 4323 references

20

1000 2000 3000 4000 5000 6000 ParsCit FCN based approach

Extraction Comparison

Total References Total Detections

slide-21
SLIDE 21

Thank you

21