Semantic PDF Segmentation for Legacy Documents in Technical - - PowerPoint PPT Presentation

semantic pdf segmentation for legacy
SMART_READER_LITE
LIVE PREVIEW

Semantic PDF Segmentation for Legacy Documents in Technical - - PowerPoint PPT Presentation

Semantic PDF Segmentation for Legacy Documents in Technical Documentation Jan Oevermann jan.oevermann@dfki.de SEMANTiCS 2018, Vienna, 13.09.18 Technical Documentation 2 Most common: PDF documents Digital Paper, archival &


slide-1
SLIDE 1

Semantic PDF Segmentation for Legacy Documents in Technical Documentation

Jan Oevermann

jan.oevermann@dfki.de

SEMANTiCS 2018, Vienna, 13.09.18

slide-2
SLIDE 2

Technical Documentation

Most common: PDF documents

  • “Digital Paper”, archival & distribution
  • ISO Standard, guaranteed reproduction,

ubiquitous support

Best practice: XML content components

  • Self-contained building blocks,

e.g. chapter-sized, ~150-500 words

  • Reuse, translation, aggregation, delivery

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 2

slide-3
SLIDE 3

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 3

XM XML XM XML XML XML

Task Desc Task

PDF PDF

Task Task Task De Desc sc De Desc sc Desc Desc

Online Portal

Search

Task Description

Motivation

slide-4
SLIDE 4

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 4

Faceted search

Only safety information of the document I need maintenance information about the fuel injection Everything about the hydraulic pump in technical overview or technical data

Information request with semantic concepts which can be used as facets

Motivation

slide-5
SLIDE 5

Limitations of PDF

  • Semantic structure gets lost
  • No metadata for (overlapping) segments
  • Large documents (>200p) only accessible via full text search

Idea

  • Use knowledge from structured XML content components
  • Manually annotated semantic concepts / metadata
  • Apply trained model on text extracted from PDF
  • Find segments which are semantically relevant

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 5

Motivation

slide-6
SLIDE 6

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 6

Procedure model

slide-7
SLIDE 7

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 7

Model

(VSM) Feature extraction (Bag o n-grams)

Weighting (TF-ICF-CF) Training data New data (unclassified)

Prediction

Learning phase Classification

cosine similarity/ k-nearest neighbour

Classifier

Training / Classification

slide-8
SLIDE 8

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 8

Chunking

slide-9
SLIDE 9

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 9

Chunking / Classification

slide-10
SLIDE 10

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 10 Range finding

slide-11
SLIDE 11

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 11

https://iirds.org/

Metadata generation

slide-12
SLIDE 12

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 12 Metadata generation

slide-13
SLIDE 13

Live demo

Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 13 Application

slide-14
SLIDE 14

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 14 Results

slide-15
SLIDE 15

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 15

Outlook

  • Other text sorts (e.g. patents) or document types (e.g. Word)
  • Combination with other techniques (formatting / heuristics)

Conclusion

  • Method relies on text and is formatting-independent
  • No splitting of PDF, just additional metadata
  • Good results in detecting semantic segments
  • Identified ranges can be provided in a standardized format

Outlook & Conclusion

slide-16
SLIDE 16

Contact

Jan Oevermann

jan.oevermann@dfki.de www.janoevermann.de

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 16

Code & Demo

github.com/j-oe/segments segments.fastclass.de