Semantic PDF Segmentation for Legacy Documents in Technical - - PowerPoint PPT Presentation

▶

Apr 03, 2023 429 likes •597 views

Semantic PDF Segmentation for Legacy Documents in Technical Documentation Jan Oevermann jan.oevermann@dfki.de SEMANTiCS 2018, Vienna, 13.09.18 Technical Documentation 2 Most common: PDF documents Digital Paper, archival &

SLIDE 1

Semantic PDF Segmentation for Legacy Documents in Technical Documentation

Jan Oevermann

jan.oevermann@dfki.de

SEMANTiCS 2018, Vienna, 13.09.18

SLIDE 2

Technical Documentation

Most common: PDF documents

“Digital Paper”, archival & distribution
ISO Standard, guaranteed reproduction,

ubiquitous support

Best practice: XML content components

Self-contained building blocks,

e.g. chapter-sized, ~150-500 words

Reuse, translation, aggregation, delivery

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 2

SLIDE 3

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 3

XM XML XM XML XML XML

Task Desc Task

PDF PDF

Task Task Task De Desc sc De Desc sc Desc Desc

Online Portal

Task Description

Motivation

SLIDE 4

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 4

Faceted search

Only safety information of the document I need maintenance information about the fuel injection Everything about the hydraulic pump in technical overview or technical data

Information request with semantic concepts which can be used as facets

Motivation

SLIDE 5

Limitations of PDF

Semantic structure gets lost
No metadata for (overlapping) segments
Large documents (>200p) only accessible via full text search

Idea

Use knowledge from structured XML content components
Manually annotated semantic concepts / metadata
Apply trained model on text extracted from PDF
Find segments which are semantically relevant

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 5

Motivation

SLIDE 6

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 6

Procedure model

SLIDE 7

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 7

Model

(VSM) Feature extraction (Bag o n-grams)

Weighting (TF-ICF-CF) Training data New data (unclassified)

Prediction

Learning phase Classification

cosine similarity/ k-nearest neighbour

Classifier

Training / Classification

SLIDE 8

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 8

Chunking

SLIDE 9

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 9

Chunking / Classification

SLIDE 10

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 10 Range finding

SLIDE 11

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 11

https://iirds.org/

Metadata generation

SLIDE 12

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 12 Metadata generation

SLIDE 13

Live demo

Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 13 Application

SLIDE 14

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 14 Results

SLIDE 15

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 15

Outlook

Other text sorts (e.g. patents) or document types (e.g. Word)
Combination with other techniques (formatting / heuristics)

Conclusion

Method relies on text and is formatting-independent
No splitting of PDF, just additional metadata
Good results in detecting semantic segments
Identified ranges can be provided in a standardized format

Outlook & Conclusion

SLIDE 16

Contact

Jan Oevermann

jan.oevermann@dfki.de www.janoevermann.de

13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna 16

Semantic PDF Segmentation for Legacy Documents in Technical - - PowerPoint PPT Presentation

Semantic PDF Segmentation for Legacy Documents in Technical Documentation

Jan Oevermann

Technical Documentation

Most common: PDF documents

ubiquitous support

Best practice: XML content components

e.g. chapter-sized, ~150-500 words

XM XML XM XML XML XML

PDF PDF

Task Task Task De Desc sc De Desc sc Desc Desc

Online Portal

Task Description

Motivation

Only safety information of the document I need maintenance information about the fuel injection Everything about the hydraulic pump in technical overview or technical data

Motivation

Limitations of PDF

Idea

Motivation

Procedure model

Model

Training / Classification

Chunking

Chunking / Classification

Metadata generation

Live demo

Outlook

Conclusion

Outlook & Conclusion

Contact

Jan Oevermann

jan.oevermann@dfki.de www.janoevermann.de

Code & Demo

github.com/j-oe/segments segments.fastclass.de