Dissecting PDF Documents Mark S. Rasmussen iPaper mark@improve.dk - - PowerPoint PPT Presentation

dissecting pdf documents
SMART_READER_LITE
LIVE PREVIEW

Dissecting PDF Documents Mark S. Rasmussen iPaper mark@improve.dk - - PowerPoint PPT Presentation

Dissecting PDF Documents Mark S. Rasmussen iPaper mark@improve.dk What Is This Session NOT About? Creating PDFs How to use Acrobat Transparency flattening options in InDesign So what is it about? PDF documents Tooling


slide-1
SLIDE 1

Dissecting PDF Documents

Mark S. Rasmussen – iPaper mark@improve.dk

slide-2
SLIDE 2

What Is This Session NOT About?

  • Creating PDFs
  • How to use Acrobat
  • Transparency flattening options in InDesign
  • So what is it about?

– PDF documents – Tooling – Extracting data

slide-3
SLIDE 3

The PDF Format

  • 1.0 released in 1993
  • Open standard as of July 1st 2008
  • Reference publicly available

– http://www.adobe.com/devnet/pdf/pdf_reference_archive.html

500 1000 1500 PDF 1.3 PDF 1.4 PDF 1.5 PDF 1.6 PDF 1.7

OOXML 1.0

slide-4
SLIDE 4

PDF Structure

  • Header

– %PDF-1.4 – %âãÏÓ (optional but common)

  • Body

– Objects

  • Xref table

– Index table containing pointers to objects

  • Trailer

– Pointers to Xref table, key objects – %%EOF

slide-5
SLIDE 5

PDF Objects

  • Boolean, Number, String, Name, Array,

Dictionary, Stream, Null

  • Indirect & direct objects
  • Random access

”A PDF file should be thought of as a flattened representation of a data structure consisting of a collection of objects that can refer to each other in any arbitrary way.”

slide-6
SLIDE 6

Reading A PDF – The Ninja Way!

slide-7
SLIDE 7

Incremental Changes

  • Fast saves, but not for free
  • Undo & history
  • Save vs Save As
  • Single-pass writing
  • Linearization
slide-8
SLIDE 8

Linearization & Xref Chaining

slide-9
SLIDE 9

PDF Objects: Image

  • Stream object with dictionary header
slide-10
SLIDE 10

ABCpdf

  • Commercial
  • Excellent .NET API
  • ObjectSoup is a

valuable friend

  • Good image rendering
  • Useless SWF rendering
  • Unstable rendering
  • Decent support
  • http://www.websupergoo.com/secret.htm
slide-11
SLIDE 11

Acrobat

  • Commercial (tricky license)
  • No COM libraries after 7.x
  • Surprisingly stable and fast
  • Ugly API
slide-12
SLIDE 12

Rendering Using Acrobat

slide-13
SLIDE 13

Xpdf

  • Open source (GPL)
  • Pdffonts, pdfimages,

pdfinfo, pdftops, pdftotext

  • Basis for many other libraries & tools
  • Commercial license & COM library available at

www.glyphandcog.com

  • http://www.foolabs.com/xpdf/
slide-14
SLIDE 14

PDF Font Management

  • Client must have fonts used in PDF document
  • However…

– Complete font can be embedded – Or a subset – 14 standard fonts (Courier, Helvetica, Times + ITC Zapf & Dingbats) – Font replacement

slide-15
SLIDE 15

Text In PDF

  • No concept of text, just characters
  • Flow order not guaranteed
  • Requires guesstimation to extract text
  • Extraction may require embedded fonts
  • Lots of tools, some better than others
slide-16
SLIDE 16

Text According To ABCpdf

1 2 2 3 3 4 4 5 5 6 1 6

slide-17
SLIDE 17

Text According To Xpdf

1 2 3 4 1 2 3 4 5 6 5 6

slide-18
SLIDE 18

Physical Text According To Xpdf

1 2 3 4 5 6 1 2 3 4 5

slide-19
SLIDE 19

SWFTools

  • Open source (GPL)
  • PDF2SWF converts PDF files to SWF format

– Based on Xpdf – Active mailing list – Author actively working on project – Use dev snapshots / git repo – Stable, but some kinks

  • http://www.swftools.org
slide-20
SLIDE 20

iTextSharp

  • Open source (5.0 – AGPL(!), 4.1 - LGPL)
  • Commercial license available
  • .NET port of iText
  • Very stable
  • Excellent for creating &

modifying PDFs

  • No rendering capabilites
  • http://itextsharp.sourceforge.net/
  • http://itextpdf.com/
slide-21
SLIDE 21

Extracting Bookmarks

slide-22
SLIDE 22

Extracting Links

slide-23
SLIDE 23

Thank you!

For attending this session

mark@improve.dk @improvedk improve.dk Email Twitter Blog