Caradoc: a Pragmatic Approach to PDF Parsing and Validation IEEE - - PowerPoint PPT Presentation

caradoc a pragmatic approach to pdf parsing and validation
SMART_READER_LITE
LIVE PREVIEW

Caradoc: a Pragmatic Approach to PDF Parsing and Validation IEEE - - PowerPoint PPT Presentation

Caradoc: a Pragmatic Approach to PDF Parsing and Validation IEEE Security & Privacy LangSec Workshop 2016 Guillaume Endignoux Olivier Levillain Jean-Yves Migeon cole Polytechnique, France EPFL, Switzerland ANSSI, France Thursday 26 th


slide-1
SLIDE 1

Caradoc: a Pragmatic Approach to PDF Parsing and Validation

IEEE Security & Privacy LangSec Workshop 2016 Guillaume Endignoux Olivier Levillain Jean-Yves Migeon

École Polytechnique, France EPFL, Switzerland ANSSI, France

Thursday 26th May, 2016

1 / 29

slide-2
SLIDE 2

Portable Document Format ?

A commonly used format, but many security issues: 500+ reported vulnerabilities in Adobe Reader1 (since 1999). Discrepancies between implementations. Syntax facilitates polymorphism2 (PDF+ZIP, PDF+JPEG, etc.).

1http://www.cvedetails.com 2See for example PoC||GTFO 2 / 29

slide-3
SLIDE 3

Portable Document Format ?

A commonly used format, but many security issues: 500+ reported vulnerabilities in Adobe Reader1 (since 1999). Discrepancies between implementations. Syntax facilitates polymorphism2 (PDF+ZIP, PDF+JPEG, etc.). In our work, we aim at verifying PDFs from syntactic level. Two approaches to validate files: Blacklist: does not detect new malware... Whitelist: higher rejection rate, but accepted files are clean.

1http://www.cvedetails.com 2See for example PoC||GTFO 2 / 29

slide-4
SLIDE 4

Table of contents

1

Syntactic and structural problems: a quick tour

2

Caradoc: a pragmatic solution

3

Application to real-world files

3 / 29

slide-5
SLIDE 5

Table of contents

1

Syntactic and structural problems: a quick tour

2

Caradoc: a pragmatic solution

3

Application to real-world files

4 / 29

slide-6
SLIDE 6

PDF syntax 101

A PDF document is made of objects: null booleans: true, false numbers: 123, -4.56 strings: (foo) names: /bar arrays: [1 2 3], [(foo) /bar] dictionaries: << /key (value) /foo 123 >> references: 1 0 obj ... endobj and 1 0 R streams: << ... >> stream ... endstream

5 / 29

slide-7
SLIDE 7

Structure of a PDF file

Header Object Object ... Reference table Trailer End-of-file

%PDF-1.7 1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj 2 0 obj << /Type /Pages /Count 1 /Kids [3 0 R] >> endobj xref 0 6 0000000000 65536 f 0000000009 00000 n 0000000060 00000 n ... trailer << /Size 6 /Root 1 0 R >> startxref 428 %%EOF

Organization of a simple PDF file.

6 / 29

slide-8
SLIDE 8

Structure of a PDF file

More complex structures: incremental updates,

  • bject streams,

linearization.

Header Objects ... Table + trailer #1 End-of-file #1 Objects ... Table + trailer #2 End-of-file #2

%PDF-1.7 xref 0 6 0000000000 65536 f 0000000009 00000 n 0000000060 00000 n ... trailer << /Size 6 /Root 1 0 R >> startxref 428 %%EOF xref 0 3 0000000002 65536 f 0000000567 00001 n 0000000000 00001 f 6 1 0000001234 00000 n trailer << /Size 7 /Root 1 1 R /Prev 428 >> startxref 1347 %%EOF

Original file Incremental update

Incremental update.

7 / 29

slide-9
SLIDE 9

Logical structure of a PDF file

Document of 17 pages (about 1000 objects).

8 / 29

slide-10
SLIDE 10

Graph organization

The graph of objects is organized into sub-structures, especially trees. Page tree.

Catalog Root of the page tree Page 3 Node Page 4 Page 1 Page 2

9 / 29

slide-11
SLIDE 11

Graph organization

The table of contents uses doubly-linked lists. Table of contents.

Catalog Outline root Chapter Chapter Chapter Section Section Section

10 / 29

slide-12
SLIDE 12

Problematic structure

An attacker may write an invalid structure. Invalid table of contents.

Catalog Outline root Chapter Chapter Chapter Section Section Section

11 / 29

slide-13
SLIDE 13

Demonstration Demonstration: two examples

Loop in the outline structure

https://github.com/ANSSI-FR/caradoc/blob/master/test_files/negative/outlines/cycle.pdf

Polymorphic file

https://github.com/ANSSI-FR/caradoc/blob/master/test_files/negative/polymorph/polymorph.pdf

These files were reported to software editors.

12 / 29

slide-14
SLIDE 14

Demonstration

These problems may lead to several attacks: Attacks on the structure (denial of service). Evasion techniques (attacks taking advantage of implementation discrepancies).

13 / 29

slide-15
SLIDE 15

Table of contents

1

Syntactic and structural problems: a quick tour

2

Caradoc: a pragmatic solution

3

Application to real-world files

14 / 29

slide-16
SLIDE 16

Solution proposals

Caradoc verifies a document at three levels: File syntax. Objects consistency (type checking). Higher-level verifications (graph, etc.).

15 / 29

slide-17
SLIDE 17

Syntax restriction

At syntax level, guarantee extraction of objects without ambiguity: Grammar formalization3 (BNF). Structure restrictions (no updates, no linearization, etc.). Systematic rejection of “corrupted” files.

3https://github.com/ANSSI-FR/caradoc/tree/master/doc/grammar 16 / 29

slide-18
SLIDE 18

Syntax restriction

At syntax level, guarantee extraction of objects without ambiguity: Grammar formalization3 (BNF). Structure restrictions (no updates, no linearization, etc.). Systematic rejection of “corrupted” files. When a conforming reader reads a PDF file with a damaged or missing cross-reference table, it may attempt to rebuild the table by scanning all the objects in the file.

— ISO 32000-1:2008, annex C.2

3https://github.com/ANSSI-FR/caradoc/tree/master/doc/grammar 16 / 29

slide-19
SLIDE 19

Type checking

At object level: guarantee semantic consistency. For this purpose: type checking algorithm.

17 / 29

slide-20
SLIDE 20

Type checking

trailer << /Size 7 /Root 1 0 R /Info 6 0 R >> 1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj 2 0 obj << /Type /Pages /Count 1 /Kids [3 0 R] >> endobj 3 0 obj << /Type /Page /MediaBox [0 0 700 200] /Parent 2 0 R /Contents 4 0 R /Resources << /Font << /F1 5 0 R >> >> >> endobj 4 0 obj << /Length 35 >> stream BT /F1 100 Tf (Hello world !) Tj ET endstream endobj 5 0 obj << /Name /F1 /BaseFont /Helvetica /Type /Font /Subtype /Type1 >> endobj 6 0 obj << /Author (G. E.) >> endobj

Example on a Hello World file.

18 / 29

slide-21
SLIDE 21

Type checking

trailer << /Size 7 /Root 1 0 R /Info 6 0 R >> 1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj 2 0 obj << /Type /Pages /Count 1 /Kids [3 0 R] >> endobj 3 0 obj << /Type /Page /MediaBox [0 0 700 200] /Parent 2 0 R /Contents 4 0 R /Resources << /Font << /F1 5 0 R >> >> >> endobj 4 0 obj << /Length 35 >> stream BT /F1 100 Tf (Hello world !) Tj ET endstream endobj 5 0 obj << /Name /F1 /BaseFont /Helvetica /Type /Font /Subtype /Type1 >> endobj 6 0 obj << /Author (G. E.) >> endobj

Constraint propagation.

19 / 29

slide-22
SLIDE 22

Type checking

trailer << /Size 7 /Root 1 0 R /Info 6 0 R >> 1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj 2 0 obj << /Type /Pages /Count 1 /Kids [3 0 R] >> endobj 3 0 obj << /Type /Page /MediaBox [0 0 700 200] /Parent 2 0 R /Contents 4 0 R /Resources << /Font << /F1 5 0 R >> >> >> endobj 4 0 obj << /Length 35 >> stream BT /F1 100 Tf (Hello world !) Tj ET endstream endobj 5 0 obj << /Name /F1 /BaseFont /Helvetica /Type /Font /Subtype /Type1 >> endobj 6 0 obj << /Author (G. E.) >> endobj

Constraint propagation.

19 / 29

slide-23
SLIDE 23

Type checking

trailer << /Size 7 /Root 1 0 R /Info 6 0 R >> 1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj 2 0 obj << /Type /Pages /Count 1 /Kids [3 0 R] >> endobj 3 0 obj << /Type /Page /MediaBox [0 0 700 200] /Parent 2 0 R /Contents 4 0 R /Resources << /Font << /F1 5 0 R >> >> >> endobj 4 0 obj << /Length 35 >> stream BT /F1 100 Tf (Hello world !) Tj ET endstream endobj 5 0 obj << /Name /F1 /BaseFont /Helvetica /Type /Font /Subtype /Type1 >> endobj 6 0 obj << /Author (G. E.) >> endobj

Constraint propagation.

19 / 29

slide-24
SLIDE 24

Type checking

trailer << /Size 7 /Root 1 0 R /Info 6 0 R >> 1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj 2 0 obj << /Type /Pages /Count 1 /Kids [3 0 R] >> endobj 3 0 obj << /Type /Page /MediaBox [0 0 700 200] /Parent 2 0 R /Contents 4 0 R /Resources << /Font << /F1 5 0 R >> >> >> endobj 4 0 obj << /Length 35 >> stream BT /F1 100 Tf (Hello world !) Tj ET endstream endobj 5 0 obj << /Name /F1 /BaseFont /Helvetica /Type /Font /Subtype /Type1 >> endobj 6 0 obj << /Author (G. E.) >> endobj

Constraint propagation.

19 / 29

slide-25
SLIDE 25

Type checking

trailer << /Size 7 /Root 1 0 R /Info 6 0 R >> 1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj 2 0 obj << /Type /Pages /Count 1 /Kids [3 0 R] >> endobj 3 0 obj << /Type /Page /MediaBox [0 0 700 200] /Parent 2 0 R /Contents 4 0 R /Resources << /Font << /F1 5 0 R >> >> >> endobj 4 0 obj << /Length 35 >> stream BT /F1 100 Tf (Hello world !) Tj ET endstream endobj 5 0 obj << /Name /F1 /BaseFont /Helvetica /Type /Font /Subtype /Type1 >> endobj 6 0 obj << /Author (G. E.) >> endobj

Constraint propagation.

19 / 29

slide-26
SLIDE 26

Type checking

Types of a 17-page document.

action page destination annotation resource

  • utline

content stream font name tree

  • ther

20 / 29

slide-27
SLIDE 27

More complex verifications

At a higher level: Verification of tree structures (page tree, outlines, etc.). Other verifications easily integrable in the future (fonts, images, existing analyses, etc.).

21 / 29

slide-28
SLIDE 28

Table of contents

1

Syntactic and structural problems: a quick tour

2

Caradoc: a pragmatic solution

3

Application to real-world files

22 / 29

slide-29
SLIDE 29

Implementation

Implementation in OCaml from the PDF specification4. Validation workflow.

PDF strict parser relaxed parser

  • bjects

graph of references extraction of specific objects type checking list of types graph checking

  • ther checks

to develop no error detected normalization

4https://www.adobe.com/devnet/pdf/pdf_reference.html 23 / 29

slide-30
SLIDE 30

Real-world files

10K files collected from random queries on a web search engine. Some files are directly accepted. Direct validation.

PDF 10000 files strict parser parsed 1465 files type checking type checked 536 files graph checking no error found 536 files

24 / 29

slide-31
SLIDE 31

Normalization

Many files do not pass the first stage... But they can be normalized beforehand. The relaxed parser supports common structures: incremental updates, object streams, etc. Normalization.

PDF 10000 files relaxed parser parsed 8993 files cleaning objects normalized 8993 files

Some files were not normalized: encryption, unrecoverable syntax errors, etc.

25 / 29

slide-32
SLIDE 32

Normalization

Validation after normalization.

normalized 8993 files type checking type checked 1429 files type error 1391 files graph checking no error found 1427 files

Our type-checker detected typos: /Blackls1 instead of /BlackIs1, /XObjcect instead of /XObject. We identified incorrect tree structures in the wild.

26 / 29

slide-33
SLIDE 33

Future work

What remains to be done: Complete the set of types. Check compression filters. Check graphic content. Check fonts, images, etc.

27 / 29

slide-34
SLIDE 34

Conclusion

Summary of our contributions: We identified novel issues in PDF parsers. We proposed and formalized a simplified syntax for PDF. We implemented Caradoc to parse and validate PDF files. Project page: https://github.com/ANSSI-FR/caradoc

28 / 29

slide-35
SLIDE 35

Questions ?

https://xkcd.com/1301/ 29 / 29