Toward Automated Grammar Extraction via Semantic Labeling of - - PowerPoint PPT Presentation

toward automated grammar extraction
SMART_READER_LITE
LIVE PREVIEW

Toward Automated Grammar Extraction via Semantic Labeling of - - PowerPoint PPT Presentation

Toward Automated Grammar Extraction via Semantic Labeling of Parser Implementations Carson Harmon Bradford Larsen Evan Sultanik LangSec Workshop at IEEE Security & Privacy, May 21, 2020 The Problem The Problem The Problem # !


slide-1
SLIDE 1

Toward Automated Grammar Extraction

via Semantic Labeling

  • f Parser Implementations

Evan Sultanik Carson Harmon Bradford Larsen

LangSec Workshop at IEEE Security & Privacy, May 21, 2020

slide-2
SLIDE 2

The Problem

slide-3
SLIDE 3

The Problem

slide-4
SLIDE 4

The Problem

✓ ⛔ ! #

slide-5
SLIDE 5

The Problem

$

slide-6
SLIDE 6

The Problem

$

✓ ✓ ✓ ✓

slide-7
SLIDE 7

%

High Level Goals

Create semantic map of the functions in a parser, which will improve grammar extraction.

& '

???

slide-8
SLIDE 8

%

High Level Goals

Create semantic map of the functions in a parser, which will improve grammar extraction.

&

parser_function1 ↳byte 0, 10, 50 Object Stream parser_function2 ↳byte 10, 74 Xref parser_function3 ↳byte 20 JFIF

slide-9
SLIDE 9

%

High Level Goals

Create semantic map of the functions in a parser, which will improve grammar extraction.

&

parser_function1 ↳byte 0, 10, 50 Object Stream parser_function2 ↳byte 10, 74 Xref parser_function3 ↳byte 20 JFIF Ultimate Goal: Automatically extract a minimal grammar specifying the files accepted by a parser Hypothesis: The majority of the potential for maliciousness and schizophrenia will exist in the symmetric difference of the grammars accepted by a format’s parser implementations

slide-10
SLIDE 10

Approach

Semantic Ground Truth Instrumentation Associative Labeling

Label the Type Composition Hierarchy of the input files Use universal taint analysis to track all input bytes through the execution of a parser Merge the results of the first two steps to produce a labeling of which functions

  • perate on which types

Detect backtracking Detect error handling Differential analysis

slide-11
SLIDE 11

Approach

Semantic Ground Truth Instrumentation Associative Labeling

Label the Type Composition Hierarchy of the input files Use universal taint analysis to track all input bytes through the execution of a parser Merge the results of the first two steps to produce a labeling of which functions

  • perate on which types

Detect backtracking Detect error handling Differential analysis

slide-12
SLIDE 12

Approach

Semantic Ground Truth Instrumentation

Label the Type Composition Hierarchy of the input files Use universal taint analysis to track all input bytes through the execution of a parser

Associative Labeling

Merge the results of the first two steps to produce a labeling of which functions

  • perate on which types

Detect backtracking Detect error handling Differential analysis

Grammar Extraction

(future work)

slide-13
SLIDE 13

Prior Work: Semantic Labeling

Polyglot-Aware File Identification Resilient Parsing Syntax Tree

'

???

iNES ROM PDF ZIP Modify parsers for best effort Instrument to track input byte offsets Label regions of the input Produce ground truth iNES [0x0→0x12220] ↳ Magic [0x0→0x3] Header [0x4→0xF] ⋮ PRG [0xC210→0x1020F] CHR [0x10210→0x12220] PDF [0x10→0x2EF72F] ↳ Magic [0x10→0x1E] Object 1.0 [0x1F→0x12221] ↳ Dictionary [0x2A→0x3E] Stream [0x3F→0x12219] ↳ JFIF Image [0x46→0x1220F] ↳ JPEG Segment […] ↳ Magic […] Marker […] ⋮

slide-14
SLIDE 14

PolyFile Ground Truth

'

slide-15
SLIDE 15

PolyFile Ground Truth

'

slide-16
SLIDE 16

Prior Work: Parser Instrumentation

LLVM Instrumentation Taint Tracking

Operate on LLVM/IR Can work with all open source parsers Eventually support closed- source binaries by lifting to LLVM (e.g., with McSEMA

  • r Remill)

Shadow memory inspired by the Data Flow Sanitizer (dfsan) Negligible CPU overhead O(n) memory overhead, where n is the number of instructions executed by the parser Novel datastructure for efficiently storing taint labels dfsan status quo: 훩(1) lookups 훩(n²) storage PolyTracker: O(log n) lookups O(n) storage

slide-17
SLIDE 17

PolyTracker Instrumentation

{ "ensure_solid_xref": [ 2276587, 2276588 ], "fmt_obj": [ 2465223, 2465224, 2465225, 2465226, 2465227, 2465228, 2465240, 2465241, 2465242, 2465243, 2465244, 2465245, 2465246, 2465258, 2465259, 2465260, 2465261, 2465262 ] }

slide-18
SLIDE 18

PolyTracker Instrumentation

{ "ensure_solid_xref": [ 2276587, 2276588 ], "fmt_obj": [ 2465223, 2465224, 2465225, 2465226, 2465227, 2465228, 2465240, 2465241, 2465242, 2465243, 2465244, 2465245, 2465246, 2465258, 2465259, 2465260, 2465261, 2465262 ] }

iNES [0x0→0x12220] ↳ Magic [0x0→0x3] Header [0x4→0xF] ⋮ PRG [0xC210→0x1020F] CHR [0x10210→0x12220] PDF [0x10→0x2EF72F] ↳ Magic [0x10→0x1E] Object 1.0 [0x1F→0x12221] ↳ Dictionary [0x2A→0x3E] Stream [0x3F→0x12219] ↳ JFIF Image [0x46→0x1220F] ↳ JPEG Segment […] ↳ Magic […] Marker […] ⋮

slide-19
SLIDE 19

PolyTracker Instrumentation

{ "ensure_solid_xref": [ 2276587, 2276588 ], "fmt_obj": [ 2465223, 2465224, 2465225, 2465226, 2465227, 2465228, 2465240, 2465241, 2465242, 2465243, 2465244, 2465245, 2465246, 2465258, 2465259, 2465260, 2465261, 2465262 ] }

iNES [0x0→0x12220] ↳ Magic [0x0→0x3] Header [0x4→0xF] ⋮ PRG [0xC210→0x1020F] CHR [0x10210→0x12220] PDF [0x10→0x2EF72F] ↳ Magic [0x10→0x1E] Object 1.0 [0x1F→0x12221] ↳ Dictionary [0x2A→0x3E] Stream [0x3F→0x12219] ↳ JFIF Image [0x46→0x1220F] ↳ JPEG Segment […] ↳ Magic […] Marker […] ⋮

fmt_obj

Object Dictionary

ensure_solid_xref

Trailer XRef

slide-20
SLIDE 20

The Challenge of Associative Labeling

How can we associate types in the file format to the set of functions most specialized in operating on that type?

Observations

Raw mapping is not necessarily injective: There will rarely be a perfect bijection between the types and functions

Monolithic Function

Parser 1

Specialized Function Specialized Function Specialized Function

Parser 2

A parser’s functional implementation will rarely be isomorphic to the type hierarchy or syntax tree of the input file

slide-21
SLIDE 21

Information Entropy

  • For each type, collect the functions that operate on that type
  • Calculate P(t, f) = the probability that a specific type occurs

within a function

  • Calculate the “genericism” of a function G : F → ℝ
  • Use G to sort the functions associated with a type, discarding

all but the smallest (most specialized) standard deviation

Idea: Use information entropy to measure function specialization

slide-22
SLIDE 22

The parser has a single function responsible for parsing multiple types

Monolithic Function

Parser 1

Problem: Code is Too Monolithic

slide-23
SLIDE 23

The parser has a single function responsible for parsing multiple types

Problem: Code is Too Monolithic

  • Calculate the dominator tree of the syntax tree
slide-24
SLIDE 24

The parser has a single function responsible for parsing multiple types

Problem: Code is Too Monolithic

  • Calculate the dominator tree of the syntax tree
  • Remove a function from the matching for a type if there exists

an ancestor of the type in the dominator tree that maps to the same function

slide-25
SLIDE 25
  • Calculate the dominator tree of the syntax tree
  • Remove a function from the matching for a type if there exists

an ancestor of the type in the dominator tree that maps to the same function

The parser has a single function responsible for parsing multiple types

Problem: Code is Too Monolithic

slide-26
SLIDE 26
  • Calculate the dominator tree of the syntax tree
  • Remove a function from the matching for a type if there exists

an ancestor of the type in the dominator tree that maps to the same function

The parser has a single function responsible for parsing multiple types

Problem: Code is Too Monolithic

parse_pdf_dictionary

slide-27
SLIDE 27
  • Calculate the dominator tree of the syntax tree
  • Remove a function from the matching for a type if there exists

an ancestor of the type in the dominator tree that maps to the same function

The parser has a single function responsible for parsing multiple types

Problem: Code is Too Monolithic

parse_pdf_dictionary

slide-28
SLIDE 28
  • If those functions are always called sequentially, then we ideally
  • nly want the single function that initiates the sequence
  • Calculate the dominator tree of the runtime control flow graph
  • For each type, remove any functions in the matching that have

an ancestor in the dominator tree that is also in the matching

The parser has many, tightly coupled functions collectively responsible for parsing a single type

Specialized Function Specialized Function Specialized Function

Parser 2

Problem: Code is Too Cohesive

slide-29
SLIDE 29
  • If those functions are always called sequentially, then we ideally
  • nly want the single function that initiates the sequence
  • Calculate the dominator tree of the runtime control flow graph
  • For each type, remove any functions in the matching that have

an ancestor in the dominator tree that is also in the matching

The parser has many, tightly coupled functions collectively responsible for parsing a single type

Problem: Code is Too Cohesive

slide-30
SLIDE 30

pdf_load_xref pdf_read_start_xref pdf_prime_xref_index

slide-31
SLIDE 31

pdf_load_xref pdf_read_start_xref pdf_prime_xref_index

PDF XRef

slide-32
SLIDE 32

pdf_load_xref pdf_read_start_xref pdf_prime_xref_index

PDF XRef

slide-33
SLIDE 33

Results

  • Runs inO(|F|n log |T|) time
  • F = # functions in the parser
  • T = # types (or production rules) in the grammar
  • n = # bytes in the input file
  • Mappings for various parsers and file formats
  • Implementation in the polymerge application distributed with

PolyFile:

  • pip3 install polyfile
slide-34
SLIDE 34

Results: MuPDF

(Generated from a single parse of a single PDF)

PDF Object PDF Dictionary PDF Object Version PDF Object Content PDF Object ID fz_isprint lex_string next_flated pdf_process_stream twoway_memmem PDF PDF Trailer PDF Comment fz_clamp fz_strtof pdf_load_version Key Value Pair Key Value pdf_process_keyword lex_name lex_white PDF Start XRef pdf_read_start_xref PDF XRef pdf_read_xref_sections FT_Stream_ReadAt pdf_show_char pdf_token_from_keyword FT_Stream_Seek cff_index_access_element cff_slot_load fz_bound_glyph fz_bound_text fz_read pdf_read_old_xref
slide-35
SLIDE 35

Results: MuPDF

(Generated from a single parse of a single PDF)

twoway_memmem PDF PDF Trailer PDF Comment fz_clamp fz_strtof pdf_load_version PDF Start XRef pdf_read_start_xref PDF XRef pdf_read_xref_sections fz_read pdf_read_old_xref

slide-36
SLIDE 36

Results: QPDF

(Generated from a single parse of a single PDF)

PDF Object PDF Dictionary PDF Object Version PDF Object Content PDF Object ID QPDF::pipeStreamData(PointerHolder<QPDF::EncryptionParameters>, PointerHolder<InputSource>, QPDF&, int, int, long long, unsigned long, QPDFObjectHandle, bool, Pipeline*, bool, bool) QPDFObjectHandle::getUIntValue() PDF PDF Trailer PDF Comment QPDF::findHeader() QUtil::is_digit(char) Key Value Pair Key Value Pl_Count::write(unsigned char*, unsigned long) QIntC::IntConverter<unsigned long, long long, false, true>::convert(unsigned long const&) QPDFObjectHandle::getIntValue() QPDFWriter::unparseObject(QPDFObjectHandle, int, int, unsigned long, bool) QPDF_String::unparse[abi:cxx11](bool) QUtil::int_to_string_base[abi:cxx11](long long, int, int) QUtil::uint_to_string_base[abi:cxx11](unsigned long long, int, int) QPDF_Name::normalizeName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) PDF Start XRef QPDF::parse(char const*) PDF XRef QIntC::IntConverter<long long, int, true, true>::convert(long long const&) QPDFXRefEntry::getType() const QPDF_Stream::setStreamDescription() Buffer::Buffer(unsigned long) ContentNormalizer::handleToken(QPDFTokenizer::Token const&) Pl_Buffer::getBuffer() Pl_Buffer::write(unsigned char*, unsigned long) Pl_Flate::handleData(unsigned char*, unsigned long, int) QPDF::readObject(PointerHolder<InputSource>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int, bool) QPDF_Dictionary::getKeys[abi:cxx11]() QPDF_Integer::~QPDF_Integer() std::_Rb_tree<int, std::pair<int const, long long>, std::_Select1st<std::pair<int const, long long> >, std::less<int>, std::allocator<std::pair<int const, long long> > >::_M_erase(std::_Rb_tree_node<std::pair<int const, long long> >*) std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::_Identity<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_erase(std::_Rb_tree_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*) std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::_Identity<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_get_insert_unique_pos(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> > >::_M_erase(std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> >*) std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> > >::equal_range(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> >* std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> > >::_M_copy<std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> > >::_Alloc_node>(std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> > const*, std::_Rb_tree_node_base*, std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> > >::_Alloc_node&) std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, QPDFObjectHandle, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) void __gnu_cxx::new_allocator<std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> > >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> >(std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle>*) void std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> > >::_M_construct_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> const&>(std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> >*, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> const&) QPDF::read_xrefTable(long long)
slide-37
SLIDE 37

Results: QPDF

(Generated from a single parse of a single PDF)

PDF Object PDF Dictionary PDF Object Version PDF Object Content PDF Object ID QPDF::pipeStreamData(PointerHolder<QPDF::EncryptionParameters>, PointerHolder<InputSource>, QPDF&, int, int, long long, unsigned long, QPDFObjectHandle, bool, Pipeline*, bool, bool) QPDFObjectHandle::getUIntValue() PDF PDF Trailer PDF Comment QPDF::findHeader() QUtil::is_digit(char) Key Value Pair Key Value Pl_Count::write(unsigned char*, unsigned long) QIntC::IntConverter<unsigned long, long long, false, true>::convert(unsigned long const&) QPDFObjectHandle::getIntValue() QPDFWriter::unparseObject(QPDFObjectHandle, int, int, unsigned long, bool) QPDF_String::unparse[abi:cxx11](bool) QUtil::int_to_string_base[abi:cxx11](long long, int, int) QUtil::uint_to_string_base[abi:cxx11](unsigned long long, int, int) QPDF_Name::normalizeName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) PDF Start XRef QPDF::parse(char const*) PDF XRef QIntC::IntConverter<long long, int, true, true>::convert(long long const&) QPDFXRefEntry::getType() const QPDF_Stream::setStreamDescription() Buffer::Buffer(unsigned long) ContentNormalizer::handleToken(QPDFTokenizer::Token const&) Pl_Buffer::getBuffer() Pl_Buffer::write(unsigned char*, unsigned long) Pl_Flate::handleData(unsigned char*, unsigned long, int) QPDF::readObject(PointerHolder<InputSource>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int, bool) QPDF_Dictionary::getKeys[abi:cxx11]() QPDF_Integer::~QPDF_Integer() std::_Rb_tree<int, std::pair<int const, long long>, std::_Select1st<std::pair<int const, long long> >, std::less<int>, std::allocator<std::pair<int const, long long> > >::_M_erase(std::_Rb_tree_node<std::pair<int const, long long> >*) std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::_Identity<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_erase(std::_Rb_tree_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*) std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::_Identity<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_get_insert_unique_pos(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) d::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> > >::_M_construct_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> const&>(std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> >*, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> const&) QPDF::read_xrefTable(long long)
slide-38
SLIDE 38

Results: QPDF

(Generated from a single parse of a single PDF)

PDF Object PDF Dictionary PDF Object Version PDF Object ID QPDF::pipeStreamData(PointerHolder<QPDF::EncryptionParameters>, PointerHolder<InputSource>, QPDF&, int, int, long long, unsigned long, QPDFObjectHandle, bool, Pipeline*, bool, bool) QPDFObjectHandle::getUIntValue() PDF PDF Trailer PDF Comment QPDF::findHeader() QUtil::is_digit(char) Key Value Pair Key Value Pl_Count::write(unsigned char*, unsigned long) QIntC::IntConverter<unsigned long, long long, false, true>::convert(unsigned long const&) QPDFObjectHandle::getIntValue() QPDFWriter::unparseObject(QPDFObjectHandle, int, int, unsigned long, bool) QPDF_String::unparse[abi:cxx11](bool) QUtil::int_to_string_base[abi:cxx11](long long, int, int) QUtil::uint_to_string_base[abi:cxx11](unsigned long long, int, int) QPDF_Name::normalizeName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) PDF Start XRef QPDF::parse(char const*) PDF XRef QIntC::IntConverter<long long, int, true, true>::convert(long long const&) QPDFXRefEntry::getType() const QPDF_Stream::setStreamDescription() locator<char> > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_get_insert_unique_pos(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) QPDF::read_xrefTable(long long)
slide-39
SLIDE 39

Results: QPDF

(Generated from a single parse of a single PDF)

PDF Object Version PDF Object ID PDF PDF Trailer PDF Comment QPDF::findHeader() QUtil::is_digit(char) QIntC::IntConverter<unsigned long, long long, false, true>::convert(unsigned long const&) QUtil::uint_to_string_base[abi:cxx11](unsigned long long, int, int) e::normalizeName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) PDF Start XRef QPDF::parse(char const*) PDF XRef r<long long, int, true, true>::convert(long long const&) QPDFXRefEntry::getType() const QPDF_Stream::setStreamDescription() QPDF::read_xrefTable(long long)

slide-40
SLIDE 40

Results: libjpeg

(Generated from a single parse of a single JPEG)

JPEG File JPEG read_markers segments consume_markers jinit_d_coef_controller jpeg_make_d_derived_tbl start_iMCU_row start_pass start_pass_main segment magic marker length image_data data first_marker next_marker segment_app0 segment_sof0 segment_sos num_components components version_major version_minor density_units density_x density_y thumbnail thumbnail_x thumbnail_y get_interesting_appn examine_app0 image_width image_height bits_per_sample jpeg_core_output_dimensions initial_setup master_selection start_spectral_selection appr_bit_pos end_spectral start_pass_huff_decoder component id quantization_table_id sampling_factors huffman_table default_decompress_parms latch_quant_tables
slide-41
SLIDE 41

Results: libjpeg

(Generated from a single parse of a single JPEG)

start_pass segment_sof0 image_width image_height bits_per_sample jpeg_core_output_dimensions initial_setup master_selection _selection

slide-42
SLIDE 42

Next Step: Grammar Extraction

  • AUTOGRAM: (Zeller, et al., 2016) Uses data-flow analysis
  • No type information other than what can be inferred from

native types in the code

  • Can be improved with our type mapping from the associative

labeling

  • Mimid: (Zeller, et al., 2019) Uses static control-flow analysis
  • Can also be improved by our type mapping
  • Needs to infer indirect control-flow that we can definitively
  • bserve with our runtime instrumentation
  • We can observe control-flow events like backtracking and infer

types at the same time

slide-43
SLIDE 43

Future Directions

  • Differential Analysis of Parsers
  • Use graph matching to map the functions of one parser to

another

  • Automatically identify feature differences
  • Differential Analysis over a Corpus of Files
  • Not all files exercise will exercise all functionality of a parser
  • Combine the output of multiple files (including intentionally

malformed files) to maximize coverage

  • Type Hierarchy Learning
  • If there is no ground truth, learn the type hierarchy from the

data structures of the parser

slide-44
SLIDE 44

Conclusions

  • Introduced new technique for semantically labeling types operated
  • n by parsers
  • Works with a single run of a parser on a single file
  • Next step: integrate with grammar extraction
  • Tools are currently available:
  • https://github.com/trailofbits/polyfile
  • https://github.com/trailofbits/polytracker
slide-45
SLIDE 45

Contact Info

Et pl. al. Carson Harmon Brad Larsen @reyeetengineer @BradLarsen

Thanks!

Et pl. al. Evan Sultanik @ESultanik

https://github.com/trailofbits/polyfile https://github.com/trailofbits/polytracker