Toward Automated Grammar Extraction
via Semantic Labeling
- f Parser Implementations
Evan Sultanik Carson Harmon Bradford Larsen
LangSec Workshop at IEEE Security & Privacy, May 21, 2020
Toward Automated Grammar Extraction via Semantic Labeling of - - PowerPoint PPT Presentation
Toward Automated Grammar Extraction via Semantic Labeling of Parser Implementations Carson Harmon Bradford Larsen Evan Sultanik LangSec Workshop at IEEE Security & Privacy, May 21, 2020 The Problem The Problem The Problem # !
Evan Sultanik Carson Harmon Bradford Larsen
LangSec Workshop at IEEE Security & Privacy, May 21, 2020
Create semantic map of the functions in a parser, which will improve grammar extraction.
Create semantic map of the functions in a parser, which will improve grammar extraction.
parser_function1 ↳byte 0, 10, 50 Object Stream parser_function2 ↳byte 10, 74 Xref parser_function3 ↳byte 20 JFIF
Create semantic map of the functions in a parser, which will improve grammar extraction.
parser_function1 ↳byte 0, 10, 50 Object Stream parser_function2 ↳byte 10, 74 Xref parser_function3 ↳byte 20 JFIF Ultimate Goal: Automatically extract a minimal grammar specifying the files accepted by a parser Hypothesis: The majority of the potential for maliciousness and schizophrenia will exist in the symmetric difference of the grammars accepted by a format’s parser implementations
Semantic Ground Truth Instrumentation Associative Labeling
Label the Type Composition Hierarchy of the input files Use universal taint analysis to track all input bytes through the execution of a parser Merge the results of the first two steps to produce a labeling of which functions
Detect backtracking Detect error handling Differential analysis
Semantic Ground Truth Instrumentation Associative Labeling
Label the Type Composition Hierarchy of the input files Use universal taint analysis to track all input bytes through the execution of a parser Merge the results of the first two steps to produce a labeling of which functions
Detect backtracking Detect error handling Differential analysis
Semantic Ground Truth Instrumentation
Label the Type Composition Hierarchy of the input files Use universal taint analysis to track all input bytes through the execution of a parser
Associative Labeling
Merge the results of the first two steps to produce a labeling of which functions
Detect backtracking Detect error handling Differential analysis
Grammar Extraction
(future work)
Polyglot-Aware File Identification Resilient Parsing Syntax Tree
iNES ROM PDF ZIP Modify parsers for best effort Instrument to track input byte offsets Label regions of the input Produce ground truth iNES [0x0→0x12220] ↳ Magic [0x0→0x3] Header [0x4→0xF] ⋮ PRG [0xC210→0x1020F] CHR [0x10210→0x12220] PDF [0x10→0x2EF72F] ↳ Magic [0x10→0x1E] Object 1.0 [0x1F→0x12221] ↳ Dictionary [0x2A→0x3E] Stream [0x3F→0x12219] ↳ JFIF Image [0x46→0x1220F] ↳ JPEG Segment […] ↳ Magic […] Marker […] ⋮
LLVM Instrumentation Taint Tracking
Operate on LLVM/IR Can work with all open source parsers Eventually support closed- source binaries by lifting to LLVM (e.g., with McSEMA
Shadow memory inspired by the Data Flow Sanitizer (dfsan) Negligible CPU overhead O(n) memory overhead, where n is the number of instructions executed by the parser Novel datastructure for efficiently storing taint labels dfsan status quo: 훩(1) lookups 훩(n²) storage PolyTracker: O(log n) lookups O(n) storage
{ "ensure_solid_xref": [ 2276587, 2276588 ], "fmt_obj": [ 2465223, 2465224, 2465225, 2465226, 2465227, 2465228, 2465240, 2465241, 2465242, 2465243, 2465244, 2465245, 2465246, 2465258, 2465259, 2465260, 2465261, 2465262 ] }
{ "ensure_solid_xref": [ 2276587, 2276588 ], "fmt_obj": [ 2465223, 2465224, 2465225, 2465226, 2465227, 2465228, 2465240, 2465241, 2465242, 2465243, 2465244, 2465245, 2465246, 2465258, 2465259, 2465260, 2465261, 2465262 ] }
iNES [0x0→0x12220] ↳ Magic [0x0→0x3] Header [0x4→0xF] ⋮ PRG [0xC210→0x1020F] CHR [0x10210→0x12220] PDF [0x10→0x2EF72F] ↳ Magic [0x10→0x1E] Object 1.0 [0x1F→0x12221] ↳ Dictionary [0x2A→0x3E] Stream [0x3F→0x12219] ↳ JFIF Image [0x46→0x1220F] ↳ JPEG Segment […] ↳ Magic […] Marker […] ⋮
{ "ensure_solid_xref": [ 2276587, 2276588 ], "fmt_obj": [ 2465223, 2465224, 2465225, 2465226, 2465227, 2465228, 2465240, 2465241, 2465242, 2465243, 2465244, 2465245, 2465246, 2465258, 2465259, 2465260, 2465261, 2465262 ] }
iNES [0x0→0x12220] ↳ Magic [0x0→0x3] Header [0x4→0xF] ⋮ PRG [0xC210→0x1020F] CHR [0x10210→0x12220] PDF [0x10→0x2EF72F] ↳ Magic [0x10→0x1E] Object 1.0 [0x1F→0x12221] ↳ Dictionary [0x2A→0x3E] Stream [0x3F→0x12219] ↳ JFIF Image [0x46→0x1220F] ↳ JPEG Segment […] ↳ Magic […] Marker […] ⋮
fmt_obj
Object Dictionary
ensure_solid_xref
Trailer XRef
How can we associate types in the file format to the set of functions most specialized in operating on that type?
Observations
Raw mapping is not necessarily injective: There will rarely be a perfect bijection between the types and functions
Monolithic Function
Parser 1
Specialized Function Specialized Function Specialized Function
Parser 2
A parser’s functional implementation will rarely be isomorphic to the type hierarchy or syntax tree of the input file
within a function
all but the smallest (most specialized) standard deviation
Idea: Use information entropy to measure function specialization
The parser has a single function responsible for parsing multiple types
Monolithic Function
Parser 1
The parser has a single function responsible for parsing multiple types
The parser has a single function responsible for parsing multiple types
an ancestor of the type in the dominator tree that maps to the same function
an ancestor of the type in the dominator tree that maps to the same function
The parser has a single function responsible for parsing multiple types
an ancestor of the type in the dominator tree that maps to the same function
The parser has a single function responsible for parsing multiple types
parse_pdf_dictionary
an ancestor of the type in the dominator tree that maps to the same function
The parser has a single function responsible for parsing multiple types
parse_pdf_dictionary
an ancestor in the dominator tree that is also in the matching
The parser has many, tightly coupled functions collectively responsible for parsing a single type
Specialized Function Specialized Function Specialized Function
Parser 2
an ancestor in the dominator tree that is also in the matching
The parser has many, tightly coupled functions collectively responsible for parsing a single type
pdf_load_xref pdf_read_start_xref pdf_prime_xref_index
pdf_load_xref pdf_read_start_xref pdf_prime_xref_index
PDF XRef
pdf_load_xref pdf_read_start_xref pdf_prime_xref_index
PDF XRef
PolyFile:
(Generated from a single parse of a single PDF)
PDF Object PDF Dictionary PDF Object Version PDF Object Content PDF Object ID fz_isprint lex_string next_flated pdf_process_stream twoway_memmem PDF PDF Trailer PDF Comment fz_clamp fz_strtof pdf_load_version Key Value Pair Key Value pdf_process_keyword lex_name lex_white PDF Start XRef pdf_read_start_xref PDF XRef pdf_read_xref_sections FT_Stream_ReadAt pdf_show_char pdf_token_from_keyword FT_Stream_Seek cff_index_access_element cff_slot_load fz_bound_glyph fz_bound_text fz_read pdf_read_old_xref(Generated from a single parse of a single PDF)
twoway_memmem PDF PDF Trailer PDF Comment fz_clamp fz_strtof pdf_load_version PDF Start XRef pdf_read_start_xref PDF XRef pdf_read_xref_sections fz_read pdf_read_old_xref
(Generated from a single parse of a single PDF)
PDF Object PDF Dictionary PDF Object Version PDF Object Content PDF Object ID QPDF::pipeStreamData(PointerHolder<QPDF::EncryptionParameters>, PointerHolder<InputSource>, QPDF&, int, int, long long, unsigned long, QPDFObjectHandle, bool, Pipeline*, bool, bool) QPDFObjectHandle::getUIntValue() PDF PDF Trailer PDF Comment QPDF::findHeader() QUtil::is_digit(char) Key Value Pair Key Value Pl_Count::write(unsigned char*, unsigned long) QIntC::IntConverter<unsigned long, long long, false, true>::convert(unsigned long const&) QPDFObjectHandle::getIntValue() QPDFWriter::unparseObject(QPDFObjectHandle, int, int, unsigned long, bool) QPDF_String::unparse[abi:cxx11](bool) QUtil::int_to_string_base[abi:cxx11](long long, int, int) QUtil::uint_to_string_base[abi:cxx11](unsigned long long, int, int) QPDF_Name::normalizeName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) PDF Start XRef QPDF::parse(char const*) PDF XRef QIntC::IntConverter<long long, int, true, true>::convert(long long const&) QPDFXRefEntry::getType() const QPDF_Stream::setStreamDescription() Buffer::Buffer(unsigned long) ContentNormalizer::handleToken(QPDFTokenizer::Token const&) Pl_Buffer::getBuffer() Pl_Buffer::write(unsigned char*, unsigned long) Pl_Flate::handleData(unsigned char*, unsigned long, int) QPDF::readObject(PointerHolder<InputSource>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int, bool) QPDF_Dictionary::getKeys[abi:cxx11]() QPDF_Integer::~QPDF_Integer() std::_Rb_tree<int, std::pair<int const, long long>, std::_Select1st<std::pair<int const, long long> >, std::less<int>, std::allocator<std::pair<int const, long long> > >::_M_erase(std::_Rb_tree_node<std::pair<int const, long long> >*) std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::_Identity<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_erase(std::_Rb_tree_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*) std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::_Identity<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_get_insert_unique_pos(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> > >::_M_erase(std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> >*) std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> > >::equal_range(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> >* std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> > >::_M_copy<std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> > >::_Alloc_node>(std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> > const*, std::_Rb_tree_node_base*, std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> > >::_Alloc_node&) std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, QPDFObjectHandle, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) void __gnu_cxx::new_allocator<std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> > >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> >(std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle>*) void std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> > >::_M_construct_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> const&>(std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> >*, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> const&) QPDF::read_xrefTable(long long)(Generated from a single parse of a single PDF)
PDF Object PDF Dictionary PDF Object Version PDF Object Content PDF Object ID QPDF::pipeStreamData(PointerHolder<QPDF::EncryptionParameters>, PointerHolder<InputSource>, QPDF&, int, int, long long, unsigned long, QPDFObjectHandle, bool, Pipeline*, bool, bool) QPDFObjectHandle::getUIntValue() PDF PDF Trailer PDF Comment QPDF::findHeader() QUtil::is_digit(char) Key Value Pair Key Value Pl_Count::write(unsigned char*, unsigned long) QIntC::IntConverter<unsigned long, long long, false, true>::convert(unsigned long const&) QPDFObjectHandle::getIntValue() QPDFWriter::unparseObject(QPDFObjectHandle, int, int, unsigned long, bool) QPDF_String::unparse[abi:cxx11](bool) QUtil::int_to_string_base[abi:cxx11](long long, int, int) QUtil::uint_to_string_base[abi:cxx11](unsigned long long, int, int) QPDF_Name::normalizeName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) PDF Start XRef QPDF::parse(char const*) PDF XRef QIntC::IntConverter<long long, int, true, true>::convert(long long const&) QPDFXRefEntry::getType() const QPDF_Stream::setStreamDescription() Buffer::Buffer(unsigned long) ContentNormalizer::handleToken(QPDFTokenizer::Token const&) Pl_Buffer::getBuffer() Pl_Buffer::write(unsigned char*, unsigned long) Pl_Flate::handleData(unsigned char*, unsigned long, int) QPDF::readObject(PointerHolder<InputSource>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int, bool) QPDF_Dictionary::getKeys[abi:cxx11]() QPDF_Integer::~QPDF_Integer() std::_Rb_tree<int, std::pair<int const, long long>, std::_Select1st<std::pair<int const, long long> >, std::less<int>, std::allocator<std::pair<int const, long long> > >::_M_erase(std::_Rb_tree_node<std::pair<int const, long long> >*) std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::_Identity<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_erase(std::_Rb_tree_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*) std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::_Identity<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_get_insert_unique_pos(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) d::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> > >::_M_construct_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> const&>(std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> >*, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, QPDFObjectHandle> const&) QPDF::read_xrefTable(long long)(Generated from a single parse of a single PDF)
PDF Object PDF Dictionary PDF Object Version PDF Object ID QPDF::pipeStreamData(PointerHolder<QPDF::EncryptionParameters>, PointerHolder<InputSource>, QPDF&, int, int, long long, unsigned long, QPDFObjectHandle, bool, Pipeline*, bool, bool) QPDFObjectHandle::getUIntValue() PDF PDF Trailer PDF Comment QPDF::findHeader() QUtil::is_digit(char) Key Value Pair Key Value Pl_Count::write(unsigned char*, unsigned long) QIntC::IntConverter<unsigned long, long long, false, true>::convert(unsigned long const&) QPDFObjectHandle::getIntValue() QPDFWriter::unparseObject(QPDFObjectHandle, int, int, unsigned long, bool) QPDF_String::unparse[abi:cxx11](bool) QUtil::int_to_string_base[abi:cxx11](long long, int, int) QUtil::uint_to_string_base[abi:cxx11](unsigned long long, int, int) QPDF_Name::normalizeName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) PDF Start XRef QPDF::parse(char const*) PDF XRef QIntC::IntConverter<long long, int, true, true>::convert(long long const&) QPDFXRefEntry::getType() const QPDF_Stream::setStreamDescription() locator<char> > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_get_insert_unique_pos(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) QPDF::read_xrefTable(long long)(Generated from a single parse of a single PDF)
PDF Object Version PDF Object ID PDF PDF Trailer PDF Comment QPDF::findHeader() QUtil::is_digit(char) QIntC::IntConverter<unsigned long, long long, false, true>::convert(unsigned long const&) QUtil::uint_to_string_base[abi:cxx11](unsigned long long, int, int) e::normalizeName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) PDF Start XRef QPDF::parse(char const*) PDF XRef r<long long, int, true, true>::convert(long long const&) QPDFXRefEntry::getType() const QPDF_Stream::setStreamDescription() QPDF::read_xrefTable(long long)
(Generated from a single parse of a single JPEG)
JPEG File JPEG read_markers segments consume_markers jinit_d_coef_controller jpeg_make_d_derived_tbl start_iMCU_row start_pass start_pass_main segment magic marker length image_data data first_marker next_marker segment_app0 segment_sof0 segment_sos num_components components version_major version_minor density_units density_x density_y thumbnail thumbnail_x thumbnail_y get_interesting_appn examine_app0 image_width image_height bits_per_sample jpeg_core_output_dimensions initial_setup master_selection start_spectral_selection appr_bit_pos end_spectral start_pass_huff_decoder component id quantization_table_id sampling_factors huffman_table default_decompress_parms latch_quant_tables(Generated from a single parse of a single JPEG)
start_pass segment_sof0 image_width image_height bits_per_sample jpeg_core_output_dimensions initial_setup master_selection _selection
native types in the code
labeling
types at the same time
another
malformed files) to maximize coverage
data structures of the parser
Et pl. al. Carson Harmon Brad Larsen @reyeetengineer @BradLarsen
Et pl. al. Evan Sultanik @ESultanik
https://github.com/trailofbits/polyfile https://github.com/trailofbits/polytracker