toward automated grammar extraction
play

Toward Automated Grammar Extraction via Semantic Labeling of - PowerPoint PPT Presentation

Toward Automated Grammar Extraction via Semantic Labeling of Parser Implementations Carson Harmon Bradford Larsen Evan Sultanik LangSec Workshop at IEEE Security & Privacy, May 21, 2020 The Problem The Problem The Problem # !


  1. Toward Automated Grammar Extraction via Semantic Labeling of Parser Implementations Carson Harmon Bradford Larsen Evan Sultanik LangSec Workshop at IEEE Security & Privacy, May 21, 2020

  2. The Problem

  3. The Problem

  4. The Problem # ! ✓ ⛔

  5. The Problem $

  6. The Problem ✓ ✓ $ ✓ ✓

  7. High Level Goals Create semantic map of the functions in a parser, which will improve grammar extraction. & ' % ???

  8. High Level Goals Create semantic map of the functions in a parser, which will improve grammar extraction. parser_function1 & ↳ byte 0, 10, 50 Object Stream % parser_function2 ↳ byte 10, 74 Xref parser_function3 ↳ byte 20 JFIF

  9. High Level Goals Create semantic map of the functions in a parser, which will improve grammar extraction. parser_function1 & ↳ byte 0, 10, 50 Ultimate Goal: Automatically Object Stream extract a minimal grammar % specifying the files accepted by a parser_function2 parser ↳ byte 10, 74 Hypothesis: The majority of the Xref potential for maliciousness and schizophrenia will exist in the parser_function3 symmetric di ff erence of the ↳ byte 20 grammars accepted by a format’s parser implementations JFIF

  10. Approach Semantic Ground Truth Instrumentation Associative Labeling Label the Type Composition Use universal taint analysis Merge the results of the first Hierarchy of the input files to track all input bytes two steps to produce a through the execution of a labeling of which functions parser operate on which types Detect backtracking Detect error handling Di ff erential analysis

  11. Approach Semantic Ground Truth Instrumentation Associative Labeling Label the Type Composition Use universal taint analysis Merge the results of the first Hierarchy of the input files to track all input bytes two steps to produce a through the execution of a labeling of which functions parser operate on which types Detect backtracking ✓ Detect error handling Di ff erential analysis

  12. Approach Semantic Ground Truth Instrumentation Associative Labeling Grammar Extraction (future work) Label the Type Composition Use universal taint analysis Merge the results of the first Hierarchy of the input files to track all input bytes two steps to produce a through the execution of a labeling of which functions parser operate on which types Detect backtracking ✓ Detect error handling Di ff erential analysis

  13. Prior Work: Semantic Labeling Polyglot-Aware File Identification Resilient Parsing Syntax Tree iNES [0x0 → 0x12220] Modify parsers for best e ff ort ' ↳ Magic [0x0 → 0x3] Header [0x4 → 0xF] ??? iNES ROM Instrument to track input byte o ff sets ⋮ PRG [0xC210 → 0x1020F] Label regions of the input CHR [0x10210 → 0x12220] PDF PDF [0x10 → 0x2EF72F] Produce ground truth ↳ Magic [0x10 → 0x1E] ZIP Object 1.0 [0x1F → 0x12221] ↳ Dictionary [0x2A → 0x3E] Stream [0x3F → 0x12219] ↳ JFIF Image [0x46 → 0x1220F] ↳ JPEG Segment […] ↳ Magic […] Marker […] ⋮

  14. PolyFile Ground Truth '

  15. PolyFile Ground Truth '

  16. Prior Work: Parser Instrumentation LLVM Instrumentation Taint Tracking Operate on LLVM/IR Shadow memory inspired by Novel datastructure for the Data Flow Sanitizer e ffi ciently storing taint labels Can work with all open (dfsan) source parsers dfsan status quo: Negligible CPU overhead 훩 (1) lookups Eventually support closed- 훩 ( n ² ) storage source binaries by lifting to O ( n ) memory overhead, LLVM ( e.g. , with McSEMA where n is the number of PolyTracker: or Remill) instructions executed by the O (log n ) lookups parser O ( n ) storage

  17. PolyTracker Instrumentation { "ensure_solid_xref" : [ 2276587, 2276588 ], "fmt_obj" : [ 2465223, 2465224, 2465225, 2465226, 2465227, 2465228, 2465240, 2465241, 2465242, 2465243, 2465244, 2465245, 2465246, 2465258, 2465259, 2465260, 2465261, 2465262 ] }

  18. PolyTracker Instrumentation { iNES [0x0 → 0x12220] "ensure_solid_xref" : [ 2276587, ↳ Magic [0x0 → 0x3] 2276588 Header [0x4 → 0xF] ], "fmt_obj" : [ ⋮ 2465223, PRG [0xC210 → 0x1020F] 2465224, 2465225, CHR [0x10210 → 0x12220] 2465226, PDF [0x10 → 0x2EF72F] 2465227, ↳ Magic [0x10 → 0x1E] 2465228, 2465240, Object 1.0 [0x1F → 0x12221] 2465241, ↳ Dictionary [0x2A → 0x3E] 2465242, 2465243, Stream [0x3F → 0x12219] 2465244, ↳ JFIF Image [0x46 → 0x1220F] 2465245, 2465246, ↳ JPEG Segment […] 2465258, ↳ Magic […] 2465259, 2465260, Marker […] 2465261, ⋮ 2465262 ] }

  19. PolyTracker Instrumentation { iNES [0x0 → 0x12220] "ensure_solid_xref" : [ 2276587, Trailer ↳ Magic [0x0 → 0x3] ensure_solid_xref 2276588 XRef Header [0x4 → 0xF] ], "fmt_obj" : [ ⋮ Object 2465223, fmt_obj PRG [0xC210 → 0x1020F] Dictionary 2465224, 2465225, CHR [0x10210 → 0x12220] 2465226, PDF [0x10 → 0x2EF72F] 2465227, ↳ Magic [0x10 → 0x1E] 2465228, 2465240, Object 1.0 [0x1F → 0x12221] 2465241, ↳ Dictionary [0x2A → 0x3E] 2465242, 2465243, Stream [0x3F → 0x12219] 2465244, ↳ JFIF Image [0x46 → 0x1220F] 2465245, 2465246, ↳ JPEG Segment […] 2465258, ❓ ↳ Magic […] 2465259, 2465260, Marker […] 2465261, ⋮ 2465262 ] }

  20. The Challenge of Associative Labeling How can we associate types in the file format to the set of functions most specialized in operating on that type? Observations Raw mapping is not necessarily injective: A parser’s functional implementation will rarely There will rarely be a Parser 1 Parser 2 be isomorphic to the type perfect bijection between the Specialized Specialized hierarchy or syntax tree of types and functions Function Function Monolithic the input file Function Specialized Function

  21. Information Entropy Idea: Use information entropy to measure function specialization • For each type, collect the functions that operate on that type • Calculate P ( t, f ) = the probability that a specific type occurs within a function • Calculate the “genericism” of a function G : F → ℝ • Use G to sort the functions associated with a type, discarding all but the smallest (most specialized) standard deviation

  22. Problem: Code is Too Monolithic The parser has a single function responsible for parsing multiple types Parser 1 Monolithic Function

  23. Problem: Code is Too Monolithic The parser has a single function responsible for parsing multiple types • Calculate the dominator tree of the syntax tree

  24. Problem: Code is Too Monolithic The parser has a single function responsible for parsing multiple types • Calculate the dominator tree of the syntax tree • Remove a function from the matching for a type if there exists an ancestor of the type in the dominator tree that maps to the same function

  25. Problem: Code is Too Monolithic The parser has a single function responsible for parsing multiple types • Calculate the dominator tree of the syntax tree • Remove a function from the matching for a type if there exists an ancestor of the type in the dominator tree that maps to the same function

  26. Problem: Code is Too Monolithic The parser has a single function responsible for parsing multiple types • Calculate the dominator tree of the syntax tree parse_pdf_dictionary • Remove a function from the matching for a type if there exists an ancestor of the type in the dominator tree that maps to the same function

  27. Problem: Code is Too Monolithic The parser has a single function responsible for parsing multiple types • Calculate the dominator tree of the syntax tree parse_pdf_dictionary • Remove a function from the matching for a type if there exists an ancestor of the type in the dominator tree that maps to the same function

  28. Problem: Code is Too Cohesive The parser has many, tightly coupled functions collectively responsible for parsing a single type • If those functions are always called sequentially, then we ideally Parser 2 only want the single function that initiates the sequence Specialized Specialized Function Function • Calculate the dominator tree of the runtime control flow graph Specialized Function • For each type, remove any functions in the matching that have an ancestor in the dominator tree that is also in the matching

  29. Problem: Code is Too Cohesive The parser has many, tightly coupled functions collectively responsible for parsing a single type • If those functions are always called sequentially, then we ideally only want the single function that initiates the sequence • Calculate the dominator tree of the runtime control flow graph • For each type, remove any functions in the matching that have an ancestor in the dominator tree that is also in the matching

  30. pdf_load_xref pdf_read_start_xref pdf_prime_xref_index

  31. PDF pdf_load_xref XRef pdf_read_start_xref pdf_prime_xref_index

  32. PDF pdf_load_xref XRef pdf_read_start_xref pdf_prime_xref_index

  33. Results • Runs in O (| F | n log | T |) time ◦ F = # functions in the parser ◦ T = # types (or production rules) in the grammar ◦ n = # bytes in the input file • Mappings for various parsers and file formats • Implementation in the polymerge application distributed with PolyFile: ◦ pip3 install polyfile

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend