Mining and Understanding Software Enclaves (MUSE) Suresh - - PowerPoint PPT Presentation

mining and understanding software enclaves muse
SMART_READER_LITE
LIVE PREVIEW

Mining and Understanding Software Enclaves (MUSE) Suresh - - PowerPoint PPT Presentation

Mining and Understanding Software Enclaves (MUSE) Suresh Jagannathan Information Innovation Office DARPA http://www.darpa.mil/Our_Work/I2O/Programs/Mining_and_Understanding_Software_Enclaves_(MUSE).aspx 1 Distribution Statement A - Approved


slide-1
SLIDE 1

1

Mining and Understanding Software Enclaves (MUSE)

Suresh Jagannathan Information Innovation Office DARPA

http://www.darpa.mil/Our_Work/I2O/Programs/Mining_and_Understanding_Software_Enclaves_(MUSE).aspx

Distribution Statement A - Approved for Public Release, Distribution Unlimited

slide-2
SLIDE 2

What is it?

2

Next for DARPA: 'Autocomplete' for programmers

Do We Really Need to Learn to Code?

Computer Programming Is a Dying Art

Pentagon seeks 'big code' for 'big data'

Distribution Statement A - Approved for Public Release, Distribution Unlimited Source: Phys.org Source: The New Yorker Source: Newsweek Source: USA Today

slide-3
SLIDE 3

Trends

3

> 10M LoC (open source) > 21M repositories > 4M code snippets

Distribution Statement A - Approved for Public Release, Distribution Unlimited

Navy’s newest warship (USS Zumwalt) runs on Linux

24M

The US government is the largest consumer of OSS

slide-4
SLIDE 4

Why should the government care?

4

Distribution Statement A - Approved for Public Release, Distribution Unlimited

Navy’s newest warship (USS Zumwalt) runs on Linux

24M

The US government is the largest consumer of OSS in the world

slide-5
SLIDE 5

Topic Modeling Open-Source Software

Generic Program Properties Specialized Domain Properties

Source: ohloh.net

5

Distribution Statement A - Approved for Public Release, Distribution Unlimited

slide-6
SLIDE 6

System Architecture

Program Analysis, Theorem Proving, Testing Learning and Synthesis Property Checking and Repair

Query: “Synthesize a program that does X”

α1 α2 α3 β1 β2 β3 λ1 λ2 λ3

Program that satisfies X: f(α1) ◦ g(β2) ◦ h(λ3)

Source Binary OR Source Binary OR

X X

Inspection Discovery

Graph Database and Mining Engine Analytics Artifact Generation

6

Distribution Statement A - Approved for Public Release, Distribution Unlimited

slide-7
SLIDE 7

Enclaves

Redundancies in the corpus exposed as dense components (enclaves) in the mined network

  • Nodes represent properties facts, claims, and

evidence

  • Edges connect related properties

Anomalous properties have small number of connections Likely invariants have large number of connections

7

Distribution Statement A - Approved for Public Release, Distribution Unlimited

slide-8
SLIDE 8

8

Big Code Front-End

Collector! Classifier!

Diverse, Representative,! High-fidelity corpus! Ontological Structure! Types and Proofs! Binary decompilation! Static and dynamic analyses! Theorem proving! Tests and runtime verification! Executable Specifications! Model Checking! Abstract Interpretation! Contracts and assertions! Documentation extraction! Canonical and persistent representation of analysis

  • utputs

Program! Analyses! Database ! construction!

Environment and platform dependencies,! Models (memory, execution, …)!

Distribution Statement A - Approved for Public Release, Distribution Unlimited

slide-9
SLIDE 9

9

Big Code Back-End

Query Language DSLs

Navigation and Search Queries

Inference Engine

Mining Property Checking Learning and Model Generation Protocol Discovery

Specification Language

Synthesis Framework Queries

Distributed Graph Database

Distribution Statement A - Approved for Public Release, Distribution Unlimited

slide-10
SLIDE 10

10

Dependencies

Ontology / Classification Collection Challenge Problems Demo Workshops

Evaluator

Datalog Analyses

(static, dynamic, concolic)

Type Systems Invariant Detection Repair Specification Trace Analysis Static Analysis LLVM Binary Specification Extraction Abstract Interpretation Dependently Typed IR Deep Learning Ontic Types & Clichés Synthesis from Specifications Sketch Based Synthesis Design Pattern Flaw Detection & Repair Widget Synthesis & Repair Draft-based Synthesis Probabilistic Inference Abductive Inference & Hypothesis Generation Graph Visualization Cloud Infrastructure

Artifact Generators Mining Engine Infrastructure Analytics

Convex Optimization Artifact Store Fault Localization & Repair Protocol Repair & Patch Synthesis Datalog Bayesian Queries Multii-Layered Database

Distribution Statement A - Approved for Public Release, Distribution Unlimited

slide-11
SLIDE 11

11

Corpus

Distribution Statement A - Approved for Public Release, Distribution Unlimited

Currently, ~6TB Java and C, C++

slide-12
SLIDE 12

12

Draper Labs: The DeepCode Architecture

Source: Draper Labs

Distribution Statement A - Approved for Public Release, Distribution Unlimited

slide-13
SLIDE 13

13

Artifact Generation

Use of clang and Draper’s open-source Fracture decompiler support both compile down of source and binary lift to LLVM Intermediate Representation (IR)

Source: Draper Labs

Distribution Statement A - Approved for Public Release, Distribution Unlimited

slide-14
SLIDE 14

14

Deep Learning Analytics

Source: Draper Labs

Distribution Statement A - Approved for Public Release, Distribution Unlimited

slide-15
SLIDE 15

Finding Heartbleed using Big Code (Draper)

170K C/C++ Projects ~400GB Artifact Generator ~20M artifacts

(calls graphs, CFGs, etc.)

Graph Layer

LLVM ANTLR4 Metadata Extractor Fracture

Math Layer

Blue-Good Red-Bad

Identify and classify design patterns (flaws and repairs)

Deep Learning Repaired Program: Added bounds checks Buggy Program: Heartbleed bug

if (1+2+16 > s->s3->rrec.length) return 0; if (1+2+payload+16 > s->s3->rrec.length) return 0; if (write_length > SSL3_RT_MAX_PLAIN_LENGTH) return 0;

Distribution Statement A - Approved for Public Release, Distribution Unlimited

slide-16
SLIDE 16

16

Kestrel Institute: Synthesis using Big Code

Source: Kestrel Institute

Distribution Statement A - Approved for Public Release, Distribution Unlimited

slide-17
SLIDE 17

17

Kestrel Institute: Proof-Directed Synthesis Using Big-Code

Source: Kestrel Institute

Distribution Statement A - Approved for Public Release, Distribution Unlimited

slide-18
SLIDE 18

18

Artifact Generation Process

Source: Kestrel Institute

slide-19
SLIDE 19

19

Features

Source: Kestrel Institute

Distribution Statement A - Approved for Public Release, Distribution Unlimited

slide-20
SLIDE 20

130K Java Projects ~2.3B methods

AES Synthesis using Big Code (Kestrel)

Analysis & Specification Extraction ~200B facts Machine Learning

180 out of 130K projects relevant to AES

Program Specification Implementation + Proof of Correctness

Synthesis + Proof Refinement (defthm bytep-of-xtime (implies (bytep b) (bytep (xtime b))) :hints (("Goal" :in-theory (enable acl2::shl)))) public static int lookup (int[][] arr, int hex) { int row = hex >> 4; int column = hex & 0xF; return arr[row][column]; } Types Control Flow Graphs API sequences Proofs

422 Features

Distribution Statement A - Approved for Public Release, Distribution Unlimited

slide-21
SLIDE 21

Challenge Problems – Phase 1

21

Problem Approach Synthesis from demonstrations in Swing/Eclipse Dynamic tracing analysis Synthesis of AES Specification-driven (synthesis-by-construction) Automated repair of incorrect API usage in Android Code transfer Repair of incorrect invariants (off-by-

  • ne errors) in C/C++ code

Deep learning Synthesize a communication module for a drone User-directed cliché discovery Complete a partial implementation of binary search tree Sketch-based synthesis Graph classification and repair Repair incorrect graph implementations from specifications

Distribution Statement A - Approved for Public Release, Distribution Unlimited

slide-22
SLIDE 22

www.darpa.mil

22