Machine-Learning-Guided SelectivelyUnsoundStaticAnalysis Kihong Heo - - PowerPoint PPT Presentation

machine learning guided selectively unsound static
SMART_READER_LITE
LIVE PREVIEW

Machine-Learning-Guided SelectivelyUnsoundStaticAnalysis Kihong Heo - - PowerPoint PPT Presentation

1 Machine-Learning-Guided SelectivelyUnsoundStaticAnalysis Kihong Heo Hakjoo Oh Kwangkeun Yi Seoul National University Korea University Seoul National University 26 May 2017 ICSE'17 @ Buenos Aires 2 Goal False Positive


slide-1
SLIDE 1

Machine-Learning-Guided SelectivelyUnsoundStaticAnalysis

26 May 2017 ICSE'17 @ Buenos Aires

1

Kihong Heo

Seoul National University

Hakjoo Oh

Korea University

Kwangkeun Yi

Seoul National University

slide-2
SLIDE 2

Goal

2

False Positive False Negative Uniformly Unsound Uniformly Sound

slide-3
SLIDE 3

Goal

3

Selectively Unsound False Positive False Negative Uniformly Unsound Uniformly Sound

slide-4
SLIDE 4

Selectively Unsound Analysis

  • Selectively apply unsound strategies
  • e.g.) unrolling loops, skipping lib calls

4

Uniformly Sound Uniformly Unsound Selectively Unsound

while(e){ C } if(e){ C } A;lib();B; A;B;

program states

error states

program states

error states

program states

error states false positive false negative

slide-5
SLIDE 5

Example

5

str = "hello world"; for(i=0; !str[i]; i++)// buffer access 1 skip; size = positive_input(); for(i=0; i<size; i++) skip; ... = str[i]; // buffer access 2

  • Sound buffer-overrun analyzer with interval domain
  • soundly analyze all the loops
slide-6
SLIDE 6

Example

  • Sound buffer-overrun analyzer with interval domain
  • soundly analyze all the loops

6

str = "hello world"; for(i=0; !str[i]; i++)// buffer access 1 skip; size = positive_input(); for(i=0; i<size; i++) skip; ... = str[i]; // buffer access 2

str.size: [12, 12] i: [0, +oo] size: [0, +oo] i: [0, +oo]

slide-7
SLIDE 7

Example

  • Uniformly unsound buffer-overrun analyzer
  • unsoundly unroll all the loops

7

str = "hello world"; i = 0; if (!str[i]) // buffer access 1 skip; size = positive_input(); i = 0; if (i < size) skip; ... = str[i]; // buffer access 2

slide-8
SLIDE 8

Example

  • Uniformly unsound buffer-overrun analyzer
  • unsoundly unroll all the loops

8

i: [0, 0]

str = "hello world"; i = 0; if (!str[i]) // buffer access 1 skip; size = positive_input(); i = 0; if (i < size) skip; ... = str[i]; // buffer access 2

i: [0, 0]

slide-9
SLIDE 9

Example

9

str = "hello world"; i = 0; if(!str[i]) // buffer access 1 skip; size = positive_input(); for(i = 0; i < size; i++) skip; ... = str[i]; // buffer access 2

  • Selectively unsound buffer-overrun analyzer
  • unsoundly unroll only harmless loops
slide-10
SLIDE 10

Example

10

str = "hello world"; i = 0; if(!str[i]) // buffer access 1 skip; size = positive_input(); for(i = 0; i < size; i++) skip; ... = str[i]; // buffer access 2

i: [0, 0] i: [0, +oo]

  • Selectively unsound buffer-overrun analyzer
  • unsoundly unroll only harmless loops
slide-11
SLIDE 11

Performance

  • Experiments with 2 analyzers & open source SW
  • Taint: 106 format string bugs / 13 programs
  • Interval: 138 buffer overrun bugs / 23 programs

FPR

25 50 75 100 B a s e l i n e S e l e c t i v e U n i f

  • r

m

FNR

20 40 60 80 B a s e l i n e S e l e c t i v e U n i f

  • r

m

Interval Analysis

FPR

25 50 75 100 B a s e l i n e S e l e c t i v e U n i f

  • r

m

FNR

25 50 75 100 B a s e l i n e S e l e c t i v e U n i f

  • r

m

Taint Analysis

11

slide-12
SLIDE 12
  • Find a set of targets for unsound strategies
  • loops to analyze unsoundly ( )
  • library calls to analyze unsoundly ( )
  • Selectively apply unsound strategies to

Setting

12

F ∈ Pgm × Π → A

Π = 2Loop Π = 2Lib

π ∈ Π

p ∈ π

slide-13
SLIDE 13

System Overview

13

Codebase

Training Data Generation Machine Learning

Training Data

F

π

Inferring Harmless Unsoundness Training Harmless Unsoundness

Test Program Classifier

slide-14
SLIDE 14

loop 1 loop 2 loop 3 ... if n

Training Data Generation

14

loop 1 loop 2 loop 3 ... loop n if 1 loop 2 loop 3 ... loop n loop 1 if 2 loop 3 ... loop n loop 1 loop 2 if 3 ... loop n training pgm # true alarms # false alarms 5 10 5 8 4 10 5 5

  • Given a codebase w/ known bugs + a sound static analyzer
  • Collect precision-decreasing yet harmless pgm components
  • e.g.) unrolling a loop reduces only FP but retains all TP

… 3 3

slide-15
SLIDE 15

Features & Learning

  • Encode each program component as a feature vector

15

f(x) = <f1(x), f2(x), …, fn(x)> f(loop1) = <1, 0, …, 1> f(loop2) = <0, 1, …, 1> f(lib1) = <0, 1, …, 0> f(lib2) = <1, 1, …, 1>

  • Derive a classifier using an off-the-shelf algorithm
  • e.g.) SVM
slide-16
SLIDE 16

Features

16

Feature Property Type Description Null Syntactic Binary Whether the loop condition contains nulls or not Const Syntactic Binary Whether the loop condition contains constants or not Array Syntactic Binary Whether the loop condition contains array accesses or not Conjunction Syntactic Binary Whether the loop condition contains && or not IdxSingle Syntactic Binary Whether the loop condition contains an index for a single array in the loop IdxMulti Syntactic Binary Whether the loop condition contains an index for multiple arrays in the loop IdxOutside Syntactic Binary Whether the loop condition contains an index for an array outside of the loop InitIdx Syntactic Binary Whether an index is initialized before the loop Exit Syntactic Numeric The (normalized) number of exits in the loop Size Syntactic Numeric The (normalized) size of the loop ArrayAccess Syntactic Numeric The (normalized) number of array accesses in the loop ArithInc Syntactic Numeric The (normalized) number of arithmetic increments in the loop PointerInc Syntactic Numeric The (normalized) number of pointer increments in the loop Prune Semantic Binary Whether the loop condition prunes the abstract state or not Input Semantic Binary Whether the loop condition is determined by external inputs GVar Semantic Binary Whether global variables are accessed in the loop condition FinInterval Semantic Binary Whether a variable has a finite interval value in the loop condition FinArray Semantic Binary Whether a variable has a finite size of array in the loop condition FinString Semantic Binary Whether a variable has a finite string in the loop condition LCSize Semantic Binary Whether a variable has an array of which the size is a left-closed interval LCOffset Semantic Binary Whether a variable has an array of which the offset is a left-closed interval #AbsLoc Semantic Numeric The (normalized) number of abstract locations accessed in the loop Const Syntactic Binary Whether the parameters contain constants or not

  • 22 features for loops
slide-17
SLIDE 17

Features

17 #AbsLoc Semantic Numeric The (normalized) number of abstract locations accessed in the loop Const Syntactic Binary Whether the parameters contain constants or not Void Syntactic Binary Whether the return type is void or not Int Syntactic Binary Whether the return type is int or not CString Syntactic Binary Whether the function is declared in string.h or not InsideLoop Syntactic Binary Whether the function is called in a loop or not #Args Syntactic Numeric The (normalized) number of arguments DefParam Semantic Binary Whether a parameter are defined in a loop or not UseRet Semantic Binary Whether the return value is used in a loop or not UptParam Semantic Binary Whether a parameter is update via the library call Escape Semantic Binary Whether the return value escapes the caller GVar Semantic Binary Whether a parameters points to a global variable Input Semantic Binary Whether a parameters are determined by external inputs FinInterval Semantic Binary Whether a parameter have a finite interval value #AbsLoc Semantic Numeric The (normalized) number of abstract locations accessed in the arguments #ArgString Semantic Numeric The (normalized) number of string arguments

Feature Property Type Description Null Syntactic Binary Whether the loop condition contains nulls or not

  • 15 features for library calls
slide-18
SLIDE 18

Winning Features

18

int r = lib1(); lib2(str1, str2);

  • Interval analysis
  • loops iterating on finite strings
  • library calls that return integers or manipulate strings

str = “hello world”; for (p = str; *p; p++) ...

slide-19
SLIDE 19

Winning Features

  • Interval analysis
  • loops iterating on finite strings
  • library calls that return integers or manipulate strings

19

str = “hello world”; for (p = str; *p; p++) ...

finite string array access ptr increment

int r = lib1(); lib2(str1, str2);

str manipulation return integer

slide-20
SLIDE 20

Winning Features

20

  • Taint analysis
  • library calls not propagating user inputs

r1 = random(); r2 = strlen(s) r3 = fread(fd,buf,len) r4 = recv(s,len,flags)

slide-21
SLIDE 21

Winning Features

21

r1 = random(); r2 = strlen(s)

# arguments, #abs. locations

  • Taint analysis
  • library calls not propagating user inputs

r3 = fread(fd,buf,len) r4 = recv(s,len,flags)

<

# arguments, #abs. locations

slide-22
SLIDE 22

Summary

  • First selectively unsound static analysis
  • more effective than uniformly sound / unsound ones
  • systematic way to tune unsoundness by ML

22

Sound Uniformly Unsound Selectively Unsound

program states program states program states

slide-23
SLIDE 23

Summary

  • First selectively unsound static analysis
  • more effective than uniformly sound / unsound ones
  • systematic way to tune unsoundness by ML

23

Sound Uniformly Unsound Selectively Unsound

program states program states program states

Thank you