 
              Vulnerability Extrapolation USENIX WOOT 2011 Fabian „fabs‟ Yamaguchi Recurity Labs GmbH, Germany
Agenda  Patterns you find when auditing code  Exploiting these patterns: Vulnerability Extrapolation  Using machine learning to get there  A method to assist in manual code audits based on this idea  The method in practice  A showcase
Exploring a new code base  Like an area of mathematics you don‟t yet know.  It‟s not completely different from the mathematics you already know.  But there are secrets specific to this area:  Vocabulary  Reoccurring patterns in argumentation  Weird tricks used in proofs  Understanding the specifics of the area makes it a lot easier to reason about it.
Another Example: libTIFF CVE-2006-3459 | CVE-2010-2067 sta tic in t T IFFFetch S h ortPair ( T IFF * tif , T IFFD irE n try * d ir ) { sw itch ( d ir -> td ir_ typ e ) { ca se T IFF_ B YT E : ca se T IFF_ S B YT E : { u in t8 v [ 4 ]; re tu rn T IFFFetch B yteA rray ( tif , d ir , v ) & & T IFFS etField ( tif , d ir -> td ir_ tag , v [ 0 ], v [ 1 ]); } ca se T IFF_ S H O R T : ca se T IFF_ S S H O R T : { u in t1 6 v [ 2 ]; re tu rn T IFFFetch S h ortA rray ( tif , d ir , v ) & & T IFFS etField ( tif , d ir -> td ir_ tag , v [ 0 ], v [ 1 ]); } d e fa u lt: re tu rn 0 ; } }
Another Example: libTIFF CVE-2006-3459 | CVE-2010-2067 s ta tic in t T IF F F e tch S u b je ctD ista n ce ( T IF F * tif , T IF F D irE n try * d ir ) sta tic in t { T IFFFetch S h ortPair ( T IFF * tif , T IFFD irE n try * d ir ) u in t3 2 l [ 2 ] ; { flo a t v ; in t o k = 0 ; sw itch ( d ir -> td ir_ typ e ) { ca se T IFF_ B YT E : ca se T IFF_ S B YT E : if ( T IF F F e tch D a ta ( tif , d ir , ( c h a r * ) l ) { & & cv tR a tio n a l ( tif , d ir , l [ 0 ] , l [ 1 ] , & v ) ) { u in t8 v [ 4 ]; /* re tu rn T IFFFetch B yteA rray ( tif , d ir , v ) * X X X : N um erator 0 x F F F F F F F F m eans th at w e h ave infinite & & T IFFS etField ( tif , d ir -> td ir_ tag , v [ 0 ], v [ 1 ]); } * d istance. I nd icate th at w ith a negative floating point ca se T IFF_ S H O R T : * S ub jectD istance value. ca se T IFF_ S S H O R T : */ { o k = T IF F S e tF ie ld ( tif , d ir -> td ir_ ta g , u in t1 6 v [ 2 ]; ( l [ 0 ] != 0 x F F F F F F F F ) ? v : - v ) ; re tu rn T IFFFetch S h ortA rray ( tif , d ir , v ) } & & T IFFS etField ( tif , d ir -> td ir_ tag , v [ 0 ], v [ 1 ]); } re tu rn o k ; d e fa u lt: } re tu rn 0 ; } }
LibTIFF: Bug Analysis  TIFFFetchShortArray is actually a wrapper around TIFFFetchData.  The two are pretty much synonyms.  These functions are part of an API local to libTIFF.  Badly designed API: the amount of data to be copied into the buffer is passed in one of the fields of the dir-structure and not explicitly!  Developers missed this in both cases and it‟s hard to blame them.
The times of “grep „memcpy‟ ./*.c” may be over. But that does not mean patterns of API use that lead to vulnerabilities no longer exist!
Vulnerability Extrapolation  Given a function known to be vulnerable, determine functions similar to this one in terms of application-specific API usage patterns.  Vulnerability Extrapolation exploits the information leak you get every time a vulnerability is disclosed!
What needs to be done  We need to be able to determine how “similar” functions are in terms of dominant programming patterns.  We need to find a way to extract these programming patterns from a code-base in the first place.  How do we do that?
Similarity – A decomposition Decomposition into shape and rotation: If rotation is just a detail, these are pretty similar. In Face-Recognition, faces are decomposed into weighted sums of Signal Processing: Decomposition into commonly found patterns components of different frequencies: Noise is + a noise-term. suspected to be of high frequency while the signal is of lower frequency.
Think of it as ‘zooming out’ Increasing level of detail/frequency s ta tic in t T IF F F e tch S u b je ctD ista n ce ( T IF F * tif , T IF F D irE n try * d ir ) Decreasing dominance of pattern { u in t3 2 l [ 2 ]; flo a t v ; in t o k = 0 ; if ( T IF F F e tch D a ta ( tif , d ir , (c h a r * ) l ) & & cv tR a tio n a l ( tif , d ir , l [ 0 ], l [ 1 ], & v )) { /* Usage Usage Usage * X X X : N um erator 0 x F F F F F F F F m eans th at w e h ave infinite * d istance. Ind icate th at w ith a negative floating point * S ub jectD istance value. */ o k = T IF F S e tF ie ld ( tif , d ir -> td ir_ ta g , Pattern Pattern Pattern ( l [ 0 ] != 0 x F F F F F F F F ) ? v : - v ); } re tu rn o k ; } Linear approximation of each function by the most dominant API usage patterns of the code-base it is contained in!
Extracting dominant patterns How do we identify the most dominant API usage patterns of a code-base? In Face Recognition, a standard technique is Principal Component Analysis.
Mapping code to the vector space  Describe functions by the API-symbols they contain.  API-symbols are extracted using a fuzzy parser.  Each API-symbol is associated with a dimension. func1(){ int *ptr = malloc(64); fetchArray(pb, ptr); }
Principal Component Analysis Data Matrix (Contains all function-vectors) Strength of pattern Each column of U is a dominant pattern. Each row is a representation Representation of functions in terms of an API-symbol in terms of of the most dominant patterns the most dominant patterns
In summary
A toy problem to gain an intuition Group 1 v o id gu iFu n c2 ( G tkW idget * w idget ) v o id g u iFu n c1 ( G tk W id g e t * w id g e t ) { { gu i_ m ake_ w in dow ( w idget ); in t j ; G tkB u tton * m yB u tton ; g u i_ m a k e _ w in d o w ( w id g e t ); bu tton 1 = gu i_ n ew _ bu tton (); G tk B u tto n * b u tto n ; bu tton 2 = gu i_ n ew _ bu tton (); b u tto n = g u i_ n e w _ b u tto n (); bu tton 3 = gu i_ n ew _ bu tton (); g u i_ sh o w _ w in d o w (); } fo r(in t i = 1 0 ; i != i ; i + + ) do_ gu i_ stu ff (); }
Group2 v o id n e tF u n c2 () { v o id n e tF u n c1 () in t fd ; { s tru c t so ck a d d r_ in in ; in t fd ; h o ste n t h o st ; in t i = 0 ; fd = so ck e t ( a rg u m e n ts ); s tru c t so ck a d d r_ in in ; re cv ( fd , m o re A rg u m e n ts ); fd = so ck e t ( a rg u m e n ts ); g e th o stb y n a m e ( h o st ) re cv ( fd , m o re A rg u m e n ts ); if( co n d itio n ){ if( co n d itio n ){ i + + ; in t i = 0 ; se n d ( fd , i , a rg ); i + + ; } se n d ( fd , i , a rg ); } se n d ( fd , i , a rg ); clo se ( fd ); clo se ( fd ); } }
Group 3 v o id listF u n c2 (in t e le m ) v o id listF u n c1 (in t e le m ) { { G List m y List ; G List m y List ; if(! list_ ch e ck ( m y List )){ if(! list_ ch e ck ( m y List )){ d o _ list_ e rro r_ stu ff (); d o _ list_ e rro r_ stu ff (); re tu rn ; re tu rn ; } } list_ re m o v e ( m y List , e le m ); list_ a d d ( m y List , e le m ); list_ d e le te ( m y List ); } }
Projection onto the first two principal components Core API Functions Occurs in this context but does not constitute the pattern
Vulnerability Extrapolation  Take a function that used to be vulnerable as an input.  Measure distances to other functions to determine those functions, which are most similar.  Let‟s try that for FFmpeg.
Original bug: CVE-2010-3429 s ta tic in t flic_ d e co d e _ fra m e _ 8 B P P ( A V C o d e cC o n te x t * a v ctx , v o id * d a ta , in t * d a ta _ size , Decoder-Pattern: c o n s t u in t8 _ t * b u f , in t b u f_ size ) { [ ..] Usually a variable of p ix e ls = s -> fra m e .d a ta [ 0 ] ; [ ..] c a s e F LI_ D E LT A : type AvCodecContext y _ p tr = 0 ; co m p re sse d _ lin e s = A V _ R L1 6 ( & b u f [ stre a m _ p tr ] ) ; AV_RL*-Functions stre a m _ p tr + = 2 ; w h ile ( co m p re sse d _ lin e s > 0 ) { used as sources. lin e _ p a c k e ts = A V _ R L 1 6 ( & b u f[ s tre a m _ p tr] ) ; stre a m _ p tr + = 2 ; Lot‟s of primitive types if ( ( lin e _ p a ck e ts & 0 x C 0 0 0 ) = = 0 x C 0 0 0 ) { // line skip opcod e with specified width lin e _ p a ck e ts = - lin e _ p a ck e ts ; used. y _ p tr + = lin e _ p a c k e ts * s -> fra m e .lin e s iz e [ 0 ] ; } e ls e if ( ( lin e _ p a ck e ts & 0 x C 0 0 0 ) = = 0 x 4 0 0 0 ) { [ ..] Use of memcpy, } e ls e if ( ( lin e _ p a ck e ts & 0 x C 0 0 0 ) = = 0 x 8 0 0 0 ) { memset, etc. // "last b yte" opcod e p ix e ls [ y _ p tr + s -> fra m e .lin e s iz e [ 0 ] -1 ] = lin e _ p a c k e ts & 0 x ff; } e ls e { [ ..] y _ p tr + = s -> fra m e .lin e s iz e [ 0 ] ; unchecked index, } } Write to arbitrary b re a k ; location in memory. [ ..] }
Extrapolation  The closest match contained the same vulnerability but it was fixed when the initial function was fixed. 0-Day
Recommend
More recommend