Recovering System Specific Rules from Software Repositories Chadd - - PowerPoint PPT Presentation

recovering system specific rules from software
SMART_READER_LITE
LIVE PREVIEW

Recovering System Specific Rules from Software Repositories Chadd - - PowerPoint PPT Presentation

Recovering System Specific Rules from Software Repositories Chadd Williams Jeff Hollingsworth Problem How much do you know about your 10 year old code base? didnt someone rewrite the matrix objects? how do you transform an


slide-1
SLIDE 1

Recovering System Specific Rules from Software Repositories

Chadd Williams Jeff Hollingsworth

slide-2
SLIDE 2

2/ 12 2/ 12

Uni ver si t y of M ar yl and Uni ver si t y of M ar yl and

Problem

How much do you know about your 10

year old code base?

– didn’t someone rewrite the matrix objects?

  • how do you transform an image now?

Implicit rules build up over time

– little or no documentation – failure to understand implicit rules causes bugs

  • 32% of bugs detected during maintenance1

We can discover implicit rules by looking

at code changes

[1] Matsumura, T., Monden, A., Matsumoto, K., The Detection of Faulty Code Violating Implicit Coding Rules, IWPSE ’02

slide-3
SLIDE 3

3/ 12 3/ 12

Uni ver si t y of M ar yl and Uni ver si t y of M ar yl and

Implicit Rule

Function Usage Pattern

– how functions are invoked with respect to each

  • ther in the source code

– describe relationships between functions – static analysis - intraprocedural

mdi = HeapAlloc(GetProcessHeap()); if (!mdi) HeapFree(GetProcessHeap(), 0, cs); HDC hdc = BeginPaint( hwnd, &ps ); if( hdc ) DrawIcon( hdc, x, y, hIcon ); EndPaint( hwnd, &ps );

Called After Conditionally Called After

slide-4
SLIDE 4

4/ 12 4/ 12

Uni ver si t y of M ar yl and Uni ver si t y of M ar yl and

Function Usage Pattern Miner

Find new instances of relationships

– where that instance was not found in the revision immediately prior

Preliminary filtering heuristic

– function calls within 10 source lines of code

  • many APIs contain functions that are called

in quick succession

  • error handling is near error producing

function

int foo(){

  • pen();

} int foo(){

  • pen();

read(); } Change

new new instance instance of read()

  • f read() called after

called after open()

  • pen()
slide-5
SLIDE 5

5/ 12 5/ 12

Uni ver si t y of M ar yl and Uni ver si t y of M ar yl and

Classification of Mined Data

Each mined instance is classified by how

it entered the source code:

– both of the function calls were added

  • instance added in full

– one function call was added

  • the added function completed the pairing
  • bug fix? refactoring?

– neither of the function calls were added

  • deleted code? control flow change?

int foo(){ } int foo(){

  • pen();

read(); } Change int foo(){

  • pen();

read(); close(); } Change

slide-6
SLIDE 6

6/ 12 6/ 12

Uni ver si t y of M ar yl and Uni ver si t y of M ar yl and

Rating Mined Relationships

Determine support and confidence for

each mined relationship

– confidence of foo() -> bar()

  • in what percent of instances that start with

foo(), is foo() follow by bar() ? – support of foo() -> bar()

  • what percent, of all instances found, are

foo() -> bar() ? – present a sorted list to the user

  • sort on support then confidence
slide-7
SLIDE 7

7/ 12 7/ 12

Uni ver si t y of M ar yl and Uni ver si t y of M ar yl and

Preliminary Case Study

Mined Wine CVS repository

– 15,666 unique relationships added > > 9 times – 862 unique relationships added > > 99 times

What relationships are found in CVS?

– how was it added to the source code? – compare to relationships in the latest version

  • f the source code

How can this help us find bugs? Can we mine data for a specific API?

slide-8
SLIDE 8

8/ 12 8/ 12

Uni ver si t y of M ar yl and Uni ver si t y of M ar yl and

How do the Top 25 of the lists differ?

1742 GetDlgItem GetDlgItem 1747 VariantChangeTypeEx GetProcAddress 1985 GetProcessHeap GetProcessHeap 2294 memcmp memcmp 2851 GetProcessHeap HeapAlloc 3098 printf printf 3577 GetProcessHeap HeapFree 3605 GetProcAddress GetProcAddress 6700 VariantChangeTypeEx VariantChangeTypeEx 12671 fprintf fprintf

COUNT

Called After Relationship 233 GetProcessHeap GetProcessHeap 342 memcmp memcmp 480 HeapFree GetProcessHeap 768 RtlFreeHeap GetProcessHeap 816 GetProcessHeap HeapAlloc 1100 GetProcAddress GetProcAddress 1200 GetProcessHeap HeapFree 1251 GetProcessHeap RtlAllocateHeap 1782 GetProcessHeap RtlFreeHeap 2606 fprintf fprintf

COUNT

Called After Relationship

Most similar to latest version

– added both function calls

  • sum of differences in ranking: 91
  • items unique to one list: 8

Least similar to latest version

– added one function call

  • sum of differences in ranking: 41
  • items unique to one list: 28

Relationships Created By Adding One Function Call Relationships found in the Latest Version of the Source Code

slide-9
SLIDE 9

9/ 12 9/ 12

Uni ver si t y of M ar yl and Uni ver si t y of M ar yl and

What relationships were found?

EnterCriticalSection -> LeaveCriticalSection

– in latest version: 939 times

How were the instances created?

– add both function calls: 1,277 times – add one function call: 5 times

EnterCri EnterCriticalSection ticalSection( &(This-> ( &(This->lock) ); lock) ); uR uRef ef = ++(This->r = ++(This->ref ef); ); if (T if (This->driver) his->driver) IDsCap aptu tureDri Driver_ er_Add ddRef(This- This- >driver); >driver); Leav LeaveCriticalSection eCriticalSection( &(This-> ( &(This->lock) ); lock) );

– added one function but did not complete the pairing: 82 times

  • 78 of these uncompleted pairings were

because of the 10 line heuristic

slide-10
SLIDE 10

10/ 12 10/ 12

Uni ver si t y of M ar yl and Uni ver si t y of M ar yl and

How can this help us find bugs?

Profile of a bug plagued relationship

– created often by adding one function call – rarely created by adding two function calls

Possible bug

– TREEVIEW_UpdateScrollBars -> TREEVIEW_Invalidate

– update the scroll bars after adding items – invalidate the Treeview so it gets redrawn

for ( Each Item In the List ) { for ( Each Item In the List ) { TREEV TREEVIE IEW_D W_DrawItem( Item(infoPt nfoPtr, hdc, w , hdc, wine neIt Item); em); } TREEV TREEVIE IEW_U W_Upda dateScrol teScrollBars ars (infoPtr); (infoPtr); . . . . . . return; return;

slide-11
SLIDE 11

11/ 12 11/ 12

Uni ver si t y of M ar yl and Uni ver si t y of M ar yl and

Mining Relationships for an API

What relationships are found between

functions declared in an API?

msiquery.c - database access API

– two sets of functions:

  • MsiFoo( , LPCSTR, ) and MSI_Foo( , LPCWSTR, )

– MsiDatabaseOpenViewA -> MsiViewExecute – MSI_DatabaseOpenViewW -> MSI_ViewExecute

Heap access functions

– HeapAlloc(GetProcessHeap(), . . . ) – HeapAlloc() -> HeapFree()

slide-12
SLIDE 12

12/ 12 12/ 12

Uni ver si t y of M ar yl and Uni ver si t y of M ar yl and

Future Work

Apply our tool to more projects

– projects that use a common external library

Track removed usage patterns Better filtering heuristic

– control flow based – data flow based

How do we use the patterns

we find?

– documentation – feed patterns to static source code checkers to find violations

hdc hdc = BeginPaint( hwnd hwnd, &ps ); if( hdc hdc ) DrawIcon( hdc hdc, x, y, hIcon ); EndPaint( hwnd hwnd, &ps );

slide-13
SLIDE 13

13/ 12 13/ 12

Uni ver si t y of M ar yl and Uni ver si t y of M ar yl and

slide-14
SLIDE 14

14/ 12 14/ 12

Uni ver si t y of M ar yl and Uni ver si t y of M ar yl and

Backup Slides

slide-15
SLIDE 15

15/ 12 15/ 12

Uni ver si t y of M ar yl and Uni ver si t y of M ar yl and

How do the Top 25 of the lists differ?

Difference metric

– distance between rankings of common items – number of items unique to each list

Most similar to latest version

– Added both function calls

  • sum of differences in ranking: 50
  • items unique to one list: 18

Least similar to latest version

– Added one function call

  • sum of differences in ranking: 12
  • items unique to one list: 48
slide-16
SLIDE 16

16/ 12 16/ 12

Uni ver si t y of M ar yl and Uni ver si t y of M ar yl and

Source Code Change History

We can discover implicit rules by looking

at code changes

– every change is committed – changes highlight misunderstood code – changes highlight new code

Studying each commit gives fine-grain

knowledge

– how quickly does a rule emerge? – how fast is a rule adopted? – how often is it used later?

slide-17
SLIDE 17

17/ 12 17/ 12

Uni ver si t y of M ar yl and Uni ver si t y of M ar yl and

Debug functions in Wine

Many of the relationships involve a

debug statement

– overwhelmed the rest of the results – filtered from the data – future work:

  • what can we determine about the proper use
  • f debug statements?

if (RegOpenKeyA(HKEY, name, &key)) { RegCloseKey(key); TRACE(message); }

slide-18
SLIDE 18

18/ 12 18/ 12

Uni ver si t y of M ar yl and Uni ver si t y of M ar yl and

Relations highlighted by CVS mining

Data Flow Functionality

– GetDlgItem -> EnableWindow

case W M _USER: case W M _USER: Enabl eW i ndow Enabl eW i ndow ( G et Dl gI t em ( G et Dl gI t em ( … ) , FALSE) ; … ) , FALSE) ; Enabl eW i ndow Enabl eW i ndow ( G et Dl gI t em ( G et Dl gI t em ( … ) , FALSE) ; … ) , FALSE) ; Enabl eW i ndow Enabl eW i ndow ( G et Dl gI t em ( G et Dl gI t em ( … ) , FALSE) ; … ) , FALSE) ; Set Focus Set Focus ( G et Dl gI t em ( G et Dl gI t em ( hwnd hwnd, I DC_TO O LBARBTN_LBO X) , I DC_TO O LBARBTN_LBO X) ) ; ) ; r et ur n TRUE; r et ur n TRUE;

slide-19
SLIDE 19

19/ 12 19/ 12

Uni ver si t y of M ar yl and Uni ver si t y of M ar yl and

Conditionally Called After

3,872 unique patterns added 10 or more times

if (!(hModule = LoadLibraryExA(fileName, 0, LLDF))) WINE_ERR("LoadLibraryExA (%s) failed, %ld\n", fileName, GetLastError());

Error handling code – conditionally report error – which functions need errors handled Debug code – conditionally call a debug function

slide-20
SLIDE 20

20/ 12 20/ 12

Uni ver si t y of M ar yl and Uni ver si t y of M ar yl and

Transitive Patterns

called after may be a transitive pattern – only a binary pattern – allow larger patterns to be built

Patterns Identified

1 2 3 4 5 6

– may need to add more context information

DeleteObject called after EndPaint TextOutA called after DeleteObject SetTextColor called after TextOutA SelectObject called after SetTextColor BeginPaint called after SelectObject

slide-21
SLIDE 21

21/ 12 21/ 12

Uni ver si t y of M ar yl and Uni ver si t y of M ar yl and

Chains of relationships

Search through the relationships

– relationships created by adding two functions – find relationships of high confidence and support such that:

case W M _USER: case W M _USER: Enabl eW i ndow Enabl eW i ndow ( G et Dl gI t em G et Dl gI t em ( … ) , FALSE) ; ( … ) , FALSE) ; Enabl eW i ndow Enabl eW i ndow ( G et Dl gI t em G et Dl gI t em ( … ) , FALSE) ; ( … ) , FALSE) ; GetDlgItem GetDlgItem() -> GetDlgItem GetDlgItem () GetDlgItem GetDlgItem() -> EnableW EnableWindow ndow () EnableWindow EnableWindow () -> GetDlgItem GetDlgItem() GetDlgItem GetDlgItem() -> EnableWindow EnableWindow () -> GetDlgItem GetDlgItem()

slide-22
SLIDE 22

22/ 12 22/ 12

Uni ver si t y of M ar yl and Uni ver si t y of M ar yl and

Data flow functionality

– LoadCursorA -> RegisterClassA

  • in latest version: 42 times
  • add both function calls: 43 times

wClass.hCursor = LoadCursorA (…); RegisterClassA (&wClass);

slide-23
SLIDE 23

23/ 12 23/ 12

Uni ver si t y of M ar yl and Uni ver si t y of M ar yl and

RtlHeapFree Called After RtlHeapAlloc Value: 8 dlls/kernel/heap.c dlls/ntdll/loader.c