Scalable Detection of Semantic Clones Mark Gabel Lingxiao Jiang - - PowerPoint PPT Presentation

scalable detection of
SMART_READER_LITE
LIVE PREVIEW

Scalable Detection of Semantic Clones Mark Gabel Lingxiao Jiang - - PowerPoint PPT Presentation

Scalable Detection of Semantic Clones Mark Gabel Lingxiao Jiang Zhendong Su Motivation Maintenance problem Refactoring Automated procedure extraction Aspect mining Program understanding Copy/paste bugs 2 Clone


slide-1
SLIDE 1

Scalable Detection of Semantic Clones

Mark Gabel Lingxiao Jiang Zhendong Su

slide-2
SLIDE 2

2

Motivation

  • Maintenance problem

 Refactoring  Automated procedure extraction

  • Aspect mining
  • Program understanding
  • Copy/paste bugs
slide-3
SLIDE 3

3

Clone Detection

  • Definition

 The enumeration of similar fragments of a

program or set of programs

  • Input:

 A program or set of programs

  • Output:

 “Clone Groups,” sets of equivalent fragments  In terms of a similarity function

slide-4
SLIDE 4

4

Similarity of Program Fragments

  • 1992: Baker, parameterized string algorithm
  • Current open source tools: Checkstyle, PMD

Strings

Semantic Awareness of Clone Detection

slide-5
SLIDE 5

5

Similarity of Program Fragments

Strings Tokens

Semantic Awareness of Clone Detection

  • 2002: Kamiya et al., CCFinder
  • 2004: Li et al., CP-Miner
  • 2007: Basit et al., Repeated Tokens Finder
slide-6
SLIDE 6

6

Similarity of Program Fragments

Strings Tokens Syntax Trees

Semantic Awareness of Clone Detection

  • 1998: Baxter et al., CloneDR
  • 2004: Wahler et al., XML-based
  • 2007: Jiang et al., Deckard
slide-7
SLIDE 7

7

Interleaved Clones

int func(int i, int j) { int k = 10; while (i < k) { i++; } j = 2 * k; printf("i=%d, j=%d\n", i, j); return k; } int func_timed(int i, int j) { int k = 10; long start = get_time_millis(); long finish; while (i < k) { i++; } finish = get_time_millis(); printf("loop took %dms\n", finish − start); j = 2 * k; printf("i=%d, j=%d\n", i, j); return k; }

Clones: Separate Computations

slide-8
SLIDE 8

8

Program Dependence Graphs

void void bar() { bar() { int int j = j = 1; int int i = i = 0; while while (j < (j < 10 10) j++; j++; printf( printf(“%d”, i); , i); printf( printf(“%d”, j); , j); }

i=0 j=1 j<10 j++ i j Str Call Call Str

slide-9
SLIDE 9

9

Similarity of Program Fragments

Strings Tokens Syntax Trees

Semantic Awareness of Clone Detection

Program Dependence Graphs

  • 2000, 2001: Komondoor and Horwitz
  • 2006: Liu et al., GPLAG
  • This work – first scalable technique
slide-10
SLIDE 10

10

Program

AST PDG PDG Subgraphs

Semantic Clones

Clone Detection Algorithm Map to Structured Syntax Separate Distinct Computations AST Forests

Approach

  • 1. Separate distinct computations

as PDG subgraphs.

  • 2. Map subgraphs to structured

syntax forests.

  • 3. Find clones within the forests.
slide-11
SLIDE 11

11

vo void id ba bar() r() { { int int j = = 1; int int i = = 0; while while (j < (j < 10 10) j++ j++; pri print ntf( f(“%d”, i) i); pri print ntf( f(“%d”, j) j); }

Separating Computations

  • Connected vertices have a semantic

relationship

  • Break implicit control dependences and

partition the PDG into weakly connected components.

i=0 j=1 j<10 j++ i j Str Call Call Str

slide-12
SLIDE 12

12

Semantic Threads

struct file_stat *compute_statistics() { struct file_stat *result = malloc(sizeof(struct file_stat)); int avg_temp_file_size = 0; int avg_data_file_size = 0; /* iterate the temp files */ ... /* iterate the data files */ ... /* avg results and store in avg_temp_file_size */ ... /* avg results and store in avg_data_file_size */ ... result−>temp_size = avg_temp_file_size; result−>data_size = avg_data_file_size; return result; }

slide-13
SLIDE 13

13

Semantic Threads

int count_list_nodes(struct list_node *head) { int i = 0; struct list_node *tail = head−>prev; while (head != tail && i < MAX) { i++; head = head−>next; } return i; }

slide-14
SLIDE 14

14

Enumerating Semantic Threads

  • Semantic thread:

 Forward slice or union of forward slices

  • Interesting semantic threads:

 Overlap by at most g nodes  Set of maximal size  No fully subsumed threads

slide-15
SLIDE 15

15

Semantic Threads in Practice

Procedures Procs w/ interleaved g=0 STs Procs w/ interleaved g=3 STs

GIMP

13,337 903 3,008

GTK

13,284 697 2,380

MySQL

14,408 1,618 2,441

Postgres

9,276 1,221 2,267

Linux

136,480 10,609 22,514

slide-16
SLIDE 16

16

Mapping and Solving

  • Syntactic Image: m : G  { AST }

 Interesting Semantic Threads 

Interesting AST Forests

  • Clone Detection: DECKARD

 Numerical vector approximation of trees  Clustering as a near-neighbor problem  Scalable solution

slide-17
SLIDE 17

17

Implementation

  • PDGs, ASTs

 Grammatech CodeSurfer: C/C++

  • Semantic Threads, Clone Detection

 Parallel Java

  • Clustering

 MIT Locality Sensitive Hashing (native)

slide-18
SLIDE 18

18

Analysis Times

slide-19
SLIDE 19

19

Quantitative Results

slide-20
SLIDE 20

20

Example

slide-21
SLIDE 21

21

Example

slide-22
SLIDE 22

22

Another Example

slide-23
SLIDE 23

23

Fragment 1

slide-24
SLIDE 24

24

Fragment 2

slide-25
SLIDE 25

25

Fragment 3

slide-26
SLIDE 26

26

Summary

  • First scalable clone detection algorithm

based on PDGs

 Reduction to a simpler tree-based problem  Scalable, effective

  • New classes of clones

 Demonstrated to exist  Enabling technology: new applications

slide-27
SLIDE 27

27

Complete PDG

formal-out func()

exit entry func() formal-in int i formal-in int j body func() return return k ctrl-pt i < k expr k = 10 actual-in j expr j = 2 * k call-site printf() expr return k expr i++ actual-in i

actual-in “i=%d, j=%d”

decl int k Key:

statement node control point node data dependency control dependency