UAM@SOCO 2014: Detection of Source Code Re-use by mean of Combining - - PowerPoint PPT Presentation

uam soco 2014 detection of source code re use by mean of
SMART_READER_LITE
LIVE PREVIEW

UAM@SOCO 2014: Detection of Source Code Re-use by mean of Combining - - PowerPoint PPT Presentation

Detection of SOurce COde Re-us UAM@SOCO 2014: Detection of Source Code Re-use by mean of Combining Different Types of Representacions A. Ramrez-de-la-Cruz, G. Ramrez-de-la-Rosa, C. Snchez-Snchez, W. A. Luna-Ramrez, H.


slide-1
SLIDE 1

UAM@SOCO 2014: Detection of Source Code Re-use by mean of Combining Different Types of Representacions

  • A. Ramírez-de-la-Cruz, G. Ramírez-de-la-Rosa, C. Sánchez-Sánchez,
  • W. A. Luna-Ramírez, H. Jiménez-Salazar and C. Rodríguez-Lucatero

Presenter: Esaú Villatoro-Tello December 5th, Bangalore, India

Detection of SOurce COde Re-us

slide-2
SLIDE 2

SOCO Task Description

´ SOCO, Detection of SOurce COde Re-use, is a shared task that focuses on monolingual source code re-use detection. ´ Participant systems were provided with sets of source codes (training and test) in C and Java programming languages. ´ The task consists on retrieving the source code pairs that have been re-use at a document level.

2

slide-3
SLIDE 3

Our general idea

´ Different and diverse views of a source code allow a richer description of it ´ Each view should highlight different aspects of a source code

3

slide-4
SLIDE 4

Proposed Source Code Representations

From three views we proposed four representations: ´ Lexical View:

´ Character 3-grams

´ Structural View:

´ Data types from the function’s signature ´ Names from the function’s signatures

´ Stylistic View:

´ 11 stylistic features to represent each source code 4

slide-5
SLIDE 5

Code Examples

Code 1- Calculator.c (Cα) Code 2- Calc.c (Cβ) 5

slide-6
SLIDE 6

Proposed Source Code Representations

´ Lexical View: ´ Character 3-grams

´ Structural View from function’s signatures

´ Data types ´ Names of function and arguments.

´ Stylistic View:

´ 11 stylistic features to represent each source code

6

slide-7
SLIDE 7

Lexical view

´ Idea: Similar to text documents, we want to find pattern similarities within the source code by means of 3-grams

  • f characters

´ We use the method proposed by Enrique Flores* plus we eliminated reserve words of the programming language 7

* E. Flores. Reutilización de código fuente entre lenguajes de programación. Master’s thesis, Universidad Politécnica de Valencia, Valencia, España, February

slide-8
SLIDE 8

Lexical View

´ Example for code C2:

stdiohaddnumxnumyresnumxnumyressubnumxnumyresnumxnumyre argcargvnumx10numy15resadd0resaddaddnumxnumy0

C2

{"std", "tdi", "dio", "ioh", "oha", "had", "add", "ddn", "dnu", "num", "umx", … ,"

Bag of 3-grams

8

List of 3grams of characters

preprocessing

slide-9
SLIDE 9

Lexical View: source code comparison

´ Then each 3-gram Bag is represented as a vector Cα y Cβ.

{"std", "tdi", "dio", "ioh", "oha", "had", "add", "ddn", "dnu", "num", "umx", … ,"my0"} {"std", "tdi", "dio", "ioh", "ohs", "hsu", "sum", "umn", "mnu", "num", "umo", "mon", "one", … ,"wo0

Bβ Bα

add ¡ ddn ¡ dio ¡ dnu ¡ had ¡ hsu ¡ ioh ¡ mnu ¡ mon ¡my0 ¡ num ¡ oha ¡ohs ¡ one ¡ std ¡ sum ¡ tdi ¡ umn ¡ umo ¡umx ¡ … ¡

1 1 1 2 8 16 0 1 8 1 3 1 2 8

add ¡ ddn ¡ dio ¡ dnu ¡ had ¡ hsu ¡ ioh ¡ mnu ¡ mon ¡my0 ¡ num ¡

  • ha ¡ohs ¡ one ¡ std ¡

sum ¡ tdi ¡ umn ¡ umo ¡umx ¡ … ¡

4 2 1 2 1 1 1 12 1 0 0 1 1 6

Cα Cβ

9

Vector representation Vector representation

slide-10
SLIDE 10

Lexical View: source code comparison

´ Finally, the similarity between a pair of source codes is computed using the cosine similarity, which is defined as follows: 10

slide-11
SLIDE 11

Proposed Source Code Representations

´ Lexical View:

´ Character 3-grams

´ Structural View from function’s signatures

´ Data types

´ Names of function and arguments.

´ Stylistic View:

´ 11 stylistic features to represent each source code

11

slide-12
SLIDE 12

Structural view

´ Idea: Some structure can be present in the function’s signature of source code

´ We used the function’s signatures in two ways ´ Data types ´ Names of function and arguments 12

slide-13
SLIDE 13

Structural View: Data types

´ Our intuition: plagiarists often are willing to change function’s and argument’s names, but not the data types of such elements.

int add(int numX, int numY) Int sub(int numX, int numY) Cβ

13

Only function’s signatures without the main method

slide-14
SLIDE 14

Structural View: Data types

´ A real example (part 1)

A function on source code 077.c A function on source code 078.c

14

Only data types without return type

077.C = [char, int, int, CrackFuncPtr, int, int, int] 078.C = [ListPtr, CrackFuncPtr]

Use only the intersection

DatatypeSet = [int, char, CrackFuncPtr, ListPtr]

slide-15
SLIDE 15

Structural View: Data types

´ For each method of the two source code in analysis, we count the frequency of each data type and then we compute the similarity as 15

077.C = [char, int, int, CrackFuncPtr, int, int, int] 078.C = [ListPtr, CrackFuncPtr] Sima(metodo1077.c, metodo2078.c) = 1/8

slide-16
SLIDE 16

Structural View: Data types

´ A real example (part 2)

A function on source code 077.c A function on source code 078.c

16

We compare only the return data type

Simr(metodo1077.c, metodo2078.c) = 0

slide-17
SLIDE 17

Structural View: Data types

´ A real example (combining part 1 and part 2) 17

The combined similarity gives us the structural similarity of data types

Simr(metodo1077.c, metodo2078.c) = 0 Sima(metodo1077.c, metodo2078.c) = 1/8 Sim(metodo1077.c, metodo2078.c) = (0.5 * 0) + (0.5 * 0.125) = 0.0625

In this work σ = 0.5

slide-18
SLIDE 18

Structural View: Data types

´ Finally, given 2 codes, Cα and Cβ, we compute the similarity of data types of all the functions in both codes: 18

Sim(mα

1, mβ 1)

… Sim(mα

1, mβ j)

Sim(mα

2, mβ 1)

… Sim(mα

2, mβ j)

… … … Sim(mα

i, mβ 1)

… Sim(mα

i, mβ j)

=

slide-19
SLIDE 19

Proposed Source Code Representations

´ Lexical View:

´ Character 3-grams

´ Structural View from function’s signatures

´ Data types

´ Names of function and arguments. ´ Stylistic View:

´ 11 stylistic features to represent each source code

19

slide-20
SLIDE 20

Structural View: Names of functions and arguments

´ Our intuition: some plagiarists might try to obfuscate the copied elements by means of changing data types, but not the variable’s names.

int add(int numX, int numY) Int sub(int numX, int numY) Cβ

20

Only function’s signatures without the main method

slide-21
SLIDE 21

Structural View: Names of functions and arguments

´ A real example

A function on source code 078.c

21

Concatenate all names to form a single string

078.C = rundictcracklfunc

A set of 3-grams of characters are extracted

3gramsSet_078 = ’run’,’und’,’ndi’,’dic’,’ict’,’ctc’,’tcr’ ,’cra’,’rac’,’ack’,’ckl’,’klf’,’lfu’,’fun’ ,’unc’]

A function on source code 077.c

3gramsSet_077 = [’set’,’num’,’cec’,’chi’,’chl’,’ chs’,’efo’,’hse’,’ncs’,’fch’,’mo f’,’enf’,’ute’,’fun’,’etn’,’sch’ ,’nbr’,’bru’,’hle’,’che’,’for’,’ nfu’,’csc’,’orc’,’rce’,’umo’,’ru n’,’len’,’ech’,’hid’,’rut’,’tnu’ ,’ofc’,’hec’,’unb’,’unc’,’tef’]

Same process is applying

  • ther methods
slide-22
SLIDE 22

Structural View: Names of functions and arguments

´ Once we have computed the bag of n-grams, we can compute how similar are two functions, using the Jaccard coefficient as follows: 22

Sim2(3gramsSet_078, 3gramsSet_078) = 3/49

slide-23
SLIDE 23

Structural View: Names of functions and arguments

´ Finally, given 2 codes, Cα and Cβ, we compute the similarity of names of all the functions in both codes: 23

Sim(mα

1, mβ 1)

… Sim(mα

1, mβ j)

Sim(mα

2, mβ 1)

… Sim(mα

2, mβ j)

… … … Sim(mα

i, mβ 1)

… Sim(mα

i, mβ j)

=

slide-24
SLIDE 24

Proposed Source Code Representations

´ Lexical View:

´ Character 3-grams

´ Structural View from function’s signatures

´ Data types ´ Names of function and arguments.

´ Stylistic View:

´ 11 stylistic features to represent each source code 24

slide-25
SLIDE 25

Stylistic View

´ This representation aims at finding unique properties from the

  • riginal author such as his/her programming style.

´ we compute 11 stylistic features to represent each source code. ´ Then, we use a vector representation and by using a cosine similarity we found the similarities between two source code.

25

slide-26
SLIDE 26

Stylistic View: 11 stylistic features

´ The features are:

#Code Lines

26

slide-27
SLIDE 27

Stylistic View: 11 stylistic features

´ The features are:

#Code Lines #White spaces

27

slide-28
SLIDE 28

Stylistic View: 11 stylistic features

´ The features are:

#Code Lines #White spaces #Tabulations

28

slide-29
SLIDE 29

Stylistic View: 11 stylistic features

´ The features are:

#Code Lines #White spaces #Tabulations #Empty Lines

29

slide-30
SLIDE 30

Stylistic View: 11 stylistic features

´ The features are:

#Code Lines #White spaces #Tabulations #Empty Lines #Functions

30

slide-31
SLIDE 31

Stylistic View: 11 stylistic features

´ The features are:

#Code Lines #White spaces #Tabulations #Empty Lines #Functions Average Word Length

31

slide-32
SLIDE 32

Stylistic View: 11 stylistic features

´ The features are:

#Code Lines #White spaces #Tabulations #Empty Lines #Functions Average Word Length #Uppercase

32

slide-33
SLIDE 33

Stylistic View: 11 stylistic features

´ The features are:

#Code Lines #White spaces #Tabulations #Empty Lines #Functions Average Word Length #Uppercase #Lowercase

33

slide-34
SLIDE 34

Stylistic View: 11 stylistic features

´ The features are:

#Code Lines #White spaces #Tabulations #Empty Lines #Functions Average Word Length #Uppercase #Lowercase #Under scores

34

slide-35
SLIDE 35

Stylistic View: 11 stylistic features

´ The features are:

#Code Lines #White spaces #Tabulations #Empty Lines #Functions Average Word Length #Uppercase #Lowercase #Under scores Total of Words

35

slide-36
SLIDE 36

Stylistic View: 11 stylistic features

´ The features are:

#Code Lines #White spaces #Tabulations #Empty Lines #Functions Average Word Length #Uppercase #Lowercase #Under scores Total of Words Lexical Richness

36

slide-37
SLIDE 37

Experimental Evaluation

´ The evaluation was perform with the training provided by the shared task. ´ The performance was measured for each proposed representation by means of establishing a manual threshold for considering when two codes are re-used. ´ That threshold was set from 10 to 90 percent of similarity. For each threshold we evaluated the precision, recall and F-measure. ´ That information help us to design the three uploaded runs. 37

slide-38
SLIDE 38

Submited Runs

We submitted three runs for the task based on three combinations of the proposed representations. Run 1. Lexical View Only The results for C and Java are shown in table 1. 38

slide-39
SLIDE 39

Submited Runs

´ Run 2. Combination of Lexical and Structural Views The linear combination is shown in next equation: The results of the experiment are shown below: 39

slide-40
SLIDE 40

Submited Runs

´ Run 3. Supervised approach. For this experiment all the similarities, from all the views, were computed using a J48 decision tree. The obtained results are in the next table: 40

slide-41
SLIDE 41

Submited Runs

´ As we can see our obtained recall value for detecting source code re-use in C are competitive with the recall of the best system (1.00 and 0.997). ´ The opposite happened with the performances for Java. Here our system performs very well, in recall as well as in precision values, which put our system at the first place in the performance’s ranking. 41

slide-42
SLIDE 42

Conclutions and Future Work

´ From the obtained results during the training phase

´ each type of representation provide some information that can be used to detect some particular cases of source code re-use. ´ It is needed a deeper analysis in order to determine the main characteristics that improve the code re-use detection.

´ We believe that the low precision values (processing C codes) are due to the fact that several source codes are not just in pure C, and instead, also C/C++ alike programs.

´ We also need to do a deeper analysis to validate this hypothesis

´ Finally, obtained results during the test phase motivate us to keep working on the same direction. 42

slide-43
SLIDE 43

43

slide-44
SLIDE 44

Our Group

Language and Reasoning Group Information Technologies Department UAM-C Follow us on Twitter @LyR_UAMC Corresponding author of this work: Gabriela Ramírez (gramirez@correo.cua.uam.mx)

44

slide-45
SLIDE 45

Evalution in the training set

Lexical view

´ Lexical view

Stylistic view

45

slide-46
SLIDE 46

Evalution in the training set

Structural view: data type

46

slide-47
SLIDE 47

Evalution in the training set

Structural view: names

47