UAM@SOCO 2014: Detection of Source Code Re-use by mean of Combining - PowerPoint PPT Presentation

Detection of SOurce COde Re-us UAM@SOCO 2014: Detection of Source Code Re-use by mean of Combining Different Types of Representacions A. Ramírez-de-la-Cruz, G. Ramírez-de-la-Rosa, C. Sánchez-Sánchez, W. A. Luna-Ramírez, H. Jiménez-Salazar and C. Rodríguez-Lucatero Presenter: Esaú Villatoro-Tello December 5th, Bangalore, India

SOCO Task Description 2 ´ SOCO, Detection of SOurce COde Re-use, is a shared task that focuses on monolingual source code re-use detection. ´ Participant systems were provided with sets of source codes (training and test) in C and Java programming languages. ´ The task consists on retrieving the source code pairs that have been re-use at a document level.

Our general idea 3 ´ Different and diverse views of a source code allow a richer description of it ´ Each view should highlight different aspects of a source code

Proposed Source Code 4 Representations From three views we proposed four representations: ´ Lexical View: ´ Character 3-grams ´ Structural View: ´ Data types from the function’s signature ´ Names from the function’s signatures ´ Stylistic View: ´ 11 stylistic features to represent each source code

Code Examples 5 Code 2- Calc.c (C β ) Code 1- Calculator.c (C α )

Proposed Source Code 6 Representations ´ Lexical View: ´ Character 3-grams ´ Structural View from function’s signatures ´ Data types ´ Names of function and arguments. ´ Stylistic View: ´ 11 stylistic features to represent each source code

Lexical view 7 ´ Idea: Similar to text documents, we want to find pattern similarities within the source code by means of 3-grams of characters ´ We use the method proposed by Enrique Flores* plus we eliminated reserve words of the programming language * E. Flores. Reutilización de código fuente entre lenguajes de programación. Master’s thesis, Universidad Politécnica de Valencia, Valencia, España, February

Lexical View 8 ´ Example for code C 2 : stdiohaddnumxnumyresnumxnumyressubnumxnumyresnumxnumyre argcargvnumx10numy15resadd0resaddaddnumxnumy0 List of 3grams of preprocessing characters {"std", "tdi", "dio", "ioh", "oha", "had", "add", "ddn", "dnu", "num", "umx", … ," Bag of 3-grams C 2

Lexical View: source code comparison 9 ´ Then each 3-gram Bag is represented as a vector C α y C β . B α {"std", "tdi", "dio", "ioh", "oha", "had", "add", "ddn", "dnu", "num", "umx", … ,"my0"} Vector representation add ¡ ddn ¡ dio ¡ dnu ¡ had ¡ hsu ¡ ioh ¡ mnu ¡ mon ¡ my0 ¡ num ¡ oha ¡ ohs ¡ one ¡ std ¡ sum ¡ tdi ¡ umn ¡ umo ¡ umx ¡ … ¡ C α 0 0 1 0 0 1 1 2 8 0 16 0 1 8 1 3 1 2 8 0 B β {"std", "tdi", "dio", "ioh", "ohs", "hsu", "sum", "umn", "mnu", "num", "umo", "mon", "one", … ,"wo0 Vector representation add ¡ ddn ¡ dio ¡ dnu ¡ had ¡ hsu ¡ ioh ¡ mnu ¡ mon ¡ my0 ¡ num ¡ oha ¡ ohs ¡ one ¡ std ¡ sum ¡ tdi ¡ umn ¡ umo ¡ umx ¡ … ¡ C β 4 2 1 2 1 0 1 0 0 1 12 1 0 0 1 0 1 0 0 6

Lexical View: source code comparison 10 ´ Finally, the similarity between a pair of source codes is computed using the cosine similarity , which is defined as follows:

Structural view 12 ´ Idea: Some structure can be present in the function’s signature of source code ´ We used the function’s signatures in two ways ´ Data types ´ Names of function and arguments

Structural View: Data types 13 ´ Our intuition: plagiarists often are willing to change function’s and argument’s names, but not the data types of such elements. int add(int numX, int numY) � Int sub(int numX, int numY) � Only function’s signatures without the main method C β

Structural View: Data types 14 ´ A real example ( part 1 ) A function on source code 077.c A function on source code 078.c Only data types without return type 077.C = [char, int, int, CrackFuncPtr, int, int, int] � 078.C = [ListPtr, CrackFuncPtr] � Use only the intersection DatatypeSet = [int, char, CrackFuncPtr, ListPtr] �

Structural View: Data types 15 ´ For each method of the two source code in analysis, we count the frequency of each data type and then we compute the similarity as 077.C = [char, int, int, CrackFuncPtr, int, int, int] � 078.C = [ListPtr, CrackFuncPtr] � Sim a (metodo1 077.c , metodo2 078.c ) = 1/8 �

Structural View: Data types 16 ´ A real example ( part 2 ) A function on source code 077.c A function on source code 078.c We compare only the return data type Sim r (metodo1 077.c , metodo2 078.c ) = 0 �

Structural View: Data types 17 ´ A real example ( combining part 1 and part 2 ) Sim r (metodo1 077.c , metodo2 078.c ) = 0 � Sim a (metodo1 077.c , metodo2 078.c ) = 1/8 � The combined similarity gives us the structural similarity of data types In this work σ = 0.5 Sim(metodo1 077.c , metodo2 078.c ) = (0.5 * 0) + (0.5 * 0.125) = 0.0625 �

Structural View: Data types 18 ´ Finally, given 2 codes, C α and C β , we compute the similarity of data types of all the functions in both codes: Sim(m α 1 , m β Sim(m α 1 , m β 1 ) � … � j ) � Sim(m α 2 , m β Sim(m α 2 , m β 1 ) � … � j ) � = � … � … � … � Sim(m α i , m β Sim(m α i , m β 1 ) � … � j ) �

Structural View: Names of functions 20 and arguments ´ Our intuition: some plagiarists might try to obfuscate the copied elements by means of changing data types, but not the variable’s names. int add(int numX, int numY) � Int sub(int numX, int numY) � Only function’s signatures without the main method C β

Structural View: Names of functions 21 and arguments ´ A real example A function on source code 078.c A function on source code 077.c Same process is applying other methods Concatenate all names to form a single string 3gramsSet_077 = 078.C = rundictcracklfunc � [’set’,’num’,’cec’,’chi’,’chl’,’ A set of 3-grams of chs’,’efo’,’hse’,’ncs’,’fch’,’mo characters are extracted f’,’enf’,’ute’,’fun’,’etn’,’sch’ 3gramsSet_078 = ,’nbr’,’bru’,’hle’,’che’,’for’,’ ’run’,’und’,’ndi’,’dic’,’ict’,’ctc’,’tcr’ nfu’,’csc’,’orc’,’rce’,’umo’,’ru ,’cra’,’rac’,’ack’,’ckl’,’klf’,’lfu’,’fun’ n’,’len’,’ech’,’hid’,’rut’,’tnu’ ,’unc’] � ,’ofc’,’hec’,’unb’,’unc’,’tef’]

Structural View: Names of functions 22 and arguments ´ Once we have computed the bag of n-grams, we can compute how similar are two functions, using the Jaccard coefficient as follows: Sim 2 ( 3gramsSet_078 , 3gramsSet_078 ) = 3/49 �

Structural View: Names of functions 23 and arguments ´ Finally, given 2 codes, C α and C β , we compute the similarity of names of all the functions in both codes: Sim(m α 1 , m β Sim(m α 1 , m β 1 ) � … � j ) � Sim(m α 2 , m β Sim(m α 2 , m β 1 ) � … � j ) � = � … � … � … � Sim(m α i , m β Sim(m α i , m β 1 ) � … � j ) �

Stylistic View 25 ´ This representation aims at finding unique properties from the original author such as his/her programming style. ´ we compute 11 stylistic features to represent each source code. ´ Then, we use a vector representation and by using a cosine similarity we found the similarities between two source code.

Stylistic View: 11 stylistic features 26 ´ The features are: #Code Lines C β

Stylistic View: 11 stylistic features 27 #White spaces ´ The features are: #Code Lines C β

Stylistic View: 11 stylistic features 28 #Tabulations #White spaces ´ The features are: #Code Lines C β

Stylistic View: 11 stylistic features 29 #Tabulations #White spaces ´ The features are: #Code Lines #Empty Lines C β

Stylistic View: 11 stylistic features 30 #Tabulations #White spaces ´ The features are: #Functions #Code Lines #Empty Lines C β

Stylistic View: 11 stylistic features 31 #Tabulations #White spaces ´ The features are: #Functions #Code Lines #Empty Lines Average Word Length C β

UAM@SOCO 2014: Detection of Source Code Re-use by mean of Combining - PowerPoint PPT Presentation

Detection of SOurce COde Re-us UAM@SOCO 2014: Detection of Source Code Re-use by mean of Combining Different Types of Representacions A. Ramrez-de-la-Cruz, G. Ramrez-de-la-Rosa, C. Snchez-Snchez, W. A. Luna-Ramrez, H.

PR-SOCO Personality Recognition in SOurce COde PAN@FIRE 2016 Kolkata, 8-10 December Francisco

CloPlag A Study of Effects of Code Obfuscation to Code Similarity Detection Tools Chaiyong

Compiling and Linking C code Assembly C Source C Source C Source Source .c Code Code Code

Instructor-Centric Source Code Plagiarism Detection and Plagiarism Corpus Jonathan Y. H. Poon,

Similar code fragment A code fragment that has similar part to it in source code

in practice source code source code javac scalac groovyc jrubyc 0xCAFEBABE byte code

Blaise Source Code Blaise Source Code Editing System Presenter: Danilo Gutierrez C Co-author:

What is a Compiler? Compiler A program that translates code in one language (source code) to

Bankruptcy Code The Bankruptcy Code (Chapter 11 of the USC) is the source of all bankruptcy

Error Detection and Correction: Hamming Code; Reed-Muller Code Greg Plaxton Theory in

Evolving nVidia GPU parallel source code W. B. Langdon CREST Department of Computer Science

From image classification to object detection Image classification Object detection Image source

What is open source ? Computer software where the source code is distributed under an open

What is open source? Computer software where the source code is distributed under an open

What is open source? Computer sofuware where the source code is distributed under an open

Automatic Defect Detection Andrzej Wasylkowski Overview Automatic Defect Detection

Source Code Optimization Felix von Leitner Code Blau GmbH leitner@codeblau.de October 2009

Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT

Edge Detection State of The Art P. Dollar and C. Zitnick Structured Forests for Fast Edge

Learning and Evaluating Contextual Embedding of Source Code Aditya Kanade 1 2 , Petros Maniatis 2

Towards a Taxonomy of Approaches Towards a Taxonomy of Approaches for for Mining of Source Code

Open Source and Google Summer of Code TM plus the Google Highly Open Participation Contest TM

Algorithm Design An algorithm can be written out in pseudo code Then turned into source code

ARC 6 the source in GitLab ARC 6 Retreat Ume 07.11-09.11 2018 ARC source code and packages

UAM@SOCO 2014: Detection of Source Code Re-use by mean of Combining - PowerPoint PPT Presentation

Detection of SOurce COde Re-us UAM@SOCO 2014: Detection of Source Code Re-use by mean of Combining Different Types of Representacions A. Ramrez-de-la-Cruz, G. Ramrez-de-la-Rosa, C. Snchez-Snchez, W. A. Luna-Ramrez, H.

PR-SOCO Personality Recognition in SOurce COde PAN@FIRE 2016 Kolkata, 8-10 December Francisco

CloPlag A Study of Effects of Code Obfuscation to Code Similarity Detection Tools Chaiyong

Compiling and Linking C code Assembly C Source C Source C Source Source .c Code Code Code

Instructor-Centric Source Code Plagiarism Detection and Plagiarism Corpus Jonathan Y. H. Poon,

Similar code fragment A code fragment that has similar part to it in source code

in practice source code source code javac scalac groovyc jrubyc 0xCAFEBABE byte code

Blaise Source Code Blaise Source Code Editing System Presenter: Danilo Gutierrez C Co-author:

What is a Compiler? Compiler A program that translates code in one language (source code) to

Bankruptcy Code The Bankruptcy Code (Chapter 11 of the USC) is the source of all bankruptcy

Error Detection and Correction: Hamming Code; Reed-Muller Code Greg Plaxton Theory in

Evolving nVidia GPU parallel source code W. B. Langdon CREST Department of Computer Science

From image classification to object detection Image classification Object detection Image source

What is open source ? Computer software where the source code is distributed under an open

What is open source? Computer software where the source code is distributed under an open

What is open source? Computer sofuware where the source code is distributed under an open

Automatic Defect Detection Andrzej Wasylkowski Overview Automatic Defect Detection

Source Code Optimization Felix von Leitner Code Blau GmbH leitner@codeblau.de October 2009

Tools for large-scale collection &amp; analysis of source code repositories OPEN SOURCE GIT

Edge Detection State of The Art P. Dollar and C. Zitnick Structured Forests for Fast Edge

Learning and Evaluating Contextual Embedding of Source Code Aditya Kanade 1 2 , Petros Maniatis 2

Towards a Taxonomy of Approaches Towards a Taxonomy of Approaches for for Mining of Source Code

Open Source and Google Summer of Code TM plus the Google Highly Open Participation Contest TM

Algorithm Design An algorithm can be written out in pseudo code Then turned into source code

ARC 6 the source in GitLab ARC 6 Retreat Ume 07.11-09.11 2018 ARC source code and packages

Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT