uam soco 2014 detection of source code re use by mean of
play

UAM@SOCO 2014: Detection of Source Code Re-use by mean of Combining - PowerPoint PPT Presentation

Detection of SOurce COde Re-us UAM@SOCO 2014: Detection of Source Code Re-use by mean of Combining Different Types of Representacions A. Ramrez-de-la-Cruz, G. Ramrez-de-la-Rosa, C. Snchez-Snchez, W. A. Luna-Ramrez, H.


  1. Detection of SOurce COde Re-us UAM@SOCO 2014: Detection of Source Code Re-use by mean of Combining Different Types of Representacions A. Ramírez-de-la-Cruz, G. Ramírez-de-la-Rosa, C. Sánchez-Sánchez, W. A. Luna-Ramírez, H. Jiménez-Salazar and C. Rodríguez-Lucatero Presenter: Esaú Villatoro-Tello December 5th, Bangalore, India

  2. SOCO Task Description 2 ´ SOCO, Detection of SOurce COde Re-use, is a shared task that focuses on monolingual source code re-use detection. ´ Participant systems were provided with sets of source codes (training and test) in C and Java programming languages. ´ The task consists on retrieving the source code pairs that have been re-use at a document level.

  3. Our general idea 3 ´ Different and diverse views of a source code allow a richer description of it ´ Each view should highlight different aspects of a source code

  4. Proposed Source Code 4 Representations From three views we proposed four representations: ´ Lexical View: ´ Character 3-grams ´ Structural View: ´ Data types from the function’s signature ´ Names from the function’s signatures ´ Stylistic View: ´ 11 stylistic features to represent each source code

  5. Code Examples 5 Code 2- Calc.c (C β ) Code 1- Calculator.c (C α )

  6. Proposed Source Code 6 Representations ´ Lexical View: ´ Character 3-grams ´ Structural View from function’s signatures ´ Data types ´ Names of function and arguments. ´ Stylistic View: ´ 11 stylistic features to represent each source code

  7. Lexical view 7 ´ Idea: Similar to text documents, we want to find pattern similarities within the source code by means of 3-grams of characters ´ We use the method proposed by Enrique Flores* plus we eliminated reserve words of the programming language * E. Flores. Reutilización de código fuente entre lenguajes de programación. Master’s thesis, Universidad Politécnica de Valencia, Valencia, España, February

  8. Lexical View 8 ´ Example for code C 2 : stdiohaddnumxnumyresnumxnumyressubnumxnumyresnumxnumyre argcargvnumx10numy15resadd0resaddaddnumxnumy0 List of 3grams of preprocessing characters {"std", "tdi", "dio", "ioh", "oha", "had", "add", "ddn", "dnu", "num", "umx", … ," Bag of 3-grams C 2

  9. Lexical View: source code comparison 9 ´ Then each 3-gram Bag is represented as a vector C α y C β . B α {"std", "tdi", "dio", "ioh", "oha", "had", "add", "ddn", "dnu", "num", "umx", … ,"my0"} Vector representation add ¡ ddn ¡ dio ¡ dnu ¡ had ¡ hsu ¡ ioh ¡ mnu ¡ mon ¡ my0 ¡ num ¡ oha ¡ ohs ¡ one ¡ std ¡ sum ¡ tdi ¡ umn ¡ umo ¡ umx ¡ … ¡ C α 0 0 1 0 0 1 1 2 8 0 16 0 1 8 1 3 1 2 8 0 B β {"std", "tdi", "dio", "ioh", "ohs", "hsu", "sum", "umn", "mnu", "num", "umo", "mon", "one", … ,"wo0 Vector representation add ¡ ddn ¡ dio ¡ dnu ¡ had ¡ hsu ¡ ioh ¡ mnu ¡ mon ¡ my0 ¡ num ¡ oha ¡ ohs ¡ one ¡ std ¡ sum ¡ tdi ¡ umn ¡ umo ¡ umx ¡ … ¡ C β 4 2 1 2 1 0 1 0 0 1 12 1 0 0 1 0 1 0 0 6

  10. Lexical View: source code comparison 10 ´ Finally, the similarity between a pair of source codes is computed using the cosine similarity , which is defined as follows:

  11. Proposed Source Code 11 Representations ´ Lexical View: ´ Character 3-grams ´ Structural View from function’s signatures ´ Data types ´ Names of function and arguments. ´ Stylistic View: ´ 11 stylistic features to represent each source code

  12. Structural view 12 ´ Idea: Some structure can be present in the function’s signature of source code ´ We used the function’s signatures in two ways ´ Data types ´ Names of function and arguments

  13. Structural View: Data types 13 ´ Our intuition: plagiarists often are willing to change function’s and argument’s names, but not the data types of such elements. int add(int numX, int numY) � Int sub(int numX, int numY) � Only function’s signatures without the main method C β

  14. Structural View: Data types 14 ´ A real example ( part 1 ) A function on source code 077.c A function on source code 078.c Only data types without return type 077.C = [char, int, int, CrackFuncPtr, int, int, int] � 078.C = [ListPtr, CrackFuncPtr] � Use only the intersection DatatypeSet = [int, char, CrackFuncPtr, ListPtr] �

  15. Structural View: Data types 15 ´ For each method of the two source code in analysis, we count the frequency of each data type and then we compute the similarity as 077.C = [char, int, int, CrackFuncPtr, int, int, int] � 078.C = [ListPtr, CrackFuncPtr] � Sim a (metodo1 077.c , metodo2 078.c ) = 1/8 �

  16. Structural View: Data types 16 ´ A real example ( part 2 ) A function on source code 077.c A function on source code 078.c We compare only the return data type Sim r (metodo1 077.c , metodo2 078.c ) = 0 �

  17. Structural View: Data types 17 ´ A real example ( combining part 1 and part 2 ) Sim r (metodo1 077.c , metodo2 078.c ) = 0 � Sim a (metodo1 077.c , metodo2 078.c ) = 1/8 � The combined similarity gives us the structural similarity of data types In this work σ = 0.5 Sim(metodo1 077.c , metodo2 078.c ) = (0.5 * 0) + (0.5 * 0.125) = 0.0625 �

  18. Structural View: Data types 18 ´ Finally, given 2 codes, C α and C β , we compute the similarity of data types of all the functions in both codes: Sim(m α 1 , m β Sim(m α 1 , m β 1 ) � … � j ) � Sim(m α 2 , m β Sim(m α 2 , m β 1 ) � … � j ) � = � … � … � … � Sim(m α i , m β Sim(m α i , m β 1 ) � … � j ) �

  19. Proposed Source Code 19 Representations ´ Lexical View: ´ Character 3-grams ´ Structural View from function’s signatures ´ Data types ´ Names of function and arguments. ´ Stylistic View: ´ 11 stylistic features to represent each source code

  20. Structural View: Names of functions 20 and arguments ´ Our intuition: some plagiarists might try to obfuscate the copied elements by means of changing data types, but not the variable’s names. int add(int numX, int numY) � Int sub(int numX, int numY) � Only function’s signatures without the main method C β

  21. Structural View: Names of functions 21 and arguments ´ A real example A function on source code 078.c A function on source code 077.c Same process is applying other methods Concatenate all names to form a single string 3gramsSet_077 = 078.C = rundictcracklfunc � [’set’,’num’,’cec’,’chi’,’chl’,’ A set of 3-grams of chs’,’efo’,’hse’,’ncs’,’fch’,’mo characters are extracted f’,’enf’,’ute’,’fun’,’etn’,’sch’ 3gramsSet_078 = ,’nbr’,’bru’,’hle’,’che’,’for’,’ ’run’,’und’,’ndi’,’dic’,’ict’,’ctc’,’tcr’ nfu’,’csc’,’orc’,’rce’,’umo’,’ru ,’cra’,’rac’,’ack’,’ckl’,’klf’,’lfu’,’fun’ n’,’len’,’ech’,’hid’,’rut’,’tnu’ ,’unc’] � ,’ofc’,’hec’,’unb’,’unc’,’tef’]

  22. Structural View: Names of functions 22 and arguments ´ Once we have computed the bag of n-grams, we can compute how similar are two functions, using the Jaccard coefficient as follows: Sim 2 ( 3gramsSet_078 , 3gramsSet_078 ) = 3/49 �

  23. Structural View: Names of functions 23 and arguments ´ Finally, given 2 codes, C α and C β , we compute the similarity of names of all the functions in both codes: Sim(m α 1 , m β Sim(m α 1 , m β 1 ) � … � j ) � Sim(m α 2 , m β Sim(m α 2 , m β 1 ) � … � j ) � = � … � … � … � Sim(m α i , m β Sim(m α i , m β 1 ) � … � j ) �

  24. Proposed Source Code 24 Representations ´ Lexical View: ´ Character 3-grams ´ Structural View from function’s signatures ´ Data types ´ Names of function and arguments. ´ Stylistic View: ´ 11 stylistic features to represent each source code

  25. Stylistic View 25 ´ This representation aims at finding unique properties from the original author such as his/her programming style. ´ we compute 11 stylistic features to represent each source code. ´ Then, we use a vector representation and by using a cosine similarity we found the similarities between two source code.

  26. Stylistic View: 11 stylistic features 26 ´ The features are: #Code Lines C β

  27. Stylistic View: 11 stylistic features 27 #White spaces ´ The features are: #Code Lines C β

  28. Stylistic View: 11 stylistic features 28 #Tabulations #White spaces ´ The features are: #Code Lines C β

  29. Stylistic View: 11 stylistic features 29 #Tabulations #White spaces ´ The features are: #Code Lines #Empty Lines C β

  30. Stylistic View: 11 stylistic features 30 #Tabulations #White spaces ´ The features are: #Functions #Code Lines #Empty Lines C β

  31. Stylistic View: 11 stylistic features 31 #Tabulations #White spaces ´ The features are: #Functions #Code Lines #Empty Lines Average Word Length C β

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend