code similarity via
play

Code Similarity via Natural Language Descriptions Meital Ben Sinai - PowerPoint PPT Presentation

www.like2drops.com Code Similarity via Natural Language Descriptions Meital Ben Sinai & Eran Yahav Technion Israel Institute of Technology Off the Beaten Track, Jan 2015 1/30 OBT'15 - Code Similarity via Natural Language Descriptions


  1. www.like2drops.com Code Similarity via Natural Language Descriptions Meital Ben Sinai & Eran Yahav Technion – Israel Institute of Technology Off the Beaten Track, Jan 2015 1/30

  2. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav Lots of snippets out there >7M users 3M registered users >17M repositories >8M questions >14M answers Dec ‘ 14 Google code, programming blogs, documentation sites … 2/30

  3. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav Similarity: Images VS. Programs  The code is not organized  Cannot accomplish even simple tasks (which are increasingly improving in other domains) 3/30

  4. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav Similarity: Images VS. Programs  Images already have some solutions  Find somewhere on the web The Grand Canal, Venice, Italy 3/30

  5. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav Similarity: Images VS. Programs  Images already have some solutions  Find somewhere on the web Google image search  The Grand Canal, Venice, Italy 3/30

  6. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav Similarity: Images VS. Programs  Images already have some solutions  Find somewhere on the web Google image search  The Grand Canal, Venice, Italy 3/30

  7. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav Similarity: Images VS. Programs  With code we still don ’ t know what to do  Program P 3/30

  8. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav Why are Programs Hard?  A program is a data transformer  “ infinite data ” ≫ “ big data ”  Potentially infinite number of runtime behaviors  Depends on inputs from subprocess import call cmd_to_run = raw_input() call(cmd_to_run.split()) Infinite code 4/30

  9. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav Why are Programs Hard?  Print the exact same value  Both written in Java  Syntactic difference int scale = 100000 ; double x = (double)Math.round(8.912384 * scale) / scale; System.out.println(x); DecimalFormat df = new DecimalFormat("#0.00000 ” ); System.out.println(df.format(8.912384)); 4/30

  10. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav Syntactic Similarity is not Sufficient  Textual diff There's more than one way to do it -Perl slogan 5/30

  11. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav Syntactic Similarity is not Sufficient  Textual diff try: import os fh = open(f) if os.path.exist(filename): print “ exist ” print(exist) except: else: print “ no such file ” print(no such file) 5/30

  12. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav Syntactic Similarity is not Sufficient  Textual diff Module  Abstract Syntax Tree diff Import Expr from itertools import permutations Call permutations([ “ a ” , “ b ” ]) args from subprocess import call call(["ls", "-l"]) Name List Str Str 5/30

  13. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav C void permute(const char *s, char *out, The Cross Language int *used, int len, int lev){ if (len == lev) { Challenge out[lev] = '\0'; puts(out); Generation of all possible return; permutations of a string }  Different algorithms int i; for (i = 0; i < len; ++i) {  Similar functionality if (used[i]) continue; PYTHON def p (head, tail=''): used[i] = 1; if len(head) == 0: out[lev] = s[i]; ?  print tail permute(s,out,used,len,lev+1); else: used[i] = 0; for i in range(len(head)): } p(head[0:i] + head[i+1:], return; } tail + head[i]) 6/30

  14. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav Our approach Text Natural Natural Similarity Language Language Description Description P1 P2 Code Code Snippet Snippet ??? 7/30

  15. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav Overview 8/30

  16. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav Equivalence, Similarity, Relatedness.. import random public static int getRandom(int min, int max){ print random.randint(min, max) Random rn = new Random(); int range = max- min + 1; return rn.nextInt(range) + min; Equivalent? NO! }  Semantics  Functionality  Quantitative similarity  Semantic relatedness  Inclusion, Reversal, Closeness 9/30

  17. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav Similarity Applications  Code similarity is a central challenge in many programming related applications, such as:  Semantic Code Search  Automatic Translation  Education I know how to get tomorrow ’ s data in JAVA, it ’ s easy! Date d1 = new Date (); Date d2 = new Date (); d2.setTime(d1.getTime() PHP though.. +1*24*60*60*1000); 10/30

  18. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav Similarity Applications  Code similarity is a central challenge in many programming related applications, such as:  Semantic Code Search  Automatic Translation  Education I know how to get tomorrow ’ s data in JAVA, it ’ s easy! define(DATETIME_FORMAT, 'y-m-d H:i'); Date d1 = new Date (); $time = date(DATETIME_FORMAT, Date d2 = new Date (); strtotime(\"+1 day\", $time)); d2.setTime(d1.getTime() PHP though.. +1*24*60*60*1000); 11/30

  19. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav Related work  PEPM ’ 15 – Source Code Examples from Unstructured Knowledge Sources [Vinayakaro, Purandare, Nori]  Onward ’ 14 – Approach based on mapping language structure [Karaivanov, Raychev, Vechev] 12/30

  20. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav Go Back to our Example “ How to generate all “ Generating list of all permutations of a list in possible permutations of a Python ” string in c? ” Big Code & Text def p (head, tail=''): void permute(const char *s, char *out, if len(head) == 0: int *used, int len, int lev){ if (len == lev) { print tail  out[lev] = '\0'; else: puts(out); for i in range(len(head)): return; p(head[0:i] + head[i+1:], } tail + head[i]) int i; for (i = 0; i < len; ++i) { if (used[i]) continue; used[i] = 1; out[lev] = s[i]; permute(s,out,used,len,lev+1); used[i] = 0; } return; } 13/30

  21. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav The Text Similarity  Python code partial description:  “ How to generate all permutations of a list in Python ”  C code partial description:  “ Generating list of all possible permutations of a string in c? ”  Similarity score = 0.72 14/30

  22. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav Text Processing generating list of all possible permutations of a string in c ? Removing stop-words & punctuation generating list possible permutations string Lemmatization 1M docs generate list possible permutation string Vector Space Trained model Model w(1) w(2) w(3) ... w(n-1) w(n) 15/30

  23. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav Models – tf.idf 𝑢𝑔. 𝑗𝑒𝑔 𝑢,𝑒 = 𝑢𝑔 𝑢,𝑒 ∙ 𝑗𝑒𝑔 𝑢  Term Frequency Inverse Document Frequency  Each cell term is:  Higher when the term occurs many times  Lower when the term occurs in many documents Doc 1 Doc 2 term idf term count term count list 0 list 1 sort 3 string 0 permutation 1 list 1 Smoothing permutation ~0.3 generate 2 string 1 generate ~0.3 string 1 sort ~0.3 Train set 16/30

  24. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav Models – tf.idf 𝑢𝑔. 𝑗𝑒𝑔 𝑢,𝑒 = 𝑢𝑔 𝑢,𝑒 ∙ 𝑗𝑒𝑔 𝑢  Term Frequency Inverse Document Frequency  Each cell term is:  Higher when the term occurs many times  Lower when the term occurs in many documents Wanted document term idf term count list 0 list 2 string 0 string 1 0 0 0 0.3 0.9 = × permutation ~0.3 generate 1 list string generate permutation sort generate ~0.3 set 1 sort ~0.3 permutation 3 16/30

  25. OBT'15 - Code Similarity via Natural Language Descriptions - Meital Ben Sinai & Eran Yahav Models – Latent Semantic Analysis “ There is some underlying latent semantic structure in the data that is obscured by the randomness of word choice. ” [Deerwester et al.] Create string  Generate text  Words that are used in the same contexts tend to have similar meanings  Mapping words and documents into a “ concept ” space  Finding the underlying meaning  Synonyms 17/30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend