leveraging a corpus of natural language descriptions for
play

Leveraging a Corpus of Natural Language Descriptions for Program - PowerPoint PPT Presentation

Leveraging a Corpus of Natural Language Descriptions for Program Similarity Meital Zilberstein & Eran Yahav Technion Israel Institute of Technology Onward! 2016 1 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM


  1. Leveraging a Corpus of Natural Language Descriptions for Program Similarity Meital Zilberstein & Eran Yahav Technion – Israel Institute of Technology Onward! 2016 1 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  2. Lots of snippets out there >19M users >5.9M registered users >38M repositories >12M questions >19M answers Sep ‘ 16 And also.. Google code, programming blogs, documentation sites, requirements documents, comments, identifier, commits, etc. 2 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  3. Similarity: Images VS. Programs  Code is not organized  Cannot accomplish even simple tasks (which are increasingly improving in other domains) 3 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  4. Similarity: Images VS. Programs  Images already have some solutions  Find somewhere on the web Google image   search Lago di Canzolino, Italy LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR 4 PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  5. Similarity: Images VS. Programs  With code we still don ’ t know what to do   Program P 5 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  6. Why are Programs Hard?  A program is a data transformer  “ infinite data ” ≫ “ big data ”  Potentially infinite number of runtime behaviors  Depends on inputs from subprocess import call cmd_to_run = raw_input() call(cmd_to_run.split()) Infinite code 6 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  7. Why are Programs Hard?  Print the exact same value  Both written in Java  Syntactic difference int scale = 100000 ; double x = (double)Math.round(8.912384 * scale) / scale; System.out.println(x); DecimalFormat df = new DecimalFormat("#0.00000 ” ); System.out.println(df.format(8.912384)); 7 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  8. Syntactic Similarity is not Sufficient  Two approaches for similarity  Textual diff  There's more than one way to do it -Perl slogan 8 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  9. Syntactic Similarity is not Sufficient import os if os.path.exist(filename): print(exist) else: print(no such file) try: fh = open(f) print “ exist ” except: print “ no such file ” 9 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  10. Syntactic Similarity is not Sufficient  Textual diff Module  Abstract Syntax Tree diff Expr Import from itertools import permutations Call permutations([ “ a ” , “ b ” ]) args from subprocess import call call(["ls", "-l"]) Name List Str Str 10 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  11. C void permute(const char *s, char *out, Cross Language int *used, int len, int lev){ if (len == lev) { Similarity out[lev] = '\0'; puts(out); Generation of all possible return; permutations of a string }  Different algorithms int i; for (i = 0; i < len; ++i) {  Similar functionality if (used[i]) continue; PYTHON def p (head, tail=''): used[i] = 1; if len(head) == 0: out[lev] = s[i]; ?  print tail permute(s,out,used,len,lev+1); else: used[i] = 0; for i in range(len(head)): } p(head[0:i] + head[i+1:], return; } tail + head[i]) 11 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  12. Our approach (simplified) 12 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  13. Semantic Relatedness  First appeared in the NLP domain  finer case of Semantic Similarity (is-a)  Can be established across different parts of speech  Based on functionality import random print random.randint(min, max)  Quantitative similarity Equivalent? NO!  Semantic relatedness public static int  Inclusion, Reversal getRandom(int min, int max){ Random rn = new Random(); int range = max- min + 1; return rn.nextInt(range) + min; } 13 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  14. Code Similarity Applications  Code similarity is a central challenge in many programming related applications, such as:  Semantic Code Search  Automatic Translation  Education I know how to get tomorrow ’ s data in JAVA, it ’ s easy! define(DATETIME_FORMAT, 'y-m-d H:i'); Date d1 = new Date (); $time = date(DATETIME_FORMAT, Date d2 = new Date (); strtotime(\"+1 day\", $time)); d2.setTime(d1.getTime() PHP though.. +1*24*60*60*1000); 14 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  15. Automatic Tagging of Snippets  Predict a set of textual labels  Semantics of the code fragment  Long-term goal: produce natural-language summaries for code snippets int foo = Integer.parseInt ( "1234" ) ; str tring ing int co conv nver erting ting 15 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  16. Overview 16 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  17. Leveraging Collective Knowledge  Stackoverflow  Community question-answering site  Programming related questions  Each question is associated with a title, content and tags  Implicit mapping between code fragments and their descriptions 17 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  18. title le que uestion tion tags vo votes es answ swer ers code de 18 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  19. Know your limits!  This work presents a radical departure from common approaches  Challenge: find representatives in the pre- computed database  The results are biased by the quality of the database  We show that this approach is feasible for snippets that serve a common purpose 19 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  20. The Importance of Data % 𝑁𝑏𝑢𝑑ℎ𝑓𝑡 12 10 8 6 4 2 log 2 (𝐸𝐶 𝑇𝑗𝑨𝑓) 0 9 10 11 12 13 14 15 16 17 20 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  21. Data Coverage ” Although the number of legal statements in the language is theoretically infinite, the number of practically useful statements is much smaller, and potentially finite. ” -- Study of the uniqueness of source Code, Gabel et al.  Software is usually an aggregation of much smaller parts  Code is repetitive and predictable  Syntactic similarity 21 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  22. Going Back to our Example 22 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  23. Text Similarity  Python code partial description:  “ How to generate all permutations of a list in Python? ”  C code partial description:  “ Generating list of all possible permutations of a string ”  Similarity score ≈ 0.8 23 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  24. Text Processing generating list of all possible permutations of a string in c ? Removing stop-words & punctuation generating list possible permutations string Lemmatization 1M docs generate list possible permutation string Vector Space Trained Model Model w(1) w(2) w(3) ... w(n-1) w(n) LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR 24 PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

  25. Models – tf.idf 𝑢𝑔. 𝑗𝑒𝑔 𝑢,𝑒 = 𝑢𝑔 𝑢,𝑒 ∙ 𝑗𝑒𝑔 𝑢  Term Frequency Inverse Document Frequency  Each cell term is:  Higher when the term occurs many times  Lower when the term occurs in many documents Wanted document Doc 1 Doc 2 term idf term count term count term count list 0 list 2 list 1 sort 3 string 1 string 0 0.3 0.9 0 0 0 = × permutation 1 list 1 Smoothing permutation ~0.3 generate 1 list string generate permutation sort generate 2 string 1 generate ~0.3 set 1 string 1 sort ~0.3 permutation 3 Train set 25 LEVERAGING A CORPUS OF NATURAL LANGUAGE DESCRIPTIONS FOR PROGRAM SIMILARITY - MEITAL ZILBERSTEIN & ERAN YAHAV

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend