bachelor thesis
play

Bachelor Thesis: Automatic Token Classification for Unknown - PowerPoint PPT Presentation

Bachelor Thesis: Automatic Token Classification for Unknown Languages Jan Kur, Joel Guggisberg 1 Introduction Given code of an unknown programming language, attempt to automatically recognize which are the keywords of the language.


  1. Bachelor Thesis: Automatic Token Classification for Unknown Languages Jan Kurš, Joel Guggisberg

  2. 1 Introduction • Given code of an unknown programming language, attempt to automatically recognize which are the keywords of the language. • To find said keywords assume that many programming languages have common constructs

  3. 2 Architecture

  4. 3 Database

  5. 4 Analyze methods Global The keywords appear most commonly in the source code Coverage The token that appear most commonly in different files are keywords Newline The token that appear most commonly at the first position of a new line are keywords Indent The token that appear most commonly at the beginning of a line before an indent are keywords

  6. 5 Java result of the hypothesis 𝑈𝑠𝑣𝑓 𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 = 𝑈𝑠𝑣𝑓𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 + 𝐺𝑏𝑚𝑡𝑓 𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 Keywords in Java: 50 Global Coverage Projects: 179 The keywords appear most The token that appear in most files Files: 100’764 commonly over all source code are keywords Distinct tokens: 414’334 Newline Indent Occurences of tokens: 92’036’362 The token that appear in most files The token at the beginning of a are keywords line before an indent are keywords

  7. 6 Filters How can we improve those results? Scan mode filter : Removes all tokens marked by the scan mode. Intersection filter : Counts in how many projects a token occurs and removes the tokens that don’t occur in enough projects. Used to remove project specific pollution. Upper case filter : Removes all tokens containing capital letters. Since in Java and many other languages keywords are written in lower-case letters.

  8. 7 Java results filtered 𝑈𝑠𝑣𝑓 𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 = 𝑈𝑠𝑣𝑓𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 + 𝐺𝑏𝑚𝑡𝑓 𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 Keywords in Java: 50 Global Coverage Projects: 179 The keywords appear most The token that appear in most files Files: 100’764 commonly over all source code are keywords Distinct tokens: 414’334 Newline Indent Occurences of tokens: 92’036’362 The token that appear in most files The token at the beginning of a are keywords line before an indent are keywords

  9. 7 More data better Results? 1,4 𝑈𝑠𝑣𝑓 𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 = 𝑈𝑠𝑣𝑓𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 + 𝐺𝑏𝑚𝑡𝑓 𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 1,2 1 1 Project 0,8 Precision 5 Project 170 Project 0,6 Expon. (1 Project) Expon. (5 Project) 0,4 Expon. (170 Project) 0,2 0 0 5 10 15 20 25 30 35 40 45 50 Number of keywords(True Positives) Intersection filter : Counts in how Keywords in Java: 50 Coverage many projects a token occurs and Projects: 179 The token that appear in most files removes the tokens that don’t Files: 100’764 are keywords Distinct tokens: 414’334 occur in enough projects. Used to remove project specific pollution. Occurences of tokens: 92’036’362

  10. 8 Summary

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend