Bachelor Thesis: Automatic Token Classification for Unknown - PowerPoint PPT Presentation

Bachelor Thesis: Automatic Token Classification for Unknown Languages Jan Kurš, Joel Guggisberg

1 Introduction • Given code of an unknown programming language, attempt to automatically recognize which are the keywords of the language. • To find said keywords assume that many programming languages have common constructs

2 Architecture

3 Database

4 Analyze methods Global The keywords appear most commonly in the source code Coverage The token that appear most commonly in different files are keywords Newline The token that appear most commonly at the first position of a new line are keywords Indent The token that appear most commonly at the beginning of a line before an indent are keywords

5 Java result of the hypothesis 𝑈𝑠𝑣𝑓 𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 = 𝑈𝑠𝑣𝑓𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 + 𝐺𝑏𝑚𝑡𝑓 𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 Keywords in Java: 50 Global Coverage Projects: 179 The keywords appear most The token that appear in most files Files: 100’764 commonly over all source code are keywords Distinct tokens: 414’334 Newline Indent Occurences of tokens: 92’036’362 The token that appear in most files The token at the beginning of a are keywords line before an indent are keywords

6 Filters How can we improve those results? Scan mode filter : Removes all tokens marked by the scan mode. Intersection filter : Counts in how many projects a token occurs and removes the tokens that don’t occur in enough projects. Used to remove project specific pollution. Upper case filter : Removes all tokens containing capital letters. Since in Java and many other languages keywords are written in lower-case letters.

7 Java results filtered 𝑈𝑠𝑣𝑓 𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 = 𝑈𝑠𝑣𝑓𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 + 𝐺𝑏𝑚𝑡𝑓 𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 Keywords in Java: 50 Global Coverage Projects: 179 The keywords appear most The token that appear in most files Files: 100’764 commonly over all source code are keywords Distinct tokens: 414’334 Newline Indent Occurences of tokens: 92’036’362 The token that appear in most files The token at the beginning of a are keywords line before an indent are keywords

7 More data better Results? 1,4 𝑈𝑠𝑣𝑓 𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 = 𝑈𝑠𝑣𝑓𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 + 𝐺𝑏𝑚𝑡𝑓 𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 1,2 1 1 Project 0,8 Precision 5 Project 170 Project 0,6 Expon. (1 Project) Expon. (5 Project) 0,4 Expon. (170 Project) 0,2 0 0 5 10 15 20 25 30 35 40 45 50 Number of keywords(True Positives) Intersection filter : Counts in how Keywords in Java: 50 Coverage many projects a token occurs and Projects: 179 The token that appear in most files removes the tokens that don’t Files: 100’764 are keywords Distinct tokens: 414’334 occur in enough projects. Used to remove project specific pollution. Occurences of tokens: 92’036’362

8 Summary

Bachelor Thesis: Automatic Token Classification for Unknown - PowerPoint PPT Presentation

Bachelor Thesis: Automatic Token Classification for Unknown Languages Jan Kur, Joel Guggisberg 1 Introduction Given code of an unknown programming language, attempt to automatically recognize which are the keywords of the language.

Alison Ogilvie (Bachelor of Animal Science) Cristina Gordon (Bachelor of Chemical Engineering)

How to enrol in the Bachelor of Laws (Honours) and Bachelor of Information Technology

HONORS THESIS PRESENTATION GUIDELINES FOR THESIS ADVISORS AND SECOND READERS Thesis Presentation :

Honors Thesis & Thesis Presentation Guidelines for Thesis Advisers and Second Readers

Bachelor of Information Technology 2 Qualification - Degree Bachelor of Information

How to enrol in the Bachelor of Laws (Honours) and Bachelor of Arts Welcome to the Faculty of

How to enrol in the Bachelor of Laws (Honours) and Bachelor of Science Welcome to the Faculty

How to enrol in the Bachelor of Laws (Honours) and Bachelor of Commerce Welcome to the

Mt. Bachelor 2.8 km asl Oregon, USA Dan Jaffe Mt. Bachelor, Oregon, (MBO) 2.8 km asl The

Bachelor-Thesis Ray-Tracing Point Clouds Christoph Wiesmeier 18. November 2011 Overview 1

Landmines & Zombies Taking on Chronic Fa6gue 2 Bachelor of Engineering & Bachelor of

The Frontier Thesis: How & Why the Riverina Was Won The Frontier Thesis The Frontier Thesis:

THE BACHELOR AGENDA Conclusion Thesis Relationships Background Men ' s Men / Diversity Info

Bachelor of Computer Science and Bachelor of Computer Science Advanced (Honours) Information

Congratulations! Welcome to the School of Exercise and Nutrition Sciences H315 Bachelor of

Gordon T. & Ellen West College of Education Bachelor of Applied Arts and Sciences (BAAS)

Jointly Learning to Label Sentences and Tokens Marek Rei Anders

Cate Construction Projects April 2019 View from Day Walkway View from Kirby Quad View from

Start What is the Problem? Online Order Customer Wait Commissions Time Restaurant &

WELCOME CENTRADO WELCOME Bradley Middleton | Director at Centrado Huntingdon, UK AGENDA 17:30

Offering Memo from recently completed Reg CF capital raise can be found at:

The photo shows a physical model of our digital token ARES. Its about how we will transform

Process-Oriented Building Blocks Adam Sampson Institute of Arts, Media and Computer Games

r tr tr s r

Bachelor Thesis: Automatic Token Classification for Unknown - PowerPoint PPT Presentation

Bachelor Thesis: Automatic Token Classification for Unknown Languages Jan Kur, Joel Guggisberg 1 Introduction Given code of an unknown programming language, attempt to automatically recognize which are the keywords of the language.

Alison Ogilvie (Bachelor of Animal Science) Cristina Gordon (Bachelor of Chemical Engineering)

How to enrol in the Bachelor of Laws (Honours) and Bachelor of Information Technology

HONORS THESIS PRESENTATION GUIDELINES FOR THESIS ADVISORS AND SECOND READERS Thesis Presentation :

Honors Thesis &amp; Thesis Presentation Guidelines for Thesis Advisers and Second Readers

Bachelor of Information Technology 2 Qualification - Degree Bachelor of Information

How to enrol in the Bachelor of Laws (Honours) and Bachelor of Arts Welcome to the Faculty of

How to enrol in the Bachelor of Laws (Honours) and Bachelor of Science Welcome to the Faculty

How to enrol in the Bachelor of Laws (Honours) and Bachelor of Commerce Welcome to the

Mt. Bachelor 2.8 km asl Oregon, USA Dan Jaffe Mt. Bachelor, Oregon, (MBO) 2.8 km asl The

Bachelor-Thesis Ray-Tracing Point Clouds Christoph Wiesmeier 18. November 2011 Overview 1

Landmines &amp; Zombies Taking on Chronic Fa6gue 2 Bachelor of Engineering &amp; Bachelor of

The Frontier Thesis: How &amp; Why the Riverina Was Won The Frontier Thesis The Frontier Thesis:

THE BACHELOR AGENDA Conclusion Thesis Relationships Background Men ' s Men / Diversity Info

Bachelor of Computer Science and Bachelor of Computer Science Advanced (Honours) Information

Congratulations! Welcome to the School of Exercise and Nutrition Sciences H315 Bachelor of

Gordon T. &amp; Ellen West College of Education Bachelor of Applied Arts and Sciences (BAAS)

Jointly Learning to Label Sentences and Tokens Marek Rei Anders

Cate Construction Projects April 2019 View from Day Walkway View from Kirby Quad View from

Start What is the Problem? Online Order Customer Wait Commissions Time Restaurant &amp;

WELCOME CENTRADO WELCOME Bradley Middleton | Director at Centrado Huntingdon, UK AGENDA 17:30

Offering Memo from recently completed Reg CF capital raise can be found at:

The photo shows a physical model of our digital token ARES. Its about how we will transform

Process-Oriented Building Blocks Adam Sampson Institute of Arts, Media and Computer Games

r tr tr s r

Honors Thesis & Thesis Presentation Guidelines for Thesis Advisers and Second Readers

Landmines & Zombies Taking on Chronic Fa6gue 2 Bachelor of Engineering & Bachelor of

The Frontier Thesis: How & Why the Riverina Was Won The Frontier Thesis The Frontier Thesis:

Gordon T. & Ellen West College of Education Bachelor of Applied Arts and Sciences (BAAS)

Start What is the Problem? Online Order Customer Wait Commissions Time Restaurant &