Bachelor Thesis: Automatic Token Classification for Unknown - - PowerPoint PPT Presentation

bachelor thesis
SMART_READER_LITE
LIVE PREVIEW

Bachelor Thesis: Automatic Token Classification for Unknown - - PowerPoint PPT Presentation

Bachelor Thesis: Automatic Token Classification for Unknown Languages Jan Kur, Joel Guggisberg 1 Introduction Given code of an unknown programming language, attempt to automatically recognize which are the keywords of the language.


slide-1
SLIDE 1

Bachelor Thesis: Automatic Token Classification for Unknown Languages

Jan Kurš, Joel Guggisberg

slide-2
SLIDE 2
  • Given code of an

unknown programming language, attempt to automatically recognize which are the keywords

  • f the language.
  • To find said keywords

assume that many programming languages have common constructs

1 Introduction

slide-3
SLIDE 3

2 Architecture

slide-4
SLIDE 4

3 Database

slide-5
SLIDE 5

4 Analyze methods

Global

The keywords appear most commonly in the source code

Coverage

The token that appear most commonly in different files are keywords

Indent

The token that appear most commonly at the beginning of a line before an indent are keywords

Newline

The token that appear most commonly at the first position of a new line are keywords

slide-6
SLIDE 6

5 Java result of the hypothesis

Keywords in Java: 50 Projects: 179 Files: 100’764 Distinct tokens: 414’334 Occurences of tokens: 92’036’362 Global The keywords appear most commonly over all source code Coverage The token that appear in most files are keywords Indent The token at the beginning of a line before an indent are keywords Newline The token that appear in most files are keywords

𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 = 𝑈𝑠𝑣𝑓 𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 𝑈𝑠𝑣𝑓𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 + 𝐺𝑏𝑚𝑡𝑓 𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓

slide-7
SLIDE 7

6 Filters

Upper case filter: Removes all tokens

containing capital letters. Since in Java and many other languages keywords are written in lower-case letters.

Scan mode filter: Removes all tokens

marked by the scan mode.

Intersection filter: Counts in how

many projects a token occurs and removes the tokens that don’t occur in enough projects. Used to remove project specific pollution.

How can we improve those results?

slide-8
SLIDE 8

7 Java results filtered

Keywords in Java: 50 Projects: 179 Files: 100’764 Distinct tokens: 414’334 Occurences of tokens: 92’036’362 Global The keywords appear most commonly over all source code Coverage The token that appear in most files are keywords Indent The token at the beginning of a line before an indent are keywords Newline The token that appear in most files are keywords

𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 = 𝑈𝑠𝑣𝑓 𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 𝑈𝑠𝑣𝑓𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 + 𝐺𝑏𝑚𝑡𝑓 𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓

slide-9
SLIDE 9

7 More data better Results?

Keywords in Java: 50 Projects: 179 Files: 100’764 Distinct tokens: 414’334 Occurences of tokens: 92’036’362 Coverage The token that appear in most files are keywords

𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 = 𝑈𝑠𝑣𝑓 𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 𝑈𝑠𝑣𝑓𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓 + 𝐺𝑏𝑚𝑡𝑓 𝑄𝑝𝑡𝑗𝑢𝑗𝑤𝑓

Intersection filter: Counts in how many projects a token occurs and removes the tokens that don’t

  • ccur in enough projects. Used to

remove project specific pollution.

0,2 0,4 0,6 0,8 1 1,2 1,4 5 10 15 20 25 30 35 40 45 50 Precision Number of keywords(True Positives) 1 Project 5 Project 170 Project

  • Expon. (1 Project)
  • Expon. (5 Project)
  • Expon. (170 Project)
slide-10
SLIDE 10

8 Summary