Discovering Information Explaining API Types Using Text Course - - PowerPoint PPT Presentation

discovering information explaining api types using text
SMART_READER_LITE
LIVE PREVIEW

Discovering Information Explaining API Types Using Text Course - - PowerPoint PPT Presentation

Discovering Information Explaining API Types Using Text Course Instructor: Classification Dr. Jin Guo Presented by: Sunyam Bagga TEXT CLASSIFICATION Relevant/Irrelevant [API type, Section fragment] Source:


slide-1
SLIDE 1

Discovering Information Explaining API Types Using Text Classification

Presented by: Sunyam Bagga Course Instructor:

  • Dr. Jin Guo
slide-2
SLIDE 2

TEXT CLASSIFICATION

Source: https://www.python-course.eu/text_classification_introduction.php

Relevant/Irrelevant [API type, Section fragment]

slide-3
SLIDE 3

Technical Concepts

  • 1. Recodoc tool
  • 2. LOOCV
  • 3. Maximum Entropy
  • 4. Cosine similarity with tf-idf weighting
  • 5. KAPPA
slide-4
SLIDE 4

RecoDoc

“Recovering Traceability Links between an API and Its Learning Resources”

1

slide-5
SLIDE 5

Aim:

  • Find API types referenced in a tutorial:
  • Identifies CLTs
  • Links these CLTs to exact API type

“DateTime….such as year() or monthOfYear().”

  • Precisely link code-like terms (e.g., year()) to

specific code elements (e.g., DateTime.year())

slide-6
SLIDE 6

Ambiguity

▪ Declaration Ambiguity: CLTs are rarely fully qualified. ▪ Overload Ambiguity: CLTs do not indicate the number/type of parameters (method is

  • verloaded).

▪ External Reference Ambiguity: May refer to code elements in external libraries. ▪ Language Ambiguity: Human errors: typographical (HtttpClient), case errors, forgetting parameters etc.

slide-7
SLIDE 7

Parsing Artifacts and Recovering Traceability Links

  • Linking Types: Given a CLT, they find all types in the

codebase whose name matches the term.

  • Disambiguate and filter.
slide-8
SLIDE 8

LOOCV

“Evaluating a classifier’s performance”

2

slide-9
SLIDE 9

Leave-one-out Cross Validation

Source: https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6

slide-10
SLIDE 10

MaxEnt Classifier

“Using Maximum Entropy for Text Classification” by Nigam et al.

3

slide-11
SLIDE 11

Maximum Entropy:

  • Technique for estimating probability distributions

from data

  • Principle: Without external knowledge, pick the

distribution that has the maximum entropy (most-uniform).

  • Labeled training data helps put constraints on the

distribution

slide-12
SLIDE 12

Example

Source: NLP by Dan Jurafsky and Chris Manning

slide-13
SLIDE 13

Add Noun feature: f1 = {NN, NNS, NNP, NNPS} Add Proper Noun feature: f2 = {NNP, NNPS}

Source: NLP by Dan Jurafsky and Chris Manning

slide-14
SLIDE 14

Constraints and Features

  • Restricts the model distribution to have the same

expected value for a feature as seen in training data, D:

  • Features for text classification:
slide-15
SLIDE 15

Cosine Similarity with tf-idf

“Comparison with Information Retrieval”

4

slide-16
SLIDE 16

Tf-Idf

  • Technique to vectorise text data
  • Term Frequency is a simple frequency count
  • f a term in a document
  • Inverse Document Frequency gives more

weight rare words.

slide-17
SLIDE 17

Cosine Similarity

  • Measures the cosine of the angle between

the vectors:

  • They consider a section relevant if the

similarity value is higher than a certain threshold.

slide-18
SLIDE 18

KAPPA score

“Annotating the Experimental Corpus”

5

slide-19
SLIDE 19

Kappa formula

  • Measures inter-annotator agreement.

▪ Po: observed agreement among annotators ▪ Pe: hypothetical probability of chance agreement ▪ More robust than simple percent agreement calculation

slide-20
SLIDE 20

Kappa Example:

▪ Po = (20+15) / 50 = 0.7 ▪ P(Yes) = 0.5*0.6 = 0.3 ▪ P(No) = 0.5*0.4 = 0.2 ▪ Pe = P(Yes) + P(No) = 0.5 Kappa = (0.7 - 0.5) / (1 - 0.5) = 0.4

Source: https://en.wikipedia.org/wiki/Cohen%27s_kappa

slide-21
SLIDE 21

Thanks!

Any questions?