discovering information explaining api types using text
play

Discovering Information Explaining API Types Using Text Course - PowerPoint PPT Presentation

Discovering Information Explaining API Types Using Text Course Instructor: Classification Dr. Jin Guo Presented by: Sunyam Bagga TEXT CLASSIFICATION Relevant/Irrelevant [API type, Section fragment] Source:


  1. Discovering Information Explaining API Types Using Text Course Instructor: Classification Dr. Jin Guo Presented by: Sunyam Bagga

  2. TEXT CLASSIFICATION Relevant/Irrelevant [API type, Section fragment] Source: https://www.python-course.eu/text_classification_introduction.php

  3. Technical Concepts 1. Recodoc tool 2. LOOCV 3. Maximum Entropy 4. Cosine similarity with tf-idf weighting 5. KAPPA

  4. RecoDoc “Recovering Traceability Links between an API and Its Learning Resources” 1

  5. Aim : - Find API types referenced in a tutorial: - Identifies CLTs - Links these CLTs to exact API type “DateTime….such as year() or monthOfYear().” - Precisely link code-like terms (e.g., year()) to specific code elements (e.g., DateTime.year())

  6. Ambiguity ▪ Declaration Ambiguity : CLTs are rarely fully qualified. ▪ Overload Ambiguity: CLTs do not indicate the number/type of parameters (method is overloaded). ▪ External Reference Ambiguity: May refer to code elements in external libraries. ▪ Language Ambiguity: Human errors: typographical (HtttpClient), case errors, forgetting parameters etc.

  7. Parsing Artifacts and Recovering Traceability Links - Linking Types: Given a CLT, they find all types in the codebase whose name matches the term. - Disambiguate and filter.

  8. LOOCV “Evaluating a classifier’s performance” 2

  9. Leave-one-out Cross Validation Source: https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6

  10. MaxEnt Classifier “Using Maximum Entropy for Text Classification” by Nigam et al. 3

  11. Maximum Entropy: - Technique for estimating probability distributions from data - Principle : Without external knowledge, pick the distribution that has the maximum entropy (most-uniform). - Labeled training data helps put constraints on the distribution

  12. Example Source: NLP by Dan Jurafsky and Chris Manning

  13. Add Noun feature: f1 = {NN, NNS, NNP, NNPS} Add Proper Noun feature: f2 = {NNP, NNPS} Source: NLP by Dan Jurafsky and Chris Manning

  14. Constraints and Features - Restricts the model distribution to have the same expected value for a feature as seen in training data, D: - Features for text classification:

  15. Cosine Similarity with tf-idf “Comparison with Information Retrieval” 4

  16. Tf-Idf - Technique to vectorise text data - Term Frequency is a simple frequency count of a term in a document - Inverse Document Frequency gives more weight rare words.

  17. Cosine Similarity - Measures the cosine of the angle between the vectors: - They consider a section relevant if the similarity value is higher than a certain threshold.

  18. KAPPA score “Annotating the Experimental Corpus” 5

  19. Kappa formula - Measures inter-annotator agreement. ▪ Po: observed agreement among annotators ▪ Pe: hypothetical probability of chance agreement ▪ More robust than simple percent agreement calculation

  20. Kappa Example: ▪ P o = (20+15) / 50 = 0.7 ▪ P(Yes) = 0.5*0.6 = 0.3 ▪ P(No) = 0.5*0.4 = 0.2 ▪ P e = P(Yes) + P(No) = 0.5 Kappa = (0.7 - 0.5) / (1 - 0.5) = 0.4 Source: https://en.wikipedia.org/wiki/Cohen%27s_kappa

  21. Thanks! Any questions?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend