technical aspects of the paper improving code readability
play

Technical Aspects of the Paper: Improving Code Readability Models - PowerPoint PPT Presentation

Technical Aspects of the Paper: Improving Code Readability Models with Textual Features Deeksha Arya COMP762 Key Concepts Previous work as mentioned in the paper: QALP tool (to compute similarity between comment and code) Entropy


  1. Technical Aspects of the Paper: Improving Code Readability Models with Textual Features Deeksha Arya COMP762

  2. Key Concepts ´ Previous work as mentioned in the paper: ´ QALP tool (to compute similarity between comment and code) ´ Entropy ´ Halstead’s volume metric ´ Area Under the Curve (AUC) ´ Concepts used in paper’s experiments: ´ Center selection (used to get 200 representative code snippets) ´ Cronbach-alpha (to evaluate agreement between participants regarding readability value) ´ Logistic Regression with wrapper strategy (binary classification algorithm) ´ Wilcoxon test ´ Cliff’s Delta

  3. QALP Score (Quality Assessment using Language Processing) Measures correlation between the natural language used in a program code ´ (mainly identifiers) w.r.t its documentation (in this case its comments), hence identifies well-documented code Pre-processing involves: ´ Using stop-words (custom defined for the code – includes keywords, library function and ´ predefined variable names) Stemming (elimination of word suffixes) ´ Atomic splitting of identifiers from code (splits compound identifiers into multiple atomic ´ terms using a lex-based scanner based on an island grammar) Weight the words using tf-idf (high weight to terms which occur more than average in ´ document but are rarer in entire collection) Considers each word as a separate dimension in an n -dimensional vector space – ´ vectorizes comments and code separately Calculates cosine similarity between comment and code tokens ´ Greater QALP score indicates both document models in question describe ´ concepts using the same vocabulary Ref: Increasing diversity: Natural language measures for software fault prediction

  4. Entropy ´ Measures the complexity, the degree of disorder, or the amount of information in a data set ´ Let x i be a term in document X, and p(x i ) is the ratio of the count of occurrences of x i to the total number of words in the document. Then H(X) is the entropy and is given by: ´ Highest entropy indicates uniform distributions and lower entropy indicates highly skewed distributions Ref: A Simpler Model of Software Readability

  5. Halstead’s Volume ´ Similar to the idea of entropy ´ Represents the minimum number of bits needed to naively represent the program, or the number of mental comparisons to write the program ´ Program length N = total number of operators + total number of operands ´ Program vocabulary n = number of distinct operators + number of distinct operands ´ Halstead Volume: V = Nlog 2 n ´ Greater volume indicates greater complexity Ref: A Simpler Model of Software Readability

  6. AUC – Area Under the (ROC) Curve ´ Receiver Operating Characteristic (ROC) curve ´ True Positive Rate (Sensitivity): TP/(TP+FN) ´ False Positive Rate (1-Specificity): FP/(FP+TN) ´ ROC curve plotted by varying discrimination threshold ´ All such curves pass through (0,0) and (1,1) ´ Point (0, 1) represents perfect classification and points on the ROC curve close to (0, 1) represent good classifiers. Ref: A Simpler Model of Software Readability

  7. Binary Classification with Logistic Regression ´ Supervised learning algorithm ´ Binary classification: “Not Readable”(0), “Readable”(1) ´ Takes real-valued inputs of some dimension n and predicts the probability of the input belonging to the default class(1). If probability > 0.5, predicted class is 1, else 0. ´ Probability = sigmoid(output) = sigmoid( ⍬ 0 + ⍬ 1 *x 1 + ⍬ 2 *x 2 + … + ⍬ n *x n )

  8. Training with Logistic Regression ´ Training involves finding the gradient of the error and updating the coefficient vector ⍬ to better represent the model and improve accuracy over a number of iterations Gradient Descent Step: Here, m = total number of training examples h ⍬ (x) = predicted output y = actual labelled output ⍺ = learning rate ´ When an optimal set of coefficients is found, the model is then used to predict the class of previously unseen datapoints

  9. Overfitting To reduce overfitting: Reduce number of features used to model ´ data

  10. Feature Selection using Wrapper Method ´ Create all possible subsets of size k from feature vector ´ k is determined via cross-validation ´ Perform classification on each subset of features ´ Feature subset upon which classification with highest accuracy is obtained is chosen as best feature representation Ref: Large Scale Attribute Selection Using Wrappers

  11. Center Selection Used to select 200 most representative methods for evaluation ´ Continuously draw an edge between closest pair of points based ´ on distance In this case Euclidean distance – square root of the difference ´ between the squares of the vector components Do not create edges between two components which are ´ already in the same cluster -> hence single-linked clusters Once there are k-connected components, stop the procedure ´ Ref: Algorithm design by J. Kleinberg and E. Tardos

  12. Cronbach-alpha Measures reliability – how well a test measures what it should ´ Measure of how closely items within a group are related ´ Used to measure level of agreement of annotators on what ´ readable code is Can be written as a function of the number of items and the ´ average inter-correlation among the items N: number of items c-bar: average inter-item covariance v-bar: variance Ref: https://stats.idre.ucla.edu/spss/faq/what-does-cronbachs-alpha-mean

  13. Wilcoxon Test Used when comparing two related values, matched values, or repeated ´ measurements on a single value to assess whether their mean ranks differ Used to determine if classification accuracy of proposed model is significantly ´ different from other models Algorithm: ´ ´ Find the difference between each pair of values ´ Rank the absolute value of these differences, ignoring any “0” differences. Give the lowest rank to the smallest absolute difference-score. If two or more difference-scores are the same, this is a "tie": tied scores get the average of the ranks that those scores would have obtained, had they been different from each other. ´ Apply the negative sign to ranks for negative differences and add together all the rank scores – this is called the critical value ´ N = number of non-0 differences ´ Compare with Wilcoxon chart and check critical value with alpha = 0.05 and N ´ If Wilcoxon chart value < critical value then data is similar, if it is more, data is highly different Ref: https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test

  14. Critical value: |W| = |1.5+1.5-3-4-5-6+7+8+9| = 9 W c (alpha=0.05, N=9) = 6 Since W c < |W|, the two datasets are dissimilar Ref: https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test

  15. Cliff’s Delta Measure of how often the values in one distribution are larger ´ than the values in a second distribution. Used to perform pairwise comparison between all-features model ´ and other models. In this expression, x 1 and x 2 are scores within a group 1 and group ´ 2, and n 1 and n 2 are the sizes of the sample groups respectively. Ranges from 1 when all values from one group are higher than ´ the values from the other group, to -1 when the reverse is true. Completely overlapping distributions have a Cliff’s delta of 0. Ref: http://www.scielo.org.co/scielo.php?script=sci_arttext&pid=S1657-92672011000200018

  16. P-value ´ The p -value is defined as the probability of obtaining a result on a null hypothesis (H 0 ) equal to or more extreme than what was actually observed. ´ Null hypothesis is a prediction of no-difference, that is, for example “Is there a significant difference if we add a particular feature to the input set to determine readability?” ´ The smaller the p -value, the higher the significance because it tells the investigator that the null hypothesis under consideration may not adequately explain the observation. ´ The hypothesis is rejected if any of these probabilities is less than or equal to a pre-defined threshold value ⍺ which is referred to as the level of significance. Ref: https://www.statsdirect.com/help/basics/p_values.htm

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend