Technical Aspects of the Paper: Improving Code Readability Models - - PowerPoint PPT Presentation

technical aspects of the paper improving code readability
SMART_READER_LITE
LIVE PREVIEW

Technical Aspects of the Paper: Improving Code Readability Models - - PowerPoint PPT Presentation

Technical Aspects of the Paper: Improving Code Readability Models with Textual Features Deeksha Arya COMP762 Key Concepts Previous work as mentioned in the paper: QALP tool (to compute similarity between comment and code) Entropy


slide-1
SLIDE 1

Technical Aspects of the Paper: Improving Code Readability Models with Textual Features

Deeksha Arya COMP762

slide-2
SLIDE 2

Key Concepts

´ Previous work as mentioned in the paper:

´ QALP tool (to compute similarity between comment and code) ´ Entropy ´ Halstead’s volume metric ´ Area Under the Curve (AUC)

´ Concepts used in paper’s experiments:

´ Center selection (used to get 200 representative code snippets) ´ Cronbach-alpha (to evaluate agreement between participants regarding readability value) ´ Logistic Regression with wrapper strategy (binary classification algorithm) ´ Wilcoxon test ´ Cliff’s Delta

slide-3
SLIDE 3

QALP Score

(Quality Assessment using Language Processing)

´ Measures correlation between the natural language used in a program code (mainly identifiers) w.r.t its documentation (in this case its comments), hence identifies well-documented code ´ Pre-processing involves:

´ Using stop-words (custom defined for the code – includes keywords, library function and predefined variable names) ´ Stemming (elimination of word suffixes) ´ Atomic splitting of identifiers from code (splits compound identifiers into multiple atomic terms using a lex-based scanner based on an island grammar) ´ Weight the words using tf-idf (high weight to terms which occur more than average in document but are rarer in entire collection)

´ Considers each word as a separate dimension in an n-dimensional vector space – vectorizes comments and code separately ´ Calculates cosine similarity between comment and code tokens ´ Greater QALP score indicates both document models in question describe concepts using the same vocabulary

Ref: Increasing diversity: Natural language measures for software fault prediction

slide-4
SLIDE 4

Entropy

´ Measures the complexity, the degree of disorder, or the amount of information in a data set ´ Let xi be a term in document X, and p(xi) is the ratio of the count of occurrences of xi to the total number of words in the document. Then H(X) is the entropy and is given by: ´ Highest entropy indicates uniform distributions and lower entropy indicates highly skewed distributions

Ref: A Simpler Model of Software Readability

slide-5
SLIDE 5

Halstead’s Volume

´ Similar to the idea of entropy ´ Represents the minimum number of bits needed to naively represent the program, or the number of mental comparisons to write the program ´ Program length N = total number of operators + total number of operands ´ Program vocabulary n = number of distinct operators + number of distinct operands ´ Halstead Volume: V = Nlog2n ´ Greater volume indicates greater complexity

Ref: A Simpler Model of Software Readability

slide-6
SLIDE 6

AUC – Area Under the (ROC) Curve

´ Receiver Operating Characteristic (ROC) curve ´ True Positive Rate (Sensitivity): TP/(TP+FN) ´ False Positive Rate (1-Specificity): FP/(FP+TN) ´ ROC curve plotted by varying discrimination threshold ´ All such curves pass through (0,0) and (1,1) ´ Point (0, 1) represents perfect classification and points

  • n the ROC curve close to (0, 1) represent good

classifiers.

Ref: A Simpler Model of Software Readability

slide-7
SLIDE 7

Binary Classification with Logistic Regression

´ Supervised learning algorithm ´ Binary classification: “Not Readable”(0), “Readable”(1) ´ Takes real-valued inputs of some dimension n and predicts the probability of the input belonging to the default class(1). If probability > 0.5, predicted class is 1, else 0. ´ Probability = sigmoid(output) = sigmoid(⍬0 + ⍬1*x1 + ⍬2*x2 + … + ⍬n*xn)

slide-8
SLIDE 8

Training with Logistic Regression

´ Training involves finding the gradient of the error and updating the coefficient vector ⍬ to better represent the model and improve accuracy over a number of iterations ´ When an optimal set of coefficients is found, the model is then used to predict the class of previously unseen datapoints

Gradient Descent Step: Here, m = total number of training examples h⍬(x) = predicted output y = actual labelled output ⍺ = learning rate

slide-9
SLIDE 9

Overfitting

´ To reduce overfitting: Reduce number of features used to model data

slide-10
SLIDE 10

Feature Selection using Wrapper Method

´ Create all possible subsets of size k from feature vector ´ k is determined via cross-validation ´ Perform classification on each subset of features ´ Feature subset upon which classification with highest accuracy is obtained is chosen as best feature representation

Ref: Large Scale Attribute Selection Using Wrappers

slide-11
SLIDE 11

Center Selection

´ Used to select 200 most representative methods for evaluation ´ Continuously draw an edge between closest pair of points based

  • n distance

´ In this case Euclidean distance – square root of the difference between the squares of the vector components ´ Do not create edges between two components which are already in the same cluster -> hence single-linked clusters ´ Once there are k-connected components, stop the procedure

Ref: Algorithm design by J. Kleinberg and E. Tardos

slide-12
SLIDE 12

Cronbach-alpha

´ Measures reliability – how well a test measures what it should ´ Measure of how closely items within a group are related ´ Used to measure level of agreement of annotators on what readable code is ´ Can be written as a function of the number of items and the average inter-correlation among the items N: number of items c-bar: average inter-item covariance v-bar: variance

Ref: https://stats.idre.ucla.edu/spss/faq/what-does-cronbachs-alpha-mean

slide-13
SLIDE 13

Wilcoxon Test

´ Used when comparing two related values, matched values, or repeated measurements on a single value to assess whether their mean ranks differ ´ Used to determine if classification accuracy of proposed model is significantly different from other models ´ Algorithm:

´ Find the difference between each pair of values ´ Rank the absolute value of these differences, ignoring any “0” differences. Give the lowest rank to the smallest absolute difference-score. If two or more difference-scores are the same, this is a "tie": tied scores get the average of the ranks that those scores would have obtained, had they been different from each other. ´ Apply the negative sign to ranks for negative differences and add together all the rank scores – this is called the critical value ´ N = number of non-0 differences ´ Compare with Wilcoxon chart and check critical value with alpha = 0.05 and N ´ If Wilcoxon chart value < critical value then data is similar, if it is more, data is highly different

Ref: https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test

slide-14
SLIDE 14

Critical value: |W| = |1.5+1.5-3-4-5-6+7+8+9| = 9 Wc (alpha=0.05, N=9) = 6 Since Wc < |W|, the two datasets are dissimilar

Ref: https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test

slide-15
SLIDE 15

Cliff’s Delta

´ Measure of how often the values in one distribution are larger than the values in a second distribution. ´ Used to perform pairwise comparison between all-features model and other models. ´ In this expression, x1 and x2 are scores within a group 1 and group 2, and n1 and n2 are the sizes of the sample groups respectively. ´ Ranges from 1 when all values from one group are higher than the values from the other group, to -1 when the reverse is true. Completely overlapping distributions have a Cliff’s delta of 0.

Ref: http://www.scielo.org.co/scielo.php?script=sci_arttext&pid=S1657-92672011000200018

slide-16
SLIDE 16

P-value

´ The p-value is defined as the probability of obtaining a result on a null hypothesis (H0) equal to or more extreme than what was actually observed. ´ Null hypothesis is a prediction of no-difference, that is, for example “Is there a significant difference if we add a particular feature to the input set to determine readability?” ´ The smaller the p-value, the higher the significance because it tells the investigator that the null hypothesis under consideration may not adequately explain the

  • bservation.

´ The hypothesis is rejected if any of these probabilities is less than or equal to a pre-defined threshold value ⍺ which is referred to as the level of significance.

Ref: https://www.statsdirect.com/help/basics/p_values.htm