Inferring semantically related words from software context Jinqiu - - PowerPoint PPT Presentation

inferring semantically related words from software context
SMART_READER_LITE
LIVE PREVIEW

Inferring semantically related words from software context Jinqiu - - PowerPoint PPT Presentation

Inferring semantically related words from software context Jinqiu Yang , Lin Tan University of Waterloo 1 Motivation I need to find all functions that disable interrupts in the Linux kernel. Hmmm, so I search for disable*interrupt .


slide-1
SLIDE 1

Inferring semantically related words from software context

Jinqiu Yang, Lin Tan University of Waterloo

1

slide-2
SLIDE 2

2

Motivation

New Search Queries: “disable*irq”, “mask*irq” I need to find all functions that disable interrupts in the Linux kernel. Hmmm, so I search for “disable*interrupt”. MISSING: disable_irq(...), mask_irq(...) BUT how am I supposed to know???

slide-3
SLIDE 3

3

Guess on my own Ask developers

How to Find Synonyms or Related Words?

Can’t find that disable & mask are synonyms!

slide-4
SLIDE 4
  • We call a pair of such semantically related

words an rPair.

Our Approach: Leveraging Context

“Disable all interrupt sources” “Disable all irq sources”

4

Real comments and identifiers from the Linux kernel

void mask_all_interrupts() void disable_all_interrupts()

  • Comments:
  • Identifiers:
slide-5
SLIDE 5
  • A general context-based approach to

automatically infer semantically related words from software context

  • Has a reasonable accuracy in 7 large code bases

written in C and Java.

  • Is more helpful to code search than the state of

art.

5

Contributions

slide-6
SLIDE 6
  • Motivation, Intuition and Contributions
  • Our Approach
  • A Running Example: Parsing, Clustering,

Extracting, Refining

  • Evaluation Methods & Results
  • Related Work
  • Conclusion

6

Outline

slide-7
SLIDE 7

7

A Running Example

Real comments from Apache HTTPD Server

Parsing

Apache

maybe add a higher-level description min of spare daemons data in the appropriate order the compiled max daemons an iovec to store the trailer sent after the file data in the wrong order an iovec to store the headers sent before the file return err maybe add a higher-level desc if a user manually creates a data file

slide-8
SLIDE 8

Extracting rPairs

8

SimilarityMeasure =

Number of Common Words in the Two Sequences Total Number of Words in the Shorter Sequence

the compiled max threads min of spare threads

SimilarityMeasure = 1/4

threshold = 0.7

SimilarityMeasure = 8/10

an iovec to store the trailer sent after the file an iovec to store the headers sent before the file

You can find how difgerent thresholds afgect our results in our paper.

slide-9
SLIDE 9
  • Pairwise comparisons of a large number of

sequences is expensive.

  • 519,168 unique comments in the Linux

kernel ➔ over 100 billion comparisons

9

Running Out of Time

slide-10
SLIDE 10

10

Clustering

maybe add a higher-level description min of spare daemons data in the appropriate order the compiled max daemons an iovec to store the trailer sent after the file data in the wrong order an iovec to store the headers sent before the file return err maybe add a higher-level desc if a user manually creates a data file

add iovec data daemons

slide-11
SLIDE 11

10

Clustering

maybe add a higher-level description min of spare daemons data in the appropriate order the compiled max daemons an iovec to store the trailer sent after the file data in the wrong order an iovec to store the headers sent before the file return err maybe add a higher-level desc if a user manually creates a data file maybe add a higher-level description

add

maybe add a higher-level description

iovec data daemons

slide-12
SLIDE 12

10

Clustering

maybe add a higher-level description min of spare daemons data in the appropriate order the compiled max daemons an iovec to store the trailer sent after the file data in the wrong order an iovec to store the headers sent before the file return err maybe add a higher-level desc if a user manually creates a data file maybe add a higher-level description min of spare daemons data in the appropriate order the compiled max daemons an iovec to store the trailer sent after the file data in the wrong order an iovec to store the headers sent before the file return err maybe add a higher-level desc if a user manually creates a data file

add

maybe add a higher-level description return err maybe add a higher-level desc

iovec

an iovec to store the headers sent before the file an iovec to store the trailer sent after the file

data

data in the appropriate order data in the wrong order

daemons

m i n

  • f

s p a r e d a e m

  • n

s the compiled max daemons if a user manually create a data file

slide-13
SLIDE 13
  • Pairwise comparisons of a large number of

sequences is expensive.

  • 519,168 unique comments in the Linux

kernel ➔ over 100 billion comparisons.

  • Clustering speeds up the process for the

Linux kernel by almost 100 times.

11

The Speedup After Clustering

slide-14
SLIDE 14
  • Filtering:
  • Using stemming to remove rPairs that consists
  • f words with the same root, e.g., (called, call).
  • Normalization:
  • (threads, daemons) (thread, daemon).
  • (called, invoked) (call, invoke)

12

Refining rPairs

slide-15
SLIDE 15
  • Motivation, Intuition and Contributions
  • Our Approach
  • A Running Example: Parsing, Clustering,

Extracting, Refining

  • Evaluation Methods & Results
  • Related Work
  • Conclusion

13

Outline

slide-16
SLIDE 16
  • Extraction Accuracy
  • 7 large code bases, in Java & C, from Comment-

Comment, Code-Code, Comment-Code

  • Search-Related Evaluation
  • Comparison with SWUM [Hill Phd Thesis] in

Code-Code

14

Evaluation Methods

slide-17
SLIDE 17

15

Comment-Comment Accuracy Results

Software rPairs Accuracy Not in Webster or WordNet Linux HTTPD Collections iReport jBidWatcher javaHMO jajuk 108,571 1,428 469 878 111 144 203 47% 47% 74% 84% 64% 56% 69% 76.6% 93.6% 97.3% 95.2% 98.4% 91.1% 94.2% Total/Average 111,804 63% 91.7%

  • The majority (91.7%) of correct rPairs discovered are not

in Webster or WordNet.

We randomly sample 100 rPairs per project for manual verification (all 111 for jBidWatcher).

slide-18
SLIDE 18
  • Extraction Accuracy
  • 7 large code bases, in Java & C, from Comment-

Comment, Code-Code, Comment-Code

  • Search-Related Evaluation
  • Comparison with SWUM [Hill Phd Thesis] in

Code-Code

16

Evaluation Methods

slide-19
SLIDE 19

17

Search-Related Evaluation

In jBidWatcher, “Add auction”

new register ... Our approach Query expansion: “XXX auction” SWUM register ...

slide-20
SLIDE 20

17

Search-Related Evaluation

In jBidWatcher, “Add auction” add register, do, new

SWUM gold set

  • ur gold set

}

JBidMouse.DoAuction(...) AuctionServer.registerAuction(...) AuctionManager.newAuctionEntry(...) FilterManager.addAuction(...) ...

slide-21
SLIDE 21

18

Search-Related Evaluation

In jBidWatcher, “Add auction”

new register do load ...

Our approach (55 words) SWUM (84 words)

register do ...

add register, do, new

Precision = 3/55 = 5.5% Recall = 3/3 =100% Precision = 2/84 = 2.3% Recall = 2/3 = 67.7%

slide-22
SLIDE 22

18

Search-Related Evaluation

In jBidWatcher, “Add auction”

new register do load ...

Our approach (55 words) SWUM (84 words)

register do ...

add register, do, new

Precision = 3/55 = 5.5% Recall = 3/3 =100% Precision = 2/84 = 2.3% Recall = 2/3 = 67.7%

Our approach achieves higher precision and higher/ equal recall for 5 out of 6 rPair groups in the gold set.

slide-23
SLIDE 23
  • Verb-DO (Direct Object) [Shepherd et al. AOSD] &

SWUM - Improved version of Verb-DO [Hill Phd Thesis]

  • Requires Natural Language Processing (NLP)

techniques

  • Requires manually generated heuristics

19

Related Work

slide-24
SLIDE 24
  • A simple, general technique to automatically infer

semantically related words from software context

  • No Natural Language Processing (NLP) required
  • Reasonable accuracy in 7 large C & Java code bases
  • The majority of rPairs discovered are not in the

dictionaries or WordNet.

  • Higher precision & recall than the state of art

20

Conclusions