Amalgamated Models for Detecting Duplicate Bug Reports Sukhjit - PDF document

Amalgamated Models for Detecting Duplicate Bug Reports Sukhjit Singh Sehra Tamer Abdou Ay¸ se Ba¸ sar Sumeet Kaur Sehra May 2, 2020

Highlights • The aim of this paper is to propose and compare amalgamated models for detecting duplicate bug reports using textual and non-textual information of bug reports. • The algorithmic models viz. LDA, TF-IDF, GloVe, Word2Vec, and their amalgamation are used to rank bug reports according to their similarity with each other. • The empirical evaluation has been performed on the open datasets from large open source software projects. 2

Highlights (contd.) • The metrics used for evaluation are mean average precision (MAP), mean reciprocal rank (MRR) and recall rate. • The experimental results show that amalgamated model (TF- IDF + Word2Vec + LDA) outperforms other amalgamated models for duplicate bug recommendations. 3

Introduction • Software bug reports can be represented as defects or errors’ descriptions identified by software testers or users. • It is crucial to detect duplicate bug reports as it helps in reduced triaging e ff orts. • These are generated due to the reporting of the same defect by many users. 4

Introduction (contd.) • These duplicates cost futile e ff ort in identification and handling. Developers, QA personnel and triagers consider duplicate bug reports as a concern. • The e ff ort needed for identifying duplicate reports can be de- termined by the textual similarity between previous issues and new report [8]. 5

Introduction (contd.) • Figure 1 shows the hierarchy of most widely used sparse and dense vector semantics [5]. Vector Representation Sparse Vector Dense Vector Representation Representation Neural Matrix TF-IDF PPMI Embedding Factorization GloVe Word2Vec SVD LDA Figure 1: Vector Representation in NLP 6

Introduction (contd.) • The proposed models takes into consideration textual information (description); and non-textual information (product and component) of the bug reports. • TF-IDF signifies documents’ relationships [11]; the distributional semantic models, • Word2Vec and GloVe, use vectors that keep track of the contexts, e.g., co-occurring words. 7

Introduction (contd.) • This study investigates and contributes into the following items: • An empirical analysis of amalgamated models to rank duplicate bug reports. • E ff ectiveness of amalgamation of models. • Statistical significance and e ff ect size of proposed models. 8

Related Work • A TF-IDF model has been proposed by modeling a bug report as a vector to compute textual features similarity [7]. • An approach based on n-grams has been applied for duplicate detection [14]. • In addition to using textual information from the bug reports, the researchers have witnessed that additional features also sup- port in the classification or identification of duplicates bug report. 9

Related Work (contd.) • The first study that combined the textual features and non- textual features derived from duplicate reports was presented by Jalbert and Weimer [4]. • A combination of LDA and n -gram algorithm outperforms the state-of-the-art methods has been suggested Zou et al. [16]. • Although in prior research many models have been developed and a recent trend has been witnessed to ensemble the various models. There exists no research which amalgamated the statistical, contextual, and semantic models to identify duplicate bug reports. 10

Dataset and Pre-processing • A collection of bug reports that are publicly available for research purposes has been proposed by Sedat et al. [12]. • The repository 1 [12], presented three defect rediscovery datasets extracted from Bugzilla in ”.csv” format. • It contains the datasets for open source software projects: Apache, Eclipse, and KDE. 11

Dataset and Pre-processing (contd.) • The datasets contain information about approximately 914 thou- sands of defect reports over a period of 18 years (1999-2017) to capture the inter-relationships among duplicate defects. • The dataset contains two categories of feature viz. textual and non-textual. The textual information is description given by the users about the bug i.e. ”Short desc”. 12

Dataset and Pre-processing (contd.) Descriptive statistics are illustrated in Table 1. Table 1: Dataset description Project Apache Eclipse KDE # of reports 44,049 503,935 365,893 Distinct id 2,416 31,811 26,114 Min report opendate 2000-08-26 2001-02-07 1999-01-21 Max report opendate 2017-02-10 2017-02-07 2017-02-13 # of products 35 232 584 # of components 350 1486 2054 13

Dataset and Pre-processing (contd.) • Pre-processing and term-filtering were used to prepare the corpus from the textual features. • In further processing steps, the sentences, words and characters identified in pre-processing were converted into tokens and corpus was prepared. • The corpus preparation included conversion into lower case, word normalisation, elimination of punctuation characters, and lemmatization. 1 https://zenodo.org/record/400614#.XaNPt-ZKh8x , last accessed: March 2020 14

Methodology The flowchart shown in Figure 2 depicts the approach followed in this paper. Textual Non-textual Features Features Statistical Contextual Syntactic Semantic Model Model Model Model (TF-IDF) (Word2Vec) (GloVe) (LDA) Passing the Query Bug Reports Passing the Query Bug Reports and Computing a Score and Computing Similarity Scores for the Non-textual Features Alternatively Combining Models Validation and Producing a Cumulative Metrics Amalgamated Score Ranking and Recommending Top-K Bug Reports Figure 2: Overall Methodology 15

Methodology (contd.) • Our study has combined sparse and dense vector representation approaches to generate amalgamated models for duplicate bug reports’ detection. • The one or more models from LDA, TF-IDF, GloVe and Word2Vec are combined to create amalgamated similarity scores. • The similarity score presents the duplicate (most similar) bug reports to bug triaging team. 16

Methodology (contd.) • The proposed models takes into consideration textual information (description); and non-textual information (product and component) of the bug reports. • TF-IDF signifies documents’ relationships [11]; the distributional semantic models, Word2Vec and GloVe, use vectors that keep track of the contexts, e.g., co-occurring words. 17

Proposed amalgamated model • It has been identified that even the established similarity recommendation models such as NextBug [10] does not produce optimal and accurate results. • The similarity scores vector ( S 1 , S 2 , S 3 , S 4 ) for k most similar bug reports is captured from individual approaches as shown in Figure 2. • Since the weights obtained for individual method have their own significance; therefore a heuristic ranking method is used to combine and create a universal rank all the results. 18

Proposed amalgamated model (contd.) • The ranking approach assigns new weights to each element of the resultant similarity scores vector from the individual approach and assign it equal to the inverse of its position in the vector as in Equation 1. 1 R i = (1) Position i . 19

Proposed amalgamated model (contd.) • Once all ranks are obtained for each bug report and for each model selected, the amalgamated score is generated by sum- mation of the ranks generated as given in Equation 2. • It creates a vector of elements less than or equals to nk , where k is number of duplicate bug reports returned from each model and n is number of models being combined. S = ( R 1 + R 2 + R 3 + R 4 ) ⇤ PC (2) Where S is amalgamated score (rank) of each returned bug report. • Here PC is the product & component score and works as a filter. 20

Evaluation Metrics • Recall-rate@k For a query bug q , it is defined as given in Equa- tion 3 as suggested by previous researchers [13, 3, 15]. 8 1 , if ifS ( q ) \ R ( q ) 6 = 0 < RR ( q ) = (3) 0 , otherwise : Given a query bug q, S ( q ) is ground truth and R ( q ) represents the set of top- k recommendations from a recommendation sys- tem. 21

Evaluation Metrics (contd.) • Mean Average Precision (MAP) is defined as the mean of the Average Precision ( AvgP ) values obtained for all the evaluation queries given in | Q | AvgP ( q ) X MAP = (4) | Q | q =1 In this equation, Q is number of queries in the test set. 22

Evaluation Metrics (contd.) • Mean Reciprocal Rank (MRR)is calculated from the reciprocal rank values of queries. | Q | X MRR ( q ) = RR ( i ) (5) i =1 Reciprocal Rank(i) calculates the mean reciprocal rank and RR is calculated as in 1 ReciprocalRank ( q ) = (6) index q 23

Results and Discussion • For evaluation of results, we used a Google Colab 2 machine with specifications as RAM: 24GB Available; and Disk: 320 GB. • The current research implements the algorithms in Python 3.5 and used ”nltk”, ”sklearn”, ”gensim” [9] packages for model implementation. • The default values of the parameters of the algorithms were used. The values of k has been taken as 1, 5, 10, 20, 30, and 50 to investigate the e ff ectiveness of proposed approach. • For the empirical validation of the results, the developed models have been applied to open bug report data consisting of three datasets of bug reports. 24

Amalgamated Models for Detecting Duplicate Bug Reports Sukhjit - PDF document

Amalgamated Models for Detecting Duplicate Bug Reports Sukhjit Singh Sehra Tamer Abdou Ay se Ba sar Sumeet Kaur Sehra May 2, 2020 Highlights The aim of this paper is to propose and compare amalgamated models for detecting duplicate

Amalgamated free products of n-slender groups RIMS Set Theory Workshop 2010 Waseda University

Rank of intersection of free subgroups in free amalgamated products of groups Alexander Zakharov

Detecting Spammers and Content Detecting Spammers and Content Detecting Spammers and Content

12/6/2013 Detecting Fakes Image Forensics: Detecting Forged Photos 1.Detecting photorealistic

Industrial Bug Mining Industrial Bug Mining Extracting, Grading and Enriching the Ore of Exploits

Fedora Bug Triage John "poelcat" Poelstra Jon "jds2001" Stanley June 21,

Audit Reports Guide Table of Contents Audit Reports Available Reports Accessing

Duplicate Encounter Avoidance Guidelines MCO Encounter Improvement Initiative Meridian Health

Web Information Retrieval Lecture 10 Crawling and Near-Duplicate Document Detection Todays

1 Near-Duplicate News Articles Near-Duplicate Detection More challenging task Are web

Bug Driven Bug Finding Chadd C. Williams Jeffrey K. Hollingsworth University of Maryland

3/3/15 Announcement: Bug of the week (extra credit) Architectural Patterns Each group can

Bugzilla, Bug-squad and GNOME3 Presented By Akhil Laddha 1 Agenda About me Bugzilla Bug

Open Source Bug Fixes: Characterization and Dataset Prediction Data Collection Bug

th NATIONAL REPORTS 6 th th th 6 6 6 NATIONAL REPORTS NATIONAL REPORTS NATIONAL REPORTS

Duplicate bug report detection through machine learning techniques Irving Muller Rodrigues

Splittability and 1-Amalgamability of Permutation Classes Michal Opler Computer Science

Higher-order amalgamation of algebraic structures David Milovich http://dkmj.org Welkin

Amalgamating many overlapping Boolean algebras David Milovich Texas A&M International

Joint work with Nicholas Ramsey (UC Berkeley). Shelahs classification Tree properties Let T be

Amalgamated algebras along an ideal: a class of ring extensions related to Nagatas

Projective measure without projective Baire D. Schrittesser Universitt Bonn YST 2011 D.

On Operator-Valued Bi-Free Distributions Paul Skoufranis TAMU March 22, 2016 Paul Skoufranis

History Birthday Celebration for Amalgamateds Middle Generation Buildings Vladeck Hall April

Amalgamated Models for Detecting Duplicate Bug Reports Sukhjit - PDF document

Amalgamated Models for Detecting Duplicate Bug Reports Sukhjit Singh Sehra Tamer Abdou Ay se Ba sar Sumeet Kaur Sehra May 2, 2020 Highlights The aim of this paper is to propose and compare amalgamated models for detecting duplicate

Amalgamated free products of n-slender groups RIMS Set Theory Workshop 2010 Waseda University

Rank of intersection of free subgroups in free amalgamated products of groups Alexander Zakharov

Detecting Spammers and Content Detecting Spammers and Content Detecting Spammers and Content

12/6/2013 Detecting Fakes Image Forensics: Detecting Forged Photos 1.Detecting photorealistic

Industrial Bug Mining Industrial Bug Mining Extracting, Grading and Enriching the Ore of Exploits

Fedora Bug Triage John &quot;poelcat&quot; Poelstra Jon &quot;jds2001&quot; Stanley June 21,

Audit Reports Guide Table of Contents Audit Reports Available Reports Accessing

Duplicate Encounter Avoidance Guidelines MCO Encounter Improvement Initiative Meridian Health

Web Information Retrieval Lecture 10 Crawling and Near-Duplicate Document Detection Todays

1 Near-Duplicate News Articles Near-Duplicate Detection More challenging task Are web

Bug Driven Bug Finding Chadd C. Williams Jeffrey K. Hollingsworth University of Maryland

3/3/15 Announcement: Bug of the week (extra credit) Architectural Patterns Each group can

Bugzilla, Bug-squad and GNOME3 Presented By Akhil Laddha 1 Agenda About me Bugzilla Bug

Open Source Bug Fixes: Characterization and Dataset Prediction Data Collection Bug

th NATIONAL REPORTS 6 th th th 6 6 6 NATIONAL REPORTS NATIONAL REPORTS NATIONAL REPORTS

Duplicate bug report detection through machine learning techniques Irving Muller Rodrigues

Splittability and 1-Amalgamability of Permutation Classes Michal Opler Computer Science

Higher-order amalgamation of algebraic structures David Milovich http://dkmj.org Welkin

Amalgamating many overlapping Boolean algebras David Milovich Texas A&amp;M International

Joint work with Nicholas Ramsey (UC Berkeley). Shelahs classification Tree properties Let T be

Amalgamated algebras along an ideal: a class of ring extensions related to Nagatas

Projective measure without projective Baire D. Schrittesser Universitt Bonn YST 2011 D.

On Operator-Valued Bi-Free Distributions Paul Skoufranis TAMU March 22, 2016 Paul Skoufranis

History Birthday Celebration for Amalgamateds Middle Generation Buildings Vladeck Hall April

Fedora Bug Triage John "poelcat" Poelstra Jon "jds2001" Stanley June 21,

Amalgamating many overlapping Boolean algebras David Milovich Texas A&M International