A Novel Automated Approach for Software Effort Estimation Based on - PowerPoint PPT Presentation

A Novel Automated Approach for Software Effort Estimation Based on Data Augmentation Liyan Song 12 , Leandro L. Minku 1 , Xin Yao 12 1 University of Birmingham, UK 2 Southern University of Science and Technology, China

Software Effort Estimation (SEE) • Estimation of the effort required to develop a software project (e.g., in person-hours). • Based on project features such as: • estimated size, • required reliability, • programming language, • development type, • etc. • Both over and underestimations can be problematic. Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 2

SEE as a Machine Learning Problem Previous projects are used as training data x1 = x2 = x3 = Project id … y = effort ? size reliability language Machine Learning 1 1000 medium Java … 850 Algorithm 2 1000 low Matlab … 500 3 900 large C# … 1000 … … … … … … required New project x effort y Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 3

A Key Challenge • High cost of collecting effort required to develop projects. • Scarcity of training data. • Small training sets can lead to poor predictive performance. • Most existing work investigates different machine learning algorithms to try to tackle this issue. Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 4

Data Augmentation We generate additional synthetic projects based on existing ones. y x 0 Synthetic projects can enrich the representativeness of the area where they are generated, potentially leading to better SEE models. Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 5

How to Create Synthetic Projects? 1. Randomly select an existing training project. 2. Create a clone of this training project. 3. For each of the clone’s input features. 1. Displace this input feature with a certain probability. 4. Displace the clone’s effort. Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 6

Displacing Categorical Input Features With probability τ , uniformly sample a new value from: { v 1 , v 2 , …, v k } \ { x ic } Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 7

Displacing Ordinal Input Features Sample a new value from B ( n =2 x ic , p =1/2) Probability x ic - 3 x ic - 2 x ic - 1 x ic x ic +1 x ic +2 x ic +3 Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 8

Displacing Numeric Input Features Sample a new value from x ic + N (0, σ 2 ), where σ is a pre-defined parameter that should assume small values. x ic Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 9

Displacing the Effort Sample a new value from y + sign( e ) . | N (0, σ 2 )| e = sum of all Normal values used to displace the numeric size- related features. Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 10

Experiments • Aims: • Evaluate the effect of synthetic data on predictive performance. • Understanding when and why the synthetic projects can help improving the baseline predictive performance. • Machine learning algorithms: • LR, ATLM, k-NN, RVM, RT, SVR. • Proposed data augmentation. • SMOTE for SEE. • MAE log = Mean Absolute Error of the estimations in the log scale. Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 11

Datasets Training set size #Data/#Fea ISBSG (International Software Benchmarking Standards Group) SEACRAFT (Software Engineering Artifacts Can Really Assist Future Tasks) Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 12

RQ1 Given a learning algorithm, can our data augmentation approach • help improving prediction performance over its baseline? When? Could it be detrimental? For most baselines and training set sizes, the proposed • approach significantly improved MAE log , according to Wilcoxon Rank Sum tests with Holm-Bonferroni corrections across data sets. The proposed approach was never significantly worse across • data sets. Effect size (A12) of improvement depends on the baseline and • training set size. Small (A12 ≥ 0.56), medium (A12 ≥ 0.64) and large (A12 ≥ 0.71) • Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 13

RQ1 - LR and ATLM MAE log for Small Training Set Size Improvements were frequently large when training sets were small or medium, especially for the small training sets. Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 14

RQ1 - RVM and RT MAE log for Small Training Set Size Improvements were frequently medium or large when training sets were small or medium. Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 15

RQ1 - k-NN and SVR MAE log for Small Training Set Size Improvements had small or insignificant effect size for all training set sizes, but there was no significant detrimental effect. Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 16

RQ2 • Why our synthetic projects are helpful? Why the magnitude of improvement varies depending on the baseline model? Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 17

RQ2 • Why our synthetic projects are helpful? Why the magnitude of improvement varies depending on the baseline model? • Increasing the training set size helps to cope with lack of data and large noise. y x 0 Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 18

RQ2 • Why our synthetic projects are helpful? Why the magnitude of improvement varies depending on the baseline model? • LR/ATLM — global approaches. • Effect of synthetic data will impact predictions in the entire space. y x 0 Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 19

RQ2 • Why our synthetic projects are helpful? Why the magnitude of improvement varies depending on the baseline model? • k-NN — local approach. • Synthetic data will only influence estimations if they are neighbours, reducing the effect of synthetic data. y x 0 Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 20

RQ3 • How well does our data augmentation approach perform against the existing data augmentation approach from the SEE literature? Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 21

RQ3 MAE log for Small Training Set Size SMOTE SMOTE Proposed approach performs always similarly or better across data sets, with larger effect sizes for small or medium training sets when using LR, ATLM, RVM or RT. Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 22

Conclusions • Proposed a novel data augmentation approach for SEE. • RQ1: proposed approach leads to similar or better MAE log than its baselines. Effect size of improvements is larger for small/medium training sets when using LR/ATLM and RT/RVM. • RQ2: improvements are obtained due to larger datasets presenting better robustness to large noise. Their effect depends on intrinsic aspects of the base learner such as globality and locality. • RQ3: proposed approach leads to similar or better MAE log than an existing data augmentation approach for SEE. Effect size is larger especially for small/medium training sets when using LR/ATLM and RT/RVM. The proposed approach can help to improve predictive performance when there is lack of training data. Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 23

Future Work • Proposal of new strategies to displace the effort. • Analysis with more performance metrics. • Investigation of the proposed approach for other problems. Leandro Minku http://www.cs.bham.ac.uk/~minkull A Novel SEE Approach Based on Data Augmentation 24

A Novel Automated Approach for Software Effort Estimation Based on - PowerPoint PPT Presentation

A Novel Automated Approach for Software Effort Estimation Based on Data Augmentation Liyan Song 12 , Leandro L. Minku 1 , Xin Yao 12 1 University of Birmingham, UK 2 Southern University of Science and Technology, China Software Effort Estimation

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

Automated Design of Digital Automated Design of Digital Automated Design of Digital Automated

An Evaluation of Ensemble Learning for Software Effort Estimation Leandro Minku CERCIA, School

Multi-Objective Software Effort Estimation Federica Sarro ! ! Senior Research Associate Dept.

Feder ederal al Time Time and and Effort Effort Reporting Requirements Reporting

Overview of Automated Bus Consortium Program Accelerating automated technology for transit

Automated Reasoning: Some Successes and New Challenges Predrag Jani ci c

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

Automated Reasoning Course Presentation Summary Automated Reasoning Motivations Course Plan

NOAA Software Engineering for Novel Architectures (SENA) Project Leslie Hart GTC DC 2016

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Software size measures and their usefulness for software project estimation Software Size

. High Purity Solvents in the Working Lab . High Purity Solvents in the Working Lab High Purity

A Novel Approach to A Novel Approach to International Studies International Studies and the

Moving Shadow Tracking in VR Interaction A novel optimized approach A novel optimized approach

State estimation approach to nonstationary Introduction inverse problems State estimation

On-Orbit Servicing L. Bucci, M. Brizioli, A. Bellanca, M. Lavagna Politecnico di Milano,

How to minimize cinema xenon TCO through optimum lamp selection and operation F. de Poorter/T.

ALCBF, TCX and the TZS Or: making local currency work in the most challenging markets TCX, a

Farm Energy IQ Farms Today Securing Our Energy Future Solar Energy on Farms Ed Johnstonbaugh,

2014 CO CORPO RPORATE RATE PRE RESENTATI NTATION ON Revie viewe wed 4Q1 Q14 6 th 5,5

An Embedding-Based Approach for Oral Disease Diagnosis Prediction from Electronic Medical

CALIFORNIA LOOKING FORWARD: LINKING INCLUSION & PROSPERITY IN THE NEW ECONOMY @PERE_USC

TREC 2003 Tracks A Tale of Two Evaluat ions Retrieval in a domain Genome Novelty Answers,

A Novel Automated Approach for Software Effort Estimation Based on - PowerPoint PPT Presentation

A Novel Automated Approach for Software Effort Estimation Based on Data Augmentation Liyan Song 12 , Leandro L. Minku 1 , Xin Yao 12 1 University of Birmingham, UK 2 Southern University of Science and Technology, China Software Effort Estimation

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

Automated Design of Digital Automated Design of Digital Automated Design of Digital Automated

An Evaluation of Ensemble Learning for Software Effort Estimation Leandro Minku CERCIA, School

Multi-Objective Software Effort Estimation Federica Sarro ! ! Senior Research Associate Dept.

Feder ederal al Time Time and and Effort Effort Reporting Requirements Reporting

Overview of Automated Bus Consortium Program Accelerating automated technology for transit

Automated Reasoning: Some Successes and New Challenges Predrag Jani ci c

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

Automated Reasoning Course Presentation Summary Automated Reasoning Motivations Course Plan

NOAA Software Engineering for Novel Architectures (SENA) Project Leslie Hart GTC DC 2016

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Software size measures and their usefulness for software project estimation Software Size

. High Purity Solvents in the Working Lab . High Purity Solvents in the Working Lab High Purity

A Novel Approach to A Novel Approach to International Studies International Studies and the

Moving Shadow Tracking in VR Interaction A novel optimized approach A novel optimized approach

State estimation approach to nonstationary Introduction inverse problems State estimation

On-Orbit Servicing L. Bucci, M. Brizioli, A. Bellanca, M. Lavagna Politecnico di Milano,

How to minimize cinema xenon TCO through optimum lamp selection and operation F. de Poorter/T.

ALCBF, TCX and the TZS Or: making local currency work in the most challenging markets TCX, a

Farm Energy IQ Farms Today Securing Our Energy Future Solar Energy on Farms Ed Johnstonbaugh,

2014 CO CORPO RPORATE RATE PRE RESENTATI NTATION ON Revie viewe wed 4Q1 Q14 6 th 5,5

An Embedding-Based Approach for Oral Disease Diagnosis Prediction from Electronic Medical

CALIFORNIA LOOKING FORWARD: LINKING INCLUSION &amp; PROSPERITY IN THE NEW ECONOMY @PERE_USC

TREC 2003 Tracks A Tale of Two Evaluat ions Retrieval in a domain Genome Novelty Answers,

CALIFORNIA LOOKING FORWARD: LINKING INCLUSION & PROSPERITY IN THE NEW ECONOMY @PERE_USC