A Novel Automated Approach for Software Effort Estimation Based on - - PowerPoint PPT Presentation

a novel automated approach for software effort estimation
SMART_READER_LITE
LIVE PREVIEW

A Novel Automated Approach for Software Effort Estimation Based on - - PowerPoint PPT Presentation

A Novel Automated Approach for Software Effort Estimation Based on Data Augmentation Liyan Song 12 , Leandro L. Minku 1 , Xin Yao 12 1 University of Birmingham, UK 2 Southern University of Science and Technology, China Software Effort Estimation


slide-1
SLIDE 1

A Novel Automated Approach for Software Effort Estimation Based on Data Augmentation

Liyan Song12, Leandro L. Minku1, Xin Yao12

1 University of Birmingham, UK 2 Southern University of Science and Technology, China

slide-2
SLIDE 2

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull

Software Effort Estimation (SEE)

  • Estimation of the effort required to develop a software project

(e.g., in person-hours).

  • Based on project features such as:
  • estimated size,
  • required reliability,
  • programming language,
  • development type,
  • etc.
  • Both over and underestimations

can be problematic.

2

slide-3
SLIDE 3

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull

SEE as a Machine Learning Problem

3

Machine Learning Algorithm

Project id x1 = size x2 = reliability x3 = language … y = effort ? 1 1000 medium Java … 850 2 1000 low Matlab … 500 3 900 large C# … 1000 … … … … … …

New project x required effort y

Previous projects are used as training data

slide-4
SLIDE 4

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull

A Key Challenge

  • High cost of collecting effort required to develop projects.
  • Scarcity of training data.
  • Small training sets can lead to poor predictive performance.
  • Most existing work investigates different machine learning

algorithms to try to tackle this issue.

4

slide-5
SLIDE 5

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull

Data Augmentation

We generate additional synthetic projects based on existing ones.

5

Synthetic projects can enrich the representativeness of the area where they are generated, potentially leading to better SEE models.

y x0

slide-6
SLIDE 6

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull

How to Create Synthetic Projects?

6

  • 1. Randomly select an existing training project.
  • 2. Create a clone of this training project.
  • 3. For each of the clone’s input features.
  • 1. Displace this input feature with a certain

probability.

  • 4. Displace the clone’s effort.
slide-7
SLIDE 7

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull

Displacing Categorical Input Features

With probability τ, uniformly sample a new value from: {v1, v2, …, vk} \ {xic}

7

slide-8
SLIDE 8

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull

Displacing Ordinal Input Features

Sample a new value from B(n=2xic,p=1/2)

8

xic-3 xic-2 xic-1 xic xic+1 xic+2 xic+3 Probability

slide-9
SLIDE 9

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull

Displacing Numeric Input Features

Sample a new value from xic + N(0,σ2), where σ is a pre-defined parameter that should assume small values.

9

xic

slide-10
SLIDE 10

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull

Displacing the Effort

Sample a new value from y + sign(e) . |N(0,σ2)| e = sum of all Normal values used to displace the numeric size- related features.

10

slide-11
SLIDE 11

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull

Experiments

  • Aims:
  • Evaluate the effect of synthetic data on predictive

performance.

  • Understanding when and why the synthetic projects can

help improving the baseline predictive performance.

  • Machine learning algorithms:
  • LR, ATLM, k-NN, RVM, RT, SVR.
  • Proposed data augmentation.
  • SMOTE for SEE.
  • MAElog = Mean Absolute Error of the estimations in the log

scale.

11

slide-12
SLIDE 12

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull

Datasets

12

Training set size

ISBSG (International Software Benchmarking Standards Group) SEACRAFT (Software Engineering Artifacts Can Really Assist Future Tasks)

#Data/#Fea

slide-13
SLIDE 13

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull

RQ1

  • Given a learning algorithm, can our data augmentation approach

help improving prediction performance over its baseline? When? Could it be detrimental?

  • For most baselines and training set sizes, the proposed

approach significantly improved MAElog, according to Wilcoxon Rank Sum tests with Holm-Bonferroni corrections across data sets.

  • The proposed approach was never significantly worse across

data sets.

  • Effect size (A12) of improvement depends on the baseline and

training set size.

  • Small (A12≥0.56), medium (A12≥0.64) and large (A12≥ 0.71)

13

slide-14
SLIDE 14

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull

RQ1 - LR and ATLM

14

Improvements were frequently large when training sets were small or medium, especially for the small training sets. MAElog for Small Training Set Size

slide-15
SLIDE 15

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull

RQ1 - RVM and RT

15

Improvements were frequently medium or large when training sets were small or medium. MAElog for Small Training Set Size

slide-16
SLIDE 16

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull 16

RQ1 - k-NN and SVR

Improvements had small or insignificant effect size for all training set sizes, but there was no significant detrimental effect. MAElog for Small Training Set Size

slide-17
SLIDE 17

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull

RQ2

  • Why our synthetic projects are helpful? Why the magnitude of

improvement varies depending on the baseline model?

17

slide-18
SLIDE 18

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull

  • Why our synthetic projects are helpful? Why the

magnitude of improvement varies depending on the baseline model?

  • Increasing the training set size helps to cope with lack of

data and large noise.

18

x0 y

RQ2

slide-19
SLIDE 19

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull

  • Why our synthetic projects are helpful? Why the magnitude
  • f improvement varies depending on the baseline model?
  • LR/ATLM — global approaches.
  • Effect of synthetic data will impact predictions in the

entire space.

19

x0 y

RQ2

slide-20
SLIDE 20

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull

RQ2

  • Why our synthetic projects are helpful? Why the magnitude
  • f improvement varies depending on the baseline model?
  • k-NN — local approach.
  • Synthetic data will only influence estimations if they are

neighbours, reducing the effect of synthetic data.

20

x0 y

slide-21
SLIDE 21

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull

RQ3

  • How well does our data augmentation approach perform

against the existing data augmentation approach from the SEE literature?

21

slide-22
SLIDE 22

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull

RQ3

22

Proposed approach performs always similarly or better across data sets, with larger effect sizes for small or medium training sets when using LR, ATLM, RVM or RT.

MAElog for Small Training Set Size

SMOTE SMOTE

slide-23
SLIDE 23

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull

Conclusions

  • Proposed a novel data augmentation approach for SEE.
  • RQ1: proposed approach leads to similar or better MAElog than its
  • baselines. Effect size of improvements is larger for small/medium

training sets when using LR/ATLM and RT/RVM.

  • RQ2: improvements are obtained due to larger datasets presenting

better robustness to large noise. Their effect depends on intrinsic aspects of the base learner such as globality and locality.

  • RQ3: proposed approach leads to similar or better MAElog than an

existing data augmentation approach for SEE. Effect size is larger especially for small/medium training sets when using LR/ATLM and RT/RVM.

23

The proposed approach can help to improve predictive performance when there is lack of training data.

slide-24
SLIDE 24

A Novel SEE Approach Based on Data Augmentation Leandro Minku http://www.cs.bham.ac.uk/~minkull

Future Work

  • Proposal of new strategies to displace the effort.
  • Analysis with more performance metrics.
  • Investigation of the proposed approach for other problems.

24