T-76.613 Software testing Analyzing Quantitative Data Mika Mntyl - - PDF document

t 76 613 software testing analyzing quantitative data
SMART_READER_LITE
LIVE PREVIEW

T-76.613 Software testing Analyzing Quantitative Data Mika Mntyl - - PDF document

T-76.613 Software testing Analyzing Quantitative Data Mika Mntyl <mika.mantyla@hut.fi> HELSINKI UNIVERSITY OF TECHNOLOGY Qualitative vs. Quantitative (Coolican 1999) Qualitative Quantitative Information Subjective, Rich


slide-1
SLIDE 1

T-76.613 Software testing 1

HELSINKI UNIVERSITY OF TECHNOLOGY

Analyzing Quantitative Data

Mika Mäntylä <mika.mantyla@hut.fi>

2

Mika Mäntylä

Qualitative vs. Quantitative (Coolican 1999)

High Low Reliability Low High Construct Validity Low High Realism Structured Unstructured Design Artificial Realistic Setting High Low Internal Validity Objective, Narrow Subjective, Rich Information Quantitative Qualitative

slide-2
SLIDE 2

T-76.613 Software testing 2

3

Mika Mäntylä

Measurement Scales

  • Why
  • Determine the applicability of

statistical tests

  • Nominal
  • E.g.: Gender
  • Arithmetic Operations: Counting
  • No numeric meaning
  • Ordinal
  • E.g.: Movie ratings, Likert scale

data, Rankings

  • Arithmetic Operations: Size

comparisons

  • Intervals between scale values

are undefined

  • Interval
  • E.g.: Temperature C°, IQ scores
  • Arithmetic Operations: Addition &

Subtraction

  • Intervals between scale values

are equal

  • Ratio
  • E.g.: Temperature K°, Length in

centimeters

  • Arithmetic Operations:

Multiplication & division

  • Rational zero and equivalent

ratios

  • Ordinal or Interval?
  • Importance of listed activities in

standard Likert scale (i.e. completely agree, somewhat agree, neither agree or disagree, somewhat agree, completely disagree)

  • Ordinal?
  • Distribute 100 point based on the

importance of activities

  • Interval?

4

Mika Mäntylä

Distributions

Why

Determine the applicability of statistical tests

Normal Distribution

Assumed in most parametric tests Most biological measure are have normal or lognormal

distribution

Many psychological and educational test are fitted to normal

distribution for research purposes

Likert scale (1-5) is not normally distributed

Too few options i.e. cannot answer 1.25 etc Often the answers are skewed towards one end i.e. not mount

shaped

Power Laws and Distributions (Pareto, Zipf)

Many phenomena follow power distribution

City sizes, wealth, word frequencies, earthquake magnitudes,

sources of software failures, visits of websites

slide-3
SLIDE 3

T-76.613 Software testing 3

5

Mika Mäntylä

Zipf’s Law – JDK

6

Mika Mäntylä

Probability density functions (Wikibedia)

Normal, LogNormal,

slide-4
SLIDE 4

T-76.613 Software testing 4

7

Mika Mäntylä

Differences between two groups

Is there a difference:

In development time between team using pair-

programming and solo-programming

In credit card spending between people receiving

promotions and not receiving promotions

In company sizes between software product companies

and software companies

t-test is popular test for measuring a difference of

single variable between two groups

Example: Credit card spending (SPSS)

Independent Samples Test Equal variances assumed 1,190 ,276

  • 2,260

498 ,024

  • 71,11095

31,45914

  • 132,920
  • 9,30196

$ spent during promotional period F Sig. Levene's Test for Equality of Variances t df

  • Sig. (2-tailed)

Mean Difference

  • Std. Error

Difference Lower Upper 95% Confidence Interval of the Difference t-test for Equality of Means

8

Mika Mäntylä

Differences between two groups - cont’d

Example: Software company sizes

Independent Samples Test 12,875 ,000

  • 2,090

998 ,037

  • 33,57995

16,06980

  • 65,11442
  • 2,04549
  • 2,025

708,999 ,043

  • 33,57995

16,58117

  • 66,13403
  • 1,02588

Equal variances assumed Equal variances not assumed Employees F Sig. Levene's Test for Equality of Variances t df

  • Sig. (2-tailed)

Mean Difference

  • Std. Error

Difference Lower Upper 95% Confidence Interval of the Difference t-test for Equality of Means

Group Statistics 529 57,7958 174,91927 7,60519 471 91,3758 319,76937 14,73419 Type Software SoftwareProduct Employees N Mean

  • Std. Deviation
  • Std. Error

Mean

slide-5
SLIDE 5

T-76.613 Software testing 5

9

Mika Mäntylä

Differences between two groups - cont’d

Assumptions of t-test

Interval scale Normal (or near normal) distribution

Both assumptions hold for credit card example data What about software company sizes?

10

Mika Mäntylä

Differences between two groups - cont’d

Mann-Whitney (or Wilcoxon-Mann-Whitney) test is

non-parametric alternative to t-test

Software company sizes with non-parametric test Ranks 529 496,63 262715,50 471 504,85 237784,50 1000 Type Software SoftwareProduct Total Employees N Mean Rank Sum of Ranks

Test Statistics

a

122530,500 262715,500

  • ,450

,653 Mann-Whitney U Wilcoxon W Z

  • Asymp. Sig. (2-tailed)

Employees Grouping Variable: Type a.

slide-6
SLIDE 6

T-76.613 Software testing 6

11

Mika Mäntylä

Correlation between two variables

  • The degree to which the variables are related (i.e. co-vary)

http://noppa5.pc.helsinki.fi/koe/corr/cor7.html

  • Parametric (Pearson) and non-parametric (Spearman, Kendall’s Tau)

Pearson: Variables are interval scale, and (approximately) normally

distributed

Spearman is more widely used and easier to compute than Tau. Kendall’s Tau has better statistical properties than Spearman “Results suggest that Kendalls tau(b) has many advantages over

Pearsons and Spearmans r; when applied to psychiatric data.“ (Arndt et al. 1999)

  • Term “high correlation” has no strict boundaries

0.00 – 0.20: no correlation; 0.20 – 0.40: low correlation; 0.40 –

0.70: moderate correlation; 0.70 – 0.90: high correlation; 0.90 – 1.00: very high correlation;

  • Significance (the chance of obtaining the result by change) of

correlation

Computed based correlation coefficient and sample size

  • Correlation does not indicate causal relationships

12

Mika Mäntylä

More than two variables - Partial Correlation

Correlation can only tell the relation of two variables What if there are several related variables

Partial Correlation (Parametric method) Regression (Next Slide)

Partial Correlation Example:

Variables

T = study time; S = exam score; F = fear of the professor

Correlations

r(TS) = 0,2; r(TF) = 0,8; r(SF) =-0,4

Partial correlation from the equation:

r(TS·F)=0,95

Interpretation: Study time is highly correlated with test score

when fear is removed / held constant

2 2

1 1 ) (

SF TF SF TF TS F TS

r r r r r r − ∗ − ∗ − =

slide-7
SLIDE 7

T-76.613 Software testing 7

13

Mika Mäntylä

Predicting Single Variable - Regression

Correlation indicates how well a line fits the data Regression line is the “best line”

Minimizes the sum of squared distances from the line y=b+mx

Regression example from SPSS

slide-8
SLIDE 8

T-76.613 Software testing 8

15

Mika Mäntylä

Predicting Single Variable - Regression

Correlation indicates how well a line fits the data Regression line is the “best line”

minimizes the sum of squared vertical distances from the line y=b+mx

In this case the equation becomes

time = -1,955 + diameter * 3,457

Regression is often used for prediction

Time estimate for polishing 15 inch object

  • 1,955 + 15*3,457 = 49,9 minutes

Strength of the regression line

R (0,700)–correlation coefficient R^2 (0,490)–percentage of variation explained (co-variation) Adjusted R^2 (0,482) – adjusted based on model

complexity, more conservative than R^2

16

Mika Mäntylä

Predicting Single Variable - Regression

  • There is no reason to have only a single predictor

Multiple Regression: y=b+mx1+nx2+ox3+px4 Line that minimizes the distances from y in multidimensional space

  • Common uses of multiple regression (other than prediction)

Controlling a variable

Confounding effect of size (Emam et al 2001).

“After controlling for size, none of the metrics we studied were associated with fault-proneness anymore.” Best combined predictors

Best combination of techniques to reduce defect rate: (MacCormack 2003)

Early prototype, design review, regression-test

Others were also correlated, but they were not significant in the regression model

  • Several variations of regression exists

Linear Regression, Logistic Regression, Categorical Regression, etc. Find a regression that is suitable for your data

  • Other more advanced techniques: Structural Equation Modeling
slide-9
SLIDE 9

T-76.613 Software testing 9 Weaknesses in linear models

18

Mika Mäntylä

Look beyond the numbers !

Four data sets with equal values (KQR 2005)

N=11; Avg X = 9,0; Avg Y=7,5;

Regression line y=3+0,5x; r^2=0,67;

I 5 10 10 20 II 5 10 10 20 III 5 10 10 20 IV 5 10 10 20

slide-10
SLIDE 10

T-76.613 Software testing 10

19

Mika Mäntylä

Machine Learning and Data Mining

The ability of a program to learn from experience ML is a new field compared to classical statistical

methods

Much on-going research Only few practitioners books Most ML researchers are promoting their own technique Thus, practitioners have difficulty in selecting suitable

methods

ML can address problems of classical statistical

methods i.e. weaknesses of linear models

ML techniques require / can handle large data sets

20

Mika Mäntylä

ML Example: Semantic Distance

  • Support software development by semantic distance between code documents
  • Code documents with short semantic distance have similar purpose!
  • Semantic information i.e. documentation of code
  • Names of variables, routines, and classes
  • Code comments
  • 1st produce (sparse) matrix with code documents * terms
  • Stem the terms:
  • running -> run;
  • runner -> run
  • Perform term weighting
  • Term that is in all documents is useless
  • Entropy weighting was used
  • 2nd reduce matrix dimensions to 200-400 semantic concepts with SVD
  • Original matrix with 13k documents and 70k terms is too big
  • Original matrix will produce poorer results
  • 3rd calculate distance between documents with the reduced matrix
  • 4th integrate this information so it may be used
  • i.e. source code browser
slide-11
SLIDE 11

T-76.613 Software testing 11

21

Mika Mäntylä

ML Example: Case Based Reasoning

Predict whether module contains a fault or not based

  • n code metrics

Code metrics create n-dimensional space Measure metrics from all modules and indicate

whether they have contained fault or not

When a new module comes in choose k-Nearest

Neighbors (k is typically 5).

3 faulty neigbours and 2 non-foulty neigbours

indicate that the module is fault-prone

22

Mika Mäntylä

ML Example: SOM

Each item is n-dimensional feature vector

Features can be: weight, height, gender

Steps

  • 1. Initialize map with random feature vectors
  • 2. For each item’s feature vector

Find most similar feature vector of the map Update map’s feature vector and its neighbours based on

the feature vector of the item

  • 3. Goto 2
slide-12
SLIDE 12

T-76.613 Software testing 12

23

Mika Mäntylä

Summary

Qualitative vs. Quantitative Scales & Distributions Methods Comparison between two groups

Parametric: t-test Non-parametric: Wilcoxon-Mann-Whitney test

Correlation between two variables

Parametric: Pearson Non-parametric: Spearman’s rho, Kendall’s Tau Partial correlation

Regression - one dependent variables and several independent

variables

Machine learning How to learn statistical testing in practice SPSS contains tutorials with example data sets Find a good book Try to understand articles that have utilized statistical methods 24

Mika Mäntylä

References

Stephan Arndt, Carolyn Turvey, Nancy C. Andreasen,

“Correlating and predicting psychiatric symptom ratings: Spearmans r versus Kendalls tau correlation”, Journal of Psychiatric Research, Volume 33, Issue 2 , 1 March 1999, Pages 97-104

  • H. Coolican, Research methods and statistics in

psychology, London, United Kingdom: Hodder & Stoughton, 1999.

  • K. El Emam, S. Benlarbi, N. Goel and S.N. Rai, "The

confounding effect of class size on the validity of object-

  • riented metrics," Software Engineering, IEEE

Transactions on, vol. 27, no. 7, 2001, pp. 630-650.

  • A. MacCormack, C.F. Kemerer, M. Cusumano and B.

Crandall, "Trade-offs between productivity and quality in selecting software development practices," Software, IEEE, vol. 20, no. 5, 2003, pp. 78-85.

  • S. Siegel, Nonparametric statistics for the behavioral

sciences New York: McGraw-Hill, 1956.

slide-13
SLIDE 13

T-76.613 Software testing 13

25

Mika Mäntylä

References – URLs

  • http://web.uccs.edu/lbecker/SPSS/scalemeas.htm
  • http://www.uni.edu/its/us/document/stats/spss2.html
  • http://en.wikipedia.org/wiki/Normal_distribution
  • http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html
  • http://www.texasoft.com/winkpear.html
  • http://www.blackwellpublishing.com/specialarticles/jcn_10_715.pdf
  • http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubM

ed&list_uids=10221741&dopt=Abstract

  • http://www.csidata.com/custserv/onlinehelp/OnlineManual/index.html?m

maviewcorrelationcoefficient.htm

  • http://faculty.vassar.edu/lowry/ch3a.html
  • http://www2.chass.ncsu.edu/garson/pa765/partialr.htm

http://www.disastercenter.com/cdc/1motorac.html 26

Mika Mäntylä

Topics for future

Statistical analysis on qualitative data Experimental Design Validity

Internal, Extrenal, Construct

Reliablity & Validity Populations & Samples Clustering , Factor analysis

PCA (Hotelling, Karhunen-Lowe transformation) etc

Cros tabulation

Chi-square

More on machine learning Outlier removal Likert scale - 5-point or 7-point