Correlation analysis Fernando Brito e Abreu (fba@di.fct.unl.pt) - - PDF document

correlation analysis
SMART_READER_LITE
LIVE PREVIEW

Correlation analysis Fernando Brito e Abreu (fba@di.fct.unl.pt) - - PDF document

Experimental Software Engineering Correlation analysis Fernando Brito e Abreu (fba@di.fct.unl.pt) Universidade Nova de Lisboa (http://www.unl.pt) QUASAR Research Group (http://ctp.di.fct.unl.pt/QUASAR) Abstract Correlation analysis


slide-1
SLIDE 1

1

Experimental Software Engineering

– Correlation analysis –

Fernando Brito e Abreu (fba@di.fct.unl.pt) Universidade Nova de Lisboa (http://www.unl.pt)

QUASAR Research Group (http://ctp.di.fct.unl.pt/QUASAR)

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Abstract

Correlation analysis vs. experimentation Relations between variables Sample size problem Correlation Parametric coefficients Non-parametric coefficients

slide-2
SLIDE 2

2

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Relations between variables

The ultimate goal of every research or scientific analysis

is finding relations between variables

The philosophy of science teaches us that there is no other

way of representing "meaning" except in terms of relations between some quantities or qualities

Either way involves relations between variables

Thus, the advancement of Science must always involve

finding and evaluating new relations between variables

Isn’t that what correlation is about? Why care about experimentation, then?

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Correlation analysis vs. experimentation

Correlation analysis

We do not influence any variables but only measure them and

look for relations (correlations) between some set of variables

Those correlations are quantified as coefficients ∈ [0%, 100%]

Example: practitioners’ expertise and defects found

Experimentation

We manipulate some variables and then measure the effects of

this manipulation on other variables

Example: a researcher increases design complexity and then records

defects found, keeping all other variables constant

Beware of learning effect when subjects are humans

slide-3
SLIDE 3

3

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Correlation analysis vs. experimentation

Experimentation can conclusively demonstrate causal

relations between variables

If we find that whenever we change variable A, then variable B

changes, then we can conclude that "A influences B.“

Correlation analysis cannot conclusively prove causality

We can find “high” correlation values between variables such

as average literacy and expected lifetime, but there is no proven causality between them

Question: why then, can we observe that correlation when analyzing

data from countries worldwide?

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Correlation analysis vs. experimentation

  • If experimental data may potentially provide

qualitatively better information than correlational data, why care about correlation analysis at all?

  • Correlation analysis only allows us to measure the

association between variables, not their

  • interdependence. Formally speaking:

1.

(interdependence ⇒ association)

2.

¬ (association ⇒ interdependence)

slide-4
SLIDE 4

4

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Why care about correlation then?

Correlation analysis can be useful for:

to reduce the size of that set of explanatory variables

Highly correlated ones may be measuring the same attribute

Performing a preliminary assessment of the feasibility of an

hypothesis

A very low correlation (association) between a dependent and an

independent variable may lead us to discard considering the hypothesis

  • f a causality

Most statistical tools allow us to produce cross-

correlation tables (symmetrical matrices with one by one correlation values among considered variables)

The main diagonal is obviously filled with 1’s (100% correlation)

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Association between variables: properties

Magnitude or size

This property pertains to the strength of the association Several correlation coefficients (e.g. Pearson, Spearman) allow to quantify

this magnitude

Signal of the association

Positive – when a variable increases, the other increases as well Negative – when a variable increases, the other decreases

Significance, reliability or truthfulness

This property pertains to the representativeness of the result found in our

specific model for the entire population

It says how probable it is that a similar relation would be found if the experiment

was replicated with other samples from the same population

slide-5
SLIDE 5

5

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Magnitude vs. reliability of relations

Usually, the larger the magnitude of the relation between

variables, the more reliable the relation

But magnitude and reliability are not totally independent!

Assuming that there is no relation between the respective

variables in the population (null magnitude), the most likely

  • utcome would be also finding no relation between those

variables in the research sample

Thus, the weaker the relation found in the sample (less magnitude), the less

likely it is that there is no corresponding relation in the population

Depending on sample size, a relation of a given strength can be

either highly significant or no significant at all

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Sample size problem

The smaller the sample size, the more likely it is that we will obtain

erroneous results comparing to the population parameters

The error would be to assume the existence of a relation between two

variables obtained from a population in which such a relation does not exist

Technically speaking, the probability of a random deviation of a

particular size (from the population mean), decreases with the increase in the sample size

Conclusion: a smaller sample size implies a smaller reliability

  • f associations
slide-6
SLIDE 6

6

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Wrap-up

If the true association (in the population) between

variables is:

very small, then there is no way to identify such a association

in a study, unless the research sample is correspondingly large

very large, then it can be found to be highly significant even in

a study based on a very small sample

Conclusion: the smaller the association between variables, the larger the sample size required to prove it significant

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Correlation

Correlation is the extent to which values of two variables

are "proportional" to each other

Proportional means linearly related

Correlation is high if it can be approximated by a straight line

The line is sloped upwards or downwards, depending on the signal of

the association

That regression line or least squares line is so-called

because it is determined such that the sum of the squared distances of all the data points from the line is the minimum

slide-7
SLIDE 7

7

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Correlation coefficients

The magnitude of the correlation can be expressed by a

correlation coefficient

Several coefficients are proposed in the literature

Some are parametric and others non-parametric

The correlation coefficient does not depend on the

specific measurement units used

for example, the correlation between Size and Effort will be

identical regardless of whether Function Points and Man.Years,

  • r KLOC and Man.Months are used as measurement units

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Parametric correlation coefficients

The most widely-used type of correlation coefficient is

Pearson r (Pearson, 1896)

It is also called linear or product-moment correlation

Assumptions

Each pair of variables is bivariate normal The two variables are measured on at least interval scales

SPSS: Analyse / Correlate / Bivariate

slide-8
SLIDE 8

8

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Nonparametric correlation coefficients

These statistics do not require that variables are

normally distributed

Chi-square

Assumptions: nominal scales

Spearman R, Kendall Tau, Gamma

Assumptions: at least ordinal scales (ranks) For ordinal scales, if ranks are represented by literal

enumerations you have to recode them into integers

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Spearman R correlation coefficient

Spearman R is similar to the Pearson coefficient, except

that can be computed from ranks SPSS: Analyse / Correlate / Bivariate

slide-9
SLIDE 9

9

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Kendall Tau correlation coefficient

Kendall tau represents a probability

It is the difference between the probability that the observed data are in the

same order for the two variables versus the probability that the observed data are in different orders for the two variables

Kendall tau is equivalent to the Spearman R statistic with regard

to the underlying assumptions

It is also comparable in terms of its statistical power However, is usually not identical in magnitude because its underlying logic,

as well as its formula, is very different

SPSS: Analyse / Correlate / Bivariate

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Gamma coefficient

Is preferable to Spearman R or Kendall tau when the data contain

many tied observations (evident in a scatter plot)

In terms of the underlying assumptions, Gamma is equivalent to

Spearman R or Kendall tau

In terms of its interpretation and computation, it is more similar to

Kendall tau than Spearman R.

Gamma is also a probability;

It is computed as the difference between the probability that the rank

  • rdering of the two variables agree minus the probability that they disagree,

divided by 1 minus the probability of ties

slide-10
SLIDE 10

10

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Correlation significance

The significance of the correlation between V1 and v2 is

based on the following hypothesis

H0: v1 and v2 are not associated H1: v1 and v2 may be associated

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Example: Parametric coefficient

Even for a test significance α = 0.01 (99% confidence) Since p=0.000 ≤ α

Reject H0 and accept H1 (I cannot reject the hypothesis that

Size nor Effort are correlated)

Correlations 1 .459** .000 3310 3287 .459** 1 .000 3287 4180 Pearson Correlation

  • Sig. (2-tailed)

N Pearson Correlation

  • Sig. (2-tailed)

N Functional Size Normalised Work Effort Functional Size Normalised Work Effort Correlation is significant at the 0.01 level (2-tailed). **.

slide-11
SLIDE 11

11

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Example: Non-Parametric coefficients

Even for a test significance α = 0.01 (99% confidence) Since p=0.000 ≤ α

Reject H0 and accept H1 (I cannot reject the hypothesis that

Size nor Effort are correlated)

Correlations 1.000 .471** . .000 3310 3287 .471** 1.000 .000 . 3287 4180 1.000 .647** . .000 3310 3287 .647** 1.000 .000 . 3287 4180 Correlation Coefficient

  • Sig. (2-tailed)

N Correlation Coefficient

  • Sig. (2-tailed)

N Correlation Coefficient

  • Sig. (2-tailed)

N Correlation Coefficient

  • Sig. (2-tailed)

N Functional Size Normalised Work Effort Functional Size Normalised Work Effort Kendall's tau_b Spearman's rho Functional Size Normalised Work Effort Correlation is significant at the 0.01 level (2-tailed). **.