Descriptive Methods 707.031: Evaluation Methodology Winter 2015/16 - - PowerPoint PPT Presentation

descriptive methods
SMART_READER_LITE
LIVE PREVIEW

Descriptive Methods 707.031: Evaluation Methodology Winter 2015/16 - - PowerPoint PPT Presentation

Descriptive Methods 707.031: Evaluation Methodology Winter 2015/16 Eduardo Veas what we do with the data depends on the scales 2 Measurement Scales 3 The complexity of measurements Nominal Crude Ordinal Interval Ratio


slide-1
SLIDE 1

Descriptive Methods

707.031: Evaluation Methodology Winter 2015/16

Eduardo Veas

slide-2
SLIDE 2

what we do with the data depends on the scales

2

slide-3
SLIDE 3

Measurement Scales

3

slide-4
SLIDE 4

The complexity of measurements

  • Nominal
  • Ordinal
  • Interval
  • Ratio

4

Sophisticated Crude

slide-5
SLIDE 5

Nominal data

  • arbitrarily assigning a code to a category or attribute:

postal codes, job classifications, military ranks, gender

  • mathematical manipulations are meaningless
  • mutually exclusive categories
  • each category is a level
  • use: freq, counts,

5

slide-6
SLIDE 6

Ordinal data

  • ranking of an attribute
  • interval between points in scale not intrinsically

equal

  • comparisons < or > are possible

6

slide-7
SLIDE 7

Interval data

  • equal distances between adjacent values, but no

absolute zero

  • temperature in C or F
  • mean can be computed
  • Likert scale data ?

7

slide-8
SLIDE 8

Ratio

  • absolute zero
  • can be operated mathematically
  • time to complete, distance or velocity of cursor,
  • count, normalized count (count per something)

8

slide-9
SLIDE 9

Frequencies

9

Title Text

slide-10
SLIDE 10

Frequency tables

  • tab.courses<-

as.data.frame(freq(ordered(courses)), plot=FALSE)

  • CumFreq= cumsum(tab.courses[-

dim(tab.courses)[1],]$Frequency)

  • tab.courses$CumFreq=c(CumFreq,NA)
  • tab.courses

10

slide-11
SLIDE 11

Interpreting frequency tables

Frequency Percent CumPercent CumFreq 1 2 20 20 2 2 3 30 50 5 3 4 40 90 9 4 1 10 100 10 Total 10 100 NA NA

11

slide-12
SLIDE 12

Contingency Tables

12

Right-handed Left-handed Total Males 43 9 52 Females 44 4 48 Totals 87 13 100

sd

slide-13
SLIDE 13

Modelling

13

slide-14
SLIDE 14

Statistical models

  • A model has to accurately represent the real

world phenomenon.

  • A model can be used to predict things about

the real world.

  • The degree to which a statistical model

represents the data collected is called fit of the model

14

slide-15
SLIDE 15

Frequency distributions

  • plot observations on the x-axis and a bar

showing the count per observation

  • ideally observations fall symmetrically around the

center

  • skew and kurtosis describe abnormalities in the

distributions

15

slide-16
SLIDE 16

Histogram / Frequency distributions

16

slide-17
SLIDE 17

Center of a distribution

  • Mode: score that occurs most frequently in the dataset
  • it may take several values
  • it may change dramatically with a single added score
  • Median: is the middle score (after ranking all scores)
  • for even nr of scores, add centric values and divide by

2

  • good for ordinal, interval and ratios
  • Mean: average score
  • can be influenced by extreme scores

17

slide-18
SLIDE 18

Dispersion of a distribution

  • range: difference between lowest and highest

score

  • interquartile difference: mode + upper and

lower quartiles

18

252 - 22 = 232 121 - 22 = 99

slide-19
SLIDE 19

Fit of the mean

  • deviance: mean - x
  • sum of squared errors

(SS)

  • variance = SS / N-1
  • stddev =

sqrt(variance)

19

slide-20
SLIDE 20

Assumptions

20

slide-21
SLIDE 21

Assumptions of parametric data

  • normally distributed: sample or error in the model
  • homogeneity of variance:
  • correlational: variance of one variable should be stable at all

levels of the other variable

  • groups: each sample comes from a population with same

variance

  • interval data: at least interval data
  • independence: the behaviour of one participant does not

influence that of another

21

slide-22
SLIDE 22

Distributions for DLF

22

0.0 0.2 0.4 0.6 1 2 3 4

Hygiene score on day1 Density

0.00 0.25 0.50 0.75 1 2 3

Hygiene score on day 2 Density

0.0 0.3 0.6 0.9 1.2 1 2 3

Hygiene score on day 3 Density

1 2 3

  • 2

2

theoretical sample

1 2 3

  • 3
  • 2
  • 1

1 2 3

theoretical sample

1 2 3

  • 2
  • 1

1 2

theoretical sample

slide-23
SLIDE 23

Quantify normallity

23

slide-24
SLIDE 24

Different groups

24

slide-25
SLIDE 25

Exam histogram

25

0.000 0.005 0.010 0.015 0.020 0.025 25 50 75 100

exam density

slide-26
SLIDE 26

Exam histogram

26

0.000 0.005 0.010 0.015 0.020 0.025 25 50 75 100

exam density

0.00 0.01 0.02 0.03 0.04 10 20 30 40 50 60 70

exam density

0.00 0.02 0.04 0.06 60 70 80 90 100

exam density

slide-27
SLIDE 27

Shapiro-Wilk test

  • # Shapiro-Wilk
  • shapiro.test(rexam$exam)
  • #if we are comparing groups, what is important

is the normallity within each group

  • by(rexam$exam, rexam$uni, shapiro.test)

27

slide-28
SLIDE 28

Reporting Shapiro-Wilk

  • A Shapiro-Wilk test on the R exam, W=0.96,

proved a significant deviation from normality (p<0.05).

28

slide-29
SLIDE 29

Homogeneity of variance

  • Levene’s test:
  • leventTest(rexam$exam, rexam$uni,

center=mean)

  • Reporting: for the percentage on the R exam,

the variances were similar for KFU and TUG students, F(1,98)=2.09

29

slide-30
SLIDE 30

Homogeneity of variance

  • Levene in large datasets may give sig for small

variations

  • Double check Variance ratio (Hartley’s Fmax)

30

slide-31
SLIDE 31

Correlations

31

Title Text

slide-32
SLIDE 32

Everything is hard to begin with, but the more you practise the easier it gets

32

slide-33
SLIDE 33

Relationships

  • Everything is hard to begin with, but the more

you practise the easier it gets

  • increase in practice, increase in skill
  • increase in practice, but skill remains unchanged
  • increase in practice, decrease in skill

33

slide-34
SLIDE 34

Correlations

  • Bivariate: correlation between two variables
  • Partial: correlation between two variables while

controlling the effect of one or more additional variables

34

slide-35
SLIDE 35

Covariance

  • are changes in one variable met with similar

changes in the other variable

  • cross product deviations= multiply deviations of

the two variables

  • covariance= CPD / (N-1)

35

slide-36
SLIDE 36

Covariance II

  • Positive: both variables vary in the same

direction

  • Negative: variables vary in opposite directions
  • Covariance is scale dependent and cannot be

generalized

36

slide-37
SLIDE 37

Pearson correlation coefficient

  • cov/sxsy
  • Data must be at least interval
  • Value between -1 and 1
  • 1 -> variables positively correlated
  • 0 -> no linear relationship
  • -1 -> variables negatively correlated

37

slide-38
SLIDE 38

Dataset Exams and Anxiety

  • effects of exam stress and revision on exam

performance

  • questionnaire to assess anxiety relating to exams

(EAQ)

38

slide-39
SLIDE 39

Enter data

  • examData<-read.delim("ExamAnxiety.dat",

header=TRUE)

  • examData2<-

examData[,c(“Exam”,"Anxiety","Revise")]

  • cor(examData2)

39

slide-40
SLIDE 40

Pearson correlation

  • Exam Anxiety Revise
  • Exam 1.0000000 -0.4409934 0.3967207
  • Anxiety -0.4409934 1.0000000 -0.7092493
  • Revise 0.3967207 -0.7092493 1.0000000

40

slide-41
SLIDE 41

Confidence values

  • rcorr(as.matrix(examData[,c(“Exam","Anxiety","R

evise")]))

  • Exam Anxiety Revise
  • Exam 0 0
  • Anxiety 0 0
  • Revise 0 0

41

slide-42
SLIDE 42

Reporting Pearson’s CC

A Pearson correlation coefficient indicated a significant correlation between anxiety performance and time spent revising, r=-.44, p<0.01

42

slide-43
SLIDE 43

Spearman’s correlation coefficient

  • non parametric test
  • first rank the data and then apply Pearson cc

43

slide-44
SLIDE 44

Liar Dataset

  • contest for storytelling the biggest lie
  • 68 participants, ranking, and creativity

questionnaire

44

slide-45
SLIDE 45

Spearman test

  • liarData=read.delim("biggestLiar.dat",

header=TRUE)

  • rcorr(as.matrix(liarData[,c(“Position","Creativity")

]))

  • Position Creativity
  • Position 1.00 -0.31
  • Creativity -0.31 1.00

45

slide-46
SLIDE 46

Reporting spearman

A Spearman non-parametric correlation test indicated a significant correlation between creativity and ranking in the world’s biggest liar contest, r=-.37, p<0.001

46

slide-47
SLIDE 47

Kendall’s tau non-parametric

  • used for small datasets
  • cor.test(liarData$Position, liarData$Creativity,

alternative="less", method="kendall")

  • z = -3.2252, p-value = 0.0006294
  • alternative hypothesis: true tau is less than 0
  • sample estimates:
  • tau
  • -0.3002413

47

slide-48
SLIDE 48

Reporting Kendall’s test

A Kendall tau correlation coefficient indicated a correlation between creativity and performance in the World’s biggest liar contest, t=-.30, p<0.001

48

slide-49
SLIDE 49

Biserial and point-biserial correlations

  • one variable is dichotomous (categorical with 2

categories)

  • point biserial: for discrete dichotomy (e.g., dead)
  • biserial: for continuous dichotomy (e.g., pass

exam)

49

slide-50
SLIDE 50

Readings

  • Discovering statistics using R (Andy Field, Jeremy

Miles, Zoe Field)

50

slide-51
SLIDE 51

R

51

Title Text

slide-52
SLIDE 52

set work directory

  • setwd("/new/work/directory")
  • getwd()
  • ls() # list the objects in the current workspace

52

slide-53
SLIDE 53

packages

  • install.packages(“package.name") #installing

packages

  • library(package.name) # loading a package
  • package::function() # disambiguating functions

53

slide-54
SLIDE 54

Nominal and Ordinal data

  • mydata$v1 <- factor(mydata$v1,


levels = c(1,2,3),
 labels = c("red", "blue", “green"))

  • mydata$v1 <- ordered(mydata$y,


levels = c(1,3, 5),
 labels = c("Low", "Medium", "High"))

54

slide-55
SLIDE 55

Missing data

  • is.na(var) #tests for missing valua/ also in rows
  • mydata$v1[mydata$v1==99] <- NA # select rows

where v1 is 99 and recode column v1 


  • x <- c(1,2,NA,3)


mean(x) # returns NA
 mean(x, na.rm=TRUE)

  • newdata <- na.omit(mydata) # spawn dataset without

missing data

55

slide-56
SLIDE 56

Install and load packages

  • install.packages(“car”); install.packages(“ggplot2”);

install.packages(“pastecs”); install.packages(“psych”); install.packages(“descr”)

  • library(car);library(ggplot2);library(pastecs);librar

y(psych);library(Rcmdr);library(descr)

56

slide-57
SLIDE 57

Enter data

  • id<-c(1,2,3,4,5,6,7,8,9,10)
  • sex<-c(1,1,1,1,1,2,2,2,2,2)
  • courses<-c(2.0,1.0,1.0,2.0,3.0,3.0,3.0,2.0,4.0,3.0)
  • sex<-factor(sex, levels=c(1:2), labels=c("M", "F"))
  • example<-

data.frame(ID=id,Gender=sex,Courses=courses )

57

slide-58
SLIDE 58

Frequency Distributions

  • facebook<-

c(22,40,53,57,93,98,103,108,116,121,252)

  • library(modeest)
  • mfv(facebook)
  • mean(facebook)
  • median(facebook)

58

slide-59
SLIDE 59

Enter data

  • quantile (facebook)
  • IQR (facebook)
  • var(facebook)
  • sd(facebook)

59

slide-60
SLIDE 60

describing your data

  • #load meaningful data
  • lecturer<-read.csv(“lecturerData.csv”,

header=TRUE)

  • #get statistics
  • stat.desc(lecturerData[,c("friends", "income")],

basic=FALSE, norm=TRUE)

60

slide-61
SLIDE 61

describing your data II

  • # print frequency table
  • tab.friends<-

as.data.frame(freq(ordered(lecturerData $friends)), plot=FALSE)

  • tab.friends.cumsum<-cumsum(tab.friends[-

dim(tab.friends)[1],]$Frequency)

  • tab.friends$CumFreq=c(tab.friends.cumsum,NA)
  • tab.friends

61

slide-62
SLIDE 62

Testing Normally Distributed

  • Load DLF data
  • dlf<-read.delim("DownloadFestival.dat",

header=TRUE)

  • Data about hygiene collected during a festival

(3days)

62

slide-63
SLIDE 63

Enter data

  • hist.day1 <- ggplot (dlf, aes(day1)) +

theme(legend.position = "none") + geom_histogram(aes(y = ..density..), colour="black", fill="white")+ labs(x="Hygiene score on day1", y="Density")

  • hist.day1 + stat_function(fun = dnorm,

args=list(mean=mean(dlf$day1,na.rm=TRUE), sd=sd(dlf $day1, na.rm = TRUE)), colour="blue", size=1)

  • qqplot.day1 <-qplot(sample=dlf$day1, stat="qq")

63

slide-64
SLIDE 64

Plot day 1

64

0.0 0.1 0.2 0.3 0.4 0.5 5 10 15 20

Hygiene score on day1 Density

slide-65
SLIDE 65

Offending score

  • # print bad score
  • dlf$day1[dlf$day1>5]
  • #correct bad score
  • dlf$day1[dlf$day1>5]<-2.02

65

5 10 15 20

  • 2

2

theoretical sample

slide-66
SLIDE 66

Quantify normallity

  • describe(cbind(dlf$day1, dlf$day2, dlf$day3))
  • stat.desc(dlf[,c("day1","day2","day3")p], basic =

FALSE, norm= TRUE)

66

slide-67
SLIDE 67

Groups

  • rexam<-read.delim("rexam.dat", header=TRUE)
  • # obtain statistics for exam, computer, lectures and numeracy
  • round(stat.desc(rexam[,c("exam","computer","lectures","numer

acy")], basic=FALSE, norm=TRUE), digits=3)

  • hist.exam <-ggplot (rexam, aes(exam)) +

theme(legend.position = "none") + geom_histogram(aes(y = ..density..), colour="black", fill="white") + labs(x="exam", y="density") + stat_function(fun=dnorm, args=list(mean=mean(rexam$exam,na.rm=TRUE), sd=sd(rexam$exam, na.rm=TRUE)), colour="blue", size=1)

67

slide-68
SLIDE 68

Add factors

  • # set uni to be a factor
  • rexam$uni <-factor(rexam$uni, levels = c(0:1),

labels = c("KFU", “TUG"))

  • by (rexam[, c("exam", "computer", "lectures",

"numeracy")], rexam$uni, stat.desc, basic=FALSE, norm = TRUE)

68

slide-69
SLIDE 69

Get subsets and individual histograms

  • # now we create subsets of the example datasets for

each factor

  • kfu<-subset(rexam, rexam$uni=="KFU")
  • tug<- subset(rexam, rexam$uni==“TUG")
  • # now we can create histograms for each subset
  • hist.exam.kfu <-ggplot (kfu, aes(exam)) +

theme(legend.position = "none") + geom_histogram(aes(y = ..density..), colour="black", fill="white") + labs(x="exam", y="density") + stat_function(fun=dnorm, args=list(mean=mean(kfu$exam,na.rm=TRUE), sd=sd(kfu $exam, na.rm=TRUE)), colour="blue", size=1)

69