Exploratory Data Analysis Paul Cohen ISTA 370 Spring, 2012 Paul - - PowerPoint PPT Presentation

exploratory data analysis
SMART_READER_LITE
LIVE PREVIEW

Exploratory Data Analysis Paul Cohen ISTA 370 Spring, 2012 Paul - - PowerPoint PPT Presentation

Exploratory Data Analysis Paul Cohen ISTA 370 Spring, 2012 Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 1 / 46 Outline Data, revisited The purpose of exploratory data analysis Learning to see Paul Cohen ISTA 370 ()


slide-1
SLIDE 1

Exploratory Data Analysis

Paul Cohen ISTA 370 Spring, 2012

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 1 / 46

slide-2
SLIDE 2

Outline

Data, revisited The purpose of exploratory data analysis Learning to see

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 2 / 46

slide-3
SLIDE 3

Data: A Review

Things and Data

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 3 / 46

slide-4
SLIDE 4

Data: A Review

Things and Data

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 4 / 46

slide-5
SLIDE 5

Data: A Review

Where Data Come From

Data are measurements of individuals (people, trees, countries, ecosystems...).

56 ¡years ¡old 70" ¡tall 180 ¡lbs Brown ¡eyes Moderately ¡presbyo8c Good ¡health Married One ¡child ... Data An ¡Individual A ¡Data ¡Table

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 5 / 46

slide-6
SLIDE 6

Exploratory Data Analysis

What is Exploratory Data Analysis (EDA)?

In terms of the Fundamental Model of Data, y = f (x, ǫ): EDA infers which factors strongly and weakly influence y and the functions that combine these factors EDA examines ǫ to see whether it contains evidence of other important but unmeasured ( “hidden” ) factors Confirmatory studies test whether x really is a causal factor that influences y Exploratory studies are to confirmatory studies as test kitchens are to cookbook recipes. EDA generally doesn’t test hypotheses, but, rather,“helps the data tell its story” EDA helps you understand phenomena, and suggests hypotheses to test in confirmatory studies.

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 6 / 46

slide-7
SLIDE 7

Exploratory Data Analysis

What is Exploratory Data Analysis? Learning to See

Data have structure that is evidence of causal influences. EDA uncovers, exposes, clarifies this structure. EDA is like hunting for fossils – it’s a skill, and you must“learn to see”not only what’s in front of you, but what lies within data. EDA asks,“what do I see, and what does it mean?” Like any other skill, EDA takes a lot of practice.

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 7 / 46

slide-8
SLIDE 8

Exploratory Data Analysis

Load Some Data

> read.table.ISTA370<-function(filename){ dataURL<-"http://www.sista.arizona.edu/~cohen/ISTA%20370/D # Reads a data frame from a URL path rooted at ISTA370 dat read.table(paste(dataURL,filename,sep="")) } > > # taheri<-read.table.ISTA370("taheri1.txt") > # iris<-read.table.ISTA370("iris.txt") > # heightC<-read.table.ISTA370("heightC.txt") > # treering<-read.table.ISTA370("treering1.txt") > # blast<-read.table.ISTA370("blastSummary.txt") > # kinect<-read.table.ISTA370("onemovie.txt") > # readability<-read.table.ISTA370("readability.txt")

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 8 / 46

slide-9
SLIDE 9

Learning to See

What Do You See? What Does it Mean?

> hist(iris$Petal.Length,col="grey",main=NULL)

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 9 / 46

iris$Petal.Length Frequency 1 2 3 4 5 6 7 10 20 30

slide-10
SLIDE 10

Learning to See

What Do You See? What Does it Mean?

> ipl<-iris$Petal.Length > hist(ipl,prob="true",ylim=c(0,1),main=NULL) > lines(density(ipl[iris$Species=="setosa"]),col="red") > lines(density(ipl[iris$Species=="versicolor"]),col="green") > lines(density(ipl[iris$Species=="virginica"]),col="blue")

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 10 / 46

Looking at density curves for each species, we see that the histogram did indeed indicate two or more separate populations (species).

ipl Density 1 2 3 4 5 6 7 0.0 0.2 0.4 0.6 0.8 1.0

slide-11
SLIDE 11

Learning to See

What Do You See? What Does it Mean?

> boxplot(iris$Petal.Length~iris$Species, ylab="Petal.Length",xlab="Species")

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 11 / 46

A boxplot by species con- firms, and summarizes the petal length statistics for each species.

  • setosa

versicolor virginica 1 2 3 4 5 6 7 Species Petal.Length

slide-12
SLIDE 12

Learning to See

Boxplots

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 12 / 46

median upper ¡quar,le ¡(75% ¡quan,le) lower ¡quar,le ¡ ¡(25% ¡quan,le)

whisker ¡(various ¡interpreta1ons)

  • utliers
  • utliers

interquar,le ¡range

whisker

slide-13
SLIDE 13

Learning to See

Median, Quartiles, Interquartile Range

If you sort the values in a sample from lowest to highest, the median is the middle value, or the average of the two middle values when the sample contains an even number of points. The median is the 50th quantile, or the value for which 50% of the values are greater. The lower quartile is the 25th quantile, above which 75% of the values are found. The upper quartile is the 75th quantile, above which 25% of the values are found. The interquartile range is a measure of variability and is the difference between the upper and lower quartiles.

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 13 / 46

slide-14
SLIDE 14

Learning to See

Median, Quartiles, Interquartile Range

The median is robust against outliers; the mean is not. Suppose 100 families in a neighborhood each make $40,000/year. When a millionaire moves in the mean jumps from $40,000 to $49,504/year. What happens to the median? Before the millionaire arrived, the variance in income was zero. Afterwards the variance is over nine billion!!! What happens to the interquartile range? Suppose you have a sorted sample of 9 elements; the median is the fifth element. If you add another element, what will the median be? By how many locations in the sorted distribution can the median shift?

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 14 / 46

slide-15
SLIDE 15

Learning to See

Symmetry and Skew

> with(blast, hist(Test0,breaks=20,col="grey",main=NULL)) > with(treering, hist(width,breaks=20,col="grey",main=NULL))

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 15 / 46

Test0 is skewed to the right, meaning it has a long tail of higher values, while Treering is nearly symmetric.

width Frequency 40 60 80 100 120 10 20 30 40 Test0 Frequency 0.1 0.2 0.3 0.4 0.5 0.6 0.7 10 20 30 40

slide-16
SLIDE 16

Learning to See

Transformations

> attach(blast) > hist(Train0,breaks=20,col="grey",main=NULL) > Train0Squared<-Train0^2 #square the Train0 data > hist(Train0Squared,breaks=20,col="grey",main=NULL)

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 16 / 46

A simple transformation amplifies an otherwise hidden feature

Train0Squared Frequency 0.2 0.4 0.6 0.8 1.0 5 10 15 20 25 30 Train0 Frequency 0.4 0.6 0.8 1.0 10 20 30 40

slide-17
SLIDE 17

Learning to See

Transformations

> Train0Squared<-with(blast,Train0^2) > with(blast,plot(density(Train0Squared))) > with(blast,lines(density(Train0Squared[gender=="female"]),c > with(blast,lines(density(Train0Squared[gender=="male"]),col

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 17 / 46

Does gender explain the bump?

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5

density.default(x = Train0Squared)

N = 187 Bandwidth = 0.04378 Density

slide-18
SLIDE 18

Learning to See

What Explains the Bump New Topics

> plot(Train0Squared,NewSkills0,col=gender) > plot(NewSkills0,Train0Squared,col=gender)

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 18 / 46

The number of topics to which students were exposed (NewSkills0) seems to explain the bump, but gender doesn’t.

  • 2

4 6 8 10 12 14 0.2 0.4 0.6 0.8 1.0 NewSkills0 Train0Squared

  • ●●
  • 0.2

0.4 0.6 0.8 1.0 2 4 6 8 10 12 14 Train0Squared NewSkills0

slide-19
SLIDE 19

Learning to See

What Explains the Bump New Topics

> precocious<-NewSkills0>8 > plot(density(Train0Squared)) > lines(density(Train0Squared[precocious=="TRUE"]),col="red") > lines(density(Train0Squared[precocious=="FALSE"]),col="blue

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 19 / 46

So the students who saw too many subjects account for the bump.

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5

density.default(x = Train0Squared)

N = 187 Bandwidth = 0.04378 Density

slide-20
SLIDE 20

Learning to See

Boxplots instead of density plots

> boxplot(Train0Squared~precocious, xlab="precocious", ylab="proportion training items correct" )

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 20 / 46

  • FALSE

TRUE 0.2 0.4 0.6 0.8 1.0 precocious proportion training items correct

slide-21
SLIDE 21

Learning to See

What Explains the Bump Why did some students see hard problems?

Exploratory data analysis helped us find and amplify an odd feature

  • f data: Some students saw too many hard problems while training

for the first test. How did this happen?

> table(precocious, policy) policy precocious DBN_11 EXPERT RANDOM FALSE 39 119 4 TRUE 25

All these“precocious”students were in one condition of the

  • experiment. In the RANDOM condition, training problems were

selected at random, so we shouldn’t be surprised that students in this condition bombed on the first test!

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 21 / 46

slide-22
SLIDE 22

Learning to See

Outline

Data, revisited The purpose of exploratory data analysis Learning to see: Histograms, boxplots, median and other robust statistics, transforming data......missing values, tips.

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 22 / 46

slide-23
SLIDE 23

Learning to See

What Do You See? What Does it Mean?

> ts.plot(kinect$lhand.y,ylim=c(-500,1000),col="red")

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 23 / 46

Time kinect$lhand.y 50 100 150 −500 500 1000

slide-24
SLIDE 24

Learning to See

What Do You See? What Does it Mean?

> ts.plot(kinect$lhand.y,ylim=c(-500,1000),col="red") > lines(rep(0,150))

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 24 / 46

Constants and lack of change are rare in biometric data. Zero is a special num-

  • ber. Perhaps the Kinect

codes missing data as“0” . What is really happen- ing around time 115?

Time kinect$lhand.y 50 100 150 −500 500 1000

slide-25
SLIDE 25

Learning to See

Missing Data

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 25 / 46

Data can be missing for many reasons (e.g., subjects in BLAST experiment were allowed to leave once they hit a maximum score, so didn’t take all tests) Missing data is usually given a code, such as

  • 999 or NA. The Kinect code of zero is
  • unhelpful. Why?

R sometimes refuses to do math on random variables with missing values. Is this unhelpful?

slide-26
SLIDE 26

Learning to See

Missing Data in R

> # won't work, missing values: > with(blast,mean(Test3)) [1] NA > # this time, exclude missing values: > with(blast,mean(Test3,na.rm=TRUE)) [1] 0.3843575 > # get their indices: > with(blast,which(is.na(Test3))) [1] 17 39 41 55 73 78 142 179

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 26 / 46

slide-27
SLIDE 27

Learning to See

Missing Data: It Matters Why!

Most experiment results are based on the assumption of“random sampling”or“random assignment to conditions.” When data are Missing Completely At Random (MCAR), missing data don’t violate these assumptions. MCAR means missingness is independent of measured and unmeasured factors. Data are Missing at Random (MAR) when the reason they are missing has nothing to do with the data themselves. If food poisoning is proportional to fast food consumption (FFC), but FFC is unrelated to enrollment in ISTA 370, then if you’re absent due to food poisoning, then you are MAR. NMAR data are not missing at random. If students ordinarily take four tests, but are excused from future tests if they get a perfect score on a test, are they MAR or NMAR? How might NMAR introduce errors in analysis?

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 27 / 46

slide-28
SLIDE 28

Learning to See

Not Missing At Random (NMAR) Data

Participants in the BLAST experiment took four tests, but those who scored 18 or more on any test were excused from later tests.

> which(is.na(Test3)) # who didn't take Test3 [1] 17 39 41 55 73 78 142 179 > Test2[which(is.na(Test3))] # what were their Test2 scores [1] 1.00 NA 0.95 NA 1.00 0.90 1.00 1.00 > T3<-Test3 > mean(T3,na.rm=TRUE) # Mean Test3 scores [1] 0.3843575 > T3[is.na(T3)]<-1 # Replace NAs with perfect scores > mean(T3) # Mean T3 scores [1] 0.4106952

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 28 / 46

slide-29
SLIDE 29

Learning to See

Not Missing At Random (NMAR) Data Does it Matter?

If NMAR data are“evenly”distributed over experimental conditions, then they might not matter so much. So let’s check:

> notTest3<-which(is.na(Test3)) # who didn't take Test3 > gender[notTest3] [1] female male male male female male male male Levels: female male > cond.plus.policy[notTest3] [1] DBN_11_NO_CHOICE DBN_11_NO_CHOICE DBN_11_NO_CHOICE [4] EXPERT_NO_CHOICE EXPERT_NO_CHOICE EXPERT_NO_CHOICE [7] EXPERT_CHOICE EXPERT_CHOICE_ZPD 6 Levels: DBN_11_NO_CHOICE ... RANDOM_NO_CHOICE

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 29 / 46

slide-30
SLIDE 30

Learning to See

Not Missing At Random (NMAR) Data Does it Matter?

> aggregate.data.frame(Test3,by=list(condition,gender), FUN=mean, na.rm=TRUE) Group.1 Group.2 x 1 CHOICE female 0.3939024 2 NO_CHOICE female 0.3139535 3 CHOICE male 0.3901961 4 NO_CHOICE male 0.4375000 > aggregate.data.frame(T3,by=list(condition,gender), FUN=mean, na.rm=TRUE) Group.1 Group.2 x 1 CHOICE female 0.3939024 2 NO_CHOICE female 0.3444444 3 CHOICE male 0.4132075 4 NO_CHOICE male 0.4843750

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 30 / 46

T3 sets all the NAs to 1.0 so it’s what students would get if they maxed out tests they were allowed to skip. Note useful aggregate.data.frame command, which applies FUN to subsets of a variable de- fined by by=...

slide-31
SLIDE 31

Learning to See

Missing and Censored Data

When a small fraction of your data is missing, you can ignore it

  • r impute its values.

Common imputation methods involve matching records that have missing data to records that don’t, and using one or more of the complete records to infer the missing value. Censored data is a more challenging problem. Examples: Inferring average runtime of an algorithm for a batch of jobs that’s allowed to run for a fixed time. Inferring age at death of a treatment sample when some people are still alive when the study ends.

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 31 / 46

slide-32
SLIDE 32

Learning to See

What Do You See? What Does it Mean?

> hist(heightC$weight,main=NULL,xlab="Weight of 33 College St

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 32 / 46 Weight of 33 College Students Frequency 50 100 150 200 2 4 6 8 10

slide-33
SLIDE 33

Learning to See

What Do You See? What Does it Mean?

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 33 / 46

Common sense tells us that a weight of 50lbs or less is unlikely and is probably an errorful datum.

Weight of 33 College Students Frequency 50 100 150 200 2 4 6 8 10

slide-34
SLIDE 34

Learning to See

What Do You See? What Does it Mean?

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 34 / 46

slide-35
SLIDE 35

Learning to See

What Do You See? What Does it Mean?

The vertical axes are different, so it’s hard to tell what’s happening.

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 35 / 46

slide-36
SLIDE 36

Learning to See

What Do You See? What Does it Mean?

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 36 / 46

slide-37
SLIDE 37

Learning to See

What Do You See? What Does it Mean?

Not all apparent differences are real differences. Mentally group your data to see if something might explain apparent differences.

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 37 / 46

slide-38
SLIDE 38

Learning to See

What Do You See? What Does it Mean?

> m<-c(0.93,0.95,0.94,0.86,NA,0.86, 0.89, 0.85, 0.90, 0.85, 0 > s<-c(0.26, 0.23, 0.25, 0.34,NA,0.34, 0.31, 0.35, 0.30, 0.35 > barx <- barplot(m,ylim=c(0,1.5),names.arg=1:11,axis.lty=1,x > error.bar(barx,m,s)

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 38 / 46

Not all differences are real differences

1 2 3 4 5 6 7 8 9 10 11 Condition 0.0 0.4 0.8 1.2

slide-39
SLIDE 39

Learning to See

What Do You See? What Does it Mean?

> plot(taheri$A,col="red",type="l",ylim=c(0,1)) > lines(taheri$B,col="blue")

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 39 / 46

2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 Index taheri$A

slide-40
SLIDE 40

Learning to See

What Do You See? What Does it Mean?

> plot(taheri$A,col="red",type="l",ylim=c(0,1)) > lines(taheri$B,col="blue") > cor(taheri$A,taheri$B) [1] -0.93024

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 40 / 46

Although features A and B were supposed to be independent, they were normalized to sum to a constant, rendering them dependent.

2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 Index taheri$A

slide-41
SLIDE 41

Learning to See

What Do You See? What Does it Mean?

> with(mtcars,plot(disp,mpg)) > cor(disp,mpg) [1] -0.8475514

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 41 / 46

  • 100

200 300 400 10 15 20 25 30 disp mpg

slide-42
SLIDE 42

Learning to See

What Do You See? What Does it Mean?

> palette(c("blue","red","forestgreen")) > with(mtcars,plot(disp,mpg,col=cyl)) > cor(disp,mpg) [1] -0.8475514

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 42 / 46

Coloring by a third vari- able shows that the cor- relation between disp and mpg depends on cyl.

  • 100

200 300 400 10 15 20 25 30 disp mpg

slide-43
SLIDE 43

Tips for looking at data

Tips: Not All Data are Real Data

Data can be noisy, missing, contaminated, or even intentionally wrong (e.g., perverse subjects). You rarely know which data are suspicious. You have to look carefully for strange values and decide what to do with them.

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 43 / 46

Weight of 33 College Students Frequency 50 100 150 200 2 4 6 8 10 Time kinect$lhand.y 50 100 150 −500 500 1000

slide-44
SLIDE 44

Tips for looking at data

Tips: Not All Differences are Real Differences

Get the right vertical axis scale Augment your picture with measures of variation

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 44 / 46

1 2 3 4 5 6 7 8 9 10 11 Condition 0.0 0.4 0.8 1.2

slide-45
SLIDE 45

Tips for looking at data

Tips: Look for“holes”in data

Holes are regions that have fewer data. You ask,“why are so few data there?”

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 45 / 46

  • ●●
  • 0.2

0.4 0.6 0.8 1.0 2 4 6 8 10 12 14 Train0Squared NewSkills0 iris$Petal.Length Frequency 1 2 3 4 5 6 7 10 20 30

slide-46
SLIDE 46

Tips for looking at data

Tips: Don’t rely on summaries. Look at the data!

Means and other summaries are useful but the raw data show patterns obscured by summaries.

Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 46 / 46

1990 1995 2000 2005 2010 4 5 6 7 8 9 year unemployment