Styles of data analysis DAAG Chapter 2 Objectives Learn the common - - PowerPoint PPT Presentation

styles of data analysis
SMART_READER_LITE
LIVE PREVIEW

Styles of data analysis DAAG Chapter 2 Objectives Learn the common - - PowerPoint PPT Presentation

Styles of data analysis DAAG Chapter 2 Objectives Learn the common tools of Exploratory Data Analysis Histograms, density plots, boxplots Scatterplots and scatterplot matrices Data summaries Learn about what to look for and


slide-1
SLIDE 1

Styles of data analysis

DAAG Chapter 2

slide-2
SLIDE 2

Objectives

  • Learn the common tools of Exploratory Data

Analysis

– Histograms, density plots, boxplots – Scatterplots and scatterplot matrices – Data summaries

  • Learn about what to look for and what can go

wrong

– Outliers, skewness, clustering – Non-linearity, heteroscedasticity

  • Be mindful of good statistical practice,
  • verreaching, overfitting, …
slide-3
SLIDE 3

What is the first rule of data analysis?

Plot your data!

slide-4
SLIDE 4

Exploratory data analysis

  • Formalized by John Tukey

– Guiding principle: let the data speak for themselves

  • Why do EDA?

– Suggest new ideas or understandings – Reveal problematic assumptions made before data collection – Check on assumptions to be made in subsequent analysis – Suggest future research questions or directions

slide-5
SLIDE 5

Plots for a single variable

A: Breaks at 72.5, 77.5, ...

Total length (cm) Frequency 75 85 95 5 10 15 20

B: Breaks at 75, 80, ...

Total length (cm) Frequency 75 85 95 5 10 15 20

slide-6
SLIDE 6

Plots for a single variable

A: Breaks at 72.5, 77.5, ...

Total length (cm) Density 75 85 95 0.00 0.04 0.08

B: Breaks at 75, 80, ...

Total length (cm) Density 75 85 95 0.00 0.04 0.08

slide-7
SLIDE 7

Plots for a single variable

75 80 85 90 95 Total length (cm)

slide-8
SLIDE 8

Plots for bivariate data

  • Experiment with 17 tasters

– Milk sample with 1 unit of sweetener – Milk sample with 4 units of sweetener

  • Each person rated the sweetness of the two

samples

slide-9
SLIDE 9

Plots for bivariate data

  • 1:1 plot ratio
  • Rug shows where

points lie on the axis

  • Most people think

“four” is sweeter

  • There is a positive

relationship between ratings

2 3 4 5 6 7 2 3 4 5 6 7

  • ne

four

slide-10
SLIDE 10

Plots for bivariate data

10 20 30 40 50 60 2000 4000 6000 8000 Apparent juice content (%) Resistance (ohms)

slide-11
SLIDE 11

Plots for bivariate data

10 20 30 40 50 60 2000 4000 6000 8000 Apparent juice content (%) Resistance (ohms)

slide-12
SLIDE 12

Plots for bivariate data

10 20 30 40 50 60 2000 4000 6000 8000 Apparent juice content (%) Resistance (ohms)

slide-13
SLIDE 13

Plots for bivariate data

20000 60000 2000 4000 body brain 5 10 2 4 6 8 log(body) log(brain)

slide-14
SLIDE 14

Clustering

  • Heights in inches of

the singers in the New York Choral Society in 1979

height Density

0.00 0.02 0.04 0.06 0.08 0.10 60 65 70 75 80

slide-15
SLIDE 15

Clustering

Height (inches) Density

0.00 0.05 0.10 0.15 55 60 65 70 75 80

Bass 2 Bass 1 Tenor 2

0.00 0.05 0.10 0.15

Tenor 1

0.00 0.05 0.10 0.15

Alto 2 Alto 1 Soprano 2

55 60 65 70 75 80 0.00 0.05 0.10 0.15

Soprano 1

slide-16
SLIDE 16

Outliers

  • Require special treatment
  • Could be highly influential in subsequent

modeling

  • Could suggest new understanding

xkcd.com

slide-17
SLIDE 17

Conditioning plots

  • Earthquake data

from a location near Fiji

  • Depth in km
  • Data since 1964
  • 35
  • 25
  • 15

165 170 175 180 185 165 170 175 180 185 165 170 175 180 185

  • 35
  • 25
  • 15

long lat

100 200 300 400 500 600

Given : depth

slide-18
SLIDE 18

Scatterplot matrix

slide-19
SLIDE 19

Sparklines

4133-6186 DAX 6045-8412 SMI 2858-4388 CAC 5014-6179 FTSE EU daily closing price indices: 1998

slide-20
SLIDE 20

(Sparklines R code)

EU <- window( EuStockMarkets, start = 1998 ) par( mfcol = c(4,1), mar = c(1,5,1,8)+0.1, oma = c(2,0,0,0) ) for( i in 1:4 ){ plot( EU[,i], axes = FALSE, xlab = "", ylab = "" ) rr <- range( EU[,i] ) mtext( paste( round(rr), collapse="-" ), 4, las = 1 ) mtext( colnames(EU)[i], 2, las = 1 ) } mtext("EU daily closing price indices: 1998",1,outer=TRUE, line=0)

slide-21
SLIDE 21

Summary statistics

  • Central tendency:

mean, median, mode, …

  • Dispersion:

standard deviation, IQR, range, …

  • Counts by group or

category

Survival on the Titanic

Age Sex Survived Female Yes No Male Child Adult Yes No

slide-22
SLIDE 22

The data analysis process

  • Moving from EDA into more directed data

analysis, we begin to ask questions of the data

– Questions motivated by scientific understanding

  • Testing hypotheses
  • Mechanism is important

– Questions motivated by a goal to predict

  • Prediction performance is important
  • Mechanism is not necessarily important
slide-23
SLIDE 23

Observational vs Experimental Data

  • Experimental data are the gold standard

– Randomization allows isolation of effects – Caution about generalizing results

  • Observational data are abundant

– Experiments are not always possible – Features and relationships are difficult or impossible to isolate

slide-24
SLIDE 24

Data from surveys

  • Are we measuring what we

think we are measuring?

– Large field of research – Are we measuring the population

  • f interest?

– Non-response issues – Does the question measure what we are interested in?

  • e.g. Would like to know whether

people support handgun ownership.

– Poll people leaving a sporting goods store. – Ask: “Have you considered handgun

  • wnership for self defense?”
slide-25
SLIDE 25

Planning ahead

  • The best time to plan data analysis is before the

data are collected

– Preliminary data or data from another study can be used to design the analysis and experiment/survey

  • The reality is that we are often asked to do data

analysis after the fact

– Although EDA can be useful, it is important to ask directed questions of the data to avoid fishing expeditions – Sometimes, it is not possible to answer a given question using a given dataset without resorting to unreasonable assumptions

slide-26
SLIDE 26

Stat 862 students

  • Reminder to see me this week about project

alternative

  • “Proposal” due date is Monday October 6