Styles of data analysis
DAAG Chapter 2
Styles of data analysis DAAG Chapter 2 Objectives Learn the common - - PowerPoint PPT Presentation
Styles of data analysis DAAG Chapter 2 Objectives Learn the common tools of Exploratory Data Analysis Histograms, density plots, boxplots Scatterplots and scatterplot matrices Data summaries Learn about what to look for and
DAAG Chapter 2
Analysis
– Histograms, density plots, boxplots – Scatterplots and scatterplot matrices – Data summaries
wrong
– Outliers, skewness, clustering – Non-linearity, heteroscedasticity
– Guiding principle: let the data speak for themselves
– Suggest new ideas or understandings – Reveal problematic assumptions made before data collection – Check on assumptions to be made in subsequent analysis – Suggest future research questions or directions
A: Breaks at 72.5, 77.5, ...
Total length (cm) Frequency 75 85 95 5 10 15 20
B: Breaks at 75, 80, ...
Total length (cm) Frequency 75 85 95 5 10 15 20
A: Breaks at 72.5, 77.5, ...
Total length (cm) Density 75 85 95 0.00 0.04 0.08
B: Breaks at 75, 80, ...
Total length (cm) Density 75 85 95 0.00 0.04 0.08
75 80 85 90 95 Total length (cm)
– Milk sample with 1 unit of sweetener – Milk sample with 4 units of sweetener
samples
points lie on the axis
“four” is sweeter
relationship between ratings
2 3 4 5 6 7 2 3 4 5 6 7
four
10 20 30 40 50 60 2000 4000 6000 8000 Apparent juice content (%) Resistance (ohms)
10 20 30 40 50 60 2000 4000 6000 8000 Apparent juice content (%) Resistance (ohms)
10 20 30 40 50 60 2000 4000 6000 8000 Apparent juice content (%) Resistance (ohms)
20000 60000 2000 4000 body brain 5 10 2 4 6 8 log(body) log(brain)
the singers in the New York Choral Society in 1979
height Density
0.00 0.02 0.04 0.06 0.08 0.10 60 65 70 75 80
Height (inches) Density
0.00 0.05 0.10 0.15 55 60 65 70 75 80
Bass 2 Bass 1 Tenor 2
0.00 0.05 0.10 0.15
Tenor 1
0.00 0.05 0.10 0.15
Alto 2 Alto 1 Soprano 2
55 60 65 70 75 80 0.00 0.05 0.10 0.15
Soprano 1
modeling
xkcd.com
from a location near Fiji
165 170 175 180 185 165 170 175 180 185 165 170 175 180 185
long lat
100 200 300 400 500 600
Given : depth
4133-6186 DAX 6045-8412 SMI 2858-4388 CAC 5014-6179 FTSE EU daily closing price indices: 1998
EU <- window( EuStockMarkets, start = 1998 ) par( mfcol = c(4,1), mar = c(1,5,1,8)+0.1, oma = c(2,0,0,0) ) for( i in 1:4 ){ plot( EU[,i], axes = FALSE, xlab = "", ylab = "" ) rr <- range( EU[,i] ) mtext( paste( round(rr), collapse="-" ), 4, las = 1 ) mtext( colnames(EU)[i], 2, las = 1 ) } mtext("EU daily closing price indices: 1998",1,outer=TRUE, line=0)
mean, median, mode, …
standard deviation, IQR, range, …
category
Survival on the Titanic
Age Sex Survived Female Yes No Male Child Adult Yes No
analysis, we begin to ask questions of the data
– Questions motivated by scientific understanding
– Questions motivated by a goal to predict
– Randomization allows isolation of effects – Caution about generalizing results
– Experiments are not always possible – Features and relationships are difficult or impossible to isolate
think we are measuring?
– Large field of research – Are we measuring the population
– Non-response issues – Does the question measure what we are interested in?
people support handgun ownership.
– Poll people leaving a sporting goods store. – Ask: “Have you considered handgun
data are collected
– Preliminary data or data from another study can be used to design the analysis and experiment/survey
analysis after the fact
– Although EDA can be useful, it is important to ask directed questions of the data to avoid fishing expeditions – Sometimes, it is not possible to answer a given question using a given dataset without resorting to unreasonable assumptions
alternative