Styles of data analysis DAAG Chapter 2 Objectives Learn the common - PowerPoint PPT Presentation

Styles of data analysis DAAG Chapter 2

Objectives • Learn the common tools of Exploratory Data Analysis – Histograms, density plots, boxplots – Scatterplots and scatterplot matrices – Data summaries • Learn about what to look for and what can go wrong – Outliers, skewness, clustering – Non-linearity, heteroscedasticity • Be mindful of good statistical practice, overreaching, overfitting, …

What is the first rule of data analysis? Plot your data!

Exploratory data analysis • Formalized by John Tukey – Guiding principle: let the data speak for themselves • Why do EDA? – Suggest new ideas or understandings – Reveal problematic assumptions made before data collection – Check on assumptions to be made in subsequent analysis – Suggest future research questions or directions

Plots for a single variable A: Breaks at 72.5, 77.5, ... B: Breaks at 75, 80, ... 20 20 Frequency 15 Frequency 15 10 10 5 5 0 0 75 85 95 75 85 95 Total length (cm) Total length (cm)

Plots for a single variable A: Breaks at 72.5, 77.5, ... B: Breaks at 75, 80, ... 0.08 0.08 Density Density 0.04 0.04 0.00 0.00 75 85 95 75 85 95 Total length (cm) Total length (cm)

Plots for a single variable 75 80 85 90 95 Total length (cm)

Plots for bivariate data • Experiment with 17 tasters – Milk sample with 1 unit of sweetener – Milk sample with 4 units of sweetener • Each person rated the sweetness of the two samples

Plots for bivariate data • 1:1 plot ratio 7 • Rug shows where 6 points lie on the axis 5 four • Most people think 4 “four” is sweeter 3 • There is a positive 2 relationship between ratings 2 3 4 5 6 7 one

Plots for bivariate data 8000 Resistance (ohms) 6000 4000 2000 10 20 30 40 50 60 Apparent juice content (%)

Plots for bivariate data 8 4000 6 log(brain) brain 4 2000 2 0 0 0 20000 60000 0 5 10 body log(body)

Clustering • Heights in inches of the singers in the New York Choral 0.10 Society in 1979 0.08 Density 0.06 0.04 0.02 0.00 60 65 70 75 80 height

Clustering 55 60 65 70 75 80 Soprano 2 Soprano 1 0.15 0.10 0.05 0.00 Alto 2 Alto 1 0.15 0.10 0.05 Density 0.00 Tenor 2 Tenor 1 0.15 0.10 0.05 0.00 Bass 2 Bass 1 0.15 0.10 0.05 0.00 55 60 65 70 75 80 Height (inches)

Outliers • Require special treatment • Could be highly influential in subsequent modeling • Could suggest new understanding xkcd.com

Conditioning plots • Earthquake data Given : depth 100 200 300 400 500 600 from a location near Fiji • Depth in km 165 170 175 180 185 165 170 175 180 185 -15 • Data since 1964 -25 -35 lat -15 -25 -35 165 170 175 180 185 long

Scatterplot matrix

Sparklines DAX 4133-6186 SMI 6045-8412 CAC 2858-4388 FTSE 5014-6179 EU daily closing price indices: 1998

(Sparklines R code) EU <- window( EuStockMarkets, start = 1998 ) par( mfcol = c(4,1), mar = c(1,5,1,8)+0.1, oma = c(2,0,0,0) ) for( i in 1:4 ){ plot( EU[,i], axes = FALSE, xlab = "", ylab = "" ) rr <- range( EU[,i] ) mtext( paste( round(rr), collapse="-" ), 4, las = 1 ) mtext( colnames(EU)[i], 2, las = 1 ) } mtext("EU daily closing price indices: 1998",1,outer=TRUE, line=0)

Summary statistics • Central tendency: Survival on the Titanic mean, median, Age Child Adult mode, … • Dispersion: No Male standard deviation, Survived Sex IQR, range, … Yes • Counts by group or Yes No Female category

The data analysis process • Moving from EDA into more directed data analysis, we begin to ask questions of the data – Questions motivated by scientific understanding • Testing hypotheses • Mechanism is important – Questions motivated by a goal to predict • Prediction performance is important • Mechanism is not necessarily important

Observational vs Experimental Data • Experimental data are the gold standard – Randomization allows isolation of effects – Caution about generalizing results • Observational data are abundant – Experiments are not always possible – Features and relationships are difficult or impossible to isolate

Data from surveys • Are we measuring what we think we are measuring? – Large field of research – Are we measuring the population of interest? – Non-response issues – Does the question measure what we are interested in? • e.g. Would like to know whether people support handgun ownership. – Poll people leaving a sporting goods store. – Ask: “Have you considered handgun ownership for self defense?”

Planning ahead • The best time to plan data analysis is before the data are collected – Preliminary data or data from another study can be used to design the analysis and experiment/survey • The reality is that we are often asked to do data analysis after the fact – Although EDA can be useful, it is important to ask directed questions of the data to avoid fishing expeditions – Sometimes, it is not possible to answer a given question using a given dataset without resorting to unreasonable assumptions

Stat 862 students • Reminder to see me this week about project alternative • “Proposal” due date is Monday October 6

Styles of data analysis DAAG Chapter 2 Objectives Learn the common - PowerPoint PPT Presentation

Styles of data analysis DAAG Chapter 2 Objectives Learn the common tools of Exploratory Data Analysis Histograms, density plots, boxplots Scatterplots and scatterplot matrices Data summaries Learn about what to look for and

Styles and Themes By T oma Vajngerl @CollaboraOffjce www.CollaboraOffjce.com Styles Styles

15 Minute Quick Start Introduction Using Word 2013 Styles Im a PhD who has been mystified by

Banking software architecture 2 Architectural Styles 1 WebLogic Network Gatekeeper's software

Using Seaborn Styles DATA VIS UALIZ ATION W ITH S EABORN Chris Moftt Instructor Setting

Click to edit Master text styles Click to edit Master text styles Second Level

Parenting styles of Parenting styles of multinational corporations multinational corporations

To conclude the Meal Service Style Slide 1 Conclusion for Meal Service Styles Topic Topic,

Know about Parenting Styles: What is yours? www.iraparenting.com Parenting Styles Introduction :

Cascading Style Sheets Overview of Cascading Style Sheets (CSS) See what is possible with CSS:

Programming Styles and Objects Fermilab - TARGET 2018 Week 3 Programming styles Imperative

Chapter 13 Interaction Styles Interaction Styles Command Entry Menus and Navigation

Architectures Architectural styles Software architectures Architectures versus middleware

Parameter Passing Styles Dr. Mattox Beckman University of Illinois at Urbana-Champaign

UNDERSTANDING LEADERSHIP STYLES Leadership on Demand AGENDA What is Leadership? What are

Data and Analysis Part V Statistical Analysis of Data Alex Simpson Part V: Statistical Analysis

Learning Styles for Children with Autism By: Eleanor Gustafson Introduction Multiple

Optimal adaptive detection of small Plan of the talk Some testing correlation functions

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

The publication cycle E6891 Lecture 2 2014-01-29 Todays plan The publication cycle

Estimating Gaussian Mixture Models from Data with Missing Features by Daniel McMichael CSSIP

Nonparametric Minimax Estimation of the Estimation of the Volatility in High- Volatility in

5. Differentiation II Daisuke Oyama Mathematics II April 24, 2020 Vectors and Matrices We

UN/Italy Workshop on the Open Universe Initiative Intro and Logistics Vienna, Austria 20-22

Stephen Taylor CEO State of play: The photonics sector in Scotland Chair Stephen Taylor, CEO