Exploratory Data Analysis Paul Cohen ISTA 370 Spring, 2012 Paul - PowerPoint PPT Presentation

Exploratory Data Analysis Paul Cohen ISTA 370 Spring, 2012 Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 1 / 46

Outline Data, revisited The purpose of exploratory data analysis Learning to see Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 2 / 46

Data: A Review Things and Data Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 3 / 46

Data: A Review Things and Data Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 4 / 46

Data: A Review Where Data Come From Data are measurements of individuals (people, trees, countries, ecosystems...). An ¡Individual Data 56 ¡years ¡old 70" ¡tall 180 ¡lbs Brown ¡eyes Moderately ¡presbyo8c Good ¡health Married One ¡child A ¡Data ¡Table ... Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 5 / 46

Exploratory Data Analysis What is Exploratory Data Analysis (EDA)? In terms of the Fundamental Model of Data, y = f ( x , ǫ ) : EDA infers which factors strongly and weakly influence y and the functions that combine these factors EDA examines ǫ to see whether it contains evidence of other important but unmeasured ( “hidden” ) factors Confirmatory studies test whether x really is a causal factor that influences y Exploratory studies are to confirmatory studies as test kitchens are to cookbook recipes. EDA generally doesn’t test hypotheses, but, rather,“helps the data tell its story” EDA helps you understand phenomena, and suggests hypotheses to test in confirmatory studies. Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 6 / 46

Exploratory Data Analysis What is Exploratory Data Analysis? Learning to See Data have structure that is evidence of causal influences. EDA uncovers, exposes, clarifies this structure. EDA is like hunting for fossils – it’s a skill, and you must“learn to see”not only what’s in front of you, but what lies within data. EDA asks,“what do I see, and what does it mean?” Like any other skill, EDA takes a lot of practice. Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 7 / 46

Exploratory Data Analysis Load Some Data > read.table.ISTA370<-function(filename){ dataURL<-"http://www.sista.arizona.edu/~cohen/ISTA%20370/D # Reads a data frame from a URL path rooted at ISTA370 dat read.table(paste(dataURL,filename,sep="")) } > > # taheri<-read.table.ISTA370("taheri1.txt") > # iris<-read.table.ISTA370("iris.txt") > # heightC<-read.table.ISTA370("heightC.txt") > # treering<-read.table.ISTA370("treering1.txt") > # blast<-read.table.ISTA370("blastSummary.txt") > # kinect<-read.table.ISTA370("onemovie.txt") > # readability<-read.table.ISTA370("readability.txt") Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 8 / 46

Learning to See What Do You See? What Does it Mean? > hist(iris$Petal.Length,col="grey",main=NULL) 30 Frequency 20 10 0 1 2 3 4 5 6 7 iris$Petal.Length Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 9 / 46

Learning to See What Do You See? What Does it Mean? > ipl<-iris$Petal.Length > hist(ipl,prob="true",ylim=c(0,1),main=NULL) > lines(density(ipl[iris$Species=="setosa"]),col="red") > lines(density(ipl[iris$Species=="versicolor"]),col="green") > lines(density(ipl[iris$Species=="virginica"]),col="blue") Looking at density curves 1.0 for each species, we see 0.8 that the histogram did 0.6 Density indeed indicate two or more 0.4 0.2 separate populations (species). 0.0 1 2 3 4 5 6 7 ipl Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 10 / 46

Learning to See What Do You See? What Does it Mean? > boxplot(iris$Petal.Length~iris$Species, ylab="Petal.Length",xlab="Species") 7 6 A boxplot by species con- 5 Petal.Length firms, and summarizes the 4 petal length statistics for 3 ● each species. 2 1 ● setosa versicolor virginica Species Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 11 / 46

Learning to See Boxplots outliers whisker ¡(various ¡interpreta1ons) upper ¡quar,le ¡(75% ¡quan,le) interquar,le ¡range median lower ¡quar,le ¡ ¡(25% ¡quan,le) whisker outliers Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 12 / 46

Learning to See Median, Quartiles, Interquartile Range If you sort the values in a sample from lowest to highest, the median is the middle value, or the average of the two middle values when the sample contains an even number of points. The median is the 50th quantile, or the value for which 50% of the values are greater. The lower quartile is the 25th quantile, above which 75% of the values are found. The upper quartile is the 75th quantile, above which 25% of the values are found. The interquartile range is a measure of variability and is the difference between the upper and lower quartiles. Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 13 / 46

Learning to See Median, Quartiles, Interquartile Range The median is robust against outliers ; the mean is not. Suppose 100 families in a neighborhood each make $40,000/year. When a millionaire moves in the mean jumps from $40,000 to $49,504/year. What happens to the median? Before the millionaire arrived, the variance in income was zero. Afterwards the variance is over nine billion !!! What happens to the interquartile range? Suppose you have a sorted sample of 9 elements; the median is the fifth element. If you add another element, what will the median be? By how many locations in the sorted distribution can the median shift? Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 14 / 46

Learning to See Symmetry and Skew > with(blast, hist(Test0,breaks=20,col="grey",main=NULL)) > with(treering, hist(width,breaks=20,col="grey",main=NULL)) 40 40 30 30 Frequency Frequency 20 20 10 10 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 40 60 80 100 120 Test0 width Test0 is skewed to the right, meaning it has a long tail of higher values, while Treering is nearly symmetric. Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 15 / 46

Learning to See Transformations > attach(blast) > hist(Train0,breaks=20,col="grey",main=NULL) > Train0Squared<-Train0^2 #square the Train0 data > hist(Train0Squared,breaks=20,col="grey",main=NULL) 30 40 25 30 20 Frequency Frequency 15 20 10 10 5 0 0 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 Train0 Train0Squared A simple transformation amplifies an otherwise hidden feature Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 16 / 46

Learning to See Transformations > Train0Squared<-with(blast,Train0^2) > with(blast,plot(density(Train0Squared))) > with(blast,lines(density(Train0Squared[gender=="female"]),c > with(blast,lines(density(Train0Squared[gender=="male"]),col density.default(x = Train0Squared) 2.5 2.0 Density 1.5 1.0 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0 N = 187 Bandwidth = 0.04378 Does gender explain the bump? Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 17 / 46

Exploratory Data Analysis Paul Cohen ISTA 370 Spring, 2012 Paul - PowerPoint PPT Presentation

Exploratory Data Analysis Paul Cohen ISTA 370 Spring, 2012 Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 1 / 46 Outline Data, revisited The purpose of exploratory data analysis Learning to see Paul Cohen ISTA 370 ()

Introduction to Data Science: x (1) x 1 x 2 x ( n ) x i n 1 1 Size: size

CME/STATS 195 CME/STATS 195 Lecture 5: Exploratory Data Analysis Lecture 5: Exploratory Data

Exploratory Data Analysis Exploratory Data Analysis for Ecological Modelling and for Ecological

Subgroup Discovery Exploratory Data Analysis Exploratory Data Analysis Classification:

VISUALIZATION Jeff Goldsmith, PhD Department of Biostatistics 1 Exploratory data analysis

Exploratory Data Analysis Maneesh Agrawala CS 448B: Visualization Fall 2018 1 A2: Exploratory

Exploratory Monitoring at Bing AUTOMATED SYNTHETIC EXPLORATORY MONITORING OF DYNAMIC WEB SITES

Exploratory Data Analysis Nam Wook Kim Mini-Courses January @ GSAS 2018 Goal Learn the

Project: Exploratory Data Analysis Tony Yao-Jen Kuo Project Overview Project source Assignment

Exploratory Data Analysis Ma Maneesh Agrawala CS 448B: Visualization Winter 2020 1 A2:

Exploratory Data Analysis Ma Maneesh Agrawala CS 448B: Visualization Fall 2020 1 A2:

The United Nations Voting Dataset Exploratory Data Analysis: Case Study UN Voting Dataset Roll

Middle Level Exploratory Classes Standards Based Grading McLean County Unit 5 Exploratory

Agenda Agenda 1. ProjectOverview 1 Project Overview 2. DrillingProgram 3 3.

EXPLORATORY PRACTICE Ins K. de Miller (PUC-Rio, Brasil) Exploratory Practice: work for

An Exploratory Study of How Developers Exploratory Study Seek, Relate, and Collect Relevant

Combining Visual Analytics and Machine Learning for Route Choice Prediction Application to Pre

Exchange rate undervaluation, economic institutions and export performance: evidence from

Teresa Noto InGRID Seminar, 26.07.2016 (HIVA) Research question Materials and methods -

PAST PRESENT FUTURE Kick-off meeting CULTURAL HERITAGE OF SMALL HOMELANDS Polish National

Data Science and Project Management North West Project Data Analytics Meetup 1 Aims 1. How to

trip planner usage data a machine learning application Acknowledgement: Jop van Roosmalen Dr.

Using clickers to predict students final courses grades, an artificial intelligence approach

A lightning quick An Andrew Heiss, Ph PhD Brigham Young University introduction to data West