CME/STATS 195 CME/STATS 195 Lecture 5: Exploratory Data Analysis - PowerPoint PPT Presentation

CME/STATS 195 CME/STATS 195 Lecture 5: Exploratory Data Analysis Lecture 5: Exploratory Data Analysis Evan Rosenman Evan Rosenman April 16, 2019 April 16, 2019 6.5

Contents Contents Missing values Exploratory Data Analysis Variation Covariation Merging datasets Data Export 6.5

Handling missing values Handling missing values 6.5

Why does it matter? Why does it matter? Many real datasets will be missing values for at least some variables for some observations A single NA in a column can break your code! R isn’t always verbose about what is happening x <- c (1, 2, 3, NA) x <- c (1, 2, 3, NA) mean (x) hist (x) ## [1] NA 6.5

Missing values Missing values Two types of missingness stocks <- tibble ( year = c (2015, 2015, 2015, 2015, 2016, 2016, 2016), qtr = c ( 1, 2, 3, 4, 2, 3, 4), return = c (1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66) ) The return for the fourth quarter of 2015 is explicitly missing . The return for the first quarter of 2016 is implicitly missing How we represent the data can make implicit values explicit. stocks %>% spread (year, return) ## # A tibble: 4 x 3 ## qtr `2015` `2016` ## <dbl> <dbl> <dbl> ## 1 1 1.88 NA ## 2 2 0.59 0.92 ## 3 3 0.35 0.17 ## 4 4 NA 2.66 6.5

Gathering missing data Gathering missing data Recall the functions we learned from tidyr package. You can used spread() and gather() to retain only non-missing records, i.e. to turn all explicit missing values into implicit ones. stocks %>% spread (year, return) %>% gather (year, return, `2015`:`2016`, na.rm = TRUE) ## # A tibble: 6 x 3 ## qtr year return ## * <dbl> <chr> <dbl> ## 1 1 2015 1.88 ## 2 2 2015 0.59 ## 3 3 2015 0.35 ## 4 2 2016 0.92 ## 5 3 2016 0.17 ## 6 4 2016 2.66 6.5

Completing missing data Completing missing data complete() takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit NA s where necessary. stocks %>% complete (year, qtr) ## # A tibble: 8 x 3 ## year qtr return ## <dbl> <dbl> <dbl> ## 1 2015 1 1.88 ## 2 2015 2 0.59 ## 3 2015 3 0.35 ## 4 2015 4 NA ## 5 2016 1 NA ## 6 2016 2 0.92 ## 7 2016 3 0.17 ## 8 2016 4 2.66 6.5

Different intepretations of NA Different intepretations of NA Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward: # tribble() constructs a tibble by filling by rows treatment <- tribble ( ~ person, ~ treatment, ~response, "Derrick Whitmore", 1, 7, NA, 2, 10, NA, 3, 9, "Katherine Burke", 1, 4 ) You can fill in these missing values with fill() treatment %>% fill (person) ## # A tibble: 4 x 3 ## person treatment response ## <chr> <dbl> <dbl> ## 1 Derrick Whitmore 1 7 ## 2 Derrick Whitmore 2 10 ## 3 Derrick Whitmore 3 9 ## 4 Katherine Burke 1 4 6.5

Exploratory data analysis Exploratory data analysis 6.5

What is exploratory data analysis? What is exploratory data analysis? There are no routine statistical questions, only questionable statistical routines. — Sir David Cox EDA is an iterative process: Generate questions about your data Search for answers by visualising, transforming, and modelling data Use what you learn to refine your questions or generate new ones. 6.5

Ask many questions Ask many questions Your goal during EDA is to develop an understanding of your data. EDA is fundamentally a creative process. And, like most creative processes, the key to asking quality questions is to generate a large quantity of questions.1 Two types of questions will always be useful for making discoveries within your data: 1. What type of variation occurs within my variables? 2. What type of covariation occurs between my variables? 6.5

Some useful definitions Some useful definitions Variable: a quantity, quality, or property that you can measure (often a column ) Observation: set of variable measurements made for a single unit (often a row ) Value: the state of a variable when you measure it. Tabular data: a set of values, each associated with a variable and an observation. Tabular data is “tidy” if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row. Example datasets: diamonds , nycflights13::flights . library (nycflights13) 6.5

EDA is not hypothesis testing! EDA is not hypothesis testing! EDA involves asking many questions, generating new hypotheses, and finding interesting patterns in the data This is very different from hypothesis testing/confirmatory data analysis, in which hypotheses are generated before seeing the data Key idea: you should not use the same dataset to generate a hypothesis and to confirm the hypothesis! 6.5

Variation Variation Variation is the spread of values of a variable across measurements. A variable’s pattern of variation can reveal interesting information. Recall the diamonds dataset. Use a bar chart, to examine the distribution of a categorical variable , and a histogram that of a continuous one. ggplot (data = diamonds) + ggplot (data = diamonds) + geom_bar (mapping = aes (x = cut)) geom_histogram (mapping = aes (x = carat), binw 6.5

Variation isn’t just about variance Variation isn’t just about variance data <- tibble (x = rpois (5000, 1)) data <- tibble (x = rnorm (5000, sd = 1)) var (data$x) var (data$x) ## [1] 1.025961 ## [1] 0.9890639 ggplot (data = data) + ggplot (data = data) + geom_bar ( aes (x = x)) geom_histogram ( aes (x = x)) 6.5

Identifying typical values Identifying typical values Which values are the most common? Why? Which values are rare? Why? Does that match your expectations? Do you see unusual patterns? What might explain them? diamonds %>% filter (carat < 3) %>% ggplot ( aes (x = carat)) + geom_histogram (binwidth = 0.01) 6.5

Boxplots Boxplots Boxplots are used to display visual shorthand for a distribution of a continuous variable broken down by categories. They mark the distribution’s quartiles. 6.5

Boxplots Boxplots ggplot (diamonds, aes (x = cut, y = carat)) + geom_boxplot () 6.5

Identify outliers Identify outliers Outliers are observations that are unusual – data points that don’t seem to fit the general pattern. Sometimes outliers are data entry errors; other times outliers suggest something important. ggplot (diamonds) + ggplot (diamonds) + geom_histogram (mapping = aes (x = y), geom_histogram (mapping = aes (x = y), binwidth = 0.5) binwidth = 0.5) + coord_cartesian (ylim = c (0, 50)) 6.5

Identifying outliers Identifying outliers diamonds %>% filter (y < 3 | y > 20) %>% select (price, carat, x, y, z) %>% arrange (y) ## # A tibble: 9 x 5 ## price carat x y z ## <int> <dbl> <dbl> <dbl> <dbl> ## 1 5139 1 0 0 0 ## 2 6381 1.14 0 0 0 ## 3 12800 1.56 0 0 0 ## 4 15686 1.2 0 0 0 ## 5 18034 2.25 0 0 0 ## 6 2130 0.71 0 0 0 ## 7 2130 0.71 0 0 0 ## 8 2075 0.51 5.15 31.8 5.12 ## 9 12210 2 8.09 58.9 8.06 The y variable measures the length (in mm) of one of the three dimensions of a diamond. Therefore, these must be entry errors! 6.5

Addressing outlying values Addressing outlying values When you encounter unusual values, you have two options Drop the entire row with the strange values: diamonds2 <- diamonds %>% filter ( between (y, 3, 20)) Replace the unusual values with missing values: diamonds2 <- diamonds %>% mutate (y = ifelse (y < 3 | y > 20, NA, y)) ggplot2 will issue a warning when you plot with missing values. Note the use of the function ifelse ifelse (test, value.if.yes, value.if.no) 6.5

Covariation Covariation Covariation is the tendency for the values of two or more variables to vary in a related way. ggplot (data = diamonds) + geom_point ( aes (x=carat, y=price)) 6.5

A neat trick for two continuous variables A neat trick for two continuous variables # install.packages("hexbin") ggplot (data = diamonds) + geom_hex (mapping = aes (x = carat, y = price)) + scale_y_log10 () + scale_x_log10 () 6.5

A categorical and a continuous variable A categorical and a continuous variable Use a boxplot or a violin plot to display the covariation between a categorical and a continuous variable. Violin plots give more information, as they show the entrire estimated distribution. ggplot (mpg, aes ( ggplot (mpg, aes ( x = reorder (class, hwy, FUN = median), x = reorder (class, hwy, FUN = median), y = hwy)) + geom_boxplot () + coord_flip () y = hwy)) + geom_violin () + coord_flip () 6.5

Two categorical variables Two categorical variables To visualise the covariation between categorical variables , you need to count the number of observations for each combination, e.g. using geom_count() : ggplot (data = diamonds) + geom_count (mapping = aes (x = cut, y = color)) 6.5

CME/STATS 195 CME/STATS 195 Lecture 5: Exploratory Data Analysis - PowerPoint PPT Presentation

CME/STATS 195 CME/STATS 195 Lecture 5: Exploratory Data Analysis Lecture 5: Exploratory Data Analysis Evan Rosenman Evan Rosenman April 16, 2019 April 16, 2019 6.5 Contents Contents Missing values Exploratory Data Analysis Variation

CME/STATS 195 CME/STATS 195 Lecture 6: Data Modeling and Linear Lecture 6: Data Modeling and

CME/STATS 195 CME/STATS 195 Lecture 4: Visualizing data Lecture 4: Visualizing data Evan

CME/STATS 195 CME/STATS 195 Lecture 3: Importing and transforming data Lecture 3: Importing and

CME/STATS 195 CME/STATS 195 Lecture 2: Programming and Lecture 2: Programming and Communicating

CME/STATS 195 CME/STATS 195 Lecture 7: Hypothesis Testing and Lecture 7: Hypothesis Testing and

CME/STATS 195 CME/STATS 195 Lecture 8: Hypothesis Testing and Lecture 8: Hypothesis Testing and

CME/STATS 195 Lecture 1: Intro to R Evan Rosenman April 2, 2019 Contents Course Objectives

Introduction to Data Science: x (1) x 1 x 2 x ( n ) x i n 1 1 Size: size

Exploratory Data Analysis Paul Cohen ISTA 370 Spring, 2012 Paul Cohen ISTA 370 () Exploratory

2017: Into the Future CME Group ISM June 2017 Source: CME Group Nov 2017 Source: CME

Exploratory Monitoring at Bing AUTOMATED SYNTHETIC EXPLORATORY MONITORING OF DYNAMIC WEB SITES

Exploratory Data Analysis Exploratory Data Analysis for Ecological Modelling and for Ecological

Exploratory Data Analysis Maneesh Agrawala CS 448B: Visualization Fall 2018 1 A2: Exploratory

Subgroup Discovery Exploratory Data Analysis Exploratory Data Analysis Classification:

VISUALIZATION Jeff Goldsmith, PhD Department of Biostatistics 1 Exploratory data analysis

CME 101: Debbie Platek, MS Remembering the Basics President, CME Mentors Where were going

Basic Statistical Questions Are two (or more) groups different? Does feed type affect weight?

Applied Statistical Analysis EDUC 6050 Review Week Finding clarity using data Today Connect

Introduction to qualitative data Emily Robinson Data Scientist DataCamp Categorical Data in

Categorical Predictors and Leverage November 4, 2019 November 4, 2019 1 / 23 More Regression

Multiblock Method for Categorical Variables Application to air quality in pig farms S. Bougeard 1

Data pre-processing RECURREN T N EURAL N ETW ORK S F OR LAN GUAGE MODELIN G IN P YTH ON David

Command Pattern CS 446 The Command Pattern ! Encapsulates a request as an object ! Packages

Chapter 1. Pigeonhole Principle Prof. Tesler Math 184A Fall 2017 Prof. Tesler Ch. 1.