Exploratory data analysis R F OR S AS US ERS Melinda Higgins, PhD - - PowerPoint PPT Presentation

exploratory data analysis
SMART_READER_LITE
LIVE PREVIEW

Exploratory data analysis R F OR S AS US ERS Melinda Higgins, PhD - - PowerPoint PPT Presentation

Exploratory data analysis R F OR S AS US ERS Melinda Higgins, PhD Research Professor/Senior Biostatistician Emory University Summary statistics R FOR SAS USERS R FOR SAS USERS R FOR SAS USERS R FOR SAS USERS Summary statistics # Summary


slide-1
SLIDE 1

Exploratory data analysis

R F OR S AS US ERS

Melinda Higgins, PhD

Research Professor/Senior Biostatistician Emory University

slide-2
SLIDE 2

R FOR SAS USERS

Summary statistics

slide-3
SLIDE 3

R FOR SAS USERS

slide-4
SLIDE 4

R FOR SAS USERS

slide-5
SLIDE 5

R FOR SAS USERS

slide-6
SLIDE 6

R FOR SAS USERS

Summary statistics

# Summary statistics of weight, height, bmi of daviskeep daviskeep %>% select(weight, height, bmi) %>% summary() weight height bmi

  • Min. : 39.0 Min. :148.0 Min. :15.82

1st Qu.: 55.0 1st Qu.:164.0 1st Qu.:20.22 Median : 63.0 Median :170.0 Median :21.80 Mean : 65.3 Mean :170.6 Mean :22.26 3rd Qu.: 73.5 3rd Qu.:177.5 3rd Qu.:23.94

  • Max. :119.0 Max. :197.0 Max. :36.73
slide-7
SLIDE 7

R FOR SAS USERS

Descriptive statistics with Hmisc

# Load Hmisc, run describe() for sex and bmi library(Hmisc) daviskeep %>% select(sex, bmi) %>% Hmisc::describe() 2 Variables 199 Observations

  • sex

n missing distinct 199 0 2 Value F M Frequency 111 88 Proportion 0.558 0.442

  • bmi

n missing distinct Info Mean Gmd 199 0 176 1 22.26 3.303 .05 .10 .25 .50 .75 .90 18.05 18.84 20.22 21.80 23.94 26.30 .95 27.25 lowest : 15.82214 16.93703 17.09928 17.43285 17.50639 highest: 29.73704 29.80278 30.09496 30.15916 36.72840

slide-8
SLIDE 8

R FOR SAS USERS

Descriptive statistics with psych

# Load psych package, run psych:: describe() for weight, height, bmi library(psych) daviskeep %>% select(weight, height, bmi) %>% psych::describe()

Result

vars n mean sd median trimmed mad min max range skew kurtosis se weight 1 199 65.30 13.34 63.0 64.12 11.86 39.00 119.00 80.00 0.91 0.84 0.95 height 2 199 170.59 8.95 170.0 170.40 10.38 148.00 197.00 49.00 0.21 -0.38 0.63 bmi 3 199 22.26 3.01 21.8 22.08 2.55 15.82 36.73 20.91 0.91 1.91 0.21

slide-9
SLIDE 9

R FOR SAS USERS

Specic statistic summaries

slide-10
SLIDE 10

R FOR SAS USERS

slide-11
SLIDE 11

R FOR SAS USERS

Specic statistic summaries - one variable

# For height, get n, median, 5th, 95th quartiles, min, max daviskeep %>% summarise(nht = n(), medianht = median(height), pt05 = quantile(height, probs = 0.05), pt95 = quantile(height, probs = 0.95), minht = min(height), maxht = max(height))

Result

nht medianht pt05 pt95 minht maxht 1 199 170 157 185 148 197

slide-12
SLIDE 12

R FOR SAS USERS

slide-13
SLIDE 13

R FOR SAS USERS

Specic statistic summaries - multiple variables

# For weight, height and bmi, get mean, standard deviation daviskeep %>% select(weight, height, bmi) %>% summarise_all(funs(mean, sd))

Result

weight_mean height_mean bmi_mean weight_sd height_sd bmi_sd 1 65.29648 170.5879 22.25761 13.34346 8.948848 3.009239

slide-14
SLIDE 14

R FOR SAS USERS

slide-15
SLIDE 15

R FOR SAS USERS

Summary statistics - by group

# Get mean and sd for weight, height and bmi by sex group daviskeep %>% group_by(sex) %>% select(sex, weight, height, bmi) %>% summarise_all(funs(mean, sd)) # A tibble: 2 x 7 sex weight_mean height_mean bmi_mean weight_sd height_sd bmi_sd <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 F 56.9 165. 21.0 6.89 5.68 2.18 2 M 75.9 178. 23.9 11.9 6.44 3.12

slide-16
SLIDE 16

Let's summarise abalones!

R F OR S AS US ERS

slide-17
SLIDE 17

Correlations and t- tests

R F OR S AS US ERS

Melinda Higgins, PhD

Research Professor/Senior Biostatistician Emory University

slide-18
SLIDE 18

R FOR SAS USERS

Correlations compare SAS and R

slide-19
SLIDE 19

R FOR SAS USERS

slide-20
SLIDE 20

R FOR SAS USERS

Correlations with psych package

# Correlations with psych::corr.test() daviskeep %>% select(bmi, weight, height) %>% psych::corr.test() Call:psych::corr.test(x = .) Correlation matrix bmi weight height bmi 1.00 0.88 0.38 weight 0.88 1.00 0.77 height 0.38 0.77 1.00 Sample Size [1] 199 Probability values (Entries above the diagonal are adjusted for multiple tests.) bmi weight height bmi 0 0 0 weight 0 0 0 height 0 0 0

slide-21
SLIDE 21

R FOR SAS USERS

Scatterplot matrix SAS and R

slide-22
SLIDE 22

R FOR SAS USERS

slide-23
SLIDE 23

R FOR SAS USERS

Scatterplot matrix - GGally::ggpairs() function

# Matrix plot with GGally::ggpairs() daviskeep %>% select(bmi, weight, height) %>% GGally::ggpairs()

slide-24
SLIDE 24

R FOR SAS USERS

Scatterplot matrix - ggpairs by group

# Color points by sex group daviskeep %>% select(bmi, weight, height, sex) %>% GGally::ggpairs(aes(color = sex))

slide-25
SLIDE 25

R FOR SAS USERS

Descriptive stats by group

No group counts

# Get mean and sd for bmi by sex daviskeep %>% select(bmi, sex) %>% group_by(sex) %>% summarise_all(funs(mean, sd)) # A tibble: 2 x 3 sex mean sd <fct> <dbl> <dbl> 1 F 21.0 2.18 2 M 23.9 3.12

With group counts

# Add n, get mean, sd for bmi by sex daviskeep %>% select(bmi, sex) %>% group_by(sex) %>% group_by(N = n(), add = TRUE) %>% summarise_all(funs(mean, sd)) # A tibble: 2 x 4 # Groups: sex [?] sex N mean sd <fct> <int> <dbl> <dbl> 1 F 111 21.0 2.18 2 M 88 23.9 3.12

slide-26
SLIDE 26

R FOR SAS USERS

T-tests SAS and R

slide-27
SLIDE 27

R FOR SAS USERS

slide-28
SLIDE 28

R FOR SAS USERS

slide-29
SLIDE 29

R FOR SAS USERS

slide-30
SLIDE 30

R FOR SAS USERS

T-tests - check for equal variances

# Perform equal variance test var.test(bmi ~ sex, data = daviskeep) F test to compare two variances data: bmi by sex F = 0.48637, num df = 110, denom df = 87, p-value = 0.0003668 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.3244691 0.7221946 sample estimates: ratio of variances 0.4863699

slide-31
SLIDE 31

R FOR SAS USERS

T-tests - pooled and unpooled

# UNPOOLED t-test bmi by sex t.test(bmi ~ sex, data = daviskeep) Welch Two Sample t-test data: bmi by sex t = -7.5158, df = 149.45, p-value = 4.819e-12 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:

  • 3.716353 -2.169035

sample estimates: mean in group F mean in group M 20.95632 23.89901 # POOLED t-test bmi by sex t.test(bmi ~ sex, data = daviskeep, var.equal = TRUE) Two Sample t-test data: bmi by sex t = -7.8239, df = 197, p-value = 3.055e-13 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:

  • 3.684428 -2.200960

sample estimates: mean in group F mean in group M 20.95632 23.89901

slide-32
SLIDE 32

Let's explore bivariate relationships in abalones!

R F OR S AS US ERS

slide-33
SLIDE 33

Categorical data: analyze and visualize

R F OR S AS US ERS

Melinda Higgins, PhD

Research Professor/Senior Biostatistician Emory University

slide-34
SLIDE 34

R FOR SAS USERS

Collapse categories

# Use table() inside with() for bmicat daviskeep %>% with(table(bmicat)) bmicat

  • 1. underwt/norm 2. overwt 3. obese

161 35 3

Add recoded variable bmigt25

# Add one more categorical variable bmigt25 daviskeep <- daviskeep %>% mutate(bmigt25 = ifelse(bmi > 25, "2. overwt/obese", "1. underwt/norm")) # View frequencies for bmigt25 categories daviskeep %>% with(table(bmigt25)) bmigt25

  • 1. underwt/norm 2. overwt/obese

161 38

slide-35
SLIDE 35

R FOR SAS USERS

Contingency tables SAS and R

slide-36
SLIDE 36

R FOR SAS USERS

Chi-square tests SAS and R

slide-37
SLIDE 37

R FOR SAS USERS

Contingency table and chi-square test

# Save table output of bmigt25 by sex tablebmisex <- daviskeep %>% with(table(bmigt25, sex)) tablebmisex # Use table object to run chisq.test chisq.test(tablebmisex) sex bmigt25 F M

  • 1. underwt/norm 107 54
  • 2. overwt/obese 4 34

Pearson's Chi-squared test with Yates' continuity correction data: tablebmisex X-squared = 36.759, df = 1, p-value = 1.336e-09

slide-38
SLIDE 38

R FOR SAS USERS

Chi-square tests with gmodels package

# Load gmodel package library(gmodels) # Run gmodels::CrossTabs, show column %s and expected values daviskeep %>% with(gmodels::CrossTable(bmigt25, sex, chisq = TRUE, prop.r = FALSE, prop.t = FALSE, prop.chisq = FALSE, expected = TRUE))

slide-39
SLIDE 39

R FOR SAS USERS

CrossTable output - part 1

Cell Contents |-------------------------| | N | | Expected N | | N / Col Total | |-------------------------| Total Observations in Table: 199 | sex bmigt25 | F | M | Row Total |

  • ---------------|-----------|-----------|-----------|
  • 1. underwt/norm | 107 | 54 | 161 |

| 89.804 | 71.196 | | | 0.964 | 0.614 | |

  • ---------------|-----------|-----------|-----------|
  • 2. overwt/obese | 4 | 34 | 38 |

| 21.196 | 16.804 | | | 0.036 | 0.386 | |

  • ---------------|-----------|-----------|-----------|

Column Total | 111 | 88 | 199 | | 0.558 | 0.442 | |

  • ---------------|-----------|-----------|-----------|
slide-40
SLIDE 40

R FOR SAS USERS

CrossTable output - part 2

gmodels::CrossTable() output - continued...

Statistics for All Table Factors Pearson's Chi-squared test

  • Chi^2 = 38.99402 d.f. = 1 p = 4.251066e-10

Pearson's Chi-squared test with Yates' continuity correction

  • Chi^2 = 36.75936 d.f. = 1 p = 1.336475e-09
slide-41
SLIDE 41

R FOR SAS USERS

Mosaic plots SAS and R

slide-42
SLIDE 42

R FOR SAS USERS

Mosaicplot of two-way categorical proportions

# Make mosaicplot of bmigt25 by sex mosaicplot(bmigt25 ~ sex, data = daviskeep, color = c("light blue", "dark grey"), main = "BMI Categories by Sex")

slide-43
SLIDE 43

Let's explore categorical associations for the abalones!

R F OR S AS US ERS