exploratory data analysis
play

Exploratory data analysis R F OR S AS US ERS Melinda Higgins, PhD - PowerPoint PPT Presentation

Exploratory data analysis R F OR S AS US ERS Melinda Higgins, PhD Research Professor/Senior Biostatistician Emory University Summary statistics R FOR SAS USERS R FOR SAS USERS R FOR SAS USERS R FOR SAS USERS Summary statistics # Summary


  1. Exploratory data analysis R F OR S AS US ERS Melinda Higgins, PhD Research Professor/Senior Biostatistician Emory University

  2. Summary statistics R FOR SAS USERS

  3. R FOR SAS USERS

  4. R FOR SAS USERS

  5. R FOR SAS USERS

  6. Summary statistics # Summary statistics of weight, height, bmi of daviskeep daviskeep %>% select(weight, height, bmi) %>% summary() weight height bmi Min. : 39.0 Min. :148.0 Min. :15.82 1st Qu.: 55.0 1st Qu.:164.0 1st Qu.:20.22 Median : 63.0 Median :170.0 Median :21.80 Mean : 65.3 Mean :170.6 Mean :22.26 3rd Qu.: 73.5 3rd Qu.:177.5 3rd Qu.:23.94 Max. :119.0 Max. :197.0 Max. :36.73 R FOR SAS USERS

  7. Descriptive statistics with Hmisc # Load Hmisc, run describe() for sex and bmi library(Hmisc) ----------------------------------------------------- daviskeep %>% bmi select(sex, bmi) %>% n missing distinct Info Mean Gmd Hmisc::describe() 199 0 176 1 22.26 3.303 .05 .10 .25 .50 .75 .90 18.05 18.84 20.22 21.80 23.94 26.30 2 Variables 199 Observations .95 ----------------------------------------------------- 27.25 sex lowest : 15.82214 16.93703 17.09928 17.43285 17.50639 n missing distinct highest: 29.73704 29.80278 30.09496 30.15916 36.72840 199 0 2 Value F M Frequency 111 88 Proportion 0.558 0.442 R FOR SAS USERS

  8. Descriptive statistics with psych # Load psych package, run psych:: describe() for weight, height, bmi library(psych) daviskeep %>% select(weight, height, bmi) %>% psych::describe() Result vars n mean sd median trimmed mad min max range skew kurtosis se weight 1 199 65.30 13.34 63.0 64.12 11.86 39.00 119.00 80.00 0.91 0.84 0.95 height 2 199 170.59 8.95 170.0 170.40 10.38 148.00 197.00 49.00 0.21 -0.38 0.63 bmi 3 199 22.26 3.01 21.8 22.08 2.55 15.82 36.73 20.91 0.91 1.91 0.21 R FOR SAS USERS

  9. Speci�c statistic summaries R FOR SAS USERS

  10. R FOR SAS USERS

  11. Speci�c statistic summaries - one variable # For height, get n, median, 5th, 95th quartiles, min, max daviskeep %>% summarise(nht = n(), medianht = median(height), pt05 = quantile(height, probs = 0.05), pt95 = quantile(height, probs = 0.95), minht = min(height), maxht = max(height)) Result nht medianht pt05 pt95 minht maxht 1 199 170 157 185 148 197 R FOR SAS USERS

  12. R FOR SAS USERS

  13. Speci�c statistic summaries - multiple variables # For weight, height and bmi, get mean, standard deviation daviskeep %>% select(weight, height, bmi) %>% summarise_all(funs(mean, sd)) Result weight_mean height_mean bmi_mean weight_sd height_sd bmi_sd 1 65.29648 170.5879 22.25761 13.34346 8.948848 3.009239 R FOR SAS USERS

  14. R FOR SAS USERS

  15. Summary statistics - by group # Get mean and sd for weight, height and bmi by sex group daviskeep %>% group_by(sex) %>% select(sex, weight, height, bmi) %>% summarise_all(funs(mean, sd)) # A tibble: 2 x 7 sex weight_mean height_mean bmi_mean weight_sd height_sd bmi_sd <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 F 56.9 165. 21.0 6.89 5.68 2.18 2 M 75.9 178. 23.9 11.9 6.44 3.12 R FOR SAS USERS

  16. Let's summarise abalones! R F OR S AS US ERS

  17. Correlations and t- tests R F OR S AS US ERS Melinda Higgins, PhD Research Professor/Senior Biostatistician Emory University

  18. Correlations compare SAS and R R FOR SAS USERS

  19. R FOR SAS USERS

  20. Correlations with psych package # Correlations with psych::corr.test() Call:psych::corr.test(x = .) daviskeep %>% Correlation matrix select(bmi, weight, height) %>% bmi weight height psych::corr.test() bmi 1.00 0.88 0.38 weight 0.88 1.00 0.77 height 0.38 0.77 1.00 Sample Size [1] 199 Probability values (Entries above the diagonal are adjusted for multiple tests.) bmi weight height bmi 0 0 0 weight 0 0 0 height 0 0 0 R FOR SAS USERS

  21. Scatterplot matrix SAS and R R FOR SAS USERS

  22. R FOR SAS USERS

  23. Scatterplot matrix - GGally::ggpairs() function # Matrix plot with GGally::ggpairs() daviskeep %>% select(bmi, weight, height) %>% GGally::ggpairs() R FOR SAS USERS

  24. Scatterplot matrix - ggpairs by group # Color points by sex group daviskeep %>% select(bmi, weight, height, sex) %>% GGally::ggpairs(aes(color = sex)) R FOR SAS USERS

  25. Descriptive stats by group No group counts With group counts # Get mean and sd for bmi by sex # Add n, get mean, sd for bmi by sex daviskeep %>% daviskeep %>% select(bmi, sex) %>% select(bmi, sex) %>% group_by(sex) %>% group_by(sex) %>% summarise_all(funs(mean, sd)) group_by(N = n(), add = TRUE) %>% summarise_all(funs(mean, sd)) # A tibble: 2 x 3 sex mean sd # A tibble: 2 x 4 <fct> <dbl> <dbl> # Groups: sex [?] 1 F 21.0 2.18 sex N mean sd 2 M 23.9 3.12 <fct> <int> <dbl> <dbl> 1 F 111 21.0 2.18 2 M 88 23.9 3.12 R FOR SAS USERS

  26. T-tests SAS and R R FOR SAS USERS

  27. R FOR SAS USERS

  28. R FOR SAS USERS

  29. R FOR SAS USERS

  30. T-tests - check for equal variances # Perform equal variance test var.test(bmi ~ sex, data = daviskeep) F test to compare two variances data: bmi by sex F = 0.48637, num df = 110, denom df = 87, p-value = 0.0003668 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.3244691 0.7221946 sample estimates: ratio of variances 0.4863699 R FOR SAS USERS

  31. T-tests - pooled and unpooled # UNPOOLED t-test bmi by sex # POOLED t-test bmi by sex t.test(bmi ~ sex, t.test(bmi ~ sex, data = daviskeep, data = daviskeep) var.equal = TRUE) Welch Two Sample t-test Two Sample t-test data: bmi by sex data: bmi by sex t = -7.5158, df = 149.45, t = -7.8239, df = 197, p-value = 4.819e-12 p-value = 3.055e-13 alternative hypothesis: true difference alternative hypothesis: true difference in means is not equal to 0 in means is not equal to 0 95 percent confidence interval: 95 percent confidence interval: -3.716353 -2.169035 -3.684428 -2.200960 sample estimates: sample estimates: mean in group F mean in group M mean in group F mean in group M 20.95632 23.89901 20.95632 23.89901 R FOR SAS USERS

  32. Let's explore bivariate relationships in abalones! R F OR S AS US ERS

  33. Categorical data: analyze and visualize R F OR S AS US ERS Melinda Higgins, PhD Research Professor/Senior Biostatistician Emory University

  34. Collapse categories Add recoded variable bmigt25 # Use table() inside with() for bmicat daviskeep %>% with(table(bmicat)) # Add one more categorical variable bmigt25 daviskeep <- daviskeep %>% bmicat mutate(bmigt25 = ifelse(bmi > 25, 1. underwt/norm 2. overwt 3. obese "2. overwt/obese", 161 35 3 "1. underwt/norm")) # View frequencies for bmigt25 categories daviskeep %>% with(table(bmigt25)) bmigt25 1. underwt/norm 2. overwt/obese 161 38 R FOR SAS USERS

  35. Contingency tables SAS and R R FOR SAS USERS

  36. Chi-square tests SAS and R R FOR SAS USERS

  37. Contingency table and chi-square test # Save table output of bmigt25 by sex sex tablebmisex <- daviskeep %>% bmigt25 F M with(table(bmigt25, sex)) 1. underwt/norm 107 54 tablebmisex 2. overwt/obese 4 34 Pearson's Chi-squared test with Yates' # Use table object to run chisq.test continuity correction chisq.test(tablebmisex) data: tablebmisex X-squared = 36.759, df = 1, p-value = 1.336e-09 R FOR SAS USERS

  38. Chi-square tests with gmodels package # Load gmodel package library(gmodels) # Run gmodels::CrossTabs, show column %s and expected values daviskeep %>% with(gmodels::CrossTable(bmigt25, sex, chisq = TRUE, prop.r = FALSE, prop.t = FALSE, prop.chisq = FALSE, expected = TRUE)) R FOR SAS USERS

  39. CrossTable output - part 1 Cell Contents | sex |-------------------------| bmigt25 | F | M | Row Total | | N | ----------------|-----------|-----------|-----------| | Expected N | 1. underwt/norm | 107 | 54 | 161 | | N / Col Total | | 89.804 | 71.196 | | |-------------------------| | 0.964 | 0.614 | | ----------------|-----------|-----------|-----------| Total Observations in Table: 199 2. overwt/obese | 4 | 34 | 38 | | 21.196 | 16.804 | | | 0.036 | 0.386 | | ----------------|-----------|-----------|-----------| Column Total | 111 | 88 | 199 | | 0.558 | 0.442 | | ----------------|-----------|-----------|-----------| R FOR SAS USERS

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend