data frame manipulation group by summarize
play

Data frame manipulation: group_by , summarize somgen223.stanford.edu - PowerPoint PPT Presentation

group_by , summarize , factors Steve Bagley somgen223.stanford.edu 1 Data frame manipulation: group_by , summarize somgen223.stanford.edu 2 3.4 1 3 2 5 3.3 2 2 2 4 1.1 2 1 2 3 data_dir <-


  1. group_by , summarize , factors Steve Bagley somgen223.stanford.edu 1

  2. Data frame manipulation: group_by , summarize somgen223.stanford.edu 2

  3. 3.4 1 3 2 5 3.3 2 2 2 4 1.1 2 1 2 3 data_dir <- "https://somgen223.stanford.edu/data/" 2 6.6 1 2 1.6 1 1 1 1 < dbl > < dbl > < dbl > < dbl > diet weight time chick # A tibble: 5 x 4 (cw1 <- read_csv ( str_c (data_dir, "cw1.csv"))) 2 Set up cw1 somgen223.stanford.edu 3

  4. cw1 %>% distinct (diet) # A tibble: 2 x 1 diet < dbl > 1 1 2 2 Computing over groups • There are two different diets. • What is the mean weight of all the chicks on each diet? somgen223.stanford.edu 4

  5. cw1 %>% group_by (diet) %>% summarize (mean_weight = mean (weight)) # A tibble: 2 x 2 diet mean_weight < dbl > < dbl > 1 1 2.5 2 2 3.67 Computing the mean weight of each diet somgen223.stanford.edu 5

  6. group_by 2 3 2 1 2 1.1 4 2 1 2 3.3 5 2 3 2 6.6 cw1 %>% 3.4 2 diet weight group_by (diet) # A tibble: 5 x 4 # Groups: diet [2] chick 1 time < dbl > < dbl > < dbl > < dbl > 1 1 1 1 1.6 2 • This looks like the original data frame, except for the additional comment line: # Groups: ... , which is a record of the variables used to form groups. No analysis has happened yet. somgen223.stanford.edu 6

  7. summarize cw1 %>% group_by (diet) %>% summarize (mean_weight = mean (weight)) • summarize takes a grouped data frame and performs the specified operation separately for all the values in each group. • In this case, mean will get called 2 times, once on each subset of rows corresponding to each value of diet . • The results for each group are then combined into a single data frame with the final result. • Note that the result has one row for each group value. somgen223.stanford.edu 7

  8. cw1 %>% summarize (mean_weight = mean (weight)) # A tibble: 1 x 1 mean_weight < dbl > 1 3.20 summarize on an ungrouped data frame • Note also that you can use summarize on an ungrouped data frame: you’ll get one row of results. In this case, it will contain the overall mean weight (of all chicks). somgen223.stanford.edu 8

  9. 1 6.6 group_by (diet) %>% summarize (mean_weight = mean (weight), max_weight = max (weight)) # A tibble: 2 x 3 diet mean_weight max_weight < dbl > < dbl > < dbl > 1 cw1 %>% 2.5 3.4 2 2 3.67 Computing more than one summary at the same time • max(weight) will return the maximum value of the weight column. • Do not use max on the entire data frame: max(cw1) ! somgen223.stanford.edu 9

  10. Exercise: the range of weights • For each diet, compute the range of weights (max - min), and sort the result by the range. somgen223.stanford.edu 10

  11. < dbl > 5.5 group_by (diet) %>% summarize (weight_range = max (weight) - min (weight)) %>% arrange (weight_range) # A tibble: 2 x 2 diet weight_range < dbl > cw1 %>% 1 1 1.8 2 2 Answer: the range of weights somgen223.stanford.edu 11

  12. Exercise: max weight of each chick • For each chick, compute its maximum weight somgen223.stanford.edu 12

  13. cw1 %>% group_by (chick) %>% summarize (max_weight = max (weight)) # A tibble: 2 x 2 chick max_weight < dbl > < dbl > 1 1 3.4 2 2 6.6 Answer: max weight of each chick somgen223.stanford.edu 13

  14. 1 3 group_by (diet) %>% summarize (n_diet = n ()) # A tibble: 2 x 2 diet n_diet < dbl > < int > 1 cw1 %>% 2 2 2 How many chicks are on each diet? • The function n() returns the number of rows in a group. • group_by/summarize computes the number of rows in each group. somgen223.stanford.edu 14

  15. Exercise: How many measurements for each chick? • Compute the number of measurements (rows) for each chick. somgen223.stanford.edu 15

  16. cw1 %>% group_by (chick) %>% summarize (n_measurements = n ()) # A tibble: 2 x 2 chick n_measurements < dbl > < int > 1 1 2 2 2 3 Answer: How many measurements for each chick? somgen223.stanford.edu 16

  17. Factors somgen223.stanford.edu 17

  18. Defining factors • Factors are a powerful, but sometimes perplexing, way to work with discrete-valued data. • The possible values of a factor are drawn from a finite set of alternatives or categories. Factors are often used in graphics and analysis for grouping. • Example: encoding the sex of a human subject as either M or F and grouping by sex. • Example: encoding the names of the fifty US states and grouping by state. • Note that many measured values are better represented not as factors but as either integers (such as for counting) or floating-point (real-valued) numbers. Example: number of subjects, weight. • We will return to factors later in the course. somgen223.stanford.edu 18

  19. Reading • Read: 5 Data transformation | R for Data Science (sections 5.6 to 5.7) • Watch at least part of this video: Tidy Tuesday screencast: analyzing malaria incidence in R - YouTube (or another video from the same channel). somgen223.stanford.edu 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend