Data frame manipulation: group_by , summarize somgen223.stanford.edu - - PowerPoint PPT Presentation

data frame manipulation group by summarize
SMART_READER_LITE
LIVE PREVIEW

Data frame manipulation: group_by , summarize somgen223.stanford.edu - - PowerPoint PPT Presentation

group_by , summarize , factors Steve Bagley somgen223.stanford.edu 1 Data frame manipulation: group_by , summarize somgen223.stanford.edu 2 3.4 1 3 2 5 3.3 2 2 2 4 1.1 2 1 2 3 data_dir <-


slide-1
SLIDE 1

group_by, summarize, factors

Steve Bagley

somgen223.stanford.edu 1

slide-2
SLIDE 2

Data frame manipulation: group_by, summarize

somgen223.stanford.edu 2

slide-3
SLIDE 3

Set up cw1

data_dir <- "https://somgen223.stanford.edu/data/" (cw1 <- read_csv(str_c(data_dir, "cw1.csv"))) # A tibble: 5 x 4 chick time diet weight <dbl> <dbl> <dbl> <dbl> 1 1 1 1 1.6 2 1 2 1 3.4 3 2 1 2 1.1 4 2 2 2 3.3 5 2 3 2 6.6

somgen223.stanford.edu 3

slide-4
SLIDE 4

Computing over groups

cw1 %>% distinct(diet) # A tibble: 2 x 1 diet <dbl> 1 1 2 2

  • There are two different diets.
  • What is the mean weight of all the chicks on each diet?

somgen223.stanford.edu 4

slide-5
SLIDE 5

Computing the mean weight of each diet

cw1 %>% group_by(diet) %>% summarize(mean_weight = mean(weight)) # A tibble: 2 x 2 diet mean_weight <dbl> <dbl> 1 1 2.5 2 2 3.67

somgen223.stanford.edu 5

slide-6
SLIDE 6

group_by

cw1 %>% group_by(diet) # A tibble: 5 x 4 # Groups: diet [2] chick time diet weight <dbl> <dbl> <dbl> <dbl> 1 1 1 1 1.6 2 1 2 1 3.4 3 2 1 2 1.1 4 2 2 2 3.3 5 2 3 2 6.6

  • This looks like the original data frame, except for the additional comment line: #

Groups: ..., which is a record of the variables used to form groups. No analysis has happened yet.

somgen223.stanford.edu 6

slide-7
SLIDE 7

summarize

cw1 %>% group_by(diet) %>% summarize(mean_weight = mean(weight))

  • summarize takes a grouped data frame and performs the specified operation

separately for all the values in each group.

  • In this case, mean will get called 2 times, once on each subset of rows

corresponding to each value of diet.

  • The results for each group are then combined into a single data frame with the

final result.

  • Note that the result has one row for each group value.

somgen223.stanford.edu 7

slide-8
SLIDE 8

summarize on an ungrouped data frame

cw1 %>% summarize(mean_weight = mean(weight)) # A tibble: 1 x 1 mean_weight <dbl> 1 3.20

  • Note also that you can use summarize on an ungrouped data frame: you’ll get
  • ne row of results. In this case, it will contain the overall mean weight (of all

chicks).

somgen223.stanford.edu 8

slide-9
SLIDE 9

Computing more than one summary at the same time

cw1 %>% group_by(diet) %>% summarize(mean_weight = mean(weight), max_weight = max(weight)) # A tibble: 2 x 3 diet mean_weight max_weight <dbl> <dbl> <dbl> 1 1 2.5 3.4 2 2 3.67 6.6

  • max(weight) will return the maximum value of the weight column.
  • Do not use max on the entire data frame: max(cw1)!

somgen223.stanford.edu 9

slide-10
SLIDE 10

Exercise: the range of weights

  • For each diet, compute the range of weights (max - min), and sort the result by

the range.

somgen223.stanford.edu 10

slide-11
SLIDE 11

Answer: the range of weights

cw1 %>% group_by(diet) %>% summarize(weight_range = max(weight) - min(weight)) %>% arrange(weight_range) # A tibble: 2 x 2 diet weight_range <dbl> <dbl> 1 1 1.8 2 2 5.5

somgen223.stanford.edu 11

slide-12
SLIDE 12

Exercise: max weight of each chick

  • For each chick, compute its maximum weight

somgen223.stanford.edu 12

slide-13
SLIDE 13

Answer: max weight of each chick

cw1 %>% group_by(chick) %>% summarize(max_weight = max(weight)) # A tibble: 2 x 2 chick max_weight <dbl> <dbl> 1 1 3.4 2 2 6.6

somgen223.stanford.edu 13

slide-14
SLIDE 14

How many chicks are on each diet?

cw1 %>% group_by(diet) %>% summarize(n_diet = n()) # A tibble: 2 x 2 diet n_diet <dbl> <int> 1 1 2 2 2 3

  • The function n() returns the number of rows in a group.
  • group_by/summarize computes the number of rows in each group.

somgen223.stanford.edu 14

slide-15
SLIDE 15

Exercise: How many measurements for each chick?

  • Compute the number of measurements (rows) for each chick.

somgen223.stanford.edu 15

slide-16
SLIDE 16

Answer: How many measurements for each chick?

cw1 %>% group_by(chick) %>% summarize(n_measurements = n()) # A tibble: 2 x 2 chick n_measurements <dbl> <int> 1 1 2 2 2 3

somgen223.stanford.edu 16

slide-17
SLIDE 17

Factors

somgen223.stanford.edu 17

slide-18
SLIDE 18

Defining factors

  • Factors are a powerful, but sometimes perplexing, way to work with

discrete-valued data.

  • The possible values of a factor are drawn from a finite set of alternatives or
  • categories. Factors are often used in graphics and analysis for grouping.
  • Example: encoding the sex of a human subject as either M or F and grouping by

sex.

  • Example: encoding the names of the fifty US states and grouping by state.
  • Note that many measured values are better represented not as factors but as

either integers (such as for counting) or floating-point (real-valued) numbers. Example: number of subjects, weight.

  • We will return to factors later in the course.

somgen223.stanford.edu 18

slide-19
SLIDE 19

Reading

  • Read: 5 Data transformation | R for Data Science (sections 5.6 to 5.7)
  • Watch at least part of this video: Tidy Tuesday screencast: analyzing malaria

incidence in R - YouTube (or another video from the same channel).

somgen223.stanford.edu 19