ifelse, summarize/mutate, cummulative functions, lead/lag Steve - - PowerPoint PPT Presentation

ifelse summarize mutate cummulative functions lead lag
SMART_READER_LITE
LIVE PREVIEW

ifelse, summarize/mutate, cummulative functions, lead/lag Steve - - PowerPoint PPT Presentation

ifelse, summarize/mutate, cummulative functions, lead/lag Steve Bagley somgen223.stanford.edu 1 2 b 3 c # A tibble: 3 x 2 x label < int > < chr > 1 1 a 2 (new_df <- tibble (x = 1 : 3, label = c ("a", "b",


slide-1
SLIDE 1

ifelse, summarize/mutate, cummulative functions, lead/lag

Steve Bagley

somgen223.stanford.edu 1

slide-2
SLIDE 2

How to create a new tibble from scratch

(new_df <- tibble(x = 1:3, label = c("a", "b", "c"))) # A tibble: 3 x 2 x label <int> <chr> 1 1 a 2 2 b 3 3 c

  • Although you will most often create new data frames using read_csv, you can

create one from scratch by providing arguments to the tibble function in the format name = vector.

  • Use this function to create small test examples.

somgen223.stanford.edu 2

slide-3
SLIDE 3

ifelse

somgen223.stanford.edu 3

slide-4
SLIDE 4

Replace some indicator value with NA

z <- c(1, 2, -999, 4)

  • Suppose that the value -999 has been used to represent a missing value. (Most

computer languages do not have the equivalent of R’s NA, so out-of-bounds values are used instead).

somgen223.stanford.edu 4

slide-5
SLIDE 5

Replace -999 with NA

z [1] 1 2 -999 4 ifelse(z == -999, NA, z) [1] 1 2 NA 4

somgen223.stanford.edu 5

slide-6
SLIDE 6

How ifelse works

z [1] 1 2 -999 4 (flag <- z == -999) [1] FALSE FALSE TRUE FALSE ifelse(flag, NA, z) [1] 1 2 NA 4

  • ifelse takes three arguments: a test vector, here, flag, and two other vectors,

here, NA and z.

  • It returns a vector with elements from either NA or z, depending on whether the

corresponding element of flag is TRUE or FALSE. flag NA z FALSE NA 1 FALSE NA 2 TRUE NA

  • 999

FALSE NA 4

somgen223.stanford.edu 6

slide-7
SLIDE 7

Using ifelse to set the color of a point

Grand Rapids Duluth University Farm Morris Crookston Waseca

  • 20
  • 10

10 20 Svansota

  • No. 462

Manchuria

  • No. 475

Velvet Peatland Glabron

  • No. 457

Wisconsin No. 38 Trebi Svansota

  • No. 462

Manchuria

  • No. 475

Velvet Peatland Glabron

  • No. 457

Wisconsin No. 38 Trebi Svansota

  • No. 462

Manchuria

  • No. 475

Velvet Peatland Glabron

  • No. 457

Wisconsin No. 38 Trebi Svansota

  • No. 462

Manchuria

  • No. 475

Velvet Peatland Glabron

  • No. 457

Wisconsin No. 38 Trebi Svansota

  • No. 462

Manchuria

  • No. 475

Velvet Peatland Glabron

  • No. 457

Wisconsin No. 38 Trebi Svansota

  • No. 462

Manchuria

  • No. 475

Velvet Peatland Glabron

  • No. 457

Wisconsin No. 38 Trebi

Difference in yield (1932 vs 1931) Variety of barley somgen223.stanford.edu 7

slide-8
SLIDE 8

Code for the barley plot

barley2 <- barley %>% spread(year, yield) %>% mutate(yield_diff = `1932` - `1931`) ggplot(barley2, aes(x = yield_diff, y = variety, color = factor(ifelse(yield_diff >= 0, "+", "-"), levels = c("-", "+")))) + geom_point() + scale_color_manual(values = c("red", "blue")) + xlab("Difference in yield (1932 vs 1931)") + ylab("Variety of barley") + facet_grid(rows = vars(site)) + theme(legend.position = "none") + theme(text = element_text(size = 9))

somgen223.stanford.edu 8

slide-9
SLIDE 9

More about recoding/replacing values

somgen223.stanford.edu 9

slide-10
SLIDE 10

How to replace value with NA

v <- c(1, 2, -999, 4) na_if(v, -999) [1] 1 2 NA 4

somgen223.stanford.edu 10

slide-11
SLIDE 11

How to replace NA with another value

v2 <- c(1, 2, 3, 4, NA) replace_na(v2, -999) [1] 1 2 3 4 -999

somgen223.stanford.edu 11

slide-12
SLIDE 12

Remove all rows containing NA anywhere

(missing_df <- read_csv(str_c(data_dir, "missing_df.csv"))) # A tibble: 10 x 3 id weight group <dbl> <dbl> <chr> 1 1 0.114 a 2 2 0.622 b 3 3 0.609 a 4 4 NA b 5 5 0.861 <NA> 6 6 0.640 b 7 7 NA a 8 8 0.233 b 9 9 0.666 a 10 10 0.514 b

somgen223.stanford.edu 12

slide-13
SLIDE 13

Remove all rows containing NA anywhere

missing_df %>% na.omit() # A tibble: 7 x 3 id weight group <dbl> <dbl> <chr> 1 1 0.114 a 2 2 0.622 b 3 3 0.609 a 4 6 0.640 b 5 8 0.233 b 6 9 0.666 a 7 10 0.514 b

somgen223.stanford.edu 13

slide-14
SLIDE 14

Remove all rows containing NA in specified columns

missing_df %>% filter(complete.cases(weight)) # A tibble: 8 x 3 id weight group <dbl> <dbl> <chr> 1 1 0.114 a 2 2 0.622 b 3 3 0.609 a 4 5 0.861 <NA> 5 6 0.640 b 6 8 0.233 b 7 9 0.666 a 8 10 0.514 b

somgen223.stanford.edu 14

slide-15
SLIDE 15

Another way

missing_df %>% filter(complete.cases(weight, group)) # A tibble: 7 x 3 id weight group <dbl> <dbl> <chr> 1 1 0.114 a 2 2 0.622 b 3 3 0.609 a 4 6 0.640 b 5 8 0.233 b 6 9 0.666 a 7 10 0.514 b

somgen223.stanford.edu 15

slide-16
SLIDE 16

summarize vs mutate on grouped data frames

somgen223.stanford.edu 16

slide-17
SLIDE 17

summarize

cw <- read_csv(str_c(data_dir, "cw.csv")) cw %>% group_by(diet) %>% summarize(mean_weight = mean(weight)) # A tibble: 4 x 2 diet mean_weight <dbl> <dbl> 1 1 123. 2 2 103. 3 3 143. 4 4 135.

somgen223.stanford.edu 17

slide-18
SLIDE 18

mutate on a grouped data frame

cw %>% group_by(diet) %>% mutate(mean_weight = mean(weight)) # A tibble: 578 x 5 # Groups: diet [4] weight time chick diet mean_weight <dbl> <dbl> <dbl> <dbl> <dbl> 1 42 1 2 103. 2 51 2 1 2 103. 3 59 4 1 2 103. 4 64 6 1 2 103. 5 76 8 1 2 103. 6 93 10 1 2 103. 7 106 12 1 2 103. 8 125 14 1 2 103. 9 149 16 1 2 103. 10 171 18 1 2 103. # ... with 568 more rows

somgen223.stanford.edu 18

slide-19
SLIDE 19

mutate on a grouped data frame

  • mutate on a grouped data frame will add a new column (or columns), with the

values computed over the groups.

  • The result will have the same number of rows as the original data frame.
  • This idiom is very useful for finding members of each group that meet some
  • condition. summarize computes properties of the group, but collapses the data
  • f the individuals in the group.

somgen223.stanford.edu 19

slide-20
SLIDE 20

Exercise: largest difference from the mean weight

  • Find the chick with the largest difference from its mean weight.

somgen223.stanford.edu 20

slide-21
SLIDE 21

Answer: largest difference from the mean weight

cw %>% group_by(chick) %>% mutate(mean_weight = mean(weight), weight_diff = abs(mean_weight - weight)) %>% ungroup() %>% filter(weight_diff == max(weight_diff)) # A tibble: 1 x 6 weight time chick diet mean_weight weight_diff <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 373 21 35 3 193. 180.

  • ungroup is the opposite of group_by, and removes the grouping from a data
  • frame. We need to do this so that filter does not operates on groups. (Try

leaving out the ungroup.)

somgen223.stanford.edu 21

slide-22
SLIDE 22

Exercise: best diet

  • Which diet produced the largest growth for some chick?

somgen223.stanford.edu 22

slide-23
SLIDE 23

First try at answer:

cw %>% group_by(chick) %>% ## summarize does not keep the diet summarize(weight_gain = max(weight) - min(weight)) # A tibble: 50 x 2 chick weight_gain <dbl> <dbl> 1 1 163 2 2 175 3 3 163 4 4 118 5 5 182 6 6 119 7 7 264 8 8 92 9 9 58 10 10 83 # ... with 40 more rows

somgen223.stanford.edu 23

slide-24
SLIDE 24

Partial answer:

cw %>% group_by(chick) %>% summarize(weight_gain = max(weight) - min(weight), ## Remember the first value of diet for each chick. ## Note that each chick only gets one diet. diet = first(diet)) # A tibble: 50 x 3 chick weight_gain diet <dbl> <dbl> <dbl> 1 1 163 2 2 2 175 2 3 3 163 2 4 4 118 2 5 5 182 2 6 6 119 2 7 7 264 2 8 8 92 2 9 9 58 2 10 10 83 2 # ... with 40 more rows

somgen223.stanford.edu 24

slide-25
SLIDE 25

Complete answer:

cw %>% group_by(chick) %>% summarize(weight_gain = max(weight) - min(weight), diet = first(diet)) %>% filter(weight_gain == max(weight_gain)) # A tibble: 1 x 3 chick weight_gain diet <dbl> <dbl> <dbl> 1 35 332 3

somgen223.stanford.edu 25

slide-26
SLIDE 26

Another way, using multiple groups

cw %>% group_by(diet, chick) %>% ## This will summarize by chick. Summarize uses the last ## variable in the group_by summarize(weight_gain = max(weight) - min(weight)) %>% ungroup() %>% filter(weight_gain == max(weight_gain)) # A tibble: 1 x 3 diet chick weight_gain <dbl> <dbl> <dbl> 1 3 35 332

  • This will be explained in greater detail later.

somgen223.stanford.edu 26

slide-27
SLIDE 27

Cumulative functions

somgen223.stanford.edu 27

slide-28
SLIDE 28

cumsum and similar

delta_x <- c(1, 0, -1, 1, 1, -1, -1, -1, 1, 1) cumsum(delta_x) [1] 1 1 1 2 1 0 -1 1

  • cumsum returns a vector with the cumulative sums: the sum of all the numbers

up to and including that position.

  • In this example, we compute the location given the change in x at each step.
  • This is sometimes called the running sum.

somgen223.stanford.edu 28

slide-29
SLIDE 29

Other cumulative functions

cumsum(delta_x) [1] 1 1 1 2 1 0 -1 1 cumall(cumsum(delta_x) >= 0) [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE

  • This marks with TRUE all the positions where we have not yet moved to the left
  • f the origin.
  • Other functions in this family: cumprod, cummin, cummax, cumany, cummean.

somgen223.stanford.edu 29

slide-30
SLIDE 30

lead and lag: using data from the previous or next line

somgen223.stanford.edu 30

slide-31
SLIDE 31

Set up example

(dist <- tibble(time = 1:3, x = (1:3)^2)) # A tibble: 3 x 2 time x <int> <dbl> 1 1 1 2 2 4 3 3 9

somgen223.stanford.edu 31

slide-32
SLIDE 32

Create a new column with the x value from the previous row

dist %>% mutate(prior_x = lag(x)) # A tibble: 3 x 3 time x prior_x <int> <dbl> <dbl> 1 1 1 NA 2 2 4 1 3 3 9 4 ## better: dist %>% mutate(prior_x = lag(x, default = 0)) # A tibble: 3 x 3 time x prior_x <int> <dbl> <dbl> 1 1 1 2 2 4 1 3 3 9 4

somgen223.stanford.edu 32

slide-33
SLIDE 33

Compute velocity as change in x over time

dist %>% mutate(prior_x = lag(x, default = 0), delta_x = x - prior_x) # A tibble: 3 x 4 time x prior_x delta_x <int> <dbl> <dbl> <dbl> 1 1 1 1 2 2 4 1 3 3 3 9 4 5

somgen223.stanford.edu 33

slide-34
SLIDE 34

Reading

  • Read: Window functions • dplyr

somgen223.stanford.edu 34