Advanced column-oriented methods: _all, _at, _if Steve Bagley - - PowerPoint PPT Presentation

advanced column oriented methods all at if
SMART_READER_LITE
LIVE PREVIEW

Advanced column-oriented methods: _all, _at, _if Steve Bagley - - PowerPoint PPT Presentation

Advanced column-oriented methods: _all, _at, _if Steve Bagley somgen223.stanford.edu 1 Different ways to select columns It is easy to use filter to select rows: the filter expressions can use the values in the columns that are specified by


slide-1
SLIDE 1

Advanced column-oriented methods: _all, _at, _if

Steve Bagley

somgen223.stanford.edu 1

slide-2
SLIDE 2

Different ways to select columns

  • It is easy to use filter to select rows: the filter expressions can use the values in

the columns that are specified by writing the column names.

  • To use select, we provide the column names.
  • What if we want to select columns based on some aspect of the column names?
  • What if we want to select columns based on the values in those columns, such as,

all columns that contain at least one NA value?

  • Somehow, we need to compute the identity of the desired columns.

somgen223.stanford.edu 2

slide-3
SLIDE 3

select_at: when you can compute the names of the columns

gene_exp2 <- read_csv(str_c(data_dir, "gene_exp2.csv")) ## select by using the exact names gene_exp2 %>% select(d1_g1, d1_g2) # A tibble: 4 x 2 d1_g1 d1_g2 <dbl> <dbl> 1 1 3 2 3 4 3 6 6 4 2 5 ## select by computing which columns match a pattern gene_exp2 %>% select_at(vars(starts_with("d1"))) # A tibble: 4 x 2 d1_g1 d1_g2 <dbl> <dbl> 1 1 3 2 3 4 3 6 6 4 2 5

somgen223.stanford.edu 3

slide-4
SLIDE 4

Functions you can use with select_at

from the documentation: Function Notes starts_with Starts with a prefix ends_with Ends with a suffix contains Contains a literal string matches Matches a regular expression num_range Matches a numerical range like x01, x02, x03

  • ne_of

Matches variable names in a character vector everything Matches all variables last_col Select last variable, possibly with an offset

somgen223.stanford.edu 4

slide-5
SLIDE 5

Example of select_at

gene_exp2 %>% select_at(vars(contains("_"))) # A tibble: 4 x 4 d1_g1 d1_g2 d2_g1 d2_g2 <dbl> <dbl> <dbl> <dbl> 1 1 3 10 1 2 3 4 20 1 3 6 6 30 1 4 2 5 40 12121

somgen223.stanford.edu 5

slide-6
SLIDE 6

Example of select_at

gene_exp2 %>% select_at(vars(-contains("_"), last_col())) # A tibble: 4 x 2 gene d2_g2 <chr> <dbl> 1 ABC123 1 2 ABC123 1 3 DEF234 1 4 DEF234 12121

  • vars accepts multiple specifications.

somgen223.stanford.edu 6

slide-7
SLIDE 7

Example of select_at

gene_exp2 %>% select_at(vars("gene", starts_with("d1"))) # A tibble: 4 x 3 gene d1_g1 d1_g2 <chr> <dbl> <dbl> 1 ABC123 1 3 2 ABC123 3 4 3 DEF234 6 6 4 DEF234 2 5

  • Can use exact name of column, as a string.

somgen223.stanford.edu 7

slide-8
SLIDE 8

Example of select_at

gene_exp2 %>% select_at(vars(1, ends_with("g2"))) # A tibble: 4 x 3 gene d1_g2 d2_g2 <chr> <dbl> <dbl> 1 ABC123 3 1 2 ABC123 4 1 3 DEF234 6 1 4 DEF234 5 12121

  • Can use the number of a column.

somgen223.stanford.edu 8

slide-9
SLIDE 9

Example of select_at

gene_exp2 %>% select_at(vars(seq(from = 1, to = 5, by = 2))) # A tibble: 4 x 3 gene d1_g2 d2_g2 <chr> <dbl> <dbl> 1 ABC123 3 1 2 ABC123 4 1 3 DEF234 6 1 4 DEF234 5 12121

  • This is useful if there is a regular pattern to the columns to want to keep.

somgen223.stanford.edu 9

slide-10
SLIDE 10

select_if: when you use the contents of the columns

gene_exp2 %>% select_if(function(x) any(x > 10)) # A tibble: 4 x 3 gene d2_g1 d2_g2 <chr> <dbl> <dbl> 1 ABC123 10 1 2 ABC123 20 1 3 DEF234 30 1 4 DEF234 40 12121

  • The function is applied to the vector containing the contents of the column and

returns TRUE to select that column.

somgen223.stanford.edu 10

slide-11
SLIDE 11

the anonymous function

gene_exp2$d2_g2 [1] 1 1 1 12121 gene_exp2$d2_g2 > 10 [1] FALSE FALSE FALSE TRUE any(gene_exp2$d2_g2 > 10) [1] TRUE

somgen223.stanford.edu 11

slide-12
SLIDE 12

select_if (repeated)

gene_exp2 %>% select_if(function(x) any(x > 10)) # A tibble: 4 x 3 gene d2_g1 d2_g2 <chr> <dbl> <dbl> 1 ABC123 10 1 2 ABC123 20 1 3 DEF234 30 1 4 DEF234 40 12121

  • Why is the gene column selected? Hint: Is "ABC123" > "10"?

somgen223.stanford.edu 12

slide-13
SLIDE 13

select_if: alternative function syntax

gene_exp2 %>% select_if(~any(. > 10)) # A tibble: 4 x 3 gene d2_g1 d2_g2 <chr> <dbl> <dbl> 1 ABC123 10 1 2 ABC123 20 1 3 DEF234 30 1 4 DEF234 40 12121

  • Inside a ~ function, . refers to the argument passed in, in this case, each column

in succession.

  • This syntax is often a bit shorter than using function (...) ...

somgen223.stanford.edu 13

slide-14
SLIDE 14

Same idea for mutate

somgen223.stanford.edu 14

slide-15
SLIDE 15

mutate_at

mutate_at(gene_exp2, vars(ends_with("g2")), function(x) -x) # A tibble: 4 x 5 gene d1_g1 d1_g2 d2_g1 d2_g2 <chr> <dbl> <dbl> <dbl> <dbl> 1 ABC123 1

  • 3

10

  • 1

2 ABC123 3

  • 4

20

  • 1

3 DEF234 6

  • 6

30

  • 1

4 DEF234 2

  • 5

40 -12121

  • This will negate the values in all columns whose name contains the string “g2”.

somgen223.stanford.edu 15

slide-16
SLIDE 16

mutate_if

mutate_if(gene_exp2, is.numeric, function(x) -x) # A tibble: 4 x 5 gene d1_g1 d1_g2 d2_g1 d2_g2 <chr> <dbl> <dbl> <dbl> <dbl> 1 ABC123

  • 1
  • 3
  • 10
  • 1

2 ABC123

  • 3
  • 4
  • 20
  • 1

3 DEF234

  • 6
  • 6
  • 30
  • 1

4 DEF234

  • 2
  • 5
  • 40 -12121
  • This will negate the values in all columns whose contents are numeric.

somgen223.stanford.edu 16

slide-17
SLIDE 17

Same idea for rename

somgen223.stanford.edu 17

slide-18
SLIDE 18

Replace spaces in column names

tibble(`this is a col name` = 3:4) %>% rename_all(~str_replace_all(., " ", "_")) # A tibble: 2 x 1 this_is_a_col_name <int> 1 3 2 4

somgen223.stanford.edu 18

slide-19
SLIDE 19

Use all lower case in column names

tibble(Col1 = 1:2, Col2 = 3:4) %>% rename_all(~str_to_lower(.)) # A tibble: 2 x 2 col1 col2 <int> <int> 1 1 3 2 2 4

somgen223.stanford.edu 19

slide-20
SLIDE 20

Same idea for summarize

somgen223.stanford.edu 20

slide-21
SLIDE 21

Count number of NA values in each column

(m <- read_csv(str_c(data_dir, "missing_df.csv"))) # A tibble: 10 x 3 id weight group <dbl> <dbl> <chr> 1 1 0.114 a 2 2 0.622 b 3 3 0.609 a 4 4 NA b 5 5 0.861 <NA> 6 6 0.640 b 7 7 NA a 8 8 0.233 b 9 9 0.666 a 10 10 0.514 b m %>% summarize_all(~sum(is.na(.))) # A tibble: 1 x 3 id weight group <int> <int> <int> 1 2 1

somgen223.stanford.edu 21

slide-22
SLIDE 22

Summarize with mean

m %>% summarize_if(is.numeric, ~mean(., na.rm = TRUE)) # A tibble: 1 x 2 id weight <dbl> <dbl> 1 5.5 0.532

  • Summarize by computing the mean of all numeric columns, ignoring NAs.

somgen223.stanford.edu 22

slide-23
SLIDE 23

Setting column names

(d1 <- tibble(a = 1:2, b = 3:4)) # A tibble: 2 x 2 a b <int> <int> 1 1 3 2 2 4 (d2 <- set_names(d1, c("new_a", "new_b"))) # A tibble: 2 x 2 new_a new_b <int> <int> 1 1 3 2 2 4

  • set_names can assign all the columns new names.
  • Remember to save the new frame.

somgen223.stanford.edu 23

slide-24
SLIDE 24

Grouping over multiple columns

somgen223.stanford.edu 24

slide-25
SLIDE 25

Get example dataset

(group_by_example <- read_csv(str_c(data_dir, "group_by_example.csv"))) # A tibble: 4 x 4 Genotype Treatment gene expression_value <chr> <chr> <chr> <dbl> 1 Control Memantine DYRK1A_N 0.504 2 Control Memantine DYRK1A_N 0.515 3 Control Saline DYRK1A_N 0.592 4 Control Saline DYRK1A_N 0.590

  • This is part of the intermediate result from

data_challenge_mouse_protein_expression.

somgen223.stanford.edu 25

slide-26
SLIDE 26

Summarize by Treatment

group_by_example %>% group_by(Treatment) %>% summarize(mean_expression = mean(expression_value)) # A tibble: 2 x 2 Treatment mean_expression <chr> <dbl> 1 Memantine 0.509 2 Saline 0.591

somgen223.stanford.edu 26

slide-27
SLIDE 27

Summarize by gene

group_by_example %>% group_by(gene) %>% summarize(mean_expression = mean(expression_value)) # A tibble: 1 x 2 gene mean_expression <chr> <dbl> 1 DYRK1A_N 0.550

somgen223.stanford.edu 27

slide-28
SLIDE 28

Summarize Treatment, gene

group_by_example %>% group_by(Treatment, gene) %>% summarize(mean_expression = mean(expression_value)) # A tibble: 2 x 3 # Groups: Treatment [2] Treatment gene mean_expression <chr> <chr> <dbl> 1 Memantine DYRK1A_N 0.509 2 Saline DYRK1A_N 0.591

  • Note the result is grouped by Treatment.
  • If you summarize a grouped data frame, the last group is removed.

somgen223.stanford.edu 28

slide-29
SLIDE 29

Summarize gene, Treatment

group_by_example %>% group_by(gene, Treatment) %>% summarize(mean_expression = mean(expression_value)) # A tibble: 2 x 3 # Groups: gene [1] gene Treatment mean_expression <chr> <chr> <dbl> 1 DYRK1A_N Memantine 0.509 2 DYRK1A_N Saline 0.591

  • Note the result is grouped by gene.

somgen223.stanford.edu 29