advanced column oriented methods all at if
play

Advanced column-oriented methods: _all, _at, _if Steve Bagley - PowerPoint PPT Presentation

Advanced column-oriented methods: _all, _at, _if Steve Bagley somgen223.stanford.edu 1 Different ways to select columns It is easy to use filter to select rows: the filter expressions can use the values in the columns that are specified by


  1. Advanced column-oriented methods: _all, _at, _if Steve Bagley somgen223.stanford.edu 1

  2. Different ways to select columns • It is easy to use filter to select rows: the filter expressions can use the values in the columns that are specified by writing the column names. • To use select , we provide the column names. • What if we want to select columns based on some aspect of the column names? • What if we want to select columns based on the values in those columns, such as, all columns that contain at least one NA value? • Somehow, we need to compute the identity of the desired columns. somgen223.stanford.edu 2

  3. ## select by computing which columns match a pattern 3 gene_exp2 %>% select_at ( vars ( starts_with ("d1"))) # A tibble: 4 x 2 d1_g1 d1_g2 < dbl > < dbl > 1 1 2 5 3 4 3 6 6 4 gene_exp2 <- read_csv ( str_c (data_dir, "gene_exp2.csv")) 2 5 1 ## select by using the exact names gene_exp2 %>% select (d1_g1, d1_g2) # A tibble: 4 x 2 d1_g1 d1_g2 < dbl > < dbl > 1 3 4 2 3 4 3 6 6 2 select_at : when you can compute the names of the columns somgen223.stanford.edu 3

  4. everything starts_with ends_with contains matches num_range last_col one_of Functions you can use with select_at from the documentation: Function Notes Starts with a prefix Ends with a suffix Contains a literal string Matches a regular expression Matches a numerical range like x01, x02, x03 Matches variable names in a character vector Matches all variables Select last variable, possibly with an offset somgen223.stanford.edu 4

  5. 20 4 2 4 1 30 6 6 3 1 gene_exp2 %>% select_at ( vars ( contains ("_"))) 3 40 12121 2 1 10 3 1 1 < dbl > < dbl > < dbl > < dbl > d1_g1 d1_g2 d2_g1 d2_g2 # A tibble: 4 x 4 5 Example of select_at somgen223.stanford.edu 5

  6. 1 1 # A tibble: 4 x 2 gene d2_g2 < chr > < dbl > 1 ABC123 gene_exp2 %>% select_at ( vars ( -contains ("_"), last_col ())) 2 ABC123 1 3 DEF234 Example of select_at 4 DEF234 12121 • vars accepts multiple specifications. somgen223.stanford.edu 6

  7. 3 4 DEF234 # A tibble: 4 x 3 gene d1_g1 d1_g2 < chr > < dbl > < dbl > 1 ABC123 1 3 5 gene_exp2 %>% select_at ( vars ("gene", starts_with ("d1"))) 4 2 6 6 Example of select_at 2 ABC123 3 DEF234 • Can use exact name of column, as a string. somgen223.stanford.edu 7

  8. 2 ABC123 1 1 6 3 DEF234 1 4 gene_exp2 %>% select_at ( vars (1, ends_with ("g2"))) 3 5 12121 1 ABC123 < dbl > < dbl > < chr > d1_g2 d2_g2 gene # A tibble: 4 x 3 4 DEF234 Example of select_at • Can use the number of a column. somgen223.stanford.edu 8

  9. 2 ABC123 5 12121 # A tibble: 4 x 3 gene d1_g2 d2_g2 < chr > < dbl > < dbl > 1 ABC123 3 1 gene_exp2 %>% select_at ( vars ( seq (from = 1, to = 5, by = 2))) 4 1 3 DEF234 6 1 Example of select_at 4 DEF234 • This is useful if there is a regular pattern to the columns to want to keep. somgen223.stanford.edu 9

  10. 20 4 DEF234 # A tibble: 4 x 3 gene d2_g1 d2_g2 < chr > < dbl > < dbl > 1 ABC123 10 1 40 12121 gene_exp2 %>% select_if ( function (x) any (x > 10)) 1 3 DEF234 30 1 select_if : when you use the contents of the columns 2 ABC123 • The function is applied to the vector containing the contents of the column and returns TRUE to select that column. somgen223.stanford.edu 10

  11. gene_exp2 $ d2_g2 [1] 1 1 1 12121 gene_exp2 $ d2_g2 > 10 [1] FALSE FALSE FALSE TRUE any (gene_exp2 $ d2_g2 > 10) [1] TRUE the anonymous function somgen223.stanford.edu 11

  12. 2 ABC123 1 # A tibble: 4 x 3 gene d2_g1 d2_g2 < chr > < dbl > < dbl > 1 ABC123 10 1 gene_exp2 %>% select_if ( function (x) any (x > 10)) 20 1 40 12121 30 select_if (repeated) 3 DEF234 4 DEF234 • Why is the gene column selected? Hint: Is "ABC123" > "10" ? somgen223.stanford.edu 12

  13. 20 40 12121 # A tibble: 4 x 3 gene d2_g1 d2_g2 < chr > < dbl > < dbl > 1 ABC123 10 1 2 ABC123 gene_exp2 %>% select_if ( ~any (. > 10)) 1 3 DEF234 30 1 select_if : alternative function syntax 4 DEF234 • Inside a ~ function, . refers to the argument passed in, in this case, each column in succession. • This syntax is often a bit shorter than using function (...) ... somgen223.stanford.edu 13

  14. Same idea for mutate somgen223.stanford.edu 14

  15. mutate_at mutate_at (gene_exp2, vars ( ends_with ("g2")), function (x) - x) 40 -12121 -5 2 4 DEF234 -1 30 -6 6 -1 20 -4 3 < dbl > < dbl > < dbl > gene 10 -3 1 # A tibble: 4 x 5 1 ABC123 < dbl > -1 < chr > d2_g2 d1_g1 d1_g2 d2_g1 2 ABC123 3 DEF234 • This will negate the values in all columns whose name contains the string “g2”. somgen223.stanford.edu 15

  16. mutate_if mutate_if (gene_exp2, is.numeric, function (x) - x) -40 -12121 -5 -2 4 DEF234 -1 -30 -6 -6 -1 -20 -4 -3 < dbl > < dbl > < dbl > gene -10 -3 -1 # A tibble: 4 x 5 1 ABC123 < dbl > -1 < chr > d2_g2 d1_g1 d1_g2 d2_g1 2 ABC123 3 DEF234 • This will negate the values in all columns whose contents are numeric. somgen223.stanford.edu 16

  17. Same idea for rename somgen223.stanford.edu 17

  18. tibble (`this is a col name` = 3 : 4) %>% rename_all ( ~str_replace_all (., " ", "_")) # A tibble: 2 x 1 this_is_a_col_name < int > 1 3 2 4 Replace spaces in column names somgen223.stanford.edu 18

  19. tibble (Col1 = 1 : 2, Col2 = 3 : 4) %>% rename_all ( ~str_to_lower (.)) # A tibble: 2 x 2 col1 col2 < int > < int > 1 1 3 2 2 4 Use all lower case in column names somgen223.stanford.edu 19

  20. Same idea for summarize somgen223.stanford.edu 20

  21. 0.640 b 10 7 NA a 8 8 0.233 b 9 9 0.666 a 10 (m <- read_csv ( str_c (data_dir, "missing_df.csv"))) 0.514 b m %>% summarize_all ( ~sum ( is.na (.))) # A tibble: 1 x 3 id weight group < int > < int > < int > 1 0 7 6 1 2 # A tibble: 10 x 3 id weight group < dbl > < dbl > < chr > 1 1 0.114 a 2 0.622 b 6 3 3 0.609 a 4 4 NA b 5 5 0.861 < NA > 2 Count number of NA values in each column somgen223.stanford.edu 21

  22. m %>% summarize_if (is.numeric, ~mean (., na.rm = TRUE)) # A tibble: 1 x 2 id weight < dbl > < dbl > 1 5.5 0.532 Summarize with mean • Summarize by computing the mean of all numeric columns, ignoring NA s. somgen223.stanford.edu 22

  23. # A tibble: 2 x 2 4 2 3 1 1 < int > < int > new_a new_b (d1 <- tibble (a = 1 : 2, b = 3 : 4)) (d2 <- set_names (d1, c ("new_a", "new_b"))) 2 4 2 3 1 1 < int > < int > b a # A tibble: 2 x 2 2 Setting column names • set_names can assign all the columns new names. • Remember to save the new frame. somgen223.stanford.edu 23

  24. Grouping over multiple columns somgen223.stanford.edu 24

  25. Memantine DYRK1A_N 2 Control Saline 4 Control 0.592 DYRK1A_N Saline 3 Control 0.515 (group_by_example <- read_csv ( str_c (data_dir, "group_by_example.csv"))) 0.504 0.590 Memantine DYRK1A_N 1 Control < dbl > < chr > < chr > < chr > expression_value Genotype Treatment gene # A tibble: 4 x 4 DYRK1A_N Get example dataset • This is part of the intermediate result from data_challenge_mouse_protein_expression . somgen223.stanford.edu 25

  26. group_by_example %>% group_by (Treatment) %>% summarize (mean_expression = mean (expression_value)) # A tibble: 2 x 2 Treatment mean_expression < chr > < dbl > 1 Memantine 0.509 0.591 Summarize by Treatment 2 Saline somgen223.stanford.edu 26

  27. group_by_example %>% group_by (gene) %>% summarize (mean_expression = mean (expression_value)) # A tibble: 1 x 2 gene mean_expression < chr > < dbl > 1 DYRK1A_N 0.550 Summarize by gene somgen223.stanford.edu 27

  28. < chr > DYRK1A_N group_by (Treatment, gene) %>% summarize (mean_expression = mean (expression_value)) # A tibble: 2 x 3 # Groups: Treatment [2] Treatment gene mean_expression < chr > group_by_example %>% < dbl > 1 Memantine DYRK1A_N 0.509 0.591 Summarize Treatment, gene 2 Saline • Note the result is grouped by Treatment . • If you summarize a grouped data frame, the last group is removed. somgen223.stanford.edu 28

  29. < chr > 0.591 group_by (gene, Treatment) %>% summarize (mean_expression = mean (expression_value)) # A tibble: 2 x 3 # Groups: gene [1] gene Treatment mean_expression group_by_example %>% < chr > < dbl > 1 DYRK1A_N Memantine 0.509 2 DYRK1A_N Saline Summarize gene, Treatment • Note the result is grouped by gene . somgen223.stanford.edu 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend