Reshaping a data frame Steve Bagley somgen223.stanford.edu 1

Reshaping data • Sometimes data are organized in a way that makes it difficult to compute in a vector-oriented way. • Sometimes data elements are included in the column names. • The tidyr package (part of tidyverse ) allows you to change the organization of the data, keeping the content the same. somgen223.stanford.edu 2

2 DEF234 13 # A tibble: 3 x 3 gene control treatment < chr > < dbl > < dbl > 1 ABC123 0 1 (gene_exp1 <- read_csv ( str_c (data_dir, "gene_exp1.csv"))) 10 3 3 GKK7 12 Reshaping example • This is in wide format: the column names contain data (the condition). • Adding another condition, such as treatment2 , would create a new column. • In R, it is sometimes useful to set up the data frame so that new data are added as rows ( tall or long format). somgen223.stanford.edu 3

< chr > # A tibble: 6 x 3 1 4 ABC123 treatment 12 control treatment 10 13 0 1 ABC123 control < dbl > < chr > gene_exp1 condition expression_level gene gene_tall 3 1 ABC123 # A tibble: 3 x 3 gene control treatment < chr > < dbl > < dbl > 0 13 1 2 DEF234 10 3 3 GKK7 12 6 GKK7 Wide vs tall 2 DEF234 control 3 GKK7 5 DEF234 treatment • Convince yourself that the same information appears in these two data frames. • In gene_tall , the column names describe the data; they don’t contain any data. somgen223.stanford.edu 4

4 ABC123 treatment expression_level 13 1 (gene_tall <- gather (gene_exp1, condition, expression_level, 12 control 3 GKK7 10 0 treatment 1 ABC123 control < dbl > < chr > < chr > condition expression_level gene # A tibble: 6 x 3 control : treatment)) 3 Using gather 2 DEF234 control 5 DEF234 treatment 6 GKK7 • The arguments to gather: 1. The data frame , here gene_exp1 2. The key , which is the name of the new column for the values from the old column names, here condition 3. The value , which is the name of the new column for the data values, here 4. The columns from which to gather the data. Here we use the : operator to name a range of columns somgen223.stanford.edu 5

Exercise: using tidy data • Filter the rows of gene_tall for gene ABC123 only. • Separately, filter to get only the control condition. somgen223.stanford.edu 6

gene # A tibble: 3 x 3 10 2 DEF234 control 0 1 ABC123 control < dbl > < chr > < chr > condition expression_level filter (gene_tall, gene == "ABC123") filter (gene_tall, condition == "control") control 1 2 ABC123 treatment 0 1 ABC123 control < dbl > < chr > < chr > condition expression_level gene # A tibble: 2 x 3 12 Answer: using tidy data 3 GKK7 somgen223.stanford.edu 7

Exercise: compute change in gene expression • Compute the change ( treatment - control ) for each sample • Hint: think about what shape the data should have to enable this computation. somgen223.stanford.edu 8

1 1 ABC123 -7 3 10 2 DEF234 1 1 0 < dbl > 12 < dbl > < dbl > < chr > control treatment change gene # A tibble: 3 x 4 mutate (gene_exp1, change = treatment - control) 13 Answer: compute change in gene expression • We use the data in the original format because then we can subtract the columns. 3 GKK7 somgen223.stanford.edu 9

Exercise: filtering the gene expression data • Produce a data frame that includes all data where the control or treatment expression value is above 5. somgen223.stanford.edu 10

2 GKK7 13 filter (expression_level > 5) # A tibble: 3 x 3 gene condition expression_level < chr > < chr > < dbl > 1 DEF234 control 10 gene_tall %>% control 12 3 GKK7 treatment Answer: filtering the gene expression data • Note that this computation is very easy to do using the tall data format. In this format, it is a single comparison that works for both treatment and control groups. somgen223.stanford.edu 11

1 ABC123 12 group_by (gene) %>% summarize (min_level = min (expression_level)) # A tibble: 3 x 2 gene min_level < chr > < dbl > gene_tall %>% 0 2 DEF234 3 Example: what is the minimum expression level for each gene? 3 GKK7 • This drops the condition column, which we might want to retain. somgen223.stanford.edu 12

< dbl > < chr > 3 12 0 1 ABC123 control gene_tall %>% < chr > condition expression_level control gene # A tibble: 3 x 3 ungroup () slice (1) %>% arrange (expression_level) %>% group_by (gene) %>% 3 GKK7 Keep the entire row with the minimum 2 DEF234 treatment • slice(1) returns the first row in the group • ungroup removes the grouping attribute that was added by group_by . somgen223.stanford.edu 13

< dbl > < chr > 3 GKK7 3 2 DEF234 treatment 0 1 ABC123 control gene_tall %>% < chr > 12 condition expression_level gene gene [3] # Groups: # A tibble: 3 x 3 slice ( which.min (expression_level)) group_by (gene) %>% control Even better: which.min • which.min returns the index of the vector element with the first occurrence of the minimum value. somgen223.stanford.edu 14

6 GKK7 2 DEF234 control 3 13 1 4 ABC123 treatment 12 control 10 0 treatment 1 ABC123 control < dbl > < chr > < chr > condition expression_level gene # A tibble: 6 x 3 gene_tall The opposite of gather • Suppose you started with the data in tall format. 3 GKK7 5 DEF234 treatment • How would you convert it to the wide format? somgen223.stanford.edu 15

10 12 # A tibble: 3 x 3 gene control treatment < chr > < dbl > < dbl > 1 ABC123 0 1 2 DEF234 spread (gene_tall, condition, expression_level) 3 13 spread is the opposite of gather 3 GKK7 • spread constructs wide data frames. • The second argument defines the column in the tall format to be used to make new column names. • The third argument defines the column in the tall format to be used as the source of data for those new columns. somgen223.stanford.edu 16

How to get information out of the column names somgen223.stanford.edu 17

20 3 2 4 DEF234 1 30 6 6 3 DEF234 1 (gene_exp2 <- read_csv ( str_c (data_dir, "gene_exp2.csv"))) 4 2 ABC123 40 12121 1 10 3 1 1 ABC123 < dbl > < dbl > < dbl > < dbl > < chr > d1_g1 d1_g2 d2_g1 d2_g2 gene # A tibble: 4 x 5 5 Example of getting information out of column names • This data frame has useful information encoded in the column names representing the number of the day and the group. • We want to move those data down into the contents of the data frame. somgen223.stanford.edu 18

2 ABC123 d1_g1 1 4 2 4 DEF234 d1_g1 6 3 DEF234 d1_g1 3 gene_exp2 %>% 1 ABC123 d1_g1 6 ABC123 d1_g2 < dbl > < chr > < chr > condition expression_level gene # A tibble: 6 x 3 head () gather (condition, expression_level, d1_g1 : d2_g2) %>% 3 Convert from wide to tall format 5 ABC123 d1_g2 somgen223.stanford.edu 19

3 2 ABC123 d1 3 g2 5 ABC123 d1 2 g1 4 6 g1 3 DEF234 d1 gene_exp2 %>% g1 1 g2 g1 1 ABC123 d1 < dbl > < chr > < chr > < chr > group expression_level day gene # A tibble: 6 x 4 head () separate (condition, into = c ("day", "group"), sep = "_") %>% gather (condition, expression_level, d1_g1 : d2_g2) %>% 6 ABC123 d1 Getting data out of the condition column 4 DEF234 d1 • The condition column has the compressed format for the values: d1_g1 means “day 1, group 1”. We need to split the string apart at the "_" character using separate . somgen223.stanford.edu 20

1 1 6 ABC123 1 2 1 2 6 1 4 3 gene_exp2 %>% 2 ABC123 1 1 1 ABC123 1 3 < dbl > < chr > < chr > < chr > group expression_level day gene # A tibble: 6 x 4 head () group = str_remove (group, "g")) %>% mutate (day = str_remove (day, "d"), separate (condition, into = c ("day", "group"), sep = "_") %>% gather (condition, expression_level, d1_g1 : d2_g2) %>% 2 Clean up the data: strings 3 DEF234 1 4 DEF234 1 5 ABC123 1 • If we want to get rid of the "d" and "g" prefixes, we need to do some string manipulation. somgen223.stanford.edu 21

str_remove ( c ("d1", "d2", "ddddd3", "dxy"), "d") [1] "1" "2" "dddd3" "xy" str_remove : replace one occurrence of pattern in string somgen223.stanford.edu 22

3 1 gene_exp2 %>% 3 DEF234 1 1 6 4 DEF234 1 1 2 5 ABC123 1 2 3 6 ABC123 1 4 2 # A tibble: 6 x 4 gather (condition, expression_level, d1_g1 : d2_g2) %>% separate (condition, into = c ("day", "group"), sep = "_") %>% mutate (day = str_remove (day, "d"), group = str_remove (group, "g")) %>% mutate (day = as.integer (day), group = as.integer (group)) %>% head () gene 1 day group expression_level < chr > < int > < int > < dbl > 1 ABC123 1 1 1 Clean up the data: numbers 2 ABC123 • The values in the day and group columns are characters, not numbers, so coerce to the desired type. somgen223.stanford.edu 23

Reading • Read: 12 Tidy data | R for Data Science (sections 12.1 to 12.4) • Read: Tidy data • tidyr somgen223.stanford.edu 24

Reshaping a data frame Steve Bagley somgen223.stanford.edu 1 - PowerPoint PPT Presentation

Reshaping a data frame Steve Bagley somgen223.stanford.edu 1 Reshaping data Sometimes data are organized in a way that makes it difficult to compute in a vector-oriented way. Sometimes data elements are included in the column names.

Kinds of picture Single frame Kinds of picture Single frame Multi-frame Kinds of

Community Liaison Committee July 2016 Reshaping Services Recap Reshaping Services Our Change

Reshaping Services Programme Community Liaison Committee February 2017 Reshaping Services Recap

What is frame busting? What is frame busting? HTML allows for any site to frame any URL with an

Data reshaping with tidyr Data reshaping with tidyr and functionals with purrr and functionals

Vale of Glamorgan Council Reshaping Services Programme Community Liaison Committee January 2016

Frame Relay Topologies and Designs Frame Relay Topologies and Design As we learned in the Frame

FRAME- -DRAGGI NG DRAGGI NG FRAME (GRAVI TOMAGNETI SM) (GRAVI TOMAGNETI SM) AND I TS

Deck Deck Frame Frame DeckFrame Deck Frame is the utilization of VP Buildings

The Frame of the p -Adic Numbers Francisco Avila June 27, 2017 Francisco Avila The Frame

Solving Quadratic BSDEs Hlne HIBON 29/06/16 Contents Introduction The convex frame The

Reshaping Visible Services & Transport Scrutiny Consultation . 13 th and 14 th March 2017

Reshaping the Way Healthcare is Delivered Registration & Survey Responses: May 27 th , 2020

How to make Key-Frame Animation with Automatic Function 1. The Aurora 3D Animation has key-frame

AND ITS MEASUREMENT AND ITS MEASUREMENT INTRODUCTION INTRODUCTION Frame- -Dragging Dragging

Frame Relay Basic Configurations: Point to Point Frame Relay Basic Point to Point Configuration

Physics 2D Lecture Slides Sept 29 Vivek Sharma UCSD Physics Galilean Relativity

Baumgartner, POLI 203 Spring 2016 Framing Reading: Radelet and Borg, Baumgartner DeBoef

S i mp l i fj e d F P G A - b a s e d s c h e me : A s c a l i n g

BLAST Anders Gorm Pedersen & Rasmus Wernersson Database searching Using pairwise

Goals Understand use of Dragonfly from game Dragonfly programmers perspective Mostly,

Aeron High-Performance Open Source Message Transport Martin Thompson - @mjpt777 1. Why build

Grounded Seman,cs Berkeley N L P Jacob

CEE 370 Environmental Engineering Principles Lecture #22 Water Resources & Hydrology II: