Reshaping a data frame Steve Bagley somgen223.stanford.edu 1 - - PowerPoint PPT Presentation

reshaping a data frame
SMART_READER_LITE
LIVE PREVIEW

Reshaping a data frame Steve Bagley somgen223.stanford.edu 1 - - PowerPoint PPT Presentation

Reshaping a data frame Steve Bagley somgen223.stanford.edu 1 Reshaping data Sometimes data are organized in a way that makes it difficult to compute in a vector-oriented way. Sometimes data elements are included in the column names.


slide-1
SLIDE 1

Reshaping a data frame

Steve Bagley

somgen223.stanford.edu 1

slide-2
SLIDE 2

Reshaping data

  • Sometimes data are organized in a way that makes it difficult to compute in a

vector-oriented way.

  • Sometimes data elements are included in the column names.
  • The tidyr package (part of tidyverse) allows you to change the organization
  • f the data, keeping the content the same.

somgen223.stanford.edu 2

slide-3
SLIDE 3

Reshaping example

(gene_exp1 <- read_csv(str_c(data_dir, "gene_exp1.csv"))) # A tibble: 3 x 3 gene control treatment <chr> <dbl> <dbl> 1 ABC123 1 2 DEF234 10 3 3 GKK7 12 13

  • This is in wide format: the column names contain data (the condition).
  • Adding another condition, such as treatment2, would create a new column.
  • In R, it is sometimes useful to set up the data frame so that new data are added as

rows (tall or long format).

somgen223.stanford.edu 3

slide-4
SLIDE 4

Wide vs tall

gene_exp1 # A tibble: 3 x 3 gene control treatment <chr> <dbl> <dbl> 1 ABC123 1 2 DEF234 10 3 3 GKK7 12 13 gene_tall # A tibble: 6 x 3 gene condition expression_level <chr> <chr> <dbl> 1 ABC123 control 2 DEF234 control 10 3 GKK7 control 12 4 ABC123 treatment 1 5 DEF234 treatment 3 6 GKK7 treatment 13

  • Convince yourself that the same information appears in these two data frames.
  • In gene_tall, the column names describe the data; they don’t contain any data.

somgen223.stanford.edu 4

slide-5
SLIDE 5

Using gather

(gene_tall <- gather(gene_exp1, condition, expression_level, control:treatment)) # A tibble: 6 x 3 gene condition expression_level <chr> <chr> <dbl> 1 ABC123 control 2 DEF234 control 10 3 GKK7 control 12 4 ABC123 treatment 1 5 DEF234 treatment 3 6 GKK7 treatment 13

  • The arguments to gather:
  • 1. The data frame, here gene_exp1
  • 2. The key, which is the name of the new column for the values from the old

column names, here condition

  • 3. The value, which is the name of the new column for the data values, here

expression_level

  • 4. The columns from which to gather the data. Here we use the : operator to name

a range of columns

somgen223.stanford.edu 5

slide-6
SLIDE 6

Exercise: using tidy data

  • Filter the rows of gene_tall for gene ABC123 only.
  • Separately, filter to get only the control condition.

somgen223.stanford.edu 6

slide-7
SLIDE 7

Answer: using tidy data

filter(gene_tall, gene == "ABC123") # A tibble: 2 x 3 gene condition expression_level <chr> <chr> <dbl> 1 ABC123 control 2 ABC123 treatment 1 filter(gene_tall, condition == "control") # A tibble: 3 x 3 gene condition expression_level <chr> <chr> <dbl> 1 ABC123 control 2 DEF234 control 10 3 GKK7 control 12

somgen223.stanford.edu 7

slide-8
SLIDE 8

Exercise: compute change in gene expression

  • Compute the change (treatment - control) for each sample
  • Hint: think about what shape the data should have to enable this computation.

somgen223.stanford.edu 8

slide-9
SLIDE 9

Answer: compute change in gene expression

  • We use the data in the original format because then we can subtract the columns.

mutate(gene_exp1, change = treatment - control) # A tibble: 3 x 4 gene control treatment change <chr> <dbl> <dbl> <dbl> 1 ABC123 1 1 2 DEF234 10 3

  • 7

3 GKK7 12 13 1

somgen223.stanford.edu 9

slide-10
SLIDE 10

Exercise: filtering the gene expression data

  • Produce a data frame that includes all data where the control or treatment

expression value is above 5.

somgen223.stanford.edu 10

slide-11
SLIDE 11

Answer: filtering the gene expression data

gene_tall %>% filter(expression_level > 5) # A tibble: 3 x 3 gene condition expression_level <chr> <chr> <dbl> 1 DEF234 control 10 2 GKK7 control 12 3 GKK7 treatment 13

  • Note that this computation is very easy to do using the tall data format. In this

format, it is a single comparison that works for both treatment and control groups.

somgen223.stanford.edu 11

slide-12
SLIDE 12

Example: what is the minimum expression level for each gene?

gene_tall %>% group_by(gene) %>% summarize(min_level = min(expression_level)) # A tibble: 3 x 2 gene min_level <chr> <dbl> 1 ABC123 2 DEF234 3 3 GKK7 12

  • This drops the condition column, which we might want to retain.

somgen223.stanford.edu 12

slide-13
SLIDE 13

Keep the entire row with the minimum

gene_tall %>% group_by(gene) %>% arrange(expression_level) %>% slice(1) %>% ungroup() # A tibble: 3 x 3 gene condition expression_level <chr> <chr> <dbl> 1 ABC123 control 2 DEF234 treatment 3 3 GKK7 control 12

  • slice(1) returns the first row in the group
  • ungroup removes the grouping attribute that was added by group_by.

somgen223.stanford.edu 13

slide-14
SLIDE 14

Even better: which.min

gene_tall %>% group_by(gene) %>% slice(which.min(expression_level)) # A tibble: 3 x 3 # Groups: gene [3] gene condition expression_level <chr> <chr> <dbl> 1 ABC123 control 2 DEF234 treatment 3 3 GKK7 control 12

  • which.min returns the index of the vector element with the first occurrence of

the minimum value.

somgen223.stanford.edu 14

slide-15
SLIDE 15

The opposite of gather

  • Suppose you started with the data in tall format.

gene_tall # A tibble: 6 x 3 gene condition expression_level <chr> <chr> <dbl> 1 ABC123 control 2 DEF234 control 10 3 GKK7 control 12 4 ABC123 treatment 1 5 DEF234 treatment 3 6 GKK7 treatment 13

  • How would you convert it to the wide format?

somgen223.stanford.edu 15

slide-16
SLIDE 16

spread is the opposite of gather

spread(gene_tall, condition, expression_level) # A tibble: 3 x 3 gene control treatment <chr> <dbl> <dbl> 1 ABC123 1 2 DEF234 10 3 3 GKK7 12 13

  • spread constructs wide data frames.
  • The second argument defines the column in the tall format to be used to make

new column names.

  • The third argument defines the column in the tall format to be used as the source
  • f data for those new columns.

somgen223.stanford.edu 16

slide-17
SLIDE 17

How to get information out of the column names

somgen223.stanford.edu 17

slide-18
SLIDE 18

Example of getting information out of column names

(gene_exp2 <- read_csv(str_c(data_dir, "gene_exp2.csv"))) # A tibble: 4 x 5 gene d1_g1 d1_g2 d2_g1 d2_g2 <chr> <dbl> <dbl> <dbl> <dbl> 1 ABC123 1 3 10 1 2 ABC123 3 4 20 1 3 DEF234 6 6 30 1 4 DEF234 2 5 40 12121

  • This data frame has useful information encoded in the column names

representing the number of the day and the group.

  • We want to move those data down into the contents of the data frame.

somgen223.stanford.edu 18

slide-19
SLIDE 19

Convert from wide to tall format

gene_exp2 %>% gather(condition, expression_level, d1_g1:d2_g2) %>% head() # A tibble: 6 x 3 gene condition expression_level <chr> <chr> <dbl> 1 ABC123 d1_g1 1 2 ABC123 d1_g1 3 3 DEF234 d1_g1 6 4 DEF234 d1_g1 2 5 ABC123 d1_g2 3 6 ABC123 d1_g2 4

somgen223.stanford.edu 19

slide-20
SLIDE 20

Getting data out of the condition column

gene_exp2 %>% gather(condition, expression_level, d1_g1:d2_g2) %>% separate(condition, into = c("day", "group"), sep = "_") %>% head() # A tibble: 6 x 4 gene day group expression_level <chr> <chr> <chr> <dbl> 1 ABC123 d1 g1 1 2 ABC123 d1 g1 3 3 DEF234 d1 g1 6 4 DEF234 d1 g1 2 5 ABC123 d1 g2 3 6 ABC123 d1 g2 4

  • The condition column has the compressed format for the values: d1_g1 means

“day 1, group 1”. We need to split the string apart at the "_" character using separate.

somgen223.stanford.edu 20

slide-21
SLIDE 21

Clean up the data: strings

gene_exp2 %>% gather(condition, expression_level, d1_g1:d2_g2) %>% separate(condition, into = c("day", "group"), sep = "_") %>% mutate(day = str_remove(day, "d"), group = str_remove(group, "g")) %>% head() # A tibble: 6 x 4 gene day group expression_level <chr> <chr> <chr> <dbl> 1 ABC123 1 1 1 2 ABC123 1 1 3 3 DEF234 1 1 6 4 DEF234 1 1 2 5 ABC123 1 2 3 6 ABC123 1 2 4

  • If we want to get rid of the "d" and "g" prefixes, we need to do some string

manipulation.

somgen223.stanford.edu 21

slide-22
SLIDE 22

str_remove: replace one occurrence of pattern in string

str_remove(c("d1", "d2", "ddddd3", "dxy"), "d") [1] "1" "2" "dddd3" "xy"

somgen223.stanford.edu 22

slide-23
SLIDE 23

Clean up the data: numbers

gene_exp2 %>% gather(condition, expression_level, d1_g1:d2_g2) %>% separate(condition, into = c("day", "group"), sep = "_") %>% mutate(day = str_remove(day, "d"), group = str_remove(group, "g")) %>% mutate(day = as.integer(day), group = as.integer(group)) %>% head() # A tibble: 6 x 4 gene day group expression_level <chr> <int> <int> <dbl> 1 ABC123 1 1 1 2 ABC123 1 1 3 3 DEF234 1 1 6 4 DEF234 1 1 2 5 ABC123 1 2 3 6 ABC123 1 2 4

  • The values in the day and group columns are characters, not numbers, so coerce

to the desired type.

somgen223.stanford.edu 23

slide-24
SLIDE 24

Reading

  • Read: 12 Tidy data | R for Data Science (sections 12.1 to 12.4)
  • Read: Tidy data • tidyr

somgen223.stanford.edu 24