Joins, and dates/times Steve Bagley somgen223.stanford.edu 1 - - PowerPoint PPT Presentation

joins and dates times
SMART_READER_LITE
LIVE PREVIEW

Joins, and dates/times Steve Bagley somgen223.stanford.edu 1 - - PowerPoint PPT Presentation

Joins, and dates/times Steve Bagley somgen223.stanford.edu 1 Joining data frames It is common to have related data in two or more data frames. It may be more convenient to have all the data in a single data frame for analysis and for


slide-1
SLIDE 1

Joins, and dates/times

Steve Bagley

somgen223.stanford.edu 1

slide-2
SLIDE 2

Joining data frames

  • It is common to have related data in two or more data frames.
  • It may be more convenient to have all the data in a single data frame for analysis

and for plotting.

  • Merging data this way is called “joining.”

somgen223.stanford.edu 2

slide-3
SLIDE 3

Getting some data to join

(gene_exp1 <- read_csv(str_c(data_dir, "gene_exp1.csv"))) # A tibble: 3 x 3 gene control treatment <chr> <dbl> <dbl> 1 ABC123 1 2 DEF234 10 3 3 GKK7 12 13 (gene_length <- read_csv(str_c(data_dir, "gene_length.csv"))) # A tibble: 3 x 2 gene length <chr> <dbl> 1 XYZ3 13 2 ABC123 100 3 GKK7 2001

somgen223.stanford.edu 3

slide-4
SLIDE 4

inner_join

inner_join(gene_exp1, gene_length, by = "gene") # A tibble: 2 x 4 gene control treatment length <chr> <dbl> <dbl> <dbl> 1 ABC123 1 100 2 GKK7 12 13 2001

  • by specifies the “key”: which columns to use to control the join.
  • The rows in both data frames will be aligned using the by column.
  • A row is included in the inner join if its key appears in both data frames. Note:

this might throw away a lot of rows.

  • The join result includes any column that appears in either data frame.

somgen223.stanford.edu 4

slide-5
SLIDE 5

Exercise: explain this result

gene_tall <- gene_exp1 %>% gather(condition, expression_level, control:treatment) inner_join(gene_tall, gene_length, by = "gene") # A tibble: 4 x 4 gene condition expression_level length <chr> <chr> <dbl> <dbl> 1 ABC123 control 100 2 GKK7 control 12 2001 3 ABC123 treatment 1 100 4 GKK7 treatment 13 2001

somgen223.stanford.edu 5

slide-6
SLIDE 6

Answer: explain this result

  • Each gene appears twice in gene_tall.
  • The join operation aligns each copy with the row in gene_length, duplicating

the information in gene_length.

somgen223.stanford.edu 6

slide-7
SLIDE 7

full_join example

full_join(gene_tall, gene_length, by = "gene") # A tibble: 7 x 4 gene condition expression_level length <chr> <chr> <dbl> <dbl> 1 ABC123 control 100 2 DEF234 control 10 NA 3 GKK7 control 12 2001 4 ABC123 treatment 1 100 5 DEF234 treatment 3 NA 6 GKK7 treatment 13 2001 7 XYZ3 <NA> NA 13

somgen223.stanford.edu 7

slide-8
SLIDE 8

full_join explained

  • A key appears in the result if it appears in either data frame.
  • All the data from both data frames are included.
  • If data are missing from one data frame, then NA’s are inserted.
  • Make sure you understand why the result on the previous slide has 7 rows.

somgen223.stanford.edu 8

slide-9
SLIDE 9

semi_join

semi_join(gene_tall, gene_length, by = "gene") # A tibble: 4 x 3 gene condition expression_level <chr> <chr> <dbl> 1 ABC123 control 2 GKK7 control 12 3 ABC123 treatment 1 4 GKK7 treatment 13

  • Result includes all rows of gene_tall that have a key in gene_length.

somgen223.stanford.edu 9

slide-10
SLIDE 10

anti_join

anti_join(gene_tall, gene_length, by = "gene") # A tibble: 2 x 3 gene condition expression_level <chr> <chr> <dbl> 1 DEF234 control 10 2 DEF234 treatment 3

  • Results includes all rows of gene_tall that do not have a key in gene_length.

somgen223.stanford.edu 10

slide-11
SLIDE 11

filtering joins

  • semi_join and anti_join are filtering joins: they filter rows (of the first

argument).

  • They do not include any new columns.

somgen223.stanford.edu 11

slide-12
SLIDE 12

Dates and times

somgen223.stanford.edu 12

slide-13
SLIDE 13

Dates and times

  • Dates and times are complicated: leap years, month/day/year vs

day/month/year vs …, time zones, daylight saving time, leap seconds, 12-hour vs 24-hour format, ….

somgen223.stanford.edu 13

slide-14
SLIDE 14

Examples

parse_date("2015-11-10") [1] "2015-11-10" parse_date("10/11/2015", format = "%m/%d/%Y") [1] "2015-10-11"

  • Dates and times come in many different formats.
  • parse_date takes a format argument that uses a special pattern code for

identifying what is expected, and in what order.

  • See ?parse_date for details.

somgen223.stanford.edu 14

slide-15
SLIDE 15

Examples

parse_datetime("10/11/2015", format = "%m/%d/%Y") [1] "2015-10-11 UTC" parse_datetime("2015-11-10") [1] "2015-11-10 UTC" parse_datetime("10/11/2015 13:45:09", format = "%m/%d/%Y %H:%M:%S") [1] "2015-10-11 13:45:09 UTC" parse_datetime("10/11/2015 13:45:09 America/Los_Angeles", format = "%m/%d/%Y %H:%M:%S %Z") [1] "2015-10-11 20:45:09 UTC"

  • parse_datetime works on dates with times (and time zones).
  • See ?parse_datetime for details.

somgen223.stanford.edu 15

slide-16
SLIDE 16

Reading

  • Read: 13 Relational data | R for Data Science
  • Read: 16 Dates and times | R for Data Science

somgen223.stanford.edu 16