joins and dates times
play

Joins, and dates/times Steve Bagley somgen223.stanford.edu 1 - PowerPoint PPT Presentation

Joins, and dates/times Steve Bagley somgen223.stanford.edu 1 Joining data frames It is common to have related data in two or more data frames. It may be more convenient to have all the data in a single data frame for analysis and for


  1. Joins, and dates/times Steve Bagley somgen223.stanford.edu 1

  2. Joining data frames • It is common to have related data in two or more data frames. • It may be more convenient to have all the data in a single data frame for analysis and for plotting. • Merging data this way is called “joining.” somgen223.stanford.edu 2

  3. 12 3 GKK7 2 ABC123 13 1 XYZ3 < dbl > < chr > length gene # A tibble: 3 x 2 (gene_length <- read_csv ( str_c (data_dir, "gene_length.csv"))) 13 (gene_exp1 <- read_csv ( str_c (data_dir, "gene_exp1.csv"))) 3 3 GKK7 10 2001 1 0 1 ABC123 < dbl > < dbl > < chr > control treatment gene # A tibble: 3 x 3 100 Getting some data to join 2 DEF234 somgen223.stanford.edu 3

  4. inner_join 0 2001 13 12 2 GKK7 inner_join (gene_exp1, gene_length, by = "gene") 1 100 1 ABC123 < dbl > < dbl > < dbl > < chr > control treatment length gene # A tibble: 2 x 4 • by specifies the “key”: which columns to use to control the join. • The rows in both data frames will be aligned using the by column. • A row is included in the inner join if its key appears in both data frames. Note: this might throw away a lot of rows. • The join result includes any column that appears in either data frame. somgen223.stanford.edu 4

  5. 100 1 ABC123 control 100 1 13 2001 12 control 2001 gene_tall <- gene_exp1 %>% 0 < dbl > treatment < dbl > < chr > < chr > condition expression_level length gene # A tibble: 4 x 4 inner_join (gene_tall, gene_length, by = "gene") control : treatment) gather (condition, expression_level, 4 GKK7 Exercise: explain this result 2 GKK7 3 ABC123 treatment somgen223.stanford.edu 5

  6. Answer: explain this result • Each gene appears twice in gene_tall . • The join operation aligns each copy with the row in gene_length , duplicating the information in gene_length . somgen223.stanford.edu 6

  7. 12 control 7 XYZ3 2001 13 treatment 6 GKK7 NA 3 13 100 1 4 ABC123 treatment 2001 full_join (gene_tall, gene_length, by = "gene") 3 GKK7 NA NA 10 2 DEF234 control 100 0 1 ABC123 control < dbl > < dbl > < chr > < chr > condition expression_level length gene # A tibble: 7 x 4 < NA > full_join example 5 DEF234 treatment somgen223.stanford.edu 7

  8. full_join explained • A key appears in the result if it appears in either data frame. • All the data from both data frames are included. • If data are missing from one data frame, then NA ’s are inserted. • Make sure you understand why the result on the previous slide has 7 rows. somgen223.stanford.edu 8

  9. semi_join 0 13 treatment 1 12 control semi_join (gene_tall, gene_length, by = "gene") 2 GKK7 1 ABC123 control < dbl > < chr > < chr > condition expression_level gene # A tibble: 4 x 3 3 ABC123 treatment 4 GKK7 • Result includes all rows of gene_tall that have a key in gene_length . somgen223.stanford.edu 9

  10. anti_join anti_join (gene_tall, gene_length, by = "gene") # A tibble: 2 x 3 gene condition expression_level < chr > < chr > < dbl > 1 DEF234 control 10 2 DEF234 treatment 3 • Results includes all rows of gene_tall that do not have a key in gene_length . somgen223.stanford.edu 10

  11. filtering joins • semi_join and anti_join are filtering joins: they filter rows (of the first argument). • They do not include any new columns. somgen223.stanford.edu 11

  12. Dates and times somgen223.stanford.edu 12

  13. Dates and times • Dates and times are complicated: leap years, month/day/year vs day/month/year vs …, time zones, daylight saving time, leap seconds, 12-hour vs 24-hour format, …. somgen223.stanford.edu 13

  14. parse_date ("2015-11-10") [1] "2015-11-10" parse_date ("10/11/2015", format = "%m/%d/%Y") [1] "2015-10-11" Examples • Dates and times come in many different formats. • parse_date takes a format argument that uses a special pattern code for identifying what is expected, and in what order. • See ?parse_date for details. somgen223.stanford.edu 14

  15. parse_datetime ("10/11/2015", format = "%m/%d/%Y") [1] "2015-10-11 UTC" parse_datetime ("2015-11-10") [1] "2015-11-10 UTC" parse_datetime ("10/11/2015 13:45:09", format = "%m/%d/%Y %H:%M:%S") [1] "2015-10-11 13:45:09 UTC" parse_datetime ("10/11/2015 13:45:09 America/Los_Angeles", format = "%m/%d/%Y %H:%M:%S %Z") [1] "2015-10-11 20:45:09 UTC" Examples • parse_datetime works on dates with times (and time zones). • See ?parse_datetime for details. somgen223.stanford.edu 15

  16. Reading • Read: 13 Relational data | R for Data Science • Read: 16 Dates and times | R for Data Science somgen223.stanford.edu 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend