Welcome to the course! Joining Data in R with dplyr Var_1 Var_2 - - PowerPoint PPT Presentation

welcome to the course
SMART_READER_LITE
LIVE PREVIEW

Welcome to the course! Joining Data in R with dplyr Var_1 Var_2 - - PowerPoint PPT Presentation

JOINING DATA IN R WITH DPLYR Welcome to the course! Joining Data in R with dplyr Var_1 Var_2 Var_3 Var_4 obs_1 33 3 54 obs_2 20 90 22 obs_3 58 12 15 obs_4 83 81 5 > mean(df$Var_2) [1] 48.5 Joining Data in R with dplyr


slide-1
SLIDE 1

JOINING DATA IN R WITH DPLYR

Welcome to the course!

slide-2
SLIDE 2

Joining Data in R with dplyr

Var_1 Var_2 Var_3 Var_4

  • bs_1

33 3 54

  • bs_2

20 90 22

  • bs_3

58 12 15

  • bs_4

83 81 5

> mean(df$Var_2) [1] 48.5

slide-3
SLIDE 3

Joining Data in R with dplyr

Var_1 Var_2 Var_3 Var_4 Var_5

  • bs_1

33 3 54 87

  • bs_2

20 90 22 42

  • bs_3

58 12 15 73

  • bs_4

83 81 5 88

> df$Var_5 <- df$Var_2 + df$Var_4

Var_1 Var_2 Var_3 Var_4

  • bs_1

33 3 54

  • bs_2

20 90 22

  • bs_3

58 12 15

  • bs_4

83 81 5

slide-4
SLIDE 4

Joining Data in R with dplyr

slide-5
SLIDE 5

Joining Data in R with dplyr

Course outline

  • Chapter 1 - Mutating joins
  • Chapter 2 - Filtering joins and set operations
  • Chapter 3 - Assembling data
  • Chapter 4 - Advanced joining
  • Chapter 5 - Case study

+ = + = + =

slide-6
SLIDE 6

Joining Data in R with dplyr

  • arrange()
  • filter()
  • select()
  • mutate()
  • summarise()
slide-7
SLIDE 7

Joining Data in R with dplyr

merge()

slide-8
SLIDE 8

Joining Data in R with dplyr

Benefits of dplyr join functions

  • Always preserve row order
  • Intuitive syntax
  • Can be applied to databases, spark, etc.
slide-9
SLIDE 9

Joining Data in R with dplyr

slide-10
SLIDE 10

JOINING DATA IN R WITH DPLYR

Let’s practice!

slide-11
SLIDE 11

JOINING DATA IN R WITH DPLYR

Keys

slide-12
SLIDE 12

Joining Data in R with dplyr

slide-13
SLIDE 13

Joining Data in R with dplyr

> names name band 1 Mick Stones 2 John Beatles 3 Paul Beatles > plays name plays 1 John Guitar 2 Paul Bass 3 Keith Guitar # Example join output name band plays 1 Mick Stones <NA> 2 John Beatles Guitar 3 Paul Beatles Bass 4 Keith <NA> Guitar

slide-14
SLIDE 14

Joining Data in R with dplyr

> names name band 1 Mick Stones 2 John Beatles 3 Paul Beatles > plays name plays 1 John Guitar 2 Paul Bass 3 Keith Guitar # Example join output name band plays 1 Mick Stones <NA> 2 John Beatles Guitar 3 Paul Beatles Bass 4 Keith <NA> Guitar

Keys

primary key foreign key

slide-15
SLIDE 15

Joining Data in R with dplyr

> names2 name surname band 1 John Coltrane NA 2 John Lennon Beatles 3 Paul McCartney Beatles > plays2 name surname plays 1 John Lennon Guitar 2 Paul McCartney Bass 3 Keith Richards Guitar # Example join output name surname band plays 1 John Coltrane <NA> <NA> 2 John Lennon Beatles Guitar 3 Paul McCartney Beatles Bass 4 Keith Richards <NA> Guitar

Keys

primary key foreign key

slide-16
SLIDE 16

JOINING DATA IN R WITH DPLYR

Let’s practice!

slide-17
SLIDE 17

JOINING DATA IN R WITH DPLYR

Joins

slide-18
SLIDE 18

Joining Data in R with dplyr

> left_join(names, plays, by = "name")

table to augment table to augment with key column name(s) as a character string

left_join()

> names name band 1 Mick Stones 2 John Beatles 3 Paul Beatles > plays name plays 1 John Guitar 2 Paul Bass 3 Keith Guitar name band plays 1 Mick Stones <NA> 2 John Beatles Guitar 3 Paul Beatles Bass

rows from first table values from second table

slide-19
SLIDE 19

Joining Data in R with dplyr

> names2 name surname band 1 John Coltrane NA 2 John Lennon Beatles 3 Paul McCartney Beatles > plays2 name surname plays 1 John Lennon Guitar 2 Paul McCartney Bass 3 Keith Richards Guitar

Multi-column keys

> left_join(names2, plays2, by = c("name", "surname"))

vector of key column name(s)

name surname band plays 1 John Coltrane <NA> <NA> 2 John Lennon Beatles Guitar 3 Paul McCartney Beatles Bass

slide-20
SLIDE 20

Joining Data in R with dplyr

> right_join(names, plays, by = "name") name band plays 1 John Beatles Guitar 2 Paul Beatles Bass 3 Keith <NA> Guitar

right_join()

> names name band 1 Mick Stones 2 John Beatles 3 Paul Beatles > plays name plays 1 John Guitar 2 Paul Bass 3 Keith Guitar

values from first table rows from second table

slide-21
SLIDE 21

Joining Data in R with dplyr

"tables"

  • data frames
  • tibbles (tbl_df)
  • tbl references
slide-22
SLIDE 22

Joining Data in R with dplyr

> # A data frame 
 > mtcars > # entire data frame prints, leaving only last values of last columns (which have been wrapped around to appear below the first columns) visible in the window, as below Camaro Z28 3 4 Pontiac Firebird 3 2 Fiat X1-9 4 1 Porsche 914-2 5 2 Lotus Europa 5 2 Ford Pantera L 5 4 Ferrari Dino 5 6 Maserati Bora 5 8 Volvo 142E 4 2 > library(tibble) > as.tibble(mtcars) # A tibble: 32 × 11 mpg cyl disp hp drat wt qsec vs am gear * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 # ... with 22 more rows, and 1 more variables: carb <dbl>

tibble vs. data frame

slide-23
SLIDE 23

Joining Data in R with dplyr

"tables"

  • data frames
  • tibbles (tbl_df)
  • tbl references
slide-24
SLIDE 24

JOINING DATA IN R WITH DPLYR

Let’s practice!

slide-25
SLIDE 25

JOINING DATA IN R WITH DPLYR

Mutating joins

slide-26
SLIDE 26

Joining Data in R with dplyr

> mutate(pressure[1:4, ], fahrenheit = temperature * 1.8 + 32) temperature pressure fahrenheit 1 0 0.0002 32 2 20 0.0012 68 3 40 0.0060 104 4 60 0.0300 140 > pressure[1:4, ] temperature pressure 1 0 0.0002 2 20 0.0012 3 40 0.0060 4 60 0.0300

mutate()

slide-27
SLIDE 27

Joining Data in R with dplyr

> left_join(names, plays, by = "name") name band plays 1 Mick Stones <NA> 2 John Beatles Guitar 3 Paul Beatles Bass > names name band 1 Mick Stones 2 John Beatles 3 Paul Beatles

left_join()

> plays name plays 1 John Guitar 2 Paul Bass 3 Keith Guitar

slide-28
SLIDE 28

Joining Data in R with dplyr

> right_join(names, plays, by = "name") name band plays 1 John Beatles Guitar 2 Paul Beatles Bass 3 Keith <NA> Guitar > names name band 1 Mick Stones 2 John Beatles 3 Paul Beatles

right_join()

> plays name plays 1 John Guitar 2 Paul Bass 3 Keith Guitar

slide-29
SLIDE 29

Joining Data in R with dplyr

> inner_join(names, plays, by = "name") name band plays 1 John Beatles Guitar 2 Paul Beatles Bass > names name band 1 Mick Stones 2 John Beatles 3 Paul Beatles

inner_join()

> plays name plays 1 John Guitar 2 Paul Bass 3 Keith Guitar

slide-30
SLIDE 30

Joining Data in R with dplyr

> full_join(names, plays, by = "name") name band plays 1 Mick Stones <NA> 2 John Beatles Guitar 3 Paul Beatles Bass 4 Keith <NA> Guitar > names name band 1 Mick Stones 2 John Beatles 3 Paul Beatles

full_join()

> plays name plays 1 John Guitar 2 Paul Bass 3 Keith Guitar

slide-31
SLIDE 31

Joining Data in R with dplyr

Syntax

> left_join( names, plays, by = "name") > right_join(names, plays, by = "name") > inner_join(names, plays, by = "name") > full_join( names, plays, by = "name")

x y by

%>%

slide-32
SLIDE 32

Joining Data in R with dplyr

> x <- 1:10 > x %>% sum() [1] 55

Pipe operator

> sum(x) [1] 55 > abs(diff(range(x))) [1] 9 > x %>% > range() %>% > diff() %>% > abs() [1] 9

slide-33
SLIDE 33

Joining Data in R with dplyr

> names %>% + full_join(plays, by = "name") %>% + mutate(missing_info = is.na(band) | is.na(plays)) %>% + filter(missing_info == TRUE) %>% + select(name, band, plays) name band plays 1 Mick Stones <NA> 2 Keith <NA> Guitar > names name band 1 Mick Stones 2 John Beatles 3 Paul Beatles

dplyr and pipes

> plays name plays 1 John Guitar 2 Paul Bass 3 Keith Guitar

slide-34
SLIDE 34

Joining Data in R with dplyr

Summary

  • left_join()
  • right_join()
  • inner_join()
  • full_join()

= = = =

slide-35
SLIDE 35

JOINING DATA IN R WITH DPLYR

Let’s practice!