The tidyverse September 2016 Hadley Wickham @hadleywickham Chief - - PowerPoint PPT Presentation

the tidyverse
SMART_READER_LITE
LIVE PREVIEW

The tidyverse September 2016 Hadley Wickham @hadleywickham Chief - - PowerPoint PPT Presentation

The tidyverse September 2016 Hadley Wickham @hadleywickham Chief Scientist, RStudio Import Visualise Surprises, but doesn't scale Tidy Transform Create new variables & new summaries Consistent way of storing data Model Scales,


slide-1
SLIDE 1

Hadley Wickham 
 @hadleywickham
 Chief Scientist, RStudio

The tidyverse

September 2016
slide-2
SLIDE 2

Tidy Import

Surprises, but doesn't scale Create new variables & new summaries Consistent way of storing data

Visualise Transform Model Communicate

Scales, but doesn't (fundamentally) surprise

Program

slide-3
SLIDE 3

No matter how complex and polished the individual operations are, it is often the quality of the glue that most directly determines the power of the system. — Hal Abelson

slide-4
SLIDE 4

Tidy Import Visualise Transform Model Communicate Program

slide-5
SLIDE 5

Tidy Import Visualise Transform Model Communicate Program

slide-6
SLIDE 6

The tidy tools manifesto

slide-7
SLIDE 7 Import readr readxl haven httr jsonlite DBI rvest xml2 Tidy tibble tidyr Transform dplyr forcats hms lubridate stringr Visualise ggplot2 Model broom modelr Program purrr magrittr

http://r4ds.had.co.nz

tidyverse

slide-8
SLIDE 8
  • 1. Share data structures.

2.Compose simple pieces. 3.Embrace FP. 4.Write for humans.

slide-9
SLIDE 9

1

Share data structures

slide-10
SLIDE 10
  • 1. Put each dataset in a 


data frame.

  • 2. Put each variable in a

column.

Tidy data

slide-11
SLIDE 11 # A tibble: 5,769 × 22 iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514 f014 f1524 <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> 1 AD 1989 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2 AD 1990 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 3 AD 1991 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 4 AD 1992 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 5 AD 1993 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 6 AD 1994 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 7 AD 1996 NA NA 0 0 0 4 1 0 0 NA NA NA 0 1 8 AD 1997 NA NA 0 0 1 2 2 1 6 NA NA NA 0 1 9 AD 1998 NA NA 0 0 0 1 0 0 0 NA NA NA NA NA 10 AD 1999 NA NA 0 0 0 1 1 0 0 NA NA NA 0 0 11 AD 2000 NA NA 0 0 1 0 0 0 0 NA NA NA NA NA 12 AD 2001 NA NA 0 NA NA 2 1 NA NA NA NA NA NA NA 13 AD 2002 NA NA 0 0 0 1 0 0 0 NA NA NA 0 1 14 AD 2003 NA NA 0 0 0 1 2 0 0 NA NA NA 0 1 15 AD 2004 NA NA 0 0 0 1 1 0 0 NA NA NA 0 0 16 AD 2005 0 0 0 0 1 1 0 0 0 0 0 0 0 1 17 AD 2006 0 0 0 1 1 2 0 1 1 0 0 0 0 0 # ... with 5,752 more rows, and 6 more variables: f2534 <int>, f3544 <int>, f4554 <int>, # f5564 <int>, f65 <int>, fu <int>

Messy data has a varied “shape”

What are the variables in this dataset? (Hint: f = female, u = unknown, 1524 = 15-24)
slide-12
SLIDE 12 library(tidyr) read_csv("tb.csv") %>% gather( m04:fu, key = demo, value = n, na.rm = TRUE ) %>% separate(demo, c("sex", "age"), 1) %>% arrange(iso2, year, sex, age) %>% rename(country = iso2)

tidyr helps you tidy your messy data

slide-13
SLIDE 13 # A tibble: 35,750 × 5 country year sex age n <chr> <int> <chr> <chr> <int> 1 AD 1996 f 014 0 2 AD 1996 f 1524 1 3 AD 1996 f 2534 1 4 AD 1996 f 3544 0 5 AD 1996 f 4554 0 6 AD 1996 f 5564 1 7 AD 1996 f 65 0 8 AD 1996 m 014 0 9 AD 1996 m 1524 0 10 AD 1996 m 2534 0 # ... with 35,740 more rows

Tidy data has a uniform “shape”

slide-14
SLIDE 14

Sometimes you don’t have variables & cases

strings dates matrices vectors xml HTTP requests HTTP response

http://simplystatistics.org/2016/02/17/non-tidy-data/

factors

slide-15
SLIDE 15

What if you have a mix of object types?

Training data Test data Model Predictions RMSE Cross-validation data frame data frame lm vector scalar
slide-16
SLIDE 16

Use a tibble with list-columns!

# A tibble: 100 x 5 train test .id mod rmse <list> <list> <chr> <list> <dbl> 1 <S3: resample> <S3: resample> 001 <S3: lm> 0.5661605 2 <S3: resample> <S3: resample> 002 <S3: lm> 0.2399357 3 <S3: resample> <S3: resample> 003 <S3: lm> 3.5482986 4 <S3: resample> <S3: resample> 004 <S3: lm> 0.2396810 5 <S3: resample> <S3: resample> 005 <S3: lm> 0.1591336 6 <S3: resample> <S3: resample> 006 <S3: lm> 0.1934869 7 <S3: resample> <S3: resample> 007 <S3: lm> 0.2697834 8 <S3: resample> <S3: resample> 008 <S3: lm> 0.4910886 9 <S3: resample> <S3: resample> 009 <S3: lm> 1.7002645 10 <S3: resample> <S3: resample> 010 <S3: lm> 0.2047787 ... with 90 more rows
slide-17
SLIDE 17 df <- data.frame(xyz = "a") # What does this return? df$x

Your turn!

slide-18
SLIDE 18 df <- data.frame(xyz = "a") # What does this return? df$x #> [1] a #> Levels: a

Your turn!

Two surprises
 partial name matching & stringsAsFactors
slide-19
SLIDE 19

Two important tensions for understanding base R

Interactive exploration Programming Conservative Utopian

slide-20
SLIDE 20 df <- tibble(xyz = "a") df$xyz #> [1] "a" is.data.frame(df[, "xyz"]) #> [1] TRUE df$x #> Warning: Unknown column 'x' #> NULL

Tibbles are data frames that are lazy & surly

slide-21
SLIDE 21 data.frame(x = list(1:2, 3:5)) #> Error: arguments imply differing number #> of rows: 2, 3

And work better with list-columns

slide-22
SLIDE 22 data.frame(x = list(1:2, 3:5)) #> Error: arguments imply differing number #> of rows: 2, 3 data.frame(x = I(list(1:2, 3:5))) #> x #> 1 1, 2 #> 2 3, 4, 5

And work better with list-columns

slide-23
SLIDE 23 data.frame(x = list(1:2, 3:5)) #> Error: arguments imply differing number #> of rows: 2, 3 data.frame(x = I(list(1:2, 3:5))) #> x #> 1 1, 2 #> 2 3, 4, 5 tibble(x = list(1:2, 3:5)) #> # A tibble: 2 x 1 #> x #> <list> #> 1 <int [2]> #> 2 <int [3]>

And work better with list-columns

slide-24
SLIDE 24

2

Compose simple pieces

slide-25
SLIDE 25

Goal: Solve complex problems by combining uniform pieces.

slide-26
SLIDE 26 https://www.flickr.com/photos/brunurb/13129057003
slide-27
SLIDE 27 http://brickartist.com/gallery/pc-magazine-computer/. CC-BY-NC
slide-28
SLIDE 28

%>%

magrittr::

slide-29
SLIDE 29 foo_foo <- little_bunny() bop_on( scoop_up( hop_through(foo_foo, forest), field_mouse ), head ) # vs foo_foo %>% hop_through(forest) %>% scoop_up(field_mouse) %>% bop_on(head)
slide-30
SLIDE 30 library(nycflights13) library(dplyr) library(ggplot2) flights %>% group_by(date) %>% summarise(n = n()) %>% ggplot(aes(date, n)) + geom_line()

Consistency across packages is important

😨

slide-31
SLIDE 31 ggplot(mtcars, aes(mpg, wt)) + geom_point() + geom_line() + ggsave("mtcars.pdf")

And ggplot2 is not even internally consistent

x

slide-32
SLIDE 32 ggsave( "mtcars.pdf", ggplot(mtcars, aes(mpg, wt)) + geom_point() + geom_line() + )

And ggplot2 is not even internally consistent

😲

slide-33
SLIDE 33 # devtools::install_github("hadley/ggplot1") library(ggplot1) ggsave( ggpoint( ggplot( mtcars, list(x = mpg, y = wt) ) ), "mtcars.pdf", width = 8, height = 6 )

ggplot1 had a tidier API than ggplot2!

slide-34
SLIDE 34 library(ggplot1) mtcars %>% ggplot(list(x = mpg, y = wt)) %>% ggpoint() %>% ggsave("mtcars.pdf", width = 8, height = 6)

So you can use the pipe with ggplot1

slide-35
SLIDE 35 2 3 4 5 10 15 20 25 30 35
  • wt
mpg
slide-36
SLIDE 36 library(rvest) library(purrr) library(readr) library(dplyr) library(lubridate) read_html("https://www.massshootingtracker.org/data") %>% html_nodes("a[href^='https://docs.goo']") %>% html_attr("href") %>% map_df(read_csv) %>% mutate(date = mdy(date)) -> shootings

One small example from Bob Rudis

https://rud.is/b/2016/07/26
slide-37
SLIDE 37

3

Embrace FP

Answered with cupcakes Why are for loops “bad”?
slide-38
SLIDE 38 1 cup flour a scant ¾ cup sugar 1 ½ t baking powder 3 T unsalted butter ½ cup whole milk 1 egg ¼ t pure vanilla extract Preheat oven to 350°F. Put the flour, sugar, baking powder, salt, and butter in a freestanding electric mixer with a paddle attachment and beat
  • n slow speed until you get a sandy consistency and everything
is combined. Whisk the milk, egg, and vanilla together in a pitcher, then slowly pour about half into the flour mixture, beat to combine, and turn the mixer up to high speed to get rid of any lumps. Turn the mixer down to a slower speed and slowly pour in the remaining milk mixture. Continue mixing for a couple of more minutes until the batter is smooth but do not overmix. Spoon the batter into paper cases until 2/3 full and bake in the preheated oven for 20-25 minutes, or until the cake bounces back when touched.

Vanilla cupcakes

The hummingbird bakery cookbook
slide-39
SLIDE 39 ¾ cup + 2T flour 2 ½ T cocoa powder a scant ¾ cup sugar 1 ½ t baking powder 3 T unsalted butter ½ cup whole milk 1 egg ¼ t pure vanilla extract Preheat oven to 350°F. Put the flour, cocoa, sugar, baking powder, salt, and butter in a freestanding electric mixer with a paddle attachment and beat
  • n slow speed until you get a sandy consistency and everything
is combined. Whisk the milk, egg, and vanilla together in a pitcher, then slowly pour about half into the flour mixture, beat to combine, and turn the mixer up to high speed to get rid of any lumps. Turn the mixer down to a slower speed and slowly pour in the remaining milk mixture. Continue mixing for a couple of more minutes until the batter is smooth but do not overmix. Spoon the batter into paper cases until 2/3 full and bake in the preheated oven for 20-25 minutes, or until the cake bounces back when touched.

Chocolate cupcakes

The hummingbird bakery cookbook
slide-40
SLIDE 40 ¾ cup + 2T flour 2 ½ T cocoa powder a scant ¾ cup sugar 1 ½ t baking powder 3 T unsalted butter ½ cup whole milk 1 egg ¼ t pure vanilla extract Preheat oven to 350°F. Put the flour, cocoa, sugar, baking powder, salt, and butter in a freestanding electric mixer with a paddle attachment and beat
  • n slow speed until you get a sandy consistency and everything
is combined. Whisk the milk, egg, and vanilla together in a pitcher, then slowly pour about half into the flour mixture, beat to combine, and turn the mixer up to high speed to get rid of any lumps. Turn the mixer down to a slower speed and slowly pour in the remaining milk mixture. Continue mixing for a couple of more minutes until the batter is smooth but do not overmix. Spoon the batter into paper cases until 2/3 full and bake in the preheated oven for 20-25 minutes, or until the cake bounces back when touched.

Chocolate cupcakes

The hummingbird bakery cookbook
slide-41
SLIDE 41 1 cup flour a scant ¾ cup sugar 1 ½ t baking powder 3 T unsalted butter ½ cup whole milk 1 egg ¼ t pure vanilla extract Preheat oven to 350°F. Put the flour, sugar, baking powder, salt, and butter in a freestanding electric mixer with a paddle attachment and beat
  • n slow speed until you get a sandy consistency and everything
is combined. Whisk the milk, egg, and vanilla together in a pitcher, then slowly pour about half into the flour mixture, beat to combine, and turn the mixer up to high speed to get rid of any lumps. Turn the mixer down to a slower speed and slowly pour in the remaining milk mixture. Continue mixing for a couple of more minutes until the batter is smooth but do not overmix. Spoon the batter into paper cases until 2/3 full and bake in the preheated oven for 20-25 minutes, or until the cake bounces back when touched.

Vanilla cupcakes

The hummingbird bakery cookbook
slide-42
SLIDE 42 120g flour 140g sugar 1.5 t baking powder 40g unsalted butter 120ml milk 1 egg 0.25 t pure vanilla extract Preheat oven to 170°C. Put the flour, sugar, baking powder, salt, and butter in a freestanding electric mixer with a paddle attachment and beat
  • n slow speed until you get a sandy consistency and everything
is combined. Whisk the milk, egg, and vanilla together in a pitcher, then slowly pour about half into the flour mixture, beat to combine, and turn the mixer up to high speed to get rid of any lumps. Turn the mixer down to a slower speed and slowly pour in the remaining milk mixture. Continue mixing for a couple of more minutes until the batter is smooth but do not overmix. Spoon the batter into paper cases until 2/3 full and bake in the preheated oven for 20-25 minutes, or until the cake bounces back when touched.

Vanilla cupcakes

  • 1. Convert units
The hummingbird bakery cookbook
slide-43
SLIDE 43 120g flour 140g sugar 1.5 t baking powder 40g butter 120ml milk 1 egg 0.25 t vanilla Beat flour, sugar, baking powder, salt, and butter until sandy. Whisk milk, egg, and vanilla. Mix half into flour mixture until smooth (use high speed). Beat in remaining half. Mix until smooth. Bake 20-25 min at 170°C.

Vanilla cupcakes

  • 2. Rely on domain knowledge
The hummingbird bakery cookbook
slide-44
SLIDE 44 Beat dry ingredients + butter until sandy. Whisk together wet ingredients. Mix half into dry until smooth (use high speed). Beat in remaining half. Mix until smooth. Bake 20-25 min at 170°C.

Vanilla cupcakes

  • 3. Use variables
120g flour 140g sugar 1.5 t baking powder 40g butter 120ml milk 1 egg 0.25 t vanilla The hummingbird bakery cookbook
slide-45
SLIDE 45 120g flour 140g sugar 1.5t baking powder 40g butter 120ml milk 1 egg 0.25 t vanilla Beat dry ingredients + butter until sandy. Whisk together wet ingredients. Mix half into dry until smooth (use high speed). Beat in remaining half. Mix until smooth. Bake 20-25 min at 170°C.

Cupcakes

  • 4. Extract out common code
100g flour 20g cocoa 140g sugar 1.5t baking powder 40g butter 120ml milk 1 egg 0.25 t vanilla Vanilla Chocolate
slide-46
SLIDE 46
  • ut1 <- vector("double", ncol(mtcars))
for(i in seq_along(mtcars)) {
  • ut1[[i]] <- mean(mtcars[[i]], na.rm = TRUE)
}
  • ut2 <- vector("double", ncol(mtcars))
for(i in seq_along(mtcars)) {
  • ut2[[i]] <- median(mtcars[[i]], na.rm = TRUE)
}

What do these for loops do?

slide-47
SLIDE 47
  • ut1 <- vector("double", ncol(mtcars))
for(i in seq_along(mtcars)) {
  • ut1[[i]] <- mean(mtcars[[i]], na.rm = TRUE)
}
  • ut2 <- vector("double", ncol(mtcars))
for(i in seq_along(mtcars)) {
  • ut2[[i]] <- median(mtcars[[i]], na.rm = TRUE)
}

For loops emphasise the objects

slide-48
SLIDE 48
  • ut1 <- vector("double", ncol(mtcars))
for(i in seq_along(mtcars)) {
  • ut1[[i]] <- mean(mtcars[[i]], na.rm = TRUE)
}
  • ut2 <- vector("double", ncol(mtcars))
for(i in seq_along(mtcars)) {
  • ut2[[i]] <- median(mtcars[[i]], na.rm = TRUE)
}

Not the actions

slide-49
SLIDE 49 library(purrr) means <- map_dbl(mtcars, mean) medians <- map_dbl(mtcars, median)

Functional programming emphasises the actions

slide-50
SLIDE 50 sim <- tribble( ~f, ~params, "runif", list(min = -1, max = 1), "rnorm", list(sd = 5), "rpois", list(lambda = 10) ) sim %>% mutate(sim = invoke_map(f, params, n = 10))

Teaser: simulation

slide-51
SLIDE 51 reports <- tibble( class = unique(mpg$class), filename = paste0("fuel-economy-", class, ".html"), params = map(class, ~ list(my_class = .)) ) reports %>% select(output_file = filename, params) %>% pwalk(rmarkdown::render, input = "fuel-economy.Rmd")

Teaser: saving parameterised reports

slide-52
SLIDE 52

4

Write for humans

slide-53
SLIDE 53

Programs must be written for people to read, and only incidentally for machines to execute. — Hal Abelson

slide-54
SLIDE 54

tibblelubridate forcats

filter mutate summarise arrange select

magrittr

slide-55
SLIDE 55

Conclusion

slide-56
SLIDE 56
  • 1. Share data structures.

2.Compose simple pieces. 3.Embrace FP. 4.Write for humans.

slide-57
SLIDE 57

My goal is to make a pit of success

h t t p : / / b l
  • g
. c
  • d
i n g h
  • r
r
  • r
. c
  • m
/ f a l l i n g
  • i
n t
  • t
h e
  • p
i t
  • f
  • s
u c c e s s /
slide-58
SLIDE 58 install.packages("tidyverse") library(tidyverse) #> Loading tidyverse: ggplot2 #> Loading tidyverse: tibble #> Loading tidyverse: tidyr #> Loading tidyverse: readr #> Loading tidyverse: purrr #> Loading tidyverse: dplyr #> Conflicts with tidy packages
  • #> filter(): dplyr, stats
#> lag(): dplyr, stats

Gotta install them all

slide-59
SLIDE 59 Import readr readxl haven httr jsonlite DBI rvest xml2 Tidy tibble tidyr Transform dplyr forcats hms lubridate stringr Visualise ggplot2 Model broom modelr ??? Program purrr magrittr

http://r4ds.had.co.nz

tidyverse

slide-60
SLIDE 60 This work is licensed under the 
 Creative Commons Attribution-Noncommercial 3.0 
 United States License. To view a copy of this license, visit 
 http://creativecommons.org/licenses/by-nc/3.0/us/