Managing many models February 2016 Hadley Wickham @hadleywickham - - PowerPoint PPT Presentation

managing many models
SMART_READER_LITE
LIVE PREVIEW

Managing many models February 2016 Hadley Wickham @hadleywickham - - PowerPoint PPT Presentation

Managing many models February 2016 Hadley Wickham @hadleywickham Chief Scientist, RStudio There are 7 key components of data science Import Visualise Communicate Transform Tidy Automate Model Understand Today I want to focus


slide-1
SLIDE 1

Hadley Wickham 
 @hadleywickham
 Chief Scientist, RStudio

Managing 
 many models

February 2016

slide-2
SLIDE 2

There are 7 key components of data science

Import Transform Tidy Visualise Model Communicate

Understand

Automate

slide-3
SLIDE 3

Today I want to focus on understanding

Import Transform Tidy Visualise Model Communicate

Exploratory data analysis

Automate

slide-4
SLIDE 4

Gapminder data

slide-5
SLIDE 5

40 60 80 1950 1960 1970 1980 1990 2000

year lifeExp

142 countries

slide-6
SLIDE 6

One way to handle is to fit a model to each country

year lifeEx p 1952 69.4 1957 70.3 1962 71.2 1967 71.5 ... ...

lm(lifeExp ~ year, data = nz)

R2=0.95

Intercept

  • 307.7

Slope

0.19 year resid 1952 0.70 1957 0.61 1962 0.63 1967

  • 0.05

... ...

glance tidy augment New Zealand

Broom, by David Robinson, makes this easy!

slide-7
SLIDE 7

To do that for many countries, we need a list of data frames Year LifeEx p

Afghanistan

1952 28.9

Afghanistan

1957 30.3

Afghanistan

... ...

Albania

1952 55.2

Albania

1957 59.3

Albania

... ...

Algeria

... ...

...

...

slide-8
SLIDE 8

A nested data frame has one row per group

Data

Afghanistan

<data>

Albania

<data>

Algeria

<data>

...

<data>

Year LifeExp 1952 28.9 1957 30.3 ... ... Year LifeExp 1952 55.2 1957 59.3 ... ...

slide-9
SLIDE 9

We can use purrr::map() to fit each model

Data

Afghanistan

<data>

Albania

<data>

Algeria

<data>

...

<data>

lm(lifeExp ~ year1950, data = afghanistan) lm(lifeExp1950 ~ year, data = albania) map(by_country$data, ~ lm(year1950 ~ year, data = .))

slide-10
SLIDE 10

Why for loops 
 are bad

An digression with cupcakes

slide-11
SLIDE 11

Why for loops 
 are bad

suboptimal

An digression with cupcakes

slide-12
SLIDE 12

1 cup flour a scant ¾ cup sugar 1 ½ t baking powder 3 T unsalted butter ½ cup whole milk 1 egg ¼ t pure vanilla extract

Preheat oven to 350°F. Put the flour, sugar, baking powder, salt, and butter in a freestanding electric mixer with a paddle attachment and beat

  • n slow speed until you get a sandy consistency and everything

is combined. Whisk the milk, egg, and vanilla together in a pitcher, then slowly pour about half into the flour mixture, beat to combine, and turn the mixer up to high speed to get rid of any lumps. Turn the mixer down to a slower speed and slowly pour in the remaining milk mixture. Continue mixing for a couple of more minutes until the batter is smooth but do not overmix. Spoon the batter into paper cases until 2/3 full and bake in the preheated oven for 20-25 minutes, or until the cake bounces back when touched.

Vanilla cupcakes

The hummingbird bakery cookbook

slide-13
SLIDE 13

¾ cup + 2T flour 2 ½ T cocoa powder a scant ¾ cup sugar 1 ½ t baking powder 3 T unsalted butter ½ cup whole milk 1 egg ¼ t pure vanilla extract

Preheat oven to 350°F. Put the flour, cocoa, sugar, baking powder, salt, and butter in a freestanding electric mixer with a paddle attachment and beat

  • n slow speed until you get a sandy consistency and everything

is combined. Whisk the milk, egg, and vanilla together in a pitcher, then slowly pour about half into the flour mixture, beat to combine, and turn the mixer up to high speed to get rid of any lumps. Turn the mixer down to a slower speed and slowly pour in the remaining milk mixture. Continue mixing for a couple of more minutes until the batter is smooth but do not overmix. Spoon the batter into paper cases until 2/3 full and bake in the preheated oven for 20-25 minutes, or until the cake bounces back when touched.

Chocolate cupcakes

The hummingbird bakery cookbook

slide-14
SLIDE 14

¾ cup + 2T flour 2 ½ T cocoa powder a scant ¾ cup sugar 1 ½ t baking powder 3 T unsalted butter ½ cup whole milk 1 egg ¼ t pure vanilla extract

Preheat oven to 350°F. Put the flour, cocoa, sugar, baking powder, salt, and butter in a freestanding electric mixer with a paddle attachment and beat

  • n slow speed until you get a sandy consistency and everything

is combined. Whisk the milk, egg, and vanilla together in a pitcher, then slowly pour about half into the flour mixture, beat to combine, and turn the mixer up to high speed to get rid of any lumps. Turn the mixer down to a slower speed and slowly pour in the remaining milk mixture. Continue mixing for a couple of more minutes until the batter is smooth but do not overmix. Spoon the batter into paper cases until 2/3 full and bake in the preheated oven for 20-25 minutes, or until the cake bounces back when touched.

Chocolate cupcakes

The hummingbird bakery cookbook

slide-15
SLIDE 15

df <- data.frame(...) means <- double(ncol(df)) for(i in seq_along(df)) { means[[i]] <- mean(x[[i]], na.rm = TRUE) } medians <- double(ncol(df)) for(i in seq_along(df)) { median[[i]] <- median(x[[i]], na.rm = TRUE) }

For loops bury the lede

slide-16
SLIDE 16

df <- data.frame(...) means <- double(ncol(df)) for(i in seq_along(df)) { means[[i]] <- mean(x[[i]], na.rm = TRUE) } medians <- double(ncol(df)) for(i in seq_along(df)) { median[[i]] <- median(x[[i]], na.rm = TRUE) }

For loops bury the lede

slide-17
SLIDE 17

1 cup flour a scant ¾ cup sugar 1 ½ t baking powder 3 T unsalted butter ½ cup whole milk 1 egg ¼ t pure vanilla extract

Preheat oven to 350°F. Put the flour, sugar, baking powder, salt, and butter in a freestanding electric mixer with a paddle attachment and beat

  • n slow speed until you get a sandy consistency and everything

is combined. Whisk the milk, egg, and vanilla together in a pitcher, then slowly pour about half into the flour mixture, beat to combine, and turn the mixer up to high speed to get rid of any lumps. Turn the mixer down to a slower speed and slowly pour in the remaining milk mixture. Continue mixing for a couple of more minutes until the batter is smooth but do not overmix. Spoon the batter into paper cases until 2/3 full and bake in the preheated oven for 20-25 minutes, or until the cake bounces back when touched.

Vanilla cupcakes

The hummingbird bakery cookbook

slide-18
SLIDE 18

120g flour 140g sugar 1.5 t baking powder 40g unsalted butter 120ml milk 1 egg 0.25 t pure vanilla extract

Preheat oven to 170°C. Put the flour, sugar, baking powder, salt, and butter in a freestanding electric mixer with a paddle attachment and beat

  • n slow speed until you get a sandy consistency and everything

is combined. Whisk the milk, egg, and vanilla together in a pitcher, then slowly pour about half into the flour mixture, beat to combine, and turn the mixer up to high speed to get rid of any lumps. Turn the mixer down to a slower speed and slowly pour in the remaining milk mixture. Continue mixing for a couple of more minutes until the batter is smooth but do not overmix. Spoon the batter into paper cases until 2/3 full and bake in the preheated oven for 20-25 minutes, or until the cake bounces back when touched.

Vanilla cupcakes

  • 1. Convert units

The hummingbird bakery cookbook

slide-19
SLIDE 19

120g flour 140g sugar 1.5 t baking powder 40g butter 120ml milk 1 egg 0.25 t vanilla

Beat flour, sugar, baking powder, salt, and butter until sandy. Whisk milk, egg, and vanilla. Mix half into flour mixture until smooth (use high speed). Beat in remaining half. Mix until smooth. Bake 20-25 min at 170°C.

Vanilla cupcakes

  • 2. Rely on domain knowledge

The hummingbird bakery cookbook

slide-20
SLIDE 20

df <- data.frame(...) means <- double(ncol(df)) for(i in seq_along(df)) { means[[i]] <- mean(x[[i]], na.rm = TRUE) } medians <- double(ncol(df)) for(i in seq_along(df)) { median[[i]] <- median(x[[i]], na.rm = TRUE) }

For loops emphasise the data

slide-21
SLIDE 21

library(purrr) means <- map_dbl(df, mean) medians <- map_dbl(df, median)

Purrr emphasises the action

slide-22
SLIDE 22

Beat dry ingredients + butter until sandy. Whisk together wet ingredients. Mix half into dry until smooth (use high speed). Beat in remaining half. Mix until smooth. Bake 20-25 min at 170°C.

Vanilla cupcakes

  • 3. Use variables

120g flour 140g sugar 1.5 t baking powder 40g butter 120ml milk 1 egg 0.25 t vanilla

The hummingbird bakery cookbook

slide-23
SLIDE 23

120g flour 140g sugar 1.5t baking powder 40g butter 120ml milk 1 egg 0.25 t vanilla

Beat dry ingredients + butter until sandy. Whisk together wet ingredients. Mix half into dry until smooth (use high speed). Beat in remaining half. Mix until smooth. Bake 20-25 min at 170°C.

Cupcakes

  • 4. Extract out common code

100g flour 20g cocoa 140g sugar 1.5t baking powder 40g butter 120ml milk 1 egg 0.25 t vanilla

Vanilla Chocolate

slide-24
SLIDE 24

df <- data.frame(...) col_sum <- function(df, f) { df %>% keep(is_numeric) %>% map_dbl(f) } means <- col_sum(df, mean) medians <- col_sum(df, median)

Similarly, purrr lets you create more complex recipes

slide-25
SLIDE 25

df <- data.frame(...) col_sum <- function(df, f) { map_dbl(keep(df, is_numeric), f) } means <- col_sum(df, mean) medians <- col_sum(df, median)

Similarly, purrr lets you create more complex recipes

slide-26
SLIDE 26

Cupcakes

Vanilla 120 1.5 140 40 1 0.25t vanilla Chocolate 100 1.5 140 40 1 20g cocoa • 0.25t vanilla Lemon 120 1.5 140 40 1 2T lemon zest Red velvet 150 150 60 1 10g cocoa • 20ml red colouring • 1.5t vinegar • 0.5 t baking soda F l

  • u

r B a k i n g p

  • w

d e r S u g a r B u t t e r E g g

  • 5. Store as data

E x t r a

slide-27
SLIDE 27

funs <- list( mean = mean, median = median, sd = sd ) map(funs, col_sum, df = df)

In R, we can store functions in lists

slide-28
SLIDE 28

Back to gapminder

slide-29
SLIDE 29

We can use purrr::map() to fit each model

Data

Afghanistan

<data>

Albania

<data>

Algeria

<data>

...

<data>

lm(lifeExp ~ year1950, data = afghanistan) lm(lifeExp1950 ~ year, data = albania) map(by_country$data, ~ lm(year1950 ~ year, data = .))

slide-30
SLIDE 30

map(by_country$data, ~ lm(year1950 ~ year, data = .)) # same as

  • ut <- vector("list", length(by_country$data))

for (i in seq_along(by_country$data)) { df <- by_country$data[[i]]

  • ut[[i]] <- lm(year1950 ~ year, data = df)

}

slide-31
SLIDE 31

Multiple lists make it easy to lose context So use a data frame!

slide-32
SLIDE 32

Unnesting is reverse of nesting

Data

Afghanistan

<data>

Albania

<data>

Algeria

<data>

...

<data> Year LifeEx p

Afghanistan

1952 28.9

Afghanistan

1957 30.3

Afghanistan

... ...

Albania

1952 55.2

Albania

1957 59.3

Albania

... ...

Algeria

... ...

...

...

nest() unnest()

slide-33
SLIDE 33

Cross-validation

slide-34
SLIDE 34

Original Test Training

slide-35
SLIDE 35

Original Test Training Model

slide-36
SLIDE 36

Original Test Training Model Predict

slide-37
SLIDE 37

Original Test Training Model Predict Score

slide-38
SLIDE 38

Test Training Model Prediction Score 1 df df lm vector number 2 df df lm vector number 3 df df lm vector number 4 df df lm vector number ... ... ... ... ...

slide-39
SLIDE 39

crossv <- partition(mtcars, 100, c( test = 0.2, training = 0.8 )) crossv <- crossv %>% mutate( # Fit the models model = map(training, ~ lm(mpg ~ wt, data = .)), # Make predictions on test data pred = map2(model, test, predict), # Evaluate difference between predicted diff = map2_dbl(pred, test %>% map("mpg"), msd) )

slide-40
SLIDE 40

Conclusion

slide-41
SLIDE 41
  • 1. Store related objects in 


list-columns.

  • 2. Learn FP so you can focus on

verbs, not objects.

  • 3. Use broom to convert models

to tidy data.

slide-42
SLIDE 42

Data frames Lists

dplyr purrr tidyr

Models

broom

slide-43
SLIDE 43

This work is licensed under the 
 Creative Commons Attribution-Noncommercial 3.0 
 United States License. To view a copy of this license, visit 
 http://creativecommons.org/licenses/by-nc/3.0/us/