Pipelines for data analysis in R Hadley Wickham @hadleywickham - PowerPoint PPT Presentation

Pipelines for   data analysis in R Hadley Wickham   @hadleywickham Chief Scientist, RStudio October 2015

Data analysis is the process Data analysis is the process by which data becomes by which data becomes understanding, knowledge understanding, knowledge and insight and insight

Import Visualise Surprises, but doesn't scale Tidy Transform Create new variables & new summaries Consistent way of storing data Model Scales, but doesn't (fundamentally) surprise

Import Visualise readr readxl ggplot2 haven ggvis DBI httr Tidy Transform dplyr tidyr m o o r b Model

Pipelines

Cognitive Think it Do it Describe it (precisely) Computational

http://www.flickr.com/photos/mutsmuts/4695658106 Cognition time ≫ Computation time

magrittr:: %>% Inspirations : unix, F#, haskell, clojure, method chaining

foo_foo <- little_bunny() bop_on( scoop_up( hop_through(foo_foo, forest), field_mouse ), head ) # vs foo_foo %>% hop_through(forest) %>% scoop_up(field_mouse) %>% bop_on(head)

x %>% f(y) # f(x, y) x %>% f(z, .) # f(z, x) x %>% f(y) %>% g(z) # g(f(x, y), z) # Turns function composition (hard to read) # into sequence (easy to read)

# Any function can use it. Only needs a simple # property: the type of the first argument # needs to be the same as the type of the result. # tidyr: pipelines for messy -> tidy data # dplyr: pipelines for data manipulation # ggvis: pipelines for visualisations # rvest: pipelines for html # purrr: pipelines for lists # xml2: pipelines for xml # stringr: pipelines for strings

Tidy data = data that makes data analysis easy Storage Meaning Table / File Data set Rows Observations Columns Variables

Source: local data frame [5,769 x 22] iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514 (chr) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) 1 AD 1989 NA NA NA NA NA NA NA NA NA NA NA NA 2 AD 1990 NA NA NA NA NA NA NA NA NA NA NA NA 3 AD 1991 NA NA NA NA NA NA NA NA NA NA NA NA 4 AD 1992 NA NA NA NA NA NA NA NA NA NA NA NA 5 AD 1993 NA NA NA NA NA NA NA NA NA NA NA NA 6 AD 1994 NA NA NA NA NA NA NA NA NA NA NA NA 7 AD 1996 NA NA 0 0 0 4 1 0 0 NA NA NA 8 AD 1997 NA NA 0 0 1 2 2 1 6 NA NA NA 9 AD 1998 NA NA 0 0 0 1 0 0 0 NA NA NA 10 AD 1999 NA NA 0 0 0 1 1 0 0 NA NA NA What are the variables in this 11 AD 2000 NA NA 0 0 1 0 0 0 0 NA NA NA dataset? (Hint: f = female,   12 AD 2001 NA NA 0 NA NA 2 1 NA NA NA NA NA 13 AD 2002 NA NA 0 0 0 1 0 0 0 NA NA NA u = unknown, 1524 = 15-24) 14 AD 2003 NA NA 0 0 0 1 2 0 0 NA NA NA 15 AD 2004 NA NA 0 0 0 1 1 0 0 NA NA NA 16 AD 2005 0 0 0 0 1 1 0 0 0 0 0 0 .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... Variables not shown: f014 (int), f1524 (int), f2534 (int), f3544 (int), f4554 (int), f5564 (int), f65 (int), fu (int)

# To convert this messy data into tidy data # we need two verbs. First we need to gather # together all the columns that aren't variables tb2 <- tb %>% gather(demo, n, -iso2, -year, na.rm = TRUE) tb2

# Then separate the demographic variable into # sex and age tb3 <- tb2 %>% separate(demo, c("sex", "age"), 1) tb3 # Many tidyr verbs come in pairs: # spread vs. gather # extract/separate vs. unite # nest vs. unnest

Google for “tidyr” & “tidy data”

Transform

Cognitive Think it Do it Describe it (precisely) Computational

One table verbs + g r o u p b y • select : subset variables by name • filter : subset observations by value • mutate : add new variables • summarise : reduce to a single obs • arrange : re-order the observations

Mutating Filtering Set inner_join() semi_join() intersect() setdiff() anti_join() left_join() right_join() full_join() union()

dplyr sources • Local data frame (C++) • Local data table • Local data cube (experimental) • RDMS: Postgres, MySQL, SQLite, Oracle, MS SQL, JDBC, Impala • MonetDB, BigQuery

Google for “dplyr”

Visualise

What is ggvis? • A grammar of graphics   (like ggplot2 ) • Reactive (interactive & dynamic)   (like shiny ) • A pipeline (a la dplyr ) • Of the web (drawn with vega )

Demo 4-ggvis.R 4-ggvis.Rmd

Google for “ggvis”

Model with broom, by David Robinson

46 TX cities, ~25 years of data 7.5 log(sales) 5.0 2.5 What makes it hard to see the long term trend? 1990 1995 2000 2005 2010 2015 date

# Models are useful as tool for removing # known patterns tx <- tx %>% group_by(city) %>% mutate( resid = lm( log(sales) ~ factor(month), na.action = na.exclude ) %>% resid() )

1 0 resid − 1 − 2 1990 1995 2000 2005 2010 2015 date

# Models are also useful in their own right models <- tx %>% group_by(city) %>% do(mod = lm( log(sales) ~ factor(month), data = ., na.action = na.exclude) )

Model summaries • Model level: one row per model • Coe ffi cient level: one row per coe ffi cient (per model) • Observation level: one row per observation (per model)

Demo 5-broom.R

Google for “broom r”

Big data and R

Can’t fit in memory on Big one computer: >5 TB Fits in memory on a Medium server: 10 GB - 5 TB Fits in memory on a Small laptop: <10 GB R is great at this!

R • R provides an excellent environment for rapid interactive exploration of small data. • There is no technical reason why it can’t also work well with medium size data. (But the work mostly hasn’t been done) • What about big data?

1. Can be reduced to a small data problem with subsetting/sampling/ summarising (90%) 2. Can be reduced to a very large number of small data problems (9%) 3. Is irreducibly big (1%)

The right small data • Rapid iteration essential • dplyr supports this activity by avoiding cognitive costs of switching between languages.

Lots of small problems • Embarrassingly parallel (e.g. Hadoop) • R wrappers like foreach, rhipe, rhadoop • Challenging is matching architecture of computing to data storage

Irreducibly big • Computation must be performed by specialised system. • Typically C/C++, Fortran, Scala. • R needs to be able to talk to those systems.

Future work

End game Provide a fluent interface where you spent your mental energy on the specific data problem, not general data analysis process. The best tools become invisible with time! Still a lot of work to do, especially on the connection between modelling and visualisation.

Pipelines for data analysis in R Hadley Wickham @hadleywickham - PowerPoint PPT Presentation

Pipelines for data analysis in R Hadley Wickham @hadleywickham Chief Scientist, RStudio October 2015 Data analysis is the process Data analysis is the process by which data becomes by which data becomes understanding, knowledge

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

Pipelines on Pipelines: Creating Agile CI/CD Workflows for Airflow DAGs By Victor Shafran CPO

Pipelines and Informed Planning Alliance (PIPA) Pipelines and Informed Planning Alliance (PIPA)

Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering

Building Data Pipelines in Python Marco Bonzanini QCon London 2017 Nice to meet you R&D

Introduction to Data Pipelines CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data Engineering

Greg Neiheisel CTO Astronomer Data Engineering Platform Streaming data Data pipelines Code

Building Data applications with Go from Bloom filters to Data pipelines Sergii Khomenko, Data

Data analysis pipelines Reading and tidying tables R.W. Oldford readr - importing

Symposium Co-locating Nuclear Plants with Natural Gas Pipelines Paul Blanch, Energy Consultant

Princeton Hydro LLC. Pipelines in the Landscape Both photographs attributed to Delaware

Safety of Gas Gathering Pipelines RIN: 2137-AF38 Docket: PHMSA 2011 0023 Gas Pipeline

Analyzing Results Page Piccinini Instructor DataCamp A/B Testing in R Experiment results

Foundations of Tidy Machine Learning Dmitriy (Dima) Gorenshteyn Lead Data Scientist, Memorial

Monetizing Social Games Pauline Reader, Senior Director, International pauline.reader@rockyou.com

OOP with Java Yuanbin Wu cs@ecnu OOP with Java Project 5: 5 9 9

Introduction and RDF streams Daniele DellAglio dellaglio@ifi.uzh.ch http://dellaglio.org

Unmasking the Villain Solving Multi-Step equations Remember Scooby Doo? Solving equations is

SWEN 256 Software Process & Project Management Plan: Identify activities. No

amazon.coms Journey to the Cloud John Rauser - @jrauser

Pipelines for data analysis in R Hadley Wickham @hadleywickham - PowerPoint PPT Presentation

Pipelines for data analysis in R Hadley Wickham @hadleywickham Chief Scientist, RStudio October 2015 Data analysis is the process Data analysis is the process by which data becomes by which data becomes understanding, knowledge

Licensed Pipelines &amp; the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

Pipelines on Pipelines: Creating Agile CI/CD Workflows for Airflow DAGs By Victor Shafran CPO

Pipelines and Informed Planning Alliance (PIPA) Pipelines and Informed Planning Alliance (PIPA)

Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering

Building Data Pipelines in Python Marco Bonzanini QCon London 2017 Nice to meet you R&amp;D

Introduction to Data Pipelines CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data Engineering

Greg Neiheisel CTO Astronomer Data Engineering Platform Streaming data Data pipelines Code

Building Data applications with Go from Bloom filters to Data pipelines Sergii Khomenko, Data

Data analysis pipelines Reading and tidying tables R.W. Oldford readr - importing

Symposium Co-locating Nuclear Plants with Natural Gas Pipelines Paul Blanch, Energy Consultant

Princeton Hydro LLC. Pipelines in the Landscape Both photographs attributed to Delaware

Safety of Gas Gathering Pipelines RIN: 2137-AF38 Docket: PHMSA 2011 0023 Gas Pipeline

Analyzing Results Page Piccinini Instructor DataCamp A/B Testing in R Experiment results

Foundations of Tidy Machine Learning Dmitriy (Dima) Gorenshteyn Lead Data Scientist, Memorial

Monetizing Social Games Pauline Reader, Senior Director, International pauline.reader@rockyou.com

OOP with Java Yuanbin Wu cs@ecnu OOP with Java Project 5: 5 9 9

Introduction and RDF streams Daniele DellAglio dellaglio@ifi.uzh.ch http://dellaglio.org

Unmasking the Villain Solving Multi-Step equations Remember Scooby Doo? Solving equations is

SWEN 256 Software Process &amp; Project Management Plan: Identify activities. No

amazon.coms Journey to the Cloud John Rauser - @jrauser

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

Building Data Pipelines in Python Marco Bonzanini QCon London 2017 Nice to meet you R&D

SWEN 256 Software Process & Project Management Plan: Identify activities. No