tidyfun : Tidy Functional Data A new framework for working with - PowerPoint PPT Presentation

tidyfun : Tidy Functional Data A new framework for working with functional data in R Fabian Scheipl 1 Jeff Goldsmith 2 1 : Dept. of Statistics, LMU Munich 2 : Columbia University Mailman School of Public Health

Functional Data Painful to work with: ◮ huge amounts of data ◮ regular grids? irregular grids? ◮ work with: ◮ raw data? ◮ smooth/interpolated? ◮ basis representations? 2 / 61

Functional Data Painful: Two (2.5, actually..) bad options to keep it in the same data.frame as the rest of your data: 1. wide format: ◮ way too many weird columns ◮ need to keep track of argument values t separately somehow: ## Observations: 382 ## Variables: 96 ## $ id <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009... ## $ sex <fct> female, female, male, male, male, male, male, male, ... ## $ case <fct> control, control, control, control, control, control... ## $ cca_0 <dbl> 0.4909345, 0.4721627, 0.5023738, 0.4021894, 0.401874... ## $ cca_0.011 <dbl> 0.5168018, 0.4868219, 0.5136516, 0.4225127, 0.405580... ## $ cca_0.022 <dbl> 0.5356539, 0.5022577, 0.5392542, 0.4398983, 0.398548... ## $ cca_0.033 <dbl> 0.5553587, 0.5233635, 0.5742101, 0.4600235, 0.386000... ## $ cca_0.043 <dbl> 0.5927610, 0.5524401, 0.6031339, 0.4751297, 0.408824... ## $ cca_0.054 <dbl> 0.6326935, 0.5872003, 0.6335913, 0.4990257, 0.425183... ## $ cca_0.065 <dbl> 0.6500317, 0.5968158, 0.6357108, 0.5165528, 0.429537... ## $ cca_0.076 <dbl> 0.6556130, 0.6026607, 0.6350799, 0.5552692, 0.444545... ## $ cca_0.087 <dbl> 0.6493701, 0.5922767, 0.6201638, 0.5826485, 0.486943... ## $ cca_0.098 <dbl> 0.6378739, 0.5791859, 0.6086281, 0.6005767, 0.510038... 3 / 61 ## $ cca_0.109 <dbl> 0.6286463, 0.5714253, 0.5910287, 0.6135842, 0.537097...

Functional Data Painful: Two (2.5, actually..) bad options to keep it in the same data.frame as the rest of your data: 2. long format: ◮ unwieldy amounts of rows, lots of duplication for non-functional data ◮ need to keep track of grouping structure (which rows belong to the same curve?) throughout ◮ infeasible if we have more than one function per observational unit ## Observations: 35,526 ## Variables: 6 ## $ id <dbl> 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001... ## $ sex <fct> female, female, female, female, female, female, fema... ## $ case <fct> control, control, control, control, control, control... ## $ cca_id <chr> "1001_1", "1001_1", "1001_1", "1001_1", "1001_1", "1... ## $ cca_arg <dbl> 0.000, 0.011, 0.022, 0.033, 0.043, 0.054, 0.065, 0.0... ## $ cca_value <dbl> 0.4909345, 0.5168018, 0.5356539, 0.5553587, 0.592761... 4 / 61

Functional Data Painful to work with: Third bad option: matrix columns in a data.frame . Sucks, too: ◮ not really well supported (breaks lots of tidyverse -stuff, much unexpected behavior in base ) ◮ more trouble than it’s worth: doesn’t solve how to keep track of argument values 5 / 61

Functional Data Despite all that, people keep measuring ever more of the damn things. Let’s make dealing with functional data in R less painful. 6 / 61

Start at the end... 7 / 61

This is what we’re aiming for: # group-wise functional medians: medians <- dti %>% group_by (case, sex) %>% summarize (median_rcst = median (rcst)) ggplot (medians) + geom_spaghetti ( aes (y = median_rcst, col = sex, linetype = case)) sex 0.7 male median_rcst female 0.6 0.5 case control 0.4 MS 0.00 0.25 0.50 0.75 1.00 8 / 61

This is what we’re aiming for: dti[, -1] ## # A tibble: 382 x 4 ## sex case cca rcst ## <fct> <fct> <tfd> <tfd> ## 1 female control (0.000,0.49);(0.011,0.52);(... (0.000,0.26);(0.019,0.45);(... ## 2 female control (0.000,0.47);(0.011,0.49);(... ( 0.22,0.44);( 0.24,0.48);(... ## 3 male control (0.000,0.50);(0.011,0.51);(... ( 0.22,0.42);( 0.24,0.41);(... ## 4 male control (0.000,0.40);(0.011,0.42);(... (0.000,0.51);(0.019,0.50);(... ## 5 male control (0.000,0.40);(0.011,0.41);(... ( 0.22,0.40);( 0.24,0.41);(... ## 6 male control (0.000,0.45);(0.011,0.45);(... (0.056,0.47);(0.074,0.49);(... ## 7 male control (0.000,0.55);(0.011,0.56);(... (0.000,0.52);(0.019,0.53);(... ## 8 male control (0.000,0.45);(0.011,0.48);(... (0.000,0.33);(0.019,0.42);(... ## 9 male control (0.000,0.50);(0.011,0.51);(... (0.000,0.57);(0.019,0.55);(... ## 10 male control (0.000,0.46);(0.011,0.47);(... ( 0.22,0.44);( 0.24,0.45);(... ## # ... with 372 more rows 9 / 61

tidyfun The goal of tidyfun is to provide accessible and well-documented software that makes functional data analysis in R easy , specifically: data wrangling and exploratory analysis. tidyfun provides: ◮ new data types for representing functional data: tfd & tfb ◮ arithmetic operators , descriptive statistics and graphics functions for such data ◮ tidyverse -verbs for handling functional data inside data frames. 10 / 61

Plan for today ◮ tidyfun ’s data types ◮ tidyfun ’s methods & functions ◮ Discussion & Feedback: ◮ What’s stupid? ◮ What is too complicated? ◮ What am I missing? 11 / 61

tf -Class: Definition 12 / 61

tf -class tf is a new data type for (vectors of) functional data: ◮ an abstract superclass for functional data in 2 forms: ◮ as (argument, value)-tuples : subclass tfd , also irregular or sparse ◮ or in basis representation : subclass tfb ◮ basically, a glorified list of numeric vectors (... since list s work well as columns of data frames ...) ◮ with additional attributes that define function-like behavior: ◮ how to evaluate the given “functions” for new arguments ◮ their domain ◮ the resolution of the argument values ◮ S3 based 13 / 61

Example Data A C 0.60 B D ex 0.50 E 0.40 0.0 0.2 0.4 0.6 0.8 1.0 ex ## tfd[5] on (0,1) based on 93 evaluations each ## interpolation by tf_approx_linear ## A: (0.000,0.49);(0.011,0.52);(0.022,0.54); ... ## B: (0.000,0.47);(0.011,0.49);(0.022,0.50); ... ## C: (0.000,0.50);(0.011,0.51);(0.022,0.54); ... ## D: (0.000,0.40);(0.011,0.42);(0.022,0.44); ... ## E: (0.000,0.40);(0.011,0.41);(0.022,0.40); ... 14 / 61

Example Data glimpse (dti) ## Observations: 382 ## Variables: 5 ## $ id <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 101... ## $ sex <fct> female, female, male, male, male, male, male, male, male,... ## $ case <fct> control, control, control, control, control, control, con... ## $ cca <tfd> 1001_1: (0.000,0.49);(0.011,0.52);(0.022,0.54); ..., 1002... ## $ rcst <tfd> 1001_1: (0.000,0.26);(0.019,0.45);(0.037,0.40); ..., 1002... 15 / 61

tf subclass: tfd tfd objects contain “raw” functional data: ◮ represented as a list of evaluations f i ( t ) | t = t ′ and corresponding arg ument vector(s) t ′ ◮ has a domain : the range of valid arg s. ex %>% tf_evaluations () %>% str ## List of 5 ## $ : num [1:93] 0.491 0.517 0.536 0.555 0.593 ... ## $ : num [1:93] 0.472 0.487 0.502 0.523 0.552 ... ## $ : num [1:93] 0.502 0.514 0.539 0.574 0.603 ... ## $ : num [1:93] 0.402 0.423 0.44 0.46 0.475 ... ## $ : num [1:93] 0.402 0.406 0.399 0.386 0.409 ... ex %>% tf_arg () %>% str ## num [1:93] 0 0.0109 0.0217 0.0326 0.0435 ... ex %>% tf_domain () ## [1] 0 1 16 / 61

tf subclass: tfd ◮ contains an evaluator function that defines how to inter-/extrapolate evaluations between arg s (and remembers results of previous calls) tf_evaluator (ex) %>% str ## function (x, arg, evaluations) ## - attr(*, "memoised")= logi TRUE ## - attr(*, "class")= chr [1:2] "memoised" "function" tf_evaluator (ex) <- tf_approx_spline 17 / 61

tf subclass: tfd ◮ tfd has subclasses for regular data with a common grid and irregular or sparse data. dti$rcst[1:2] ## tfd[2] on (0,1) based on 43 to 55 (mean: 49) evaluations each ## inter-/extrapolation by tf_approx_linear ## 1001_1: (0.000,0.26);(0.019,0.45);(0.037,0.40); ... ## 1002_1: ( 0.22,0.44);( 0.24,0.48);( 0.26,0.48); ... dti$rcst[1:2] %>% tf_arg () %>% str ## List of 2 ## $ 1001_1: num [1:55] 0 0.019 0.037 0.056 0.074 0.093 0.111 0.13 0.148 0.167 ... ## $ 1002_1: num [1:43] 0.222 0.241 0.259 0.278 0.296 0.315 0.333 0.352 0.37 0.389 ... dti$rcst[1:2] %>% plot (pch = "x", col = viridis (2)) xx xx xx xxxxxxxxxx x xxx 0.7 x xx x xx x x x xxx xxxxxx 0.6 xx xxxx xxxxx x x xxx x xx xxxxx xx 0.5 x . x xxx x xx x x xxxxxx x x xx x x 0.4 x x x x x x xxx xx 0.3 x 0.0 0.2 0.4 0.6 0.8 1.0 18 / 61

tf subclass: tfd dti$cca[1:3] %>% str (1) ## List of 3 ## $ 1001_1: num [1:93] 0.491 0.517 0.536 0.555 0.593 ... ## $ 1002_1: num [1:93] 0.472 0.487 0.502 0.523 0.552 ... ## $ 1003_1: num [1:93] 0.502 0.514 0.539 0.574 0.603 ... ## - attr(*, "arg")=List of 1 ## - attr(*, "domain")= num [1:2] 0 1 ## - attr(*, "evaluator")=function (x, arg, evaluations) ## ..- attr(*, "memoised")= logi TRUE ## ..- attr(*, "class")= chr [1:2] "memoised" "function" ## - attr(*, "evaluator_name")= chr "tf_approx_linear" ## - attr(*, "resolution")= num 0.01 ## - attr(*, "class")= chr [1:3] "tfd_reg" "tfd" "tf" 19 / 61

tidyfun : Tidy Functional Data A new framework for working with - PowerPoint PPT Presentation

tidyfun : Tidy Functional Data A new framework for working with functional data in R Fabian Scheipl 1 Jeff Goldsmith 2 1 : Dept. of Statistics, LMU Munich 2 : Columbia University Mailman School of Public Health Functional Data Painful to work with:

Tidy data CLEAN IN G DATA IN P YTH ON Daniel Chen Instructor Tidy data Tidy Data paper

Tidy data & tidy tools Hadley Wickham Assistant Professor / Dobelman Family Junior Chair

Tidy data Tidy datasets are all alike but every messy dataset is messy in its own way

Tidy Table Tidy Table | The Problem Food courts and fast food restaurants are often full of empty,

Tidy evaluation (hygienic fexprs) Lionel Henry and Hadley Wickham | RStudio Tidy evaluation

Day 3: Data Manipulation Sociology Methods Camp September 6th, 2018 1 / 54 Outline 1. Tidy

Welcome! Julia Silge Data Scientist at Stack Overflow DataCamp Sentiment Analysis in R: The

Ranking pop songs through the years Julia Silge Data Scientist at Stack Overflow DataCamp

Tidying Shakespeare Julia Silge Data Scientist at Stack Overflow DataCamp Sentiment Analysis in

Unlocking Sustainable Tourism in Wales Creating Sustainable Destinations Nick Ashby

Totally Disconnected L.C. Groups: Tidy subgroups and the scale George Willis The University of

FFR Guided Functional FFR Guided Functional FFR Guided Functional FFR Guided Functional

The joy of functional programming June 2019 Hadley Wickham @hadleywickham Chief

Describing and summarizing data Describing and summarizing data Abhijit Dasgupta Abhijit

Functional Linear Models 1 66 / 181 Functional Linear Models Statistical Models So far we have

Functional Programming in 40 minutes @russolsen Functional Programming in 40 minutes

T -maturity ZCB; time- t price denoted P ( t ; T ) . As a fct of T : Smooth. As a fct of t :

Three coupled wave guides and EP3s, a model Dieter Heiss, Stellenbosch Guenter Wunner, Suttgart

Quality of Similarity Rankings in Time Series T. Bernecker, in Time Series M. E. Houle, H.-P.

The class NP Isabel Oitavem CMAF-UL and FCT-UNL Recursion-theoretic approach Theorem FPtime

Universal Packet Scheduling Radhika Mittal, Rachit Agarwal, Sylvia Ratnasamy, Scott Shenker UC

Programming in the Lambda-Calculus, Continued Testing booleans Recall: t. f. t tru =

Graphing Functions Marco Chiarandini Department of Mathematics & Computer Science University

More on dplyr ~/> previously gg_miss_fct(x = riskfactors, fct = marital) quick_na <-

tidyfun : Tidy Functional Data A new framework for working with - PowerPoint PPT Presentation

tidyfun : Tidy Functional Data A new framework for working with functional data in R Fabian Scheipl 1 Jeff Goldsmith 2 1 : Dept. of Statistics, LMU Munich 2 : Columbia University Mailman School of Public Health Functional Data Painful to work with:

Tidy data CLEAN IN G DATA IN P YTH ON Daniel Chen Instructor Tidy data Tidy Data paper

Tidy data &amp; tidy tools Hadley Wickham Assistant Professor / Dobelman Family Junior Chair

Tidy data Tidy datasets are all alike but every messy dataset is messy in its own way

Tidy Table Tidy Table | The Problem Food courts and fast food restaurants are often full of empty,

Tidy evaluation (hygienic fexprs) Lionel Henry and Hadley Wickham | RStudio Tidy evaluation

Day 3: Data Manipulation Sociology Methods Camp September 6th, 2018 1 / 54 Outline 1. Tidy

Welcome! Julia Silge Data Scientist at Stack Overflow DataCamp Sentiment Analysis in R: The

Ranking pop songs through the years Julia Silge Data Scientist at Stack Overflow DataCamp

Tidying Shakespeare Julia Silge Data Scientist at Stack Overflow DataCamp Sentiment Analysis in

Unlocking Sustainable Tourism in Wales Creating Sustainable Destinations Nick Ashby

Totally Disconnected L.C. Groups: Tidy subgroups and the scale George Willis The University of

FFR Guided Functional FFR Guided Functional FFR Guided Functional FFR Guided Functional

The joy of functional programming June 2019 Hadley Wickham @hadleywickham Chief

Describing and summarizing data Describing and summarizing data Abhijit Dasgupta Abhijit

Functional Linear Models 1 66 / 181 Functional Linear Models Statistical Models So far we have

Functional Programming in 40 minutes @russolsen Functional Programming in 40 minutes

T -maturity ZCB; time- t price denoted P ( t ; T ) . As a fct of T : Smooth. As a fct of t :

Three coupled wave guides and EP3s, a model Dieter Heiss, Stellenbosch Guenter Wunner, Suttgart

Quality of Similarity Rankings in Time Series T. Bernecker, in Time Series M. E. Houle, H.-P.

The class NP Isabel Oitavem CMAF-UL and FCT-UNL Recursion-theoretic approach Theorem FPtime

Universal Packet Scheduling Radhika Mittal, Rachit Agarwal, Sylvia Ratnasamy, Scott Shenker UC

Programming in the Lambda-Calculus, Continued Testing booleans Recall: t. f. t tru =

Graphing Functions Marco Chiarandini Department of Mathematics &amp; Computer Science University

More on dplyr ~/&gt; previously gg_miss_fct(x = riskfactors, fct = marital) quick_na &lt;-

Tidy data & tidy tools Hadley Wickham Assistant Professor / Dobelman Family Junior Chair

Graphing Functions Marco Chiarandini Department of Mathematics & Computer Science University

More on dplyr ~/> previously gg_miss_fct(x = riskfactors, fct = marital) quick_na <-