tidyfun tidy functional data
play

tidyfun : Tidy Functional Data A new framework for working with - PowerPoint PPT Presentation

tidyfun : Tidy Functional Data A new framework for working with functional data in R Fabian Scheipl 1 Jeff Goldsmith 2 1 : Dept. of Statistics, LMU Munich 2 : Columbia University Mailman School of Public Health Functional Data Painful to work with:


  1. tidyfun : Tidy Functional Data A new framework for working with functional data in R Fabian Scheipl 1 Jeff Goldsmith 2 1 : Dept. of Statistics, LMU Munich 2 : Columbia University Mailman School of Public Health

  2. Functional Data Painful to work with: ◮ huge amounts of data ◮ regular grids? irregular grids? ◮ work with: ◮ raw data? ◮ smooth/interpolated? ◮ basis representations? 2 / 61

  3. Functional Data Painful: Two (2.5, actually..) bad options to keep it in the same data.frame as the rest of your data: 1. wide format: ◮ way too many weird columns ◮ need to keep track of argument values t separately somehow: ## Observations: 382 ## Variables: 96 ## $ id <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009... ## $ sex <fct> female, female, male, male, male, male, male, male, ... ## $ case <fct> control, control, control, control, control, control... ## $ cca_0 <dbl> 0.4909345, 0.4721627, 0.5023738, 0.4021894, 0.401874... ## $ cca_0.011 <dbl> 0.5168018, 0.4868219, 0.5136516, 0.4225127, 0.405580... ## $ cca_0.022 <dbl> 0.5356539, 0.5022577, 0.5392542, 0.4398983, 0.398548... ## $ cca_0.033 <dbl> 0.5553587, 0.5233635, 0.5742101, 0.4600235, 0.386000... ## $ cca_0.043 <dbl> 0.5927610, 0.5524401, 0.6031339, 0.4751297, 0.408824... ## $ cca_0.054 <dbl> 0.6326935, 0.5872003, 0.6335913, 0.4990257, 0.425183... ## $ cca_0.065 <dbl> 0.6500317, 0.5968158, 0.6357108, 0.5165528, 0.429537... ## $ cca_0.076 <dbl> 0.6556130, 0.6026607, 0.6350799, 0.5552692, 0.444545... ## $ cca_0.087 <dbl> 0.6493701, 0.5922767, 0.6201638, 0.5826485, 0.486943... ## $ cca_0.098 <dbl> 0.6378739, 0.5791859, 0.6086281, 0.6005767, 0.510038... 3 / 61 ## $ cca_0.109 <dbl> 0.6286463, 0.5714253, 0.5910287, 0.6135842, 0.537097...

  4. Functional Data Painful: Two (2.5, actually..) bad options to keep it in the same data.frame as the rest of your data: 2. long format: ◮ unwieldy amounts of rows, lots of duplication for non-functional data ◮ need to keep track of grouping structure (which rows belong to the same curve?) throughout ◮ infeasible if we have more than one function per observational unit ## Observations: 35,526 ## Variables: 6 ## $ id <dbl> 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001... ## $ sex <fct> female, female, female, female, female, female, fema... ## $ case <fct> control, control, control, control, control, control... ## $ cca_id <chr> "1001_1", "1001_1", "1001_1", "1001_1", "1001_1", "1... ## $ cca_arg <dbl> 0.000, 0.011, 0.022, 0.033, 0.043, 0.054, 0.065, 0.0... ## $ cca_value <dbl> 0.4909345, 0.5168018, 0.5356539, 0.5553587, 0.592761... 4 / 61

  5. Functional Data Painful to work with: Third bad option: matrix columns in a data.frame . Sucks, too: ◮ not really well supported (breaks lots of tidyverse -stuff, much unexpected behavior in base ) ◮ more trouble than it’s worth: doesn’t solve how to keep track of argument values 5 / 61

  6. Functional Data Despite all that, people keep measuring ever more of the damn things. Let’s make dealing with functional data in R less painful. 6 / 61

  7. Start at the end... 7 / 61

  8. This is what we’re aiming for: # group-wise functional medians: medians <- dti %>% group_by (case, sex) %>% summarize (median_rcst = median (rcst)) ggplot (medians) + geom_spaghetti ( aes (y = median_rcst, col = sex, linetype = case)) sex 0.7 male median_rcst female 0.6 0.5 case control 0.4 MS 0.00 0.25 0.50 0.75 1.00 8 / 61

  9. This is what we’re aiming for: dti[, -1] ## # A tibble: 382 x 4 ## sex case cca rcst ## <fct> <fct> <tfd> <tfd> ## 1 female control (0.000,0.49);(0.011,0.52);(... (0.000,0.26);(0.019,0.45);(... ## 2 female control (0.000,0.47);(0.011,0.49);(... ( 0.22,0.44);( 0.24,0.48);(... ## 3 male control (0.000,0.50);(0.011,0.51);(... ( 0.22,0.42);( 0.24,0.41);(... ## 4 male control (0.000,0.40);(0.011,0.42);(... (0.000,0.51);(0.019,0.50);(... ## 5 male control (0.000,0.40);(0.011,0.41);(... ( 0.22,0.40);( 0.24,0.41);(... ## 6 male control (0.000,0.45);(0.011,0.45);(... (0.056,0.47);(0.074,0.49);(... ## 7 male control (0.000,0.55);(0.011,0.56);(... (0.000,0.52);(0.019,0.53);(... ## 8 male control (0.000,0.45);(0.011,0.48);(... (0.000,0.33);(0.019,0.42);(... ## 9 male control (0.000,0.50);(0.011,0.51);(... (0.000,0.57);(0.019,0.55);(... ## 10 male control (0.000,0.46);(0.011,0.47);(... ( 0.22,0.44);( 0.24,0.45);(... ## # ... with 372 more rows 9 / 61

  10. tidyfun The goal of tidyfun is to provide accessible and well-documented software that makes functional data analysis in R easy , specifically: data wrangling and exploratory analysis. tidyfun provides: ◮ new data types for representing functional data: tfd & tfb ◮ arithmetic operators , descriptive statistics and graphics functions for such data ◮ tidyverse -verbs for handling functional data inside data frames. 10 / 61

  11. Plan for today ◮ tidyfun ’s data types ◮ tidyfun ’s methods & functions ◮ Discussion & Feedback: ◮ What’s stupid? ◮ What is too complicated? ◮ What am I missing? 11 / 61

  12. tf -Class: Definition 12 / 61

  13. tf -class tf is a new data type for (vectors of) functional data: ◮ an abstract superclass for functional data in 2 forms: ◮ as (argument, value)-tuples : subclass tfd , also irregular or sparse ◮ or in basis representation : subclass tfb ◮ basically, a glorified list of numeric vectors (... since list s work well as columns of data frames ...) ◮ with additional attributes that define function-like behavior: ◮ how to evaluate the given “functions” for new arguments ◮ their domain ◮ the resolution of the argument values ◮ S3 based 13 / 61

  14. Example Data A C 0.60 B D ex 0.50 E 0.40 0.0 0.2 0.4 0.6 0.8 1.0 ex ## tfd[5] on (0,1) based on 93 evaluations each ## interpolation by tf_approx_linear ## A: (0.000,0.49);(0.011,0.52);(0.022,0.54); ... ## B: (0.000,0.47);(0.011,0.49);(0.022,0.50); ... ## C: (0.000,0.50);(0.011,0.51);(0.022,0.54); ... ## D: (0.000,0.40);(0.011,0.42);(0.022,0.44); ... ## E: (0.000,0.40);(0.011,0.41);(0.022,0.40); ... 14 / 61

  15. Example Data glimpse (dti) ## Observations: 382 ## Variables: 5 ## $ id <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 101... ## $ sex <fct> female, female, male, male, male, male, male, male, male,... ## $ case <fct> control, control, control, control, control, control, con... ## $ cca <tfd> 1001_1: (0.000,0.49);(0.011,0.52);(0.022,0.54); ..., 1002... ## $ rcst <tfd> 1001_1: (0.000,0.26);(0.019,0.45);(0.037,0.40); ..., 1002... 15 / 61

  16. tf subclass: tfd tfd objects contain “raw” functional data: ◮ represented as a list of evaluations f i ( t ) | t = t ′ and corresponding arg ument vector(s) t ′ ◮ has a domain : the range of valid arg s. ex %>% tf_evaluations () %>% str ## List of 5 ## $ : num [1:93] 0.491 0.517 0.536 0.555 0.593 ... ## $ : num [1:93] 0.472 0.487 0.502 0.523 0.552 ... ## $ : num [1:93] 0.502 0.514 0.539 0.574 0.603 ... ## $ : num [1:93] 0.402 0.423 0.44 0.46 0.475 ... ## $ : num [1:93] 0.402 0.406 0.399 0.386 0.409 ... ex %>% tf_arg () %>% str ## num [1:93] 0 0.0109 0.0217 0.0326 0.0435 ... ex %>% tf_domain () ## [1] 0 1 16 / 61

  17. tf subclass: tfd ◮ contains an evaluator function that defines how to inter-/extrapolate evaluations between arg s (and remembers results of previous calls) tf_evaluator (ex) %>% str ## function (x, arg, evaluations) ## - attr(*, "memoised")= logi TRUE ## - attr(*, "class")= chr [1:2] "memoised" "function" tf_evaluator (ex) <- tf_approx_spline 17 / 61

  18. tf subclass: tfd ◮ tfd has subclasses for regular data with a common grid and irregular or sparse data. dti$rcst[1:2] ## tfd[2] on (0,1) based on 43 to 55 (mean: 49) evaluations each ## inter-/extrapolation by tf_approx_linear ## 1001_1: (0.000,0.26);(0.019,0.45);(0.037,0.40); ... ## 1002_1: ( 0.22,0.44);( 0.24,0.48);( 0.26,0.48); ... dti$rcst[1:2] %>% tf_arg () %>% str ## List of 2 ## $ 1001_1: num [1:55] 0 0.019 0.037 0.056 0.074 0.093 0.111 0.13 0.148 0.167 ... ## $ 1002_1: num [1:43] 0.222 0.241 0.259 0.278 0.296 0.315 0.333 0.352 0.37 0.389 ... dti$rcst[1:2] %>% plot (pch = "x", col = viridis (2)) xx xx xx xxxxxxxxxx x xxx 0.7 x xx x xx x x x xxx xxxxxx 0.6 xx xxxx xxxxx x x xxx x xx xxxxx xx 0.5 x . x xxx x xx x x xxxxxx x x xx x x 0.4 x x x x x x xxx xx 0.3 x 0.0 0.2 0.4 0.6 0.8 1.0 18 / 61

  19. tf subclass: tfd dti$cca[1:3] %>% str (1) ## List of 3 ## $ 1001_1: num [1:93] 0.491 0.517 0.536 0.555 0.593 ... ## $ 1002_1: num [1:93] 0.472 0.487 0.502 0.523 0.552 ... ## $ 1003_1: num [1:93] 0.502 0.514 0.539 0.574 0.603 ... ## - attr(*, "arg")=List of 1 ## - attr(*, "domain")= num [1:2] 0 1 ## - attr(*, "evaluator")=function (x, arg, evaluations) ## ..- attr(*, "memoised")= logi TRUE ## ..- attr(*, "class")= chr [1:2] "memoised" "function" ## - attr(*, "evaluator_name")= chr "tf_approx_linear" ## - attr(*, "resolution")= num 0.01 ## - attr(*, "class")= chr [1:3] "tfd_reg" "tfd" "tf" 19 / 61

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend