tidyfun : Tidy Functional Data A new framework for working with - - PowerPoint PPT Presentation

tidyfun tidy functional data
SMART_READER_LITE
LIVE PREVIEW

tidyfun : Tidy Functional Data A new framework for working with - - PowerPoint PPT Presentation

tidyfun : Tidy Functional Data A new framework for working with functional data in R Fabian Scheipl 1 Jeff Goldsmith 2 1 : Dept. of Statistics, LMU Munich 2 : Columbia University Mailman School of Public Health Functional Data Painful to work with:


slide-1
SLIDE 1

tidyfun: Tidy Functional Data

A new framework for working with functional data in R Fabian Scheipl1 Jeff Goldsmith2

1: Dept. of Statistics, LMU Munich 2: Columbia University Mailman School of Public Health

slide-2
SLIDE 2

Functional Data

Painful to work with: ◮ huge amounts of data ◮ regular grids? irregular grids? ◮ work with:

◮ raw data? ◮ smooth/interpolated? ◮ basis representations?

2 / 61

slide-3
SLIDE 3

Functional Data

Painful: Two (2.5, actually..) bad options to keep it in the same data.frame as the rest of your data:

  • 1. wide format:

◮ way too many weird columns ◮ need to keep track of argument values t separately somehow:

## Observations: 382 ## Variables: 96 ## $ id <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009... ## $ sex <fct> female, female, male, male, male, male, male, male, ... ## $ case <fct> control, control, control, control, control, control... ## $ cca_0 <dbl> 0.4909345, 0.4721627, 0.5023738, 0.4021894, 0.401874... ## $ cca_0.011 <dbl> 0.5168018, 0.4868219, 0.5136516, 0.4225127, 0.405580... ## $ cca_0.022 <dbl> 0.5356539, 0.5022577, 0.5392542, 0.4398983, 0.398548... ## $ cca_0.033 <dbl> 0.5553587, 0.5233635, 0.5742101, 0.4600235, 0.386000... ## $ cca_0.043 <dbl> 0.5927610, 0.5524401, 0.6031339, 0.4751297, 0.408824... ## $ cca_0.054 <dbl> 0.6326935, 0.5872003, 0.6335913, 0.4990257, 0.425183... ## $ cca_0.065 <dbl> 0.6500317, 0.5968158, 0.6357108, 0.5165528, 0.429537... ## $ cca_0.076 <dbl> 0.6556130, 0.6026607, 0.6350799, 0.5552692, 0.444545... ## $ cca_0.087 <dbl> 0.6493701, 0.5922767, 0.6201638, 0.5826485, 0.486943... ## $ cca_0.098 <dbl> 0.6378739, 0.5791859, 0.6086281, 0.6005767, 0.510038... ## $ cca_0.109 <dbl> 0.6286463, 0.5714253, 0.5910287, 0.6135842, 0.537097...

3 / 61

slide-4
SLIDE 4

Functional Data

Painful: Two (2.5, actually..) bad options to keep it in the same data.frame as the rest of your data:

  • 2. long format:

◮ unwieldy amounts of rows, lots of duplication for non-functional data ◮ need to keep track of grouping structure (which rows belong to the same curve?) throughout ◮ infeasible if we have more than one function per observational unit

## Observations: 35,526 ## Variables: 6 ## $ id <dbl> 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001... ## $ sex <fct> female, female, female, female, female, female, fema... ## $ case <fct> control, control, control, control, control, control... ## $ cca_id <chr> "1001_1", "1001_1", "1001_1", "1001_1", "1001_1", "1... ## $ cca_arg <dbl> 0.000, 0.011, 0.022, 0.033, 0.043, 0.054, 0.065, 0.0... ## $ cca_value <dbl> 0.4909345, 0.5168018, 0.5356539, 0.5553587, 0.592761...

4 / 61

slide-5
SLIDE 5

Functional Data

Painful to work with: Third bad option: matrix columns in a data.frame. Sucks, too: ◮ not really well supported (breaks lots of tidyverse-stuff, much unexpected behavior in base) ◮ more trouble than it’s worth: doesn’t solve how to keep track of argument values

5 / 61

slide-6
SLIDE 6

Functional Data

Despite all that, people keep measuring ever more of the damn things. Let’s make dealing with functional data in R less painful.

6 / 61

slide-7
SLIDE 7

Start at the end...

7 / 61

slide-8
SLIDE 8

This is what we’re aiming for:

# group-wise functional medians: medians <- dti %>% group_by(case, sex) %>% summarize(median_rcst = median(rcst)) ggplot(medians) + geom_spaghetti(aes(y = median_rcst, col = sex, linetype = case))

0.4 0.5 0.6 0.7 0.00 0.25 0.50 0.75 1.00

median_rcst sex

male female

case

control MS

8 / 61

slide-9
SLIDE 9

This is what we’re aiming for:

dti[, -1] ## # A tibble: 382 x 4 ## sex case cca rcst ## <fct> <fct> <tfd> <tfd> ## 1 female control (0.000,0.49);(0.011,0.52);(... (0.000,0.26);(0.019,0.45);(... ## 2 female control (0.000,0.47);(0.011,0.49);(... ( 0.22,0.44);( 0.24,0.48);(... ## 3 male control (0.000,0.50);(0.011,0.51);(... ( 0.22,0.42);( 0.24,0.41);(... ## 4 male control (0.000,0.40);(0.011,0.42);(... (0.000,0.51);(0.019,0.50);(... ## 5 male control (0.000,0.40);(0.011,0.41);(... ( 0.22,0.40);( 0.24,0.41);(... ## 6 male control (0.000,0.45);(0.011,0.45);(... (0.056,0.47);(0.074,0.49);(... ## 7 male control (0.000,0.55);(0.011,0.56);(... (0.000,0.52);(0.019,0.53);(... ## 8 male control (0.000,0.45);(0.011,0.48);(... (0.000,0.33);(0.019,0.42);(... ## 9 male control (0.000,0.50);(0.011,0.51);(... (0.000,0.57);(0.019,0.55);(... ## 10 male control (0.000,0.46);(0.011,0.47);(... ( 0.22,0.44);( 0.24,0.45);(... ## # ... with 372 more rows

9 / 61

slide-10
SLIDE 10

tidyfun

The goal of tidyfun is to provide accessible and well-documented software that makes functional data analysis in R easy, specifically: data wrangling and exploratory analysis. tidyfun provides: ◮ new data types for representing functional data: tfd & tfb ◮ arithmetic operators, descriptive statistics and graphics functions for such data ◮ tidyverse-verbs for handling functional data inside data frames.

10 / 61

slide-11
SLIDE 11

Plan for today

◮ tidyfun’s data types ◮ tidyfun’s methods & functions ◮ Discussion & Feedback:

◮ What’s stupid? ◮ What is too complicated? ◮ What am I missing?

11 / 61

slide-12
SLIDE 12

tf-Class: Definition

12 / 61

slide-13
SLIDE 13

tf-class

tf is a new data type for (vectors of) functional data: ◮ an abstract superclass for functional data in 2 forms:

◮ as (argument, value)-tuples: subclass tfd, also irregular or sparse ◮ or in basis representation: subclass tfb

◮ basically, a glorified list of numeric vectors (... since lists work well as columns of data frames ...) ◮ with additional attributes that define function-like behavior:

◮ how to evaluate the given “functions” for new arguments ◮ their domain ◮ the resolution of the argument values

◮ S3 based

13 / 61

slide-14
SLIDE 14

Example Data

0.0 0.2 0.4 0.6 0.8 1.0 0.40 0.50 0.60 ex A B C D E ex ## tfd[5] on (0,1) based on 93 evaluations each ## interpolation by tf_approx_linear ## A: (0.000,0.49);(0.011,0.52);(0.022,0.54); ... ## B: (0.000,0.47);(0.011,0.49);(0.022,0.50); ... ## C: (0.000,0.50);(0.011,0.51);(0.022,0.54); ... ## D: (0.000,0.40);(0.011,0.42);(0.022,0.44); ... ## E: (0.000,0.40);(0.011,0.41);(0.022,0.40); ...

14 / 61

slide-15
SLIDE 15

Example Data

glimpse(dti) ## Observations: 382 ## Variables: 5 ## $ id <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 101... ## $ sex <fct> female, female, male, male, male, male, male, male, male,... ## $ case <fct> control, control, control, control, control, control, con... ## $ cca <tfd> 1001_1: (0.000,0.49);(0.011,0.52);(0.022,0.54); ..., 1002... ## $ rcst <tfd> 1001_1: (0.000,0.26);(0.019,0.45);(0.037,0.40); ..., 1002...

15 / 61

slide-16
SLIDE 16

tf subclass: tfd

tfd objects contain “raw” functional data: ◮ represented as a list of evaluations fi(t)|t=t′ and corresponding argument vector(s) t′ ◮ has a domain: the range of valid args.

ex %>% tf_evaluations() %>% str ## List of 5 ## $ : num [1:93] 0.491 0.517 0.536 0.555 0.593 ... ## $ : num [1:93] 0.472 0.487 0.502 0.523 0.552 ... ## $ : num [1:93] 0.502 0.514 0.539 0.574 0.603 ... ## $ : num [1:93] 0.402 0.423 0.44 0.46 0.475 ... ## $ : num [1:93] 0.402 0.406 0.399 0.386 0.409 ... ex %>% tf_arg() %>% str ## num [1:93] 0 0.0109 0.0217 0.0326 0.0435 ... ex %>% tf_domain() ## [1] 0 1

16 / 61

slide-17
SLIDE 17

tf subclass: tfd

◮ contains an evaluator function that defines how to inter-/extrapolate evaluations between args (and remembers results of previous calls)

tf_evaluator(ex) %>% str ## function (x, arg, evaluations) ##

  • attr(*, "memoised")= logi TRUE

##

  • attr(*, "class")= chr [1:2] "memoised" "function"

tf_evaluator(ex) <- tf_approx_spline

17 / 61

slide-18
SLIDE 18

tf subclass: tfd

◮ tfd has subclasses for regular data with a common grid and irregular or sparse data.

dti$rcst[1:2] ## tfd[2] on (0,1) based on 43 to 55 (mean: 49) evaluations each ## inter-/extrapolation by tf_approx_linear ## 1001_1: (0.000,0.26);(0.019,0.45);(0.037,0.40); ... ## 1002_1: ( 0.22,0.44);( 0.24,0.48);( 0.26,0.48); ... dti$rcst[1:2] %>% tf_arg() %>% str ## List of 2 ## $ 1001_1: num [1:55] 0 0.019 0.037 0.056 0.074 0.093 0.111 0.13 0.148 0.167 ... ## $ 1002_1: num [1:43] 0.222 0.241 0.259 0.278 0.296 0.315 0.333 0.352 0.37 0.389 ... dti$rcst[1:2] %>% plot(pch = "x", col = viridis(2))

0.0 0.2 0.4 0.6 0.8 1.0 0.3 0.4 0.5 0.6 0.7 . x x xxxxxx x x xx xxxxx x xxx x xx xxxxxxxxxx x x xxxx x x x xxx x x xxxxx xx x xx x x xx x xx xx x xx xx xxx x xxxxxx x x x xx x x x xxx xxx xx

18 / 61

slide-19
SLIDE 19

tf subclass: tfd

dti$cca[1:3] %>% str(1) ## List of 3 ## $ 1001_1: num [1:93] 0.491 0.517 0.536 0.555 0.593 ... ## $ 1002_1: num [1:93] 0.472 0.487 0.502 0.523 0.552 ... ## $ 1003_1: num [1:93] 0.502 0.514 0.539 0.574 0.603 ... ##

  • attr(*, "arg")=List of 1

##

  • attr(*, "domain")= num [1:2] 0 1

##

  • attr(*, "evaluator")=function (x, arg, evaluations)

## ..- attr(*, "memoised")= logi TRUE ## ..- attr(*, "class")= chr [1:2] "memoised" "function" ##

  • attr(*, "evaluator_name")= chr "tf_approx_linear"

##

  • attr(*, "resolution")= num 0.01

##

  • attr(*, "class")= chr [1:3] "tfd_reg" "tfd" "tf"

19 / 61

slide-20
SLIDE 20

tf subclass: tfb

Functional data in basis representation: ◮ represented as a list of coefficients and a common basis_matrix of basis function evaluations on a vector of arg-values. ◮ contains a basis function that defines how to compute the basis for new args and how to differentiate/integrate. ◮ (internal) flavors:

◮ tfb_spline: mgcv-spline bases ◮ tfb_fpc: FPCs (wavelets to be added)

◮ significant memory (and time) savings:

refund::DTI$cca %>% object.size() %>% print(units = "Kb") ## 307.7 Kb dti$cca %>% object.size() %>% print(units = "Kb") ## 339.7 Kb dti$cca %>% tfb(verbose = FALSE) %>% object.size() %>% print(units = "Kb") ## 189.8 Kb

20 / 61

slide-21
SLIDE 21

tf subclass: tfb

dti$cca[1:3] %>% tfb(verbose = FALSE) %>% str(1) ## List of 3 ## $ 1001_1: num [1:25] 0.485 0.592 0.651 0.622 0.534 ... ## $ 1002_1: num [1:25] 0.465 0.551 0.598 0.543 0.486 ... ## $ 1003_1: num [1:25] 0.491 0.597 0.633 0.549 0.488 ... ##

  • attr(*, "domain")= num [1:2] 0 1

##

  • attr(*, "basis")=function (arg)

## ..- attr(*, "memoised")= logi TRUE ## ..- attr(*, "class")= chr [1:2] "memoised" "function" ##

  • attr(*, "basis_label")= chr "s(arg, bs = \"cr\", k = 25)"

##

  • attr(*, "basis_args")=List of 2

##

  • attr(*, "basis_matrix")= num [1:93, 1:25] 1 0.6681 0.3663 0.1246 -0.0176 ...

##

  • attr(*, "arg")= num [1:93] 0 0.011 0.022 0.033 0.043 0.054 0.065 0.076 0.087 0.098

##

  • attr(*, "resolution")= num 0.01

##

  • attr(*, "family")=List of 11

## ..- attr(*, "class")= chr "family" ##

  • attr(*, "class")= chr [1:3] "tfb_spline" "tfb" "tf"

21 / 61

slide-22
SLIDE 22

tf subclass: tfb_spline

◮ default for tfb() ◮ accepts all arguments of mgcv’s s()-syntax: control basis type bs, basis dimension k, penalty order m ◮ also does non-Gaussian fits: family argument

◮ all exponential families ◮ but also: t-distribution (robust smoothing?), ZI-Poisson (accelerometry?), Beta, ...

22 / 61

slide-23
SLIDE 23

tf subclass: tfb_spline

ex_b = ex %>% tfb(); ex_b[1:2] ## Percentage of input data variability preserved in basis representation ## (per functional observation, approximate): ##

  • Min. 1st Qu.

Median Mean 3rd Qu. Max. ## 95.60 96.40 97.00 97.14 98.00 98.70 ## tfb[2] on (0,1) in basis representation: ## using basis s(arg, bs = "cr", k = 25) ## A: (0.000,0.49);(0.011,0.52);(0.022,0.54); ... ## B: (0.000,0.47);(0.011,0.49);(0.022,0.51); ... ex[1:2] %>% tfb(bs = "tp", k = 55) ## Percentage of input data variability preserved in basis representation ## (per functional observation, approximate): ##

  • Min. 1st Qu.

Median Mean 3rd Qu. Max. ## 99.10 99.22 99.35 99.35 99.47 99.60 ## tfb[2] on (0,1) in basis representation: ## using basis s(arg, bs = "tp", k = 55) ## A: (0.000,0.49);(0.011,0.52);(0.022,0.54); ... ## B: (0.000,0.47);(0.011,0.49);(0.022,0.50); ... ex[1:2] %>% tfb(bs = "ps", m = c(2,1), family = betar(link = "cloglog")) ## Percentage of input data variability preserved in basis representation ## (on inverse link-scale, per functional observation, approximate): ##

  • Min. 1st Qu.

Median Mean 3rd Qu. Max. ## 99.40 99.47 99.55 99.55 99.62 99.70 ## tfb[2] on (0,1) in basis representation: ## using basis s(arg, bs = "ps", m = c(2, 1), k = 25) ## A: (0.000,0.49);(0.011,0.51);(0.022,0.54); ... ## B: (0.000,0.47);(0.011,0.49);(0.022,0.51); ...

23 / 61

slide-24
SLIDE 24

tf subclass: tfb spline basis

◮ penalization: function-specific (default), none, prespecified (sp),

  • r global

ex %>% plot() ex %>% tfb() %>% plot(col = "red") ex %>% tfb(k = 35, penalized = FALSE) %>% lines(col = "blue")

0.0 0.2 0.4 0.6 0.8 1.0 0.40 0.50 0.60 ex 0.0 0.2 0.4 0.6 0.8 1.0 0.40 0.50 0.60 ex %>% tfb(verbose = FALSE)

24 / 61

slide-25
SLIDE 25

tf subclass: tfb spline basis

◮ penalization: function-specific (default), none, prespecified (sp),

  • r global

ex %>% plot() ex %>% tfb() %>% plot(col = "red") ex %>% tfb(sp = .001) %>% lines(col = "orange")

0.0 0.2 0.4 0.6 0.8 1.0 0.40 0.50 0.60 ex 0.0 0.2 0.4 0.6 0.8 1.0 0.40 0.50 0.60 ex %>% tfb(verbose = FALSE)

25 / 61

slide-26
SLIDE 26

tf subclass: tfb spline basis

“Global” smoothing:

  • 1. estimate smoothing parameters for subsample (~10%) of curves
  • 2. apply geometric mean of estimated smoothing parameters to

smooth all curves Good: ◮ (much) faster than optimizing penalization for each curve ◮ should scale well for larg-ish datasets Not good: ◮ no real borrowing of information across curves (very sparse or functional fragment data, e.g.) ◮ needs more observations than basis functions per curve. ◮ subsample could miss small subgroups with different roughness Should global smoothing be the default?

26 / 61

slide-27
SLIDE 27

tf subclass: tfb spline basis

Global smoothing:

raw %>% plot() tfb(raw, k = 55) %>% plot tfb(raw, k = 55, global = TRUE) %>% plot

0.0 0.2 0.4 0.6 0.8 1.0 −5 5

raw

raw 0.0 0.2 0.4 0.6 0.8 1.0 −5 5

separate

tfb(raw, k = 55) 0.0 0.2 0.4 0.6 0.8 1.0 −5 5

global

tfb(raw, k = 55, global = TRUE)

27 / 61

slide-28
SLIDE 28

tf subclass: tfb FPC-based

◮ uses either

◮ simple unregularized SVD of the data matrix (“smooth = FALSE”) ◮ or smoothed covariance estimate from refund::fpca.sc

◮ corresponding FPC basis and mean function saved as tfd-object & observed functions are simply linear combinations of those.

(ex %>% tfb_fpc(smooth = FALSE, pve = .999)) ## tfb[5] on (0,1) in basis representation: ## using basis FPC: 4 components. ## A: (0.000,0.49);(0.011,0.52);(0.022,0.54); ... ## B: (0.000,0.47);(0.011,0.49);(0.022,0.50); ... ## C: (0.000,0.50);(0.011,0.51);(0.022,0.54); ... ## D: (0.000,0.40);(0.011,0.42);(0.022,0.44); ... ## E: (0.000,0.40);(0.011,0.41);(0.022,0.40); ... (ex %>% tfb_fpc(pve = .9)) ## tfb[5] on (0,1) in basis representation: ## using basis FPC: 10 components. ## A: (0.000,0.49);(0.011,0.52);(0.022,0.55); ... ## B: (0.000,0.46);(0.011,0.49);(0.022,0.52); ... ## C: (0.000,0.50);(0.011,0.53);(0.022,0.56); ... ## D: (0.000,0.41);(0.011,0.43);(0.022,0.45); ... ## E: (0.000,0.41);(0.011,0.41);(0.022,0.41); ...

28 / 61

slide-29
SLIDE 29

tf subclass: tfb FPC-based

ex %>% plot() ex %>% tfb_fpc(smooth = FALSE, pve = .999) %>% plot(col = "red") ex %>% tfb_fpc(pve = .9) %>% lines(col = "blue")

0.0 0.2 0.4 0.6 0.8 1.0 0.40 0.50 0.60 ex 0.0 0.2 0.4 0.6 0.8 1.0 0.40 0.50 0.60 tfb_fpc(ex)

29 / 61

slide-30
SLIDE 30

tf-Class: Methods

30 / 61

slide-31
SLIDE 31

Subset & subassign

Special [-methods:

ex[1:2] ## tfd[2] on (0,1) based on 93 evaluations each ## interpolation by tf_approx_spline ## A: (0.000,0.49);(0.011,0.52);(0.022,0.54); ... ## B: (0.000,0.47);(0.011,0.49);(0.022,0.50); ... ex[1:2] = ex[2:1] ex ## tfd[5] on (0,1) based on 93 evaluations each ## interpolation by tf_approx_spline ## B: (0.000,0.47);(0.011,0.49);(0.022,0.50); ... ## A: (0.000,0.49);(0.011,0.52);(0.022,0.54); ... ## C: (0.000,0.50);(0.011,0.51);(0.022,0.54); ... ## D: (0.000,0.40);(0.011,0.42);(0.022,0.44); ... ## E: (0.000,0.40);(0.011,0.41);(0.022,0.40); ...

31 / 61

slide-32
SLIDE 32

Evaluate

Re-defined second argument for [:

ex[1:2, seq(0, 1, l = 5)] ## 0.25 0.5 0.75 1 ## B 0.4721627 0.5153710 0.4984125 0.4887047 0.5802742 ## A 0.4909345 0.4911993 0.5307563 0.4184430 0.5904773 ## attr(,"arg") ## [1] 0.00 0.25 0.50 0.75 1.00 ex["B", seq(0, .15, l = 3), interpolate = FALSE] ## 0 0.075 0.15 ## B 0.4721627 NA NA ## attr(,"arg") ## [1] 0.000 0.075 0.150 ex[1:2, c(0, 1), matrix = FALSE] %>% str ## List of 2 ## $ B:Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 2 variables: ## ..$ arg : num [1:2] 0 1 ## ..$ value: num [1:2] 0.472 0.58 ## $ A:Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 2 variables: ## ..$ arg : num [1:2] 0 1 ## ..$ value: num [1:2] 0.491 0.59

32 / 61

slide-33
SLIDE 33

Compare & compute

ex[1] + ex[1] == 2 * ex[1] ## [1] TRUE log(exp(ex[2])) == ex[2] ## [1] TRUE ex - (2:-2) != ex ## [1] TRUE TRUE FALSE TRUE TRUE

33 / 61

slide-34
SLIDE 34

Summarize

c(mean = mean(ex), sd = sd(ex)) ## tfd[2] on (0,1) based on 93 evaluations each ## interpolation by tf_approx_spline ## mean: (0.000, 0.45);(0.011, 0.47);(0.022, 0.48); ... ## sd: (0.000,0.049);(0.011,0.052);(0.022,0.062); ... # Modified Band-2 Depth (Sun/Genton/Nychka, 2012), others to come -- which? tf_depth(ex) ## B A C D E ## 0.6108696 0.6467391 0.6597826 0.5728261 0.5097826 median(ex) == ex[which.max(tf_depth(ex))] ## C ## TRUE

34 / 61

slide-35
SLIDE 35

(Simple, local) smoothing

ex %>% tf_smooth("lowess") %>% plot ex %>% tf_smooth("rollmedian", k = 5) %>% plot 0.0 0.2 0.4 0.6 0.8 1.0 0.40 0.50 0.60 lowess 0.0 0.2 0.4 0.6 0.8 1.0 0.40 0.50 0.60 rolling median (k=5)

35 / 61

slide-36
SLIDE 36

Differentiate & integrate

ex %>% plot ex %>% tf_smooth() %>% tf_derive() %>% plot ex %>% tf_integrate(definite = FALSE) %>% plot

0.0 0.2 0.4 0.6 0.8 1.0 0.40 0.45 0.50 0.55 0.60 0.65 ex 0.0 0.2 0.4 0.6 0.8 1.0 −1 1 2 tf_smooth(ex) %>% tf_derive 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4 0.5 tf_integrate(ex, definite = FALSE)

ex %>% tf_integrate() ## B A C D E ## 0.5202226 0.5266721 0.5090597 0.5308668 0.4661275

36 / 61

slide-37
SLIDE 37

Query

Find arguments t satisfying a condition on value f(t) (and argument t, optionally):

ex %>% tf_anywhere(value > .65) ## B A C D E ## FALSE TRUE TRUE FALSE FALSE ex[1:2] %>% tf_where(value > .6, "all") ## $B ## [1] 0.07608696 0.89130435 0.90217391 0.91304348 0.92391304 0.96739130 ## [7] 0.97826087 ## ## $A ## [1] 0.05434783 0.06521739 0.07608696 0.08695652 0.09782609 0.10869565 ## [7] 0.11956522 0.13043478 0.14130435 0.95652174 0.96739130 0.97826087 ex["A"] %>% tf_where(value > .6, "range") ## begin end ## A 0.05434783 0.9782609 ex %>% tf_where(value > .6 & arg > .5, "first") ## B A C D E ## 0.8913043 0.9565217 0.9565217 0.9347826 0.9347826

37 / 61

slide-38
SLIDE 38

Zoom & query

0.0 0.2 0.4 0.6 0.8 1.0 0.40 0.50 0.60 ex

B A C D E

ex %>% tf_where(value == max(value), "first") ## B A C D E ## 0.90217391 0.07608696 1.00000000 0.10869565 0.97826087 # locations of maxima on [.5, 1]: ex[c("A", "D")] %>% tf_zoom(.5, 1) %>% tf_where(value == max(value), "first") ## A D ## 0.9673913 0.9565217 # which functions dip below the median curve anywhere in [0.2, 0.6]: ex %>% tf_zoom(0.2, 0.6) %>% tf_anywhere(value < median(ex)[, arg]) ## B A C D E ## TRUE FALSE FALSE FALSE TRUE

38 / 61

slide-39
SLIDE 39

Convert & construct

To & from list, matrix or data frame with "id","arg","value"-columns:

ex_matrix = ex %>% as.matrix(); str(ex_matrix) ## num [1:5, 1:93] 0.472 0.491 0.502 0.402 0.402 ... ##

  • attr(*, "dimnames")=List of 2

## ..$ : chr [1:5] "B" "A" "C" "D" ... ## ..$ : chr [1:93] "0" "0.0108695652173913" "0.0217391304347826" "0.0326086956521739" ##

  • attr(*, "arg")= num [1:93] 0 0.0109 0.0217 0.0326 0.0435 ...

ex_df = ex %>% as.data.frame(); str(ex_df) ## Classes 'tbl_df', 'tbl' and 'data.frame': 465 obs. of 3 variables: ## $ id : Ord.factor w/ 5 levels "B"<"A"<"C"<"D"<..: 1 1 1 1 1 1 1 1 1 1 ... ## $ arg : num 0 0.0109 0.0217 0.0326 0.0435 ... ## $ value: num 0.472 0.487 0.502 0.523 0.552 ... ex_matrix[1:2, ] %>% tfd() ## tfd[2] on (0,1) based on 93 evaluations each ## interpolation by tf_approx_linear ## B: (0.000,0.47);(0.011,0.49);(0.022,0.50); ... ## A: (0.000,0.49);(0.011,0.52);(0.022,0.54); ... tfd(ex_df) == tfd(ex_matrix) ## B A C D E ## TRUE TRUE TRUE TRUE TRUE

Same for tfb.

39 / 61

slide-40
SLIDE 40

Visualize: base

layout(t(1:2)) plot(ex, type = "spaghetti"); lines(c(median(ex), mean(ex)), col = c(2, 4)) plot(ex, type = "lasagna", col = viridis(50))

0.0 0.2 0.4 0.6 0.8 1.0 0.40 0.50 0.60 ex 0.0 0.2 0.4 0.6 0.8 1.0 id E D C A B 40 / 61

slide-41
SLIDE 41

Visualize: ggplot2

Pasta-themed geoms for functional data: ◮ geom_spaghetti for lines ◮ geom_meatballs for (lines &) points ◮ geom_capellini for little sparklines / glyphs on maps etc. ◮ gglasagna with order-aesthetic to sort the lasagna layers

41 / 61

slide-42
SLIDE 42

Visualize: ggplot2

ggplot(dti, aes(y = cca, colour = case)) + geom_spaghetti() + facet_wrap(~ sex)

male female 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.4 0.6 0.8

cca case

control MS

42 / 61

slide-43
SLIDE 43

Visualize: ggplot2

gglasagna(dti, y = cca,

  • rder = tf_integrate(cca, definite = TRUE)) +

facet_wrap(~ case)

control MS 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

2009_2 2009_5 2009_6 2009_3 2001_3 2009_4 2002_2 2001_4 2001_2 2010_3 2007_4 2006_5 2008_1 2001_1 2006_4 2009_7 2006_2 2009_1 2004_3 2004_2 2007_5 2008_4 2004_4 2008_3 2006_3 2004_1 2001_5 2011_4 2008_2 2002_3 2006_1 2008_5 2007_3 2006_6 2006_7 2010_2 2008_6 2011_3 2002_1 2007_1 2007_6 2005_1 2005_4 2011_5 2011_1 2010_1 2005_2 2004_5 2005_3 2002_5 2007_2 2003_1 2010_4 2011_2 2003_2 2005_5 2002_4 2005_6 1005_1 1040_1 1013_1 1018_1 1003_1 1009_1 1020_1 1002_1 1001_1 1016_1 1004_1 1026_1 1010_1 1041_1 1006_1 1015_1 1011_1 1021_1 1024_1 1042_1 1031_1 1029_1 1023_1 1019_1 1028_1 1007_1 1014_1 1017_1 1012_1 1039_1 1035_1 1022_1 1037_1 1008_1 1030_1 1027_1 1025_1 1033_1 1036_1 1034_1 1032_1 1038_1

0.3 0.4 0.5 0.6 0.7 0.8

cca

  • rdered by: tf_integrate(cca, definite = TRUE)

43 / 61

slide-44
SLIDE 44

Visualize: ggplot2

Plots for spatial functional data:

weather <- fda::CanadianWeather canada <- data.frame( place = weather$place, region = weather$region, lat = weather$coordinates[,1], lon = -weather$coordinates[,2], region = weather$region) canada$temp <- tfd(t(weather$dailyAv[,,1]), arg = 1:365) canada$precipl10 <- tfd(t(weather$dailyAv[,,3]), arg = 1:365) %>% tf_smooth glimpse(canada) ## Observations: 35 ## Variables: 7 ## $ place <fct> St. Johns, Halifax, Sydney, Yarmouth, Charlottvl, Fr... ## $ region <fct> Atlantic, Atlantic, Atlantic, Atlantic, Atlantic, At... ## $ lat <dbl> 47.34, 44.39, 46.09, 43.50, 42.48, 45.58, 54.47, 48.... ## $ lon <dbl> -52.43, -63.36, -60.11, -66.07, -80.25, -66.39, -64.... ## $ region.1 <fct> Atlantic, Atlantic, Atlantic, Atlantic, Atlantic, At... ## $ temp <tfd> St. Johns: (1, -4);(2, -3);(3, -3); ..., Halifax: (1... ## $ precipl10 <tfd> [1]: (1, 0.6);(2, 0.6);(3, 0.6); ..., [2]: (1, 0...

44 / 61

slide-45
SLIDE 45

Visualize: ggplot2

canada_map <- data.frame(maps::map("world", "Canada", plot = FALSE)[c("x", "y")]) # maps of Canada with annual temperature averages in red, precipitation in blue: map_plot <- ggplot(canada, aes(x = lon, y = lat)) + geom_path(data = canada_map, aes(x = x, y = y), alpha = .3) + coord_quickmap() map_plot + geom_capellini(aes(tf = precipl10), width = 3, height = 5, colour = "blue") + map_plot + geom_capellini(aes(tf = temp), width = 3, height = 5, colour = "red")

40 50 60 70 80

lat

−125 −100 −75 −50

lon

40 50 60 70 80

lat

−125 −100 −75 −50

lon

45 / 61

slide-46
SLIDE 46

Wrangling tfs inside data frames

46 / 61

slide-47
SLIDE 47

Wrangling tfs inside data frames: dplyr

dplyr verbs filter, select, mutate, summarize work on tf-columns - e.g.:

# group-wise functional means: dti %>% group_by(case, sex) %>% summarize(mean_rcst = mean(rcst, na.rm = TRUE)) ## # A tibble: 4 x 3 ## case sex mean_rcst ## <fct> <fct> <tfd> ## 1 control male (0.000,0.51);(0.019,0.50);(... ## 2 control female (0.000,0.52);(0.019,0.53);(... ## 3 MS male (0.000,0.53);(0.019,0.52);(... ## 4 MS female (0.000,0.52);(0.019,0.52);(... # which subjects go below cca = .26: dti %>% filter(tf_anywhere(cca, value < .26)) ## # A tibble: 5 x 5 ## id sex case cca ## <dbl> <fct> <fct> <tfd> ## 1 2017 male MS (0.000,0.38);(0.011,0.38);(... ## 2 2017 male MS (0.000,0.34);(0.011,0.35);(... ## 3 2017 male MS (0.000,0.35);(0.011,0.37);(... ## 4 2083 male MS (0.000,0.36);(0.011,0.37);(... ## 5 2083 male MS (0.000,0.39);(0.011,0.43);(... ## # ... with 1 more variable: rcst <tfd>

47 / 61

slide-48
SLIDE 48

Wrangling tfs inside data frames: dplyr

# mutate and create derived functional data dti %>% mutate(rcst_smooth = tfb(rcst, k = 15, verbose = FALSE), rcst_deriv = tf_derive(rcst_smooth)) %>% glimpse ## Observations: 382 ## Variables: 7 ## $ id <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 10... ## $ sex <fct> female, female, male, male, male, male, male, male... ## $ case <fct> control, control, control, control, control, contr... ## $ cca <tfd> 1001_1: (0.000,0.49);(0.011,0.52);(0.022,0.54); ..... ## $ rcst <tfd> 1001_1: (0.000,0.26);(0.019,0.45);(0.037,0.40); ..... ## $ rcst_smooth <tfb> 1001_1: (0.00,0.32);(0.02,0.36);(0.04,0.39); ..., ... ## $ rcst_deriv <tfb> 1001_1: (0.00, 1.9);(0.02, 1.8);(0.04, 1.5); .....

48 / 61

slide-49
SLIDE 49

Wrangling tfs inside data frames: tidyr

tidyfun provides tf_ variants of tidyr-verbs to reshape and reformat functional data while keeping it in sync with other covariates: ◮ tf_spread: tf → columns for each arg ◮ tf_gather: columns for each arg → tf ◮ tf_nest : data in long format (id, arg, value) → tf ◮ tf_unnest: tf → data in long format (id, arg, value)

49 / 61

slide-50
SLIDE 50

Wrangling tfs inside data frames: tidyr

# *spread* tf out into columns for each arg dti_wide = dti %>% tf_spread(cca); dti_wide[, 1:10] %>% glimpse() ## Observations: 382 ## Variables: 10 ## $ id <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009... ## $ sex <fct> female, female, male, male, male, male, male, male, ... ## $ case <fct> control, control, control, control, control, control... ## $ rcst <tfd> 1001_1: (0.000,0.26);(0.019,0.45);(0.037,0.40); ...,... ## $ cca_0 <dbl> 0.4909345, 0.4721627, 0.5023738, 0.4021894, 0.401874... ## $ cca_0.011 <dbl> 0.5168018, 0.4868219, 0.5136516, 0.4225127, 0.405580... ## $ cca_0.022 <dbl> 0.5356539, 0.5022577, 0.5392542, 0.4398983, 0.398548... ## $ cca_0.033 <dbl> 0.5553587, 0.5233635, 0.5742101, 0.4600235, 0.386000... ## $ cca_0.043 <dbl> 0.5927610, 0.5524401, 0.6031339, 0.4751297, 0.408824... ## $ cca_0.054 <dbl> 0.6326935, 0.5872003, 0.6335913, 0.4990257, 0.425183... # *gather* columns representing f(<arg>) into a single tf-column dti_wide %>% tf_gather(matches("cca_")) %>% glimpse() ## Observations: 382 ## Variables: 5 ## $ id <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 101... ## $ sex <fct> female, female, male, male, male, male, male, male, male,... ## $ case <fct> control, control, control, control, control, control, con... ## $ rcst <tfd> 1001_1: (0.000,0.26);(0.019,0.45);(0.037,0.40); ..., 1002... ## $ cca <tfd> [1]: (0.000,0.49);(0.011,0.52);(0.022,0.54); ..., [2]: (0...

50 / 61

slide-51
SLIDE 51

Wrangling tfs inside data frames: tidyr

# unnest tf by writing 3 loong columns id, arg, value: # (will try to avoid unnecessary duplication of columns) dti_long = dti %>% tf_unnest(cca); dti_long %>% glimpse() ## Observations: 35,526 ## Variables: 7 ## $ id <dbl> 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001... ## $ sex <fct> female, female, female, female, female, female, fema... ## $ case <fct> control, control, control, control, control, control... ## $ rcst <tfd> 1001_1: (0.000,0.26);(0.019,0.45);(0.037,0.40); ...,... ## $ cca_id <chr> "1001_1", "1001_1", "1001_1", "1001_1", "1001_1", "1... ## $ cca_arg <dbl> 0.000, 0.011, 0.022, 0.033, 0.043, 0.054, 0.065, 0.0... ## $ cca_value <dbl> 0.4909345, 0.5168018, 0.5356539, 0.5553587, 0.592761... # create tf ("nested data") by re-combining (id, arg, value): dti_long %>% tf_nest(cca_value, .id = cca_id, .arg = cca_arg) %>% glimpse() ## Observations: 382 ## Variables: 6 ## $ cca_id <chr> "1001_1", "1002_1", "1003_1", "1004_1", "1005_1", "1... ## $ id <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009... ## $ sex <fct> female, female, male, male, male, male, male, male, ... ## $ case <fct> control, control, control, control, control, control... ## $ rcst <tfd> 1001_1: (0.000,0.26);(0.019,0.45);(0.037,0.40); ...,... ## $ cca_value <tfd> 1001_1: (0.000,0.49);(0.011,0.52);(0.022,0.54); ...,...

51 / 61

slide-52
SLIDE 52

Wrangling tfs inside data frames: tidyr

Careful: ◮ Grouped mutate-operations on irregular data can go wrong ◮ Internals likely to change completely if tidyverse-people enforce switch to vctrs.

52 / 61

slide-53
SLIDE 53

Wrangling tfs inside data.table:

... under development. Should be workable, data.table generally does list columns well.

53 / 61

slide-54
SLIDE 54

Wrap-Up & Discussion

54 / 61

slide-55
SLIDE 55

What seems too difficult to you...? What seems too sloppy/dangerous/simplistic to you..? What additional features would you need...?

55 / 61

slide-56
SLIDE 56

What is missing:

◮ Wavelet basis representation: How?

56 / 61

slide-57
SLIDE 57

What is missing:

◮ Warping/Registration: Which methods?

57 / 61

slide-58
SLIDE 58

What is missing:

◮ Bridges to fda, fda.usc, roahd, fdasrvf, ... ◮ Integration with refund, FDboost, registr, ...

58 / 61

slide-59
SLIDE 59

... I like it. You might, too, give it a spin!1 https://fabian-s.github.io/tidyfun/ Get it: devtools::install_github("fabian-s/tidyfun")

1Caveat emptor. Currently a moving target, still a beta-version. 59 / 61

slide-60
SLIDE 60

Outlook

Next up: ◮ integrate fda bases & penalties, wavelet bases ◮ functions for registering & warping ◮ validate & test Version 1.0: ◮ extensions for multivariable functions ◮ much more extensive documentation & tests ◮ integration with refund for modeling and inference

60 / 61

slide-61
SLIDE 61

Thanks. https://github.com/fabian-s/tidyfun

(I don’t even have references.) 61 / 61