[PPT] - Structured Additive Regression Models for Functional Data Fabian PowerPoint Presentation

SLIDE 1

Structured Additive Regression Models for Functional Data

Fabian Scheipl & Sonja Greven Institut für Statistik Ludwig-Maximilians-Universität München

III International Workshop on Advances in FDA Castro Urdiales, June 2019

SLIDE 2

Joint work with:

Alexander Bauer, Andreas Bender, Sarah Brockhaus, Jona Cederbaum, Ciprian Crainiceanu, Karen Fuchs, Antonio Gasparrini, Jan Gertheiss, Jonathan Gellar, Jeff Goldsmith, Wolfgang Hartl, Torsten Hothorn, Andrada Ivanescu, Helmut Küchenhoff, Matthew McLean, Friedrich Leisch, Viola Obermeier, David Rügamer, Ana-Maria Staicu, Simon N. Wood

SLIDE 3

Outline

Introduction Functional data analysis Research program Structured additive functional regression Model Specification of Model Terms Estimation Penalized Likelihood Estimation Gradient Boosting Example Feeding Behavior of Pigs Summary & Outlook Wrap-Up

SLIDE 4

Outline

Introduction Functional data analysis Research program Structured additive functional regression Model Specification of Model Terms Estimation Penalized Likelihood Estimation Gradient Boosting Example Feeding Behavior of Pigs Summary & Outlook Wrap-Up

SLIDE 5

Functional Data

◮ unit of observation is a curve over some interval (or 2D/3D) ◮ theoretical view: data are realizations of stochastic processes,

curves live in a function space

◮ in practice: (densely/sparsely) sampled on a finite grid ◮ examples: spectroscopy, longitudinal blood marker profiles,

medical imaging, accelerometers, ....

1000 1500 2000 2500 −1.5 −1.0 −0.5 0.0 0.5 wavelength [nm] NIR

20
10

10 20 30 40 500 1500 2500 3500 Months since seroconversion Total CD4 Cell Count

2 / 35

SLIDE 6

Functional Data

◮ unit of observation is a curve over some interval (or 2D/3D) ◮ theoretical view: data are realizations of stochastic processes,

curves live in a function space

◮ in practice: (densely/sparsely) sampled on a finite grid ◮ examples: spectroscopy, longitudinal blood marker profiles,

medical imaging, accelerometers, ....

02:00 07:00 12:00 17:00 22:00 20 60 100 day

pig 57

2 / 35

SLIDE 7

Functional Data Analysis (FDA)

FDA is statistics for functional data. Increasing interest in relating functional variables to other variables

f interest: functional regression

◮ function-on-scalar regression ◮ scalar-on-function (also: “time-to-event”-on-function) ◮ function-on-function

(Cuevas, 2014; Morris, 2015)

3 / 35

SLIDE 8

Outline

Introduction Functional data analysis Research program Structured additive functional regression Model Specification of Model Terms Estimation Penalized Likelihood Estimation Gradient Boosting Example Feeding Behavior of Pigs Summary & Outlook Wrap-Up

SLIDE 9

Aim

A general framework for functional regression with choice of

1. scalar or functional response (observed on (un)equal grids)
2. modeled feature of the conditional response distribution,

e.g. expectation, quantile or higher moments

3. linear, smooth and interaction effects of scalar and functional

covariates, (functional) random effects

4. bases, e.g. splines, functional principal components
5. estimation method: gradient boosting or penalized likelihood

Key idea: model observations within curves, shift functional structure to additive predictor. = ⇒ penalized scalar regression using varying coefficient models Implementation in R-packages FDboost and refund.

4 / 35

SLIDE 10

Aim

A general framework for functional regression with choice of

1. scalar or functional response (observed on (un)equal grids)
2. modeled feature of the conditional response distribution,

e.g. expectation, quantile or higher moments

3. linear, smooth and interaction effects of scalar and functional

covariates, (functional) random effects

4. bases, e.g. splines, functional principal components
5. estimation method: gradient boosting or penalized likelihood

Key idea: model observations within curves, shift functional structure to additive predictor. = ⇒ penalized scalar regression using varying coefficient models Implementation in R-packages FDboost and refund.

4 / 35

SLIDE 11

Aim

A general framework for functional regression with choice of

1. scalar or functional response (observed on (un)equal grids)
2. modeled feature of the conditional response distribution,

e.g. expectation, quantile or higher moments

3. linear, smooth and interaction effects of scalar and functional

covariates, (functional) random effects

4. bases, e.g. splines, functional principal components
5. estimation method: gradient boosting or penalized likelihood

Key idea: model observations within curves, shift functional structure to additive predictor. = ⇒ penalized scalar regression using varying coefficient models Implementation in R-packages FDboost and refund.

4 / 35

SLIDE 12

Aim

A general framework for functional regression with choice of

1. scalar or functional response (observed on (un)equal grids)
2. modeled feature of the conditional response distribution,

e.g. expectation, quantile or higher moments

3. linear, smooth and interaction effects of scalar and functional

covariates, (functional) random effects

4. bases, e.g. splines, functional principal components
5. estimation method: gradient boosting or penalized likelihood

Key idea: model observations within curves, shift functional structure to additive predictor. = ⇒ penalized scalar regression using varying coefficient models Implementation in R-packages FDboost and refund.

4 / 35

SLIDE 13

Aim

A general framework for functional regression with choice of

1. scalar or functional response (observed on (un)equal grids)
2. modeled feature of the conditional response distribution,

e.g. expectation, quantile or higher moments

3. linear, smooth and interaction effects of scalar and functional

covariates, (functional) random effects

4. bases, e.g. splines, functional principal components
5. estimation method: gradient boosting or penalized likelihood

Key idea: model observations within curves, shift functional structure to additive predictor. = ⇒ penalized scalar regression using varying coefficient models Implementation in R-packages FDboost and refund.

4 / 35

SLIDE 14

Aim

A general framework for functional regression with choice of

1. scalar or functional response (observed on (un)equal grids)
2. modeled feature of the conditional response distribution,

e.g. expectation, quantile or higher moments

3. linear, smooth and interaction effects of scalar and functional

covariates, (functional) random effects

4. bases, e.g. splines, functional principal components
5. estimation method: gradient boosting or penalized likelihood

Key idea: model observations within curves, shift functional structure to additive predictor. = ⇒ penalized scalar regression using varying coefficient models Implementation in R-packages FDboost and refund.

4 / 35

SLIDE 15

Aim

A general framework for functional regression with choice of

1. scalar or functional response (observed on (un)equal grids)
2. modeled feature of the conditional response distribution,

e.g. expectation, quantile or higher moments

3. linear, smooth and interaction effects of scalar and functional

covariates, (functional) random effects

4. bases, e.g. splines, functional principal components
5. estimation method: gradient boosting or penalized likelihood

Key idea: model observations within curves, shift functional structure to additive predictor. = ⇒ penalized scalar regression using varying coefficient models Implementation in R-packages FDboost and refund.

4 / 35

SLIDE 16

Outline

Introduction Functional data analysis Research program Structured additive functional regression Model Specification of Model Terms Estimation Penalized Likelihood Estimation Gradient Boosting Example Feeding Behavior of Pigs Summary & Outlook Wrap-Up

SLIDE 17

Structured Additive Functional Models

Observations (Yi, Xi), i = 1, . . . , N, with

◮ Yi a functional (scalar) response over interval T = [a, b], [t, t] ◮ Xi a set of scalar and/or functional covariates

Structured Additive Regression Model ξ(Yi|Xi = xi)(t) = h(xi)(t) =

J

j=1

hj(xi, t),

◮ ξ the modeled feature of the conditional response distribution,

e.g. expectation (with link function), median, a quantile, ....

◮ partial effects hj(xi, t) are real valued functions over T depending

n one or more covariates.

Scheipl et al. (2015, 2016); Brockhaus et al. (2015); Greven & Scheipl (2017)

5 / 35

SLIDE 18

Outline

Introduction Functional data analysis Research program Structured additive functional regression Model Specification of Model Terms Estimation Penalized Likelihood Estimation Gradient Boosting Example Feeding Behavior of Pigs Summary & Outlook Wrap-Up

SLIDE 19

Partial Effects hj(x, t)

Model: ξ(Y|X = x) = h(x)(t) =

jhj(x, t)

covariate(s) type of effect hj(x, t) (none) smooth intercept α(t) scalar covariate z linear effect zβ(t) smooth effect γ(z, t) two scalars z1, z2 linear interaction z1z2β(t) functional varying coefficient z1f(z2, t) smooth interaction f(z1, z2, t) functional covariate x(s) linear functional effect

S x(s)β(s, t)ds

historical effect u(t)

ℓ(t) x(s)β(s, t)ds

smooth functional effect

F(x(s), s, t)ds

scalar z, functional x(s) linear interaction z

x(s)β(s, t)ds

smooth interaction

x(s)β(z, s, t)ds

grouping variable g functional random intercept bg(t) grouping variable g and scalar z functional random slope zbg(t) curve indicator i curve-specific smooth residual ei(t)

6 / 35

SLIDE 20

Representation of Partial Effects hj(x, t)

Let Y collect all Yi(tid) ∀ i, d in one long vector. Model: ξ(Y|X = x) = h(x, t) =

jhj(x, t)

Represent each hj(x, t) using bases tailored to the effect type: Row Tensor Product Basis hj(x, t) = (bj(x, t)⊤ ⊙ bY(t)⊤)θj,

◮ row tensor: A n×p ⊙ B n×q =

A ⊗ 1q
·
1p ⊗ B
◮ bj: Kj-vector of basis functions for covariate(s)

◮ bY: KY-vector of basis functions over T ◮ θj: coefficient vector

⇒ Linearization of estimation problem.

7 / 35

SLIDE 21

Representation of Partial Effects hj(x, t)

Let Y collect all Yi(tid) ∀ i, d in one long vector. Model: ξ(Y|X = x) = h(x, t) =

jhj(x, t)

Represent each hj(x, t) using bases tailored to the effect type: Row Tensor Product Basis hj(x, t) = (bj(x, t)⊤ ⊙ bY(t)⊤)θj,

◮ row tensor: A n×p ⊙ B n×q =

A ⊗ 1q
·
1p ⊗ B
◮ bj: Kj-vector of basis functions for covariate(s)

◮ bY: KY-vector of basis functions over T ◮ θj: coefficient vector

⇒ Linearization of estimation problem.

7 / 35

SLIDE 22

Representation of Partial Effects hj(x, t)

Let Y collect all Yi(tid) ∀ i, d in one long vector. Model: ξ(Y|X = x) = h(x, t) =

jhj(x, t)

Represent each hj(x, t) using bases tailored to the effect type: Row Tensor Product Basis hj(x, t) = (bj(x, t)⊤ ⊙ bY(t)⊤)θj,

◮ row tensor: A n×p ⊙ B n×q =

A ⊗ 1q
·
1p ⊗ B
◮ bj: Kj-vector of basis functions for covariate(s)

◮ bY: KY-vector of basis functions over T ◮ θj: coefficient vector

⇒ Linearization of estimation problem.

7 / 35

SLIDE 23

Regularization of Partial Effects

Anisotropic penalty pen(hj(x, t)) = θT

jPjYθj

for regularization. Penalty matrix PjY = λj(Pj ⊗ IKY) + λY(IKj ⊗ PY)

◮ with Pj, PY penalty matrices for the marginal bases ◮ λj, λY ∈ R+ smoothing parameters

Equivalent representation as (low-rank) Gaussian process prior: hj(x, t) = (bj(x)⊤ ⊗ bY(t)⊤)θj = BjYθj, θj ∼ N(0, P−

jY)

Latent Low-Rank Gauss Processes hj(x, t) ∼ GP

0, BjYP−

jYB⊤ jY

8 / 35

SLIDE 24

Regularization of Partial Effects

Anisotropic penalty pen(hj(x, t)) = θT

jPjYθj

for regularization. Penalty matrix PjY = λj(Pj ⊗ IKY) + λY(IKj ⊗ PY)

◮ with Pj, PY penalty matrices for the marginal bases ◮ λj, λY ∈ R+ smoothing parameters

Equivalent representation as (low-rank) Gaussian process prior: hj(x, t) = (bj(x)⊤ ⊗ bY(t)⊤)θj = BjYθj, θj ∼ N(0, P−

jY)

Latent Low-Rank Gauss Processes hj(x, t) ∼ GP

0, BjYP−

jYB⊤ jY

8 / 35

SLIDE 25

Specification of Partial Effects hj(x, t)

Very flexible: Combine

◮ any basis & penalty over x with ◮ any basis & penalty over T .

Computational efficiency:

◮ compute only on distinct/discretized values (Li & Wood, 2019) ◮ use fast & lean generalized linear array model (Currie et al., 2006).

For data on a regular grid t = (t1, . . . , td) and most hj: BjY

nd×KjKY

=

bj(xi, t)⊤ ⊙ bY(t)⊤

i=1,...,n t=t1,...,td

reduces to BjY

nd×KjKY

=

bj(xi)⊤

i=1,...,n

n×Kj

⊗ bY(t)⊤

d×KY

.

9 / 35

SLIDE 26

Specification of Partial Effects hj(x, t)

Very flexible: Combine

◮ any basis & penalty over x with ◮ any basis & penalty over T .

Computational efficiency:

◮ compute only on distinct/discretized values (Li & Wood, 2019) ◮ use fast & lean generalized linear array model (Currie et al., 2006).

For data on a regular grid t = (t1, . . . , td) and most hj: BjY

nd×KjKY

=

bj(xi, t)⊤ ⊙ bY(t)⊤

i=1,...,n t=t1,...,td

reduces to BjY

nd×KjKY

=

bj(xi)⊤

i=1,...,n

n×Kj

⊗ bY(t)⊤

d×KY

.

9 / 35

SLIDE 27

Tensor Product Basis & Penalties

P ⊗ I, I ⊗ P: repeated penalties that apply to each subvector of θ associated with a specific marginal basis function: bj(x, t) ⊗ bY(t) = ⊥1 ∇1 ⋉1

⊥2 ∇2 ⋉2 ⊥3 ∇3 ⋉3 ⊥4 ∇4 ⋉4

⊗

1 ⋆1

2 ⋆2 3 ⋆3

=

           

⊥11 ⊥1⋆1 ∇11 ∇1⋆1 ⋉11 ⋉1⋆1 ⊥12 ⊥1⋆2 ∇12 ∇1⋆2 ⋉12 ⋉1⋆2 ⊥13 ⊥1⋆3 ∇13 ∇1⋆3 ⋉13 ⋉1⋆3 ⊥21 ⊥2⋆1 ∇21 ∇2⋆1 ⋉21 ⋉2⋆1 ⊥22 ⊥2⋆2 ∇22 ∇2⋆2 ⋉22 ⋉2⋆2 ⊥23 ⊥2⋆3 ∇23 ∇2⋆3 ⋉23 ⋉2⋆3 ⊥31 ⊥3⋆1 ∇31 ∇3⋆1 ⋉31 ⋉3⋆1 ⊥32 ⊥3⋆2 ∇32 ∇3⋆2 ⋉32 ⋉3⋆2 ⊥33 ⊥3⋆3 ∇33 ∇3⋆3 ⋉33 ⋉3⋆3 ⊥41 ⊥4⋆1 ∇41 ∇4⋆1 ⋉41 ⋉4⋆1 ⊥42 ⊥4⋆2 ∇42 ∇4⋆2 ⋉42 ⋉4⋆2 ⊥43 ⊥4⋆3 ∇43 ∇4⋆3 ⋉43 ⋉4⋆3

            IKj ⊗ PY = 1

1 1

⊗

P

P⋆ P⋆ P⋆

=

    

P P⋆ P⋆ P⋆ P P⋆ P⋆ P⋆ P P⋆ P⋆ P⋆

     Pj ⊗ IKt =

P⊥

P∇⊥ P⋉⊥ P∇⊥ P∇ P∇⋉ P⋉⊥ P⋉∇ P⋉

⊗ ( 1

1 )

=     

P⊥ P∇⊥ P⋉⊥ P⊥ P∇⊥ P⋉⊥ P∇⊥ P∇ P∇⋉ P∇⊥ P∇ P∇⋉ P⋉⊥ P⋉∇ P⋉ P⋉⊥ P⋉∇ P⋉

    

SLIDE 28

Example: Historical Functional Effect

◮ For example, u(t)

ℓ(t)

x(s)β(s, t)ds ≈

R

r=1

˜ x(sr, t)β(sr, t) with

◮ ˜

x(sr, t) = 1(ℓ(t) ≤ sr ≤ u(t))∆(sr)x(sr)

◮ ∆(s) numerical integration weights ◮ {bj

k(s), 1 ≤ k ≤ Kj}, {bY l (t), 1 ≤ l ≤ KY} spline bases.

◮ bj(x, t) =

R

r=1 ˜

xi(sr, t)bj

k(sr)

i=1,...,n

k=1,...,Kj

◮ Pj, PY marginal penalty matrices for bj(s), bY(t) 11 / 35

SLIDE 29

Example: Historical Functional Effect

◮ For example, u(t)

ℓ(t)

x(s)β(s, t)ds ≈

R

r=1

˜ x(sr, t)

Kj

k=1

KY

l=1

bj

k(sr)bY l (t)θj,kl

with

◮ ˜

x(sr, t) = 1(ℓ(t) ≤ sr ≤ u(t))∆(sr)x(sr)

◮ ∆(s) numerical integration weights ◮ {bj

k(s), 1 ≤ k ≤ Kj}, {bY l (t), 1 ≤ l ≤ KY} spline bases.

◮ bj(x, t) =

R

r=1 ˜

xi(sr, t)bj

k(sr)

i=1,...,n

k=1,...,Kj

◮ Pj, PY marginal penalty matrices for bj(s), bY(t) 11 / 35

SLIDE 30

Example: FPC-Based Functional Effect

For xi(s) ≈ Kj

k=1 ψk(s)ξik,

S

xi(s)β(s, t)ds =

k

ξik

S

ψk(s)β(s, t)ds =

k

ξik ˜ βk(t) ⇒ sum of varying coefficient terms for FPC scores

◮ bj(x, t) =

ˆ

ξik

n×Kj

⊗ 1d

◮ Pj = 0 ◮ Extends to nonlinear FPC effects

hj(xi(s), t) = Kj

k fk(ˆ

ξik, t), hj(xi(s), t) = Kj

k;k′>k fk,k′(ˆ

ξik, ˆ ξik′, t) (sum of) nonlinear effects of synthetic covariates ˆ ξk.

12 / 35

SLIDE 31

Example: FPC-Based Functional Effect

For xi(s) ≈ Kj

k=1 ψk(s)ξik,

S

xi(s)β(s, t)ds =

k

ξik

S

ψk(s)β(s, t)ds =

k

ξik ˜ βk(t) ⇒ sum of varying coefficient terms for FPC scores

◮ bj(x, t) =

ˆ

ξik

n×Kj

⊗ 1d

◮ Pj = 0 ◮ Extends to nonlinear FPC effects

hj(xi(s), t) = Kj

k fk(ˆ

ξik, t), hj(xi(s), t) = Kj

k;k′>k fk,k′(ˆ

ξik, ˆ ξik′, t) (sum of) nonlinear effects of synthetic covariates ˆ ξk.

12 / 35

SLIDE 32

FPC-Based Functional Effects

+ always identifiable (not true for spline-based effects) + ˜ βk(t) easier to interpret than β(s, t), at least for interpretable ˆ ψk(s) + applicable to sparse/irregular functional covariates − inference becomes conditional on estimated FPCs − suitable truncation lag Kx?

13 / 35

SLIDE 33

Example: Functional Random Intercept

hj(x, t) = bg(t) for a grouping variable g with G levels.

◮ bY(t) = (bY 1 (t), . . . , bY KY(t))T spline basis evaluations

with penalty matrix PY

◮ bj(x) matrix of dummy variables for levels of g ◮ Pj defines dependence structure between levels of g, e.g.

precision matrix of a Gaussian (Markov) random field. Simplest case: Pj = I for i.i.d. bg(t). For large G, a parsimonious functional principal component basis is computationally advantageous (!), quantifies main modes of variation.

14 / 35

SLIDE 34

Example: Functional Random Intercept

hj(x, t) = bg(t) for a grouping variable g with G levels.

◮ bY(t) = (bY 1 (t), . . . , bY KY(t))T spline basis evaluations

with penalty matrix PY

◮ bj(x) matrix of dummy variables for levels of g ◮ Pj defines dependence structure between levels of g, e.g.

precision matrix of a Gaussian (Markov) random field. Simplest case: Pj = I for i.i.d. bg(t). For large G, a parsimonious functional principal component basis is computationally advantageous (!), quantifies main modes of variation.

14 / 35

SLIDE 35

Example: Functional Random Intercept

hj(x, t) = bg(t) for a grouping variable g with G levels.

◮ bY(t) = (bY 1 (t), . . . , bY KY(t))T spline basis evaluations

with penalty matrix PY

◮ bj(x) matrix of dummy variables for levels of g ◮ Pj defines dependence structure between levels of g, e.g.

precision matrix of a Gaussian (Markov) random field. Simplest case: Pj = I for i.i.d. bg(t). For large G, a parsimonious functional principal component basis is computationally advantageous (!), quantifies main modes of variation.

14 / 35

SLIDE 36

Example: Principal Component Bases

Karhunen-Loève expansion for bg(t) iid ∼ GP(0, Kb(t, t′)): bg(t) =

∞

l=1

θj,glφl(t) ≈

KY

l=1

θj,glφl(t), with eigenfunctions and -values φl, νl of Kb, θj,gl

i.i.d.

∼ N(0, νl).

◮ bY(t) = (φ1(t), . . . , φKY(t))T with PY = diag(ν1, . . . , νKY)−1 ◮ bj(x) as before, typically Pj = I ◮ λY, λj fixed to 1.

KY small, λj, λY fixed → reduced computational complexity for pen. lik. Greven et al. (2010); Shou et al. (2014); Zipunnikov et al. (2014); Cederbaum et al. (2015) discuss estimation of eigenbases for different random effects models and grid/sparse functional/image data.

15 / 35

SLIDE 37

Example: Principal Component Bases

Karhunen-Loève expansion for bg(t) iid ∼ GP(0, Kb(t, t′)): bg(t) =

∞

l=1

θj,glφl(t) ≈

KY

l=1

θj,glφl(t), with eigenfunctions and -values φl, νl of Kb, θj,gl

i.i.d.

∼ N(0, νl).

◮ bY(t) = (φ1(t), . . . , φKY(t))T with PY = diag(ν1, . . . , νKY)−1 ◮ bj(x) as before, typically Pj = I ◮ λY, λj fixed to 1.

KY small, λj, λY fixed → reduced computational complexity for pen. lik. Greven et al. (2010); Shou et al. (2014); Zipunnikov et al. (2014); Cederbaum et al. (2015) discuss estimation of eigenbases for different random effects models and grid/sparse functional/image data.

15 / 35

SLIDE 38

Example: Principal Component Bases

Karhunen-Loève expansion for bg(t) iid ∼ GP(0, Kb(t, t′)): bg(t) =

∞

l=1

θj,glφl(t) ≈

KY

l=1

θj,glφl(t), with eigenfunctions and -values φl, νl of Kb, θj,gl

i.i.d.

∼ N(0, νl).

◮ bY(t) = (φ1(t), . . . , φKY(t))T with PY = diag(ν1, . . . , νKY)−1 ◮ bj(x) as before, typically Pj = I ◮ λY, λj fixed to 1.

KY small, λj, λY fixed → reduced computational complexity for pen. lik. Greven et al. (2010); Shou et al. (2014); Zipunnikov et al. (2014); Cederbaum et al. (2015) discuss estimation of eigenbases for different random effects models and grid/sparse functional/image data.

15 / 35

SLIDE 39

Outline

Introduction Functional data analysis Research program Structured additive functional regression Model Specification of Model Terms Estimation Penalized Likelihood Estimation Gradient Boosting Example Feeding Behavior of Pigs Summary & Outlook Wrap-Up

SLIDE 40

Outline

Introduction Functional data analysis Research program Structured additive functional regression Model Specification of Model Terms Estimation Penalized Likelihood Estimation Gradient Boosting Example Feeding Behavior of Pigs Summary & Outlook Wrap-Up

SLIDE 41

Estimation approaches: Penalized likelihood

Penalized likelihood-based (Scheipl et al., 2015 & 2016, based on Wood et al., 2016):

◮ ξ = g ◦ E with link function g, functional responses from

exponential families, Beta, scaled non-central t, ordinal, . . .

◮ moderate numbers of covariates ◮ likelihood-based inference: functional data log-likelihood is

equivalent to penalized log-likelihood for scalar data = ⇒ adapt 30+ years of method development & know-how

◮ implemented in the pffr function of R package refund, internally

using mgcv (Wood, 2015) for scalar GAMMs: formula based interface, construction of all bases and penalties, identifiability checks and constraints, plotting etc.

16 / 35

SLIDE 42

Estimation as an Additive Mixed Model

If ξ = g ◦ E for a link function g, write model (5) as g(E(Y)) = Bθ, where

◮ Y contains Yi(tid) ∀ i, d ◮ B contains evaluated basis functions ∀ i, d, j ◮ θ contains blocks θj of coefficients for each j.

The penalized log-likelihood for this model is given by l(θ) − 1 2

J

j=1

θT

jPjYθj,

with l(θ) the log-likelihood based on the assumed density of Y|X.

17 / 35

SLIDE 43

Estimation as an Additive Mixed Model

◮ l(θ) − 1 2

J

j=1 θT jPjYθj is equivalent to penalized log-likelihood for

scalar data. Use Wood (2006, 2011); Wood et al. (2016) for

ptimization and estimation of smoothing parameters

(REML/marginal likelihood).

◮ recent extensions of mgcv:

◮ multivariate models: useful for FDA for trajectory data,

phase/amplitude components (Volkmann, 2018)

◮ location-scale models: very useful for additional additive predictor

for modeling (Gaussian) conditional variance σ2(t|x) (Greven & Scheipl, 2017)

◮ computational shortcuts for tensor product bases using distinct

values only. (Li & Wood, 2019)

18 / 35

SLIDE 44

Publications

Framework:

◮ Ivanescu, Staicu, Scheipl & Greven (2015), Penalized function-on-function regression. CompStat, 30(2), 539-568. ◮ Scheipl, Staicu & Greven (2015), Functional additive mixed models. JCGS, 24(2), 477-501. ◮ Scheipl, Gertheiss & Greven (2016), Generalized functional additive mixed models. EJS, 10(1), 1455-1492. ◮ Scheipl & Greven (2016), Identifiability in penalized function-on-function regression

models. EJS, 10(1), 495-526.

◮ Greven & Scheipl (2017), A general framework for functional regression modelling. Statistical Modelling, 17(1-2), 1-35. (with discussion, rejoinder) ◮ Bauer, Scheipl, Küchenhoff & Gabriel (2018), An introduction to semiparametric function-on-scalar regression. Statistical Modelling, 18(3-4), 346-364.

Extensions & refinements for scalar responses:

◮ Goldsmith & Scheipl (2013), Estimator selection and combination in scalar-on-function regression, CSDA, 70, 362-372. ◮ McLean, Scheipl, Hooker, Greven & Ruppert (2014), Functional generalized additive models, JCGS, 23(1), 249-269. ◮ Fuchs, Scheipl & Greven (2015), Penalized scalar-on-functions regression with interaction term, CSDA, 81, 38-51.

Implementation:

◮ Goldsmith, Scheipl, Huang, Wrobel, Gellar, Harezlak, McLean, Swihart, Xiao, Crainiceanu & Reiss (2018), refund: Regression with functional data. R package version 0.1-16

19 / 35

SLIDE 45

Outline

Introduction Functional data analysis Research program Structured additive functional regression Model Specification of Model Terms Estimation Penalized Likelihood Estimation Gradient Boosting Example Feeding Behavior of Pigs Summary & Outlook Wrap-Up

SLIDE 46

Estimation approaches: Boosting

Empirical risk minimization via component-wise gradient boosting (Brockhaus et al., 2015, based on Hothorn et al., 2013)

◮ variety of loss functions e.g. for quantile or robust regression ◮ empirical risk for functional responses simply integrates

point-wise loss functions over T

◮ many covariates with automatic variable selection, p >> N

possible

◮ efficient implementation using array models & data indexing,

using mboost (Hothorn et al., 2015) extension to GAMLSS in gamboostLSS (Hofner et al., 2016)

◮ resampling-based uncertainty quantification only

20 / 35

SLIDE 47

Transformation Functions and Loss Functions

Model: ξ(Yi|Xi = xi) = h(xi) = J

j=1 hj(xi, t)

Represent estimator of transformation function ξ as minimizer of empirical risk for corresponding loss function ρ. ξ ρ(Y, h(x)) mean regression E L2-loss median regression Q0.5 L1-loss quantile regression Qτ check function generalized regression g ◦ E

neg. log-likelihood

Loss functions for functional responses: Integrate ρ(Y, h(x))(t) over T . Goal: Minimize the expected loss, the risk, w.r.t. h.

21 / 35

SLIDE 48

Transformation Functions and Loss Functions

Model: ξ(Yi|Xi = xi) = (h1(xi), . . . , hC(xi)) = J

j=1 h1j(xi), . . . , J j=1 hCj(xi)

Represent estimator of transformation function ξ as minimizer of

empirical risk for corresponding loss function ρ. ξ ρ(Y, h(x)) mean regression E L2-loss median regression Q0.5 L1-loss quantile regression Qτ check function generalized regression g ◦ E

neg. log-likelihood

GAMLSS C-vector of params.

neg. log-likelihood

Loss functions for functional responses: Integrate ρ(Y, h(x))(t) over T . Goal: Minimize the expected loss, the risk, w.r.t. h.

21 / 35

SLIDE 49

Transformation Functions and Loss Functions

Model: Represent estimator of transformation function ξ as minimizer of empirical risk for corresponding loss function ρ. ξ ρ(Y, h(x)) mean regression E L2-loss median regression Q0.5 L1-loss quantile regression Qτ check function generalized regression g ◦ E

neg. log-likelihood

Loss functions for functional responses: Integrate ρ(Y, h(x))(t) over T . Goal: Minimize the expected loss, the risk, w.r.t. h.

21 / 35

SLIDE 50

Component-wise gradient boosting

◮ Boosting is an ensemble method that aims at minimizing an

expected loss criterion.

◮ The predictor is iteratively updated along the steepest gradient

with respect to the predictors (gradient descent).

◮ Model represented as sum of simple (penalized) regression

models (“base-learners”), fitted to the negative gradients in each step.

◮ In each boosting iteration only the best fitting base-learner is

updated (component-wise) with step-length ν.

◮ For functional response regression, response and predictors are

functions over T .

(Bühlmann & Hothorn 2007; Hastie et al 2009; Hothorn et al 2013; Brockhaus et al 2015)

22 / 35

SLIDE 51

Component-wise gradient boosting

◮ Boosting is an ensemble method that aims at minimizing an

expected loss criterion.

◮ The predictor is iteratively updated along the steepest gradient

with respect to the predictors (gradient descent).

◮ Model represented as sum of simple (penalized) regression

models (“base-learners”), fitted to the negative gradients in each step.

◮ In each boosting iteration only the best fitting base-learner is

updated (component-wise) with step-length ν.

◮ For functional response regression, response and predictors are

functions over T .

(Bühlmann & Hothorn 2007; Hastie et al 2009; Hothorn et al 2013; Brockhaus et al 2015)

22 / 35

SLIDE 52

Algorithm: component-wise gradient boosting

◮ [Step 1:] initialize all parameters, set m = 0 ◮ [Step 2:] (within each iteration m) ◮ compute the negative partial gradients ui for i = 1, . . . , N of the

empirical risk w.r.t. the predictor h using the current estimates

f all distribution parameters

◮ fit each base-learner hj to ui, j = 1, . . . , J ◮ select the best fitting base-learner hj∗ and update it with a

small step-length ν, h[m] = h[m−1] + ν h[m]

j⋆ . ◮ [Step 3:] unless m > mstop, set m = m + 1, go to Step 2.

→ The final ˆ h is a linear combination of base-learner fits. For GAMLSS, another loop over the C model equations is added.

(Mayr et al., 2012; Brockhaus et al., 2017; non-cyclical: Thomas et al, 2017)

23 / 35

SLIDE 53

Component-wise gradient boosting

◮ The number of boosting iterations determines the model

complexity for fixed ν and smoothing parameters (chosen for unbiased selection of base learners; Hofner et al., 2012).

◮ Choose optimal stopping iteration mstop by resampling methods

(on the level of curves).

◮ Model selection by early stopping and stability selection (Shah &

Samworth, 2013).

◮ Implementation in R-package FDboost (Brockhaus, Rügamer &

Hothorn, 2017) based on mboost (Hothorn et al., 2017) and gamboostLSS (Hofner et al., 2017).

24 / 35

SLIDE 54

Publications

◮ Brockhaus, Scheipl, Hothorn & Greven (2015), The functional linear array

model. Statistical Modelling, 15(3):279-300.

◮ Brockhaus, Fuest, Mayr & Greven (2015), Functional regression models

for location, scale and shape applied to stock returns. In: Friedl & Wagner (eds), Proceedings of the 30th IWSM, 117–122.

◮ Brockhaus, Rügamer & Greven (2017), Boosting functional regression

models with FDboost. arXiv:1705.10662.

◮ Brockhaus, Melcher, Leisch & Greven (2017), Boosting flexible functional

regression models with a high number of functional historical effects. Statistics and Computing, 27: 913-926.

◮ Rügamer, Brockhaus, Gentsch, Scherer & Greven (2018), Boosting

factor-specific functional historical models for the detection of synchronisation in bioelectrical signals. Journal of the Royal Statistical Society - C, 67: 621-642. Implementation:

◮ Brockhaus, Rügamer & Hothorn (2017), FDboost: Boosting functional

regression models. R package version 0.3-0

25 / 35

SLIDE 55

Outline

Introduction Functional data analysis Research program Structured additive functional regression Model Specification of Model Terms Estimation Penalized Likelihood Estimation Gradient Boosting Example Feeding Behavior of Pigs Summary & Outlook Wrap-Up

SLIDE 56

Feeding Behavior of Pigs

◮ "PIGWISE" project (Maselyne et al., 2014, Gertheiss et al., 2015) ◮ proximity of the pigs to the trough (yes/no) recorded every 10

seconds for 127 pigs over 102 days using RFID tags

◮ yields binary functional observations over [0h,24h] at 0.1 Hz ◮ model to investigate (un)common feeding behavior

→ ethology, monitoring of pigs’ health

02:00 07:00 12:00 17:00 22:00 20 60 100 day

pig 57

26 / 35

SLIDE 57

Feeding Behavior of Pigs

◮ generalized functional response model needed ◮ autoregressive terms and day-specific (functional random)

effects of interest

◮ temperature and humidity information available

27 / 35

SLIDE 58

Feeding Behavior of Pigs

◮ for illustration, focus on one pig ◮ to reduce noise & data size, aggregate binary feeding indicators

bserved every 10 seconds over 10 minute intervals

◮ model Yi(t)|xi ∼ Binomial(60, pi(t)), i = 1, . . . , 102,

t = 00:00h, 00:10h, . . . , 23:50h with logit (pi(t)) = β0(t) + bi(t) + t−10min

t−3h

yi(s)β(t, s)ds

◮ bi(t)

iid

∼ GP(0, Kb(t, t′)) functional random day effects

◮ assume pi(t) is smooth over t, use periodic P-splines ◮ fit generalized functional additive mixed model using pffr in

refund on 2/3 training data (ca. 1min; 5 smoothing parameters, dim(B) : 14544 × 850).

28 / 35

SLIDE 59

Feeding Behavior of Pigs

◮ for illustration, focus on one pig ◮ to reduce noise & data size, aggregate binary feeding indicators

bserved every 10 seconds over 10 minute intervals

◮ model Yi(t)|xi ∼ Binomial(60, pi(t)), i = 1, . . . , 102,

t = 00:00h, 00:10h, . . . , 23:50h with logit (pi(t)) = β0(t) + bi(t) + t−10min

t−3h

yi(s)β(t, s)ds

◮ bi(t)

iid

∼ GP(0, Kb(t, t′)) functional random day effects

◮ assume pi(t) is smooth over t, use periodic P-splines ◮ fit generalized functional additive mixed model using pffr in

refund on 2/3 training data (ca. 1min; 5 smoothing parameters, dim(B) : 14544 × 850).

28 / 35

SLIDE 60

Feeding Behavior of Pigs

◮ for illustration, focus on one pig ◮ to reduce noise & data size, aggregate binary feeding indicators

bserved every 10 seconds over 10 minute intervals

◮ model Yi(t)|xi ∼ Binomial(60, pi(t)), i = 1, . . . , 102,

t = 00:00h, 00:10h, . . . , 23:50h with logit (pi(t)) = β0(t) + bi(t) + t−10min

t−3h

yi(s)β(t, s)ds

◮ bi(t)

iid

∼ GP(0, Kb(t, t′)) functional random day effects

◮ assume pi(t) is smooth over t, use periodic P-splines ◮ fit generalized functional additive mixed model using pffr in

refund on 2/3 training data (ca. 1min; 5 smoothing parameters, dim(B) : 14544 × 850).

28 / 35

SLIDE 61

Model Fit

1 6 10 16 21 27 31 37 42 48 53 59 63 69 74 80 84 90 95 101 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 6 12 18 240 6 12 18 240 6 12 18 240 6 12 18 240 6 12 18 24

t

type fitted

bserved

ˆ µ(t) & y(t) for selected days (training data)

29 / 35

SLIDE 62

Model Predictions

2 5 11 17 20 26 32 38 41 47 52 58 61 67 73 79 82 88 94 100 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 6 12 18 24 0 6 12 18 24 0 6 12 18 24 0 6 12 18 24 0 6 12 18 24

t

type fitted true

ˆ µ(t) & y(t) for selected days (test data)

30 / 35

SLIDE 63

Effect Estimates

6
3

3 6 12 18 24

t

ˆ β0(t)

30
20
10

10 6 12 18 24

t

ˆ bi(t)

estimate upper CI lower CI

3
2
1

6 12 18 24 6 12 18 24 6 12 18 24

t s-t

0.5
0.4
0.3
0.2
0.1

0.0 0.1 value

ˆ β(t, s); max. lag= 3 h

31 / 35

SLIDE 64

Outline

Introduction Functional data analysis Research program Structured additive functional regression Model Specification of Model Terms Estimation Penalized Likelihood Estimation Gradient Boosting Example Feeding Behavior of Pigs Summary & Outlook Wrap-Up

SLIDE 65

Summary

Unified framework for broad class of regression models for functional covariates and/or functional responses. Basic ideas:

◮ define generally applicable, very flexible representation of model

terms via penalized tensor product bases

◮ don’t reinvent the wheel for functional data: rephrase as scalar

data problems and adapt methodology, implementations

◮ different estimation frameworks for different goals and data

settings:

◮ boosting: variable & model selection; flexible choice of feature(s)

f conditional response distribution to model

◮ likelihood-based (additive) mixed models: valid inference,

uncertainty quantification without resampling

◮ everything implemented in R: FDboost, refund

32 / 35

SLIDE 66

Outlook

◮ models for multivariate functional responses (trajectories,

shapes, accelerometry, . . . ) and probability densities

◮ computationally efficient methods for dealing with

heterogeneous, auto-correlated functional residuals & random effects: “penalized GEEs”? “sandwich” estimators? . . .

cf. Morris et al.’s WFMM (Morris & Carroll, 2006, e.g.)

◮ refactoring pffr() now that we know what we’re doing

(somewhat...)

◮ better visualization & model diagnostics

(c.f. mgcViz, Fasiolo et al. (2018))

33 / 35

SLIDE 67

Additional References

◮

Bühlmann & Hothorn (2007), Boosting algorithms: Regularization, prediction and model fitting, Statistical Science, 22, 477–505, with discussion.

◮

Cuevas (2014), A partial overview of the theory of statistics with functional data. Journal of Statistical Planning and Inference, 147, 1–23.

◮

Currie, Durban & Eilers (2006), Generalized linear array models with applications to multidimensional smoothing. JRSS-B, 68, 259–280.

◮

Eilers & Marx (1996), Flexible smoothing with B-splines and penalties (with comments and rejoinder). Statistical Science, 11, 89–121.

◮

Friedman (1982), Piecewise exponential models for survival data with covariates. The Annals of Statistics, 10(1), 101–113.

◮

Gertheiss, Maier, Hessel & Staicu (2015), Marginal functional regression models for analyzing the feeding behavior of

pigs. Journal of Agricultural, Biological, and Environmental Statistics, 20(3), 353-370.

◮

Hofner, Mayr, Fenske & Schmid (2016), gamboostLSS: Boosting methods for GAMLSS. R package version 1.3-0.

◮

Hothorn, Bühlmann, Kneib, Schmid & Hofner (2015), mboost: Model-based boosting. R package version 2.4-2.

◮

Hothorn, Kneib & Bühlmann (2013), Conditional transformation models. JRSS-B, 76, 3–27. 34 / 35

SLIDE 68

Additional References

◮

Li & Wood (2019),Faster model matrix crossproducts for large generalized linear models with discretized covariates.Statistics and Computing, to appear.

◮

Maselyne, Saeys, Ketelaere, Mertens, Vangeyte, Hessel, Millet & Van Nuffel (2014), Validation of a high frequency radio frequency identification (HF RFID) system for registering feeding patterns of growing-finishing pigs. Computers and Electronics in Agriculture, 102, 10–18.

◮

Marra & Wood (2011), Practical variable selection for generalized additive models. CSDA, 55, 2372–2387.

◮

Meinshausen & Bühlmann (2010), Stability selection. JRSS-B, 72(4), 417-473.

◮

Morris & Carroll (2006), Wavelet-based functional mixed models. JRSS-B, 68(2), 179-199.

◮

Morris (2015), Functional regression. Annual Review of Statistics and its Applications, 2:321-59.

◮

Ramsay & Silverman (2005), Functional data analysis, Springer, New York.

◮

Shah & Samworth (2013), Variable selection with error control: another look at stability selection. JRSS-B, 75, 55–80.

◮

Wood (2006), Generalized Additive Models: An Introduction with R, Chapman & Hal/CRC, Boca Raton, Florida.

◮

Wood (2011), Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. JRSS-B, 73, 3–36.

◮

Wood (2015), mgcv: Mixed GAM computation vehicle with GCV/AIC/REML smoothness estimation. R package version 1.8-6.

◮

Wood, Pya & Säfken (2016), Smoothing parameter and model selection for general smooth models. JASA, 111(516), 1548–1563. 35 / 35

SLIDE 69

Outline

Example: Nutrition & ICU survival

SLIDE 70

Exposure-Lag-Response Association (ELRA): Nutrition & ICU Survival

◮ multi-center study of critical care patients from 457 ICUs

(≈ 10k patients)

◮ investigate acute mortality (first 30d) ◮ confounders z: age, gender, Apache II Score, year of admission,

ICU random effect, ...

◮ 12-day nutrition record xi(s)

◮ prescribed calories (determined at baseline) ◮ daily caloric intake ◮ daily caloric adequacy (CA)= caloric intake/prescribed calories 1 / 8

SLIDE 71

Exposure-Lag-Response Association (ELRA): Nutrition & ICU Survival

◮ multi-center study of critical care patients from 457 ICUs

(≈ 10k patients)

◮ investigate acute mortality (first 30d) ◮ confounders z: age, gender, Apache II Score, year of admission,

ICU random effect, ...

◮ 12-day nutrition record xi(s)

hypocal/figure/exampleProfiles-1.pdf

1 / 8

SLIDE 72

Clinical Questions & Parameterization

Importance of timing & amount of nutrition during critical, acute & recovery phases unclear. Model Assumptions: delayed, time-limited, cumulative & time-varying effect of time-varying exposure xi(s) Idea: (partial) effect of xi(s) on log-hazard at time t:

W(t)

xi(s)β(t, s)ds

◮ delay & time limit defined by W(t) ◮ time-varying partial effects β(t, s)

2 / 8

SLIDE 73

Piece-wise Exponential Model

Time-to-event model reparameterized as Poisson-GAMM: log (λ(t|zi, xi)) = α0(t) +

j

hj(zi, t) +

W(t)

xi(s)β(t, s)ds

◮ estimated with mgcv (could use FDboost as well) ◮ flexible, semi-parametric modeling of baseline hazard rate α0(t) ◮ use GFAMM framework for functional covariates to perform

inference, e.g. about nutrition effects

◮ flexible confounder adjustment: non-linear & time-varying

effects; time-varying covariates, random effects / frailties, . . .

Details 3 / 8

SLIDE 74

ELRA: Example

Compare hazard of patient with given nutrition record to constant undernutrition (e.g., ceteris paribus):

25

50 75 100 1 2 3 4 5 6 7 8 9 10 11

protocol day caloric adequacy [%]

4 / 8

SLIDE 75

ELRA: Example

Compare hazard of patient with given nutrition record to constant undernutrition (e.g., ceteris paribus):

25

50 75 100 1 2 3 4 5 6 7 8 9 10 11

protocol day caloric adequacy [%]

0.25 0.5 0.75 1 1.25 10 20 30

t hazard ratio estimate

I’m leaving out a LOT of complications here . . .

4 / 8

SLIDE 76

ELRA Publications

Time-to-event-models:

◮ Bender, Groll & Scheipl (2017), A generalized additive model approach to time-to-event

analysis. Statistical Modelling, 18(3-4), 299-341.

◮ Bender, Scheipl, Hartl, Day & Küchenhoff (2018), Penalized estimation of complex, non-linear exposure-lag-response associations. Biostatistics, kxy003. ◮ Hartl, Bender, Scheipl, Kuppinger, Day & Küchenhoff (2018). Calorie intake and short-term survival of critically ill patients. Clinical Nutrition, in press.

Distributed lag models:

◮ Obermeier, Scheipl, Heumann, Wassermann & Küchenhoff (2015), Flexible distributed lags for modeling earthquake data. Journal of the Royal Statistical Society - C, 64(2):395-412. ◮ Gasparrini, Scheipl, Armstrong & Kenward (2017). A penalized framework for distributed lag non-linear models. Biometrics, 73: 938–948.

Implementation:

◮ Gasparrini, Armstrong & Scheipl (2016), dlnm: Distributed lag non-linear models. R package version 2.2-5 ◮ Bender & Scheipl (2017) pammtools: Piece-wise exponential additive mixed modeling

tools. https://github.com/adibender/pammtools

5 / 8

SLIDE 77

Piece-wise Exponential Model

◮ define cut-points on the time line: 0 = κ0 < . . . < κJ = tmax ◮ assume constant hazard rates in each interval [κj−1, κj)

⇒ likelihood for event indicators yij =

1,

event in [κj−1, κj) 0, else is proportional to Poisson likelihood (Friedman, 1982)

◮ for t ∈ [κj−1, κj), piecewise constant hazard rate for event process

λ(t; zi, xi) is rate of pseudo-Poisson responses yij!

back 6 / 8

SLIDE 78

Identifiability

1. Functional response models: use suitable identifiability constraints.
2. Functional covariates x(s) often well approximated by truncated

Karhunen-Loève-expansions, xi(s) =

∞

l=1

ξi,lφX

l (s) ≈ M

l=1

ξi,lφX

l (s).

For a linear functional effect hj(x, t) =

S

x(s)β(s, t)ds, β is identified up to functions f, f(·, t) ∈ span({φX

l , l > M}) ∀t.

For finite grid data, the design matrix is rank-deficient if M < Kj or if the span of the basis for β in s-direction contains functions

rthogonal to {φX

l , 1 ≤ l ≤ M} using numerical integration.

back 7 / 8

SLIDE 79

Identifiability

◮ Iff the kernels of penalty and design matrix do not overlap, there

is a unique minimum of the penalty on each hyperplane defined by coefficient vectors θ representing identical hj functions.

◮ The penalized optimization criterion then finds the smoothest

solution (i.e, argmin(pen(θ)) among the solutions with optimal fit to the data.

◮ Practical recommendations (Scheipl and Greven, 2015):

1. avoid rank-reducing pre-processing of functional covariates
2. check diagnostic measures (implemented in refund/FDboost)
3. use penalties with small kernel: 1st order differences/ derivatives
r full-rank penalties (Marra and Wood, 2011) or suitable

constraints

8 / 8