Structured Additive Regression Models for Functional Data Fabian - - PowerPoint PPT Presentation
Structured Additive Regression Models for Functional Data Fabian - - PowerPoint PPT Presentation
Structured Additive Regression Models for Functional Data Fabian Scheipl & Sonja Greven Institut fr Statistik Ludwig-Maximilians-Universitt Mnchen III International Workshop on Advances in FDA Castro Urdiales, June 2019 Joint work
Joint work with:
Alexander Bauer, Andreas Bender, Sarah Brockhaus, Jona Cederbaum, Ciprian Crainiceanu, Karen Fuchs, Antonio Gasparrini, Jan Gertheiss, Jonathan Gellar, Jeff Goldsmith, Wolfgang Hartl, Torsten Hothorn, Andrada Ivanescu, Helmut Küchenhoff, Matthew McLean, Friedrich Leisch, Viola Obermeier, David Rügamer, Ana-Maria Staicu, Simon N. Wood
Outline
Introduction Functional data analysis Research program Structured additive functional regression Model Specification of Model Terms Estimation Penalized Likelihood Estimation Gradient Boosting Example Feeding Behavior of Pigs Summary & Outlook Wrap-Up
Outline
Introduction Functional data analysis Research program Structured additive functional regression Model Specification of Model Terms Estimation Penalized Likelihood Estimation Gradient Boosting Example Feeding Behavior of Pigs Summary & Outlook Wrap-Up
Functional Data
◮ unit of observation is a curve over some interval (or 2D/3D) ◮ theoretical view: data are realizations of stochastic processes,
curves live in a function space
◮ in practice: (densely/sparsely) sampled on a finite grid ◮ examples: spectroscopy, longitudinal blood marker profiles,
medical imaging, accelerometers, ....
1000 1500 2000 2500 −1.5 −1.0 −0.5 0.0 0.5 wavelength [nm] NIR
- 20
- 10
10 20 30 40 500 1500 2500 3500 Months since seroconversion Total CD4 Cell Count
2 / 35
Functional Data
◮ unit of observation is a curve over some interval (or 2D/3D) ◮ theoretical view: data are realizations of stochastic processes,
curves live in a function space
◮ in practice: (densely/sparsely) sampled on a finite grid ◮ examples: spectroscopy, longitudinal blood marker profiles,
medical imaging, accelerometers, ....
02:00 07:00 12:00 17:00 22:00 20 60 100 day
pig 57
2 / 35
Functional Data Analysis (FDA)
FDA is statistics for functional data. Increasing interest in relating functional variables to other variables
- f interest: functional regression
◮ function-on-scalar regression ◮ scalar-on-function (also: “time-to-event”-on-function) ◮ function-on-function
(Cuevas, 2014; Morris, 2015)
3 / 35
Outline
Introduction Functional data analysis Research program Structured additive functional regression Model Specification of Model Terms Estimation Penalized Likelihood Estimation Gradient Boosting Example Feeding Behavior of Pigs Summary & Outlook Wrap-Up
Aim
A general framework for functional regression with choice of
- 1. scalar or functional response (observed on (un)equal grids)
- 2. modeled feature of the conditional response distribution,
e.g. expectation, quantile or higher moments
- 3. linear, smooth and interaction effects of scalar and functional
covariates, (functional) random effects
- 4. bases, e.g. splines, functional principal components
- 5. estimation method: gradient boosting or penalized likelihood
Key idea: model observations within curves, shift functional structure to additive predictor. = ⇒ penalized scalar regression using varying coefficient models Implementation in R-packages FDboost and refund.
4 / 35
Aim
A general framework for functional regression with choice of
- 1. scalar or functional response (observed on (un)equal grids)
- 2. modeled feature of the conditional response distribution,
e.g. expectation, quantile or higher moments
- 3. linear, smooth and interaction effects of scalar and functional
covariates, (functional) random effects
- 4. bases, e.g. splines, functional principal components
- 5. estimation method: gradient boosting or penalized likelihood
Key idea: model observations within curves, shift functional structure to additive predictor. = ⇒ penalized scalar regression using varying coefficient models Implementation in R-packages FDboost and refund.
4 / 35
Aim
A general framework for functional regression with choice of
- 1. scalar or functional response (observed on (un)equal grids)
- 2. modeled feature of the conditional response distribution,
e.g. expectation, quantile or higher moments
- 3. linear, smooth and interaction effects of scalar and functional
covariates, (functional) random effects
- 4. bases, e.g. splines, functional principal components
- 5. estimation method: gradient boosting or penalized likelihood
Key idea: model observations within curves, shift functional structure to additive predictor. = ⇒ penalized scalar regression using varying coefficient models Implementation in R-packages FDboost and refund.
4 / 35
Aim
A general framework for functional regression with choice of
- 1. scalar or functional response (observed on (un)equal grids)
- 2. modeled feature of the conditional response distribution,
e.g. expectation, quantile or higher moments
- 3. linear, smooth and interaction effects of scalar and functional
covariates, (functional) random effects
- 4. bases, e.g. splines, functional principal components
- 5. estimation method: gradient boosting or penalized likelihood
Key idea: model observations within curves, shift functional structure to additive predictor. = ⇒ penalized scalar regression using varying coefficient models Implementation in R-packages FDboost and refund.
4 / 35
Aim
A general framework for functional regression with choice of
- 1. scalar or functional response (observed on (un)equal grids)
- 2. modeled feature of the conditional response distribution,
e.g. expectation, quantile or higher moments
- 3. linear, smooth and interaction effects of scalar and functional
covariates, (functional) random effects
- 4. bases, e.g. splines, functional principal components
- 5. estimation method: gradient boosting or penalized likelihood
Key idea: model observations within curves, shift functional structure to additive predictor. = ⇒ penalized scalar regression using varying coefficient models Implementation in R-packages FDboost and refund.
4 / 35
Aim
A general framework for functional regression with choice of
- 1. scalar or functional response (observed on (un)equal grids)
- 2. modeled feature of the conditional response distribution,
e.g. expectation, quantile or higher moments
- 3. linear, smooth and interaction effects of scalar and functional
covariates, (functional) random effects
- 4. bases, e.g. splines, functional principal components
- 5. estimation method: gradient boosting or penalized likelihood
Key idea: model observations within curves, shift functional structure to additive predictor. = ⇒ penalized scalar regression using varying coefficient models Implementation in R-packages FDboost and refund.
4 / 35
Aim
A general framework for functional regression with choice of
- 1. scalar or functional response (observed on (un)equal grids)
- 2. modeled feature of the conditional response distribution,
e.g. expectation, quantile or higher moments
- 3. linear, smooth and interaction effects of scalar and functional
covariates, (functional) random effects
- 4. bases, e.g. splines, functional principal components
- 5. estimation method: gradient boosting or penalized likelihood
Key idea: model observations within curves, shift functional structure to additive predictor. = ⇒ penalized scalar regression using varying coefficient models Implementation in R-packages FDboost and refund.
4 / 35
Outline
Introduction Functional data analysis Research program Structured additive functional regression Model Specification of Model Terms Estimation Penalized Likelihood Estimation Gradient Boosting Example Feeding Behavior of Pigs Summary & Outlook Wrap-Up
Structured Additive Functional Models
Observations (Yi, Xi), i = 1, . . . , N, with
◮ Yi a functional (scalar) response over interval T = [a, b], [t, t] ◮ Xi a set of scalar and/or functional covariates
Structured Additive Regression Model ξ(Yi|Xi = xi)(t) = h(xi)(t) =
J
- j=1
hj(xi, t),
◮ ξ the modeled feature of the conditional response distribution,
e.g. expectation (with link function), median, a quantile, ....
◮ partial effects hj(xi, t) are real valued functions over T depending
- n one or more covariates.
Scheipl et al. (2015, 2016); Brockhaus et al. (2015); Greven & Scheipl (2017)
5 / 35
Outline
Introduction Functional data analysis Research program Structured additive functional regression Model Specification of Model Terms Estimation Penalized Likelihood Estimation Gradient Boosting Example Feeding Behavior of Pigs Summary & Outlook Wrap-Up
Partial Effects hj(x, t)
Model: ξ(Y|X = x) = h(x)(t) =
jhj(x, t)
covariate(s) type of effect hj(x, t) (none) smooth intercept α(t) scalar covariate z linear effect zβ(t) smooth effect γ(z, t) two scalars z1, z2 linear interaction z1z2β(t) functional varying coefficient z1f(z2, t) smooth interaction f(z1, z2, t) functional covariate x(s) linear functional effect
- S x(s)β(s, t)ds
historical effect u(t)
ℓ(t) x(s)β(s, t)ds
smooth functional effect
- F(x(s), s, t)ds
scalar z, functional x(s) linear interaction z
- x(s)β(s, t)ds
smooth interaction
- x(s)β(z, s, t)ds
grouping variable g functional random intercept bg(t) grouping variable g and scalar z functional random slope zbg(t) curve indicator i curve-specific smooth residual ei(t)
6 / 35
Representation of Partial Effects hj(x, t)
Let Y collect all Yi(tid) ∀ i, d in one long vector. Model: ξ(Y|X = x) = h(x, t) =
jhj(x, t)
Represent each hj(x, t) using bases tailored to the effect type: Row Tensor Product Basis hj(x, t) = (bj(x, t)⊤ ⊙ bY(t)⊤)θj,
◮ row tensor: A n×p ⊙ B n×q =
- A ⊗ 1q
- ·
- 1p ⊗ B
- ◮ bj: Kj-vector of basis functions for covariate(s)
◮ bY: KY-vector of basis functions over T ◮ θj: coefficient vector
⇒ Linearization of estimation problem.
7 / 35
Representation of Partial Effects hj(x, t)
Let Y collect all Yi(tid) ∀ i, d in one long vector. Model: ξ(Y|X = x) = h(x, t) =
jhj(x, t)
Represent each hj(x, t) using bases tailored to the effect type: Row Tensor Product Basis hj(x, t) = (bj(x, t)⊤ ⊙ bY(t)⊤)θj,
◮ row tensor: A n×p ⊙ B n×q =
- A ⊗ 1q
- ·
- 1p ⊗ B
- ◮ bj: Kj-vector of basis functions for covariate(s)
◮ bY: KY-vector of basis functions over T ◮ θj: coefficient vector
⇒ Linearization of estimation problem.
7 / 35
Representation of Partial Effects hj(x, t)
Let Y collect all Yi(tid) ∀ i, d in one long vector. Model: ξ(Y|X = x) = h(x, t) =
jhj(x, t)
Represent each hj(x, t) using bases tailored to the effect type: Row Tensor Product Basis hj(x, t) = (bj(x, t)⊤ ⊙ bY(t)⊤)θj,
◮ row tensor: A n×p ⊙ B n×q =
- A ⊗ 1q
- ·
- 1p ⊗ B
- ◮ bj: Kj-vector of basis functions for covariate(s)
◮ bY: KY-vector of basis functions over T ◮ θj: coefficient vector
⇒ Linearization of estimation problem.
7 / 35
Regularization of Partial Effects
Anisotropic penalty pen(hj(x, t)) = θT
jPjYθj
for regularization. Penalty matrix PjY = λj(Pj ⊗ IKY) + λY(IKj ⊗ PY)
◮ with Pj, PY penalty matrices for the marginal bases ◮ λj, λY ∈ R+ smoothing parameters
Equivalent representation as (low-rank) Gaussian process prior: hj(x, t) = (bj(x)⊤ ⊗ bY(t)⊤)θj = BjYθj, θj ∼ N(0, P−
jY)
Latent Low-Rank Gauss Processes hj(x, t) ∼ GP
- 0, BjYP−
jYB⊤ jY
- 8 / 35
Regularization of Partial Effects
Anisotropic penalty pen(hj(x, t)) = θT
jPjYθj
for regularization. Penalty matrix PjY = λj(Pj ⊗ IKY) + λY(IKj ⊗ PY)
◮ with Pj, PY penalty matrices for the marginal bases ◮ λj, λY ∈ R+ smoothing parameters
Equivalent representation as (low-rank) Gaussian process prior: hj(x, t) = (bj(x)⊤ ⊗ bY(t)⊤)θj = BjYθj, θj ∼ N(0, P−
jY)
Latent Low-Rank Gauss Processes hj(x, t) ∼ GP
- 0, BjYP−
jYB⊤ jY
- 8 / 35
Specification of Partial Effects hj(x, t)
Very flexible: Combine
◮ any basis & penalty over x with ◮ any basis & penalty over T .
Computational efficiency:
◮ compute only on distinct/discretized values (Li & Wood, 2019) ◮ use fast & lean generalized linear array model (Currie et al., 2006).
For data on a regular grid t = (t1, . . . , td) and most hj: BjY
nd×KjKY
=
- bj(xi, t)⊤ ⊙ bY(t)⊤
i=1,...,n t=t1,...,td
reduces to BjY
nd×KjKY
=
- bj(xi)⊤
i=1,...,n
n×Kj
⊗ bY(t)⊤
d×KY
.
9 / 35
Specification of Partial Effects hj(x, t)
Very flexible: Combine
◮ any basis & penalty over x with ◮ any basis & penalty over T .
Computational efficiency:
◮ compute only on distinct/discretized values (Li & Wood, 2019) ◮ use fast & lean generalized linear array model (Currie et al., 2006).
For data on a regular grid t = (t1, . . . , td) and most hj: BjY
nd×KjKY
=
- bj(xi, t)⊤ ⊙ bY(t)⊤
i=1,...,n t=t1,...,td
reduces to BjY
nd×KjKY
=
- bj(xi)⊤
i=1,...,n
n×Kj
⊗ bY(t)⊤
d×KY
.
9 / 35
Tensor Product Basis & Penalties
P ⊗ I, I ⊗ P: repeated penalties that apply to each subvector of θ associated with a specific marginal basis function: bj(x, t) ⊗ bY(t) = ⊥1 ∇1 ⋉1
⊥2 ∇2 ⋉2 ⊥3 ∇3 ⋉3 ⊥4 ∇4 ⋉4
- ⊗
1 ⋆1
2 ⋆2 3 ⋆3
- =
⊥11 ⊥1⋆1 ∇11 ∇1⋆1 ⋉11 ⋉1⋆1 ⊥12 ⊥1⋆2 ∇12 ∇1⋆2 ⋉12 ⋉1⋆2 ⊥13 ⊥1⋆3 ∇13 ∇1⋆3 ⋉13 ⋉1⋆3 ⊥21 ⊥2⋆1 ∇21 ∇2⋆1 ⋉21 ⋉2⋆1 ⊥22 ⊥2⋆2 ∇22 ∇2⋆2 ⋉22 ⋉2⋆2 ⊥23 ⊥2⋆3 ∇23 ∇2⋆3 ⋉23 ⋉2⋆3 ⊥31 ⊥3⋆1 ∇31 ∇3⋆1 ⋉31 ⋉3⋆1 ⊥32 ⊥3⋆2 ∇32 ∇3⋆2 ⋉32 ⋉3⋆2 ⊥33 ⊥3⋆3 ∇33 ∇3⋆3 ⋉33 ⋉3⋆3 ⊥41 ⊥4⋆1 ∇41 ∇4⋆1 ⋉41 ⋉4⋆1 ⊥42 ⊥4⋆2 ∇42 ∇4⋆2 ⋉42 ⋉4⋆2 ⊥43 ⊥4⋆3 ∇43 ∇4⋆3 ⋉43 ⋉4⋆3
IKj ⊗ PY = 1
1 1
- ⊗
P
P⋆ P⋆ P⋆
- =
P P⋆ P⋆ P⋆ P P⋆ P⋆ P⋆ P P⋆ P⋆ P⋆
Pj ⊗ IKt =
- P⊥
P∇⊥ P⋉⊥ P∇⊥ P∇ P∇⋉ P⋉⊥ P⋉∇ P⋉
- ⊗ ( 1
1 )
=
P⊥ P∇⊥ P⋉⊥ P⊥ P∇⊥ P⋉⊥ P∇⊥ P∇ P∇⋉ P∇⊥ P∇ P∇⋉ P⋉⊥ P⋉∇ P⋉ P⋉⊥ P⋉∇ P⋉
Example: Historical Functional Effect
◮ For example, u(t)
- ℓ(t)
x(s)β(s, t)ds ≈
R
- r=1
˜ x(sr, t)β(sr, t) with
◮ ˜
x(sr, t) = 1(ℓ(t) ≤ sr ≤ u(t))∆(sr)x(sr)
◮ ∆(s) numerical integration weights ◮ {bj
k(s), 1 ≤ k ≤ Kj}, {bY l (t), 1 ≤ l ≤ KY} spline bases.
◮ bj(x, t) =
R
r=1 ˜
xi(sr, t)bj
k(sr)
- i=1,...,n
k=1,...,Kj
◮ Pj, PY marginal penalty matrices for bj(s), bY(t) 11 / 35
Example: Historical Functional Effect
◮ For example, u(t)
- ℓ(t)
x(s)β(s, t)ds ≈
R
- r=1
˜ x(sr, t)
Kj
- k=1
KY
- l=1
bj
k(sr)bY l (t)θj,kl
with
◮ ˜
x(sr, t) = 1(ℓ(t) ≤ sr ≤ u(t))∆(sr)x(sr)
◮ ∆(s) numerical integration weights ◮ {bj
k(s), 1 ≤ k ≤ Kj}, {bY l (t), 1 ≤ l ≤ KY} spline bases.
◮ bj(x, t) =
R
r=1 ˜
xi(sr, t)bj
k(sr)
- i=1,...,n
k=1,...,Kj
◮ Pj, PY marginal penalty matrices for bj(s), bY(t) 11 / 35
Example: FPC-Based Functional Effect
For xi(s) ≈ Kj
k=1 ψk(s)ξik,
- S
xi(s)β(s, t)ds =
- k
ξik
- S
ψk(s)β(s, t)ds =
- k
ξik ˜ βk(t) ⇒ sum of varying coefficient terms for FPC scores
◮ bj(x, t) =
- ˆ
ξik
- n×Kj
⊗ 1d
◮ Pj = 0 ◮ Extends to nonlinear FPC effects
hj(xi(s), t) = Kj
k fk(ˆ
ξik, t), hj(xi(s), t) = Kj
k;k′>k fk,k′(ˆ
ξik, ˆ ξik′, t) (sum of) nonlinear effects of synthetic covariates ˆ ξk.
12 / 35
Example: FPC-Based Functional Effect
For xi(s) ≈ Kj
k=1 ψk(s)ξik,
- S
xi(s)β(s, t)ds =
- k
ξik
- S
ψk(s)β(s, t)ds =
- k
ξik ˜ βk(t) ⇒ sum of varying coefficient terms for FPC scores
◮ bj(x, t) =
- ˆ
ξik
- n×Kj
⊗ 1d
◮ Pj = 0 ◮ Extends to nonlinear FPC effects
hj(xi(s), t) = Kj
k fk(ˆ
ξik, t), hj(xi(s), t) = Kj
k;k′>k fk,k′(ˆ
ξik, ˆ ξik′, t) (sum of) nonlinear effects of synthetic covariates ˆ ξk.
12 / 35
FPC-Based Functional Effects
+ always identifiable (not true for spline-based effects) + ˜ βk(t) easier to interpret than β(s, t), at least for interpretable ˆ ψk(s) + applicable to sparse/irregular functional covariates − inference becomes conditional on estimated FPCs − suitable truncation lag Kx?
13 / 35
Example: Functional Random Intercept
hj(x, t) = bg(t) for a grouping variable g with G levels.
◮ bY(t) = (bY 1 (t), . . . , bY KY(t))T spline basis evaluations
with penalty matrix PY
◮ bj(x) matrix of dummy variables for levels of g ◮ Pj defines dependence structure between levels of g, e.g.
precision matrix of a Gaussian (Markov) random field. Simplest case: Pj = I for i.i.d. bg(t). For large G, a parsimonious functional principal component basis is computationally advantageous (!), quantifies main modes of variation.
14 / 35
Example: Functional Random Intercept
hj(x, t) = bg(t) for a grouping variable g with G levels.
◮ bY(t) = (bY 1 (t), . . . , bY KY(t))T spline basis evaluations
with penalty matrix PY
◮ bj(x) matrix of dummy variables for levels of g ◮ Pj defines dependence structure between levels of g, e.g.
precision matrix of a Gaussian (Markov) random field. Simplest case: Pj = I for i.i.d. bg(t). For large G, a parsimonious functional principal component basis is computationally advantageous (!), quantifies main modes of variation.
14 / 35
Example: Functional Random Intercept
hj(x, t) = bg(t) for a grouping variable g with G levels.
◮ bY(t) = (bY 1 (t), . . . , bY KY(t))T spline basis evaluations
with penalty matrix PY
◮ bj(x) matrix of dummy variables for levels of g ◮ Pj defines dependence structure between levels of g, e.g.
precision matrix of a Gaussian (Markov) random field. Simplest case: Pj = I for i.i.d. bg(t). For large G, a parsimonious functional principal component basis is computationally advantageous (!), quantifies main modes of variation.
14 / 35
Example: Principal Component Bases
Karhunen-Loève expansion for bg(t) iid ∼ GP(0, Kb(t, t′)): bg(t) =
∞
- l=1
θj,glφl(t) ≈
KY
- l=1
θj,glφl(t), with eigenfunctions and -values φl, νl of Kb, θj,gl
i.i.d.
∼ N(0, νl).
◮ bY(t) = (φ1(t), . . . , φKY(t))T with PY = diag(ν1, . . . , νKY)−1 ◮ bj(x) as before, typically Pj = I ◮ λY, λj fixed to 1.
KY small, λj, λY fixed → reduced computational complexity for pen. lik. Greven et al. (2010); Shou et al. (2014); Zipunnikov et al. (2014); Cederbaum et al. (2015) discuss estimation of eigenbases for different random effects models and grid/sparse functional/image data.
15 / 35
Example: Principal Component Bases
Karhunen-Loève expansion for bg(t) iid ∼ GP(0, Kb(t, t′)): bg(t) =
∞
- l=1
θj,glφl(t) ≈
KY
- l=1
θj,glφl(t), with eigenfunctions and -values φl, νl of Kb, θj,gl
i.i.d.
∼ N(0, νl).
◮ bY(t) = (φ1(t), . . . , φKY(t))T with PY = diag(ν1, . . . , νKY)−1 ◮ bj(x) as before, typically Pj = I ◮ λY, λj fixed to 1.
KY small, λj, λY fixed → reduced computational complexity for pen. lik. Greven et al. (2010); Shou et al. (2014); Zipunnikov et al. (2014); Cederbaum et al. (2015) discuss estimation of eigenbases for different random effects models and grid/sparse functional/image data.
15 / 35
Example: Principal Component Bases
Karhunen-Loève expansion for bg(t) iid ∼ GP(0, Kb(t, t′)): bg(t) =
∞
- l=1
θj,glφl(t) ≈
KY
- l=1
θj,glφl(t), with eigenfunctions and -values φl, νl of Kb, θj,gl
i.i.d.
∼ N(0, νl).
◮ bY(t) = (φ1(t), . . . , φKY(t))T with PY = diag(ν1, . . . , νKY)−1 ◮ bj(x) as before, typically Pj = I ◮ λY, λj fixed to 1.
KY small, λj, λY fixed → reduced computational complexity for pen. lik. Greven et al. (2010); Shou et al. (2014); Zipunnikov et al. (2014); Cederbaum et al. (2015) discuss estimation of eigenbases for different random effects models and grid/sparse functional/image data.
15 / 35
Outline
Introduction Functional data analysis Research program Structured additive functional regression Model Specification of Model Terms Estimation Penalized Likelihood Estimation Gradient Boosting Example Feeding Behavior of Pigs Summary & Outlook Wrap-Up
Outline
Introduction Functional data analysis Research program Structured additive functional regression Model Specification of Model Terms Estimation Penalized Likelihood Estimation Gradient Boosting Example Feeding Behavior of Pigs Summary & Outlook Wrap-Up
Estimation approaches: Penalized likelihood
Penalized likelihood-based (Scheipl et al., 2015 & 2016, based on Wood et al., 2016):
◮ ξ = g ◦ E with link function g, functional responses from
exponential families, Beta, scaled non-central t, ordinal, . . .
◮ moderate numbers of covariates ◮ likelihood-based inference: functional data log-likelihood is
equivalent to penalized log-likelihood for scalar data = ⇒ adapt 30+ years of method development & know-how
◮ implemented in the pffr function of R package refund, internally
using mgcv (Wood, 2015) for scalar GAMMs: formula based interface, construction of all bases and penalties, identifiability checks and constraints, plotting etc.
16 / 35
Estimation as an Additive Mixed Model
If ξ = g ◦ E for a link function g, write model (5) as g(E(Y)) = Bθ, where
◮ Y contains Yi(tid) ∀ i, d ◮ B contains evaluated basis functions ∀ i, d, j ◮ θ contains blocks θj of coefficients for each j.
The penalized log-likelihood for this model is given by l(θ) − 1 2
J
- j=1
θT
jPjYθj,
with l(θ) the log-likelihood based on the assumed density of Y|X.
17 / 35
Estimation as an Additive Mixed Model
◮ l(θ) − 1 2
J
j=1 θT jPjYθj is equivalent to penalized log-likelihood for
scalar data. Use Wood (2006, 2011); Wood et al. (2016) for
- ptimization and estimation of smoothing parameters
(REML/marginal likelihood).
◮ recent extensions of mgcv:
◮ multivariate models: useful for FDA for trajectory data,
phase/amplitude components (Volkmann, 2018)
◮ location-scale models: very useful for additional additive predictor
for modeling (Gaussian) conditional variance σ2(t|x) (Greven & Scheipl, 2017)
◮ computational shortcuts for tensor product bases using distinct
values only. (Li & Wood, 2019)
18 / 35
Publications
Framework:
◮ Ivanescu, Staicu, Scheipl & Greven (2015), Penalized function-on-function regression. CompStat, 30(2), 539-568. ◮ Scheipl, Staicu & Greven (2015), Functional additive mixed models. JCGS, 24(2), 477-501. ◮ Scheipl, Gertheiss & Greven (2016), Generalized functional additive mixed models. EJS, 10(1), 1455-1492. ◮ Scheipl & Greven (2016), Identifiability in penalized function-on-function regression
- models. EJS, 10(1), 495-526.
◮ Greven & Scheipl (2017), A general framework for functional regression modelling. Statistical Modelling, 17(1-2), 1-35. (with discussion, rejoinder) ◮ Bauer, Scheipl, Küchenhoff & Gabriel (2018), An introduction to semiparametric function-on-scalar regression. Statistical Modelling, 18(3-4), 346-364.
Extensions & refinements for scalar responses:
◮ Goldsmith & Scheipl (2013), Estimator selection and combination in scalar-on-function regression, CSDA, 70, 362-372. ◮ McLean, Scheipl, Hooker, Greven & Ruppert (2014), Functional generalized additive models, JCGS, 23(1), 249-269. ◮ Fuchs, Scheipl & Greven (2015), Penalized scalar-on-functions regression with interaction term, CSDA, 81, 38-51.
Implementation:
◮ Goldsmith, Scheipl, Huang, Wrobel, Gellar, Harezlak, McLean, Swihart, Xiao, Crainiceanu & Reiss (2018), refund: Regression with functional data. R package version 0.1-16
19 / 35
Outline
Introduction Functional data analysis Research program Structured additive functional regression Model Specification of Model Terms Estimation Penalized Likelihood Estimation Gradient Boosting Example Feeding Behavior of Pigs Summary & Outlook Wrap-Up
Estimation approaches: Boosting
Empirical risk minimization via component-wise gradient boosting (Brockhaus et al., 2015, based on Hothorn et al., 2013)
◮ variety of loss functions e.g. for quantile or robust regression ◮ empirical risk for functional responses simply integrates
point-wise loss functions over T
◮ many covariates with automatic variable selection, p >> N
possible
◮ efficient implementation using array models & data indexing,
using mboost (Hothorn et al., 2015) extension to GAMLSS in gamboostLSS (Hofner et al., 2016)
◮ resampling-based uncertainty quantification only
20 / 35
Transformation Functions and Loss Functions
Model: ξ(Yi|Xi = xi) = h(xi) = J
j=1 hj(xi, t)
Represent estimator of transformation function ξ as minimizer of empirical risk for corresponding loss function ρ. ξ ρ(Y, h(x)) mean regression E L2-loss median regression Q0.5 L1-loss quantile regression Qτ check function generalized regression g ◦ E
- neg. log-likelihood
Loss functions for functional responses: Integrate ρ(Y, h(x))(t) over T . Goal: Minimize the expected loss, the risk, w.r.t. h.
21 / 35
Transformation Functions and Loss Functions
Model: ξ(Yi|Xi = xi) = (h1(xi), . . . , hC(xi)) = J
j=1 h1j(xi), . . . , J j=1 hCj(xi)
- Represent estimator of transformation function ξ as minimizer of
empirical risk for corresponding loss function ρ. ξ ρ(Y, h(x)) mean regression E L2-loss median regression Q0.5 L1-loss quantile regression Qτ check function generalized regression g ◦ E
- neg. log-likelihood
GAMLSS C-vector of params.
- neg. log-likelihood
Loss functions for functional responses: Integrate ρ(Y, h(x))(t) over T . Goal: Minimize the expected loss, the risk, w.r.t. h.
21 / 35
Transformation Functions and Loss Functions
Model: Represent estimator of transformation function ξ as minimizer of empirical risk for corresponding loss function ρ. ξ ρ(Y, h(x)) mean regression E L2-loss median regression Q0.5 L1-loss quantile regression Qτ check function generalized regression g ◦ E
- neg. log-likelihood
Loss functions for functional responses: Integrate ρ(Y, h(x))(t) over T . Goal: Minimize the expected loss, the risk, w.r.t. h.
21 / 35
Component-wise gradient boosting
◮ Boosting is an ensemble method that aims at minimizing an
expected loss criterion.
◮ The predictor is iteratively updated along the steepest gradient
with respect to the predictors (gradient descent).
◮ Model represented as sum of simple (penalized) regression
models (“base-learners”), fitted to the negative gradients in each step.
◮ In each boosting iteration only the best fitting base-learner is
updated (component-wise) with step-length ν.
◮ For functional response regression, response and predictors are
functions over T .
(Bühlmann & Hothorn 2007; Hastie et al 2009; Hothorn et al 2013; Brockhaus et al 2015)
22 / 35
Component-wise gradient boosting
◮ Boosting is an ensemble method that aims at minimizing an
expected loss criterion.
◮ The predictor is iteratively updated along the steepest gradient
with respect to the predictors (gradient descent).
◮ Model represented as sum of simple (penalized) regression
models (“base-learners”), fitted to the negative gradients in each step.
◮ In each boosting iteration only the best fitting base-learner is
updated (component-wise) with step-length ν.
◮ For functional response regression, response and predictors are
functions over T .
(Bühlmann & Hothorn 2007; Hastie et al 2009; Hothorn et al 2013; Brockhaus et al 2015)
22 / 35
Algorithm: component-wise gradient boosting
◮ [Step 1:] initialize all parameters, set m = 0 ◮ [Step 2:] (within each iteration m) ◮ compute the negative partial gradients ui for i = 1, . . . , N of the
empirical risk w.r.t. the predictor h using the current estimates
- f all distribution parameters
◮ fit each base-learner hj to ui, j = 1, . . . , J ◮ select the best fitting base-learner hj∗ and update it with a
small step-length ν, h[m] = h[m−1] + ν h[m]
j⋆ . ◮ [Step 3:] unless m > mstop, set m = m + 1, go to Step 2.
→ The final ˆ h is a linear combination of base-learner fits. For GAMLSS, another loop over the C model equations is added.
(Mayr et al., 2012; Brockhaus et al., 2017; non-cyclical: Thomas et al, 2017)
23 / 35
Component-wise gradient boosting
◮ The number of boosting iterations determines the model
complexity for fixed ν and smoothing parameters (chosen for unbiased selection of base learners; Hofner et al., 2012).
◮ Choose optimal stopping iteration mstop by resampling methods
(on the level of curves).
◮ Model selection by early stopping and stability selection (Shah &
Samworth, 2013).
◮ Implementation in R-package FDboost (Brockhaus, Rügamer &
Hothorn, 2017) based on mboost (Hothorn et al., 2017) and gamboostLSS (Hofner et al., 2017).
24 / 35
Publications
◮ Brockhaus, Scheipl, Hothorn & Greven (2015), The functional linear array
- model. Statistical Modelling, 15(3):279-300.
◮ Brockhaus, Fuest, Mayr & Greven (2015), Functional regression models
for location, scale and shape applied to stock returns. In: Friedl & Wagner (eds), Proceedings of the 30th IWSM, 117–122.
◮ Brockhaus, Rügamer & Greven (2017), Boosting functional regression
models with FDboost. arXiv:1705.10662.
◮ Brockhaus, Melcher, Leisch & Greven (2017), Boosting flexible functional
regression models with a high number of functional historical effects. Statistics and Computing, 27: 913-926.
◮ Rügamer, Brockhaus, Gentsch, Scherer & Greven (2018), Boosting
factor-specific functional historical models for the detection of synchronisation in bioelectrical signals. Journal of the Royal Statistical Society - C, 67: 621-642. Implementation:
◮ Brockhaus, Rügamer & Hothorn (2017), FDboost: Boosting functional
regression models. R package version 0.3-0
25 / 35
Outline
Introduction Functional data analysis Research program Structured additive functional regression Model Specification of Model Terms Estimation Penalized Likelihood Estimation Gradient Boosting Example Feeding Behavior of Pigs Summary & Outlook Wrap-Up
Feeding Behavior of Pigs
◮ "PIGWISE" project (Maselyne et al., 2014, Gertheiss et al., 2015) ◮ proximity of the pigs to the trough (yes/no) recorded every 10
seconds for 127 pigs over 102 days using RFID tags
◮ yields binary functional observations over [0h,24h] at 0.1 Hz ◮ model to investigate (un)common feeding behavior
→ ethology, monitoring of pigs’ health
02:00 07:00 12:00 17:00 22:00 20 60 100 day
pig 57
26 / 35
Feeding Behavior of Pigs
◮ generalized functional response model needed ◮ autoregressive terms and day-specific (functional random)
effects of interest
◮ temperature and humidity information available
27 / 35
Feeding Behavior of Pigs
◮ for illustration, focus on one pig ◮ to reduce noise & data size, aggregate binary feeding indicators
- bserved every 10 seconds over 10 minute intervals
◮ model Yi(t)|xi ∼ Binomial(60, pi(t)), i = 1, . . . , 102,
t = 00:00h, 00:10h, . . . , 23:50h with logit (pi(t)) = β0(t) + bi(t) + t−10min
t−3h
yi(s)β(t, s)ds
◮ bi(t)
iid
∼ GP(0, Kb(t, t′)) functional random day effects
◮ assume pi(t) is smooth over t, use periodic P-splines ◮ fit generalized functional additive mixed model using pffr in
refund on 2/3 training data (ca. 1min; 5 smoothing parameters, dim(B) : 14544 × 850).
28 / 35
Feeding Behavior of Pigs
◮ for illustration, focus on one pig ◮ to reduce noise & data size, aggregate binary feeding indicators
- bserved every 10 seconds over 10 minute intervals
◮ model Yi(t)|xi ∼ Binomial(60, pi(t)), i = 1, . . . , 102,
t = 00:00h, 00:10h, . . . , 23:50h with logit (pi(t)) = β0(t) + bi(t) + t−10min
t−3h
yi(s)β(t, s)ds
◮ bi(t)
iid
∼ GP(0, Kb(t, t′)) functional random day effects
◮ assume pi(t) is smooth over t, use periodic P-splines ◮ fit generalized functional additive mixed model using pffr in
refund on 2/3 training data (ca. 1min; 5 smoothing parameters, dim(B) : 14544 × 850).
28 / 35
Feeding Behavior of Pigs
◮ for illustration, focus on one pig ◮ to reduce noise & data size, aggregate binary feeding indicators
- bserved every 10 seconds over 10 minute intervals
◮ model Yi(t)|xi ∼ Binomial(60, pi(t)), i = 1, . . . , 102,
t = 00:00h, 00:10h, . . . , 23:50h with logit (pi(t)) = β0(t) + bi(t) + t−10min
t−3h
yi(s)β(t, s)ds
◮ bi(t)
iid
∼ GP(0, Kb(t, t′)) functional random day effects
◮ assume pi(t) is smooth over t, use periodic P-splines ◮ fit generalized functional additive mixed model using pffr in
refund on 2/3 training data (ca. 1min; 5 smoothing parameters, dim(B) : 14544 × 850).
28 / 35
Model Fit
1 6 10 16 21 27 31 37 42 48 53 59 63 69 74 80 84 90 95 101 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 6 12 18 240 6 12 18 240 6 12 18 240 6 12 18 240 6 12 18 24
t
type fitted
- bserved
ˆ µ(t) & y(t) for selected days (training data)
29 / 35
Model Predictions
2 5 11 17 20 26 32 38 41 47 52 58 61 67 73 79 82 88 94 100 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 6 12 18 24 0 6 12 18 24 0 6 12 18 24 0 6 12 18 24 0 6 12 18 24
t
type fitted true
ˆ µ(t) & y(t) for selected days (test data)
30 / 35
Effect Estimates
- 6
- 3
3 6 12 18 24
t
ˆ β0(t)
- 30
- 20
- 10
10 6 12 18 24
t
ˆ bi(t)
estimate upper CI lower CI
- 3
- 2
- 1
6 12 18 24 6 12 18 24 6 12 18 24
t s-t
- 0.5
- 0.4
- 0.3
- 0.2
- 0.1
0.0 0.1 value
ˆ β(t, s); max. lag= 3 h
31 / 35
Outline
Introduction Functional data analysis Research program Structured additive functional regression Model Specification of Model Terms Estimation Penalized Likelihood Estimation Gradient Boosting Example Feeding Behavior of Pigs Summary & Outlook Wrap-Up
Summary
Unified framework for broad class of regression models for functional covariates and/or functional responses. Basic ideas:
◮ define generally applicable, very flexible representation of model
terms via penalized tensor product bases
◮ don’t reinvent the wheel for functional data: rephrase as scalar
data problems and adapt methodology, implementations
◮ different estimation frameworks for different goals and data
settings:
◮ boosting: variable & model selection; flexible choice of feature(s)
- f conditional response distribution to model
◮ likelihood-based (additive) mixed models: valid inference,
uncertainty quantification without resampling
◮ everything implemented in R: FDboost, refund
32 / 35
Outlook
◮ models for multivariate functional responses (trajectories,
shapes, accelerometry, . . . ) and probability densities
◮ computationally efficient methods for dealing with
heterogeneous, auto-correlated functional residuals & random effects: “penalized GEEs”? “sandwich” estimators? . . .
- cf. Morris et al.’s WFMM (Morris & Carroll, 2006, e.g.)
◮ refactoring pffr() now that we know what we’re doing
(somewhat...)
◮ better visualization & model diagnostics
(c.f. mgcViz, Fasiolo et al. (2018))
33 / 35
Additional References
◮
Bühlmann & Hothorn (2007), Boosting algorithms: Regularization, prediction and model fitting, Statistical Science, 22, 477–505, with discussion.
◮
Cuevas (2014), A partial overview of the theory of statistics with functional data. Journal of Statistical Planning and Inference, 147, 1–23.
◮
Currie, Durban & Eilers (2006), Generalized linear array models with applications to multidimensional smoothing. JRSS-B, 68, 259–280.
◮
Eilers & Marx (1996), Flexible smoothing with B-splines and penalties (with comments and rejoinder). Statistical Science, 11, 89–121.
◮
Friedman (1982), Piecewise exponential models for survival data with covariates. The Annals of Statistics, 10(1), 101–113.
◮
Gertheiss, Maier, Hessel & Staicu (2015), Marginal functional regression models for analyzing the feeding behavior of
- pigs. Journal of Agricultural, Biological, and Environmental Statistics, 20(3), 353-370.
◮
Hofner, Mayr, Fenske & Schmid (2016), gamboostLSS: Boosting methods for GAMLSS. R package version 1.3-0.
◮
Hothorn, Bühlmann, Kneib, Schmid & Hofner (2015), mboost: Model-based boosting. R package version 2.4-2.
◮
Hothorn, Kneib & Bühlmann (2013), Conditional transformation models. JRSS-B, 76, 3–27. 34 / 35
Additional References
◮
Li & Wood (2019),Faster model matrix crossproducts for large generalized linear models with discretized covariates.Statistics and Computing, to appear.
◮
Maselyne, Saeys, Ketelaere, Mertens, Vangeyte, Hessel, Millet & Van Nuffel (2014), Validation of a high frequency radio frequency identification (HF RFID) system for registering feeding patterns of growing-finishing pigs. Computers and Electronics in Agriculture, 102, 10–18.
◮
Marra & Wood (2011), Practical variable selection for generalized additive models. CSDA, 55, 2372–2387.
◮
Meinshausen & Bühlmann (2010), Stability selection. JRSS-B, 72(4), 417-473.
◮
Morris & Carroll (2006), Wavelet-based functional mixed models. JRSS-B, 68(2), 179-199.
◮
Morris (2015), Functional regression. Annual Review of Statistics and its Applications, 2:321-59.
◮
Ramsay & Silverman (2005), Functional data analysis, Springer, New York.
◮
Shah & Samworth (2013), Variable selection with error control: another look at stability selection. JRSS-B, 75, 55–80.
◮
Wood (2006), Generalized Additive Models: An Introduction with R, Chapman & Hal/CRC, Boca Raton, Florida.
◮
Wood (2011), Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. JRSS-B, 73, 3–36.
◮
Wood (2015), mgcv: Mixed GAM computation vehicle with GCV/AIC/REML smoothness estimation. R package version 1.8-6.
◮
Wood, Pya & Säfken (2016), Smoothing parameter and model selection for general smooth models. JASA, 111(516), 1548–1563. 35 / 35
Outline
Example: Nutrition & ICU survival
Exposure-Lag-Response Association (ELRA): Nutrition & ICU Survival
◮ multi-center study of critical care patients from 457 ICUs
(≈ 10k patients)
◮ investigate acute mortality (first 30d) ◮ confounders z: age, gender, Apache II Score, year of admission,
ICU random effect, ...
◮ 12-day nutrition record xi(s)
◮ prescribed calories (determined at baseline) ◮ daily caloric intake ◮ daily caloric adequacy (CA)= caloric intake/prescribed calories 1 / 8
Exposure-Lag-Response Association (ELRA): Nutrition & ICU Survival
◮ multi-center study of critical care patients from 457 ICUs
(≈ 10k patients)
◮ investigate acute mortality (first 30d) ◮ confounders z: age, gender, Apache II Score, year of admission,
ICU random effect, ...
◮ 12-day nutrition record xi(s)
hypocal/figure/exampleProfiles-1.pdf
1 / 8
Clinical Questions & Parameterization
Importance of timing & amount of nutrition during critical, acute & recovery phases unclear. Model Assumptions: delayed, time-limited, cumulative & time-varying effect of time-varying exposure xi(s) Idea: (partial) effect of xi(s) on log-hazard at time t:
- W(t)
xi(s)β(t, s)ds
◮ delay & time limit defined by W(t) ◮ time-varying partial effects β(t, s)
2 / 8
Piece-wise Exponential Model
Time-to-event model reparameterized as Poisson-GAMM: log (λ(t|zi, xi)) = α0(t) +
- j
hj(zi, t) +
- W(t)
xi(s)β(t, s)ds
◮ estimated with mgcv (could use FDboost as well) ◮ flexible, semi-parametric modeling of baseline hazard rate α0(t) ◮ use GFAMM framework for functional covariates to perform
inference, e.g. about nutrition effects
◮ flexible confounder adjustment: non-linear & time-varying
effects; time-varying covariates, random effects / frailties, . . .
Details 3 / 8
ELRA: Example
Compare hazard of patient with given nutrition record to constant undernutrition (e.g., ceteris paribus):
- 25
50 75 100 1 2 3 4 5 6 7 8 9 10 11
protocol day caloric adequacy [%]
4 / 8
ELRA: Example
Compare hazard of patient with given nutrition record to constant undernutrition (e.g., ceteris paribus):
- 25
50 75 100 1 2 3 4 5 6 7 8 9 10 11
protocol day caloric adequacy [%]
0.25 0.5 0.75 1 1.25 10 20 30
t hazard ratio estimate
I’m leaving out a LOT of complications here . . .
4 / 8
ELRA Publications
Time-to-event-models:
◮ Bender, Groll & Scheipl (2017), A generalized additive model approach to time-to-event
- analysis. Statistical Modelling, 18(3-4), 299-341.
◮ Bender, Scheipl, Hartl, Day & Küchenhoff (2018), Penalized estimation of complex, non-linear exposure-lag-response associations. Biostatistics, kxy003. ◮ Hartl, Bender, Scheipl, Kuppinger, Day & Küchenhoff (2018). Calorie intake and short-term survival of critically ill patients. Clinical Nutrition, in press.
Distributed lag models:
◮ Obermeier, Scheipl, Heumann, Wassermann & Küchenhoff (2015), Flexible distributed lags for modeling earthquake data. Journal of the Royal Statistical Society - C, 64(2):395-412. ◮ Gasparrini, Scheipl, Armstrong & Kenward (2017). A penalized framework for distributed lag non-linear models. Biometrics, 73: 938–948.
Implementation:
◮ Gasparrini, Armstrong & Scheipl (2016), dlnm: Distributed lag non-linear models. R package version 2.2-5 ◮ Bender & Scheipl (2017) pammtools: Piece-wise exponential additive mixed modeling
- tools. https://github.com/adibender/pammtools
5 / 8
Piece-wise Exponential Model
◮ define cut-points on the time line: 0 = κ0 < . . . < κJ = tmax ◮ assume constant hazard rates in each interval [κj−1, κj)
⇒ likelihood for event indicators yij =
- 1,
event in [κj−1, κj) 0, else is proportional to Poisson likelihood (Friedman, 1982)
◮ for t ∈ [κj−1, κj), piecewise constant hazard rate for event process
λ(t; zi, xi) is rate of pseudo-Poisson responses yij!
back 6 / 8
Identifiability
- 1. Functional response models: use suitable identifiability constraints.
- 2. Functional covariates x(s) often well approximated by truncated
Karhunen-Loève-expansions, xi(s) =
∞
- l=1
ξi,lφX
l (s) ≈ M
- l=1
ξi,lφX
l (s).
For a linear functional effect hj(x, t) =
- S
x(s)β(s, t)ds, β is identified up to functions f, f(·, t) ∈ span({φX
l , l > M}) ∀t.
For finite grid data, the design matrix is rank-deficient if M < Kj or if the span of the basis for β in s-direction contains functions
- rthogonal to {φX
l , 1 ≤ l ≤ M} using numerical integration.
back 7 / 8
Identifiability
◮ Iff the kernels of penalty and design matrix do not overlap, there
is a unique minimum of the penalty on each hyperplane defined by coefficient vectors θ representing identical hj functions.
◮ The penalized optimization criterion then finds the smoothest
solution (i.e, argmin(pen(θ)) among the solutions with optimal fit to the data.
◮ Practical recommendations (Scheipl and Greven, 2015):
- 1. avoid rank-reducing pre-processing of functional covariates
- 2. check diagnostic measures (implemented in refund/FDboost)
- 3. use penalties with small kernel: 1st order differences/ derivatives
- r full-rank penalties (Marra and Wood, 2011) or suitable
constraints
8 / 8