Goodness-of-fit tests for the functional linear model with scalar - - PowerPoint PPT Presentation

goodness of fit tests for the functional linear model
SMART_READER_LITE
LIVE PREVIEW

Goodness-of-fit tests for the functional linear model with scalar - - PowerPoint PPT Presentation

Goodness-of-fit tests for the functional linear model with scalar response with responses missing at random Manuel Febrero-Bande 1 Pedro Galeano 2 es 2 and Wenceslao Gonz alez-Manteiga 1 Eduardo Garc a-Portugu 1 Department of Statistics,


slide-1
SLIDE 1

Goodness-of-fit tests for the functional linear model with scalar response with responses missing at random

Manuel Febrero-Bande1 Pedro Galeano2 Eduardo Garc´ ıa-Portugu´ es2 and Wenceslao Gonz´ alez-Manteiga1

1Department of Statistics, Mathematical Analysis and Optimization

Universidade de Santiago de Compostela

2Department of Statistics and UC3M-BS Institute of Financial Big Data

Universidad Carlos III de Madrid

Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 1 / 22

slide-2
SLIDE 2

Motivation

Regression model with a functional covariate and a scalar response:

◮ General model: Y = m (X) + ε, where: ⋆ Real response: Y centered. ⋆ Functional covariate: X ∈ H centered and with covariance operator Γ. ⋆ Hilbert space: H of square integrable functions, with inner product ·, · and

associated norm · .

⋆ Regression operator: m (X) = E [Y |X = X]. ⋆ Error random variable: ε ∼

  • 0, σ2

and ε uncorrelated with X.

◮ Interest: Given a random sample from (X, Y ), {(Xi, Yi)}n

i=1, check whether

the regression operator m is linear.

◮ Goodness-of-fit tests for linearity: ⋆ Garc´

ıa-Portugu´ es, Gonz´ alez-Manteiga and Febrero-Bande (2014, JCGS).

⋆ Cuesta-Albertos, Garc´

ıa-Portugu´ es, Gonz´ alez-Manteiga and Febrero-Bande (2019, AoS).

Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 2 / 22

slide-3
SLIDE 3

Motivation

Febrero-Bande, Galeano, and Gonz´ alez-Manteiga (2019, CSDA):

◮ Data set: Data from 73 Spanish weather stations in the period 1980 − 2009. ◮ Functional covariate: Mean curve of the annual average daily temperature. ◮ Real response: Average of the total number of hours of sunshine per year. ◮ Missing responses: The responses are not observed in 26 and out of the 73

weather stations (35.62% of missing responses).

◮ Functional linear model with scalar response (FLMSR): m (X) = X, β,

where β ∈ H is a functional slope and ·, · is the inner product of H.

◮ Two methods for estimating β with FPCs: 1

Simplified method: Delete the pairs with missing responses.

2

Imputed method: Impute the missing responses before estimation.

◮ Results suggest: The imputed method outperforms the simplified method. Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 3 / 22

slide-4
SLIDE 4

Motivation

Work in progress:

◮ Goal: Analyze goodness-of-fit tests for functional regression models when

some of the responses are missing at random.

◮ Two possibilities: 1

Use goodness-of-fit tests after deleting pairs with missing responses.

2

Impute missing responses, then use goodness-of-fit tests.

◮ Question: Which option is better? ◮ Today, initial results on: ⋆ Model: Functional linear model with scalar response (FLMSR). ⋆ Goodness-of-fit test: Garc´

ıa-Portugu´ es et al. (2014, JCGS).

Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 4 / 22

slide-5
SLIDE 5

The testing problem

Elements of the testing problem:

◮ Problem: Test the linear hypothesis

H0 : m ∈ {·, β : β ∈ H} versus the alternative hypothesis H1 : m ∈ {·, β : β ∈ H} .

◮ Random sample:

{(Xi, Yi, Ri)}n

i=1 generated from (X, Y , R), where R is

Bernoulli with Ri = 1, if Yi is observed, and Ri = 0, if Yi is missing.

◮ Missing at Random (MAR) mechanism:

P (R = 1|Y , X) = P (R = 1|X) = p (X) where p : H → [0, 1] is an unspecified function operator of X.

◮ Consequence: This mechanism allows missing responses to be predicted with

the available information.

Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 5 / 22

slide-6
SLIDE 6

Estimation of the FLMSR with MAR responses

Estimation of β with Functional Principal Components (FPCs):

◮ The FLMSR: Y = X, β + ε. ◮ Functional slope: β = ∞

k=1 bkψk, where:

⋆ ψ1, ψ2, . . . are eigenfunction of Γ linked to eigenvalues λ1 > λ2 > . . . > 0. ⋆ bk = Cov[Y ,Sk ]

λk

, for k ∈ N.

⋆ Sk = X, ψk, for k ∈ N, are the FPCs scores of X. ◮ Problem: Estimate β with a random sample {(Xi, Yi, Ri)}n

i=1.

◮ Need: ⋆ Estimates of ψ1, ψ2, . . . and λ1, λ2, . . . ⋆ Sample S1, S2, . . . ⋆ A cutoff to truncate the infinite sum that defines β. Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 6 / 22

slide-7
SLIDE 7

Estimation of the FLMSR with MAR responses

Simplified estimation (Febrero-Bande et al., 2019, CSDA):

◮ Complete-case analysis: Delete pairs with missing responses. ◮ Covariates of complete pairs: XS = {Xi : i ∈ IS}, where IS = {i : Ri = 1}. ◮ Estimates of ψ1, ψ2, . . . and λ1, λ2, . . .:

Eigenfunctions ψ1,S, ψ2,S, . . . and eigenvalues λ1,S ≥ λ2,S ≥ · · · of the sample covariance operator ΓXS .

◮ Sample FPCs scores:

Si,k,S =

  • Xi,

ψk,S

  • , for i ∈ IS and k ∈ N.

◮ Estimate of bk:

bk,S =

1

  • λk,S
  • 1

nS

  • i∈IS Yi

Si,k,S

  • , where nS = #IS, for k ∈ N.

◮ Estimate of β:

βkS = kS

k=1

bk,S ψk,S, where kS is a cutoff.

Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 7 / 22

slide-8
SLIDE 8

Estimation of the FLMSR with MAR responses

Imputed estimator (Febrero-Bande et al., 2019, CSDA):

◮ Impute missing responses:

Yi,kS =

  • Xi,

βkS

  • , for i /

∈ IS.

◮ New set of responses: Yi,kS = RiYi + (1 − Ri)

Yi,kS , for i = 1, . . . , n.

◮ Covariates of all pairs: XC = {Xi : i = 1, . . . , n}. ◮ Estimates of ψ1, ψ2, . . . and λ1, λ2, . . .: Eigenfunctions

ψ1,C, ψ2,C, . . . and eigenvalues λ1,C ≥ λ2,C ≥ · · · of the sample covariance operator ΓXC .

◮ Sample FPCs scores:

Si,k,C =

  • Xi,

ψk,C

  • , for i = 1, . . . , n and k ∈ N.

◮ Estimate of bk:

bk,kS ,C =

1

  • λk,C
  • 1

n

n

i=1 Yi,kS

Si,k,C

  • , for k ∈ N.

◮ Estimate of β:

βkS ,kC =

kC

  • k=1
  • bk,kS ,C

ψk,C, where kC is a cutoff.

Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 8 / 22

slide-9
SLIDE 9

Estimation of the FLMSR with MAR responses

Important notes:

◮ Selection of cutoffs: Use leave-one-out cross-validation or standard model

selection criteria (GCV, AIC, AICc, SIC, SICc,. . . ).

◮ Consequence: kS in

βkS may be different to kS and/or kC in βkS ,kC , e. g., it is possible that β2 and β1,3 are the chosen estimators, respectively.

◮ Two sources of potential improvement: 1

Principal component estimation: βkS depends on ψk,S (constructed with XS), while βkS ,kC depends on ψk,C (constructed with XC ).

2

Cutoff selection: βkS ,kC may have smaller MSEE than βkS if the cutoffs are selected appropriately (see, Febrero-Bande et al., 2019, CSDA).

Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 9 / 22

slide-10
SLIDE 10

Testing linearity with MAR responses

A Cram´ er-von Mises testing procedure (I):

◮ Garc´

ıa-Portugu´ es et al. (2014, JCGS): The following statements are equiva- lent:

1

m (X) = X, β, ∀X ∈ H.

2

E

  • (Y − X, β) ✶{X,γ≤u}
  • = 0, for a.e. u ∈ R and ∀γ ∈ SH, where SH =

{γ ∈ H : γ = 1}.

◮ Estimate of β:

β, may be βkS , βkS ,kC or some other estimator.

◮ Residuals:

εi = Yi −

  • Xi,

β

  • , for i ∈ IS = {i : Ri = 1}.

◮ Therefore: Only residuals for the observed responses. Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 10 / 22

slide-11
SLIDE 11

Testing linearity with MAR responses

A Cram´ er-von Mises testing procedure (II):

◮ Residual marked empirical process based on projections:

R

  • β, u, γ
  • = n−1/2 n

i=1 Ri

εi✶{Xi ,γ≤u}, where u ∈ R and γ ∈ SH.

◮ CvM statistic: Measure the deviation of {(Xi, Yi, Ri)}n

i=1 from H0 with:

PCvM

  • β
  • =
  • R×SH

R

  • β, u, γ

2 Fn,γ (du) ω (dγ) , where Fn,γ is the ECDF of {Xi, γ : i = 1, . . . , n}, and ω is a measure on SH.

◮ Unfortunately: Computation of the statistic PCvM

  • β
  • is not feasible be-

cause SH is of infinite dimension.

Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 11 / 22

slide-12
SLIDE 12

Testing linearity with MAR responses

A Cram´ er-von Mises testing procedure (III):

◮ Idea: Replace γ ∈ SH in PCvM with: ⋆ Simplified estimator:

γkS = kS

k=1

  • γ,

ψk,S

  • ψk,S, where γ ∈ SH.

⋆ Imputed estimator:

γkS ,kC = kC

k=1

  • γ,

ψk,C

  • ψk,C , where γ ∈ SH.

◮ Modified CvM statistic:

MPCvM

  • β
  • =
  • R×Sk

H

R

  • β, u,

γk 2 Fn,

γk (du) ω (d

γk) , where k is either kS or kC, and Fn,

γk is the ECDF of {Xi,

γk : i = 1, . . . , n}.

◮ Simpler expression: After some algebra, it is possible to show that:

MPCvM

  • β
  • = n−2

ε′

SA

εS, where εS is the vector of residuals and A is a certain square symmetric matrix.

Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 12 / 22

slide-13
SLIDE 13

Testing linearity with MAR responses

Procedure to calibrate the p-value of MPCvM

  • β
  • under H0:

1

Obtain β and the associated residuals εi = Yi −

  • Xi,

β

  • , for i ∈ IS.

2

Compute the statistic MPCvM

  • β
  • .

3

For b = 1 . . . , B, do:

1

Draw i.i.d.r.v. Vi satisfying E [Vi] = 0 and Var [Vi] = 1.

2

Construct bootstrap residuals εb

i = Vi

εi, for i ∈ IS.

3

Define a bootstrap sample Y b

i =

  • Xi,

β

  • +εb

i , for i ∈ IS, and estimate β with

  • Xi, Y b

i , Ri

n

i=1, leading to

βb.

4

Obtain the estimated bootstrap residuals εb

i = Y b i −

  • Xi,

βb , for i ∈ IS.

5

Compute MPCvM

  • βb

with the estimated bootstrap residuals εb

i , for i ∈ IS.

4

Estimate the p-value of the test with #

  • MPCvM
  • β
  • ≤ MPCvM
  • βb

/B.

Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 13 / 22

slide-14
SLIDE 14

Simulation study

Characteristics of the simulation:

◮ Scenarios: 3 (H0 and H1). ◮ Sample size: 50, 100 and 250. ◮ Bootstrap samples: 500. ◮ Number of replicas: 1000. ◮ MAR operator: Logistic function. ◮ Estimators of β: Simplified and imputed estimators with kS = kC because

the computational cost is high.

◮ Selection of cutoffs: SICc. ◮ Additionally: Imputed estimator with non-parametric imputations as in Ling

et al. (2017, JSPI).

Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 14 / 22

slide-15
SLIDE 15

Simulation study

Table: Size of test (α = 0.05)

Sample size Scenario Simple Imputed NP Full 1 0.019 0.023 0.171 0.026 50 2 0.049 0.058 0.133 0.045 3 0.025 0.026 0.141 0.030 1 0.030 0.038 0.215 0.039 100 2 0.053 0.063 0.178 0.070 3 0.053 0.051 0.217 0.059 1 0.044 0.053 0.295 0.057 250 2 0.049 0.046 0.223 0.046 3 0.039 0.038 0.238 0.051

Table: Power of test

Sample size Scenario Simple Imputed NP Full 1 0.135 0.137 0.255 0.309 50 2 0.971 0.970 0.981 0.992 3 0.341 0.337 0.494 0.489 1 0.399 0.402 0.445 0.680 100 2 0.998 0.998 0.999 1.000 3 0.598 0.595 0.733 0.712 1 0.889 0.886 0.800 0.987 250 2 1.000 1.000 1.000 1.000 3 0.819 0.819 0.903 0.868

Table: Percentage missing

Sample Size Scenario H0 H1 1 0.3518 0.3518 50 2 0.2552 0.2553 3 0.2720 0.2720 1 0.3483 0.3487 100 2 0.2555 0.2552 3 0.2724 0.2724 1 0.3481 0.3481 250 2 0.2565 0.2565 3 0.2729 0.2729

Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 15 / 22

slide-16
SLIDE 16

Real data example

The data set revisited:

◮ Data set: Data from 73 Spanish weather stations in the period 1980 − 2009. ◮ Functional predictor: Mean curve of the annual average daily temperature

(one observation per day).

◮ Smoothing: Discrete functions are converted to functional observations using

a B-spline basis of order 4 with 15 basis functions.

◮ Outliers: Remove stations in the Canary islands and in Port of Navacerrada,

leading to 63 stations.

◮ Real response: Total number of hours of sunshine per year. ◮ Missing responses: 22 out of the 63 weather stations (34.92% of missing

responses).

Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 16 / 22

slide-17
SLIDE 17

Real data example

100 200 300 5 10 15 20 25 Average daily temperatures Day number Temperatures 100 200 300 5 10 15 20 25 Average daily temperatures without outliers Day number Temperatures

Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 17 / 22

slide-18
SLIDE 18

Real data example

−50 50 1600 2000 2400 2800

First FPC (82.23%)

PC scores Observed responses −40 −20 20 1600 2000 2400 2800

Second FPC (16.91%)

PC scores Observed responses −5 5 10 1600 2000 2400 2800

Thrid FPC (0.53%)

PC scores Observed responses −3 −2 −1 1 2 3 4 1600 2000 2400 2800

Fourth FPC (0.16%)

PC scores Observed responses

Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 18 / 22

slide-19
SLIDE 19

Real data example

1e+05 2e+05 3e+05 4e+05 0.0e+00 1.0e−05

p−value: 0.0017

x Density function

(a) Simplified estimator

1e+05 2e+05 3e+05 4e+05 0.0e+00 1.0e−05

p−value: 0.0015

x Density function

(b) Imputed estimator

1e+05 2e+05 3e+05 4e+05 0.0e+00 6.0e−06 1.2e−05

p−value: 6e−04

x Density function

(c) Imputed estimator with NP

Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 19 / 22

slide-20
SLIDE 20

Real data example

−100 −50 50 −100 100 200 Χ FPC 1, Y FPC 1 〈Χ, Ψ ^

1〉

Rn(u, Ψ ^

1, Φ

^

1)

(a) Projected processes 1st FPC

−40 −20 20 40 −150 −100 −50 50 100 150 Χ FPC 2, Y FPC 1 〈Χ, Ψ ^

2〉

Rn(u, Ψ ^

2, Φ

^

1)

(b) Projected processes 2nd FPC

−5 5 10 −200 −100 100 200 Χ FPC 3, Y FPC 1 〈Χ, Ψ ^

3〉

Rn(u, Ψ ^

3, Φ

^

1)

(c) Projected processes 3rd FPC

−4 −2 2 −200 −100 100 200 Χ FPC 4, Y FPC 1 〈Χ, Ψ ^

4〉

Rn(u, Ψ ^

4, Φ

^

1)

(d) Projected processes 4th FPC

Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 20 / 22

slide-21
SLIDE 21

Conclusions

Conclusions:

◮ Importantly: The testing procedure remains valid for any functional regression

model with scalar response as it is based on the residuals of the model.

◮ To impute or not to impute?: It is not clear whether imputing missing re-

sponses is necessary before testing.

◮ Currently: ⋆ Allowing cutoff selection for imputed estimator. ⋆ Considering alternative imputation methods. ⋆ Extending the analysis for the procedure in Cuesta-Albertos et al. (2019, AoS). Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 21 / 22

slide-22
SLIDE 22

References

Febrero-Bande, M., Galeano, P. and Gonz´ alez-Manteiga, W. (2019). Estimation and prediction for the functional linear model with scalar response with responses missing at

  • random. Computational Statistics and Data Analysis, 131, 91-103.

Cuesta-Albertos, J. A., Garc´ ıa-Portugu´ es, E., Gonz´ alez-Manteiga, W. and Febrero-Bande,

  • M. (2019). Goodness-of-fit tests for the functional linear model based on randomly pro-

jected empirical processes. Annals of Statistics, 47, 439–467. Garc´ ıa-Portugu´ es, Gonz´ alez-Manteiga and Febrero-Bande (2014). A goodness-of-fit test for the functional linear model with scalar response. Journal of Computational and Graph- ical Statistics, 23, 761–778. Ling, N., Liang, L. and Vieu, P. (2015). Nonparametric regression estimation for func- tional stationary ergodic data with missing at random. Journal of Statistical Planning and Inference, 162, 75-87.

Pedro Galeano Goodness-of-fit for the FLM with MAR responses III IWAFDA 22 / 22