Semiparametric models with functional responses in a model assisted - - PowerPoint PPT Presentation

semiparametric models with functional responses in a
SMART_READER_LITE
LIVE PREVIEW

Semiparametric models with functional responses in a model assisted - - PowerPoint PPT Presentation

Semiparametric models with functional responses in a model assisted survey sampling setting e Cardot 1 , Alain Dessertaine 2 , and Etienne Josserand 1 Herv 1 Institut de Math ematiques de Bourgogne, UMR 5584 CNRS herve.cardot@u-bourgogne.fr,


slide-1
SLIDE 1

Semiparametric models with functional responses in a model assisted survey sampling setting

Herv´ e Cardot1, Alain Dessertaine2, and Etienne Josserand1

1Institut de Math´

ematiques de Bourgogne, UMR 5584 CNRS herve.cardot@u-bourgogne.fr, etienne.josserand@u-bourgogne.fr

2EDF, R&D, ICAME - SOAD

alain.dessertaine@edf.fr

Computational Statistics - Paris - August 23rd 2010

slide-2
SLIDE 2

Outline

Introduction Survey sampling and curve estimation Estimation with auxiliary information Application to electricity consumption curves

slide-3
SLIDE 3

Sampling survey on curves

A new subject in statistic boundaries between functional data analysis and survey sampling theory. EDF problematic :

◮ EDF does not know what their clients consume at each time ! ◮ EDF plans to install electricity meters which will be able to send

individual electricity consumptions at very fine time scales.

◮ Collecting, saving and analysing all this information would be very

expensive (≈ 30 millions of electricity meters).

◮ How to estimate as precisely as possible the mean consumption

curve in France or a part of this (particular region, type of clients, . . .) ?

slide-4
SLIDE 4

Consumption curves

A sample of individual electricity consumption curves measured every half hour during one week.

50 100 150 200 400 600 800 1000 1200 Hours Electricity consumption

slide-5
SLIDE 5

Survey sampling in large databases of functional data

Chiky 2009 (these, ENST) : survey sampling procedures on the sensors, which allow a trade off between limited storage capacities and accuracy

  • f the data, can be relevant approaches compared to signal compression

in order to get accurate approximations to simple estimates such as mean

  • r total trajectories.
slide-6
SLIDE 6

Sampling design and mean curve estimation

A population U = {1, . . . , k, . . . , N} with finite size N. At each individual (statistic unit) k of the population U, we associate a deterministic curve Yk = (Yk(t))t∈[0,T] ∈ C[0, T]. Let µ ∈ C[0, T], the mean of Yk in the population µ(t) = 1 N

  • k∈U

Yk(t), t ∈ [0, T]. A sample s, i.e. a part s ⊂ U, with known size n, and p a probability law on the set of parts on U,

◮ πk = Pr(k ∈ s) > 0 for all k ∈ U, ◮ πkl = Pr(k & l ∈ s) > 0 for all k, l ∈ U, k = l.

The Horvitz-Thompson estimator of the mean curve is

  • µ(t)

= 1 N

  • k∈s

Yk(t) πk = 1 N

  • k∈U

Yk(t) πk ✶k∈s, t ∈ [0, T].

slide-7
SLIDE 7

Two classical sampling designs

  • The simple random sampling without replacement with size n

◮ πk = n

N for all k ∈ U

◮ πkl =

n(n−1) N(N−1) for all k, l ∈ U, k = l

We find again the common mean estimator

  • µ(t) = 1

n

  • k∈s

Yk(t).

  • Stratified sampling with size n.

The population U is stratified in H stratum H

h=1 Uh = U, with size Nh

◮ πk = nh

Nh for all k ∈ Uh

◮ πkl = nh(nh−1)

Nh(Nh−1) for all k, l ∈ Uh, k = l

◮ πkl = nhnℓ

NhNℓ for all k ∈ Uh, l ∈ Uℓ, h = ℓ

So

  • µ(t) = 1

N

  • h∈H

Nh 1 nh

  • k∈sh

Yk(t) = 1 N

  • h∈H

Nh µh(t).

slide-8
SLIDE 8

Utilization of auxiliary information

Considering information given by m auxiliary variables

◮ meteorological : temperature, cloud covering , . . . ◮ geographical : altitude, longitude, latitude, . . . ◮ behavioral : past mean consumption, . . .

would be able to improve the estimator accuracy of the mean curve. This requires modeling the behavior of individual electricity meters that are not in the sample : Yk(t) = µ(t) + f (xk1, . . . , xkm, t) + error ⊲ Not much hope to obtain directly an accurate and flexible estimator of the function f which depends on time t and covariables X1, . . . , Xm.

  • Reducing the dimension of data seems to be an interesting way.
slide-9
SLIDE 9

Dimension reduction in finite population

⊲ The best linear approximation, with quadratic error, of functions Yk in a functional space of fixed dimension q, q < N, generated by q

  • rthonormal functions φ1, . . . , φq :

Yk(t) = µ(t) +

q

  • j=1

Yk − µ, φjφj(t) + Rqk(t) The mean rest with the norm L2[0, T] satisfies 1 N

  • k∈U

Rqk2 = 1 N

  • k∈U

Yk − µ2 −

q

  • j=1

Γφj, φj where the covariance operator Γ is associated with the covariance function γ(s, t) = 1 N

  • k∈U

(Yk(t) − µ(t)) (Yk(s) − µ(s)) , where for all f ∈ L2[0, T], Γf (s) = T γ(s, t)f (t)dt , s ∈ [0, T]. To minimize against φ1, . . . , φq, the mean rest 1

N

  • k∈U Rqk2 is the

same to find eigen vectors of Γ.

slide-10
SLIDE 10

Model on principal components

Property The rest is minimal for φ1 = v1, . . . , φq = vq, where Γvj(t) = λj vj(t), t ∈ [0, T], the functions vj constitute an orthonormal system in L2[0, T] the eigen values are sorted, λ1 ≥ λ2 ≥ ... ≥ λN ≥ 0.

  • Obtaining estimations of individual variations on principal components

(real) Yk − µ, vj ≈ gj(xk1, . . . , xkm) allow the application of model-assisted techniques to build an estimator

  • f µ
  • µx(t) =

µ(t) − 1 N

  • k∈s
  • Yk(t)

πk −

  • k∈U
  • Yk(t)
  • where
  • Yk(t) =

µ(t) +

q

  • j=1
  • gj(xk1, . . . , xkm)

vj(t).

slide-11
SLIDE 11

An illustration of EDF consumption curves

50 100 200 300 150 200 250 300

(a) Mean consumption

Time Consumption 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8

(b)

Principal components Explained variance 50 100 200 300 0.7 0.9 1.1 1.3

(c) First eigenfunction

Time Consumption 1000 3000 5000 1000 3000 5000

(d)

Weekly consumption First principal components

slide-12
SLIDE 12

Error estimation of µ : µ − µ

  • SRWR

OPTIM MA1 5 10 15 20

The model (MA1) considered is very simple

  • Yk(t) =

µ(t) + ( β0 + β1Xk) v1(t) where Xk is the mean consumption of the last week.

slide-13
SLIDE 13

Variances comparison γ(t, t) of estimators µ

50 100 150 200 250 300 10 20 30 40 50 Empirical variance SRSWR OPTIM MA1

Problem : Lack of explicit formula for variance estimation

  • Candidate for asymptotic formula (when n, N → ∞)
  • Need a corrected variance which depends on eigen vectors’s variances

(perturbations) ?

  • ...