Fast and efficient estimation of the geometric median in high - - PowerPoint PPT Presentation

fast and efficient estimation of the geometric median in
SMART_READER_LITE
LIVE PREVIEW

Fast and efficient estimation of the geometric median in high - - PowerPoint PPT Presentation

Fast and efficient estimation of the geometric median in high dimension : a stochastic gradient approach Herv e Cardot Institut de Math ematiques de Bourgogne, Universit e de Bourgogne with Peggy C enac (Univ. Bourgogne) and Mohamed


slide-1
SLIDE 1

Fast and efficient estimation of the geometric median in high dimension : a stochastic gradient approach

Herv´ e Cardot

Institut de Math´ ematiques de Bourgogne, Universit´ e de Bourgogne with Peggy C´ enac (Univ. Bourgogne) and Mohamed Chaouch (EDF) Herve.Cardot@u-bourgogne.fr

Compstat - August 2010 - Paris

slide-2
SLIDE 2

The median in R

A ”central” notion in statistics since Laplace. For a random variable taking values in R : ”the” (may not be unique) value m such that P(X ≤ m) = 0.5 . Another characterization of the median m E (sign(X − m)) =

  • sign(X(ω) − m)dP(ω) = 0.

Since sign(X − m) =

X−m |X−m|, we also have the following characterization,

m = arg min

z∈R E |X − z| .

  • The quantile of order α, for α ∈]0, 1[, is defined by P(X ≤ qα) = α.

Equivalently, qα = arg min

z∈R E [|X − z| + (2α − 1)(X − z)] .

slide-3
SLIDE 3

The geometric median in Rd (or in a separable Hilbert space H)

In the Euclidean space Rd equipped with its usual norm , a natural generalization of the median m := arg min

z∈H EX − z,

called geometric or spatial median (Haldane, 1948). Property (Kemperman, 1987) If the space H is strictly convex, the geometric median m is unique, unless the support of X is within a one dimensional subspace.

  • Examples of strictly convex spaces :
  • Euclidean spaces Rd, when d > 1,
  • separable Hilbert spaces H,
  • some Banach spaces (Lp, 1 < p < ∞).
  • Support condition :

∃(u1, u2) ∈ H × H, u1, u2 = 0, Var(u1, X) > 0 and Var(u2, X) > 0.

slide-4
SLIDE 4

Characterization of the geometric median

We suppose there are no atoms (∀x ∈ H, P(X = x) = 0). Then G : H → R defined by G(x) = EX − x, is strictly convex and Fr´ echet diff´ erentiable Φ(x) := ∇Gx = −E X − x X − x

  • .

The median m is characterized by ∇Gm = 0. G has a second order Fr´ echet derivative, at point m, Γm : H → H, Γm := E

  • 1

X − m

  • IH − (X − m) ⊗ (X − m)

X − m2

  • ,

where IH is the identity operator and u ⊗ v = u, .v, for (u, v) ∈ H2. If EX − m−1 < ∞, then Γm is a bounded and strictly positive

  • perator. There are constants, ∞ > EX − m−1 = λM > λm > 0,

λMu2 ≥ Γmu, u ≥ λmu2, ∀u ∈ H.

slide-5
SLIDE 5

Robustness : the influence function

Consider a distribution P0 contaminated by a point-mass distribution at z ∈ H, Pǫ,z = (1 − ǫ)P0 + ǫδz. The influence function IFm(z) = lim

ǫ→0

m(Pǫ,z) − m(P0) ǫ is a measure of the sensitivity of the median m to a small perturbation of the target distribution. Property IFm(z) = Γ−1

m

z − m z − m and the gross error sensitivity is bounded as follows sup{IFm(z), z ∈ H} = 1 λm .

  • The gross error sensitivity is not bounded for the mean.
slide-6
SLIDE 6

Estimation in Rd

Suppose we have a sample of n independent realizations, X1, . . . , Xn. The usual estimator of m (Gower, 1974, Vardi & Zhang, 2000, Gervini, 2008) is characterized by

n

  • i=1

Xi − m Xi − m = 0, Iterative and rather long procedures are needed (Newton-Raphson or Weiszfeld) to get a numerical solution

n

  • i=1

Xi − m Xi − m = 0 ⇒ me+1 =

n

  • i=1

pi( me) Xi. Property (Haberman, 1989, Niemiro, 1992). If H = Rd, when n → +∞, √n ( mn − m) N(0, Γ−1

m Var(S(X − m))Γ−1 m )

with S(u) = u/u, u ∈ Rd.

slide-7
SLIDE 7

A very simple and very fast algorithm

We consider the algorithm mn+1 = mn + γn Xn+1 − mn Xn+1 − mn and suppose the steps γn are such that ∀n, γn > 0,

  • n≥1

γn = ∞ and

  • n≥1

γ2

n < ∞.

Advantages

  • For a sample of n realizations in Rd : O(nd) operations.
  • No need to store all the data (data streams)
  • Automatic update (online estimation)
slide-8
SLIDE 8

A Robbins-Monro (1951) algorithm

This algorithm (stochastic gradient) can also be written mn+1 = mn − γn ( Φ(mn) gradient +ζn+1), with ζn+1 = − Xn+1−mn

Xn+1−mn − Φ(mn).

  • If the Xn are i.i.d., the sequence ζn+1 is a martingale difference,

E (ζn+1 | Fn) = 0 avec Fn = σ(X0, . . . , Xn). Moreover, E

  • ζn+12|Fn
  • ≤ 4.
slide-9
SLIDE 9

A remark on geometric quantiles estimation

This approach can be extended directly to get stochastic approximations to geometric quantiles (Chaudhuri, 1996). Consider a vector u ∈ H, such that u < 1. The geometric quantile of X, say mu, corresponding to direction u, is defined, uniquely under previous assumptions, by mu = arg min

Q∈H E (X − Q + X − Q, u) .

It is characterized by Φu(mu) = Φ(mu) − u = 0. The following stochastic approximation

  • mu

n+1 =

mu

n + γn

Xn+1 − mu

n

Xn+1 − mu

n + u

  • .
slide-10
SLIDE 10

A convergence result in Hilbert spaces

Result (Cardot, C´ enac, Zitt 2010) The sequence mn converges almost surely when n tends to infinity, mn − m → 0, p.s.

  • Sketch of the proof (classical)

Show that for all ǫ ∈]0, 1[, P(Ωǫ) = 0, with Ωǫ = ˘ ω ∈ Ω : ∃nǫ(ω) ≥ 1, ∀n ≥ nǫ(ω), ǫ < Vn(ω) < ǫ−1¯ considering that lim

n→∞ EVn+1 = EV0 + lim n→∞

B B @

n

X

j=1

γ2

j + 2 n

X

j=1

γn E Φ(mn), m − mn | {z }

<−λǫǫ in Ωǫ

1 C C A ≤ C Property For all r > 0, the function G restricted to x ∈ B(m, r) is strongly convex : ∃λr > 0, such that ∀x ∈ B(m, r) Φ(x), x − m ≥ G(x) − G(m) ≥ λrx − m2 .

slide-11
SLIDE 11

Does it really work ?

A sample with Gaussian distribution, with mean (0, 0) and variance 10 3 3 2

  • .
  • 5

5

  • 4
  • 2

2 4 X1 X2

slide-12
SLIDE 12

Not really, even for simple examples ! ! !

mn+1 = mn + g n3/4 Xn+1 − mn Xn+1 − mn

2000 4000 6000 8000 10000 0.0 0.1 0.2 0.3 0.4 0.5 Iterations MSE RM, g=10 RM, g=1 AV, g=10 AV, g=1 2000 4000 6000 8000 10000 0.00 0.02 0.04 0.06 0.08 0.10 iteration MSE RM, g=10 RM, g=1 AV, g=10 AV, g=1

slide-13
SLIDE 13

Averaging : a magic formula

Consider now the mean of all past iterations, ¯ mn = 1 n

n

  • j=1

mj,      mn+1 = mn + γn Xn+1 − mn Xn+1 − mn ¯ mn+1 = ¯ mn + mn+1 − ¯ mn n + 1 Property (in Rd)

  • If γn = g/nα, 0.5 < α < 1,

√n ( ¯ mn − m) N(0, ∆) in distribution, where ∆ is the same variance matrix as for mn, ∆ = Γ−1

m Var(S(X − m))Γ−1 m with S(u) = u/u, u ∈ Rd.

Proof : As in Polyak & Juditsky (1992).

slide-14
SLIDE 14

200 samples with size n = 2000.

g = 0.1 g = 0.5 g = 1 g = 2 g = 5 g = 10 0.00 0.05 0.10 0.15 0.20 0.25 Mean g = 0.1 g = 0.5 g = 1 g = 2 g = 5 g = 10 0.00 0.05 0.10 0.15 0.20 0.25

Averaging

slide-15
SLIDE 15

Example : electricity consumption curves

  • EDF (Electricit´

e de France) has planned to install communicating meters (30 millions). Survey sampling techniques will be used to select 300 000 meters which will provide individual electricity consumption at fine time scales.

  • A first test on a population of N = 18900 giving electricity

consumption every 30 minutes.

50 100 150 200 400 600 800 1000 1200 Electricity consumption

slide-16
SLIDE 16

Example : median trajectory

1 2 3 4 5 6 7 50 100 150 200 250 300 Days Electricity consumption mean curve median (pointwise) median curve

slide-17
SLIDE 17

Perspectives

  • Averaging in Hilbert spaces H :

still no results on nonlinear algorithms in the literature.

  • Discretized noisy trajectories

Zn =

  • Xn(tn

1) + ǫ1n, . . . , Xn(tn pn) + ǫpnn

  • Covariates : conditional geometric median

m(X|Z = z) = β0 + β1z where Z is for example the mean consumption of the week before. We look for min

(β0,β1)∈H×H EX − (β0 + β1Z)

  • (robust) Clustering with medians based on . (k-median).