Fast and efficient estimation of the geometric median in high dimension : a stochastic gradient approach Herv´ e Cardot Institut de Math´ ematiques de Bourgogne, Universit´ e de Bourgogne with Peggy C´ enac (Univ. Bourgogne) and Mohamed Chaouch (EDF) Herve.Cardot@u-bourgogne.fr Compstat - August 2010 - Paris

The median in R A ”central” notion in statistics since Laplace. For a random variable taking values in R : ”the” (may not be unique) value m such that P ( X ≤ m ) = 0 . 5 . Another characterization of the median m � E (sign( X − m )) = sign( X ( ω ) − m ) dP ( ω ) = 0 . X − m Since sign( X − m ) = | X − m | , we also have the following characterization, m = arg min z ∈ R E | X − z | . • The quantile of order α, for α ∈ ]0 , 1[ , is defined by P ( X ≤ q α ) = α. Equivalently, q α = arg min z ∈ R E [ | X − z | + (2 α − 1)( X − z )] .

The geometric median in R d (or in a separable Hilbert space H ) In the Euclidean space R d equipped with its usual norm � � , a natural generalization of the median m := arg min z ∈ H E � X − z � , called geometric or spatial median (Haldane, 1948). Property (Kemperman, 1987) If the space H is strictly convex, the geometric median m is unique, unless the support of X is within a one dimensional subspace. • Examples of strictly convex spaces : - Euclidean spaces R d , when d > 1 , - separable Hilbert spaces H , - some Banach spaces ( L p , 1 < p < ∞ ). • Support condition : ∃ ( u 1 , u 2 ) ∈ H × H , � u 1 , u 2 � = 0 , V ar( � u 1 , X � ) > 0 and V ar( � u 2 , X � ) > 0 .

Characterization of the geometric median We suppose there are no atoms ( ∀ x ∈ H , P ( X = x ) = 0). Then G : H �→ R defined by G ( x ) = E � X − x � , is strictly convex and Fr´ echet diff´ erentiable � X − x � Φ( x ) := ∇ G x = − E . � X − x � The median m is characterized by ∇ G m = 0 . G has a second order Fr´ echet derivative, at point m , Γ m : H �→ H , � � �� 1 I H − ( X − m ) ⊗ ( X − m ) Γ m := E , � X − m � � X − m � 2 where I H is the identity operator and u ⊗ v = � u , . � v , for ( u , v ) ∈ H 2 . If E � X − m � − 1 < ∞ , then Γ m is a bounded and strictly positive operator. There are constants, ∞ > E � X − m � − 1 = λ M > λ m > 0 , λ M � u � 2 ≥ � Γ m u , u � ≥ λ m � u � 2 , ∀ u ∈ H .

Robustness : the influence function Consider a distribution P 0 contaminated by a point-mass distribution at z ∈ H , P ǫ, z = (1 − ǫ ) P 0 + ǫδ z . The influence function m ( P ǫ, z ) − m ( P 0 ) IF m ( z ) = lim ǫ ǫ → 0 is a measure of the sensitivity of the median m to a small perturbation of the target distribution. Property z − m IF m ( z ) = Γ − 1 m � z − m � and the gross error sensitivity is bounded as follows sup {� IF m ( z ) � , z ∈ H } = 1 . λ m • The gross error sensitivity is not bounded for the mean.

Estimation in R d Suppose we have a sample of n independent realizations, X 1 , . . . , X n . The usual estimator of m (Gower, 1974, Vardi & Zhang, 2000, Gervini, 2008) is characterized by n � X i − � m m � = 0 , � X i − � i =1 Iterative and rather long procedures are needed (Newton-Raphson or Weiszfeld) to get a numerical solution n n � � X i − � m m e +1 = m e ) X i . m � = 0 ⇒ � p i ( � � X i − � i =1 i =1 Property (Haberman, 1989, Niemiro, 1992). If H = R d , when n → + ∞ , √ n ( � m n − m ) � N (0 , Γ − 1 m Var ( S ( X − m ))Γ − 1 m ) with S ( u ) = u / � u � , u ∈ R d .

A very simple and very fast algorithm We consider the algorithm X n +1 − m n m n +1 = m n + γ n � X n +1 − m n � and suppose the steps γ n are such that ∀ n , γ n > 0 , � � γ 2 γ n = ∞ and n < ∞ . n ≥ 1 n ≥ 1 Advantages • For a sample of n realizations in R d : O ( nd ) operations. • No need to store all the data (data streams) • Automatic update (online estimation)

A Robbins-Monro (1951) algorithm This algorithm (stochastic gradient) can also be written m n +1 = m n − γ n ( Φ( m n ) + ζ n +1 ) , � �� � gradient with ζ n +1 = − X n +1 − m n � X n +1 − m n � − Φ( m n ) . • If the X n are i . i . d ., the sequence ζ n +1 is a martingale difference, E ( ζ n +1 | F n ) = 0 avec F n = σ ( X 0 , . . . , X n ) . Moreover, � � � ζ n +1 � 2 |F n ≤ 4 . E

A remark on geometric quantiles estimation This approach can be extended directly to get stochastic approximations to geometric quantiles (Chaudhuri, 1996). Consider a vector u ∈ H , such that � u � < 1 . The geometric quantile of X , say m u , corresponding to direction u , is defined, uniquely under previous assumptions, by m u = arg min Q ∈ H E ( � X − Q � + � X − Q , u � ) . It is characterized by Φ u ( m u ) = Φ( m u ) − u = 0 . The following stochastic approximation � X n +1 − � � m u n m u m u n +1 = � n + γ n n � + u . � � X n +1 − � m u

A convergence result in Hilbert spaces Result (Cardot, C´ enac, Zitt 2010) The sequence m n converges almost surely when n tends to infinity, � m n − m � → 0 , p . s . • Sketch of the proof (classical) Show that for all ǫ ∈ ]0 , 1[ , P (Ω ǫ ) = 0 , with ˘ ǫ < V n ( ω ) < ǫ − 1 ¯ Ω ǫ = ω ∈ Ω : ∃ n ǫ ( ω ) ≥ 1 , ∀ n ≥ n ǫ ( ω ) , considering that 0 1 n n X X B C γ 2 n →∞ E V n +1 = E V 0 + lim lim j + 2 γ n E � Φ( m n ) , m − m n � A ≤ C B C n →∞ @ | {z } j =1 j =1 < − λ ǫ ǫ in Ω ǫ Property For all r > 0 , the function G restricted to x ∈ B ( m , r ) is strongly convex : ∃ λ r > 0 , such that ∀ x ∈ B ( m , r ) � Φ( x ) , x − m � ≥ G ( x ) − G ( m ) ≥ λ r � x − m � 2 .

Does it really work ? A sample with Gaussian distribution, � 10 � 3 with mean (0 , 0) and variance . 3 2 4 2 X2 0 -2 -4 -5 0 5 X1

Not really, even for simple examples ! ! ! X n +1 − m n g m n +1 = m n + n 3 / 4 � X n +1 − m n � 0.10 0.5 RM, g=10 0.08 RM, g=1 0.4 AV, g=10 RM, g=10 AV, g=1 RM, g=1 AV, g=10 AV, g=1 0.06 0.3 MSE MSE 0.04 0.2 0.02 0.1 0.00 0.0 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Iterations iteration

Averaging : a magic formula n � m n = 1 Consider now the mean of all past iterations, ¯ m j , n j =1 X n +1 − m n m n +1 = m n + γ n � X n +1 − m n � m n + m n +1 − ¯ m n ¯ = ¯ m n +1 n + 1 Property (in R d ) • If γ n = g / n α , 0 . 5 < α < 1 , √ n ( ¯ m n − m ) � N (0 , ∆) in distribution, where ∆ is the same variance matrix as for � m n , ∆ = Γ − 1 m Var ( S ( X − m ))Γ − 1 m with S ( u ) = u / � u � , u ∈ R d . Proof : As in Polyak & Juditsky (1992).

200 samples with size n = 2000 . Averaging 0.25 0.25 0.20 0.20 0.15 0.15 0.10 0.10 0.05 0.05 0.00 0.00 g = 0.1 g = 0.5 g = 1 g = 2 g = 5 g = 10 Mean g = 0.1 g = 0.5 g = 1 g = 2 g = 5 g = 10

Example : electricity consumption curves • EDF (Electricit´ e de France) has planned to install communicating meters (30 millions). Survey sampling techniques will be used to select 300 000 meters which will provide individual electricity consumption at fine time scales. • A first test on a population of N = 18900 giving electricity consumption every 30 minutes. 1200 1000 800 Electricity consumption 600 400 200 0 0 50 100 150

Example : median trajectory 300 250 200 Electricity consumption 150 100 50 mean curve median (pointwise) median curve 0 0 1 2 3 4 5 6 7 Days

Perspectives • Averaging in Hilbert spaces H : still no results on nonlinear algorithms in the literature. • Discretized noisy trajectories � � X n ( t n 1 ) + ǫ 1 n , . . . , X n ( t n Z n = p n ) + ǫ p n n • Covariates : conditional geometric median m ( X | Z = z ) = β 0 + β 1 z where Z is for example the mean consumption of the week before. We look for ( β 0 ,β 1 ) ∈ H × H E � X − ( β 0 + β 1 Z ) � min • (robust) Clustering with medians based on � . � (k-median).

Download Presentation

Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend

More recommend