Mean estimation: median-of-means tournaments G abor Lugosi ICREA, - PowerPoint PPT Presentation

Mean estimation: median-of-means tournaments G´ abor Lugosi ICREA, Pompeu Fabra University, BGSE based on joint work with Luc Devroye (McGill, Montreal) Matthieu Lerasle (CNRS, Nice) Roberto Imbuzeiro Oliveira (IMPA, Rio) Shahar Mendelson (Technion and ANU)

estimating the mean Given X 1 , . . . , X n , a real i.i.d. sequence, estimate µ = E X 1 .

estimating the mean Given X 1 , . . . , X n , a real i.i.d. sequence, estimate µ = E X 1 . “Obvious” choice: empirical mean n � µ n = 1 X i n i =1

estimating the mean Given X 1 , . . . , X n , a real i.i.d. sequence, estimate µ = E X 1 . “Obvious” choice: empirical mean n � µ n = 1 X i n i =1 By the central limit theorem, if X has a finite variance σ 2 , � √ n | µ n − µ | > σ � � n →∞ P lim 2 log(2 /δ ) ≤ δ . We would like non-asymptotic inequalities of a similar form.

estimating the mean Given X 1 , . . . , X n , a real i.i.d. sequence, estimate µ = E X 1 . “Obvious” choice: empirical mean n � µ n = 1 X i n i =1 By the central limit theorem, if X has a finite variance σ 2 , � √ n | µ n − µ | > σ � � n →∞ P lim 2 log(2 /δ ) ≤ δ . We would like non-asymptotic inequalities of a similar form. If the distribution is sub-Gaussian, E exp( λ ( X − µ )) ≤ exp( σ 2 λ 2 / 2) , then with probability at least 1 − δ , � 2 log(2 /δ ) | µ n − µ | ≤ σ . n

empirical mean–heavy tails The empirical mean is computationally attractive. Requires no a priori knowledge and automatically scales with σ . If the distribution is not sub-Gaussian, we still have Chebyshev’s inequality: w.p. ≥ 1 − δ , � 1 | µ n − µ | ≤ σ n δ . Exponentially weaker bound. Especially hurts when many means are estimated simultaneously. This is the best one can say. Catoni (2012) shows that for each δ there exists a distribution with variance σ such that � c � � | µ n − µ | ≥ σ ≥ δ . P n δ

median of means A simple estimator is median-of-means. Goes back to Nemirovsky, Yudin (1983), Jerrum, Valiant, and Vazirani (1986), Alon, Matias, and Szegedy (2002).   m km � �  1 X t , . . . , 1 def  µ MM � = median X t m m t =1 t =( k − 1) m +1

median of means A simple estimator is median-of-means. Goes back to Nemirovsky, Yudin (1983), Jerrum, Valiant, and Vazirani (1986), Alon, Matias, and Szegedy (2002).   m km � �  1 X t , . . . , 1 def  µ MM � = median X t m m t =1 t =( k − 1) m +1 Lemma Let δ ∈ (0 , 1) , k = 8 log δ − 1 and m = n 8 log δ − 1 . Then with probability at least 1 − δ , � 32 log(1 /δ ) | � µ MM − µ | ≤ σ n

proof � By Chebyshev, each mean is within distance σ 4 / m of µ with probability 3 / 4 . � The probability that the median is not within distance σ 4 / m of µ is at most P { Bin ( k , 1 / 4) > k / 2 } which is exponentially small in k .

median of means • Sub-Gaussian deviations. • Scales automatically with σ . • Parameters depend on required confidence level δ . • See Lerasle and Oliveira (2012), Hsu and Sabato (2013), Minsker (2014) for generalizations. • Also works when the variance is infinite. If � | X − E X | 1+ α � = M for some α ≤ 1 , then, with E probability at least 1 − δ , � � α/ (1+ α ) 8(12 M ) 1 /α ln(1 /δ ) | � µ MM − µ | ≤ n

why sub-Gaussian? Sub-Gaussian bounds are the best one can hope for when the variance is finite. In fact, for any M > 0 , α ∈ (0 , 1] , δ > 2 e − n / 4 , and mean � | X − E X | 1+ α � estimator � µ n , there exists a distribution E = M such that � � α/ (1+ α ) M 1 /α ln(1 /δ ) | � µ n − µ | ≥ . n Proof: The distributions P + (0) = 1 − p , P + ( c ) = p and P − (0) = 1 − p , P − ( − c ) = p are indistinguishable if all n samples are equal to 0 .

why sub-Gaussian? This shows optimality of the median-of-means estimator for all α . It also shows that finite variance is necessary even for rate n − 1 / 2 . One cannot hope to get anything better than sub-Gaussian tails. Catoni proved that sample mean is optimal for the class of Gaussian distributions.

multiple- δ estimators Do there exist estimators that are sub-Gaussian simultaneously for all confidence levels? An estimator is multiple- δ -sub-Gaussian for a class of distributions P and δ min if for all δ ∈ [ δ min , 1) , and all distributions in P , � log(2 /δ ) | � µ n − µ | ≤ L σ . n

multiple- δ estimators Do there exist estimators that are sub-Gaussian simultaneously for all confidence levels? An estimator is multiple- δ -sub-Gaussian for a class of distributions P and δ min if for all δ ∈ [ δ min , 1) , and all distributions in P , � log(2 /δ ) | � µ n − µ | ≤ L σ . n The picture is more complex than before.

known variance Given 0 < σ 1 ≤ σ 2 < ∞ , define the class [ σ 2 1 ,σ 2 2 ] = { P : σ 2 1 ≤ σ 2 P ≤ σ 2 P 2 . } 2 Let R = σ 2 /σ 1 .

known variance Given 0 < σ 1 ≤ σ 2 < ∞ , define the class [ σ 2 1 ,σ 2 2 ] = { P : σ 2 1 ≤ σ 2 P ≤ σ 2 P 2 . } 2 Let R = σ 2 /σ 1 . • If R is bounded then there exists a multiple- δ -sub-Gaussian estimator with δ min = 4 e 1 − n / 2 ; • If R is unbounded then there is no multiple- δ -sub-Gaussian estimate for any L and δ min → 0 . A sharp distinction. The exponentially small value of δ min is best possible.

construction of multiple- δ estimator Reminiscent to Lepski’s method of adaptive estimation. For k = 1 , . . . , K = log 2 (1 /δ min ) , use the median-of-means estimator to construct confidence intervals I k such that ∈ I k } ≤ 2 − k . P { µ / (This is where knowledge of σ 2 and boundedness of R is used.) Define     K � � I j � = ∅ k = min  k :  . j = k Finally, let K � µ n = mid point of � I j j = � k

proof For any k = 1 , . . . , K , P {| � µ n − µ | > | I k |} ≤ P {∃ j ≥ k : µ / ∈ I j } because if µ ∈ ∩ K j = k I j , then ∩ K j = k I j is non-empty and therefore µ n ∈ ∩ K � j = k I j . But K � ∈ I j } ≤ 2 1 − k P {∃ j ≥ k : µ / ∈ I j } ≤ P { µ / j = k

higher moments For η ≥ 1 and α ∈ (2 , 3] , define P α,η = { P : E | X − µ | α ≤ ( η σ ) α } . Then for some C = C ( α, η ) there exists a multiple- δ estimator with a constant L and δ min = e − n / C for all sufficiently large n .

k -regular distributions This follows from a more general result: Define         j j � � p − ( j ) = P X i ≤ j µ and p + ( j ) = P X i ≥ j µ  .    i =1 i =1 A distribution is k -regular if ∀ j ≥ k , min( p + ( j ) , p − ( j )) ≥ 1 / 3 . For this class there exists a multiple- δ estimator with a constant L and δ min = e − n / k for all n .

multivariate distributions Let X be a random vector taking values in R d with mean µ = E X and covariance matrix Σ = E ( X − µ )( X − µ ) T . Given an i.i.d. sample X 1 , . . . , X n , we want to estimate µ that has sub-Gaussian performance.

multivariate distributions Let X be a random vector taking values in R d with mean µ = E X and covariance matrix Σ = E ( X − µ )( X − µ ) T . Given an i.i.d. sample X 1 , . . . , X n , we want to estimate µ that has sub-Gaussian performance. What is sub-Gaussian? If X has a multivariate Gaussian distribution, the sample mean µ n = (1 / n ) � n i =1 X 1 satisfies, with probability at least 1 − δ , � � Tr (Σ) 2 λ max log(1 /δ ) � µ n − µ � ≤ + , n n Can one construct mean estimators with similar performance for a large class of distributions?

coordinate-wise median of means Coordinate-wise median of means yields the bound: � Tr (Σ) log( d /δ ) � � µ MM − µ � ≤ K . n We can do better.

multivariate median of means Hsu and Sabato (2013), Minsker (2015) extended the median-of-means estimate. Minsker proposes an analogous estimate that uses the multivariate median N � Med ( x 1 , . . . , x N ) = argmin � y − x i � . y ∈ R d i =1 For this estimate, with probability at least 1 − δ , � Tr (Σ) log(1 /δ ) � � µ MM − µ � ≤ K . n No further assumption or knowledge of the distribution is required. Computationally feasible. Almost sub-Gaussian but not quite. Dimension free.

median-of-means tournament We propose a new estimator with a purely sub-Gaussian performance, without further conditions. The mean µ is the minimizer of f ( x ) = E � X − µ � 2 . For any pair a , b ∈ R d , we try to guess whether f ( a ) < f ( b ) and set up a “tournament”. Partition the data points into k blocks of size m = n / k . We say that a defeats b if � � 1 � X i − a � 2 < 1 � X i − b � 2 m m i ∈ B j i ∈ B j on more than k / 2 blocks B j .

median-of-means tournament Within each block compute � Y j = 1 X i . m i ∈ B j Then a defeats b if � Y j − a � < � Y j − b � on more than k / 2 blocks B j . Lemma. Let k = ⌈ 200 log(2 /δ ) ⌉ . With probability at least 1 − δ , µ defeats all b ∈ R d such that � b − µ � ≥ r , where     � � Tr (Σ) λ max log(2 /δ )  800    . r = max , 240 n n

Mean estimation: median-of-means tournaments G abor Lugosi ICREA, - PowerPoint PPT Presentation

Mean estimation: median-of-means tournaments G abor Lugosi ICREA, Pompeu Fabra University, BGSE based on joint work with Luc Devroye (McGill, Montreal) Matthieu Lerasle (CNRS, Nice) Roberto Imbuzeiro Oliveira (IMPA, Rio) Shahar Mendelson

the nerves sensory radial median ulnar median median sensory median median ulnar radial

I - -75 Median Cable Barrier 75 Median Cable Barrier 75 Median Cable Barrier I 75 Median Cable

Unavoidable trees in tournaments Richard Mycroft Tssio Naia 20 April 2016 1 Tournaments

Domination in tournaments Nicolas Bousquet Birmingham, June 2017 1/16 Tournaments A tournament

Linear-time Median Def: Median of elements A=a 1 , a 2 , , a n is the (n/2)-th smallest element

Spartanburg Nation Median Value of a $115,900 $184,700 Home Median Gross Rent $705 $950

Median Finding Test Cases What's Next 1. Median finding, part 2 2. Why we write test cases 3.

Minimal Retentive Sets in Tournaments From Anywhere to TEQ Felix Brandt Markus Brill

3 Tournaments Other uses 3 Tournaments Other uses rank adjustment (or challege)

Z 2 -embeddings and Tournaments Radoslav Fulek , Jan Kyn cl Z 2 -embeddings and Tournaments

African American Strategy Equitable Access to Homeownership Presentation April 16, 2018

When does the Tukey Median work? Banghua Zhu with Jiantao Jiao and Jacob Steinhardt Department

Business Statistics CONTENTS Hypotheses on the median The sign test The Wilcoxon signed ranks

Estimation of Median Incomes of Small Areas: A Bayesian Semiparametric Approach Malay Ghosh

ITF Grand Slam tournaments (2008-2011) Katarina Pijetlovic, LL.M, LL.Lic, LL.D cand. Lecturer in

Extremal problems concerning tournaments Timothy Chan (Monash) Andrzej Grzesik (Krak ow) Dan

Crying Wolf: An Empirical Study of SSL Warning Effectiveness Joshua Sunshine Serge Egelman

Tighter risk certificates for (probabilistic) neural networks Omar Rivasplata

Using Strengths Based Measures to Assess and Manage Risk of Future Negative outcomes Simone

Decision Trees COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Decision

PAC Learnability and Bayes Classifier Matthieu R. Bloch 1 PAC learnability Tie last question to

Model selection theory: a tutorial with applications to learning Pascal Massart Universit

MATH Opportunities in Berlin Shameless plug Postdoc and PhD positions in optimization/ML. At Zuse

If You Give a Judge a Risk Score Evidence from Kentucky Bail Decisions Alex Albright, Harvard