sub gaussian estimators of the mean of a random matrix
play

Sub-Gaussian Estimators of the Mean of a Random Matrix with Entries - PowerPoint PPT Presentation

Sub-Gaussian Estimators of the Mean of a Random Matrix with Entries Possessing Only Two Moments Stas Minsker University of Southern California July 21, 2016 ICERM Workshop Simple question: how to estimate the mean? Assume that X 1 , . . . , X


  1. Sub-Gaussian Estimators of the Mean of a Random Matrix with Entries Possessing Only Two Moments Stas Minsker University of Southern California July 21, 2016 ICERM Workshop

  2. Simple question: how to estimate the mean? Assume that X 1 , . . . , X n are i.i.d. N ( µ, σ 2 0 ) . Problem: construct CI norm ( α ) for µ with coverage probability ≥ 1 − 2 α .

  3. Simple question: how to estimate the mean? Assume that X 1 , . . . , X n are i.i.d. N ( µ, σ 2 0 ) . Problem: construct CI norm ( α ) for µ with coverage probability ≥ 1 − 2 α . n � µ n := 1 Solution: compute ˆ X j , take n j = 1 � � � � √ √ log ( 1 /α ) log ( 1 /α ) CI norm ( α ) = µ n − σ 0 ˆ 2 , ˆ µ n + σ 0 2 n n

  4. Simple question: how to estimate the mean? Assume that X 1 , . . . , X n are i.i.d. N ( µ, σ 2 0 ) . Problem: construct CI norm ( α ) for µ with coverage probability ≥ 1 − 2 α . n � µ n := 1 Solution: compute ˆ X j , take n j = 1 � � � � √ √ log ( 1 /α ) log ( 1 /α ) CI norm ( α ) = µ n − σ 0 ˆ 2 , ˆ µ n + σ 0 2 n n Coverage is guaranteed since � � � � � 2 log ( 1 /α ) � ≥ σ 0 � ˆ Pr µ n − µ ≤ 2 α. n

  5. Example: how to estimate the mean? P . J. Huber (1964): “...This raises a question which could have been asked already by Gauss, but which was, as far as I know, only raised a few years ago (notably by Tukey): what happens if the true distribution deviates slightly from the assumed normal one?" Going back to our question: what if X 1 , . . . , X n are i.i.d. copies of X ∼ Π such that E X = µ, Var ( X ) ≤ σ 2 0 ?

  6. Example: how to estimate the mean? P . J. Huber (1964): “...This raises a question which could have been asked already by Gauss, but which was, as far as I know, only raised a few years ago (notably by Tukey): what happens if the true distribution deviates slightly from the assumed normal one?" Going back to our question: what if X 1 , . . . , X n are i.i.d. copies of X ∼ Π such that E X = µ, Var ( X ) ≤ σ 2 0 ? Problem: construct CI for µ with coverage probability ≥ 1 − α such that for any α length ( CI ( α )) ≤ ( Absolute constant ) · length ( CI norm ( α )) No additional assumptions on Π are imposed.

  7. Example: how to estimate the mean? P . J. Huber (1964): “...This raises a question which could have been asked already by Gauss, but which was, as far as I know, only raised a few years ago (notably by Tukey): what happens if the true distribution deviates slightly from the assumed normal one?" Going back to our question: what if X 1 , . . . , X n are i.i.d. copies of X ∼ Π such that E X = µ, Var ( X ) ≤ σ 2 0 ? Problem: construct CI for µ with coverage probability ≥ 1 − α such that for any α length ( CI ( α )) ≤ ( Absolute constant ) · length ( CI norm ( α )) No additional assumptions on Π are imposed. n � µ n = 1 Remark : guarantees for the sample mean ˆ X j is unsatisfactory: n j = 1 � � � � � ( 1 /α ) � ≥ σ 0 � ˆ Pr µ n − µ ≤ α. n Does the solution exist?

  8. Example: how to estimate the mean? Answer (somewhat unexpected?): Yes!

  9. Example: how to estimate the mean? Answer (somewhat unexpected?): Yes! Construction: [A. Nemirovski, D. Yudin ‘83; N. Alon, Y. Matias, M. Szegedy ‘96; R. Oliveira, M. Lerasle ‘11] Split the sample into k = ⌊ log ( 1 /α ) ⌋ + 1 groups G 1 , . . . , G k of size ≃ n / k each: G 1 G k � �� � � �� � X 1 , . . . , X | G 1 | . . . . . . X n −| G k | + 1 , . . . , X n � �� � � �� � 1 1 µ 1 := ˆ � X i µ k := ˆ � X i | G 1 | | Gk | Xi ∈ G 1 Xi ∈ Gk � �� � µ ∗ =ˆ ˆ µ ∗ ( α ):= median (ˆ µ 1 ,..., ˆ µ k )

  10. Example: how to estimate the mean? Answer (somewhat unexpected?): Yes! Construction: [A. Nemirovski, D. Yudin ‘83; N. Alon, Y. Matias, M. Szegedy ‘96; R. Oliveira, M. Lerasle ‘11] Split the sample into k = ⌊ log ( 1 /α ) ⌋ + 1 groups G 1 , . . . , G k of size ≃ n / k each: G 1 G k � �� � � �� � X 1 , . . . , X | G 1 | . . . . . . X n −| G k | + 1 , . . . , X n � �� � � �� � 1 1 µ 1 := ˆ � X i µ k := ˆ � X i | G 1 | | Gk | Xi ∈ G 1 Xi ∈ Gk � �� � µ ∗ =ˆ ˆ µ ∗ ( α ):= median (ˆ µ 1 ,..., ˆ µ k ) Claim: � � � log ( e /α ) Pr | ˆ µ ∗ − µ | ≥ 7 . 7 σ 0 ≤ α n

  11. Example: how to estimate the mean? Answer (somewhat unexpected?): Yes! Construction: [A. Nemirovski, D. Yudin ‘83; N. Alon, Y. Matias, M. Szegedy ‘96; R. Oliveira, M. Lerasle ‘11] Split the sample into k = ⌊ log ( 1 /α ) ⌋ + 1 groups G 1 , . . . , G k of size ≃ n / k each: G 1 G k � �� � � �� � X 1 , . . . , X | G 1 | . . . . . . X n −| G k | + 1 , . . . , X n � �� � � �� � 1 1 µ 1 := ˆ � X i µ k := ˆ � X i | G 1 | | Gk | Xi ∈ G 1 Xi ∈ Gk � �� � µ ∗ =ˆ ˆ µ ∗ ( α ):= median (ˆ µ 1 ,..., ˆ µ k ) Claim: � � � log ( e /α ) Pr | ˆ µ ∗ − µ | ≥ 7 . 7 σ 0 ≤ α n Then take � � � � log ( e /α ) log ( e /α ) CI ( α ) = µ ∗ − 7 . 7 σ 0 ˆ , ˆ µ ∗ + 7 . 7 σ 0 n n

  12. Idea of the proof: ˆ . . . . . . µ . . . . . . ˆ ˆ µ 1 µ 8 | ˆ µ − µ | ≥ s = ⇒ at least half of events {| ˆ µ j − µ | ≥ s } occur.

  13. Improve the constant? O. Catoni’s estimator (2012), “Generalized truncation”: let α > 0 − log ( 1 − x + x 2 / 2 ) ≤ ψ ( x ) ≤ log ( 1 + x + x 2 / 2 ) , and define ˆ µ via n � � � ψ θ ( X j − ˆ µ ) = 0 . j = 1

  14. Improve the constant? O. Catoni’s estimator (2012), “Generalized truncation”: let α > 0 − log ( 1 − x + x 2 / 2 ) ≤ ψ ( x ) ≤ log ( 1 + x + x 2 / 2 ) , and define ˆ µ via n � � � ψ θ ( X j − ˆ µ ) = 0 . j = 1 Truncation τ ( x ) = ( | x | ∧ 1 ) sign ( x ) satisfies a weaker inequality − log ( 1 − x + x 2 ) ≤ τ ( x ) ≤ log ( 1 + x + x 2 ) 1 0 − 1 − 1 0 1

  15. Improve the constant? n � � � θ ( X j − ˆ ψ µ ) = 0 . j = 1 Intuition: for small θ > 0, n n � � � � θ ( X j − ˆ ≃ θ ( X j − ˆ ψ µ ) µ ) = 0 j = 1 j = 1 n � µ ≃ 1 = ⇒ ˆ X j n j = 1

  16. Improve the constant? n � � � ψ θ ( X j − ˆ µ ) = 0 . j = 1 � 2 log ( 1 /α ) 1 The following holds: set θ ∗ = σ 0 . Then n � � √ � log ( 1 /α ) | ˆ µ − µ | ≤ 2 + o ( 1 ) σ 0 n with probability ≥ 1 − 2 α .

  17. Extensions to higher dimensions A natural question: is it possible to extend presented techniques to the multivariate mean?

  18. Extensions to higher dimensions A natural question: is it possible to extend presented techniques to the multivariate mean? Motivation: PCA 72 3 71.9 2.8 71.8 2.6 71.7 2.4 71.6 2.2 ⇒ = 71.5 2 71.4 1.8 71.3 1.6 71.2 1.4 71.1 1.2 71 1 10 10 9 9 8 8 10 10 7 7 9 9 6 8 6 8 7 7 5 5 6 6 4 4 5 5 3 4 3 4 2 3 2 3 2 2 1 1 1 1 0 0 0 0

  19. Extensions to higher dimensions Motivation: PCA Genes mirror geography within Europe , J. Novembre et al, Nature 2008. PC1 a PC2 b c good explanation for non-experts: https://faculty.washington.edu/tathornt/SISG2015/lectures/assoc2015session05.pdf

  20. Extensions to higher dimensions Motivation: PCA Genes mirror geography within Europe , J. Novembre et al, Nature 2008. Mathematical framework: Y 1 , . . . , Y n ∈ R d , i.i.d. E Y j = 0 , E Y j Y T = Σ . j Goal: construct ˆ Σ , an estimator of Σ such that � � � � � ˆ Σ − Σ � Op is small.

  21. Extensions to higher dimensions Motivation: PCA Genes mirror geography within Europe , J. Novembre et al, Nature 2008. Mathematical framework: Y 1 , . . . , Y n ∈ R d , i.i.d. E Y j = 0 , E Y j Y T = Σ . j Goal: construct ˆ Σ , an estimator of Σ such that � � � � � ˆ Σ − Σ � Op is small. Sample covariance n � Σ n = 1 ˜ Y j Y T j n j = 1 is very sensitive to outliers.

  22. Extensions to higher dimensions Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise. Makes the bound dimension-dependent.

  23. Extensions to higher dimensions Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise. Makes the bound dimension-dependent. Better approach – replace the usual median by the geometric median. k � x ∗ = med ( x 1 , . . . , x k ) := argmin � y − x j � . y ∈ R d j = 1

  24. Extensions to higher dimensions Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise. Makes the bound dimension-dependent. Better approach – replace the usual median by the geometric median. k � x ∗ = med ( x 1 , . . . , x k ) := argmin � y − x j � . y ∈ R d j = 1 Still some issues: 1 does not work well for small sample sizes; 2 yields bounds in the wrong norm.

  25. Extensions to higher dimensions Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise. Makes the bound dimension-dependent. Better approach – replace the usual median by the geometric median. k � x ∗ = med ( x 1 , . . . , x k ) := argmin � y − x j � . y ∈ R d j = 1 Still some issues: 1 does not work well for small sample sizes; 2 yields bounds in the wrong norm. Alternatives: Tyler’s M-estimator, Maronna’s M-estimator; guarantees are limited to special classes of distributions.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend