Sub-Gaussian Estimators of the Mean of a Random Matrix with Entries - - PowerPoint PPT Presentation
Sub-Gaussian Estimators of the Mean of a Random Matrix with Entries - - PowerPoint PPT Presentation
Sub-Gaussian Estimators of the Mean of a Random Matrix with Entries Possessing Only Two Moments Stas Minsker University of Southern California July 21, 2016 ICERM Workshop Simple question: how to estimate the mean? Assume that X 1 , . . . , X
Simple question: how to estimate the mean?
Assume that X1, . . . , Xn are i.i.d. N(µ, σ2
0).
Problem: construct CInorm(α) for µ with coverage probability ≥ 1 − 2α.
Simple question: how to estimate the mean?
Assume that X1, . . . , Xn are i.i.d. N(µ, σ2
0).
Problem: construct CInorm(α) for µ with coverage probability ≥ 1 − 2α. Solution: compute ˆ µn := 1
n n
- j=1
Xj, take CInorm(α) =
- ˆ
µn − σ0 √ 2
- log(1/α)
n , ˆ µn + σ0 √ 2
- log(1/α)
n
Simple question: how to estimate the mean?
Assume that X1, . . . , Xn are i.i.d. N(µ, σ2
0).
Problem: construct CInorm(α) for µ with coverage probability ≥ 1 − 2α. Solution: compute ˆ µn := 1
n n
- j=1
Xj, take CInorm(α) =
- ˆ
µn − σ0 √ 2
- log(1/α)
n , ˆ µn + σ0 √ 2
- log(1/α)
n
- Coverage is guaranteed since
Pr
- ˆ
µn − µ
- ≥ σ0
- 2 log(1/α)
n
- ≤ 2α.
Example: how to estimate the mean?
P . J. Huber (1964): “...This raises a question which could have been asked already by Gauss, but which was, as far as I know, only raised a few years ago (notably by Tukey): what happens if the true distribution deviates slightly from the assumed normal one?" Going back to our question: what if X1, . . . , Xn are i.i.d. copies of X ∼ Π such that EX = µ, Var(X) ≤ σ2
0?
Example: how to estimate the mean?
P . J. Huber (1964): “...This raises a question which could have been asked already by Gauss, but which was, as far as I know, only raised a few years ago (notably by Tukey): what happens if the true distribution deviates slightly from the assumed normal one?" Going back to our question: what if X1, . . . , Xn are i.i.d. copies of X ∼ Π such that EX = µ, Var(X) ≤ σ2
0?
Problem: construct CI for µ with coverage probability ≥ 1 − α such that for any α length(CI(α)) ≤ (Absolute constant) · length(CInorm(α)) No additional assumptions on Π are imposed.
Example: how to estimate the mean?
P . J. Huber (1964): “...This raises a question which could have been asked already by Gauss, but which was, as far as I know, only raised a few years ago (notably by Tukey): what happens if the true distribution deviates slightly from the assumed normal one?" Going back to our question: what if X1, . . . , Xn are i.i.d. copies of X ∼ Π such that EX = µ, Var(X) ≤ σ2
0?
Problem: construct CI for µ with coverage probability ≥ 1 − α such that for any α length(CI(α)) ≤ (Absolute constant) · length(CInorm(α)) No additional assumptions on Π are imposed. Remark: guarantees for the sample mean ˆ µn = 1
n n
- j=1
Xj is unsatisfactory: Pr
- ˆ
µn − µ
- ≥ σ0
- (1/α)
n
- ≤ α.
Does the solution exist?
Example: how to estimate the mean?
Answer (somewhat unexpected?): Yes!
Example: how to estimate the mean?
Answer (somewhat unexpected?): Yes! Construction: [A. Nemirovski, D. Yudin ‘83; N. Alon, Y. Matias, M. Szegedy ‘96; R. Oliveira, M. Lerasle ‘11] Split the sample into k = ⌊log(1/α)⌋ + 1 groups G1, . . . , Gk of size ≃ n/k each:
G1
- X1, . . . , X|G1|
- ˆ
µ1:=
1 |G1|
- Xi ∈G1
Xi
. . . . . .
Gk
- Xn−|Gk |+1, . . . , Xn
- ˆ
µk :=
1 |Gk |
- Xi ∈Gk
Xi
- ˆ
µ∗=ˆ µ∗(α):=median(ˆ µ1,...,ˆ µk )
Example: how to estimate the mean?
Answer (somewhat unexpected?): Yes! Construction: [A. Nemirovski, D. Yudin ‘83; N. Alon, Y. Matias, M. Szegedy ‘96; R. Oliveira, M. Lerasle ‘11] Split the sample into k = ⌊log(1/α)⌋ + 1 groups G1, . . . , Gk of size ≃ n/k each:
G1
- X1, . . . , X|G1|
- ˆ
µ1:=
1 |G1|
- Xi ∈G1
Xi
. . . . . .
Gk
- Xn−|Gk |+1, . . . , Xn
- ˆ
µk :=
1 |Gk |
- Xi ∈Gk
Xi
- ˆ
µ∗=ˆ µ∗(α):=median(ˆ µ1,...,ˆ µk )
Claim: Pr
- |ˆ
µ∗ − µ| ≥ 7.7σ0
- log(e/α)
n
- ≤ α
Example: how to estimate the mean?
Answer (somewhat unexpected?): Yes! Construction: [A. Nemirovski, D. Yudin ‘83; N. Alon, Y. Matias, M. Szegedy ‘96; R. Oliveira, M. Lerasle ‘11] Split the sample into k = ⌊log(1/α)⌋ + 1 groups G1, . . . , Gk of size ≃ n/k each:
G1
- X1, . . . , X|G1|
- ˆ
µ1:=
1 |G1|
- Xi ∈G1
Xi
. . . . . .
Gk
- Xn−|Gk |+1, . . . , Xn
- ˆ
µk :=
1 |Gk |
- Xi ∈Gk
Xi
- ˆ
µ∗=ˆ µ∗(α):=median(ˆ µ1,...,ˆ µk )
Claim: Pr
- |ˆ
µ∗ − µ| ≥ 7.7σ0
- log(e/α)
n
- ≤ α
Then take CI(α) =
- ˆ
µ∗ − 7.7σ0
- log(e/α)
n , ˆ µ∗ + 7.7σ0
- log(e/α)
n
Idea of the proof:
ˆ µ8 ˆ µ1
ˆ µ
. . . . . . . . . . . .
|ˆ µ − µ| ≥ s = ⇒ at least half of events {|ˆ µj − µ| ≥ s} occur.
Improve the constant?
- O. Catoni’s estimator (2012), “Generalized truncation”: let α > 0
− log(1 − x + x2/2) ≤ ψ(x) ≤ log(1 + x + x2/2), and define ˆ µ via
n
- j=1
ψ
- θ(Xj − ˆ
µ)
- = 0.
Improve the constant?
- O. Catoni’s estimator (2012), “Generalized truncation”: let α > 0
− log(1 − x + x2/2) ≤ ψ(x) ≤ log(1 + x + x2/2), and define ˆ µ via
n
- j=1
ψ
- θ(Xj − ˆ
µ)
- = 0.
Truncation τ(x) = (|x| ∧ 1)sign(x) satisfies a weaker inequality − log(1 − x + x2) ≤ τ(x) ≤ log(1 + x + x2)
−1 1 −1 1
Improve the constant?
n
- j=1
ψ
- θ(Xj − ˆ
µ)
- = 0.
Intuition: for small θ > 0,
n
- j=1
ψ
- θ(Xj − ˆ
µ)
- ≃
n
- j=1
θ(Xj − ˆ µ) = 0 = ⇒ ˆ µ ≃ 1 n
n
- j=1
Xj
Improve the constant?
n
- j=1
ψ
- θ(Xj − ˆ
µ)
- = 0.
The following holds: set θ∗ =
- 2 log(1/α)
n 1 σ0 . Then
|ˆ µ − µ| ≤ √ 2 + o(1)
- σ0
- log(1/α)
n with probability ≥ 1 − 2α.
Extensions to higher dimensions
A natural question: is it possible to extend presented techniques to the multivariate mean?
Extensions to higher dimensions
A natural question: is it possible to extend presented techniques to the multivariate mean? Motivation: PCA
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 71 71.1 71.2 71.3 71.4 71.5 71.6 71.7 71.8 71.9 72= ⇒
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3Extensions to higher dimensions
Motivation: PCA Genes mirror geography within Europe, J. Novembre et al, Nature 2008.
PC1
a b c
PC2
good explanation for non-experts: https://faculty.washington.edu/tathornt/SISG2015/lectures/assoc2015session05.pdf
Extensions to higher dimensions
Motivation: PCA Genes mirror geography within Europe, J. Novembre et al, Nature 2008. Mathematical framework: Y1, . . . , Yn ∈ Rd, i.i.d. EYj = 0, EYjY T
j
= Σ. Goal: construct ˆ Σ, an estimator of Σ such that
- ˆ
Σ − Σ
- Op
is small.
Extensions to higher dimensions
Motivation: PCA Genes mirror geography within Europe, J. Novembre et al, Nature 2008. Mathematical framework: Y1, . . . , Yn ∈ Rd, i.i.d. EYj = 0, EYjY T
j
= Σ. Goal: construct ˆ Σ, an estimator of Σ such that
- ˆ
Σ − Σ
- Op
is small. Sample covariance ˜ Σn = 1 n
n
- j=1
YjY T
j
is very sensitive to outliers.
Extensions to higher dimensions
Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise. Makes the bound dimension-dependent.
Extensions to higher dimensions
Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise. Makes the bound dimension-dependent. Better approach – replace the usual median by the geometric median. x∗ = med(x1, . . . , xk) := argmin
y∈Rd k
- j=1
y − xj.
Extensions to higher dimensions
Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise. Makes the bound dimension-dependent. Better approach – replace the usual median by the geometric median. x∗ = med(x1, . . . , xk) := argmin
y∈Rd k
- j=1
y − xj. Still some issues:
1
does not work well for small sample sizes;
2
yields bounds in the wrong norm.
Extensions to higher dimensions
Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise. Makes the bound dimension-dependent. Better approach – replace the usual median by the geometric median. x∗ = med(x1, . . . , xk) := argmin
y∈Rd k
- j=1
y − xj. Still some issues:
1
does not work well for small sample sizes;
2
yields bounds in the wrong norm.
Alternatives: Tyler’s M-estimator, Maronna’s M-estimator; guarantees are limited to special classes of distributions.
Matrix functions
f : R → R, A = AT = UΛUT , then f(A) = Uf(Λ)UT , f(Λ) = f λ1 ... λd = f(λ1) ... f(λd)
Construction of the estimator
X ∈ Rd×d - symmetric random matrix, X1, . . . , Xn ∈ Rd×d – i.i.d. copies of X, EX2
F < ∞.
No additional assumptions.
Construction of the estimator
X ∈ Rd×d - symmetric random matrix, X1, . . . , Xn ∈ Rd×d – i.i.d. copies of X, EX2
F < ∞.
No additional assumptions. − log(1 − x + x2/2) ≤ ψ(x) ≤ log(1 + x + x2/2), θ > 0, define ˆ Σn = 1 nθ
n
- j=1
ψ(θXj)
Construction of the estimator
X ∈ Rd×d - symmetric random matrix, X1, . . . , Xn ∈ Rd×d – i.i.d. copies of X, EX2
F < ∞.
No additional assumptions. − log(1 − x + x2/2) ≤ ψ(x) ≤ log(1 + x + x2/2), θ > 0, define ˆ Σn = 1 nθ
n
- j=1
ψ(θXj) For example, if Xj = YjY T
j , we get
ˆ Σn = 1 nθ
n
- j=1
ψ
- θYjY T
j
- Note that
ψ
- θYjY T
j
- = ψ(θYj2
2)
Yj Yj2 Y T
j
Yj2 is easy to compute.
Construction of the estimator
X ∈ Rd×d - symmetric random matrix, X1, . . . , Xn ∈ Rd×d – i.i.d. copies of X, EX2
F < ∞.
No additional assumptions. − log(1 − x + x2/2) ≤ ψ(x) ≤ log(1 + x + x2/2), θ > 0, define ˆ Σn = 1 nθ
n
- j=1
ψ(θXj) For example, if Xj = YjY T
j , we get
ˆ Σn = 1 nθ
n
- j=1
ψ
- θYjY T
j
- Intuition: for small θ, ψ(θx) ≃ θx, hence
ˆ Σn ≃ Sample mean + o(θ)
ˆ Σn = 1 nθ
n
- j=1
ψ
- θXj
- Theorem (M., 2016)
X1, . . . , Xn - i.i.d. Assume that σ2 ≥ EX 2. Let θ =
- 2 log(d/α)
n 1 σ , then
- ˆ
Σn − EX
- ≤ σ
- 2 log(d/α)
n with probability ≥ 1 − 2α. For example, in covariance estimation σ2 =
- EY2
2 YY T
- .
Theorem (M., 2016)
X1, . . . , Xn - i.i.d. Assume that σ2 ≥ EX 2. Let θ =
- 2 log(d/α)
n 1 σ , then
- ˆ
Σn − EX
- ≤ σ
- 2 log(d/α)
n with probability ≥ 1 − 2α. Compare to:
Theorem (Matrix Bernstein inequality, Tropp ‘11)
X, X1, . . . , Xn ∈ Rd×d - i.i.d., σ2
0 =
- E(X − EX)2
, X ≤ M. Then for all 0 < α < 1,
- 1
n
n
- j=1
Xj − EX
- ≤ max
- 2σ0
- log(d/α)
n , 4 3 M log(d/α) n
- with probability ≥ 1 − 2α.
Further improvements: Xj → Xj + S, ˆ Σ(S) = S + 1 nθ
n
- j=1
ψ
- θ(Xj − S)
- ≃EX−S
.
Further improvements: Xj → Xj + S, ˆ Σ(S) = S + 1 nθ
n
- j=1
ψ
- θ(Xj − S)
- ≃EX−S
. "Ideal choice" S = EX is unavailable = ⇒ use the initial estimator ˆ Σn in place of S.
Further improvements: Xj → Xj + S, ˆ Σ(S) = S + 1 nθ
n
- j=1
ψ
- θ(Xj − S)
- ≃EX−S
. "Ideal choice" S = EX is unavailable = ⇒ use the initial estimator ˆ Σn in place of S. Iterate... S∞ = S∞ + 1 nθ
n
- j=1
ψ
- θ(Xj − S∞)
- =0
Theorem (M., 2016)
Assume that σ2
0 ≥ E(X − EX)2. Let θ =
- 2 log(d/α)
n 1 σ0 , and
1 nθ
n
- j=1
ψ
- θ(Xj −
S∞)
- = 0.
Assume that n is large enough (n d3). Then S∞ exists and
- S∞ − EX
- ≤ Cσ0
- log(d/α)
n with probability ≥ 1 − α.
Numerical results
Y1, . . . , Yn ∈ R100, Σ = 10 5 1 ...
1 100
Yi,j ∼ symmetric Pareto-type distribution with 4 moments.
Numerical results
Histograms over 500 replications: n = 100.
1 2 3 4 5 6 7 8 9 10 11 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Error Frequency
Sample covariance estimator Robust covariance estimator
Sample covariance error Sn − Σ/Σ Robust estimator error
- Σn − Σ/Σ
Numerical results
Histograms over 500 replications: n = 100.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Error Frequency
Sample covariance estimator Robust covariance estimator
u1( Σn)u( Σn)T − u1(Σ)u1(Σ)T u1( Sn)u1( Sn)T − u1(Σ)u1(Σ)T
Numerical results
Histograms over 500 replications: n = 1000.
1 10 20 30 40 50 60 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Error Frequency
Sample covariance estimator Robust covariance estimator
Robust estimator error
- Σn − Σ/Σ
Sample covariance error Sn − Σ/Σ
Numerical results
Histograms over 500 replications: n = 1000.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Error Frequency
Sample covariance estimator Robust covariance estimator
u1( Σn)u( Σn)T − u1(Σ)u1(Σ)T u1( Sn)u1( Sn)T − u1(Σ)u1(Σ)T
Matrix Completion
Observe some entries of the ratings matrix A0 = movie 1 movie 2 . . . movie n user 1 ∗ ∗ . . . ∗ . . . . . . . . . . . . . . . user k ∗ ∗ . . . ∗ Question: can we predict the unobserved entries?
Matrix Completion
X =
- ej(d)eT
k (d), 1 ≤ j ≤ d, 1 ≤ k ≤ d
- .
Matrix Completion
X =
- ej(d)eT
k (d), 1 ≤ j ≤ d, 1 ≤ k ≤ d
- .
X1, . . . , Xn - independent sample from Π := Unif(X), and observations Yj, j = 1, . . . , n have the form Yj = tr (X T
j A0) + ξj, (“noisy matrix entry”)
where ξj, j = 1, . . . , n is additive noise.
Matrix Completion
X =
- ej(d)eT
k (d), 1 ≤ j ≤ d, 1 ≤ k ≤ d
- .
X1, . . . , Xn - independent sample from Π := Unif(X), and observations Yj, j = 1, . . . , n have the form Yj = tr (X T
j A0) + ξj, (“noisy matrix entry”)
where ξj, j = 1, . . . , n is additive noise. E(YX) =
1 d2 A0, hence natural estimator of A0 is
- A = d2
n
n
- j=1
YjXj.
Matrix Completion
X =
- ej(d)eT
k (d), 1 ≤ j ≤ d, 1 ≤ k ≤ d
- .
X1, . . . , Xn - independent sample from Π := Unif(X), and observations Yj, j = 1, . . . , n have the form Yj = tr (X T
j A0) + ξj, (“noisy matrix entry”)
where ξj, j = 1, . . . , n is additive noise. E(YX) =
1 d2 A0, hence natural estimator of A0 is
- A = d2
n
n
- j=1
YjXj. Incorporate low rank assumption:
- Aτ = argmin
A∈Rd×d
- A −
A2
F
d2 + τA1
Matrix completion
What if noise ξj is heavy-tailed (only Var(ξj) < ∞)?
Matrix completion
What if noise ξj is heavy-tailed (only Var(ξj) < ∞)? Replace A with a "robust" estimator
- R = d2
nθ
n
- j=1
ψ
- θYjH(Xj)
- and
- Rτ = argmin
A∈Rd×d
- A −
R2
F
d2 + τA1
- .
Here, H(X) = X X T
- is the so-called self-adjoint dilation.
Matrix completion
What if noise ξj is heavy-tailed (only Var(ξj) < ∞)? Replace A with a "robust" estimator
- R = d2
nθ
n
- j=1
ψ
- θYjH(Xj)
- and
- Rτ = argmin
A∈Rd×d
- A −
R2
F
d2 + τA1
- .
Here, H(X) = X X T
- is the so-called self-adjoint dilation.
Theorem (M., 2016)
Take τ = Const ·
- t + log 2d
nd , then 1 d2
- Rτ − H(A0)
- 2
F ≤
- 1 +
√ 2 2 2 d · 2rank(A0) n
- t + log 2d