Sub-Gaussian Estimators of the Mean of a Random Matrix with Entries - - PowerPoint PPT Presentation

sub gaussian estimators of the mean of a random matrix
SMART_READER_LITE
LIVE PREVIEW

Sub-Gaussian Estimators of the Mean of a Random Matrix with Entries - - PowerPoint PPT Presentation

Sub-Gaussian Estimators of the Mean of a Random Matrix with Entries Possessing Only Two Moments Stas Minsker University of Southern California July 21, 2016 ICERM Workshop Simple question: how to estimate the mean? Assume that X 1 , . . . , X


slide-1
SLIDE 1

Sub-Gaussian Estimators of the Mean of a Random Matrix with Entries Possessing Only Two Moments

Stas Minsker University of Southern California July 21, 2016 ICERM Workshop

slide-2
SLIDE 2

Simple question: how to estimate the mean?

Assume that X1, . . . , Xn are i.i.d. N(µ, σ2

0).

Problem: construct CInorm(α) for µ with coverage probability ≥ 1 − 2α.

slide-3
SLIDE 3

Simple question: how to estimate the mean?

Assume that X1, . . . , Xn are i.i.d. N(µ, σ2

0).

Problem: construct CInorm(α) for µ with coverage probability ≥ 1 − 2α. Solution: compute ˆ µn := 1

n n

  • j=1

Xj, take CInorm(α) =

  • ˆ

µn − σ0 √ 2

  • log(1/α)

n , ˆ µn + σ0 √ 2

  • log(1/α)

n

slide-4
SLIDE 4

Simple question: how to estimate the mean?

Assume that X1, . . . , Xn are i.i.d. N(µ, σ2

0).

Problem: construct CInorm(α) for µ with coverage probability ≥ 1 − 2α. Solution: compute ˆ µn := 1

n n

  • j=1

Xj, take CInorm(α) =

  • ˆ

µn − σ0 √ 2

  • log(1/α)

n , ˆ µn + σ0 √ 2

  • log(1/α)

n

  • Coverage is guaranteed since

Pr

  • ˆ

µn − µ

  • ≥ σ0
  • 2 log(1/α)

n

  • ≤ 2α.
slide-5
SLIDE 5

Example: how to estimate the mean?

P . J. Huber (1964): “...This raises a question which could have been asked already by Gauss, but which was, as far as I know, only raised a few years ago (notably by Tukey): what happens if the true distribution deviates slightly from the assumed normal one?" Going back to our question: what if X1, . . . , Xn are i.i.d. copies of X ∼ Π such that EX = µ, Var(X) ≤ σ2

0?

slide-6
SLIDE 6

Example: how to estimate the mean?

P . J. Huber (1964): “...This raises a question which could have been asked already by Gauss, but which was, as far as I know, only raised a few years ago (notably by Tukey): what happens if the true distribution deviates slightly from the assumed normal one?" Going back to our question: what if X1, . . . , Xn are i.i.d. copies of X ∼ Π such that EX = µ, Var(X) ≤ σ2

0?

Problem: construct CI for µ with coverage probability ≥ 1 − α such that for any α length(CI(α)) ≤ (Absolute constant) · length(CInorm(α)) No additional assumptions on Π are imposed.

slide-7
SLIDE 7

Example: how to estimate the mean?

P . J. Huber (1964): “...This raises a question which could have been asked already by Gauss, but which was, as far as I know, only raised a few years ago (notably by Tukey): what happens if the true distribution deviates slightly from the assumed normal one?" Going back to our question: what if X1, . . . , Xn are i.i.d. copies of X ∼ Π such that EX = µ, Var(X) ≤ σ2

0?

Problem: construct CI for µ with coverage probability ≥ 1 − α such that for any α length(CI(α)) ≤ (Absolute constant) · length(CInorm(α)) No additional assumptions on Π are imposed. Remark: guarantees for the sample mean ˆ µn = 1

n n

  • j=1

Xj is unsatisfactory: Pr

  • ˆ

µn − µ

  • ≥ σ0
  • (1/α)

n

  • ≤ α.

Does the solution exist?

slide-8
SLIDE 8

Example: how to estimate the mean?

Answer (somewhat unexpected?): Yes!

slide-9
SLIDE 9

Example: how to estimate the mean?

Answer (somewhat unexpected?): Yes! Construction: [A. Nemirovski, D. Yudin ‘83; N. Alon, Y. Matias, M. Szegedy ‘96; R. Oliveira, M. Lerasle ‘11] Split the sample into k = ⌊log(1/α)⌋ + 1 groups G1, . . . , Gk of size ≃ n/k each:

G1

  • X1, . . . , X|G1|
  • ˆ

µ1:=

1 |G1|

  • Xi ∈G1

Xi

. . . . . .

Gk

  • Xn−|Gk |+1, . . . , Xn
  • ˆ

µk :=

1 |Gk |

  • Xi ∈Gk

Xi

  • ˆ

µ∗=ˆ µ∗(α):=median(ˆ µ1,...,ˆ µk )

slide-10
SLIDE 10

Example: how to estimate the mean?

Answer (somewhat unexpected?): Yes! Construction: [A. Nemirovski, D. Yudin ‘83; N. Alon, Y. Matias, M. Szegedy ‘96; R. Oliveira, M. Lerasle ‘11] Split the sample into k = ⌊log(1/α)⌋ + 1 groups G1, . . . , Gk of size ≃ n/k each:

G1

  • X1, . . . , X|G1|
  • ˆ

µ1:=

1 |G1|

  • Xi ∈G1

Xi

. . . . . .

Gk

  • Xn−|Gk |+1, . . . , Xn
  • ˆ

µk :=

1 |Gk |

  • Xi ∈Gk

Xi

  • ˆ

µ∗=ˆ µ∗(α):=median(ˆ µ1,...,ˆ µk )

Claim: Pr

µ∗ − µ| ≥ 7.7σ0

  • log(e/α)

n

  • ≤ α
slide-11
SLIDE 11

Example: how to estimate the mean?

Answer (somewhat unexpected?): Yes! Construction: [A. Nemirovski, D. Yudin ‘83; N. Alon, Y. Matias, M. Szegedy ‘96; R. Oliveira, M. Lerasle ‘11] Split the sample into k = ⌊log(1/α)⌋ + 1 groups G1, . . . , Gk of size ≃ n/k each:

G1

  • X1, . . . , X|G1|
  • ˆ

µ1:=

1 |G1|

  • Xi ∈G1

Xi

. . . . . .

Gk

  • Xn−|Gk |+1, . . . , Xn
  • ˆ

µk :=

1 |Gk |

  • Xi ∈Gk

Xi

  • ˆ

µ∗=ˆ µ∗(α):=median(ˆ µ1,...,ˆ µk )

Claim: Pr

µ∗ − µ| ≥ 7.7σ0

  • log(e/α)

n

  • ≤ α

Then take CI(α) =

  • ˆ

µ∗ − 7.7σ0

  • log(e/α)

n , ˆ µ∗ + 7.7σ0

  • log(e/α)

n

slide-12
SLIDE 12

Idea of the proof:

ˆ µ8 ˆ µ1

ˆ µ

. . . . . . . . . . . .

|ˆ µ − µ| ≥ s = ⇒ at least half of events {|ˆ µj − µ| ≥ s} occur.

slide-13
SLIDE 13

Improve the constant?

  • O. Catoni’s estimator (2012), “Generalized truncation”: let α > 0

− log(1 − x + x2/2) ≤ ψ(x) ≤ log(1 + x + x2/2), and define ˆ µ via

n

  • j=1

ψ

  • θ(Xj − ˆ

µ)

  • = 0.
slide-14
SLIDE 14

Improve the constant?

  • O. Catoni’s estimator (2012), “Generalized truncation”: let α > 0

− log(1 − x + x2/2) ≤ ψ(x) ≤ log(1 + x + x2/2), and define ˆ µ via

n

  • j=1

ψ

  • θ(Xj − ˆ

µ)

  • = 0.

Truncation τ(x) = (|x| ∧ 1)sign(x) satisfies a weaker inequality − log(1 − x + x2) ≤ τ(x) ≤ log(1 + x + x2)

−1 1 −1 1

slide-15
SLIDE 15

Improve the constant?

n

  • j=1

ψ

  • θ(Xj − ˆ

µ)

  • = 0.

Intuition: for small θ > 0,

n

  • j=1

ψ

  • θ(Xj − ˆ

µ)

n

  • j=1

θ(Xj − ˆ µ) = 0 = ⇒ ˆ µ ≃ 1 n

n

  • j=1

Xj

slide-16
SLIDE 16

Improve the constant?

n

  • j=1

ψ

  • θ(Xj − ˆ

µ)

  • = 0.

The following holds: set θ∗ =

  • 2 log(1/α)

n 1 σ0 . Then

|ˆ µ − µ| ≤ √ 2 + o(1)

  • σ0
  • log(1/α)

n with probability ≥ 1 − 2α.

slide-17
SLIDE 17

Extensions to higher dimensions

A natural question: is it possible to extend presented techniques to the multivariate mean?

slide-18
SLIDE 18

Extensions to higher dimensions

A natural question: is it possible to extend presented techniques to the multivariate mean? Motivation: PCA

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 71 71.1 71.2 71.3 71.4 71.5 71.6 71.7 71.8 71.9 72

= ⇒

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3
slide-19
SLIDE 19

Extensions to higher dimensions

Motivation: PCA Genes mirror geography within Europe, J. Novembre et al, Nature 2008.

PC1

a b c

PC2

good explanation for non-experts: https://faculty.washington.edu/tathornt/SISG2015/lectures/assoc2015session05.pdf

slide-20
SLIDE 20

Extensions to higher dimensions

Motivation: PCA Genes mirror geography within Europe, J. Novembre et al, Nature 2008. Mathematical framework: Y1, . . . , Yn ∈ Rd, i.i.d. EYj = 0, EYjY T

j

= Σ. Goal: construct ˆ Σ, an estimator of Σ such that

  • ˆ

Σ − Σ

  • Op

is small.

slide-21
SLIDE 21

Extensions to higher dimensions

Motivation: PCA Genes mirror geography within Europe, J. Novembre et al, Nature 2008. Mathematical framework: Y1, . . . , Yn ∈ Rd, i.i.d. EYj = 0, EYjY T

j

= Σ. Goal: construct ˆ Σ, an estimator of Σ such that

  • ˆ

Σ − Σ

  • Op

is small. Sample covariance ˜ Σn = 1 n

n

  • j=1

YjY T

j

is very sensitive to outliers.

slide-22
SLIDE 22

Extensions to higher dimensions

Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise. Makes the bound dimension-dependent.

slide-23
SLIDE 23

Extensions to higher dimensions

Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise. Makes the bound dimension-dependent. Better approach – replace the usual median by the geometric median. x∗ = med(x1, . . . , xk) := argmin

y∈Rd k

  • j=1

y − xj.

slide-24
SLIDE 24

Extensions to higher dimensions

Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise. Makes the bound dimension-dependent. Better approach – replace the usual median by the geometric median. x∗ = med(x1, . . . , xk) := argmin

y∈Rd k

  • j=1

y − xj. Still some issues:

1

does not work well for small sample sizes;

2

yields bounds in the wrong norm.

slide-25
SLIDE 25

Extensions to higher dimensions

Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise. Makes the bound dimension-dependent. Better approach – replace the usual median by the geometric median. x∗ = med(x1, . . . , xk) := argmin

y∈Rd k

  • j=1

y − xj. Still some issues:

1

does not work well for small sample sizes;

2

yields bounds in the wrong norm.

Alternatives: Tyler’s M-estimator, Maronna’s M-estimator; guarantees are limited to special classes of distributions.

slide-26
SLIDE 26

Matrix functions

f : R → R, A = AT = UΛUT , then f(A) = Uf(Λ)UT , f(Λ) = f       λ1 ... λd       =    f(λ1) ... f(λd)   

slide-27
SLIDE 27

Construction of the estimator

X ∈ Rd×d - symmetric random matrix, X1, . . . , Xn ∈ Rd×d – i.i.d. copies of X, EX2

F < ∞.

No additional assumptions.

slide-28
SLIDE 28

Construction of the estimator

X ∈ Rd×d - symmetric random matrix, X1, . . . , Xn ∈ Rd×d – i.i.d. copies of X, EX2

F < ∞.

No additional assumptions. − log(1 − x + x2/2) ≤ ψ(x) ≤ log(1 + x + x2/2), θ > 0, define ˆ Σn = 1 nθ

n

  • j=1

ψ(θXj)

slide-29
SLIDE 29

Construction of the estimator

X ∈ Rd×d - symmetric random matrix, X1, . . . , Xn ∈ Rd×d – i.i.d. copies of X, EX2

F < ∞.

No additional assumptions. − log(1 − x + x2/2) ≤ ψ(x) ≤ log(1 + x + x2/2), θ > 0, define ˆ Σn = 1 nθ

n

  • j=1

ψ(θXj) For example, if Xj = YjY T

j , we get

ˆ Σn = 1 nθ

n

  • j=1

ψ

  • θYjY T

j

  • Note that

ψ

  • θYjY T

j

  • = ψ(θYj2

2)

Yj Yj2 Y T

j

Yj2 is easy to compute.

slide-30
SLIDE 30

Construction of the estimator

X ∈ Rd×d - symmetric random matrix, X1, . . . , Xn ∈ Rd×d – i.i.d. copies of X, EX2

F < ∞.

No additional assumptions. − log(1 − x + x2/2) ≤ ψ(x) ≤ log(1 + x + x2/2), θ > 0, define ˆ Σn = 1 nθ

n

  • j=1

ψ(θXj) For example, if Xj = YjY T

j , we get

ˆ Σn = 1 nθ

n

  • j=1

ψ

  • θYjY T

j

  • Intuition: for small θ, ψ(θx) ≃ θx, hence

ˆ Σn ≃ Sample mean + o(θ)

slide-31
SLIDE 31

ˆ Σn = 1 nθ

n

  • j=1

ψ

  • θXj
  • Theorem (M., 2016)

X1, . . . , Xn - i.i.d. Assume that σ2 ≥ EX 2. Let θ =

  • 2 log(d/α)

n 1 σ , then

  • ˆ

Σn − EX

  • ≤ σ
  • 2 log(d/α)

n with probability ≥ 1 − 2α. For example, in covariance estimation σ2 =

  • EY2

2 YY T

  • .
slide-32
SLIDE 32

Theorem (M., 2016)

X1, . . . , Xn - i.i.d. Assume that σ2 ≥ EX 2. Let θ =

  • 2 log(d/α)

n 1 σ , then

  • ˆ

Σn − EX

  • ≤ σ
  • 2 log(d/α)

n with probability ≥ 1 − 2α. Compare to:

Theorem (Matrix Bernstein inequality, Tropp ‘11)

X, X1, . . . , Xn ∈ Rd×d - i.i.d., σ2

0 =

  • E(X − EX)2

, X ≤ M. Then for all 0 < α < 1,

  • 1

n

n

  • j=1

Xj − EX

  • ≤ max
  • 2σ0
  • log(d/α)

n , 4 3 M log(d/α) n

  • with probability ≥ 1 − 2α.
slide-33
SLIDE 33

Further improvements: Xj → Xj + S, ˆ Σ(S) = S + 1 nθ

n

  • j=1

ψ

  • θ(Xj − S)
  • ≃EX−S

.

slide-34
SLIDE 34

Further improvements: Xj → Xj + S, ˆ Σ(S) = S + 1 nθ

n

  • j=1

ψ

  • θ(Xj − S)
  • ≃EX−S

. "Ideal choice" S = EX is unavailable = ⇒ use the initial estimator ˆ Σn in place of S.

slide-35
SLIDE 35

Further improvements: Xj → Xj + S, ˆ Σ(S) = S + 1 nθ

n

  • j=1

ψ

  • θ(Xj − S)
  • ≃EX−S

. "Ideal choice" S = EX is unavailable = ⇒ use the initial estimator ˆ Σn in place of S. Iterate... S∞ = S∞ + 1 nθ

n

  • j=1

ψ

  • θ(Xj − S∞)
  • =0
slide-36
SLIDE 36

Theorem (M., 2016)

Assume that σ2

0 ≥ E(X − EX)2. Let θ =

  • 2 log(d/α)

n 1 σ0 , and

1 nθ

n

  • j=1

ψ

  • θ(Xj −

S∞)

  • = 0.

Assume that n is large enough (n d3). Then S∞ exists and

  • S∞ − EX
  • ≤ Cσ0
  • log(d/α)

n with probability ≥ 1 − α.

slide-37
SLIDE 37

Numerical results

Y1, . . . , Yn ∈ R100, Σ =        10 5 1 ...

1 100

       Yi,j ∼ symmetric Pareto-type distribution with 4 moments.

slide-38
SLIDE 38

Numerical results

Histograms over 500 replications: n = 100.

1 2 3 4 5 6 7 8 9 10 11 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Error Frequency

Sample covariance estimator Robust covariance estimator

Sample covariance error Sn − Σ/Σ Robust estimator error

  • Σn − Σ/Σ
slide-39
SLIDE 39

Numerical results

Histograms over 500 replications: n = 100.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Error Frequency

Sample covariance estimator Robust covariance estimator

u1( Σn)u( Σn)T − u1(Σ)u1(Σ)T u1( Sn)u1( Sn)T − u1(Σ)u1(Σ)T

slide-40
SLIDE 40

Numerical results

Histograms over 500 replications: n = 1000.

1 10 20 30 40 50 60 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Error Frequency

Sample covariance estimator Robust covariance estimator

Robust estimator error

  • Σn − Σ/Σ

Sample covariance error Sn − Σ/Σ

slide-41
SLIDE 41

Numerical results

Histograms over 500 replications: n = 1000.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Error Frequency

Sample covariance estimator Robust covariance estimator

u1( Σn)u( Σn)T − u1(Σ)u1(Σ)T u1( Sn)u1( Sn)T − u1(Σ)u1(Σ)T

slide-42
SLIDE 42

Matrix Completion

Observe some entries of the ratings matrix A0 =      movie 1 movie 2 . . . movie n user 1 ∗ ∗ . . . ∗ . . . . . . . . . . . . . . . user k ∗ ∗ . . . ∗      Question: can we predict the unobserved entries?

slide-43
SLIDE 43

Matrix Completion

X =

  • ej(d)eT

k (d), 1 ≤ j ≤ d, 1 ≤ k ≤ d

  • .
slide-44
SLIDE 44

Matrix Completion

X =

  • ej(d)eT

k (d), 1 ≤ j ≤ d, 1 ≤ k ≤ d

  • .

X1, . . . , Xn - independent sample from Π := Unif(X), and observations Yj, j = 1, . . . , n have the form Yj = tr (X T

j A0) + ξj, (“noisy matrix entry”)

where ξj, j = 1, . . . , n is additive noise.

slide-45
SLIDE 45

Matrix Completion

X =

  • ej(d)eT

k (d), 1 ≤ j ≤ d, 1 ≤ k ≤ d

  • .

X1, . . . , Xn - independent sample from Π := Unif(X), and observations Yj, j = 1, . . . , n have the form Yj = tr (X T

j A0) + ξj, (“noisy matrix entry”)

where ξj, j = 1, . . . , n is additive noise. E(YX) =

1 d2 A0, hence natural estimator of A0 is

  • A = d2

n

n

  • j=1

YjXj.

slide-46
SLIDE 46

Matrix Completion

X =

  • ej(d)eT

k (d), 1 ≤ j ≤ d, 1 ≤ k ≤ d

  • .

X1, . . . , Xn - independent sample from Π := Unif(X), and observations Yj, j = 1, . . . , n have the form Yj = tr (X T

j A0) + ξj, (“noisy matrix entry”)

where ξj, j = 1, . . . , n is additive noise. E(YX) =

1 d2 A0, hence natural estimator of A0 is

  • A = d2

n

n

  • j=1

YjXj. Incorporate low rank assumption:

  • Aτ = argmin

A∈Rd×d

  • A −

A2

F

d2 + τA1

slide-47
SLIDE 47

Matrix completion

What if noise ξj is heavy-tailed (only Var(ξj) < ∞)?

slide-48
SLIDE 48

Matrix completion

What if noise ξj is heavy-tailed (only Var(ξj) < ∞)? Replace A with a "robust" estimator

  • R = d2

n

  • j=1

ψ

  • θYjH(Xj)
  • and
  • Rτ = argmin

A∈Rd×d

  • A −

R2

F

d2 + τA1

  • .

Here, H(X) = X X T

  • is the so-called self-adjoint dilation.
slide-49
SLIDE 49

Matrix completion

What if noise ξj is heavy-tailed (only Var(ξj) < ∞)? Replace A with a "robust" estimator

  • R = d2

n

  • j=1

ψ

  • θYjH(Xj)
  • and
  • Rτ = argmin

A∈Rd×d

  • A −

R2

F

d2 + τA1

  • .

Here, H(X) = X X T

  • is the so-called self-adjoint dilation.

Theorem (M., 2016)

Take τ = Const ·

  • t + log 2d

nd , then 1 d2

  • Rτ − H(A0)
  • 2

F ≤

  • 1 +

√ 2 2 2 d · 2rank(A0) n

  • t + log 2d

with probability ≥ 1 − e−t.

slide-50
SLIDE 50

Thank you for your attention!