SLIDE 1 Matrix-valued Chernoff Bounds and Applications
China Theory Week Anastasios Zouzias
University of Toronto
September 2010
SLIDE 2 Introduction
Probability theory: backbone in analysis of randomized algorithms Random sampling is the most fundamental technique Several inequalities for analyzing approximation: Markov, Chebyshev, Chernoff, Azuma, etc. In this talk: Discuss recent matrix-valued probabilistic inequalities and their applications Agenda:
1
Review real-valued probabilistic inequalities
2
Present recent matrix-valued variants
3
A low rank matrix-valued inequality
4
Two applications: matrix sparsification, approximate matrix multiplication
SLIDE 3 Introduction
Probability theory: backbone in analysis of randomized algorithms Random sampling is the most fundamental technique Several inequalities for analyzing approximation: Markov, Chebyshev, Chernoff, Azuma, etc. In this talk: Discuss recent matrix-valued probabilistic inequalities and their applications Agenda:
1
Review real-valued probabilistic inequalities
2
Present recent matrix-valued variants
3
A low rank matrix-valued inequality
4
Two applications: matrix sparsification, approximate matrix multiplication
SLIDE 4 Introduction
Probability theory: backbone in analysis of randomized algorithms Random sampling is the most fundamental technique Several inequalities for analyzing approximation: Markov, Chebyshev, Chernoff, Azuma, etc. In this talk: Discuss recent matrix-valued probabilistic inequalities and their applications Agenda:
1
Review real-valued probabilistic inequalities
2
Present recent matrix-valued variants
3
A low rank matrix-valued inequality
4
Two applications: matrix sparsification, approximate matrix multiplication
SLIDE 5 Introduction
Probability theory: backbone in analysis of randomized algorithms Random sampling is the most fundamental technique Several inequalities for analyzing approximation: Markov, Chebyshev, Chernoff, Azuma, etc. In this talk: Discuss recent matrix-valued probabilistic inequalities and their applications Agenda:
1
Review real-valued probabilistic inequalities
2
Present recent matrix-valued variants
3
A low rank matrix-valued inequality
4
Two applications: matrix sparsification, approximate matrix multiplication
SLIDE 6 Introduction
Probability theory: backbone in analysis of randomized algorithms Random sampling is the most fundamental technique Several inequalities for analyzing approximation: Markov, Chebyshev, Chernoff, Azuma, etc. In this talk: Discuss recent matrix-valued probabilistic inequalities and their applications Agenda:
1
Review real-valued probabilistic inequalities
2
Present recent matrix-valued variants
3
A low rank matrix-valued inequality
4
Two applications: matrix sparsification, approximate matrix multiplication
SLIDE 7 Law of Large Numbers
Fundamental principle of random sampling: Law of Large Numbers (LLN) It states that the empirical average converges to true average Classical form: for reals rather than matrices Let X1,...,Xt be independent copies of a random variable X Goal: estimate the mean E[X] using samples X1,...,Xt Approximate by the empirical mean
1 t
t
Xt ≈ E[X]
How good is the approximation (non-asymptotics)?
SLIDE 8 Law of Large Numbers
Fundamental principle of random sampling: Law of Large Numbers (LLN) It states that the empirical average converges to true average Classical form: for reals rather than matrices Let X1,...,Xt be independent copies of a random variable X Goal: estimate the mean E[X] using samples X1,...,Xt Approximate by the empirical mean
1 t
t
Xt ≈ E[X]
How good is the approximation (non-asymptotics)? Question: Is there a matrix-valued LLN?
SLIDE 9
Matrix-valued Random Variables
Let (Ω,F ,P) be a probability space. A matrix-valued random variable is a measurable function
M : Ω → Rd×d
Its expectation is a d ×d matrix, denote by E[M] ∈ Rd×d Self-adjoint matrix-valued random variable: M : Ω → Sd×d Caveat: Entries may or may not be correlated with each other
SLIDE 10
Matrix-valued Random Variables
Let (Ω,F ,P) be a probability space. A matrix-valued random variable is a measurable function
M : Ω → Rd×d
Its expectation is a d ×d matrix, denote by E[M] ∈ Rd×d Self-adjoint matrix-valued random variable: M : Ω → Sd×d Caveat: Entries may or may not be correlated with each other
Matrix-valued random variable
is a random matrix with (possibly) correlated entries
SLIDE 11
Real-valued Probabilistic Inequalities
Lemma (Markov)
Let X ≥ 0 be a real-valued random variable (r.v.) and α > 0. Then
P(X ≥ α) ≤ E[X]
α .
SLIDE 12 Real-valued Probabilistic Inequalities
Lemma (Markov)
Let X ≥ 0 be a real-valued random variable (r.v.) and α > 0. Then
P(X ≥ α) ≤ E[X]
α .
Lemma (Chernoff-Hoeffding)
Let X1,X2,...,Xt be i.i.d. copies of a real-valued r.v. X and ε > 0. If |X| ≤ γ, then
P
t
t
Xi −E[X]
≤ 2exp
γ2
SLIDE 13 Real-valued Probabilistic Inequalities
Lemma (Chernoff-Hoeffding)
Let X1,X2,...,Xt be i.i.d. copies of a real-valued r.v. X and ε > 0. If |X| ≤ γ, then
P
t
t
Xi −E[X]
≤ 2exp
γ2
Lemma (Bernstein)
Let X1,X2,...,Xt be i.i.d. copies of a real-valued r.v. X and ε > 0. If |X| ≤ γ and
Var(X) ≤ ρ2, then P
t
t
Xi −E[X]
≤ 2exp
ε2t ρ2 + γε/3
SLIDE 14 Real-valued Probabilistic Inequalities
Lemma (Chernoff-Hoeffding)
Let X1,X2,...,Xt be i.i.d. copies of a real-valued r.v. X and ε > 0. If |X| ≤ γ, then
P
t
t
Xi −E[X]
≤ 2exp
γ2
Lemma (Bernstein)
Let X1,X2,...,Xt be i.i.d. copies of a real-valued r.v. X and ε > 0. If |X| ≤ γ and
Var(X) ≤ ρ2, then P
t
t
Xi −E[X]
≤ 2exp
ε2t ρ2 + γε/3
...and many more...
SLIDE 15 Real-valued Probabilistic Inequalities
Lemma (Chernoff-Hoeffding)
Let X1,X2,...,Xt be i.i.d. copies of a real-valued r.v. X and ε > 0. If |X| ≤ γ, then
P
t
t
Xi −E[X]
≤ 2exp
γ2
Lemma (Bernstein)
Let X1,X2,...,Xt be i.i.d. copies of a real-valued r.v. X and ε > 0. If |X| ≤ γ and
Var(X) ≤ ρ2, then P
t
t
Xi −E[X]
≤ 2exp
ε2t ρ2 + γε/3
Question: How would the matrix-valued generalizations look like?
SLIDE 16
Real-valued to Matrix-valued
Is there a meaningful way to generalize the real-valued inequalities to matrix-valued? Would these inequalities be useful to us?
SLIDE 17
Real-valued to Matrix-valued
Is there a meaningful way to generalize the real-valued inequalities to matrix-valued? Would these inequalities be useful to us? α,̙ ∈ R
A,B ∈ Sd×d
Comments α > ̙
A B A−B is p.s.d. |α| A
Spectral norm
eα eA
Matrix Exponential
SLIDE 18
Matrix-valued Probabilistic Inequalities
Lemma (Markov)
Let X ≥ 0 be a real-valued r.v. and α > 0. Then
P(X ≥ α) ≤ E[X]
α .
SLIDE 19
Matrix-valued Probabilistic Inequalities
Lemma (Markov)
Let X ≥ 0 be a real-valued r.v. and α > 0. Then
P(X ≥ α) ≤ E[X]
α .
Lemma (Matrix-valued Markov [AW02])
Let M 0 be a self adjoint matrix-valued r.v. and α > 0. Then
P(M α ·I) ≤ tr(E[M])
α . Remark: P(M α ·I) = P(λmax(M) > α)
SLIDE 20 Matrix-valued Probabilistic Inequalities
Theorem (Chernoff)
Let X1,X2,...,Xt be i.i.d. copies of a real-valued r.v. X and ε > 0. If |X| ≤ γ, then
P
t
t
Xi −E[X]
≤ 2exp
γ2
SLIDE 21 Matrix-valued Probabilistic Inequalities
Theorem (Chernoff)
Let X1,X2,...,Xt be i.i.d. copies of a real-valued r.v. X and ε > 0. If |X| ≤ γ, then
P
t
t
Xi −E[X]
≤ 2exp
γ2
Theorem (Matrix-valued Chernoff [AW02, WX08])
Let M1,M2,...,Mt be i.i.d. copies of a self adjoint matrix-valued r.v. M of size
P
t
t
Mi −E[M]
≤ dexp
γ2
SLIDE 22 Matrix-valued Probabilistic Inequalities
Theorem (Chernoff)
Let X1,X2,...,Xt be i.i.d. copies of a real-valued r.v. X and ε > 0. If |X| ≤ γ, then
P
t
t
Xi −E[X]
≤ 2exp
γ2
Theorem (Matrix-valued Chernoff [AW02, WX08])
Let M1,M2,...,Mt be i.i.d. copies of a self adjoint matrix-valued r.v. M of size
P
t
t
Mi −E[M]
≤ dexp
γ2
Remark: Proof similar with real-valued case (use of matrix exponential!)
SLIDE 23 Matrix-valued Probabilistic Inequalities
Theorem (Chernoff)
Let X1,X2,...,Xt be i.i.d. copies of a real-valued r.v. X and ε > 0. If |X| ≤ γ, then
P
t
t
Xi −E[X]
≤ 2exp
γ2
Theorem (Matrix-valued Chernoff [AW02, WX08])
Let M1,M2,...,Mt be i.i.d. copies of a self adjoint matrix-valued r.v. M of size
P
t
t
Mi −E[M]
≤ dexp
γ2
Question: Can we remove the dependency on the dimensionality (d)?
SLIDE 24 In general, no!
Set M =
g1
... ... ... ...
gd−1
...
gd
, gi ∼ N(0,1). Then E[M] = 0d×d
t
t
i=1 Mi −E[M]
√t (g1,g2,...,gd)∞, i.e., maximum deviation of d
independent Gaussian r.v.’s
SLIDE 25 In general, no!
Set M =
g1
... ... ... ...
gd−1
...
gd
, gi ∼ N(0,1). Then E[M] = 0d×d
t
t
i=1 Mi −E[M]
√t (g1,g2,...,gd)∞, i.e., maximum deviation of d
independent Gaussian r.v.’s Question: Are there any natural assumptions that avoid the dependency on d?
SLIDE 26 In general, no!
Set M =
g1
... ... ... ...
gd−1
...
gd
, gi ∼ N(0,1). Then E[M] = 0d×d
t
t
i=1 Mi −E[M]
√t (g1,g2,...,gd)∞, i.e., maximum deviation of d
independent Gaussian r.v.’s Question: Are there any natural assumptions that avoid the dependency on d? What if M has rank-one [RV07, Rud99]?
SLIDE 27 In general, no!
Set M =
g1
... ... ... ...
gd−1
...
gd
, gi ∼ N(0,1). Then E[M] = 0d×d
t
t
i=1 Mi −E[M]
√t (g1,g2,...,gd)∞, i.e., maximum deviation of d
independent Gaussian r.v.’s Question: Are there any natural assumptions that avoid the dependency on d? What if M has rank-one [RV07, Rud99]? Low-rank [MZ10]?
SLIDE 28 Let M1,M2,...,Mt be i.i.d. copies of a self adjoint matrix-valued r.v. M of size
d Theorem (“Restated” Matrix-valued Chernoff)
If M ≤ γ a.s., and t = Ω(γ2/ε2 logd) then
P
t
t
Mi −E[M]
≤ 1
poly(d).
SLIDE 29 Low Rank Matrix-valued Chernoff
Let M1,M2,...,Mt be i.i.d. copies of a self adjoint matrix-valued r.v. M of size
d Theorem (“Restated” Matrix-valued Chernoff)
If M ≤ γ a.s., and t = Ω(γ2/ε2 logd) then
P
t
t
Mi −E[M]
≤ 1
poly(d).
Theorem (Low Rank Matrix-valued Chernoff [MZ10])
If M ≤ γ, rank(M) = O(1) a.s., E[M] ≤ 1 and t = Ω(γ/ε2 log(γ/ε2)) then
P
t
t
Mi −E[M]
≤ 1
poly(t).
SLIDE 30 Warm-up (Real-valued case)
Let’s start by proving the real-valued case Let X1,X2,...,Xt be i.i.d. copies of a real-valued r.v. X and ε > 0. If |X| ≤ γ, then
P
t
t
Xi −E[X]
≤ 2exp
γ2
p-th moments, Ep :=
t
t
i=1 Xi −E[X]
Approach: Give tight bounds for Ep Mimic the real-valued case for matrix-valued
Fact
If g ∼ N(0,σ2), then (E|g|p)1/p = O(σ √p)
SLIDE 31 Proof (Warm-up)
Reduce general r.v. Xi to Bernoulli ϸi ∼ ±1 (Symmetrisation Argument)
Ep := EXi
t
t
Xi −E[X]
1/p
≤ 2 t EXi Eϸi
ϸiXi
1/p
Bound Eϸi
i=1 ϸiXi
- p. By Khintchine’s ineq. we get
Eϸi
ϸiXi
≤ C ·p
t
X2
i
p/2
SLIDE 32 Proof (Warm-up)
Reduce general r.v. Xi to Bernoulli ϸi ∼ ±1 (Symmetrisation Argument)
Ep := EXi
t
t
Xi −E[X]
1/p
≤ 2 t EXi Eϸi
ϸiXi
1/p
Bound Eϸi
i=1 ϸiXi
- p. By Khintchine’s ineq. we get
Eϸi
ϸiXi
≤ C ·p
t
X2
i
p/2
SLIDE 33 Proof (Warm-up)
Reduce general r.v. Xi to Bernoulli ϸi ∼ ±1 (Symmetrisation Argument)
Ep := EXi
t
t
Xi −E[X]
1/p
≤ 2 t EXi Eϸi
ϸiXi
1/p
Bound Eϸi
i=1 ϸiXi
- p. By Khintchine’s ineq. we get
Eϸi
ϸiXi
≤ C ·p
t
X2
i
p/2
SLIDE 34 Proof (Warm-up) - Continued
Ep ≤ 2 t EXi Eϸi
ϸiXi
1/p
, Symmetrisation
≤ 2C √p t EXi
t
X2
i
p/2
1/p
, Khintchine
≤ 2C √p t
t
X2
i ≤ tγ2
≤ 2Cγ √p √t
SLIDE 35 Proof (Warm-up) - Continued
Ep ≤ 2 t EXi Eϸi
ϸiXi
1/p
, Symmetrisation
≤ 2C √p t EXi
t
X2
i
p/2
1/p
, Khintchine
≤ 2C √p t
t
X2
i ≤ tγ2
≤ 2Cγ √p √t
SLIDE 36 Proof (Warm-up) - Continued
Ep ≤ 2 t EXi Eϸi
ϸiXi
1/p
, Symmetrisation
≤ 2C √p t EXi
t
X2
i
p/2
1/p
, Khintchine
≤ 2C √p t
t
X2
i ≤ tγ2
≤ 2Cγ √p √t
SLIDE 37 Proof (Warm-up) - Continued
Ep ≤ 2 t EXi Eϸi
ϸiXi
1/p
, Symmetrisation
≤ 2C √p t EXi
t
X2
i
p/2
1/p
, Khintchine
≤ 2C √p t
t
X2
i ≤ tγ2
≤ 2Cγ √p √t
SLIDE 38 Theorem (Low Rank Matrix-valued Chernoff [MZ10])
Let M1,M2,...,Mt be i.i.d. copies of a self adjoint matrix-valued r.v. M of size
- d. If M ≤ γ, rank(M) = O(1) a.s., E[M] ≤ 1 and t = Ω(γ/ε2 log(γ/ε2)) then
P
t
t
Mi −E[M]
≤ 1
poly(t). Let Z =
t
t
i=1 Mi −E[M]
- Goal: Prove a similar bound for (EZp)1/p like before (real case)
SLIDE 39 Theorem (Low Rank Matrix-valued Chernoff [MZ10])
Let M1,M2,...,Mt be i.i.d. copies of a self adjoint matrix-valued r.v. M of size
- d. If M ≤ γ, rank(M) = O(1) a.s., E[M] ≤ 1 and t = Ω(γ/ε2 log(γ/ε2)) then
P
t
t
Mi −E[M]
≤ 1
poly(t). Let Z =
t
t
i=1 Mi −E[M]
- Goal: Prove a similar bound for (EZp)1/p like before (real case)
Main Problem There is no Khintchine ineq. for · space as for reals
SLIDE 40 Theorem (Low Rank Matrix-valued Chernoff [MZ10])
Let M1,M2,...,Mt be i.i.d. copies of a self adjoint matrix-valued r.v. M of size
- d. If M ≤ γ, rank(M) = O(1) a.s., E[M] ≤ 1 and t = Ω(γ/ε2 log(γ/ε2)) then
P
t
t
Mi −E[M]
≤ 1
poly(t). Let Z =
t
t
i=1 Mi −E[M]
- Goal: Prove a similar bound for (EZp)1/p like before (real case)
Main Problem There is no Khintchine ineq. for · space as for reals ...however there is Khintchine ineq. for the Schatten space...
SLIDE 41 Schatten Space
Let A ∈ Rd×d. Denote by Cd
p the p-th Schatten norm space in Rd equipped
with the norm
ACd
p :=
d
σi(A)p
1/p
, where σi(A) are the singular values of A.
p = ∞: Operator norm p = 2: Frobenius (Hilbert-Schmidt) norm p = 1: Nuclear norm A ≤ ACd
p ≤ (rank(A))1/p A for any p ≥ 1
Cd
p space has a Khintchine inequality [LPP91, LP86]!
Eϸi
ϸiMi
Cd
p
1/p
≤ O( √p)
t
M2
i
1/2
p
where ϸi ∼ ±1.
SLIDE 42 Schatten Space
Let A ∈ Rd×d. Denote by Cd
p the p-th Schatten norm space in Rd equipped
with the norm
ACd
p :=
d
σi(A)p
1/p
, where σi(A) are the singular values of A.
p = ∞: Operator norm p = 2: Frobenius (Hilbert-Schmidt) norm p = 1: Nuclear norm A ≤ ACd
p ≤ (rank(A))1/p A for any p ≥ 1
Cd
p space has a Khintchine inequality [LPP91, LP86]!
Eϸi
ϸiMi
Cd
p
1/p
≤ O( √p)
t
M2
i
1/2
p
where ϸi ∼ ±1.
SLIDE 43 Schatten Space
Let A ∈ Rd×d. Denote by Cd
p the p-th Schatten norm space in Rd equipped
with the norm
ACd
p :=
d
σi(A)p
1/p
, where σi(A) are the singular values of A.
p = ∞: Operator norm p = 2: Frobenius (Hilbert-Schmidt) norm p = 1: Nuclear norm A ≤ ACd
p ≤ (rank(A))1/p A for any p ≥ 1
Cd
p space has a Khintchine inequality [LPP91, LP86]!
Eϸi
ϸiMi
Cd
p
1/p
≤ O( √p)
t
M2
i
1/2
p
where ϸi ∼ ±1.
SLIDE 44 Schatten Space
Let A ∈ Rd×d. Denote by Cd
p the p-th Schatten norm space in Rd equipped
with the norm
ACd
p :=
d
σi(A)p
1/p
, where σi(A) are the singular values of A.
p = ∞: Operator norm p = 2: Frobenius (Hilbert-Schmidt) norm p = 1: Nuclear norm A ≤ ACd
p ≤ (rank(A))1/p A for any p ≥ 1
Cd
p space has a Khintchine inequality [LPP91, LP86]!
Eϸi
ϸiMi
Cd
p
1/p
≤ O( √p)
t
M2
i
1/2
p
where ϸi ∼ ±1.
SLIDE 45 What we proved before...
Real-valued:
EXi
t
t
Xi −E[X]
1/p
≤ C √p t EXi
t
X2
i
p/2
1/p
SLIDE 46 ...what we get now
Real-valued:
EXi
t
t
Xi −E[X]
1/p
≤ C √p t EXi
t
X2
i
p/2
1/p
Lemma (Main Lemma [MZ10])
Let M1,...,Mt be i.i.d. copies of a self adjoint matrix-valued r.v. M with rank at most r almost surely. Then for every p ≥ 2
E
t
t
Mi −E[M]
1/p
≤ C(rt)1/p √p t EMj
M2
j
1/p
SLIDE 47 Proof Sketch
Let
Ep :=
t
t
i=1 Mi −E[M]
SLIDE 48 Proof Sketch
Let
Ep :=
t
t
i=1 Mi −E[M]
≤ 2 t EMi Eϸi
ϸiMi
1/p
Symmetrisation
SLIDE 49 Proof Sketch
Let
Ep :=
t
t
i=1 Mi −E[M]
≤ 2 t EMi Eϸi
ϸiMi
1/p
Symmetrisation
≤ 2 t EMi Eϸi
ϸiMi
Cd
p
1/p
A ≤ ACd
p
SLIDE 50 Proof Sketch
Let
Ep :=
t
t
i=1 Mi −E[M]
≤ 2 t EMi Eϸi
ϸiMi
1/p
Symmetrisation
≤ 2 t EMi Eϸi
ϸiMi
Cd
p
1/p
A ≤ ACd
p
SLIDE 51 Proof Sketch
Let
Ep :=
t
t
i=1 Mi −E[M]
≤ 2 t EMi Eϸi
ϸiMi
1/p
Symmetrisation
≤ 2 t EMi Eϸi
ϸiMi
Cd
p
1/p
A ≤ ACd
p
≤ 2 t EMi Cppp/2
t
M2
i
1/2
Cd
p
1/p
Khintchine
SLIDE 52 Proof Sketch
Let
Ep :=
t
t
i=1 Mi −E[M]
≤ 2 t EMi Eϸi
ϸiMi
1/p
Symmetrisation
≤ 2 t EMi Eϸi
ϸiMi
Cd
p
1/p
A ≤ ACd
p
≤ 2C √p t EMi
t
M2
i
1/2
Cd
p
1/p
Khintchine
SLIDE 53 Proof Sketch
Let
Ep :=
t
t
i=1 Mi −E[M]
≤ 2 t EMi Eϸi
ϸiMi
1/p
Symmetrisation
≤ 2 t EMi Eϸi
ϸiMi
Cd
p
1/p
A ≤ ACd
p
≤ 2C √p t EMi
t
M2
i
1/2
Cd
p
1/p
Khintchine
SLIDE 54 Proof Sketch
Let
Ep :=
t
t
i=1 Mi −E[M]
≤ 2 t EMi Eϸi
ϸiMi
1/p
Symmetrisation
≤ 2 t EMi Eϸi
ϸiMi
Cd
p
1/p
A ≤ ACd
p
≤ 2C √p t EMi
t
M2
i
1/2
Cd
p
1/p
Khintchine
≤ 2C(rt)1/p √p t EMi
M2
i
1/p
ACd
p ≤ rank(A)1/p A
SLIDE 55
SECOND PART
APPLICATIONS
SLIDE 56
Matrix Sparsification
A :=
19 3 4 16 7 6 6 7 19 8 13 10 2 4 3 9 15 4 12 17 16 4 4 6 8 7 5 14 19 19 1 10 5 5 16 16 17 17 11 19 8 4 13 16 6 9 16 19 8 7 11 9 9 8 17 8 2 20 1 13 6 7 12 11 20 19 2 3 1 13 6 14 2 15 11 4 2 4 2 4 5 16 1 16 11 1 17 16 14 6 2 13 9 6 3 8 4 2 20 17 14 6 18 13 11 6 3 3 4 5 13 19 20 19 7 14 20 10 16 14 20 3 5 9 15 16 4 14 20 14 11 8 19 14 1 18 9 2 13 10 20 16 1 4 3 17 3 15 16 12 1 3 10 9 20 15 9 3 3 12 12 10 17 11 19 19 11 9 10 8 8 10 6 11 10 2 18 3 19 20 6 7 17 14 16 20 17 19 1 5 2 18 10 12 15 11
SLIDE 57 Matrix Sparsification
19 3 16 7 6 7 8 13 10 2 4 9 15 4 17 16 4 4 6 7 14 19 19 1 10 5 5 16 16 17 11 8 4 13 6 19 8 11 9 8 17 8 2 20 1 13 7 11 20 19 2 3 1 13 6 14 2 15 11 4 2 4 2 4 5 16 1 16 11 1 16 6 2 13 6 3 8 2 20 17 14 6 18 13 11 6 3 3 4 5 13 19 20 19 7 14 20 10 16 14 20 3 5 15 16 4 14 20 14 11 8 19 14 1 18 9 2 13 10 20 16 1 4 3 3 15 12 1 3 10 9 20 15 9 3 12 10 11 19 19 11 9 10 8 8 6 11 10 2 18 3 20 6 7 17 14 16 20 17 19 1 2 18 10 12 15 11
SLIDE 58 Matrix Sparsification
19 3 16 7 6 7 8 13 10 2 4 9 15 4 17 16 4 4 6 7 14 19 19 1 10 5 5 16 16 17 11 8 4 13 6 19 8 11 9 8 17 8 2 20 1 13 7 11 20 19 2 3 1 13 6 14 2 15 11 4 2 4 2 4 5 16 1 16 11 1 16 6 2 13 6 3 8 2 20 17 14 6 18 13 11 6 3 3 4 5 13 19 20 19 7 14 20 10 16 14 20 3 5 15 16 4 14 20 14 11 8 19 14 1 18 9 2 13 10 20 16 1 4 3 3 15 12 1 3 10 9 20 15 9 3 12 10 11 19 19 11 9 10 8 8 6 11 10 2 18 3 20 6 7 17 14 16 20 17 19 1 2 18 10 12 15 11
Goal: Given A ∈ Rn×n and ε > 0. Find sparse
A s.t.
SLIDE 59 Matrix Sparsification
Problem
Given A ∈ Rn×n and ε > 0. Find sparse
A s.t.
Achlioptas, McSherry [AM07] Sparsify each entry (i,j) independently w.p. ≈ |Aij| Analysis:
A−A is a random matrix with independent entries
Arora et al. [AHK06] simplified their analysis using real-valued Chernoff bounds. Drineas, Z. [DZ10] Sample each entry (i,j) independently w.p. ≈ A2
ij/A2 F
Improve the above results using matrix-valued Chernoff bounds (matrix-valued Bernstein)
SLIDE 60 Analysis via matrix-valued Chernoff
Define a matrix-valued r.v. M with E[M] = A Each sample of M is a zero d ×d matrix with only one non-zero entry Let pij = A2
ij/A2 F (probability selecting (i,j) entry)
P
pij Aijeie⊤
j
SLIDE 61
Analysis via matrix-valued Chernoff
A :=
19 3 4 16 7 6 6 7 19 8 13 10 2 4 3 9 15 4 12 17 16 4 4 6 8 7 5 14 19 19 1 10 5 5 16 16 17 17 11 19 8 4 13 16 6 9 16 19 8 7 11 9 9 8 17 8 2 20 1 13 6 7 12 11 20 19 2 3 1 13 6 14 2 15 11 4 2 4 2 4 5 16 1 16 11 1 17 16 14 6 2 13 9 6 3 8 4 2 20 17 14 6 18 13 11 6 3 3 4 5 13 19 20 19 7 14 20 10 16 14 20 3 5 9 15 16 4 14 20 14 11 8 19 14 1 18 9 2 13 10 20 16 1 4 3 17 3 15 16 12 1 3 10 9 20 15 9 3 3 12 12 10 17 11 19 19 11 9 10 8 8 10 6 11 10 2 18 3 19 20 6 7 17 14 16 20 17 19 1 5 2 18 10 12 15 11
SLIDE 62
Analysis via matrix-valued Chernoff
M1 :=
14
SLIDE 63
Analysis via matrix-valued Chernoff
M2 :=
16
SLIDE 64
Analysis via matrix-valued Chernoff
M3 :=
15
SLIDE 65
Analysis via matrix-valued Chernoff
M3 :=
15
......
SLIDE 66
Analysis via matrix-valued Chernoff
Mt :=
13
SLIDE 67 Analysis via matrix-valued Chernoff
19 3 16 7 6 7 8 13 10 2 4 9 15 4 17 16 4 4 6 7 14 19 19 1 10 5 5 16 16 17 11 8 4 13 6 19 8 11 9 8 17 8 2 20 1 13 7 11 20 19 2 3 1 13 6 14 2 15 11 4 2 4 2 4 5 16 1 16 11 1 16 6 2 13 6 3 8 2 20 17 14 6 18 13 11 6 3 3 4 5 13 19 20 19 7 14 20 10 16 14 20 3 5 15 16 4 14 20 14 11 8 19 14 1 18 9 2 13 10 20 16 1 4 3 3 15 12 1 3 10 9 20 15 9 3 12 10 11 19 19 11 9 10 8 8 6 11 10 2 18 3 20 6 7 17 14 16 20 17 19 1 2 18 10 12 15 11
Set
A := 1
t
t
i=1 Mi
SLIDE 68 Analysis via matrix-valued Chernoff
Define a matrix-valued r.v. M with E[M] = A Each sample of M is a zero d ×d matrix with only one non-zero entry. Let pij = A2
ij/A2 F (probability selecting (i,j) entry)
P
pij Aijeie⊤
j
Bounding the number of samples = # of non-zero entries of
A
Matrix-valued Chernoff bounds guarantees
SLIDE 69 Approximate Matrix Multiplication
Problem
Given A ∈ Rn×m, B ∈ Rn×p and ε > 0. Approximate matrix product A⊤B; compute
A ∈ Rt×m and B ∈ Rt×p (t ≪ m,p,n) such that
B−A⊤B
Approaches: Randomly project their columns Non-uniform row sampling Related Work: Many results w.r.t. Frobenius norm [DKM06, Sar06, CW09] “Weak” bounds w.r.t. spectral norm [DK01, DKM06, Sar06] Similar strong bounds for the special case A = B in [RV07]
SLIDE 70 Approximate Matrix Multiplication
Problem
Given A ∈ Rn×m, B ∈ Rn×p and ε > 0. Approximate matrix product A⊤B; compute
A ∈ Rt×m and B ∈ Rt×p (t ≪ m,p,n) such that
B−A⊤B
Approaches: Randomly project their columns Non-uniform row sampling Related Work: Many results w.r.t. Frobenius norm [DKM06, Sar06, CW09] “Weak” bounds w.r.t. spectral norm [DK01, DKM06, Sar06] Similar strong bounds for the special case A = B in [RV07]
SLIDE 71 Approximate Matrix Multiplication
Problem
Given A ∈ Rn×m, B ∈ Rn×p and ε > 0. Approximate matrix product A⊤B; compute
A ∈ Rt×m and B ∈ Rt×p (t ≪ m,p,n) such that
B−A⊤B
Approaches: Randomly project their columns Non-uniform row sampling Related Work: Many results w.r.t. Frobenius norm [DKM06, Sar06, CW09] “Weak” bounds w.r.t. spectral norm [DK01, DKM06, Sar06] Similar strong bounds for the special case A = B in [RV07]
SLIDE 72 Non-uniform Row Sampling
Recall that A⊤B = n
i=1 A⊤ i Bi(= n i=1 Ai ⊗Bi)
!
" # $ % # $ % & & !!! !!!
'( ) * * *
!!!
+,-,& &,-,. +,-,.
*
SLIDE 73 Non-uniform Row Sampling
Recall that A⊤B = n
i=1 A⊤ i Bi(= n i=1 Ai ⊗Bi)
!
" # $ % # $ % & & !!! !!!
'( ) * * *
!!! "#$ "#% "#&
+,-,& &,-,. +,-,.
*
"#'
SLIDE 74 Non-uniform Row Sampling
Theorem
There exists prob. distribution pi s.t. if we form an t ×m matrix
A and an t ×p
matrix
B by taking t i.i.d. (row indices) samples from pi with t = Ω( r/ε2 log( r/ε2)), then P
B−A⊤B
r(1),
where
r is st.rank(A)+ st.rank(B).
st.rank(A) := A2
F /A ≤ rank(A)
SLIDE 75 Proof Sketch
Define a distribution over R(m+p)×(m+p) by
P
pi B⊤
i Ai
A⊤
i Bi
E[X] = B⊤A A⊤B
- Every (matrix) sample has rank at most two.
X ≤ ˜ rA + ˜ rB(≤ ˜ r) a.s..
Applying Theorem with t = Ω(˜
r/ε2 log(˜ r/ε2)), we get i1,i2,...,it indices from [n] such that with high probability
t
t
1 pij B⊤ ij Aij 1 pij A⊤ ij Bij
− B⊤A A⊤B
εAB
SLIDE 76
Conclusion and Open Problems
Matrix-valued probabilistic inequalities are powerful tools Present two application: matrix sparsification and approximate matrix multiplication More applications: graph sparsifiers [SS08], matrix completion [Rec09], bounding integrality gaps [Nem07], Cayley graph expansion, etc... Many unexplored connections. Matrix martingales - Adaptive sampling? See [Tro10]
SLIDE 77
Thank You
SLIDE 78 References I
- S. Arora, E. Hazan, and S. Kale.
A Fast Random Sampling Algorithm for Sparsifying Matrices. In Proceedings of the International Workshop on Randomization and Approximation Techniques (RANDOM), pages 272–279, 2006.
. Mcsherry. Fast Computation of Low-rank Matrix Approximations. SIAM J. Comput., 54(2):9, 2007.
- R. Ahlswede and A. Winter.
Strong Converse for Identification via Quantum Channels. IEEE Transactions on Information Theory, 48(3):569–579, 2002.
. Woodruff. Numerical Linear Algebra in the Streaming Model. In Proceedings of the Symposium on Theory of Computing (STOC), pages 205–214, 2009. P . Drineas and R. Kannan. Fast Monte-Carlo Algorithms for Approximate Matrix Multiplication. In Proceedings of the Symposium on Foundations of Computer Science (FOCS), pages 452–459, 2001. P . Drineas, R. Kannan, and M. W. Mahoney. Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication. SIAM J. Comput., 36(1):132–157, 2006. P . Drineas and A. Zouzias. A Note on Element-wise Matrix Sparsification via Matrix-valued Chernoff Bounds. Available at arxiv:1006.0407, June 2010. F . Lust-Piquard. Inégalités de Khintchine dans Cp (1 < p < ∞).
- C. R. Acad. Sci. Paris Sér. I Math., 303(7):289–292, 1986.
SLIDE 79 References II
F . Lust-Piquard and G. Pisier. Non Commutative Khintchine and Paley Inequalities. Arkiv för Matematik, 29(1-2):241–260, December 1991.
Low Rank Matrix-valued Chernoff Bounds and Approximate Matrix Multiplication, 2010.
Sums of Random Symmetric Matrices and Quadratic Optimization under Orthogonality Constraints. Mathematical Programming, 109(2):283–317, 2007.
A Simpler Approach to Matrix Completion. Available at arxiv:0910.0651, October 2009.
Random Vectors in the Isotropic Position.
- J. Funct. Anal., 164(1):60–72, 1999.
- M. Rudelson and R. Vershynin.
Sampling from Large Matrices: An Approach through Geometric Functional Analysis. SIAM J. Comput., 54(4):21, 2007.
Improved Approximation Algorithms for Large Matrices via Random Projections. In Proceedings of the Symposium on Foundations of Computer Science (FOCS), pages 143–152, 2006.
- D. A. Spielman and N. Srivastava.
Graph Sparsification by Effective Resistances. In Proceedings of the Symposium on Theory of Computing (STOC), pages 563–568, 2008.
SLIDE 80 References III
User-Friendly Tail Bounds for Sums of Random Matrices. Available at arxiv:1004.4389, April 2010.
- A. Wigderson and D. Xiao.
Derandomizing the Ahlswede-Winter Matrix-valued Chernoff Bound using Pessimistic Estimators, and Applications. Theory of Computing, 4(1):53–76, 2008.