1-1
Subspace Embeddings and ℓp-Regression Using Exponential Random Variables David P. Woodruff and Qin Zhang IBM Research Almaden
COLT’13, June 12, 2013
Subspace Embeddings and p -Regression Using Exponential Random - - PowerPoint PPT Presentation
Subspace Embeddings and p -Regression Using Exponential Random Variables David P. Woodruff and Qin Zhang IBM Research Almaden COLT13, June 12, 2013 1-1 Subspace embeddings Subspace embeddings: A distribution over linear maps : R n
1-1
COLT’13, June 12, 2013
2-1
A distribution over linear maps Π : Rn → Rm, s.t., for any fixed d-dimensional subspace of Rn (denoted by M), w. pr. 0.99 Mxp ≤ ΠMxq ≤ κMxp simultaneously for all vectors x ∈ Rd. Goal: to minimize
Subspace embeddings:
2-2
A distribution over linear maps Π : Rn → Rm, s.t., for any fixed d-dimensional subspace of Rn (denoted by M), w. pr. 0.99 Mxp ≤ ΠMxq ≤ κMxp simultaneously for all vectors x ∈ Rd. Goal: to minimize
Applications:
ℓp-regression (next slide), low-rank approximation, quantile regression, . . .
Subspace embeddings:
3-1
Using ℓp subspace embedding (SE) to solve ℓp regression: minx∈Rd
Mx − b
For convenience, let ¯ M ∈ Rn×(d−1), and let M = [ ¯ M, −b] ∈ Rn×d. n ≫ d. Let Π be a SE with dimension m, distortion κ and embedding time t
3-2
Using ℓp subspace embedding (SE) to solve ℓp regression: minx∈Rd
Mx − b
For convenience, let ¯ M ∈ Rn×(d−1), and let M = [ ¯ M, −b] ∈ Rn×d. n ≫ d.
s.t. MR has some good properties. (cost ↑ if m ↑)
x of sub-sampled problem minx∈Rd
Mx − Π1b
(cost ↑ if m′ ↑, or κ ↑) Total running time ↑ if m ↑ or κ ↑ or t ↑. Let Π be a SE with dimension m, distortion κ and embedding time t
4-1
ℓ1 regression: minx∈Rd
Mx − b
M ∈ Rn×(d−1)).
Allow a (1 + ǫ)-approximation :
O(ndω−1) + poly(d/ǫ).
O(nd log n) + poly(d/ǫ).
gave O(nnz(M) log n) + poly(d/ǫ), nnz(M) is # non-zero entries of M.
(ω < 3 is the exponent of matrix multiplication)
4-2
ℓ1 regression: minx∈Rd
Mx − b
M ∈ Rn×(d−1)).
Allow a (1 + ǫ)-approximation :
O(ndω−1) + poly(d/ǫ).
O(nd log n) + poly(d/ǫ).
gave O(nnz(M) log n) + poly(d/ǫ), nnz(M) is # non-zero entries of M.
This paper: further improves the ℓ1 SE, thus also ℓ1 regression.
(ω < 3 is the exponent of matrix multiplication)
5-1
ℓp subspace embeddings. Improved all previous results for ∀p ∈ [1, ∞)\2
p = 2 has already been made optimal by Clarkson and Woodruff ’12
5-2
ℓp subspace embeddings. Improved all previous results for ∀p ∈ [1, ∞)\2
p = 2 has already been made optimal by Clarkson and Woodruff ’12
Time Distortion Dimemsion SW ndω−1 ˜ O(d) ˜ O(d) C+ nd log d ˜ O(d2+γ) ˜ O(d5) MM nnz(M) ˜ O(d3) ˜ O(d5) This paper nnz(M) + ˜ O(d2+γ) ˜ O(d2) ˜ O(d) nnz(M) + ˜ O(d2+γ) ˜ O(d3/2 log1/2 n) ˜ O(d) SW: Sohler & Woodruff ’11 ; C+: Clarkson et al. ’12; MM: Meng & Mahoney ’12; ω < 3 is the exponent of matrix multiplication. γ = 0.0000001.
In particular, p = 1
5-3
ℓp subspace embeddings. Improved all previous results for ∀p ∈ [1, ∞)\2
p = 2 has already been made optimal by Clarkson and Woodruff ’12
ℓp regression Improved all previous results for ∀p ∈ [1, ∞)\2 Have efficient distributed implementations.
Time Distortion Dimemsion SW ndω−1 ˜ O(d) ˜ O(d) C+ nd log d ˜ O(d2+γ) ˜ O(d5) MM nnz(M) ˜ O(d3) ˜ O(d5) This paper nnz(M) + ˜ O(d2+γ) ˜ O(d2) ˜ O(d) nnz(M) + ˜ O(d2+γ) ˜ O(d3/2 log1/2 n) ˜ O(d) SW: Sohler & Woodruff ’11 ; C+: Clarkson et al. ’12; MM: Meng & Mahoney ’12; ω < 3 is the exponent of matrix multiplication. γ = 0.0000001.
In particular, p = 1
6-1
(m, s) − ℓ2-SE (oblivious subspace embedding for ℓ2 norm)
A distribution over linear maps S : Rn → Rm, s.t., for any fixed d-dimensional subspace of Rn, w. pr. 0.99, 1/2 · Mx2 ≤ SMx2 ≤ 3/2 · Mx2, ∀x ∈ Rd. s = O(1) is the the max of # non-zero entries of each colummn in S.
6-2
(m, s) − ℓ2-SE (oblivious subspace embedding for ℓ2 norm)
A distribution over linear maps S : Rn → Rm, s.t., for any fixed d-dimensional subspace of Rn, w. pr. 0.99, 1/2 · Mx2 ≤ SMx2 ≤ 3/2 · Mx2, ∀x ∈ Rd. s = O(1) is the the max of # non-zero entries of each colummn in S.
Our ℓp subspace embedding matrix
Use different ℓ2-SEs (from CW12, MM12, Nelson & Nguyen 12) for 1 ≤ p < 2 and p > 2. Can compute ΠM in O(nnz(M)) time. = × Π ∈ Rm×n ℓ2-SE
1/u1/p 1 1/u1/p n
S ∈ Rm×n D ∈ Rn×n
ui are i.i.d. exponentials
7-1
Exponential distribution PDF f (x) = e−x, CDF F(x) = 1 − e−x
(max stability) If u1, . . . , un are exponentially distributed, α = (α1, . . . , αn) ∈ R+n, then max{α1/u1, . . . , αn/un} ≃ α1 /u, where u is exponential. (Recently used by Andoni (2012) for approximating frequency moments).
7-2
Exponential distribution PDF f (x) = e−x, CDF F(x) = 1 − e−x
(max stability) If u1, . . . , un are exponentially distributed, α = (α1, . . . , αn) ∈ R+n, then max{α1/u1, . . . , αn/un} ≃ α1 /u, where u is exponential. (Recently used by Andoni (2012) for approximating frequency moments).
p-stable distribution: Previous pet for subspace embedding.
Dp is p-stable, if for any vector α = (α1, . . . , αn) ∈ Rn and v1, . . . , vn
i.i.d.
∼ Dp, we have
i∈[n] αivi ≃ αp v, where v ∼ Dp.
7-3
Exponential distribution PDF f (x) = e−x, CDF F(x) = 1 − e−x
(max stability) If u1, . . . , un are exponentially distributed, α = (α1, . . . , αn) ∈ R+n, then max{α1/u1, . . . , αn/un} ≃ α1 /u, where u is exponential. (Recently used by Andoni (2012) for approximating frequency moments).
p-stable distribution: Previous pet for subspace embedding.
Dp is p-stable, if for any vector α = (α1, . . . , αn) ∈ Rn and v1, . . . , vn
i.i.d.
∼ Dp, we have
i∈[n] αivi ≃ αp v, where v ∼ Dp.
E.g., for p = 2 it is the Gaussian distribution; for p = 1 it is the Cauchy distribution.
7-4
Exponential distribution PDF f (x) = e−x, CDF F(x) = 1 − e−x
(max stability) If u1, . . . , un are exponentially distributed, α = (α1, . . . , αn) ∈ R+n, then max{α1/u1, . . . , αn/un} ≃ α1 /u, where u is exponential. (Recently used by Andoni (2012) for approximating frequency moments).
p-stable distribution: Previous pet for subspace embedding.
Dp is p-stable, if for any vector α = (α1, . . . , αn) ∈ Rn and v1, . . . , vn
i.i.d.
∼ Dp, we have
i∈[n] αivi ≃ αp v, where v ∼ Dp.
E.g., for p = 2 it is the Gaussian distribution; for p = 1 it is the Cauchy distribution. = × Π ∈ Rm×n ℓ2-SE
v1 vn
S ∈ Rm×n D′ ∈ Rn×n
vi are i.i.d. p-stables Similar embedding matrix
8-1
Why exponential distribution is better?
8-2
Why exponential distribution is better?
ℓp-SE (p ≥ 1).
8-3
Why exponential distribution is better?
ℓp-SE (p ≥ 1).
p-stable, while its the upper tail is similar to p-stables.
8-4
Why exponential distribution is better?
ℓp-SE (p ≥ 1).
p-stable, while its the upper tail is similar to p-stables.
reciprocal of exponential cauchy (1-stable) lower tails upper tails
9-1
10-1
Recall Π = SD:
= ×
Π ∈ Rm×n
S ∈ Rm×n D ∈ Rn×n
1/u1 1/un
ui: exponential
10-2
No underestimation. For each x ∈ Rd, let y = Mx ∈ Rn. Recall Π = SD:
Πy1 = SDy1 ≥ SDy2 ≥ 1/2 · Dy2 (property of ℓ2-SE) ≥ 1/2 · Dy∞ ∼ 1/2 · y1 /u (u is exponential, max stability) ≥ Ω(d log d) · y1 . (holds w.pr. 1 − e−d log d, lower tail of reciprocal of an exponential) = ×
Π ∈ Rm×n
S ∈ Rm×n D ∈ Rn×n
1/u1 1/un
ui: exponential
10-3
No underestimation. For each x ∈ Rd, let y = Mx ∈ Rn. Recall Π = SD:
Πy1 = SDy1 ≥ SDy2 ≥ 1/2 · Dy2 (property of ℓ2-SE) ≥ 1/2 · Dy∞ ∼ 1/2 · y1 /u (u is exponential, max stability) ≥ Ω(d log d) · y1 . (holds w.pr. 1 − e−d log d, lower tail of reciprocal of an exponential) This proves “for each y in the subspace” w.h.p.. To show this for all, we employ a standard net argument + a union bound. = ×
Π ∈ Rm×n
S ∈ Rm×n D ∈ Rn×n
1/u1 1/un
ui: exponential
10-4
No underestimation. For each x ∈ Rd, let y = Mx ∈ Rn. Recall Π = SD:
Πy1 = SDy1 ≥ SDy2 ≥ 1/2 · Dy2 (property of ℓ2-SE) ≥ 1/2 · Dy∞ ∼ 1/2 · y1 /u (u is exponential, max stability) ≥ Ω(d log d) · y1 . (holds w.pr. 1 − e−d log d, lower tail of reciprocal of an exponential) This proves “for each y in the subspace” w.h.p.. To show this for all, we employ a standard net argument + a union bound. = ×
Π ∈ Rm×n
S ∈ Rm×n D ∈ Rn×n
1/u1 1/un
ui: exponential
For d ≥ log n, distortion can be improved to ˜ O(√d log n).
10-5
No underestimation. For each x ∈ Rd, let y = Mx ∈ Rn. Recall Π = SD:
Πy1 = SDy1 ≥ SDy2 ≥ 1/2 · Dy2 (property of ℓ2-SE) ≥ 1/2 · Dy∞ ∼ 1/2 · y1 /u (u is exponential, max stability) ≥ Ω(d log d) · y1 . (holds w.pr. 1 − e−d log d, lower tail of reciprocal of an exponential) This proves “for each y in the subspace” w.h.p.. To show this for all, we employ a standard net argument + a union bound. Similar arguments work for general 1 ≤ p < 2. = ×
Π ∈ Rm×n
S ∈ Rm×n D ∈ Rn×n
1/u1 1/un
ui: exponential
For d ≥ log n, distortion can be improved to ˜ O(√d log n).
11-1
No overestimation. For each x ∈ Rd, let y = Mx ∈ Rn.
Πy1 = SDy1 ≤ O(1) · Dy1 (ℓ2-SE only contracts ℓ1-norm)
(for a constant γ, upper tails of reciprocal of exponential and Cauchy are similar) ≤ O(d log d · y1) (holds for all y = Mx w.pr. 0.99, previously known)
Recall Π = SD:
= ×
Π ∈ Rm×n
S ∈ Rm×n D ∈ Rn×n
1/u1 1/un
ui: exponential
11-2
No overestimation. For each x ∈ Rd, let y = Mx ∈ Rn.
Πy1 = SDy1 ≤ O(1) · Dy1 (ℓ2-SE only contracts ℓ1-norm)
(for a constant γ, upper tails of reciprocal of exponential and Cauchy are similar) ≤ O(d log d · y1) (holds for all y = Mx w.pr. 0.99, previously known)
Recall Π = SD:
= ×
Π ∈ Rm×n
S ∈ Rm×n D ∈ Rn×n
1/u1 1/un
ui: exponential γ
v1 vn
D′ ∈ Rn×n vi: Cauchy
11-3
No overestimation. For each x ∈ Rd, let y = Mx ∈ Rn.
Πy1 = SDy1 ≤ O(1) · Dy1 (ℓ2-SE only contracts ℓ1-norm)
(for a constant γ, upper tails of reciprocal of exponential and Cauchy are similar) ≤ O(d log d · y1) (holds for all y = Mx w.pr. 0.99, previously known) Similar arguments work for general 1 ≤ p < 2.
Recall Π = SD:
= ×
Π ∈ Rm×n
S ∈ Rm×n D ∈ Rn×n
1/u1 1/un
ui: exponential γ
v1 vn
D′ ∈ Rn×n vi: Cauchy
12-1
Recall Π = SD: We actually can embed the subspace into ℓ∞.
Ω(1/(d log d)1/p) · Mxp ≤ ΠMx∞ ≤ O((d log d)1/p) · Mxp. = ×
Π ∈ Rm×n
O(n1−2/pd1+2/p) + poly(d), 1
S ∈ Rm×n D ∈ Rn×n
1/u1/p 1 1/u1/p n
ui: exponential
12-2
Recall Π = SD: We actually can embed the subspace into ℓ∞.
Ω(1/(d log d)1/p) · Mxp ≤ ΠMx∞ ≤ O((d log d)1/p) · Mxp. Good news: ℓ∞-regression can be solved efficiently by LP. = ×
Π ∈ Rm×n
O(n1−2/pd1+2/p) + poly(d), 1
S ∈ Rm×n D ∈ Rn×n
1/u1/p 1 1/u1/p n
ui: exponential
12-3
Recall Π = SD: We actually can embed the subspace into ℓ∞.
Ω(1/(d log d)1/p) · Mxp ≤ ΠMx∞ ≤ O((d log d)1/p) · Mxp. Good news: ℓ∞-regression can be solved efficiently by LP.
Main technical ingredients
bound coordinates of the vectors in a subspace (an idea previously used in CW12). = ×
Π ∈ Rm×n
O(n1−2/pd1+2/p) + poly(d), 1
S ∈ Rm×n D ∈ Rn×n
1/u1/p 1 1/u1/p n
ui: exponential
13-1
The distributed model: We have k machines and one central server.
– Each machine has a 2-way communication channel with the server. – Each machine has a subset of rows of ¯ M ∈ Rn×(d−1) and b ∈ Rd. – Goal is to solve ℓp-regression: minx∈Rd
Mx − b
14-1
submatrix)
(by server)
i )
Mx − Π1b
Recall the ℓp regression framework
14-2
submatrix)
(by server)
i )
Mx − Π1b
Recall the ℓp regression framework
Running time on the server + total communication = sublinear in n: poly(d) for 1 ≤ p < 2, and n1−2/ppoly(d) for p > 2. – Previous results either have n/poly(d) communication
Total running time of the system
15-1
embeddings using exponential random variables, which improve all previous work on embedding distortions and dimensions, given the
setting.
15-2
embeddings using exponential random variables, which improve all previous work on embedding distortions and dimensions, given the
setting. What is the best distortion given O(nnz(M) + poly(d)) embedding time and ˜ O(d) embedding dimension, for ℓ1 subspace embedding? Currently it is min{ ˜ O(d2, d3/2 log1/2 n)}. Is it possible to make it ˜ O(d3/2) or even ˜ O(d)? Can we prove any tradeoff lower bounds?
16-1
17-1
High level idea for no overestimation is similar as before.
No overestimation of ΠMx for each x ∈ Rd, w. pr. 1 − e−d log d. + a standard net argument to extend this to all x ∈ Rd.
17-2
High level idea for no overestimation is similar as before.
No overestimation of ΠMx for each x ∈ Rd, w. pr. 1 − e−d log d. + a standard net argument to extend this to all x ∈ Rd. Cannot show for arbitrary vectors! We should use the the property of a subspace.
17-3
High level idea for no overestimation is similar as before.
No overestimation of ΠMx for each x ∈ Rd, w. pr. 1 − e−d log d. + a standard net argument to extend this to all x ∈ Rd. Cannot show for arbitrary vectors! We should use the the property of a subspace.
Use leverage scores of M
ℓp
i = Mip p, where Mi is the i-th row of M.
(An idea in Clarkson & Woodruff, ’13)
17-4
High level idea for no overestimation is similar as before.
No overestimation of ΠMx for each x ∈ Rd, w. pr. 1 − e−d log d. + a standard net argument to extend this to all x ∈ Rd. Cannot show for arbitrary vectors! We should use the the property of a subspace.
Use leverage scores of M
ℓp
i = Mip p, where Mi is the i-th row of M.
(An idea in Clarkson & Woodruff, ’13)
Can assume M is the Auerbach basis (since we prove for all x ∈ Rd), which has the property
i∈[n] ℓp i ≤ d. Thus not many big ℓi.
17-5
High level idea for no overestimation is similar as before.
No overestimation of ΠMx for each x ∈ Rd, w. pr. 1 − e−d log d. + a standard net argument to extend this to all x ∈ Rd. Cannot show for arbitrary vectors! We should use the the property of a subspace.
Use leverage scores of M
ℓp
i = Mip p, where Mi is the i-th row of M.
(An idea in Clarkson & Woodruff, ’13)
Can assume M is the Auerbach basis (since we prove for all x ∈ Rd), which has the property
i∈[n] ℓp i ≤ d. Thus not many big ℓi.
Also, ∀x ∈ Rd and y = Mx, for all i ∈ [n], we have yi ≤ d1−1/pℓi. Thus only a few big yi’s.
17-6
High level idea for no overestimation is similar as before.
No overestimation of ΠMx for each x ∈ Rd, w. pr. 1 − e−d log d. + a standard net argument to extend this to all x ∈ Rd. Cannot show for arbitrary vectors! We should use the the property of a subspace.
Use leverage scores of M
ℓp
i = Mip p, where Mi is the i-th row of M.
(An idea in Clarkson & Woodruff, ’13)
Can assume M is the Auerbach basis (since we prove for all x ∈ Rd), which has the property
i∈[n] ℓp i ≤ d. Thus not many big ℓi.
Use this property, together with max stability, can design an embedding matrix Π works for arbitrary vectors. Also, ∀x ∈ Rd and y = Mx, for all i ∈ [n], we have yi ≤ d1−1/pℓi. Thus only a few big yi’s.