Subspace Embeddings and p -Regression Using Exponential Random - - PowerPoint PPT Presentation

subspace embeddings and p regression using exponential
SMART_READER_LITE
LIVE PREVIEW

Subspace Embeddings and p -Regression Using Exponential Random - - PowerPoint PPT Presentation

Subspace Embeddings and p -Regression Using Exponential Random Variables David P. Woodruff and Qin Zhang IBM Research Almaden COLT13, June 12, 2013 1-1 Subspace embeddings Subspace embeddings: A distribution over linear maps : R n


slide-1
SLIDE 1

1-1

Subspace Embeddings and ℓp-Regression Using Exponential Random Variables David P. Woodruff and Qin Zhang IBM Research Almaden

COLT’13, June 12, 2013

slide-2
SLIDE 2

2-1

Subspace embeddings

A distribution over linear maps Π : Rn → Rm, s.t., for any fixed d-dimensional subspace of Rn (denoted by M), w. pr. 0.99 Mxp ≤ ΠMxq ≤ κMxp simultaneously for all vectors x ∈ Rd. Goal: to minimize

  • 1. m: the dimension of the subspace embedding.
  • 2. κ: the distortion of the embedding.
  • 3. t: the time to compute ΠM.

Subspace embeddings:

slide-3
SLIDE 3

2-2

Subspace embeddings

A distribution over linear maps Π : Rn → Rm, s.t., for any fixed d-dimensional subspace of Rn (denoted by M), w. pr. 0.99 Mxp ≤ ΠMxq ≤ κMxp simultaneously for all vectors x ∈ Rd. Goal: to minimize

  • 1. m: the dimension of the subspace embedding.
  • 2. κ: the distortion of the embedding.
  • 3. t: the time to compute ΠM.

Applications:

ℓp-regression (next slide), low-rank approximation, quantile regression, . . .

Subspace embeddings:

slide-4
SLIDE 4

3-1

All matter: embedding time, dimension and distortion

Using ℓp subspace embedding (SE) to solve ℓp regression: minx∈Rd

  • ¯

Mx − b

  • p

For convenience, let ¯ M ∈ Rn×(d−1), and let M = [ ¯ M, −b] ∈ Rn×d. n ≫ d. Let Π be a SE with dimension m, distortion κ and embedding time t

slide-5
SLIDE 5

3-2

All matter: embedding time, dimension and distortion

Using ℓp subspace embedding (SE) to solve ℓp regression: minx∈Rd

  • ¯

Mx − b

  • p

For convenience, let ¯ M ∈ Rn×(d−1), and let M = [ ¯ M, −b] ∈ Rn×d. n ≫ d.

  • 1. Compute ΠM. (cost t)
  • 2. Use ΠM to compute a matrix R ∈ Rd×d (change-of-basis matrix)

s.t. MR has some good properties. (cost ↑ if m ↑)

  • 3. Given R, find a sampling matrix Π1 ∈ Rm′×n. (m′ ↑ if κ ↑)
  • 4. Compute ˆ

x of sub-sampled problem minx∈Rd

  • Π1 ¯

Mx − Π1b

  • p.

(cost ↑ if m′ ↑, or κ ↑) Total running time ↑ if m ↑ or κ ↑ or t ↑. Let Π be a SE with dimension m, distortion κ and embedding time t

slide-6
SLIDE 6

4-1

ℓ1 regression

ℓ1 regression: minx∈Rd

  • ¯

Mx − b

  • 1 ( ¯

M ∈ Rn×(d−1)).

  • Can be solved by linear programming, in time superlinear in n.
  • Clarkson 2005 gave an n · poly(d) solution.
  • . . .

Allow a (1 + ǫ)-approximation :

  • Sohler & Woodruff 2011 used ℓ1 subspace embedding (SE), gave

O(ndω−1) + poly(d/ǫ).

  • Clarkson et al. 2012 used a more structured ℓ1 SE, gave

O(nd log n) + poly(d/ǫ).

  • Clarkson & Woodruff / Meng & Mahoney 2012 used other ℓ1 SE’s,

gave O(nnz(M) log n) + poly(d/ǫ), nnz(M) is # non-zero entries of M.

(ω < 3 is the exponent of matrix multiplication)

slide-7
SLIDE 7

4-2

ℓ1 regression

ℓ1 regression: minx∈Rd

  • ¯

Mx − b

  • 1 ( ¯

M ∈ Rn×(d−1)).

  • Can be solved by linear programming, in time superlinear in n.
  • Clarkson 2005 gave an n · poly(d) solution.
  • . . .

Allow a (1 + ǫ)-approximation :

  • Sohler & Woodruff 2011 used ℓ1 subspace embedding (SE), gave

O(ndω−1) + poly(d/ǫ).

  • Clarkson et al. 2012 used a more structured ℓ1 SE, gave

O(nd log n) + poly(d/ǫ).

  • Clarkson & Woodruff / Meng & Mahoney 2012 used other ℓ1 SE’s,

gave O(nnz(M) log n) + poly(d/ǫ), nnz(M) is # non-zero entries of M.

This paper: further improves the ℓ1 SE, thus also ℓ1 regression.

(ω < 3 is the exponent of matrix multiplication)

slide-8
SLIDE 8

5-1

Our results

ℓp subspace embeddings. Improved all previous results for ∀p ∈ [1, ∞)\2

p = 2 has already been made optimal by Clarkson and Woodruff ’12

slide-9
SLIDE 9

5-2

Our results

ℓp subspace embeddings. Improved all previous results for ∀p ∈ [1, ∞)\2

p = 2 has already been made optimal by Clarkson and Woodruff ’12

Time Distortion Dimemsion SW ndω−1 ˜ O(d) ˜ O(d) C+ nd log d ˜ O(d2+γ) ˜ O(d5) MM nnz(M) ˜ O(d3) ˜ O(d5) This paper nnz(M) + ˜ O(d2+γ) ˜ O(d2) ˜ O(d) nnz(M) + ˜ O(d2+γ) ˜ O(d3/2 log1/2 n) ˜ O(d) SW: Sohler & Woodruff ’11 ; C+: Clarkson et al. ’12; MM: Meng & Mahoney ’12; ω < 3 is the exponent of matrix multiplication. γ = 0.0000001.

In particular, p = 1

slide-10
SLIDE 10

5-3

Our results

ℓp subspace embeddings. Improved all previous results for ∀p ∈ [1, ∞)\2

p = 2 has already been made optimal by Clarkson and Woodruff ’12

ℓp regression Improved all previous results for ∀p ∈ [1, ∞)\2 Have efficient distributed implementations.

Time Distortion Dimemsion SW ndω−1 ˜ O(d) ˜ O(d) C+ nd log d ˜ O(d2+γ) ˜ O(d5) MM nnz(M) ˜ O(d3) ˜ O(d5) This paper nnz(M) + ˜ O(d2+γ) ˜ O(d2) ˜ O(d) nnz(M) + ˜ O(d2+γ) ˜ O(d3/2 log1/2 n) ˜ O(d) SW: Sohler & Woodruff ’11 ; C+: Clarkson et al. ’12; MM: Meng & Mahoney ’12; ω < 3 is the exponent of matrix multiplication. γ = 0.0000001.

In particular, p = 1

slide-11
SLIDE 11

6-1

Our subspace embedding matrices

(m, s) − ℓ2-SE (oblivious subspace embedding for ℓ2 norm)

A distribution over linear maps S : Rn → Rm, s.t., for any fixed d-dimensional subspace of Rn, w. pr. 0.99, 1/2 · Mx2 ≤ SMx2 ≤ 3/2 · Mx2, ∀x ∈ Rd. s = O(1) is the the max of # non-zero entries of each colummn in S.

slide-12
SLIDE 12

6-2

Our subspace embedding matrices

(m, s) − ℓ2-SE (oblivious subspace embedding for ℓ2 norm)

A distribution over linear maps S : Rn → Rm, s.t., for any fixed d-dimensional subspace of Rn, w. pr. 0.99, 1/2 · Mx2 ≤ SMx2 ≤ 3/2 · Mx2, ∀x ∈ Rd. s = O(1) is the the max of # non-zero entries of each colummn in S.

Our ℓp subspace embedding matrix

Use different ℓ2-SEs (from CW12, MM12, Nelson & Nguyen 12) for 1 ≤ p < 2 and p > 2. Can compute ΠM in O(nnz(M)) time. = × Π ∈ Rm×n ℓ2-SE

1/u1/p 1 1/u1/p n

S ∈ Rm×n D ∈ Rn×n

ui are i.i.d. exponentials

slide-13
SLIDE 13

7-1

Two distributions

Exponential distribution PDF f (x) = e−x, CDF F(x) = 1 − e−x

(max stability) If u1, . . . , un are exponentially distributed, α = (α1, . . . , αn) ∈ R+n, then max{α1/u1, . . . , αn/un} ≃ α1 /u, where u is exponential. (Recently used by Andoni (2012) for approximating frequency moments).

slide-14
SLIDE 14

7-2

Two distributions

Exponential distribution PDF f (x) = e−x, CDF F(x) = 1 − e−x

(max stability) If u1, . . . , un are exponentially distributed, α = (α1, . . . , αn) ∈ R+n, then max{α1/u1, . . . , αn/un} ≃ α1 /u, where u is exponential. (Recently used by Andoni (2012) for approximating frequency moments).

p-stable distribution: Previous pet for subspace embedding.

Dp is p-stable, if for any vector α = (α1, . . . , αn) ∈ Rn and v1, . . . , vn

i.i.d.

∼ Dp, we have

i∈[n] αivi ≃ αp v, where v ∼ Dp.

slide-15
SLIDE 15

7-3

Two distributions

Exponential distribution PDF f (x) = e−x, CDF F(x) = 1 − e−x

(max stability) If u1, . . . , un are exponentially distributed, α = (α1, . . . , αn) ∈ R+n, then max{α1/u1, . . . , αn/un} ≃ α1 /u, where u is exponential. (Recently used by Andoni (2012) for approximating frequency moments).

p-stable distribution: Previous pet for subspace embedding.

Dp is p-stable, if for any vector α = (α1, . . . , αn) ∈ Rn and v1, . . . , vn

i.i.d.

∼ Dp, we have

i∈[n] αivi ≃ αp v, where v ∼ Dp.

E.g., for p = 2 it is the Gaussian distribution; for p = 1 it is the Cauchy distribution.

slide-16
SLIDE 16

7-4

Two distributions

Exponential distribution PDF f (x) = e−x, CDF F(x) = 1 − e−x

(max stability) If u1, . . . , un are exponentially distributed, α = (α1, . . . , αn) ∈ R+n, then max{α1/u1, . . . , αn/un} ≃ α1 /u, where u is exponential. (Recently used by Andoni (2012) for approximating frequency moments).

p-stable distribution: Previous pet for subspace embedding.

Dp is p-stable, if for any vector α = (α1, . . . , αn) ∈ Rn and v1, . . . , vn

i.i.d.

∼ Dp, we have

i∈[n] αivi ≃ αp v, where v ∼ Dp.

E.g., for p = 2 it is the Gaussian distribution; for p = 1 it is the Cauchy distribution. = × Π ∈ Rm×n ℓ2-SE

v1 vn

S ∈ Rm×n D′ ∈ Rn×n

vi are i.i.d. p-stables Similar embedding matrix

slide-17
SLIDE 17

8-1

Exponential distribution is superior than p-stables

Why exponential distribution is better?

slide-18
SLIDE 18

8-2

Exponential distribution is superior than p-stables

Why exponential distribution is better?

  • 1. p-stables only exist for p ∈ [1, 2]; while exponential can be used for all

ℓp-SE (p ≥ 1).

slide-19
SLIDE 19

8-3

Exponential distribution is superior than p-stables

Why exponential distribution is better?

  • 1. p-stables only exist for p ∈ [1, 2]; while exponential can be used for all

ℓp-SE (p ≥ 1).

  • 2. The lower tail of the reciprocal of exponential decreases faster than

p-stable, while its the upper tail is similar to p-stables.

slide-20
SLIDE 20

8-4

Exponential distribution is superior than p-stables

Why exponential distribution is better?

  • 1. p-stables only exist for p ∈ [1, 2]; while exponential can be used for all

ℓp-SE (p ≥ 1).

  • 2. The lower tail of the reciprocal of exponential decreases faster than

p-stable, while its the upper tail is similar to p-stables.

reciprocal of exponential cauchy (1-stable) lower tails upper tails

slide-21
SLIDE 21

9-1

Analysis of distortions

slide-22
SLIDE 22

10-1

Analysis for ℓ1 subspace embedding

Recall Π = SD:

= ×

Π ∈ Rm×n

  • O(d1.001), O(1)
  • − ℓ2-SE

S ∈ Rm×n D ∈ Rn×n

1/u1 1/un

ui: exponential

slide-23
SLIDE 23

10-2

Analysis for ℓ1 subspace embedding

No underestimation. For each x ∈ Rd, let y = Mx ∈ Rn. Recall Π = SD:

Πy1 = SDy1 ≥ SDy2 ≥ 1/2 · Dy2 (property of ℓ2-SE) ≥ 1/2 · Dy∞ ∼ 1/2 · y1 /u (u is exponential, max stability) ≥ Ω(d log d) · y1 . (holds w.pr. 1 − e−d log d, lower tail of reciprocal of an exponential) = ×

Π ∈ Rm×n

  • O(d1.001), O(1)
  • − ℓ2-SE

S ∈ Rm×n D ∈ Rn×n

1/u1 1/un

ui: exponential

slide-24
SLIDE 24

10-3

Analysis for ℓ1 subspace embedding

No underestimation. For each x ∈ Rd, let y = Mx ∈ Rn. Recall Π = SD:

Πy1 = SDy1 ≥ SDy2 ≥ 1/2 · Dy2 (property of ℓ2-SE) ≥ 1/2 · Dy∞ ∼ 1/2 · y1 /u (u is exponential, max stability) ≥ Ω(d log d) · y1 . (holds w.pr. 1 − e−d log d, lower tail of reciprocal of an exponential) This proves “for each y in the subspace” w.h.p.. To show this for all, we employ a standard net argument + a union bound. = ×

Π ∈ Rm×n

  • O(d1.001), O(1)
  • − ℓ2-SE

S ∈ Rm×n D ∈ Rn×n

1/u1 1/un

ui: exponential

slide-25
SLIDE 25

10-4

Analysis for ℓ1 subspace embedding

No underestimation. For each x ∈ Rd, let y = Mx ∈ Rn. Recall Π = SD:

Πy1 = SDy1 ≥ SDy2 ≥ 1/2 · Dy2 (property of ℓ2-SE) ≥ 1/2 · Dy∞ ∼ 1/2 · y1 /u (u is exponential, max stability) ≥ Ω(d log d) · y1 . (holds w.pr. 1 − e−d log d, lower tail of reciprocal of an exponential) This proves “for each y in the subspace” w.h.p.. To show this for all, we employ a standard net argument + a union bound. = ×

Π ∈ Rm×n

  • O(d1.001), O(1)
  • − ℓ2-SE

S ∈ Rm×n D ∈ Rn×n

1/u1 1/un

ui: exponential

For d ≥ log n, distortion can be improved to ˜ O(√d log n).

slide-26
SLIDE 26

10-5

Analysis for ℓ1 subspace embedding

No underestimation. For each x ∈ Rd, let y = Mx ∈ Rn. Recall Π = SD:

Πy1 = SDy1 ≥ SDy2 ≥ 1/2 · Dy2 (property of ℓ2-SE) ≥ 1/2 · Dy∞ ∼ 1/2 · y1 /u (u is exponential, max stability) ≥ Ω(d log d) · y1 . (holds w.pr. 1 − e−d log d, lower tail of reciprocal of an exponential) This proves “for each y in the subspace” w.h.p.. To show this for all, we employ a standard net argument + a union bound. Similar arguments work for general 1 ≤ p < 2. = ×

Π ∈ Rm×n

  • O(d1.001), O(1)
  • − ℓ2-SE

S ∈ Rm×n D ∈ Rn×n

1/u1 1/un

ui: exponential

For d ≥ log n, distortion can be improved to ˜ O(√d log n).

slide-27
SLIDE 27

11-1

Analysis for ℓ1 subspace embedding (cont.)

No overestimation. For each x ∈ Rd, let y = Mx ∈ Rn.

Πy1 = SDy1 ≤ O(1) · Dy1 (ℓ2-SE only contracts ℓ1-norm)

  • O(1) · γ
  • D′y
  • 1

(for a constant γ, upper tails of reciprocal of exponential and Cauchy are similar) ≤ O(d log d · y1) (holds for all y = Mx w.pr. 0.99, previously known)

Recall Π = SD:

= ×

Π ∈ Rm×n

  • O(d1.001), O(1)
  • − ℓ2-SE

S ∈ Rm×n D ∈ Rn×n

1/u1 1/un

ui: exponential

slide-28
SLIDE 28

11-2

Analysis for ℓ1 subspace embedding (cont.)

No overestimation. For each x ∈ Rd, let y = Mx ∈ Rn.

Πy1 = SDy1 ≤ O(1) · Dy1 (ℓ2-SE only contracts ℓ1-norm)

  • O(1) · γ
  • D′y
  • 1

(for a constant γ, upper tails of reciprocal of exponential and Cauchy are similar) ≤ O(d log d · y1) (holds for all y = Mx w.pr. 0.99, previously known)

Recall Π = SD:

= ×

Π ∈ Rm×n

  • O(d1.001), O(1)
  • − ℓ2-SE

S ∈ Rm×n D ∈ Rn×n

1/u1 1/un

ui: exponential γ

v1 vn

D′ ∈ Rn×n vi: Cauchy

slide-29
SLIDE 29

11-3

Analysis for ℓ1 subspace embedding (cont.)

No overestimation. For each x ∈ Rd, let y = Mx ∈ Rn.

Πy1 = SDy1 ≤ O(1) · Dy1 (ℓ2-SE only contracts ℓ1-norm)

  • O(1) · γ
  • D′y
  • 1

(for a constant γ, upper tails of reciprocal of exponential and Cauchy are similar) ≤ O(d log d · y1) (holds for all y = Mx w.pr. 0.99, previously known) Similar arguments work for general 1 ≤ p < 2.

Recall Π = SD:

= ×

Π ∈ Rm×n

  • O(d1.001), O(1)
  • − ℓ2-SE

S ∈ Rm×n D ∈ Rn×n

1/u1 1/un

ui: exponential γ

v1 vn

D′ ∈ Rn×n vi: Cauchy

slide-30
SLIDE 30

12-1

High level ideas for ℓp (p > 2)

Recall Π = SD: We actually can embed the subspace into ℓ∞.

Ω(1/(d log d)1/p) · Mxp ≤ ΠMx∞ ≤ O((d log d)1/p) · Mxp. = ×

Π ∈ Rm×n

  • ˜

O(n1−2/pd1+2/p) + poly(d), 1

  • − ℓ2-SE

S ∈ Rm×n D ∈ Rn×n

1/u1/p 1 1/u1/p n

ui: exponential

slide-31
SLIDE 31

12-2

High level ideas for ℓp (p > 2)

Recall Π = SD: We actually can embed the subspace into ℓ∞.

Ω(1/(d log d)1/p) · Mxp ≤ ΠMx∞ ≤ O((d log d)1/p) · Mxp. Good news: ℓ∞-regression can be solved efficiently by LP. = ×

Π ∈ Rm×n

  • ˜

O(n1−2/pd1+2/p) + poly(d), 1

  • − ℓ2-SE

S ∈ Rm×n D ∈ Rn×n

1/u1/p 1 1/u1/p n

ui: exponential

slide-32
SLIDE 32

12-3

High level ideas for ℓp (p > 2)

Recall Π = SD: We actually can embed the subspace into ℓ∞.

Ω(1/(d log d)1/p) · Mxp ≤ ΠMx∞ ≤ O((d log d)1/p) · Mxp. Good news: ℓ∞-regression can be solved efficiently by LP.

Main technical ingredients

  • 1. No underestimation: max stability of exponentials, like before
  • 2. No overestimation: More complicated. Use leverage scores to upper

bound coordinates of the vectors in a subspace (an idea previously used in CW12). = ×

Π ∈ Rm×n

  • ˜

O(n1−2/pd1+2/p) + poly(d), 1

  • − ℓ2-SE

S ∈ Rm×n D ∈ Rn×n

1/u1/p 1 1/u1/p n

ui: exponential

slide-33
SLIDE 33

13-1

Distributed implementation ℓp-regression

The distributed model: We have k machines and one central server.

– Each machine has a 2-way communication channel with the server. – Each machine has a subset of rows of ¯ M ∈ Rn×(d−1) and b ∈ Rd. – Goal is to solve ℓp-regression: minx∈Rd

  • ¯

Mx − b

  • p

· · ·

S1 S2 S3 Sk C

slide-34
SLIDE 34

14-1

Distributed implementation ℓp-regression (cont.)

  • 1. Compute ΠM. (by machines, each computes ΠMi where Mi is its local

submatrix)

  • 2. Use ΠM to compute a matrix R ∈ Rd×d s.t. MR has good propertities.

(by server)

  • 3. Given R, find a sampling matrix Π1 ∈ Rm′×n. (by machines, actually Π1

i )

  • 4. Solve sub-sampled problem minx∈Rd
  • Π1 ¯

Mx − Π1b

  • p. (by server)

Recall the ℓp regression framework

slide-35
SLIDE 35

14-2

Distributed implementation ℓp-regression (cont.)

  • 1. Compute ΠM. (by machines, each computes ΠMi where Mi is its local

submatrix)

  • 2. Use ΠM to compute a matrix R ∈ Rd×d s.t. MR has good propertities.

(by server)

  • 3. Given R, find a sampling matrix Π1 ∈ Rm′×n. (by machines, actually Π1

i )

  • 4. Solve sub-sampled problem minx∈Rd
  • Π1 ¯

Mx − Π1b

  • p. (by server)

Recall the ℓp regression framework

  • Running time of the centralized version + communication cost,
  • Most work is distributed on the k machines.

Running time on the server + total communication = sublinear in n: poly(d) for 1 ≤ p < 2, and n1−2/ppoly(d) for p > 2. – Previous results either have n/poly(d) communication

  • r only work for 1 ≤ p ≤ 2.

Total running time of the system

slide-36
SLIDE 36

15-1

Conclusions and open problems

  • 1. We have proposed algorithms for ℓp (p ∈ [1, ∞]\2) subspace

embeddings using exponential random variables, which improve all previous work on embedding distortions and dimensions, given the

  • ptimal running time.
  • 2. Improved subspace embeddings also lead to improved ℓp regressions
  • 3. Our algorithms can be efficiently implemented in the distributed

setting.

slide-37
SLIDE 37

15-2

Conclusions and open problems

  • 1. We have proposed algorithms for ℓp (p ∈ [1, ∞]\2) subspace

embeddings using exponential random variables, which improve all previous work on embedding distortions and dimensions, given the

  • ptimal running time.
  • 2. Improved subspace embeddings also lead to improved ℓp regressions
  • 3. Our algorithms can be efficiently implemented in the distributed

setting. What is the best distortion given O(nnz(M) + poly(d)) embedding time and ˜ O(d) embedding dimension, for ℓ1 subspace embedding? Currently it is min{ ˜ O(d2, d3/2 log1/2 n)}. Is it possible to make it ˜ O(d3/2) or even ˜ O(d)? Can we prove any tradeoff lower bounds?

slide-38
SLIDE 38

16-1

Thank you! Questions?

slide-39
SLIDE 39

17-1

High level ideas for ℓp (p > 2) (cont.)

High level idea for no overestimation is similar as before.

No overestimation of ΠMx for each x ∈ Rd, w. pr. 1 − e−d log d. + a standard net argument to extend this to all x ∈ Rd.

slide-40
SLIDE 40

17-2

High level ideas for ℓp (p > 2) (cont.)

High level idea for no overestimation is similar as before.

No overestimation of ΠMx for each x ∈ Rd, w. pr. 1 − e−d log d. + a standard net argument to extend this to all x ∈ Rd. Cannot show for arbitrary vectors! We should use the the property of a subspace.

slide-41
SLIDE 41

17-3

High level ideas for ℓp (p > 2) (cont.)

High level idea for no overestimation is similar as before.

No overestimation of ΠMx for each x ∈ Rd, w. pr. 1 − e−d log d. + a standard net argument to extend this to all x ∈ Rd. Cannot show for arbitrary vectors! We should use the the property of a subspace.

Use leverage scores of M

ℓp

i = Mip p, where Mi is the i-th row of M.

(An idea in Clarkson & Woodruff, ’13)

slide-42
SLIDE 42

17-4

High level ideas for ℓp (p > 2) (cont.)

High level idea for no overestimation is similar as before.

No overestimation of ΠMx for each x ∈ Rd, w. pr. 1 − e−d log d. + a standard net argument to extend this to all x ∈ Rd. Cannot show for arbitrary vectors! We should use the the property of a subspace.

Use leverage scores of M

ℓp

i = Mip p, where Mi is the i-th row of M.

(An idea in Clarkson & Woodruff, ’13)

Can assume M is the Auerbach basis (since we prove for all x ∈ Rd), which has the property

i∈[n] ℓp i ≤ d. Thus not many big ℓi.

slide-43
SLIDE 43

17-5

High level ideas for ℓp (p > 2) (cont.)

High level idea for no overestimation is similar as before.

No overestimation of ΠMx for each x ∈ Rd, w. pr. 1 − e−d log d. + a standard net argument to extend this to all x ∈ Rd. Cannot show for arbitrary vectors! We should use the the property of a subspace.

Use leverage scores of M

ℓp

i = Mip p, where Mi is the i-th row of M.

(An idea in Clarkson & Woodruff, ’13)

Can assume M is the Auerbach basis (since we prove for all x ∈ Rd), which has the property

i∈[n] ℓp i ≤ d. Thus not many big ℓi.

Also, ∀x ∈ Rd and y = Mx, for all i ∈ [n], we have yi ≤ d1−1/pℓi. Thus only a few big yi’s.

slide-44
SLIDE 44

17-6

High level ideas for ℓp (p > 2) (cont.)

High level idea for no overestimation is similar as before.

No overestimation of ΠMx for each x ∈ Rd, w. pr. 1 − e−d log d. + a standard net argument to extend this to all x ∈ Rd. Cannot show for arbitrary vectors! We should use the the property of a subspace.

Use leverage scores of M

ℓp

i = Mip p, where Mi is the i-th row of M.

(An idea in Clarkson & Woodruff, ’13)

Can assume M is the Auerbach basis (since we prove for all x ∈ Rd), which has the property

i∈[n] ℓp i ≤ d. Thus not many big ℓi.

Use this property, together with max stability, can design an embedding matrix Π works for arbitrary vectors. Also, ∀x ∈ Rd and y = Mx, for all i ∈ [n], we have yi ≤ d1−1/pℓi. Thus only a few big yi’s.