SLIDE 1
Sparser Johnson-Lindenstrauss Transforms
Jelani Nelson
Princeton
February 16, 2012
joint work with Daniel Kane (Stanford)
SLIDE 2 Random Projections
- x ∈ Rd, d huge
- store y = Sx, where S is a k × d matrix (compression)
SLIDE 3 Random Projections
- x ∈ Rd, d huge
- store y = Sx, where S is a k × d matrix (compression)
- compressed sensing (recover x from y when x is (near-)sparse)
- group-testing (as above, but Sx is Boolean multiplication)
- recover properties of x (entropy, heavy hitters, . . .)
- approximate norm preservation (want y ≈ x)
- motif discovery (slightly different; randomly project discrete x
- nto subset of its coordinates) [Buhler-Tompa]
SLIDE 4 Random Projections
- x ∈ Rd, d huge
- store y = Sx, where S is a k × d matrix (compression)
- compressed sensing (recover x from y when x is (near-)sparse)
- group-testing (as above, but Sx is Boolean multiplication)
- recover properties of x (entropy, heavy hitters, . . .)
- approximate norm preservation (want y ≈ x)
- motif discovery (slightly different; randomly project discrete x
- nto subset of its coordinates) [Buhler-Tompa]
- In many of these applications, random S is either required or
- btains better parameters than deterministic constructions.
SLIDE 5 Random Projections
- x ∈ Rd, d huge
- store y = Sx, where S is a k × d matrix (compression)
- compressed sensing (recover x from y when x is (near-)sparse)
- group-testing (as above, but Sx is Boolean multiplication)
- recover properties of x (entropy, heavy hitters, . . .)
- approximate norm preservation (want y ≈ x)
- motif discovery (slightly different; randomly project discrete x
- nto subset of its coordinates) [Buhler-Tompa]
- In many of these applications, random S is either required or
- btains better parameters than deterministic constructions.
SLIDE 6
Metric Johnson-Lindenstrauss lemma
Metric JL (MJL) Lemma, 1984
Every set of n points in Euclidean space can be embedded into O(ε−2 log n)-dimensional Euclidean space so that all pairwise distances are preserved up to a 1 ± ε factor.
SLIDE 7 Metric Johnson-Lindenstrauss lemma
Metric JL (MJL) Lemma, 1984
Every set of n points in Euclidean space can be embedded into O(ε−2 log n)-dimensional Euclidean space so that all pairwise distances are preserved up to a 1 ± ε factor. Uses:
- Speed up geometric algorithms by first reducing dimension of
input [Indyk-Motwani, 1998], [Indyk, 2001]
- Low-memory streaming algorithms for linear algebra problems
[Sarl´
- s, 2006], [LWMRT, 2007], [Clarkson-Woodruff, 2009]
- Essentially equivalent to RIP matrices from compressed
sensing [Baraniuk et al., 2008], [Krahmer-Ward, 2011] (used for recovery of sparse signals)
SLIDE 8 How to prove the JL lemma
Distributional JL (DJL) lemma Lemma
For any 0 < ε, δ < 1/2 there exists a distribution Dε,δ on Rk×d for k = O(ε−2 log(1/δ)) so that for any x of unit norm Pr
S∼Dε,δ
2 − 1
SLIDE 9 How to prove the JL lemma
Distributional JL (DJL) lemma Lemma
For any 0 < ε, δ < 1/2 there exists a distribution Dε,δ on Rk×d for k = O(ε−2 log(1/δ)) so that for any x of unit norm Pr
S∼Dε,δ
2 − 1
Proof of MJL: Set δ = 1/n2 in DJL and x as the difference vector
- f some pair of points. Union bound over the
n
2
SLIDE 10 How to prove the JL lemma
Distributional JL (DJL) lemma Lemma
For any 0 < ε, δ < 1/2 there exists a distribution Dε,δ on Rk×d for k = O(ε−2 log(1/δ)) so that for any x of unit norm Pr
S∼Dε,δ
2 − 1
Proof of MJL: Set δ = 1/n2 in DJL and x as the difference vector
- f some pair of points. Union bound over the
n
2
Theorem (Alon, 2003)
For every n, there exists a set of n points requiring target dimension k = Ω((ε−2/ log(1/ε)) log n).
Theorem (Jayram-Woodruff, 2011; Kane-Meka-N., 2011)
For DJL, k = Θ(ε−2 log(1/δ)) is optimal.
SLIDE 11 Proving the JL lemma
Older proofs
- [Johnson-Lindenstrauss, 1984], [Frankl-Maehara, 1988]:
Random rotation, then projection onto first k coordinates.
- [Indyk-Motwani, 1998], [Dasgupta-Gupta, 2003]:
Random matrix with independent Gaussian entries.
- [Achlioptas, 2001]: Independent ±1 entries.
- [Clarkson-Woodruff, 2009]:
O(log(1/δ))-wise independent ±1 entries.
- [Arriaga-Vempala, 1999], [Matousek, 2008]:
Independent entries having mean 0, variance 1/k, and subGaussian tails
SLIDE 12 Proving the JL lemma
Older proofs
- [Johnson-Lindenstrauss, 1984], [Frankl-Maehara, 1988]:
Random rotation, then projection onto first k coordinates.
- [Indyk-Motwani, 1998], [Dasgupta-Gupta, 2003]:
Random matrix with independent Gaussian entries.
- [Achlioptas, 2001]: Independent ±1 entries.
- [Clarkson-Woodruff, 2009]:
O(log(1/δ))-wise independent ±1 entries.
- [Arriaga-Vempala, 1999], [Matousek, 2008]:
Independent entries having mean 0, variance 1/k, and subGaussian tails Downside: Performing embedding is dense matrix-vector multiplication, O(k · x0) time
SLIDE 13 Fast JL Transforms
- [Ailon-Chazelle, 2006]: x → PHDx, O(d log d + k3) time
P is a random sparse matrix, H is Hadamard, D has random ±1 on diagonal
- [Ailon-Liberty, 2008]: O(d log k + k2) time, also based on fast
Hadamard transform
- [Ailon-Liberty, 2011] and [Krahmer-Ward, 2011]: O(d log d)
for MJL, but with suboptimal k = O(ε−2 log n log4 d).
SLIDE 14 Fast JL Transforms
- [Ailon-Chazelle, 2006]: x → PHDx, O(d log d + k3) time
P is a random sparse matrix, H is Hadamard, D has random ±1 on diagonal
- [Ailon-Liberty, 2008]: O(d log k + k2) time, also based on fast
Hadamard transform
- [Ailon-Liberty, 2011] and [Krahmer-Ward, 2011]: O(d log d)
for MJL, but with suboptimal k = O(ε−2 log n log4 d). Downside: Slow to embed sparse vectors: running time is Ω(min{k · x0, d log d}).
SLIDE 15 Where Do Sparse Vectors Show Up?
- Document as bag of words: xi = number of occurrences of
word i. Compare documents using cosine similarity. d = lexicon size; most documents aren’t dictionaries
- Network traffic: xi,j = #bytes sent from i to j
d = 264 (2256 in IPv6); most servers don’t talk to each other
- User ratings: xi is user’s score for movie i on Netflix
d = #movies; most people haven’t rated all movies
- Streaming: x receives a stream of updates of the form: “add
v to xi”. Maintaining Sx requires calculating v · Sei.
SLIDE 16
Sparse JL transforms
One way to embed sparse vectors faster: use sparse matrices.
SLIDE 17
Sparse JL transforms
One way to embed sparse vectors faster: use sparse matrices. s = #non-zero entries per column in embedding matrix (so embedding time is s · x0) reference value of s type [JL84], [FM88], [IM98], . . . k ≈ 4ε−2 log(1/δ) dense [Achlioptas01] k/3 sparse Bernoulli [WDALS09] no proof hashing [DKS10] ˜ O(ε−1 log3(1/δ)) hashing [KN10a], [BOR10] ˜ O(ε−1 log2(1/δ)) ” [KN12] O(ε−1 log(1/δ)) hashing (random codes)
SLIDE 18 Other related work
- CountSketch of [Charikar-Chen-FarachColton] gives
s = O(log(1/δ)) (see [Thorup-Zhang])
SLIDE 19 Other related work
- CountSketch of [Charikar-Chen-FarachColton] gives
s = O(log(1/δ)) (see [Thorup-Zhang])
- Can recover (1 ± ε)x2 from Sx, but not as Sx2 (not an
embedding into ℓ2)
- Not applicable in certain situations, e.g. in some nearest
neighbor data structures, and when learning classifiers over projected vectors via stochastic gradient descent
SLIDE 20
Sparse JL Constructions
[DKS, 2010]
s = ˜ Θ(ε−1 log2(1/δ))
SLIDE 21
Sparse JL Constructions
[DKS, 2010]
s = ˜ Θ(ε−1 log2(1/δ))
[this work]
s = Θ(ε−1 log(1/δ))
SLIDE 22
Sparse JL Constructions
[DKS, 2010]
s = ˜ Θ(ε−1 log2(1/δ))
[this work]
s = Θ(ε−1 log(1/δ))
[this work]
k/s s = Θ(ε−1 log(1/δ))
SLIDE 23 Sparse JL Constructions (in matrix form)
=
k
k/s
=
k
Each black cell is ±1/√s at random
SLIDE 24
Sparse JL Constructions (nicknames)
“Graph” construction
k/s
“Block” construction
SLIDE 25 Sparse JL notation (block construction)
- Let h(j, r), σ(j, r) be random hash location and random sign
for copy of xj in rth block.
SLIDE 26 Sparse JL notation (block construction)
- Let h(j, r), σ(j, r) be random hash location and random sign
for copy of xj in rth block.
h(j,r)=i xj · σ(j, r)
SLIDE 27 Sparse JL notation (block construction)
- Let h(j, r), σ(j, r) be random hash location and random sign
for copy of xj in rth block.
h(j,r)=i xj · σ(j, r)
Sx2
2 = x2 2 + 1
s ·
s
xixjσ(i, r)σ(j, r) · 1h(i,r)=h(j,r)
SLIDE 28 Sparse JL via Codes
=
k
k/s
=
k
- Graph construction: Constant weight binary code of weight s.
- Block construction: Code over q-ary alphabet, q = k/s.
SLIDE 29 Sparse JL via Codes
=
k
k/s
=
k
- Graph construction: Constant weight binary code of weight s.
- Block construction: Code over q-ary alphabet, q = k/s.
- Thm: Just need distance s − O(s2/k) (block construction)
(2s − O(s2/k) for graph construction)
SLIDE 30 Analysis (block construction)
k/s
=
k
- ηi,j,r indicates whether i, j collide in rth block.
- Sx2
2 = x2 2 + Z
Z = (1/s)
r Zr
Zr =
i=j xixjσ(i, r)σ(j, r)ηi,j,r
SLIDE 31 Analysis (block construction)
k/s
=
k
- ηi,j,r indicates whether i, j collide in rth block.
- Sx2
2 = x2 2 + Z
Z = (1/s)
r Zr
Zr =
i=j xixjσ(i, r)σ(j, r)ηi,j,r
- Z is a quadratic form in σ, so apply known moment bounds
for quadratic forms
SLIDE 32 Analysis
k/s
=
k
Theorem (Hanson-Wright, 1971)
σ1, . . . , σn independent ±1, B ∈ Rn×n symmetric. For λ > 0, Pr
- σTBσ − trace(B)
- > λ
- < e−C·min{λ2/B2
F ,λ/B2}
Reminder:
i,j
- B2 is largest magnitude of eigenvalue of B
SLIDE 33 Analysis
Z = 1 s ·
s
xixjσ(i, r)σ(j, r)ηi,j,r
SLIDE 34 Analysis
Z = 1 s ·
s
xixjσ(i, r)σ(j, r)ηi,j,r = σTBσ B = 1 s · B1 . . . B2 . . . ... . . . Bs
SLIDE 35 Frobenius norm bound
B = 1 s · B1 . . . B2 . . . ... . . . Bs
SLIDE 36 Frobenius norm bound
B = 1 s · B1 . . . B2 . . . ... . . . Bs
F = 1 s2
i x2 j · (#times i, j collide)
SLIDE 37 Frobenius norm bound
B = 1 s · B1 . . . B2 . . . ... . . . Bs
F = 1 s2
i x2 j · (#times i, j collide)
< O(1/k) · x4
2 = O(1/k) (good code!)
SLIDE 38 Operator norm bound
Br = 1 s · x2
1
x1x2η1,2,r . . . x1xdη1,d,r x2x1η2,1,r x2
2
. . . x2xdη2,d,r . . . ... ... . . . xdx1ηd,1,r . . . xdxd−1ηd,d−1,r x2
d
− 1 s · x2
1
. . . x2
2
. . . ... . . . x2
d
SLIDE 39 Operator norm bound
Br = 1 s · x2
1
x1x2η1,2,r . . . x1xdη1,d,r x2x1η2,1,r x2
2
. . . x2xdη2,d,r . . . ... ... . . . xdx1ηd,1,r . . . xdxd−1ηd,d−1,r x2
d
− 1 s · x2
1
. . . x2
2
. . . ... . . . x2
d
s · (Sr − D)
SLIDE 40 Operator norm bound
Br = 1 s · x2
1
x1x2η1,2,r . . . x1xdη1,d,r x2x1η2,1,r x2
2
. . . x2xdη2,d,r . . . ... ... . . . xdx1ηd,1,r . . . xdxd−1ηd,d−1,r x2
d
− 1 s · x2
1
. . . x2
2
. . . ... . . . x2
d
s · (Sr − D)
∞ ≤ 1
SLIDE 41 Operator norm bound
Br = 1 s · x2
1
x1x2η1,2,r . . . x1xdη1,d,r x2x1η2,1,r x2
2
. . . x2xdη2,d,r . . . ... ... . . . xdx1ηd,1,r . . . xdxd−1ηd,d−1,r x2
d
− 1 s · x2
1
. . . x2
2
. . . ... . . . x2
d
s · (Sr − D)
∞ ≤ 1
i=1 ur,iuT r,i, these are the eigenvectors of Sr
(ur,i is projection of x onto coordinates hashing to i in rth block) Sr2 = maxi ur,i2
2 ≤ x2 2 = 1
SLIDE 42 Operator norm bound
Br = 1 s · x2
1
x1x2η1,2,r . . . x1xdη1,d,r x2x1η2,1,r x2
2
. . . x2xdη2,d,r . . . ... ... . . . xdx1ηd,1,r . . . xdxd−1ηd,d−1,r x2
d
− 1 s · x2
1
. . . x2
2
. . . ... . . . x2
d
s · (Sr − D)
∞ ≤ 1
i=1 ur,iuT r,i, these are the eigenvectors of Sr
(ur,i is projection of x onto coordinates hashing to i in rth block) Sr2 = maxi ur,i2
2 ≤ x2 2 = 1
- Br2 ≤ (1/s) · max{Sr2, D2} ≤ 1/s
SLIDE 43 Wrapping up analysis
F = O(1/k)
Apply Hanson-Wright:
SLIDE 44 Wrapping up analysis
F = O(1/k)
Apply Hanson-Wright: Pr [|Z| > ε] < e−C ′·min{ε2k, εs} k = Ω(log(1/δ)/ε2), s = Ω(log(1/δ)/ε), QED
SLIDE 45
Code-based Construction: Caveat
Need a sufficiently good code.
SLIDE 46 Code-based Construction: Caveat
Need a sufficiently good code.
- Each pair of codewords should agree on O(s2/k) coordinates.
- Can get this with random code by Chernoff + union bound
- ver pairs, but then need: s2/k ≥ log(d/δ) ⇒
s ≥
log(d/δ) log(1/δ)).
SLIDE 47 Code-based Construction: Caveat
Need a sufficiently good code.
- Each pair of codewords should agree on O(s2/k) coordinates.
- Can get this with random code by Chernoff + union bound
- ver pairs, but then need: s2/k ≥ log(d/δ) ⇒
s ≥
log(d/δ) log(1/δ)).
- Slightly better: Can assume d = O(ε−2/δ) by first embedding
into this dimension with s = 1 (Analysis: Chebyshev’s inequality) ⇒ Can get away with s = O(ε−1 log(1/(εδ)) log(1/δ)). Can we avoid the loss incurred by this union bound?
SLIDE 48 Code-based Construction: Caveat
Need a sufficiently good code.
- Each pair of codewords should agree on O(s2/k) coordinates.
- Can get this with random code by Chernoff + union bound
- ver pairs, but then need: s2/k ≥ log(d/δ) ⇒
s ≥
log(d/δ) log(1/δ)).
- Slightly better: Can assume d = O(ε−2/δ) by first embedding
into this dimension with s = 1 (Analysis: Chebyshev’s inequality) ⇒ Can get away with s = O(ε−1 log(1/(εδ)) log(1/δ)). Can we avoid the loss incurred by this union bound?
SLIDE 49 Improving the Construction
- Pick h at random
- Analysis: Directly bound the ℓ = log(1/δ)th moment of error
term Z, then Markov’s inequality Pr[|Z| > ε] < ε−ℓ · E[Z ℓ] (conditioning on the event “h gives a code” is too demanding)
SLIDE 50 Improving the Construction
- Pick h at random
- Analysis: Directly bound the ℓ = log(1/δ)th moment of error
term Z, then Markov’s inequality Pr[|Z| > ε] < ε−ℓ · E[Z ℓ] (conditioning on the event “h gives a code” is too demanding)
r=1
- i=j xixjσ(i, r)σ(j, r)ηi,j,r
SLIDE 51 Improving the Construction
- Pick h at random
- Analysis: Directly bound the ℓ = log(1/δ)th moment of error
term Z, then Markov’s inequality Pr[|Z| > ε] < ε−ℓ · E[Z ℓ] (conditioning on the event “h gives a code” is too demanding)
r=1
- i=j xixjσ(i, r)σ(j, r)ηi,j,r
(Z = (1/s) s
r=1 Zr)
SLIDE 52 Improving the Construction
- Pick h at random
- Analysis: Directly bound the ℓ = log(1/δ)th moment of error
term Z, then Markov’s inequality Pr[|Z| > ε] < ε−ℓ · E[Z ℓ] (conditioning on the event “h gives a code” is too demanding)
r=1
- i=j xixjσ(i, r)σ(j, r)ηi,j,r
(Z = (1/s) s
r=1 Zr)
Eh,σ[Z ℓ] = 1 sℓ ·
t1,...,tn>1
t1, . . . , tn
n
Eh,σ[Z ti
ri ]
Bound the tth moment of any Zr, then get the ℓth moment bound for Z by plugging into the above
SLIDE 53 Bounding E[Z t
r ]
i=j xixjσ(i, r)σ(j, r)ηi,j,r
SLIDE 54 Bounding E[Z t
r ]
i=j xixjσ(i, r)σ(j, r)ηi,j,r
- Monomials appearing in expansion of Z t
r are in
correspondence with directed multigraphs. (x1x2) · (x3x4) · (x3x8) · (x4x8) · (x2x10) →
1 5 2 4 3
SLIDE 55 Bounding E[Z t
r ]
i=j xixjσ(i, r)σ(j, r)ηi,j,r
- Monomials appearing in expansion of Z t
r are in
correspondence with directed multigraphs. (x1x2) · (x3x4) · (x3x8) · (x4x8) · (x2x10) →
1 5 2 4 3
- Monomial contributes to expectation iff all degrees even
- Analysis: Group monomials appearing in Z t
r according to
graph isomorphism class then do some combinatorics.
- Our analysis is tight up to a constant factor.
SLIDE 56 Bounding E[Z t
r ]
m = #connected components, v = #vertices, du = degree of u Eh,σ[Z t
r ]
=
f ((iu,ju)t
u=1)=G
E t
ηiu,ju,r
t
xiuxju
SLIDE 57 Bounding E[Z t
r ]
m = #connected components, v = #vertices, du = degree of u Eh,σ[Z t
r ]
=
f ((iu,ju)t
u=1)=G
E t
ηiu,ju,r
t
xiuxju
s k v−m ·
f ((iu,ju)t
u=1)=G
t
xiuxju
SLIDE 58 Bounding E[Z t
r ]
m = #connected components, v = #vertices, du = degree of u Eh,σ[Z t
r ]
=
f ((iu,ju)t
u=1)=G
E t
ηiu,ju,r
t
xiuxju
s k v−m ·
f ((iu,ju)t
u=1)=G
t
xiuxju
≤
s k v−m · v! · 1
d1/2,...,dv/2
SLIDE 59 Bounding E[Z t
r ]
m = #connected components, v = #vertices, du = degree of u Eh,σ[Z t
r ]
=
f ((iu,ju)t
u=1)=G
E t
ηiu,ju,r
t
xiuxju
s k v−m ·
f ((iu,ju)t
u=1)=G
t
xiuxju
≤
s k v−m · v! · 1
d1/2,...,dv/2
2O(t)
v,m
t−tvv s k v−m ·
du
SLIDE 60 Bounding E[Z t
r ]
m = #connected components, v = #vertices, du = degree of u Eh,σ[Z t
r ]
=
f ((iu,ju)t
u=1)=G
E t
ηiu,ju,r
t
xiuxju
s k v−m ·
f ((iu,ju)t
u=1)=G
t
xiuxju
≤
s k v−m · v! · 1
d1/2,...,dv/2
2O(t)
v,m
t−tvv s k v−m ·
du
SLIDE 61 Bounding E[Z t
r ]
E[Z t
r ] ≤ 2O(t) ·
t < log(k/s) (t/ log(k/s))t
- therwise
- Plug this into expression for E[Z ℓ], QED
SLIDE 62
Open Problems
SLIDE 63 Open Problems
- OPEN: Devise distribution which can be sampled using few
random bits Current record: O(log d + log(1/ε) log(1/δ) + log(1/δ) log log(1/δ)) [Kane-Meka-N., 2011] Existential: O(log d + log(1/δ))
- OPEN: Prove a tight lower bound on achievable sparsity in a
JL distribution.
- OPEN: Can we have a JL matrix such that we can multiply
by any k × k submatrix in k · polylog(d) time? (ultimate goal)
- OPEN: Embed any vector in ˜
O(d) time into optimal k