Understanding Sparse JL for Feature Hashing Meena Jagadeesan - - PowerPoint PPT Presentation

understanding sparse jl for feature hashing
SMART_READER_LITE
LIVE PREVIEW

Understanding Sparse JL for Feature Hashing Meena Jagadeesan - - PowerPoint PPT Presentation

Understanding Sparse JL for Feature Hashing Meena Jagadeesan Harvard University (Class of 2020) NeurIPS 2019 (Poster #59) Understanding Sparse JL for Feature Hashing Meena Jagadeesan Dimensionality reduction ( 2 -to- 2 ) A randomized


slide-1
SLIDE 1

Understanding Sparse JL for Feature Hashing

Meena Jagadeesan

Harvard University (Class of 2020)

NeurIPS 2019 (Poster #59)

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-2
SLIDE 2

Dimensionality reduction (ℓ2-to-ℓ2)

A randomized map Rn → Rm (where m ≪ n) that preserves distances.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-3
SLIDE 3

Dimensionality reduction (ℓ2-to-ℓ2)

A randomized map Rn → Rm (where m ≪ n) that preserves distances. A pre-processing step in many applications: clustering nearest neighbors

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-4
SLIDE 4

Dimensionality reduction (ℓ2-to-ℓ2)

A randomized map Rn → Rm (where m ≪ n) that preserves distances. A pre-processing step in many applications: clustering nearest neighbors Key question: What is the tradeoff between the dimension m, the performance in distance preservation, and the projection time?

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-5
SLIDE 5

Dimensionality reduction (ℓ2-to-ℓ2)

A randomized map Rn → Rm (where m ≪ n) that preserves distances. A pre-processing step in many applications: clustering nearest neighbors Key question: What is the tradeoff between the dimension m, the performance in distance preservation, and the projection time? This paper: A theoretical analysis of this tradeoff for a state-of-the-art dimensionality reduction scheme on feature vectors.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-6
SLIDE 6

Feature hashing (Weinberger et al. ’09)

One standard dimensionality reduction scheme is feature hashing.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-7
SLIDE 7

Feature hashing (Weinberger et al. ’09)

One standard dimensionality reduction scheme is feature hashing. Use a hash function h : {1, . . . , n} → {1, . . . , m} on coordinates. Use random signs to handle collisions: f (x)i =

j∈h−1(i) σjxj.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-8
SLIDE 8

Feature hashing (Weinberger et al. ’09)

One standard dimensionality reduction scheme is feature hashing. Use a hash function h : {1, . . . , n} → {1, . . . , m} on coordinates. Use random signs to handle collisions: f (x)i =

j∈h−1(i) σjxj.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-9
SLIDE 9

Feature hashing (Weinberger et al. ’09)

One standard dimensionality reduction scheme is feature hashing. Use a hash function h : {1, . . . , n} → {1, . . . , m} on coordinates. Use random signs to handle collisions: f (x)i =

j∈h−1(i) σjxj.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-10
SLIDE 10

Sparse Johnson-Lindenstrauss transform (KN ’12)

Sparse JL is a state-of-the-art sparse dimensionality reduction.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-11
SLIDE 11

Sparse Johnson-Lindenstrauss transform (KN ’12)

Sparse JL is a state-of-the-art sparse dimensionality reduction. Use many (anti-correlated) hash fns h1, . . . , hs : {1, . . . , n} → {1, . . . , m}. = ⇒ Each input coordinate is mapped to s output coordinates.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-12
SLIDE 12

Sparse Johnson-Lindenstrauss transform (KN ’12)

Sparse JL is a state-of-the-art sparse dimensionality reduction. Use many (anti-correlated) hash fns h1, . . . , hs : {1, . . . , n} → {1, . . . , m}. = ⇒ Each input coordinate is mapped to s output coordinates. Use random signs to deal with collisions. That is: f (x)i =

1 √s

s

k=1

  • j∈h−1

k

(i) σk j xj

  • .

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-13
SLIDE 13

Sparse Johnson-Lindenstrauss transform (KN ’12)

Sparse JL is a state-of-the-art sparse dimensionality reduction. Use many (anti-correlated) hash fns h1, . . . , hs : {1, . . . , n} → {1, . . . , m}. = ⇒ Each input coordinate is mapped to s output coordinates. Use random signs to deal with collisions. That is: f (x)i =

1 √s

s

k=1

  • j∈h−1

k

(i) σk j xj

  • .

(Alternate view: a random sparse matrix w/ s nonzero entries per column.)

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-14
SLIDE 14

Sparse Johnson-Lindenstrauss transform (KN ’12)

Sparse JL is a state-of-the-art sparse dimensionality reduction. Use many (anti-correlated) hash fns h1, . . . , hs : {1, . . . , n} → {1, . . . , m}. = ⇒ Each input coordinate is mapped to s output coordinates. Use random signs to deal with collisions. That is: f (x)i =

1 √s

s

k=1

  • j∈h−1

k

(i) σk j xj

  • .

(Alternate view: a random sparse matrix w/ s nonzero entries per column.) The tradeoff: higher s preserves distances better, but takes longer.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-15
SLIDE 15

Sparse Johnson-Lindenstrauss transform (KN ’12)

Sparse JL is a state-of-the-art sparse dimensionality reduction. Use many (anti-correlated) hash fns h1, . . . , hs : {1, . . . , n} → {1, . . . , m}. = ⇒ Each input coordinate is mapped to s output coordinates. Use random signs to deal with collisions. That is: f (x)i =

1 √s

s

k=1

  • j∈h−1

k

(i) σk j xj

  • .

(Alternate view: a random sparse matrix w/ s nonzero entries per column.) The tradeoff: higher s preserves distances better, but takes longer.

This work

Analysis of tradeoff for sparse JL between # of hash functions s, dimension m, and performance in ℓ2-distance preservation.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-16
SLIDE 16

Intuition for this paper

Analysis of sparse JL with respect to a performance measure:

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-17
SLIDE 17

Intuition for this paper

Analysis of sparse JL with respect to a performance measure:

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-18
SLIDE 18

Intuition for this paper

Analysis of sparse JL with respect to a performance measure:

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-19
SLIDE 19

Intuition for this paper

Analysis of sparse JL with respect to a performance measure:

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-20
SLIDE 20

Intuition for this paper

Analysis of sparse JL with respect to a performance measure:

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-21
SLIDE 21

Traditional mathematical framework

Consider a probability distribution F over linear maps f : Rn → Rm.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-22
SLIDE 22

Traditional mathematical framework

Consider a probability distribution F over linear maps f : Rn → Rm. Geometry-preserving condition. For each x ∈ Rn: Pf ∈F[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ, for ǫ target error, δ target failure probability.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-23
SLIDE 23

Traditional mathematical framework

Consider a probability distribution F over linear maps f : Rn → Rm. Geometry-preserving condition. For each x ∈ Rn: Pf ∈F[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ, for ǫ target error, δ target failure probability. (Can apply to differences x = x1 − x2 since f is linear.)

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-24
SLIDE 24

Traditional mathematical framework

Consider a probability distribution F over linear maps f : Rn → Rm. Geometry-preserving condition. For each x ∈ Rn: Pf ∈F[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ, for ǫ target error, δ target failure probability. (Can apply to differences x = x1 − x2 since f is linear.) Sparse JL can sometimes perform much better in practice on feature vectors than traditional theory suggests...

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-25
SLIDE 25

Performance on feature vectors (Weinberger et al. ’09)

Consider vectors w/ small ℓ∞-to-ℓ2 norm ratio: Sv = {x ∈ Rn | x∞ ≤ v x2} .

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-26
SLIDE 26

Performance on feature vectors (Weinberger et al. ’09)

Consider vectors w/ small ℓ∞-to-ℓ2 norm ratio: Sv = {x ∈ Rn | x∞ ≤ v x2} . Let Fs,m be the distribution given by sparse JL with parameters s and m.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-27
SLIDE 27

Performance on feature vectors (Weinberger et al. ’09)

Consider vectors w/ small ℓ∞-to-ℓ2 norm ratio: Sv = {x ∈ Rn | x∞ ≤ v x2} . Let Fs,m be the distribution given by sparse JL with parameters s and m.

Definition

v(m, ǫ, δ, s) is the supremum over v ∈ [0, 1] such that: Pf ∈Fs,m[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ holds for each x ∈ Sv.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-28
SLIDE 28

Performance on feature vectors (Weinberger et al. ’09)

Consider vectors w/ small ℓ∞-to-ℓ2 norm ratio: Sv = {x ∈ Rn | x∞ ≤ v x2} . Let Fs,m be the distribution given by sparse JL with parameters s and m.

Definition

v(m, ǫ, δ, s) is the supremum over v ∈ [0, 1] such that: Pf ∈Fs,m[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ holds for each x ∈ Sv.

◮ v(m, ǫ, δ, s) = 0 =

⇒ poor performance

◮ v(m, ǫ, δ, s) = 1 =

⇒ full performance

◮ v(m, ǫ, δ, s) ∈ (0, 1) =

⇒ good performance on x ∈ Sv(m,ǫ,δ,s)

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-29
SLIDE 29

Performance on feature vectors (Weinberger et al. ’09)

Consider vectors w/ small ℓ∞-to-ℓ2 norm ratio: Sv = {x ∈ Rn | x∞ ≤ v x2} . Let Fs,m be the distribution given by sparse JL with parameters s and m.

Definition

v(m, ǫ, δ, s) is the supremum over v ∈ [0, 1] such that: Pf ∈Fs,m[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ holds for each x ∈ Sv.

◮ v(m, ǫ, δ, s) = 0 =

⇒ poor performance

◮ v(m, ǫ, δ, s) = 1 =

⇒ full performance

◮ v(m, ǫ, δ, s) ∈ (0, 1) =

⇒ good performance on x ∈ Sv(m,ǫ,δ,s) We give a tight theoretical analysis of the function v(m, ǫ, δ, s).

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-30
SLIDE 30

Informal statement of main result

Goal: Pf ∈F[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ. v(m, ǫ, δ, s) := sup over v ∈ [0, 1] s.t. sparse JL meets ℓ2 goal on x ∈ Sv.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-31
SLIDE 31

Informal statement of main result

Goal: Pf ∈F[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ. v(m, ǫ, δ, s) := sup over v ∈ [0, 1] s.t. sparse JL meets ℓ2 goal on x ∈ Sv.

Theorem (Informal)

For error ǫ and failure probability δ, sparse JL with projected dimension m and s hash functions has four regimes in its performance: that is, v(m, ǫ, δ, s) =            1 (full performance) High m √sB1 (partial performance) Middle m √s min (B1, B2) (partial performance) Middle m 0 (poor performance) Small m, where p = ln(1/δ), B1 =

  • ln(mǫ2/p)/√p and B2 = ln(mǫ/p)/p.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-32
SLIDE 32

Informal statement of main result

Goal: Pf ∈F[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ. v(m, ǫ, δ, s) := sup over v ∈ [0, 1] s.t. sparse JL meets ℓ2 goal on x ∈ Sv.

Theorem (Informal)

For error ǫ and failure probability δ, sparse JL with projected dimension m and s hash functions has four regimes in its performance: that is, v(m, ǫ, δ, s) =            1 (full performance) High m √sB1 (partial performance) Middle m √s min (B1, B2) (partial performance) Middle m 0 (poor performance) Small m, where p = ln(1/δ), B1 =

  • ln(mǫ2/p)/√p and B2 = ln(mǫ/p)/p.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-33
SLIDE 33

v(m, ǫ, δ, s) on more synthetic data

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-34
SLIDE 34

v(m, ǫ, δ, s) on more synthetic data

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-35
SLIDE 35

v(m, ǫ, δ, s) on more synthetic data

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-36
SLIDE 36

v(m, ǫ, δ, s) on more synthetic data

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-37
SLIDE 37

Sparse JL on News20 dataset

Sparse JL with 4 hash fns can significantly outperform feature hashing!

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-38
SLIDE 38

Sparse JL on News20 dataset

Sparse JL with 4 hash fns can significantly outperform feature hashing!

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-39
SLIDE 39

Comparison to previous work

Goal: Pf ∈F[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ. v(m, ǫ, δ, s) := sup over v ∈ [0, 1] s.t. sparse JL meets ℓ2 goal on x ∈ Sv.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-40
SLIDE 40

Comparison to previous work

Goal: Pf ∈F[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ. v(m, ǫ, δ, s) := sup over v ∈ [0, 1] s.t. sparse JL meets ℓ2 goal on x ∈ Sv. Bounds on v (Weinberger et al. ’09,..., Freksen et al. ’18):

◮ v(m, ǫ, δ, 1) understood ◮ v(m, ǫ, δ, s) bound for multiple hashing (a suboptimal construction)

Bounds for sparse JL on full space Rn:

◮ Can set m ≈ ǫ−2 log(1/δ), s ≈ ǫ−1 log(1/δ) (Kane and Nelson ’12) ◮ Can set m ≈ min(2ǫ−2/δ, ǫ−2 log(1/δ)eΘ(ǫ−1 log(1/δ)/s)) (Cohen ’16)

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-41
SLIDE 41

Comparison to previous work

Goal: Pf ∈F[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ. v(m, ǫ, δ, s) := sup over v ∈ [0, 1] s.t. sparse JL meets ℓ2 goal on x ∈ Sv. Bounds on v (Weinberger et al. ’09,..., Freksen et al. ’18):

◮ v(m, ǫ, δ, 1) understood ◮ v(m, ǫ, δ, s) bound for multiple hashing (a suboptimal construction)

Bounds for sparse JL on full space Rn:

◮ Can set m ≈ ǫ−2 log(1/δ), s ≈ ǫ−1 log(1/δ) (Kane and Nelson ’12) ◮ Can set m ≈ min(2ǫ−2/δ, ǫ−2 log(1/δ)eΘ(ǫ−1 log(1/δ)/s)) (Cohen ’16)

This work

Tight bounds on v(m, ǫ, δ, s) for a general s > 1 for sparse JL.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-42
SLIDE 42

Comparison to previous work

Goal: Pf ∈F[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ. v(m, ǫ, δ, s) := sup over v ∈ [0, 1] s.t. sparse JL meets ℓ2 goal on x ∈ Sv. Bounds on v (Weinberger et al. ’09,..., Freksen et al. ’18):

◮ v(m, ǫ, δ, 1) understood ◮ v(m, ǫ, δ, s) bound for multiple hashing (a suboptimal construction)

Bounds for sparse JL on full space Rn:

◮ Can set m ≈ ǫ−2 log(1/δ), s ≈ ǫ−1 log(1/δ) (Kane and Nelson ’12) ◮ Can set m ≈ min(2ǫ−2/δ, ǫ−2 log(1/δ)eΘ(ǫ−1 log(1/δ)/s)) (Cohen ’16)

This work

Tight bounds on v(m, ǫ, δ, s) for a general s > 1 for sparse JL. = ⇒ Characterization of sparse JL performance in terms of ǫ, δ, and ℓ∞-to-ℓ2 norm ratio for a general # of hash functions s

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

slide-43
SLIDE 43

Conclusion

Tight analysis of v(m, ǫ, δ, s) for uniform sparse JL for a general s. Could inform how to optimally set s and m in practice. Characterization of sparse JL performance in terms of ǫ, δ, and ℓ∞-to-ℓ2 norm ratio for a general # of hash functions s. Evaluation on real-world and synthetic data (sparse JL can perform much better than feature hashing). Proof technique involves a new perspective on analyzing JL distributions. Thank you!

Understanding Sparse JL for Feature Hashing Meena Jagadeesan