Understanding Sparse JL for Feature Hashing Meena Jagadeesan - - PowerPoint PPT Presentation

▶

Feb 24, 2024 605 likes •1.06k views

Understanding Sparse JL for Feature Hashing Meena Jagadeesan Harvard University (Class of 2020) NeurIPS 2019 (Poster #59) Understanding Sparse JL for Feature Hashing Meena Jagadeesan Dimensionality reduction ( 2 -to- 2 ) A randomized

SLIDE 1

Understanding Sparse JL for Feature Hashing

Meena Jagadeesan

Harvard University (Class of 2020)

NeurIPS 2019 (Poster #59)

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 2

Dimensionality reduction (ℓ2-to-ℓ2)

A randomized map Rn → Rm (where m ≪ n) that preserves distances.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 3

Dimensionality reduction (ℓ2-to-ℓ2)

A randomized map Rn → Rm (where m ≪ n) that preserves distances. A pre-processing step in many applications: clustering nearest neighbors

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 4

Dimensionality reduction (ℓ2-to-ℓ2)

A randomized map Rn → Rm (where m ≪ n) that preserves distances. A pre-processing step in many applications: clustering nearest neighbors Key question: What is the tradeoff between the dimension m, the performance in distance preservation, and the projection time?

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 5

Dimensionality reduction (ℓ2-to-ℓ2)

A randomized map Rn → Rm (where m ≪ n) that preserves distances. A pre-processing step in many applications: clustering nearest neighbors Key question: What is the tradeoff between the dimension m, the performance in distance preservation, and the projection time? This paper: A theoretical analysis of this tradeoff for a state-of-the-art dimensionality reduction scheme on feature vectors.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 6

Feature hashing (Weinberger et al. ’09)

One standard dimensionality reduction scheme is feature hashing.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 7

Feature hashing (Weinberger et al. ’09)

One standard dimensionality reduction scheme is feature hashing. Use a hash function h : {1, . . . , n} → {1, . . . , m} on coordinates. Use random signs to handle collisions: f (x)i =

j∈h−1(i) σjxj.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 8

Feature hashing (Weinberger et al. ’09)

One standard dimensionality reduction scheme is feature hashing. Use a hash function h : {1, . . . , n} → {1, . . . , m} on coordinates. Use random signs to handle collisions: f (x)i =

j∈h−1(i) σjxj.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 9

Feature hashing (Weinberger et al. ’09)

One standard dimensionality reduction scheme is feature hashing. Use a hash function h : {1, . . . , n} → {1, . . . , m} on coordinates. Use random signs to handle collisions: f (x)i =

j∈h−1(i) σjxj.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 10

Sparse Johnson-Lindenstrauss transform (KN ’12)

Sparse JL is a state-of-the-art sparse dimensionality reduction.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 11

Sparse Johnson-Lindenstrauss transform (KN ’12)

Sparse JL is a state-of-the-art sparse dimensionality reduction. Use many (anti-correlated) hash fns h1, . . . , hs : {1, . . . , n} → {1, . . . , m}. = ⇒ Each input coordinate is mapped to s output coordinates.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 12

Sparse Johnson-Lindenstrauss transform (KN ’12)

Sparse JL is a state-of-the-art sparse dimensionality reduction. Use many (anti-correlated) hash fns h1, . . . , hs : {1, . . . , n} → {1, . . . , m}. = ⇒ Each input coordinate is mapped to s output coordinates. Use random signs to deal with collisions. That is: f (x)i =

1 √s

s

k=1

j∈h−1

(i) σk j xj

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 13

Sparse Johnson-Lindenstrauss transform (KN ’12)

Sparse JL is a state-of-the-art sparse dimensionality reduction. Use many (anti-correlated) hash fns h1, . . . , hs : {1, . . . , n} → {1, . . . , m}. = ⇒ Each input coordinate is mapped to s output coordinates. Use random signs to deal with collisions. That is: f (x)i =

1 √s

s

k=1

j∈h−1

(i) σk j xj

(Alternate view: a random sparse matrix w/ s nonzero entries per column.)

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 14

Sparse Johnson-Lindenstrauss transform (KN ’12)

Sparse JL is a state-of-the-art sparse dimensionality reduction. Use many (anti-correlated) hash fns h1, . . . , hs : {1, . . . , n} → {1, . . . , m}. = ⇒ Each input coordinate is mapped to s output coordinates. Use random signs to deal with collisions. That is: f (x)i =

1 √s

s

k=1

j∈h−1

(i) σk j xj

(Alternate view: a random sparse matrix w/ s nonzero entries per column.) The tradeoff: higher s preserves distances better, but takes longer.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 15

Sparse Johnson-Lindenstrauss transform (KN ’12)

Sparse JL is a state-of-the-art sparse dimensionality reduction. Use many (anti-correlated) hash fns h1, . . . , hs : {1, . . . , n} → {1, . . . , m}. = ⇒ Each input coordinate is mapped to s output coordinates. Use random signs to deal with collisions. That is: f (x)i =

1 √s

s

k=1

j∈h−1

(i) σk j xj

(Alternate view: a random sparse matrix w/ s nonzero entries per column.) The tradeoff: higher s preserves distances better, but takes longer.

This work

Analysis of tradeoff for sparse JL between # of hash functions s, dimension m, and performance in ℓ2-distance preservation.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 16

Intuition for this paper

Analysis of sparse JL with respect to a performance measure:

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 17

Intuition for this paper

Analysis of sparse JL with respect to a performance measure:

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 18

Intuition for this paper

Analysis of sparse JL with respect to a performance measure:

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 19

Intuition for this paper

Analysis of sparse JL with respect to a performance measure:

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 20

Intuition for this paper

Analysis of sparse JL with respect to a performance measure:

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 21

Traditional mathematical framework

Consider a probability distribution F over linear maps f : Rn → Rm.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 22

Traditional mathematical framework

Consider a probability distribution F over linear maps f : Rn → Rm. Geometry-preserving condition. For each x ∈ Rn: Pf ∈F[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ, for ǫ target error, δ target failure probability.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 23

Traditional mathematical framework

Consider a probability distribution F over linear maps f : Rn → Rm. Geometry-preserving condition. For each x ∈ Rn: Pf ∈F[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ, for ǫ target error, δ target failure probability. (Can apply to differences x = x1 − x2 since f is linear.)

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 24

Traditional mathematical framework

Consider a probability distribution F over linear maps f : Rn → Rm. Geometry-preserving condition. For each x ∈ Rn: Pf ∈F[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ, for ǫ target error, δ target failure probability. (Can apply to differences x = x1 − x2 since f is linear.) Sparse JL can sometimes perform much better in practice on feature vectors than traditional theory suggests...

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 25

Performance on feature vectors (Weinberger et al. ’09)

Consider vectors w/ small ℓ∞-to-ℓ2 norm ratio: Sv = {x ∈ Rn | x∞ ≤ v x2} .

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 26

Performance on feature vectors (Weinberger et al. ’09)

Consider vectors w/ small ℓ∞-to-ℓ2 norm ratio: Sv = {x ∈ Rn | x∞ ≤ v x2} . Let Fs,m be the distribution given by sparse JL with parameters s and m.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 27

Performance on feature vectors (Weinberger et al. ’09)

Consider vectors w/ small ℓ∞-to-ℓ2 norm ratio: Sv = {x ∈ Rn | x∞ ≤ v x2} . Let Fs,m be the distribution given by sparse JL with parameters s and m.

Definition

v(m, ǫ, δ, s) is the supremum over v ∈ [0, 1] such that: Pf ∈Fs,m[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ holds for each x ∈ Sv.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 28

Performance on feature vectors (Weinberger et al. ’09)

Consider vectors w/ small ℓ∞-to-ℓ2 norm ratio: Sv = {x ∈ Rn | x∞ ≤ v x2} . Let Fs,m be the distribution given by sparse JL with parameters s and m.

Definition

v(m, ǫ, δ, s) is the supremum over v ∈ [0, 1] such that: Pf ∈Fs,m[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ holds for each x ∈ Sv.

◮ v(m, ǫ, δ, s) = 0 =

⇒ poor performance

◮ v(m, ǫ, δ, s) = 1 =

⇒ full performance

◮ v(m, ǫ, δ, s) ∈ (0, 1) =

⇒ good performance on x ∈ Sv(m,ǫ,δ,s)

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 29

Performance on feature vectors (Weinberger et al. ’09)

Consider vectors w/ small ℓ∞-to-ℓ2 norm ratio: Sv = {x ∈ Rn | x∞ ≤ v x2} . Let Fs,m be the distribution given by sparse JL with parameters s and m.

Definition

v(m, ǫ, δ, s) is the supremum over v ∈ [0, 1] such that: Pf ∈Fs,m[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ holds for each x ∈ Sv.

◮ v(m, ǫ, δ, s) = 0 =

⇒ poor performance

◮ v(m, ǫ, δ, s) = 1 =

⇒ full performance

◮ v(m, ǫ, δ, s) ∈ (0, 1) =

⇒ good performance on x ∈ Sv(m,ǫ,δ,s) We give a tight theoretical analysis of the function v(m, ǫ, δ, s).

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 30

Informal statement of main result

Goal: Pf ∈F[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ. v(m, ǫ, δ, s) := sup over v ∈ [0, 1] s.t. sparse JL meets ℓ2 goal on x ∈ Sv.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 31

Informal statement of main result

Goal: Pf ∈F[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ. v(m, ǫ, δ, s) := sup over v ∈ [0, 1] s.t. sparse JL meets ℓ2 goal on x ∈ Sv.

Theorem (Informal)

For error ǫ and failure probability δ, sparse JL with projected dimension m and s hash functions has four regimes in its performance: that is, v(m, ǫ, δ, s) =            1 (full performance) High m √sB1 (partial performance) Middle m √s min (B1, B2) (partial performance) Middle m 0 (poor performance) Small m, where p = ln(1/δ), B1 =

ln(mǫ2/p)/√p and B2 = ln(mǫ/p)/p.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 32

Informal statement of main result

Goal: Pf ∈F[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ. v(m, ǫ, δ, s) := sup over v ∈ [0, 1] s.t. sparse JL meets ℓ2 goal on x ∈ Sv.

Theorem (Informal)

For error ǫ and failure probability δ, sparse JL with projected dimension m and s hash functions has four regimes in its performance: that is, v(m, ǫ, δ, s) =            1 (full performance) High m √sB1 (partial performance) Middle m √s min (B1, B2) (partial performance) Middle m 0 (poor performance) Small m, where p = ln(1/δ), B1 =

ln(mǫ2/p)/√p and B2 = ln(mǫ/p)/p.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 33

v(m, ǫ, δ, s) on more synthetic data

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 34

v(m, ǫ, δ, s) on more synthetic data

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 35

v(m, ǫ, δ, s) on more synthetic data

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 36

v(m, ǫ, δ, s) on more synthetic data

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 37

Sparse JL on News20 dataset

Sparse JL with 4 hash fns can significantly outperform feature hashing!

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 38

Sparse JL on News20 dataset

Sparse JL with 4 hash fns can significantly outperform feature hashing!

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 39

Comparison to previous work

Goal: Pf ∈F[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ. v(m, ǫ, δ, s) := sup over v ∈ [0, 1] s.t. sparse JL meets ℓ2 goal on x ∈ Sv.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 40

Comparison to previous work

Goal: Pf ∈F[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ. v(m, ǫ, δ, s) := sup over v ∈ [0, 1] s.t. sparse JL meets ℓ2 goal on x ∈ Sv. Bounds on v (Weinberger et al. ’09,..., Freksen et al. ’18):

◮ v(m, ǫ, δ, 1) understood ◮ v(m, ǫ, δ, s) bound for multiple hashing (a suboptimal construction)

Bounds for sparse JL on full space Rn:

◮ Can set m ≈ ǫ−2 log(1/δ), s ≈ ǫ−1 log(1/δ) (Kane and Nelson ’12) ◮ Can set m ≈ min(2ǫ−2/δ, ǫ−2 log(1/δ)eΘ(ǫ−1 log(1/δ)/s)) (Cohen ’16)

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 41

Comparison to previous work

Goal: Pf ∈F[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ. v(m, ǫ, δ, s) := sup over v ∈ [0, 1] s.t. sparse JL meets ℓ2 goal on x ∈ Sv. Bounds on v (Weinberger et al. ’09,..., Freksen et al. ’18):

◮ v(m, ǫ, δ, 1) understood ◮ v(m, ǫ, δ, s) bound for multiple hashing (a suboptimal construction)

Bounds for sparse JL on full space Rn:

◮ Can set m ≈ ǫ−2 log(1/δ), s ≈ ǫ−1 log(1/δ) (Kane and Nelson ’12) ◮ Can set m ≈ min(2ǫ−2/δ, ǫ−2 log(1/δ)eΘ(ǫ−1 log(1/δ)/s)) (Cohen ’16)

This work

Tight bounds on v(m, ǫ, δ, s) for a general s > 1 for sparse JL.

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 42

Comparison to previous work

Goal: Pf ∈F[f (x)2 ∈ (1 ± ǫ) x2] > 1 − δ. v(m, ǫ, δ, s) := sup over v ∈ [0, 1] s.t. sparse JL meets ℓ2 goal on x ∈ Sv. Bounds on v (Weinberger et al. ’09,..., Freksen et al. ’18):

◮ v(m, ǫ, δ, 1) understood ◮ v(m, ǫ, δ, s) bound for multiple hashing (a suboptimal construction)

Bounds for sparse JL on full space Rn:

◮ Can set m ≈ ǫ−2 log(1/δ), s ≈ ǫ−1 log(1/δ) (Kane and Nelson ’12) ◮ Can set m ≈ min(2ǫ−2/δ, ǫ−2 log(1/δ)eΘ(ǫ−1 log(1/δ)/s)) (Cohen ’16)

This work

Tight bounds on v(m, ǫ, δ, s) for a general s > 1 for sparse JL. = ⇒ Characterization of sparse JL performance in terms of ǫ, δ, and ℓ∞-to-ℓ2 norm ratio for a general # of hash functions s

Understanding Sparse JL for Feature Hashing Meena Jagadeesan

SLIDE 43

Conclusion

Tight analysis of v(m, ǫ, δ, s) for uniform sparse JL for a general s. Could inform how to optimally set s and m in practice. Characterization of sparse JL performance in terms of ǫ, δ, and ℓ∞-to-ℓ2 norm ratio for a general # of hash functions s. Evaluation on real-world and synthetic data (sparse JL can perform much better than feature hashing). Proof technique involves a new perspective on analyzing JL distributions. Thank you!

Understanding Sparse JL for Feature Hashing Meena Jagadeesan