Dimension Reduction with Certainty
Rasmus Pagh IT University of Copenhagen ECML-PKDD September 21, 2016 Slides: goo.gl/hZoRWo
SCALABLE SIMILARITY SEARCH
1
What can we say about high- dimensional objects from a - - PowerPoint PPT Presentation
Dimension Reduction with Certainty Rasmus Pagh IT University of Copenhagen S CALABLE S IMILARITY ECML-PKDD S EARCH September 21, 2016 Slides: goo.gl/hZoRWo 1 What can we say about high- dimensional objects from a low-dimensional
Rasmus Pagh IT University of Copenhagen ECML-PKDD September 21, 2016 Slides: goo.gl/hZoRWo
1
2
What can we say about high- dimensional
low-dimensional representation?
2
What can we say about high- dimensional
low-dimensional representation? with certainty
3
Part I:
Tools for randomized dimension reduction - greatest hits
Part II:
Transparency and interpretability
Part III:
Dimension reduction with certainty?
Technique for mapping objects from a large space into a small space, while preserving essential relations.
4
arXiv:1609.02094v1 [cs.IT] 7 Sep 2016 Optimality of the Johnson-Lindenstrauss Lemma Kasper Green Larsen∗ Jelani Nelson† September 8, 2016 Abstract For any integers d, n ≥ 2 and 1/(min{n, d})0.4999 < ε < 1, we show the existence of a set of n vectors X ⊂ Rd such that any embedding f : X → Rm satisfying ∀x, y ∈ X, (1 − ε)∥x − y∥2 2 ≤ ∥f(x) − f(y)∥2 2 ≤ (1 + ε)∥x − y∥2 2 must have m = Ω(ε−2 lg n). This lower bound matches the upper bound given by the Johnson-Lindenstrauss lemma [JL84]. Further- more, our lower bound holds for nearly the full range of ε of interest, since there is always an isometric embedding into dimension min{d, n} (either the identity map, or projection onto span(X)). Previously such a lower bound was only known to hold against linear maps f, and not for such a wide range of parameters ε, n, d [LN16]. The best previously known lower bound for general f was m = Ω(ε−2 lg n/ lg(1/ε)) [Wel74, Alo03], which is suboptimal for any ε = o(1). 1 Introduction In modern algorithm design, often data is high-dimensional, and one seeks to first pre-process the data via some dimensionality reduction scheme that preserves geometry in such a way that is acceptable for particular applications. The lower-dimensional embedded data has the benefit of requiring less storage, less communication bandwith to be transmitted over a network, and less time to be analyzed by later algorithms. Such schemes have been applied to good effect in a diverse range of areas, such as streaming algorithms [Mut05], numerical linear algebra [Woo14], compressed sensing [CRT06, Don06], graph sparsification [SS11], clustering [BZMD15, CEM+15], nearest neighbor search [HIM12], and many others. A cornerstone dimensionality reduction result is the following Johnson-Lindenstrauss (JL) lemma [JL84]. Theorem 1 (JL lemma). Let X ⊂ Rd be any set of size n, and let ε ∈ (0, 1/2) be arbitrary. Then there exists a map f : X → Rm for some m = O(ε−2 lg n) such that ∀x, y ∈ X, (1 − ε)∥x − y∥2 2 ≤ ∥f(x) − f(y)∥2 2 ≤ (1 + ε)∥x − y∥2 2. (1) Even though the JL lemma has found applications in a plethora of different fields over the past three decades, its optimality has still not been settled. In the original paper by Johnson and Lindenstrauss [JL84], it was proved that for ε smaller than some universal constant ε0, there exists n point sets X ⊂ Rn for which any embedding f : X → Rm providing (1) must have m = Ω(lg n). This was later improved by Alon [Alo03], who showed the existence of an n point set X ⊂ Rn, such that any f providing (1) must have m = Ω(min{n, ε−2 lg n/ lg(1/ε)}). This lower bound can also be obtained from the Welch bound [Wel74], which states ε2k ≥ (1/(n − 1))(n/ m+k−1 k0010111011101010101001010 0110111010101010101101010
x
^
x
5
arXiv:1609.02094v1 [cs.IT] 7 Sep 2016 Optimality of the Johnson-Lindenstrauss Lemma Kasper Green Larsen∗ Jelani Nelson† September 8, 2016 Abstract For any integers d, n ≥ 2 and 1/(min{n, d})0.4999 < ε < 1, we show the existence of a set of n vectors X ⊂ Rd such that any embedding f : X → Rm satisfying ∀x, y ∈ X, (1 − ε)∥x − y∥2 2 ≤ ∥f(x) − f(y)∥2 2 ≤ (1 + ε)∥x − y∥2 2 must have m = Ω(ε−2 lg n). This lower bound matches the upper bound given by the Johnson-Lindenstrauss lemma [JL84]. Further- more, our lower bound holds for nearly the full range of ε of interest, since there is always an isometric embedding into dimension min{d, n} (either the identity map, or projection onto span(X)). Previously such a lower bound was only known to hold against linear maps f, and not for such a wide range of parameters ε, n, d [LN16]. The best previously known lower bound for general f was m = Ω(ε−2 lg n/ lg(1/ε)) [Wel74, Alo03], which is suboptimal for any ε = o(1). 1 Introduction In modern algorithm design, often data is high-dimensional, and one seeks to first pre-process the data via some dimensionality reduction scheme that preserves geometry in such a way that is acceptable for particular applications. The lower-dimensional embedded data has the benefit of requiring less storage, less communication bandwith to be transmitted over a network, and less time to be analyzed by later algorithms. Such schemes have been applied to good effect in a diverse range of areas, such as streaming algorithms [Mut05], numerical linear algebra [Woo14], compressed sensing [CRT06, Don06], graph sparsification [SS11], clustering [BZMD15, CEM+15], nearest neighbor search [HIM12], and many others. A cornerstone dimensionality reduction result is the following Johnson-Lindenstrauss (JL) lemma [JL84]. Theorem 1 (JL lemma). Let X ⊂ Rd be any set of size n, and let ε ∈ (0, 1/2) be arbitrary. Then there exists a map f : X → Rm for some m = O(ε−2 lg n) such that ∀x, y ∈ X, (1 − ε)∥x − y∥2 2 ≤ ∥f(x) − f(y)∥2 2 ≤ (1 + ε)∥x − y∥2 2. (1) Even though the JL lemma has found applications in a plethora of different fields over the past three decades, its optimality has still not been settled. In the original paper by Johnson and Lindenstrauss [JL84], it was proved that for ε smaller than some universal constant ε0, there exists n point sets X ⊂ Rn for which any embedding f : X → Rm providing (1) must have m = Ω(lg n). This was later improved by Alon [Alo03], who showed the existence of an n point set X ⊂ Rn, such that any f providing (1) must have m = Ω(min{n, ε−2 lg n/ lg(1/ε)}). This lower bound can also be obtained from the Welch bound [Wel74], which states ε2k ≥ (1/(n − 1))(n/ m+k−1 k0010111011101010101001010 0110111010101010101101010
Technique for mapping objects from a large space into a small space, while preserving essential relations, that is data-independent and does not need to be trained.
5
arXiv:1609.02094v1 [cs.IT] 7 Sep 2016 Optimality of the Johnson-Lindenstrauss Lemma Kasper Green Larsen∗ Jelani Nelson† September 8, 2016 Abstract For any integers d, n ≥ 2 and 1/(min{n, d})0.4999 < ε < 1, we show the existence of a set of n vectors X ⊂ Rd such that any embedding f : X → Rm satisfying ∀x, y ∈ X, (1 − ε)∥x − y∥2 2 ≤ ∥f(x) − f(y)∥2 2 ≤ (1 + ε)∥x − y∥2 2 must have m = Ω(ε−2 lg n). This lower bound matches the upper bound given by the Johnson-Lindenstrauss lemma [JL84]. Further- more, our lower bound holds for nearly the full range of ε of interest, since there is always an isometric embedding into dimension min{d, n} (either the identity map, or projection onto span(X)). Previously such a lower bound was only known to hold against linear maps f, and not for such a wide range of parameters ε, n, d [LN16]. The best previously known lower bound for general f was m = Ω(ε−2 lg n/ lg(1/ε)) [Wel74, Alo03], which is suboptimal for any ε = o(1). 1 Introduction In modern algorithm design, often data is high-dimensional, and one seeks to first pre-process the data via some dimensionality reduction scheme that preserves geometry in such a way that is acceptable for particular applications. The lower-dimensional embedded data has the benefit of requiring less storage, less communication bandwith to be transmitted over a network, and less time to be analyzed by later algorithms. Such schemes have been applied to good effect in a diverse range of areas, such as streaming algorithms [Mut05], numerical linear algebra [Woo14], compressed sensing [CRT06, Don06], graph sparsification [SS11], clustering [BZMD15, CEM+15], nearest neighbor search [HIM12], and many others. A cornerstone dimensionality reduction result is the following Johnson-Lindenstrauss (JL) lemma [JL84]. Theorem 1 (JL lemma). Let X ⊂ Rd be any set of size n, and let ε ∈ (0, 1/2) be arbitrary. Then there exists a map f : X → Rm for some m = O(ε−2 lg n) such that ∀x, y ∈ X, (1 − ε)∥x − y∥2 2 ≤ ∥f(x) − f(y)∥2 2 ≤ (1 + ε)∥x − y∥2 2. (1) Even though the JL lemma has found applications in a plethora of different fields over the past three decades, its optimality has still not been settled. In the original paper by Johnson and Lindenstrauss [JL84], it was proved that for ε smaller than some universal constant ε0, there exists n point sets X ⊂ Rn for which any embedding f : X → Rm providing (1) must have m = Ω(lg n). This was later improved by Alon [Alo03], who showed the existence of an n point set X ⊂ Rn, such that any f providing (1) must have m = Ω(min{n, ε−2 lg n/ lg(1/ε)}). This lower bound can also be obtained from the Welch bound [Wel74], which states ε2k ≥ (1/(n − 1))(n/ m+k−1 k0010111011101010101001010 0110111010101010101101010
Generally applicable. As good as data-dependent methods in many cases
Technique for mapping objects from a large space into a small space, while preserving essential relations, that is data-independent and does not need to be trained.
5
arXiv:1609.02094v1 [cs.IT] 7 Sep 2016 Optimality of the Johnson-Lindenstrauss Lemma Kasper Green Larsen∗ Jelani Nelson† September 8, 2016 Abstract For any integers d, n ≥ 2 and 1/(min{n, d})0.4999 < ε < 1, we show the existence of a set of n vectors X ⊂ Rd such that any embedding f : X → Rm satisfying ∀x, y ∈ X, (1 − ε)∥x − y∥2 2 ≤ ∥f(x) − f(y)∥2 2 ≤ (1 + ε)∥x − y∥2 2 must have m = Ω(ε−2 lg n). This lower bound matches the upper bound given by the Johnson-Lindenstrauss lemma [JL84]. Further- more, our lower bound holds for nearly the full range of ε of interest, since there is always an isometric embedding into dimension min{d, n} (either the identity map, or projection onto span(X)). Previously such a lower bound was only known to hold against linear maps f, and not for such a wide range of parameters ε, n, d [LN16]. The best previously known lower bound for general f was m = Ω(ε−2 lg n/ lg(1/ε)) [Wel74, Alo03], which is suboptimal for any ε = o(1). 1 Introduction In modern algorithm design, often data is high-dimensional, and one seeks to first pre-process the data via some dimensionality reduction scheme that preserves geometry in such a way that is acceptable for particular applications. The lower-dimensional embedded data has the benefit of requiring less storage, less communication bandwith to be transmitted over a network, and less time to be analyzed by later algorithms. Such schemes have been applied to good effect in a diverse range of areas, such as streaming algorithms [Mut05], numerical linear algebra [Woo14], compressed sensing [CRT06, Don06], graph sparsification [SS11], clustering [BZMD15, CEM+15], nearest neighbor search [HIM12], and many others. A cornerstone dimensionality reduction result is the following Johnson-Lindenstrauss (JL) lemma [JL84]. Theorem 1 (JL lemma). Let X ⊂ Rd be any set of size n, and let ε ∈ (0, 1/2) be arbitrary. Then there exists a map f : X → Rm for some m = O(ε−2 lg n) such that ∀x, y ∈ X, (1 − ε)∥x − y∥2 2 ≤ ∥f(x) − f(y)∥2 2 ≤ (1 + ε)∥x − y∥2 2. (1) Even though the JL lemma has found applications in a plethora of different fields over the past three decades, its optimality has still not been settled. In the original paper by Johnson and Lindenstrauss [JL84], it was proved that for ε smaller than some universal constant ε0, there exists n point sets X ⊂ Rn for which any embedding f : X → Rm providing (1) must have m = Ω(lg n). This was later improved by Alon [Alo03], who showed the existence of an n point set X ⊂ Rn, such that any f providing (1) must have m = Ω(min{n, ε−2 lg n/ lg(1/ε)}). This lower bound can also be obtained from the Welch bound [Wel74], which states ε2k ≥ (1/(n − 1))(n/ m+k−1 k0010111011101010101001010 0110111010101010101101010
Generally applicable. As good as data-dependent methods in many cases Data does not need to be available in advance - works for “on-line” data Easier to parallelize
Technique for mapping objects from a large space into a small space, while preserving essential relations, that is data-independent and does not need to be trained.
🛡 Random projection 🛡 Random feature mapping 🛡 1-bit minwise hashing
6
7
Photo by Giovanni Dall'Orto Figure courtesy of Suresh Venkatasubramanian
8
[Johnson & Lindenstrauss ‘84]
||ˆ xi − ˆ xj||2 = (1 ± ε)||xi − xj||2
8
x A x
^
=
m = O(log(n)/ε2) dimensions
[Johnson & Lindenstrauss ‘84]
||ˆ xi − ˆ xj||2 = (1 ± ε)||xi − xj||2
8
x A x
^
=
m = O(log(n)/ε2) dimensions
[Johnson & Lindenstrauss ‘84]
Dot products?
Yes, but error depends on vector lengths
||ˆ xi − ˆ xj||2 = (1 ± ε)||xi − xj||2
8
x A x
^
=
m = O(log(n)/ε2) dimensions
[Johnson & Lindenstrauss ‘84]
Dot products?
Yes, but error depends on vector lengths Still too many dimensions!
||ˆ xi − ˆ xj||2 = (1 ± ε)||xi − xj||2
8
x A x
^
=
m = O(log(n)/ε2) dimensions
[Johnson & Lindenstrauss ‘84]
Dot products?
Yes, but error depends on vector lengths Still too many dimensions!
||ˆ xi − ˆ xj||2 = (1 ± ε)||xi − xj||2
9
For example constrained to d-dim. subspace.
9
For example constrained to d-dim. subspace.
but mapping is data-dependent.
9
For example constrained to d-dim. subspace.
but mapping is data-dependent.
m = O(d/ε2) dimensions! [Sarlós ‘06]
9
Key tool in randomized numerical linear algebra (RandNLA) methods
For example constrained to d-dim. subspace.
but mapping is data-dependent.
m = O(d/ε2) dimensions! [Sarlós ‘06]
9
Key tool in randomized numerical linear algebra (RandNLA) methods
For example constrained to d-dim. subspace.
but mapping is data-dependent.
m = O(d/ε2) dimensions! [Sarlós ‘06]
R a n d
i z a t i
f e r s n e w b e n e fi t s f
l a r g e
c a l e l i n e a r a l g e b r a c
p u t a t i
s .
B Y P E T R O S D R I N E A S A N D M I C H A E L W . M A H O N E Y MATRICES ARE UBIQUITOUS in computer science,
s t a t i s t i c s , a n d a p p l i e d m a t h e m a t i c s . A n m × n m a t r i x c a n e n c
e i n f
m a t i
a b
t m
j e c t s ( e a c h d e s c r i b e d b y n f e a t u r e s ) ,
t h e b e h a v i
a d i s c r e t i z e d d i f f e r e n t i a l
e r a t
a fi n i t e e l e m e n t m e s h ; a n n × n p
i t i v e
e fi n i t e m a t r i x c a n e n c
e t h e c
r e l a t i
s b e t w e e n a l l p a i r s
n
j e c t s ,
t h e e d g e
n e c t i v i t y b e t w e e n a l l p a i r s
n
e s i n a s
i a l n e t w
k ; a n d s
. M
i v a t e d l a r g e l y b y t e c h n
i c a l d e v e l
m e n t s t h a t g e n e r a t e e x t r e m e l y l a r g e s c i e n t i fi c a n d I n t e r n e t d a t a s e t s , r e c e n t y e a r s h a v e w i t n e s s e d e x c i t i n g d e v e l
m e n t s i n t h e t h e
y a n d p r a c t i c e
m a t r i x a l g
i t h m s . P a r t i c u l a r l y r e m a r k a b l e i s t h e u s e
r a n d
i z a t i
— t y p i c a l l y a s s u m e d t
e a p r
e r t y
t h e i n p u t d a t a d u e t
f
e x a m p l e , n
s e i n t h e d a t a
g e n e r a t i
m e c h a n i s m s — a s a n a l g
i t h m i c
c
p u t a t i
a l r e s
r c e f
t h e d e v e l
m e n t
i m p r
e d a l g
i t h m s f
f u n d a m e n t a l m a t r i x p r
e m s s u c h a s m a t r i x m u l t i p l i c a t i
, l e a s t
q u a r e s ( L S ) a p p r
i m a t i
, l
a n k m a t r i x a p p r
i m a t i
, a n d L a p l a
i a n
a s e d l i n e a r e q u a t i
s
v e r s . R a n d
i z e d N u m e r i c a l L i n e a r A l g e b r a ( R a n d N L A ) i s a n i n t e r d i s c i
l i n a r y r e s e a r c h a r e a t h a t e x p l
t s r a n d
i z a t i
a s a c
p u t a t i
a l r e s
r c e t
e v e l
i m p r
e d a l g
i t h m s f
l a r g e
c a l e l i n e a r a l g e b r a p r
l e m s .32 F r
a f
n d a t i
a l p e r
p e c t i v e , R a n d N L A h a s i t s r
s i n t h e
e t i c a l c
p u t e r s c i e n c e ( T C S ) , w i t h d e e p c
n e c t i
s t
a t h e m a t
c s ( c
v e x a n a l y s i s , p r
a b i l i t y t h e
y , m e t r i c e m b e d d i n g t h e
y ) a n d a p p l i e d m a t h e m a t i c s ( s c i e n t i fi c c
p u t i n g , s i g n a l p r
e s s i n g , n u m e r i c a l l i n e a r a l g e b r a ) . F r
a n a p p l i e d p e r s p e c
i v e , R a n d N L A i s a v i t a l n e w t
f
m a c h i n e l e a r n i n g , s t a t i s t i c s , a n d d a t a a n a l y s i s . W e l l
n g i n e e r e d i m p l e m e n
a t i
s h a v e a l r e a d y
t p e r f
m e d h i g h l y
t i m i z e d s
t w a r e l i b r a r i e s f
u b i q u i t
s p r
l e m s s u c h a s l e a s t
q u a r e s ,4,35 w i t h g
s c a l a b i l i t y i n p a r
l l e l a n d d i s t r i b u t e d e n v i r
m e n t s .52 M
e
e r , R a n d N L A p r
i s e s a s
n d a l g
i t h m i c a n d s t a t i s t i c a l f
n d a t i
f
m
e r n l a r g e
c a l e d a t a a n a l y s i s .
k e y i n s i g h t s
Randomization isn’t just used to model n10
[Rahimi & Recht ‘07]
0010000001010000 … 00001000100000010
feature space
k e r n e l e x p a n s i
00101110111010101
data vectors x 𝜒(x)
dimension-reduced representation
x・y ≈ 𝜒(x)・𝜒(y)
^
x
^ ^
10
[Rahimi & Recht ‘07]
0010000001010000 … 00001000100000010
feature space
k e r n e l e x p a n s i
00101110111010101
data vectors x 𝜒(x)
dimension-reduced representation
Can use linear kernel methods (e.g. linear SVM)
x・y ≈ 𝜒(x)・𝜒(y)
^
x
^ ^
10
[Rahimi & Recht ‘07]
0010000001010000 … 00001000100000010
feature space
k e r n e l e x p a n s i
00101110111010101
data vectors x 𝜒(x)
dimension-reduced representation
Efficient mappings:
Can use linear kernel methods (e.g. linear SVM)
x・y ≈ 𝜒(x)・𝜒(y)
^
x
^ ^
10
[Rahimi & Recht ‘07]
0010000001010000 … 00001000100000010
feature space
k e r n e l e x p a n s i
00101110111010101
data vectors x 𝜒(x)
dimension-reduced representation
Efficient mappings:
Can use linear kernel methods (e.g. linear SVM)
x・y ≈ 𝜒(x)・𝜒(y)
^
x
^ ^
11
0010000001010000 … 00001000100000010
Term vector (or TF/IDF) > 105 dimensions
arXiv:1609.02094v1 [cs.IT] 7 Sep 2016 Optimality of the Johnson-Lindenstrauss Lemma Kasper Green Larsen∗ Jelani Nelson† September 8, 2016 Abstract For any integers d, n ≥ 2 and 1/(min{n, d})0.4999 < ε < 1, we show the existence of a set of n vectors X ⊂ Rd such that any embedding f : X → Rm satisfying ∀x, y ∈ X, (1 − ε)∥x − y∥2 2 ≤ ∥f(x) − f(y)∥2 2 ≤ (1 + ε)∥x − y∥2 2 must have m = Ω(ε−2 lg n). This lower bound matches the upper bound given by the Johnson-Lindenstrauss lemma [JL84]. Further- more, our lower bound holds for nearly the full range of ε of interest, since there is always an isometric embedding into dimension min{d, n} (either the identity map, or projection onto span(X)). Previously such a lower bound was only known to hold against linear maps f, and not for such a wide range of parameters ε, n, d [LN16]. The best previously known lower bound for general f was m = Ω(ε−2 lg n/ lg(1/ε)) [Wel74, Alo03], which is suboptimal for any ε = o(1). 1 Introduction In modern algorithm design, often data is high-dimensional, and one seeks to first pre-process the data via some dimensionality reduction scheme that preserves geometry in such a way that is acceptable for particular applications. The lower-dimensional embedded data has the benefit of requiring less storage, less communication bandwith to be transmitted over a network, and less time to be analyzed by later algorithms. Such schemes have been applied to good effect in a diverse range of areas, such as streaming algorithms [Mut05], numerical linear algebra [Woo14], compressed sensing [CRT06, Don06], graph sparsification [SS11], clustering [BZMD15, CEM+15], nearest neighbor search [HIM12], and many others. A cornerstone dimensionality reduction result is the following Johnson-Lindenstrauss (JL) lemma [JL84]. Theorem 1 (JL lemma). Let X ⊂ Rd be any set of size n, and let ε ∈ (0, 1/2) be arbitrary. Then there exists a map f : X → Rm for some m = O(ε−2 lg n) such that ∀x, y ∈ X, (1 − ε)∥x − y∥2 2 ≤ ∥f(x) − f(y)∥2 2 ≤ (1 + ε)∥x − y∥2 2. (1) Even though the JL lemma has found applications in a plethora of different fields over the past three decades, its optimality has still not been settled. In the original paper by Johnson and Lindenstrauss [JL84], it was proved that for ε smaller than some universal constant ε0, there exists n point sets X ⊂ Rn for which any embedding f : X → Rm providing (1) must have m = Ω(lg n). This was later improved by Alon [Alo03], who showed the existence of an n point set X ⊂ Rn, such that any f providing (1) must have m = Ω(min{n, ε−2 lg n/ lg(1/ε)}). This lower bound can also be obtained from the Welch bound [Wel74], which states ε2k ≥ (1/(n − 1))(n/ m+k−1 k11
0010000001010000 … 00001000100000010
Term vector (or TF/IDF) > 105 dimensions
0010111011101010101001010
dimension-reduced representation
arXiv:1609.02094v1 [cs.IT] 7 Sep 2016 Optimality of the Johnson-Lindenstrauss Lemma Kasper Green Larsen∗ Jelani Nelson† September 8, 2016 Abstract For any integers d, n ≥ 2 and 1/(min{n, d})0.4999 < ε < 1, we show the existence of a set of n vectors X ⊂ Rd such that any embedding f : X → Rm satisfying ∀x, y ∈ X, (1 − ε)∥x − y∥2 2 ≤ ∥f(x) − f(y)∥2 2 ≤ (1 + ε)∥x − y∥2 2 must have m = Ω(ε−2 lg n). This lower bound matches the upper bound given by the Johnson-Lindenstrauss lemma [JL84]. Further- more, our lower bound holds for nearly the full range of ε of interest, since there is always an isometric embedding into dimension min{d, n} (either the identity map, or projection onto span(X)). Previously such a lower bound was only known to hold against linear maps f, and not for such a wide range of parameters ε, n, d [LN16]. The best previously known lower bound for general f was m = Ω(ε−2 lg n/ lg(1/ε)) [Wel74, Alo03], which is suboptimal for any ε = o(1). 1 Introduction In modern algorithm design, often data is high-dimensional, and one seeks to first pre-process the data via some dimensionality reduction scheme that preserves geometry in such a way that is acceptable for particular applications. The lower-dimensional embedded data has the benefit of requiring less storage, less communication bandwith to be transmitted over a network, and less time to be analyzed by later algorithms. Such schemes have been applied to good effect in a diverse range of areas, such as streaming algorithms [Mut05], numerical linear algebra [Woo14], compressed sensing [CRT06, Don06], graph sparsification [SS11], clustering [BZMD15, CEM+15], nearest neighbor search [HIM12], and many others. A cornerstone dimensionality reduction result is the following Johnson-Lindenstrauss (JL) lemma [JL84]. Theorem 1 (JL lemma). Let X ⊂ Rd be any set of size n, and let ε ∈ (0, 1/2) be arbitrary. Then there exists a map f : X → Rm for some m = O(ε−2 lg n) such that ∀x, y ∈ X, (1 − ε)∥x − y∥2 2 ≤ ∥f(x) − f(y)∥2 2 ≤ (1 + ε)∥x − y∥2 2. (1) Even though the JL lemma has found applications in a plethora of different fields over the past three decades, its optimality has still not been settled. In the original paper by Johnson and Lindenstrauss [JL84], it was proved that for ε smaller than some universal constant ε0, there exists n point sets X ⊂ Rn for which any embedding f : X → Rm providing (1) must have m = Ω(lg n). This was later improved by Alon [Alo03], who showed the existence of an n point set X ⊂ Rn, such that any f providing (1) must have m = Ω(min{n, ε−2 lg n/ lg(1/ε)}). This lower bound can also be obtained from the Welch bound [Wel74], which states ε2k ≥ (1/(n − 1))(n/ m+k−1 k11
0010000001010000 … 00001000100000010
Term vector (or TF/IDF) > 105 dimensions
0010111011101010101001010
dimension-reduced representation
arXiv:1609.02094v1 [cs.IT] 7 Sep 2016 Optimality of the Johnson-Lindenstrauss Lemma Kasper Green Larsen∗ Jelani Nelson† September 8, 2016 Abstract For any integers d, n ≥ 2 and 1/(min{n, d})0.4999 < ε < 1, we show the existence of a set of n vectors X ⊂ Rd such that any embedding f : X → Rm satisfying ∀x, y ∈ X, (1 − ε)∥x − y∥2 2 ≤ ∥f(x) − f(y)∥2 2 ≤ (1 + ε)∥x − y∥2 2 must have m = Ω(ε−2 lg n). This lower bound matches the upper bound given by the Johnson-Lindenstrauss lemma [JL84]. Further- more, our lower bound holds for nearly the full range of ε of interest, since there is always an isometric embedding into dimension min{d, n} (either the identity map, or projection onto span(X)). Previously such a lower bound was only known to hold against linear maps f, and not for such a wide range of parameters ε, n, d [LN16]. The best previously known lower bound for general f was m = Ω(ε−2 lg n/ lg(1/ε)) [Wel74, Alo03], which is suboptimal for any ε = o(1). 1 Introduction In modern algorithm design, often data is high-dimensional, and one seeks to first pre-process the data via some dimensionality reduction scheme that preserves geometry in such a way that is acceptable for particular applications. The lower-dimensional embedded data has the benefit of requiring less storage, less communication bandwith to be transmitted over a network, and less time to be analyzed by later algorithms. Such schemes have been applied to good effect in a diverse range of areas, such as streaming algorithms [Mut05], numerical linear algebra [Woo14], compressed sensing [CRT06, Don06], graph sparsification [SS11], clustering [BZMD15, CEM+15], nearest neighbor search [HIM12], and many others. A cornerstone dimensionality reduction result is the following Johnson-Lindenstrauss (JL) lemma [JL84]. Theorem 1 (JL lemma). Let X ⊂ Rd be any set of size n, and let ε ∈ (0, 1/2) be arbitrary. Then there exists a map f : X → Rm for some m = O(ε−2 lg n) such that ∀x, y ∈ X, (1 − ε)∥x − y∥2 2 ≤ ∥f(x) − f(y)∥2 2 ≤ (1 + ε)∥x − y∥2 2. (1) Even though the JL lemma has found applications in a plethora of different fields over the past three decades, its optimality has still not been settled. In the original paper by Johnson and Lindenstrauss [JL84], it was proved that for ε smaller than some universal constant ε0, there exists n point sets X ⊂ Rn for which any embedding f : X → Rm providing (1) must have m = Ω(lg n). This was later improved by Alon [Alo03], who showed the existence of an n point set X ⊂ Rn, such that any f providing (1) must have m = Ω(min{n, ε−2 lg n/ lg(1/ε)}). This lower bound can also be obtained from the Welch bound [Wel74], which states ε2k ≥ (1/(n − 1))(n/ m+k−1 kSparse, non-negative entries
11
0010000001010000 … 00001000100000010
Term vector (or TF/IDF) > 105 dimensions
0010111011101010101001010
dimension-reduced representation
arXiv:1609.02094v1 [cs.IT] 7 Sep 2016 Optimality of the Johnson-Lindenstrauss Lemma Kasper Green Larsen∗ Jelani Nelson† September 8, 2016 Abstract For any integers d, n ≥ 2 and 1/(min{n, d})0.4999 < ε < 1, we show the existence of a set of n vectors X ⊂ Rd such that any embedding f : X → Rm satisfying ∀x, y ∈ X, (1 − ε)∥x − y∥2 2 ≤ ∥f(x) − f(y)∥2 2 ≤ (1 + ε)∥x − y∥2 2 must have m = Ω(ε−2 lg n). This lower bound matches the upper bound given by the Johnson-Lindenstrauss lemma [JL84]. Further- more, our lower bound holds for nearly the full range of ε of interest, since there is always an isometric embedding into dimension min{d, n} (either the identity map, or projection onto span(X)). Previously such a lower bound was only known to hold against linear maps f, and not for such a wide range of parameters ε, n, d [LN16]. The best previously known lower bound for general f was m = Ω(ε−2 lg n/ lg(1/ε)) [Wel74, Alo03], which is suboptimal for any ε = o(1). 1 Introduction In modern algorithm design, often data is high-dimensional, and one seeks to first pre-process the data via some dimensionality reduction scheme that preserves geometry in such a way that is acceptable for particular applications. The lower-dimensional embedded data has the benefit of requiring less storage, less communication bandwith to be transmitted over a network, and less time to be analyzed by later algorithms. Such schemes have been applied to good effect in a diverse range of areas, such as streaming algorithms [Mut05], numerical linear algebra [Woo14], compressed sensing [CRT06, Don06], graph sparsification [SS11], clustering [BZMD15, CEM+15], nearest neighbor search [HIM12], and many others. A cornerstone dimensionality reduction result is the following Johnson-Lindenstrauss (JL) lemma [JL84]. Theorem 1 (JL lemma). Let X ⊂ Rd be any set of size n, and let ε ∈ (0, 1/2) be arbitrary. Then there exists a map f : X → Rm for some m = O(ε−2 lg n) such that ∀x, y ∈ X, (1 − ε)∥x − y∥2 2 ≤ ∥f(x) − f(y)∥2 2 ≤ (1 + ε)∥x − y∥2 2. (1) Even though the JL lemma has found applications in a plethora of different fields over the past three decades, its optimality has still not been settled. In the original paper by Johnson and Lindenstrauss [JL84], it was proved that for ε smaller than some universal constant ε0, there exists n point sets X ⊂ Rn for which any embedding f : X → Rm providing (1) must have m = Ω(lg n). This was later improved by Alon [Alo03], who showed the existence of an n point set X ⊂ Rn, such that any f providing (1) must have m = Ω(min{n, ε−2 lg n/ lg(1/ε)}). This lower bound can also be obtained from the Welch bound [Wel74], which states ε2k ≥ (1/(n − 1))(n/ m+k−1 kSparse, non-negative entries
k(x, y) = P
i min(xi, yi)
P
i max(xi, yi)
12
[Li & König ‘10]
k(x, y) = P
i min(xi, yi)
P
i max(xi, yi)
Now: Binary vectors/Jaccard similarity
12
[Li & König ‘10]
k(x, y) = P
i min(xi, yi)
P
i max(xi, yi)
Now: Binary vectors/Jaccard similarity
12
[Li & König ‘10] xj≠0
k(x, y) = P
i min(xi, yi)
P
i max(xi, yi)
Now: Binary vectors/Jaccard similarity
12
[Li & König ‘10] xj≠0
k(x, y) = P
i min(xi, yi)
P
i max(xi, yi)
Now: Binary vectors/Jaccard similarity
b1(z1(x)) … bm(zm(x))
12
[Li & König ‘10] xj≠0
k(x, y) = P
i min(xi, yi)
P
i max(xi, yi)
Now: Binary vectors/Jaccard similarity
b1(z1(x)) … bm(zm(x))
12
[Li & König ‘10] xj≠0
AUGUST 2011 | VOL. 54 | NO. 8 | COMMUNICATIONS OF THE ACM 101 DOI:10.1145/1978542.1978566Theory and Applications
By Ping Li and Arnd Christian König
Abstract Efficient (approximate) computation of set similarity in very large datasets is a common task with many applications in information retrieval and data management. One com- mon approach for this task is minwise hashing. This paper describes b-bit minwise hashing, which can provide anPart II:
13
14
arXiv:1606.08813v1 [stat.ML] 28 Jun 2016
EU regulations on algorithmic decision-making and a “right to explanation”
Bryce Goodman
BRYCE.GOODMAN@STX.OX.AC.UK
Oxford Internet Institute, Oxford Seth Flaxman
FLAXMAN@STATS.OX.AC.UK
Department of Statistics, Oxford
Abstract
We summarize the potential impact that the Euro- pean Union’s new General Data Protection Reg- ulation will have on the routine use of machine learning algorithms. Slated to take effect as law across the EU in 2018, it will restrict automated individual decision-making (that is, algorithms that make decisions based on user-level predic- tors) which “significantly affect” users. The law will also create a “right to explanation,” whereby a user can ask for an explanation of an algorithmic decision that was made about them. We argue that while this law will pose large chal- lenges for industry, it highlights opportunities for machine learning researchers to take the lead in designing algorithms and evaluation frameworks which avoid discrimination.
On 14 April 2016, for the first time in over two decades, the European Parliament adopted a set of comprehen- sive regulations for the collection, storage and use of personal information, the General Data Protection Reg- However, while the bulk of language deals with how data is collected and stored, the regulation contains a short article entitled “Automated individual decision-making” (see figure 1) potentially prohibiting a wide swath of algorithms currently in use in, e.g. recommendation sys- tems, credit and insurance risk assessments, computational advertising, and social networks. This raises important issues that are of particular concern to the machine learning
could require a complete overhaul of standard and widely used algorithmic techniques. The GDPR’s policy on the right of citizens to receive an explanation for algorithmic decisions highlights the pressing importance of human interpretability in algorithm design. If, as expected, the GDPR takes effect in its current form in mid-2018, there will be a pressing need for effective algorithms which can
Article 11. Automated individual decision making
solely on automated processing, including profiling, which produces an adverse legal effect concerning the data subject or significantly affects him or her, to
15
We summarize the potential impact that the Euro- pean Union’s new General Data Protection Reg- ulation will have on the routine use of machine learning algorithms. Slated to take effect as law across the EU in 2018, it will restrict automated individual decision-making (that is, algorithms that make decisions based on user-level predic- tors) which “significantly affect” users. The law will also create a “right to explanation,” whereby a user can ask for an explanation of an algorithmic decision that was made about them. We argue that while this law will pose large chal- lenges for industry, it highlights opportunities for machine learning researchers to take the lead in designing algorithms and evaluation frameworks
15
We summarize the potential impact that the Euro- pean Union’s new General Data Protection Reg- ulation will have on the routine use of machine learning algorithms. Slated to take effect as law across the EU in 2018, it will restrict automated individual decision-making (that is, algorithms that make decisions based on user-level predic- tors) which “significantly affect” users. The law will also create a “right to explanation,” whereby a user can ask for an explanation of an algorithmic decision that was made about them. We argue that while this law will pose large chal- lenges for industry, it highlights opportunities for machine learning researchers to take the lead in designing algorithms and evaluation frameworks
16
Biff Tannen in Back to the Future
easy to describe in terms of original data.
17
easy to describe in terms of original data.
similar to Donald Trump’s, or was this arranged?”
choices had been different?”
17
Part III:
18
19
19
19
“Explanation”
19
“Explanation”
missed near neighbor
19
reduction cannot work without assumptions on data (“incompressibility”).
20
reduction cannot work without assumptions on data (“incompressibility”).
whose output is guaranteed, but may fail to produce result within a given time/space usage. (“Las Vegas”, a la quicksort.)
20
21
[SODA ‘16] [CIKM ‘16]
22
S
h(S)
revealing very little about S.
22
S
h(S)
revealing very little about S.
22
S
h(S)
Allow ε fraction “false positives”
server.
vector x in S while revealing very little about S.
23
S
h(S)
Allow ε fraction “false positives”
1. exists x ∈ S within distance r from y, and 2. all vectors in S have distance at least cr from y
24
[Kirsch & Mitzenmacher ‘06]
radius cr radius r
1. exists x ∈ S within distance r from y, and 2. all vectors in S have distance at least cr from y
24
[Kirsch & Mitzenmacher ‘06]
radius cr radius r
1. exists x ∈ S within distance r from y, and 2. all vectors in S have distance at least cr from y
24
[Kirsch & Mitzenmacher ‘06]
radius cr radius r
1. exists x ∈ S within distance r from y, and 2. all vectors in S have distance at least cr from y
case 2 we allow probability ε of “false positive”
25
[Kirsch & Mitzenmacher ‘06]
1. exists x ∈ S within distance r from y, and 2. all vectors in S have distance at least cr from y
case 2 we allow probability ε of “false positive”
25
[Kirsch & Mitzenmacher ‘06]
But: Existing solutions also have false negatives…
d, store a・x
Query y: If |a・x - a・y|≤ r answer ‘1’, otherwise ‘2’
26
[Goswami et al. ‘16]
without false negatives
d, store a・x
Query y: If |a・x - a・y|≤ r answer ‘1’, otherwise ‘2’
26
[Goswami et al. ‘16]
without false negatives
5 10 a·x-a·y 0.05 0.10 0.15 0.20 0.25 0.30 0.35 probability
distance r distance r2
d, store a・x
Query y: If |a・x - a・y|≤ r answer ‘1’, otherwise ‘2’
26
[Goswami et al. ‘16]
without false negatives
5 10 a·x-a·y 0.05 0.10 0.15 0.20 0.25 0.30 0.35 probability
distance r distance r2 Distinguishes distance r and r2 with constant probability and no false positives
d, store a・x
Query y: If |a・x - a・y|≤ r answer ‘1’, otherwise ‘2’
26
[Goswami et al. ‘16]
without false negatives
5 10 a·x-a·y 0.05 0.10 0.15 0.20 0.25 0.30 0.35 probability
distance r distance r2 Distinguishes distance r and r2 with constant probability and no false positives In general, distinguish between distance r and cr using space Õ(r/(c-1))
d, store a・x
Query y: If |a・x - a・y|≤ r answer ‘1’, otherwise ‘2’
26
[Goswami et al. ‘16]
without false negatives
5 10 a·x-a·y 0.05 0.10 0.15 0.20 0.25 0.30 0.35 probability
distance r distance r2 Distinguishes distance r and r2 with constant probability and no false positives Space usage essentially
In general, distinguish between distance r and cr using space Õ(r/(c-1))
for the polynomial kernel
[Karppa et al. ‘16]
0010000001010000 … 00001000100000010
feature space
00101110111010101
data vectors x 𝜒(x)
dimension-reduced representation
x・y ≈ (x・y)k
^
x
^ ^
for the polynomial kernel
[Karppa et al. ‘16]
0010000001010000 … 00001000100000010
feature space
00101110111010101
data vectors x 𝜒(x)
dimension-reduced representation
x・y ≈ (x・y)k deterministic!
^
x
^ ^
for the polynomial kernel
[Karppa et al. ‘16]
0010000001010000 … 00001000100000010
feature space
00101110111010101
data vectors x 𝜒(x)
dimension-reduced representation
x・y ≈ (x・y)k deterministic!
H
i s t h i s p
s i b l e ! ?
^
x
^ ^
1 feature per edge in biclique
28
x1 x2 xd x1 x2 xd … …
[Karppa et al. ‘16]
x2x1 x3 x3
1 feature per edge in biclique
28
x1 x2 xd x1 x2 xd … …
edge in constant degree expander graph
[Karppa et al. ‘16]
x2x1 x3 x3
1 feature per edge in biclique
28
x1 x2 xd x1 x2 xd … …
edge in constant degree expander graph
[Karppa et al. ‘16]
Proofs only for vectors in {-1,+1}d
x2x1 x3 x3
W h a t m a c h i n e l e a r n i n g
k n
l e d g e d i s c
e r y a i d e d d e c i s i
s c a n b e m a d e “ e x p l a i n a b l e ” ?
29
W h a t m a c h i n e l e a r n i n g
k n
l e d g e d i s c
e r y a i d e d d e c i s i
s c a n b e m a d e “ e x p l a i n a b l e ” ?
29
W h a t M L / K D D a l g
i t h m s c a n b e s p e d u p b y R a n d N L A m e t h
s ?
W h a t m a c h i n e l e a r n i n g
k n
l e d g e d i s c
e r y a i d e d d e c i s i
s c a n b e m a d e “ e x p l a i n a b l e ” ?
29
W h a t d i s t a n c e / k e r n e l a p p r
i m a t i
s a r e p
s i b l e w i t h
e
i d e d e r r
? W h a t M L / K D D a l g
i t h m s c a n b e s p e d u p b y R a n d N L A m e t h
s ?
W h a t m a c h i n e l e a r n i n g
k n
l e d g e d i s c
e r y a i d e d d e c i s i
s c a n b e m a d e “ e x p l a i n a b l e ” ?
29
W h a t d i s t a n c e / k e r n e l a p p r
i m a t i
s a r e p
s i b l e w i t h
e
i d e d e r r
? W h a t M L / K D D a l g
i t h m s c a n b e s p e d u p b y R a n d N L A m e t h
s ? W h a t k e r n e l s e x p a n s i
s h a v e e f fi c i e n t d e t e r m i n i s t i c a p p r
i m a t i
s ?
30
Acknowledgement of economic support:
SCALABLE SIMILARITY
SEARCH