Dimensionality Reduction Techniques for Proximity Problems Piotr - PowerPoint PPT Presentation

Dimensionality Reduction Techniques for Proximity Problems Piotr Indyk, SODA 2000 CS 468 | Geometric Algorithms Bart Adams

Talk Summary Core algorithm: dimensionality reduction using hashing Applied to: � c-nearest neighbor search algorithm (c-NNS) � c-furthest neighbor search algorithm (c-FNS)

Talk Overview Introduction c-Nearest Neighbor Search c-Furthest Neighbor Search Conclusion

Talk Overview Introduction � Problem Statement � Hamming Metric � Dimensionality Reduction c-Nearest Neighbor Search c-Furthest Neighbor Search Conclusion

Problem Statement We are dealing with proximity problems ( n points, dimension d ) P P p p q q nearest neighbor search furthest neighbor search (NNS) (FNS)

Problem Statement High dimensions: curse of dimensionality � time and/or space exponential in d Use approximate algorithms p p 0 r p 0 cr p q r q c -NNS c -FNS

Problem Statement Problems with (most) existing work in high d � randomized Monte Carlo � incorrect answers possible Randomized algorithms in low d � Las Vegas � always correct answer → can’t we have Las Vegas algorithms for high d ?

Hamming Metric Hamming Space of dimension d { 0 , 1 } d � points are bit-vectors d = 3 : 000 , 001 , 010 , 011 , 100 , 101 , 110 , 111 � hamming distance d ( x, y ) � # positions where x and y differ Remarks � simplest high-dimensional setting � generalizes to larger alphabets Σ Σ = { α , β , γ , δ , . . . }

Dimensionality Reduction Main idea 00110101 � map from high to low 00100101 dimension 11100111 � preserve distances 00111101 � solve problem in low dimension space → improved performance 011 001 at the cost of approximation error 101 110

Las Vegas 1+ ε -NNS Probabilistic NNS � for Hamming metric � approximation error 1+ ε � always returns correct answer Recall: c-NNS can be reduced to ( r , R )-PLEB � so we will solve this problem

Las Vegas 1+ ε -NNS Main outline d 1. hash {0,1} d into { α , β , γ , δ ,…} O ( R ) 11001001101010001 � dimension O ( R ) R 2. encode symbols α , β , γ , δ ,… as αγγ binary codes of length O (log n ) � dimension O ( R log n ) R log n 3. divide and conquer 000111111 � divide into sets of size O (log n ) � solve each subproblem 011 001 111 � take best found solution log n

Las Vegas 1+ ε -NNS Main outline d 1. hash {0,1} d into { α , β , γ , δ ,…} O ( R ) 11001001101010001 � dimension O ( R ) R 2. encode symbols α , β , γ , δ ,… as αγγ binary codes of length O (log n) � dimension O ( R log n ) R log n 3. divide and conquer 000111111 � divide into sets of size O (log n ) � solve each subproblem 011 001 111 � take best found solution log n

Hashing f : { 0 , 1 } d → Σ D Find a mapping � f is non-expansive d ( f ( x ) , f ( y )) ≤ Sd ( x, y ) � f is ( ε , R )-contractive (almost non-contractive) d ( x, y ) ≥ R ⇒ d ( f ( x ) , f ( y )) ≥ SR (1 − ² )

Hashing � f ( x ) is defined as concatenation f = f h 1 ( x ) f h 2 ( x ) . . . f h |H| ( x ) � one f h ( x ) is defined using a hash function h ( x ) = ax modP, P = R ² , a ∈ [ P ] � in total there are P such hash functions, i.e., |H| = P

Hashing 00101011 Mapping f h ( x ) � map each bit x i into bucket h ( i ) - 11 00 0011 � sort bits in h (2) h (4) h (0) h (5) h (1) h (3) h (6) h (7) ascending order of i ’s � concatenate all γ α ζ δ bits within each bucket to one symbol γαδζ

Hashing d -dimensional 00101011 small alphabet - 11 00 0011 h (2) h (4) h (0) h (5) h (1) h (3) h (6) h (7) R -dimensional γ large alphabet α ζ δ PR -dimensional ααηγ . . . γαδζ . . . δξαδ large alphabet

Hashing With , one can prove that S = |H| � f is non-expansive d ( f ( x ) , f ( y )) ≤ Sd ( x, y ) → proof: for each difference bit, f can generate at most |H| = S difference symbols.

Hashing With , Piotr Indyk states that one can S = |H| prove that � f is ( ε , R )-contractive d ( x, y ) ≥ R ⇒ d ( f ( x ) , f ( y )) ≥ SR (1 − ² ) → however, recall that h ( x ) = ax modP, P = R ² → it is known that Pr [ h ( x ) = h ( y )] ≤ 1 R/² → ( ε , R )-contractive only holds with a certain (large) probability (?)

Las Vegas 1+ ε -NNS Main outline d 1. hash {0,1} d into { α , β , γ , δ ,…} O (R) 11001001101010001 � dimension O (R) R 2. encode symbols α , β , γ , δ ,… as αγγ binary codes of length O (log n ) � dimension O ( R log n ) R log n 3. divide and conquer 000111111 � divide into sets of size O (log n ) � solve each subproblem 011 001 111 � take best found solution log n

Coding Each symbol α from Σ mapped to a binary word C ( α ) of length l , so that d ( C ( α ) , C ( β )) ∈ [ (1 − ² ) l l = O ( log | Σ | , l 2 ] ) 2 ² 2 Example ( l =  ) α → C ( α ) = 01000101 β → C ( β ) = 11011111

Coding It can be shown, or also seen by intuition, that this mapping is � non-expansive � almost non-contractive Also, the resulting mapping g = C ◦ f (hashing + coding) is � non-expansive � almost non-contractive

Las Vegas 1+ ε -NNS Main outline d 1. hash {0,1} d into { α , β , γ , δ ,…} O (R) 11001001101010001 � dimension O (R) R 2. encode symbols α , β , γ , δ ,… as αγγ binary codes of length O (log n) � dimension O ( R log n ) R log n 3. divide and conquer 000111111 � divide into sets of size O (log n ) � solve each subproblem 011 001 111 � take best found solution log n

Divide and Conquer Partition the set of coordinates into random sets of size S 1 , . . . , S k s = O (log n ) Project g on coordinate sets g ( x ) 000111111 One of the projections should be � non-expansive 011 001 111 � almost non-contractive g ( x ) | S 1 g ( x ) | S 2 g ( x ) | S 3

Divide and Conquer Solve NNS problem on each sub-problem g ( x ) | S i � dimension log n � easy problem � can precompute all solutions with O ( n ) space O (2 log n ) = O ( n ) Take best solution as answer Resulting algorithm is 1+ ε approximate (lots of algebra to prove)

Las Vegas 1+ ε -NNS Main outline d 1. hash {0,1} d into { α , β , γ , δ ,…} O ( R ) 11001001101010001 � dimension O ( R ) R 2. encode symbols α , β , γ , δ ,… as αγγ binary codes of length O (log n ) � dimension O ( R log n ) R log n 3. divide and conquer 000111111 � divide into sets of size O (log n ) � solve each subproblem 011 001 111 � take best found solution log n

Extensions Basic algorithm can be adapted � 3+ ε -approximate deterministic algorithm � make step 3 (divide and conquer) deterministic � other metrics O ( ∆ d � embed into -dimensional Hamming l d ² ) 1 metric ( ∆ is diameter/closest pair ratio) l O ( d 2 ) l d � embed into 2 1

FNS to NNS Reduction Reduce (1+ ε )-FNS to (1+ ε /6)-NNS � for ² ∈ [0 , 2] � in Hamming spaces p p 0 r q c -FNS

Basic Idea For p, q ∈ { 0 , 1 } d d ( p, q ) = d − d ( p, ¯ q ) p = 110011 p = 110011 q = 101011 q = 010100 ¯ d ( p, q ) = 2 = 6 − 4 q ) = 4 = 6 − 2 d ( p, ¯

Exact FNS to NNS Set of points P in {0,1} d P p furthest neighbor of q in P p ⇒ q ¯ p is nearest neighbor of in P q ¯ q → exact versions of NNS and FNS are equivalent

Approximate FNS to NNS Reduction does not preserve approximation � p FN of q , with d ( q, p ) = R � therefore p (exact) NN of q ¯ � p ’ c-NN of q ¯ q, p 0 ) = cd (¯ d (¯ q, p ) = c ( d − R ) � therefore d ( q,p ) R d ( q,p 0 ) = d − c ( d − R ) � so, if we want p ’ to be c ’-FN of q c 0 ≥ R d − c ( d − R )

Approximate FNS to NNS Reduction does not preserve approximation � so, if we want p ’ to be c ’-FN of q c 0 ≥ R d − c ( d − R ) � or, equivalently, c 0 ≤ d 1 R + (1 − d R ) c � so, the smaller d / R , the better the reduction → apply dimensionality reduction to decrease d / R

Approximate FNS to NNS With a similar hashing and coding technique, one can reduce d / R and prove: There is a reduction of (1+ ε )-FNS to (1+ ε /6)-NNS for . ² ∈ [0 , 2]

Conclusion Hashing can be used effectively to overcome the “curse of dimensionality”. Dimensionality reduction used for two different purposes: � Las Vegas c-NNS: reduce storage � FNS → NNS: relate approximation factors

Dimensionality Reduction Techniques for Proximity Problems Piotr - PowerPoint PPT Presentation

Dimensionality Reduction Techniques for Proximity Problems Piotr Indyk, SODA 2000 CS 468 | Geometric Algorithms Bart Adams Talk Summary Core algorithm: dimensionality reduction using hashing Applied to: c-nearest neighbor search

STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer Dawson 1 / 24 Dimensionality

Dimensionality Reduction Alexandros Tantos Assistant Professor Aristotle University of

Investigating Dimensionality Dimensionality Dimensionality with with Investigating

Nonlinear Dimensionality Reduction Donovan Parks Overview Direct visualization vs.

WIKIPEDIA ARTICLE GROUP 9 Contents Article Overview 1. Dimensionality Reduction 2.

Dimensionality Reduction Algorithms (and how to interpret their output) Dalya Baron (Tel Aviv

Exploring Multivariate Data with Clustering and Dimensionality Reduction Marco Baroni Practical

Applied Machine Learning Dimensionality reduction using PCA Siamak Ravanbakhsh COMP 551 (Fall

Preprocessing and Dimensionality Reduction J er emy Fix CentraleSup elec

DIMENSIONALITY REDUCTION DIMENSIONALITY REDUCTION MATTHIEU BLOCH April 21, 2020 1 / 26

Probabilistic Dimensionality Reduction Neil D. Lawrence University of Sheffield Facebook, London

Spatial Data: Dimensionality Reduction CS444 Techniques, Lecture 3 In this subfield, we think

Spatial Data: Dimensionality Reduction CSC444 Techniques In this subfield, we think of a data

Kernel-Based Dimensionality Reduction Methods on Synthesized and Facial Image Data Jonathan L.

Dimensionality Reduction for Data Mining - Techniques, Applications and Trends Lei Yu Binghamton

Tree Pr ee Proximity ximity Finding the good and bad of trees. joe@buildfax.com Tree

Information Overview & Information Overview & Corporate Presentation for Pacific Control

January 17, 2013 Government of the District of Columbia Vincent C. Gray, Mayor 1 MAYORS POWER

benthic biotopes Torsten Berg & Birgit Heyden With input from Kai Hoppe, Petra Schmitt,

Optimal Design of Multi-Purpose Reservoir System to Meet Water Demands in Townsville Michael V.

Curse of Dimensionality in Pivot-based Indexes Ilya Volnyansky, Vladimir Pestov Department of

Reliability Analysis in High Dimensions S Adhikari Department of Aerospace Engineering,

BOAT: Building Auto-Tuners with Structured Bayesian Optimization B esp O ke A uto- T uners Indigo

PARAMETRIC MODELING OF COMPOSITE LAMINATES Ch. Ghnatios, B. Bognet, A. Leygue, F. Chinesta*, A.

Dimensionality Reduction Techniques for Proximity Problems Piotr - PowerPoint PPT Presentation

Dimensionality Reduction Techniques for Proximity Problems Piotr Indyk, SODA 2000 CS 468 | Geometric Algorithms Bart Adams Talk Summary Core algorithm: dimensionality reduction using hashing Applied to: c-nearest neighbor search

STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer Dawson 1 / 24 Dimensionality

Dimensionality Reduction Alexandros Tantos Assistant Professor Aristotle University of

Investigating Dimensionality Dimensionality Dimensionality with with Investigating

Nonlinear Dimensionality Reduction Donovan Parks Overview Direct visualization vs.

WIKIPEDIA ARTICLE GROUP 9 Contents Article Overview 1. Dimensionality Reduction 2.

Dimensionality Reduction Algorithms (and how to interpret their output) Dalya Baron (Tel Aviv

Exploring Multivariate Data with Clustering and Dimensionality Reduction Marco Baroni Practical

Applied Machine Learning Dimensionality reduction using PCA Siamak Ravanbakhsh COMP 551 (Fall

Preprocessing and Dimensionality Reduction J er emy Fix CentraleSup elec

DIMENSIONALITY REDUCTION DIMENSIONALITY REDUCTION MATTHIEU BLOCH April 21, 2020 1 / 26

Probabilistic Dimensionality Reduction Neil D. Lawrence University of Sheffield Facebook, London

Spatial Data: Dimensionality Reduction CS444 Techniques, Lecture 3 In this subfield, we think

Spatial Data: Dimensionality Reduction CSC444 Techniques In this subfield, we think of a data

Kernel-Based Dimensionality Reduction Methods on Synthesized and Facial Image Data Jonathan L.

Dimensionality Reduction for Data Mining - Techniques, Applications and Trends Lei Yu Binghamton

Tree Pr ee Proximity ximity Finding the good and bad of trees. joe@buildfax.com Tree

Information Overview &amp; Information Overview &amp; Corporate Presentation for Pacific Control

January 17, 2013 Government of the District of Columbia Vincent C. Gray, Mayor 1 MAYORS POWER

benthic biotopes Torsten Berg &amp; Birgit Heyden With input from Kai Hoppe, Petra Schmitt,

Optimal Design of Multi-Purpose Reservoir System to Meet Water Demands in Townsville Michael V.

Curse of Dimensionality in Pivot-based Indexes Ilya Volnyansky, Vladimir Pestov Department of

Reliability Analysis in High Dimensions S Adhikari Department of Aerospace Engineering,

BOAT: Building Auto-Tuners with Structured Bayesian Optimization B esp O ke A uto- T uners Indigo

PARAMETRIC MODELING OF COMPOSITE LAMINATES Ch. Ghnatios, B. Bognet, A. Leygue, F. Chinesta*, A.

Information Overview & Information Overview & Corporate Presentation for Pacific Control

benthic biotopes Torsten Berg & Birgit Heyden With input from Kai Hoppe, Petra Schmitt,