curse of dimensionality in pivot based indexes
play

Curse of Dimensionality in Pivot-based Indexes Ilya Volnyansky, - PowerPoint PPT Presentation

Overview Our Work Discussion Curse of Dimensionality in Pivot-based Indexes Ilya Volnyansky, Vladimir Pestov Department of Mathematics and Statistics University of Ottawa Ottawa, Ontario, Canada SISAP 2009, Prague, 29/09/2009 Ilya


  1. Overview Our Work Discussion Curse of Dimensionality in Pivot-based Indexes Ilya Volnyansky, Vladimir Pestov Department of Mathematics and Statistics University of Ottawa Ottawa, Ontario, Canada SISAP 2009, Prague, 29/09/2009 Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

  2. Overview Our Work Discussion Outline Overview 1 The Setting for Similarity Search Previous Work Our Work 2 Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

  3. Overview The Setting for Similarity Search Our Work Previous Work Discussion Similarity Workloads Universe Ω : metric space with metric 휌 . Dataset X ⊂ Ω , always finite, with metric 휌 . A range query : given q ∈ Ω and r > 0 find { x ∈ X ∣ 휌 ( x , q ) < r } For analysis purposes, we add: A measure 휇 on Ω . Treat X as i.i.d. sample ∼ 휇 of size n Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

  4. Overview The Setting for Similarity Search Our Work Previous Work Discussion Similarity Workloads Universe Ω : metric space with metric 휌 . Dataset X ⊂ Ω , always finite, with metric 휌 . A range query : given q ∈ Ω and r > 0 find { x ∈ X ∣ 휌 ( x , q ) < r } For analysis purposes, we add: A measure 휇 on Ω . Treat X as i.i.d. sample ∼ 휇 of size n Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

  5. Overview The Setting for Similarity Search Our Work Previous Work Discussion Curse of dimensionality conjecture All indexing schemes suffer from the curse of dimensionality: (conjecture) If d = 휔 ( log n ) and d = n o ( 1 ) , any sequence of indexes built on a sequence of datasets X d ⊂ Σ d allowing similarity search in time polynomial in d must use n 휔 ( 1 ) space. Handbook of Discrete and Computational Geometry The Hamming cube Σ d of dimension d : The set of all binary sequences of length d . Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

  6. Overview The Setting for Similarity Search Our Work Previous Work Discussion Curse of dimensionality conjecture All indexing schemes suffer from the curse of dimensionality: (conjecture) If d = 휔 ( log n ) and d = n o ( 1 ) , any sequence of indexes built on a sequence of datasets X d ⊂ Σ d allowing similarity search in time polynomial in d must use n 휔 ( 1 ) space. Handbook of Discrete and Computational Geometry The Hamming cube Σ d of dimension d : The set of all binary sequences of length d . Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

  7. Overview The Setting for Similarity Search Our Work Previous Work Discussion Fixed dimension Examples of previous work: Let n the size of X vary, but the space (Ω , 휌, 휇 ) be fixed. The usual “asymptotic” analysis in the CS sense. Does not investigate the curse of dimensionality. Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

  8. Overview The Setting for Similarity Search Our Work Previous Work Discussion Fixed n Let the dimension and hence (Ω , 휌, 휇 ) vary but the size n of X stay the same. e.g. [Weber 98], [Chávez 01] Too small sample size n makes it easier to index spaces of high dimension d . When both d and n vary, the math is more challenging. Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

  9. Overview The Setting for Similarity Search Our Work Previous Work Discussion Points to keep in mind Distinction between X and Ω . Both d and n grow. Need to make assumptions about the sequence of Ω ’s (?) Need to make assumption about the indexes. Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

  10. Overview The Setting for Similarity Search Our Work Previous Work Discussion Gameplan Pick an index type to analyze. 1 Pick a cost model. 2 The sequence of Ω ’s exhibits concentration of measure, 3 the “intrinsic dimension” grows. Statistical Learning Theory: linking properties of Ω ’s and 4 properties of X ’s. Conclusion: if all conditions are met, the Curse of 5 Dimensionality will take place. Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

  11. Overview The Setting for Similarity Search Our Work Previous Work Discussion Main Result From a sequence of metric spaces with measure (Ω d , 휌 d , 휇 d ) , where d = 1 , 2 , 3 , . . . take i.i.d. samples (datasets) X d ∼ 휇 d . Assume (Ω d , 휌 d , 휇 d ) display the concentration of measure. The VC dimension of closed balls in (Ω d , 휌 d ) is O ( d ) . We build a pivot-index using k pivots, where k = o ( n d / d ) . Sample size n d satisfies d = 휔 ( log n d ) and d = n o ( 1 ) . d Suppose we perform queries of radius=NN. Then: If we fix arbitrarily small 휀, 휂 > 0, ∃ D such that for all d ⩾ D , the probability that at least half the queries on dataset X d take less than ( 1 − 휀 ) n d time is less than 휂 . Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

  12. Overview The Setting for Similarity Search Our Work Previous Work Discussion Main Result From a sequence of metric spaces with measure (Ω d , 휌 d , 휇 d ) , where d = 1 , 2 , 3 , . . . take i.i.d. samples (datasets) X d ∼ 휇 d . Assume (Ω d , 휌 d , 휇 d ) display the concentration of measure. The VC dimension of closed balls in (Ω d , 휌 d ) is O ( d ) . We build a pivot-index using k pivots, where k = o ( n d / d ) . Sample size n d satisfies d = 휔 ( log n d ) and d = n o ( 1 ) . d Suppose we perform queries of radius=NN. Then: If we fix arbitrarily small 휀, 휂 > 0, ∃ D such that for all d ⩾ D , the probability that at least half the queries on dataset X d take less than ( 1 − 휀 ) n d time is less than 휂 . Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

  13. Framework Overview Concentration of Measure Our Work Statistical Learning Theory Discussion Asymptotic Bounds Pivot indexing scheme Build an index: Pick { p 1 . . . p k } from X 1 Calculate n × k array of distances 2 휌 ( x , p i ) , 1 ⩽ i ⩽ k , x ∈ X Perform query given q and r : Compute 휌 k ( q , x ) := sup 1 ⩽ i ⩽ k ∣ 휌 ( q , p i ) − 휌 ( x , p i ) ∣ . 1 Since 휌 ( q , x ) ⩾ 휌 k ( q , x ) , no need to compute 휌 ( q , x ) if 2 휌 k ( q , x ) > r Compute 휌 ( q , x ) otherwise. 3 Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

  14. Framework Overview Concentration of Measure Our Work Statistical Learning Theory Discussion Asymptotic Bounds Pivot indexing scheme Build an index: Pick { p 1 . . . p k } from X 1 Calculate n × k array of distances 2 휌 ( x , p i ) , 1 ⩽ i ⩽ k , x ∈ X Perform query given q and r : Compute 휌 k ( q , x ) := sup 1 ⩽ i ⩽ k ∣ 휌 ( q , p i ) − 휌 ( x , p i ) ∣ . 1 Since 휌 ( q , x ) ⩾ 휌 k ( q , x ) , no need to compute 휌 ( q , x ) if 2 휌 k ( q , x ) > r Compute 휌 ( q , x ) otherwise. 3 Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

  15. Framework Overview Concentration of Measure Our Work Statistical Learning Theory Discussion Asymptotic Bounds The cost model Only one cost: 휌 ( q , x ) Computing 휌 k ( q , x ) costs k . Let C q , r , p 1 ,..., p k denote all the discarded points in X : { x ∈ X ∣ 휌 k ( q , x ) > r } Let n = ∣ X ∣ . Total cost: k + n − ∣ C q , r , p 1 ,..., p k ∣ . Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

  16. Framework Overview Concentration of Measure Our Work Statistical Learning Theory Discussion Asymptotic Bounds The cost model Only one cost: 휌 ( q , x ) Computing 휌 k ( q , x ) costs k . Let C q , r , p 1 ,..., p k denote all the discarded points in X : { x ∈ X ∣ 휌 k ( q , x ) > r } Let n = ∣ X ∣ . Total cost: k + n − ∣ C q , r , p 1 ,..., p k ∣ . Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

  17. Framework Overview Concentration of Measure Our Work Statistical Learning Theory Discussion Asymptotic Bounds Concentration of Measure A function f : Ω → ℝ is 1-Lipschitz if ∣ f ( 휔 1 ) − f ( 휔 2 ) ∣ ⩽ 휌 ( 휔 1 , 휔 2 ) ∀ 휔 1 , 휔 2 ∈ Ω Examples: f ( x ) = x f ( x ) = 1 2 x ( x 2 + 1 ) √ f ( x ) = Its median is a number M such that 휇 { 휔 ∣ f ( 휔 ) ⩽ M } ⩾ 1 / 2 and 휇 { 휔 ∣ f ( 휔 ) ⩾ M } ⩾ 1 / 2 Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

  18. Framework Overview Concentration of Measure Our Work Statistical Learning Theory Discussion Asymptotic Bounds Concentration of Measure A function f : Ω → ℝ is 1-Lipschitz if ∣ f ( 휔 1 ) − f ( 휔 2 ) ∣ ⩽ 휌 ( 휔 1 , 휔 2 ) ∀ 휔 1 , 휔 2 ∈ Ω Examples: f ( x ) = x f ( x ) = 1 2 x ( x 2 + 1 ) √ f ( x ) = Its median is a number M such that 휇 { 휔 ∣ f ( 휔 ) ⩽ M } ⩾ 1 / 2 and 휇 { 휔 ∣ f ( 휔 ) ⩾ M } ⩾ 1 / 2 Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend