diverse near neighbor problem
play

Diverse Near Neighbor Problem Sofiane Abbar (QCRI) Sihem Amer-Yahia - PowerPoint PPT Presentation

Diverse Near Neighbor Problem Sofiane Abbar (QCRI) Sihem Amer-Yahia (CNRS) Piotr Indyk (MIT) Sepideh Mahabadi (MIT) Kasturi R. Varadarajan (UIowa) Near Neighbor Problem Definition Set of points in -dimensional space


  1. Diverse Near Neighbor Problem Sofiane Abbar (QCRI) Sihem Amer-Yahia (CNRS) Piotr Indyk (MIT) Sepideh Mahabadi (MIT) Kasturi R. Varadarajan (UIowa)

  2. Near Neighbor Problem Definition β€’ Set of π‘œ points 𝑸 in 𝑒 -dimensional – space Query point 𝒓 – Report one neighbor of 𝒓 if there is any – Neighbor: A point within distance 𝑠 of β€’ query Application β€’ Major importance in databases – (document, image, video), information retrieval, pattern recognition Object of interest as point β€’ Similarity is measured as distance. β€’

  3. Motivation Search: How many answers? Small output size, e.g. 10 β€’ Reporting 𝑙 Nearest Neighbors may not – be informative (could be identical texts) Large output size β€’ Time to retrieve them is high – Small output size which is Relevant and Diverse β€’ Good to have result from each β€’ cluster, i.e. should be diverse

  4. Diverse Near Neighbor Problem Definition β€’ Set of π‘œ points 𝑸 in 𝑒 -dimensional space – Query point 𝒓 – Report the k most diverse neighbors of π‘Ÿ – Neighbor: β€’ Points within distance 𝑠 of query – We use Hamming distance – Diversity: β€’ div S = m π‘—π‘œ π‘ž , π‘Ÿβˆˆπ‘‡ | π‘ž βˆ’ π‘Ÿ | – Goal: report Q (green points), s.t. β€’ 𝑅 βŠ† 𝑄 ∩ 𝐢 π‘Ÿ , 𝑠 – |Q| = k – 𝑒𝑗𝑒 𝑅 is maximized –

  5. Approximation Want sublinear query time, so need to approximate β€’ Approximate NN: β€’ 𝐢 π‘Ÿ , 𝑠 β†’ 𝐢 π‘Ÿ , 𝑑𝑠 for some value of 𝑑 > 1 – 1 Result: query time of 𝑃 ( π‘’π‘œ 𝑑 ) – Approximate Diverse NN: β€’ Bi-criterion approximation: distance and diversity – ( 𝐝 , 𝜷 ) -Approximate 𝑙 -diverse Near Neighbor – Let 𝑅 βˆ— (green points) be the optimum solution for 𝐢 π‘Ÿ , 𝑠 – Report approximate neighbors 𝑅 (purple points) β€’ 𝑅 βŠ† 𝐢 π‘Ÿ , 𝑑𝑠 Diversity approximates the optimum diversity β€’ 1 𝛽 𝑒𝑗𝑒 ( 𝑅 βˆ— ) , 𝛽 β‰₯ 1 𝑒𝑗𝑒 𝑅 β‰₯

  6. Results Algorithm A Algorithm B Distance Apx. Factor c > 2 c >1 Diversity Apx. Factor Ξ± 6 6 log 𝑙 βˆ— π‘œ 1+1 / 𝑑 + π‘œπ‘’ ( π‘œ log 𝑙 ) 1+1 /( π‘‘βˆ’1 ) + π‘œπ‘’ Space 𝑙 2 + log π‘œ 𝑙 2 + log π‘œ Query Time 𝑒 (log 𝑙 ) 𝑑 /( π‘‘βˆ’1 ) π‘œ 1 /( π‘‘βˆ’1 ) 𝑒 βˆ— log 𝑙 βˆ— π‘œ 1 / 𝑑 𝑠 𝑠 Algorithm A was earlier introduced in [Abbar, Amer-yahia, Indyk, Mahabadi, β€’ WWW’13]

  7. Techniques

  8. Compute k-diversity: GMM β€’ Have n points, compute the subset with maximum diversity. β€’ Exact : NP-hard to approximate better than 2 [Ravi et al.] β€’ GMM Algorithm [Ravi et al.] [Gonzales] – Choose an arbitrary point – Repeat k-1 times β€’ Add the point whose minimum distance to the currently chosen points is maximized β€’ Achieves approximation factor 2 β€’ Running time of the algorithm is O(kn)

  9. Locality Sensitive Hashing (LSH) β€’ LSH – close points have higher probability of collision than far points – Hash functions: 𝑕 1 , … , 𝑕 𝑀 𝑕 𝑗 = < β„Ž 𝑗 , 1 , … , β„Ž 𝑗 , 𝑒 > β€’ β„Ž 𝑗 , π‘˜ ∈ β„‹ is chosen randomly β€’ β„‹ is a family of hash functions which is β€’ 𝑄 1 , 𝑄 2 , 𝑠 , 𝑑𝑠 -sensitive: If π‘ž βˆ’ π‘žπ‘ž ≀ 𝑠 then Pr β„Ž π‘ž = β„Ž π‘žπ‘ž β‰₯ 𝑄 1 – If π‘ž βˆ’ π‘žπ‘ž β‰₯ 𝑑𝑠 then Pr β„Ž π‘ž = β„Ž π‘žπ‘ž ≀ 𝑄 2 – Example: Hamming distance: β€’ β„Ž π‘ž = π‘ž 𝑗 , i.e., the ith bit of π‘ž – Is (1 βˆ’ 𝑠 𝑒 , 1 βˆ’ 𝑠𝑑 𝑒 , 𝑠 , 𝑠𝑑 ) -sensitive – – 𝑴 and 𝒖 are parameters of LSH

  10. LSH-based NaΓ―ve Algorithm [Indyk, Motwani] Parameters 𝑀 and 𝑒 can be set s.t. β€’ With constant probability Any neighbor of π‘Ÿ falls into the same bucket as π‘Ÿ in at least – one hash function Total number of outliers is at most 3𝑀 – Outlier : point farther than 𝑑𝑠 from the query point – Algorithm Arrays for each hash function 𝐡 1 , … , 𝐡 𝑀 β€’ For a query 𝒓 compute β€’ 𝑀 Retrieve the possible neighbors S = ⋃ 𝑩 [ 𝑕 𝑗 ( π‘Ÿ )] – 𝑗=1 Remove the outliers S = S ∩ B q, cr – Report the approximate k most diverse points of S, or – GMM(S) Achieves (c,2)-approximation β€’ Running time may be linear in π‘œ  β€’ Should prune the buckets before collecting them –

  11. Core-sets Core-sets [ Agarwal, Har-Peled, Varadarajan] : subset of a point set S that β€’ represents it. Approximately determines the solution to an optimization β€’ problem Composes: A union of coresets is a coreset of the union β€’ Ξ² – core-set: Approximates the cost up-to a factor of Ξ² β€’ Our Optimization problem: β€’ Finding the k-diversity of S. – Instead we consider finding K-Center Cost of S – 𝐿𝐿 𝑇 , 𝑇 β€² = max π‘ž β€² βˆˆπ‘‡ β€² π‘ž βˆ’ π‘ž β€² π‘žβˆˆπ‘‡ min β€’ 𝑇 β€² βŠ†π‘‡ , 𝑇 β€² =𝑙 𝐿𝐿 ( 𝑇 , 𝑇 β€² ) 𝐿𝐿 𝑙 𝑇 = min β€’ KC cost 2-approximates diversity – 𝐿𝐿 π‘™βˆ’1 𝑇 ≀ 𝑒𝑗𝑒 𝑙 𝑇 ≀ 2. 𝐿𝐿 π‘™βˆ’1 𝑇 β€’ GMM computes a 1/3-Coreset for KC-cost β€’

  12. Algorithms

  13. Algorithm A β€’ Parameters 𝑀 and 𝑒 can be set s.t. with constant probability – Any neighbor of π‘Ÿ falls into the same bucket as π‘Ÿ in at least one hash function – There is no outlier No need to keep all the points in each bucket, β€’ just keep a coreset! β€’ – π‘©π‘ž 𝒋 π’Œ = 𝑯𝑯𝑯 𝑩 𝒋 π’Œ – Keep a 1/3 coreset of 𝑩 𝒋 π’Œ Given query 𝒓 β€’ 𝑀 – Retrieve the coresets from buckets S = ⋃ π‘©π‘ž [ 𝑕 𝑗 ( π‘Ÿ )] 𝑗=1 – Run GMM(S) – Report the result

  14. Analysis β€’ Achieves (c,6)-Approx – Union of 1/3 coresets is a 1/3 coreset for the union – The last GMM call, adds a 2 approximation factor Only works if we set 𝑀 and 𝑒 s.t. there is no outlier in 𝑇 β€’ with constant probability Space: 𝑃 π‘œπ‘€ = 𝑃 (( π‘œ log 𝑙 ) 1+1 /( π‘‘βˆ’1 ) + π‘œπ‘’ ) – Time: 𝑃 𝑀𝑙 2 = 𝑃 ( 𝑙 2 + log π‘œ 𝑒 (log 𝑙 ) 𝑑 /( π‘‘βˆ’1 ) π‘œ 1 /( π‘‘βˆ’1 ) ) – 𝑠 Only makes sense for 𝑑 > 2 – Not optimal: β€’ 1 ANN query time is 𝑃 ( π‘’π‘œ 𝑑 ) – So if we want to improve over these we should be able to deal – with outliers.

  15. Robust Core-sets β€’ 𝑇 β€² i s an π‘š -robust Ξ² -coreset for S if – for any set 𝑃 of outliers of size at most π‘š – ( π‘‡π‘ž \O ) is a Ξ² -coreset for 𝑇 β€’ Peeling Algorithm [Agarwal, Har-peled, Yu,’06][Varadarajan, Xiao, β€˜12] : Repeat ( π‘š + 1) times – Compute a Ξ² -coreset for 𝑇 β€’ Add them to the coreset π‘‡π‘ž β€’ Remove them from the set 𝑇 β€’ Note: if we order the points in π‘‡π‘ž as we find them, then the first π‘š β€² + 1 𝑙 points also form an π‘šπ‘ž -robust Ξ² -coreset. 2 robust coreset: S’= {3, 5 ; 2, 9 ; 1, 6} 1 robust coreset

  16. Algorithm B Parameters 𝑀 and 𝑒 can be set s.t. With constant β€’ probability – Any neighbor of π‘Ÿ falls into the same bucket as π‘Ÿ in at least one hash function – Total number of outliers is at most 3𝑀 For each bucket 𝐡 𝑗 π‘˜ keep an 3 𝑀 -robust 1/3-coreset in β€’ π΅π‘ž 𝑗 π‘˜ which has size 3𝑀 + 1 𝑙 For query π‘Ÿ β€’ For each bucket π΅π‘ž [ 𝑕 𝑗 ( π‘Ÿ )] – Find smallest π‘š s.t. the first ( π‘™π‘š ) points contains less than π‘š outliers β€’ Add those π‘™π‘š points to 𝑇 β€’ Remove outliers from 𝑇 – Return 𝐻𝐻𝐻 ( 𝑇 ) –

  17. Example and Analysis β€’ Total # outliers ≀ 3𝑀 , 𝑇 < 𝑃 ( 𝑀𝑙 ) 1 β€’ Time: O( 𝑀𝑙 2 ) = O( 𝑙 2 + log π‘œ 𝑒 βˆ— log 𝑙 βˆ— π‘œ 𝑑 ) 𝑠 β€’ Space: 𝑃 π‘œπ‘€ = 𝑃 (log 𝑙 βˆ— π‘œ 1+1 / 𝑑 + π‘œπ‘’ ) β€’ Achieves (c,6)-Approx for the same reason

  18. Conclusion Algorithm A Algorithm B ANN Distance Apx. c > 2 c >1 c >1 Factor Diversity Apx. 6 6 - Factor Ξ± ~ π‘œ 1+ 1 ~ π‘œ 1+1 π‘œ 1+1 Space π‘‘βˆ’1 𝑑 𝑑 1 1 1 Query Time ~ 𝑒 π‘œ ~ 𝑒 π‘œ 𝑒 π‘œ π‘‘βˆ’1 𝑑 𝑑 Further Work β€’ Improve diversity factor Ξ± β€’ Consider other definitions of diversity , e.g., sum of distances

  19. Thank You!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend