Diverse Near Neighbor Problem Sofiane Abbar (QCRI) Sihem Amer-Yahia - - PowerPoint PPT Presentation

β–Ά
diverse near neighbor problem
SMART_READER_LITE
LIVE PREVIEW

Diverse Near Neighbor Problem Sofiane Abbar (QCRI) Sihem Amer-Yahia - - PowerPoint PPT Presentation

Diverse Near Neighbor Problem Sofiane Abbar (QCRI) Sihem Amer-Yahia (CNRS) Piotr Indyk (MIT) Sepideh Mahabadi (MIT) Kasturi R. Varadarajan (UIowa) Near Neighbor Problem Definition Set of points in -dimensional space


slide-1
SLIDE 1

Diverse Near Neighbor Problem

Sofiane Abbar (QCRI) Sihem Amer-Yahia (CNRS) Piotr Indyk (MIT) Sepideh Mahabadi (MIT) Kasturi R. Varadarajan (UIowa)

slide-2
SLIDE 2

Near Neighbor Problem

  • Definition

– Set of π‘œ points 𝑸 in 𝑒-dimensional space – Query point 𝒓 – Report one neighbor of 𝒓 if there is any

  • Neighbor: A point within distance 𝑠 of

query

  • Application

– Major importance in databases (document, image, video), information retrieval, pattern recognition

  • Object of interest as point
  • Similarity is measured as distance.
slide-3
SLIDE 3

Motivation

Search: How many answers?

  • Small output size, e.g. 10

– Reporting 𝑙 Nearest Neighbors may not be informative (could be identical texts)

  • Large output size

– Time to retrieve them is high

Small output size which is

  • Relevant and Diverse
  • Good to have result from each

cluster, i.e. should be diverse

slide-4
SLIDE 4

Diverse Near Neighbor Problem

  • Definition

– Set of π‘œ points 𝑸 in 𝑒-dimensional space – Query point 𝒓 – Report the k most diverse neighbors of π‘Ÿ

  • Neighbor:

– Points within distance 𝑠 of query – We use Hamming distance

  • Diversity:

– div S = mπ‘—π‘œ

π‘ž,π‘Ÿβˆˆπ‘‡ |π‘ž βˆ’ π‘Ÿ|

  • Goal: report Q (green points), s.t.

– 𝑅 βŠ† 𝑄 ∩ 𝐢 π‘Ÿ, 𝑠 – |Q| = k – 𝑒𝑗𝑒 𝑅 is maximized

slide-5
SLIDE 5

Approximation

  • Want sublinear query time, so need to approximate
  • Approximate NN:

– 𝐢 π‘Ÿ, 𝑠 β†’ 𝐢 π‘Ÿ, 𝑑𝑠 for some value of 𝑑 > 1 – Result: query time of 𝑃(π‘’π‘œ

1 𝑑)

  • Approximate Diverse NN:

– Bi-criterion approximation: distance and diversity – (𝐝, 𝜷)-Approximate 𝑙-diverse Near Neighbor – Let π‘…βˆ— (green points) be the optimum solution for 𝐢 π‘Ÿ, 𝑠

  • Report approximate neighbors 𝑅 (purple points)

𝑅 βŠ† 𝐢 π‘Ÿ, 𝑑𝑠

  • Diversity approximates the optimum diversity

𝑒𝑗𝑒 𝑅 β‰₯

1 𝛽 𝑒𝑗𝑒 (π‘…βˆ—) , 𝛽 β‰₯ 1

slide-6
SLIDE 6

Results

Algorithm A Algorithm B Distance Apx. Factor c > 2 c >1 Diversity Apx. Factor Ξ± 6 6 Space (π‘œ log 𝑙)1+1/(π‘‘βˆ’1)+π‘œπ‘’ log 𝑙 βˆ— π‘œ1+1/𝑑 + π‘œπ‘’ Query Time

𝑙2 + log π‘œ 𝑠 𝑒 (log 𝑙)𝑑/(π‘‘βˆ’1)π‘œ1/(π‘‘βˆ’1) 𝑙2 + log π‘œ 𝑠 𝑒 βˆ— log 𝑙 βˆ— π‘œ1/𝑑

  • Algorithm A was earlier introduced in [Abbar, Amer-yahia, Indyk, Mahabadi,

WWW’13]

slide-7
SLIDE 7

Techniques

slide-8
SLIDE 8

Compute k-diversity: GMM

  • Have n points, compute the subset with

maximum diversity.

  • Exact : NP-hard to approximate better

than 2 [Ravi et al.]

  • GMM Algorithm [Ravi et al.] [Gonzales]

– Choose an arbitrary point – Repeat k-1 times

  • Add the point whose minimum distance to the

currently chosen points is maximized

  • Achieves approximation factor 2
  • Running time of the algorithm is O(kn)
slide-9
SLIDE 9

Locality Sensitive Hashing (LSH)

  • LSH

– close points have higher probability of collision than far points – Hash functions: 𝑕1 , … , 𝑕𝑀

  • 𝑕𝑗 = < β„Žπ‘—,1, … , β„Žπ‘—,𝑒 >
  • β„Žπ‘—,π‘˜ ∈ β„‹ is chosen randomly
  • β„‹ is a family of hash functions which is

𝑄

1, 𝑄2, 𝑠, 𝑑𝑠 -sensitive:

– If π‘ž βˆ’ π‘žπ‘ž ≀ 𝑠 then Pr β„Ž π‘ž = β„Ž π‘žπ‘ž β‰₯ 𝑄

1

– If π‘ž βˆ’ π‘žπ‘ž β‰₯ 𝑑𝑠 then Pr β„Ž π‘ž = β„Ž π‘žπ‘ž ≀ 𝑄2

  • Example: Hamming distance:

– β„Ž π‘ž = π‘žπ‘— , i.e., the ith bit of π‘ž – Is (1 βˆ’ 𝑠

𝑒 , 1 βˆ’ 𝑠𝑑 𝑒 , 𝑠, 𝑠𝑑)-sensitive

– 𝑴 and 𝒖 are parameters of LSH

slide-10
SLIDE 10

LSH-based NaΓ―ve Algorithm

  • [Indyk, Motwani] Parameters 𝑀 and 𝑒 can be set s.t.

With constant probability

– Any neighbor of π‘Ÿ falls into the same bucket as π‘Ÿ in at least

  • ne hash function

– Total number of outliers is at most 3𝑀 – Outlier : point farther than 𝑑𝑠 from the query point

Algorithm

  • Arrays for each hash function 𝐡1, … , 𝐡𝑀
  • For a query 𝒓 compute

– Retrieve the possible neighbors S = ⋃ 𝑩[𝑕𝑗(π‘Ÿ)]

𝑀 𝑗=1

– Remove the outliers S = S ∩ B q, cr – Report the approximate k most diverse points of S, or GMM(S)

  • Achieves (c,2)-approximation
  • Running time may be linear in π‘œ 

– Should prune the buckets before collecting them

slide-11
SLIDE 11

Core-sets

  • Core-sets [Agarwal, Har-Peled, Varadarajan]: subset of a point set S that

represents it.

  • Approximately determines the solution to an optimization

problem

  • Composes: A union of coresets is a coreset of the union
  • β– core-set: Approximates the cost up-to a factor of Ξ²
  • Our Optimization problem:

– Finding the k-diversity of S. – Instead we consider finding K-Center Cost of S

  • 𝐿𝐿 𝑇, 𝑇′ = max

π‘žβˆˆπ‘‡ min π‘žβ€²βˆˆπ‘‡β€² π‘ž βˆ’ π‘žβ€²

  • 𝐿𝐿𝑙 𝑇 =

min

π‘‡β€²βŠ†π‘‡ , 𝑇′ =𝑙 𝐿𝐿(𝑇, 𝑇′)

– KC cost 2-approximates diversity

  • πΏπΏπ‘™βˆ’1 𝑇 ≀ 𝑒𝑗𝑒𝑙 𝑇 ≀ 2. πΏπΏπ‘™βˆ’1 𝑇
  • GMM computes a 1/3-Coreset for KC-cost
slide-12
SLIDE 12

Algorithms

slide-13
SLIDE 13

Algorithm A

  • Parameters 𝑀 and 𝑒 can be set s.t. with constant probability

– Any neighbor of π‘Ÿ falls into the same bucket as π‘Ÿ in at least one hash function – There is no outlier

  • No need to keep all the points in each bucket,
  • just keep a coreset!

– π‘©π‘žπ’‹ π’Œ = 𝑯𝑯𝑯 𝑩𝒋 π’Œ – Keep a 1/3 coreset of 𝑩𝒋 π’Œ

  • Given query 𝒓

– Retrieve the coresets from buckets S = ⋃ π‘©π‘ž[𝑕𝑗(π‘Ÿ)]

𝑀 𝑗=1

– Run GMM(S) – Report the result

slide-14
SLIDE 14

Analysis

  • Achieves (c,6)-Approx

– Union of 1/3 coresets is a 1/3 coreset for the union – The last GMM call, adds a 2 approximation factor

  • Only works if we set 𝑀 and 𝑒 s.t. there is no outlier in 𝑇

with constant probability

– Space: 𝑃 π‘œπ‘€ = 𝑃((π‘œ log 𝑙)1+1/(π‘‘βˆ’1) + π‘œπ‘’) – Time: 𝑃 𝑀𝑙2 = 𝑃( 𝑙2 + log π‘œ

𝑠

𝑒 (log 𝑙)𝑑/(π‘‘βˆ’1)π‘œ1/(π‘‘βˆ’1)) – Only makes sense for 𝑑 > 2

  • Not optimal:

– ANN query time is 𝑃(π‘’π‘œ

1 𝑑)

– So if we want to improve over these we should be able to deal with outliers.

slide-15
SLIDE 15

Robust Core-sets

  • 𝑇′is an π‘š-robust Ξ²-coreset for S if

– for any set 𝑃 of outliers of size at most π‘š – (π‘‡π‘ž\O) is a Ξ²-coreset for 𝑇

  • Peeling Algorithm [Agarwal, Har-peled,

Yu,’06][Varadarajan, Xiao, β€˜12]: – Repeat (π‘š + 1) times

  • Compute a Ξ²-coreset for 𝑇
  • Add them to the coreset π‘‡π‘ž
  • Remove them from the set 𝑇

Note: if we order the points in π‘‡π‘ž as we find them, then the first π‘šβ€² + 1 𝑙 points also form an π‘šπ‘ž-robust Ξ²-coreset.

2 robust coreset: S’= {3, 5; 2, 9; 1, 6}

1 robust coreset

slide-16
SLIDE 16

Algorithm B

  • Parameters 𝑀 and 𝑒 can be set s.t. With constant

probability

– Any neighbor of π‘Ÿ falls into the same bucket as π‘Ÿ in at least one hash function – Total number of outliers is at most 3𝑀

  • For each bucket 𝐡𝑗 π‘˜ keep an 3𝑀-robust 1/3-coreset in

π΅π‘žπ‘— π‘˜ which has size 3𝑀 + 1 𝑙

  • For query π‘Ÿ

– For each bucket π΅π‘ž[𝑕𝑗(π‘Ÿ)]

  • Find smallest π‘š s.t. the first (π‘™π‘š) points contains less than π‘š outliers
  • Add those π‘™π‘š points to 𝑇

– Remove outliers from 𝑇 – Return 𝐻𝐻𝐻(𝑇)

slide-17
SLIDE 17

Example and Analysis

  • Total # outliers ≀ 3𝑀 , 𝑇 < 𝑃(𝑀𝑙)
  • Time: O(𝑀𝑙2) = O( 𝑙2 +

log π‘œ 𝑠

𝑒 βˆ— log 𝑙 βˆ— π‘œ

1 𝑑)

  • Space: 𝑃 π‘œπ‘€ = 𝑃(log 𝑙 βˆ— π‘œ1+1/𝑑 + π‘œπ‘’)
  • Achieves (c,6)-Approx for the same reason
slide-18
SLIDE 18

Conclusion

Further Work

  • Improve diversity factor Ξ±
  • Consider other definitions of diversity , e.g., sum of distances

Algorithm A Algorithm B ANN

Distance Apx. Factor c > 2 c >1 c >1 Diversity Apx. Factor Ξ± 6 6

  • Space

~π‘œ1+ 1

π‘‘βˆ’1

~π‘œ1+1

𝑑

π‘œ1+1

𝑑

Query Time

~𝑒 π‘œ

1 π‘‘βˆ’1

~𝑒 π‘œ

1 𝑑

𝑒 π‘œ

1 𝑑

slide-19
SLIDE 19

Thank You!