fast evaluation of union intersection expressions
play

Fast Evaluation of Union-Intersection Expressions Philip Bille Anna - PowerPoint PPT Presentation

Fast Evaluation of Union-Intersection Expressions Philip Bille Anna Pagh Rasmus Pagh IT University of Copenhagen Data Structures for Intersection Queries Preprocess a collection of sets independently into a representation


  1. Fast Evaluation of Union-Intersection Expressions Philip Bille Anna Pagh Rasmus Pagh IT University of Copenhagen

  2. Data Structures for Intersection Queries • Preprocess a collection of sets independently into a representation S 1 , . . . , S m that supports intersection queries of the form , . S i ∩ S j 1 ≤ i, j ≤ m • Application: Boolean AND-queries in search engines. • For each word store the set of documents containing the word. • To search for documents that contains words x and y compute the intersection of the corresponding document sets. • Generalizes to arbitrary expressions over set collection involving intersection, union, and difference.

  3. Previous Comparison-Based Results • Query: S 1 ∩ S 2 • Classical solution: • Represent sets as sorted lists. • Query by merging and reporting duplicates: time. O ( | S 1 | + | S 2 | ) • Special cases with faster solutions: • When : time [HL1972]. S 1 ≪ S 2 O ( | S 1 | log(1 + S 1 S 2 )) • When consists of few sublists from and [DLM2000, BK2002]. S 1 ∩ S 2 S 1 S 2 (adaptive algorithms). • Generalizations to more complicated expressions involving intersections and unions [CFM2005].

  4. Previous Non-Comparison Based Results • Fast solution when : S 1 ≪ S 2 • Build a hashing-based dictionary for each set. • Lookup the elements of in the dictionary for : time. S 1 S 2 O ( S 1 ) • For very small universes: • Represent sets as bitstrings. • Compute intersections as a bitwise-AND.

  5. Our Results • Theorem : There is a non-comparison based linear space representation supporting intersection queries queries in expected time S 1 ∩ S 2 � ( | S 1 | + | S 2 | ) log 2 w � + occ O w • Output-sensitive algorithm. • For the algorithm runs in sublinear time. occ < ( | S 1 | + | S 2 | ) /w • All previously known solutions use worst-case linear time even if the intersection is empty. • We show how to generalize the result to arbitrary union-intersection expressions. • We give a communication complexity lower bound proving that the result is near optimal.

  6. Approximate Set Representation S h ( S ) x 1 h ( x 1 ) h ( x 3 ) x 2 h ( x 2 ) x 3 • Represent set as a set of hash function values . S ⊆ { 0 , 1 } w h ( S ) • is an approximate set representation : h ( S ) • If then . x ∈ S h ( x ) ∈ h ( S ) • if then with probability close to 1. x �∈ S h ( x ) �∈ h ( S )

  7. Computing Intersections 1.Compute intersection of the approximate representations . H = h ( S 1 ) ∩ h ( S 2 ) • We do this in time. o ( | S 1 | + | S 2 | ) 2.Compute and . S ′ S ′ 1 = { x ∈ S 1 | h ( x ) ∈ H } 2 = { x ∈ S 2 | h ( x ) ∈ H } • With a hash table that allows us to lookup a value and retrieve all h ( x ) elements with this value this takes time. O ( | S ′ 1 | + | S ′ 2 | ) 3.Compute and return . S ′ 1 ∩ S ′ 2 • Idea: If the hash function is suitably chosen, the number of elements to be checked in step 2 is small.

  8. Choosing Hash Functions • The number of bits used for the hash values should be: • Small enough so that can be computed quickly. H = h ( S 1 ) ∩ h ( S 2 ) • Large enough to get a significant reduction in the number remaining elements in and so S ′ S ′ 1 = { x ∈ S 1 | h ( x ) ∈ H } 2 = { x ∈ S 2 | h ( x ) ∈ H } that can be computed quickly. S ′ 1 ∩ S ′ 2 • Optimal range of hash function depends on the size of input sets. • We store at “multiple resolutions” using hash functions with different S ranges.

  9. r − b bits . . . w bits 2 b . . . • We store a set of -bit hash values as a bucketed set for parameter : h r ( S ) b r • Elements with the same most significant bits are stored in the same b bucket. • Elements in the same bucket are represented by their least r − b significant bits as a sorted packed array . • We choose to minimize total space. b • We can store a sufficient set of resolutions of in total linear space. S • . r − b = O (log w )

  10. r − b bits . . . w bits 2 b . . . • Intersection algorithm for bucketed sets: • Convert buckets to have a common (suitable chosen) parameter . b • Create a new array of size . 2 b • Repartition packed arrays among the new buckets. • Modify number of bits in packed array representation. • Compute intersection among each of the sorted packed arrays.

  11. 10 12 3 13 1 2 4 5 7 8 1 4 6 8 11 12 merge 8 10 11 12 13 1 1 3 4 5 6 7 8 12 2 4 keep duplicate values 12 1 8 4 compact 1 4 8 12 • Lemma : [AH1992, ATNR1995] All of the above operation can be computed in time per word in the packed arrays. O (log w ) � ( | S 1 | + | S 2 | ) log w � • Total time: O · log w w

  12. Our Results • Theorem : There is a non-comparison based linear space representation supporting intersection queries queries in expected time S 1 ∩ S 2 � ( | S 1 | + | S 2 | ) log 2 w � + occ O w • In the paper: • Generalization to arbitrary union-intersection expressions • Lower bound • Open Problem: • Can we extend this to set difference?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend