Fast Evaluation of Union-Intersection Expressions Philip Bille Anna - - PowerPoint PPT Presentation

fast evaluation of union intersection expressions
SMART_READER_LITE
LIVE PREVIEW

Fast Evaluation of Union-Intersection Expressions Philip Bille Anna - - PowerPoint PPT Presentation

Fast Evaluation of Union-Intersection Expressions Philip Bille Anna Pagh Rasmus Pagh IT University of Copenhagen Data Structures for Intersection Queries Preprocess a collection of sets independently into a representation


slide-1
SLIDE 1

Fast Evaluation of Union-Intersection Expressions

Philip Bille Anna Pagh Rasmus Pagh IT University of Copenhagen

slide-2
SLIDE 2

Data Structures for Intersection Queries

  • Preprocess a collection of sets independently into a representation

that supports intersection queries of the form , .

  • Application: Boolean AND-queries in search engines.
  • For each word store the set of documents containing the word.
  • To search for documents that contains words x and y compute the

intersection of the corresponding document sets.

  • Generalizes to arbitrary expressions over set collection involving intersection,

union, and difference. S1, . . . , Sm Si ∩ Sj 1 ≤ i, j ≤ m

slide-3
SLIDE 3

Previous Comparison-Based Results

  • Query:
  • Classical solution:
  • Represent sets as sorted lists.
  • Query by merging and reporting duplicates: time.
  • Special cases with faster solutions:
  • When : time [HL1972].
  • When consists of few sublists from and [DLM2000, BK2002].

(adaptive algorithms).

  • Generalizations to more complicated expressions involving intersections and

unions [CFM2005]. O(|S1| + |S2|) S1 ∩ S2 S1 ≪ S2 O(|S1| log(1 + S1

S2 ))

S1 ∩ S2 S1 S2

slide-4
SLIDE 4

Previous Non-Comparison Based Results

  • Fast solution when :
  • Build a hashing-based dictionary for each set.
  • Lookup the elements of in the dictionary for : time.
  • For very small universes:
  • Represent sets as bitstrings.
  • Compute intersections as a bitwise-AND.

S1 ≪ S2 S1 S2 O(S1)

slide-5
SLIDE 5

Our Results

  • Theorem: There is a non-comparison based linear space representation

supporting intersection queries queries in expected time

  • Output-sensitive algorithm.
  • For the algorithm runs in sublinear time.
  • All previously known solutions use worst-case linear time even if the

intersection is empty.

  • We show how to generalize the result to arbitrary union-intersection

expressions.

  • We give a communication complexity lower bound proving that the result is

near optimal. S1 ∩ S2 O (|S1| + |S2|) log2 w w + occ

  • cc < (|S1| + |S2|)/w
slide-6
SLIDE 6
  • Represent set as a set of hash function values .
  • is an approximate set representation:
  • If then .
  • if then with probability close to 1.

Approximate Set Representation

S ⊆ {0, 1}w h(S)

x1 h(x1) x2 x3 h(x3) h(x2)

S h(S) h(S) x ∈ S x ∈ S h(x) ∈ h(S) h(x) ∈ h(S)

slide-7
SLIDE 7

Computing Intersections

1.Compute intersection of the approximate representations .

  • We do this in time.

2.Compute and .

  • With a hash table that allows us to lookup a value and retrieve all

elements with this value this takes time. 3.Compute and return .

  • Idea: If the hash function is suitably chosen, the number of elements to be

checked in step 2 is small. H = h(S1) ∩ h(S2)

  • (|S1| + |S2|)

S′

1 = {x ∈ S1 | h(x) ∈ H}

S′

2 = {x ∈ S2 | h(x) ∈ H}

O(|S′

1| + |S′ 2|)

h(x) S′

1 ∩ S′ 2

slide-8
SLIDE 8

Choosing Hash Functions

  • The number of bits used for the hash values should be:
  • Small enough so that can be computed quickly.
  • Large enough to get a significant reduction in the number remaining

elements in and so that can be computed quickly.

  • Optimal range of hash function depends on the size of input sets.
  • We store at “multiple resolutions” using hash functions with different

ranges. H = h(S1) ∩ h(S2) S′

1 = {x ∈ S1 | h(x) ∈ H}

S′

2 = {x ∈ S2 | h(x) ∈ H}

S′

1 ∩ S′ 2

S

slide-9
SLIDE 9
  • We store a set of -bit hash values as a bucketed set for parameter :
  • Elements with the same most significant bits are stored in the same

bucket.

  • Elements in the same bucket are represented by their least

significant bits as a sorted packed array.

  • We choose to minimize total space.
  • We can store a sufficient set of resolutions of in total linear space.
  • .

2b . . . w bits r − b bits . . .

r hr(S) b b r − b r − b = O(log w) S b

slide-10
SLIDE 10
  • Intersection algorithm for bucketed sets:
  • Convert buckets to have a common (suitable chosen) parameter .
  • Create a new array of size .
  • Repartition packed arrays among the new buckets.
  • Modify number of bits in packed array representation.
  • Compute intersection among each of the sorted packed arrays.

2b . . . w bits r − b bits . . .

b 2b

slide-11
SLIDE 11
  • Lemma: [AH1992, ATNR1995] All of the above operation can be computed in

time per word in the packed arrays.

  • Total time: O

(|S1| + |S2|) log w w · log w

  • O(log w)

1 2 4 5 7 8 10 12 3 6 11 13 1 4 8 12 1 2 4 5 7 8 10 12 3 6 11 13 1 4 8 12 1 4 8 12 1 4 8 12

merge keep duplicate values compact

slide-12
SLIDE 12

Our Results

  • Theorem: There is a non-comparison based linear space representation

supporting intersection queries queries in expected time

  • In the paper:
  • Generalization to arbitrary union-intersection expressions
  • Lower bound
  • Open Problem:
  • Can we extend this to set difference?

S1 ∩ S2 O (|S1| + |S2|) log2 w w + occ