[PPT] - Consistent subset sampling Konstantin Kutzkov and Rasmus Pagh Work PowerPoint Presentation

SLIDE 1

Consistent subset sampling

Konstantin Kutzkov and Rasmus Pagh Work supported by:

1

SLIDE 2

Agenda

Consistent sampling
Applications of consistent sampling of subsets
New sampling algorithm

2

SLIDE 3

Sampling

3

F u n d a m e n t a l t

l

i n ( s t r e a m i n g ) a l g

r

i t h m s , e . g . , t

e

s t i m a t e n u m b e r

f

d i s t i n c t i t e m s

r

t h e J a c c a r d s i m i l a r i t y

f

s e t s .

SLIDE 4

Sampling

3

F u n d a m e n t a l t

l

i n ( s t r e a m i n g ) a l g

r

i t h m s , e . g . , t

e

s t i m a t e n u m b e r

f

d i s t i n c t i t e m s

r

t h e J a c c a r d s i m i l a r i t y

f

s e t s .

SLIDE 5

Consistent sampling

Setting: Data stream of items x1,x2,… ∈ U.
Problem: Create substream with only items in a

random sample U2 ⊆ U.

Idea: Use a hash function h to

determine whether to sample:

Include xi in sample iff h(x2)=0.
U2 = h-1(0).

4

SLIDE 6

Consistent sampling

Setting: Data stream of items x1,x2,… ∈ U.
Problem: Create substream with only items in a

random sample U2 ⊆ U.

Idea: Use a hash function h to

determine whether to sample:

Include xi in sample iff h(x2)=0.
U2 = h-1(0).

4

P r

p

e r t y :   I f h a s h f u n c t i

n

i s 2

i

n d e p e n d e n t , f r a c t i

n

s a m p l e d i s c

n

c e n t r a t e d a r

u

n d e x p e c t a t i

n

SLIDE 7

Consistent subset sampling

Setting: Data stream of sets S1,S2,… ⊆ U.
Problem: For each set Si, generate every K ⊆ Si

with |K|=k and h(K)=0.

Hash function h should be 2-independent with

Pr[h(S)=0] = 1/r.

Goal: Time and space efficiency.

5

SLIDE 8

Consistent subset sampling

Setting: Data stream of sets S1,S2,… ⊆ U.
Problem: For each set Si, generate every K ⊆ Si

with |K|=k and h(K)=0.

Hash function h should be 2-independent with

Pr[h(S)=0] = 1/r.

Goal: Time and space efficiency.

5

I n t a l k : k = 4

SLIDE 9

Motivating example

Set of shopping baskets (sets of products).

Generate sample of 4-sets of products bought

together to, for example, estimate:

The number of frequent 4-sets.
Similarity in buying habits of two customers.

6

SLIDE 10

Motivating example

Set of shopping baskets (sets of products).

Generate sample of 4-sets of products bought

together to, for example, estimate:

The number of frequent 4-sets.
Similarity in buying habits of two customers.

6

U s e t

fi

n e

t

u n e p a r a m e t e r s

f

f r e q u e n t i t e m s e t m i n i n g a l g

r

i t h m s ( A p r i

r

i , F P

g

r

w

t h ) i n

r

d e r t

a

v

i

d t h e g e n e r a t i

n
f

a h u g e n u m b e r

f

i t e m s e t s t h a t a r e c

n

s i d e r e d f r e q u e n t .

SLIDE 11

Naïve solutions

1.Generate explicitly all 4-subsets and apply 2- independent hash function to each of them.

Slow: Time O(b4) for set of size b.
2. Sample items S ⊆ Si independently w.p. r-1/4.
Not 2-independent: Positive correlation

among overlapping 4-subsets.

7

SLIDE 12

Decomposable hashing

We choose h to be “decomposable”: 

h({i1, i2, i3, i4}) = (g(i1) + g(i2) + g(i3) + g(i4)) mod r

  where g: U → [r] is an 8-independent hash function.

8

SLIDE 13

Decomposable hashing

We choose h to be “decomposable”: 

h({i1, i2, i3, i4}) = (g(i1) + g(i2) + g(i3) + g(i4)) mod r

  where g: U → [r] is an 8-independent hash function.

8

2

i

n d e p e n d e n t

SLIDE 14

Decomposable hashing

We choose h to be “decomposable”: 

h({i1, i2, i3, i4}) = (g(i1) + g(i2) + g(i3) + g(i4)) mod r

  where g: U → [r] is an 8-independent hash function.

8

O u r m a i n r e d u c t i

n

:   F i n d i n g 4

s

u b s e t s w i t h h a s h v a l u e z e r

i

s e q u i v a l e n t t

s
l

v i n g 4 S U M

n

{ g ( x ) | x ∈ S

i

} 2

i

n d e p e n d e n t

SLIDE 15

Schroeppel/Shamir technique

9

sorted values sorted values sorted values sorted values

9

sorted values sorted values sorted values sorted values

maxheap

p a i r s u m s

minheap

p a i r s u m s

I n p a p e r :   H

w

t

h

a n d l e m

d

u l a r a r i t h m e t i c

SLIDE 28

Main result

Theorem (vanilla version)

We can compute a consistent 2-independent sample of 4-subsets of a set of size b in expected time O(b2 log log b + pb4) and space O(b) for sampling probability p.

10

SLIDE 29

Main result

Theorem (vanilla version)

We can compute a consistent 2-independent sample of 4-subsets of a set of size b in expected time O(b2 log log b + pb4) and space O(b) for sampling probability p.

10

I n p a p e r :  

G

e n e r a l i z e d t

k
s

u b s e t s  

T

i m e / s p a c e t r a d e

f

f

SLIDE 30

Thank you!

11