Consistent subset sampling Konstantin Kutzkov and Rasmus Pagh Work - - PowerPoint PPT Presentation

consistent subset sampling
SMART_READER_LITE
LIVE PREVIEW

Consistent subset sampling Konstantin Kutzkov and Rasmus Pagh Work - - PowerPoint PPT Presentation

Consistent subset sampling Konstantin Kutzkov and Rasmus Pagh Work supported by: 1 Agenda Consistent sampling Applications of consistent sampling of subsets New sampling algorithm 2 Sampling g ) n m i a e t r ( s n i


slide-1
SLIDE 1

Consistent subset sampling

Konstantin Kutzkov and Rasmus Pagh Work supported by:

1

slide-2
SLIDE 2

Agenda

  • Consistent sampling
  • Applications of consistent sampling of subsets
  • New sampling algorithm

2

slide-3
SLIDE 3

Sampling

3

F u n d a m e n t a l t

  • l

i n ( s t r e a m i n g ) a l g

  • r

i t h m s , e . g . , t

  • e

s t i m a t e n u m b e r

  • f

d i s t i n c t i t e m s

  • r

t h e J a c c a r d s i m i l a r i t y

  • f

s e t s .

slide-4
SLIDE 4

Sampling

3

F u n d a m e n t a l t

  • l

i n ( s t r e a m i n g ) a l g

  • r

i t h m s , e . g . , t

  • e

s t i m a t e n u m b e r

  • f

d i s t i n c t i t e m s

  • r

t h e J a c c a r d s i m i l a r i t y

  • f

s e t s .

slide-5
SLIDE 5

Consistent sampling

  • Setting: Data stream of items x1,x2,… ∈ U.
  • Problem: Create substream with only items in a

random sample U2 ⊆ U.

  • Idea: Use a hash function h to 


determine whether to sample:

  • Include xi in sample iff h(x2)=0.
  • U2 = h-1(0).

4

slide-6
SLIDE 6

Consistent sampling

  • Setting: Data stream of items x1,x2,… ∈ U.
  • Problem: Create substream with only items in a

random sample U2 ⊆ U.

  • Idea: Use a hash function h to 


determine whether to sample:

  • Include xi in sample iff h(x2)=0.
  • U2 = h-1(0).

4

P r

  • p

e r t y : 
 I f h a s h f u n c t i

  • n

i s 2

  • i

n d e p e n d e n t , f r a c t i

  • n

s a m p l e d i s c

  • n

c e n t r a t e d a r

  • u

n d e x p e c t a t i

  • n
slide-7
SLIDE 7

Consistent subset sampling

  • Setting: Data stream of sets S1,S2,… ⊆ U.
  • Problem: For each set Si, generate every K ⊆ Si

with |K|=k and h(K)=0.

  • Hash function h should be 2-independent with

Pr[h(S)=0] = 1/r.

  • Goal: Time and space efficiency.

5

slide-8
SLIDE 8

Consistent subset sampling

  • Setting: Data stream of sets S1,S2,… ⊆ U.
  • Problem: For each set Si, generate every K ⊆ Si

with |K|=k and h(K)=0.

  • Hash function h should be 2-independent with

Pr[h(S)=0] = 1/r.

  • Goal: Time and space efficiency.

5

I n t a l k : k = 4

slide-9
SLIDE 9

Motivating example

  • Set of shopping baskets (sets of products).



 
 


  • Generate sample of 4-sets of products bought

together to, for example, estimate:

  • The number of frequent 4-sets.
  • Similarity in buying habits of two customers.

6

slide-10
SLIDE 10

Motivating example

  • Set of shopping baskets (sets of products).



 
 


  • Generate sample of 4-sets of products bought

together to, for example, estimate:

  • The number of frequent 4-sets.
  • Similarity in buying habits of two customers.

6

U s e t

  • fi

n e

  • t

u n e p a r a m e t e r s

  • f

f r e q u e n t i t e m s e t m i n i n g a l g

  • r

i t h m s ( A p r i

  • r

i , F P

  • g

r

  • w

t h ) i n

  • r

d e r t

  • a

v

  • i

d t h e g e n e r a t i

  • n
  • f

a h u g e n u m b e r

  • f

i t e m s e t s t h a t a r e c

  • n

s i d e r e d f r e q u e n t .

slide-11
SLIDE 11

Naïve solutions

1.Generate explicitly all 4-subsets and apply 2- independent hash function to each of them.

  • Slow: Time O(b4) for set of size b.
  • 2. Sample items S ⊆ Si independently w.p. r-1/4.
  • Not 2-independent: Positive correlation

among overlapping 4-subsets.

7

slide-12
SLIDE 12

Decomposable hashing

We choose h to be “decomposable”:


  • h({i1, i2, i3, i4}) = (g(i1) + g(i2) + g(i3) + g(i4)) mod r


 where g: U → [r] is an 8-independent hash function.

  • 8
slide-13
SLIDE 13

Decomposable hashing

We choose h to be “decomposable”:


  • h({i1, i2, i3, i4}) = (g(i1) + g(i2) + g(i3) + g(i4)) mod r


 where g: U → [r] is an 8-independent hash function.

  • 8

2

  • i

n d e p e n d e n t

slide-14
SLIDE 14

Decomposable hashing

We choose h to be “decomposable”:


  • h({i1, i2, i3, i4}) = (g(i1) + g(i2) + g(i3) + g(i4)) mod r


 where g: U → [r] is an 8-independent hash function.

  • 8

O u r m a i n r e d u c t i

  • n

: 
 F i n d i n g 4

  • s

u b s e t s w i t h h a s h v a l u e z e r

  • i

s e q u i v a l e n t t

  • s
  • l

v i n g 4 S U M

  • n

{ g ( x ) | x ∈ S

i

} 2

  • i

n d e p e n d e n t

slide-15
SLIDE 15

Schroeppel/Shamir technique

9

sorted values sorted values sorted values sorted values

slide-16
SLIDE 16

Schroeppel/Shamir technique

9

sorted values sorted values sorted values sorted values

maxheap

p a i r s u m s

slide-17
SLIDE 17

Schroeppel/Shamir technique

9

sorted values sorted values sorted values sorted values

maxheap

p a i r s u m s

minheap

p a i r s u m s

slide-18
SLIDE 18

Schroeppel/Shamir technique

9

sorted values sorted values sorted values sorted values

maxheap

p a i r s u m s

minheap

p a i r s u m s

slide-19
SLIDE 19

Schroeppel/Shamir technique

9

sorted values sorted values sorted values sorted values

maxheap

p a i r s u m s

minheap

p a i r s u m s

slide-20
SLIDE 20

Schroeppel/Shamir technique

9

sorted values sorted values sorted values sorted values

maxheap

p a i r s u m s

minheap

p a i r s u m s

slide-21
SLIDE 21

Schroeppel/Shamir technique

9

sorted values sorted values sorted values sorted values

maxheap

p a i r s u m s

minheap

p a i r s u m s

slide-22
SLIDE 22

Schroeppel/Shamir technique

9

sorted values sorted values sorted values sorted values

maxheap

p a i r s u m s

minheap

p a i r s u m s

slide-23
SLIDE 23

Schroeppel/Shamir technique

9

sorted values sorted values sorted values sorted values

maxheap

p a i r s u m s

minheap

p a i r s u m s

slide-24
SLIDE 24

Schroeppel/Shamir technique

9

sorted values sorted values sorted values sorted values

maxheap

p a i r s u m s

minheap

p a i r s u m s

slide-25
SLIDE 25

Schroeppel/Shamir technique

9

sorted values sorted values sorted values sorted values

maxheap

p a i r s u m s

minheap

p a i r s u m s

slide-26
SLIDE 26

Schroeppel/Shamir technique

9

sorted values sorted values sorted values sorted values

maxheap

p a i r s u m s

minheap

p a i r s u m s

slide-27
SLIDE 27

Schroeppel/Shamir technique

9

sorted values sorted values sorted values sorted values

maxheap

p a i r s u m s

minheap

p a i r s u m s

I n p a p e r : 
 H

  • w

t

  • h

a n d l e m

  • d

u l a r a r i t h m e t i c

slide-28
SLIDE 28

Main result

  • Theorem (vanilla version)


We can compute a consistent 2-independent sample of 4-subsets of a set of size b in expected time O(b2 log log b + pb4) and space O(b) for sampling probability p.

10

slide-29
SLIDE 29

Main result

  • Theorem (vanilla version)


We can compute a consistent 2-independent sample of 4-subsets of a set of size b in expected time O(b2 log log b + pb4) and space O(b) for sampling probability p.

10

I n p a p e r : 


  • G

e n e r a l i z e d t

  • k
  • s

u b s e t s 


  • T

i m e / s p a c e t r a d e

  • f

f

slide-30
SLIDE 30

Thank you!

11