Consistent subset sampling
Konstantin Kutzkov and Rasmus Pagh Work supported by:
1
Consistent subset sampling Konstantin Kutzkov and Rasmus Pagh Work - - PowerPoint PPT Presentation
Consistent subset sampling Konstantin Kutzkov and Rasmus Pagh Work supported by: 1 Agenda Consistent sampling Applications of consistent sampling of subsets New sampling algorithm 2 Sampling g ) n m i a e t r ( s n i
Konstantin Kutzkov and Rasmus Pagh Work supported by:
1
2
3
F u n d a m e n t a l t
i n ( s t r e a m i n g ) a l g
i t h m s , e . g . , t
s t i m a t e n u m b e r
d i s t i n c t i t e m s
t h e J a c c a r d s i m i l a r i t y
s e t s .
3
F u n d a m e n t a l t
i n ( s t r e a m i n g ) a l g
i t h m s , e . g . , t
s t i m a t e n u m b e r
d i s t i n c t i t e m s
t h e J a c c a r d s i m i l a r i t y
s e t s .
random sample U2 ⊆ U.
determine whether to sample:
4
random sample U2 ⊆ U.
determine whether to sample:
4
P r
e r t y : I f h a s h f u n c t i
i s 2
n d e p e n d e n t , f r a c t i
s a m p l e d i s c
c e n t r a t e d a r
n d e x p e c t a t i
with |K|=k and h(K)=0.
Pr[h(S)=0] = 1/r.
5
with |K|=k and h(K)=0.
Pr[h(S)=0] = 1/r.
5
I n t a l k : k = 4
together to, for example, estimate:
6
together to, for example, estimate:
6
U s e t
n e
u n e p a r a m e t e r s
f r e q u e n t i t e m s e t m i n i n g a l g
i t h m s ( A p r i
i , F P
r
t h ) i n
d e r t
v
d t h e g e n e r a t i
a h u g e n u m b e r
i t e m s e t s t h a t a r e c
s i d e r e d f r e q u e n t .
1.Generate explicitly all 4-subsets and apply 2- independent hash function to each of them.
among overlapping 4-subsets.
7
We choose h to be “decomposable”:
where g: U → [r] is an 8-independent hash function.
We choose h to be “decomposable”:
where g: U → [r] is an 8-independent hash function.
2
n d e p e n d e n t
We choose h to be “decomposable”:
where g: U → [r] is an 8-independent hash function.
O u r m a i n r e d u c t i
: F i n d i n g 4
u b s e t s w i t h h a s h v a l u e z e r
s e q u i v a l e n t t
v i n g 4 S U M
{ g ( x ) | x ∈ S
i
} 2
n d e p e n d e n t
9
sorted values sorted values sorted values sorted values
9
sorted values sorted values sorted values sorted values
maxheap
p a i r s u m s
9
sorted values sorted values sorted values sorted values
maxheap
p a i r s u m s
minheap
p a i r s u m s
9
sorted values sorted values sorted values sorted values
maxheap
p a i r s u m s
minheap
p a i r s u m s
9
sorted values sorted values sorted values sorted values
maxheap
p a i r s u m s
minheap
p a i r s u m s
9
sorted values sorted values sorted values sorted values
maxheap
p a i r s u m s
minheap
p a i r s u m s
9
sorted values sorted values sorted values sorted values
maxheap
p a i r s u m s
minheap
p a i r s u m s
9
sorted values sorted values sorted values sorted values
maxheap
p a i r s u m s
minheap
p a i r s u m s
9
sorted values sorted values sorted values sorted values
maxheap
p a i r s u m s
minheap
p a i r s u m s
9
sorted values sorted values sorted values sorted values
maxheap
p a i r s u m s
minheap
p a i r s u m s
9
sorted values sorted values sorted values sorted values
maxheap
p a i r s u m s
minheap
p a i r s u m s
9
sorted values sorted values sorted values sorted values
maxheap
p a i r s u m s
minheap
p a i r s u m s
9
sorted values sorted values sorted values sorted values
maxheap
p a i r s u m s
minheap
p a i r s u m s
I n p a p e r : H
t
a n d l e m
u l a r a r i t h m e t i c
We can compute a consistent 2-independent sample of 4-subsets of a set of size b in expected time O(b2 log log b + pb4) and space O(b) for sampling probability p.
10
We can compute a consistent 2-independent sample of 4-subsets of a set of size b in expected time O(b2 log log b + pb4) and space O(b) for sampling probability p.
10
I n p a p e r :
e n e r a l i z e d t
u b s e t s
i m e / s p a c e t r a d e
f
11