Streaming ¡Data ¡Mining ¡
Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata
November 20, 2014
Examples of Streaming Data Ocean behavior at a point - - PowerPoint PPT Presentation
Streaming Data Mining Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata November 20, 2014 Examples of Streaming Data Ocean behavior at a point Temperature (once every half an
November 20, 2014
2 ¡
3 ¡
– Example: Filter all words starting with ab
– Simplified example: Emails <email address, email> stream – Task: Filter emails based on email addresses – Have S = Set of 1 billion email address which are not spam – Keep emails from addresses in S, discard others
– Not to keep in main memory – Option 1: make disk access for each stream element and check – Option 2: Bloom filter, use 1GB main memory
4 ¡
5 ¡
1 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡
6 ¡
1 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡
7 ¡
Stream ¡ element ¡ discard ¡ Stream ¡ element ¡ accept ¡
8 ¡
Let |S| = m, bit array is of n bits, k hash functions h1, h1, …, hk § Assumption: the hash functions are independent and they map one element to each bit with equal probability § P[a particular hi maps a particular element to a particular bit] = 1/n § P[a particular hi does not map a particular element to a particular bit] = 1 – 1/n § P[No hi maps a particular element to a particular bit] = (1 – 1/n)k § P[After hashing m elements of S, one particular bit is still 0] = (1 – 1/n)km § P[A particular bit is 1 after hashing all of S] = 1 – (1 – 1/n)km False positive analysis § Now, let a new element x not be in S. Should be discarded. § Each hi(x) = 1 with probability 1 – (1 – 1/n)km § P[hi(x) = 1 for all i] = (1 – (1 – 1/n)km)k § This probability is ≈ (1 – e– km/n)k § Optimal number k of hash functions: loge2×n/m
9 ¡
(1-ε)1/ε ≈ 1/e for small ε
– Use login id if website requires account – What for internet search engine?
– What if number of distinct elements is too large?
– Hash each element to a sufficiently long bit string – Must have more possible hash values than number of distinct elements – Example: 64bit à 264 possible values, sufficient for IP addresses
10 ¡
§ Stream elements, hash functions § Let a be an element, h a hash function § Tail length of h and a = number of 0s at the end of h(a) § Let R = maximum tail length seen so far (of h and many elements) § How large can R be? § More (distinct) elements we see, it is more likely that R is larger § P[For a given a, h(a) has tail length ≥ r] = 2–r § P[In m distinct elements, none has tail length ≥ r] = (1 – 2–r)m § Rewrite this as:
11 ¡
2r
m2−r
(1-ε)1/ε ≈ 1/e for small ε
m2−r
§ So: if m << 2r, the probability à 1; if m >> 2r, the probability à 0 § Use 2R as an estimate of the number of distinct elements § Use many hash functions: combine estimates using average and median
12 ¡