examples of streaming data
play

Examples of Streaming Data Ocean behavior at a point - PowerPoint PPT Presentation

Streaming Data Mining Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata November 20, 2014 Examples of Streaming Data Ocean behavior at a point Temperature (once every half an


  1. Streaming ¡Data ¡Mining ¡ Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 20, 2014

  2. Examples ¡of ¡Streaming ¡Data ¡ § Ocean behavior at a point – Temperature (once every half an hour) – Surface height (once or more / second) – Several places in the ocean: one per 100 km 2 – Overall 1.5 million sensors – A few terabytes of data everyday § Satellite image data – Terabytes of images sent to the earth everyday – Convert to low resolution, but many satellites, a lot of data § Web stream data – More than hundred million search queries per day – Clicks 2 ¡

  3. Mining ¡Streaming ¡Data ¡ § Standard (non-stream) setting: data available when we need it § Streaming data: data comes in one or more streams § If you can, process, store results – Size of results much smaller than the stream size § Then the data is lost forever § Queries – Temperature alert if > some degree (standing query) – Maximum temperature in this month – Number of distinct users in the last month 3 ¡

  4. Filtering ¡Streaming ¡Data ¡ § Filter part of the stream based on a criteria § If the criteria can be calculated, then easy – Example: Filter all words starting with ab § Challenge: The criteria involves a membership lookup – Simplified example: Emails <email address, email> stream – Task: Filter emails based on email addresses – Have S = Set of 1 billion email address which are not spam – Keep emails from addresses in S , discard others § Each email ~ 20 bytes or more. Total > 20GB – Not to keep in main memory – Option 1: make disk access for each stream element and check – Option 2: Bloom filter, use 1GB main memory 4 ¡

  5. Filtering ¡with ¡One ¡Hash ¡Func>on ¡ § Available memory: n bits (e.g. 1GB ~ 8 billion bits) § Use a bit array of n bits (in main memory), initialize to all 0s § A hash function h : maps an email address à one of the n bits § Pre-compute hash values of S § Set the hashed bits to 1, leave the rest to 0 5 ¡

  6. Filtering ¡with ¡One ¡Hash ¡Func>on ¡ § Available memory: n bits (e.g. 1GB ~ 8 billion bits) § Use a bit array of n bits (in main memory), initialize to all 0s § A hash function h : maps an email address à one of the n bits § Pre-compute hash values of S § Set the hashed bits to 1, leave the rest to 0 1 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡ 6 ¡

  7. Filtering ¡with ¡One ¡Hash ¡Func>on ¡ § Available memory: n bits (e.g. 1GB ~ 8 billion bits) § Use a bit array of n bits (in main memory), initialize to all 0s § A hash function h : maps an email address à one of the n bits § Pre-compute hash values of S § Set the hashed bits to 1, leave the rest to 0 1 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡ Online process: streaming data comes Stream ¡ Stream ¡ § Hash an element (email address) element ¡ element ¡ § Check if the hashed bit was 1 accept ¡ discard ¡ § If yes, accept the email, otherwise discard § Note: x = y implies h ( x ) = h ( y ), but not vice versa § So, there would be false positives 7 ¡

  8. The ¡Bloom ¡Filter ¡ § Available memory: n bits § Use a bit array of n bits (in main memory), initialize to all 0s § Want to minimize probability of false positives § Use k hash functions h 1 , h 1 , …, h k § Each h i maps an element à one of the n bits § Pre-compute hash values of S for all h i § Set a bit to 1 if any element is hashed to that bit for any h i § Leave the rest of the bits to 0 Online process: streaming data comes § Hash an element with all hash functions § Check if the hashed bit was 1 for all hash functions § If yes, accept the element, otherwise discard 8 ¡

  9. The ¡Bloom ¡Filter: ¡Analysis ¡ Let | S | = m , bit array is of n bits, k hash functions h 1 , h 1 , …, h k § Assumption: the hash functions are independent and they map one element to each bit with equal probability § P[a particular h i maps a particular element to a particular bit] = 1/ n § P[a particular h i does not map a particular element to a particular bit] = 1 – 1/ n § P[No h i maps a particular element to a particular bit] = (1 – 1/ n ) k § P[After hashing m elements of S , one particular bit is still 0] = (1 – 1/ n ) km § P[A particular bit is 1 after hashing all of S ] = 1 – (1 – 1/ n ) km False positive analysis § Now, let a new element x not be in S. Should be discarded. § Each h i ( x ) = 1 with probability 1 – (1 – 1/ n ) km § P[ h i ( x ) = 1 for all i ] = (1 – (1 – 1/ n ) km ) k (1- ε ) 1/ ε ≈ 1/ e § This probability is ≈ (1 – e – km/n ) k for small ε § Optimal number k of hash functions: log e 2 × n / m 9 ¡

  10. Coun>ng ¡Dis>nct ¡Elements ¡in ¡a ¡Stream ¡ § Example: In a website, count the number of distinct users in a month – Use login id if website requires account – What for internet search engine? § Standard solution: store in a hash, keep adding new elements – What if number of distinct elements is too large? § Approach: intelligent hashing, use much lesser memory – Hash each element to a sufficiently long bit string – Must have more possible hash values than number of distinct elements – Example: 64bit à 2 64 possible values, sufficient for IP addresses 10 ¡

  11. The ¡Flajolet ¡– ¡Mar>n ¡Algorithm ¡(1985) ¡ § Stream elements, hash functions § Let a be an element, h a hash function § Tail length of h and a = number of 0s at the end of h ( a ) § Let R = maximum tail length seen so far (of h and many elements) § How large can R be? § More (distinct) elements we see, it is more likely that R is larger § P[For a given a , h ( a ) has tail length ≥ r ] = 2 –r § P[In m distinct elements, none has tail length ≥ r ] = (1 – 2 –r ) m m 2 − r § Rewrite this as: " % 2 r ( ) 1 − 2 − r (1- ε ) 1/ ε ≈ 1/ e $ ' # & for small ε m 2 − r = e − m 2 − r ( ) = e − 1 § So: if m << 2 r , the probability à 1; if m >> 2 r , the probability à 0 § Use 2 R as an estimate of the number of distinct elements § Use many hash functions: combine estimates using average and median 11 ¡

  12. Reference ¡ § Mining of Massive Datasets , by Leskovec, Rajaraman and Ullman 12 ¡

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend