Examples of Streaming Data Ocean behavior at a point - PowerPoint PPT Presentation

Streaming ¡Data ¡Mining ¡ Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 20, 2014

Examples ¡of ¡Streaming ¡Data ¡ § Ocean behavior at a point – Temperature (once every half an hour) – Surface height (once or more / second) – Several places in the ocean: one per 100 km 2 – Overall 1.5 million sensors – A few terabytes of data everyday § Satellite image data – Terabytes of images sent to the earth everyday – Convert to low resolution, but many satellites, a lot of data § Web stream data – More than hundred million search queries per day – Clicks 2 ¡

Mining ¡Streaming ¡Data ¡ § Standard (non-stream) setting: data available when we need it § Streaming data: data comes in one or more streams § If you can, process, store results – Size of results much smaller than the stream size § Then the data is lost forever § Queries – Temperature alert if > some degree (standing query) – Maximum temperature in this month – Number of distinct users in the last month 3 ¡

Filtering ¡Streaming ¡Data ¡ § Filter part of the stream based on a criteria § If the criteria can be calculated, then easy – Example: Filter all words starting with ab § Challenge: The criteria involves a membership lookup – Simplified example: Emails <email address, email> stream – Task: Filter emails based on email addresses – Have S = Set of 1 billion email address which are not spam – Keep emails from addresses in S , discard others § Each email ~ 20 bytes or more. Total > 20GB – Not to keep in main memory – Option 1: make disk access for each stream element and check – Option 2: Bloom filter, use 1GB main memory 4 ¡

Filtering ¡with ¡One ¡Hash ¡Func>on ¡ § Available memory: n bits (e.g. 1GB ~ 8 billion bits) § Use a bit array of n bits (in main memory), initialize to all 0s § A hash function h : maps an email address à one of the n bits § Pre-compute hash values of S § Set the hashed bits to 1, leave the rest to 0 5 ¡

Filtering ¡with ¡One ¡Hash ¡Func>on ¡ § Available memory: n bits (e.g. 1GB ~ 8 billion bits) § Use a bit array of n bits (in main memory), initialize to all 0s § A hash function h : maps an email address à one of the n bits § Pre-compute hash values of S § Set the hashed bits to 1, leave the rest to 0 1 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡ 6 ¡

Filtering ¡with ¡One ¡Hash ¡Func>on ¡ § Available memory: n bits (e.g. 1GB ~ 8 billion bits) § Use a bit array of n bits (in main memory), initialize to all 0s § A hash function h : maps an email address à one of the n bits § Pre-compute hash values of S § Set the hashed bits to 1, leave the rest to 0 1 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡ Online process: streaming data comes Stream ¡ Stream ¡ § Hash an element (email address) element ¡ element ¡ § Check if the hashed bit was 1 accept ¡ discard ¡ § If yes, accept the email, otherwise discard § Note: x = y implies h ( x ) = h ( y ), but not vice versa § So, there would be false positives 7 ¡

The ¡Bloom ¡Filter ¡ § Available memory: n bits § Use a bit array of n bits (in main memory), initialize to all 0s § Want to minimize probability of false positives § Use k hash functions h 1 , h 1 , …, h k § Each h i maps an element à one of the n bits § Pre-compute hash values of S for all h i § Set a bit to 1 if any element is hashed to that bit for any h i § Leave the rest of the bits to 0 Online process: streaming data comes § Hash an element with all hash functions § Check if the hashed bit was 1 for all hash functions § If yes, accept the element, otherwise discard 8 ¡

The ¡Bloom ¡Filter: ¡Analysis ¡ Let | S | = m , bit array is of n bits, k hash functions h 1 , h 1 , …, h k § Assumption: the hash functions are independent and they map one element to each bit with equal probability § P[a particular h i maps a particular element to a particular bit] = 1/ n § P[a particular h i does not map a particular element to a particular bit] = 1 – 1/ n § P[No h i maps a particular element to a particular bit] = (1 – 1/ n ) k § P[After hashing m elements of S , one particular bit is still 0] = (1 – 1/ n ) km § P[A particular bit is 1 after hashing all of S ] = 1 – (1 – 1/ n ) km False positive analysis § Now, let a new element x not be in S. Should be discarded. § Each h i ( x ) = 1 with probability 1 – (1 – 1/ n ) km § P[ h i ( x ) = 1 for all i ] = (1 – (1 – 1/ n ) km ) k (1- ε ) 1/ ε ≈ 1/ e § This probability is ≈ (1 – e – km/n ) k for small ε § Optimal number k of hash functions: log e 2 × n / m 9 ¡

Coun>ng ¡Dis>nct ¡Elements ¡in ¡a ¡Stream ¡ § Example: In a website, count the number of distinct users in a month – Use login id if website requires account – What for internet search engine? § Standard solution: store in a hash, keep adding new elements – What if number of distinct elements is too large? § Approach: intelligent hashing, use much lesser memory – Hash each element to a sufficiently long bit string – Must have more possible hash values than number of distinct elements – Example: 64bit à 2 64 possible values, sufficient for IP addresses 10 ¡

The ¡Flajolet ¡– ¡Mar>n ¡Algorithm ¡(1985) ¡ § Stream elements, hash functions § Let a be an element, h a hash function § Tail length of h and a = number of 0s at the end of h ( a ) § Let R = maximum tail length seen so far (of h and many elements) § How large can R be? § More (distinct) elements we see, it is more likely that R is larger § P[For a given a , h ( a ) has tail length ≥ r ] = 2 –r § P[In m distinct elements, none has tail length ≥ r ] = (1 – 2 –r ) m m 2 − r § Rewrite this as: " % 2 r ( ) 1 − 2 − r (1- ε ) 1/ ε ≈ 1/ e $ ' # & for small ε m 2 − r = e − m 2 − r ( ) = e − 1 § So: if m << 2 r , the probability à 1; if m >> 2 r , the probability à 0 § Use 2 R as an estimate of the number of distinct elements § Use many hash functions: combine estimates using average and median 11 ¡

Reference ¡ § Mining of Massive Datasets , by Leskovec, Rajaraman and Ullman 12 ¡

Examples of Streaming Data Ocean behavior at a point - PowerPoint PPT Presentation

Streaming Data Mining Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata November 20, 2014 Examples of Streaming Data Ocean behavior at a point Temperature (once every half an

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Massive-scale analysis of streaming social networks David A. Bader Exascale Streaming Data

Introduction (1) Packet Loss Recovery for Streaming is growing Commercial streaming

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Streaming Systems Instructor: Matei Zaharia cs245.stanford.edu Outline Motivation Streaming

Landell - live streaming for the masses Luciana Fujii Pontello Landell - live streaming for the

Playing Video Content Alan Smith ACTIVE SOLUTION, STOCKHOLM, SWEDEN youtube.com/user/CloudCasts

Graph Distances in the Streaming Model Joan Feigenbaum Sampath Kannan Andrew McGregor Siddharth

Streaming algorithms Jeremy Gibbons University of Oxford APPSEM II, April 2004 Streaming

Evaluation of 802.11a for Evaluation of 802.11a for Streaming Data in Ad- -hoc hoc Streaming

Streaming Queries over Streaming Data Sirish Chandrasekaran UC Berkeley August 20, 2002 VLDB

Streaming and storing CineGrid data: A study on optimization methods Sevickson.Kwidama os3.nl

Semi-Streaming Algorithms for Annotated Graph Streams Justin Thaler, Yahoo Labs Data Streaming

Spark Streaming Summary by Lucy Yu Motivation Most of big data happens in a streaming

Streaming Algorithms CSE 545 - Spring 2017 Big Data Analytics -- The Class We will learn:

Lecture 3 More on Git Commits Sign in on the attendance sheet! Review: The Git Commit Workflow

Provider Directory Subject Matter Expert Workgroup Meeting #5 May 14, 2014 1 Welcome and

Electronic Mail Overview Electronic mail History Format

Lecture #7: M icha el Ba ll The roster is delayed L , so please Higher Order Functions send

Mac Workshop March 2014 Topics Apple Mail Understand how mail works Problems Tips

IMPROVING PRODUCTIVITY AND SECURITY IN AN INSANELY BUSY WORLD PRESENTED BY REID F. TRAUTZ ISBA

Introduction to Machine Learning Classification: Naive Bayes Learning goals 15 Understand the