Streaming Algorithms Stony Brook University CSE545, Fall 2016 Big - PowerPoint PPT Presentation

Streaming Algorithms Stony Brook University CSE545, Fall 2016

Big Data Analytics -- The Class We will learn: ● to analyze different types of data: ○ high dimensional ○ graphs ○ infinite/never-ending ○ labeled ● to use different models of computation: ○ MapReduce ○ streams and online algorithms ○ single machine in-memory ○ Spark J. Leskovec, A.Rajaraman, J.Ullman: Mining of Massive Datasets, www.mmds.org

Motivation One often does not know when a set of data will end. ● Can not store ● Not practical to access repeatedly ● Rapidly arriving ● Does not make sense to ever “insert” into a database Can not fit on disk but would like to generalize / summarize the data?

Motivation One often does not know when a set of data will end. ● Can not store ● Not practical to access repeatedly ● Rapidly arriving ● Does not make sense to ever “insert” into a database Can not fit on disk but would like to generalize / summarize the data? Examples: Google search queries Satellite imagery data Text Messages, Status updates Click Streams

Stream Queries 1. Standing Queries: Stored and permanently executing. 2. Ad-Hoc: One-time questions -- must store expected parts / summaries of streams

Stream Queries 1. Standing Queries: Stored and permanently executing. 2. Ad-Hoc: One-time questions -- must store expected parts / summaries of streams E.g. Each would handle the following differently: What is the mean of values seen so far?

Streaming Algorithms ● Sampling ● Filtering Data ● Count Distinct Elements ● Counting Moments ● Incremental Processing*

General Stream Processing Model

Sampling and Filtering Data Sampling: Create a random sample for statistical analysis. ● Basic version: generate random number; if < sample% keep ○ Problem: Tuples usually are not units-of-analysis for statistical analyses ● Assume provided some key as unit-of analysis to sample over ○ E.g. ip_address, user_id, document_id, ...etc…. ● Want 1/20th of all “keys” (e.g. users) ○ Hash to 20 buckets; bucket 1 is “in”; others are “out” ○ Note: do not need to store anything (except hash functions); may be part of standing query

Sampling and Filtering Data Filtering: Select elements with property x Example: 40B email addresses to bypass spam filter

Sampling and Filtering Data Filtering: Select elements with property x Example: 40B email addresses to bypass spam filter ● The Bloom Filter ○ Given: ■ |S| keys to filter; will be mapped to |B| bits ■ hashes = h 1 , h 2 , …, h k independent hash functions

Sampling and Filtering Data Filtering: Select elements with property x Example: 40B email addresses to bypass spam filter ● The Bloom Filter (approximates; allows FPs) ○ Given : ■ |S| keys to filter; will be mapped to |B| bits ■ hashes = h 1 , h 2 , …, h k independent hash functions ○ Algorithm ■ Set all B to 0 ■ For each i in hashes, for each s in S: Set B[ h i (s)] = 1 … #usually embedded in other code ■ while key x arrives next in stream ● if B[ h i (s)] == 1 for all i in hashes: do as if x is in S ● else: do as if x not in S

Sampling and Filtering Data What is the probability of a Filtering: Select elements with property x false-positive? Example: 40B email addresses to bypass spam filter ● The Bloom Filter (approximates; allows FPs) ○ Given: ■ |S| keys to filter; will be mapped to |B| bits ■ hashes = h 1, h 2 , …, h k independent hash functions ○ Algorithm ■ Set all B to 0 ■ For each i in hashes, for each s in S: Set B[ h i (s)] = 1 … #usually embedded in other code ■ while key x arrives next in stream ● if B[ h i (s)] == 1 all i in hashes: do as if x is in S ● else: do as if x not in S

Sampling and Filtering Data What is the probability of a Filtering: Select elements with property x false-positive? Example: 40B email addresses to bypass spam filter What fraction of |B| are 1s? ● The Bloom Filter (approximates; allows FPs) ○ Given : Like throwing |S| * k darts at n ■ |S| keys to filter; will be mapped to |B| bits targets. ■ hashes = h 1, h 2 , …, h k independent hash functions 1 dart: 1/n ○ Algorithm D darts: (1 - 1/n) d ■ Set all B to 0 = e -d/n faction are 1s ■ For each i in hashes, for each s in S: Set B[ h i (s)] = 1 … #usually embedded in other code ■ while key x arrives next in stream ● if B[ h i (s)] == 1 all i in hashes: do as if x is in S ● else: do as if x not in S

Sampling and Filtering Data What is the probability of a Filtering: Select elements with property x false-positive? Example: 40B email addresses to bypass spam filter What fraction of |B| are 1s? ● The Bloom Filter (approximates; allows FPs) ○ Given: Like throwing |S| * k darts at n ■ |S| keys to filter; will be mapped to |B| bits targets. ■ hashes = h 1, h 2 , …, h k independent hash functions 1 dart: 1/n ○ Algorithm D darts: (1 - 1/n) d ■ Set all B to 0 = e -d/n faction are 1s ■ For each i in hashes, for each s in S: Set B[ h i (s)] = 1 probability all k hashes being 1? … #usually embedded in other code ■ while key x arrives next in stream ● if B[ h i (s)] == 1 all i in hashes: do as if x is in S ● else: do as if x not in S

Sampling and Filtering Data What is the probability of a Filtering: Select elements with property x false-positive? Example: 40B email addresses to bypass spam filter What fraction of |B| are 1s? ● The Bloom Filter (approximates; allows FPs) ○ Given : Like throwing |S| * k darts at n ■ |S| keys to filter; will be mapped to |B| bits targets. ■ hashes = h 1, h 2 , …, h k independent hash functions 1 dart: 1/n ○ Algorithm D darts: (1 - 1/n) d ■ Set all B to 0 = e -d/n faction are 1s ■ For each i in hashes, for each s in S: Set B[ h i (s)] = 1 probability all k hashes being 1? … #usually embedded in other code (e -(|S|*k)/n ) k ■ while key x arrives next in stream Note: Can expand S as stream ● if B[ h i (s)] == 1 all i in hashes: do as if x is in S continues ● else: do as if x not in S (e.g. adding verified email addresses)

Counting Moments Moments: ● Suppose m i is the count of distinct element i in the data ● The kth moment of the stream is Examples ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness related to variance)

Counting Moments 0th moment Moments: One Solution: Just keep a set (hashmap, dictionary, heap) ● Suppose m i is the count of distinct element i in the data Problem: Can’t maintain that many in memory; disk storage is too slow ● The kth moment of the stream is Examples ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness related to variance)

Counting Moments Moments: 0th moment Streaming Solution: Flajolet-Martin Algorithm ● Suppose m i is the count of distinct element i in the data Pick a hash, h, to map each of n elements to log 2 n bits R = 0 #potential max number of zeros at tail ● The kth moment of the stream is for each stream element, e: r(e) = num of trailing 0s from h (e) R = r(e) if r(e) > R Examples estimated_distinct_elements = 2 R ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness related to variance)

Counting Moments Problem: Moments: 0th moment Unstable in practice. Streaming Solution: Flajolet-Martin Algorithm ● Suppose m i is the count of distinct element i in the data Pick a hash, h, to map each of n elements to log 2 n bits R = 0 #potential max number of zeros at tail ● The kth moment of the stream is for each stream element, e: r(e) = num of trailing 0s from h (e) R = r(e) if r(e) > R Examples estimated_distinct_elements = 2 R ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness related to variance)

Counting Moments Problem: Moments: 0th moment Unstable in practice. Streaming Solution: Flajolet-Martin Algorithm ● Suppose m i is the count of distinct element i in the data Pick a hash, h, to map each of n elements to log 2 n bits Solution: R = 0 #potential max number of zeros at tail 1. Partition into groups ● The kth moment of the stream is 2. Take mean in group for each stream element, e: 3. Take median of r(e) = num of trailing 0s from h (e) means R = r(e) if r(e) > R Examples estimated_distinct_elements = 2 R ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness related to variance)

Counting Moments 1st moment Moments: Streaming Solution: Simply keep a counter ● Suppose m i is the count of distinct element i in the data ● The kth moment of the stream is Examples ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness related to variance)

Streaming Algorithms Stony Brook University CSE545, Fall 2016 Big - PowerPoint PPT Presentation

Streaming Algorithms Stony Brook University CSE545, Fall 2016 Big Data Analytics -- The Class We will learn: to analyze different types of data: high dimensional graphs infinite/never-ending labeled to use

Streaming algorithms Jeremy Gibbons University of Oxford APPSEM II, April 2004 Streaming

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Parameterized Streaming Algorithms Graham Cormode Rajesh Chitnis Parameterized Streaming

Introduction (1) Packet Loss Recovery for Streaming is growing Commercial streaming

Massive-scale analysis of streaming social networks David A. Bader Exascale Streaming Data

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Streaming Systems Instructor: Matei Zaharia cs245.stanford.edu Outline Motivation Streaming

Landell - live streaming for the masses Luciana Fujii Pontello Landell - live streaming for the

Playing Video Content Alan Smith ACTIVE SOLUTION, STOCKHOLM, SWEDEN youtube.com/user/CloudCasts

Graph Distances in the Streaming Model Joan Feigenbaum Sampath Kannan Andrew McGregor Siddharth

Semi-Streaming Algorithms for Annotated Graph Streams Justin Thaler, Yahoo Labs Data Streaming

Evaluation and Development of Algorithms and Techniques for Streaming Detector Readout

Streaming Algorithms for Bin Packing and Vector Scheduling Graham Cormode and Pavel Vesel y

Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323:

Example: the binary tree of Example 2.1.2: The outcomes are the 8 possible paths from t = 0

: gr ( ) E XAMPLE (P. H ALL , E. W ITT , W. M AGNUS ) Let F n = x x 1 , . . . , x n y

Digital Image Processing (CS/ECE 545) Lecture 4: Filters (Part 2) & Edges and Contours Prof

Directions to Atkins WTP 31 Market Hill Road, Amherst, MA see following maps and

Effective Computation of Generalized Spectral Sequences Andrea Guidolin 1 and Ana Romero 2 1 Basque

Casson towers and filtrations of the smooth knot concordance group Arunima Ray AMS Central

Database Filtering in Blast Vineet Bafna 1 Note: These lecture notes are meant to supplement

Hierarchies and Ranks for Persistence Pairs Bastian Rieck 1 Heike Leitte 1 Filip Sadlo 2 1 TU

Streaming Algorithms Stony Brook University CSE545, Fall 2016 Big - PowerPoint PPT Presentation

Streaming Algorithms Stony Brook University CSE545, Fall 2016 Big Data Analytics -- The Class We will learn: to analyze different types of data: high dimensional graphs infinite/never-ending labeled to use

Streaming algorithms Jeremy Gibbons University of Oxford APPSEM II, April 2004 Streaming

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Parameterized Streaming Algorithms Graham Cormode Rajesh Chitnis Parameterized Streaming

Introduction (1) Packet Loss Recovery for Streaming is growing Commercial streaming

Massive-scale analysis of streaming social networks David A. Bader Exascale Streaming Data

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Streaming Systems Instructor: Matei Zaharia cs245.stanford.edu Outline Motivation Streaming

Landell - live streaming for the masses Luciana Fujii Pontello Landell - live streaming for the

Playing Video Content Alan Smith ACTIVE SOLUTION, STOCKHOLM, SWEDEN youtube.com/user/CloudCasts

Graph Distances in the Streaming Model Joan Feigenbaum Sampath Kannan Andrew McGregor Siddharth

Semi-Streaming Algorithms for Annotated Graph Streams Justin Thaler, Yahoo Labs Data Streaming

Evaluation and Development of Algorithms and Techniques for Streaming Detector Readout

Streaming Algorithms for Bin Packing and Vector Scheduling Graham Cormode and Pavel Vesel y

Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323:

Example: the binary tree of Example 2.1.2: The outcomes are the 8 possible paths from t = 0

: gr ( ) E XAMPLE (P. H ALL , E. W ITT , W. M AGNUS ) Let F n = x x 1 , . . . , x n y

Digital Image Processing (CS/ECE 545) Lecture 4: Filters (Part 2) &amp; Edges and Contours Prof

Directions to Atkins WTP 31 Market Hill Road, Amherst, MA see following maps and

Effective Computation of Generalized Spectral Sequences Andrea Guidolin 1 and Ana Romero 2 1 Basque

Casson towers and filtrations of the smooth knot concordance group Arunima Ray AMS Central

Database Filtering in Blast Vineet Bafna 1 Note: These lecture notes are meant to supplement

Hierarchies and Ranks for Persistence Pairs Bastian Rieck 1 Heike Leitte 1 Filip Sadlo 2 1 TU

Digital Image Processing (CS/ECE 545) Lecture 4: Filters (Part 2) & Edges and Contours Prof