Crash Course on Data Stream Algorithms Part I: Basic Definitions and - PowerPoint PPT Presentation

Crash Course on Data Stream Algorithms Part I: Basic Definitions and Numerical Streams Andrew McGregor University of Massachusetts Amherst 1/24

Goals of the Crash Course ◮ Goal: Give a flavor for the theoretical results and techniques from the 100’s of papers on the design and analysis of stream algorithms. 2/24

Goals of the Crash Course ◮ Goal: Give a flavor for the theoretical results and techniques from the 100’s of papers on the design and analysis of stream algorithms. “When we abstract away the application-specific details, what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?” 2/24

Goals of the Crash Course ◮ Goal: Give a flavor for the theoretical results and techniques from the 100’s of papers on the design and analysis of stream algorithms. “When we abstract away the application-specific details, what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?” ◮ Disclaimer: Talks will be theoretical/mathematical but shouldn’t require much in the way of prerequisites. 2/24

Goals of the Crash Course ◮ Goal: Give a flavor for the theoretical results and techniques from the 100’s of papers on the design and analysis of stream algorithms. “When we abstract away the application-specific details, what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?” ◮ Disclaimer: Talks will be theoretical/mathematical but shouldn’t require much in the way of prerequisites. ◮ Request: 2/24

Goals of the Crash Course ◮ Goal: Give a flavor for the theoretical results and techniques from the 100’s of papers on the design and analysis of stream algorithms. “When we abstract away the application-specific details, what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?” ◮ Disclaimer: Talks will be theoretical/mathematical but shouldn’t require much in the way of prerequisites. ◮ Request: ◮ If you get bored, ask questions. . . 2/24

Goals of the Crash Course ◮ Goal: Give a flavor for the theoretical results and techniques from the 100’s of papers on the design and analysis of stream algorithms. “When we abstract away the application-specific details, what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?” ◮ Disclaimer: Talks will be theoretical/mathematical but shouldn’t require much in the way of prerequisites. ◮ Request: ◮ If you get bored, ask questions. . . ◮ If you get lost, ask questions. . . 2/24

Goals of the Crash Course ◮ Goal: Give a flavor for the theoretical results and techniques from the 100’s of papers on the design and analysis of stream algorithms. “When we abstract away the application-specific details, what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?” ◮ Disclaimer: Talks will be theoretical/mathematical but shouldn’t require much in the way of prerequisites. ◮ Request: ◮ If you get bored, ask questions. . . ◮ If you get lost, ask questions. . . ◮ If you’d like to ask questions, ask questions. . . 2/24

Outline Basic Definitions Sampling Sketching Counting Distinct Items Summary of Some Other Results 3/24

Data Stream Model ◮ Stream: m elements from universe of size n , e.g., � x 1 , x 2 , . . . , x m � = 3 , 5 , 3 , 7 , 5 , 4 , . . . 5/24

Data Stream Model ◮ Stream: m elements from universe of size n , e.g., � x 1 , x 2 , . . . , x m � = 3 , 5 , 3 , 7 , 5 , 4 , . . . ◮ Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence. 5/24

Data Stream Model ◮ Stream: m elements from universe of size n , e.g., � x 1 , x 2 , . . . , x m � = 3 , 5 , 3 , 7 , 5 , 4 , . . . ◮ Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence. ◮ Catch: 1. Limited working memory, sublinear in n and m 5/24

Data Stream Model ◮ Stream: m elements from universe of size n , e.g., � x 1 , x 2 , . . . , x m � = 3 , 5 , 3 , 7 , 5 , 4 , . . . ◮ Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence. ◮ Catch: 1. Limited working memory, sublinear in n and m 2. Access data sequentially 5/24

Data Stream Model ◮ Stream: m elements from universe of size n , e.g., � x 1 , x 2 , . . . , x m � = 3 , 5 , 3 , 7 , 5 , 4 , . . . ◮ Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence. ◮ Catch: 1. Limited working memory, sublinear in n and m 2. Access data sequentially 3. Process each element quickly 5/24

Data Stream Model ◮ Stream: m elements from universe of size n , e.g., � x 1 , x 2 , . . . , x m � = 3 , 5 , 3 , 7 , 5 , 4 , . . . ◮ Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence. ◮ Catch: 1. Limited working memory, sublinear in n and m 2. Access data sequentially 3. Process each element quickly ◮ Origins in 70s but has become popular in last ten years because of growing theory and very applicable. 5/24

Why’s it become popular? ◮ Practical Appeal: ◮ Faster networks, cheaper data storage, ubiquitous data-logging results in massive amount of data to be processed. ◮ Applications to network monitoring, query planning, I/O efficiency for massive data, sensor networks aggregation. . . 6/24

Why’s it become popular? ◮ Practical Appeal: ◮ Faster networks, cheaper data storage, ubiquitous data-logging results in massive amount of data to be processed. ◮ Applications to network monitoring, query planning, I/O efficiency for massive data, sensor networks aggregation. . . ◮ Theoretical Appeal: ◮ Easy to state problems but hard to solve. ◮ Links to communication complexity, compressed sensing, embeddings, pseudo-random generators, approximation. . . 6/24

Sampling and Statistics ◮ Sampling is a general technique for tackling massive amounts of data 8/24

Sampling and Statistics ◮ Sampling is a general technique for tackling massive amounts of data ◮ Example: To compute the median packet size of some IP packets, we could just sample some and use the median of the sample as an estimate for the true median. Statistical arguments relate the size of the sample to the accuracy of the estimate. 8/24

Sampling and Statistics ◮ Sampling is a general technique for tackling massive amounts of data ◮ Example: To compute the median packet size of some IP packets, we could just sample some and use the median of the sample as an estimate for the true median. Statistical arguments relate the size of the sample to the accuracy of the estimate. ◮ Challenge: But how do you take a sample from a stream of unknown length or from a “sliding window”? 8/24

Reservoir Sampling ◮ Problem: Find uniform sample s from a stream of unknown length 9/24

Reservoir Sampling ◮ Problem: Find uniform sample s from a stream of unknown length ◮ Algorithm: ◮ Initially s = x 1 ◮ On seeing the t -th element, s ← x t with probability 1 / t 9/24

Reservoir Sampling ◮ Problem: Find uniform sample s from a stream of unknown length ◮ Algorithm: ◮ Initially s = x 1 ◮ On seeing the t -th element, s ← x t with probability 1 / t ◮ Analysis: ◮ What’s the probability that s = x i at some time t ≥ i ? 9/24

Reservoir Sampling ◮ Problem: Find uniform sample s from a stream of unknown length ◮ Algorithm: ◮ Initially s = x 1 ◮ On seeing the t -th element, s ← x t with probability 1 / t ◮ Analysis: ◮ What’s the probability that s = x i at some time t ≥ i ? P [ s = x i ] = 1 „ 1 « „ 1 − 1 « = 1 i × 1 − × . . . × i + 1 t t 9/24

Reservoir Sampling ◮ Problem: Find uniform sample s from a stream of unknown length ◮ Algorithm: ◮ Initially s = x 1 ◮ On seeing the t -th element, s ← x t with probability 1 / t ◮ Analysis: ◮ What’s the probability that s = x i at some time t ≥ i ? P [ s = x i ] = 1 „ 1 « „ 1 − 1 « = 1 i × 1 − × . . . × i + 1 t t ◮ To get k samples we use O ( k log n ) bits of space. 9/24

Priority Sampling for Sliding Windows ◮ Problem: Maintain a uniform sample from the last w items 10/24

Priority Sampling for Sliding Windows ◮ Problem: Maintain a uniform sample from the last w items ◮ Algorithm: 1. For each x i we pick a random value v i ∈ (0 , 1) 10/24

Priority Sampling for Sliding Windows ◮ Problem: Maintain a uniform sample from the last w items ◮ Algorithm: 1. For each x i we pick a random value v i ∈ (0 , 1) 2. In a window � x j − w +1 , . . . , x j � return value x i with smallest v i 10/24

Priority Sampling for Sliding Windows ◮ Problem: Maintain a uniform sample from the last w items ◮ Algorithm: 1. For each x i we pick a random value v i ∈ (0 , 1) 2. In a window � x j − w +1 , . . . , x j � return value x i with smallest v i 3. To do this, maintain set of all elements in sliding window whose v value is minimal among subsequent values 10/24

Crash Course on Data Stream Algorithms Part I: Basic Definitions and - PowerPoint PPT Presentation

Crash Course on Data Stream Algorithms Part I: Basic Definitions and Numerical Streams Andrew McGregor University of Massachusetts Amherst 1/24 Goals of the Crash Course Goal: Give a flavor for the theoretical results and techniques from

PUEBLO MS2 - CRASH http://pueblo.ms2soft.com/ By: Hannah Haunert TCDS Traffic Crash Location

Cool Cisco IOS Commands: test crash test crash test crash is an undocumented Cisco IOS command

MATLAB crash course Cesar E. Tamayo Economics - Rutgers September 27th, 2013 1/27 MATLAB crash

Arizona Crash Report Presentation by Glen Robison State Custodian of Crash Records Prepared

Crash Preventability Determination Program 1 Request and Review Process 2 Eligible Crash Types

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

D3: The Crash Course Chad Stolper CSE 6242: Data and Visual Analytics D3: The Crash Course Chad

CRASH COURSE OR COURSE CRASH: Gaming, VR and a Pedagogical Approach Dr. Brent Chamberlain

A Crash Course on A Crash Course on Temporal Specifications Temporal Specifications [Kansas

A Crash Course in Genetics A Crash Course in Genetics General Overview: DNA Structure

Crash Course into the New Finnish Government and HQ Communication Crash Course into the New

Crash Course Entrepreneurship Crash Course Escape from Corporate [Case Study] Who wants

Reconfigurable Computing Reconfigurable Computing VHDL Crash Course VHDL Crash Course Chapter 2

B.e) Stream Ciphers W. Schindler: Cryptography, B-IT, winter 2006 / 2007 2 B.125 Stream Ciphers

Stream Ciphers Stream Ciphers 1 Stream Ciphers Generalization of one-time pad Trade

Taint Nobody Got Time for Crash Analysis Crash Analysis Triage Goals Execution Path What

What can be sampled loca ! y ? Yitong Yin Nanjing University Joint work with: W eiming Feng, Y

Pseudorandom generators from polarizing random walks Ka Kaave Ho Hossei eini (UC San Diego)

Sampling in networks Argimiro Arratia & R. Ferrer-i-Cancho Universitat Polit` ecnica de

Data Analysis and Uncertainty Part 1: Random Variables Instructor: Sargur N. Srihari University

Some notes on Interrogating Random Quantum Circuits Lus Brando and Ren Peralta

USING DATA TO DRIVE CHANGE Continuous Improvements in CA Hospitals California Breastfeeding and

Introductions CSUS 2019-20 Dietetic Interns Preceptors Nadine Braunstein, PhD, RD, FAND

CAMHPRO State Peer Certification SB 614 Update & Input Meeting November 10, 2016 California

Crash Course on Data Stream Algorithms Part I: Basic Definitions and - PowerPoint PPT Presentation

Crash Course on Data Stream Algorithms Part I: Basic Definitions and Numerical Streams Andrew McGregor University of Massachusetts Amherst 1/24 Goals of the Crash Course Goal: Give a flavor for the theoretical results and techniques from

PUEBLO MS2 - CRASH http://pueblo.ms2soft.com/ By: Hannah Haunert TCDS Traffic Crash Location

Cool Cisco IOS Commands: test crash test crash test crash is an undocumented Cisco IOS command

MATLAB crash course Cesar E. Tamayo Economics - Rutgers September 27th, 2013 1/27 MATLAB crash

Arizona Crash Report Presentation by Glen Robison State Custodian of Crash Records Prepared

Crash Preventability Determination Program 1 Request and Review Process 2 Eligible Crash Types

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

D3: The Crash Course Chad Stolper CSE 6242: Data and Visual Analytics D3: The Crash Course Chad

CRASH COURSE OR COURSE CRASH: Gaming, VR and a Pedagogical Approach Dr. Brent Chamberlain

A Crash Course on A Crash Course on Temporal Specifications Temporal Specifications [Kansas

A Crash Course in Genetics A Crash Course in Genetics General Overview: DNA Structure

Crash Course into the New Finnish Government and HQ Communication Crash Course into the New

Crash Course Entrepreneurship Crash Course Escape from Corporate [Case Study] Who wants

Reconfigurable Computing Reconfigurable Computing VHDL Crash Course VHDL Crash Course Chapter 2

B.e) Stream Ciphers W. Schindler: Cryptography, B-IT, winter 2006 / 2007 2 B.125 Stream Ciphers

Stream Ciphers Stream Ciphers 1 Stream Ciphers Generalization of one-time pad Trade

Taint Nobody Got Time for Crash Analysis Crash Analysis Triage Goals Execution Path What

What can be sampled loca ! y ? Yitong Yin Nanjing University Joint work with: W eiming Feng, Y

Pseudorandom generators from polarizing random walks Ka Kaave Ho Hossei eini (UC San Diego)

Sampling in networks Argimiro Arratia &amp; R. Ferrer-i-Cancho Universitat Polit` ecnica de

Data Analysis and Uncertainty Part 1: Random Variables Instructor: Sargur N. Srihari University

Some notes on Interrogating Random Quantum Circuits Lus Brando and Ren Peralta

USING DATA TO DRIVE CHANGE Continuous Improvements in CA Hospitals California Breastfeeding and

Introductions CSUS 2019-20 Dietetic Interns Preceptors Nadine Braunstein, PhD, RD, FAND

CAMHPRO State Peer Certification SB 614 Update &amp; Input Meeting November 10, 2016 California

Sampling in networks Argimiro Arratia & R. Ferrer-i-Cancho Universitat Polit` ecnica de

CAMHPRO State Peer Certification SB 614 Update & Input Meeting November 10, 2016 California