Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. - PowerPoint PPT Presentation

Stream Processing Marco Serafini COMPSCI 532 Lecture 5

Stream vs. Batch Processing • Batch processing • Bounded input • Bounded one-shot computation • Bounded output • Stream processing • Unbounded input: data stream • Unbounded computation: always on • Unbounded output 2 2

Advantages and Disadvantages • Advantages of stream processing • Many real-time datasets are data streams (e.g. IoT) • Near real-time results • No need for accumulating data for processing • Streaming operators typically require less memory • Disadvantages of stream processing • Need to deal with timing semantics • Some operators are harder to implement with streaming • Especially if we want operator state to be constant • E.g. Find median • Stream algorithms are often approximations 3 3

Streaming Computation • Dataflow Graph of (possibly stateful) operators • Data streams connecting them • Tuples in the stream: <key, [timestamp,] value> • FIFO channels • Partitioning • Operators are parallelized into subtasks based on key • Streams are split into partitions 4 4

Example: Streaming Inverted Index • How could the architecture look like? • What kind of streams would we have? • What kind of operators? 5 5

Windowing • Windows create batches for bounded operations • Example: aggregation • Tuples replicated on multiple windows 6 6

Event and Processing Time • Event time • Time when events occurs • Associated to event itself • Immutable • Processing time • Time when event is processed • Depends on system implementation • Mutable • Q: Which one is easier to program with? 7 7

Watermarks • Event-time processing • Requires reordering since event time ≠ processing time • Example: process all events generated [from, to) • Low watermarks • How to know that I got all events until event-time T? • Watermark is a special message telling us that • Forwarded throughout dataflow graph • If an operator has multiple input channels, forward minimum (earliest) low watermark across inputs • Punctuations • Similar to watermarks 8 8

Triggers • Watermarks can be too fast (when?) or too slow (when?) • Triggers are used to process a window based on a processing time signal • Problem: what to do with the window once triggering? • Discard? Accumulate? Retract? 9 9

Spark Structured Streaming • Define a relational query on stream • System incrementally updates output tables 10 10

System Implementation 11

Stream Processing Systems • “Pure” streaming systems • Tuple-at-a-time semantic • Example: Apache Storm, Apache Flink • Micro-batching • Create small batches of inputs and then execute batched computation • Example: Spark Streaming • System implementation ≠ programming semantics 12 12

Control vs. Data Messages • Control messages are injected in event stream • Checkpoint markers • Inserting them in stream helps consistent snapshot • Watermarks for windowing • Inserting them in stream allows triggering windows • Coordination barriers • Inserting them in stream allows marking event before-after barrier 13 13

Implementing Batch on Streaming • DataSet abstraction in Flink • Example • Q: How to implement map-reduce on Flink? • A: Control messages • Mappers send an “eof” marker to each reducer when done • Reducer do not process until they receive markers from all mappers 14 14

Fault Tolerance • Streaming: stateful operators • Cannot rerun the whole stream from beginning • Spark Streaming: Lineage (Spark) + checkpoints • Flink: Periodic checkpointing of stateful operators • Export API to application to define state to be checkpointed • How to checkpoint? 15 15

Distributed Checkpoints • Uncoordinated Checkpoint? Domino effect • Coordinated checkpoint • Consistent cut: no message received but not sent • Distributed checkpointing protocol (Chandy-Lamport) 16 16

Chandy-Lamport Protocol • Assumptions • Originator process starts it • FIFO channels • One checkpoint at a time • Goal: checkpoint state + all in-flight messages • Algorithm • Originator checkpoints its state and sends checkpoint marker • Upon receiving checkpoint marker • Checkpoint and send checkpoint marker on each channel • Record subsequent messages on each channel until receive checkpoint marker back 17 17

Load Balancing • How to balance load in tuple-at-a-time system? • Redistribute keys • Move key from one server to another • Require migrating operator state • Replicate and aggregate • Multiple copies of the same operator compute partial results • A downstream operator aggregates them • Power of both choices 18 18

Lambda Architecture Recreate accurate results Periodically Batch Incoming Processing data stream Persistent Result of store analysis (e.g. Kafka) (e.g. Index) Stream Update Real-time Processing approximate results • Compromise between accuracy and freshness • Requires maintaining 2 platforms and 2 implementations 19 19

Regulating the Data Flow • Backpressure • If receiver operator cannot process inputs fast enough… • ... Block or slow down senders (recursively if needed) • Intermediate buffer pools (queues) • Decouple communication from consuming messages • Conflicting requirements • Throughput: batch output messages, don’t send one by one • Latency: send messages asap • Tradeoff: Send when either • Max batch size reached (e.g. 1 kB) or • Timeout (e.g. 5 milliseconds) 20 20

Exercise 21

Exercise: Online Store • Two input streams • One has ad clicks: <user_ID, time, ad_ID> • The other has ad impressions: <ad_ID, time, user_ID> • Design DataFlow application that • Correlates ad impressions with user clicks (for billing) • Correlated = click happens within 10 seconds after ad • Questions • Which operators? How are they partitioned? • Watermarks? • Triggers? 22 22

Possible Implementation • Streaming operator • Receives both streams • Partitioned by ad_ID • Join by ad_ID and return <ad_ID, user_ID, ad_time> • Windowing: session • When an ad appears, gather all clicks for 5 minutes • Watermarks • Both input streams emit a low watermark every second • Earliest low watermark triggers the window • Triggers: aggregate and retract 23 23

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. - PowerPoint PPT Presentation

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch processing Bounded input Bounded one-shot computation Bounded output Stream processing Unbounded input: data stream

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Stream Ciphers Stream Ciphers 1 Stream Ciphers Generalization of one-time pad Trade

An Introduction To Data Stream Query Processing Neil Conway <nconway@truviso.com> Truviso,

Text Stream Processing Dunja Mladeni Artificial Intelligence Laboratory Marko Grobelnik Jo

Auto-sizing for Stream Processing Applications at LinkedIn Rayman Preet Singh, Bharath

Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 19/09/2019 The Course

Fresh water stream ecosystem Gr ov p 2 The description of stream lives Quadrat 1: Hong Kong Newt

Phase III Stream Assessment Study: Potential Stream Restoration Projects Strawberry Run and

UPLOAD VIDEOS TO MICROSOFT STREAM VIA ACCESSUH To upload a video on Microsoft Stream, go to

Assessing stream and riparian conditions Stream Habitat Assessment Conducted yearly

CS162: Introduction to Computer Science II Streams 1 Streams A stream is a flow of data

Stream Switching Control draft-gentric-mmusic-stream-switching-00.txt Philippe Gentric

B.e) Stream Ciphers W. Schindler: Cryptography, B-IT, winter 2006 / 2007 2 B.125 Stream Ciphers

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

The Eight Requirements of Real- Time Stream Processing: STREAM vs Storm Presentation by: Alex

Software Quality Engineering: Testing, Quality Assurance, and Quantifiable Improvement Jeff

Scientific GPU computing with Go A novel approach to highly reliable CUDA HPC 1 February 2014

Conformal blocks, entanglement entropy and heavy states Shouvik Datta Institut fr Theoretische

Use of the AES instruction set ?

Spectrum Sharing: Scenarios & Opportunities Sumit Roy Integrated Systems Professor, Elect.

On Complementarity In QEC And Quantum Cryptography David Kribs Professor & Chair Department

Approximate Operator Quantum Error Correction Prabha Mandayam Institute of Mathematical

Si Signa gnatur ure o of qua quantum cha um chaos i in n op operator or entanglement t

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. - PowerPoint PPT Presentation

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch processing Bounded input Bounded one-shot computation Bounded output Stream processing Unbounded input: data stream

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Stream Ciphers Stream Ciphers 1 Stream Ciphers Generalization of one-time pad Trade

An Introduction To Data Stream Query Processing Neil Conway &lt;nconway@truviso.com&gt; Truviso,

Text Stream Processing Dunja Mladeni Artificial Intelligence Laboratory Marko Grobelnik Jo

Auto-sizing for Stream Processing Applications at LinkedIn Rayman Preet Singh, Bharath

Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 19/09/2019 The Course

Fresh water stream ecosystem Gr ov p 2 The description of stream lives Quadrat 1: Hong Kong Newt

Phase III Stream Assessment Study: Potential Stream Restoration Projects Strawberry Run and

UPLOAD VIDEOS TO MICROSOFT STREAM VIA ACCESSUH To upload a video on Microsoft Stream, go to

Assessing stream and riparian conditions Stream Habitat Assessment Conducted yearly

CS162: Introduction to Computer Science II Streams 1 Streams A stream is a flow of data

Stream Switching Control draft-gentric-mmusic-stream-switching-00.txt Philippe Gentric

B.e) Stream Ciphers W. Schindler: Cryptography, B-IT, winter 2006 / 2007 2 B.125 Stream Ciphers

Batch &amp; Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

The Eight Requirements of Real- Time Stream Processing: STREAM vs Storm Presentation by: Alex

Software Quality Engineering: Testing, Quality Assurance, and Quantifiable Improvement Jeff

Scientific GPU computing with Go A novel approach to highly reliable CUDA HPC 1 February 2014

Conformal blocks, entanglement entropy and heavy states Shouvik Datta Institut fr Theoretische

Use of the AES instruction set ?

Spectrum Sharing: Scenarios &amp; Opportunities Sumit Roy Integrated Systems Professor, Elect.

On Complementarity In QEC And Quantum Cryptography David Kribs Professor &amp; Chair Department

Approximate Operator Quantum Error Correction Prabha Mandayam Institute of Mathematical

Si Signa gnatur ure o of qua quantum cha um chaos i in n op operator or entanglement t

An Introduction To Data Stream Query Processing Neil Conway <nconway@truviso.com> Truviso,

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

Spectrum Sharing: Scenarios & Opportunities Sumit Roy Integrated Systems Professor, Elect.

On Complementarity In QEC And Quantum Cryptography David Kribs Professor & Chair Department