stream processing
play

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. - PowerPoint PPT Presentation

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch processing Bounded input Bounded one-shot computation Bounded output Stream processing Unbounded input: data stream


  1. Stream Processing Marco Serafini COMPSCI 532 Lecture 5

  2. Stream vs. Batch Processing • Batch processing • Bounded input • Bounded one-shot computation • Bounded output • Stream processing • Unbounded input: data stream • Unbounded computation: always on • Unbounded output 2 2

  3. Advantages and Disadvantages • Advantages of stream processing • Many real-time datasets are data streams (e.g. IoT) • Near real-time results • No need for accumulating data for processing • Streaming operators typically require less memory • Disadvantages of stream processing • Need to deal with timing semantics • Some operators are harder to implement with streaming • Especially if we want operator state to be constant • E.g. Find median • Stream algorithms are often approximations 3 3

  4. Streaming Computation • Dataflow Graph of (possibly stateful) operators • Data streams connecting them • Tuples in the stream: <key, [timestamp,] value> • FIFO channels • Partitioning • Operators are parallelized into subtasks based on key • Streams are split into partitions 4 4

  5. Example: Streaming Inverted Index • How could the architecture look like? • What kind of streams would we have? • What kind of operators? 5 5

  6. Windowing • Windows create batches for bounded operations • Example: aggregation • Tuples replicated on multiple windows 6 6

  7. Event and Processing Time • Event time • Time when events occurs • Associated to event itself • Immutable • Processing time • Time when event is processed • Depends on system implementation • Mutable • Q: Which one is easier to program with? 7 7

  8. Watermarks • Event-time processing • Requires reordering since event time ≠ processing time • Example: process all events generated [from, to) • Low watermarks • How to know that I got all events until event-time T? • Watermark is a special message telling us that • Forwarded throughout dataflow graph • If an operator has multiple input channels, forward minimum (earliest) low watermark across inputs • Punctuations • Similar to watermarks 8 8

  9. Triggers • Watermarks can be too fast (when?) or too slow (when?) • Triggers are used to process a window based on a processing time signal • Problem: what to do with the window once triggering? • Discard? Accumulate? Retract? 9 9

  10. Spark Structured Streaming • Define a relational query on stream • System incrementally updates output tables 10 10

  11. System Implementation 11

  12. Stream Processing Systems • “Pure” streaming systems • Tuple-at-a-time semantic • Example: Apache Storm, Apache Flink • Micro-batching • Create small batches of inputs and then execute batched computation • Example: Spark Streaming • System implementation ≠ programming semantics 12 12

  13. Control vs. Data Messages • Control messages are injected in event stream • Checkpoint markers • Inserting them in stream helps consistent snapshot • Watermarks for windowing • Inserting them in stream allows triggering windows • Coordination barriers • Inserting them in stream allows marking event before-after barrier 13 13

  14. Implementing Batch on Streaming • DataSet abstraction in Flink • Example • Q: How to implement map-reduce on Flink? • A: Control messages • Mappers send an “eof” marker to each reducer when done • Reducer do not process until they receive markers from all mappers 14 14

  15. Fault Tolerance • Streaming: stateful operators • Cannot rerun the whole stream from beginning • Spark Streaming: Lineage (Spark) + checkpoints • Flink: Periodic checkpointing of stateful operators • Export API to application to define state to be checkpointed • How to checkpoint? 15 15

  16. Distributed Checkpoints • Uncoordinated Checkpoint? Domino effect • Coordinated checkpoint • Consistent cut: no message received but not sent • Distributed checkpointing protocol (Chandy-Lamport) 16 16

  17. Chandy-Lamport Protocol • Assumptions • Originator process starts it • FIFO channels • One checkpoint at a time • Goal: checkpoint state + all in-flight messages • Algorithm • Originator checkpoints its state and sends checkpoint marker • Upon receiving checkpoint marker • Checkpoint and send checkpoint marker on each channel • Record subsequent messages on each channel until receive checkpoint marker back 17 17

  18. Load Balancing • How to balance load in tuple-at-a-time system? • Redistribute keys • Move key from one server to another • Require migrating operator state • Replicate and aggregate • Multiple copies of the same operator compute partial results • A downstream operator aggregates them • Power of both choices 18 18

  19. Lambda Architecture Recreate accurate results Periodically Batch Incoming Processing data stream Persistent Result of store analysis (e.g. Kafka) (e.g. Index) Stream Update Real-time Processing approximate results • Compromise between accuracy and freshness • Requires maintaining 2 platforms and 2 implementations 19 19

  20. Regulating the Data Flow • Backpressure • If receiver operator cannot process inputs fast enough… • ... Block or slow down senders (recursively if needed) • Intermediate buffer pools (queues) • Decouple communication from consuming messages • Conflicting requirements • Throughput: batch output messages, don’t send one by one • Latency: send messages asap • Tradeoff: Send when either • Max batch size reached (e.g. 1 kB) or • Timeout (e.g. 5 milliseconds) 20 20

  21. Exercise 21

  22. Exercise: Online Store • Two input streams • One has ad clicks: <user_ID, time, ad_ID> • The other has ad impressions: <ad_ID, time, user_ID> • Design DataFlow application that • Correlates ad impressions with user clicks (for billing) • Correlated = click happens within 10 seconds after ad • Questions • Which operators? How are they partitioned? • Watermarks? • Triggers? 22 22

  23. Possible Implementation • Streaming operator • Receives both streams • Partitioned by ad_ID • Join by ad_ID and return <ad_ID, user_ID, ad_time> • Windowing: session • When an ad appears, gather all clicks for 5 minutes • Watermarks • Both input streams emit a low watermark every second • Earliest low watermark triggers the window • Triggers: aggregate and retract 23 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend