Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. - - PowerPoint PPT Presentation

stream processing
SMART_READER_LITE
LIVE PREVIEW

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. - - PowerPoint PPT Presentation

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch processing Bounded input Bounded one-shot computation Bounded output Stream processing Unbounded input: data stream


slide-1
SLIDE 1

Stream Processing

Marco Serafini

COMPSCI 532 Lecture 5

slide-2
SLIDE 2

22

Stream vs. Batch Processing

  • Batch processing
  • Bounded input
  • Bounded one-shot computation
  • Bounded output
  • Stream processing
  • Unbounded input: data stream
  • Unbounded computation: always on
  • Unbounded output
slide-3
SLIDE 3

33

Advantages and Disadvantages

  • Advantages of stream processing
  • Many real-time datasets are data streams (e.g. IoT)
  • Near real-time results
  • No need for accumulating data for processing
  • Streaming operators typically require less memory
  • Disadvantages of stream processing
  • Need to deal with timing semantics
  • Some operators are harder to implement with streaming
  • Especially if we want operator state to be constant
  • E.g. Find median
  • Stream algorithms are often approximations
slide-4
SLIDE 4

44

Streaming Computation

  • Dataflow Graph of (possibly stateful) operators
  • Data streams connecting them
  • Tuples in the stream: <key, [timestamp,] value>
  • FIFO channels
  • Partitioning
  • Operators are parallelized into subtasks based on key
  • Streams are split into partitions
slide-5
SLIDE 5

55

Example: Streaming Inverted Index

  • How could the architecture look like?
  • What kind of streams would we have?
  • What kind of operators?
slide-6
SLIDE 6

66

Windowing

  • Windows create batches for bounded operations
  • Example: aggregation
  • Tuples replicated on multiple windows
slide-7
SLIDE 7

77

Event and Processing Time

  • Event time
  • Time when events occurs
  • Associated to event itself
  • Immutable
  • Processing time
  • Time when event is processed
  • Depends on system

implementation

  • Mutable
  • Q: Which one is easier to

program with?

slide-8
SLIDE 8

88

Watermarks

  • Event-time processing
  • Requires reordering since event time ≠ processing time
  • Example: process all events generated [from, to)
  • Low watermarks
  • How to know that I got all events until event-time T?
  • Watermark is a special message telling us that
  • Forwarded throughout dataflow graph
  • If an operator has multiple input channels, forward minimum

(earliest) low watermark across inputs

  • Punctuations
  • Similar to watermarks
slide-9
SLIDE 9

99

Triggers

  • Watermarks can be too fast (when?) or too slow

(when?)

  • Triggers are used to process a window based on a

processing time signal

  • Problem: what to do with the window once triggering?
  • Discard? Accumulate? Retract?
slide-10
SLIDE 10

10

10

Spark Structured Streaming

  • Define a relational query on stream
  • System incrementally updates output tables
slide-11
SLIDE 11

11

System Implementation

slide-12
SLIDE 12

12 12

Stream Processing Systems

  • “Pure” streaming systems
  • Tuple-at-a-time semantic
  • Example: Apache Storm, Apache Flink
  • Micro-batching
  • Create small batches of inputs and then execute batched

computation

  • Example: Spark Streaming
  • System implementation ≠ programming semantics
slide-13
SLIDE 13

13

13

Control vs. Data Messages

  • Control messages are injected in event stream
  • Checkpoint markers
  • Inserting them in stream helps consistent snapshot
  • Watermarks for windowing
  • Inserting them in stream allows triggering windows
  • Coordination barriers
  • Inserting them in stream allows marking event before-after

barrier

slide-14
SLIDE 14

14

14

Implementing Batch on Streaming

  • DataSet abstraction in Flink
  • Example
  • Q: How to implement map-reduce on Flink?
  • A: Control messages
  • Mappers send an “eof” marker to each reducer when done
  • Reducer do not process until they receive markers from all mappers
slide-15
SLIDE 15

15

15

Fault Tolerance

  • Streaming: stateful operators
  • Cannot rerun the whole stream from beginning
  • Spark Streaming: Lineage (Spark) + checkpoints
  • Flink: Periodic checkpointing of stateful operators
  • Export API to application to define state to be checkpointed
  • How to checkpoint?
slide-16
SLIDE 16

16

16

Distributed Checkpoints

  • Uncoordinated Checkpoint?

Domino effect

  • Coordinated checkpoint
  • Consistent cut: no message received but not sent
  • Distributed checkpointing protocol (Chandy-Lamport)
slide-17
SLIDE 17

17

17

Chandy-Lamport Protocol

  • Assumptions
  • Originator process starts it
  • FIFO channels
  • One checkpoint at a time
  • Goal: checkpoint state + all in-flight messages
  • Algorithm
  • Originator checkpoints its state and sends checkpoint marker
  • Upon receiving checkpoint marker
  • Checkpoint and send checkpoint marker on each channel
  • Record subsequent messages on each channel until receive checkpoint marker back
slide-18
SLIDE 18

18

18

Load Balancing

  • How to balance load in tuple-at-a-time system?
  • Redistribute keys
  • Move key from one server to another
  • Require migrating operator state
  • Replicate and aggregate
  • Multiple copies of the same operator compute partial results
  • A downstream operator aggregates them
  • Power of both choices
slide-19
SLIDE 19

19

19

Lambda Architecture

  • Compromise between accuracy and freshness
  • Requires maintaining 2 platforms and 2 implementations

Persistent store (e.g. Kafka) Incoming data stream Stream Processing Batch Processing Periodically Result of analysis (e.g. Index) Recreate accurate results Update approximate results Real-time

slide-20
SLIDE 20

20 20

Regulating the Data Flow

  • Backpressure
  • If receiver operator cannot process inputs fast enough…
  • ... Block or slow down senders (recursively if needed)
  • Intermediate buffer pools (queues)
  • Decouple communication from consuming messages
  • Conflicting requirements
  • Throughput: batch output messages, don’t send one by one
  • Latency: send messages asap
  • Tradeoff: Send when either
  • Max batch size reached (e.g. 1 kB) or
  • Timeout (e.g. 5 milliseconds)
slide-21
SLIDE 21

21

Exercise

slide-22
SLIDE 22

22

22

Exercise: Online Store

  • Two input streams
  • One has ad clicks: <user_ID, time, ad_ID>
  • The other has ad impressions: <ad_ID, time, user_ID>
  • Design DataFlow application that
  • Correlates ad impressions with user clicks (for billing)
  • Correlated = click happens within 10 seconds after ad
  • Questions
  • Which operators? How are they partitioned?
  • Watermarks?
  • Triggers?
slide-23
SLIDE 23

23

23

Possible Implementation

  • Streaming operator
  • Receives both streams
  • Partitioned by ad_ID
  • Join by ad_ID and return <ad_ID, user_ID, ad_time>
  • Windowing: session
  • When an ad appears, gather all clicks for 5 minutes
  • Watermarks
  • Both input streams emit a low watermark every second
  • Earliest low watermark triggers the window
  • Triggers: aggregate and retract