Discretized Streams
An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters
Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica
UC ¡BERKELEY ¡
Discretized Streams An Efficient and Fault-Tolerant Model for - - PowerPoint PPT Presentation
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica UC BERKELEY Motivation Many important applications need to
UC ¡BERKELEY ¡
mutable state node 1 node 3 input records push node 2 input records
node 1 node 3 node 2 node 1’ node 3’ node 2’
synchronization
node 1 node 3 node 2 standby input input input input
node 1 node 3 node 2 node 1’ node 3’ node 2’
synchronization
node 1 node 3 node 2 standby input input input input
node 1 node 3 node 2 node 1’ node 3’ node 2’
synchronization
node 1 node 3 node 2 standby input input input input
t = 1: t = 2:
stream 1 stream 2 batch operation pull input … … input
immutable dataset (stored reliably) immutable dataset (output or state); stored in memory without replication
…
map input dataset
60 ter c c 0.5 1 1.5 2 2.5 3 20 40 60 Cluster Throughput (GB/s) # of Nodes in Cluster
WordCount
1 sec 2 sec 0.5 1 1.5 2 2.5 3 20 40 60 Cluster Throughput (GB/s) # of Nodes in Cluster
Grep
1 sec 2 sec
Max throughput within a given latency bound (1 or 2s)
Failure Happens 0.0 0.5 1.0 1.5 2.0 15 30 45 60 75 Interval Processing Time (s) Time (s)
Sliding WordCount on 10 nodes with 30s checkpoint interval
pageViews = readStream("...", "1s")
counts = ones.runningReduce(_ + _) t = 1: t = 2:
pageViews
counts
map reduce . . .
= RDD = partition
Scala function literal
sliding = ones.reduceByWindow( “5s”, _ + _, _ - _)
Incremental version with “add” and “subtract” functions
Concern Discretized Streams Record-at-a-time Systems Latency 0.5–2s 1-100 ms Consistency Yes, batch-level Not in msg. passing systems; some DBs use waiting Failures Parallel recovery Replication or upstream bkp. Stragglers Speculation Typically not handled Unification with batch Ad-hoc queries from Spark shell, join w. RDD Not in msg. passing systems; in some DBs