CS 744: SPARK STREAMING Shivaram Venkataraman Fall 2019 - - PowerPoint PPT Presentation

cs 744 spark streaming
SMART_READER_LITE
LIVE PREVIEW

CS 744: SPARK STREAMING Shivaram Venkataraman Fall 2019 - - PowerPoint PPT Presentation

CS 744: SPARK STREAMING Shivaram Venkataraman Fall 2019 ADMINISTRIVIA - Midterm grades this week - Course Projects sign up for meetings Applications Machine Learning SQL Streaming Graph Computational Engines Scalable Storage Systems


slide-1
SLIDE 1

CS 744: SPARK STREAMING

Shivaram Venkataraman Fall 2019

slide-2
SLIDE 2

ADMINISTRIVIA

  • Midterm grades this week
  • Course Projects sign up for meetings
slide-3
SLIDE 3

Scalable Storage Systems Datacenter Architecture Resource Management Computational Engines Machine Learning SQL Streaming Graph Applications

slide-4
SLIDE 4

CONTINUOUS OPERATOR MODEL

Long-lived operators Distributed Checkpoints for Fault Recovery

Naiad Task Control Message Driver Network Transfer

Mutable State Stragglers ?

slide-5
SLIDE 5

CONTINUOUS OPERATORS

slide-6
SLIDE 6

SPARK STREAMING: GOALS

1. Scalability to hundreds of nodes

  • 2. Minimal cost beyond base processing (no replication)
  • 3. Second-scale latency
  • 4. Second-scale recovery from faults and stragglers
slide-7
SLIDE 7

DISCRETIZED STREAMS (DSTREAMS)

slide-8
SLIDE 8

EXAMPLE

pageViews = readStream(http://..., "1s")

  • nes = pageViews.map(

event =>(event.url, 1)) counts =

  • nes.runningReduce(

(a, b) => a + b)

slide-9
SLIDE 9

ARCHITECHTURE

slide-10
SLIDE 10

DSTREAM API

Transformations Stateless: map, reduce, groupBy, join Stateful: window(“5s”) à RDDs with data in [0,5), [1,6), [2,7) reduceByWindow(“5s”, (a, b) => a + b)

slide-11
SLIDE 11

SLIDING WINDOW

Add previous 5 each time

slide-12
SLIDE 12

STATE MANAGEMENT

Tracking State: streams of (Key, Event) à (Key, State)

events.track( (key, ev) => 1, (key, st, ev) => ev == Exit ? null : 1, "30s”)

slide-13
SLIDE 13

SYSTEM IMPLEMENTATION

slide-14
SLIDE 14

OPTIMIZATIONS

Timestep Pipelining No barrier across timesteps unless needed Tasks from the next timestep scheduled before current finishes Checkpointing Async I/O, as RDDs are immutable Forget lineage after checkpoint

slide-15
SLIDE 15

FAULT TOLERANCE: PARALLEL RECOVERY

Worker failure

  • Need to recompute state RDDs stored on worker
  • Re-execute tasks running on the worker

Strategy

  • Run all independent recovery tasks in parallel
  • Parallelism from partitions in timestep and across timesteps
slide-16
SLIDE 16

EXAMPLE

pageViews = readStream(http://..., "1s")

  • nes = pageViews.map(

event =>(event.url, 1)) counts =

  • nes.runningReduce(

(a, b) => a + b)

slide-17
SLIDE 17

FAULT TOLERANCE

Straggler Mitigation Use speculative execution Task runs more than 1.4x longer than median task à straggler Master Recovery

  • At each timestep, save graph of DStreams and Scala function objects
  • Workers connect to a new master and report their RDD partitions
  • Note: No problem if a given RDD is computed twice (determinism).
slide-18
SLIDE 18

DISCUSSION

https://forms.gle/xUvzC1bdV7H48mTM8

slide-19
SLIDE 19
slide-20
SLIDE 20

If the latency bound was made to 100ms, how do you think the above figure would change? What could be the reasons for it?

slide-21
SLIDE 21

Consider the pros and cons of approaches in Naiad vs Spark Streaming. What application properties would you use to decide which system to choose?

slide-22
SLIDE 22

NEXT STEPS

Next class: Graph processing Sign up for project check-ins!

slide-23
SLIDE 23

SHORTCOMINGS?

Expressiveness

  • Current API requires users to “think” in micro-batches

Setting batch interval

  • Manual tuning. Higher batch à better throughput but worse latency

Memory usage

  • LRU cache stores state RDDs in memory
slide-24
SLIDE 24

COMPUTATION MODEL: MICRO-BATCHES

Task Control Message Driver

S H U F F L E

Network Transfer

Micro-Batch

slide-25
SLIDE 25

SUMMARY

Micro-batches: New approach to stream processing Higher latency for fault tolerance, straggler mitigation Unifying batch, streaming analytics