From a Calculus to an Execution Environment for Stream Processing - - PowerPoint PPT Presentation

from a calculus to an execution environment for stream
SMART_READER_LITE
LIVE PREVIEW

From a Calculus to an Execution Environment for Stream Processing - - PowerPoint PPT Presentation

From a Calculus to an Execution Environment for Stream Processing Robert Soul Martin Hirzel Bu ra Gedik Robert Grimm Cornell University IBM Research Bilkent University New York University DEBS 2012 1 to an Execution Environment


slide-1
SLIDE 1

From a Calculus to an Execution Environment for Stream Processing

DEBS 2012

1

Robert Soulé Martin Hirzel Buğra Gedik Robert Grimm

Cornell University IBM Research Bilkent University New York University

slide-2
SLIDE 2

… to an Execution Environment

2

CQL (StreamSQL) StreamIt (SDF) Sawzall (MapReduce) River (execution environment) System S (platform) Fusion (merge ops) Fission (replicate ops) Placement (assign hosts)

Benefits of execution environment:

  • Language portability
  • Optimization reuse

Source languages Optimizations

slide-3
SLIDE 3

From a Calculus …

  • Calculus = formal language + semantics

– Stream calculus, Soulé et al. [ESOP’10]

3

  • Graph language:

– Stream operators with functions (F) – Queues (Q) – Variables (V)

f q q' v

  • Semantics:

– Small-step – Operational – Sequence of “operator firings” F <Q1,V1> b <Q2,V2> b* …

slide-4
SLIDE 4

Benefits of Calculus: Translation Correctness Proofs

4 ¡

Execute Execute Input Input Output Output Translate Translate

slide-5
SLIDE 5

From Abstractions to the Real World

Brooklet calculus River execution environment Sequence of atomic steps Operators execute concurrently Pure functions, state threaded through invocations Stateful functions, protected with automatic locking Non-deterministic execution Restricted execution: bounded queues and back-pressure Opaque functions Function implementations No physical platform, independent from runtime Abstract representation of platform, e.g. placement Finite execution Indefinite execution

5

slide-6
SLIDE 6

Concurrent Execution Case 1: No Shared State

  • Brooklet operators fire one at a time
  • River operators fire concurrently
  • For both, data must be available

6

  • 1

v

  • 2
  • 3

w x

Single-threaded

  • perators

Atomic queue

  • perations
slide-7
SLIDE 7

Concurrent Execution Case 2: With Shared State

  • Locks form equivalence classes over shared variables
  • Every shared variable is protected by one lock
  • Shared variables in the same class protected by same lock
  • Locks acquired/released in standard order

7

  • 1

v

  • 2
  • 3

w w

Minimal locking

slide-8
SLIDE 8

Restricted Execution Bounded Queues

8

  • 1

v

  • 2
  • 3

w w

  • Naïve approach:

block when output queue is full

  • 2 waits b/c
  • utput q is full
  • 3 waits b/c
  • 2 locked w

q

Deadlock!

slide-9
SLIDE 9

Restricted Execution Safe Back-Pressure

9

  • 1

v

  • 3

w w

  • Our approach: only block on output queue

when not holding locks on variables

q

  • 2
  • 5. Move data to
  • utput queue
  • 1. Acquire locks
  • 2. Fire operator
  • 3. Buffer data

in local queue

  • 4. Release locks
slide-10
SLIDE 10

Applications of an Execution Environment

  • Easier to develop source languages

– Implementation language – Language modules – Operator templates

  • Possible to reuse optimizations

– Annotations provide additional information between source and intermediate language

10

slide-11
SLIDE 11

Function Implementations and Translations

11

logs : {origin : string; target : string} stream; hits : {origin : string; count : int} stream = select istream(origin, count(origin)) from logs[range 300] where origin != target Bag.filter (fun x -> #expr) Bag.filter (fun x ->

  • rigin != target)

Select Range Aggr IStream count win

Expose operators, communication, and state Pre-existing

  • perator

templates

slide-12
SLIDE 12

Translation Support: Pluggable Compiler Modules

12

select istream(*) from quotes[now], history where quotes.ask<=history.low and quotes.ticker=history.ticker

CQL = SQL + Streaming + Expressions

Expression analyzer SQL analyzer CQL analyzer Symbol table is-a has-a has-a has-a

slide-13
SLIDE 13

Optimization Support: Extensible Annotations

13

Source language River (execution environment) System S (platform) Optimizer

Establishes by construction, e.g., Sawzall reducers commute Needs to know:

  • Safety
  • Profitability

Establishes, e.g., available resources

slide-14
SLIDE 14

Optimization Support: Current Annotations

Annotation Description Optimization @Fuse(ID) Fuse operators with same ID in the same process Fusion @Parallel() Perform fission on an

  • perator

Fission @Commutative() An operator’s function is commutative Fission @Keys(k1,…,kn) An operator’s state is partitionable by fields k1,…,kn Fission @Group(ID) Place operators with same ID

  • n the same machine

Placement

14

slide-15
SLIDE 15

Evaluation

  • Four benchmark

applications

– CQL linear road – StreamIt FM radio – Sawzall web log analyzer (batch) – CQL web log analyzer (continuous)

  • Three optimizations

– Placement – Fission – Fusion

15

slide-16
SLIDE 16

Distributed Linear Road

(simplified version from Arasu/Babu/Widom [VLDBJ’06])

16

now proj ect istre am dup split ran ge join istre am aggre gate join se lect join ran ge parti tion proj ect dis tinct dup- split now proj ect aggre gate pro ject pro ject rstre am

First distributed CQL implementation

slide-17
SLIDE 17

CQL: Placement, Fusion, Fission

17

  • Placement + Fusion

 4x speedup on 4 machines

  • Fission

 2x speedup on 16 machines

  • Insufficient work per operator
slide-18
SLIDE 18

StreamIt: Placement

18

  • Optimization reuse  1.8x speedup on 4 machines
slide-19
SLIDE 19

Sawzall (MapReduce on River) Fission + Fusion

19

  • Same fission optimizer for Sawzall as for CQL
  • 8.92x speedup on 16 machines, 14.80x on 64 cores
  • With fusion, 50.32x on 64 cores
slide-20
SLIDE 20

Related Work

20

Stream processing Execution environment Translators from languages to IL CQL Arasu et al. [VLDB J.’06] SVM Labonte et al. [PACT’04] P-Code Nelson [CC’79] This paper

slide-21
SLIDE 21

Conclusions

  • River, execution environment for streaming
  • Semantics specified by formal calculus

– Brooklet, Soulé et al. [ESOP’10]

  • 3 source languages, 3 optimizations

– First distributed CQL – Language compiler module reuse – Optimization enabled by annotations

  • Encourages innovation in stream processing
  • h$p://www.cs.nyu.edu/brooklet/

21