Drinking From The Fire Hose: The Rise of Scalable Stream Processing - - PowerPoint PPT Presentation

drinking from the fire hose the rise of scalable stream
SMART_READER_LITE
LIVE PREVIEW

Drinking From The Fire Hose: The Rise of Scalable Stream Processing - - PowerPoint PPT Presentation

Department of Computing Drinking From The Fire Hose: The Rise of Scalable Stream Processing Systems Peter Pietzuch prp@doc.ic.ac.uk Large-Scale Distributed Systems Group Peter R. Pietzuch http://lsds.doc.ic.ac.uk prp@doc.ic.ac.uk Cambridge


slide-1
SLIDE 1

Peter R. Pietzuch

prp@doc.ic.ac.uk

Drinking From The Fire Hose: The Rise of Scalable Stream Processing Systems

Peter Pietzuch

Large-Scale Distributed Systems Group

http://lsds.doc.ic.ac.uk Cambridge MPhil – February 2014 Department of Computing prp@doc.ic.ac.uk

slide-2
SLIDE 2

The Data Deluge

  • 150 Exabytes (billion GBs) created in 2005 alone

– Increased to 1200 Exabytes in 2010

  • Many new sources of data become available

– Sensors, mobile devices – Web feeds, social networking – Cameras – Databases – Scientific instruments

  • E

E How can we make sense of all data ?

– Most data is not interesting – New data supersedes old data – Challenge is not only storage but also querying

2

slide-3
SLIDE 3

Real Time Traffic Monitoring

3

  • Instrumenting country’s transportation infrastructure

Many parties interested in data

– Road authorities, traffic planners, emergency services, commuters – But access not everything: Privacy

High-level queries

– “What is the best time/ route for my commute through central London between 7-8am?”

Time-EACM

(Cambridge)

slide-4
SLIDE 4

Web/Social Feed Mining

4

Social Cascade Detection

  • Detection and reaction to social cascades
slide-5
SLIDE 5

Fraud Detection

  • How to detect identity fraud as it happens?
  • Illegal use of mobile phone, credit card, etc.

– Offline: avoid aggravating customer – Online: detect and intervene

  • Huge volume of call records
  • More sophisticated forms of fraud

– e.g. insider trading

  • Supervision of laws and regulations

– e.g. Sabanes-Oxley, real-time risk analysis

5

slide-6
SLIDE 6

Astronomic Data Processing

6

  • Analysing transient cosmic events: γ-ray bursts
  • Large Synoptic Survey

Telescope (LSST)

– Generates 1.28 Petabytes per year

slide-7
SLIDE 7

Stream Processing to the Rescue!

  • Stream data rates can be high

– High resource requirements for processing (clusters, data centres)

  • Processing stream data has real-time aspect

– Latency of data processing matters – Must be able to react to events as they occur

7

E Process data streams on the fly without storage

slide-8
SLIDE 8

Traditional Databases (Boring)

  • Database Management System (DBMS):
  • Data relatively static but queries dynamic

8

DBMS

Data

Queries Results

Index

– Persistent relations

  • Random access
  • Low update rate
  • Unbounded disk storage

– One-time queries

  • Finite query result
  • Queries exploit (static) indices
slide-9
SLIDE 9

Data Stream Processing System

  • DSPS: Queries static but data dynamic
  • Data represented as time-dependant data stream

9

DSPS

Queries

Stream Results

Working Storage

– Transient streams

  • Sequential access
  • Potentially high rate
  • Bounded main memory

– Continuous queries

  • Produce time-dependant

result stream

  • Indexing?
slide-10
SLIDE 10

Overview

  • Why Stream Processing?
  • Stream Processing Models

– Streams, windows, operators – Data mining of streams

  • Stream Processing Systems

– Distributed Stream Processing – Scalable Stream Processing in the Cloud

10

slide-11
SLIDE 11

Stream Processing

  • Need to define
  • 1. Data model for streams
  • 2. Processing (query) model for streams

11

slide-12
SLIDE 12

Data Stream

  • “A data stream is a real-time, continuous, ordered (implicitly

by arrival time or explicitly by timestamp) sequence of items. It is impossible to control the order in which items arrive, nor is it feasible to locally store a stream in its entirety.”

[Golab & Ozsu (SIGMOD 2003)]

  • Relational model for stream structure?

– Can’t represent audio/video data – Can’t represent analogue measurements

12

slide-13
SLIDE 13

Relational Data Stream Model

  • Streams consist of infinite sequence of tuples

– Tuples often have associated time stamp

  • e.g. arrival time, time of reading, ...
  • Tuples have fixed relational schema

– Set of attributes

13

id temp rain id temp rain id temp rain id temp rain id temp rain id temp rain id temp rain id temp rain id temp rain id temp rain

time

id = 27182 temp = 24 C rain = 20mm

sensor output Sensors data stream

Sensors(id, temp, rain)

t1 t2 t3 t4 ...

slide-14
SLIDE 14

Stream Relational Model

  • Window converts stream to dynamic relation

– Similar to maintaining view – Use regular relational algebra operators on tuples – Can combine streams and relations in single query

14

Streams Relations

Window specification Special operators: Istream, Dstream, Rstream Any relational query

slide-15
SLIDE 15

window

Sliding Window I

  • How many tuples should we process each time?
  • Process tuples in window-sized batches

Time-based window with size τ at current time t

[t - τ : t] Sensors [Range τ seconds] [t : t] Sensors [Now]

Count-based window with size n:

last n tuples Sensors [Rows n]

15

temp rain temp rain temp rain temp rain temp rain temp rain temp rain temp rain temp rain temp rain

now

slide-16
SLIDE 16

Sliding Window II

  • How often should we evaluate the window?
  • 1. Output new result tuples as soon as available

– Difficult to implement efficiently

  • 2. Slide window by s seconds (or m tuples)
  • Sensors [Slide s seconds]

Sliding window: s < τ Tumbling window: s = τ

16

window

temp rain temp rain temp rain temp rain temp rain temp rain temp rain temp rain temp rain temp rain

s

slide-17
SLIDE 17

Continuous Query Language (CQL)

  • Based on SQL with streaming constructs

– Tuple- and time-based windows – Sampling primitives

  • Apart from that regular SQL syntax

17

SELECT temp FROM Sensors [Range 1 hour] WHERE temp > 42; SELECT * FROM S1 [Rows 1000], S2 [Range 2 mins] WHERE S1.A = S2.A AND S1.A > 42;

slide-18
SLIDE 18

Join Processing

  • Naturally supports joins over windows
  • Only meaningful with window specification for streams

– Otherwise requires unbounded state!

18

SELECT S.id, S.rain FROM Sensors [Rows 10] as S, Faulty [Range 1 day] as F WHERE S.rain > 10 AND F.id != S.id; Sensors(time, id, temp, rain) Faulty(time, id) SELECT * FROM S1, S2 WHERE S1.a = S2.b;

slide-19
SLIDE 19

Converting Relations è Streams

  • Define mapping from relation back to stream

– Assumes discrete, monotonically increasing timestamps τ, τ+1, τ+2, τ+3, ...

  • Istream(R)

– Stream of all tuples (r, τ) where r∈R at time τ but r∉R at time τ-1

  • Dstream(R)

– Stream of all tuples (r, τ) where r∈R at time τ-1 but r∉R at time τ

  • Rstream(R)

– Stream of all tuples (r, τ) where r∈R at time τ

19

slide-20
SLIDE 20

Data Mining in Streams

20

slide-21
SLIDE 21

Stream Data Mining

  • Often continuous queries relate to long-term characteristics of

streams

– Frequency of stock trades, number of invalid sensor readings, ...

  • May have insufficient memory to evaluate query

– Consider stream with window of 109 integers

  • Can store this in 4GB of memory

– What about 106 such streams?

  • Cannot keep all windows in memory
  • E Need to compress data in windows

21

slide-22
SLIDE 22

Limitations of Window Compression

  • Consider window compression for following query:
  • Assume that W can be compressed as C(W) = WC

– Then W1 ≠ W2 must exist, with C(W1) = C(W2) – Let t be oldest time in window for which W1 and W2 differ: – For W1: subtract W1(t) = 3; for W2: subtract W2(t) = 4

  • Cannot distinguish between cases from C(W1) = C(W2)

– No correct compression scheme C(W) possible

22

SELECT SUM(num) FROM Numbers [Rows 109];

3 5 8 9 2 3 9 7 8 9 4 5 8 2 7 7 2 1

W1 W2 t

slide-23
SLIDE 23

Approximate Sum Calculation

  • Keep sums Σi for each n tuples in window

– Compression ratio is 1/n – Estimate of window sum ΣW is total of group sums Σi

  • Now v1 leaves window and v2n+3 arrives:

– Accuracy of approximation depends on variance

23

v1 v2 ... vn vn+1 vn+2 ... v2n ... v2n+1 v2n+2

n tuples + 2 tuples (incomplete group)

Σ1 Σ2 Σincomplete

+ ... + n tuples

ΣW=

+

(n-1/n) * Σ1 Σ2 Σincomplete

+ ... +

ΣW=

3 tuples (incomplete group)

slide-24
SLIDE 24

Counting Bits

  • Assume sliding window W of size N contains bits 1 and 0

– How many 1s are there in the most recent k bits? (1 ≤ k ≤ N)

  • Could answer question trivially with O(N) storage

– But can we approximate answer with, say, logarithmic storage?

24

1 1 1 1 1 1 1

size N most recent tuple W

slide-25
SLIDE 25
  • Divide window into multiple buckets B(m, t)

– B(m, t) contains 2m 1s and starts at t – Size of buckets does not decrease as t increases – Either one or two buckets for each size m – Largest bucket only partially filled

  • Estimate sum of last k tuples Σk:

Σk = {sizes of buckets within k} + ½ {last partial bucket} ΣN = 20 + 20 + 21 + 22 + ½ * 23 = 12 (exact answer: 13)

Approximate Counting with Buckets

25

1 1 1 1 1 1 1 1 1 1 1 1 1

B(0,1) B(0,2) B(1,4) B(2,6) B(3,11)

slide-26
SLIDE 26
  • Discard/merge buckets as window slides

– Discard largest bucket once outside of window – Create new bucket B(0,1) for new tuple if 1 – Merge buckets to restore invariant of at most 2 buckets of each size m

Maintaining Buckets

26

1 1 1 1 1 1 1 1 1 1 1 1 1 1

B(0,1) B(0,2) B(1,4) B(2,6) B(3,11)

X

1 1 1 1 1 1 1 1 1 1 1 1 1 1

B(0,1) B(1,2) (merged) B(1,5) B(2,7) B(3,12)

X

slide-27
SLIDE 27

Space Complexity

  • Need O(log N) buckets for window of size N
  • Need O(log N) bits to represent bucket B(m, t):

– m is power of 2, so representable as log2 m m can be represented with O(log log N) bits – t is representable as t mod N t can be represented with O(log N) bits

  • Overall window compressed to O(log2 N) bits

27

slide-28
SLIDE 28

Stream Processing Systems

28

slide-29
SLIDE 29

General DSPS Architecture

29 Source: Golab & Ozsu 2003

slide-30
SLIDE 30

Stream Query Execution

  • Continuous queries are long-running

è properties of base streams may change

– Tuple distribution, arrival characteristics, query load, available CPU, memory and disk resources, system conditions, ...

  • Solution: Use adaptive query plans

– Monitor system conditions – Re-optimise query plans at run-time

  • DBMS didn’t quite have this problem...

30

slide-31
SLIDE 31

Query Plan Execution

  • Executed query plans include:

– Operators – Queues between operators – State/“Synposis” (windows, ...) – Base streams

  • Challenges

– State may get large (e.g. large windows)

31

SELECT * FROM S1 [Rows 1000], S2 [Range 2 mins] WHERE S1.A = S2.A AND S1.A > 42;

Source: STREAM project

slide-32
SLIDE 32

Operator Scheduling

  • Need scheduler to invoke operators (for time slice)

– Scheduling must be adaptive

  • Different scheduling disciplines possible:
  • 1. Round-robin
  • 2. Minimise queue length
  • 3. Minimise tuple delay
  • 4. Combination of the above

32

slide-33
SLIDE 33

Load Shedding

  • DSMS must handle overload:

Tuples arrive faster than processing rate

  • Two options when overloaded:
  • 1. Load shedding: Drop tuples
  • Much research on deciding which

tuples to drop: c.f. result correctness and resource relief

  • e.g. sample tuples from stream
  • 2. Approximate processing:

Replace operators with approximate processing

  • Saves resources

33

slide-34
SLIDE 34

Distributed DSPS

34

slide-35
SLIDE 35

Distributed DSPS

  • Interconnect multiple DSPSs with network

– Better scalability, handles geographically distributed stream sources

  • Interconnect on LAN or Internet?

– Different assumptions about time and failure models

35 Scientific instruments Traffic monitors Mobile sensing devices Queries RFID tags Body sensor networks Queries

slide-36
SLIDE 36

Stream Processing to the Rescue!

36

E Process data streams on-the-fly: Apache S4, Twitter Storm, Nokia Dempsy, … E Exploit intra-query parallelism for scale out

Most interesting operators are stateful

slide-37
SLIDE 37

Query Planning in DSPS

  • Query Plan

– Operator placement – Stream connections – Resource allocation: CPU, network bandwidth, ...

  • State-of-the-art planners

– Based on heuristics

(eg IBM’s SODA)

– Assume over-provisioned system

  • Simplifies query planning
  • Not true when you pay for

resources...

37

final stream

slide-38
SLIDE 38

Planning Challenges

  • Premature exhaustion of resources

è multi-resource constraints

  • Waste of resources due to query
  • verlap è reuse streams

38

slide-39
SLIDE 39

SQPR: Stream Query Planning with Reuse [ICDE’11]

  • Unified optimisation problem for

– query admission – operator allocation – stream reuse

  • This is hard!

– Solve approximate problem to obtain tractable solution

39

maximise: λ1 * (no of satisfied queries) – λ2 * (CPU usage) – λ3 * (net usage) – λ4 * (balance load) subject to constraints:

  • 1. availability:

streams for operators exist on nodes

  • 2. resource:

allocations within resource limits

  • 3. demand:

final query streams are generated eventually

  • 4. acyclicity:

all streams come from real sources

Evangelia Kalyvianaki, Wolfram Wiesemann, Quang Hieu Vu and Peter Pietzuch, “SQPR: Stream Query Planning with Reuse”, IEEE International Conference on Data Engineering (ICDE), Hannover, Germany, April 2011

slide-40
SLIDE 40

Tractable Optimisation Model

  • Idea: Only optimise over streams related to new query

– Add relay operators to work around constraints under reuse

40

slide-41
SLIDE 41

Scalable Stream Processing

41

slide-42
SLIDE 42

Stream Processing in the Cloud

  • Clouds provide virtually infinite pools of resources

– Fast and cheap access to new machines for operators – Needlessly overprovisioning system is expense – Using too few nodes leads to poor performance

42

E How do you decide on the optimal number of VMs?

Streams

...

n virtual machines in cloud data centre

Results

slide-43
SLIDE 43

Challenge 1: Elastic Data-Parallel Processing

  • Typical stream processing workloads are bursty

43 0% 20% 40% 60% 80% 100% 09/07 09/08 09/09 09/10 09/11 09/12 09/13 Utilisation Date

Courtesy of MSRC

0% 50% 100% 09/07 09/08 09/09 09/10 09/11 09/12 09/13

High + bursty input rates è Detect bottleneck + parallelise

slide-44
SLIDE 44

Challenge 2: Fault-Tolerant Processing

Large scale deployment è Handle node failures

  • Failure is a common occurrence

– Active fault-tolerance requires 2x resources – Passive fault-tolerance leads to long recovery times

44

slide-45
SLIDE 45

Stream Processing System (eg Twitter Storm, Yahoo S4,…)

User Activities (eg item purchases, page views, clicks, …) Recommendations

Processing State

User A Item 2 User B Item 1 2 4 1 5

State in Stream Processing

E Most online machine learning algorithms require state Consider a streaming recommender application (collaborative filtering)

45

slide-46
SLIDE 46

State Complicates Things…

  • 1. Dynamic scale out impacts state
  • 2. Recovery from failures

46

Partitioning

  • f state

Loss of state after node failure

slide-47
SLIDE 47

Current Approaches for Stateful Processing

Stateless stream processing systems (eg Yahoo S4, Twitter

Storm, …) – Developers manage state – Typically combine with external system to store state (eg Cassandra) – Design complexity

Relational stream processing systems (eg Borealis, Stream)

– State is window over stream – No support for arbitrary state – Hard to realise complex ML algorithms

47

window

temp rain temp rain temp rain temp rain temp rain

slide-48
SLIDE 48

Stateful Stream Processing Model

48

Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, and Peter Pietzuch, "Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management”, ACM International Conference on Management of Data (SIGMOD), New York, NY, June 2013

  • Operators can maintain arbitrary state
  • State management primitives to:

– Backup and recover state – Partition state

  • Integrated mechanism for scale out and failure recovery

– Operator recovery and scale out equivalent from state perspective

slide-49
SLIDE 49

Idea: State as First Class Citizen

  • Operators have direct access to state
  • System manages state

49

E Expose operator state as external entity so that it can be managed by stream processing system

slide-50
SLIDE 50

Operator State Management

  • State cannot be lost, or stream results are affected
  • On scale out:

– Partition operator state correctly, maintaining consistency

  • On failure recovery:

– Restore state of failed operator – Define primitives for state management and build other mechanisms on top of them

50

E Make operator state an external entity that can be managed by the stream processing system

slide-51
SLIDE 51

What is State?

A C B

Processing state Routing state Buffer state

Dynamic data flow graph: Based on data, A è B or A è C Data ts1 Data ts2 Data ts3 Data ts4 User A Item 2 User B Item 1 2 4 1 5

51

slide-52
SLIDE 52

State Management Primitives

ts

  • Makes state available to system
  • Attaches last processed tuple timestamp

Restore

ts

Backup

A A1

Checkpoint Partition

  • Moves copy of state from
  • ne operator to another
  • Splits state to scale out an operator

A2

52

slide-53
SLIDE 53

State Primitives: Backup and Restore

B

Restore Checkpoint

A

Data t4 Data t3 Data t2 Data t1

Data t1 Data t2

Data t1 Data t2

t2

Backup

t2

53

slide-54
SLIDE 54

State Primitives: Partition

Processing state modeled as (key, value) dictionary State partitioned according to key k of tuples

– Same key used to partition streams

0-x x-n userId 0-n userId 0-x userId x-n

A1 A2

userId 0-n 0-n

54

slide-55
SLIDE 55

Scale Out and Failure Recovery

55

A

Two cases:

  • Operator B becomes bottleneck è

è Scale out

  • Operator B fails è

è Recover

B

slide-56
SLIDE 56

Scaling Out Stateful Operators

56

B A B

Periodically, stateful operators checkpoint and back up state to designated upstream backup node

B B B

For scale out, backup node already has state

  • f operator to be parallelised

B1

New operator

B

B1

B

B1

E Checkpoint E Backup E Partition E Restore

Finally, upstream operators replay unprocessed tuples to update checkpointed state

slide-57
SLIDE 57

Recovering Failed Operators

57

B B B

New operator

B B

State restored and unprocessed tuples replayed from buffer

Use backed up state to recover quickly

E Restore

slide-58
SLIDE 58

SEEP Stream Processing System

58

EC2 stats fault detector scale out coordinator deployment manager query manager queries

bottleneck detector

scaling policy VM pool faults UB+C coordinator

  • Experimental stateful stream processing platform
  • Implements dynamic scale out and recovery

– Detect failed or overloaded operators – Have fast access to new VMs

slide-59
SLIDE 59

Detecting Bottlenecks

CPU utilisation report

35% 85% 30%

Local infrastructure view

35% 85% 30%

Bottleneck Bottleneck detector

59

slide-60
SLIDE 60

VM Pool for Adding Operators

  • Problem: Allocating new VMs takes minutes...

Bottleneck detector

Monitoring information

Cloud provider VM1 VM2 Virtual Machine Pool Decision to scale-out Bottleneck detected Select pre-provisioned VM (order of secs) Provision VM from cloud (order of mins) VM3 Add new VM to pool VM2 VM3

(dynamic pool size)

60

slide-61
SLIDE 61

Evaluation: Goals and Methodology

  • 1. Effectiveness of dynamic scale out
  • 2. Measurement of failure recovery time
  • 3. Overhead of state management

Workload: Linear Road Benchmark [VLDB’04]

– Operator state depends on whole stream history – Input stream rate increases over time according to Load Factor L – SLA: results < 5 secs – Data flow graph with 7 operators

Deployed SEEP on Amazon AWS EC2

61

slide-62
SLIDE 62
  • Scale Out with Elastic Workload

E SEEP scales out dynamically with low impact on latency

Scales to load factor L=350 with 60 VMs on Amazon EC2

  • L=512 highest report result [VLDB’12]

62

slide-63
SLIDE 63

Upstream Backup

Upstream Backup saves all tuples in buffers Source Replay saves tuples only in the source

ACKs data ACKs ACKs ACKs data ACKs ACKs

63

slide-64
SLIDE 64

Failure Recovery Time

Workload: Windowed word counting query

– 30 sec window with 5 sec checkpointing interval

E Checkpointing leads to smaller buffers

64

slide-65
SLIDE 65

Overhead of Checkpointing

E Tradeoff between latency and recovery time

65

slide-66
SLIDE 66

Related Work

  • Scalable stream processing systems

– Twitter Storm, Yahoo S4, Nokia Dempsey Exploit operator parallelism mainly for stateless queries – ParaSplit operator [VLDB’12] Partition stream for intra-query parallelism

  • Support for elasticity

– StreamCloud [TPDS’12] Dynamic scale out/in for subset of relational stream operators – Esc [ICCC’11] Dynamic support for stateless scale out

  • Resource-efficient fault tolerance models

– Active Replication at (almost) no cost [SRDS’11] Use under-utilized machines to run operator replicas – Discretized Streams [HotCloud’12] Data is checkpointed and recovered in parallel in event of failure

66

slide-67
SLIDE 67

Conclusions

67

  • Stream processing will grow in importance

– Handling the data deluge – Just provide a view/window on subset of data – Enables real-time response and decision making

  • Principled models to express stream processing semantics

– Enables automatic optimisation of queries, e.g. finding parallelism – What is the right model?

  • Resource allocation matters due to long running queries

– High stream rates and many queries require scalable systems – Handling overload becomes crucial requirement – Volatile workloads benefit from elastic DSPS in cloud environments

slide-68
SLIDE 68

Thank You! Any Questions?

68

Peter Pietzuch

<prp@doc.ic.ac.uk> http://lsds.doc.ic.ac.uk