Systems Infrastructure for Data Science Web Science Group Uni - - PowerPoint PPT Presentation

systems infrastructure for data science
SMART_READER_LITE
LIVE PREVIEW

Systems Infrastructure for Data Science Web Science Group Uni - - PowerPoint PPT Presentation

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Data Stream Processing Todays Topic Stream Processing Model Issues System Issues Distributed Processing Issues Uni Freiburg, WS2012/13 Systems


slide-1
SLIDE 1

Systems Infrastructure for Data Science

Web Science Group Uni Freiburg WS 2012/13

slide-2
SLIDE 2

Data Stream Processing

slide-3
SLIDE 3

Today’s Topic

  • Stream Processing

– Model Issues – System Issues – Distributed Processing Issues

Uni Freiburg, WS2012/13 3 Systems Infrastructure for Data Science

slide-4
SLIDE 4

Distributed Stream Processing

Motivation

  • Distributed data sources
  • Performance and Scalability
  • High availability and Fault tolerance

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 4

slide-5
SLIDE 5

Design Options for Distributed DSMS

  • Almost same split as with distributed databases

vs cloud databases

  • Currently, most of the work is on fairly tightly

coupled, strongly maintained distributed DSMS

  • We will study a number of general/traditional

approaches for most of the lecture, look at some ideas for cloud‐based streaming

  • As usual, distributed processing is about

tradeoffs!

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 5

slide-6
SLIDE 6

Distributed Stream Processing Borealis Example

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 6

End-point Applications Push-based Data Sources Aurora Borealis

slide-7
SLIDE 7

Distributed Stream Processing

Major Problem Areas

  • Load distribution and balancing

– Dynamic / Correlation‐based techniques – Static / Load‐resilient techniques – (Network‐aware techniques)

  • Distributed load shedding
  • High availability and Fault tolerance

– Handling node failures – Handling link failures (esp. network partitions)

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 7

slide-8
SLIDE 8

Load Distribution

  • Goal: to distribute a given set of continuous

query operators onto multiple stream processing server nodes

  • What makes an operator distribution good?

– Load balance across nodes – Resiliency to load variations – Low operator migration overhead – Low network bandwidth usage

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 8

slide-9
SLIDE 9

Correlation‐based Techniques

  • Goals:

– to minimize end‐to‐end query processing latency – to balance load across nodes to avoid overload

  • Key ideas:

– Group boxes with small load correlation together

 helps minimize the overall load variance on that node  keeps the node load steady as input rates change

– Maximize load correlation among nodes

 helps minimize the need for load migration

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 9

slide-10
SLIDE 10

Example

Connected Plan

c c c c r 2r 2cr 4cr

Cut Plan

c c c c r 2r 3cr 3cr c c c c r1 r2

2r r time

r1

2r r time

r2

Uni Freiburg, WS2012/13 10 Systems Infrastructure for Data Science

slide-11
SLIDE 11

Example: Cut Plan beats the Connect Plan

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 11

slide-12
SLIDE 12

Formal Problem Definition

  • n: number of server nodes
  • Xi: load time series of node Ni
  • ρij: correlation coefficient of Xi and Xj, 1 ≤ i, j ≤ n
  • Find a plan that maps operators to nodes with the

following properties:

  • EX1 ≈ EX2 ≈ … ≈ EXn
  • Uni Freiburg, WS2012/13

Systems Infrastructure for Data Science 12

1

1 var is minimized, or

n i i

X n

1

is maximized.

ij i j n

  

slide-13
SLIDE 13

Dynamic Load Distribution Algorithms

  • Periodically repeat:
  • 1. Collect load statistics from all nodes.
  • 2. Order nodes by their average load.
  • 3. Pair the ith node with the (n‐i+1)th node.
  • 4. If there exists a pair (A, B) such that |A.load – B.load|

≥ threshold, then move operators between them to balance their average load and to minimize their average load variance.

  • Two load movement algorithms for pairs in Step 4:

– One‐way – Two‐way

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 13

slide-14
SLIDE 14

One‐way Algorithm

  • Given a pair (A, B) that must move load, the node with

the higher load (say A) offloads half of its excess load to the other node (B).

  • Operators of A are ordered based on a score, and the
  • perator with the largest score is moved to B until

balance is achieved.

  • Score of an operator O is computed as follows:

correlation_coefficient(O, other operators at A)  correlation_coefficient(O, other operators at B)

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 14

slide-15
SLIDE 15

Two‐way Algorithm

  • All operators in a given pair can be moved in both ways.
  • Assume both nodes are initially empty.
  • Score all the operators.
  • Select the largest score operator and place it at the less

loaded node.

  • Continue until all operators are placed.
  • Two‐way algorithm could results in a better placement.
  • But, load migration cost would be higher.

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 15

slide-16
SLIDE 16

Load‐resilient Techniques

  • Goal: to tolerate as many load conditions as

possible without the need for operator migration.

  • Resilient Operator Distribution (ROD)

– ROD does not become overloaded easily in the face of fluctuating input rates. – Key idea:

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 16

maximize this area !

slide-17
SLIDE 17

Comparison of Approaches

Correlation‐based

  • Dynamic
  • Medium‐to‐long term

load variations

  • Periodic operator

movement Load‐resilient

  • Static
  • Short‐term load

fluctuations

  • No operator movement

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 17

slide-18
SLIDE 18

Distributed Stream Processing

Major Problem Areas

  • Load distribution and balancing

– Dynamic / Correlation‐based techniques – Static / Load‐resilient techniques – (Network‐aware techniques)

  • Distributed load shedding
  • High availability and Fault tolerance

– Handling node failures – Handling link failures (esp. network partitions)

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 18

slide-19
SLIDE 19

Distributed Load Shedding

  • Problem: One or more servers can be overloaded.
  • Goal: Remove excess load from all of them with

minimal quality loss at query end‐points.

  • There is a load dependency among the servers.
  • To keep quality under control, servers must

coordinate in their load shedding decisions.

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 19

slide-20
SLIDE 20

Distributed Load Shedding

Load Dependency

Plan Rates at A A.load A.throughput B.load B.throughput 1, 1 3 1/3, 1/3 4/3 1/4, 1/4 1 1, 0 1 1, 0 3 1/3, 0 2 0, 1/2 1 0, 1/2 1/2 0, 1/2 3 1/5, 2/5 1 1/5, 2/5 1 1/5, 2/5

Uni Freiburg, WS2012/13 20 Systems Infrastructure for Data Science

  • ptimal

for A

  • ptimal

for both feasible for both

≤ 1 ≤ 1 maximize !

Cost = 1 Selectivity = 1.0 Cost = 2 Selectivity = 1.0 Cost = 3 Selectivity = 1.0 Cost = 1 Selectivity = 1.0 1 tuple/sec 1 tuple/sec Node A Node B 1/4 tuple/sec 1/4 tuple/sec

Server nodes must coordinate in their load shedding decisions to achieve high-quality results.

Plan Rates at A A.load A.throughput B.load B.throughput Plan Rates at A A.load A.throughput B.load B.throughput 1, 1 3 1/3, 1/3 4/3 1/4, 1/4 Plan Rates at A A.load A.throughput B.load B.throughput 1, 1 3 1/3, 1/3 4/3 1/4, 1/4 1 1, 0 1 1, 0 3 1/3, 0 Plan Rates at A A.load A.throughput B.load B.throughput 1, 1 3 1/3, 1/3 4/3 1/4, 1/4 1 1, 0 1 1, 0 3 1/3, 0 2 0, 1/2 1 0, 1/2 1/2 0, 1/2

slide-21
SLIDE 21

Distributed Load Shedding

as a Linear Optimization Problem

, 1 1

: 1

j D i j j i j i j j j D j j j j j

x i N r x s c x r x s p 

 

          

 

Find such that for all nodes is maximized.

Node 1 Node N Node 2

r1 rD x1 xD c1,1 s1,1 s1 sD c1,D s1,D c2,1 s2,1 c2,D s2,D cN,1 sN,1 cN,D sN,D s1

2

sD

2

s1

N

sD

N

p1 pD

1

2

N

Uni Freiburg, WS2012/13 21 Systems Infrastructure for Data Science

slide-22
SLIDE 22

Distributed Stream Processing

Major Problem Areas

  • Load distribution and balancing

– Dynamic / Correlation‐based techniques – Static / Load‐resilient techniques – (Network‐aware techniques)

  • Distributed load shedding
  • High availability and Fault tolerance

– Handling node failures – Handling link failures (esp. network partitions)

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 22

slide-23
SLIDE 23

High Availability and Fault Tolerance

Overview

  • Problem: node failures and network link failures
  • Query execution stalls
  • Queries produce incorrect results
  • Requirements:

– Consistency ‐> Avoid lost, duplicate, or out of order data – Performance ‐> Avoid overhead during normal processing + overhead during failure recovery

  • Major tasks:

– Failure preparation ‐> Replication of volatile processing state – Failure detection ‐> Timeouts – Failure recovery ‐> Replica coordination upon failure

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 23

slide-24
SLIDE 24

High Availability and Fault Tolerance

General Approach

  • Adapt traditional approaches to stream processing
  • Two general approaches:

– State‐machine approach

  • Replicate the processing on multiple nodes
  • Send all the nodes the same input in the same order
  • Advantage: Fast fail‐over
  • Disadvantage: High resource requirements

– Rollback recovery approach

  • Periodically check‐point processing state to other nodes
  • Log input between check‐points
  • Advantage: Low run‐time overhead
  • Disadvantage: High recovery time
  • Different trade‐offs can be made among:

– Availability, Run‐time overhead, and Consistency

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 24

slide-25
SLIDE 25

Handling Node Failures

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 25

Active Replicas Passive Replicas Passive Standby Upstream Backup

slide-26
SLIDE 26

Active Replicas

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 26

slide-27
SLIDE 27

Passive Standby

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 27

slide-28
SLIDE 28

Upstream Backup

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 28

slide-29
SLIDE 29

Run‐time Overhead vs. Recovery Time Trade‐off

  • Active Replicas:

– High run‐time overhead – Fast fail‐over (i.e., low recovery time)

  • Passive Standby:

– Check‐point interval can be flexibly adjusted

  • Upstream Backup:

– Low run‐time overhead – Recovery time is proportional to the size of the upstream buffers

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 29

slide-30
SLIDE 30
  • “Network Partitions” occur when data sources,

processing nodes, and clients are split into disconnected partitions due to network failures.

  • Two general options:

– Suspend processing to avoid inconsistency. – Continue processing to avoid unavailability.

  • Delay‐Process‐Correct (DPC) Protocol

– Adjust the trade‐off btw consistency and availability using maximum tolerable latency threshold and tentative tuples.

Handling Network Partitions

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 30

slide-31
SLIDE 31

Other Advanced HA Techniques

  • Cooperative and Self‐configuring HA [Borealis]

– Each server node is backed up by multiple servers in a cooperative fashion, which can take over processing in parallel. – Backup assignment dynamically changes to balance HA load. – Wide‐area extensions

  • Integrating Fault Tolerance with Load Balancing [Flux]

– Fine‐granularity dataflow partitions – Rebalance load after failure recovery

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 31