Systems Infrastructure for Data Science Web Science Group Uni - PowerPoint PPT Presentation

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13

Data Stream Processing

Today’s Topic • Stream Processing – Model Issues – System Issues – Distributed Processing Issues Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 3

Distributed Stream Processing Motivation • Distributed data sources • Performance and Scalability • High availability and Fault tolerance Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 4

Design Options for Distributed DSMS • Almost same split as with distributed databases vs cloud databases • Currently, most of the work is on fairly tightly coupled, strongly maintained distributed DSMS • We will study a number of general/traditional approaches for most of the lecture, look at some ideas for cloud ‐ based streaming • As usual, distributed processing is about tradeoffs! Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 5

Distributed Stream Processing Borealis Example End-point Applications Borealis Push -based Data Sources Aurora Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 6

Distributed Stream Processing Major Problem Areas • Load distribution and balancing – Dynamic / Correlation ‐ based techniques – Static / Load ‐ resilient techniques – (Network ‐ aware techniques) • Distributed load shedding • High availability and Fault tolerance – Handling node failures – Handling link failures (esp. network partitions) Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 7

Load Distribution • Goal: to distribute a given set of continuous query operators onto multiple stream processing server nodes • What makes an operator distribution good? – Load balance across nodes – Resiliency to load variations – Low operator migration overhead – Low network bandwidth usage Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 8

Correlation ‐ based Techniques • Goals: – to minimize end ‐ to ‐ end query processing latency – to balance load across nodes to avoid overload • Key ideas: – Group boxes with small load correlation together  helps minimize the overall load variance on that node  keeps the node load steady as input rates change – Maximize load correlation among nodes  helps minimize the need for load migration Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 9

Example r 2 r 1 r 1 c c 2r 2r r r r 2 c c time time Connected Plan Cut Plan r r c c c c 2cr c c c c 4cr 2r 2r 3cr 3cr Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 10

Example: Cut Plan beats the Connect Plan Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 11

Formal Problem Definition • n: number of server nodes • X i : load time series of node N i • ρ ij : correlation coefficient of X i and X j , 1 ≤ i, j ≤ n • Find a plan that maps operators to nodes with the following properties:  EX 1 ≈ EX 2 ≈ … ≈ EX n n 1   var is minimized, or X i n  i 1    is maximized. ij    1 i j n Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 12

Dynamic Load Distribution Algorithms • Periodically repeat: 1. Collect load statistics from all nodes. 2. Order nodes by their average load. 3. Pair the i th node with the (n ‐ i+1) th node. 4. If there exists a pair (A, B) such that |A.load – B.load| ≥ threshold, then move operators between them to balance their average load and to minimize their average load variance. • Two load movement algorithms for pairs in Step 4: – One ‐ way – Two ‐ way Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 13

One ‐ way Algorithm • Given a pair (A, B) that must move load, the node with the higher load (say A) offloads half of its excess load to the other node (B). • Operators of A are ordered based on a score, and the operator with the largest score is moved to B until balance is achieved. • Score of an operator O is computed as follows: correlation_coefficient(O, other operators at A)  correlation_coefficient(O, other operators at B) Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 14

Two ‐ way Algorithm • All operators in a given pair can be moved in both ways. • Assume both nodes are initially empty. • Score all the operators. • Select the largest score operator and place it at the less loaded node. • Continue until all operators are placed. • Two ‐ way algorithm could results in a better placement. • But, load migration cost would be higher. Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 15

Load ‐ resilient Techniques • Goal: to tolerate as many load conditions as possible without the need for operator migration. • Resilient Operator Distribution (ROD) – ROD does not become overloaded easily in the face of fluctuating input rates. – Key idea: maximize this area ! Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 16

Comparison of Approaches Correlation ‐ based Load ‐ resilient • Dynamic • Static • Medium ‐ to ‐ long term • Short ‐ term load load variations fluctuations • Periodic operator • No operator movement movement Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 17

Distributed Load Shedding • Problem: One or more servers can be overloaded. • Goal: Remove excess load from all of them with minimal quality loss at query end ‐ points. • There is a load dependency among the servers. • To keep quality under control, servers must coordinate in their load shedding decisions. Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 19

Distributed Load Shedding Load Dependency Node A Node B 1 tuple/sec 1/4 tuple/sec Cost = 1 Cost = 3 Selectivity = 1.0 Selectivity = 1.0 1 tuple/sec 1/4 tuple/sec Cost = 2 Cost = 1 Selectivity = 1.0 Selectivity = 1.0 Server nodes must coordinate in their load shedding decisions Plan Plan Rates at A Rates at A A.load A.load A.throughput A.throughput B.load B.load B.throughput B.throughput Plan Rates at A A.load A.throughput B.load B.throughput Plan Rates at A A.load A.throughput B.load B.throughput Plan Rates at A A.load A.throughput B.load B.throughput 0 1, 1 3 1/3, 1/3 4/3 1/4, 1/4 to achieve high-quality results. 0 1, 1 3 1/3, 1/3 4/3 1/4, 1/4 0 0 1, 1 1, 1 3 3 1/3, 1/3 1/3, 1/3 4/3 4/3 1/4, 1/4 1/4, 1/4 optimal for A 1 1, 0 1 1, 0 3 1/3, 0 1 1 1, 0 1, 0 1 1 1, 0 1, 0 3 3 1/3, 0 1/3, 0 optimal 2 0, 1/2 1 0, 1/2 1/2 0, 1/2 2 0, 1/2 1 0, 1/2 1/2 0, 1/2 feasible for both for both 3 1/5, 2/5 1 1/5, 2/5 1 1/5, 2/5 maximize ! ≤ 1 ≤ 1 Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 20

Distributed Load Shedding as a Linear Optimization Problem    Node 1 Node 2 Node N 1 2 N x 1 2 N r 1 s 1 s 1 s 1 c 1,1 c 2,1 c N,1 p 1 s 1,1 s 2,1 s N,1 2 N s D s D r D s D c 1,D c 2,D c N,D p D s 1,D s 2,D s N,D x D   Find such that for all nodes x 0 i N : j D       i r x s c , j j i j i j  j 1   0 x 1 j D     is maximized. r x s p j j j j  1 j Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 21

High Availability and Fault Tolerance Overview • Problem: node failures and network link failures  Query execution stalls  Queries produce incorrect results • Requirements: – Consistency ‐ > Avoid lost, duplicate, or out of order data – Performance ‐ > Avoid overhead during normal processing + overhead during failure recovery • Major tasks: – Failure preparation ‐ > Replication of volatile processing state – Failure detection ‐ > Timeouts – Failure recovery ‐ > Replica coordination upon failure Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 23

High Availability and Fault Tolerance General Approach • Adapt traditional approaches to stream processing • Two general approaches: – State ‐ machine approach • Replicate the processing on multiple nodes • Send all the nodes the same input in the same order • Advantage: Fast fail ‐ over • Disadvantage: High resource requirements – Rollback recovery approach • Periodically check ‐ point processing state to other nodes • Log input between check ‐ points • Advantage: Low run ‐ time overhead • Disadvantage: High recovery time • Different trade ‐ offs can be made among: – Availability, Run ‐ time overhead, and Consistency Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 24

Systems Infrastructure for Data Science Web Science Group Uni - PowerPoint PPT Presentation

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Data Stream Processing Todays Topic Stream Processing Model Issues System Issues Distributed Processing Issues Uni Freiburg, WS2012/13 Systems

Cyber- -Science Infrastructure: Science Infrastructure: Cyber Cyber-Science Infrastructure:

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture I:

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Medical Infrastructure in Medical Infrastructure in Medical Infrastructure in Medical

What can Infrastructure do for you today? Daniel Humbedooh Gruno Infrastructure Architect,

Lecture 23 Verified Systems Software Infrastructure is Shaky Software Infrastructure is Shaky

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

Compiler Infrastructure Systems and Internet Infrastructure Security (SIIS) Laboratory Page 1

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Data Stream

Types of Expert Systems Interpretation Systems Prediction Systems Diagnosis Systems

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Selecting Least Cost Green Infrastructure James W. Ridgway, PE October 14, 2015 Integrated

Infrastructure Solutions MSD 2250R Infrastructure Solutions Background: Infrastructure

Infrastructure & Shared Services Director Infrastructure & Shared Services Organisational

Broadband Infrastructure in Broadband Infrastructure in North Asia and Central Asia North Asia and

Bicycle Infrastructure 1st of 2 presentations about Bike Infrastructure This Month: A Picture of

Overload Control for Scaling WeChat Microservices WeChat The new way to connect Chat Moments

Vulnerability Analysis Of Optimal Power Flow Problem Under Cyber-Physical Security Attacks

A MAZON S3 Simple storage service Launched: March 14, 2006 Simple key/value storage

Load Shedding in Network Monitoring Applications . Barlet-Ros 1 G. Iannaccone 2 J. Sanjus-Cuxart

Confusion in the land of the serverless Sam Newman Building Microservices DESIGNING FINE -

Scripts for Sensor Network Seminar Data Management Section Lectured by George Kollios,

Real-Time Databases Meghan Russ Miriam Speert Pete Dempsey Sedat Behar Yevgeny Ioffe Zachi

Ope rating State s November 16 th 2018 1. Follow up on Autonomous Islands Age nda 2. Brief

Systems Infrastructure for Data Science Web Science Group Uni - PowerPoint PPT Presentation

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Data Stream Processing Todays Topic Stream Processing Model Issues System Issues Distributed Processing Issues Uni Freiburg, WS2012/13 Systems

Cyber- -Science Infrastructure: Science Infrastructure: Cyber Cyber-Science Infrastructure:

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture I:

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Medical Infrastructure in Medical Infrastructure in Medical Infrastructure in Medical

What can Infrastructure do for you today? Daniel Humbedooh Gruno Infrastructure Architect,

Lecture 23 Verified Systems Software Infrastructure is Shaky Software Infrastructure is Shaky

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

Compiler Infrastructure Systems and Internet Infrastructure Security (SIIS) Laboratory Page 1

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Data Stream

Types of Expert Systems Interpretation Systems Prediction Systems Diagnosis Systems

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Selecting Least Cost Green Infrastructure James W. Ridgway, PE October 14, 2015 Integrated

Infrastructure Solutions MSD 2250R Infrastructure Solutions Background: Infrastructure

Infrastructure &amp; Shared Services Director Infrastructure &amp; Shared Services Organisational

Broadband Infrastructure in Broadband Infrastructure in North Asia and Central Asia North Asia and

Bicycle Infrastructure 1st of 2 presentations about Bike Infrastructure This Month: A Picture of

Overload Control for Scaling WeChat Microservices WeChat The new way to connect Chat Moments

Vulnerability Analysis Of Optimal Power Flow Problem Under Cyber-Physical Security Attacks

A MAZON S3 Simple storage service Launched: March 14, 2006 Simple key/value storage

Load Shedding in Network Monitoring Applications . Barlet-Ros 1 G. Iannaccone 2 J. Sanjus-Cuxart

Confusion in the land of the serverless Sam Newman Building Microservices DESIGNING FINE -

Scripts for Sensor Network Seminar Data Management Section Lectured by George Kollios,

Real-Time Databases Meghan Russ Miriam Speert Pete Dempsey Sedat Behar Yevgeny Ioffe Zachi

Ope rating State s November 16 th 2018 1. Follow up on Autonomous Islands Age nda 2. Brief

Infrastructure & Shared Services Director Infrastructure & Shared Services Organisational