Addressing Deployment Challenges in Data Stream Processing Corso di - PDF document

Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica Addressing Deployment Challenges in Data Stream Processing Corso di Sistemi e Architetture per Big Data A.A. 2019/2020 Valeria Cardellini Laurea Magistrale in Ingegneria Informatica DSP deployment challenges • Let’s consider challenges when deploying DSP applications 1. Optimize the DSP application • Lazy evaluation in Flink and Spark Streaming 2. Place the DSP operators on the underlying computing infrastructure • Most frameworks use simple placement policies – E.g., in Storm: Round Robin as default strategy – Recently added Resource Aware Scheduler • Takes into account resource availability on machines and resource requirements of workloads • But requires user to specify memory and CPU requirements for individual topology components V. Cardellini - SABD 2019/2020 1

DSP operator placement • Goal: to determine which distributed computing nodes should host and execute each application operator, with the goal of optimizing the application QoS V. Cardellini - SABD 2019/2020 2 Placement: new distributed environment • Fog + Cloud computing: allows to increase scalability and availability, reduce latency, network traffic, and power consumption V. Cardellini - SABD 2019/2020 3

Placement: challenges • Network latencies are significant – e.g., geo-distributed resources • Computing and networking resources are heterogeneous – e.g., capacity limits , business constraints • Computing/network resources can be unavailable • Data cannot be quickly moved around the network • Peculiarities of DSP applications: – computational requirements unknown a-priori – can change continuously – load is imposed for long provisioning times à Need to adapt to internal and external changes V. Cardellini - SABD 2019/2020 4 Placement: frameworks • Most frameworks use simple placement policies, e.g., in Storm – Round Robin as default strategy – Resource Aware Scheduler as alternative • Takes into account resource availability on machines and resource requirements of workloads V. Cardellini - SABD 2019/2020 5

Placement: different approaches • Several operator placement policies in literature (mainly heuristics) that address the problem but: – Different assumptions (system model, application topology, QoS attributes and metrics, …) – Different objectives – Not easily comparable • Main methodologies: – Mathematical programming • Formalization of the operator placement problem: NP-hard problem • Does not scale well, but provides useful insights – Heuristics V. Cardellini - SABD 2019/2020 6 Placement: different approaches • Who is the decision maker? – Centralized placement strategies • Require global view (full resource and network state, application state, workload information) Pros: Capable of determining optimal global solution Cons: Scalability – Decentralized placement strategies • Take decision based only on local information Pros: Scalability, better suited for runtime adaptation Cons: Optimality is not guaranteed V. Cardellini - SABD 2019/2020 7

ODP: Optimal DSP Placement • We propose ODP – Centralized policy for optimal placement of DSP applications – Formulated as Integer Linear Programming (ILP) problem • Our goals: – To compute the optimal placement (of course!) – To provide a unified general formulation of the placement problem for DSP applications (but not only!) – To consider multiple QoS attributes of applications and resources – To provide a benchmark for heuristics V. Cardellini, V. Grassi, F. Lo Presti, M. Nardelli, Optimal Operator Placement for Distributed Stream Processing Applications, DEBS ’16 V. Cardellini - SABD 2019/2020 8 ODP: model DSP application Operators Data streams • C i required computing • l i,j data rate from operator i to j resources • R i execution time per data unit V. Cardellini - SABD 2019/2020 9

ODP: model Computing and network resources Computing resources (Logical) Network links • C u amount of resources • d u,v network delay from u to v • S u processing speed • B u,v bandwidth from u to v • A u resource availability • A u,v link availability V. Cardellini - SABD 2019/2020 10 ODP: model Decision variables • Determine where to map DSP operators and data streams u i z x i,u = 1 y (i,j),(u,v) =1 v x j,v = 1 j w V. Cardellini - SABD 2019/2020 11

ODP: some QoS metrics • Response time max end-to-end delay between sources and destination R • Application availability probability that all components/links are up and running • Inter-node traffic overall network data rate • Network usage S links Î l rate( l )Lat( l ) in-flight bytes V. Cardellini - SABD 2019/2020 12 ODP: optimal problem formulation Tunable knobs to set the optimal placement goals Latency Availability Network bandwidth and node capacity constraints Assignment and integer constraints V. Cardellini - SABD 2019/2020 13

ODP: scalability issue Placement problem is NP-hard: does not scale well! We need heuristics to compute the placement in a feasible amount of time V. Cardellini - SABD 2019/2020 14 Centralized placement heuristics • Two heuristics that aim to reduce inter-node traffic 1. Aniello et al .: co-locate pairs of communicating tasks on the same computing node as to minimize inter-node communication and balance process CPU demand Greedy heuristic – Key idea: – Rank task pairs according to exchanged traffic – For each pair: » If node pairs have not been yet assigned, assign them to the same node » If either is assigned, consider least loaded node and those where they have been assigned. Work out the configuration which minimizes the inter-process traffic 2. Xu et al. use a similar idea but assign tasks in isolation L. Aniello, R. Baldoni and L. Querzoni, Adaptive online scheduling in storm, DEBS '13 J. Xu, Z. Chen, J. Tang and S. Su, T-storm: traffic-aware online scheduling in storm, ICDCS '14 V. Cardellini - SABD 2019/2020 15

ODP as benchmark Using ODP, we can evaluate how good the heuristics work V. Cardellini, V. Grassi, F. Lo Presti, M. Nardelli, Optimal Operator Placement for Distributed Stream Processing Applications, DEBS ’16 V. Cardellini - SABD 2019/2020 16 Decentralized placement heuristic • Heuristics goal: reduce network usage – Network usage metric combines link latencies and exchanged data rates among DSP operators: S links Î l rate( l )Lat( l ) • Pietzuch et al. exploit spring relaxation idea: – Application regarded as a system of springs, whose minimum energy configuration corresponds to minimizing network usage • Features – Decentralized policy to minimize network impact – Adaptive to change in network conditions P. Pietzuch et al., Network-aware operator placement for stream-processing systems. ICDE ‘06 V. Cardellini - SABD 2019/2020 17

Decentralized placement heuristic 1. Represents DSP application as an equivalent system of springs V. Cardellini - SABD 2019/2020 18 Decentralized placement heuristic 2. Determines the placement of the operators in the cost space by minimizing the elastic energy of the equivalent system Network of springs tries to minimize potential energy E D R S = k Streams as springs, that restore a force F = ½ • k • s: P 2 Lat = s – k (spring constant): exchanged data rate on link P – s (spring extension): latency on link 1 V. Cardellini - SABD 2019/2020 19

Decentralized placement heuristic 3. Maps its decision back to physical nodes V. Cardellini - SABD 2019/2020 20 ODP as benchmark Distributed placement heuristic that minimizes network usage Pietzuch et al. : V. Cardellini, V. Grassi, F. Lo Presti, M. Nardelli, Optimal Operator Placement for Distributed Stream Processing Applications, DEBS ’16. V. Cardellini - SABD 2019/2020 21

Not only placement • Stream processing workloads are characterized by: – High volume – High production rate • Exploit replication (i.e., data parallelism) : concurrent execution of multiple operator replicas on different data portions V. Cardellini - SABD 2019/2020 22 Operator placement and replication V. Cardellini - SABD 2019/2020 23

ODRP: Opt. DSP Replication and Placement • We propose ODRP – Centralized policy for optimal replication and placement of DSP applications – Formulated as Integer Linear Programming (ILP) problem that extends ODP • Our goals: – Jointly determine the optimal number of replicas and their placement – Consider multiple QoS attributes of applications and resources – Provide a unified general formulation – Provide a benchmark for heuristics • Limitation: scalability, in practice we need heuristics V. Cardellini, V. Grassi, F. Lo Presti, M. Nardelli, Optimal operator replication and placement for distributed stream processing systems. ACM Perf. Eval. Rew. , 2017. V. Cardellini - SABD 2019/2020 24 DSP deployment challenges 3. Manage load variations • Some frameworks (Flink, Heron, Storm) support backpressure – In Storm: backpressure mechanism based on configurable high/low watermarks expressed as a percentage of a task's buffer size • If the high water mark is reached, Storm slows down the topology's spouts and stop throttling when the low water mark is reached V. Cardellini - SABD 2019/2020 25

Addressing Deployment Challenges in Data Stream Processing Corso di - PDF document

Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica Addressing Deployment Challenges in Data Stream Processing Corso di Sistemi e Architetture per Big Data A.A. 2019/2020 Valeria Cardellini Laurea Magistrale in

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Presented by: Doretta Richardson Pre-Deployment Brief Got Deployment? 2 Pre-Deployment Workshop

Presented by: Doretta Richardson Pre-Deployment Brief Got Deployment? 2 Pre-Deployment Workshop

IPv6 Deployment WG in IPv6 Promotion Council and its Deployment Guideline 2005.2.23 IPv6

An Introduction To Data Stream Query Processing Neil Conway <nconway@truviso.com> Truviso,

ARM Assembler Addressing Modes Addressing Modes p. 1/14 op1 : Data Addressing Mode

Stream Ciphers Stream Ciphers 1 Stream Ciphers Generalization of one-time pad Trade

Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 19/09/2019 The Course

ADDRESSING INCREASED REGULATION IN THE ADDRESSING INCREASED REGULATION IN THE ADDRESSING

Addressing Modes Chapter 11 S. Dandamudi Outline Addressing modes Examples Simple

Addressing Modes Chapter 11 S. Dandamudi Outline Addressing modes Examples Simple

Data Stream Processing Part II Stream Windows Lossy Counting Sticky Sampling 1 Data Streams

DEPLOYMENT BAT REVIEW TANKER TOWLINE DEPLOYMENT BAT REVIEW TANKER TOWLINE DEPLOYMENT BAT REVIEW

CS162: Introduction to Computer Science II Streams 1 Streams A stream is a flow of data

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

Interface, Data, Approximation Sarita Adve With: Vikram Adve, Johnathan Alsop, Maria Kotsifakou,

Warmup Use a k-map to fi nd a minimal implementation of this truth table: A B C D | Y 0 0 0 0 0

FPGA Multipliers Bogdan PASCA projet Ar enaire, ENS-Lyon/INRIA/CNRS/Universit e de Lyon,

Hoplite-DSP Harnessing the Xilinx DSP48 Multiplexers to efficiently support NoCs on FPGAs

DSP HW2-1 HMM Training and Testing Outline 1.

DSS Review Dan Wenman DSS Review November 9, 2016 proto Outline Basic TPC

Computing Husnu S aner Narman Md. Shohrab Hossain Mohammed Atiquzzaman School of Computer

Constructing Domain Specific Knowledge Graphs Mayank Kejriwal, Craig Knoblock and Pedro Szekely

Addressing Deployment Challenges in Data Stream Processing Corso di - PDF document

Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica Addressing Deployment Challenges in Data Stream Processing Corso di Sistemi e Architetture per Big Data A.A. 2019/2020 Valeria Cardellini Laurea Magistrale in

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Presented by: Doretta Richardson Pre-Deployment Brief Got Deployment? 2 Pre-Deployment Workshop

Presented by: Doretta Richardson Pre-Deployment Brief Got Deployment? 2 Pre-Deployment Workshop

IPv6 Deployment WG in IPv6 Promotion Council and its Deployment Guideline 2005.2.23 IPv6

An Introduction To Data Stream Query Processing Neil Conway &lt;nconway@truviso.com&gt; Truviso,

ARM Assembler Addressing Modes Addressing Modes p. 1/14 op1 : Data Addressing Mode

Stream Ciphers Stream Ciphers 1 Stream Ciphers Generalization of one-time pad Trade

Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 19/09/2019 The Course

ADDRESSING INCREASED REGULATION IN THE ADDRESSING INCREASED REGULATION IN THE ADDRESSING

Addressing Modes Chapter 11 S. Dandamudi Outline Addressing modes Examples Simple

Addressing Modes Chapter 11 S. Dandamudi Outline Addressing modes Examples Simple

Data Stream Processing Part II Stream Windows Lossy Counting Sticky Sampling 1 Data Streams

DEPLOYMENT BAT REVIEW TANKER TOWLINE DEPLOYMENT BAT REVIEW TANKER TOWLINE DEPLOYMENT BAT REVIEW

CS162: Introduction to Computer Science II Streams 1 Streams A stream is a flow of data

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

Interface, Data, Approximation Sarita Adve With: Vikram Adve, Johnathan Alsop, Maria Kotsifakou,

Warmup Use a k-map to fi nd a minimal implementation of this truth table: A B C D | Y 0 0 0 0 0

FPGA Multipliers Bogdan PASCA projet Ar enaire, ENS-Lyon/INRIA/CNRS/Universit e de Lyon,

Hoplite-DSP Harnessing the Xilinx DSP48 Multiplexers to efficiently support NoCs on FPGAs

DSP HW2-1 HMM Training and Testing Outline 1.

DSS Review Dan Wenman DSS Review November 9, 2016 proto Outline Basic TPC

Computing Husnu S aner Narman Md. Shohrab Hossain Mohammed Atiquzzaman School of Computer

Constructing Domain Specific Knowledge Graphs Mayank Kejriwal, Craig Knoblock and Pedro Szekely

An Introduction To Data Stream Query Processing Neil Conway <nconway@truviso.com> Truviso,