fast and accurate load balancing
play

Fast and Accurate Load Balancing for Geo-Distributed Storage Systems - PowerPoint PPT Presentation

Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov 1 Waleed Reda 1,2 Gerald Q. Maguire Jr. 1 Dejan Kostic 1 Marco Canini 3 1 KTH Royal Institute of Technology 2 Universit Catholique de Louvain 3 KAUST


  1. Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov 1 Waleed Reda 1,2 Gerald Q. Maguire Jr. 1 Dejan Kostic 1 Marco Canini 3 1 KTH Royal Institute of Technology 2 Université Catholique de Louvain 3 KAUST

  2. Geo-Distributed Services Service Level Objective (SLO): Clients Request completion time at the target percentile (e.g., 30 ms at 95 th percentile) Datacenter 2

  3. Geo-Distributed Services Web-based services demonstrate Clients temporal and spatial variability in load Datacenter Problem: it is difficult to meet strict SLOs, while maintaining high resource utilization and low cost 3

  4. Approach 1 - Datacenter Elasticity [1000x req/s] Arrival rate 4

  5. Approach 1 - Datacenter Elasticity [1000x req/s] Arrival rate 5

  6. Approach 1 - Datacenter Elasticity [1000x req/s] Arrival rate 6

  7. Approach 1 - Datacenter Elasticity Provisioning delays Lead to [1000x req/s] Unused capacity Arrival rate overprovisioning! SLO violations Provisioning delay (minutes) due to time needed to spawn and warm up a VM Hard to predict workload far into the future Load spikes can be short lived 7

  8. Approach 2 - Geo-Distributed Load Balancing Redirection Redirection delay Excessive [1000x req/s] Arrival rate redirection delays Inaccurate SLO violations response time How estimation much to [1000x req/s] Arrival rate redirect? Excessive or insufficient redirection 8

  9. Our Approach: Kurma Reacts to changes in load within [1000x req/s] seconds Arrival rate Avoids unnecessary scaling out Accurately estimates remote rate of SLO violations [1000x req/s] Arrival rate Tames SLO violation at the target level 9

  10. Request Completion Time Datacenter Frankfurt Datacenter Ireland Server 2 Server 1 Wide Area Network Base Propagation: Stable Delay Variance: Variable Service Time: component associated with component associated Variable component packet propagation along a with competing traffic associated with network path and queuing load on the server Kurma solves global optimization model while considering: Base Propagation + Delay Variance + Service Time at all datacenters 10

  11. Understanding Service Time Datacenter Frankfurt 5 Server Cassandra cluster 11

  12. Understanding Service Time Datacenter Frankfurt 5 Server Cassandra cluster Challenge: How to accurately estimate remote fraction of SLO violations at runtime under variable network conditions? 12

  13. Understanding Service Time Datacenter Frankfurt 5 Server Cassandra cluster 13

  14. Understanding Service Time Datacenter Frankfurt Datacenter Ireland 5 Server Cassandra cluster 5 Server Cassandra cluster Wide Area Network 7000 Insight: the farther away a remote datacenter is, the less loaded it should be to serve remote requests within a given SLO target 14

  15. Understanding WAN Latency Base propagation delay Monte Carlo Simulations Service time distribution recorded locally at a specific load 15

  16. Understanding WAN Latency Base propagation delay Monte Carlo Simulations Service time distribution recorded locally at a specific load 16

  17. Understanding WAN Latency Base propagation delay Monte Carlo Simulations Service time distribution Gives SLO violation rate recorded locally at a given a specific load specific load and WAN conditions 17

  18. Understanding WAN Latency Base propagation delay Estimation Error 18

  19. Incorporating WAN and Load 19

  20. Incorporating WAN and Load 20

  21. Incorporating WAN and Load

  22. Incorporating WAN and Load 22

  23. Optimisation Model Runtime load in each + datacenter { λ 1 , λ 2 , λ 3 } Optimisation Problem ✓ Minimize global SLO violations (KurmaPerf) ✓ Minimize the cost of running a service (KurmaCost) 23

  24. Implementation Global View: Each Epoch latencies + 2.5 sec → 0.4Hz loads Perform run-time WAN latency measurements Aggregate load information (rates of requests) Datacenter Stockholm … Exchange metrics to obtain global view Solve decentralized … performance model Datacenter London Datacenter Frankfurt 24

  25. Implementation Each Epoch 2.5 sec → 0.4Hz Perform run-time WAN latency measurements Aggregate load information (rates of requests) Datacenter Stockholm Exchange metrics to obtain global view Solve decentralized performance model Datacenter London Enforce computed rates of Datacenter requests redirection Frankfurt 25

  26. Evaluation Setup Geo-distributed Cassandra cluster • 3 Amazon EC2 datacenter (Ireland, Frankfurt, London) • 5 x r5.large VMs per datacenter SLO: 30 ms at the 95 th percentile • • Modified YCSB to replay workload traces (World Cup http://ita.ee.lbl.gov/html/contrib/WorldCup.html) Experiments: • Minimizing SLO violations for reads • Maintaining Target SLO (accuracy) • Cost Savings for 1 min billing intervals (simulations) • Reads and writes, scalability, etc. link here. 26

  27. Workload Trace No elastic scaling Load threshold for 5% SLO violations 27

  28. Cumulative Normalized SLO Violations Kurma’s SLO violations are at 2.4% The numbers shown above the bars indicate the amount of inter-datacentre traffic transferred, whiskers → 75 th percentile 28

  29. Cumulative Normalized SLO Violations Kurma’s SLO violations are at 2.4% The numbers shown above the bars indicate the amount of inter-datacentre traffic transferred, whiskers → 75 th percentile 29

  30. Average Provisioning Cost Over 30 Consecutive Days Total Cost [US$] Per Day All Shared All local WAN latency = 0ms Bandwidth cost = 0$ - Reactive threshold based elastic controller - Minimum billing period of 1 minute - Results obtained using simulations 30

  31. Average Provisioning Cost Over 30 Consecutive Days Total Cost [US$] Per Day All Shared KurmaCost KurmaPerf All local WAN latency = 0ms Bandwidth cost = 0$ Keeps SLO violations under 5% - Reactive threshold based elastic controller Minimize SLO violations - Minimum billing period of 1 minute (minimize redirections while (no consideration for traffic usage) - Results obtained using simulations avoiding scaling out) 31

  32. Taming SLO Violations Under Elastic No elastic Threshold scaling 32

  33. Conclusion Kurma – fast and accurate load balancer for geo-distributed systems that takes advantage of spatial variability in load Decouples end-to-end response time into components of base propagation latency, network congestion, and service time distribution By operating at the granularity of a few seconds, Kurma reduces SLO violations or lowers the costs of running services by avoiding excessive global service overprovisioning 33 Contact: KIRILLB@kth.se

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend