Fast and Accurate Load Balancing for Geo-Distributed Storage Systems - - PowerPoint PPT Presentation

fast and accurate load balancing
SMART_READER_LITE
LIVE PREVIEW

Fast and Accurate Load Balancing for Geo-Distributed Storage Systems - - PowerPoint PPT Presentation

Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov 1 Waleed Reda 1,2 Gerald Q. Maguire Jr. 1 Dejan Kostic 1 Marco Canini 3 1 KTH Royal Institute of Technology 2 Universit Catholique de Louvain 3 KAUST


slide-1
SLIDE 1

Fast and Accurate Load Balancing for Geo-Distributed Storage Systems

Kirill L. Bogdanov1 Waleed Reda1,2 Gerald Q. Maguire Jr.1 Dejan Kostic1 Marco Canini3

1KTH Royal Institute of Technology 2Université Catholique de Louvain 3KAUST

slide-2
SLIDE 2

Geo-Distributed Services

2

Datacenter Clients

Service Level Objective (SLO):

Request completion time at the target percentile (e.g., 30 ms at 95th percentile)

slide-3
SLIDE 3

Geo-Distributed Services

3

Datacenter Clients Web-based services demonstrate temporal and spatial variability in load Problem: it is difficult to meet strict SLOs, while maintaining high resource utilization and low cost

slide-4
SLIDE 4

Approach 1 - Datacenter Elasticity

4

Arrival rate [1000x req/s]

slide-5
SLIDE 5

Approach 1 - Datacenter Elasticity

5

Arrival rate [1000x req/s]

slide-6
SLIDE 6

Approach 1 - Datacenter Elasticity

6

Arrival rate [1000x req/s]

slide-7
SLIDE 7

Approach 1 - Datacenter Elasticity

7

Lead to

  • verprovisioning!

Provisioning delay (minutes) due to time needed to spawn and warm up a VM Hard to predict workload far into the future Load spikes can be short lived Provisioning delays SLO violations Unused capacity

Arrival rate [1000x req/s]

slide-8
SLIDE 8

Approach 2 - Geo-Distributed Load Balancing

8

Excessive or insufficient redirection Redirection delays Inaccurate response time estimation

Redirection delay SLO violations Excessive redirection

Arrival rate [1000x req/s] Arrival rate [1000x req/s]

How much to redirect?

slide-9
SLIDE 9

Our Approach: Kurma

9

Tames SLO violation at the target level Reacts to changes in load within seconds Accurately estimates remote rate of SLO violations

Avoids unnecessary scaling out

Arrival rate [1000x req/s] Arrival rate [1000x req/s]

slide-10
SLIDE 10

Request Completion Time

10

Wide Area Network

Base Propagation: Stable component associated with packet propagation along a network path Delay Variance: Variable component associated with competing traffic and queuing

Server 1 Server 2 Datacenter Ireland Datacenter Frankfurt

Service Time: Variable component associated with load on the server

Kurma solves global optimization model while considering: Base Propagation + Delay Variance + Service Time at all datacenters

slide-11
SLIDE 11

Understanding Service Time

11

5 Server Cassandra cluster Datacenter Frankfurt

slide-12
SLIDE 12

Understanding Service Time

12

5 Server Cassandra cluster Datacenter Frankfurt

Challenge: How to accurately estimate remote fraction of SLO violations at runtime under variable network conditions?

slide-13
SLIDE 13

Understanding Service Time

13

5 Server Cassandra cluster Datacenter Frankfurt

slide-14
SLIDE 14

Wide Area Network

Understanding Service Time

14

7000 5 Server Cassandra cluster Datacenter Frankfurt 5 Server Cassandra cluster Datacenter Ireland

Insight: the farther away a remote datacenter is, the less loaded it should be to serve remote requests within a given SLO target

slide-15
SLIDE 15

Understanding WAN Latency

15

Base propagation delay Service time distribution recorded locally at a specific load

Monte Carlo Simulations

slide-16
SLIDE 16

Understanding WAN Latency

16

Base propagation delay Service time distribution recorded locally at a specific load

Monte Carlo Simulations

slide-17
SLIDE 17

Understanding WAN Latency

17

Base propagation delay Service time distribution recorded locally at a specific load Gives SLO violation rate given a specific load and WAN conditions

Monte Carlo Simulations

slide-18
SLIDE 18

Understanding WAN Latency

18

Base propagation delay

Estimation Error

slide-19
SLIDE 19

Incorporating WAN and Load

19

slide-20
SLIDE 20

Incorporating WAN and Load

20

slide-21
SLIDE 21

Incorporating WAN and Load

slide-22
SLIDE 22

Incorporating WAN and Load

22

slide-23
SLIDE 23

Optimisation Model

Runtime load in each datacenter {λ1,λ2, λ3}

+

Optimisation Problem ✓ Minimize global SLO violations (KurmaPerf) ✓ Minimize the cost of running a service (KurmaCost)

23

slide-24
SLIDE 24

Implementation

24

Global View: latencies + loads … …

Each Epoch 2.5 sec → 0.4Hz

Perform run-time WAN latency measurements Aggregate load information (rates of requests) Exchange metrics to obtain global view Solve decentralized performance model Datacenter London Datacenter Frankfurt Datacenter Stockholm

slide-25
SLIDE 25

Implementation

25 Each Epoch 2.5 sec → 0.4Hz

Perform run-time WAN latency measurements Aggregate load information (rates of requests) Exchange metrics to obtain global view Solve decentralized performance model Datacenter London Datacenter Frankfurt Datacenter Stockholm Enforce computed rates of requests redirection

slide-26
SLIDE 26

Evaluation Setup

Geo-distributed Cassandra cluster

  • 3 Amazon EC2 datacenter (Ireland, Frankfurt, London)
  • 5 x r5.large VMs per datacenter
  • SLO: 30 ms at the 95th percentile
  • Modified YCSB to replay workload traces

(World Cup http://ita.ee.lbl.gov/html/contrib/WorldCup.html)

Experiments:

  • Minimizing SLO violations for reads
  • Maintaining Target SLO (accuracy)
  • Cost Savings for 1 min billing intervals (simulations)
  • Reads and writes, scalability, etc. link here.

26

slide-27
SLIDE 27

Workload Trace

27 Load threshold for 5% SLO violations No elastic scaling

slide-28
SLIDE 28

28

The numbers shown above the bars indicate the amount of inter-datacentre traffic transferred, whiskers → 75th percentile Kurma’s SLO violations are at 2.4%

Cumulative Normalized SLO Violations

slide-29
SLIDE 29

29

The numbers shown above the bars indicate the amount of inter-datacentre traffic transferred, whiskers → 75th percentile Kurma’s SLO violations are at 2.4%

Cumulative Normalized SLO Violations

slide-30
SLIDE 30

Average Provisioning Cost Over 30 Consecutive Days

30

All Shared WAN latency = 0ms Bandwidth cost = 0$ All local

  • Reactive threshold based elastic controller
  • Minimum billing period of 1 minute
  • Results obtained using simulations

Total Cost [US$] Per Day

slide-31
SLIDE 31

Average Provisioning Cost Over 30 Consecutive Days

31

All Shared WAN latency = 0ms Bandwidth cost = 0$

Keeps SLO violations under 5% (minimize redirections while avoiding scaling out)

KurmaCost KurmaPerf All local

  • Reactive threshold based elastic controller
  • Minimum billing period of 1 minute
  • Results obtained using simulations

Minimize SLO violations (no consideration for traffic usage)

Total Cost [US$] Per Day

slide-32
SLIDE 32

Taming SLO Violations Under Elastic Threshold

32 No elastic scaling

slide-33
SLIDE 33

Conclusion

Kurma – fast and accurate load balancer for geo-distributed systems that takes advantage of spatial variability in load Decouples end-to-end response time into components of base propagation latency, network congestion, and service time distribution By operating at the granularity of a few seconds, Kurma reduces SLO violations or lowers the costs of running services by avoiding excessive global service

  • verprovisioning

33

Contact: KIRILLB@kth.se