Availability Knob Flexible User-Defined Availability in the Cloud - - PowerPoint PPT Presentation

availability knob
SMART_READER_LITE
LIVE PREVIEW

Availability Knob Flexible User-Defined Availability in the Cloud - - PowerPoint PPT Presentation

Availability Knob Flexible User-Defined Availability in the Cloud Mohammad Shahrad and David Wentzlaff October 5, 2016 IaaS Providers and Availability Guarantees One thing in common: Fixed 99.95% availability! 2 Whats wrong with fixed


slide-1
SLIDE 1

Availability Knob

Flexible User-Defined Availability in the Cloud

Mohammad Shahrad and David Wentzlaff

October 5, 2016

slide-2
SLIDE 2

IaaS Providers and Availability Guarantees

2

Fixed 99.95% availability! One thing in common:

slide-3
SLIDE 3

What’s wrong with fixed availability?

3

Cloud customers:

  • Various downtime demands
  • Different WTP*

Cloud infrastructures:

  • Heterogeneous HW & 


SW reliability

* WTP= Willingness to Pay

slide-4
SLIDE 4

The Availability Knob (AK)

Let’s have clients ask for their desired availability and be charged correspondingly.

4

slide-5
SLIDE 5

Cloud Scheduler Cloud Scheduler Cloud Scheduler Cloud Scheduler Cloud Scheduler Cloud Scheduler

What should change in cloud to support AK?

5

Cloud management

  • Gathering failure data and build failure stats
  • Avail-aware scheduling

Cloud Scheduler Cloud Scheduler Cloud Scheduler

Service Level Agreements (SLAs)

slide-6
SLIDE 6

How do SLAs look with AK?

  • 1. Desired Avail. / Period


(e.g. 99.8% / 7 days)

  • 3. Variable service credit (penalty)

6

  • 2. Availability price scale


e.g. (99.95%,1.00), (99.9%,0.95)

slide-7
SLIDE 7

The AK Scheduler

7

  • 1. Check for available resources
  • 2. Find the cheapest resource

considering possible penalties using:

User’s experienced vs. requested DT Expected PM time-to-next-failure VM size and expected DT** length in case of failure

PM* Failure DB Service Record DB

* PM= Physical Machine ** DT= Downtime

slide-8
SLIDE 8

AK-Specific Scheduler Features

8

Extra Knowledge on user availability demand enables new scheduling features: Benign VM* Migration (BVM) Deliberate Downtimes (DDT)

* VM= Virtual Machine

slide-9
SLIDE 9

Benign VM Migration (BVM)

  • VMs can be over-served
  • Low failure rate
  • Assignment to HR resources (resource shortfall)

9

Periodic migration of over-served VMs to cheaper resources

* DTF= Downtime Fulfillment ** SLO= Service Level Objective

slide-10
SLIDE 10

Deliberate Downtimes (DDT)

  • Providers can deliberately fail VMs near the end of period.

10

  • Motivations:
  • Building market incentives
  • Lowering energy consumption
  • Bidding redeemed resources
  • etc.

Requested Avail. Delivered Avail.

Safety Margin

slide-11
SLIDE 11

Economics of AK

How to set prices to ensure mutual benefit? How does AK make money?

11

slide-12
SLIDE 12

Incentive Compatibility

Clients may:

  • run buggy VMs
  • cause deliberate DTs**.

Providers can:

  • neglect meeting SLOs*

* SLO= Service Level Objective ** DT = Downtime

Pricing for incentive compatibility

12

Using game theory to ensure:

  • Providers maximize profit margin by not violating SLOs
  • Clients pay less by asking their true demands
slide-13
SLIDE 13

How does AK make money?

13

  • 1. Adapting service to real demand:


Higher market efficiency through supply chain flexibility

  • 2. More efficient resource utilization:


Lowering OpEx, Extra Bidding/Sprinting

  • 3. Variable profit margins:


Compensates risks & supply/demand disparity ~10% Cost Reduction ~20% Profit Increase

slide-14
SLIDE 14

AK Deployment

  • No hardware change required
  • Low technology adoption cost

14

  • Existing fixed availability a subset of AK
  • Can be offered as an optional feature
  • Easy shift to the new model
slide-15
SLIDE 15

How to evaluate AK?

Infrequency of Failures

Accelerated testing Simulations

15

Data center scale

  • 1. Stochastic simulations in MATLAB
  • 2. Prototype implementation with OpenStack

[1] http://gdkomeg.en.made-in-china.com/productimage [1]

slide-16
SLIDE 16

AKSim: Stochastic Cloud Simulator

16

Scalability Resolution/Accuracy trade-off Diverse Applications Multiple VMs Various Machine Types (cost/resilience trade-off)

slide-17
SLIDE 17

OpenStack AK Prototype

17

slide-18
SLIDE 18

Availability-aware Scheduler

18

1000 machines, 12000 users, Normal demand dist., 6 month BVM every 1hr for top 10% of over-served clients

slide-19
SLIDE 19

Benign VM Migration (BVM)

19

~7%

Cost Reduction

Increased Miss Rate 0.19% 0.34%

1000 machines, 12000 users, Uniform demand dist. [3 nines,5 nines], 30 days BVM every 1hr for top 10% of over-served clients

Benefits of BVM depend on machine type blend and data-center utilization.

slide-20
SLIDE 20

Deliberate Downtimes (DDT)

20

1000 machines, 12000 users, Normal demand dist. [3 nines,5 nines], 6 month BVM every 1hr for top 10% of over-served clients

Benefits of DDT depend on demand distribution.

DDT

slide-21
SLIDE 21

Improved Service Satisfaction

21

Downtime Price

* WTP= Willingness to Pay

AK Satisfaction Fixed-avail Satisfaction

slide-22
SLIDE 22

Things to Remember

  • Supply chain flexibility -> market efficiency

22

  • Knowing user demand can enable new techniques
  • Game theory to ensure mutual economic incentive
  • Leveraging reliability/cost trade-offs
slide-23
SLIDE 23

The Availability Knob

23

Mohammad Shahrad

mshahrad@princeton.edu

slide-24
SLIDE 24

Back-up Slides

slide-25
SLIDE 25

What if client’s demand changed?

Client must have the incentive to change his plan.

Price No change; Fixed A1

PA1

deliberate failures by user to earn cash back

PA1 − SCA1(αA1 +(1−α)A2)

Change to A2

αPA1 +(1−α)PA2

Plan update condition: Upper bound of SC given arbitrary P

25

slide-26
SLIDE 26

Nash Equilibrium

26

Nash equilibrium:

slide-27
SLIDE 27

Catastrophic Failure & AK

27

  • When the whole cloud service is down.

1 2 3 4 5 6 7 8

Catastrophic Event Length (Hour)

10 20 30 40 50 60 70 80 90 100

Missed SLOs (%)

AK (Uniform Dist) Fixed Availability

slide-28
SLIDE 28

Why OpenStack

  • VM migration (unlike Eucalyptus)
  • Diverse hypervisor support (KVM)
  • AWS Compatibility
  • Big community (good support)
  • Real world adoption in public/private/hybrid clouds

28

slide-29
SLIDE 29

Some More Results

29

slide-30
SLIDE 30

Service Credit Reshaping

30

slide-31
SLIDE 31

Availability Monitoring Tools

  • There are some performance monitoring tools AK

can use to gather avail data:

  • Nagios (used in AWS)
  • Zabbix
  • Ganglia

31