Reliability Support in Virtual Infrastructures 2 nd IEEE - - PowerPoint PPT Presentation

reliability support in virtual infrastructures
SMART_READER_LITE
LIVE PREVIEW

Reliability Support in Virtual Infrastructures 2 nd IEEE - - PowerPoint PPT Presentation

Reliability Support in Virtual Infrastructures 2 nd IEEE International Conference on Cloud Computing Technology and Science, Indianapolis, 2010 RESO Guilherme Koslovski (INRIA University of Lyon) Wai-Leong Yeow (DoCoMo USA Labs) Cedric


slide-1
SLIDE 1

RESO

Reliability Support in Virtual Infrastructures

Guilherme Koslovski (INRIA – University of Lyon) Wai-Leong Yeow (DoCoMo USA Labs) Cedric Westphal (DoCoMo USA Labs) Tram Truong Huu (University of Nice – I3S) Johan Montagnat (CNRS – I3S) Pascale Vicat-Blanc Primet (INRIA - LYaTiss) 2nd IEEE International Conference on Cloud Computing Technology and Science, Indianapolis, 2010

slide-2
SLIDE 2

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

2

Reliability as a Service

  • Reliability: probability that a system will survive failures
  • Availability: fraction of time that a system is functional
  • Actually nothing more than SLAs.

– Failure => credits – Lock-ins – No guarantees at all 99.95% availability 99.9% availability 99.95% reliability 100% uptime 100% network uptime

slide-3
SLIDE 3

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

3

Context

Convergence of computing and communication: Virtual Infrastructure is a concept emerging from Virtual Networks and Infrastructures as a Service New models and tools to manage virtualized substrate & to help users in execution of their applications

Network virtualization Grid computing experience Resources virtualization Users Complex applications Distributed & virtualized substrate IaaS, PaaS, … XaaS concepts

slide-4
SLIDE 4

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

4

Issue

  • Network and IT resources are subject to random failures
  • Failures can be measured: mean time between failures (MTBF)
  • Impact of a failure on a distributed application:
  • worker node failure: can affect the total execution time
  • database and servers: can compromise the entire execution
  • Some applications can recover from failures but
  • This process usually affects the execution time
  • This complicates the application development
slide-5
SLIDE 5

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

5

Our proposal

Application

Reliability as a service offered by the infrastructure provider

User

PM

VM 1

PM

VM 2

PM

VM n

PM

BKP 1

PM

BKP 2

PM

BKP n

Provide me a basic infrastructure Provide me a reliable infrastructure

slide-6
SLIDE 6

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

6

Our proposal

Application User

PM

VM 1

PM

VM 2

PM

VM n

PM

BKP 1

PM

BKP 2

PM

BKP n

Reliability becomes a service offered by the infrastructure provider

Transparent realibility provisioning Users (applications) have no knowledge about physical failures

slide-7
SLIDE 7

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

7

Outline

Providing Transparent Reliability

Reliable Virtual Infrastructure description Automatic generation of backup nodes and backup links Allocation algorithm

Evaluation through a use case application Conclusion & Future work

slide-8
SLIDE 8

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

8

Mechanism for providing transparent reliability

I. Virtual Infrastructure description II. Translation of reliability requirements into real backup nodes III. Allocation of a reliable virtual infrastructure

slide-9
SLIDE 9

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

9

Virtual Infrastructure description: VXDL language

VXDL file Resources description Network topology description Timeline description General description VXDL: Virtual private eXecution infrastructure Description Language

– http://www.ens-lyon.fr/LIP/RESO/Software/vxdl/

vm1 database [1 GB, 2 GB] 2 GHz 2 cores Location: lyon.fr Reliability: 99.99% workers [100 nodes] 1 GB, 2 GHz 2 cores Location: lyon.fr Reliability: 99.9%

slide-10
SLIDE 10

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

10

Virtual Infrastructure extension

Translation of reliability requirements into replica nodes

Opportunistic Redundancy Pooling (ORP) mechanism [W. Yeow et al, 2010]: Input:

Reliability level (user requirement) Probability of physical failures (from MTBF) Number of protected virtual nodes (user requirement)

Output: the number of backup nodes – Backup nodes can be shared among different groups of critical nodes – For example, two sets of backup nodes (k1 and k2) can be shared to protect two groups of critical nodes.

Thanks to ORP is required only the min(k1, k2)

[W. Yeow et al, 2010]: Designing and embedding reliable virtual infrastructures, VISA workshop 2010.

slide-11
SLIDE 11

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

11

Virtual Infrastructure extension

Backup links: consistent network topology

Step 1 Step 2 Step 3

slide-12
SLIDE 12

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

12

Allocation of a Reliable Virtual Infrastructure

An extended graph is composed by original description + backup components Backup components can have specific constraints: – For example, original node and backup node should be allocated on different physical racks Subgraph-isomorphism detection [Lischka et al, 2009]

Physical substrate Embedded graph

slide-13
SLIDE 13

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

13

From mapping to allocation

The map provided by the allocation is interpreted and instantiate using the HIPerNet framework [P. Primet et al, 2010]

  • Original VMs and replicas are synchronized by a modified

version of the Remus live protection mechanism [B. Cully et al, 2008]

slide-14
SLIDE 14

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

14

Evaluation through a use case application

Bronze Standard: distributed large-scale application

– Quantifies the maximal error resulting from medical-image analysis – Large databases: more the data, more the accuracy – 31 VMs: 512 MB,1 GHz – 10 Mbps for each virtual link between the database and the workers

Two scenarios of reliability requirements:

– Database protection: a failure stops the application execution – Workers protection: a failure increases the execution time

Testbed: Grid’5000

– Physical substrate is composed by 100 nodes: – MTBF simulation values: 60000s, 30000s, 15000s

[D. Atwood et al., 2008]

I) Translated into VXDL II) Submitted to HIPerNet

slide-15
SLIDE 15

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

15

Experimental results

Prices for computing nodes protection (30 VMs, 99.9%):

Short term Long term MTBF Backup Nodes Total cost Reliability cost / total cost Total cost Reliability cost / total cost 60000 s 5 $3.42 16.1% $3.10 5.3% 30000 s 8 $3.71 25.8% $3.19 8.4% 15000 s 12 $4.09 38.7% $3.32 12.6%

Goal: quantify the cost of a reliable virtual infrastructure

Prices are based on Amazon EC2 for Europe We do not include any specific link pricing cost without reliability support (short term lease): $2.95 / h VM specifications Basic node Short term lease $0.095 Long term lease $0.031

slide-16
SLIDE 16

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

16

Experimental results

Goal: evaluate the application behavior when executing with reliability support

Application makespan without substrate failures: 1205s, used as baseline – Database protection:

DB label: database is the unique component protected Makespan increases proportionally to the number of failures

– Worker nodes protection:

WN label: only computing nodes are protected Makespan slightly increases MTBF DB Increase WN Increase NI 60000s 16.26% 0.2% 30000s 26.47% 1.7% 15000s 40.08% 3.2%

200 400 600 800 1000 1200 1400 1600 1800 NI 60000s 30000s 15000s DB WN

slide-17
SLIDE 17

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

17

Experimental results

Goal: reliability service vs resubmission mechanism:

– Application is aware about substrate failures – A task is resubmitted on a new computing node – The makespan difference would have been more if backup nodes were not pre-allocated and configured

MTBF Makespan Increase 60000s +13.08% 30000s +19.67% 15000s +22.19%

200 400 600 800 1000 1200 1400 1600 60000s 30000s 15000s Reliability Resubmission

slide-18
SLIDE 18

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

18

Conclusions

Reliability becomes a service offered by the infrastructure provider We have developed a framework to provide transparent reliability:

– A language to specify the reliability requirements; – A mechanism to interpret these requirements and transform it in replicas (nodes and links) – A map and allocation process to provisioning the reliability level required by the user

The framework was implemented on top of the HIPerNet framework, and validated over the Grid’5000 testbed Future work includes:

– the implementation of a mechanism to protect virtual links – a detailed investigation on the economical aspects

– Tomorrow there is a demonstration about the industry version of the HIPerNet framework (LYaTiss core) - http://www.lyatiss.com/

slide-19
SLIDE 19

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

19

Thank you for your attention! Any questions?

guilherme.koslovski@ens-lyon.fr, wlyeow@ieee.org, cwestphal@docomolabs-usa.com, tram@polytech.unice.fr, johan@i3s.unice.fr, pvb@lyatiss.com

slide-20
SLIDE 20

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

20

Some references

Specifying and provisioning Virtual Infrastructures with HIPerNET. Fabienne Anhalt, Guilherme Koslovski, and Pascale Vicat-Blanc Primet. ACM International Journal of Network Management (IJNM) - special issue on Network Virtualization and its Management, 2010; Joint elastic cloud and network framework for application performance optimization and cost reduction. Tram Truong Huu, Guilherme Koslovski, Fabienne Anhalt, Pascale Vicat-Blanc Primet, and Johan Montagnat. Journal of Grid Computing (JoGC) , 2010; Reliability support in virtual infrastructures. Guilherme Koslovski, Wai-Leong, Cedric Westphal, Tram Truong Huu, Pascale Vicat-Blanc Primet, and Johan Montagnat. In 2nd IEEE CloudCom 2010, Indianapolis, USA; A scalable security model for enabling Dynamic Virtual Private Execution Infrastructures on the Internet. Pascale Vicat-Blanc Primet, Jean-Patrick Gelas, Olivier Mornard, Guilherme Koslovski, Vincent Roca, Lionel Giraud, Johan Montagnat, and Tram Truong Huu. In IEEE/ACM CCGrid2009, Shanghai, May 2009; Exploring the virtual infrastructure service concept in Grid'5000. Pascale Vicat-Blanc Primet, Fabienne Anhalt, and Guilherme Koslovski. In 20th ITC Specialist Seminar on Network Virtualization, Hoi An, Vietnam, May 2009; Executing distributed applications on virtualized infrastructures specified with the VXDL language and managed by the HIPerNET framework. Guilherme Koslovski, Tram Truong Huu, Johan Montagnat, and Pascale Vicat-Blanc

  • Primet. In CLOUDCOMP 2009, Munich, Germany, October 2009;

Virtual Resources and Interconnection Networks Description Language. Guilherme Koslovski, Pascale Vicat-Blanc Primet, and Andrea Schwertner Charão. In GridNets 2008, Oct. 2008; HIPernet: A Decentralized Security Infrastructure for Large Scale Grid Environments. Julien Laganier, Pascale Vicat-Blanc Primet. In 6th IEEE/ACM International Conference on Grid Computing (GRID 2005), November 13-14, 2005, Seattle, Washington, USA, Proceedings, pages 140-147, 2005. IEEE;

slide-21
SLIDE 21

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

21

Backup slides

slide-22
SLIDE 22

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

22

ViPXi description: General description

VXDL file Resources description Network topology description Timeline description General description

VI identification (name, owner, users) Reservation (start and end time) General properties:

– Location – Security – Monitoring

slide-23
SLIDE 23

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

23

ViPXi description: Resources

VXDL file Resources description Network topology description Timeline description General description

Resources and groups of resources Resources types:

Virtual routers/switches Virtual machines Virtual storages Virtual access points

Cross-layers parameters:

– Exclusivity – Specific devices – Location

slide-24
SLIDE 24

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

24

ViPXi description: Network topology

VXDL file Resources description Network topology description Timeline description General description

Topology description:

  • (Provisioned) virtual links

Specification of QoS constraints:

– Bandwidth (forward and reverse) – Latency – Reliability – Security

slide-25
SLIDE 25

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

25

ViPXi description: Timeline

VXDL file Resources description Network topology description Timeline description General description

  • often components are not used

simultaneously or all along the VI lifetime

  • an internal timeline for each VI can help
  • ptimizing the allocation, scheduling, and

provisioning processes

  • a timeline is composed by stages, delimited

by temporal marks

slide-26
SLIDE 26

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

26

Reliability prices

Total cost is: original cost + reliability cost Original cost Reliability cost

slide-27
SLIDE 27

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

27

Evaluation through a use case application

Bronze Standard workflow is analyzed and translated into a VXDL file:

– 31 VMs: 512 MB,1 GHz – 10 Mbps for each virtual link between the database and the workers

Two scenarios of reliability requirements:

– Database protection: a failure stops the application execution – Workers protection: a failure increases the execution time

Physical substrate: Grid’5000 testbed

I) Translated into VXDL II) Submitted to HIPerNet

slide-28
SLIDE 28

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

28

ViPXi description: reliability

vm1 Node - vm2 [1 GB, 2 GB] 2 GHz 2 cores Location: lyon.fr Reliability: 99.9% vm3 vm4 Routers – r1, r2 Layer: Ethernet Dynamically configured Link - l1 [20 Mb/s, 200 Mb/s] with monitoring Links – l2, l3 [10 Mb/s, 100 Mb/s] Link – l4, l5 {10 Mb/s, 50 Mb/s, 100 Mb/s} Reliability: 99.99%

Reliability is informed by the user considering the information exposed by the physical substrate;

slide-29
SLIDE 29

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

29

Scenario composition and metrics

Goals:

Evaluate the application behavior when executing with reliability support Quantify the cost of a reliable ViPXi

Application makespan without substrate failures: 1205s, used as baseline Simple cost model for the pricing:

– ViPXi cost + replicas cost – Prices are based on Amazon EC2 for Europe – 1 hour reservation – We do not include any specific link pricing

ViPXi cost without reliability support (short term lease): $2.95, used as baseline

Physical substrate is composed by 100 nodes:

MTBF simulation values: 60000s, 30000s, 15000s VM specifications Basic node (1.7GB RAM) Short term lease $0.095 Long term lease $0.031

slide-30
SLIDE 30

2nd IEEE CloudCom – 2010

  • G. Koslovski, W. Yeow, C. Westphal,
  • T. Huu, J. Montagnat, P. Vicat-Blanc

30

Experimental results

Prices for database protection (1 VM, 99.99%):

Short term Long term MTBF Fail prob. Backup Nodes Total cost Reliability cost / total cost Total cost Reliability cost / total cost 60000 s 0.03 2 $3.13 6% $3.01 2% 30000 s 0.06 3 $3.23 10% $3.04 3% 15000 s 0.12 4 $3.33 13% $3.07 4%

Prices for computing nodes protection (30 VMs, 99.9%):

Short term Long term MTBF Fail prob. Backup Nodes Total cost Reliability cost / total cost Total cost Reliability cost / total cost 60000 s 0.03 5 $3.42 16.1% $3.10 5.3% 30000 s 0.06 8 $3.71 25.8% $3.19 8.4% 15000 s 0.12 12 $4.09 38.7% $3.32 12.6%

Goal: Quantify the cost of a reliable ViPXi

Prices are based on Amazon EC2 for Europe We do not include any specific link pricing cost without reliability support (short term lease): $2.95 VM specifications Basic node Short term lease $0.095 Long term lease $0.031