reliability support in virtual infrastructures
play

Reliability Support in Virtual Infrastructures 2 nd IEEE - PowerPoint PPT Presentation

Reliability Support in Virtual Infrastructures 2 nd IEEE International Conference on Cloud Computing Technology and Science, Indianapolis, 2010 RESO Guilherme Koslovski (INRIA University of Lyon) Wai-Leong Yeow (DoCoMo USA Labs) Cedric


  1. Reliability Support in Virtual Infrastructures 2 nd IEEE International Conference on Cloud Computing Technology and Science, Indianapolis, 2010 RESO Guilherme Koslovski (INRIA – University of Lyon) Wai-Leong Yeow (DoCoMo USA Labs) Cedric Westphal ( DoCoMo USA Labs) Tram Truong Huu (University of Nice – I3S) Johan Montagnat (CNRS – I3S) Pascale Vicat-Blanc Primet (INRIA - LYaTiss)

  2. Reliability as a Service • Reliability : probability that a system will survive failures • Availability : fraction of time that a system is functional 99.95% availability 99.9% availability 99.95% reliability 100% uptime 100% network uptime • Actually nothing more than SLAs. – Failure => credits – Lock-ins – No guarantees at all 2 nd IEEE CloudCom – 2010 2 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

  3. Context Convergence of computing and communication: Virtual Infrastructure is a concept emerging from Virtual Networks and Infrastructures as a Service New models and tools to manage virtualized substrate & to help users in execution of their applications Network virtualization Users Resources virtualization Distributed & virtualized substrate Grid computing experience IaaS, PaaS, … XaaS concepts Complex applications 2 nd IEEE CloudCom – 2010 3 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

  4. Issue • Network and IT resources are subject to random failures • Failures can be measured: mean time between failures (MTBF) • Impact of a failure on a distributed application: • worker node failure: can affect the total execution time • database and servers: can compromise the entire execution • Some applications can recover from failures but • This process usually affects the execution time • This complicates the application development 2 nd IEEE CloudCom – 2010 4 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

  5. Our proposal Reliability as a service offered by the infrastructure provider Provide me a basic Provide me a reliable infrastructure infrastructure User PM VM 1 PM BKP 1 Application PM VM 2 PM BKP 2 PM VM n PM BKP n 2 nd IEEE CloudCom – 2010 5 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

  6. Our proposal Reliability becomes a service offered by the infrastructure provider Transparent realibility provisioning Users (applications) have no knowledge about physical failures User PM VM 1 PM BKP 1 Application PM VM 2 PM BKP 2 PM VM n PM BKP n 2 nd IEEE CloudCom – 2010 6 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

  7. Outline Providing Transparent Reliability Reliable Virtual Infrastructure description Automatic generation of backup nodes and backup links Allocation algorithm Evaluation through a use case application Conclusion & Future work 2 nd IEEE CloudCom – 2010 7 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

  8. Mechanism for providing transparent reliability I. Virtual Infrastructure description II. Translation of reliability requirements into real backup nodes III. Allocation of a reliable virtual infrastructure 2 nd IEEE CloudCom – 2010 8 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

  9. Virtual Infrastructure description: VXDL language VXDL: Virtual private eXecution infrastructure VXDL file Description Language – http://www.ens-lyon.fr/LIP/RESO/Software/vxdl/ vm1 General description Resources description workers [100 nodes] Network topology 1 GB, 2 GHz database description 2 cores [1 GB, 2 GB] Location: lyon.fr 2 GHz Reliability: 99.9% Timeline description 2 cores Location: lyon.fr Reliability: 99.99% 2 nd IEEE CloudCom – 2010 9 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

  10. Virtual Infrastructure extension Translation of reliability requirements into replica nodes Opportunistic Redundancy Pooling (ORP) mechanism [W. Yeow et al, 2010] : Input: Reliability level (user requirement) Probability of physical failures (from MTBF) Number of protected virtual nodes (user requirement) Output: the number of backup nodes – Backup nodes can be shared among different groups of critical nodes – For example, two sets of backup nodes (k1 and k2) can be shared to protect two groups of critical nodes. Thanks to ORP is required only the min(k1, k2) [W. Yeow et al, 2010]: Designing and embedding reliable virtual infrastructures, VISA workshop 2010. 2 nd IEEE CloudCom – 2010 10 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

  11. Virtual Infrastructure extension Backup links: consistent network topology Step 1 Step 3 Step 2 2 nd IEEE CloudCom – 2010 11 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

  12. Allocation of a Reliable Virtual Infrastructure An extended graph is composed by original description + backup components Backup components can have specific constraints: – For example, original node and backup node should be allocated on different physical racks Subgraph-isomorphism detection [Lischka et al, 2009] Physical substrate Embedded graph 2 nd IEEE CloudCom – 2010 12 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

  13. From mapping to allocation The map provided by the allocation is interpreted and instantiate using the HIPerNet framework [P. Primet et al, 2010] • Original VMs and replicas are synchronized by a modified version of the Remus live protection mechanism [B. Cully et al, 2008] 2 nd IEEE CloudCom – 2010 13 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

  14. Evaluation through a use case application Bronze Standard: distributed large-scale application – Quantifies the maximal error resulting from medical-image analysis – Large databases: more the data, more the accuracy – 31 VMs: 512 MB,1 GHz – 10 Mbps for each virtual link between the database and the workers I) Translated into VXDL II) Submitted to HIPerNet Two scenarios of reliability requirements: – Database protection: a failure stops the application execution – Workers protection: a failure increases the execution time Testbed: Grid’5000 – Physical substrate is composed by 100 nodes: – MTBF simulation values: 60000s, 30000s, 15000s [D. Atwood et al., 2008] 2 nd IEEE CloudCom – 2010 14 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

  15. Experimental results Goal : quantify the cost of a reliable virtual infrastructure Prices are based on Amazon EC2 for Europe VM specifications Basic node We do not include any specific link pricing Short term lease $0.095 cost without reliability support (short term lease): $2.95 / h Long term lease $0.031 Prices for computing nodes protection (30 VMs, 99.9%): Short term Long term MTBF Backup Total Reliability cost / total Total Reliability cost / total Nodes cost cost cost cost 60000 5 $3.42 16.1% $3.10 5.3% s 30000 8 $3.71 25.8% $3.19 8.4% s 15000 12 $4.09 38.7% $3.32 12.6% s 2 nd IEEE CloudCom – 2010 15 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

  16. Experimental results Goal: evaluate the application behavior when executing with reliability support Application makespan without substrate failures: 1205s, used as baseline – Database protection: DB label : database is the unique component protected Makespan increases proportionally to the number of failures – Worker nodes protection: WN label : only computing nodes are protected Makespan slightly increases 1800 1600 1400 MTBF DB WN 1200 Increase Increase 1000 NI DB 800 WN 60000s 16.26% 0.2% 600 400 30000s 26.47% 1.7% 200 0 15000s 40.08% 3.2% NI 60000s 30000s 15000s 2 nd IEEE CloudCom – 2010 16 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

  17. Experimental results Goal: reliability service vs resubmission mechanism: – Application is aware about substrate failures – A task is resubmitted on a new computing node – The makespan difference would have been more if backup nodes were not pre-allocated and configured 1600 1400 MTBF Makespan 1200 Increase 1000 60000s +13.08% 800 Reliability Resubmission 600 30000s +19.67% 400 15000s +22.19% 200 0 60000s 30000s 15000s 2 nd IEEE CloudCom – 2010 17 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

  18. Conclusions Reliability becomes a service offered by the infrastructure provider We have developed a framework to provide transparent reliability: – A language to specify the reliability requirements; – A mechanism to interpret these requirements and transform it in replicas (nodes and links) – A map and allocation process to provisioning the reliability level required by the user The framework was implemented on top of the HIPerNet framework, and validated over the Grid’5000 testbed Future work includes: – the implementation of a mechanism to protect virtual links – a detailed investigation on the economical aspects – Tomorrow there is a demonstration about the industry version of the HIPerNet framework (LYaTiss core) - http://www.lyatiss.com/ 2 nd IEEE CloudCom – 2010 18 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend