aspects of heterogeneity
play

ASPECTS OF HETEROGENEITY AND FAULT TOLERANCE IN CLOUD COMPUTING - PowerPoint PPT Presentation

Atlanta, Georgia, April 19, 2010 in conjunction with IPDPS 2010 UNIBUS: ASPECTS OF HETEROGENEITY AND FAULT TOLERANCE IN CLOUD COMPUTING Magdalena Slawinska Jaroslaw Slawinski Vaidy Sunderam {magg, jaross, vss}@mathcs.emory.edu Emory


  1. Atlanta, Georgia, April 19, 2010 in conjunction with IPDPS 2010 UNIBUS: ASPECTS OF HETEROGENEITY AND FAULT TOLERANCE IN CLOUD COMPUTING Magdalena Slawinska Jaroslaw Slawinski Vaidy Sunderam {magg, jaross, vss}@mathcs.emory.edu Emory University, Dept. of Mathematics and Computer Science Atlanta, GA, USA

  2. Creating a problem 2 1. What do I want? 4. Why might I want FT on cloud?  Execute an MPI application  To reduce costs (money, time, energy, … ) 2. What do I need?  Reliability 2. What do I need?  Target resource: MPI cluster  …  Target resource: MPI cluster  FT services: Checkpoint, Heartbeat 5. What is the overhead User introduced by FT? 3. What do I have?  Access to the Rackspace cloud 6. Can I do that? How? Rackspace cloud

  3. Problem 3 User’s requirements Available resource  Execute MPI software  Target resource: MPI cluster  Target platform: FT-flavor Rackspace cloud User’s resources User  Rackspace cloud (credentials) Resource Manually:  interaction with web page transformation  prepare the image: install EC2 cloud required software and dependencies  instantiate servers Target resource  configure passwordless authentication  …. Workstations  1 man-hour for 16+1 nodes

  4. Unibus: a resource orchestrator 4 User’s requirements Available resource  Execute MPI software  Target resource: MPI cluster  Target platform: FT-flavor Rackspace cloud User’s resources User  Rackspace cloud (credentials) Unibus EC2 cloud Target resource Workstations

  5. Outline 5  Unibus – an infrastructure framework that allows to orchestrate resources  Resource access virtualization  Resource provisioning  Unibus – FT MPI platform on demand  Automatic assembly of an FT MPI-enabled platform  Execution of an MPI application on the Unibus- created FT MPI-enabled platform  Discussion of the FT overhead

  6. Unibus resource sharing model 6 Traditional Model Proposed Model Resource exposition Virtual Organization (VO) Resource provider Resource usage Determined by VO Determined by a particular resource provider Resource virtualization Resource providers belonging Software at the client side and aggregation to VO

  7. Handling heterogeneity in Unibus 7  Resources exposed Unibus in an arbitrary uses Capability Model manner as access points User  Capability Model to implements abstract operations Engine available on Mediators provider’s resources access  Mediators to daemon library implement the specifics of access implements points access protocols Network  Knowledge engine to infer relevant facts Resources access points

  8. Unibus access device Complicating Metaapplications a big picture … Services 8 Resources exposed in an  User arbitrary manner as access Unibus points Capability Model implements Capability Model to abstract  operations on resources Mediators to implement the  Engine specifics of access points Services Knowledge engine to infer  Mediators relevant fact Resource descriptors to  Resource describe resources descriptors semantically (OWL-DL) Services (standard and third  Access daemon parties), e.g., heartbeat, library checkpoint, resource Network discovery, etc. implements Metaapplications to  orchestrate execution of access protocols applications on relevant Resources resources Access points

  9. Virtualizing access to resources Capability Model and mediators 9  Capability Model  Provides virtually homogenized access to heterogeneous resources  Specifies abstract binding operations , grouped in interfaces  Interface hierarchy not appropriate (e.g. fs:ITransferable and Implements details ssh:ISftp) Ssh AP  Mediators  Implement resource Rackspace access point protocols cloud Cluster Workstations

  10. Virtualizing access to resources 10 ISsh shell exec subsystem Ssh Mediator implements invoke_shell exec_command get_subsystem get_subsytemsftp compatibleWith …. sshd Workstation

  11. Knowledge engine Mediator’s Developer 11 Knowledge Set Ssh Mediator ISsh Request interface implements invoke_shell shell User ISsh implements exec_command exec implements Resource: get_subsystem subsystem emily … compatibleWith hasOperation some Operating compatibleWith Resource System Knowledge some Access Linux emily Engine hasOS Point (inferring) … … Open SshD hasAccessPoint …

  12. Composite operations 12 ISimpleCloud  Rs_addhosts dependsOn IRackspace create_server addhosts implements create_server  Create_server is implemented by RS deletehosts delete_server Mediator implements  Rs_addhosts implements dependsOn addhosts implements Composite operation  So RS mediator Rackspace implements addhosts rs_addhosts Mediator create_server entry point Composite operations delete_server  Dynamically expand Composite mediator’s operations rs_addhosts operation  May result in classification a.k.a. addhosts definition Def … of mediators and Def … compatible resources to Rackspace new interfaces ISimpleCloud_RS.py Cloud

  13. Resource access unification via composite operations User 13 Eliminating Unified interface ISimpleCloud need of standardization addhosts IEC2 IRackspace deletehosts run_instance create_server … Composite delete_server operations implements implements rs_addhosts Rackspace EC2 Mediator ec2_addhosts Mediator create_server run_instance Def … Def … delete_server … Def … Def … rs_addhosts ec2_addhosts Rackspace Different resources, yet EC2 cloud Cloud semantically similar

  14. Resource provisioning Homogenizing resource heterogeneity 14  Conditioning increases resource specialization levels  Soft conditioning  changes resource software capabilities  e.g., installing MPI enables execution of MPI apps  Successive conditioning  enhances resource capabilities in terms of available access points (may use soft conditioning)  e.g., deploying Globus Toolkit makes the resource accessible via Grid protocols

  15. Transforming Rackspace to FT-enabled MPI platform User’s 15 credentials Rackspace descriptor Unibus Metaapp Rackspace User  Soft conditioning  Successive cond.  Composite ops User’s requirements  …  Execute software: NAS Parellel Benchmarks (NPB)  Target resource: MPI cluster NPB logs  FT services: Heartbeat, Checkpoint FT MPI cluster

  16. Rackspace Cloud to MPI cluster 16 Installing other services (FT) Deployment of MPI on new resources Creating a new group of resources (Rackspace ssh- enabled servers) in terms of new access points Obtaining a higher level of abstraction

  17. Metaapplications User’s requirements  Execute software: NAS Parellel Benchmarks (NPB) 17 Unibus access device  Target resource: MPI cluster Metaapplications  FT services: Heartbeat, Checkpoint Services Unibus Capability Model User Engine Services Mediators Resource descriptors Network Resources

  18. Metaapplication 18 Metaapplication Requests  IClusterMPI  FT services:  IHeartbeat  ICheckpointRestart  Specifies available  resources Performs benchmarks  Transfers benchmarks  execution logs to the head node Requests ISftp 

  19. Rackspace testbed • RAM 256MB 19 • 10GB  16 working nodes (WN) + 1 head node (HN) • RAM HDD  Node: 4-core, 64-bit, AMD 2GHz 1GB dmtcp_coordinator  Debian 5.0 (Lenny) • 40GB  OpenMPI v. 1.3.4 (GNU suite v. 4.3.2 (gcc, HDD gfortran)  NAS Parallel Benchmarks v.3.3, class B Private IPs FT setup: dmtcp_command Heartbeat service:  OpenMPI-based – in case of failure, the service dmtcp_checkpoint – j – h \ determines failes node(s) and raises an exception headNode_privateIP mpirun … Checkpoint/restart service:  DMTCP – Distributed MultiThreaded CheckPointing  user-level transparent checkpointing  Executes dmtcp_command every 60 secs on HN to checkpoint 81 processes (64 MPI processes, 16+1 OpenMPI supervisor processes)  Moves local checkpoint files from WN to HN (in parallel)  Checkpoint time – 5 sec; moving checkpoints from WN -> HN less than10 sec; compressed checkpoint size c.a.1GB

  20. Results: NPB, class B, Rackspace, DMTCP, OpenMPI Heartbeat 20 16 Worker Nodes (WN) + 1 Head Node WN: 4-core, 64-bit, AMD Opteron 2GH, 1GB RAM, 40 GB HDD Checkpoints every 60 sec, average of 8 series Checkpoints every 60 sec FT overhead 2% - 10% HB - Heartbeat

  21. Summary 21  The Unibus infrastructure framework  Virtualization of access to various resources  Automatic resource provisioning  Innovatively used to assemble an FT MPI execution platform on cloud resources  Reduces effort to bare minimum (servers instantiation, etc)  15-20 min from 1 man-hour  Observed FT overhead 2%-10% (expected at least 8%)  Future work  Migration and restart of MPI-based computations on two different clouds or a cloud and a local cluster  Work with an MPI application

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend