4 22 2009
play

4/22/2009 Designing highly available systems Incorporate elements - PDF document

4/22/2009 Designing highly available systems Incorporate elements of fault-tolerant design Replication, TMR Distributed Systems Fully fault tolerant system will offer non-stop availability Clusters You cant achieve this! Problem:


  1. 4/22/2009 Designing highly available systems Incorporate elements of fault-tolerant design – Replication, TMR Distributed Systems Fully fault tolerant system will offer non-stop availability Clusters – You can’t achieve this! Problem: expensive! Paul Krzyzanowski pxk@cs.rutgers.edu Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Designing highly scalable systems Clustering SMP architecture Achieve reliability and scalability by interconnecting multiple independent systems Problem: Cluster: group of standard, autonomous servers performance gain as f (# processors) is sublinear configured so they appear on the network as a – Contention for resources (bus, memory, devices) – Also … the solution is expensive! single machine approach single system image Ideally… • Bunch of off-the shelf machines • Interconnected on a high speed LAN • Appear as one system to external users we don’t get all that (yet) • Processors are load-balanced – May migrate – May run on different systems (at least not in one package) – All IPC mechanisms and file access available • Fault tolerant – Components may fail – Machines may be taken down 1

  2. 4/22/2009 Clustering types • Supercomputing (HPC) • Batch processing High Performance Computing • High availability (HA) (HPC) • Load balancing The evolution of supercomputers Clustering for performance • Target complex applications: Example: One popular effort – Beowulf – Large amounts of data • Initially built to address problems associated with – Lots of computation large data sets in Earth and Space Science – Parallelizable application applications • From Center of Excellence in Space Data & • Many custom efforts Information Sciences (CESDIS), division of University Space Research Association at the – Typically Linux + message passing software Goddard Space Flight Center + remote exec + remote monitoring What makes it possible What can you run? • Commodity off-the-shelf computers are cost • Programs that do not require fine-grain effective communication • Publicly available software: • Nodes are dedicated to the cluster – Linux, GNU compilers & tools – Performance of nodes not subject to external factors – MPI (message passing interface) • Interconnect network isolated from external network – PVM (parallel virtual machine) – Network load is determined only by application • Low cost, high speed networking • Global process ID provided • Experience with parallel software – Global signaling mechanism – Difficult: solutions tend to be custom 2

  3. 4/22/2009 Beowulf configuration Programming tools: MPI Includes: • Message Passing Interface – BPROC: Beowulf distributed process space • API for sending/receiving messages • Start processes on other machines • Global process ID, global signaling – Optimizations for shared memory & NUMA – Group communication support – Network device drivers • Channel bonding, scalable I/O • Other features: – File system (file sharing is generally not critical) – Scalable file I/O • NFS root – Dynamic process management • unsynchronized – Synchronization (barriers) • synchronized periodically via rsync – Combining results Programming tools: PVM Beowulf programming tools • Software that emulates a general-purpose • PVM and MPI libraries heterogeneous computing framework on • Distributed shared memory interconnected computers – Page based: software-enforced ownership and consistency policy • Cluster monitor • Present a view of virtual processing elements • Global ps, top, uptime tools – Create tasks – Use global task IDs • Process management – Manage groups of tasks – Batch system – Basic message passing – Write software to control synchronization and load balancing with MPI and/or PVM – Preemptive distributed scheduling: not part of Beowulf (two packages: Condor and Mosix ) Another example Another example • Microsoft HPC Server 2008 • Rocks Cluster Distribution – Windows Server 2008 + clustering package – Based on CentOS Linux – Systems Management – Mass installation is a core part of the system • Management Console: plug-in to System Center UI with support for Windows PowerShell • Mass re-installation for application-specific configurations • RIS (Remote Installation Service) – Networking – Front-end central server + compute & storage nodes • MS-MPI (Message Passing Interface) • ICS (Internet Connection Sharing) : NAT for cluster nodes – Rolls: collection of packages • Network Direct RDMA (Remote DMA) • Base roll includes: PBS (portable batch system), PVM (parallel – Job scheduler virtual machine), MPI (message passing interface), job – Storage: iSCSI SAN and SMB support launchers, … – Failover support 3

  4. 4/22/2009 Batch processing • Common application: graphics rendering – Maintain a queue of frames to be rendered – Have a dispatcher to remotely exec process Batch Processing • Virtually no IPC needed • Coordinator dispatches jobs Single-queue work distribution Single-queue work distribution Render Farms: Render Farms: Pixar: – ILM: • 1,024 2.8 GHz Xeon processors running Linux and Renderman • 3,000 processor (AMD) renderfarm; expands to 5,000 by harnessing • 2 TB RAM, 60 TB disk space desktop machines • Custom Linux software for articulating, animating/lighting (Marionette), • 20 Linux-based SpinServer NAS storage systems and 3,000 disks from scheduling (Ringmaster), and rendering (RenderMan) Network Appliance • Cars: each frame took 8 hours to Render. Consumes ~32 GB storage on a • 10 Gbps ethernet SAN –Sony Pictures’ Imageworks: DreamWorks: • Over 1,200 processors • >3,000 servers and >1,000 Linux desktops • Dell and IBM workstations HP xw9300 workstations and HP DL145 G2 servers with 8 GB/server • almost 70 TB data for Polar Express • Shrek 3: 20 million CPU render hours. Platform LSF used for scheduling + Maya for modeling + Avid for editing+ Python for pipelining – movie uses 24 TB storage Batch Processing OpenPBS.org: – Portable Batch System – Developed by Veridian MRJ for NASA Load Balancing • Commands for the web – Submit job scripts • Submit interactive jobs • Force a job to run – List jobs – Delete jobs – Hold jobs 4

  5. 4/22/2009 Functions of a load balancer Redirection Simplest technique Load balancing HTTP REDIRECT error code Failover Planned outage management Redirection Redirection Simplest technique Simplest technique HTTP REDIRECT error code HTTP REDIRECT error code www.mysite.com www.mysite.com REDIRECT www03.mysite.com Redirection Redirection Simplest technique • Trivial to implement HTTP REDIRECT error code • Successive requests automatically go to the same web server – Important for sessions • Visible to customer – Some don’t like it www03.mysite.com • Bookmarks will usually tag a specific site 5

  6. 4/22/2009 Software load balancer Software load balancer e.g.: IBM Interactive Network Dispatcher Software Forwards request via load balancing – Leaves original source address – Load balancer not in path of outgoing traffic (high bandwidth) www.mysite.com – Kernel extensions for routing TCP and UDP requests • Each client accepts connections on its own address and dispatcher’s address • Dispatcher changes MAC address of packets. Software load balancer Software load balancer src=bobby, dest=www03 src=bobby, dest=www03 www.mysite.com www.mysite.com response Load balancing router Load balancing router Routers have been getting smarter • Assign one or more virtual addresses to physical address – Most support packet filtering – Incoming request gets mapped to physical address – Add load balancing • Special assignments can be made per port – e.g. all FTP traffic goes to one machine Cisco LocalDirector, Altheon, F5 Big-IP Balancing decisions : – Pick machine with least # TCP connections – Factor in weights when selecting machines – Pick machines round-robin – Pick fastest connecting machine (SYN/ACK time) 6

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend