4/22/2009 Designing highly available systems Incorporate elements - PDF document

4/22/2009 Designing highly available systems Incorporate elements of fault-tolerant design – Replication, TMR Distributed Systems Fully fault tolerant system will offer non-stop availability Clusters – You can’t achieve this! Problem: expensive! Paul Krzyzanowski pxk@cs.rutgers.edu Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Designing highly scalable systems Clustering SMP architecture Achieve reliability and scalability by interconnecting multiple independent systems Problem: Cluster: group of standard, autonomous servers performance gain as f (# processors) is sublinear configured so they appear on the network as a – Contention for resources (bus, memory, devices) – Also … the solution is expensive! single machine approach single system image Ideally… • Bunch of off-the shelf machines • Interconnected on a high speed LAN • Appear as one system to external users we don’t get all that (yet) • Processors are load-balanced – May migrate – May run on different systems (at least not in one package) – All IPC mechanisms and file access available • Fault tolerant – Components may fail – Machines may be taken down 1

4/22/2009 Clustering types • Supercomputing (HPC) • Batch processing High Performance Computing • High availability (HA) (HPC) • Load balancing The evolution of supercomputers Clustering for performance • Target complex applications: Example: One popular effort – Beowulf – Large amounts of data • Initially built to address problems associated with – Lots of computation large data sets in Earth and Space Science – Parallelizable application applications • From Center of Excellence in Space Data & • Many custom efforts Information Sciences (CESDIS), division of University Space Research Association at the – Typically Linux + message passing software Goddard Space Flight Center + remote exec + remote monitoring What makes it possible What can you run? • Commodity off-the-shelf computers are cost • Programs that do not require fine-grain effective communication • Publicly available software: • Nodes are dedicated to the cluster – Linux, GNU compilers & tools – Performance of nodes not subject to external factors – MPI (message passing interface) • Interconnect network isolated from external network – PVM (parallel virtual machine) – Network load is determined only by application • Low cost, high speed networking • Global process ID provided • Experience with parallel software – Global signaling mechanism – Difficult: solutions tend to be custom 2

4/22/2009 Beowulf configuration Programming tools: MPI Includes: • Message Passing Interface – BPROC: Beowulf distributed process space • API for sending/receiving messages • Start processes on other machines • Global process ID, global signaling – Optimizations for shared memory & NUMA – Group communication support – Network device drivers • Channel bonding, scalable I/O • Other features: – File system (file sharing is generally not critical) – Scalable file I/O • NFS root – Dynamic process management • unsynchronized – Synchronization (barriers) • synchronized periodically via rsync – Combining results Programming tools: PVM Beowulf programming tools • Software that emulates a general-purpose • PVM and MPI libraries heterogeneous computing framework on • Distributed shared memory interconnected computers – Page based: software-enforced ownership and consistency policy • Cluster monitor • Present a view of virtual processing elements • Global ps, top, uptime tools – Create tasks – Use global task IDs • Process management – Manage groups of tasks – Batch system – Basic message passing – Write software to control synchronization and load balancing with MPI and/or PVM – Preemptive distributed scheduling: not part of Beowulf (two packages: Condor and Mosix ) Another example Another example • Microsoft HPC Server 2008 • Rocks Cluster Distribution – Windows Server 2008 + clustering package – Based on CentOS Linux – Systems Management – Mass installation is a core part of the system • Management Console: plug-in to System Center UI with support for Windows PowerShell • Mass re-installation for application-specific configurations • RIS (Remote Installation Service) – Networking – Front-end central server + compute & storage nodes • MS-MPI (Message Passing Interface) • ICS (Internet Connection Sharing) : NAT for cluster nodes – Rolls: collection of packages • Network Direct RDMA (Remote DMA) • Base roll includes: PBS (portable batch system), PVM (parallel – Job scheduler virtual machine), MPI (message passing interface), job – Storage: iSCSI SAN and SMB support launchers, … – Failover support 3

4/22/2009 Batch processing • Common application: graphics rendering – Maintain a queue of frames to be rendered – Have a dispatcher to remotely exec process Batch Processing • Virtually no IPC needed • Coordinator dispatches jobs Single-queue work distribution Single-queue work distribution Render Farms: Render Farms: Pixar: – ILM: • 1,024 2.8 GHz Xeon processors running Linux and Renderman • 3,000 processor (AMD) renderfarm; expands to 5,000 by harnessing • 2 TB RAM, 60 TB disk space desktop machines • Custom Linux software for articulating, animating/lighting (Marionette), • 20 Linux-based SpinServer NAS storage systems and 3,000 disks from scheduling (Ringmaster), and rendering (RenderMan) Network Appliance • Cars: each frame took 8 hours to Render. Consumes ~32 GB storage on a • 10 Gbps ethernet SAN –Sony Pictures’ Imageworks: DreamWorks: • Over 1,200 processors • >3,000 servers and >1,000 Linux desktops • Dell and IBM workstations HP xw9300 workstations and HP DL145 G2 servers with 8 GB/server • almost 70 TB data for Polar Express • Shrek 3: 20 million CPU render hours. Platform LSF used for scheduling + Maya for modeling + Avid for editing+ Python for pipelining – movie uses 24 TB storage Batch Processing OpenPBS.org: – Portable Batch System – Developed by Veridian MRJ for NASA Load Balancing • Commands for the web – Submit job scripts • Submit interactive jobs • Force a job to run – List jobs – Delete jobs – Hold jobs 4

4/22/2009 Functions of a load balancer Redirection Simplest technique Load balancing HTTP REDIRECT error code Failover Planned outage management Redirection Redirection Simplest technique Simplest technique HTTP REDIRECT error code HTTP REDIRECT error code www.mysite.com www.mysite.com REDIRECT www03.mysite.com Redirection Redirection Simplest technique • Trivial to implement HTTP REDIRECT error code • Successive requests automatically go to the same web server – Important for sessions • Visible to customer – Some don’t like it www03.mysite.com • Bookmarks will usually tag a specific site 5

4/22/2009 Software load balancer Software load balancer e.g.: IBM Interactive Network Dispatcher Software Forwards request via load balancing – Leaves original source address – Load balancer not in path of outgoing traffic (high bandwidth) www.mysite.com – Kernel extensions for routing TCP and UDP requests • Each client accepts connections on its own address and dispatcher’s address • Dispatcher changes MAC address of packets. Software load balancer Software load balancer src=bobby, dest=www03 src=bobby, dest=www03 www.mysite.com www.mysite.com response Load balancing router Load balancing router Routers have been getting smarter • Assign one or more virtual addresses to physical address – Most support packet filtering – Incoming request gets mapped to physical address – Add load balancing • Special assignments can be made per port – e.g. all FTP traffic goes to one machine Cisco LocalDirector, Altheon, F5 Big-IP Balancing decisions : – Pick machine with least # TCP connections – Factor in weights when selecting machines – Pick machines round-robin – Pick fastest connecting machine (SYN/ACK time) 6

4/22/2009 Designing highly available systems Incorporate elements - PDF document

4/22/2009 Designing highly available systems Incorporate elements of fault-tolerant design Replication, TMR Distributed Systems Fully fault tolerant system will offer non-stop availability Clusters You cant achieve this! Problem:

Platinum Platinum 2009 2009 th May 2009 18 18 th May 2009 Good morning to everyone, and

First Quarter 2009 - A Good Start 1Q 2009 Results Presentation - 29 April 2009 Agenda 1Q 2009

Presentation 4 th Quarter 2009 22 F b 22. February 2009 2009 Jon A. Elde CFO CFO 1 Highlights

2009 Half Year Results Presentation 6 months to 30 June 2009 13 August 2009 2009 Half Year

CO CO 2 fixation by mineral matter; fixation by mineral matter; the potential of different the

Earnings Presentation First Quarter 2009 May 1, 2009 February 6, 2009: Preliminary &

1 Worldwide locations of Trtzschler Germany BR/BH/Hof 2009/D/11.11.2009 Subsidiaries:

Fronteers 2009 , 2009

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 $1,000,000 FY 2009

Budget Curtailment Plan January 26, 2009 PHASE I: January, 2009 thru June 30, 2009 1 Implement

PSYCHOTRONICA nitesh dhanjani july 2009 black hat briefings 2009 (las vegas) Sunday, June 28,

Swedbank Q3 Results 2009 Q3 Results 2009 Swedbank 2073075 CEO Michael Wolf October 20, 2009

PHP code audits OSCON 2009 San Jos, CA, USA July 21th 2009 samedi 25 juillet 2009 1 Agenda

Year-end report 2009 Jan Johansson, President and CEO SCA Interim Report Q4 2009 Full year 2009

Bank of Georgia Q3 2009 & YTD 2009 financials January 2010 Bank of Georgia consolidated

Earnings Conference Call 2 nd Quarter 2009 July 24, 2009 July 24, 2009 Forward-Looking

The impact of security proofs: two troublesome case studies D. J. Bernstein University of

iFCP - A Protocol for Internet Fibre Channel Storage Networking draft-monia-ips-ifcp-00.txt

Augmented Hilbert series of numerical semigroups Christopher ONeill University of California

Lab 4 Feedback We need some work on func@ons Follow examples and instruc@ons Feb 13, 2018

Introduction to PC- Cluster Hardware (II) Russian-German School on High Performance Computer

Physical Database Design Basic considerations: Data independence: The user should be insulated

Next Generation Network - - a PIONIER example Maciej Stroi ski, Artur Binczewski, Micha

Some current topics Recent OS research and development tends to centre around distributed systems

4/22/2009 Designing highly available systems Incorporate elements - PDF document

4/22/2009 Designing highly available systems Incorporate elements of fault-tolerant design Replication, TMR Distributed Systems Fully fault tolerant system will offer non-stop availability Clusters You cant achieve this! Problem:

Platinum Platinum 2009 2009 th May 2009 18 18 th May 2009 Good morning to everyone, and

First Quarter 2009 - A Good Start 1Q 2009 Results Presentation - 29 April 2009 Agenda 1Q 2009

Presentation 4 th Quarter 2009 22 F b 22. February 2009 2009 Jon A. Elde CFO CFO 1 Highlights

2009 Half Year Results Presentation 6 months to 30 June 2009 13 August 2009 2009 Half Year

CO CO 2 fixation by mineral matter; fixation by mineral matter; the potential of different the

Earnings Presentation First Quarter 2009 May 1, 2009 February 6, 2009: Preliminary &amp;

1 Worldwide locations of Trtzschler Germany BR/BH/Hof 2009/D/11.11.2009 Subsidiaries:

Fronteers 2009 , 2009

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 $1,000,000 FY 2009

Budget Curtailment Plan January 26, 2009 PHASE I: January, 2009 thru June 30, 2009 1 Implement

PSYCHOTRONICA nitesh dhanjani july 2009 black hat briefings 2009 (las vegas) Sunday, June 28,

Swedbank Q3 Results 2009 Q3 Results 2009 Swedbank 2073075 CEO Michael Wolf October 20, 2009

PHP code audits OSCON 2009 San Jos, CA, USA July 21th 2009 samedi 25 juillet 2009 1 Agenda

Year-end report 2009 Jan Johansson, President and CEO SCA Interim Report Q4 2009 Full year 2009

Bank of Georgia Q3 2009 &amp; YTD 2009 financials January 2010 Bank of Georgia consolidated

Earnings Conference Call 2 nd Quarter 2009 July 24, 2009 July 24, 2009 Forward-Looking

The impact of security proofs: two troublesome case studies D. J. Bernstein University of

iFCP - A Protocol for Internet Fibre Channel Storage Networking draft-monia-ips-ifcp-00.txt

Augmented Hilbert series of numerical semigroups Christopher ONeill University of California

Lab 4 Feedback We need some work on func@ons Follow examples and instruc@ons Feb 13, 2018

Introduction to PC- Cluster Hardware (II) Russian-German School on High Performance Computer

Physical Database Design Basic considerations: Data independence: The user should be insulated

Next Generation Network - - a PIONIER example Maciej Stroi ski, Artur Binczewski, Micha

Some current topics Recent OS research and development tends to centre around distributed systems

Earnings Presentation First Quarter 2009 May 1, 2009 February 6, 2009: Preliminary &

Bank of Georgia Q3 2009 & YTD 2009 financials January 2010 Bank of Georgia consolidated