assessing fault sensitivity in mpi applications
play

Assessing Fault Sensitivity in MPI Applications Charng-Da Lu - PowerPoint PPT Presentation

Assessing Fault Sensitivity in MPI Applications Charng-Da Lu Daniel A. Reed Center for Computational Microsoft Research Research SUNY at Buffalo Outline Introduction background and motivations reliability challenges of large PC


  1. Assessing Fault Sensitivity in MPI Applications Charng-Da Lu Daniel A. Reed Center for Computational Microsoft Research Research SUNY at Buffalo

  2. Outline • Introduction – background and motivations – reliability challenges of large PC clusters • Failure modes – memory and communication errors • Fault injection experiments – methodology and experiments – analysis and implications • Conclusions – large-scale cluster design – software strategies for reliability

  3. Large Computing Systems Machine Processor PetaFLOPS Year Cores K Computer 705,000 10.5 2011 Jaguar 224,000 1.8 2009 Tianhe-1A 186,000 2.6 2010 Hopper 153,000 1.1 2011 Cielo 142,000 1.1 2011 Tera100 138,000 1.0 2010 RoadRunner 122,000 1.0 2008 Dominant constraints on size • – power consumption, reliability and usability

  4. Node Failure Challenges Domain decomposition • – spreads vital data across all nodes Burn Late – each spatial cell exists in one memory Failure Rate in Failure » except possible ghost or halo cells Normal Single node failure • Aging – causes blockage of the overall simulation – data is lost and must be recovered “Bathtub” failure model operating regimes • – infant mortality Elapsed Time – normal mode – late failure mode Simple checkpointing helps; the optimum interval is roughly • where δ is time to complete a checkpoint M is the time before failure R is the restart time due to lost work

  5. Large Systems Reliability Machine Core Count Reliability ASCI Q 8,192 MTBI 6.5 hr. 114 unplanned outages/month. HW outage sources: storage, CPU, memory * ASCI White 8,192 MTBF 5 hr (’01) and 40 hr (’03) HW outage sources: storage, CPU, 3 rd party hardware ** NERSC 6,656 MTBI 14 days. MTTR 3.3 hr Availability 98.74%. SW is main outage source. *** Seaborg 3,016 MTBI 9.7 hr PSC Lemieux Availability 98.33% **** Google ~15,000 20 reboots/day. 2-3% machines replaced/year. HW outage sources: storage, memory ***** *J. Morrison (LANL): “The ASCI Q System at Los Alamos,” SOS7, 2003 ** M. Seager (LLNL): “Operational machines: ASCI White,” SOS7, 2003 *** http://hpcf.nersc.gov/computers/stats/AvailStats **** M. Levine (PSC): “NSF’s terascale computing system,” SOS7, 2003 ***** J. Hennessy et al, “Computer Architecture: A Quantitative Approach”, 3 rd edition, 2002

  6. Large System Reliability • Facing the issues 1 hour reliability – component MTBF MTTF (hours) – system size – usable capability • A few assumptions System Size – assume independent component failures » an optimistic and not realistic assumption – N is the number of processors – r is probability a component operates for 1 hour – R is probability the system operates for 1 hour • Then or for large N

  7. Component Reliability Two basic types • – hard (permanent) errors – soft (recoverable) errors Hard errors • – permanent physical defects – memory: 160-1000 years MTBF for 32-64 Mb DRAM chips – disk: 50-100 years MTBF (?) – node: 3-5 years (warranty period) Soft errors • – transient faults in semiconductor devices » alpha particles, cosmic rays, overheat, poor power supplies, .. – ECC memory isn’t 100% secure » 80-95% protection rate – much more likely than hard errors » 10 days MTBF for 1GB RAM – continues to worsen as chip geometries shrink

  8. Memory Soft Error Rates Memory Type MTBF in days (1 GB) Commercial CMOS memory 0.8 4M SRAM > 1.2 1Gb memory (NightHawk) 2.3 SRAM and DRAM 2.6-5.2 8.2 Gb SRAM (Cray YMP-8) 4 SRAM 5.2 256 MB 7.4 160 Gb DRAM (FermiLab) 7.4 32 Gb DRAM (Cray YMP-8) 8.7 MoSys 1T-SRAM (no ECC) 10.4 Micron estimates, 256 MB 43-86 Source: Tazzaron Semiconductor, “Soft Errors in Electronic Memory – A White Paper”

  9. Communication Errors • Soft errors occur on networks as well – routers, switches, NICs, links ... • Link-level checksum = Reliable transmission? – Stone and Patridge’s study* shows » probability of Ethernet’s 32-bit CRC not catching errors  1/1,100 to 1/32,000 – theoretically, it should be 1/(4 billion) • To make things worse – performance-oriented computing favors OS-bypass protocols » relative to TCP – message integrity solely relies on link-level checksum * J. Stone and C. Partridge “When the CRC and TCP checksum disagree” in ACM SIGCOMM 2000

  10. Terminology • Error/failure • Fault – system behavior is the source – single event upset deviates from of specification » bit flips – omission – single event burnout » occasionally no » power surge response… – Bohrbug – response » determinism » incorrect – Heisenbug – performance » response is correct » race condition but not timely » rare input – crash/hang is the – ageing manifestation » resource exhaustion of

  11. Experiments • Goal: study the impact of bit-flip faults on MPI codes • Rationale – it is easier to detect hard errors and assess their damage – what about transient faults? – crash? hang? incorrect output? … • Approach: fault injection • Hardware-based  Software-based – expensive – inexpensive and portable – heavy ion bombarding or lasers – targets a wide range of – pin-level probes and sockets components – Alpha particles, bit-flips, power – OS, libraries, applications ... surge, 0/1 stuck-at ... – address bus, ALU, memory ...

  12. Register Fault Injection • Processor (x86) – User-space injection – Regular registers and x87 FPU registers – No injection to special purpose registers (need root privilege) » System control registers, debug and performance registers » Virtual memory management registers, MMX/SSE.. – No injection to L2/L3 caches, TLB

  13. Memory Fault Injection • Memory – Focus on application memory – Injection addresses have uniform distribution. – Skip library memory » MPI and shared libraries – Text, Data, BSS – Heap and stack Linux Process Memory Model

  14. Message Fault Injection • Simulate faults that link- level checksums miss – Use MPICH for communication – Inject at the level closest to operating system » but avoid perturbing the operating system (for testability) – Can affect all kinds of messages » Control, point-to-point, collective operations…

  15. Memory Fault Injector • ptrace UNIX system call – Attach to and halt a host process – Peek/poke register and memory contents (like gdb) • Static objects (Text, Data, BSS) – Used nm and objdump utilities to find the range of injection – Skipped all MPI objects • Dynamic objects (Heap and stack) – Created customized malloc/free » separates application objects from MPI objects – Examined return addresses in stack frames » determine the range of stack injection

  16. Message Fault Injector • MPICH – Developed by Argonne National Laboratory – Highly portable MPI implementation – Adopted by many hardware vendors • Fault injector – Modified MPICH library – Uses “ch_p4” channel (TCP/IP) – Faults injected in the payload » immediately after receipt from a socket – Both MPICH and user applications are vulnerable to message faults

  17. Experimental Environment • A meta-cluster formed from two clusters – Rhapsody » 32 dual 930 MHz Pentium III nodes » 1 GB RAM/node » 10/100 Gigabit Ethernet – Symphony » 16 dual 500 MHz Pentium II nodes » 512 MB RAM/node » Ethernet and Myrinet

  18. Fault Assessment Code Suite • Cactus Wavetoy – PDE solver for wave functions in physics – Test problem » 150x150x150 for 100 steps » 196 processes • CAM – C ommunity A tmospheric M odel – Test problem » default test dataset for 24 hours of simulated time » 64 processes • NAMD – Molecular dynamics code – Test problem » 92,000 atoms and 20 steps » 96 processes

  19. Test Code Suite Characteristics Injection Location Cactus NAMD CAM Memory 1.1 MB 25-30 MB 80 MB Text Size 330 KB 2 MB 2 MB Data Size 130 KB 110 KB 32 MB BSS Size 5 KB 598 KB 38 MB Heap Size 450-500 KB 22-27 MB 8 MB Message 2.4-4.8 MB 13-33 MB 125-150 MB

  20. Experimental Fault Assessment • Failure modes – Application crash » MPI error detected via MPI error handler » Application detected via assertion checks » Other(e.g., Segmentation fault) – Application hang (no termination) – Application execution completion » correct (fault not manifest) or incorrect output

  21. Cactus Wavetoy Results 500-2000 injections for each category

  22. NAMD Results ~500 injections for each category

  23. CAM Results ~500 injections for each category

  24. Register Injection Analysis • Registers are the most vulnerable to transient faults – 39-63% error rate overall – Results could depend on register management » Live register allocation and size of register file » Optimization increases register use • Error rates for floating point registers are much lower – 4-8% error rate – Most injections into control registers do not generate errors » Except the Tag Word register, which turns a number into NaN – Injections into data registers do not yield high error rates » At most 4 out of 8 data registers are in use » A data register is actually 80-bit long, but only 64 bits can be read out.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend