 
              Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Failure distributions: with several processors Processor (or node): any entity subject to failures ⇒ approach agnostic to granularity If the MTBF is µ with one processor, what is its value µ p with p processors? Well, it depends � Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 16/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Failure distributions: with several processors Processor (or node): any entity subject to failures ⇒ approach agnostic to granularity If the MTBF is µ with one processor, what is its value µ p with p processors? Well, it depends � Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 16/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion With rejuvenation Rebooting all p processors after a failure Platform failure distribution ⇒ minimum of p IID processor distributions With p distributions Exp ( λ ): � � min Exp ( λ 1 ) , Exp ( λ 2 ) = Exp ( λ 1 + λ 2 ) µ = 1 λ ⇒ µ p = µ p With p distributions Weibull ( k , λ ): = Weibull ( k , p 1 / k λ ) � � min Weibull ( k , λ ) 1 .. p µ = 1 λ Γ(1 + 1 µ k ) ⇒ µ p = p 1 / k Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 17/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Without rejuvenation (= real life) Rebooting only faulty processor Platform failure distribution ⇒ superposition of p IID processor distributions Theorem: µ p = µ p for arbitrary distributions Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 18/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Values from the literature MTBF of one processor: between 1 and 125 years Shape parameters for Weibull: k = 0 . 5 or k = 0 . 7 Failure trace archive from INRIA ( http://fta.inria.fr ) Computer Failure Data Repository from LANL ( http://institutes.lanl.gov/data/fdata ) Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 19/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Does it matter? Parallel machine (10 6 nodes) 1 0.9 0.8 Failure Probability 0.7 0.6 0.5 0.4 0.3 0.2 Exp(1/100) Weibull(0.7, 1/100) 0.1 Weibull(0.5, 1/100) 0 0h 3h 6h 9h 12h 15h 18h 21h 24h Time (hours) Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 20/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Outline 1 Introduction Large-scale computing platforms Faults and failures 2 ABFT for dense linear algebra kernels 3 Checkpointing Process checkpointing Coordinated checkpointing Young/Daly’s approximation 4 Probabilistic models for checkpointing Coordinated checkpointing Hierarchical checkpointing 5 Other techniques Replication Failure Prediction Silent errors In-memory checkpointing 6 Conclusion Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 21/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Tiled LU factorization U U A L A' Solve A · x = b (hard) Transform A into a LU factorization Solve L · y = B · b , then U · x = y Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 22/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Tiled LU factorization TRSM - Update row block U U A L A' GETF2: factorize a GEMM: Update column block the trailing matrix Solve A · x = b (hard) Transform A into a LU factorization Solve L · y = B · b , then U · x = y Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 22/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Tiled LU factorization TRSM - Update row block U U U U L L L GETF2: factorize a GEMM: Update column block the trailing matrix Solve A · x = b (hard) Transform A into a LU factorization Solve L · y = B · b , then U · x = y Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 22/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Tiled LU factorization Failure of rank 2 0 2 4 0 2 4 0 2 0 2 4 0 2 4 0 2 1 3 5 1 3 5 1 3 1 3 5 1 3 5 1 3 0 2 4 0 2 4 0 2 0 2 4 0 2 4 0 2 1 3 5 1 3 5 1 3 1 3 5 1 3 5 1 3 0 2 4 0 2 4 0 2 0 2 4 0 2 4 0 2 1 3 5 1 3 5 1 3 1 3 5 1 3 5 1 3 0 2 4 0 2 4 0 2 0 2 4 0 2 4 0 2 1 3 5 1 3 5 1 3 1 3 5 1 3 5 1 3 2D Block Cyclic Distribution (here 2 × 3) A single failure ⇒ many data lost Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 22/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Algorithm Based Fault Tolerant LU decomposition N nb Q N/Q P mb 0 2 4 0 2 4 0 2 0 2 4 0 2 4 0 2 A A A 1 3 5 1 3 5 1 3 1 3 5 1 3 5 1 3 B B B 0 2 4 0 2 4 0 2 0 2 4 0 2 4 0 2 A A A 1 3 5 1 3 5 1 3 1 3 5 1 3 5 1 3 B B B M 0 2 4 0 2 4 0 2 0 2 4 0 2 4 0 2 A A A 1 3 5 1 3 5 1 3 1 3 5 1 3 5 1 3 B B B 0 2 4 0 2 4 0 2 0 2 4 0 2 4 0 2 A A A 1 3 5 1 3 5 1 3 1 3 5 1 3 5 1 3 B B B + + + Checksum: invertible operation on row/column data Checksum replication avoided by dedicating additional computing resources to checksum storage Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 23/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Algorithm Based Fault Tolerant LU decomposition N Q nb < 2N / Q + nb P mb 0 2 4 0 2 4 0 2 0 2 4 0 2 4 0 2 4 0 2 4 0 2 1 3 5 1 3 5 1 3 1 3 5 1 3 5 1 3 5 1 3 5 1 3 0 2 4 0 2 4 0 2 0 2 4 0 2 4 0 2 4 0 2 4 0 2 1 3 5 1 3 5 1 3 1 3 5 1 3 5 1 3 5 1 3 5 1 3 M 0 2 4 0 2 4 0 2 0 2 4 0 2 4 0 2 4 0 2 4 0 2 1 3 5 1 3 5 1 3 1 3 5 1 3 5 1 3 5 1 3 5 1 3 0 2 4 0 2 4 0 2 0 2 4 0 2 4 0 2 4 0 2 4 0 2 1 3 5 1 3 5 1 3 1 3 5 1 3 5 1 3 5 1 3 5 1 3 + + + Checksum: invertible operation on row/column data Checksum blocks are doubled, to allow recovery when data and checksum are lost together (no extra resource needed) Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 23/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Algorithm Based Fault Tolerant LU decomposition 0 2 4 0 2 4 0 2 A A A 0 2 4 0 2 4 0 2 A A A 1 3 5 1 3 5 1 3 B B B 1 3 5 1 3 5 1 3 B B B 0 2 4 0 2 4 0 2 A A A 0 2 4 0 2 4 0 2 A A A 1 3 5 1 3 5 1 3 B B B 1 3 5 1 3 5 1 3 B B B 0 2 4 0 2 4 0 2 A A A 0 2 4 0 2 4 0 2 A A A TRSM 1 3 5 1 3 5 1 3 B B B 1 3 5 1 3 5 1 3 B B B 0 2 4 0 2 4 0 2 A A A 0 2 4 0 2 4 0 2 A A A 1 3 5 1 3 5 1 3 B B B 1 3 5 1 3 5 1 3 B B B GETF2 GEMM Checksum: invertible operation on row/column data Key idea of ABFT: applying the operation on data and checksum preserves the checksum properties Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 23/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Performance FT1LU"performance" Non1FT"LU"performance" Overhead" 60" 30" 50" 25" Performance'(Tflop/s)' Tflop/s'Overhead'(%)' 40" 20" 30" 15" 20" 10" 10" 5" 0" 0" 20k"(6x6)" 40k"(12x12)" 80k"(24x24)" 160k"(48x48)" 320k"(96x96)" 640k"(192x192)" Matrix'size'(grid'size)' MPI-Next ULFM Performance Open MPI with ULFM; Kraken supercomputer; Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 24/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Outline 1 Introduction Large-scale computing platforms Faults and failures 2 ABFT for dense linear algebra kernels 3 Checkpointing Process checkpointing Coordinated checkpointing Young/Daly’s approximation 4 Probabilistic models for checkpointing Coordinated checkpointing Hierarchical checkpointing 5 Other techniques Replication Failure Prediction Silent errors In-memory checkpointing 6 Conclusion Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 25/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Outline 1 Introduction Large-scale computing platforms Faults and failures 2 ABFT for dense linear algebra kernels 3 Checkpointing Process checkpointing Coordinated checkpointing Young/Daly’s approximation 4 Probabilistic models for checkpointing Coordinated checkpointing Hierarchical checkpointing 5 Other techniques Replication Failure Prediction Silent errors In-memory checkpointing 6 Conclusion Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 26/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Maintaining Redundant Information Goal General Purpose Fault Tolerance Techniques: work despite application behavior Two adversaries: Failures & Application Use automatically computed redundant information At given instants: checkpoint At any instant: replication Anything in between: checkpoint + message logging Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 27/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Process Checkpointing Goal Save the current state of the process FT Protocols save a possible state of the parallel application Techniques User-level checkpointing System-level checkpointing Blocking call Asynchronous call Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 28/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion System-level checkpointing Blocking Checkpointing Relatively intuitive: checkpoint(filename) Cost: no process activity during whole checkpoint operation Different implementations: OS syscall; dynamic library; compiler assisted Create a serial file that can be loaded in a process image. Usually on same architecture / OS / software environment Entirely transparent Preemptive (often needed for library-level checkpointing) Lack of portability Large size of checkpoint ( ≈ memory footprint) Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 29/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Storage Remote Reliable Storage Intuitive. I/O intensive. Disk usage. Memory Hierarchy local memory local disk (SSD, HDD) remote disk Scalable Checkpoint Restart Library http://scalablecr.sourceforge.net Checkpoint is valid when finished on reliable storage Distributed Memory Storage In-memory checkpointing Disk-less checkpointing Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 30/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Outline 1 Introduction Large-scale computing platforms Faults and failures 2 ABFT for dense linear algebra kernels 3 Checkpointing Process checkpointing Coordinated checkpointing Young/Daly’s approximation 4 Probabilistic models for checkpointing Coordinated checkpointing Hierarchical checkpointing 5 Other techniques Replication Failure Prediction Silent errors In-memory checkpointing 6 Conclusion Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 31/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Coordinated checkpointing orphan orphan missing Definition (Missing Message) A message is missing if in the current configuration, the sender sent it, while the receiver did not receive it Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 32/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Coordinated checkpointing orphan orphan missing Definition (Orphan Message) A message is orphan if in the current configuration, the receiver received it, while the sender did not send it Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 33/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Coordinated checkpointing Create a consistent view of the application (no orphan messages) Messages belong to a checkpoint wave or another All communication channels must be flushed (all2all) Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 34/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Coordinated checkpointing App. Message Marker Message Silences the network during checkpoint Missing messages recorded Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 35/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Outline 1 Introduction Large-scale computing platforms Faults and failures 2 ABFT for dense linear algebra kernels 3 Checkpointing Process checkpointing Coordinated checkpointing Young/Daly’s approximation 4 Probabilistic models for checkpointing Coordinated checkpointing Hierarchical checkpointing 5 Other techniques Replication Failure Prediction Silent errors In-memory checkpointing 6 Conclusion Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 36/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Checkpointing cost Time spent working Time spent checkpointing Time Computing the first chunk Checkpointing the first chunk Processing the first chunk Processing the second chunk Blocking model: while a checkpoint is taken, no computation can be performed Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 37/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Framework Periodic checkpointing policy of period T Independent and identically distributed failures Applies to a single processor with MTBF µ = µ ind Applies to a platform with p processors with MTBF µ = µ ind p coordinated checkpointing tightly-coupled application progress ⇔ all processors available Waste : fraction of time not spent for useful computations Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 38/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Waste in fault-free execution Time base : application base time Time spent working Time spent checkpointing Time Time FF : with periodic checkpoints Computing the first chunk Checkpointing the first chunk but failure-free Processing the first chunk Processing the second chunk Time FF = Time base + # checkpoints × C � Time base � ≈ Time base # checkpoints = (valid for large jobs) T − C T − C Waste [ FF ] = Time FF − Time base = C T Time FF Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 39/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Waste due to failures Time base : application base time Time FF : with periodic checkpoints but failure-free Time final : expectation of time with failures Time final = Time FF + N faults × T lost N faults number of failures during execution T lost : average time lost par failures N faults = Time final µ T lost ? Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 40/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Waste due to failures Time base : application base time Time FF : with periodic checkpoints but failure-free Time final : expectation of time with failures Time final = Time FF + N faults × T lost N faults number of failures during execution T lost : average time lost par failures N faults = Time final µ T lost ? Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 40/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Computing T lost Time spent working Time spent checkpointing Downtime Recovery time Time P 0 P 1 P 2 P 3 T lost D R T − C C T T lost = D + R + T 2 ⇒ Instants when periods begin and failures strike are independent ⇒ Valid for all distribution laws, regardless of their particular shape Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 41/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Waste due to failures Time final = Time FF + N faults × T lost Waste [ fail ] = Time final − Time FF = 1 � D + R + T � Time final µ 2 Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 42/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Total waste T -C C T -C C T -C C T -C C T -C C Time FF = Time Final (1- Waste [Fail]) Time Final × Waste [ Fail ] Time Final Waste = Time final − Time base Time final 1 − Waste = (1 − Waste [ FF ])(1 − Waste [ fail ]) � 1 Waste = C � 1 − C � D + R + T � T + T µ 2 Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 43/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Waste minimization � 1 Waste = C � 1 − C � D + R + T � T + T µ 2 Waste = u T + v + wT 1 − D + R v = D + R − C / 2 w = 1 � � u = C µ µ 2 µ Waste minimized for T = � u w � T = 2( µ − ( D + R )) C Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 44/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Comparison with Young/Daly T -C C T -C C T -C C T -C C T -C C Time FF = Time Final (1- Waste [Fail]) Time Final × Waste [ Fail ] Time Final � � 1 − Waste [ fail ] Time final = Time FF � ⇒ T = 2( µ − ( D + R )) C � � Daly : Time final = 1 + Waste [ fail ] Time FF � ⇒ T = 2( µ + ( D + R )) C + C � � Young : Time final = 1 + Waste [ fail ] Time FF and D = R = 0 ⇒ T = √ 2 µ C + C Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 45/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Validity of the approach (1/3) Technicalities E ( N faults ) = Time final and E ( T lost ) = D + R + T µ 2 but expectation of product is not product of expectations (not independent RVs here) Enforce C ≤ T to get Waste [ FF ] ≤ 1 Enforce D + R ≤ µ and bound T to get Waste [ fail ] ≤ 1 but µ = µ ind too small for large p , regardless of µ ind p Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 46/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Validity of the approach (2/3) Several failures within same period? Waste [fail] accurate only when two or more faults do not take place within same period Cap period: T ≤ γµ , where γ is some tuning parameter Poisson process of parameter θ = T µ Probability of having k ≥ 0 failures : P ( X = k ) = θ k k ! e − θ Probability of having two or more failures: π = P ( X ≥ 2) = 1 − ( P ( X = 0)+ P ( X = 1)) = 1 − (1+ θ ) e − θ γ = 0 . 27 ⇒ π ≤ 0 . 03 ⇒ overlapping faults for only 3% of checkpointing segments Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 47/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Validity of the approach (3/3) Enforce T ≤ γµ , C ≤ γµ , and D + R ≤ γµ � Optimal period 2( µ − ( D + R )) C may not belong to admissible interval [ C , γµ ] Waste is then minimized for one of the bounds of this admissible interval (by convexity) Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 48/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Wrap up Capping periods, and enforcing a lower bound on MTBF ⇒ mandatory for mathematical rigor � Not needed for practical purposes � • actual job execution uses optimal value • account for multiple faults by re-executing work until success Approach surprisingly robust � Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 49/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Outline 1 Introduction Large-scale computing platforms Faults and failures 2 ABFT for dense linear algebra kernels 3 Checkpointing Process checkpointing Coordinated checkpointing Young/Daly’s approximation 4 Probabilistic models for checkpointing Coordinated checkpointing Hierarchical checkpointing 5 Other techniques Replication Failure Prediction Silent errors In-memory checkpointing 6 Conclusion Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 50/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Outline 1 Introduction Large-scale computing platforms Faults and failures 2 ABFT for dense linear algebra kernels 3 Checkpointing Process checkpointing Coordinated checkpointing Young/Daly’s approximation 4 Probabilistic models for checkpointing Coordinated checkpointing Hierarchical checkpointing 5 Other techniques Replication Failure Prediction Silent errors In-memory checkpointing 6 Conclusion Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 51/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Background: coordinated checkpointing protocols P 0 m 1 m 2 m 3 Coordinated checkpoints over all P 1 processes m 4 m 5 Global restart after a failure P 2 � No risk of cascading rollbacks � No need to log messages � All processors need to roll back Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 52/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Background: message logging protocols P 0 m 1 m 2 m 3 Message content logging P 1 (sender memory) m 4 m 5 Restart of failed process only P 2 � No cascading rollbacks � Number of processes to roll back � Memory occupation � Overhead Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 53/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Background: hierarchical protocols P 0 Clusters of processes Coordinated checkpointing P 1 protocol within clusters m 1 m 3 m 5 Message logging protocols P 2 between clusters m 2 m 4 Only processors from failed group P 3 need to roll back � Need to log inter-groups messages • Slowdowns failure-free execution • Increases checkpoint size/time � Faster re-execution with logged messages Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 54/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Which checkpointing protocol to use? Coordinated checkpointing � No risk of cascading rollbacks � No need to log messages � All processors need to roll back � Rumor: May not scale to very large platforms Hierarchical checkpointing � Need to log inter-groups messages • Slowdowns failure-free execution • Increases checkpoint size/time � Only processors from failed group need to roll back � Faster re-execution with logged messages � Rumor: Should scale to very large platforms Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 55/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Coordinated checkpointing Time spent working Time spent checkpointing Time Computing the first chunk Checkpointing the first chunk Processing the first chunk Processing the second chunk Blocking model: checkpointing blocks all computations Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 56/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Coordinated checkpointing Time spent working Time spent checkpointing Time Computing the first chunk Checkpointing the first chunk Processing the first chunk Processing the second chunk Non-blocking model: checkpointing has no impact on computations (e.g., first copy state to RAM, then copy RAM to disk) Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 56/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Coordinated checkpointing Time spent working Time spent checkpointing Time spent working with slowdown Time Computing the first chunk Checkpointing the first chunk Processing the first chunk General model: checkpointing slows computations down: during a checkpoint of duration C , the same amount of computation is done as during a time α C without checkpointing (0 ≤ α ≤ 1) Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 56/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Waste in fault-free execution Time spent working Time spent checkpointing Time spent working with slowdown Time P 0 P 1 P 2 P 3 T − C C T Time elapsed since last checkpoint: T Amount of computations executed: Work = ( T − C ) + α C Waste [ FF ] = T − Work T Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 57/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Waste due to failures Time spent working Time spent checkpointing Time spent working with slowdown Time P 0 P 1 P 2 P 3 Failure can happen 1 During computation phase 2 During checkpointing phase Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 57/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Waste due to failures Time spent working Time spent checkpointing Time spent working with slowdown Time P 0 P 1 P 2 P 3 Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 57/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Waste due to failures Time spent working Time spent checkpointing Time spent working with slowdown Time P 0 P 1 P 2 P 3 Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 57/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Waste due to failures Time spent working Time spent checkpointing Time spent working with slowdown Time P 0 P 1 P 2 P 3 T lost Coordinated checkpointing protocol: when one processor is victim of a failure, all processors lose their work and must roll back to last checkpoint Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 57/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Waste due to failures in computation phase Time spent working Time spent checkpointing Time spent working with slowdown Downtime Time P 0 P 1 P 2 P 3 D Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 57/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Waste due to failures in computation phase Time spent working Time spent checkpointing Time spent working with slowdown Downtime Recovery time Time P 0 P 1 P 2 P 3 R Coordinated checkpointing protocol: all processors must recover from last checkpoint Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 57/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Waste due to failures in computation phase Time spent working Time spent checkpointing Time spent working with slowdown Downtime Recovery time Re-executing slowed-down work Time P 0 P 1 P 2 P 3 C α C Redo the work destroyed by the failure, that was done in the checkpointing phase before the computation phase But no checkpoint is taken in parallel, hence this re-execution is faster than the original computation Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 57/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Waste due to failures in computation phase Time spent working Time spent checkpointing Time spent working with slowdown Downtime Recovery time Re-executing slowed-down work Time P 0 P 1 P 2 P 3 T − C Re-execute the computation phase Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 57/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Waste due to failures in computation phase Time spent working Time spent checkpointing Time spent working with slowdown Downtime Recovery time Re-executing slowed-down work Time P 0 P 1 P 2 P 3 C Finally, the checkpointing phase is executed Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 57/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Total waste Time spent working Time spent checkpointing Time spent working with slowdown Downtime Recovery time Re-executing slowed-down work Time P 0 P 1 P 2 P 3 T lost D R α C T − C C T ∆ Waste [ fail ] = 1 � D + R + α C + T � µ 2 � Optimal period T opt = 2(1 − α ) C ( µ − ( D + R + α C )) Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 57/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Outline 1 Introduction Large-scale computing platforms Faults and failures 2 ABFT for dense linear algebra kernels 3 Checkpointing Process checkpointing Coordinated checkpointing Young/Daly’s approximation 4 Probabilistic models for checkpointing Coordinated checkpointing Hierarchical checkpointing 5 Other techniques Replication Failure Prediction Silent errors In-memory checkpointing 6 Conclusion Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 58/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Hierarchical checkpointing Time spent working Time spent checkpointing Time spent working with slowdown Downtime Recovery time Re-executing slowed-down work Time G 1 G 2 G g G 4 G 5 T lost D R T lost G . C α ( G − g +1) C T − G . C − T lost T Processors partitioned into G groups Each group includes q processors Inside each group: coordinated checkpointing in time C ( q ) Inter-group messages are logged Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 59/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Accounting for message logging: Impact on work � Logging messages slows down execution: ⇒ Work becomes λ Work , where 0 < λ < 1 Typical value: λ ≈ 0 . 98 � Re-execution after a failure is faster: ⇒ Re-Exec becomes Re-Exec , where ρ ∈ [1 .. 2] ρ Typical value: ρ ≈ 1 . 5 Waste [ FF ] = T − λ Work T � � Waste [ fail ] = 1 D ( q ) + R ( q ) + Re-Exec µ ρ Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 60/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Accounting for message logging: Impact on checkpoint size Inter-groups messages logged continuously Checkpoint size increases with amount of work executed before a checkpoint � C 0 ( q ): Checkpoint size of a group without message logging C ( q ) = C 0 ( q )(1 + β Work ) ⇔ β = C ( q ) − C 0 ( q ) C 0 ( q ) Work Work = λ ( T − (1 − α ) GC ( q )) C 0 ( q )(1 + βλ T ) C ( q ) = 1 + GC 0 ( q ) βλ (1 − α ) Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 61/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Three case studies Coord-IO Coordinated approach: C = C Mem = Mem b io where Mem is the memory footprint of the application Hierarch-IO Several (large) groups, I/O-saturated ⇒ groups checkpoint sequentially C 0 ( q ) = C Mem = Mem G G b io Hierarch-Port Very large number of smaller groups, port-saturated ⇒ some groups checkpoint in parallel Groups of q min processors, where q min b port ≥ b io Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 62/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Three applications 1 2D-stencil 2 Matrix product 3 3D-Stencil Plane Line Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 63/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Computing β for 2D-Stencil C ( q ) = C 0 ( q ) + Logged Msg = C 0 ( q )(1 + β Work ) Real n × n matrix and p × p grid Work = 9 b 2 s p , b = n / p Each process sends a block to its 4 neighbors Hierarch-IO : 1 group = 1 grid row 2 out of the 4 messages are logged β = 2s p 9 b 3 Hierarch-Port : β doubles Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 64/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Four platforms: basic characteristics Name Number of Number of Number of cores Memory I/O Network Bandwidth (b io ) I/O Bandwidth (b port ) cores processors p total per processor per processor Read Write Read/Write per processor Titan 299,008 16,688 16 32GB 300GB/s 300GB/s 20GB/s K-Computer 705,024 88,128 8 16GB 150GB/s 96GB/s 20GB/s Exascale-Slim 1,000,000,000 1,000,000 1,000 64GB 1TB/s 1TB/s 200GB/s Exascale-Fat 1,000,000,000 100,000 10,000 640GB 1TB/s 1TB/s 400GB/s Name Scenario G ( C ( q )) β for β for 2D-Stencil Matrix-Product 1 (2,048s) / / Coord-IO Titan Hierarch-IO 136 (15s) 0.0001098 0.0004280 Hierarch-Port 1,246 (1.6s) 0.0002196 0.0008561 Coord-IO 1 (14,688s) / / K-Computer 296 (50s) 0.0002858 0.001113 Hierarch-IO Hierarch-Port 17,626 (0.83s) 0.0005716 0.002227 Coord-IO 1 (64,000s) / / Exascale-Slim 1,000 (64s) 0.0002599 0.001013 Hierarch-IO Hierarch-Port 200,0000 (0.32s) 0.0005199 0.002026 Coord-IO 1 (64,000s) / / Exascale-Fat Hierarch-IO 316 (217s) 0.00008220 0.0003203 33,3333 (1.92s) 0.00016440 0.0006407 Hierarch-Port Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 65/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Checkpoint time Name C K-Computer 14,688s Exascale-Slim 64,000 Exascale-Fat 64,000 Large time to dump the memory Using 1% C Comparing with 0 . 1% C for exascale platforms α = 0 . 3, λ = 0 . 98 and ρ = 1 . 5 Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 66/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Plotting formulas – Platform: Titan Stencil 2D Matrix product Stencil 3D Waste as a function of processor MTBF µ ind Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 67/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Platform: K-Computer Stencil 2D Matrix product Stencil 3D Waste as a function of processor MTBF µ ind Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 68/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Plotting formulas – Platform: Exascale Waste = 1 for all scenarios!!! Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 69/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Plotting formulas – Platform: Exascale Waste = 1 for all scenarios!!! Goodbye Exascale?! Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 69/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Plotting formulas – Platform: Exascale with C = 1 , 000 Stencil 2D Matrix product Stencil 3D Exascale-Slim Exascale-Fat Waste as a function of processor MTBF µ ind , C = 1 , 000 Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 70/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Plotting formulas – Platform: Exascale with C = 100 Stencil 2D Matrix product Stencil 3D Exascale-Slim Exascale-Fat Waste as a function of processor MTBF µ ind , C = 100 Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 71/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Simulations – Platform: Titan Stencil 2D Matrix product Coordinated BestPer Hierarchical Line BestPer Coordinated Hierarchical Hierarchical Port Coordinated BestPer Hierarchical BestPer Hierarchical Port BestPer 90 90 Coordinated Coordinated 80 80 Coordinated BestPer Coordinated BestPer Hierarchical Hierarchical 70 70 Hierarchical BestPer Hierarchical BestPer Makespan (days) Hierarchical Port Makespan (days) Hierarchical Port 60 60 Hierarchical Port BestPer Hierarchical Port BestPer 50 50 40 40 30 30 20 20 10 10 0 0 1 2 3 4 5 7.5 10 15 20 35 50 75 100 1 2 3 4 5 7.5 10 15 20 35 50 75 100 MTBF (years) MTBF (years) Makespan (in days) as a function of processor MTBF µ ind Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 72/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Simulations – Platform: Exascale with C = 1 , 000 Stencil 2D Matrix product Coordinated BestPer Hierarchical Line BestPer Coordinated Hierarchical Hierarchical Port Coordinated BestPer Hierarchical BestPer Hierarchical Port BestPer Exascale-Slim 250 300 Coordinated Coordinated Coordinated BestPer Coordinated BestPer 250 200 Hierarchical Hierarchical Hierarchical BestPer Hierarchical BestPer Makespan (days) Makespan (days) Hierarchical Port Hierarchical Port 200 Hierarchical Port BestPer Hierarchical Port BestPer 150 150 100 100 50 50 0 0 1 2 3 4 5 7.5 10 15 20 35 50 75 100 1 2 3 4 5 7.5 10 15 20 35 50 75 100 MTBF (years) MTBF (years) Exascale-Fat 250 250 Coordinated Coordinated Coordinated BestPer Coordinated BestPer Hierarchical Hierarchical 200 200 Hierarchical BestPer Hierarchical BestPer Makespan (days) Hierarchical Port Makespan (days) Hierarchical Port Hierarchical Port BestPer Hierarchical Port BestPer 150 150 100 100 50 50 0 0 1 2 3 4 5 7.5 10 15 20 35 50 75 100 1 2 3 4 5 7.5 10 15 20 35 50 75 100 MTBF (years) MTBF (years) Makespan (in days) as a function of processor MTBF µ ind , C = 1 , 000 Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 73/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Simulations – Platform: Exascale with C = 100 Stencil 2D Matrix product Coordinated BestPer Hierarchical Line BestPer Coordinated Hierarchical Hierarchical Port Coordinated BestPer Hierarchical BestPer Hierarchical Port BestPer Exascale-Slim 120 120 Coordinated Coordinated Coordinated BestPer Coordinated BestPer 100 100 Hierarchical Hierarchical Hierarchical BestPer Hierarchical BestPer Makespan (days) Makespan (days) Hierarchical Port Hierarchical Port 80 80 Hierarchical Port BestPer Hierarchical Port BestPer 60 60 40 40 20 20 0 0 1 2 3 4 5 7.5 10 15 20 35 50 75 100 1 2 3 4 5 7.5 10 15 20 35 50 75 100 MTBF (years) MTBF (years) Exascale-Fat 15 15 Coordinated Coordinated Daly 14 14 Coordinated BestPer Coordinated BestPer 13 Hierarchical 13 Hierarchical Hierarchical BestPer Hierarchical BestPer 12 12 Makespan (days) Hierarchical Port Makespan (days) Hierarchical Port 11 11 Hierarchical Port BestPer Hierarchical Port BestPer 10 10 9 9 8 8 7 7 6 6 5 5 4 4 1 2 3 4 5 7.5 10 15 20 35 50 75 100 1 2 3 4 5 7.5 10 15 20 35 50 75 100 MTBF (years) MTBF (years) Makespan (in days) as a function of processor MTBF µ ind , C = 100 Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 74/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Outline 1 Introduction Large-scale computing platforms Faults and failures 2 ABFT for dense linear algebra kernels 3 Checkpointing Process checkpointing Coordinated checkpointing Young/Daly’s approximation 4 Probabilistic models for checkpointing Coordinated checkpointing Hierarchical checkpointing 5 Other techniques Replication Failure Prediction Silent errors In-memory checkpointing 6 Conclusion Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 75/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Outline 1 Introduction Large-scale computing platforms Faults and failures 2 ABFT for dense linear algebra kernels 3 Checkpointing Process checkpointing Coordinated checkpointing Young/Daly’s approximation 4 Probabilistic models for checkpointing Coordinated checkpointing Hierarchical checkpointing 5 Other techniques Replication Failure Prediction Silent errors In-memory checkpointing 6 Conclusion Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 76/ 98
Introduction ABFT for dense linear algebra kernels Checkpointing Probabilistic models for checkpointing Other techniques Conclusion Replication State Update P replica P replica Both process the same messages P P Passive Replication Active Replication Idea Each process is replicated on a resource that has small chance to be hit by the same failure as its replica In case of failure, one of the replicas will continue working, while the other recovers Passive Replication / Active Replication Yves.Robert@ens-lyon.fr Fault-tolerance for HPC 77/ 98
Recommend
More recommend