containment domains resilience
play

Containment Domains Resilience Mechanisms and Tools Toward Exascale - PowerPoint PPT Presentation

Containment Domains Resilience Mechanisms and Tools Toward Exascale Resilience Mattan Erez The University of Texas at Austin 2 (c) Mattan Erez Yes, resilience is an exascale concern Checkpoint-restart not good enough on its own


  1. Containment Domains Resilience Mechanisms and Tools Toward Exascale Resilience Mattan Erez The University of Texas at Austin

  2. 2 (c) Mattan Erez • Yes, resilience is an exascale concern – Checkpoint-restart not good enough on its own – Commercial datacenters face different problems – Heterogeneity keeps growing – Correctness also at risk (integrity)

  3. 3 (c) Mattan Erez • Containment Domains (CDs) – Isolate application resilience from system – Increase performance and efficiency – Simplify defensive (resilient) codes – Adapt hardware and software • Portable Performant Resilient Proportional

  4. 4 (c) Mattan Erez • Efficient resilience is an exascale problem

  5. 5 (c) Mattan Erez 100% Performance Efficiency 80% CDs, NT 60% h-CPR, 80% 40% g-CPR, 80% 20% 0% • Failure rate possibly too high for checkpoint/restart • Correctness also at risk

  6. 6 (c) Mattan Erez Energy Overhead 20% CDs, NT h-CPR, 80% 10% gCPR, 80% 0% 2.5PF 10PF 40PF 160PF640PF 1.2EF 2.5EF • Energy also problematic

  7. 7 (c) Mattan Erez • Something bad every ~ minute at exascale • Something bad every year commercially – Smaller units of execution – Different requirements – Different ramifications

  8. 8 (c) Mattan Erez • Rapid adoption of new technology and accelerators – Again, potential mismatch with commercial setting

  9. 9 (c) Mattan Erez • So who’s responsible for resilience? • Hardware? Software? • Algorithm? •

  10. 10 (c) Mattan Erez • Can hardware alone solve the problem? • Yes, but costly – Significant and fixed/hidden overheads – Different tradeoffs in commercial settings

  11. 11 (c) Mattan Erez • Fixed overhead examples (estimated) Both energy and/or throughput – Up to ~25% chipkill correct vs. chipkill detect – 20 – 40% for pipeline SDC reduction – >2X for arbitrary correction – Even greater overhead if approximate units allowed

  12. 12 (c) Mattan Erez • Relaxed reliability and precision – Some lunacy (rare easy-to-detect errors + parallelism) – Lunatic fringe: bounded imprecision – Lunacy: live with real unpredictable errors 50 40 Arith. Headroom 30 20 40 5 8 12 18 10 20 15 12 8 2 0 Today Scaled Researchy Some Lunatic Lunacy lunacy fringe Rough estimated numbers for illustration purposes

  13. 13 (c) Mattan Erez • Can software do it alone? – Detection likely very costly – Recovery effectiveness depends on error/failure frequency – Tradeoffs more limited

  14. 14 (c) Mattan Erez • Locality and hierarchy are key – Hierarchical constructs – Distributed operation • Algorithm is key: – Correctness is a range

  15. 15 (c) Mattan Erez • Containment Domains elevate resilience to first-class abstraction – Program-structure abstractions – Composable resilient program components – Regimented development flow – Supporting tools and mechanisms – Ideally combined with adaptive hardware reliability • Portable Performant Resilient Proportional

  16. 16 (c) Mattan Erez

  17. 17 (c) Mattan Erez

  18. 18 (c) Mattan Erez • CDs help bridge the gap – Help us figure out exactly how – Open source: lph.ece.utexas.edu/public/CDs bitbucket.org/cdresilience/cdruntime

  19. 19 (c) Mattan Erez CDs Embed Resilience within Application • Express resilience as a tree of CDs Root CD – Match CD, task, and machine hierarchies – Escalation for differentiated error handling • Semantics – Erroneous data never communicated – Each CD provides recovery mechanism • Components of a CD – Preserve data on domain start – Compute (domain body) – Detect faults before domain commits – Recover from detected errors Child CD

  20. 20 (c) Mattan Erez Mapping example: SpMV void task<inner> SpMV(in M, in Vi, out Ri) { cd = GetC etCur urren rentC tCD() () ->Crea reate teAnd AndBeg Begin(); (); cd->Prese eserv rve(m e(mat atrix rix, , si size, ze, kCop Copy); ); forall (…) reduce(…) 𝑵 𝑾 SpMV(M […],Vi[…], Ri […]); cd->Complet Complete(); e(); } void task<leaf> SpMV (…) { Matrix M Vector V cd = GetC tCur urrentC rentCD() () ->Crea reate teAnd AndBeg Begin(); (); cd->Pr Prese eserve rve(M, M, si sizeo zeof(M) (M), , kRe kRef); ); cd->Prese eserve rve(Vi, Vi, size izeof of(V (Vi), i), kCop Copy); ); for r=0..N for c=rowS[r]..rowS[r+1] resi[r]+=data[c]*Vi[cIdx[c]]; cd->CDAs DAsse sert rt(id idx > > pre prevId vIdx, , kSoft kSoft); ); prevC=c; cd->Compl mplete ete(); ); }

  21. 21 (c) Mattan Erez Mapping example: SpMV void task<inner> SpMV(in M, in Vi, out Ri) { cd = GetC etCur urren rentC tCD() () 𝑵 𝟏𝟏 𝑵 𝟏𝟐 𝑾 𝟏 ->Crea reate teAnd AndBeg Begin(); (); cd->Prese eserv rve(m e(mat atrix rix, , si size, ze, kCop Copy); ); forall (…) reduce(…) SpMV(M […],Vi[…], Ri […]); cd->Complet Complete(); e(); 𝑵 𝟐𝟏 𝑵 𝟐𝟐 𝑾 𝟐 } void task<leaf> SpMV (…) { Matrix M Vector V cd = GetCu tCurre rrentC ntCD() () ->Crea reate teAnd AndBeg Begin(); (); cd->Pr Prese eserve rve(M, M, si sizeo zeof(M) (M), , kRe kRef); ); cd->Prese eserve rve(Vi, Vi, size izeof of(V (Vi), i), kCop Copy); ); for r=0..N for c=rowS[r]..rowS[r+1] resi[r]+=data[c]*Vi[cIdx[c]]; cd->CDAs DAsse sert rt(id idx > > pre prevId vIdx, , kSoft kSoft); ); prevC=c; cd->Compl mplete ete(); ); }

  22. 22 (c) Mattan Erez Mapping example: SpMV void task<leaf> SpMV (…) { cd = GetC tCur urrentC rentCD() () ->Crea reate teAnd AndBeg Begin(); (); 𝑵 𝟏𝟏 𝑵 𝟏𝟐 𝑾 𝟏 cd->Pr Prese eserve rve(M, M, si sizeo zeof(M) (M), , kRe kRef); ); cd->Prese eserve rve(Vi, Vi, size izeof of(V (Vi), i), kCop Copy); ); for r=0..N for c=rowS[r]..rowS[r+1] resi[r]+=data[c]*Vi[cIdx[c]]; 𝑵 𝟐𝟏 𝑵 𝟐𝟐 𝑾 𝟐 cd->CDAs DAsse sert rt(id idx > > pre prevId vIdx, , kSoft kSoft); ); prevC=c; cd->Compl mplete ete(); ); } Matrix M Vector V Distributed to 4 nodes 𝑵 𝟏𝟏 𝑵 𝟐𝟏 𝑵 𝟏𝟐 𝑵 𝟐𝟐 𝑾 𝟏 𝑾 𝟏 𝑾 𝟐 𝑾 𝟐

  23. 23 (c) Mattan Erez Mapping example: SpMV void task<leaf> SpMV (…) { cd = GetC tCur urrentC rentCD() () ->Crea reate teAnd AndBeg Begin(); (); 𝑵 𝟏𝟏 𝑵 𝟏𝟐 𝑾 𝟏 cd->Pr Prese eserve rve(M, M, si sizeo zeof(M) (M), , kRe kRef); ); cd->Prese eserve rve(Vi, Vi, size izeof of(V (Vi), i), kCop Copy); ); for r=0..N for c=rowS[r]..rowS[r+1] resi[r]+=data[c]*Vi[cIdx[c]]; 𝑵 𝟐𝟏 𝑵 𝟐𝟐 𝑾 𝟐 cd->CDAs DAsse sert rt(id idx > > pre prevId vIdx, , kSoft kSoft); ); prevC=c; cd->Compl mplete ete(); ); } Matrix M Vector V Distributed to 4 nodes

  24. 24 (c) Mattan Erez Concise abstraction for complex behavior void task<leaf> SpMV (…) { cd = GetC tCur urrentC rentCD() () ->Crea reate teAnd AndBeg Begin(); (); cd->Pr Prese eserve rve(M, M, si sizeo zeof(M) (M), , kRe kRef); ); cd->Prese eserve rve(Vi, Vi, size izeof of(V (Vi), i), kCop Copy); ); for r=0..N for c=rowS[r]..rowS[r+1] resi[r]+=data[c]*Vi[cIdx[c]]; cd->CDAs DAsse sert rt(id idx > > pre prevId vIdx, , kSoft kSoft); ); prevC=c; cd->Compl mplete ete(); ); } Local copy or regen Sibling Parent (unchanged)

  25. 25 (c) Mattan Erez • General abstractions – a “language” for resilience Replicate in space or time or none? B C A B C A A B C B C A = = ? ? A B C A B C B C A Local copy or regen Sibling Parent (unchanged)

  26. 26 (c) Mattan Erez • CDs natural fit for: – Hierarchical SPMD – Task-based systems • CDs still general: – Opportunistic approaches to add hierarchical resilience – Always fall back to more checkpoint-like mappings

  27. 27 (c) Mattan Erez • Reminder of why you care

  28. 28 (c) Mattan Erez • CDs enable per- experiment/system “optimality” – (Portable) Use same resilience abstractions across programming models and implementations • MPI ULFM? MPI-Reinit? OpenMP? UPC++? Legion? – Don’t keep rethinking correctness and recovery • CPU, GPU, FPGA accelerator, memory accelerator, … ? – ( Performant ) Resilient patterns that scale • Hierarchical / local • Aware of application semantics • Auto-tuned efficiency/reliability tradeoffs – ( Resilient ) Defensive coding • Algorithms, implementations, and systems • Reasonable default schemes • Programmer customization – ( Proportional ) Adapt hardware and software redundancy

  29. 29 (c) Mattan Erez CD Runtime System Architecture External Tool Internal Tool Future Plan CD-annotated Compiler Support Debugger Applications/Libraries User Interaction for customized error CD-App detection /handling / tolerance / injection Mapper CD Runtime System Scaling Tool Persistence Layer (LWM2) Runtime Error Communication Unified Runtime Auto-tuner State Preservation Logging Logging Error Detector Handling Interface CD Auto Profiling & Low-Level BLCR Tuner Communication Legion + CD-Storage Visualizatio Machine Check Runtime Library Libc n Interface (Legion + GasNet) HW/SW I/F Mapping Interface Sight DRAM SSD Buddy PFS Error Reporting HDD Architecture Hardware – Annotations, persistence, reporting, recovery, tools

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend