VELOC: Very Low Overhead Checkpointing System Bogdan Nicolae, Rinku - PowerPoint PPT Presentation

VELOC: Very Low Overhead Checkpointing System Bogdan Nicolae, Rinku Gupta, Franck Cappello (ANL) Adam Moody, Elsa Gonsiorowski, Kathryn Mohror (LLNL) 1 Exascale Computing Project

Part 1: Overview of VELOC 2 Exascale Computing Project

HPC Resilience: Checkpoint-Restart (CR) ● Main resilience technique for HPC due to tight coupling ● “Defensive” checkpointing: save state to parallel file system In action games , autosave checkpoints are points where a game will automatically save your progress and restart the player upon death. As such, the player does not need to restart the entire level over again. This reduces the frustration and tedium that is potentially felt without such a design " Checkpointing is one of these things that’s simpler in theory than it is in implementation. The reality is, you’re trying to balance many competing interests,” Brianna Wu, head of development Giant Spacekat Bad checkpoints ask players to replay large parts of the game due to their death or failure in some task, and this can lead to frustration and anger. 3 Exascale Computing Project

CR at Exascale: Challenges (1) Object store, Caching Layer, etc. ● Checkpointing generates a lot of I/O contention to storage ● Impact on performance and scalability is significant ● At Exascale, this issue is amplified: ○ Bigger systems -> more frequent failures -> need to checkpoint more frequently ○ Large increase in CPU power but modest increase in I/O capability -> less I/O bandwidth available per processing element 4 Exascale Computing Project

CR at Exascale: Challenges (2) Parallel File System, Object store, Caching Layer, etc. ● Storage hierarchy is heterogeneous and complex at Exascale: ○ Many options in addition to PFS: burst buffers, object stores, caching layers, etc. ○ Each HPC machine has its own combination ○ Many vendors, each with its own API and performance characteristics ● Need to customize CR strategy reduces productivity and leads to inefficiencies as application developers are not I/O experts 5 Exascale Computing Project

VELOC: CR Solution at Exascale Goal: Provide a checkpoint restart solution for HPC applications that delivers high performance and scalability for complex heterogeneous storage hierarchies without sacrificing ease of use and flexibility 6 Exascale Computing Project

Key idea: Multi-Level CR ● Multi-level checkpoint-restart uses a layered approach with increasing resilience guarantees but higher checkpointing overhead: ○ L1: local checkpoints ○ L2: partner copies, erasure codes ○ L3: parallel file system ● Higher levels defend against more complex types of failures, which typically happen less frequently ● Cost of higher levels can be masked asynchronously VELOC improves performance and scalability by using multi-level CR 7 Exascale Computing Project

How to use multiple levels The checkpoint interval of each level is optimized for the type of failures not covered by the previous levels ● L1 survives software errors ● L2 survives a majority of simultaneous node failures ● L3 survives catastrophic failures (rack or system down) One node Partner All nodes Soft failure crash nodes crash crash L1: Local FS L2-1: Partner node copy L2-2: Distrib erasure codes L3: Parallel File System Work done twice Failure Checkpoint Recovery 8 Exascale Computing Project

Example of observed failures by level 9 Exascale Computing Project

Hidden Complexity of Heterogeneous Storage One simple VeloC API Many complex vendor APIs: ● Cray DataWarp ● DDN IME ● EMC 2 Tiers ● IBM CORAL burst buffer Complex Heterogeneous Storage Hierarchy (Burst Buffers, Parallel File Systems, Object Stores, etc.) VELOC facilitates ease of use by transparent interaction with the heterogeneous storage hierarchy 10 Exascale Computing Project

Modular Architecture ● Configurable resilience strategy: ○ L1: Local write ○ L2: Partner replication, XOR encoding, RS encoding ○ L3: Optimized transfer to external storage ● Configurable mode of operation: ○ Synchronous mode: resilience engine runs in application process ○ Asynchronous mode: resilience engine in separate backend process (backend survives software failures in apps) ● Easily extensible: ○ Custom modules can be added for additional post-processing in the engine (e.g. compression) VELOC facilitates flexibility thanks to its modular design 11 Exascale Computing Project

VELOC API Initializing VELOC: ● Application-level checkpoint ● VELOC_Init() and restart API ● VELOC_Finalize() Memory registration: ● Minimizes code changes in ● VELOC_Mem_protect() applications ● VELOC_Mem_unprotect() File registration: ● Two possible modes: ● VELOC_Route_file() ○ File-oriented API: Manually Checkpoint functions: write files and tell VeloC about ● VELOC_Checkpoint_wait() ● VELOC_Checkpoint_begin() them ● VELOC_Checkpoint_mem() ○ Memory-oriented API: Declare ● VELOC_Checkpoint_end() and capture memory regions Restart functions: automatically ● VELOC_Restart_test() ● Fire-and-forget: VeloC ● VELOC_Restart_begin() ● VELOC_Recover_mem() operates in the background ● VELOC_Restart_end() ● Waiting for checkpoints is Environmental functions: ● VELOC_Get_version() optional; a primitive is used Convenience functions (Mem. only): to check progress ● VELOC_Checkpoint() ● VELOC_Restart() 12 Exascale Computing Project

VeloC Initialization and Finalize 13 Exascale Computing Project

VELOC Memory- Based Mode In memory-based mode, applications need to register any critical memory regions needed for restart. Registration is allowed at any moment before initiating a checkpoint or restart. Memory regions can also be unregistered if they become non-critical at any moment during runtime. 14 Exascale Computing Project

VELOC File- Based Mode In the file-based mode, applications need to manually serialize/recover the critical data structures to/from checkpoint files. This mode provides fine-grain control over the serialization process and is especially useful when the application uses non-contiguous memory regions for which the memory- based API is not convenient to use. 15 Exascale Computing Project

VELOC Checkpoint Functions 16 Exascale Computing Project

VELOC Checkpointing Functions (cont.) Needed in the file mode: VeloC needs to know when writing on the checkpoint file Is done to start the next steps (synchronous or asynchronous) of multi-level checkpointing. 17 Exascale Computing Project

VELOC Checkpointing Functions (cont.) 18 Exascale Computing Project

VELOC Restart Functions 19 Exascale Computing Project

VELOC Restart Functions (cont.) 20 Exascale Computing Project

VELOC Restart Functions (cont.) 21 Exascale Computing Project

Examples of ECP apps using VELOC LatticeQCD HACC ● Helps understand particle dynamics (quarks, gluons) ● Helps understand structure formation of universe ● Based on CPS (Columbia Physics System) ● Needs to checkpoint 6 x 1D arrays ● Needs to checkpoint a 1D array 22 Exascale Computing Project

Industry Interest for VELOC ● Total SA M a j o r F r e n c h o i l a n d g a s m u l t i - n a t i o n a l ● N e e d s H P C t o a c c e l e r a t e s t u d i e s ● 6 P F l o p ) L a r g e s t i n d u s t r i a l s u p e r c o m p u t e r ( ● ● Application: PoroDG S i m u l a t i o n s o f p o r o u s m e d i a ● D i s c o n t i n u o u s G a l e r k i n m e t h o d ● W r i t t e n i n F o r t r a n ● N e e d s e f f i c i e n t c h e c k p o i n t - r e s t a r t ● e p r o j e c t ● C o l l a b o r a t i v F o r t r a n b i n d i n g s f o r V E L O C ● E v a l u a t i o n s o f V E L O C i n p r o g r e s s ● 23 Exascale Computing Project

Results: Sync vs. Async Mode ● The cost for doing async flushes ● Experimental platform: Theta (thousands of KNL to PFS: nodes, Lustre PFS) ○ They generate noticeable ● What people did so far: blocking writes to PFS interference but it does not grow (purple) at scale ○ The result: poor scalability ● Overall: ● What VeloC can do: async writes to PFS (green) ○ Rapid growing gap between ○ Apps are blocked only during local writes sync and async with increasing (on DRAM) #PEs ○ Much better scalability 24 Exascale Computing Project

Heterogeneity of Local Storage ● Local storage is increasingly complex ● Example: KNL Node (ANL Theta) ○ MCDRAM ○ DDR4 RAM ○ Flash Storage (SSD) ● VELOC can leverage heterogeneous local storage to improve performance ● Example: ○ Scenario: 256 concurrent writers, each writing 256 MB ○ Hybrid local storage: 6 GB DDR4 + 128 GB SSD ○ Hybrid local storage much faster than SSD only despite small DDR4 size 25 Exascale Computing Project

VELOC: Very Low Overhead Checkpointing System Bogdan Nicolae, Rinku - PowerPoint PPT Presentation

VELOC: Very Low Overhead Checkpointing System Bogdan Nicolae, Rinku Gupta, Franck Cappello (ANL) Adam Moody, Elsa Gonsiorowski, Kathryn Mohror (LLNL) 1 Exascale Computing Project Part 1: Overview of VELOC 2 Exascale Computing Project HPC

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery Sreepathi Pai April 17,

Low-Overhead System Tracing With eBPF Akshay Kapoor DevOps Engineer @ SAP Labs May 2018

Bursty Tracing: A Framework for Low-Overhead Temporal Profiling Martin Hirzel Trishul Chilimbi

File System Performance File System Performance Memory mapped files - Avoid system call overhead

Me Mesoscale hi high-re resolution modeling of of extreme win wind veloc locit itie ies s

Electric Traction Electrified railway systems Prof. Dr. Ir. R.P.B.J. Dollevoet Introduction

Cyber-Physical System Checkpointing and Recovery Fanxin Kong , Meng Xu, James Weimer, Oleg

Virtual Machine Checkpointing Brendan Cully University of British Columbia with Andrew Warfield

Adjoint Data-Flow analyses applied to checkpointing - Tradeoff between snapshots and TBR Benjamin

Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud - Qingxi

Fast dynamic and partial reconfiguration Data Path with low Hardware overhead on Xilinx FPGAs

OVERHEAD CRANE OVERHEAD CRANE-HOIST HOIST-JIB CRANE JIB CRANE ATEX PLANT ATEX PLANT

Tables in TEX \eTD \bTD overhead \eTD \eTR overhead so much \eTABLE \eTD \eTR \bTABLE even

Sophos and Diane Searchable Symmetric Encryption with (Very) Low Overhead Raphael Bost, Brice

A Full Bandwidth Audio Codec with Low A Full Bandwidth Audio Codec with Low Complexity and Very

No CDN On-net Off-net Deep off-net User Experience Low Medium High Very High

TransparentCheckpointofClosed DistributedSystemsin Emulab

Checkpoint/Recovery 18-849b Dependable Embedded Systems John DeVale February 4, 1999 Required

Challenges of water utilities in the cities Distribution of water in Chandigarh B.

Trial of tele-medicine to promote the fetal diagnosis of congenital heart disease (CHD) Motoyoshi

Mission Objective: Compromise Nuclear Facility Using Virtual Reality to Improve Cyber Security and

The The Hadoop Di adoop Dist stri ributed buted Fi File le System System Konstantin

Proposed 2019-2022 CAPITAL BUDGET THE CITY OF EDMONTON CITY COUNCIL October 23, 2018 1 OUR

INTERMITTENT HARDWARE ERRORS RECOVERY: MODELING AND EVALUATION L A Y A L I R A S H I D , K A R T