Fast In Memory Checkpointing with POSIX API for Legacy Exascale - PowerPoint PPT Presentation

Fast In ‐ Memory Checkpointing with POSIX API for Legacy Exascale ‐ Applications Jan Fajerski, Matthias Noack, Alexander Reinefeld, Florian Schintke, Thorsten Schütt, Thomas Steinke Zuse Institute Berlin SPPEXA Symposium, 26.01.2016

FFMK: A Fast and Fault ‐ tolerant Microkernel ‐ based Operating System for Exascale Computing Project Goals Means ‐ to improve fault ‐ tolerance ‐ in ‐ memory checkpoint/restart ‐ to reduce workload imbalances ‐ fine ‐ grained monitoring & control ‐ low ‐ overhead  ‐ kernel OS ‐ to reduce OS noise/overhead Alexander Reinefeld 2

Outline 1. In ‐ memory C/R First Results on Cray XC40 o 2. Reed ‐ Solomon Erasure Codes in Distributed, Unreliable Systems 3. Restart Process placement after restart o Oversubscription after restart o Alexander Reinefeld 3

Checkpointing Today Cray XC40 “Konrad” @ZIB ‐ 1 Petaflop/s peak ‐ 1872 nodes ‐ 44928 cores ‐ 117 TB memory LNET1 LNET2 LNET3 LNET4 LNET5 LNET6 LNET7 LNET8 LNET9 LNET10 LNET11LNET12LNET13LNET14LNET15 bmds1 boss1 boss2 boss3 boss4 bmds3 boss9 boss10 boss11 boss12 bmds2 boss5 boss6 boss6 boss8 bmds4 boss13 boss14 boss15 boss16 metadata objects metadata objects 480 SATA Disks 800 SATA Disks 1.4 PB net capacity 2.3 PB net capacity Lustre 1 Lustre 2 Alexander Reinefeld 4

Checkpointing on ExaScale Exascale systems • more components o more complexity o  With today’s methods the time to save the checkpoint may be larger than MTBF. Must reduce … •  application ‐ level checkpointing data size o  can use multilevel checkpointing  utilize storage layers, e.g. NVRAM Goal: Checkpoint frequency writing time  stripe over many remote DRAMs of a few minutes o  via fast interconnect  but need redundant storage (EC) Need POSIX compliance • for legacy applications o Alexander Reinefeld 5

I N ‐ M EMORY C/R Cray XC40 implementation and first results

In ‐ Memory C/R with XtreemFS We use XtreemFS , a user space file system. • OSD data is stored in the tmpfs file system (RAM disk) o Stender, Berlin, Reinefeld. XtreemFS ‐ a File System for the Cloud , In: Data Intensive Storage Services for Cloud Environments, IGI Global, 2013. Alexander Reinefeld 7

Accessing Remote RAM File System: 3 Options FUSE (file system in user space) libxtreemfs LD_PRELOAD requires kernel modules link the library intercepts calls to DLLs to client code Alexander Reinefeld 8

Deployment on a Cray XC40 Experimental results with BQCD code with application ‐ level C/R • OpenMPI o  Node 1: DIR, MRC  Node 1...n: 1 OSD each We manually killed the job end restarted it from the checkpoint. o Used CCM mode (cluster compatibility mode) on Cray • need to run XtreemFS and application side ‐ by ‐ side o need to be able to restart the checkpoint o ibverbs (emulates IB on Aries) o Alexander Reinefeld 9

Results (1): Sequential Write on Remote RAM Disk Client uses: libxtreemfs LD_PRELOAD FUSE Limited bandwidth because (1) synchronous access pattern, (2) single data stream, (3) single client Alexander Reinefeld 10

Results (2): Writing Erasure Coded Data n = 2 ... 13 clients, each writes 1 GB Access to striped volume Access to erasure ‐ coded volume read write #data OSDs #data OSDs 13 nodes with 1 OSD and 1 client each + 2 redundancy nodes 13 nodes with 1 OSD and 1 client each Writing encoded data is 17% to 49% slower than writing striped data (15 .. 100% more data) Alexander Reinefeld 11

U SING R EED ‐ S OLOMON E RASURE C ODES IN D ISTRIBUTED , U NRELIABLE S YSTEMS Update Order Problem

RS Erasure Codes RS Erasure Codes provide more flexibility than RAID • n = k + m , for various k , m o But blocks in one stripe are dependent on each other • not so in replication o Must ensure sequential consistency for client accesses • w/ failing servers o w/ failing (other) clients o Alexander Reinefeld 13

RS Erasure Codes MDS erasure code with k=3 original blocks and m=2 redundancy blocks encode and add diff D2 read write diff D2 diff D2 D 1 D 2 D 3 R 1 R 2 read Write block:  Read old data block D2  Calculate difference (diff)  Write new data block D2 new to storage server  Read block: Encode and add at redundancy servers R 1 , R 2 (commutative)  Read data block D1 Recovery:  Read any k=3 blocks out of k + m, recover lost blocks Alexander Reinefeld 14

Update Order Problem Concurrent Updates of data blocks D1 and D2 encode and add diff D2 read write diff D2 diff D1 D 1 D 2 D 3 R 1 R 2 diff D1 diff D2 read write diff D1 encode and add Problem:  Update order of redundancy blocks R1, R2  Inconsistency in case of failures (data block or client failure) Alexander Reinefeld 15

2 Solutions for Update Order Problem PSW : P essimistic Protocol with S equential W riting • a master (sequencer) enforces the total order of updates on R i o use separate master per file for scaling o OCW : O ptimistic protocol with C oncurrent W riting • uses buffers and replicated state machine o Performance • OCW is as fast as replication (RAID ‐ 1), but needs additional buffer space o PSW is slower, but no buffers o K. Peter, A. Reinefeld. Consistency and fault tolerance for erasure ‐ coded distributed storage systems , DIDC 2012. Alexander Reinefeld 16

R ESTART Q1: Where to restart? Q2: Oversubscribing or Downscaling?

Restart Does it matter where to restart a crashed process? cabinet 0 cabinet 1 cabinet 2 cabinet 3 process restart after node failure in cabinet 3 F. Wende, Th. Steinke, A. Reinefeld, The Impact of Process Placement and Oversubscription on Application Performance: A Case Study for Exascale Computing, EASC ‐ 2015. Alexander Reinefeld 18

Process Placement: CP2K on Cray XC40 CP2K setup: H 2 0 ‐ 1024 with 5 MD steps • Placement across 4 cabinets is (color)encoded into string C1 ‐ C2 ‐ C3 ‐ C4 • processes 1..16 in different electrical group all processes in same cabinet Experiment Setup avg. of 6 runs o 16 procs. per node o explicit node allocation via Moab o exclusive system use o Alexander Reinefeld 19

Process Placement: CP2K on Cray XC40 Communication matrix for H 2 O ‐ 1024, • 512 MPI processes Some MPI ranks are src./dest. o of gather and scatter operations → Placing them far away from other processes may cause performance decrease Intra ‐ group and nearest o neighbor communication Notes: tracing experiment with CrayPAT o some comm. paths pruned away o Alexander Reinefeld 20

Restart without free nodes Oversubscribing or downscaling the application? cabinet 0 cabinet 1 cabinet 2 cabinet 3 Oversubscribing in cabinet 2 after node failure in cabinet 3 ? F. Wende, Th. Steinke, A. Reinefeld, The Impact of Process Placement and Oversubscription on Application Performance: A Case Study for Exascale Computing, EASC ‐ 2015. Alexander Reinefeld 21

Oversubscription Oversubscription with the same application better worse Oversubscription with different applications better worse Oversubscribing for the whole runtime on Cray XC40, ALPS_APP_PE Alexander Reinefeld 22

Summary & Outlook We demonstrated the feasibility of in ‐ memory erasure ‐ coded C/R on HPC • supports legacy applications (POSIX) o Next steps • implement full Reed Solomon erasure codes with Jerasure lib o data scheduling : checkpointing should not impact running applications o improve coding speed on manycore, coprocessor, … o Publications • F. Wende, Th. Steinke, A. Reinefeld, The Impact of Process Placement and Oversubscription on Application o Performance: A Case Study for Exascale Computing , EASC ‐ 2015. Ch. Kleineweber, A. Reinefeld, T. Schütt. QoS ‐ Aware Storage Virtualization for Cloud File Systems. 1st o Programmable File Systems Workshop, HPDC’14 K. Peter, A. Reinefeld. Consistency and fault tolerance for erasure ‐ coded distributed storage systems , DIDC o 2012. H. Härtig, S. Matsuoka, F. Müller, A. Reinefeld. Resilience in Exascale Computing , Dagstuhl Reports, vol. 4, o no. 9, pp.124 ‐ 139, doi: 10.4230/DagRep.4.9.124 Alexander Reinefeld 23

Fast In Memory Checkpointing with POSIX API for Legacy Exascale - PowerPoint PPT Presentation

Fast In Memory Checkpointing with POSIX API for Legacy Exascale Applications Jan Fajerski, Matthias Noack, Alexander Reinefeld, Florian Schintke, Thorsten Schtt, Thomas Steinke Zuse Institute Berlin SPPEXA Symposium, 26.01.2016 FFMK: A

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery Sreepathi Pai April 17,

POSIX IPC: Overview primitive POSIX function description message queues create or access

MARK 10:1-31 SPIRITUAL LEGACY SPIRITUAL LEGACY Legacy in Marriage v. 1-12 Legacy in

RESTFUL API BEST PRACTICES By Malwina Nowakowska STX NEXT talented developers | flexible teams

Posix-Free File Systems in the Cloud Jeff Chase Duke University Beyond Posix

www.pdl.cmu.edu/posix/ December 14, 2005 APIs for HPC IO POSIX IO APIs (open, close, read,

ScoutFS: POSIX Archiving at Extreme Scale Zach Brown, Versity MSST 2019 POSIX Archiving with

API Ruby on Rails UI ES API Hedtek Wijiti API API Elasticsearch Depositing user Build

API Connect Arnauld Desprets - arnauld_desprets@fr.ibm.com Technical Sale 0 Agenda 1. API

Spock Data driven testing RESTful API What is a RESTful API ? A RESTful API is an application

Introduction to the SAGA API Outline SAGA Standardization API Structure and Scope (C++)

Study of an API Migration for two XML APIs Thiago Bartholomei Krzysztof Czarnecki Ralf Lmmel

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Example: Mentor Graphics POSIX Implementation ( Nucleus ) Mentor Graphics Nucleus User Guide

POSIX Thread Synchronization Mutex Locks Condition Variables Read-Write Locks

POSIX mini-challenge Leo Freitas and Jim Woodcock University of York December 2006 @ TC Dublin

The Impact of Process Placement and Oversubscription on Application Performance: A Case Study for

GPCF* Update Present status as a series of questions / answers related to decisions made / yet

ICT and Development ICT and Development Week 10 March 28 - 30 1 Computers and Society

QoS-Aware Admission Control in Heterogeneous Datacenters Christina Delimitrou and Christos

Generating Plans in Concurrent, Probabilistic, Oversubscribed Domains Li Li and Nilufer Onder

QMPI: A Library for Multithreaded MPI Applications Alex Brooks Hoang-Vu Dang Marc Snir Outline

Firecracker How to Securely Run Thousands of Workloads on a Single Host What is Firecracker? -

Customer Performance Jim Warner University of California Santa Cruz March 2014 Exaggerated

Sambuz

Useful Links

Newsletter

Mail Us