 
              Fast In ‐ Memory Checkpointing with POSIX API for Legacy Exascale ‐ Applications Jan Fajerski, Matthias Noack, Alexander Reinefeld, Florian Schintke, Thorsten Schütt, Thomas Steinke Zuse Institute Berlin SPPEXA Symposium, 26.01.2016
FFMK: A Fast and Fault ‐ tolerant Microkernel ‐ based Operating System for Exascale Computing Project Goals Means ‐ to improve fault ‐ tolerance ‐ in ‐ memory checkpoint/restart ‐ to reduce workload imbalances ‐ fine ‐ grained monitoring & control ‐ low ‐ overhead  ‐ kernel OS ‐ to reduce OS noise/overhead Alexander Reinefeld 2
Outline 1. In ‐ memory C/R First Results on Cray XC40 o 2. Reed ‐ Solomon Erasure Codes in Distributed, Unreliable Systems 3. Restart Process placement after restart o Oversubscription after restart o Alexander Reinefeld 3
Checkpointing Today Cray XC40 “Konrad” @ZIB ‐ 1 Petaflop/s peak ‐ 1872 nodes ‐ 44928 cores ‐ 117 TB memory LNET1 LNET2 LNET3 LNET4 LNET5 LNET6 LNET7 LNET8 LNET9 LNET10 LNET11LNET12LNET13LNET14LNET15 bmds1 boss1 boss2 boss3 boss4 bmds3 boss9 boss10 boss11 boss12 bmds2 boss5 boss6 boss6 boss8 bmds4 boss13 boss14 boss15 boss16 metadata objects metadata objects 480 SATA Disks 800 SATA Disks 1.4 PB net capacity 2.3 PB net capacity Lustre 1 Lustre 2 Alexander Reinefeld 4
Checkpointing on ExaScale Exascale systems • more components o more complexity o  With today’s methods the time to save the checkpoint may be larger than MTBF. Must reduce … •  application ‐ level checkpointing data size o  can use multilevel checkpointing  utilize storage layers, e.g. NVRAM Goal: Checkpoint frequency writing time  stripe over many remote DRAMs of a few minutes o  via fast interconnect  but need redundant storage (EC) Need POSIX compliance • for legacy applications o Alexander Reinefeld 5
I N ‐ M EMORY C/R Cray XC40 implementation and first results
In ‐ Memory C/R with XtreemFS We use XtreemFS , a user space file system. • OSD data is stored in the tmpfs file system (RAM disk) o Stender, Berlin, Reinefeld. XtreemFS ‐ a File System for the Cloud , In: Data Intensive Storage Services for Cloud Environments, IGI Global, 2013. Alexander Reinefeld 7
Accessing Remote RAM File System: 3 Options FUSE (file system in user space) libxtreemfs LD_PRELOAD requires kernel modules link the library intercepts calls to DLLs to client code Alexander Reinefeld 8
Deployment on a Cray XC40 Experimental results with BQCD code with application ‐ level C/R • OpenMPI o  Node 1: DIR, MRC  Node 1...n: 1 OSD each We manually killed the job end restarted it from the checkpoint. o Used CCM mode (cluster compatibility mode) on Cray • need to run XtreemFS and application side ‐ by ‐ side o need to be able to restart the checkpoint o ibverbs (emulates IB on Aries) o Alexander Reinefeld 9
Results (1): Sequential Write on Remote RAM Disk Client uses: libxtreemfs LD_PRELOAD FUSE Limited bandwidth because (1) synchronous access pattern, (2) single data stream, (3) single client Alexander Reinefeld 10
Results (2): Writing Erasure Coded Data n = 2 ... 13 clients, each writes 1 GB Access to striped volume Access to erasure ‐ coded volume read write #data OSDs #data OSDs 13 nodes with 1 OSD and 1 client each + 2 redundancy nodes 13 nodes with 1 OSD and 1 client each Writing encoded data is 17% to 49% slower than writing striped data (15 .. 100% more data) Alexander Reinefeld 11
U SING R EED ‐ S OLOMON E RASURE C ODES IN D ISTRIBUTED , U NRELIABLE S YSTEMS Update Order Problem
RS Erasure Codes RS Erasure Codes provide more flexibility than RAID • n = k + m , for various k , m o But blocks in one stripe are dependent on each other • not so in replication o Must ensure sequential consistency for client accesses • w/ failing servers o w/ failing (other) clients o Alexander Reinefeld 13
RS Erasure Codes MDS erasure code with k=3 original blocks and m=2 redundancy blocks encode and add diff D2 read write diff D2 diff D2 D 1 D 2 D 3 R 1 R 2 read Write block:  Read old data block D2  Calculate difference (diff)  Write new data block D2 new to storage server  Read block: Encode and add at redundancy servers R 1 , R 2 (commutative)  Read data block D1 Recovery:  Read any k=3 blocks out of k + m, recover lost blocks Alexander Reinefeld 14
Update Order Problem Concurrent Updates of data blocks D1 and D2 encode and add diff D2 read write diff D2 diff D1 D 1 D 2 D 3 R 1 R 2 diff D1 diff D2 read write diff D1 encode and add Problem:  Update order of redundancy blocks R1, R2  Inconsistency in case of failures (data block or client failure) Alexander Reinefeld 15
2 Solutions for Update Order Problem PSW : P essimistic Protocol with S equential W riting • a master (sequencer) enforces the total order of updates on R i o use separate master per file for scaling o OCW : O ptimistic protocol with C oncurrent W riting • uses buffers and replicated state machine o Performance • OCW is as fast as replication (RAID ‐ 1), but needs additional buffer space o PSW is slower, but no buffers o K. Peter, A. Reinefeld. Consistency and fault tolerance for erasure ‐ coded distributed storage systems , DIDC 2012. Alexander Reinefeld 16
R ESTART Q1: Where to restart? Q2: Oversubscribing or Downscaling?
Restart Does it matter where to restart a crashed process? cabinet 0 cabinet 1 cabinet 2 cabinet 3 process restart after node failure in cabinet 3 F. Wende, Th. Steinke, A. Reinefeld, The Impact of Process Placement and Oversubscription on Application Performance: A Case Study for Exascale Computing, EASC ‐ 2015. Alexander Reinefeld 18
Process Placement: CP2K on Cray XC40 CP2K setup: H 2 0 ‐ 1024 with 5 MD steps • Placement across 4 cabinets is (color)encoded into string C1 ‐ C2 ‐ C3 ‐ C4 • processes 1..16 in different electrical group all processes in same cabinet Experiment Setup avg. of 6 runs o 16 procs. per node o explicit node allocation via Moab o exclusive system use o Alexander Reinefeld 19
Process Placement: CP2K on Cray XC40 Communication matrix for H 2 O ‐ 1024, • 512 MPI processes Some MPI ranks are src./dest. o of gather and scatter operations → Placing them far away from other processes may cause performance decrease Intra ‐ group and nearest o neighbor communication Notes: tracing experiment with CrayPAT o some comm. paths pruned away o Alexander Reinefeld 20
Restart without free nodes Oversubscribing or downscaling the application? cabinet 0 cabinet 1 cabinet 2 cabinet 3 Oversubscribing in cabinet 2 after node failure in cabinet 3 ? F. Wende, Th. Steinke, A. Reinefeld, The Impact of Process Placement and Oversubscription on Application Performance: A Case Study for Exascale Computing, EASC ‐ 2015. Alexander Reinefeld 21
Oversubscription Oversubscription with the same application better worse Oversubscription with different applications better worse Oversubscribing for the whole runtime on Cray XC40, ALPS_APP_PE Alexander Reinefeld 22
Summary & Outlook We demonstrated the feasibility of in ‐ memory erasure ‐ coded C/R on HPC • supports legacy applications (POSIX) o Next steps • implement full Reed Solomon erasure codes with Jerasure lib o data scheduling : checkpointing should not impact running applications o improve coding speed on manycore, coprocessor, … o Publications • F. Wende, Th. Steinke, A. Reinefeld, The Impact of Process Placement and Oversubscription on Application o Performance: A Case Study for Exascale Computing , EASC ‐ 2015. Ch. Kleineweber, A. Reinefeld, T. Schütt. QoS ‐ Aware Storage Virtualization for Cloud File Systems. 1st o Programmable File Systems Workshop, HPDC’14 K. Peter, A. Reinefeld. Consistency and fault tolerance for erasure ‐ coded distributed storage systems , DIDC o 2012. H. Härtig, S. Matsuoka, F. Müller, A. Reinefeld. Resilience in Exascale Computing , Dagstuhl Reports, vol. 4, o no. 9, pp.124 ‐ 139, doi: 10.4230/DagRep.4.9.124 Alexander Reinefeld 23
Recommend
More recommend