Fast In‐Memory Checkpointing with POSIX API for Legacy Exascale‐Applications
SPPEXA Symposium, 26.01.2016
Fast In Memory Checkpointing with POSIX API for Legacy Exascale - - PowerPoint PPT Presentation
Fast In Memory Checkpointing with POSIX API for Legacy Exascale Applications Jan Fajerski, Matthias Noack, Alexander Reinefeld, Florian Schintke, Thorsten Schtt, Thomas Steinke Zuse Institute Berlin SPPEXA Symposium, 26.01.2016 FFMK: A
SPPEXA Symposium, 26.01.2016
Alexander Reinefeld 2
Alexander Reinefeld 3
Alexander Reinefeld 4
Cray XC40 “Konrad” @ZIB ‐ 1 Petaflop/s peak ‐ 1872 nodes ‐ 44928 cores ‐ 117 TB memory
800 SATA Disks 2.3 PB net capacity
boss13 boss14 boss15 boss16 boss9 boss10 boss11 boss12 bmds4 bmds3
480 SATA Disks 1.4 PB net capacity
boss5 boss6 boss6 boss8 boss1 boss2 boss3 bmds2 bmds1 boss4 LNET11LNET12LNET13LNET14LNET15 LNET6 LNET7 LNET8 LNET9 LNET10 LNET1 LNET2 LNET3 LNET4 LNET5
Lustre 1 Lustre 2
metadata
metadata
Alexander Reinefeld 5
Goal: Checkpoint frequency
Alexander Reinefeld 7
Stender, Berlin, Reinefeld. XtreemFS ‐ a File System for the Cloud, In: Data Intensive Storage Services for Cloud Environments, IGI Global, 2013.
Alexander Reinefeld 8
FUSE (file system in user space)
requires kernel modules
libxtreemfs
link the library to client code
LD_PRELOAD
intercepts calls to DLLs
Alexander Reinefeld 9
Alexander Reinefeld 10
Limited bandwidth because (1) synchronous access pattern, (2) single data stream, (3) single client
Alexander Reinefeld 11
Access to striped volume Access to erasure‐coded volume
#data OSDs #data OSDs
13 nodes with 1 OSD and 1 client each 13 nodes with 1 OSD and 1 client each + 2 redundancy nodes
Writing encoded data is 17% to 49% slower than writing striped data (15 .. 100% more data)
Alexander Reinefeld 13
Alexander Reinefeld 14
diffD2 diffD2
read write
diffD2
encode and add read
Write block: Read old data block D2 Calculate difference (diff) Write new data block D2new to storage server Encode and add at redundancy servers R1, R2 (commutative) Read block: Read data block D1 Recovery: Read any k=3 blocks out of k + m, recover lost blocks
Alexander Reinefeld 15
diffD2 diffD1
read write
diffD2
encode and add write
Problem: Update order of redundancy blocks R1, R2 Inconsistency in case of failures (data block or client failure)
read
diffD1 diffD2
encode and add
diffD1
Alexander Reinefeld 16
Alexander Reinefeld 18
cabinet 0 cabinet 1 cabinet 2 cabinet 3
process restart after node failure in cabinet 3
Exascale Computing, EASC‐2015.
Alexander Reinefeld 19
all processes in same cabinet processes 1..16 in different electrical group
Experiment Setup
Alexander Reinefeld 20
Alexander Reinefeld 21
cabinet 0 cabinet 1 cabinet 2 cabinet 3
Oversubscribing in cabinet 2 after node failure in cabinet 3 ?
Exascale Computing, EASC‐2015.
Alexander Reinefeld 22
better worse better worse
Oversubscribing for the whole runtime
Alexander Reinefeld 23
Performance: A Case Study for Exascale Computing, EASC‐2015.
Programmable File Systems Workshop, HPDC’14
2012.