Fast In Memory Checkpointing with POSIX API for Legacy Exascale - - PowerPoint PPT Presentation

fast in memory checkpointing with posix api for legacy
SMART_READER_LITE
LIVE PREVIEW

Fast In Memory Checkpointing with POSIX API for Legacy Exascale - - PowerPoint PPT Presentation

Fast In Memory Checkpointing with POSIX API for Legacy Exascale Applications Jan Fajerski, Matthias Noack, Alexander Reinefeld, Florian Schintke, Thorsten Schtt, Thomas Steinke Zuse Institute Berlin SPPEXA Symposium, 26.01.2016 FFMK: A


slide-1
SLIDE 1

Fast In‐Memory Checkpointing with POSIX API for Legacy Exascale‐Applications

SPPEXA Symposium, 26.01.2016

Jan Fajerski, Matthias Noack, Alexander Reinefeld, Florian Schintke, Thorsten Schütt, Thomas Steinke Zuse Institute Berlin

slide-2
SLIDE 2

Alexander Reinefeld 2

FFMK: A Fast and Fault‐tolerant Microkernel‐based Operating System for Exascale Computing

Project Goals ‐ to improve fault‐tolerance ‐ to reduce workload imbalances ‐ to reduce OS noise/overhead Means ‐ in‐memory checkpoint/restart ‐ fine‐grained monitoring & control ‐ low‐overhead ‐kernel OS

slide-3
SLIDE 3

Alexander Reinefeld 3

1. In‐memory C/R

  • First Results on Cray XC40

2. Reed‐Solomon Erasure Codes in Distributed, Unreliable Systems 3. Restart

  • Process placement after restart
  • Oversubscription after restart

Outline

slide-4
SLIDE 4

Alexander Reinefeld 4

Checkpointing Today

Cray XC40 “Konrad” @ZIB ‐ 1 Petaflop/s peak ‐ 1872 nodes ‐ 44928 cores ‐ 117 TB memory

800 SATA Disks 2.3 PB net capacity

boss13 boss14 boss15 boss16 boss9 boss10 boss11 boss12 bmds4 bmds3

480 SATA Disks 1.4 PB net capacity

boss5 boss6 boss6 boss8 boss1 boss2 boss3 bmds2 bmds1 boss4 LNET11LNET12LNET13LNET14LNET15 LNET6 LNET7 LNET8 LNET9 LNET10 LNET1 LNET2 LNET3 LNET4 LNET5

Lustre 1 Lustre 2

metadata

  • bjects

metadata

  • bjects
slide-5
SLIDE 5

Alexander Reinefeld 5

  • Exascale systems
  • more components
  • more complexity

 With today’s methods the time to save the checkpoint may be larger than MTBF.

  • Must reduce …
  • data size

 application‐level checkpointing

  • can use multilevel checkpointing
  • utilize storage layers, e.g. NVRAM
  • writing time  stripe over many remote DRAMs
  • via fast interconnect
  • but need redundant storage (EC)
  • Need POSIX compliance
  • for legacy applications

Checkpointing on ExaScale

Goal: Checkpoint frequency

  • f a few minutes
slide-6
SLIDE 6

IN‐MEMORY C/R

Cray XC40 implementation and first results

slide-7
SLIDE 7

Alexander Reinefeld 7

  • We use XtreemFS, a user space file system.
  • OSD data is stored in the tmpfs file system (RAM disk)

In‐Memory C/R with XtreemFS

Stender, Berlin, Reinefeld. XtreemFS ‐ a File System for the Cloud, In: Data Intensive Storage Services for Cloud Environments, IGI Global, 2013.

slide-8
SLIDE 8

Alexander Reinefeld 8

Accessing Remote RAM File System: 3 Options

FUSE (file system in user space)

requires kernel modules

libxtreemfs

link the library to client code

LD_PRELOAD

intercepts calls to DLLs

slide-9
SLIDE 9

Alexander Reinefeld 9

  • Experimental results with BQCD code with application‐level C/R
  • OpenMPI
  • Node 1: DIR, MRC
  • Node 1...n: 1 OSD each
  • We manually killed the job end restarted it from the checkpoint.
  • Used CCM mode (cluster compatibility mode) on Cray
  • need to run XtreemFS and application side‐by‐side
  • need to be able to restart the checkpoint
  • ibverbs (emulates IB on Aries)

Deployment on a Cray XC40

slide-10
SLIDE 10

Alexander Reinefeld 10

Results (1): Sequential Write on Remote RAM Disk

libxtreemfs FUSE LD_PRELOAD Client uses:

Limited bandwidth because (1) synchronous access pattern, (2) single data stream, (3) single client

slide-11
SLIDE 11

Alexander Reinefeld 11

Results (2): Writing Erasure Coded Data

read write

Access to striped volume Access to erasure‐coded volume

#data OSDs #data OSDs

n = 2 ... 13 clients, each writes 1 GB

13 nodes with 1 OSD and 1 client each 13 nodes with 1 OSD and 1 client each + 2 redundancy nodes

Writing encoded data is 17% to 49% slower than writing striped data (15 .. 100% more data)

slide-12
SLIDE 12

USING REED‐SOLOMON ERASURE CODES

IN DISTRIBUTED, UNRELIABLE SYSTEMS

Update Order Problem

slide-13
SLIDE 13

Alexander Reinefeld 13

  • RS Erasure Codes provide more flexibility than RAID
  • n = k + m, for various k , m
  • But blocks in one stripe are dependent on each other
  • not so in replication
  • Must ensure sequential consistency for client accesses
  • w/ failing servers
  • w/ failing (other) clients

RS Erasure Codes

slide-14
SLIDE 14

Alexander Reinefeld 14

MDS erasure code with k=3 original blocks and m=2 redundancy blocks

RS Erasure Codes

D1 D2 D3 R1 R2

diffD2 diffD2

read write

diffD2

encode and add read

Write block:  Read old data block D2  Calculate difference (diff)  Write new data block D2new to storage server  Encode and add at redundancy servers R1, R2 (commutative) Read block:  Read data block D1 Recovery:  Read any k=3 blocks out of k + m, recover lost blocks

slide-15
SLIDE 15

Alexander Reinefeld 15

Concurrent Updates of data blocks D1 and D2

Update Order Problem

D1 D2 D3 R1 R2

diffD2 diffD1

read write

diffD2

encode and add write

Problem:  Update order of redundancy blocks R1, R2  Inconsistency in case of failures (data block or client failure)

read

diffD1 diffD2

encode and add

diffD1

slide-16
SLIDE 16

Alexander Reinefeld 16

  • PSW: Pessimistic Protocol with Sequential Writing
  • a master (sequencer) enforces the total order of updates on Ri
  • use separate master per file for scaling
  • OCW: Optimistic protocol with Concurrent Writing
  • uses buffers and replicated state machine
  • Performance
  • OCW is as fast as replication (RAID‐1), but needs additional buffer space
  • PSW is slower, but no buffers

2 Solutions for Update Order Problem

  • K. Peter, A. Reinefeld. Consistency and fault tolerance for erasure‐coded distributed storage systems, DIDC 2012.
slide-17
SLIDE 17

RESTART

Q1: Where to restart? Q2: Oversubscribing or Downscaling?

slide-18
SLIDE 18

Alexander Reinefeld 18

Restart

cabinet 0 cabinet 1 cabinet 2 cabinet 3

Does it matter where to restart a crashed process?

process restart after node failure in cabinet 3

  • F. Wende, Th. Steinke, A. Reinefeld, The Impact of Process Placement and Oversubscription on Application Performance: A Case Study for

Exascale Computing, EASC‐2015.

slide-19
SLIDE 19

Alexander Reinefeld 19

Process Placement: CP2K on Cray XC40

  • CP2K setup: H20‐1024 with 5 MD steps
  • Placement across 4 cabinets is (color)encoded into string C1‐C2‐C3‐C4

all processes in same cabinet processes 1..16 in different electrical group

Experiment Setup

  • avg. of 6 runs
  • 16 procs. per node
  • explicit node allocation via Moab
  • exclusive system use
slide-20
SLIDE 20

Alexander Reinefeld 20

  • Communication matrix for H2O‐1024,

512 MPI processes

  • Some MPI ranks are src./dest.
  • f gather and scatter operations

→ Placing them far away from

  • ther processes may cause

performance decrease

  • Intra‐group and nearest

neighbor communication

Process Placement: CP2K on Cray XC40

Notes:

  • tracing experiment with CrayPAT
  • some comm. paths pruned away
slide-21
SLIDE 21

Alexander Reinefeld 21

Restart without free nodes

cabinet 0 cabinet 1 cabinet 2 cabinet 3

Oversubscribing or downscaling the application?

Oversubscribing in cabinet 2 after node failure in cabinet 3 ?

  • F. Wende, Th. Steinke, A. Reinefeld, The Impact of Process Placement and Oversubscription on Application Performance: A Case Study for

Exascale Computing, EASC‐2015.

slide-22
SLIDE 22

Alexander Reinefeld 22

Oversubscription with the same application Oversubscription with different applications

Oversubscription

better worse better worse

Oversubscribing for the whole runtime

  • n Cray XC40, ALPS_APP_PE
slide-23
SLIDE 23

Alexander Reinefeld 23

  • We demonstrated the feasibility of in‐memory erasure‐coded C/R on HPC
  • supports legacy applications (POSIX)
  • Next steps
  • implement full Reed Solomon erasure codes with Jerasure lib
  • data scheduling: checkpointing should not impact running applications
  • improve coding speed on manycore, coprocessor, …
  • Publications
  • F. Wende, Th. Steinke, A. Reinefeld, The Impact of Process Placement and Oversubscription on Application

Performance: A Case Study for Exascale Computing, EASC‐2015.

  • Ch. Kleineweber, A. Reinefeld, T. Schütt. QoS‐Aware Storage Virtualization for Cloud File Systems. 1st

Programmable File Systems Workshop, HPDC’14

  • K. Peter, A. Reinefeld. Consistency and fault tolerance for erasure‐coded distributed storage systems, DIDC

2012.

  • H. Härtig, S. Matsuoka, F. Müller, A. Reinefeld. Resilience in Exascale Computing, Dagstuhl Reports, vol. 4,
  • no. 9, pp.124‐139, doi: 10.4230/DagRep.4.9.124

Summary & Outlook