Towards Scalable Application Checkpointing with Parallel File System - - PowerPoint PPT Presentation

towards scalable application checkpointing with parallel
SMART_READER_LITE
LIVE PREVIEW

Towards Scalable Application Checkpointing with Parallel File System - - PowerPoint PPT Presentation

Towards Scalable Application Checkpointing with Parallel File System Delegation Dulcardo Arteaga Ming Zhao darte003@fiu.edu ming@cs.fiu.edu School of Computing and Information Sciences Florida International University Miami, FL High


slide-1
SLIDE 1

Towards Scalable Application Checkpointing with Parallel File System Delegation

Dulcardo Arteaga Ming Zhao

darte003@fiu.edu ming@cs.fiu.edu

School of Computing and Information Sciences Florida International University Miami, FL

slide-2
SLIDE 2

High Performance Computing Systems

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 1 / 25

slide-3
SLIDE 3

Introduction

Scalability

  • Large scale applications run on HPC
  • One important challenge is Fault Tolerance
  • Common approach is checkpointing

Checkpointing

  • Store a snapshot of the current application state
  • Applications recover from valid snapshot in case of failure

HPC systems use parallel file system to do checkpointing

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 2 / 25

slide-4
SLIDE 4

Parallel File Systems (PFSes)

Components:

Meta Data Servers Store metadata information about files Data Servers Store actual data of files Clients Run on compute nodes and provide interface to Storage System

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 3 / 25

slide-5
SLIDE 5

Problem Statement

Problem Large scale checkpointing causes serious bottleneck at metadata servers on HPC systems Approach Delegate the management of the PFS storage space used for checkpointing to applications to reduce metadata overhead

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 4 / 25

slide-6
SLIDE 6

Outline

1 Introduction 2 Checkpointing Modes 3 Approach 4 Experimental Evaluation 5 Conclusion Background Checkpointing Modes Approach Experimental Evaluation Conclusions 5 / 25

slide-7
SLIDE 7

Checkpointing Modes

File-per-Process

File-per-Process (N-N)

  • Every process writes to a

different file

Metadata management

  • verhead
  • Imply a creation of many

files

  • Metadata operation per

file and per process

N1 P1 P2 N2 P3 P4 N3 P5 P6

(N-N)

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 6 / 25

slide-8
SLIDE 8

Checkpointing Modes

Shared-File

N1 P1 P2 N2 P3 P4 N3 P5 P6

(N-1 Segmented)

N1 P1 P2 N2 P3 P4 N3 P5 P6

(N-1 Strided) Shared-File (N-1) segmented

  • Processes write sequentially
  • n shared-file’s region

Shared-File (N-1) strided

  • Processes write to different

part of shared-file

Metadata management

  • verhead
  • Every process requests same

metadata every time

  • File locking

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 7 / 25

slide-9
SLIDE 9

Approach - PFS-delegation

Compute Nodes Metadata Servers Data Servers

… …

RESERVED SPACE

  • 3. Read/write of checkpoints

Application MPI-IO PFS-D Metadata table

  • Proc. 1’s

checkpoint space

  • Proc. 2’s

checkpoint space

  • Proc. n’s

checkpoint space

1

Create reserved space (only one time)

2

Receive metadata of reserved space (only

  • ne time)

3

Perform I/O directly to data servers

Read and write from/to checkpoints require to follow only step 3 after reserved space is created

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25

slide-10
SLIDE 10

Approach - PFS-delegation

Compute Nodes Metadata Servers Data Servers

… …

RESERVED SPACE

  • 3. Read/write of checkpoints

Application MPI-IO PFS-D Metadata table

  • Proc. 1’s

checkpoint space

  • Proc. 2’s

checkpoint space

  • Proc. n’s

checkpoint space

1

Create reserved space (only one time)

2

Receive metadata of reserved space (only

  • ne time)

3

Perform I/O directly to data servers

Read and write from/to checkpoints require to follow only step 3 after reserved space is created

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25

slide-11
SLIDE 11

Approach - PFS-delegation

Compute Nodes Metadata Servers Data Servers

… …

RESERVED SPACE

  • 3. Read/write of checkpoints

Application MPI-IO PFS-D Metadata table

  • Proc. 1’s

checkpoint space

  • Proc. 2’s

checkpoint space

  • Proc. n’s

checkpoint space

1

Create reserved space (only one time)

2

Receive metadata of reserved space (only

  • ne time)

3

Perform I/O directly to data servers

Read and write from/to checkpoints require to follow only step 3 after reserved space is created

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25

slide-12
SLIDE 12

Approach - PFS-delegation

Compute Nodes Metadata Servers Data Servers

… …

RESERVED SPACE

  • 3. Read/write of checkpoints

Application MPI-IO PFS-D Metadata table

  • Proc. 1’s

checkpoint space

  • Proc. 2’s

checkpoint space

  • Proc. n’s

checkpoint space

1

Create reserved space (only one time)

2

Receive metadata of reserved space (only

  • ne time)

3

Perform I/O directly to data servers

Read and write from/to checkpoints require to follow only step 3 after reserved space is created

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25

slide-13
SLIDE 13

Approach - PFS-delegation

Compute Nodes Metadata Servers Data Servers

… …

RESERVED SPACE

  • 3. Read/write of checkpoints

Application MPI-IO PFS-D Metadata table

  • Proc. 1’s

checkpoint space

  • Proc. 2’s

checkpoint space

  • Proc. n’s

checkpoint space

1

Create reserved space (only one time)

2

Receive metadata of reserved space (only

  • ne time)

3

Perform I/O directly to data servers

Read and write from/to checkpoints require to follow only step 3 after reserved space is created

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25

slide-14
SLIDE 14

Approach - PFS-delegation

Compute Nodes Metadata Servers Data Servers

… …

RESERVED SPACE

  • 3. Read/write of checkpoints

Application MPI-IO PFS-D Metadata table

  • Proc. 1’s

checkpoint space

  • Proc. 2’s

checkpoint space

  • Proc. n’s

checkpoint space

1

Create reserved space (only one time)

2

Receive metadata of reserved space (only

  • ne time)

3

Perform I/O directly to data servers

Read and write from/to checkpoints require to follow only step 3 after reserved space is created

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25

slide-15
SLIDE 15

Approach - PFS-delegation

Compute Nodes Metadata Servers Data Servers

… …

RESERVED SPACE

  • 3. Read/write of checkpoints

Application MPI-IO PFS-D Metadata table

  • Proc. 1’s

checkpoint space

  • Proc. 2’s

checkpoint space

  • Proc. n’s

checkpoint space

Application uses PFS-delegation interfaces PFS-delegation uses MPI-IO API to communicate with servers

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25

slide-16
SLIDE 16

Approach - PFS-delegation

Compute Nodes Metadata Servers Data Servers

… …

RESERVED SPACE

  • 3. Read/write of checkpoints

Application MPI-IO PFS-D Metadata table

  • Proc. 1’s

checkpoint space

  • Proc. 2’s

checkpoint space

  • Proc. n’s

checkpoint space

Application uses PFS-delegation interfaces PFS-delegation uses MPI-IO API to communicate with servers

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25

slide-17
SLIDE 17

Reserving Delegated Storage Space

The reservation process is made by creating

  • ne large logical file across the PFS data

servers

Compute Nodes Metadata Servers Data Servers

… …

RESERVED SPACE

  • 3. Read/write of checkpoints

Application MPI-IO PFS-D Metadata table

  • Proc. 1’s

checkpoint space

  • Proc. 2’s

checkpoint space

  • Proc. n’s

checkpoint space

To avoid initial overhead at reservation there are different techniques

  • Create a sparse file by writing the last byte of corresponding datafile

(PVFS2)

  • Use fallocate (GPFS)

This process is executed only once The size of reserved space should consider:

  • Single checkpointing size
  • Amount of checkpoints
  • Storage policy

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 9 / 25

slide-18
SLIDE 18

Data Layout

The layout is specified as a regular file layout using MPI-IO

Compute Nodes Metadata Servers Data Servers

… …

RESERVED SPACE

  • 3. Read/write of checkpoints

Application MPI-IO PFS-D Metadata table

  • Proc. 1’s

checkpoint space

  • Proc. 2’s

checkpoint space

  • Proc. n’s

checkpoint space

PFS-delegation uses the following hints for layout definition:

  • striping factor: number of data server involved
  • striping unit: stripe size

PVFS2 implementation uses simple stripe and round robin distribution MPI info info; MPI Info set(info, ‘‘striping factor’’, ‘‘4’’); MPI Info set(info, ‘‘striping unit’’, ‘‘65536’’);

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 10 / 25

slide-19
SLIDE 19

Reserved-Space Distribution

Compute Nodes Metadata Servers Data Servers

… …

RESERVED SPACE

  • 3. Read/write of checkpoints

Application MPI-IO PFS-D Metadata table

  • Proc. 1’s

checkpoint space

  • Proc. 2’s

checkpoint space

  • Proc. n’s

checkpoint space

… Metadata Table

  • ffset start
  • ffset end
  • ffset next

revision

  • ffset start and offset end:
  • Specify limits of client’s assigned region
  • ffset next:
  • Specify next valid offset to write a checkpoint

revision:

  • Checkpointing counter

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 11 / 25

slide-20
SLIDE 20

Accessing Checkpoints in Delegated Space

PFS-delegation provides interfaces to applications to write/read checkpoints to the delegated space

Compute Nodes Metadata Servers Data Servers

… …

RESERVED SPACE

  • 3. Read/write of checkpoints

Application MPI-IO PFS-D Metadata table

  • Proc. 1’s

checkpoint space

  • Proc. 2’s

checkpoint space

  • Proc. n’s

checkpoint space

Interface Name Description PFS write file Perform writes of a checkpoint on the delegated space PFS read file Perform reads of last valid check- point from the delegated space PFS read file revision Read a specific past checkpoint stored in the delegated space

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 12 / 25

slide-21
SLIDE 21

Single Checkpoint Write Process

1 Read metadata table 2 Get the offset “offset next” (available space) 3 Call MPI-IO functions to do write 4 Update metadata table with new offsets 5 Increase revision number

  • Only one process (rank 0) performs lookup and

update to metadata table

  • In case of N-N and N-1 modes many processes

update metadata info

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 13 / 25

slide-22
SLIDE 22

Single Checkpointing Read Process

1 Read metadata table 2 Get corresponding offset where the data is located 3 Call MPI-IO functions to perform read in parallel

  • Only one process (rank 0) performs lookup at

metadata table

  • In case of N-N and N-1 modes many processes

update metadata info

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 14 / 25

slide-23
SLIDE 23

Experimental Evaluation

Setup

Evaluation was performed in our cluster:

  • Eleven DELL cluster

nodes

  • 2 six-core 2.4GHz

Opteron

  • 32GB RAM - 500GB

SAS Disk

  • OS: Ubuntu 8.04

Kernel: 2.6.24-16-server

Benchmark IOR2

Distributed Centralized Metadata Metadata Server Server Node 1 to 4 Meta Servers Node 4 4 Data Servers 4 Data Servers Node 5 1 Meta Server Node 6 to 16 to 128 16 to 128 Node 11 Processes Processes

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 15 / 25

slide-24
SLIDE 24

Centralized Metadata Server

Checkpointing Time

2 4 6 8 10 12 14 16 18 20 16 32 64 128

Runtime (Secconds) Number of Clients

PFS-Delegation Shared-File File-Per-Process

Performance is similar

with less than 64 clients With 128 clients PFS-delegation is:

  • 7% faster than

“shared-file”

  • 10% faster than

“file-per-process”

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 16 / 25

slide-25
SLIDE 25

Centralized Metadata Server

Checkpointing Time

2 4 6 8 10 12 14 16 18 20 16 32 64 128

Runtime (Secconds) Number of Clients

PFS-Delegation Shared-File File-Per-Process

Performance is similar

with less than 64 clients With 128 clients PFS-delegation is:

  • 7% faster than

“shared-file”

  • 10% faster than

“file-per-process”

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 16 / 25

slide-26
SLIDE 26

Centralized Metadata Server

Total number of Metadata Operations

33 65 129 257 188 348 669 1310 608 1034 2376 4132

16 32 64 128

Number of Messages

Number of Clients PFS-Delegation Shared-File File-Per-Process

PFS-delegation

metadata operations reduced to:

  • 7% of

“file-per-process”

  • 20% of “shared

file” With 128 processes the metadata operations are reduced by:

  • 1053 compared to

“shared file”

  • 3875 compared to

“file-per-process”

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 17 / 25

slide-27
SLIDE 27

Centralized Metadata Server

Total number of Metadata Operations

33 65 129 257 188 348 669 1310 608 1034 2376 4132

16 32 64 128

Number of Messages

Number of Clients PFS-Delegation Shared-File File-Per-Process

Metadata operations

reduced to:

  • “shared-file” is

30% of “file-per-process”

  • “PFS-delegation”

is 20% of “shared file” With 128 processes the metadata operations are reduced by:

  • 1053 compared to

“shared file”

  • 3875 compared to

“file-per-process”

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 17 / 25

slide-28
SLIDE 28

Centralized Metadata Server

Different metadata operations with 128 processes

128 128 1 128 1146 9 18 9 128 918 1122 813 1152

GETCONFIG GETATTR CREATE LOOKUP CRDIRENT

Number of Messages Metadata Operations

PFS-Delegation Shared-File File-Per-Process

PFS-delegation’s

“GETATTR” is less than the other two methods

  • Triggered by:

create, read, and write

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 18 / 25

slide-29
SLIDE 29

Centralized Metadata Server

Different metadata operations with 128 processes

128 128 1 128 1146 9 18 9 128 918 1122 813 1152

GETCONFIG GETATTR CREATE LOOKUP CRDIRENT

Number of Messages Metadata Operations

PFS-Delegation Shared-File File-Per-Process

PFS-delegation’s

“GETATTR” is less than the other two methods

  • Triggered by:

create, read, and write

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 18 / 25

slide-30
SLIDE 30

Distributed Metadata Server

Checkpointing Time

Performance is similar

with less than 32 clients With 128 clients PFS-delegation is:

  • 22% faster than

“shared-file”

  • 31% faster than

“file-per-process”

2 4 6 8 10 12 14 16 16 32 64 128

Runtime (Secconds) Number of Clients

PFS-Delegation Shared-File File-Per-Process Background Checkpointing Modes Approach Experimental Evaluation Conclusions 19 / 25

slide-31
SLIDE 31

Distributed Metadata Server

Checkpointing Time

Performance is similar

with less than 32 clients With 128 clients PFS-delegation is:

  • 22% faster than

“shared-file”

  • 31% faster than

“file-per-process”

2 4 6 8 10 12 14 16 16 32 64 128

Runtime (Secconds) Number of Clients

PFS-Delegation Shared-File File-Per-Process Background Checkpointing Modes Approach Experimental Evaluation Conclusions 19 / 25

slide-32
SLIDE 32

Distributed Metadata Server

Total number of Metadata Operations

More metadata

  • perations than

centralized metadata server Metadata operations reduced to:

  • 20% of “shared

file”

  • 10% of

“file-per-process”

81 161 321 641 563 733 2397 3667 1008 2016 3830 8114

16 32 64 128

Number of Messages Number of Clients

PFS-Delegation Shared-File File-Per-Process Background Checkpointing Modes Approach Experimental Evaluation Conclusions 20 / 25

slide-33
SLIDE 33

Distributed Metadata Server

Different metadata operations with 128 processes

PFS-delegation’s

“GETATTR” is much less than shared- file/file-per-process

640 128 1 640 3119 9 18 9 640 4500 1152 798 1152

GETCONFIG GETATTR CREATE LOOKUP CRDIRENT

Number of Messages Metadata Operations

PFS-Delegation Shared-File File-Per-Process Background Checkpointing Modes Approach Experimental Evaluation Conclusions 21 / 25

slide-34
SLIDE 34

Distributed Metadata Server

Different metadata operations with 128 processes

PFS-delegation’s

“GETATTR” is much less than shared- file/file-per-process

640 128 1 640 3119 9 18 9 640 4500 1152 798 1152

GETCONFIG GETATTR CREATE LOOKUP CRDIRENT

Number of Messages Metadata Operations

PFS-Delegation Shared-File File-Per-Process Background Checkpointing Modes Approach Experimental Evaluation Conclusions 21 / 25

slide-35
SLIDE 35

Related Work

PLFS - Parallel Log structure File System

  • Map the access pattern from N-1 to N-N
  • Create interposition layer between application and PFS
  • Implements access transparently by providing ad plfs MPI-IO

driver

GFS - Google File System

  • Handle large workloads
  • Perform better with appending-only writes

LWFS - Light-Weight File System

  • No traditional PFS services
  • Provide secure access and high-level services

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 22 / 25

slide-36
SLIDE 36

Conclusions

PFS-delegation is a checkpointing technique that reduces the

  • verhead at metadata management
  • Require no modifications on PFS
  • Provide simple interfaces to applications

A prototype on PVFS2 was implemented with good results compared to shared-file and file-per-process

  • 7% and 10% speedup using centralized metadata server
  • 22% and 31% speedup using distributed metadata server

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 23 / 25

slide-37
SLIDE 37

Future Work

Implement PFS-delegation MPI-IO driver to provide full transparency to application Integrate PFS-delegation capabilities to use netCDF/HDF5 to structure the reserved space Scale up the number of clients/server for future experiments

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 24 / 25

slide-38
SLIDE 38

Acknowledgment

This research is sponsored by National Science Foundation under grant CCF-0938045 and Department of Homeland Security under grant 2010-ST-062-00039

Questions?

WEB: http://visa.cis.fiu.edu darte003@fiu.edu ming@cs.fiu.edu

Background Checkpointing Modes Approach Experimental Evaluation Conclusions 25 / 25