Towards Scalable Application Checkpointing with Parallel File System Delegation
Dulcardo Arteaga Ming Zhao
darte003@fiu.edu ming@cs.fiu.edu
School of Computing and Information Sciences Florida International University Miami, FL
Towards Scalable Application Checkpointing with Parallel File System - - PowerPoint PPT Presentation
Towards Scalable Application Checkpointing with Parallel File System Delegation Dulcardo Arteaga Ming Zhao darte003@fiu.edu ming@cs.fiu.edu School of Computing and Information Sciences Florida International University Miami, FL High
School of Computing and Information Sciences Florida International University Miami, FL
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 1 / 25
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 2 / 25
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 3 / 25
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 4 / 25
1 Introduction 2 Checkpointing Modes 3 Approach 4 Experimental Evaluation 5 Conclusion Background Checkpointing Modes Approach Experimental Evaluation Conclusions 5 / 25
File-per-Process
N1 P1 P2 N2 P3 P4 N3 P5 P6
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 6 / 25
Shared-File
N1 P1 P2 N2 P3 P4 N3 P5 P6
N1 P1 P2 N2 P3 P4 N3 P5 P6
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 7 / 25
Compute Nodes Metadata Servers Data Servers
RESERVED SPACE
Application MPI-IO PFS-D Metadata table
checkpoint space
checkpoint space
checkpoint space
1
2
3
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25
Compute Nodes Metadata Servers Data Servers
RESERVED SPACE
Application MPI-IO PFS-D Metadata table
checkpoint space
checkpoint space
checkpoint space
1
2
3
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25
Compute Nodes Metadata Servers Data Servers
RESERVED SPACE
Application MPI-IO PFS-D Metadata table
checkpoint space
checkpoint space
checkpoint space
1
2
3
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25
Compute Nodes Metadata Servers Data Servers
RESERVED SPACE
Application MPI-IO PFS-D Metadata table
checkpoint space
checkpoint space
checkpoint space
1
2
3
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25
Compute Nodes Metadata Servers Data Servers
RESERVED SPACE
Application MPI-IO PFS-D Metadata table
checkpoint space
checkpoint space
checkpoint space
1
2
3
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25
Compute Nodes Metadata Servers Data Servers
RESERVED SPACE
Application MPI-IO PFS-D Metadata table
checkpoint space
checkpoint space
checkpoint space
1
2
3
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25
Compute Nodes Metadata Servers Data Servers
RESERVED SPACE
Application MPI-IO PFS-D Metadata table
checkpoint space
checkpoint space
checkpoint space
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25
Compute Nodes Metadata Servers Data Servers
RESERVED SPACE
Application MPI-IO PFS-D Metadata table
checkpoint space
checkpoint space
checkpoint space
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25
…
Compute Nodes Metadata Servers Data Servers
… …
RESERVED SPACE
Application MPI-IO PFS-D Metadata table
checkpoint space
checkpoint space
checkpoint space
…
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 9 / 25
…
Compute Nodes Metadata Servers Data Servers
… …
RESERVED SPACE
Application MPI-IO PFS-D Metadata table
checkpoint space
checkpoint space
checkpoint space
…
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 10 / 25
Compute Nodes Metadata Servers Data Servers
RESERVED SPACE
Application MPI-IO PFS-D Metadata table
checkpoint space
checkpoint space
checkpoint space
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 11 / 25
…
Compute Nodes Metadata Servers Data Servers
… …
RESERVED SPACE
Application MPI-IO PFS-D Metadata table
checkpoint space
checkpoint space
checkpoint space
…
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 12 / 25
1 Read metadata table 2 Get the offset “offset next” (available space) 3 Call MPI-IO functions to do write 4 Update metadata table with new offsets 5 Increase revision number
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 13 / 25
1 Read metadata table 2 Get corresponding offset where the data is located 3 Call MPI-IO functions to perform read in parallel
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 14 / 25
Setup
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 15 / 25
Checkpointing Time
2 4 6 8 10 12 14 16 18 20 16 32 64 128
Runtime (Secconds) Number of Clients
PFS-Delegation Shared-File File-Per-Process
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 16 / 25
Checkpointing Time
2 4 6 8 10 12 14 16 18 20 16 32 64 128
Runtime (Secconds) Number of Clients
PFS-Delegation Shared-File File-Per-Process
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 16 / 25
Total number of Metadata Operations
33 65 129 257 188 348 669 1310 608 1034 2376 4132
16 32 64 128
Number of Messages
Number of Clients PFS-Delegation Shared-File File-Per-Process
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 17 / 25
Total number of Metadata Operations
33 65 129 257 188 348 669 1310 608 1034 2376 4132
16 32 64 128
Number of Messages
Number of Clients PFS-Delegation Shared-File File-Per-Process
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 17 / 25
Different metadata operations with 128 processes
128 128 1 128 1146 9 18 9 128 918 1122 813 1152
GETCONFIG GETATTR CREATE LOOKUP CRDIRENT
Number of Messages Metadata Operations
PFS-Delegation Shared-File File-Per-Process
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 18 / 25
Different metadata operations with 128 processes
128 128 1 128 1146 9 18 9 128 918 1122 813 1152
GETCONFIG GETATTR CREATE LOOKUP CRDIRENT
Number of Messages Metadata Operations
PFS-Delegation Shared-File File-Per-Process
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 18 / 25
Checkpointing Time
2 4 6 8 10 12 14 16 16 32 64 128
Runtime (Secconds) Number of Clients
PFS-Delegation Shared-File File-Per-Process Background Checkpointing Modes Approach Experimental Evaluation Conclusions 19 / 25
Checkpointing Time
2 4 6 8 10 12 14 16 16 32 64 128
Runtime (Secconds) Number of Clients
PFS-Delegation Shared-File File-Per-Process Background Checkpointing Modes Approach Experimental Evaluation Conclusions 19 / 25
Total number of Metadata Operations
81 161 321 641 563 733 2397 3667 1008 2016 3830 8114
16 32 64 128
Number of Messages Number of Clients
PFS-Delegation Shared-File File-Per-Process Background Checkpointing Modes Approach Experimental Evaluation Conclusions 20 / 25
Different metadata operations with 128 processes
640 128 1 640 3119 9 18 9 640 4500 1152 798 1152
GETCONFIG GETATTR CREATE LOOKUP CRDIRENT
Number of Messages Metadata Operations
PFS-Delegation Shared-File File-Per-Process Background Checkpointing Modes Approach Experimental Evaluation Conclusions 21 / 25
Different metadata operations with 128 processes
640 128 1 640 3119 9 18 9 640 4500 1152 798 1152
GETCONFIG GETATTR CREATE LOOKUP CRDIRENT
Number of Messages Metadata Operations
PFS-Delegation Shared-File File-Per-Process Background Checkpointing Modes Approach Experimental Evaluation Conclusions 21 / 25
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 22 / 25
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 23 / 25
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 24 / 25
Background Checkpointing Modes Approach Experimental Evaluation Conclusions 25 / 25