DMTCP
Transparent Checkpointing for Cluster Computations and the Desktop Jason Ansel1 Kapil Arya2 Gene Cooperman2
1MIT 2Northeastern University
May 26, 2009
Jason Ansel (MIT) DMTCP May 26, 2009 1 / 39
DMTCP Transparent Checkpointing for Cluster Computations and the - - PowerPoint PPT Presentation
DMTCP Transparent Checkpointing for Cluster Computations and the Desktop Jason Ansel 1 Kapil Arya 2 Gene Cooperman 2 1 MIT 2 Northeastern University May 26, 2009 Jason Ansel (MIT) DMTCP May 26, 2009 1 / 39 Introduction Outline Introduction
1MIT 2Northeastern University
Jason Ansel (MIT) DMTCP May 26, 2009 1 / 39
Introduction
Jason Ansel (MIT) DMTCP May 26, 2009 2 / 39
Introduction Background
Jason Ansel (MIT) DMTCP May 26, 2009 3 / 39
Introduction Background
Jason Ansel (MIT) DMTCP May 26, 2009 3 / 39
Introduction Background
Jason Ansel (MIT) DMTCP May 26, 2009 3 / 39
Introduction Background
Jason Ansel (MIT) DMTCP May 26, 2009 3 / 39
Introduction Background
Jason Ansel (MIT) DMTCP May 26, 2009 3 / 39
Introduction Background
Jason Ansel (MIT) DMTCP May 26, 2009 3 / 39
Introduction Motivation
Jason Ansel (MIT) DMTCP May 26, 2009 4 / 39
Introduction Motivation
Jason Ansel (MIT) DMTCP May 26, 2009 4 / 39
Introduction Motivation
Jason Ansel (MIT) DMTCP May 26, 2009 4 / 39
Introduction Motivation
Jason Ansel (MIT) DMTCP May 26, 2009 4 / 39
Introduction Motivation
Jason Ansel (MIT) DMTCP May 26, 2009 5 / 39
Introduction Motivation
Jason Ansel (MIT) DMTCP May 26, 2009 5 / 39
Introduction Motivation
Jason Ansel (MIT) DMTCP May 26, 2009 5 / 39
Introduction Motivation
Jason Ansel (MIT) DMTCP May 26, 2009 5 / 39
Introduction Motivation
Jason Ansel (MIT) DMTCP May 26, 2009 5 / 39
Introduction Motivation
Jason Ansel (MIT) DMTCP May 26, 2009 5 / 39
Introduction Related work
Jason Ansel (MIT) DMTCP May 26, 2009 6 / 39
Introduction Related work
Jason Ansel (MIT) DMTCP May 26, 2009 6 / 39
Introduction Related work
Jason Ansel (MIT) DMTCP May 26, 2009 6 / 39
Introduction Motivation
Jason Ansel (MIT) DMTCP May 26, 2009 7 / 39
Introduction Short Demo
Jason Ansel (MIT) DMTCP May 26, 2009 8 / 39
Design and Implementation
Jason Ansel (MIT) DMTCP May 26, 2009 9 / 39
Design and Implementation How it works
Jason Ansel (MIT) DMTCP May 26, 2009 10 / 39
Design and Implementation How it works
Jason Ansel (MIT) DMTCP May 26, 2009 10 / 39
Design and Implementation How it works
Jason Ansel (MIT) DMTCP May 26, 2009 10 / 39
Design and Implementation How it works
Jason Ansel (MIT) DMTCP May 26, 2009 10 / 39
Design and Implementation How it works
1 User space memory 2 Processor state 3 Data in network 4 Kernel state Jason Ansel (MIT) DMTCP May 26, 2009 11 / 39
Design and Implementation How it works
1 User space memory - read from checkpoint management thread 2 Processor state 3 Data in network 4 Kernel state Jason Ansel (MIT) DMTCP May 26, 2009 11 / 39
Design and Implementation How it works
1 User space memory - read from checkpoint management thread 2 Processor state - hijack user threads and copy to memory 3 Data in network 4 Kernel state Jason Ansel (MIT) DMTCP May 26, 2009 11 / 39
Design and Implementation How it works
1 User space memory - read from checkpoint management thread 2 Processor state - hijack user threads and copy to memory 3 Data in network - drained to process memory 4 Kernel state Jason Ansel (MIT) DMTCP May 26, 2009 11 / 39
Design and Implementation How it works
1 User space memory - read from checkpoint management thread 2 Processor state - hijack user threads and copy to memory 3 Data in network - drained to process memory 4 Kernel state - probing at checkpoint time Jason Ansel (MIT) DMTCP May 26, 2009 11 / 39
Design and Implementation How it works
1 User space memory - read from checkpoint management thread 2 Processor state - hijack user threads and copy to memory 3 Data in network - drained to process memory 4 Kernel state - probing at checkpoint time
Jason Ansel (MIT) DMTCP May 26, 2009 11 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 12 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39
Design and Implementation Distributed checkpointing algorithm
Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39
Design and Implementation Other features
Jason Ansel (MIT) DMTCP May 26, 2009 15 / 39
Design and Implementation Other features
Jason Ansel (MIT) DMTCP May 26, 2009 15 / 39
Design and Implementation Other features
Jason Ansel (MIT) DMTCP May 26, 2009 15 / 39
Design and Implementation Other features
Jason Ansel (MIT) DMTCP May 26, 2009 15 / 39
Design and Implementation Other features
Jason Ansel (MIT) DMTCP May 26, 2009 16 / 39
Design and Implementation Other features
Jason Ansel (MIT) DMTCP May 26, 2009 16 / 39
Design and Implementation Other features
Jason Ansel (MIT) DMTCP May 26, 2009 16 / 39
Design and Implementation Other features
Jason Ansel (MIT) DMTCP May 26, 2009 16 / 39
Design and Implementation Other features
Jason Ansel (MIT) DMTCP May 26, 2009 16 / 39
Design and Implementation Other features
Jason Ansel (MIT) DMTCP May 26, 2009 16 / 39
Design and Implementation Other features
Jason Ansel (MIT) DMTCP May 26, 2009 16 / 39
Design and Implementation Other features
Faster Smaller
1
Jason Ansel (MIT) DMTCP May 26, 2009 17 / 39
Design and Implementation Other features
Faster Smaller
1
2
Jason Ansel (MIT) DMTCP May 26, 2009 17 / 39
Design and Implementation Other features
Faster Smaller
1
2
3
Jason Ansel (MIT) DMTCP May 26, 2009 17 / 39
Results
Jason Ansel (MIT) DMTCP May 26, 2009 18 / 39
Results Performance trends
Jason Ansel (MIT) DMTCP May 26, 2009 19 / 39
Results Performance trends
Jason Ansel (MIT) DMTCP May 26, 2009 20 / 39
Results Performance trends
Jason Ansel (MIT) DMTCP May 26, 2009 20 / 39
Results Performance trends
Jason Ansel (MIT) DMTCP May 26, 2009 20 / 39
Results Benchmarks
Jason Ansel (MIT) DMTCP May 26, 2009 21 / 39
Results Benchmarks
Jason Ansel (MIT) DMTCP May 26, 2009 21 / 39
Results Benchmarks
Jason Ansel (MIT) DMTCP May 26, 2009 22 / 39
Results Benchmarks
Jason Ansel (MIT) DMTCP May 26, 2009 22 / 39
Results Benchmarks
Jason Ansel (MIT) DMTCP May 26, 2009 22 / 39
Results Benchmarks
Jason Ansel (MIT) DMTCP May 26, 2009 22 / 39
Results Benchmarks
Jason Ansel (MIT) DMTCP May 26, 2009 22 / 39
Results Benchmarks
Jason Ansel (MIT) DMTCP May 26, 2009 23 / 39
Results Benchmarks
Jason Ansel (MIT) DMTCP May 26, 2009 24 / 39
Results Benchmarks
Jason Ansel (MIT) DMTCP May 26, 2009 24 / 39
Results Benchmarks
Jason Ansel (MIT) DMTCP May 26, 2009 24 / 39
Results Benchmarks
Jason Ansel (MIT) DMTCP May 26, 2009 24 / 39
Results Benchmarks
Jason Ansel (MIT) DMTCP May 26, 2009 24 / 39
Results Benchmarks
Jason Ansel (MIT) DMTCP May 26, 2009 25 / 39
Conclusions
Jason Ansel (MIT) DMTCP May 26, 2009 26 / 39
Conclusions Final remarks
Jason Ansel (MIT) DMTCP May 26, 2009 27 / 39
Conclusions Final remarks
Jason Ansel (MIT) DMTCP May 26, 2009 27 / 39
Conclusions Final remarks
Jason Ansel (MIT) DMTCP May 26, 2009 27 / 39
Conclusions Final remarks
Jason Ansel (MIT) DMTCP May 26, 2009 27 / 39
Conclusions Final remarks
Jason Ansel (MIT) DMTCP May 26, 2009 28 / 39
Conclusions Questions
Jason Ansel (MIT) DMTCP May 26, 2009 29 / 39
Conclusions Questions
Jason Ansel (MIT) DMTCP May 26, 2009 30 / 39
Backup Slides
Jason Ansel (MIT) DMTCP May 26, 2009 31 / 39
Backup Slides
1
Jason Ansel (MIT) DMTCP May 26, 2009 32 / 39
Backup Slides
1
2
Jason Ansel (MIT) DMTCP May 26, 2009 32 / 39
Backup Slides
1
2
3
Jason Ansel (MIT) DMTCP May 26, 2009 32 / 39
Backup Slides
Jason Ansel (MIT) DMTCP May 26, 2009 33 / 39
Backup Slides
1 2 3 4 5 6 7 8 iPython/Shell[1] iPython/Demo[1] Baseline[2] ParGeant4[2] NAS/CG[2] Baseline[3] NAS/EP[3] NAS/LU[3] NAS/SP[3] NAS/MG[3] NAS/IS[3] NAS/BT[3] Checkpoint Time (s) Uncompressed Compressed 1 2 3 4 5 6 7 8 iPython/Shell[1] iPython/Demo[1] Baseline[2] ParGeant4[2] NAS/CG[2] Baseline[3] NAS/EP[3] NAS/LU[3] NAS/SP[3] NAS/MG[3] NAS/IS[3] NAS/BT[3] Restart Time (s) Uncompressed Compressed 2000 4000 6000 8000 10000 iPython/Shell[1] IPython/Demo[1] Baseline[2] ParGeant4[2] NAS/CG[2] Baseline[3] NAS/EP[3] NAS/LU[3] NAS/SP[3] NAS/MG[3] NAS/IS[3] NAS/BT[3] Checkpoint Size (MB) Uncompressed Compressed
Jason Ansel (MIT) DMTCP May 26, 2009 34 / 39
Backup Slides
0.5 1 1.5 2 2.5 3 3.5 4 bc emacs ghci(Haskell) ghostscript gnuplot gst(Smalltalk) lynx macaulay2 matlab mzscheme
perl php python ruby slsh(S-Lang) sqlite tclsh tightvnc+twm vim/cscope Time (s) Checkpoint Time Restart Time 5 10 15 20 25 30 35 bc emacs ghci(Haskell) ghostscript gnuplot gst(Smalltalk) lynx macaulay2 matlab mzscheme
perl php python ruby slsh(S-Lang) sqlite tclsh tightvnc+twm vim/cscope Checkpoint Size (MB) Checkpoint Size Jason Ansel (MIT) DMTCP May 26, 2009 35 / 39
Backup Slides
Jason Ansel (MIT) DMTCP May 26, 2009 36 / 39
Backup Slides
1
2
3
4
5
6
7
8
Jason Ansel (MIT) DMTCP May 26, 2009 37 / 39
Backup Slides
1
2
3
4
5
6
Jason Ansel (MIT) DMTCP May 26, 2009 38 / 39
Backup Slides
1 2 3 4 5 6 7 8 10 20 30 40 50 60 70 Time (s) Total Memory Usage (GB) Restart Checkpoint
Jason Ansel (MIT) DMTCP May 26, 2009 39 / 39