DMTCP Transparent Checkpointing for Cluster Computations and the - - PowerPoint PPT Presentation

dmtcp
SMART_READER_LITE
LIVE PREVIEW

DMTCP Transparent Checkpointing for Cluster Computations and the - - PowerPoint PPT Presentation

DMTCP Transparent Checkpointing for Cluster Computations and the Desktop Jason Ansel 1 Kapil Arya 2 Gene Cooperman 2 1 MIT 2 Northeastern University May 26, 2009 Jason Ansel (MIT) DMTCP May 26, 2009 1 / 39 Introduction Outline Introduction


slide-1
SLIDE 1

DMTCP

Transparent Checkpointing for Cluster Computations and the Desktop Jason Ansel1 Kapil Arya2 Gene Cooperman2

1MIT 2Northeastern University

May 26, 2009

Jason Ansel (MIT) DMTCP May 26, 2009 1 / 39

slide-2
SLIDE 2

Introduction

Outline

1

Introduction Background Motivation Related work Short Demo

2

Design and Implementation How it works Distributed checkpointing algorithm Other features

3

Results Performance trends Benchmarks

4

Conclusions Final remarks Questions

Jason Ansel (MIT) DMTCP May 26, 2009 2 / 39

slide-3
SLIDE 3

Introduction Background

What is DMTCP / checkpointing?

We present DMTCP: Distributed MultiThreaded CheckPointing

Jason Ansel (MIT) DMTCP May 26, 2009 3 / 39

slide-4
SLIDE 4

Introduction Background

What is DMTCP / checkpointing?

We present DMTCP: Distributed MultiThreaded CheckPointing Checkpointing is taking a snapshot of an applications state that can later be restarted

Jason Ansel (MIT) DMTCP May 26, 2009 3 / 39

slide-5
SLIDE 5

Introduction Background

What is DMTCP / checkpointing?

We present DMTCP: Distributed MultiThreaded CheckPointing Checkpointing is taking a snapshot of an applications state that can later be restarted DMTCP is

distributed - can checkpoint a network of programs connected by sockets

Jason Ansel (MIT) DMTCP May 26, 2009 3 / 39

slide-6
SLIDE 6

Introduction Background

What is DMTCP / checkpointing?

We present DMTCP: Distributed MultiThreaded CheckPointing Checkpointing is taking a snapshot of an applications state that can later be restarted DMTCP is

distributed - can checkpoint a network of programs connected by sockets multithreaded - each program can have many threads

Jason Ansel (MIT) DMTCP May 26, 2009 3 / 39

slide-7
SLIDE 7

Introduction Background

What is DMTCP / checkpointing?

We present DMTCP: Distributed MultiThreaded CheckPointing Checkpointing is taking a snapshot of an applications state that can later be restarted DMTCP is

distributed - can checkpoint a network of programs connected by sockets multithreaded - each program can have many threads transparent - works on unmodified binaries

Jason Ansel (MIT) DMTCP May 26, 2009 3 / 39

slide-8
SLIDE 8

Introduction Background

What is DMTCP / checkpointing?

We present DMTCP: Distributed MultiThreaded CheckPointing Checkpointing is taking a snapshot of an applications state that can later be restarted DMTCP is

distributed - can checkpoint a network of programs connected by sockets multithreaded - each program can have many threads transparent - works on unmodified binaries user-level - kernel is not modified

Jason Ansel (MIT) DMTCP May 26, 2009 3 / 39

slide-9
SLIDE 9

Introduction Motivation

The traditional motivation for checkpointing

Long running computation on a large cluster Computation takes 30 days On day 29...

Jason Ansel (MIT) DMTCP May 26, 2009 4 / 39

slide-10
SLIDE 10

Introduction Motivation

The traditional motivation for checkpointing

Long running computation on a large cluster Computation takes 30 days On day 29... a node crashes. Disaster!!! Must restart from the beginning

Jason Ansel (MIT) DMTCP May 26, 2009 4 / 39

slide-11
SLIDE 11

Introduction Motivation

The traditional motivation for checkpointing

Long running computation on a large cluster Computation takes 30 days On day 29... a node crashes. Disaster!!! Must restart from the beginning Restart from the last checkpoint

Jason Ansel (MIT) DMTCP May 26, 2009 4 / 39

slide-12
SLIDE 12

Introduction Motivation

The traditional motivation for checkpointing

Long running computation on a large cluster Computation takes 30 days On day 29... a node crashes. Disaster!!! Must restart from the beginning Restart from the last checkpoint Gives fault tolerance with no programmer support

Jason Ansel (MIT) DMTCP May 26, 2009 4 / 39

slide-13
SLIDE 13

Introduction Motivation

Haven’t we heard of checkpointing before?

Surveying existing checkpointing systems:

Most don’t work Others have never been released

Jason Ansel (MIT) DMTCP May 26, 2009 5 / 39

slide-14
SLIDE 14

Introduction Motivation

Haven’t we heard of checkpointing before?

Surveying existing checkpointing systems:

Most don’t work Others have never been released

Difficulty in checkpointing is robustness Going from checkpointing one application to most:

A four year effort Now about 10 developers

Jason Ansel (MIT) DMTCP May 26, 2009 5 / 39

slide-15
SLIDE 15

Introduction Motivation

Haven’t we heard of checkpointing before?

Surveying existing checkpointing systems:

Most don’t work Others have never been released

Difficulty in checkpointing is robustness Going from checkpointing one application to most:

A four year effort Now about 10 developers

Exception: BLCR

Also works for most applications (though fails on many of our benchmarks) Kernel level

Jason Ansel (MIT) DMTCP May 26, 2009 5 / 39

slide-16
SLIDE 16

Introduction Motivation

Haven’t we heard of checkpointing before?

Surveying existing checkpointing systems:

Most don’t work Others have never been released

Difficulty in checkpointing is robustness Going from checkpointing one application to most:

A four year effort Now about 10 developers

Exception: BLCR

Also works for most applications (though fails on many of our benchmarks) Kernel level

Can’t bundle with application

Jason Ansel (MIT) DMTCP May 26, 2009 5 / 39

slide-17
SLIDE 17

Introduction Motivation

Haven’t we heard of checkpointing before?

Surveying existing checkpointing systems:

Most don’t work Others have never been released

Difficulty in checkpointing is robustness Going from checkpointing one application to most:

A four year effort Now about 10 developers

Exception: BLCR

Also works for most applications (though fails on many of our benchmarks) Kernel level

Can’t bundle with application Harder to maintain

Jason Ansel (MIT) DMTCP May 26, 2009 5 / 39

slide-18
SLIDE 18

Introduction Motivation

Haven’t we heard of checkpointing before?

Surveying existing checkpointing systems:

Most don’t work Others have never been released

Difficulty in checkpointing is robustness Going from checkpointing one application to most:

A four year effort Now about 10 developers

Exception: BLCR

Also works for most applications (though fails on many of our benchmarks) Kernel level

Can’t bundle with application Harder to maintain

Doesn’t support sockets Distributed support (with customized MPI libraries) less robust

Jason Ansel (MIT) DMTCP May 26, 2009 5 / 39

slide-19
SLIDE 19

Introduction Related work

Related work

Kernel level

Berkeley Lab Checkpoint/Restart (BLCR)

Doesn’t support sockets Open source

User level

DMTCP (our system)

Distributed/multithreaded Open Source

Jason Ansel (MIT) DMTCP May 26, 2009 6 / 39

slide-20
SLIDE 20

Introduction Related work

Related work

Kernel level

Berkeley Lab Checkpoint/Restart (BLCR)

Doesn’t support sockets Open source

Zap (from Columbia University)

Distributed/multithreaded Closed source, not publicly available

User level

Deja Vu (from Virginia Tech)

Distributed/multithreaded Closed source, not publicly available

DMTCP (our system)

Distributed/multithreaded Open Source

Jason Ansel (MIT) DMTCP May 26, 2009 6 / 39

slide-21
SLIDE 21

Introduction Related work

Related work

Kernel level

Berkeley Lab Checkpoint/Restart (BLCR)

Doesn’t support sockets Open source

Zap (from Columbia University)

Distributed/multithreaded Closed source, not publicly available

User level

Deja Vu (from Virginia Tech)

Distributed/multithreaded Closed source, not publicly available Reported overheads 97x slower for a benchmark of similar scale

DMTCP (our system)

Distributed/multithreaded Open Source

Jason Ansel (MIT) DMTCP May 26, 2009 6 / 39

slide-22
SLIDE 22

Introduction Motivation

Other uses for checkpointing

Fault tolerance Process migration Replacement for save/restore workspace Skip past long startup times Debugging Ultimate bug report Speculative execution

Jason Ansel (MIT) DMTCP May 26, 2009 7 / 39

slide-23
SLIDE 23

Introduction Short Demo

Short Demo

Jason Ansel (MIT) DMTCP May 26, 2009 8 / 39

slide-24
SLIDE 24

Design and Implementation

Outline

1

Introduction Background Motivation Related work Short Demo

2

Design and Implementation How it works Distributed checkpointing algorithm Other features

3

Results Performance trends Benchmarks

4

Conclusions Final remarks Questions

Jason Ansel (MIT) DMTCP May 26, 2009 9 / 39

slide-25
SLIDE 25

Design and Implementation How it works

Gaining initial control

Dynamic library injection (LD PRELOAD) to force the user application to load dmtcphijack.so

Jason Ansel (MIT) DMTCP May 26, 2009 10 / 39

slide-26
SLIDE 26

Design and Implementation How it works

Gaining initial control

Dynamic library injection (LD PRELOAD) to force the user application to load dmtcphijack.so A checkpointing manager thread is spawned in each process

Jason Ansel (MIT) DMTCP May 26, 2009 10 / 39

slide-27
SLIDE 27

Design and Implementation How it works

Gaining initial control

Dynamic library injection (LD PRELOAD) to force the user application to load dmtcphijack.so A checkpointing manager thread is spawned in each process Additional forked processes are hijacked recursively Remote process (spawned with ssh) are detected and hijacked

Jason Ansel (MIT) DMTCP May 26, 2009 10 / 39

slide-28
SLIDE 28

Design and Implementation How it works

Gaining initial control

Dynamic library injection (LD PRELOAD) to force the user application to load dmtcphijack.so A checkpointing manager thread is spawned in each process Additional forked processes are hijacked recursively Remote process (spawned with ssh) are detected and hijacked The result: our library and checkpoint manger thread in every user process

Jason Ansel (MIT) DMTCP May 26, 2009 10 / 39

slide-29
SLIDE 29

Design and Implementation How it works

Saving program state

1 User space memory 2 Processor state 3 Data in network 4 Kernel state Jason Ansel (MIT) DMTCP May 26, 2009 11 / 39

slide-30
SLIDE 30

Design and Implementation How it works

Saving program state

1 User space memory - read from checkpoint management thread 2 Processor state 3 Data in network 4 Kernel state Jason Ansel (MIT) DMTCP May 26, 2009 11 / 39

slide-31
SLIDE 31

Design and Implementation How it works

Saving program state

1 User space memory - read from checkpoint management thread 2 Processor state - hijack user threads and copy to memory 3 Data in network 4 Kernel state Jason Ansel (MIT) DMTCP May 26, 2009 11 / 39

slide-32
SLIDE 32

Design and Implementation How it works

Saving program state

1 User space memory - read from checkpoint management thread 2 Processor state - hijack user threads and copy to memory 3 Data in network - drained to process memory 4 Kernel state Jason Ansel (MIT) DMTCP May 26, 2009 11 / 39

slide-33
SLIDE 33

Design and Implementation How it works

Saving program state

1 User space memory - read from checkpoint management thread 2 Processor state - hijack user threads and copy to memory 3 Data in network - drained to process memory 4 Kernel state - probing at checkpoint time Jason Ansel (MIT) DMTCP May 26, 2009 11 / 39

slide-34
SLIDE 34

Design and Implementation How it works

Saving program state

1 User space memory - read from checkpoint management thread 2 Processor state - hijack user threads and copy to memory 3 Data in network - drained to process memory 4 Kernel state - probing at checkpoint time

Memory Maps – /proc filesystem File descriptors (files) – /proc filesystem, fstat, etc File descriptors (sockets, pipes, pts, etc) – /proc filesystem, getsockopt, wrappers around creation functions Other information (signal handlers, etc) – POSIX API

Jason Ansel (MIT) DMTCP May 26, 2009 11 / 39

slide-35
SLIDE 35

Design and Implementation Distributed checkpointing algorithm

Our checkpointing algorithm

Distributed algorithm Only global communication is a barrier Coordinated / “stop the world” style checkpointing

Jason Ansel (MIT) DMTCP May 26, 2009 12 / 39

slide-36
SLIDE 36

Design and Implementation Distributed checkpointing algorithm

Checkpointing algorithm, by example

Running normally, wait for checkpoint to begin

Process A Process B Process C Process D Node 1 Node 2 Node 3 S h a r e d S

  • c

k e t Socket DMTCP Control User Control Socket Data

Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39

slide-37
SLIDE 37

Design and Implementation Distributed checkpointing algorithm

Checkpointing algorithm, by example

Suspend user threads, barrier

Process A Process B Process C Process D Node 1 Node 2 Node 3 S h a r e d S

  • c

k e t Socket DMTCP Control User Control Socket Data

Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39

slide-38
SLIDE 38

Design and Implementation Distributed checkpointing algorithm

Checkpointing algorithm, by example

Suspend user threads, barrier

Process A Process B Process C Process D Node 1 Node 2 Node 3 S h a r e d S

  • c

k e t Socket DMTCP Control User Control Socket Data

Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39

slide-39
SLIDE 39

Design and Implementation Distributed checkpointing algorithm

Checkpointing algorithm, by example

Elect shared resource leaders, barrier

Process A Process B Process C Process D Node 1 Node 2 Node 3 S h a r e d S

  • c

k e t Socket DMTCP Control User Control Socket Data

Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39

slide-40
SLIDE 40

Design and Implementation Distributed checkpointing algorithm

Checkpointing algorithm, by example

Elect shared resource leaders, barrier

Process A Process B Process C Process D Node 1 Node 2 Node 3 S h a r e d S

  • c

k e t Socket DMTCP Control User Control Socket Data

Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39

slide-41
SLIDE 41

Design and Implementation Distributed checkpointing algorithm

Checkpointing algorithm, by example

Drain socket data, barrier

Process A Process B Process C Process D Node 1 Node 2 Node 3 S h a r e d S

  • c

k e t Socket DMTCP Control User Control Socket Data

Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39

slide-42
SLIDE 42

Design and Implementation Distributed checkpointing algorithm

Checkpointing algorithm, by example

Drain socket data, barrier

Process A Process B Process C Process D Node 1 Node 2 Node 3 S h a r e d S

  • c

k e t Socket DMTCP Control User Control Socket Data

Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39

slide-43
SLIDE 43

Design and Implementation Distributed checkpointing algorithm

Checkpointing algorithm, by example

Perform single process checkpointing, barrier

Process A Process B Process C Process D Node 1 Node 2 Node 3 S h a r e d S

  • c

k e t Socket DMTCP Control User Control Socket Data

Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39

slide-44
SLIDE 44

Design and Implementation Distributed checkpointing algorithm

Checkpointing algorithm, by example

Perform single process checkpointing, barrier

Process A Process B Process C Process D Node 1 Node 2 Node 3 S h a r e d S

  • c

k e t Socket DMTCP Control User Control Socket Data

Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39

slide-45
SLIDE 45

Design and Implementation Distributed checkpointing algorithm

Checkpointing algorithm, by example

Refill socket data, barrier

Process A Process B Process C Process D Node 1 Node 2 Node 3 S h a r e d S

  • c

k e t Socket DMTCP Control User Control Socket Data

Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39

slide-46
SLIDE 46

Design and Implementation Distributed checkpointing algorithm

Checkpointing algorithm, by example

Refill socket data, barrier

Process A Process B Process C Process D Node 1 Node 2 Node 3 S h a r e d S

  • c

k e t Socket DMTCP Control User Control Socket Data

Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39

slide-47
SLIDE 47

Design and Implementation Distributed checkpointing algorithm

Checkpointing algorithm, by example

Refill socket data, barrier

Process A Process B Process C Process D Node 1 Node 2 Node 3 S h a r e d S

  • c

k e t Socket DMTCP Control User Control Socket Data

Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39

slide-48
SLIDE 48

Design and Implementation Distributed checkpointing algorithm

Checkpointing algorithm, by example

Resume user threads

Process A Process B Process C Process D Node 1 Node 2 Node 3 S h a r e d S

  • c

k e t Socket DMTCP Control User Control Socket Data

Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39

slide-49
SLIDE 49

Design and Implementation Distributed checkpointing algorithm

Checkpointing algorithm, by example

Running normally

Process A Process B Process C Process D Node 1 Node 2 Node 3 S h a r e d S

  • c

k e t Socket DMTCP Control User Control Socket Data

Jason Ansel (MIT) DMTCP May 26, 2009 13 / 39

slide-50
SLIDE 50

Design and Implementation Distributed checkpointing algorithm

Restart algorithm, by example

Start with nothing (possibly different nodes)

Node 1 Node 2 Node 3 DMTCP Control User Control Socket Data

Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39

slide-51
SLIDE 51

Design and Implementation Distributed checkpointing algorithm

Restart algorithm, by example

Restart process on each node

Node 1 Node 2 Node 3 DMTCP Control User Control Socket Data Restart Restart Restart

Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39

slide-52
SLIDE 52

Design and Implementation Distributed checkpointing algorithm

Restart algorithm, by example

Recreate files, sockets, etc

Node 1 Node 2 Node 3 DMTCP Control User Control Socket Data Restart Restart Restart

Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39

slide-53
SLIDE 53

Design and Implementation Distributed checkpointing algorithm

Restart algorithm, by example

Recreate files, sockets, etc

Node 1 Node 2 Node 3 DMTCP Control User Control Socket Data Restart Restart Restart

Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39

slide-54
SLIDE 54

Design and Implementation Distributed checkpointing algorithm

Restart algorithm, by example

Fork user processes

Node 1 Node 2 Node 3 DMTCP Control User Control Socket Data Restart Restart Restart

Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39

slide-55
SLIDE 55

Design and Implementation Distributed checkpointing algorithm

Restart algorithm, by example

Fork user processes

Node 1 Node 2 Node 3 DMTCP Control User Control Socket Data Restart Restart Restart Restart

Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39

slide-56
SLIDE 56

Design and Implementation Distributed checkpointing algorithm

Restart algorithm, by example

Rearrange FDs to match each user process

Node 1 Node 2 Node 3 DMTCP Control User Control Socket Data Restart Restart Restart Restart

Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39

slide-57
SLIDE 57

Design and Implementation Distributed checkpointing algorithm

Restart algorithm, by example

Rearrange FDs to match each user process

Node 1 Node 2 Node 3 DMTCP Control User Control Socket Data Restart Restart Restart Restart

Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39

slide-58
SLIDE 58

Design and Implementation Distributed checkpointing algorithm

Restart algorithm, by example

Restore memory/threads

Node 1 Node 2 Node 3 DMTCP Control User Control Socket Data Restart Restart Restart Restart

Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39

slide-59
SLIDE 59

Design and Implementation Distributed checkpointing algorithm

Restart algorithm, by example

Restore memory/threads

Process A Process B Process C Process D Node 1 Node 2 Node 3 DMTCP Control User Control Socket Data

Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39

slide-60
SLIDE 60

Design and Implementation Distributed checkpointing algorithm

Restart algorithm, by example

Continue as if after a checkpoint

Process A Process B Process C Process D Node 1 Node 2 Node 3 S h a r e d S

  • c

k e t Socket DMTCP Control User Control Socket Data

Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39

slide-61
SLIDE 61

Design and Implementation Distributed checkpointing algorithm

Restart algorithm, by example

Continue as if after a checkpoint

Process A Process B Process C Process D Node 1 Node 2 Node 3 S h a r e d S

  • c

k e t Socket DMTCP Control User Control Socket Data

Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39

slide-62
SLIDE 62

Design and Implementation Distributed checkpointing algorithm

Restart algorithm, by example

Continue as if after a checkpoint

Process A Process B Process C Process D Node 1 Node 2 Node 3 S h a r e d S

  • c

k e t Socket DMTCP Control User Control Socket Data

Jason Ansel (MIT) DMTCP May 26, 2009 14 / 39

slide-63
SLIDE 63

Design and Implementation Other features

Other features supported by DMTCP

Threads, mutexes/semaphores, fork, exec, ssh Shared memory (between processes) TCP/IP sockets, UNIX domain sockets, pipes Pseudo terminals, terminal modes, ownership of controlling terminals Signals and signal handlers I/O (including the readline library), shared fds Parent-child process relationships, process id & thread id virtualization, session and process group ids Syslogd, vdso Address space randomization, exec shield Checkpoint image compression, forked checkpointing ...

Jason Ansel (MIT) DMTCP May 26, 2009 15 / 39

slide-64
SLIDE 64

Design and Implementation Other features

Other features supported by DMTCP

Threads, mutexes/semaphores, fork, exec, ssh Shared memory (between processes) TCP/IP sockets, UNIX domain sockets, pipes Pseudo terminals, terminal modes, ownership of controlling terminals Signals and signal handlers I/O (including the readline library), shared fds Parent-child process relationships, process id & thread id virtualization, session and process group ids Syslogd, vdso Address space randomization, exec shield Checkpoint image compression, forked checkpointing ...

Jason Ansel (MIT) DMTCP May 26, 2009 15 / 39

slide-65
SLIDE 65

Design and Implementation Other features

Other features supported by DMTCP

Threads, mutexes/semaphores, fork, exec, ssh Shared memory (between processes) TCP/IP sockets, UNIX domain sockets, pipes Pseudo terminals, terminal modes, ownership of controlling terminals Signals and signal handlers I/O (including the readline library), shared fds Parent-child process relationships, process id & thread id virtualization, session and process group ids Syslogd, vdso Address space randomization, exec shield Checkpoint image compression, forked checkpointing ...

Jason Ansel (MIT) DMTCP May 26, 2009 15 / 39

slide-66
SLIDE 66

Design and Implementation Other features

Other features supported by DMTCP

Threads, mutexes/semaphores, fork, exec, ssh Shared memory (between processes) TCP/IP sockets, UNIX domain sockets, pipes Pseudo terminals, terminal modes, ownership of controlling terminals Signals and signal handlers I/O (including the readline library), shared fds Parent-child process relationships, process id & thread id virtualization, session and process group ids Syslogd, vdso Address space randomization, exec shield Checkpoint image compression, forked checkpointing ...

Jason Ansel (MIT) DMTCP May 26, 2009 15 / 39

slide-67
SLIDE 67

Design and Implementation Other features

Pseudo terminals

Example execution:

Jason Ansel (MIT) DMTCP May 26, 2009 16 / 39

slide-68
SLIDE 68

Design and Implementation Other features

Pseudo terminals

Example execution:

Process 1 opens /dev/ptmx Process 1 calls ptsname() on the FD

Returns the string "/dev/pts/7"

String copied and shared ...

Jason Ansel (MIT) DMTCP May 26, 2009 16 / 39

slide-69
SLIDE 69

Design and Implementation Other features

Pseudo terminals

Example execution:

Process 1 opens /dev/ptmx Process 1 calls ptsname() on the FD

Returns the string "/dev/pts/7"

String copied and shared ... At restart time /dev/pts/7 is in use!!! Problem: we can’t change the string hidden in user memory

Jason Ansel (MIT) DMTCP May 26, 2009 16 / 39

slide-70
SLIDE 70

Design and Implementation Other features

Pseudo terminals

Example execution:

Process 1 opens /dev/ptmx Process 1 calls ptsname() on the FD

Returns the string "/dev/pts/7"

String copied and shared ... At restart time /dev/pts/7 is in use!!! Problem: we can’t change the string hidden in user memory

Solution: virtualize in a sneaky way

Jason Ansel (MIT) DMTCP May 26, 2009 16 / 39

slide-71
SLIDE 71

Design and Implementation Other features

Pseudo terminals

Example execution:

Process 1 opens /dev/ptmx Process 1 calls ptsname() on the FD

Returns the string "/dev/pts/7"

String copied and shared ... At restart time /dev/pts/7 is in use!!! Problem: we can’t change the string hidden in user memory

Solution: virtualize in a sneaky way

ptsname() returns /tmp/unique

Jason Ansel (MIT) DMTCP May 26, 2009 16 / 39

slide-72
SLIDE 72

Design and Implementation Other features

Pseudo terminals

Example execution:

Process 1 opens /dev/ptmx Process 1 calls ptsname() on the FD

Returns the string "/dev/pts/7"

String copied and shared ... At restart time /dev/pts/7 is in use!!! Problem: we can’t change the string hidden in user memory

Solution: virtualize in a sneaky way

ptsname() returns /tmp/unique /tmp/unique is a symlink to /dev/pts/7

Jason Ansel (MIT) DMTCP May 26, 2009 16 / 39

slide-73
SLIDE 73

Design and Implementation Other features

Pseudo terminals

Example execution:

Process 1 opens /dev/ptmx Process 1 calls ptsname() on the FD

Returns the string "/dev/pts/7"

String copied and shared ... At restart time /dev/pts/7 is in use!!! Problem: we can’t change the string hidden in user memory

Solution: virtualize in a sneaky way

ptsname() returns /tmp/unique /tmp/unique is a symlink to /dev/pts/7 At restart time we can redirect /tmp/unique to an available device

Jason Ansel (MIT) DMTCP May 26, 2009 16 / 39

slide-74
SLIDE 74

Design and Implementation Other features

Checkpoint image compression

Time Space

Faster Smaller

Normal

Three checkpointing modes:

1

Uncompressed (normal) checkpoints

Jason Ansel (MIT) DMTCP May 26, 2009 17 / 39

slide-75
SLIDE 75

Design and Implementation Other features

Checkpoint image compression

Time Space

Faster Smaller

Normal Compressed

Three checkpointing modes:

1

Uncompressed (normal) checkpoints

2

Compressed checkpoints

Calls “gzip –fast” as a filter On our distributed benchmarks: 2.1x to 28.0x (mean 7.3x) compression

Jason Ansel (MIT) DMTCP May 26, 2009 17 / 39

slide-76
SLIDE 76

Design and Implementation Other features

Checkpoint image compression

Time Space

Faster Smaller

Normal Compressed Forked

Three checkpointing modes:

1

Uncompressed (normal) checkpoints

2

Compressed checkpoints

Calls “gzip –fast” as a filter On our distributed benchmarks: 2.1x to 28.0x (mean 7.3x) compression

3

Forked checkpointing

Completed in parallel to user application

Jason Ansel (MIT) DMTCP May 26, 2009 17 / 39

slide-77
SLIDE 77

Results

Outline

1

Introduction Background Motivation Related work Short Demo

2

Design and Implementation How it works Distributed checkpointing algorithm Other features

3

Results Performance trends Benchmarks

4

Conclusions Final remarks Questions

Jason Ansel (MIT) DMTCP May 26, 2009 18 / 39

slide-78
SLIDE 78

Results Performance trends

Time .vs. # of nodes

2 4 6 8 10 12 14 16 16 32 48 64 80 96 112 128 Time (s) ParGeant4 Compute Processes Restart Checkpoint

Compression enabled. ParGeant4 benchmark. 4 nodes through 32 nodes × 4 cores per node.

Jason Ansel (MIT) DMTCP May 26, 2009 19 / 39

slide-79
SLIDE 79

Results Performance trends

What controls checkpoint time?

With compression:

time(checkpoint) ≈ time(gzip memory) In parallel across cluster

Jason Ansel (MIT) DMTCP May 26, 2009 20 / 39

slide-80
SLIDE 80

Results Performance trends

What controls checkpoint time?

With compression:

time(checkpoint) ≈ time(gzip memory) In parallel across cluster

Stage Compressed Uncompressed Suspend user threads 0.02 Elect FD leaders 0.00 Drain kernel buffers 0.10 Write checkpoint 3.94 Refill kernel buffers 0.00 Total 4.07

NAS/MG benchmark with 32 compute processes on 8 nodes

Jason Ansel (MIT) DMTCP May 26, 2009 20 / 39

slide-81
SLIDE 81

Results Performance trends

What controls checkpoint time?

With compression:

time(checkpoint) ≈ time(gzip memory) In parallel across cluster

Without compression, dominated by writing to disk Stage Compressed Uncompressed Suspend user threads 0.02 0.03 Elect FD leaders 0.00 0.00 Drain kernel buffers 0.10 0.10 Write checkpoint 3.94 0.63 Refill kernel buffers 0.00 0.00 Total 4.07 0.76

NAS/MG benchmark with 32 compute processes on 8 nodes

Jason Ansel (MIT) DMTCP May 26, 2009 20 / 39

slide-82
SLIDE 82

Results Benchmarks

Benchmarks Overview

Distributed benchmarks (10 benchmarks)

Run on a 32 node (128 core) cluster

Jason Ansel (MIT) DMTCP May 26, 2009 21 / 39

slide-83
SLIDE 83

Results Benchmarks

Benchmarks Overview

Distributed benchmarks (10 benchmarks)

Run on a 32 node (128 core) cluster

Single node benchmarks (20 benchmarks)

Run on an 8 core machine Some, not all, are multithreaded/multiprocess

Jason Ansel (MIT) DMTCP May 26, 2009 21 / 39

slide-84
SLIDE 84

Results Benchmarks

Distributed benchmarks

Based on sockets directly: Run using MPICH2: Run using OpenMPI:

Jason Ansel (MIT) DMTCP May 26, 2009 22 / 39

slide-85
SLIDE 85

Results Benchmarks

Distributed benchmarks

Based on sockets directly:

iPython/Shell and iPython/Demo: parallel/distributed python shell

Run using MPICH2: Run using OpenMPI:

Jason Ansel (MIT) DMTCP May 26, 2009 22 / 39

slide-86
SLIDE 86

Results Benchmarks

Distributed benchmarks

Based on sockets directly:

iPython/Shell and iPython/Demo: parallel/distributed python shell

Run using MPICH2:

Baseline

Run using OpenMPI:

Baseline

Jason Ansel (MIT) DMTCP May 26, 2009 22 / 39

slide-87
SLIDE 87

Results Benchmarks

Distributed benchmarks

Based on sockets directly:

iPython/Shell and iPython/Demo: parallel/distributed python shell

Run using MPICH2:

Baseline ParGeant4: a million-line C++ toolkit for simulating particle-mattter interaction.

Run using OpenMPI:

Baseline

Jason Ansel (MIT) DMTCP May 26, 2009 22 / 39

slide-88
SLIDE 88

Results Benchmarks

Distributed benchmarks

Based on sockets directly:

iPython/Shell and iPython/Demo: parallel/distributed python shell

Run using MPICH2:

Baseline ParGeant4: a million-line C++ toolkit for simulating particle-mattter interaction. NAS NPB2.4: CG (Conjugate Gradient)

Run using OpenMPI:

Baseline NAS NPB2.4: BT (Block Tridiagonal), SP (Scalar Pentadiagonal), EP (Embarrassingly Parallel), LU (Lower-Upper Symmetric Gauss-Seidel), MG (Multi Grid), and IS (Integer Sort).

Jason Ansel (MIT) DMTCP May 26, 2009 22 / 39

slide-89
SLIDE 89

Results Benchmarks

Single node benchmarks

Scripting languages:

BC – an arbitrary precision calculator language GHCi – the Glasgow Haskell Compiler Ghostscript – PostScript and PDF language interpreter GNUPlot – an interactive plotting program GST – the GNU Smalltalk virtual machine Macaulay2 – a system supporting research in algebraic geometry and commutative algebra MATLAB – a high-level language and interactive environment for technical computing MZScheme – the PLT Scheme implementation OCaml – the Objective Caml interactive shell Octave – a high-level interactive language for numerical computations PERL – Practical Extraction and Report Language interpreter

Jason Ansel (MIT) DMTCP May 26, 2009 23 / 39

slide-90
SLIDE 90

Results Benchmarks

Single node benchmarks (continued)

Scripting languages (continued):

PHP – an HTML-embedded scripting language Python – an interpreted, interactive, object-oriented programming language Ruby – an interpreted object-oriented scripting language SLSH – an interpreter for S-Lang scripts tclsh – a simple shell containing the Tcl interpreter

Jason Ansel (MIT) DMTCP May 26, 2009 24 / 39

slide-91
SLIDE 91

Results Benchmarks

Single node benchmarks (continued)

Scripting languages (continued):

PHP – an HTML-embedded scripting language Python – an interpreted, interactive, object-oriented programming language Ruby – an interpreted object-oriented scripting language SLSH – an interpreter for S-Lang scripts tclsh – a simple shell containing the Tcl interpreter

Other programs:

Emacs – a well known text editor vim/cscope – interactively examine a C program.

Jason Ansel (MIT) DMTCP May 26, 2009 24 / 39

slide-92
SLIDE 92

Results Benchmarks

Single node benchmarks (continued)

Scripting languages (continued):

PHP – an HTML-embedded scripting language Python – an interpreted, interactive, object-oriented programming language Ruby – an interpreted object-oriented scripting language SLSH – an interpreter for S-Lang scripts tclsh – a simple shell containing the Tcl interpreter

Other programs:

Emacs – a well known text editor vim/cscope – interactively examine a C program. Lynx – a command line web browser SQLite – a command line interface for the SQLite database tightvnc/twm – headless X server and window manager

Jason Ansel (MIT) DMTCP May 26, 2009 24 / 39

slide-93
SLIDE 93

Results Benchmarks

Single node benchmarks (continued)

Scripting languages (continued):

PHP – an HTML-embedded scripting language Python – an interpreted, interactive, object-oriented programming language Ruby – an interpreted object-oriented scripting language SLSH – an interpreter for S-Lang scripts tclsh – a simple shell containing the Tcl interpreter

Other programs:

Emacs – a well known text editor vim/cscope – interactively examine a C program. Lynx – a command line web browser SQLite – a command line interface for the SQLite database tightvnc/twm – headless X server and window manager RunCMS – Simulation of the CMS experiment at LHC/CERN

Jason Ansel (MIT) DMTCP May 26, 2009 24 / 39

slide-94
SLIDE 94

Results Benchmarks

Single node benchmarks (continued)

Scripting languages (continued):

PHP – an HTML-embedded scripting language Python – an interpreted, interactive, object-oriented programming language Ruby – an interpreted object-oriented scripting language SLSH – an interpreter for S-Lang scripts tclsh – a simple shell containing the Tcl interpreter

Other programs:

Emacs – a well known text editor vim/cscope – interactively examine a C program. Lynx – a command line web browser SQLite – a command line interface for the SQLite database tightvnc/twm – headless X server and window manager RunCMS – Simulation of the CMS experiment at LHC/CERN

Jason Ansel (MIT) DMTCP May 26, 2009 24 / 39

slide-95
SLIDE 95

Results Benchmarks

RunCMS Benchmark

RunCMS benchmark

Developed at CERN Simulates the CMS experiment of the large hadron collider (LHC) 2 million lines of code 700 dynamic libraries 12 minute startup time

Checkpoint time (with compression) is 25.2 seconds Restart time is 18.4 seconds 680MB memory image, compressed to 225MB

Jason Ansel (MIT) DMTCP May 26, 2009 25 / 39

slide-96
SLIDE 96

Conclusions

Outline

1

Introduction Background Motivation Related work Short Demo

2

Design and Implementation How it works Distributed checkpointing algorithm Other features

3

Results Performance trends Benchmarks

4

Conclusions Final remarks Questions

Jason Ansel (MIT) DMTCP May 26, 2009 26 / 39

slide-97
SLIDE 97

Conclusions Final remarks

Future work

Integration with Condor

Condor is a ground breaking process migration system Based on its own single-process checkpointer

Requires relinking. Doesn’t support: threads, multiple processes, mmap, etc.

Jason Ansel (MIT) DMTCP May 26, 2009 27 / 39

slide-98
SLIDE 98

Conclusions Final remarks

Future work

Integration with Condor

Condor is a ground breaking process migration system Based on its own single-process checkpointer

Requires relinking. Doesn’t support: threads, multiple processes, mmap, etc.

DMTCP will remove these limitations

Jason Ansel (MIT) DMTCP May 26, 2009 27 / 39

slide-99
SLIDE 99

Conclusions Final remarks

Future work

Integration with Condor

Condor is a ground breaking process migration system Based on its own single-process checkpointer

Requires relinking. Doesn’t support: threads, multiple processes, mmap, etc.

DMTCP will remove these limitations Hope to release an experimental beta version by end of summer

Jason Ansel (MIT) DMTCP May 26, 2009 27 / 39

slide-100
SLIDE 100

Conclusions Final remarks

Future work

Integration with Condor

Condor is a ground breaking process migration system Based on its own single-process checkpointer

Requires relinking. Doesn’t support: threads, multiple processes, mmap, etc.

DMTCP will remove these limitations Hope to release an experimental beta version by end of summer

DMTCP as a save/restore workspace feature in SCIRun

Computational workbench Visual programming For modelling, simulation and visualization Millions of lines of code

Improving support for X windows applications

Jason Ansel (MIT) DMTCP May 26, 2009 27 / 39

slide-101
SLIDE 101

Conclusions Final remarks

Special thanks/credit goes to...

MTCP (our single-process component):

Michael Rieker

Colleagues at U Wisconsin (integration with Condor):

Peter Keller and others

Colleagues at CERN (help with runCMS, ParGeant4):

John Apostolakis, Giulio Eulisse, Lassi Tuura, and others

Other DMTCP developers / contributers:

Alex Brick, Tyler Deniseton Xin Dong, Daniel Kunkle Artem Polyakov. Praveen Solanki, and Ana-Maria Visan

Jason Ansel (MIT) DMTCP May 26, 2009 28 / 39

slide-102
SLIDE 102

Conclusions Questions

For more information

Source code (LGPL), documentation, other publications: http://dmtcp.sourceforge.net/ Questions?

Jason Ansel (MIT) DMTCP May 26, 2009 29 / 39

slide-103
SLIDE 103

Conclusions Questions

Thank you

Jason Ansel (MIT) DMTCP May 26, 2009 30 / 39

slide-104
SLIDE 104

Backup Slides

Backup Slides

Jason Ansel (MIT) DMTCP May 26, 2009 31 / 39

slide-105
SLIDE 105

Backup Slides

Usage

1

Start your program under DMTCP: dmtcp checkpoint [options] <program> For example: dmtcp checkpoint mpdboot -n 32 dmtcp checkpoint mpirun -np 32 hellompi

Jason Ansel (MIT) DMTCP May 26, 2009 32 / 39

slide-106
SLIDE 106

Backup Slides

Usage

1

Start your program under DMTCP: dmtcp checkpoint [options] <program> For example: dmtcp checkpoint mpdboot -n 32 dmtcp checkpoint mpirun -np 32 hellompi

2

Request a checkpoint dmtcp command --checkpoint

Jason Ansel (MIT) DMTCP May 26, 2009 32 / 39

slide-107
SLIDE 107

Backup Slides

Usage

1

Start your program under DMTCP: dmtcp checkpoint [options] <program> For example: dmtcp checkpoint mpdboot -n 32 dmtcp checkpoint mpirun -np 32 hellompi

2

Request a checkpoint dmtcp command --checkpoint

3

Restart: ./dmtcp restart script.sh

Jason Ansel (MIT) DMTCP May 26, 2009 32 / 39

slide-108
SLIDE 108

Backup Slides

MultiThreaded CheckPointing (MTCP)

MTCP is our single process checkpointing component Separate/modular so that it can be swapped out (when porting) Requires its own talk to properly describe See our past publication: Transparent User-Level Checkpointing for the Native POSIX Thread Library for Linux. Michael Rieker, Jason Ansel, and Gene Cooperman.

Jason Ansel (MIT) DMTCP May 26, 2009 33 / 39

slide-109
SLIDE 109

Backup Slides

Distributed benchmark timings

1 2 3 4 5 6 7 8 iPython/Shell[1] iPython/Demo[1] Baseline[2] ParGeant4[2] NAS/CG[2] Baseline[3] NAS/EP[3] NAS/LU[3] NAS/SP[3] NAS/MG[3] NAS/IS[3] NAS/BT[3] Checkpoint Time (s) Uncompressed Compressed 1 2 3 4 5 6 7 8 iPython/Shell[1] iPython/Demo[1] Baseline[2] ParGeant4[2] NAS/CG[2] Baseline[3] NAS/EP[3] NAS/LU[3] NAS/SP[3] NAS/MG[3] NAS/IS[3] NAS/BT[3] Restart Time (s) Uncompressed Compressed 2000 4000 6000 8000 10000 iPython/Shell[1] IPython/Demo[1] Baseline[2] ParGeant4[2] NAS/CG[2] Baseline[3] NAS/EP[3] NAS/LU[3] NAS/SP[3] NAS/MG[3] NAS/IS[3] NAS/BT[3] Checkpoint Size (MB) Uncompressed Compressed

Jason Ansel (MIT) DMTCP May 26, 2009 34 / 39

slide-110
SLIDE 110

Backup Slides

Single node benchmark performance

0.5 1 1.5 2 2.5 3 3.5 4 bc emacs ghci(Haskell) ghostscript gnuplot gst(Smalltalk) lynx macaulay2 matlab mzscheme

  • caml
  • ctave

perl php python ruby slsh(S-Lang) sqlite tclsh tightvnc+twm vim/cscope Time (s) Checkpoint Time Restart Time 5 10 15 20 25 30 35 bc emacs ghci(Haskell) ghostscript gnuplot gst(Smalltalk) lynx macaulay2 matlab mzscheme

  • caml
  • ctave

perl php python ruby slsh(S-Lang) sqlite tclsh tightvnc+twm vim/cscope Checkpoint Size (MB) Checkpoint Size Jason Ansel (MIT) DMTCP May 26, 2009 35 / 39

slide-111
SLIDE 111

Backup Slides

Experimental Setup

Distributed (cluster) tests:

32 node cluster 4 cores per node (128 total cores) dual-socket, dual-core Xeon 5130 8 or 16 GB ram/node 64-bit Red Hat Enterprise 4 Linux 2.6.9

Single node tests:

8 cores dual-socket, quad core Xeon E5320 8 GB ram 64-bit Debian “sid” Linux 2.6.28

DMTCP has been tested on:

Ubuntu, Debian, OpenSuse, Fedora, RHEL, ... Linux 2.6.9 and up x86, x86 64

Jason Ansel (MIT) DMTCP May 26, 2009 36 / 39

slide-112
SLIDE 112

Backup Slides

Our checkpoint algorithm

The checkpoint management thread, in each user process, performs the following:

1

Wait for the checkpoint to begin

2

Hijack and suspend user threads

3

Node-local elections for shared resources

4

Drain sockets to process memory

5

Single-process checkpointing

6

Refill sockets

7

Resume user threads

8

Go to step 1

“ ” is a cluster-wide barrier

Jason Ansel (MIT) DMTCP May 26, 2009 37 / 39

slide-113
SLIDE 113

Backup Slides

Our restart algorithm

Initially, one restart process per node, in each restart process:

1

Restore files, ptys, other single process FDs

2

Reconnect sockets using a cluster wide discovery service

3

Fork into user processes

4

Rearrange FDs for each process

5

Restore each process memory / threads

6

Continue with step 9 in the checkpoint algorithm

Refill kernel buffers Resume user threads

Jason Ansel (MIT) DMTCP May 26, 2009 38 / 39

slide-114
SLIDE 114

Backup Slides

Varying memory usage

1 2 3 4 5 6 7 8 10 20 30 40 50 60 70 Time (s) Total Memory Usage (GB) Restart Checkpoint

Checkpoint time is dominated by writing checkpoints to disk. Compression

  • disabled. A synthetic program on 32 nodes.

Jason Ansel (MIT) DMTCP May 26, 2009 39 / 39