Presenter: Box. Lean Box. Leangsuksun gsuksun SWEPCO Endowed - - PowerPoint PPT Presentation

presenter box lean box leangsuksun gsuksun
SMART_READER_LITE
LIVE PREVIEW

Presenter: Box. Lean Box. Leangsuksun gsuksun SWEPCO Endowed - - PowerPoint PPT Presentation

Partially supported by Presenter: Box. Lean Box. Leangsuksun gsuksun SWEPCO Endowed Professor*, Computer Science Louisiana Tech University Louisiana Tech University box@latech.edu S. Laosooksathit, N. Naksinehaboon, K. Chanchio Amir Fabin


slide-1
SLIDE 1

Partially supported by

Presenter: Box. Lean

  • Box. Leangsuksun

gsuksun

SWEPCO Endowed Professor*, Computer Science Louisiana Tech University

  • S. Laosooksathit, N. Naksinehaboon,
  • Box. Leang
  • Box. Leangsuk

uksun, A. Dhungana, C Ch dl Amir Fabin U of Texas, Arlington

  • K. Chanchio

Thammasat Univ Louisiana Tech University box@latech.edu

  • C. Chandler

Louisiana Tech U Thammasat Univ

4th

th HPCVirt

HPCVirt workshop, Paris, France

  • rkshop, Paris, France, April 13, 2010

, April 13, 2010

slide-2
SLIDE 2

 Motivations  Background - VCCP  GPU checkpoint protocols: Memcopy vs

i l S simpleStream

 CheCUDA (related work)  GPU checkpoint protocols: CUDA Streams  GPU checkpoint protocols: CUDA Streams  Restart protocols  Scheduling model and Analysis  Scheduling model and Analysis  Conclusion

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

2

slide-3
SLIDE 3

 More attention on GPUs  ORNL-NVDIA 10 petaflop machine  Large scale GPU cluster -> fault tolerance for

GPU li i GPU applications

  • Normal checkpoint doesn’t help GPU applications

when a failure occurs. e a a u e occu s

  • GPU execution isn’t saved when do checkpoint on

CPU

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

3

slide-4
SLIDE 4

 High transparency

  • Checkpoint/restart mechanisms should be

transparent transparent to applications OS and runtime transparent transparent to applications, OS, and runtime environments; no modification required

 Efficiency

  • Checkpoint/restart mechanisms should not

not generate unacceptable overheads

 Normal Execution Normal Execution  Communication  Checkpointing Delay

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

slide-5
SLIDE 5

Run apps/OS Run apps/OS unmodified unmodified FIFO FIFO R li bl Checkpoint/res Checkpoint/restart tart protocols protocols

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

FIFO FIFO, , Reli liabl ble Checkpoint/restart eckpoint/restart protocols protocols

slide-6
SLIDE 6
  • 1. Pauce VM computation
  • 1. Pauce VM computation
  • 2. Flush messages out of the

network network

  • 3. Locally Save State of every VM
  • 4. Continue computation

p

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

slide-7
SLIDE 7

VCCP checkpoint protocol VCCP checkpoint protocol

Head compute01 compute02 save save save

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

slide-8
SLIDE 8

VCCP checkpoint protocol VCCP checkpoint protocol

Head compute0 1 compute0 2 1 2 Flush communication communication channel channel empty channel empty channel empty save VM & buffer save VM & buffer save VM & buffer

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

slide-9
SLIDE 9

VCCP checkpoint protocol VCCP checkpoint protocol

Head compute01 compute02 res lt res lt result result resume success t resume cont cont cont

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

cont

slide-10
SLIDE 10

 Publication in IEEE cluster 2009  Average overhead 12%  Provide transparent checkpoint/restart

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

10

slide-11
SLIDE 11

1.

Device Initialization D i

Host Device Grid 1

2.

Device memory allocation

3.

Copies data to device

Kernel 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1)

memory

4.

Executes kernel (Calling

__global__ function)

Kernel 2 ( , ) ( , ) ( , ) Grid 2

5.

Copies data from device memory (retrieve results)

2 Block (1, 1)

 Issues – latency round trip

data movement

Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0)

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

11 Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2)

slide-12
SLIDE 12

 Long running GPU application  High (relatively) failure rate in a large scale

GPU cluster in MPI & GPU environment S GPU f

 Save GPU software state  Move data back from GPU in low latency

  • Memcopy (pauce GPU) vs simpleStream
  • Memcopy (pauce GPU) vs simpleStream

(concurrency)

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

12

slide-13
SLIDE 13

 “CheCUDA: A Checkpoint/Restart Tool for

CUDA Applications” by H. Takizawa, K. Sato,

  • K. Komatsu, and H. Kobayashi

 A prototype of an add on package of BLCR  A prototype of an add-on package of BLCR

for GPU checkpointing

 Memcopy approach  Memcopy approach

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

13

slide-14
SLIDE 14

CPU checkpointing 2 Migration/ CPU checkpoint 2 checkpoint GPU checkpointing 1

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

14

slide-15
SLIDE 15

Process starts H-D memory H D memory copy Kernel starts Syncthread() GPU checkpoint GPU checkpoint duration CPU checkpoint/ migration Kernel completes D-H memory

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

15

co p etes D H memory copy Process ends

slide-16
SLIDE 16

1.

Copying all the user data in the device memory to the host memory

2.

Writing the current status of the application and the user data to a checkpoint file and the user data to a checkpoint file

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

16

slide-17
SLIDE 17

1.

Read the checkpoint file

2.

Initialize the GPU and recreating CUDA resources S di h d b k h d i

3.

Sending the user data back to the device memory

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

17

slide-18
SLIDE 18

 Transfer data from device to host = overhead

  • Must pauce GPU computation until the copy is

completed

 SimpleStream

  • Using latency hiding (Streams) to reduce the
  • verhead
  • CUDA streams = overlap memory copy and kernel

execution execution

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

18

slide-19
SLIDE 19

Process starts H D memory K l t t Code Code Analys Analysis H-D memory copy Kernel starts Code Code Analysis Analysis After the sync point, OVERWRITE? Syncthread() GPU checkpoint duration CPU checkpoint / migration YES NO Kernel completes D-H memory copy

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

19 Kernel completes py Process ends

slide-20
SLIDE 20

Process starts H D memory K l t t Code Code Analys Analysis H-D memory copy Kernel starts Code Code Analysis Analysis After the sync point, OVERWRITE? Syncthread() YES NO

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

20

slide-21
SLIDE 21

Process starts H D memory K l t t After the H-D memory copy Kernel starts sync point, OVERWRITE? Syncthread() NO CPU checkpoint / migration GPU checkpoint duration Kernel completes D-H memory copy

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

21 Kernel completes py Process ends BACK

slide-22
SLIDE 22

Process starts H D memory K l t t After the H-D memory copy Kernel starts After the sync point, OVERWRITE? Syncthread() YES C h i i GPU Duplicate image CPU checkpoint/ migration YES GPU checkpoint duration Copy the sync image in GPU migration Kernel completes D-H memory copy

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

22 Kernel completes py Process ends BACK

slide-23
SLIDE 23

 Restart CPU  Transfer the last GPU checkpoint back to CPU  Recreate CUDA context from the CKpt file  Restart the kernel execution from the marked

synchronization point

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

23

slide-24
SLIDE 24

 GPU checkpoint after a thread

synchronization

 NOT every thread synchronization

QUESTION???

 QUESTION???

  • Which thread synchronization should a checkpoint

be invoked? be

  • ed

 FACTORs

  • GPU checkpoint overhead
  • Chance of a failure occurrence

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

24

slide-25
SLIDE 25

O

j

C

th

m

th

n C ˆ

        

C C P

n m j j f

ˆ   m

j

  

O

P C O P

f f

   1 ˆ

Perform the checkpoint:

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

25

slide-26
SLIDE 26

O

j

C

th

m

th

n C ˆ

        

C C P

n m j j f

ˆ

Skip the checkpoint:

  m

j

  

O

P C O P

f f

   1 ˆ

Perform the checkpoint:

Perform the checkpoint

O C P

n j f

 ÷ ÷      

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

26

p

m j j f

  

slide-27
SLIDE 27

 Simulate failures & the wasted time

  • total checkpoint overhead + re-computing due to a

failure

 Overhead  Overhead

  • Non-stream: 10 milliseconds – 3 seconds
  • Streams: negligible

 MTTF: 12 hours – 7 days  Thread sync interval: 10 and 30 minutes

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

27

slide-28
SLIDE 28

Thread sync interval = 10 mins Thread sync interval = 30 mins 10 mins 30 mins

28

slide-29
SLIDE 29

Thread sync interval = 10 mins Thread sync interval = 30 mins 10 mins 30 mins

29

slide-30
SLIDE 30

Against MTTFs Against overheads g g

30

slide-31
SLIDE 31

 GPU checkpointing with Stream to reduce

  • verhead

 Non-stream and stream checkpoints are

insignificantly different if data transfer is insignificantly different if data transfer is insignificant

 BUT stream checkpoint potentially performs  BUT stream checkpoint potentially performs

better when the checkpoint overhead of memcopy is larger.

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

31

slide-32
SLIDE 32

 Implement GPU checkpoint/restart

mechanism

 Work on other checkpoint protocol

I l d GPU i i

 Include GPU process migration

4th HPCVirt workshop, Paris, France, April 13, 2010 4th HPCVirt workshop, Paris, France, April 13, 2010

32