Presenter: Box. Lean Box. Leangsuksun gsuksun SWEPCO Endowed - PowerPoint PPT Presentation

Partially supported by Presenter: Box. Lean Box. Leangsuksun gsuksun SWEPCO Endowed Professor*, Computer Science Louisiana Tech University Louisiana Tech University box@latech.edu S. Laosooksathit, N. Naksinehaboon, K. Chanchio Amir Fabin Box. Leang Box. Leangsuk uksun, A. Dhungana, U of Texas, Arlington Thammasat Univ Thammasat Univ C Ch C. Chandler dl Louisiana Tech U th HPCVirt 4 th HPCVirt workshop, Paris, France orkshop, Paris, France, April 13, 2010 , April 13, 2010

 Motivations  Background - VCCP  GPU checkpoint protocols: Memcopy vs simpleStream i l S  CheCUDA (related work)  GPU checkpoint protocols: CUDA Streams  GPU checkpoint protocols: CUDA Streams  Restart protocols  Scheduling model and Analysis  Scheduling model and Analysis  Conclusion 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 2

 More attention on GPUs  ORNL-NVDIA 10 petaflop machine  Large scale GPU cluster -> fault tolerance for GPU GPU applications li i ◦ Normal checkpoint doesn’t help GPU applications when a failure occurs. e a a u e occu s ◦ GPU execution isn’t saved when do checkpoint on CPU 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 3

 High transparency ◦ Checkpoint/restart mechanisms should be transparent transparent to applications OS and runtime transparent transparent to applications, OS, and runtime environments; no modification required  Efficiency ◦ Checkpoint/restart mechanisms should not not generate unacceptable overheads  Normal Execution Normal Execution  Communication  Checkpointing Delay 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010

Run apps/OS Run apps/OS unmodified unmodified Checkpoint/restart Checkpoint/res Checkpoint/restart eckpoint/restart protocols tart protocols protocols protocols FIFO FIFO, FIFO R li bl FIFO , Reli liabl ble 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010

1. Pauce VM computation 1. Pauce VM computation 2. Flush messages out of the network network 3. Locally Save State of every VM 4. Continue computation p 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010

VCCP checkpoint protocol VCCP checkpoint protocol Head compute01 compute02 save save save 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010

VCCP checkpoint protocol VCCP checkpoint protocol Head compute0 compute0 1 1 2 2 Flush communication communication channel channel empty channel empty channel empty save VM & buffer save VM & buffer save VM & buffer 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010

VCCP checkpoint protocol VCCP checkpoint protocol Head compute01 compute02 res lt result res lt result success resume cont t resume cont cont cont 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010

 Publication in IEEE cluster 2009  Average overhead 12%  Provide transparent checkpoint/restart 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 10

Host Device Device Initialization 1. Grid 1 Device memory D i 2. Kernel Block Block Block allocation 1 (0, 0) (1, 0) (2, 0) Copies data to device 3. Block Block Block (0, 1) ( , ) (1, 1) ( , ) (2, 1) ( , ) memory Executes kernel (Calling 4. Grid 2 __global__ function) Kernel 2 2 Copies data from device 5. memory (retrieve results) Block (1, 1) Thread Thread Thread Thread Thread  Issues – latency round trip (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) data movement Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 11

 Long running GPU application  High (relatively) failure rate in a large scale GPU cluster in MPI & GPU environment  Save GPU software state S GPU f  Move data back from GPU in low latency ◦ Memcopy (pauce GPU) vs simpleStream ◦ Memcopy (pauce GPU) vs simpleStream (concurrency) 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 12

 “CheCUDA: A Checkpoint/Restart Tool for CUDA Applications” by H. Takizawa, K. Sato, K. Komatsu, and H. Kobayashi  A prototype of an add on package of BLCR  A prototype of an add-on package of BLCR for GPU checkpointing  Memcopy approach  Memcopy approach 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 13

2 CPU checkpointing 2 Migration/ CPU checkpoint checkpoint GPU 1 checkpointing 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 14

Process starts H-D memory H D memory copy Kernel starts Syncthread() GPU checkpoint GPU checkpoint duration CPU checkpoint/ migration Kernel completes co p etes D-H memory D H memory copy Process ends 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 15

Copying all the user data in the device 1. memory to the host memory Writing the current status of the application 2. and the user data to a checkpoint file and the user data to a checkpoint file 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 16

Read the checkpoint file 1. Initialize the GPU and recreating CUDA 2. resources Sending the user data back to the device S di h d b k h d i 3. memory 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 17

 Transfer data from device to host = overhead ◦ Must pauce GPU computation until the copy is completed  SimpleStream ◦ Using latency hiding (Streams) to reduce the overhead ◦ CUDA streams = overlap memory copy and kernel execution execution 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 18

Process starts H D memory H-D memory Code Code Analys Code Code Analysis Analysis Analysis Kernel starts K l t t copy After the sync point, OVERWRITE? Syncthread() GPU checkpoint duration CPU checkpoint / migration YES NO D-H memory copy py Kernel completes Kernel completes Process ends 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 19

Process starts H-D memory H D memory Code Analysis Code Code Code Analys Analysis Analysis Kernel starts K l t t copy After the sync point, OVERWRITE? Syncthread() YES NO 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 20

Process starts After the H-D memory H D memory Kernel starts K l t t sync point, copy OVERWRITE? Syncthread() NO GPU checkpoint duration CPU checkpoint / migration D-H memory copy py Kernel completes Kernel completes BACK Process ends 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 21

Process starts H D memory H-D memory After the After the Kernel starts K l t t copy sync point, OVERWRITE? Syncthread() Duplicate image YES YES C Copy the sync image in GPU h i i GPU GPU checkpoint duration CPU checkpoint/ migration migration D-H memory copy py Kernel completes Kernel completes Process ends BACK 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 22

 Restart CPU  Transfer the last GPU checkpoint back to CPU  Recreate CUDA context from the CKpt file  Restart the kernel execution from the marked synchronization point 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 23

 GPU checkpoint after a thread synchronization  NOT every thread synchronization  QUESTION??? QUESTION??? ◦ Which thread synchronization should a checkpoint be invoked? be o ed  FACTORs ◦ GPU checkpoint overhead ◦ Chance of a failure occurrence 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 24

O th n th m C ˆ C j   n     ˆ P C C   f j    m    j j m     O    Perform the checkpoint: ˆ P O C 1 P f f 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 25

O th n th m C ˆ C j   n     ˆ Skip the checkpoint: P C C   f j   m     j j m     O    Perform the checkpoint: ˆ P O C 1 P f f    n     Perform the checkpoint p P C ÷ O   f f j j ÷     j m 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 26

Presenter: Box. Lean Box. Leangsuksun gsuksun SWEPCO Endowed - PowerPoint PPT Presentation

Partially supported by Presenter: Box. Lean Box. Leangsuksun gsuksun SWEPCO Endowed Professor*, Computer Science Louisiana Tech University Louisiana Tech University box@latech.edu S. Laosooksathit, N. Naksinehaboon, K. Chanchio Amir Fabin

UK Lean Summit 2016 UK Lean Summit 2016 Learning Lean, Lean Learning Learning Lean, Lean

NE Indiana Lean Network Dec 8, 2016 Lean Culture Traditional Way Of Thinking Cost + Profit =

YELLOW BELT TRAINING (LEAN DAILY) SIMPLER. FASTER. BETTER. LESS COSTLY. lean.ohio.gov

Lean Software Development Lean Software Development is an Agile practice that is based on the

Balfour Beatty Competing in a Lean Environment AGENDA ABOUT BALFOUR BEATTY DEFINING LEAN IN

How to Run a Lean Coffee Session Lean Coffee(tm) is a trademark of Modus Cooperandi WHAT IS LEAN

Grid Aware HA-OSCAR Kshitij Limaye 1 , Box Leangsuksun 1 , Venkata K. Munganuru 1 , Zeno Greenwood

Paradoxes in Probability How probability continues to amuse me! Let's play a game! Box A Box B

Lean and green; how environmental performance can be enhanced by lean production systems and vice

Coast Lean Log Handling Project 11/10/2017 TOPFN Division 1 WHAT IS LEAN? LEAN is a

Lean in Lean Leonardo de Moura - MSR - USA Workshop Lean Programming Language Goals

Lean Processing For Practice Transformation Ellen Batchelor Objectives How & Why Lean,

Exploring Lean Principles UNDERSTANDING WHAT LEAN IS Chris B. Behrens SOFTWARE ARCHITECT

Lean Intrapreneurship - Lean Startup in Established Companies Carl Danneels Feb 5th 2014 Key

LEAN FROM THE START LEAN FROM THE START Lean is about solving problems through continuous

The Blind Spot: Part I of II by Dave Munch, M.D. | Dec 1, 2011 | Lean Leadership, Lean Training,

Oasys PRIMER Did you know? Back to Contents Top Tips Demo Slide 2 Slide 2 Checkpoint

MC714 - Sistemas Distribuidos slides by Maarten van Steen (adapted from Distributed System - 3rd

On Efficient Constructions of Checkpoints Yu Chen, Zhenming Liu, Bin Ren and Xin Jin Checkpoint

Securing Proof-of-Work Ledgers via Checkpointing Dimitris Karakostas, Aggelos Kiayias

Identifying Slow Queries, and Fixing Them! Stephen Frost Crunchy Data stephen@crunchydata.com

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N.,

Resilient Distributed Concurrent Collections Cdric Bassem Promotor: Prof. Dr. Wolfgang De

Incremental checkpointing of program state to NVRAM for transiently-powered systems Fayal

Presenter: Box. Lean Box. Leangsuksun gsuksun SWEPCO Endowed - PowerPoint PPT Presentation

Partially supported by Presenter: Box. Lean Box. Leangsuksun gsuksun SWEPCO Endowed Professor*, Computer Science Louisiana Tech University Louisiana Tech University box@latech.edu S. Laosooksathit, N. Naksinehaboon, K. Chanchio Amir Fabin

UK Lean Summit 2016 UK Lean Summit 2016 Learning Lean, Lean Learning Learning Lean, Lean

NE Indiana Lean Network Dec 8, 2016 Lean Culture Traditional Way Of Thinking Cost + Profit =

YELLOW BELT TRAINING (LEAN DAILY) SIMPLER. FASTER. BETTER. LESS COSTLY. lean.ohio.gov

Lean Software Development Lean Software Development is an Agile practice that is based on the

Balfour Beatty Competing in a Lean Environment AGENDA ABOUT BALFOUR BEATTY DEFINING LEAN IN

How to Run a Lean Coffee Session Lean Coffee(tm) is a trademark of Modus Cooperandi WHAT IS LEAN

Grid Aware HA-OSCAR Kshitij Limaye 1 , Box Leangsuksun 1 , Venkata K. Munganuru 1 , Zeno Greenwood

Paradoxes in Probability How probability continues to amuse me! Let's play a game! Box A Box B

Lean and green; how environmental performance can be enhanced by lean production systems and vice

Coast Lean Log Handling Project 11/10/2017 TOPFN Division 1 WHAT IS LEAN? LEAN is a

Lean in Lean Leonardo de Moura - MSR - USA Workshop Lean Programming Language Goals

Lean Processing For Practice Transformation Ellen Batchelor Objectives How &amp; Why Lean,

Exploring Lean Principles UNDERSTANDING WHAT LEAN IS Chris B. Behrens SOFTWARE ARCHITECT

Lean Intrapreneurship - Lean Startup in Established Companies Carl Danneels Feb 5th 2014 Key

LEAN FROM THE START LEAN FROM THE START Lean is about solving problems through continuous

The Blind Spot: Part I of II by Dave Munch, M.D. | Dec 1, 2011 | Lean Leadership, Lean Training,

Oasys PRIMER Did you know? Back to Contents Top Tips Demo Slide 2 Slide 2 Checkpoint

MC714 - Sistemas Distribuidos slides by Maarten van Steen (adapted from Distributed System - 3rd

On Efficient Constructions of Checkpoints Yu Chen, Zhenming Liu, Bin Ren and Xin Jin Checkpoint

Securing Proof-of-Work Ledgers via Checkpointing Dimitris Karakostas, Aggelos Kiayias

Identifying Slow Queries, and Fixing Them! Stephen Frost Crunchy Data stephen@crunchydata.com

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N.,

Resilient Distributed Concurrent Collections Cdric Bassem Promotor: Prof. Dr. Wolfgang De

Incremental checkpointing of program state to NVRAM for transiently-powered systems Fayal

Lean Processing For Practice Transformation Ellen Batchelor Objectives How & Why Lean,