realizing extremely large scale stencil applications on
play

Realizing Extremely LargeScale Stencil Applications on GPU - PowerPoint PPT Presentation

Realizing Extremely LargeScale Stencil Applications on GPU Supercomputers Toshio Endo, Yuki Takasaki, Satoshi Matsuoka GSIC, Tokyo Institute of Technology ( ) Stencil Computations Important kernels for various simulations


  1. Realizing Extremely Large‐Scale Stencil Applications on GPU Supercomputers Toshio Endo, Yuki Takasaki, Satoshi Matsuoka GSIC, Tokyo Institute of Technology ( 東京工業大学 )

  2. Stencil Computations Important kernels for various simulations (CFD, material…) ASUCA weather Phase‐Field computation Air flow simulation simulator (2011 Gordon Bell) Time t Time t+1 = Highly successful in speed Memory intensive computations  But not in scale

  3. CPU‐GPU Hybrid Supercomputers with Memory Hierarchy Tokyo Tech TSUBAME2.5 Supercomputer Node Architecture (Simplified) GPU card In typical stencil implementations on GPUs, GPU array sizes are configured as cores < (aggregated) GPU memory  Prohibits extremely Big&Fast simulation L2$ 1.5MB 250GB/s GPU mem CPU PCIe 6GB cores 2 Xeon CPUs 8GB/s × 1408 Host memory 54~96GB 3 K20X GPUs Other nodes

  4. Stencil Code Example on GPU Double buffering < 6GB Copy domain Host  Device LBM performance GPU mem size Temporal Loop on a K20X GPU (6GB) 80 Compute Faster Speed (GFlops) Grid points Fast, but not Big 60 A TSUBAME2.5 MPI comm. of 40 node is used boundary 20 0 Copy domain 0 6 12 18 24 30 36 42 48 54 Device  Host Problem Size (GB) NORMAL Bigger

  5. How Can We Exceed Memory Size? (1) Using many GPUs (2) Using capacity of host memory > (3) Using both

  6. Motivating Example: What if We Exceed GPU memory Simply ? (1) • A simple method is: – Put domain data on host memory – We divide the domain into small sub‐domains (or spatial block) – Repeat Copy a “sub‐domain” into GPU  Compute  Copy back results Host memory Host memory Host memory Device memory Device memory Device memory GPU GPU GPU cores cores cores

  7. Motivating Example: What if We Exceed GPU memory Simply ? (2) 7‐point performance Temporal loop on a K20X GPU (6GB) Dev mem Loop over capacity 20~30x slower Faster Sub‐domains 80 due to large PCIe cost!! Speed (GFlops) Copy sub‐domain Host  Device 60 This ratio is close to 40 8GB/s : 250GB/s Compute Points in sub‐dom 20 0 Copy sub‐domain 0 6 12 18 24 30 36 42 48 54 Device  Host Problem Size (GB) MPI comm. of boundary NORMAL HH Bigger Keys for improvement are “Communication avoiding & Locality Improvement”

  8. Goals of This Work When we have existing apps, we want to realize followings High Large Performance Performance Scale Locality improvement Using memory swapping with Temporal Blocking of the HHRT library High Productivity Co‐design approach that spans Algorithm layer, Runtime layer, Architecture layer 8

  9. Current Target GPU Stencil Application • City‐Wind Simulation by Naoyuki Onodera • Based on Lattice‐Botlzmann method • Written in MPI+CUDA • ~12,000 Lines of code • 600TFlops with ~4,000 GPUs D3Q19 Model (19point stencil) In Original design, “Total Array size < Total GPU memory” How can we exceed this limitation?

  10. Contributions • For real existing applications, the followings are realized – [Scale] > GPU memory size is realized – [Performance] Compared with smaller cases, up to 85% performance is obtained – [Productivity] Required modification of ~150 lines for basic change, and ~1000 lines for optimization 10

  11. Contents • HHRT library – Expands available memory capacity by data swapping • Temporal blocking – Optimizations of stencils for locality improvement • Combining the above two on real applications • Results

  12. Contents • HHRT library – Expands available memory capacity by data swapping • Temporal blocking – Optimizations of stencils for locality improvement • Combining the above two on real applications • Results

  13. The HHRT Runtime Library for GPU Memory Swapping [Endo, Jin Cluster 14] • HHRT supports applications written in CUDA and MPI – HHRT is as a wrapper library of CUDA/MPI – Original CUDA and MPI are not modified – Not only for stencil applications With HHRT w/o HHRT App App HHRT CUDA MPI CUDA MPI OS/HW OS/HW

  14. Functions of HHRT (1) HHRT supports overprovisioning of MPI processes on each GPU – Each GPU is shared by m MPI processes (2) HHRT executes implicitly memory swapping between device memory and host memory – “process‐wise” swapping – OS‐like “page‐wise” swapping is currently hard, without modifying original CUDA device/runtime

  15. Execution model of HHRT w/o HHRT (typically) Node Device memory cudaMemcpy MPI comm Host memory Process’s data With HHRT Node Device memory MPI comm Host memory Process’s data m MPI processes share a single GPU In this case, m=6

  16. Processes on HHRT Node Running processes Device Process’s memory data Sleeping processes Host mem • We suppose s < Device‐memory‐capacity < m s s: Size of data that each process allocates on device memory m: The number of processes sharing a GPU  We can support larger data size than device memory in total • We cannot keep all of m processes running  HHRT makes some processes “sleep” forcibly and implicitly • Blocking MPI calls are “yield” points

  17. State Transition of Each Process A process is blocked Running due to MPI operation (MPI_Recv, MPI_Wait..) Swapping finished All data on device All data are Swapping Swapping (cudaMalloc’ed) restored from are evacuated to out in host to device host memory Swapping finished There is enough space on device memory Sleeping Sleeping (Blocked) (Runnable) MPI operation is now unblocked (cf. message arrived)

  18. Running LBM Code on HHRT Performance on a K20X GPU 80 Capacity wall 70 was broken, Speed (GFlops) 60 but too slow!! 50 40 30 20 10 0 0 6 12 18 24 30 36 42 48 54 Problem Size (GB) NORMAL HH We can support “larger problem sizes > GPU memory” with HHRT, but too slow  We need aggressive optimization! 18

  19. Contents • HHRT library – Expands available memory capacity by data swapping • Temporal blocking – Optimizations of stencils for locality improvement • Combining the above two on real applications • Results

  20. Why Slow if We Use Host Memory? • Each process can suffer from heavy memory swapping costs, every iteration – This corresponds to transfer of the entire process’es sub‐domain between GPU and CPU • This is done automatically, but too heave costs are not hidden Node Device Process’s memory data Host mem • This is due to lack of locality of stencil computations – Array data are swapped out every iteration • We need optimizations to improve locality!!

  21. Temporal Blocking (TB) for Locality Improvement • Temporal blocking : When we pick up a sub‐domain, we do k‐step update at once on it on a small block, before going to the next sub‐domain [Wolf 91] k = 1 (w/o TB) Halo region k = 2 k: Temporal t = 100 t = 101 t = 102 block size Introducing “larger halo” • Mainly used for cache optimization [Wonnacott 00] [Datta 08] ... (k=2~8) • We use it to reduce PCIe commucation (k=10~200)

  22. Code Structure with TB Hand‐coding TB Naïve version “Big but slow” Outer Temporal Typical code loop (Nt/k times) “Small” Loop over Temporal loop Sub‐domains Copy domain Loop over Copy sub‐domain Host  Device Sub‐domains Host  Device Copy sub‐domain Inner Temporal Temporal Loop Host  Device loop (k times) Compute Compute Compute Grid points Points in sub‐dom Points in sub‐dom MPI comm. of Copy sub‐domain boundary Device  Host Copy sub‐domain Device  Host MPI comm. of very slow boundary due to MPI comm. of Copy domain k boundary frequent Device  Host PCI ‐e

  23. What Makes TB Code Complex? Differences between “typical” and “hand‐coding TB” Automated by (1) “Sub‐domain” loop is introduced HHRT runtime (2) Temporal loop is divided into “inner” and “outer” (3) Considering larger “Halo” Yes, we currently rely on (4) PCIe and MPI comm is done out Code refactoring! of “inner” loop

  24. Contents • HHRT library – Expands available memory capacity by data swapping • Temporal blocking – Optimizations of stencils for locality improvement • Combining the above two on real applications • Results

  25. Implementing Temporal Blocking on HHRT How do we reduce refactoring costs of existaing apps? • How do we map multiple sub‐domains to a GPU? – w/o HHRT: 1GPU  1 process  m sub‐domains – With HHRT: 1GPU  m processes  m domains Each process maintains only one domain We don’t need additional sub‐domain loop • How is domain data moved? – w/o HHRT: PCIe comm is done explicitly – With HHRT: Implicitly within MPI comm • On the other hand, doubly nested temporal loops should be (still) written in hand

  26. Implementing Temporal Blocking on HHRT (2) hand‐coding TB TB on HHRT Outer Temporal Typical code loop (Nt/k times) Copy grid Host  Device Loop over k‐times update is done Sub‐domains Copy grid w/o intervention Outer Temporal Host  Device Copy sub‐domain loop (Nt/k times) Host  Device Inner Temporal Temporal Loop Inner Temporal loop (k times) loop (k times) Compute Compute Grid points Compute Grid points Points in sub‐dom MPI comm. of boundary MPI comm. of Copy sub‐domain k boundary Device  Host Copy grid MPI comm. of Swapping (PCIe comm) Device  Host k boundary Copy grid is done implicitly here Device  Host

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend