Realizing Extremely LargeScale Stencil Applications on GPU - PowerPoint PPT Presentation

Realizing Extremely Large‐Scale Stencil Applications on GPU Supercomputers Toshio Endo, Yuki Takasaki, Satoshi Matsuoka GSIC, Tokyo Institute of Technology ( 東京工業大学 )

Stencil Computations Important kernels for various simulations (CFD, material…) ASUCA weather Phase‐Field computation Air flow simulation simulator (2011 Gordon Bell) Time t Time t+1 = Highly successful in speed Memory intensive computations  But not in scale

CPU‐GPU Hybrid Supercomputers with Memory Hierarchy Tokyo Tech TSUBAME2.5 Supercomputer Node Architecture (Simplified) GPU card In typical stencil implementations on GPUs, GPU array sizes are configured as cores < (aggregated) GPU memory  Prohibits extremely Big&Fast simulation L2$ 1.5MB 250GB/s GPU mem CPU PCIe 6GB cores 2 Xeon CPUs 8GB/s × 1408 Host memory 54~96GB 3 K20X GPUs Other nodes

Stencil Code Example on GPU Double buffering < 6GB Copy domain Host  Device LBM performance GPU mem size Temporal Loop on a K20X GPU (6GB) 80 Compute Faster Speed (GFlops) Grid points Fast, but not Big 60 A TSUBAME2.5 MPI comm. of 40 node is used boundary 20 0 Copy domain 0 6 12 18 24 30 36 42 48 54 Device  Host Problem Size (GB) NORMAL Bigger

How Can We Exceed Memory Size? (1) Using many GPUs (2) Using capacity of host memory > (3) Using both

Motivating Example: What if We Exceed GPU memory Simply ? (1) • A simple method is: – Put domain data on host memory – We divide the domain into small sub‐domains (or spatial block) – Repeat Copy a “sub‐domain” into GPU  Compute  Copy back results Host memory Host memory Host memory Device memory Device memory Device memory GPU GPU GPU cores cores cores

Motivating Example: What if We Exceed GPU memory Simply ? (2) 7‐point performance Temporal loop on a K20X GPU (6GB) Dev mem Loop over capacity 20~30x slower Faster Sub‐domains 80 due to large PCIe cost!! Speed (GFlops) Copy sub‐domain Host  Device 60 This ratio is close to 40 8GB/s : 250GB/s Compute Points in sub‐dom 20 0 Copy sub‐domain 0 6 12 18 24 30 36 42 48 54 Device  Host Problem Size (GB) MPI comm. of boundary NORMAL HH Bigger Keys for improvement are “Communication avoiding & Locality Improvement”

Goals of This Work When we have existing apps, we want to realize followings High Large Performance Performance Scale Locality improvement Using memory swapping with Temporal Blocking of the HHRT library High Productivity Co‐design approach that spans Algorithm layer, Runtime layer, Architecture layer 8

Current Target GPU Stencil Application • City‐Wind Simulation by Naoyuki Onodera • Based on Lattice‐Botlzmann method • Written in MPI+CUDA • ~12,000 Lines of code • 600TFlops with ~4,000 GPUs D3Q19 Model (19point stencil) In Original design, “Total Array size < Total GPU memory” How can we exceed this limitation?

Contributions • For real existing applications, the followings are realized – [Scale] > GPU memory size is realized – [Performance] Compared with smaller cases, up to 85% performance is obtained – [Productivity] Required modification of ~150 lines for basic change, and ~1000 lines for optimization 10

Contents • HHRT library – Expands available memory capacity by data swapping • Temporal blocking – Optimizations of stencils for locality improvement • Combining the above two on real applications • Results

The HHRT Runtime Library for GPU Memory Swapping [Endo, Jin Cluster 14] • HHRT supports applications written in CUDA and MPI – HHRT is as a wrapper library of CUDA/MPI – Original CUDA and MPI are not modified – Not only for stencil applications With HHRT w/o HHRT App App HHRT CUDA MPI CUDA MPI OS/HW OS/HW

Functions of HHRT (1) HHRT supports overprovisioning of MPI processes on each GPU – Each GPU is shared by m MPI processes (2) HHRT executes implicitly memory swapping between device memory and host memory – “process‐wise” swapping – OS‐like “page‐wise” swapping is currently hard, without modifying original CUDA device/runtime

Execution model of HHRT w/o HHRT (typically) Node Device memory cudaMemcpy MPI comm Host memory Process’s data With HHRT Node Device memory MPI comm Host memory Process’s data m MPI processes share a single GPU In this case, m=6

Processes on HHRT Node Running processes Device Process’s memory data Sleeping processes Host mem • We suppose s < Device‐memory‐capacity < m s s: Size of data that each process allocates on device memory m: The number of processes sharing a GPU  We can support larger data size than device memory in total • We cannot keep all of m processes running  HHRT makes some processes “sleep” forcibly and implicitly • Blocking MPI calls are “yield” points

State Transition of Each Process A process is blocked Running due to MPI operation (MPI_Recv, MPI_Wait..) Swapping finished All data on device All data are Swapping Swapping (cudaMalloc’ed) restored from are evacuated to out in host to device host memory Swapping finished There is enough space on device memory Sleeping Sleeping (Blocked) (Runnable) MPI operation is now unblocked (cf. message arrived)

Running LBM Code on HHRT Performance on a K20X GPU 80 Capacity wall 70 was broken, Speed (GFlops) 60 but too slow!! 50 40 30 20 10 0 0 6 12 18 24 30 36 42 48 54 Problem Size (GB) NORMAL HH We can support “larger problem sizes > GPU memory” with HHRT, but too slow  We need aggressive optimization! 18

Why Slow if We Use Host Memory? • Each process can suffer from heavy memory swapping costs, every iteration – This corresponds to transfer of the entire process’es sub‐domain between GPU and CPU • This is done automatically, but too heave costs are not hidden Node Device Process’s memory data Host mem • This is due to lack of locality of stencil computations – Array data are swapped out every iteration • We need optimizations to improve locality!!

Temporal Blocking (TB) for Locality Improvement • Temporal blocking : When we pick up a sub‐domain, we do k‐step update at once on it on a small block, before going to the next sub‐domain [Wolf 91] k = 1 (w/o TB) Halo region k = 2 k: Temporal t = 100 t = 101 t = 102 block size Introducing “larger halo” • Mainly used for cache optimization [Wonnacott 00] [Datta 08] ... (k=2~8) • We use it to reduce PCIe commucation (k=10~200)

Code Structure with TB Hand‐coding TB Naïve version “Big but slow” Outer Temporal Typical code loop (Nt/k times) “Small” Loop over Temporal loop Sub‐domains Copy domain Loop over Copy sub‐domain Host  Device Sub‐domains Host  Device Copy sub‐domain Inner Temporal Temporal Loop Host  Device loop (k times) Compute Compute Compute Grid points Points in sub‐dom Points in sub‐dom MPI comm. of Copy sub‐domain boundary Device  Host Copy sub‐domain Device  Host MPI comm. of very slow boundary due to MPI comm. of Copy domain k boundary frequent Device  Host PCI ‐e

What Makes TB Code Complex? Differences between “typical” and “hand‐coding TB” Automated by (1) “Sub‐domain” loop is introduced HHRT runtime (2) Temporal loop is divided into “inner” and “outer” (3) Considering larger “Halo” Yes, we currently rely on (4) PCIe and MPI comm is done out Code refactoring! of “inner” loop

Implementing Temporal Blocking on HHRT How do we reduce refactoring costs of existaing apps? • How do we map multiple sub‐domains to a GPU? – w/o HHRT: 1GPU  1 process  m sub‐domains – With HHRT: 1GPU  m processes  m domains Each process maintains only one domain We don’t need additional sub‐domain loop • How is domain data moved? – w/o HHRT: PCIe comm is done explicitly – With HHRT: Implicitly within MPI comm • On the other hand, doubly nested temporal loops should be (still) written in hand

Implementing Temporal Blocking on HHRT (2) hand‐coding TB TB on HHRT Outer Temporal Typical code loop (Nt/k times) Copy grid Host  Device Loop over k‐times update is done Sub‐domains Copy grid w/o intervention Outer Temporal Host  Device Copy sub‐domain loop (Nt/k times) Host  Device Inner Temporal Temporal Loop Inner Temporal loop (k times) loop (k times) Compute Compute Grid points Compute Grid points Points in sub‐dom MPI comm. of boundary MPI comm. of Copy sub‐domain k boundary Device  Host Copy grid MPI comm. of Swapping (PCIe comm) Device  Host k boundary Copy grid is done implicitly here Device  Host

Realizing Extremely LargeScale Stencil Applications on GPU - PowerPoint PPT Presentation

Realizing Extremely LargeScale Stencil Applications on GPU Supercomputers Toshio Endo, Yuki Takasaki, Satoshi Matsuoka GSIC, Tokyo Institute of Technology ( ) Stencil Computations Important kernels for various simulations

Stencil Buffer Algorithms CS418 Computer Graphics John C. Hart Stencil Buffer

HIM Without Walls Realizing Our Vision! Realizing Our Vision Realize Our Vision Realizing Our

Realizing the Dreams of Personalized Medicine Realizing the Dreams of Personalized Medicine

Precision solder paste stencil for fine pitch printing applications www.microstencil.com

Realizing OutofCore Stencil Computations using MultiTier Memory Hierarchy on GPGPU

Creative surprises from Undercover 29.09.2017 1 Stencil set 2 sticker sheets, stencil, 10

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris Cummins http://chriscummins.cc

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Realizing Maximum Business Value November 17 th , 2017 Agenda nda The Realizing Transition

Matthias Book: Realizing an Integrated Electronic Commerce Portal System August 11, 2000 A Portal

Realizing Bullet Time in Realizing Bullet Time in movies: visual effect combining slow motion

Use of a Domain Specific Modeling Language for Realizing Versatile Dashboards for Realizing

An Extension of OpenACC Directives for Out-of-Core Stencil Computation with Temporal Blocking

Next Level DSLs Configure your app the Kotlin way! Aaron Sarazan CTO - Stencil Ltd.

RETRIEVAL PRACTICE & STUDY PLANNING IN MOOCS: EXPLORING CLASSROOM-BASED SELF-REGULATED

GAMS An Introduction Hands-on Tutorial on Optimization Frederik Proske & Lutz Westermann

This feeling, that we understand things better than we do, has become a real problem Steven

Disclosure Left Atrial Appendage Closure SentreHeart, Inc Identifying the Patients Who Will

An Auto-Tuning Framework for Parallel Multicore Stencil Computations Shoaib Kamil , Cy Chan ,

Data Partitioning Strategies for Stencil Computations on NUMA Systems Frank Feinbube, Max Plauth ,

Bandwidth Avoiding Stencil Computations By Kaushik Datta , Sam Williams, Kathy Yelick, and Jim

Lecture 12 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Stencil methods

Realizing Extremely LargeScale Stencil Applications on GPU - PowerPoint PPT Presentation

Realizing Extremely LargeScale Stencil Applications on GPU Supercomputers Toshio Endo, Yuki Takasaki, Satoshi Matsuoka GSIC, Tokyo Institute of Technology ( ) Stencil Computations Important kernels for various simulations

Stencil Buffer Algorithms CS418 Computer Graphics John C. Hart Stencil Buffer

HIM Without Walls Realizing Our Vision! Realizing Our Vision Realize Our Vision Realizing Our

Realizing the Dreams of Personalized Medicine Realizing the Dreams of Personalized Medicine

Precision solder paste stencil for fine pitch printing applications www.microstencil.com

Realizing OutofCore Stencil Computations using MultiTier Memory Hierarchy on GPGPU

Creative surprises from Undercover 29.09.2017 1 Stencil set 2 sticker sheets, stencil, 10

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris Cummins http://chriscummins.cc

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Realizing Maximum Business Value November 17 th , 2017 Agenda nda The Realizing Transition

Matthias Book: Realizing an Integrated Electronic Commerce Portal System August 11, 2000 A Portal

Realizing Bullet Time in Realizing Bullet Time in movies: visual effect combining slow motion

Use of a Domain Specific Modeling Language for Realizing Versatile Dashboards for Realizing

An Extension of OpenACC Directives for Out-of-Core Stencil Computation with Temporal Blocking

Next Level DSLs Configure your app the Kotlin way! Aaron Sarazan CTO - Stencil Ltd.

RETRIEVAL PRACTICE &amp; STUDY PLANNING IN MOOCS: EXPLORING CLASSROOM-BASED SELF-REGULATED

GAMS An Introduction Hands-on Tutorial on Optimization Frederik Proske &amp; Lutz Westermann

This feeling, that we understand things better than we do, has become a real problem Steven

Disclosure Left Atrial Appendage Closure SentreHeart, Inc Identifying the Patients Who Will

An Auto-Tuning Framework for Parallel Multicore Stencil Computations Shoaib Kamil , Cy Chan ,

Data Partitioning Strategies for Stencil Computations on NUMA Systems Frank Feinbube, Max Plauth ,

Bandwidth Avoiding Stencil Computations By Kaushik Datta , Sam Williams, Kathy Yelick, and Jim

Lecture 12 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Stencil methods

RETRIEVAL PRACTICE & STUDY PLANNING IN MOOCS: EXPLORING CLASSROOM-BASED SELF-REGULATED

GAMS An Introduction Hands-on Tutorial on Optimization Frederik Proske & Lutz Westermann