An Optimized GPU Implementation of a Multicolor Point-Implicit - PowerPoint PPT Presentation

An Optimized GPU Implementation of a Multicolor Point-Implicit Linear Solver

Outline • Motivation • Problem and Challenge • Approach • Performance Results • Future Work and Conclusion

Motivation

FUN3D • FUN3D is a suite of computational fluid dynamics software developed at the NASA Langley Research Center to solve the Navier-Stokes equations for a broad range of aerodynamics applications • For high-energy flows, traditional perfect gas assumptions are invalid and the system must be further expanded to accommodate finite-rate chemistry models. For these reasons, accurate and efficient simulations of complex aerodynamic flows are challenging and require significant computational resources.

FUN3D • FUN3D uses an implicit time-integration strategy on unstructured grid that requires that requires solution of large tightly-coupled system of block-sparse linear equations at each time step • Computing solutions to these linear systems accounts for a significant fraction of the overall runtime in virtually all FUN3D simulations.

Problem and Challenges

Linear Solver • Implicit scheme results in linear systems of equations: 𝑩𝒚 = 𝒄 , 𝑩 is a 𝑜 × 𝑜 block matrix for 𝑜 grid points; block is of size 𝑜𝑐 × 𝑜𝑐 • Matrix 𝑩 is segregated into two separate matrices: 𝑩 ≡ 𝑬 + 𝑷 , where 𝑬 and 𝑷 represent the diagonal and off-diagonal blocks of 𝑩 • Arrays for storing 𝑬 : 𝐸 𝑜𝑐, 𝑜𝑐, 𝑜 Prior to performing each linear solve, each diagonal block 𝐸 𝑗 is decomposed in-place into lower and upper triangular matrices 𝑀 𝑗 𝑏𝑜𝑒 𝑉 𝑗 , 1 ≤ 𝑗 ≤ 𝑜

Sparse Block-Matrix Storage Format for 𝑷

Multicolor Solver • Grid points are colored such that no two adjacent points are assigned the same color. • Unknowns of the same color carry no data dependency and may be updated in parallel. • Updates of unknowns of each color use the latest updated values at grid points corresponding to other colors. • The overall process may be repeated using several outer sweeps over the entire system.

Distribution of non-zeros in the sample matrix

Solver Challenges • With an average of few off-diagonal blocks per row, the arithmetic intensity of the computation is quite low ( ≈ 0.5 for 983633 sized matrix) – memory bound on GPU • The number of rows associated with a color can vary significantly. Consequently the amount of parallelism available for different colors varies significantly • To support strong scalability, the single node performance for light workload should also be good • The solver uses indirect memory addressing • The implementation must support a broad range of block sizes

Solver Computation

Solver Computation – MPI Implementation This computation is subroutine fun3d_solver identical to the do color = 1, nc sequential computation ! Process rows of each color MPI Communication: update portion of x array end do end subroutine fun3d_solver

Approach

Solver Computation Sparse block-matrix (A) vector operation Multiple triangular solve using dense blocks of matrix D

Naive Approach 𝑷𝒚 𝒓 𝒚 𝑬 𝑷 𝒚 Assign a thread to process a row block. As all rows can be processed independently the parallelism is determined by number of row blocks . 𝑔𝑝𝑠𝑥𝑏𝑠𝑒 & 𝑐𝑏𝑑𝑙𝑥𝑏𝑠𝑒

Solver Computation Use sparse BLAS: cusparseSbsrmv Use batched dense BLAS: cublasStrsmBatched

Sparse and Dense Blas Formulation • Advantage: makes code portable with performance on different architectures • Issues: For block sizes of interest BLAS routines do not perform well. • Solution: Write optimized BLAS for block sizes of interest

Overall Structure of Optimized Solver

Approach for sparse matrix vector (for nb = 1 to 16)

Approach for sparse matrix – vector (for nb = 1 to 16)

Approach for sparse matrix – vector (for nb = 17 to 64) Assign warps to four consecutive non-zero blocks in a row. Once the processing of the four blocks is complete, move to the next set of four blocks in the same row. Once the processing is complete for the whole row, aggregate the partial sum being held by four warps using the shared memory.

Approach for triangular solve (for nb <= 32) Process columns of lower triangular matrix of D from left to right using a single warp Amount of parallelism available to the warp decreases as we move from left to right Need for broadcast a value from previous column processing (b(j1)) – use of shuffle instruction.

Approach for triangular solve (for nb <= 32) All threads of the warp execute this line

Prefetching First prefetch outside the loop Prefetches inside the loop

Approach for triangular solve (for nb > 32)

Performance Results

Platform and Testbed • We ran all our experiments on a Sandy Bridge Host Processors with a NVIDIA K40 GPU (NASA Pleiades). • The codes were written in CUDA Fortran, compiled and executed using PGI Fortran Compiler 15.10, and CUDA Toolkit 7.5. • Performance testing was done on matrix sparsity structures that occurs often in FUN3D application. Using a sparsity structure we generated matrices with different block sizes. The blocks were populated using random values. To keep the memory requirement manageable, we only work with a small matrix for large block sizes.

Future Work and Conclusion • It is possible to get benefit out of emerging architecture but requires restructuring and designing new algorithms to map to the underlying architecture. • Explore new sparse storage tuned for GPUs • Explore NVIDIA PASCAL • Device memory bandwidth: 720 GB/s

An Optimized GPU Implementation of a Multicolor Point-Implicit - PowerPoint PPT Presentation

An Optimized GPU Implementation of a Multicolor Point-Implicit Linear Solver Outline Motivation Problem and Challenge Approach Performance Results Future Work and Conclusion Motivation FUN3D FUN3D is a suite of

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Optimized design and analysis of Optimized design and analysis of sparse-sampling fMRI

9 MAY 2019 -WEBCAST Ind AS Implementation Group Accounting Standards Board The Institute of

Regulatory and supervisory challenges for the Icelandic pension industry Insurance Core Principle

2010: Growth in a Deleveraging World Presentation to: CFA Society of Houston Houston, Texas

October 21, 2014 Rick Hamilton 1 DISCLAIMER THE VIEWS EXPRESSED IN THIS PRESENATATION ARE

OLDFIELDS HOLDINGS LIMITED ANNUAL GENERAL MEETING 30 TH NOVEMBER 2017 1 Board & Executives

FU FUELING THE GROW OWTH OF OF SM SMALL BUSI SINESS SS FORWARD LOOKING STATEMENTS Certain

FY2012 Second Quarter FY2012 Second Quarter FY2012 Second Quarter Financial Results Financial

Transpacific 2013 Financial Year For personal use only Results Presentation Kevin Campbell

An Optimized GPU Implementation of a Multicolor Point-Implicit - PowerPoint PPT Presentation

An Optimized GPU Implementation of a Multicolor Point-Implicit Linear Solver Outline Motivation Problem and Challenge Approach Performance Results Future Work and Conclusion Motivation FUN3D FUN3D is a suite of

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Optimized design and analysis of Optimized design and analysis of sparse-sampling fMRI

9 MAY 2019 -WEBCAST Ind AS Implementation Group Accounting Standards Board The Institute of

Regulatory and supervisory challenges for the Icelandic pension industry Insurance Core Principle

2010: Growth in a Deleveraging World Presentation to: CFA Society of Houston Houston, Texas

October 21, 2014 Rick Hamilton 1 DISCLAIMER THE VIEWS EXPRESSED IN THIS PRESENATATION ARE

OLDFIELDS HOLDINGS LIMITED ANNUAL GENERAL MEETING 30 TH NOVEMBER 2017 1 Board &amp; Executives

FU FUELING THE GROW OWTH OF OF SM SMALL BUSI SINESS SS FORWARD LOOKING STATEMENTS Certain

FY2012 Second Quarter FY2012 Second Quarter FY2012 Second Quarter Financial Results Financial

Transpacific 2013 Financial Year For personal use only Results Presentation Kevin Campbell

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

OLDFIELDS HOLDINGS LIMITED ANNUAL GENERAL MEETING 30 TH NOVEMBER 2017 1 Board & Executives