An Optimized GPU Implementation of a Multicolor Point-Implicit - - PowerPoint PPT Presentation

β–Ά
an optimized gpu implementation of a
SMART_READER_LITE
LIVE PREVIEW

An Optimized GPU Implementation of a Multicolor Point-Implicit - - PowerPoint PPT Presentation

An Optimized GPU Implementation of a Multicolor Point-Implicit Linear Solver Outline Motivation Problem and Challenge Approach Performance Results Future Work and Conclusion Motivation FUN3D FUN3D is a suite of


slide-1
SLIDE 1

An Optimized GPU Implementation of a Multicolor Point-Implicit Linear Solver

slide-2
SLIDE 2

Outline

  • Motivation
  • Problem and Challenge
  • Approach
  • Performance Results
  • Future Work and Conclusion
slide-3
SLIDE 3

Motivation

slide-4
SLIDE 4

FUN3D

  • FUN3D is a suite of computational fluid dynamics software developed

at the NASA Langley Research Center to solve the Navier-Stokes equations for a broad range of aerodynamics applications

  • For high-energy flows, traditional perfect gas assumptions are invalid

and the system must be further expanded to accommodate finite-rate chemistry models. For these reasons, accurate and efficient simulations of complex aerodynamic flows are challenging and require significant computational resources.

slide-5
SLIDE 5

FUN3D

  • FUN3D uses an implicit time-integration strategy on unstructured grid

that requires that requires solution of large tightly-coupled system of block-sparse linear equations at each time step

  • Computing solutions to these linear systems accounts for a significant

fraction of the overall runtime in virtually all FUN3D simulations.

slide-6
SLIDE 6

Problem and Challenges

slide-7
SLIDE 7

Linear Solver

  • Implicit scheme results in linear systems of equations:

π‘©π’š = 𝒄, 𝑩 is a π‘œ Γ— π‘œ block matrix for π‘œ grid points; block is of size π‘œπ‘ Γ— π‘œπ‘

  • Matrix 𝑩 is segregated into two separate matrices:

𝑩 ≑ 𝑬 + 𝑷, where 𝑬 and 𝑷 represent the diagonal and off-diagonal blocks of 𝑩

  • Arrays for storing 𝑬: 𝐸 π‘œπ‘, π‘œπ‘, π‘œ

Prior to performing each linear solve, each diagonal block 𝐸𝑗 is decomposed in-place into lower and upper triangular matrices 𝑀𝑗 π‘π‘œπ‘’ 𝑉𝑗, 1 ≀ 𝑗 ≀ π‘œ

slide-8
SLIDE 8

Sparse Block-Matrix Storage Format for 𝑷

slide-9
SLIDE 9

Multicolor Solver

  • Grid points are colored such that no two adjacent points are assigned the

same color.

  • Unknowns of the same color carry no data dependency and may be

updated in parallel.

  • Updates of unknowns of each color use the latest updated values at grid

points corresponding to other colors.

  • The overall process may be repeated using several outer sweeps over the

entire system.

slide-10
SLIDE 10
slide-11
SLIDE 11

Distribution of non-zeros in the sample matrix

slide-12
SLIDE 12

Solver Challenges

  • With an average of few off-diagonal blocks per row, the arithmetic intensity of the

computation is quite low (β‰ˆ 0.5 for 983633 sized matrix) – memory bound on GPU

  • The number of rows associated with a color can vary significantly. Consequently the

amount of parallelism available for different colors varies significantly

  • To support strong scalability, the single node performance for light workload should also

be good

  • The solver uses indirect memory addressing
  • The implementation must support a broad range of block sizes
slide-13
SLIDE 13

Solver Computation

slide-14
SLIDE 14

Solver Computation – MPI Implementation

subroutine fun3d_solver do color = 1, nc ! Process rows of each color MPI Communication: update portion of x array end do end subroutine fun3d_solver

This computation is identical to the sequential computation

slide-15
SLIDE 15

Approach

slide-16
SLIDE 16

Solver Computation

Sparse block-matrix (A) vector operation Multiple triangular solve using dense blocks of matrix D

slide-17
SLIDE 17

𝒓 𝑬 π’š

𝑔𝑝𝑠π‘₯𝑏𝑠𝑒 & 𝑐𝑏𝑑𝑙π‘₯𝑏𝑠𝑒

Naive Approach

Assign a thread to process a row block. As all rows can be processed independently the parallelism is determined by number of row blocks.

π’š 𝑷 π‘·π’š

slide-18
SLIDE 18

Solver Computation

Use sparse BLAS: cusparseSbsrmv Use batched dense BLAS: cublasStrsmBatched

slide-19
SLIDE 19

Sparse and Dense Blas Formulation

  • Advantage: makes code portable with performance on different

architectures

  • Issues: For block sizes of interest BLAS routines do not perform well.
  • Solution: Write optimized BLAS for block sizes of interest
slide-20
SLIDE 20

Overall Structure of Optimized Solver

slide-21
SLIDE 21

Approach for sparse matrix vector (for nb = 1 to 16)

slide-22
SLIDE 22

Approach for sparse matrix – vector (for nb = 1 to 16)

slide-23
SLIDE 23

Approach for sparse matrix – vector (for nb = 1 to 16)

slide-24
SLIDE 24

Approach for sparse matrix – vector (for nb = 17 to 64)

Assign warps to four consecutive non-zero blocks in a row. Once the processing of the four blocks is complete, move to the next set of four blocks in the same row. Once the processing is complete for the whole row, aggregate the partial sum being held by four warps using the shared memory.

slide-25
SLIDE 25

Approach for triangular solve (for nb <= 32)

Process columns of lower triangular matrix of D from left to right using a single warp Amount of parallelism available to the warp decreases as we move from left to right Need for broadcast a value from previous column processing (b(j1)) – use of shuffle instruction.

slide-26
SLIDE 26

Approach for triangular solve (for nb <= 32)

All threads

  • f the warp

execute this line

slide-27
SLIDE 27

First prefetch

  • utside the

loop Prefetches inside the loop

Prefetching

slide-28
SLIDE 28

Approach for triangular solve (for nb > 32)

slide-29
SLIDE 29

Performance Results

slide-30
SLIDE 30

Platform and Testbed

  • We ran all our experiments on a Sandy Bridge Host Processors with a

NVIDIA K40 GPU (NASA Pleiades).

  • The codes were written in CUDA Fortran, compiled and executed using PGI

Fortran Compiler 15.10, and CUDA Toolkit 7.5.

  • Performance testing was done on matrix sparsity structures that occurs
  • ften in FUN3D application. Using a sparsity structure we generated

matrices with different block sizes. The blocks were populated using random values. To keep the memory requirement manageable, we only work with a small matrix for large block sizes.

slide-31
SLIDE 31
slide-32
SLIDE 32
slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35

Future Work and Conclusion

  • It is possible to get benefit out of emerging architecture but requires

restructuring and designing new algorithms to map to the underlying architecture.

  • Explore new sparse storage tuned for GPUs
  • Explore NVIDIA PASCAL
  • Device memory bandwidth: 720 GB/s