an optimized gpu implementation of a
play

An Optimized GPU Implementation of a Multicolor Point-Implicit - PowerPoint PPT Presentation

An Optimized GPU Implementation of a Multicolor Point-Implicit Linear Solver Outline Motivation Problem and Challenge Approach Performance Results Future Work and Conclusion Motivation FUN3D FUN3D is a suite of


  1. An Optimized GPU Implementation of a Multicolor Point-Implicit Linear Solver

  2. Outline • Motivation • Problem and Challenge • Approach • Performance Results • Future Work and Conclusion

  3. Motivation

  4. FUN3D • FUN3D is a suite of computational fluid dynamics software developed at the NASA Langley Research Center to solve the Navier-Stokes equations for a broad range of aerodynamics applications • For high-energy flows, traditional perfect gas assumptions are invalid and the system must be further expanded to accommodate finite-rate chemistry models. For these reasons, accurate and efficient simulations of complex aerodynamic flows are challenging and require significant computational resources.

  5. FUN3D • FUN3D uses an implicit time-integration strategy on unstructured grid that requires that requires solution of large tightly-coupled system of block-sparse linear equations at each time step • Computing solutions to these linear systems accounts for a significant fraction of the overall runtime in virtually all FUN3D simulations.

  6. Problem and Challenges

  7. Linear Solver • Implicit scheme results in linear systems of equations: 𝑩𝒚 = 𝒄 , 𝑩 is a 𝑜 × 𝑜 block matrix for 𝑜 grid points; block is of size 𝑜𝑐 × 𝑜𝑐 • Matrix 𝑩 is segregated into two separate matrices: 𝑩 ≡ 𝑬 + 𝑷 , where 𝑬 and 𝑷 represent the diagonal and off-diagonal blocks of 𝑩 • Arrays for storing 𝑬 : 𝐸 𝑜𝑐, 𝑜𝑐, 𝑜 Prior to performing each linear solve, each diagonal block 𝐸 𝑗 is decomposed in-place into lower and upper triangular matrices 𝑀 𝑗 𝑏𝑜𝑒 𝑉 𝑗 , 1 ≤ 𝑗 ≤ 𝑜

  8. Sparse Block-Matrix Storage Format for 𝑷

  9. Multicolor Solver • Grid points are colored such that no two adjacent points are assigned the same color. • Unknowns of the same color carry no data dependency and may be updated in parallel. • Updates of unknowns of each color use the latest updated values at grid points corresponding to other colors. • The overall process may be repeated using several outer sweeps over the entire system.

  10. Distribution of non-zeros in the sample matrix

  11. Solver Challenges • With an average of few off-diagonal blocks per row, the arithmetic intensity of the computation is quite low ( ≈ 0.5 for 983633 sized matrix) – memory bound on GPU • The number of rows associated with a color can vary significantly. Consequently the amount of parallelism available for different colors varies significantly • To support strong scalability, the single node performance for light workload should also be good • The solver uses indirect memory addressing • The implementation must support a broad range of block sizes

  12. Solver Computation

  13. Solver Computation – MPI Implementation This computation is subroutine fun3d_solver identical to the do color = 1, nc sequential computation ! Process rows of each color MPI Communication: update portion of x array end do end subroutine fun3d_solver

  14. Approach

  15. Solver Computation Sparse block-matrix (A) vector operation Multiple triangular solve using dense blocks of matrix D

  16. Naive Approach 𝑷𝒚 𝒓 𝒚 𝑬 𝑷 𝒚 Assign a thread to process a row block. As all rows can be processed independently the parallelism is determined by number of row blocks . 𝑔𝑝𝑠𝑥𝑏𝑠𝑒 & 𝑐𝑏𝑑𝑙𝑥𝑏𝑠𝑒

  17. Solver Computation Use sparse BLAS: cusparseSbsrmv Use batched dense BLAS: cublasStrsmBatched

  18. Sparse and Dense Blas Formulation • Advantage: makes code portable with performance on different architectures • Issues: For block sizes of interest BLAS routines do not perform well. • Solution: Write optimized BLAS for block sizes of interest

  19. Overall Structure of Optimized Solver

  20. Approach for sparse matrix vector (for nb = 1 to 16)

  21. Approach for sparse matrix – vector (for nb = 1 to 16)

  22. Approach for sparse matrix – vector (for nb = 1 to 16)

  23. Approach for sparse matrix – vector (for nb = 17 to 64) Assign warps to four consecutive non-zero blocks in a row. Once the processing of the four blocks is complete, move to the next set of four blocks in the same row. Once the processing is complete for the whole row, aggregate the partial sum being held by four warps using the shared memory.

  24. Approach for triangular solve (for nb <= 32) Process columns of lower triangular matrix of D from left to right using a single warp Amount of parallelism available to the warp decreases as we move from left to right Need for broadcast a value from previous column processing (b(j1)) – use of shuffle instruction.

  25. Approach for triangular solve (for nb <= 32) All threads of the warp execute this line

  26. Prefetching First prefetch outside the loop Prefetches inside the loop

  27. Approach for triangular solve (for nb > 32)

  28. Performance Results

  29. Platform and Testbed • We ran all our experiments on a Sandy Bridge Host Processors with a NVIDIA K40 GPU (NASA Pleiades). • The codes were written in CUDA Fortran, compiled and executed using PGI Fortran Compiler 15.10, and CUDA Toolkit 7.5. • Performance testing was done on matrix sparsity structures that occurs often in FUN3D application. Using a sparsity structure we generated matrices with different block sizes. The blocks were populated using random values. To keep the memory requirement manageable, we only work with a small matrix for large block sizes.

  30. Future Work and Conclusion • It is possible to get benefit out of emerging architecture but requires restructuring and designing new algorithms to map to the underlying architecture. • Explore new sparse storage tuned for GPUs • Explore NVIDIA PASCAL • Device memory bandwidth: 720 GB/s

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend