Performance Portable Line Smoother for Multiphysics Problems using - - PowerPoint PPT Presentation

▶

Aug 15, 2023 224 likes •491 views

Performance Portable Line Smoother for Multiphysics Problems using Compact Batched BLAS Kyungjoo Kim Mehmet Deveci Andrew M. Bradley Simon D. Hammond Sivasankaran Rajamanickam Center for Computing Research, Sandia National Labs BLIS Retreat,

SLIDE 1

Performance Portable Line Smoother for Multiphysics Problems using Compact Batched BLAS

Kyungjoo Kim Mehmet Deveci Andrew M. Bradley Simon D. Hammond Sivasankaran Rajamanickam Center for Computing Research, Sandia National Labs

BLIS Retreat, Sep 18 - 19, 2017

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly

wned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.

SAND NO. SAND2017-10026 C

SLIDE 2

Introduction: Line Smoother Compact Data Layout Numerical Experiments Summary

SLIDE 3

Line Smoother

SLIDE 4

Line Smoother

Consider a block sparse system of equations Ax = b arising from coupled PDEs. By splitting A = M −S, we obtain (M −S)x = b Mx = b+Sx x = M−1(b+Sx) = x+M−1(b−Ax), where M is a set of block tridiagonal matrices corresponding to the extracted lines of elements. Solve this iteratively xk+1 = M−1(b+Sxk).

m k n

A set of line elements extracted along the stream line direction.

SLIDE 5

Considerations for Line Smoother Implementation

A fixed number of this iteration is used as preconditioner. xk+1 = M−1(b+Sxk). M is factorized once per solution (or each non-linear iteration) and applied multiple times. Typical blocksizes b are 3, 5, 9 and 15, which are related to scientific applications e.g., elasticity, ideal gas and multiphysics fluid problems. Limit memory usage up to 16 GB i.e., MCDRAM on KNL and GPU device memory. With this memory constraint, typical local problem sizes (m×n×k) are selected as 128×128×128 for b = 3,5 and 64×64×128 for b = 10,15. Batch parallelism: running a sequential algorithm within parallel for.

T0 T1 Tm∗n−1

ˆ Ar ˆ Br ˆ Cr ˆ Ar+1

1 for T in {T0,T1,··· ,Tm×n−1} do in parallel 2

for r ← 0 to k −2 do

ˆ Ar := LU( ˆ Ar);

ˆ Br := L−1 ˆ Br;

ˆ Cr := ˆ CrU−1;

ˆ Ar+1 := ˆ Ar+1 − ˆ Cr ˆ Br;

ˆ Ak−1 := LU( ˆ Ak−1); Initialization of the line smoother

SLIDE 6

Objective: Develop Performance Portable Line Preconditioner

Architecture Constraints:

Wide vector length (AVX512) of KNL for small problem sizes. A large number of CUDA threads (including vector lanes, 132K) on GPUs. Hierarchical parallelism to improve concurrency for solving each tridiagonal system. Different data access patterns for different architectures; core-affinity access on KNL vs coalesced access on GPUs.

SLIDE 7

Problems Using Vendor DLA Libraries

Batched BLAS

Batched BLAS computes many dense problems in parallel. Conceptually, similar to BLAS calls within parallel for. On the line smoother context: batched tridiagonal factor/solve using a sequence

f batch operations is not efficient.

No data locality is exploited in the sequence of batch calls. In each parallel for, synchronization is implicitly imposed.

BLAS within OpenMP

BLAS is not optimized for 3×3 matrix computations. cuBLAS provides batch interface only.

SLIDE 8

Compact Data Layout

SLIDE 9

Compact Data Layout (SIMD type)

Blocks are too small to use a wide vector units. → Vectorize across batch instances. Compact BLAS uses packed vectors as computing unit instead scalar. By overloading arithmatic operators, the scalar BLAS are reused and naturally vectorized.

// computing unit struct VectorAVX512D { union { __m512d v; double s[8]; }; }; // overload arithmatic

perators (+−∗/)

VectorAVX512D

perator +( VectorAVX512D

const &a, VectorAVX512D const &b) { return __mm512_add_pd(a, b); }

A123 A113 A133 A121 A111 A122 A112 A132 A131 A223 A213 A233 A221 A211 A222 A212 A232 A231 A323 A313 A333 A321 A311 A322 A312 A332 A331 A423 A413 A433 A421 A411 A422 A412 A432 A431 A113 A111 A112 A213 A211 A212 A123 A121 A122 A223 A221 A222 A133 A132 A131 A233 A232 A231 A313 A311 A312 A413 A411 A412 A323 A321 A322 A423 A421 A422 A333 A332 A331 A433 A432 A431

In collaboration with Intel MKL team, Compact Batched BLAS standard is adopted as new batch interface.

SLIDE 10

Line Smoother Impl. using Compact Data Layout

ˆ Ar ˆ Br ˆ Cr ˆ Ar+1 ˆ Ar ˆ Br ˆ Cr ˆ Ar+1

T0 T1

Block A of T0 and T1 is packed and elements are aligned to its vector lane

···

αT0

00 αT1 00 αT0 01 αT1 01

1 for a pair T in

{(T0,T1),(T2,T3),··· ,(Tm×n−2,Tm×n−1)} do in parallel

for r ← 0 to k −2 do

ˆ Ar := LU( ˆ Ar);

ˆ Br := L−1 ˆ Br;

ˆ Cr := ˆ CrU−1;

ˆ Ar+1 := ˆ Ar+1 − ˆ Cr ˆ Br;

ˆ Ak−1 := LU( ˆ Ak−1);

Layered APIs for Kokkos Hierarchical Parallelism

Serial (Vector): a single thread is mapped to a single instance of parallel batch. Team: a team of threads is mapped to a single instance of parallel batch. Device: the entire device used as like current batch interface provided by vendors.

Serial and Team APIs are used to compose batched block tridiagonal factorization/solve.

SLIDE 11

Line Smoother Impl. using Compact Data Layout

ˆ Ar ˆ Br ˆ Cr ˆ Ar+1 ˆ Ar ˆ Br ˆ Cr ˆ Ar+1

T0 T1

Block A of T0 and T1 is packed and elements are aligned to its vector lane

···

αT0

00 αT1 00 αT0 01 αT1 01

1 for a pair T in

{(T0,T1),(T2,T3),··· ,(Tm×n−2,Tm×n−1)} do in parallel

for r ← 0 to k −2 do

ˆ Ar := LU( ˆ Ar);

ˆ Br := L−1 ˆ Br;

ˆ Cr := ˆ CrU−1;

ˆ Ar+1 := ˆ Ar+1 − ˆ Cr ˆ Br;

ˆ Ak−1 := LU( ˆ Ak−1);

Some Issues

As it perform cross-matrix vectorization, pivoting in LU is not feasible. → For preconditioning purpose, this does not matter. There is repacking overhead when the standard format is used. → Block tridiagonal matrices are extracted and repacked at the same time.

SLIDE 12

Numerical Experiments

SLIDE 13

Test Setup

Intel Knight Landing

34 Tiles, 2 Cores (4 Threads/core), 2x AVX512 units/core, 1MB L2 3+ TFLOPs in double precision, 400+ GB/s (MCDRAM)

NVIDIA Tesla P100

56 SMs, 32 FP64 CUDA Cores/SM 4.7 TFLOPs in double precision, 732+ GB/s (HBM)

Bechmark

Compact batched BLAS is implemented in KokkosKernels. We evaluate our compact batched BLAS against vendor optimized BLAS implementations; 1) MKL with OpenMP, 2) MKL batched APIs and 3) cuBLAS batched APIs. Line smoother is implemented using our compact batched BLAS serial and team interfaces and compared with SPARC. SPARC is Sandia production simulation code for solving Navier-Stokes equations for compressible and reacting flows.

SLIDE 14

Intel Knight Landing: Batched BLAS

SIMD data layout allows vectorization for small blocksizes.

1 2 4 8 16 34 68 # Cores 0.0 0.1 0.2 0.3 0.4 0.5 GFLOPS / Core

Block Size 3

MKL OpenMP KokkosKernels 1 2 4 8 16 34 68 # Cores 0.0 0.1 0.2 0.3 0.4 0.5 0.6

Block Size 5

1 2 4 8 16 34 68 # Cores 0.0 0.5 1.0 1.5

Block Size 10

1 2 4 8 16 34 68 # Cores 0.0 0.5 1.0 1.5 2.0 2.5

Block Size 15

Batched LU

SLIDE 15

Intel Knight Landing: Batched BLAS

SIMD data layout allows vectorization for small blocksizes.

1 2 4 8 16 34 68 # Cores 0.0 0.2 0.4 0.6 0.8 GFLOPS / Core

Block Size 3

KokkosKernels MKL Batch MKL OpenMP 1 2 4 8 16 34 68 # Cores 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Block Size 5

1 2 4 8 16 34 68 # Cores 0.0 0.5 1.0 1.5 2.0 2.5

Block Size 10

1 2 4 8 16 34 68 # Cores 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

Block Size 15

Batched Trsm

SLIDE 16

Intel Knight Landing: Batched BLAS

SIMD data layout allows vectorization for small blocksizes.

1 2 4 8 16 34 68 # Cores 0.0 0.2 0.4 0.6 0.8 1.0 1.2 GFLOPS / Core

Block Size 3

KokkosKernels MKL Batch MKL OpenMP libxsmm 1 2 4 8 16 34 68 # Cores 0.0 0.5 1.0 1.5

Block Size 5

1 2 4 8 16 34 68 # Cores 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

Block Size 10

1 2 4 8 16 34 68 # Cores 1 2 3 4 5

Block Size 15

Batched Gemm

SLIDE 17

Intel Knight Landing: Roofline Analysis

APEX toolkit (developed in Sandia) is used for performance analysis. Memory bandwidth bounded compute characteristics. Our implementation fully exploits MCDRAM memory bandwidth.

10−2 10−1 100 101 102

Arithmetic Intensity (flops/bytes)

10−1 100 101 102 103 104

Attainable Performance (GFlop/s)

Peak GFlop/s (w/FMA) (no FMA) (no FMA, no vectorization) D D R B W ( G B / s ) M C D R A M B W G B / s ) L 2 B W ( G B / s ) MKL OpenMP

10−2 10−1 100 101 102

Arithmetic Intensity (flops/bytes)

10−1 100 101 102 103 104

Attainable Performance (GFlop/s)

Peak GFlop/s (w/FMA) (no FMA) (no FMA, no vectorization) D D R B W ( G B / s ) M C D R A M B W G B / s ) L 2 B W ( G B / s ) MKL Batch

Batched Gemm

SLIDE 18

Intel Knight Landing: Roofline Analysis

APEX toolkit (developed in Sandia) is used for performance analysis. Memory bandwidth bounded compute characteristics. Our implementation fully exploits MCDRAM memory bandwidth.

10−2 10−1 100 101 102

Arithmetic Intensity (flops/bytes)

10−1 100 101 102 103 104

Attainable Performance (GFlop/s)

Peak GFlop/s (w/FMA) (no FMA) (no FMA, no vectorization) D D R B W ( G B / s ) M C D R A M B W G B / s ) L 2 B W ( G B / s ) KokkosKernels

10−2 10−1 100 101 102

Arithmetic Intensity (flops/bytes)

10−1 100 101 102 103 104

Attainable Performance (GFlop/s)

Peak GFlop/s (w/FMA) (no FMA) (no FMA, no vectorization) D D R B W ( G B / s ) M C D R A M B W G B / s ) L 2 B W ( G B / s ) libxsmm

Batched Gemm

SLIDE 19

Intel Knight Landing: Line Smoother

Our batch block tridiagonal solver with compact data layout is purely vectorized regardless of block sizes. # of batch size (m×n) is 16,384 and the length of a block tridiagonal matrix (k) is 128.

13.9 20.9 27.8 34.8 41.8 48.7 55.7 GFLOPS 80.3 161 241 321 GFLOPS Solve Operations/s Factor Operations/s 4 8 16 34 68 136 272 # Threads 50 100 150 200 250 300 350 0.0 5.3 10.6 16.0 21.3 26.6 31.9 37.2 4 8 16 34 68 136 272 50 100 150 200 250 300 Block Size 3 (128 × 128 × 128) SPARC KokkosKernels 0.0 12.0 24.0 35.9 47.9 59.9 71.9 4 8 16 34 68 136 272 # Threads 20 40 60 80 100 120 140 0.0 6.0 12.1 18.1 24.2 30.2 36.3 42.3 4 8 16 34 68 136 272 20 40 60 80 100 120 Block Size 5 (128 × 128 × 128) 0.0 22.9 45.8 68.7 91.7 115 137

SLIDE 20

Intel Knight Landing: Line Smoother

Our batch block tridiagonal solver with compact data layout is purely vectorized regardless of block sizes. # of batch size (m×n) is 4096 and the length of a block tridiagonal matrix (k) is 128.

4 8 16 34 68 136 272 # Threads 50 100 150 0.0 15.4 30.8 46.2 4 8 16 34 68 136 272 20 40 60 80 100 120 Block Size 10 (64 × 64 × 128) 0.0 47.1 94.3 141 189 236 283 4 8 16 34 68 136 272 # Threads 10 20 30 40 50 60 70 80 0.0 7.0 13.9 20.9 27.8 34.8 41.8 48.7 55.7 4 8 16 34 68 136 272 10 20 30 40 Block Size 15 (64 × 64 × 128) 0.0 80.3 161 241 321 13.9 20.9 27.8 34.8 41.8 48.7 55.7 GFLOPS 80.3 161 241 321 GFLOPS Solve Operations/s Factor Operations/s

SLIDE 21

NVIDIA Tesla P100: Batched BLAS

Device is not fully occupied with small workset sizes: 130k threads vs 16k batch size. Flat parallelism (called KK Range) vs Hierarchical parallelism (called KK Team). Team parallelism is essential for performance; however, identifying right team size is challenging.

1.00E+08 5.10E+09 1.01E+10 1.51E+10 2.01E+10 2.51E+10 1024 2048 4096 8192 16384 GFLOP/s Workset size (N)

LU Blksize 3

cuBLAS KK Range KK Team 1.00E+08 5.10E+09 1.01E+10 1.51E+10 2.01E+10 2.51E+10 1024 2048 4096 8192 16384 GFLOP/s Workset size (N)

Trsm Blksize 3

cuBLAS KK Range KK Team 1.00E+08 5.10E+09 1.01E+10 1.51E+10 2.01E+10 2.51E+10 1024 2048 4096 8192 16384 GFLOP/s Workset size (N)

Gemm Blksize 3

cuBLAS KK Range KK Team

SLIDE 22

NVIDIA Tesla P100: Batched BLAS

Device is not fully occupied with small workset sizes: 130k threads vs 16k batch size. Flat parallelism (called KK Range) vs Hierarchical parallelism (called KK Team). Team parallelism is essential for performance; however, identifying right team size is challenging.

0.00E+00 5.00E+09 1.00E+10 1.50E+10 2.00E+10 2.50E+10 3.00E+10 3.50E+10 4.00E+10 1024 2048 4096 8192 16384 GFLOP/s Workset size (N)

LU Blksize 5

cuBLAS KK Range KK Team 0.00E+00 5.00E+09 1.00E+10 1.50E+10 2.00E+10 2.50E+10 3.00E+10 3.50E+10 4.00E+10 1024 2048 4096 8192 16384 GFLOP/s Workset size (N)

Trsm Blksize 5

cuBLAS KK Range KK Team 0.00E+00 5.00E+09 1.00E+10 1.50E+10 2.00E+10 2.50E+10 3.00E+10 3.50E+10 4.00E+10 1024 2048 4096 8192 16384 GFLOP/s Workset size (N)

Gemm Blksize 5

cuBLAS KK Range KK Team

SLIDE 23

NVIDIA Tesla P100: Batched BLAS

Device is not fully occupied with small workset sizes: 130k threads vs 16k batch size. Flat parallelism (called KK Range) vs Hierarchical parallelism (called KK Team). Team parallelism is essential for performance; however, identifying right team size is challenging.

0.00E+00 1.00E+10 2.00E+10 3.00E+10 4.00E+10 5.00E+10 6.00E+10 7.00E+10 1024 2048 4096 8192 16384 GFLOP/s Workset size (N)

LU Blksize 10

cuBLAS KK Range KK Team 0.00E+00 1.00E+10 2.00E+10 3.00E+10 4.00E+10 5.00E+10 6.00E+10 7.00E+10 1024 2048 4096 8192 16384 GFLOP/s Workset size (N)

Trsm Blksize 10

cuBLAS KK Range KK Team 0.00E+00 1.00E+10 2.00E+10 3.00E+10 4.00E+10 5.00E+10 6.00E+10 7.00E+10 1024 2048 4096 8192 16384 GFLOP/s Workset size (N)

Gemm Blksize 10

cuBLAS KK Range KK Team

SLIDE 24

NVIDIA Tesla P100: Batched BLAS

Device is not fully occupied with small workset sizes: 130k threads vs 16k batch size. Flat parallelism (called KK Range) vs Hierarchical parallelism (called KK Team). Team parallelism is essential for performance; however, identifying right team size is challenging.

0.00E+00 1.00E+10 2.00E+10 3.00E+10 4.00E+10 5.00E+10 6.00E+10 7.00E+10 8.00E+10 9.00E+10 1024 2048 4096 8192 16384 GFLOP/s Workset size (N)

LU Blksize 15

cuBLAS KK Range KK Team 0.00E+00 1.00E+10 2.00E+10 3.00E+10 4.00E+10 5.00E+10 6.00E+10 7.00E+10 8.00E+10 9.00E+10 1024 2048 4096 8192 16384 GFLOP/s Workset size (N)

Trsm Blksize 15

cuBLAS KK Range KK Team 0.00E+00 1.00E+10 2.00E+10 3.00E+10 4.00E+10 5.00E+10 6.00E+10 7.00E+10 8.00E+10 9.00E+10 1024 2048 4096 8192 16384 GFLOP/s Workset size (N)

Gemm Blksize 15

cuBLAS KK Range KK Team

SLIDE 25

NVIDIA Tesla P100: Line Smoother

Native version of SPARC does not include GPU implementation of line smoother. KokkosKernels provide portable implementation for two architectures i.e., KNL and P100, and demonstrate comparable performance each other.

0.00E+00 5.00E+03 1.00E+04 1.50E+04 3 5 10 15 # of factorization per min Blocksize

Line Smoother Initialization (Block Tridiagonal Factorization)

KK Team KK (KNL) 0.00E+00 5.00E+03 1.00E+04 1.50E+04 2.00E+04 2.50E+04 3 5 10 15 # of solves per min Blocksize

Line Smoother Application (Block Tridiagonal Solve)

KK Team KK (KNL)

SLIDE 26

Summary

Demonstrate performance portable block line smoother implementation. Numerical experiments are presented for Intel Knight Landing and NVIDIA P100. Functor-level interfaces (serial and team) are used as building blocks for implementing the line smoother. By packing data aligned with hardware specific vector length, codes are easily vectorized.