 
              Supported by SPPEXA (German software for exascale computing project, DFG), project ExaDG Supported by Bayerisches Kompetenznetzwerk f¨ ur Technisch-Wissenschaftliches Hoch- und H¨ ochstleistungsrechnen (KONWIHR) The State of Matrix-free Methods and HPC Martin Kronbichler Institute for Computational Mechanics Technical University of Munich, Germany May 26, 2020 Eighth deal.II Users and Developers Workshop M. Kronbichler The State of Matrix-free Methods and HPC 1
Outline What kind of matrix-free methods are there in deal.II and why? Use cases of matrix-free methods Geometric multigrid M. Kronbichler The State of Matrix-free Methods and HPC 2
Today’s big HPC theme: On the road towards exascale HPC community is moving towards exascale (machine with 1 Exaflop/s = 10 18 floating point ops / sec) We want to enable the user of deal.II to ◮ Select most efficient discretization parameters ◮ Meshing (high-order curved, adaptive, . . . ) ◮ Polynomial degree ◮ Select most efficient iterative solver SuperMUC-NG supercomputer (top) (source: ◮ > 10 9 spatial unknowns, millions of time steps → www.lrz.de) and efficiency of various instruc- tion sets on Intel Skylake-X for DG operator optimal-complexity solvers essential evaluation (bottom) ◮ Example: multigrid ◮ Select most efficient implementation for given hardware ◮ Matrix-free vs matrix-based → avoid memory wall ◮ Usage of most efficient instruction set – e.g. vectorization with AVX/AVX-512 ◮ Intel/AMD CPUs, NVIDIA GPUs, ARM SVE, . . . ◮ Scalability to 10,000+ nodes M. Kronbichler The State of Matrix-free Methods and HPC 3
Implement by ourselves or use external libraries? External libraries Implemented inside deal.II ◮ Matrix-based linear algebra and ◮ Algorithms for finite elements and beyond preconditioners: PETSc, Trilinos, ◮ Linear algebra infrastructure not available UMFPACK, cuSPARSE, with adequate functionality or performance cuSOLVERS externally ◮ Mesh partitioning: p4est, METIS ◮ MPI-parallel vector ◮ . . . LinearAlgebra::distributed ::Vector<Number> Programming frameworks & backends ◮ SIMD abstraction class ◮ Distributed memory: MPI VectorizedArray<Number> ◮ Threading (TBB; taskflow, . . . ?) ◮ Wrapper classes for unified interface to ◮ CUDA for GPUs external libraries ◮ Vectorization: via compiler ◮ Help our users to concentrate on their (intrinsics) or C++ standard library application: separation of concerns M. Kronbichler The State of Matrix-free Methods and HPC 4
Outline What kind of matrix-free methods are there in deal.II and why? Use cases of matrix-free methods Geometric multigrid M. Kronbichler The State of Matrix-free Methods and HPC 5
Matrix-free algorithms: hot topic in HPC ◮ What is limiting resource in FEM programs? 2 , 000 1 , 500 ◮ Iterative solvers (or explicit time stepping) typically spend Bytes / DoF 60–95% of time in matrix-vector product 1 , 000 ◮ Classical approach: sparse matrix-vector product 500 ◮ 2 arithmetic operations (mult, add) per matrix entry 0 1 2 3 4 5 6 7 8 loaded from memory (8 byte for value + 4.x byte for index Polynomial degree p data): 0.16–0.25 Flop/Byte 700 ◮ Modern CPUs/GPUs can do 3–20 Flop/Byte 600 Operations / DoF ◮ Performance limit is memory bandwidth 500 ◮ SpMV leaves resources unused 400 300 ◮ Memory wall: gap to memory widens over time 200 ◮ Matrix-free algorithm: transfer less, compute more 100 0 ◮ Choice in deal.II: fast computation of FEM integrals 1 2 3 4 5 6 7 8 ◮ 1–8 Flop/Byte Polynomial degree p ◮ Method of choice for higher polynomial degrees p sparse matrix-vector ◮ Closely related to US CEED project matrix-free M. Kronbichler The State of Matrix-free Methods and HPC 6
Matrix-free algorithm layout in deal.II Matrix-vector product Matrix-free evaluation of generic FEM operator matrix-based: ◮ v = 0  N el P T ◮ loop over elements e = 1 ,..., N el ∑  A = e A e P e (assembly)  e = 1 (i) Extract local vector values: u e = P e u (matrix-vector product  v = Au  (ii) Apply operation locally by integration: within iterative solver) v K = A K u K ( without forming A K ) (iii) Sum results from (ii) into the global solution vector: v = v + P T e v e matrix-free: N el P T ∑ v = e A e ( P e u ) Design goals : e = 1 ◮ Data locality for higher arithmetic intensity: single implication : cell loop within sweep through data iterative solvers, need optimized ◮ Absolute performance in unknowns per second loops + vector access (DoFs/s), not maximal GFlop/s or GB/s M. Kronbichler, K. Kormann, A generic interface for parallel finite element operator application. Comput. Fluids 63:135–147, 2012 M. Kronbichler, K. Kormann, Fast matrix-free evaluation of discontinuous Galerkin finite element operators. ACM TOMS 45(3), 29, 2019 M. Kronbichler The State of Matrix-free Methods and HPC 7
Matrix-vector product on cell by integration Contribution of cell K to matrix-vector product: example for Laplacian � w q det J q ∇ x φ j · ∇ x u h � K ∇ x φ j · ∇ x u h d x ≈ ∑ ( A K u K ) j = � � x = x q q � = ∑ ∇ ξ φ j J − 1 q ( w q det J q ) J − T q ∑ ∇ ξ φ i u K , i x = x q , j = 1 ,..., cell dofs � � q i (a) Compute unit cell gradients ∇ ξ u h = ∑ ( ∇ ξ φ i ) u K , i at all Matrix notation: quadrature points v K = A K u K (b) At each quadrature point, apply geometry J − T q , multiply = S T WSu K by quadrature weight and Jacobian determinant, apply geometry for test function J − 1 q with (c) Test by unit cell gradients of all basis functions and sum � S qi = ∇ ξ φ i � ξ q over quadrature points W qq = J − 1 q ( w q det J q ) J − T q M. Kronbichler The State of Matrix-free Methods and HPC 8
Fast interpolation and integration: sum factorization via FEEvaluation   S ζ ⊗ S η ⊗ D ξ ◮ Efficient evaluation of S and S T matrices with structure S 3D = S ζ ⊗ D η ⊗ S ξ   D ζ ⊗ S η ⊗ S ξ ◮ Ideas from spectral elements (1980s) for tensor product shape functions and tensor product quadrature ◮ Visualization of interpolation of ∂ u ∂ξ with Q 3 element (Lagrange basis) S η ⊗ D ξ : successively apply 1D kernels D ξ S η ∂ u h Vector values u K on nodes on quadrature points ∂ξ Tensor-based evaluation reduces evaluation cost from 4 4 to 2 × 4 3 In general for degree p and dimension d : O (( p + 1 ) 2 d ) to O ( d ( p + 1 ) d + 1 ) M. Kronbichler The State of Matrix-free Methods and HPC 9
Sum factorization – implement ourselves or use performance libraries? ◮ Sum factorization consists of series of matrix-matrix multiplication of size own implementation 800 ( p + 1 ) × ( p + 1 ) and ( p + 1 ) × ( p + 1 ) 2 for Batched BLAS projected BLAS-3 via MKL polynomial degree p 600 ◮ Standard BLAS optimized for large N GFlop/s ◮ bad here due to function call overhead & 400 rearrangement cost ◮ Series of mat-mat! Batched BLAS possible? ◮ Combine small mat-mat of all elements 200 ◮ Limit : Data locality between mat-mat lost ◮ Unclear how to best utilize data in caches 0 0 5 10 15 20 25 (CPU) or registers (GPU) Polynomial degree p ◮ Own implementation with SIMD and unrolled System : 2 × 14 Intel Broadwell 2690 v4, peak: 1.3 TFlop/s, 115 GB/s loops (compiler via templated C++ code) Own implementation 2–10 times faster in interesting regime 2 ≤ p ≤ 10 M. Kronbichler The State of Matrix-free Methods and HPC 10
Node-level performance over time Track performance of operator eval- 125 million DoFs / [s × cores] uation per core over 10 years of 100 hardware ◮ Baseline: sparse matrix-vector 75 product, FE Q(1) 50 ◮ Continuous H 1 conforming elements, affine (arithmetic 25 intensive) and high-order 0 deformed (memory intensive), Opteron 2 × 8C Sandy Bridge 2 × 8C Haswell 2 × 8C Haswell 2 × 14C Broadwell 2 × 20C Skylake 2 × 24C FE Q(4) ◮ Discontinuous elements, affine, FE DGQ(4) Matrix-free and SpMV within a fac- matrix-free, H 1 affine Q 4 matrix-free, H 1 curved Q 4 tor 2 in 2010, more than 10 × apart matrix-free, DG-IP affine Q 4 sparse matrix-vector Q 1 today ! M. Kronbichler The State of Matrix-free Methods and HPC 11
Comparison versus matrix-based: matrix-free and high order wins Continuous finite elements, DG-SIP , hybridizable discontinuous Galerkin (HDG) representing efficient sparse matrix-based scheme Throughput of matrix-vector product measured using run time against N el k 3 “equivalent” DoFs to make different discretizations comparable 1 3D affine mesh 3D curved mesh 10 9 10 9 Equivalent DoF/s Equivalent DoF/s 10 8 10 8 10 7 10 7 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Polynomial degree Polynomial degree continuous FEM matrix-free continuous FEM stat. cond. matrix DG-SIP matrix-free HDG trace matrix 1 Kronbichler, Wall, A performance comparison of continuous and discontinuous Galerkin methods with fast multigrid solvers, SIAM J Sci Comput 40:A3423–48, 2018 M. Kronbichler The State of Matrix-free Methods and HPC 12
Recommend
More recommend