the state of matrix free methods and hpc
play

The State of Matrix-free Methods and HPC Martin Kronbichler - PowerPoint PPT Presentation

Supported by SPPEXA (German software for exascale computing project, DFG), project ExaDG Supported by Bayerisches Kompetenznetzwerk f ur Technisch-Wissenschaftliches Hoch- und H ochstleistungsrechnen (KONWIHR) The State of Matrix-free


  1. Supported by SPPEXA (German software for exascale computing project, DFG), project ExaDG Supported by Bayerisches Kompetenznetzwerk f¨ ur Technisch-Wissenschaftliches Hoch- und H¨ ochstleistungsrechnen (KONWIHR) The State of Matrix-free Methods and HPC Martin Kronbichler Institute for Computational Mechanics Technical University of Munich, Germany May 26, 2020 Eighth deal.II Users and Developers Workshop M. Kronbichler The State of Matrix-free Methods and HPC 1

  2. Outline What kind of matrix-free methods are there in deal.II and why? Use cases of matrix-free methods Geometric multigrid M. Kronbichler The State of Matrix-free Methods and HPC 2

  3. Today’s big HPC theme: On the road towards exascale HPC community is moving towards exascale (machine with 1 Exaflop/s = 10 18 floating point ops / sec) We want to enable the user of deal.II to ◮ Select most efficient discretization parameters ◮ Meshing (high-order curved, adaptive, . . . ) ◮ Polynomial degree ◮ Select most efficient iterative solver SuperMUC-NG supercomputer (top) (source: ◮ > 10 9 spatial unknowns, millions of time steps → www.lrz.de) and efficiency of various instruc- tion sets on Intel Skylake-X for DG operator optimal-complexity solvers essential evaluation (bottom) ◮ Example: multigrid ◮ Select most efficient implementation for given hardware ◮ Matrix-free vs matrix-based → avoid memory wall ◮ Usage of most efficient instruction set – e.g. vectorization with AVX/AVX-512 ◮ Intel/AMD CPUs, NVIDIA GPUs, ARM SVE, . . . ◮ Scalability to 10,000+ nodes M. Kronbichler The State of Matrix-free Methods and HPC 3

  4. Implement by ourselves or use external libraries? External libraries Implemented inside deal.II ◮ Matrix-based linear algebra and ◮ Algorithms for finite elements and beyond preconditioners: PETSc, Trilinos, ◮ Linear algebra infrastructure not available UMFPACK, cuSPARSE, with adequate functionality or performance cuSOLVERS externally ◮ Mesh partitioning: p4est, METIS ◮ MPI-parallel vector ◮ . . . LinearAlgebra::distributed ::Vector<Number> Programming frameworks & backends ◮ SIMD abstraction class ◮ Distributed memory: MPI VectorizedArray<Number> ◮ Threading (TBB; taskflow, . . . ?) ◮ Wrapper classes for unified interface to ◮ CUDA for GPUs external libraries ◮ Vectorization: via compiler ◮ Help our users to concentrate on their (intrinsics) or C++ standard library application: separation of concerns M. Kronbichler The State of Matrix-free Methods and HPC 4

  5. Outline What kind of matrix-free methods are there in deal.II and why? Use cases of matrix-free methods Geometric multigrid M. Kronbichler The State of Matrix-free Methods and HPC 5

  6. Matrix-free algorithms: hot topic in HPC ◮ What is limiting resource in FEM programs? 2 , 000 1 , 500 ◮ Iterative solvers (or explicit time stepping) typically spend Bytes / DoF 60–95% of time in matrix-vector product 1 , 000 ◮ Classical approach: sparse matrix-vector product 500 ◮ 2 arithmetic operations (mult, add) per matrix entry 0 1 2 3 4 5 6 7 8 loaded from memory (8 byte for value + 4.x byte for index Polynomial degree p data): 0.16–0.25 Flop/Byte 700 ◮ Modern CPUs/GPUs can do 3–20 Flop/Byte 600 Operations / DoF ◮ Performance limit is memory bandwidth 500 ◮ SpMV leaves resources unused 400 300 ◮ Memory wall: gap to memory widens over time 200 ◮ Matrix-free algorithm: transfer less, compute more 100 0 ◮ Choice in deal.II: fast computation of FEM integrals 1 2 3 4 5 6 7 8 ◮ 1–8 Flop/Byte Polynomial degree p ◮ Method of choice for higher polynomial degrees p sparse matrix-vector ◮ Closely related to US CEED project matrix-free M. Kronbichler The State of Matrix-free Methods and HPC 6

  7. Matrix-free algorithm layout in deal.II Matrix-vector product Matrix-free evaluation of generic FEM operator matrix-based: ◮ v = 0  N el P T ◮ loop over elements e = 1 ,..., N el ∑  A = e A e P e (assembly)  e = 1 (i) Extract local vector values: u e = P e u (matrix-vector product  v = Au  (ii) Apply operation locally by integration: within iterative solver) v K = A K u K ( without forming A K ) (iii) Sum results from (ii) into the global solution vector: v = v + P T e v e matrix-free: N el P T ∑ v = e A e ( P e u ) Design goals : e = 1 ◮ Data locality for higher arithmetic intensity: single implication : cell loop within sweep through data iterative solvers, need optimized ◮ Absolute performance in unknowns per second loops + vector access (DoFs/s), not maximal GFlop/s or GB/s M. Kronbichler, K. Kormann, A generic interface for parallel finite element operator application. Comput. Fluids 63:135–147, 2012 M. Kronbichler, K. Kormann, Fast matrix-free evaluation of discontinuous Galerkin finite element operators. ACM TOMS 45(3), 29, 2019 M. Kronbichler The State of Matrix-free Methods and HPC 7

  8. Matrix-vector product on cell by integration Contribution of cell K to matrix-vector product: example for Laplacian � w q det J q ∇ x φ j · ∇ x u h � K ∇ x φ j · ∇ x u h d x ≈ ∑ ( A K u K ) j = � � x = x q q � = ∑ ∇ ξ φ j J − 1 q ( w q det J q ) J − T q ∑ ∇ ξ φ i u K , i x = x q , j = 1 ,..., cell dofs � � q i (a) Compute unit cell gradients ∇ ξ u h = ∑ ( ∇ ξ φ i ) u K , i at all Matrix notation: quadrature points v K = A K u K (b) At each quadrature point, apply geometry J − T q , multiply = S T WSu K by quadrature weight and Jacobian determinant, apply geometry for test function J − 1 q with (c) Test by unit cell gradients of all basis functions and sum � S qi = ∇ ξ φ i � ξ q over quadrature points W qq = J − 1 q ( w q det J q ) J − T q M. Kronbichler The State of Matrix-free Methods and HPC 8

  9. Fast interpolation and integration: sum factorization via FEEvaluation   S ζ ⊗ S η ⊗ D ξ ◮ Efficient evaluation of S and S T matrices with structure S 3D = S ζ ⊗ D η ⊗ S ξ   D ζ ⊗ S η ⊗ S ξ ◮ Ideas from spectral elements (1980s) for tensor product shape functions and tensor product quadrature ◮ Visualization of interpolation of ∂ u ∂ξ with Q 3 element (Lagrange basis) S η ⊗ D ξ : successively apply 1D kernels D ξ S η ∂ u h Vector values u K on nodes on quadrature points ∂ξ Tensor-based evaluation reduces evaluation cost from 4 4 to 2 × 4 3 In general for degree p and dimension d : O (( p + 1 ) 2 d ) to O ( d ( p + 1 ) d + 1 ) M. Kronbichler The State of Matrix-free Methods and HPC 9

  10. Sum factorization – implement ourselves or use performance libraries? ◮ Sum factorization consists of series of matrix-matrix multiplication of size own implementation 800 ( p + 1 ) × ( p + 1 ) and ( p + 1 ) × ( p + 1 ) 2 for Batched BLAS projected BLAS-3 via MKL polynomial degree p 600 ◮ Standard BLAS optimized for large N GFlop/s ◮ bad here due to function call overhead & 400 rearrangement cost ◮ Series of mat-mat! Batched BLAS possible? ◮ Combine small mat-mat of all elements 200 ◮ Limit : Data locality between mat-mat lost ◮ Unclear how to best utilize data in caches 0 0 5 10 15 20 25 (CPU) or registers (GPU) Polynomial degree p ◮ Own implementation with SIMD and unrolled System : 2 × 14 Intel Broadwell 2690 v4, peak: 1.3 TFlop/s, 115 GB/s loops (compiler via templated C++ code) Own implementation 2–10 times faster in interesting regime 2 ≤ p ≤ 10 M. Kronbichler The State of Matrix-free Methods and HPC 10

  11. Node-level performance over time Track performance of operator eval- 125 million DoFs / [s × cores] uation per core over 10 years of 100 hardware ◮ Baseline: sparse matrix-vector 75 product, FE Q(1) 50 ◮ Continuous H 1 conforming elements, affine (arithmetic 25 intensive) and high-order 0 deformed (memory intensive), Opteron 2 × 8C Sandy Bridge 2 × 8C Haswell 2 × 8C Haswell 2 × 14C Broadwell 2 × 20C Skylake 2 × 24C FE Q(4) ◮ Discontinuous elements, affine, FE DGQ(4) Matrix-free and SpMV within a fac- matrix-free, H 1 affine Q 4 matrix-free, H 1 curved Q 4 tor 2 in 2010, more than 10 × apart matrix-free, DG-IP affine Q 4 sparse matrix-vector Q 1 today ! M. Kronbichler The State of Matrix-free Methods and HPC 11

  12. Comparison versus matrix-based: matrix-free and high order wins Continuous finite elements, DG-SIP , hybridizable discontinuous Galerkin (HDG) representing efficient sparse matrix-based scheme Throughput of matrix-vector product measured using run time against N el k 3 “equivalent” DoFs to make different discretizations comparable 1 3D affine mesh 3D curved mesh 10 9 10 9 Equivalent DoF/s Equivalent DoF/s 10 8 10 8 10 7 10 7 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Polynomial degree Polynomial degree continuous FEM matrix-free continuous FEM stat. cond. matrix DG-SIP matrix-free HDG trace matrix 1 Kronbichler, Wall, A performance comparison of continuous and discontinuous Galerkin methods with fast multigrid solvers, SIAM J Sci Comput 40:A3423–48, 2018 M. Kronbichler The State of Matrix-free Methods and HPC 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend