exadg high order discontinuous galerkin for the exa scale
play

ExaDG: High-order discontinuous Galerkin for the exa-scale Guido - PowerPoint PPT Presentation

Institute for Computational Mechanics Martin Kronbichler Technische Universit at M unchen ExaDG: High-order discontinuous Galerkin for the exa-scale Guido Kanschat 1 Katharina Kormann 2 Martin Kronbichler 3 Wolfgang A. Wall 3 1


  1. Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen ExaDG: High-order discontinuous Galerkin for the exa-scale Guido Kanschat 1 Katharina Kormann 2 Martin Kronbichler 3 Wolfgang A. Wall 3 1 Interdisziplin¨ ares Institut f¨ ur Wissenschaftliches Rechnen, Universit¨ at Heidelberg 2 Max–Planck–Institut f¨ ur Plasmaphysik 3 Institute for Computational Mechanics Technische Universit¨ at M¨ unchen January 27, 2016 ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

  2. Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Outline ExaDG: Operator evaluation framework Matrix-free algorithm Parallel adaptive multigrid Applications Summary ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

  3. Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen ExaDG: Motivation of project ◮ Modern PDE simulators spend about 60–95 % of CPU time on matrix-vector products; unstructured/adaptive case typically done with sparse matrices ◮ Bottleneck: memory-limited, only 1–5 % of arithmetic peak ◮ Matrix-free alternative based on cellwise integration 1 ◮ (Much) lower memory transfer 10 -5 SpMV usual SpMV stat cond matrix-free general ◮ CPUs: 30–70 % of arithmetic peak matrix-free Cartesian 10 -6 Time [s] / DoF 2–5 times as fast for Q 2 elements on ◮ 10 -7 petascale machine (SuperMUC) ◮ Medium/high order 2 ≤ p ≤ 6 honors 10 -8 1 2 3 4 5 6 7 8 local computations and cache hierarchy Element degree Laplacian on continuous elements rather than data transfer CPU: Xeon E5 (SNB @ SuperMUC) Matrix-free implementation of general-purpose PDE operator eval- uation promising for the exascale 1 Kronbichler, Kormann, Comput. Fluids, 2012 ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

  4. Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Generic operator evaluation framework ◮ Integrated in deal.II finite element library ( www.dealii.org ) ◮ Target generic weak forms on quadrilaterals, input by user: operator in one quadrature point → C++ templated function ◮ Mesh loop according to specific algorithm structure (operator evaluation on active cells, evaluation on geometric multigrid levels, multigrid smoothing on cell patches) ◮ Library should handle loops in a hardware-specific way ◮ Loop over quadrature points and cells/faces/vertices ◮ Full support for mesh adaptivity with hanging nodes ( L 2 -, H div -, H curl -, H 1 -conforming elements) ◮ Parallelization: MPI (domain decomposition), threads ◮ Vectorization ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

  5. Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Infrastructure for parallel meshes in deal.II ◮ deal.II implements quad/hex only meshes, forest-of-trees concept ◮ deal.II integrates functionality of the p4est library for dynamically adapted meshes based on hierarchical refinement into trees (with hanging nodes) 2 ◮ Each processor has its view of mesh (refined from coarse mesh) 3 ◮ Scaling results in deal.II to 65,536 cores ( p4est has been scaled to 1.5m MPI ranks) P 0 P 1 P 2 P 3 2 www.p4est.org 3 W. Bangerth, C. Burstedde, T. Heister, M. Kronbichler, ACM TOMS 38(2), 2011 ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

  6. Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Faster matrix-vector products for high order elements Matrix-vector product Matrix-free algorithm: matrix-based: ◮ v = 0  � P T A = K A K P K (assembly) ◮ loop over cells  K ∈{ cells } (i) Extract local vector v = Au  values on cell: u K = P K u (ii) Apply operation locally on cell: v K = A K u K matrix-free: ( without forming A K ; sum factorization, like � P T v = K A K ( P K u ) EXA-DUNE) K ∈{ cells } (iii) Sum results from (ii) into the global solution vector: v = v + P T K v K ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

  7. Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Characterization matrix-free vs. sparse matrices Operation count per degree of freedom Memory per degree of freedom ( ≈ (arithmetics) main memory transfer) 1000 SpMV SpMV 10000 900 matrix-free general matrix-free matrix-free Cartesian matrix-free GL Arithmetic operations / DoF 800 Memory [bytes] / DoF 700 1000 600 500 100 400 300 10 200 100 1 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Element degree Element degree 3D Laplacian, hexahedral elements Q p , periodic b.c. (= no boundary), pre-evaluated Jacobians on all quadrature points, “even-odd decomposition” to decrease work for 1D kernels � p from ( p + 1) 2 to ( p + 1)( � + 2) 2 Sparse matrix-vector product is memory bandwidth bound → speedup for Q 2 and Q 3 : trade memory transfer for arithmetics ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

  8. Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Performance continuous elements 1000 SpMV SpMV 900 matrix-free 10000 matrix-free general matrix-free GL matrix-free Cartesian Arithmetic operations / DoF 800 Memory [bytes] / DoF 700 1000 600 + = 500 100 400 300 10 200 100 1 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Element degree Element degree 10 -5 SpMV usual 3D Laplacian, Q p SpMV stat cond elements, Cartesian matrix-free general matrix-free Cartesian & curved geometry 10 -6 Time [s] / DoF Intel Xeon E5 Sandy Bridge 2.7 GHz (Su- perMUC) 10 -7 time per degree of freedom (DoF) per core (but node fully loaded) 10 -8 1 2 3 4 5 6 7 8 Element degree ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

  9. Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Performance discontinuous Galerkin 10 -5 SpMV usual 3D Laplacian, Q p matrix-free curved elements, symmetric matrix-free Cartesian interior penalty DG , Cartesian & curved 10 -6 Time [s] / DoF geometry Intel Xeon E5 Sandy Bridge 2.7 GHz (Su- 10 -7 perMUC) time per degree of freedom per 10 -8 core (but node fully 1 2 3 4 5 6 7 8 loaded) Element degree ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

  10. Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Characteristics of element kernels Operator evaluation: 30–70% of arithmetic peak ◮ Vectorization over elements . . . ◮ data layout: array-of-struct-of-array vmovapd 0x988(%rsp),%ymm4 ◮ intrinsics via C++ template class vmovapd 0xbc8(%rsp),%ymm6 vmovapd 0x9a8(%rsp),%ymm14 vsubpd %ymm4,%ymm6,%ymm11 VectorizedArray<double> vmulpd %ymm15,%ymm6,%ymm3 ◮ everything but vector read/write fully vmulpd %ymm6,%ymm13,%ymm8 vfmadd231pd %ymm13,%ymm4,%ymm3 vectorized (DG: fully vectorized) vfmadd231pd %ymm4,%ymm15,%ymm8 vmulpd %ymm6,%ymm12,%ymm10 ◮ Loop bounds known at compile time vmovapd 0xbe8(%rsp),%ymm0 vmovapd 0x268(%rsp),%ymm9 ◮ element degree and number of 1D vfmsub231pd %ymm9,%ymm4,%ymm10 vmulpd %ymm9,%ymm6,%ymm9 quadrature points as templates vmovapd 0xaa8(%rsp),%ymm2 ◮ unroll loops of length ⌊ p / 2 ⌋ + 1 or p + 1 vfmsub231pd %ymm4,%ymm12,%ymm9 vmovapd %ymm11,0x748(%rsp) ◮ Challenge: Mix of operations vmovapd 0xac8(%rsp),%ymm1 vmulpd %ymm15,%ymm0,%ymm5 ◮ memory intensive: vector read/write, vmulpd %ymm0,%ymm13,%ymm4 vfmadd231pd %ymm7,%ymm2,%ymm3 pre-evaluated Jacobians, coefficients, etc. vfmadd231pd %ymm7,%ymm2,%ymm8 vfmadd231pd %ymm13,%ymm14,%ymm5 on quadrature points vmovapd %ymm3,0x2c8(%rsp) ◮ arithmetic intensive: sum factorization vsubpd %ymm14,%ymm0,%ymm3 . . . → Develop performance model! ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

  11. Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Project goal: Parallel adaptive multigrid Implicit solvers for large-scale problems ◮ Many established solvers need more than just operator evaluation (mat-vec) ◮ Matrix-free evaluation requires specific preconditioners ◮ Simple example: Geometric multigrid with polynomial (Chebyshev) smoother ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

  12. Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Project goal: Parallel adaptive multigrid Implicit solvers for large-scale problems ◮ Many established solvers need more than just operator evaluation (mat-vec) ◮ Matrix-free evaluation requires specific preconditioners ◮ Simple example: Geometric multigrid with polynomial (Chebyshev) smoother ◮ Goal: GMG with overlapping Schwarz smoothers for high order bases ◮ Schwarz smoothers approximately invert differential operator on patches of elements by tensorial techniques ◮ Integrate with MPI and domain decomposition ◮ Combine with algebraic multigrid for coarse meshes with millions of elements, using low-order bases ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend