is it performance portability when i m using small dgemm
play

Is it performance portability when Im using (small) DGEMM? Dagstuhl - PowerPoint PPT Presentation

Is it performance portability when Im using (small) DGEMM? Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions Michael Bader (and many others!) Technical University of Munich Oct 2327,


  1. Is it performance portability when I’m using (small) DGEMM? Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions Michael Bader (and many others!) Technical University of Munich Oct 23–27, 2017

  2. Co-Authors – Current SeisSol Group LMU Munich – Geophysics: Alice-Agnes Elizabeth Stephanie Thomas Gabriel Madden Wollherr Ulrich Technical University of Munich – HPC: Further/former members: Alexander Breuer (TUM → San Diego) Alexander Heinecke (Intel) Christian Pelties (LMU → MunichRe) Sebastian Carsten Leonhard Rannabauer (TUM) Rettenberger Uphoff M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 2

  3. Dynamic Rupture and Earthquake Simulation Landers fault system: simulated ground motion and seismic waves [2] SeisSol – ADER-DG for seismic simulations: (www.seissol.org) • adaptive tetrahedral meshes → complex geometries, heterogeneous media, multiphysics • complicated fault systems with multiple branches → non-linear multiphysics dynamic rupture simulation • ADER-DG: high-order discretisation in space and time M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 3

  4. Part I Simulation of the 2004 Sumatra Megathrust Earthquake SC17 paper [5] by Sebastian Rettenberger, Carsten Uphoff, Alice Gabriel, Betsy Madden, Stephanie Wollherr, Thomas Ulrich M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 4

  5. Sumatra Earthquake – Seismology Challenges Megathrust North North Forethrust East East Depth Upper backthrust r t e s w u r o h L k t c a b e r e d Layered L a y t c r u s oceanic crust e n t a l o n t i n c 50 km 1000 km Volume continues to 500 km Domain, mesh and geometry of the Sumatra scenario (images from [5]) • multiscale: rupture extends of 1500 km, but happens on meter scale • complex geometry: shallow angles in subduction zone; splay faults, topography, multiple material layers • extremely long duration of earthquake: 500 s simulated time (over 3 Mio smallest time steps) → local time stepping imperative M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 5

  6. Sumatra Earthquake – HPC Challenges 2048.0 ● Extrapolated time (h) 10 8 1024.0 ● C: BL G6 10 7 512.0 10 6 ● C: BL L6 10 5 256.0 Count ● 187.5 ● 10 4 ● C: SC G6 10 3 111.3 ● ● 77.9 C: SC L6 10 2 ● ● 55.0 ● ● 10 1 32.0 S: SC G6 10 0 ● 1 2 4 8 16 32 64 128 256 512 1024 16.0 S: SC L6 ⋅ ∆ t min ● 9.4 ● 7.3 ● Elements Dynamic rupture faces 16 32 64 128 256 384 512 Number of nodes Sumatra: histogram of LTS clusters and extrapolated runtimes (plots from [5]) • target manycore CPUs (Knights Landing → Cori supercomputer) → available cache/local memory per core → new flux computation → dynamic rupture became bottleneck → matrix-based code generation • dynamic rupture plus local time stepping with strong(!) scalability required M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 6

  7. Sumatra 2004: 220 Mio Elements on SuperMUC HPC Facts – 13.9 Hours Production Run: • 221 million elements with order 6 accuracy • 111 billion degrees of freedom • 11 LTS clusters: “smallest” elements performed 3.3 Mio time steps • 500 s simulated time • 1500km fault size; 400 m geometrical resolution; • 2.2 Hz frequency content of the seismic wave field • 0.94 PFLOPS sustained performance (86,016 Haswell cores 2.2 GHz) • 13 TB checkpoint data, 2.8 TB for post-processing (asynchronous IO; costs entirely overlapped by computation) M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 7

  8. Sumatra 2004 – Results Splay Fault Activation and Ocean Floor Displacements M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 8

  9. Sumatra 2004 – Results Splay Fault Activation and Ocean Floor Displacements M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 8

  10. SeisSol – Recent Extensions “Multiphysics” Simulations: • viscoelastic attenuation; implementation based on new matrix-based code generator (C. Uphoff, [4]) • off-fault plasticity (current work by S. Wollherr) Workflow and HPC: • asynchronous parallel IO using staging nodes or writer cores (S. Rettenberger, [13]) • input of 3D velocity models from data files via parallel library ASAGI (S. Rettenberger, [14]) • simplified CAD generation and close-to-automatic meshing using SimModeler and Simulation Modeling Suite by Simmetrix M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 9

  11. Part II SeisSol as a Compute-Bound Code: Code Generation for Matrix Kernels Breuer, Heinecke, Rannabauer , Bader [1]: High-Order ADER-DG Minimizes Energy- and Time-to-Solution of SeisSol (ISC’15) Uphoff , Bader [4]: Generating high performance matrix kernels for earthquake simulations with viscoelastic attenuation (HPCS 2016) M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 10

  12. Seismic Wave Propagation with SeisSol Elastic Wave Equations: (velocity-stress formulation) q t + Aq x + Bq y + Cq z = 0  q = ( σ 11 , σ 22 , σ 33 , σ 12 , σ 23 , σ 13 , u , v , w ) T  with            0 0 0 0 0 0 − λ − 2 µ 0 0 0 0 0 0 0 0 0 − λ 0 0 0 0 0 0 0 − λ 0 0 0 0 0 0 0 0 0 − λ − 2 µ 0         0 0 0 0 0 0 0 0  0 0 0 0 0 0 − λ 0 0 − λ      0 0 0 0 0 0 0 0   0 0 0 0 0 0 − µ 0 0  − µ         A = 0 0 0 0 0 0 0 0 0 B = 0 0 0 0 0 0 0 0 − µ         0 0 0 0 0 0 0 0 − µ 0 0 0 0 0 0 0 0 0      − ρ − 1   − ρ − 1  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0      − ρ − 1   − ρ − 1  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0     − ρ − 1 − ρ − 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0     • high order discontinuous Galerkin discretisation • ADER-DG : high approximation order in space and time • additional features: local time stepping, high accuracy of earthquake faulting (full frictional sliding) → Dumbser, K¨ aser et al., e.g. [8] M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 11

  13. SeisSol in a Nutshell – ADER-DG 4 = Q k − | S k | � Q n + 1 | J k | M − 1 X F − , i I ( t n , t n + 1 , Q n k ) N k , i A + k N − 1 k k , i Update scheme i = 1 4 � X F + , i , j , h I ( t n , t n + 1 , Q n k ( i ) ) N k , i A − k ( i ) N − 1 + k , i i = 1 + M − 1 K ξ I ( t n , t n + 1 , Q n k ) A ∗ k + M − 1 K η I ( t n , t n + 1 , Q n k ) B ∗ k + M − 1 K ζ I ( t n , t n + 1 , Q n k ) C ∗ k Kovalewski J ( t n + 1 − t n ) j + 1 ∂ j Cauchy I ( t n , t n + 1 , Q n X ∂ t j Q k ( t n ) k ) = ( j + 1 ) ! j = 0 ( Q k ) t = − M − 1 � ( K ξ ) T Q k A ∗ k + ( K η ) T Q k B ∗ k + ( K ζ ) T Q k C ∗ � k M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 12

  14. Sparse, Dense → Block-Sparse Consider equaivalent sparsity patterns: (Uphoff, [4]) 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 0 0 5 5 5 5 1 1 6 6 6 6 7 7 2 7 7 2 3 3 8 8 8 8 4 4 9 9 9 9 10 10 5 10 10 5 6 6 11 11 11 11 7 7 12 12 12 12 13 13 8 13 13 8 9 9 14 14 14 14 10 10 15 15 15 15 16 16 11 16 16 11 12 12 17 17 17 17 18 18 13 18 18 13 19 19 14 19 19 14 15 15 20 20 20 20 21 21 16 21 21 16 17 17 22 22 22 22 18 18 23 23 23 23 24 24 19 24 24 19 20 20 25 25 25 25 21 21 26 26 26 26 27 27 22 27 27 22 23 23 28 28 28 28 24 24 29 29 29 29 30 30 25 30 30 25 26 26 31 31 31 31 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 32 32 32 32 33 33 33 33 34 34 34 34 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Graph representation and block-sparse memory layouts A 1 A 2 A 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 0 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455 M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend