Is it performance portability when Im using (small) DGEMM? Dagstuhl - PowerPoint PPT Presentation

Is it performance portability when I’m using (small) DGEMM? Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions Michael Bader (and many others!) Technical University of Munich Oct 23–27, 2017

Co-Authors – Current SeisSol Group LMU Munich – Geophysics: Alice-Agnes Elizabeth Stephanie Thomas Gabriel Madden Wollherr Ulrich Technical University of Munich – HPC: Further/former members: Alexander Breuer (TUM → San Diego) Alexander Heinecke (Intel) Christian Pelties (LMU → MunichRe) Sebastian Carsten Leonhard Rannabauer (TUM) Rettenberger Uphoff M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 2

Dynamic Rupture and Earthquake Simulation Landers fault system: simulated ground motion and seismic waves [2] SeisSol – ADER-DG for seismic simulations: (www.seissol.org) • adaptive tetrahedral meshes → complex geometries, heterogeneous media, multiphysics • complicated fault systems with multiple branches → non-linear multiphysics dynamic rupture simulation • ADER-DG: high-order discretisation in space and time M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 3

Part I Simulation of the 2004 Sumatra Megathrust Earthquake SC17 paper [5] by Sebastian Rettenberger, Carsten Uphoff, Alice Gabriel, Betsy Madden, Stephanie Wollherr, Thomas Ulrich M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 4

Sumatra Earthquake – Seismology Challenges Megathrust North North Forethrust East East Depth Upper backthrust r t e s w u r o h L k t c a b e r e d Layered L a y t c r u s oceanic crust e n t a l o n t i n c 50 km 1000 km Volume continues to 500 km Domain, mesh and geometry of the Sumatra scenario (images from [5]) • multiscale: rupture extends of 1500 km, but happens on meter scale • complex geometry: shallow angles in subduction zone; splay faults, topography, multiple material layers • extremely long duration of earthquake: 500 s simulated time (over 3 Mio smallest time steps) → local time stepping imperative M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 5

Sumatra Earthquake – HPC Challenges 2048.0 ● Extrapolated time (h) 10 8 1024.0 ● C: BL G6 10 7 512.0 10 6 ● C: BL L6 10 5 256.0 Count ● 187.5 ● 10 4 ● C: SC G6 10 3 111.3 ● ● 77.9 C: SC L6 10 2 ● ● 55.0 ● ● 10 1 32.0 S: SC G6 10 0 ● 1 2 4 8 16 32 64 128 256 512 1024 16.0 S: SC L6 ⋅ ∆ t min ● 9.4 ● 7.3 ● Elements Dynamic rupture faces 16 32 64 128 256 384 512 Number of nodes Sumatra: histogram of LTS clusters and extrapolated runtimes (plots from [5]) • target manycore CPUs (Knights Landing → Cori supercomputer) → available cache/local memory per core → new flux computation → dynamic rupture became bottleneck → matrix-based code generation • dynamic rupture plus local time stepping with strong(!) scalability required M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 6

Sumatra 2004: 220 Mio Elements on SuperMUC HPC Facts – 13.9 Hours Production Run: • 221 million elements with order 6 accuracy • 111 billion degrees of freedom • 11 LTS clusters: “smallest” elements performed 3.3 Mio time steps • 500 s simulated time • 1500km fault size; 400 m geometrical resolution; • 2.2 Hz frequency content of the seismic wave field • 0.94 PFLOPS sustained performance (86,016 Haswell cores 2.2 GHz) • 13 TB checkpoint data, 2.8 TB for post-processing (asynchronous IO; costs entirely overlapped by computation) M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 7

Sumatra 2004 – Results Splay Fault Activation and Ocean Floor Displacements M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 8

SeisSol – Recent Extensions “Multiphysics” Simulations: • viscoelastic attenuation; implementation based on new matrix-based code generator (C. Uphoff, [4]) • off-fault plasticity (current work by S. Wollherr) Workflow and HPC: • asynchronous parallel IO using staging nodes or writer cores (S. Rettenberger, [13]) • input of 3D velocity models from data files via parallel library ASAGI (S. Rettenberger, [14]) • simplified CAD generation and close-to-automatic meshing using SimModeler and Simulation Modeling Suite by Simmetrix M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 9

Part II SeisSol as a Compute-Bound Code: Code Generation for Matrix Kernels Breuer, Heinecke, Rannabauer , Bader [1]: High-Order ADER-DG Minimizes Energy- and Time-to-Solution of SeisSol (ISC’15) Uphoff , Bader [4]: Generating high performance matrix kernels for earthquake simulations with viscoelastic attenuation (HPCS 2016) M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 10

Seismic Wave Propagation with SeisSol Elastic Wave Equations: (velocity-stress formulation) q t + Aq x + Bq y + Cq z = 0  q = ( σ 11 , σ 22 , σ 33 , σ 12 , σ 23 , σ 13 , u , v , w ) T  with            0 0 0 0 0 0 − λ − 2 µ 0 0 0 0 0 0 0 0 0 − λ 0 0 0 0 0 0 0 − λ 0 0 0 0 0 0 0 0 0 − λ − 2 µ 0         0 0 0 0 0 0 0 0  0 0 0 0 0 0 − λ 0 0 − λ      0 0 0 0 0 0 0 0   0 0 0 0 0 0 − µ 0 0  − µ         A = 0 0 0 0 0 0 0 0 0 B = 0 0 0 0 0 0 0 0 − µ         0 0 0 0 0 0 0 0 − µ 0 0 0 0 0 0 0 0 0      − ρ − 1   − ρ − 1  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0      − ρ − 1   − ρ − 1  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0     − ρ − 1 − ρ − 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0     • high order discontinuous Galerkin discretisation • ADER-DG : high approximation order in space and time • additional features: local time stepping, high accuracy of earthquake faulting (full frictional sliding) → Dumbser, K¨ aser et al., e.g. [8] M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 11

SeisSol in a Nutshell – ADER-DG 4 = Q k − | S k | � Q n + 1 | J k | M − 1 X F − , i I ( t n , t n + 1 , Q n k ) N k , i A + k N − 1 k k , i Update scheme i = 1 4 � X F + , i , j , h I ( t n , t n + 1 , Q n k ( i ) ) N k , i A − k ( i ) N − 1 + k , i i = 1 + M − 1 K ξ I ( t n , t n + 1 , Q n k ) A ∗ k + M − 1 K η I ( t n , t n + 1 , Q n k ) B ∗ k + M − 1 K ζ I ( t n , t n + 1 , Q n k ) C ∗ k Kovalewski J ( t n + 1 − t n ) j + 1 ∂ j Cauchy I ( t n , t n + 1 , Q n X ∂ t j Q k ( t n ) k ) = ( j + 1 ) ! j = 0 ( Q k ) t = − M − 1 � ( K ξ ) T Q k A ∗ k + ( K η ) T Q k B ∗ k + ( K ζ ) T Q k C ∗ � k M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 12

Sparse, Dense → Block-Sparse Consider equaivalent sparsity patterns: (Uphoff, [4]) 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 0 0 5 5 5 5 1 1 6 6 6 6 7 7 2 7 7 2 3 3 8 8 8 8 4 4 9 9 9 9 10 10 5 10 10 5 6 6 11 11 11 11 7 7 12 12 12 12 13 13 8 13 13 8 9 9 14 14 14 14 10 10 15 15 15 15 16 16 11 16 16 11 12 12 17 17 17 17 18 18 13 18 18 13 19 19 14 19 19 14 15 15 20 20 20 20 21 21 16 21 21 16 17 17 22 22 22 22 18 18 23 23 23 23 24 24 19 24 24 19 20 20 25 25 25 25 21 21 26 26 26 26 27 27 22 27 27 22 23 23 28 28 28 28 24 24 29 29 29 29 30 30 25 30 30 25 26 26 31 31 31 31 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 32 32 32 32 33 33 33 33 34 34 34 34 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Graph representation and block-sparse memory layouts A 1 A 2 A 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 0 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455 M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 13

Is it performance portability when Im using (small) DGEMM? Dagstuhl - PowerPoint PPT Presentation

Is it performance portability when Im using (small) DGEMM? Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions Michael Bader (and many others!) Technical University of Munich Oct 2327,

Number Portability Three kinds of number portability Location portability: a subscriber may move

EXPLORER+500 Performance and portability combined EXPLORER+ 500 The most used BGAN terminal

Kokkos: Performance Portability and Photos placed in horizontal position with even amount

It Its confusing HEALTHPLAN2020 BHB UNIFIED UNIVERSAL MFR MOCKPLAN PORTABILITY UN-INSURED

The Right to Data Portability: Privacy and An7trust Analysis Professor Peter Swire Ohio State

JEDI Portability Across Platforms Containers, Cloud Computing, and HPC Outline I) JEDI

Applets Murray Cole Applets 1 Portability and Security JVM and bytecode make

Lecture 1.3 Course Introduction Portability and Scalability in Heterogeneous Parallel

Health Insurance Portability and Accountability Act of 1996 Compliance at Purdue

MPAS on GPUs Using OpenACC: Portability, Scalability & Performance Dr. Raghu Raj Kumar

BOAST Performance Portability Using Meta-Programming and Auto-Tuning Frdric Desprez 1 , Brice

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F

Office of Small Business Small Business Updates James G. Burrows: Senior Vice President, Office

City of Boston Small Business Plan Small Business Plan Overview State of Small Business in

SMALL CHARITIES THE REALITY Definition of small Micro, small The scope, reach and

Kokkos, Manycore Device Photos placed in horizontal position Performance Portability with even

RBMK SP-2 Validation Results RBMK SP-2 Validation Results (KS PH Rupture Simulation) (KS PH

Harnessing the Intel Xeon Phi x200 Processor 2017 IXPUG US Annual Meeting for Earthquake

Three Mile Islands Steam Generator Safety Is Suspect During Reactor Transient Conditions TMI

twenty one transfer load subjected to tension or compression steel construction:

GPU-accelerated real-time image analysis: key to smart microscopy Robert Haase, Daniela Vorkel,

Circle of Security-Parenting: Keeping the Parent-Child Relationship in Mind Presented by: Sami

Pulsed E-beams to improve corrosion barriers for lead alloy cooled reactors: overview and

Sustained Petascale Performance of Seismic Simulations with SeisSol M. Bader, A. Breuer, A.

Sambuz

Useful Links

Newsletter

Mail Us

Is it performance portability when Im using (small) DGEMM? Dagstuhl - PowerPoint PPT Presentation

Is it performance portability when Im using (small) DGEMM? Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions Michael Bader (and many others!) Technical University of Munich Oct 2327,

Number Portability Three kinds of number portability Location portability: a subscriber may move

EXPLORER+500 Performance and portability combined EXPLORER+ 500 The most used BGAN terminal

Kokkos: Performance Portability and Photos placed in horizontal position with even amount

It Its confusing HEALTHPLAN2020 BHB UNIFIED UNIVERSAL MFR MOCKPLAN PORTABILITY UN-INSURED

The Right to Data Portability: Privacy and An7trust Analysis Professor Peter Swire Ohio State

JEDI Portability Across Platforms Containers, Cloud Computing, and HPC Outline I) JEDI

Applets Murray Cole Applets 1 Portability and Security JVM and bytecode make

Lecture 1.3 Course Introduction Portability and Scalability in Heterogeneous Parallel

Health Insurance Portability and Accountability Act of 1996 Compliance at Purdue

MPAS on GPUs Using OpenACC: Portability, Scalability &amp; Performance Dr. Raghu Raj Kumar

BOAST Performance Portability Using Meta-Programming and Auto-Tuning Frdric Desprez 1 , Brice

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F

Office of Small Business Small Business Updates James G. Burrows: Senior Vice President, Office

City of Boston Small Business Plan Small Business Plan Overview State of Small Business in

SMALL CHARITIES THE REALITY Definition of small Micro, small The scope, reach and

Kokkos, Manycore Device Photos placed in horizontal position Performance Portability with even

RBMK SP-2 Validation Results RBMK SP-2 Validation Results (KS PH Rupture Simulation) (KS PH

Harnessing the Intel Xeon Phi x200 Processor 2017 IXPUG US Annual Meeting for Earthquake

Three Mile Islands Steam Generator Safety Is Suspect During Reactor Transient Conditions TMI

twenty one transfer load subjected to tension or compression steel construction:

GPU-accelerated real-time image analysis: key to smart microscopy Robert Haase, Daniela Vorkel,

Circle of Security-Parenting: Keeping the Parent-Child Relationship in Mind Presented by: Sami

Pulsed E-beams to improve corrosion barriers for lead alloy cooled reactors: overview and

Sustained Petascale Performance of Seismic Simulations with SeisSol M. Bader, A. Breuer, A.

Sambuz

Useful Links

Newsletter

Mail Us

MPAS on GPUs Using OpenACC: Portability, Scalability & Performance Dr. Raghu Raj Kumar