edge extreme scale discontinuous galerkin environment
play

EDGE: Extreme-scale Discontinuous Galerkin Environment Alexander - PowerPoint PPT Presentation

EDGE: Extreme-scale Discontinuous Galerkin Environment Alexander Breuer, Alexander Heinecke (Intel), Yifeng Cui Getting Started: Advection Equation q ( x, t ) t + v q ( x, t ) x = 0 , v R Simplest hyperbolic Partial


  1. EDGE: Extreme-scale Discontinuous Galerkin Environment Alexander Breuer, Alexander Heinecke (Intel), Yifeng Cui

  2. Getting Started: Advection Equation q ( x, t ) t + v · q ( x, t ) x = 0 , v ∈ R • “Simplest” hyperbolic Partial Differential Equation (PDE) • Elastic wave equations similar: Linear system with variable coefficients

  3. Getting Started: Fused Solver q ( x, t ) t + v · q ( x, t ) x = 0 , v ∈ R • Non-Fused: o 1 = s ( i 1 ) o 4 = s ( i 4 ) o 3 = s ( i 3 ) o 2 = s ( i 2 ) • Fused: O 4 = ( o 1 , o 2 , o 3 , o 4 ) = S 4 ( I 4 ) = S 4 ( i 1 , i 2 , i 3 , i 4 )

  4. Getting Started: Fused Solver q ( x, t ) t + v · q ( x, t ) x = 0 , v ∈ R • Non-Fused: o 1 = s ( i 1 ) o 4 = s ( i 4 ) o 3 = s ( i 3 ) o 2 = s ( i 2 ) • Fused: O 4 = ( o 1 , o 2 , o 3 , o 4 ) = S 4 ( I 4 ) = S 4 ( i 1 , i 2 , i 3 , i 4 )

  5. DOFs: Non-Fused vs. Fused fused runs fused runs fused runs 0 1 2 3 0 1 2 3 0 1 2 3 0 0 3 6 0 0 1 2 3 12 13 14 15 12 13 14 15 modes modes 1 1 4 7 1 4 5 6 7 16 17 18 19 16 17 18 19 2 2 5 8 2 8 9 10 11 20 21 22 23 20 21 22 23 0 1 2 0 1 2 elements elements

  6. Key Advantages • Full vector operations, even 6.8 7 for sparse matrix operators relative arithmetic 6 4.9 5 4.0 intensity • Automatic memory alignment 3.3 3.1 4 2.7 2.5 2.4 3 2.0 2.0 • Read-only data shared 1.8 1.9 1.7 1.7 1.5 1.4 2 1.0 1.0 1.0 1.0 among all runs 1 0 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 • Lower sensitivity to latency (memory & network) Relative arithmetic intensities. Shown are convergence rates 2-5 and fusion of 2,4,8,16 simulations vs. non-fused for the elastic wave equations, using an ADER-DG solver. [ISC17]

  7. � � � � � � � � � � � � � � � � � � � � � � � � “Similar Enough”: EDGE’s Approach 1 2 1. Identical mesh for all fused simulations 2. Identical simulations parameters: 3 4 1. Start and end time 2. Convergence rate 3. “Frequency” of wave field output, “frequency” and location of seismic receivers 5 6 3. Identical material parameters (velocity model) 4. “Sources”: � 1. Arbitrary initial DOFs 7 8 2. Kinematic sources: Fused or non-fused point sources 3. Spontaneous rupture: Identical friction law, other parameters � (e.g., nucleation, initial stresses, coefficients) arbitrary � mulations (SoA) with point sources at di � erent locations

  8. Performance: LOH.1 • Orders: 2-6 (non-fused), 2-4 (fused) • Unstructured tetrahedral mesh: 350,264 elements LOH.1 Benchmark: Example mesh • Single node of Cori-II (68 core Intel and material regions [ISC16_1] Xeon Phi x200, 
 1 code-named Knights 
 0.8 0.6 Landing) 0.4 0.2 u (m/s) 0 • EDGE vs. SeisSol (GTS, git-tag -0.2 -0.4 201511) -0.6 reference -0.8 EDG Ǝ O4 -1 -1.2 0 1 2 3 4 5 6 7 8 9 time (s) Synthetic seismogram of EDGE for quantity u at the ninth seismic receiver located at (8647 m, 5764 m, 0) in red. The reference solution is shown in black. Detailed setup: [ISC17]

  9. Fused Simulations: Speedup 4.60 EDGE over speedup: 4 SeisSol 2.87 3 1.82 2 1.24 0.91 0.96 0.80 0.74 1 0 O2C1 O2C8 O3C1 O3C8 O4C1 O4C8 O5C1 O6C1 configuration (order, #fused simulations) Speedup of EDGE over SeisSol (GTS, git-tag 201511). Convergence rates O2 − O6: single non-fused forward simulations (O2C1-O6C1). Additionally, per-simulation speedups for orders O2 − O4 when using EDGE’s full capabilities by fusing eight simulations (O2C8-O4C8). [ISC17]

  10. Weak: Setup • Regular cubic mesh, 5 Tets per Cube, 4th order (P3) and 6th order (P5) 1 10 • Imitates convergence 0 10 -1 benchmark 10 O1 Q8 C1 O1 Q8 C4 -2 10 O1 Q8 C8 • 276K elements per node O2 Q8 C1 linf error -3 10 O2 Q8 C4 O2 Q8 C8 • 1-9000 nodes of Cori-II (9000 -4 O3 Q8 C1 10 O3 Q8 C4 nodes = 612,000 cores) O3 Q8 C8 -5 10 O4 Q8 C1 O4 Q8 C4 -6 10 O4 Q8 C8 O5 Q8 C1 -7 10 O5 Q8 C4 Convergence of EDGE in the L ∞ -norm. Shown are orders O1 − O5 for v (Q8) when utilizing O5 Q8 C8 -8 EDGE’s fusion capabilities with shifted initial conditions. For clarity, from the total of eight fused 10 50 25 20 10 5 3 1/3 2.5 2 simulations, only errors of the first (C1), fourth (C4) and last simulation (C8) are shown. [ISC17] edge length (m)

  11. Weak: Results ���� ���� �������� ���� ���� �������� ���� ���� �������� ���� ���� �������� ���� ��� • O6C1 @ 9K nodes: �� 10.4 PFLOPS (38% �� of peak) �� �� � ���� �� • O4C8 vs. O4C1 @ �� 9K nodes: 
 �� 2.0x speedup �� � � � � �� �� �� �� �� ��� ��� ��� ��� ���� ���� ���� ���� ���� ���� Weak scaling study on Cori-II. Shown are hardware and non-zero peak efficiencies ������ in flat mode. O denotes the order and C the number of fused simulations. [ISC17]

  12. Strong: LOH.1 • Orders: 4 & 6 (non-fused), 4 (fused) • Unstructured tetrahedral LOH.1 Benchmark: Example mesh and material regions [ISC16_1] mesh: 172,386,915 elements 0.02 • 32-3200 nodes of Theta (64 core Intel Xeon Phi x200, 
 2 0.01 frequency (Hz) 1 code-named Knights Landing) 0 0.4 • 3200 nodes = 204,800 cores -0.01 -0.02 0 1 2 3 4 5 6 7 8 time (s) Time-frequency misfit for quantity u at the ninth seismic receiver located at (8647 m, 5764 m, 0) and in a frequency range between 0.13Hz and 5Hz. Detailed setup: [ISC17], Visualization: TF-MISFIT_GOF_CRITERIA, http://nuquake.eu

  13. Strong: Results ���� ���� �������� ���� ���� �������� ���� ���� �������� ���� ���� �������� ���� ��� • O6C1 @ 3.2K nodes: �� 3.4 PFLOPS (40% of �� peak) �� � ���� �� • O4C8 vs. O4C1 @ �� 3.2K nodes: 
 �� 2.0x speedup �� � �� �� �� �� ��� ��� ��� ��� ���� ���� ���� ���� ���� Strong scaling study on Theta. Shown are hardware and non-zero peak efficiencies ������ in flat mode. O denotes the order and C the number of fused simulations. [ISC17]

  14. EDGE: Current and Upcoming • Sparse, fused assembly 
 • Elements: Line, rectangular quads, 3-node triangles, rectangular hexes, 4-node tets kernels for orders 5+ • Equations: Advection (FV+ADER-DG: 1D, • Kinematic Sources 
 2D, 3D), Shallow Water (FV: 1D), Elastic (Standard Rupture Format): Wave Equations (FV+ADER-DG: 2D, 3D) Support for fused and 
 • Parallelization: Assembly kernels for non-fused source descriptions WSM, SNB, HSW, KNC (non-fused), KNL • Spontaneous Rupture (fused & non-fused), OpenMP (custom), MPI (overlapping) Simulations • Continuity: Continuous Integration (sanity • Grouped Local Time Stepping checks), Continuous Delivery (automated • EDGEcut: Automated surface 
 convergence + benchmarks runs), and volume meshing automated code coverage, automated license checks, container bootstrap • Public in next few weeks: 
 http://dial3343.org • License: 3-clause BSD

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend