S7527 - Unstructured low-order finite-element earthquake simulation - PowerPoint PPT Presentation

GPU Technology Conference May 11, 2017 S7527 - Unstructured low-order finite-element earthquake simulation using OpenACC on Pascal GPUs Takuma Yamaguchi, Kohei Fujita, Tsuyoshi Ichimura, Muneo Hori, Lalith Maddegedara

Introduction • Contribution of high-performance computing to earthquake mitigation highly anticipated from society • We are developing comprehensive earthquake simulation that simulate all phases of earthquake disaster by full use of CPU based K computer system • Simulate all phases of earthquake required by speeding up core solver • Nominated for SC14 Gordon Bell Prize Finalist, SC15 Gordon Bell Prize Finalist & awarded SC16 Best Poster • Core solver also useful for manufacturing industry • Today’s topic is porting this solver to GPU -CPU heterogeneous environment • Report performance on Pascal GPUs K computer: 8 core CPU x 82944 node system with peak performance of 10.6 PFLOPS (7 th in Earthquake disaster process 2 Top 500)

Comprehensive earthquake simulation c) Resident evacuation a) Earthquake wave propagation 0 km -7 km Two million agents evacuating to nearest safe site Ikebukuro Earthquake Post earthquake Ueno Shinjuku Tokyo station Shinbashi Shibuya World’s largest finite -element simulation b) City response simulation enabled by developed solver 3

Target problem • Solve large matrix equation many times • Arises from unstructured finite-element analyses used in many components of comprehensive earthquake simulation • Involves many random data access & communication • Difficulty of problem • Attaining load balance & peak-performance & convergency of iterative solver & short time-to-solution at same time Ku = f Outer force vector Unknown vector with 1 trillion degrees of freedom Sparse, symmetric positive definite matrix 4

Designing scalable & fast finite-element solver • Design algorithm that can obtain equal granularity at O(million) cores • Matrix-free matrix-vector product (Element-by-Element method) is promising: Good load balance when elements per core is equal • Also high-peak performance as it is on-cache computation Element-by-Element method Element #0 += T u f = Σ e P e K e P e [ K e is generated on-the-fly] Element #1 += … u f … K e Element #N-1 5

Designing scalable & fast finite-element solver • Conjugate-Gradient method + Element-by-Element method + simple preconditioner ➔ Scalability & peak-performance good, but poor convergency ➔ Time-to-solution not good • Conjugate-Gradient method + sophisticated preconditioner ➔ Convergency good, but scalability or peak-performance (sometimes both) not good ➔ Time-to-solution not good 6

Designing scalable & fast finite-element solver • Conjugate-Gradient method + Element-by-Element method + Multi-grid + Mixed-Precision + Adaptive preconditioner ➔ Scalability & peak-performance good (all computation based on Element-by-Element), convergency good ➔ Time-to-solution good • Key to make this solver even faster: • Make Element-by-Element method super fast 7

Fast Element-by-Element method • Element-by-Element method for unstructured mesh involves many random access & computation • Use structured mesh to reduce these costs • Fast & scalable solver algorithm + fast Element-by-Element method • Enables very good scalability & peak-performance & convergency & time-to-solution on K computer • Nominated as Gordon Bell prize finalists for SC14 and SC15 Random Register-to-L1 FLOP count cache access 1/3.6 1/3.0 ➔ Unstructured mesh Pure unstructured Unstructured Structured Unstructured Structured mesh Structured mesh Operation count for Element-by-Element kernel (linear elements) 8

Motivation & aim of this study • Demand for conducting comprehensive earthquake simulations on variety of compute systems • Joint projects ongoing with government/companies for actual use in disaster mitigation • Users have access to different types of compute environment • Advance in GPU accelerator systems • Improvement in compute capability & performance-per-watt • We aim to port high-performance CPU based solver to GPU-CPU heterogeneous systems • Extend usability to wider range of compute systems & attain further speedup 9

Porting approach • Same algorithm expected to be more effective on GPU-CPU heterogeneous systems • Use of mixed precision (most computation is done in single precision instead of double precision) more effective • Reducing random access by structured mesh more effective • Developing high-performance Element-by-Element kernel for GPU becomes key for fast solver • Our approach: attain high-performance with low porting cost • Directly port CPU code for simple kernels by OpenACC • Redesign algorithm of Element-by-Element kernel for GPU 10

Element-by-Element kernel algorithm for CPUs • Element-by-Element kernel involves data recurrence • Algorithm for avoiding data recurrence on CPUs • Use temporary buffers per core & per SIMD lane • Suitable for small core counts with large cache capacity Element-by-Element method Element #0 += Data recurrence (add into same node) Element #1 += f u … … K e Element #N-1 11

Element-by-Element kernel algorithm for GPUs • GPU: designed to hide latency by running many threads on 10 3 physical cores • Cannot allocate temporary buffers per thread on GPU memory • Algorithm for adding up thread-wise results on GPUs • Coloring often used for previous GPUs • Algorithm independent of cache and atomics • Recent GPUs have improved cache and atomics • Using atomics expected to improve performance as data ( u ) can be reused on cache Element #0 Element #1 f Atomic add u (on cache) … 12

Implementation of GPU computation • OpenACC: Port to GPU by b) Atomic add a) Coloring add inserting a few directives !$ACC DATA PRESENT(…) !$ACC DATA PRESENT( ... ) … ... • Parallelize do icolor=1,ncolor !$ACC PARALLEL LOOP !$ACC PARALLEL LOOP do i=1,ne • Atomically operate to avoid do i=ns(icolor),ne(icolor) ! read arrays ! read arrays ... data race (atomic version) ... ! compute Ku Ku11=… • CPU-GPU data transfer ! compute Ku Ku11=… Ku12=… Ku12=… ... ... ! add to global vector • Launch threads for the ! add to global vector !$ACC ATOMIC f(1,cny1)=Ku11+f(1,cny1) f(1,cny1)=Ku11+f(1,cny1) element loop i f(2,cny1)=Ku21+f(2,cny1) !$ACC ATOMIC ... f(2,cny1)=Ku21+f(2,cny1) f(3,cny4)=Ku34+f(3,cny4) ... enddo !$ACC ATOMIC enddo f(3,cny4)=Ku34+f(3,cny4) !$ACC END DATA enddo !$ACC END DATA 13

Comparison of algorithms • Coloring and Atomics Elapsed time per EBE call (ms) • With pure unstructured computation K40 P100 • NVIDIA K40 and P100 with OpenACC 50 • K40: 4.29 TFLOPS (SP) • P100: 10.6 TFLOPS (SP) 40 • 10,427,823 DOF and 2,519,867 elements 1/2.8 30 20 • Atomics is faster algorithm 1/4.2 • High data locality and enhanced atomic 10 function 0 • P100 shows better speedup Coloring Atomic • Similar performance in CUDA 14

Performance in structured computation • Effectiveness of mixed Elapsed time per EBE call (ms) structured/unstructured computation Tetra Voxel 20 • With mixed structured/unstructured Tetra ⇒ Voxel computation 16 • K40 and P100 12 • 2,519,867 tetrahedral elements ➔ 204,185 voxels and 1,294,757 8 tetrahedral elements Tetra ⇒ Voxel 4 1/1.81 • 1.81 times speedup in structured 0 P100 K40 computation part 15

Overlap of EBE computation and MPI communication Use multi-GPUs to solve larger scale problems GPU #0 • MPI communication is required: one of bottlenecks in GPU #1 GPU computation • Overlap communication by splitting EBE kernels EBE EBE [GPU] boundary Inner part Send Recv [CPU] Recv MPI Send Packing Unpacking GPU #2 v 16

Performance in the solver • 82,196,106 DOF and 19,921,530 elements # of CPU/node GPU/node Hardware peak Memory nodes FLOPS bandwidth K computer 8 1 x SPARC64 VIIIfx - 1.02 TFLOPS 512 GB/s GPU cluster 4 2 x Xeon E5-2695 v2 2 x K40 34.3 TFLOPS 2.30 TB/s NVIDIA DGX-1 1 2 x Xeon E5-2698 v4 8 x P100 84.8 TFLOPS 5.76 TB/s • 19.6 times speedup for DGX-1 in the EBE kernel Computation time in Elapsed time (s) the EBE kernel Target part Other part(CPU) K computer GPU cluster (K40) DGX-1 (P100) 17 0 10 20 30 40 50

Conclusion • Accelerate the EBE kernel on unstructured implicit low-order finite element solvers by OpenACC • Design the solver that attains equal granularity at many cores • Port the key kernel to GPUs • Obtain high performance with low development costs • Computation in low power consumption • Many-case simulation within short time • Expect good performance • With larger GPU-based architectures (100 million DOF per P100) • In other finite-element simulations 18

S7527 - Unstructured low-order finite-element earthquake simulation - PowerPoint PPT Presentation

GPU Technology Conference May 11, 2017 S7527 - Unstructured low-order finite-element earthquake simulation using OpenACC on Pascal GPUs Takuma Yamaguchi, Kohei Fujita, Tsuyoshi Ichimura, Muneo Hori, Lalith Maddegedara Introduction

- A Finite Element Software Teresa Beck, Simon Gawlok and HiFlow team HiFlow-Finite Element

Automatic Gas Shut-Off Devices The Earthquake Sensor & Shut Off Valve Earthquake-Proof Shut

CFD General Notation System (CGNS) Usage for unstructured grids Edwin van der Weide Stanford

Finite Element Method for netting Daniel.Priour@ifremer.fr IFREMER November 4, 2010

SIO15-SS1 2020: Topic 6 Earthquake Hazards { SIO15-SS1 2020: Topic 6 Earthquake Hazards {

Hawkes Bay Earthquake Tuesday 03 February 1931 10.46:47am Hawkes Bay Earthquake Commonly

Finite Element tool box for Structural and Fluid Mechanics Cast3M Cast3M is a finite element tool

Slide 1 / 48 1 Elements Z and X are compared. Element Z is larger than Element X. Based on this

Electric Phenomena for Earthquake Prediction Thomas St-Laurent Earthquake prediction: Objective

Earthquake Preparedness Earthquake Preparedness Your role Council Staff Workshop 2012-12-12

National Earthquake Hazards Reduction Program a research and implementation partnership NIST

California Earthquake Early Warning Program Ryan Arba, Branch Chief August 14, 2019 What is

National Earthquake Hazards Reduction Program a research and implementation partnership NIST

EARTHQUAKE DETECTION, PROTECTION, RECOVERY. JAPAN DEVASTATION DEVISTATION EARTHQUAKE

Assessing Earthquake Disaster Using ALOS Assessing Earthquake Disaster Using ALOS Assessing

OVERVIEW OF EARTHQUAKE RISK IN METRO MANILA AND DEVELOPING EARTHQUAKE PREPAREDNESS Asian

Presentation half year results 2014 Amsterdam, 25 July 2014 Ren J. Takens, CEO Hielke H.

Dyeing machinery Contemporary wool dyeing and finishing Mr Arthur Fisher CSIRO Sum m ary 1.

ZUARI AGRO CHEMICALS LIMITED Investors Presentation Q1FY 2019 1 DISCLAIMER The views

Inter In erim im res esult ults pr pres esentation entation 31 1 De Decem ember er

Well Road Project Accelerated Bridge Construction Using Self-Propelled Modular Transporters

Maxi ximis isin ing customer valu lue fr from th the netw twork in in a hig igh-DER fu

As of August 2017 Disclaimer This document contains forward-looking statements. These statements

Lizzi Lecture 2006 Performance of Seismic Retrofits with High Capacity Micropiles Jiro Fukui

S7527 - Unstructured low-order finite-element earthquake simulation - PowerPoint PPT Presentation

GPU Technology Conference May 11, 2017 S7527 - Unstructured low-order finite-element earthquake simulation using OpenACC on Pascal GPUs Takuma Yamaguchi, Kohei Fujita, Tsuyoshi Ichimura, Muneo Hori, Lalith Maddegedara Introduction

- A Finite Element Software Teresa Beck, Simon Gawlok and HiFlow team HiFlow-Finite Element

Automatic Gas Shut-Off Devices The Earthquake Sensor &amp; Shut Off Valve Earthquake-Proof Shut

CFD General Notation System (CGNS) Usage for unstructured grids Edwin van der Weide Stanford

Finite Element Method for netting Daniel.Priour@ifremer.fr IFREMER November 4, 2010

SIO15-SS1 2020: Topic 6 Earthquake Hazards { SIO15-SS1 2020: Topic 6 Earthquake Hazards {

Hawkes Bay Earthquake Tuesday 03 February 1931 10.46:47am Hawkes Bay Earthquake Commonly

Finite Element tool box for Structural and Fluid Mechanics Cast3M Cast3M is a finite element tool

Slide 1 / 48 1 Elements Z and X are compared. Element Z is larger than Element X. Based on this

Electric Phenomena for Earthquake Prediction Thomas St-Laurent Earthquake prediction: Objective

Earthquake Preparedness Earthquake Preparedness Your role Council Staff Workshop 2012-12-12

National Earthquake Hazards Reduction Program a research and implementation partnership NIST

California Earthquake Early Warning Program Ryan Arba, Branch Chief August 14, 2019 What is

National Earthquake Hazards Reduction Program a research and implementation partnership NIST

EARTHQUAKE DETECTION, PROTECTION, RECOVERY. JAPAN DEVASTATION DEVISTATION EARTHQUAKE

Assessing Earthquake Disaster Using ALOS Assessing Earthquake Disaster Using ALOS Assessing

OVERVIEW OF EARTHQUAKE RISK IN METRO MANILA AND DEVELOPING EARTHQUAKE PREPAREDNESS Asian

Presentation half year results 2014 Amsterdam, 25 July 2014 Ren J. Takens, CEO Hielke H.

Dyeing machinery Contemporary wool dyeing and finishing Mr Arthur Fisher CSIRO Sum m ary 1.

ZUARI AGRO CHEMICALS LIMITED Investors Presentation Q1FY 2019 1 DISCLAIMER The views

Inter In erim im res esult ults pr pres esentation entation 31 1 De Decem ember er

Well Road Project Accelerated Bridge Construction Using Self-Propelled Modular Transporters

Maxi ximis isin ing customer valu lue fr from th the netw twork in in a hig igh-DER fu

As of August 2017 Disclaimer This document contains forward-looking statements. These statements

Lizzi Lecture 2006 Performance of Seismic Retrofits with High Capacity Micropiles Jiro Fukui

Automatic Gas Shut-Off Devices The Earthquake Sensor & Shut Off Valve Earthquake-Proof Shut