Porting and Optimization of Search of Neighbour-particle by Using - PowerPoint PPT Presentation

Porting and Optimization of Search of Neighbour-particle by Using OpenACC Takaaki Miyajima and Naoyuki Fujita Neumerical Simulation Tech. Unit Aeronautical Tech. Directorate Japan Aerospace Exploration Agency

Japan Aerospace Exploration Agency JAXA is a public agency for researching and developing aerospace science and technology. Supersonic aircraft, sattelites, rockets and space science are our research target.

High-Fidelity Computation Fluid Dynamic Apply Computational Fluid Dynamic (CFD) more broadly • From steady-state simulation to unsteady-state simulation Simulate all flight envelope Simulate unsteady-state Apply to actual aircrafts Understanding behavior of aircraft during takeoff, landing or turn Simulate steady-state Apply newly developed throry to model of actual aircrafts Many research have been done on cruise condition 3

An example of unsteady-state simulation Landing in rainy weather: raindrops have negative effects on wings and tires. • Simulation of interactions between raindrops and aircraft is needed • Higher computing power is necessary as well :) Raindrops decrease the lift coefficient Rain puddle makes landing run longer Cite: http://akihito114.exblog.jp/21064589/ Cite: http://blogs.yahoo.co.jp/qf104j/28794118.htm We adopt MPS (Moving Particle Semi-implicit) method for simulating raindrops 4

Agenda 1. Moving Particle Semi-implicit (MPS) Method 2. OpenACC 3. Porting and Optimization 5

MPS(Moving Particles Semi-Implicit) Method 【 Overview 】 • MPS method is attracting attention in CFD area. • Particle-base simulation (not a stencil computation) • Target fluids are divided to thousands of particles, each particle interacts with its neighbour-particle. MPS simulation: A collapse of water column 【 Features computer science 】 1.The # of particles becomes over ten thausands, parallel computing device is necessary 2.All the particles can be calculated independently 3.Memory-bound application 4.The “search for neighbour-particle ” is the main bottleneck

NSRU-MPS ： in-house MPS program We’re developing in -house MPS method program 【 Features of original program 】 • Physicist write the program • +7000 lines of Fortran90 • Physical quantities are single-precision floating-point • Structure of Array (SoA) style data structure • Parallelized only by MPI Simulation done by NSRU-MPS 7

NSRU-MPS ： preliminary evaluation Profiling of elapse time on Xeon CPU (IveBridge) Target problemA collapse of water column 40[cm]x40[cm]x8[cm] # of particles2,247,750 # of MPI processes24 CPU Intel Xeon E5-2697 v2 @2.7GHz, 12 cores * 2CPUs Memory 128GB of DDR3-12800 Compiler PGI Fortran 16.10 with "-O3 -fast" option MPI Library OpenMPI 1.10.5 with "-bind-to socket -npersocket 12 -n 24" option Measurment method An average of first 200 steps by MPI_Wtime() function 【 Result 】 • 1 time step ： 7093.75[ms] Proc 1, Others, 1911.4[ms] Search for 1645.1[ms] • Search for neighbour-particle neighbour- particle, and MPI related accounted for MPI Proc 4, 3973.5[ms] related, 56% and 21% of the total 1706.5[ms] 1475.0[ms] processing time, respectively. Proc 5, 355.5[ms] 8

Relationship bewteen elapse time and MPI Profile elapse time by changing the # of process from 2 to 24 • Elapse time decreased along with the # of procs • MPI communication increased in proportion to the # of procs 50977.4 50000.0 45000.0 40000.0 Total elapse time ： 1/7 Processing Time [msec] 35000.0 MPI related ： x15 28551.4 30000.0 25000.0 19159.2 20000.0 15676.4 15000.0 12447.2 10000.0 7093.8 5000.0 0.0 2MPI 4MPI 6MPI 8MPI 12MPI 24MPI Others 9757.1 5562.7 3801.2 3139.9 2562.3 1645.2 MPI related 98.3 901.8 1063.0 1068.2 1045.7 1475.1 Search of neighbour-particles 41122.1 22086.9 14294.9 11468.3 8839.2 3973.5 Decrease elapse time while keeping the # of procs small 9

Search for neighbour particle (w/ bucket) 【 Bucket 】【 Search for neighbour particle 】 • Divide simulation space into 1.Pickup a target particle (red) squares called “bucket” 2.Traverse adjacent 3^3 buckets • The volume of bucket is equal ✓ No fiexed order to traverse bucket to 3^3 particles 3.Search particles in a bucket • Effect radius (cut-off distance) 4.Calculate distance and weight is 3 buckets between the target particle 5.Accumulate weighted physical value to a target particle ✓ No fixed order to accumulate physical value ※ Other particle-base simulation (Molecular Dynamics or N-body sim) has similar computation 10

Search for neighbour particle in NSRU-MPS Quadraple nested-loop is used Pickup a target particle Traverse adjacent 3x3x3 buckets Search particles in a bucket Calculate distance and weight Accumulate physical value 11

Analysis of Search for neighbour particle • Not easy to vectorize and utilize cache target particle In-direct access; search particles in a bucket bucket Indefinite loop; # of particles in a bucket is uncertain, inefficient access pattern particles • Computation natullary fits to SIMT-model ✓ Each target particle accesses different index of bucket and particle ✓ Thousands of in-flight data request to hide latency ✓ No fixed order to traverse and accumulate value

An overview of OpenACC Add directives on existing C/C++, Fortran code, and the compiler automatically generates binary for GPU. No need to write CUDA C/Fortran from scratch. Typical target of offload • Loop • Data transfer (CPU from/to GPU) • User defined functions • CUDA Library: cuBLAS, cuFFT, etc can be integrated Not a few practical applications are ported by OpenACC • Sunway TaihuLight added their own extensions. • Most of applications adopts stencil computation PGI Compiler (Community Edition) is free for personal use.

Three directives provided by OpenACC acc data directive • transfers data between the host and the device memory at an arbitrary timing. • data transfer happens at this position. acc kernels directive • specify regions of code for offloading from CPU to GPU • compiler automatically analyzed the loop and the necessary data acc parallel / loop directive • Optimize nested/single loop • Loop can be mapped to block, warp, and thread ※ Each directive can have additional Sample code: Jacobi method clauses to augment information

OpenACC’s three level of parallelism Gang, Worker, Vector are provided to model SIMT • Map loops and functions explicitly • Gand = Block Grid • Worker = Warp Block(0,0) Block(1,0) Block(2,0) • Vector = CUDA Thread = Gang(0,0) = Gang(1,0) = Gang(2,0) num_gang = 3 Block(1,0) Shared mem ( shared clause) num_worker = 3 Thread(0,0) Thread(1,0) Thread(2,0) Thread(3,0) Thread(4,0) = Vector(0,0) = Vector(1,0) = Vector(2,0) = Vector(3,0) = Vector(4,0) Thread(0,1) Thread(1,1) Thread(2,1) Thread(3,1) Thread(4,1) = Vector(0,1) = Vector(1,1) = Vector(2,1) = Vector(3,1) = Vector(4,1) Thread(0,2) Thread(1,2) Thread(2,2) Thread(3,2) Thread(4,2) = Vector(0,2) = Vector(1,2) = Vector(2,2) = Vector(3,2) = Vector(4,2) vector_length = 5

Clause and its function Used clauses in our implementation clause Function gang(N) map the loop to the N thread block worker(N) map the loop to the N warp vector(N) map the loop to the N thread seq run the loop sequentially collapse(N) make a N-nested loop to one large loop independent run each iteration independently atomic perform atomic operation 17

Three optimization: Naive, Atomic, 3-D 1. Naive : 1particle = 1CUDA thread • Simplest optimization • Code modification is not required 2. Atomic : 1bucket = 1CUDA thread • Use atomic operation for accumulation • Small code modification is required 3. 3-D Thread : 1bucket = 1CUDA thread • Consider physical background to map threads • Small code modification is required 19

Naive ： 1particle = 1CUDA thread each particle is mapped to CUDA thread inner loops are performed in sequential manner particle 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 bucket traversal 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 … … … … … … … … … … … … … … … 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 128 threads / warp

Atomic ： 1bucket = 1CUDA thread each bucket traversal is mapped to CUDA thread move bucket number calculation to here particle bucket traversal … … … … … 0 26 0 26 0 26 0 26 0 26 Atomic Atomic Atomic Atomic Atomic Add Add Add Add Add 128 threads / warp atomic operation is used for accumulation

3-D Thread : 1bucket = 1CUDA thread • Bucket traversal is mapped to threadIdx.{x,y,z}, respectively • Physical background is considered particle theadIdx.z bucket traversal theadIdx.y (3-D index) theadIdx.x Atomic add 27 threads / warp atomic operation is used for accumulation

Porting and Optimization of Search of Neighbour-particle by Using - PowerPoint PPT Presentation

Porting and Optimization of Search of Neighbour-particle by Using OpenACC Takaaki Miyajima and Naoyuki Fujita Neumerical Simulation Tech. Unit Aeronautical Tech. Directorate Japan Aerospace Exploration Agency Japan Aerospace Exploration

Porting Go to NetBSD/arm64 Maya Rashish <coypu@sdf.org> Porting Go to NetBSD/arm64

Challenges in Application Porting and Abstraction Presented by: Raj Johnson, President & CEO

Prex: Finding Guidance for Forward and Backward Porting of Linux Device Drivers Julia Lawall,

Porting Porting Biological Biological Applications Applications in Grid: An in Grid: An

Porting OpenVMS to x86-64 Update Clair Grant Camiel Vanderhoeven April 8, 2016 Porting OpenVMS

PORTING THE HAMMER FILE SYSTEM TO LINUX Daniel Lorch June 10, 2009 Outline 2/13 Motivation 1.

Porting GASNet to Portals: Porting GASNet to Portals: Partitioned Global Address Space (PGAS)

Security- -Enhanced Darwin: Enhanced Darwin: Security Porting SELinux to Mac OS X Porting

Porting LLVM to a new OS Kai Nacke 31 January 2016 LLVM devroom @ FOSDEM16 Porting LLVM

NATIVE MODE PORTING CASE STUDY Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Native mode

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Porting Maxwell to the GPU Top Challenges Juan Caada Head of Visualization Next Limit

S P R I N G C O N F E R E N C E S A P R I L 2 0 1 6 Suppor porting ing Healthcar thcare

Porting OpenBSD Niall OHiggins <niallo@openbsd.org> Uwe Sthler <uwe@openbsd.org>

stapdyn: Porting SystemTap onto Dyninst Josh Stone & David Smith Performance Tools @ Red Hat

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

Proje oject Bri riefin ing Co Community Con Consultati tive Com Committee th Oct 5 th

Designing personalized, interactive materials for presentation skills Katrien Deroey BALEAP PIM

West Area Planning Committee Presentation 17/01021/FUL 53 Sunderland Avenue 47 Oxford OX2 8DT

Four main areas Neighbours meet ups Its important to get to know about each other, sharing

BUSAC Reform Presentation A Review of the Journey, Process, and Current State of Affairs Chris

Load Balancing in Downlink LTE Self- Optimizing Networks TD (10)10071 COST 2100, 10 th MCM

The Opportunity? Indigenous Tourism Development from a Northern Ontario Perspective WHO ARE

Student Education Tom Price - Niagara College Student Administrative Council Lianne Gagnon

Porting and Optimization of Search of Neighbour-particle by Using - PowerPoint PPT Presentation

Porting and Optimization of Search of Neighbour-particle by Using OpenACC Takaaki Miyajima and Naoyuki Fujita Neumerical Simulation Tech. Unit Aeronautical Tech. Directorate Japan Aerospace Exploration Agency Japan Aerospace Exploration

Porting Go to NetBSD/arm64 Maya Rashish &lt;coypu@sdf.org&gt; Porting Go to NetBSD/arm64

Challenges in Application Porting and Abstraction Presented by: Raj Johnson, President &amp; CEO

Prex: Finding Guidance for Forward and Backward Porting of Linux Device Drivers Julia Lawall,

Porting Porting Biological Biological Applications Applications in Grid: An in Grid: An

Porting OpenVMS to x86-64 Update Clair Grant Camiel Vanderhoeven April 8, 2016 Porting OpenVMS

PORTING THE HAMMER FILE SYSTEM TO LINUX Daniel Lorch June 10, 2009 Outline 2/13 Motivation 1.

Porting GASNet to Portals: Porting GASNet to Portals: Partitioned Global Address Space (PGAS)

Security- -Enhanced Darwin: Enhanced Darwin: Security Porting SELinux to Mac OS X Porting

Porting LLVM to a new OS Kai Nacke 31 January 2016 LLVM devroom @ FOSDEM16 Porting LLVM

NATIVE MODE PORTING CASE STUDY Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Native mode

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Porting Maxwell to the GPU Top Challenges Juan Caada Head of Visualization Next Limit

S P R I N G C O N F E R E N C E S A P R I L 2 0 1 6 Suppor porting ing Healthcar thcare

Porting OpenBSD Niall OHiggins &lt;niallo@openbsd.org&gt; Uwe Sthler &lt;uwe@openbsd.org&gt;

stapdyn: Porting SystemTap onto Dyninst Josh Stone &amp; David Smith Performance Tools @ Red Hat

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

Proje oject Bri riefin ing Co Community Con Consultati tive Com Committee th Oct 5 th

Designing personalized, interactive materials for presentation skills Katrien Deroey BALEAP PIM

West Area Planning Committee Presentation 17/01021/FUL 53 Sunderland Avenue 47 Oxford OX2 8DT

Four main areas Neighbours meet ups Its important to get to know about each other, sharing

BUSAC Reform Presentation A Review of the Journey, Process, and Current State of Affairs Chris

Load Balancing in Downlink LTE Self- Optimizing Networks TD (10)10071 COST 2100, 10 th MCM

The Opportunity? Indigenous Tourism Development from a Northern Ontario Perspective WHO ARE

Student Education Tom Price - Niagara College Student Administrative Council Lianne Gagnon

Porting Go to NetBSD/arm64 Maya Rashish <coypu@sdf.org> Porting Go to NetBSD/arm64

Challenges in Application Porting and Abstraction Presented by: Raj Johnson, President & CEO

Porting OpenBSD Niall OHiggins <niallo@openbsd.org> Uwe Sthler <uwe@openbsd.org>

stapdyn: Porting SystemTap onto Dyninst Josh Stone & David Smith Performance Tools @ Red Hat