Beyond Pair Potential: A CUDA implementation of REBO Potential - - PowerPoint PPT Presentation

beyond pair potential a cuda implementation of rebo
SMART_READER_LITE
LIVE PREVIEW

Beyond Pair Potential: A CUDA implementation of REBO Potential - - PowerPoint PPT Presentation

Many-body potential Proposed algorithm Beyond Pair Potential: A CUDA implementation of REBO Potential Przemysaw Trdak Faculty of Physics, University of Warsaw GTC 2015, March 19, 2015 Przemysaw Trdak Beyond Pair Potential Many-body


slide-1
SLIDE 1

Many-body potential Proposed algorithm

Beyond Pair Potential: A CUDA implementation of REBO Potential

Przemysław Trędak

Faculty of Physics, University of Warsaw

GTC 2015, March 19, 2015

Przemysław Trędak Beyond Pair Potential

slide-2
SLIDE 2

Many-body potential Proposed algorithm Parallelizing MD potentials REBO potential

Na¨ ıve approach to parallelizing MD potentials

for a l l i in atoms do in p a r a l l e l for a l l (j, k, . . .) in atoms interacting with i do compute forces acting on atom i end for ; end p a r a l l e l for ; Very simple approach 1 thread per atom

Przemysław Trędak Beyond Pair Potential

slide-3
SLIDE 3

Many-body potential Proposed algorithm Parallelizing MD potentials REBO potential

Na¨ ıve approach to parallelizing MD potentials

For 2-body potentials it works reasonably well! for a l l i in atoms do in p a r a l l e l for a l l j in atoms interacting with i do //2-body for a l l k in atoms interacting with i do //3-body ... compute forces acting on atom i end for ; end for ; end p a r a l l e l for ;

Przemysław Trędak Beyond Pair Potential

slide-4
SLIDE 4

Many-body potential Proposed algorithm Parallelizing MD potentials REBO potential

Na¨ ıve approach to parallelizing MD potentials

For 2-body potentials it works reasonably well! For 3-body and more complicated potentials not so much: for a l l i in atoms do in p a r a l l e l for a l l j in atoms interacting with i do //2-body for a l l k in atoms interacting with i do //3-body ... compute forces acting on atom i end for ; end for ; end p a r a l l e l for ;

Przemysław Trędak Beyond Pair Potential

slide-5
SLIDE 5

Many-body potential Proposed algorithm Parallelizing MD potentials REBO potential

Different many-body potentials

E =

M

  • i

VN (. . .) Bonded interactions: N, M - constant Nonbonded N-body interactions: N - constant, M - variable ”Real” many-body potentials: N, M - variable ← focus of this talk

Przemysław Trędak Beyond Pair Potential

slide-6
SLIDE 6

Many-body potential Proposed algorithm Parallelizing MD potentials REBO potential

REBO potential

2nd generation Brenner potential Used for simulation of hydrocarbons Many-body potential

Przemysław Trędak Beyond Pair Potential

slide-7
SLIDE 7

Many-body potential Proposed algorithm Parallelizing MD potentials REBO potential

Form of REBO potential

E =

  • i
  • j>i
  • VR (rij) − ¯

bijVA (rij)

  • VR and VA are simple two body terms

Difficulty hidden in ¯ bij term

Przemysław Trędak Beyond Pair Potential

slide-8
SLIDE 8

Many-body potential Proposed algorithm Parallelizing MD potentials REBO potential

Challenges in parallel implementation

Effective impact of a single interaction

V A ,V R b ij b ji F , T

Complexity of the computation of interaction (3D cubic splines)

Przemysław Trędak Beyond Pair Potential

slide-9
SLIDE 9

Many-body potential Proposed algorithm Design decisions Proposed algorithm Performance

Design decisions and assumptions

During one kernel write only to nearest neighbors - need to split work into several steps Use neighbor lists for nearest neighbors No atomic operations during force computation - better to use more memory Small number of nearest neighbors - during normal simulation no more than 16

Przemysław Trędak Beyond Pair Potential

slide-10
SLIDE 10

Many-body potential Proposed algorithm Design decisions Proposed algorithm Performance

Impact of GPU architecture

CUDA GPUs employ SIMT (Single Instruction Multiple Threads) architecture 1 warp of threads executes in lockstep Starting with Kepler (SM 3.0) - instructions available (__shfl) to share data inside a warp Easy to logically split a single warp into several pieces of size 2n

Przemysław Trędak Beyond Pair Potential

slide-11
SLIDE 11

Many-body potential Proposed algorithm Design decisions Proposed algorithm Performance

Proposed algorithm

Let N - maximum number of nearest neighbors of any atom rounded up to nearest power of 2. Every N threads are grouped to work on interactions of a single atom i

A B C D E

Przemysław Trędak Beyond Pair Potential

slide-12
SLIDE 12

Many-body potential Proposed algorithm Design decisions Proposed algorithm Performance

Proposed algorithm

Let N - maximum number of nearest neighbors of any atom rounded up to nearest power of 2. Every thread j from a group in parallel computes interaction between i and j

A B C D E

Przemysław Trędak Beyond Pair Potential

slide-13
SLIDE 13

Many-body potential Proposed algorithm Design decisions Proposed algorithm Performance

Proposed algorithm

Let N - maximum number of nearest neighbors of any atom rounded up to nearest power of 2. During this computation all of the forces acting on atom k = i, j are being sent using shuffle instructions to appropriate thread from the group

A B C D E

Przemysław Trędak Beyond Pair Potential

slide-14
SLIDE 14

Many-body potential Proposed algorithm Design decisions Proposed algorithm Performance

Challenges

High divergence of threads if number of neighbors is less than N When real number of neighbors is less than N, some threads in a group are idle

Przemysław Trędak Beyond Pair Potential

slide-15
SLIDE 15

Many-body potential Proposed algorithm Design decisions Proposed algorithm Performance

Challenges

High divergence of threads if number of neighbors is less than N When real number of neighbors is less than N, some threads in a group are idle Solution During neighbor list creation atoms are divided into groups with the same nearest neighbor count Kernels are templated, so that for every group the lowest N is used Nearest neighbors count for most atoms is ≤ 4 - minimum efficiency is 75%

Przemysław Trędak Beyond Pair Potential

slide-16
SLIDE 16

Many-body potential Proposed algorithm Design decisions Proposed algorithm Performance

Challenges

High amount of GPU memory used to avoid atomic operations Maximum number of atoms per K20 GPU (5 GB of RAM) - 2.5M atoms

Przemysław Trędak Beyond Pair Potential

slide-17
SLIDE 17

Many-body potential Proposed algorithm Design decisions Proposed algorithm Performance

Challenges

High amount of GPU memory used to avoid atomic operations Maximum number of atoms per K20 GPU (5 GB of RAM) - 2.5M atoms Analysis With this many atoms, achieved performance would be 0.5 ns/day For real simulations, desired performance is higher - size of the system achievable

  • n 1 GPU is not limiting

Other GPUs have much more RAM

Przemysław Trędak Beyond Pair Potential

slide-18
SLIDE 18

Many-body potential Proposed algorithm Design decisions Proposed algorithm Performance

Challenges

Very high register pressure and local memory spilling Due to complexity of the main kernel, even 128 registers per thread is not enough to avoid spilling Limited occupancy with 256 registers per thread hurts performance

Przemysław Trędak Beyond Pair Potential

slide-19
SLIDE 19

Many-body potential Proposed algorithm Design decisions Proposed algorithm Performance

Challenges

Very high register pressure and local memory spilling Due to complexity of the main kernel, even 128 registers per thread is not enough to avoid spilling Limited occupancy with 256 registers per thread hurts performance Solution Careful optimizations to reduce register pressure Spline computation in separate kernels Tesla K80

Przemysław Trędak Beyond Pair Potential

slide-20
SLIDE 20

Many-body potential Proposed algorithm Design decisions Proposed algorithm Performance

Performance tests

CPU version - OpenMP implementation of REBO in LAMMPS, Intel Core i7-4930K 3.40 GHz (Ivy Bridge-E) GPU version - custom code,

NVIDIA Tesla K20 GPU, Intel Xeon E5620 2.4 GHz (Westmere) NVIDIA Tesla K40 GPU, default clocks, Intel Xeon E5-2690 v2 3.0 GHz (Ivy Bridge-EP)

1 2 NVIDIA Tesla K80 GPU, default clocks, Intel Xeon E5-2650 v3 2.3 GHz

(Haswell-EP)

Tests:

Methane gas (625000 atoms) Ethylene gas (768000 atoms) Polyethylene (32640 atoms) Polyethylene (587520 atoms)

Przemysław Trędak Beyond Pair Potential

slide-21
SLIDE 21

Many-body potential Proposed algorithm Design decisions Proposed algorithm Performance

Speedup over 1 CPU core

K20 K40 1/2 K80 Methane K20 K40 1/2 K80 Ethylene K20 K40 1/2 K80 Polyethylene K20 K40 1/2 K80 Polyethylene big 5 10 15 20 25 Speedup Przemysław Trędak Beyond Pair Potential

slide-22
SLIDE 22

Many-body potential Proposed algorithm Design decisions Proposed algorithm Performance

Speedup over full node

K20 K40 1/2 K80 Methane K20 K40 1/2 K80 Ethylene K20 K40 1/2 K80 Polyethylene K20 K40 1/2 K80 Polyethylene big 1 2 3 4 5 6 7 Speedup Przemysław Trędak Beyond Pair Potential

slide-23
SLIDE 23

Many-body potential Proposed algorithm Design decisions Proposed algorithm Performance

Conclusions and future work

Conclusions Getting advantage of SIMT architecture enables efficient algorithm for many-body REBO potential GPU version of REBO potential achieves great speedup over optimized CPU code Future work Reducing performance impact of data movement between CPU and GPU Open source the code

Przemysław Trędak Beyond Pair Potential

slide-24
SLIDE 24

Many-body potential Proposed algorithm Design decisions Proposed algorithm Performance

Thank you

Questions? You can contact me at przemyslaw.tredak@fuw.edu.pl Please complete the Presenter Evaluation sent to you by email or through the GTC Mobile App. Your feedback is important!

Przemysław Trędak Beyond Pair Potential