[PPT] - Using a Hybrid Cray S Supercomputer to M Model N Non-Icing PowerPoint Presentation

SLIDE 1

DRAFT

Using a Hybrid Cray S Supercomputer to M Model N Non-Icing Surfaces f for Cold- Climate Wind Turbines

Accelerating Three-Body Potentials using GPUs NVIDIA Tesla K20X GE Global Research Masako Yamada

SLIDE 2

2 GE Title or job number 11/11/2013

DRAFT

Wind energy production > 285 GW/year and growing

Cold regions favorable
Lower human population
Good wind conditions
45-50 GW opportunity from 2013-2017 ~$2million/MW installed
Technical need
Anti-icing surfaces
3-10% energy losses due to icing
Shut-downs
Active heating expensive

Opportunity in Cold-Climate Wind

VTT Tech chnica cal Research ch Centre of Finland

http://www.vtt.fi/news/2013/280520 13_wind_energy.jsp?lang=en

SLIDE 3

3 GE Title or job number 11/11/2013

DRAFT

ALCC Awards 40 + 40 million hours

DOE ASCR Leadership Computing Challenge Awards Energy-relevant applications

1. Non-Icing Surfaces for Cold-Climate Wind Turbines
Jaguar (Cray XK6) at Oak Ridge National Lab
Molecular dynamics using LAMMPS
1 million mW water molecule droplets on engineered surfaces
Completed >300 simulations
Achieved >200x speedup from 2011 to 2013
>5x from GPU acceleration
2. Accelerated Non-Icing Surfaces for Cold-Climate Wind

Turbines

Titan (Cray XK7, hybrid) at Oak Ridge National Lab
“Time parallelization” via Parallel Replica method
Expected 10 – 100x faster results

SLIDE 4

4 GE Title or job number 11/11/2013

DRAFT

Titan enables leadership-class study

Size of simulation ~ 1 million molecules
Droplet size >> critical nucleus size
Mimic physical dimensions (*somewhat)
Duration of simulation ~ 1 microsecond
Nucleation is an activated process
Freezing rarely observed in MD simulations
Number of simulations ~ 100’s
Study requires “embarrassingly parallel” runs
Different surfaces, ambient temperatures, conductivity
Multiple replicates required due to stochastic nature

*million molecule droplet ~ 50nm diameter

SLIDE 5

5 GE Title or job number 11/11/2013

DRAFT

Year Software/Language # of Molecu cules Hardware 1995 Pascal Few Desktop Mac 2000 C, Fortran90 Hundreds IBM SP, SGI O2K 2010 NAMD, LAMMPS 1000’s Linux HPC Present GPU-enabled LAMMPS Millions Titan

Personal history with MD

LOGO

1995 2000 2013

SLIDE 6

6 GE Title or job number 11/11/2013

DRAFT

>200x overall speedup since 2011

1. Switched to mW water potential

3-body model is more expensive/complex than 2-body but

Particle reduction – at least 3x
Timestep increase – 10x
No long-range forces
2. LAMMPS dynamic load balance – 2-3x
3. GPU acceleration of 3-body model – 5x

2011: 6 femtosecond/1024 CPU-second (SPC/E) 2013: 2 picoseconds/1024 CPU-second (mW)

SLIDE 7

7 GE Title or job number 11/11/2013

DRAFT

1. mW water potential

Stillinger Weber 3-body particle = one water molecule

Introduced in 2009, Nature paper in 2011
Bulk water properties comparable or better than existing

point-charge models

Much faster than point-charge models
Exemplary test case by authors: 180x faster than SPC/E
GE production simulation: 40-50x faster than SPC/E

asymmetric million molecule droplet on engineered surface; loaded onto 64 nodes

SPC/E mW

SLIDE 8

8 GE Title or job number 11/11/2013

DRAFT

2. LAMMPS dynamic load balance

Introduced in 2012 Adjusts size of processor sub-domains to equalize number of particles 2-3x speedup for 1 million molecule droplets on 64 nodes (with user-specified processor mapping)

No load balancing Default load balancing User-specified mapping

SLIDE 9

9 GE Title or job number 11/11/2013

DRAFT

3. GPU-acceleration of 3-body potential

See details

W. Michael Brown and Masako Yamada

Implementing Molecular Dynamics on Hybrid High Performance Computers – Three-Body Potentials. Computer Physics Communications. 2013.Computer Physics Communications, (2013)

SLIDE 10

10 GE Title or job number 11/11/2013

DRAFT

Load 1 million molecules on Host/CPU

+ + + +

1 million molecules

64 nodes
Processor sub-domains

correspond to “spatial” partitioning of droplet

8 MPI tasks/node
1 core/paired-unit

SLIDE 11

11 GE Title or job number 11/11/2013

DRAFT

Per node ~ 15,000 molecules

Accelerator NVIDIA Tesla K20X GPU Host AMD Opteron 6274 CPU

Global Memory

“Kernel”

Processor 1

Local Memory

Core 192 Private Core 1 Private Core 2 Private

Host Memory

Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8 Core9 Core10 Core11 Core12 Core13 Core14 Core15

….

Processor 2 Processor 14

Work item Work item Work item Work item Work item Work item Work iterm

Work Group

Work item = fundamental unit of activity

SLIDE 12

12 GE Title or job number 11/11/2013

DRAFT

Parallelization in LAMMPS

3-body potential Neighbor-lists Time integration Thermostat/barostat Bond/angle calculations Statistics

Host Accelerator

SLIDE 13

13 GE Title or job number 11/11/2013

DRAFT

Generic 3-body potential

𝑉 = 𝜚 𝒒𝑗, 𝒒𝑘, 𝒒𝑙

𝑙>𝑘 𝑘≠𝑗

𝑠𝑗𝑘 < 𝑠

𝑑, 𝑠𝑗𝑙 < 𝑠 𝑑 𝑗

0 otherwise

Good ca candidate for GPU 1. Occupies majority of computational time 2. Can be decomposed into independent kernels/work-items

(0,0,0) 𝑠

𝑗𝑘

𝑠

𝑗𝑙

i j k

𝒒𝑗 𝒒𝑘 𝒒𝑙

𝑠

𝑑= cutoff

𝑠

𝛽= neighbor

skin

Stillinger-Weber MEAM Tersoff REBO/AIREBO Bond-order…

SLIDE 14

14 GE Title or job number 11/11/2013

DRAFT

Redundant Computation Approach

Atom-decomposition

1 atom  1 computational kernel only
fewest operations (and effective parallelization) but

– shared memory access a bottleneck

Force-decomposition

1 atom  3 computational kernels required
redundant computations but

– reduced shared memory issues – many work-items = more effective use of cores

SLIDE 15

15 GE Title or job number 11/11/2013

DRAFT

Stillinger-Weber Parallelization

2-body operations 3-body operations (𝑠

𝑗𝑘 < 𝑠 𝛽) .AND. (𝑠 𝑗𝑙 < 𝑠 𝛽) == .TRUE.

update forces on i only 3-body operations (𝑠

𝑗𝑘 < 𝑠 𝛽) .AND. (𝑠 𝑗𝑙 < 𝑠 𝛽) == .FALSE.

neighbor-of-neighbor interactions

3 kernels

no data dependencies

Atom 𝑗 𝑉 = 𝜚2

𝑘<𝑗

(𝑠𝑗𝑘)

𝑗

+ 𝜚3

𝑙>𝑘

𝑠𝑗𝑘, 𝑠𝑗𝑙, 𝜄

𝑘𝑗𝑙 𝑘≠𝑗 𝑗

SLIDE 16

16 GE Title or job number 11/11/2013

DRAFT

Neighbor List

3-body force-decomposition approach involves

neighbor-of-neighbor operations

Requires additional overhead
increase in border size shared by two processes
neighbor list for ghost atoms “straddling” across cores
GPU implementation not necessarily faster than

CPU but less time spent in host-accelerator data transfer (note: neighbor lists are huge)

SLIDE 17

17 GE Title or job number 11/11/2013

DRAFT

GPU acceleration benefit

>5x speedup achieved in production water droplet of 1 million molecules on engineered surface (64 nodes) Not limited to Stillinger-Weber -- applicable to MEAM, Tersoff, REBO, AIREBO, Bond-order, etc.

SLIDE 18

18 GE Title or job number 11/11/2013

DRAFT

Implementation

SLIDE 19

19 GE Title or job number 11/11/2013

DRAFT

6 different surfaces

Interaction potential developed at GE Global Research

SLIDE 20

20 GE Title or job number 11/11/2013

DRAFT

Freezing front propagation

Visualization of “latent heat” release

SLIDE 21

DRAFT

Steinhardt-Nelson order parameter particle mobility Side View Bottom View

Visualizing crystalline regions

SLIDE 22

22 GE Title or job number 11/11/2013

DRAFT

Advanced visualization

Mike Matheson, Oak Ridge National Lab Will include visuals/movies here

SLIDE 23

23 GE Title or job number 11/11/2013

DRAFT

Next steps

Quasi “time parallelization” using Parallel Replica

Method

Launch dozens of replicates simultaneously; monitor

ensemble behavior

Expected outcome: 10-100x faster results
Analysis and application of simulation results

SLIDE 24

24 GE Title or job number 11/11/2013

DRAFT

Credits

Mike Brown (ORNL) – GPU acceleration
Paul Crozier (Sandia) – dynamic load balancing
Valeria Molinero (Utah) – mW potential
Aaron Keyes (Umich, Berkeley) – Steinhardt-Nelson order parameters
Art Voter/Danny Perez (LANL) – Parallel Replica method
Mike Matheson (ORNL) -- Visualization
Jack Wells, Suzy Tichenor (ORNL) – General
Azar Alizadeh, Branden Moore, Rick Arthur, Margaret Blohm (GE Global Research)

This research was conducted in part under the auspices of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy under Contract No. DEAC05-00OR22725 with UT- Battelle, LLC. This research was also conducted in part under the auspices of the GE Global Research High Performance Computing program.

SLIDE 25

25 GE Title or job number 11/11/2013

DRAFT

References

http://www.vtt.fi/news/2013/28052013_wind_energy.jsp?lang=en
W. Michael Brown, W. M and Yamada, M. Implementing Molecular Dynamics on Hybrid High Performance Computers – Three-Body
Potentials. Computer Physics Communications. 2013.Computer Physics Communications, (2013)
Shi, B. and Dhir, V. K. Molecular dynamics simulation of the contact angle of liquids on solid surfaces. The Journal of Chemical Physics,

130, 3 (01/21/ 2009), 034705-034705; Sergi, D., Scocchi, G. and Ortona, A. Molecular dynamics simulations of the contact angle between water droplets and graphite surfaces. Fluid Phase Equilibria, 332, 0 (10/25/ 2012), 173-177.

Oxtoby, D. W. Homogeneous nucleation: theory and experiment. Journal of Physics: Condensed Matter, 4, 38 1992), 7627.
Plimpton, S. Fast Parallel Algorithms for Short-Range Molecular Dynamics. Journal of Computational Physics, 117, 1 (3/1/ 1995), 1-19.
Humphrey, W., Dalke, A. and Schulten, K. VMD: Visual molecular dynamics. Journal of Molecular Graphics, 14, 1 (2// 1996), 33-38.
Keys, A. S. Shape Matching Analysis Code. University of Michigan, City, 2011; Keys, A. S., Iacovella, C. R. and Glotzer, S. C.

Characterizing Structure Through Shape Matching and Applications to Self-Assembly. Annual Review of Condensed Matter Physics, 2, 1 (2011/03/01 2011), 263-285; Steinhardt, P. J., Nelson, D. R. and Ronchetti, M. Bond-orientational order in liquids and glasses. Physical Review B, 28, 2 (07/15/ 1983), 784-805.

Stillinger, F. H. and Weber, T. A. Computer simulation of local order in condensed phases of silicon. Physical Review B, 31, 8 (04/15/

1985), 5262-5271.

Berendsen, H. J. C., Grigera, J. R. and Straatsma, T. P. The missing term in effective pair potentials. The Journal of Physical Chemistry,

91, 24 (1987/11/01 1987), 6269-6271.

Molinero, V. and Moore, E. B. Water Modeled As an Intermediate Element between Carbon and Silicon†. The Journal of Physical

Chemistry B, 113, 13 (2009/04/02 2008), 4008-4016; Moore, E. B. and Molinero, V. Structural transformation in supercooled water controls the crystallization rate of ice. Nature, 479, 7374 (11/24/print 2011), 506-508.

Yamada, M., Mossa, S., Stanley, H. E. and Sciortino, F. Interplay between Time-Temperature Transformation and the Liquid-Liquid

Phase Transition in Water. Physical Review Letters, 88, 19 (04/26/ 2002), 195701.

Brown, W. M., Wang, P., Plimpton, S. J. and Tharrington, A. N. Implementing molecular dynamics on hybrid high performance

computers – short range forces. Computer Physics Communications, 182, 4 (4// 2011), 898-911.

Voter, A. F. Parallel replica method for dynamics of infrequent events. Physical Review B, 57, 22 (06/01/ 1998), R13985-R13988.

SLIDE 26

26 GE Title or job number 11/11/2013

DRAFT

Titan

299,008 Opteron cores (18688 nodes) 18,688 NVIDIA Tesla K20X GPU accelerators Each node 16 cores AMD Opteron CPUs 1 Tesla K20X GPU accelerator 2688 compute cores Gemini interconnect (ASIC, MPI messages) PCI-Express 2.0 bus LAMMPS one of the acceptance testing applications for Titan