[PPT] - Molecular Dynamics (MD) on GPUs May 5, 2016 Accelerating PowerPoint Presentation

SLIDE 1

May 5, 2016

Molecular Dynamics (MD) on GPUs

SLIDE 2

2

Accelerating Discoveries

Using a supercomputer powered by the Tesla Platform with over 3,000 Tesla accelerators, University of Illinois scientists performed the first all-atom simulation of the HIV virus and discovered the chemical structure of its capsid — “the perfect target for fighting the infection.” Without gpu, the supercomputer would need to be 5x larger for similar performance.

SLIDE 3

3

Overview of Life & Material Accelerated Apps

MD: All key codes are GPU-accelerated Great multi-GPU performance Focus on dense (up to 16) GPU nodes &/or large # of GPU nodes

ACEMD*, AMBER (PMEMD)*, BAND, CHARMM, DESMOND, ESPResso, Folding@Home, GPUgrid.net, GROMACS, HALMD, HOOMD-Blue*, LAMMPS, Lattice Microbes*, mdcore, MELD, miniMD, NAMD, OpenMM, PolyFTS, SOP-GPU* & more

QC: All key codes are ported or optimizing Focus on using GPU-accelerated math libraries, OpenACC directives GPU-accelerated and available today:

ABINIT, ACES III, ADF, BigDFT, CP2K, GAMESS, GAMESS- UK, GPAW, LATTE, LSDalton, LSMS, MOLCAS, MOPAC2012, NWChem, OCTOPUS*, PEtot, QUICK, Q-Chem, QMCPack, Quantum Espresso/PWscf, QUICK, TeraChem*

Active GPU acceleration projects:

CASTEP, GAMESS, Gaussian, ONETEP, Quantum Supercharger Library*, VASP & more

green* = application where >90% of the workload is on GPU

SLIDE 4

4

MD vs. QC on GPUs

“Classical” Molecular Dynamics Quantum Chemistry (MO, PW, DFT, Semi-Emp)

Simulates positions of atoms over time; chemical-biological or chemical-material behaviors Calculates electronic properties; ground state, excited states, spectral properties, making/breaking bonds, physical properties Forces calculated from simple empirical formulas (bond rearrangement generally forbidden) Forces derived from electron wave function (bond rearrangement OK, e.g., bond energies) Up to millions of atoms Up to a few thousand atoms Solvent included without difficulty Generally in a vacuum but if needed, solvent treated classically (QM/MM) or using implicit methods Single precision dominated Double precision is important Uses cuBLAS, cuFFT, CUDA Uses cuBLAS, cuFFT, OpenACC Geforce (Accademics), Tesla (Servers) Tesla recommended ECC off ECC on

SLIDE 5

5

GPU-Accelerated Molecular Dynamics Apps

ACEMD AMBER CHARMM DESMOND ESPResSO Folding@Home GPUGrid.net GROMACS HALMD HOOMD-Blue LAMMPS mdcore Green Lettering Indicates Performance Slides Included

GPU Perf compared against dual multi-core x86 CPU socket.

MELD NAMD OpenMM PolyFTS

SLIDE 6

6

Benefits of MD GPU-Accelerated Computing

3x-8x Faster than CPU only systems in all tests (on average)
Most major compute intensive aspects of classical MD ported
Large performance boost with marginal price increase
Energy usage cut by more than half
GPUs scale well within a node and/or over multiple nodes
K80 GPU is our fastest and lowest power high performance GPU yet

Try GPU accelerated MD apps for free – www.nvidia.com/GPUTestDrive

Why wouldn’t you want to turbocharge your research?

SLIDE 7

ACEMD

SLIDE 8

www.acellera.com

470 ns/day on 1 GPU for L-Iduronic acid (1362 atoms) 116 ns/day on 1 GPU for DHFR (23K atoms)

M. Harvey, G. Giupponi and G. De Fabritiis, ACEMD: Accelerated molecular dynamics simulations in the microseconds timescale, J. Chem. Theory and
Comput. 5, 1632 (2009)

SLIDE 9

www.acellera.com

NVT, NPT, PME, TCL, PLUMED, CAMSHIFT1

1 M. J. Harvey and G. De Fabritiis, An implementation of the smooth particle-mesh Ewald (PME) method on GPU hardware, J. Chem. Theory Comput., 5, 2371–2377 (2009) 2 For a list of selected references see http://www.acellera.com/acemd/publications

SLIDE 10

AMBER 14

SLIDE 11

11

Ross Walker (AMBER developer) video

SLIDE 12

12

AMBER 14 vs. AMBER 12

Courtesy of Scott Le Grand From GTC 2014 presentation

SLIDE 13

13

AMBER 14; large P2P and small Boost Clocks impacts

2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@745Mhz (no P2P) 2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@875Mhz (no P2P) 2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@745Mhz (P2P) 2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@875Mhz (P2P) Series1 125.77 132.97 196.68 215.18 125.77 132.97 196.68 215.18 50 100 150 200 250

ns/day

AMBER 14 (ns/day) on 4x K40; P2P and Boost Clocks Impact DHFR NVE PME, 2fs Benchmark (CUDA 6.0, ECC off)

Boost P2P Boost No P2P No Boost P2P No Boost No P2P

SLIDE 14

14

AMBER Performance Over Time

Courtesy of Scott Le Grand From GTC 2014 presentation

SLIDE 15

Protein Folding Simulation With AMBER Accelerated By GPUs

168 ns/day 2x8xE5-2670 CPU 585 ns/day 2x8xE5-2670 CPU + Tesla K20X GPU

2.95x Faster

Data courtesy of AMBER.org

SLIDE 16

16

AMBER 14 with P2P, Higher Density Nodes

2 x Xeon E5-2690 v2@3.00GHz 2 x Xeon E5-2690 v2@3.00GHz + 1 x Tesla K40@875Mhz 2 x Xeon E5-2690 v2@3.00GHz + 2 x Tesla K40@875Mhz 2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@875Mhz Series1 23.79 133.74 200.58 223.56 23.79 133.74 200.58 223.56 50 100 150 200 250

ns/day

AMBER 14 (ns/day) on K40 with P2P and Boost Clocks DHFR NVE PME, 2fs Benchmark (CUDA 5.5, ECC off)

1 K40 2 K40 4 K40

SLIDE 17

17

AMBER 14 and K40 with P2P, fastest GPU yet!

2 x Xeon E5-2690 v2@3.00GHz 2 x Xeon E5-2690 v2@3.00GHz + 1 x Tesla K40@875Mhz 2 x Xeon E5-2690 v2@3.00GHz + 2 x Tesla K40@875Mhz 2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@875Mhz Series1 6.1 37.15 56.08 60.27 6.1 37.15 56.08 60.27 10 20 30 40 50 60 70

ns/day

AMBER 14 (ns/day) on K40 with P2P and Boost Clocks Factor IX NPT PME, 2fs Benchmark (CUDA 5.5, ECC off)

SLIDE 18

18

AMBER 14 and K40 with P2P, fastest GPU yet!

2 x Xeon E5-2690 v2@3.00GHz 2 x Xeon E5-2690 v2@3.00GHz + 1 x Tesla K40@875Mhz 2 x Xeon E5-2690 v2@3.00GHz + 2 x Tesla K40@875Mhz 2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@875Mhz Series1 1.32 9.22 14.03 16.64 1.32 9.22 14.03 16.64 2 4 6 8 10 12 14 16 18

ns/day

AMBER 14 (ns/day) on K40 with P2P and Boost Clocks Cellulose NVE PME, 2fs Benchmark (CUDA 5.5, ECC off)

SLIDE 19

19

AMBER 14 and K40 with P2P, fastest GPU yet!

2 x Xeon E5-2690 v2@3.00GHz 2 x Xeon E5-2690 v2@3.00GHz + 1 x Tesla K40@875Mhz 2 x Xeon E5-2690 v2@3.00GHz + 2 x Tesla K40@875Mhz 2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@875Mhz Series1 0.08 3.97 7.6 13.77 0.08 3.97 7.6 13.77 2 4 6 8 10 12 14 16

ns/day

AMBER 14 (ns/day) on K40 with P2P and Boost Clocks Nucleosome GB, 2fs Benchmark (CUDA 5.5, ECC off)

SLIDE 20

20

Cellulose on K40s, K80s and M6000s

Running AMBER version 14 The blue node contains Dual Intel E5- 2698 v3@2.3GHz, 3.6GHz Turbo CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz, 3.6GHz Turbo CPUs + either NVIDIA Tesla K40@875Mhz, Tesla K80@562Mhz (autoboost), or Quadro M6000@987Mhz GPUs

1.93 8.96 7.87 11.76 10.49 13.67 15.38 14.90 4 8 12 16 20 1 Haswell Node 1 CPU Node + 1x K40 1 CPU Node + 0.5x K80 1 CPU Node + 1x K80 1 CPU Node + 1x M6000 1 CPU Node + 2x K40 1 CPU Node + 2x K80 1 CPU Node + 2x M6000 Simulated Time (ns/day)

PME-Cellulose_NVE

4.1X 6.1X 5.4X 8.0X 7.7X 4.6X 7.1X

SLIDE 21

21

Factor IX on K40s, K80s and M6000s

Running AMBER version 14 The blue node contains Dual Intel E5- 2698 v3@2.3GHz, 3.6GHz Turbo CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz, 3.6GHz Turbo CPUs + either NVIDIA Tesla K40@875Mhz, Tesla K80@562Mhz (autoboost), or Quadro M6000@987Mhz GPUs

9.68 40.48 33.59 50.70 47.80 61.18 60.93 66.89 10 20 30 40 50 60 70 80 1 Haswell Node 1 CPU Node + 1x K40 1 CPU Node + 0.5x K80 1 CPU Node + 1x K80 1 CPU Node + 1x M6000 1 CPU Node + 2x K40 1 CPU Node + 2x K80 1 CPU Node + 2x M6000 Simulated Time (ns/day)

PME-FactorIX_NVE

3.5X 5.2X 5.0X 6.4X 6.3X 7.0X 4.2X

SLIDE 22

22

JAC on K40s, K80s and M6000s

Running AMBER version 14 The blue node contains Dual Intel E5- 2698 v3@2.3GHz, 3.6GHz Turbo CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz, 3.6GHz Turbo CPUs + either NVIDIA Tesla K40@875Mhz, Tesla K80@562Mhz (autoboost), or Quadro M6000@987Mhz GPUs

37.38 134.82 121.30 174.34 161.53 200.34 225.34 219.83 50 100 150 200 250 1 Haswell Node 1 CPU Node + 1x K40 1 CPU Node + 0.5x K80 1 CPU Node + 1x K80 1 CPU Node + 1x M6000 1 CPU Node + 2x K40 1 CPU Node + 2x K80 1 CPU Node + 2x M6000 Simulated Time (ns/day)

PME-JAC_NVE

3.2X 4.7X 4.3X 5.4X 6.0X 5.9X 3.6X

SLIDE 23

March 2016

AMBER 14

SLIDE 24

24

Cellulose on M40s

Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

1.07 10.12 14.40 15.90

2 4 6 8 10 12 14 16 18 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node

Simulated Time (ns/Day)

PME - Cellulose_NPT

9.5X 13.5X 14.9X

SLIDE 25

25

Cellulose on M40s

1.07 10.50 15.41 17.13

2 4 6 8 10 12 14 16 18 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node

Simulated Time (ns/Day)

PME - Cellulose_NVE

9.8X 14.4X 16.0X

Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

SLIDE 26

26

FactorIX on M40s

5.38 46.90 67.37 72.96

10 20 30 40 50 60 70 80 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node

Simulated Time (ns/Day)

PME - FactorIX_NPT

8.7X 12.5X 13.6X

Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

SLIDE 27

27

FactorIX on M40s

5.47 49.33 73.00 80.04

10 20 30 40 50 60 70 80 90 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node

Simulated Time (ns/Day)

PME - FactorIX_NVE

9.0X 13.3X 14.6X

Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

SLIDE 28

28

JAC on M40s

20.88 149.40 211.97 226.63

50 100 150 200 250 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node

Simulated Time (ns/Day)

PME - JAC_NPT

7.2X 10.2X 10.9X

Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

SLIDE 29

29

JAC on M40s

21.11 157.68 230.18 246.15

50 100 150 200 250 300 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node

Simulated Time (ns/Day)

PME - JAC_NVE

7.5X 10.9X 11.7X

Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

SLIDE 30

30

Myoglobin on M40s

9.83 232.20 300.86 322.09

50 100 150 200 250 300 350 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node

Simulated Time (ns/Day)

GB - Myoglobin

23.6X 30.6X 32.8X

Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

SLIDE 31

31

Nucleosome on M40s

0.13 4.67 9.05 16.11

2 4 6 8 10 12 14 16 18 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node

Simulated Time (ns/Day)

GB - Nucleosome

35.9X 69.6X 123.9X

Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

SLIDE 32

32

TrpCage on M40s

408.88 831.91 551.36 464.63

100 200 300 400 500 600 700 800 900 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node

Simulated Time (ns/Day)

GB - TrpCage

2.03X 1.3X 1.1X

Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

SLIDE 33

JAC on K40s and K80s

2 x Xeon E5-2697 v2@2.70GHz (1 Ivybridge node) 2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 0.5 x Tesla K80 (autoboost) (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K80 (autoboost) (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K40@875Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K80 (autoboost) (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 4 x Tesla K40@875Mhz (1 node) 24.94 131.03 116.67 167.81 195.39 205.30 218.62 50 100 150 200 250

Simulated Time (ns/Day) AMBER 14; PME-JAC_NVE on Intel Phi, Tesla K40s and K80s & IVB CPUs (1 Node: Simulation Time in ns/Day)

SLIDE 34

FactorIX on K40s and K80s

2 x Xeon E5-2697 v2@2.70GHz (1 Ivybridge node) 2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 0.5 x Tesla K80@562Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K80@562Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K40@875Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K80@562Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 4 x Tesla K40@875Mhz (1 node) 6.61 36.31 31.52 46.65 53.90 53.28 59.45 10 20 30 40 50 60 70

Simulated Time (ns/Day)

AMBER 14; PME-FactorIX_NVE on Intel Phi, Tesla K40s and K80s & IVB CPUs (1 Node: Simulation Time in ns/Day)

SLIDE 35

Cellulose on K40s and K80s

2 x Xeon E5-2697 v2@2.70GHz (1 Ivybridge node) 2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 0.5 x Tesla K80@562Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K80@562Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K40@875Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K80@562Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 4 x Tesla K40@875Mhz (1 node) 1.35 8.45 7.36 10.96 12.49 13.48 14.68 3 6 9 12 15 18

Simulated Time (ns/Day)

AMBER 14; PME-Cellulose_NVE on Intel Phi, Tesla K40s and K80s & IVB CPUs (1 Node: Simulation Time in ns/Day)

SLIDE 36

36

Kepler - Our Fastest Family of GPUs Yet

Running AMBER 14 The blue node contains Dual E5-2697 CPUs (12 Cores per CPU). The green nodes contain Dual E5-2697 CPUs (12 Cores per CPU) and either 1x or 2x NVIDIA K20X, K40 or K80 for the GPU DHFR (JAC)

1 x Xeon E5-2697 v2@2.70 GHz 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Phi 5110p (Offload) 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Phi 7120p (Offload) 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K20X 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K40@87 5Mhz 1 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K20X 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K80 Board 1 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K40@87 5Mhz 2 x Xeon E5-2697 v2@2.70 GHz 2 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K20X 2 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K40@87 5Mhz 2 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K20X 2 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K40@87 5Mhz Series1 14.54 4.08 3.82 111.32 134.08 159.25 175.43 196.69 25.80 110.87 132.68 159.06 196.86 14.54 4.08 3.82 111.32 134.08 159.25 175.43 196.69 25.80 110.87 132.68 159.06 196.86 0.00 50.00 100.00 150.00 200.00 250.00

ns/Day

AMBER 14, SPFP-DHFR_production_NVE

SLIDE 37

37

Kepler - Our Fastest Family of GPUs Yet

Running AMBER 14 The blue node contains Dual E5-2697 CPUs (12 Cores per CPU). The green nodes contain Dual E5-2697 CPUs (12 Cores per CPU) and either 1x or 2x NVIDIA K20X, K40 or K80 for the GPU Factor IX

1 x Xeon E5-2697 v2@2.70 GHz 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Phi 5110p (Offload) 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Phi 7120p (Offload) 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K20X 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K40@87 5Mhz 1 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K20X 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K80 Board 1 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K40@87 5Mhz 2 x Xeon E5-2697 v2@2.70 GHz 2 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K20X 2 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K40@87 5Mhz 2 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K20X 2 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K40@87 5Mhz Series1 3.70 3.29 3.35 32.45 38.65 46.58 51.12 57.89 6.87 32.30 38.60 46.50 57.83 3.70 3.29 3.35 32.45 38.65 46.58 51.12 57.89 6.87 32.30 38.60 46.50 57.83 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00

ns/Day

AMBER 14, SPFP-Factor_IX_Production_NVE

SLIDE 38

38

Kepler - Our Fastest Family of GPUs Yet

Running AMBER 14 The blue node contains Dual E5-2697 CPUs (12 Cores per CPU). The green nodes contain Dual E5-2697 CPUs (12 Cores per CPU) and either 1x or 2x NVIDIA K20X, K40 or K80 for the GPU Cellulose

1 x Xeon E5-2697 v2@2.70 GHz 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Phi 5110p (Offload) 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Phi 7120p (Offload) 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K20X 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K40@87 5Mhz 1 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K20X 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K80 Board 1 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K40@87 5Mhz 2 x Xeon E5-2697 v2@2.70 GHz 2 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K20X 2 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K40@87 5Mhz 2 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K20X 2 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K40@87 5Mhz Series1 0.74 1.50 1.56 7.60 8.95 10.82 11.86 13.29 1.38 7.60 8.95 10.83 13.29 0.74 1.50 1.56 7.60 8.95 10.82 11.86 13.29 1.38 7.60 8.95 10.83 13.29 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00

ns/Day

AMBER 14, SPFP-Cellulose_Production_NVE

SLIDE 39

SAN DIEGO SUPERCOMPUTER CENTER

Cost Comparison

Traditional Cluster GPU Workstation Nodes Required 12 1 (4 GPUs) Interconnect QDR IB None Time to complete simulations 4.98 days 2.25 days Power Consumption 5.7 kW (681.3 kWh) 1.0 kW (54.0 kWh) System Cost (per day) $96,800 ($88.40) $5200 ($4.75) Simulation Cost (681.3 * 0.18) + (88.40 * 4.98) (54.0 * 0.18) + (4.75 * 2.25) $562.87 $20.41

39

4 simultaneous simulations, 23,000 atoms, 250ns each, 5 days maximum time to solution. >25x cheaper AND solution obtained in less than half the time

SLIDE 40

40

Replace 8 Nodes with 1 K20 GPU

Cut down simulation costs to ¼ and gain higher performance

Running AMBER 12 GPU Support Revision 12.1 SPFP with CUDA 4.2.9 ECC Off The eight (8) blue nodes each contain 2x Intel E5-2687W CPUs (8 Cores per CPU) Each green node contains 2x Intel E5- 2687W CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPU Note: Typical CPU and GPU node pricing

used. Pricing may vary depending on node
configuration. Contact your preferred HW

vendor for actual pricing.

65.00 81.09 $32,000.00 $6,500.00 5000 10000 15000 20000 25000 30000 35000 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 Nanoseconds/Day Cost

DHFR

SLIDE 41

41

Replace 7 Nodes with 1 K10 GPU

Cut down simulation costs to ¼ and increase performance by 70%

Running AMBER 12 GPU Support Revision 12.1 SPFP with CUDA 4.2.9 ECC Off The eight (8) blue nodes each contain 2x Intel E5-2687W CPUs (8 Cores per CPU) The green node contains 2x Intel E5-2687W CPUs (8 Cores per CPU) plus 1x NVIDIA K10 GPU Note: Typical CPU and GPU node pricing

used. Pricing may vary depending on node
configuration. Contact your preferred HW

vendor for actual pricing.

DHFR

$32,000 $7,000 $0.00 $5,000.00 $10,000.00 $15,000.00 $20,000.00 $25,000.00 $30,000.00 $35,000.00 CPU Only GPU Enabled

Cost

10 20 30 40 50 60 70 80 CPU Only GPU Enabled Nanoseconds / Day

Performance on JAC NVE

SLIDE 42

When used with GPUs, dual CPU sockets perform worse than single CPU sockets.

Extra CPUs decrease Performance

Running AMBER 12 GPU Support Revision 12.1 The orange bars contains one E5-2687W CPUs (8 Cores per CPU). The blue bars contain Dual E5-2687W CPUs (8 Cores per CPU)

1 2 3 4 5 6 7 8 CPU Only CPU with dual K20s Nanoseconds / Day

Cellulose NVE

1 E5-2687W 2 E5-2687W

Cellulose 1 CPU 2 GPUs 2 CPUs 2 GPUs

SLIDE 43

Kepler - Greener Science

The GPU Accelerated systems use 65-75% less energy Energy Expended = Power x Time

Lower is better

500 1000 1500 2000 2500 CPU Only CPU + K10 CPU + K20 CPU + K20X Energy Expended (kJ)

Energy used in simulating 1 ns of DHFR JAC

Running AMBER 12 GPU Support Revision 12.1 The blue node contains Dual E5-2687W CPUs (150W each, 8 Cores per CPU). The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and 1x NVIDIA K10, K20, or K20X GPUs (235W each).

SLIDE 44

44

Recommended GPU Node Configuration for AMBER Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2 Cores per CPU socket 6+ (1 CPU core drives 1 GPU) CPU speed (Ghz) 2.66+ System memory per node (GB) 16 GPUs Kepler K20, K40, K80 # of GPUs per CPU socket 1-4 GPU memory preference (GB) 6 GPU to CPU connection PCIe 3.0 16x or higher Server storage 2 TB Network configuration Infiniband QDR or better

Scale to multiple nodes with same single node configuration

44

SLIDE 45

CHARMM

SLIDE 46

Courtesy of Antti-Pekka Hynninen @ NREL

SLIDE 47

Courtesy of Antti-Pekka Hynninen @ NREL

SLIDE 48

Courtesy of Antti-Pekka Hynninen @ NREL

SLIDE 49

Courtesy of Antti-Pekka Hynninen @ NREL

SLIDE 50

Greener Science with NVIDIA

Running CHARMM release C37b1 The blue nodes contains 64 X5667 CPUs (95W, 4 Cores per CPU). The green nodes contain 2 X5667 CPUs and 1 or 2 NVIDIA C2070 GPUs (238W each). Note: Typical CPU and GPU node pricing

used. Pricing may vary depending on node
configuration. Contact your preferred HW

vendor for actual pricing.

Using GPUs will decrease energy use by 75%

2000 4000 6000 8000 10000 12000 14000 16000 18000 64x X5667 2x X5667 + 1x C2070 2x X5667 + 2x C2070 Energy Expended (kJ)

Energy Used in Simulating 1 ns Daresbury G1nBP 61.2k Atoms Lower is better

Energy Expended = Power x Time

SLIDE 51

May 2016

CHARMM c40a2

SLIDE 52

52

465K System on K80s

Running CHARMM version c40a2 The blue node contains Dual Intel Xeon Intel Xeon (R) ES-2698@2.30 MHZ (Haswell) CPUs The green nodes contain Dual Intel Xeon Intel Xeon (R) ES-2698@2.30 MHZ (Haswell) CPUs + Tesla K80 (autoboost) GPUs “Gpuonly” means all the forces are calculated in GPU “Gpuon” means only non-bonded forces are calculated in GPU

0.36 2.15 1.70 1.80 1.62 0.0 0.5 1.0 1.5 2.0 2.5 1 Haswell node 1 Node + 1x K80 per node (gpuonly) 1 Node + 1x K80 per node (gpuon) 1 Node + 2x K80 per node (gpuon) 1 Node + 4x K80 per node (gpuon) ns/day

465K System

6.0X 4.7X 5.0X 4.5X

SLIDE 53

53

534K System on K80s

Running CHARMM version c40a2 The blue node contains Dual Intel Xeon Intel Xeon (R) ES-2698@2.30 MHZ (Haswell) CPUs The green nodes contain Dual Intel Xeon Intel Xeon (R) ES-2698@2.30 MHZ (Haswell) CPUs + Tesla K80 (autoboost) GPUs “Gpuonly” means all the forces are calculated in GPU “Gpuon” means only non-bonded forces are calculated in GPU

0.18 1.43 1.44 1.44 1.86 0.0 0.4 0.8 1.2 1.6 2.0 1 Haswell node 1 Node + 1x K80 per node (gpuonly) 1 Node + 1x K80 per node (gpuon) 1 Node + 2x K80 per node (gpuon) 1 Node + 4x K80 per node (gpuon) ns/day

534K System

8.0X 8.0X 8.0X 10.3X

SLIDE 54

October 2015

GROMACS 5.1

SLIDE 55

55

Erik Lindahl (GROMACS developer) video

SLIDE 56

56

384K Waters on K40s and K80s

Running GROMACS version 5.1 The blue node contains Dual Intel E5- 2698 v3@2.3GHz CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs + either NVIDIA Tesla K40@875Mhz or Tesla K80@562Mhz (autoboost) GPUs

7.16 10.45 16.99 17.07 22.95 22.36 24.72 5 10 15 20 25 30 1 Haswell Node 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K40 1 CPU Node + 2x K80 1 CPU Node + 4x K40 1 CPU Node + 4x K80 Simulated Time (ns/day)

Water [PME] 384k

2.4X 3.1X 3.5X 1.5X 2.4X 3.2X

SLIDE 57

57

384K Waters on Titan X

Running GROMACS version 5.1 The blue node contains Dual Intel E5- 2698 v3@2.3GHz CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs + GeForce GTX TitanX@1000Mhz GPUs

7.16 16.08 18.13 21.74 5 10 15 20 25 1 Haswell Node 1 CPU Node + 1x TitanX 1 CPU Node + 2x TitanX 1 CPU Node + 4x TitanX Simulated Time (ns/day)

Water [PME] 384k

2.5X 3.0X 2.2X

SLIDE 58

58

768K Waters on K40s and K80s

Running GROMACS version 5.1 The blue node contains Dual Intel E5- 2698 v3@2.3GHz CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs + either NVIDIA Tesla K40@875Mhz or Tesla K80@562Mhz (autoboost) GPUs

3.58 5.37 8.50 8.60 11.36 11.31 12.78 5 10 15 1 Haswell Node 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K40 1 CPU Node + 2x K80 1 CPU Node + 4x K40 1 CPU Node + 4x K80 Simulated Time (ns/day)

Water [PME] 768k

1.5X 2.4X 2.4X 3.2X 3.2X 3.6X

SLIDE 59

59

768K Waters on Titan X

Running GROMACS version 5.1 The blue node contains Dual Intel E5- 2698 v3@2.3GHz CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs + GeForce GTX TitanX@1000Mhz GPUs

3.58 8.19 9.12 11.51 4 8 12 16 1 Haswell Node 1 CPU Node + 1x TitanX 1 CPU Node + 2x TitanX 1 CPU Node + 4x TitanX Simulated Time (ns/day)

Water [PME] 768k

2.5X 2.3X 3.2X

SLIDE 60

60

1.5M Waters on K40s and K80s

Running GROMACS version 5.1 The blue node contains Dual Intel E5- 2698 v3@2.3GHz CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs + either NVIDIA Tesla K40@875Mhz or Tesla K80@562Mhz (autoboost) GPUs

1.72 2.69 4.13 4.16 5.67 5.61 6.07 1 2 3 4 5 6 7 1 Haswell Node 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K40 1 CPU Node + 2x K80 1 CPU Node + 4x K40 1 CPU Node + 4x K80 Simulated Time (ns/day)

Water [PME] 1.5M

1.6X 2.4X 2.4X 3.3X 3.3X 3.5X

SLIDE 61

61

1.5M Waters on Titan X

Running GROMACS version 5.1 The blue node contains Dual Intel E5- 2698 v3@2.3GHz CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs + GeForce GTX TitanX@1000Mhz GPUs

1.72 3.75 4.64 5.87 2 4 6 8 1 Haswell Node 1 CPU Node + 1x TitanX 1 CPU Node + 2x TitanX 1 CPU Node + 4x TitanX Simulated Time (ns/day)

Water [PME] 1.5M

2.7X 2.2X 3.4X

SLIDE 62

62

3M Waters on K40s and K80s

Running GROMACS version 5.1 The blue node contains Dual Intel E5- 2698 v3@2.3GHz CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs + either NVIDIA Tesla K40@875Mhz or Tesla K80@562Mhz (autoboost) GPUs

0.81 1.32 1.88 1.85 2.72 2.76 3.23 1 2 3 4 1 Haswell Node 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K40 1 CPU Node + 2x K80 1 CPU Node + 4x K40 1 CPU Node + 4x K80 Simulated Time (ns/day)

Water [PME] 3M

1.6X 2.3X 2.3X 3.4X 3.4X 4.0X

SLIDE 63

63

3M Waters on Titan X

Running GROMACS version 5.1 The blue node contains Dual Intel E5- 2698 v3@2.3GHz CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs + GeForce GTX TitanX@1000Mhz GPUs

0.81 1.53 2.36 2.99 1 2 3 4 1 Haswell Node 1 CPU Node + 1x TitanX 1 CPU Node + 2x TitanX 1 CPU Node + 4x TitanX Simulated Time (ns/day)

Water [PME] 3M

1.9X 2.9X 3.7X

SLIDE 64

GROMACS 5.0: Phi vs. Kepler K40 fastest GPU!

1 x Xeon E5-2697 v2@2.70GHz 1 x Intel Phi 3120p (Native Mode) 1 x Intel Phi 5110p (Native Mode) 1 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K40@875Mhz 1 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K40@875Mhz 2 x Xeon E5-2697 v2@2.70GHz 2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K40@875Mhz 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K40@875Mhz Series1 4.96 6.02 5.9 18.19 18.55 7.9 19.29 25.84 4.96 6.02 5.9 18.19 18.55 7.9 19.29 25.84 5 10 15 20 25 30

ns/day

GROMACS 5.0 RC1 (ns/day) on K40 with Boost Clocks and Intel Phi 192K Waters Benchmark (CUDA 6.0)

SLIDE 65

GROMACS 5.0 & Fastest Kepler GPUs yet!

1 x Xeon E5-2697 v2@2.70GHz 1 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K20X 1 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K40@875Mhz 1 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K80 (autoboost) 2 x Xeon E5-2697 v2@2.70GHz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K20X (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K40@875Mhz (1 node) 7.92 20.01 21.79 18.63 11.60 25.49 26.00 0.00 5.00 10.00 15.00 20.00 25.00 30.00

ns/Day

GROMACS 5.0, cresta_ion_channel Single Node with & without Kepler GPUs

SLIDE 66

GROMACS 5.0 & Fastest Kepler GPUs yet!

1 x Xeon E5- 2697 v2@2.70GHz 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K20X 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K40@875Mhz 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K80 Board 2 x Xeon E5- 2697 v2@2.70GHz (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K20X (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 2 x Tesla K20X (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 2 x Tesla K40@875Mhz (1 node) 13.66 35.27 37.00 31.86 17.98 41.94 45.29 42.57 45.37 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 45.00 50.00

ns/Day

GROMACS 5.0, cresta_ion_channel_vsites Single node with & without Kepler GPUs

SLIDE 67

GROMACS 5.0 & Fastest Kepler GPUs yet!

1 x Xeon E5- 2697 v2@2.70GHz 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K20X 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K40@875Mhz 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K80 Board (autoboost) 2 x Xeon E5- 2697 v2@2.70GHz (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K20X (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 2 x Tesla K20X (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 2 x Tesla K40@875Mhz (1 node) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45

ns/Day

GROMACS 5.0, cresta_methanol Single node with & without Kepler GPUs

SLIDE 68

GROMACS 5.0 & Fastest Kepler GPUs yet!

1 x Xeon E5- 2697 v2@2.70GHz 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K20X 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K40@875Mhz 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K80 Board 2 x Xeon E5- 2697 v2@2.70GHz (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K20X (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 2 x Tesla K20X (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 2 x Tesla K40@875Mhz (1 node) Series1 0.12 0.27 0.31 0.34 0.19 0.30 0.36 0.46 0.52 0.12 0.27 0.31 0.34 0.19 0.30 0.36 0.46 0.52 0.00 0.10 0.20 0.30 0.40 0.50 0.60

ns/Day

GROMACS 5.0, cresta_methanol_rf Single Node with & without Kepler GPUs

SLIDE 69

GROMACS 5.0 & Fastest Kepler GPUs yet!

1 x Xeon E5- 2697 v2@2.70GHz 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K20X 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K40@875Mhz 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K80 Board 2 x Xeon E5- 2697 v2@2.70GHz (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K20X (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 2 x Tesla K20X (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 2 x Tesla K40@875Mhz (1 node) Series1 0.92 2.79 2.99 3.30 1.54 3.24 3.83 4.58 5.18 0.00 1.00 2.00 3.00 4.00 5.00 6.00

ns/Day

GROMACS 5.0, cresta_virus_capsid Single Node with & without Kepler GPUs

SLIDE 70

GROMACS 5.0 & Fastest Kepler GPUs yet!

4 x Xeon E5-2697 v2@2.70G Hz (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 2 x Tesla K20X (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 2 x Tesla K40@875 Mhz (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K20X (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K40@875 Mhz (2 nodes) 8 x Xeon E5-2697 v2@2.70G Hz (4 node) 8 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K20X (4 nodes) 8 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K40@875 Mhz (4 nodes) 8 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K20X (4 nodes) 8 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K40@875 Mhz (4 nodes) 16 x Xeon E5-2697 v2@2.70G Hz (8 node) 16 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K20X (8 nodes) 16 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K40@875 Mhz (8 nodes) 16 x Xeon E5-2697 v2@2.70G Hz + 16 x Tesla K20X (8 nodes) 16 x Xeon E5-2697 v2@2.70G Hz + 16 x Tesla K40@875 Mhz (8 nodes) Series1 21.32 31.80 33.76 44.49 45.92 35.99 48.85 52.25 59.28 61.16 54.72 62.95 68.11 72.18 78.48 21.32 31.80 33.76 44.49 45.92 35.99 48.85 52.25 59.28 61.16 54.72 62.95 68.11 72.18 78.48 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00

ns/Day GROMACS 5.0, cresta_ion_channel 2 to 8 Nodes, with & without Kepler GPUs

SLIDE 71

GROMACS 5.0 & Fastest Kepler GPUs yet!

4 x Xeon E5-2697 v2@2.70G Hz (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 2 x Tesla K20X (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 2 x Tesla K40@875 Mhz (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K20X (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K40@875 Mhz (2 nodes) 8 x Xeon E5-2697 v2@2.70G Hz (4 node) 8 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K20X (4 nodes) 8 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K40@875 Mhz (4 nodes) 8 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K20X (4 nodes) 8 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K40@875 Mhz (4 nodes) 16 x Xeon E5-2697 v2@2.70G Hz (8 node) 16 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K20X (8 nodes) 16 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K40@875 Mhz (8 nodes) 16 x Xeon E5-2697 v2@2.70G Hz + 16 x Tesla K20X (8 nodes) 16 x Xeon E5-2697 v2@2.70G Hz + 16 x Tesla K40@875 Mhz (8 nodes) 32.81 47.92 53.98 70.02 76.48 55.66 75.50 81.26 99.26 98.37 82.31 102.47 105.78 131.88 140.66 0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 160.00

ns/Day

GROMACS 5.0, cresta_ion_channel_vsites 2 to 8 Nodes, with & without Kepler GPUs

SLIDE 72

GROMACS 5.0 & Fastest Kepler GPUs yet!

4 x Xeon E5-2697 v2@2.70G Hz (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 2 x Tesla K20X (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 2 x Tesla K40@875 Mhz (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K20X (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K40@875 Mhz (2 nodes) 8 x Xeon E5-2697 v2@2.70G Hz (4 node) 8 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K20X (4 nodes) 8 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K40@875 Mhz (4 nodes) 8 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K20X (4 nodes) 8 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K40@875 Mhz (4 nodes) 16 x Xeon E5-2697 v2@2.70G Hz (8 node) 16 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K20X (8 nodes) 16 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K40@875 Mhz (8 nodes) 16 x Xeon E5-2697 v2@2.70G Hz + 16 x Tesla K20X (8 nodes) 16 x Xeon E5-2697 v2@2.70G Hz + 16 x Tesla K40@875 Mhz (8 nodes) Series1 0.33 0.44 0.47 0.63 0.80 0.60 0.84 0.97 1.38 1.53 1.25 1.73 1.83 2.73 2.85 0.33 0.44 0.47 0.63 0.80 0.60 0.84 0.97 1.38 1.53 1.25 1.73 1.83 2.73 2.85 0.00 0.50 1.00 1.50 2.00 2.50 3.00

ns/Day

GROMACS 5.0, cresta_methanol 2 to 8 Nodes, with & without Kepler GPUs

SLIDE 73

GROMACS 5.0 & Fastest Kepler GPUs yet!

4 x Xeon E5-2697 v2@2.70 GHz (2 nodes) 4 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K20X (2 nodes) 4 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K40@875 Mhz (2 nodes) 4 x Xeon E5-2697 v2@2.70 GHz + 4 x Tesla K20X (2 nodes) 4 x Xeon E5-2697 v2@2.70 GHz + 4 x Tesla K40@875 Mhz (2 nodes) 8 x Xeon E5-2697 v2@2.70 GHz (4 node) 8 x Xeon E5-2697 v2@2.70 GHz + 4 x Tesla K20X (4 nodes) 8 x Xeon E5-2697 v2@2.70 GHz + 4 x Tesla K40@875 Mhz (4 nodes) 8 x Xeon E5-2697 v2@2.70 GHz + 8 x Tesla K20X (4 nodes) 8 x Xeon E5-2697 v2@2.70 GHz + 8 x Tesla K40@875 Mhz (4 nodes) 16 x Xeon E5-2697 v2@2.70 GHz (8 node) 16 x Xeon E5-2697 v2@2.70 GHz + 8 x Tesla K20X (8 nodes) 16 x Xeon E5-2697 v2@2.70 GHz + 8 x Tesla K40@875 Mhz (8 nodes) 16 x Xeon E5-2697 v2@2.70 GHz + 16 x Tesla K20X (8 nodes) 16 x Xeon E5-2697 v2@2.70 GHz + 16 x Tesla K40@875 Mhz (8 nodes) Series1 0.38 0.49 0.57 0.89 1.05 0.75 0.91 1.17 1.73 2.12 1.48 1.86 2.23 3.65 4.16 0.38 0.49 0.57 0.89 1.05 0.75 0.91 1.17 1.73 2.12 1.48 1.86 2.23 3.65 4.16 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50

ns/Day

GROMACS 5.0, cresta_methanol_rf 2 to 8 Nodes, with & without Kepler GPUs

SLIDE 74

GROMACS 5.0 & Fastest Kepler GPUs yet!

4 x Xeon E5-2697 v2@2.70 GHz (2 nodes) 4 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K20X (2 nodes) 4 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K40@875 Mhz (2 nodes) 4 x Xeon E5-2697 v2@2.70 GHz + 4 x Tesla K20X (2 nodes) 4 x Xeon E5-2697 v2@2.70 GHz + 4 x Tesla K40@875 Mhz (2 nodes) 8 x Xeon E5-2697 v2@2.70 GHz (4 node) 8 x Xeon E5-2697 v2@2.70 GHz + 4 x Tesla K20X (4 nodes) 8 x Xeon E5-2697 v2@2.70 GHz + 4 x Tesla K40@875 Mhz (4 nodes) 8 x Xeon E5-2697 v2@2.70 GHz + 8 x Tesla K20X (4 nodes) 8 x Xeon E5-2697 v2@2.70 GHz + 8 x Tesla K40@875 Mhz (4 nodes) 16 x Xeon E5-2697 v2@2.70 GHz (8 node) 16 x Xeon E5-2697 v2@2.70 GHz + 8 x Tesla K20X (8 nodes) 16 x Xeon E5-2697 v2@2.70 GHz + 8 x Tesla K40@875 Mhz (8 nodes) 16 x Xeon E5-2697 v2@2.70 GHz + 16 x Tesla K20X (8 nodes) 16 x Xeon E5-2697 v2@2.70 GHz + 16 x Tesla K40@875 Mhz (8 nodes) Series1 2.93 5.44 5.71 8.36 8.63 5.53 8.99 9.81 14.20 12.93 9.18 15.24 15.57 20.30 22.01 2.93 5.44 5.71 8.36 8.63 5.53 8.99 9.81 14.20 12.93 9.18 15.24 15.57 20.30 22.01 0.00 5.00 10.00 15.00 20.00 25.00

ns/Day

GROMACS 5.0, cresta_virus_capsid 2 to 8 Nodes, with & without Kepler GPUs

SLIDE 75

Slides – courtesy of GROMACS Dev Team

SLIDE 76

Slides – courtesy of GROMACS Dev Team

SLIDE 77

Slides – courtesy of GROMACS Dev Team

SLIDE 78

Slides – courtesy of GROMACS Dev Team

SLIDE 79

Greener Science

Running GROMACS 4.6 with CUDA 4.1 The blue nodes contain 2x Intel X5550 CPUs (95W TDP, 4 Cores per CPU) The green node contains 2x Intel X5550 CPUs, 4 Cores per CPU) and 2x NVIDIA M2090s GPUs (225W TDP per GPU)

In simulating each nanosecond, the GPU-accelerated system uses 33% less energy Energy Expended = Power x Time

Lower is better

2000 4000 6000 8000 10000 12000 4 Nodes (760 Watts) 1 Node + 2x M2090 (640 Watts) Energy Expended (KiloJoules Consumed)

ADH in Water (134K Atoms)

SLIDE 80

80

Recommended GPU Node Configuration for GROMACS Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32 GPUs Kepler K20, K40, K80 # of GPUs per CPU socket 1x Kepler GPUs: need fast Sandy Bridge or Ivy Bridge, or high-end AMD Opterons GPU memory preference (GB) 6 GPU to CPU connection PCIe 3.0 or higher Server storage 500 GB or higher Network configuration Gemini, InfiniBand

80

SLIDE 81

March 2016

HOOMD-Blue

SLIDE 82

82

500 1000 1500 2000 2500 3000 1 2 4 8 16 32 64 Hours to complete 10e6 sweeps Nodes

2^23 dodecahedral in HPMC, running on Comet

96 CPU cores 384 CPU cores 768 CPU cores 1536 CPU cores

Blue nodes contain Dual Intel Xeon E5-2680 v3@2.50 GHz (Haswell) CPUs Green nodes contain Tesla K80 (autoboost) GPUs

4 K80 GPUs 8 K80 GPUs 16 K80 GPUs 24 CPU cores 48 CPU cores 192 CPU cores

SLIDE 83

October 2015

HOOMD-BLUE 1.0

SLIDE 84

84

Running HOOMD-Blue version 1.0 The green nodes contain Dual Intel E5- 2697 v2@2.70GHz CPUs + either NVIDIA Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

1184.44 1496.42 1516.91 2068.27 500 1000 1500 2000 2500 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Timesteps (s)

Liquid

HOOMD-Blue 1.0, K40 & K80, Boost impact!

SLIDE 85

85

Running HOOMD-Blue version 1.0 The green nodes contain Dual Intel E5- 2697 v2@2.70GHz CPUs + either NVIDIA Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

1031.79 1173.01 1203.83 1580.45 500 1000 1500 2000 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Timesteps (s)

Polymer

HOOMD-Blue 1.0, K40 & K80, Boost impact!

SLIDE 86

HOOMD-Blue 1.0, K40 & K80, Boost impact!

2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K80 (autoboost) 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K40@875Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K80 (autoboost) Series1 1184.44 1496.42 1516.91 2068.27 1184.44 1496.42 1516.91 2068.27 0.00 500.00 1000.00 1500.00 2000.00 2500.00

Average Timesteps per second

HOOMD-Blue 1.0, Liquid Single Node with 1 or 2 Kepler GPUs

SLIDE 87

HOOMD-Blue 1.0, K40 & K80, Boost impact!

2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K80 (autoboost) 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K40@875Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K80 (autoboost) Series1 1031.79 1173.01 1203.83 1580.45 1031.79 1173.01 1203.83 1580.45 0.00 200.00 400.00 600.00 800.00 1000.00 1200.00 1400.00 1600.00 1800.00

Average Timesteps per Second

HOOMD-Blue, Polymer Single Node with 1 or 2 Kepler GPUs

SLIDE 88

HOOMD-Blue 1.0.0 and K40, Boost impact!

2 x Xeon E5-2690 v2@3.00GHz 2 x Xeon E5-2690 v2@3.00GHz + 1 x Tesla K40@745Mhz 2 x Xeon E5-2690 v2@3.00GHz + 1 x Tesla K40@875Mhz 2 x Xeon E5-2690 v2@3.00GHz + 2 x Tesla K40@745Mhz 2 x Xeon E5-2690 v2@3.00GHz + 2 x Tesla K40@875Mhz 2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@745Mhz 2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@875Mhz Series1 183.6 1017.4 1180.6 1412.9 1599.0 1989.7 2232.1 183.6 1017.4 1180.6 1412.9 1599.0 1989.7 2232.1 500 1000 1500 2000 2500

Timesteps per Second

HOOMD-Blue (Timesteps/Sec) on K40 with and without Boost Clocks lj_liquid (64K particles) Benchmark (CUDA 5.5, ECC on, gcc 4.7.3)

SLIDE 89

HOOMD-Blue 1.0.0 and K40, fastest GPU yet!

2 x Xeon E5- 2690 v2@3.00GHz 2 x Xeon E5- 2690 v2@3.00GHz + 1 x Tesla K40@875Mh z 2 x Xeon E5- 2690 v2@3.00GHz + 2 x Tesla K40@875Mh z 2 x Xeon E5- 2690 v2@3.00GHz + 4 x Tesla K40@875Mh z 4 x Xeon E5- 2690 v2@3.00GHz 4 x Xeon E5- 2690 v2@3.00GHz + 2 x Tesla K40@875Mh z 4 x Xeon E5- 2690 v2@3.00GHz + 4 x Tesla K40@875Mh z 4 x Xeon E5- 2690 v2@3.00GHz + 8 x Tesla K40@875Mh z 8 x Xeon E5- 2690 v2@3.00GHz 8 x Xeon E5- 2690 v2@3.00GHz + 4 x Tesla K40@875Mh z 8 x Xeon E5- 2690 v2@3.00GHz + 8 x Tesla K40@875Mh z 8 x Xeon E5- 2690 v2@3.00GHz + 16 x Tesla K40@875Mh z Series1 183.6 1180.6 1599.0 2232.1 343.4 1621.9 2166.2 2721.6 582.5 2257.0 2684.5 3235.4 183.6 1180.6 1599.0 2232.1 343.4 1621.9 2166.2 2721.6 582.5 2257.0 2684.5 3235.4 500 1000 1500 2000 2500 3000 3500

Timesteps per Second

HOOMD-Blue (Timesteps/Sec) on K40 with Boost Clocks lj_liquid (64K particles) Benchmark (CUDA 5.5, ECC on, gcc 4.7.3)

SLIDE 90

HOOMD-Blue 1.0.0 and K40, fastest GPU yet!

2 x Xeon E5- 2690 v2@3.00GH z 2 x Xeon E5- 2690 v2@3.00GH z + 1 x Tesla K40@875Mh z 2 x Xeon E5- 2690 v2@3.00GH z + 2 x Tesla K40@875Mh z 2 x Xeon E5- 2690 v2@3.00GH z + 4 x Tesla K40@875Mh z 4 x Xeon E5- 2690 v2@3.00GH z 4 x Xeon E5- 2690 v2@3.00GH z + 2 x Tesla K40@875Mh z 4 x Xeon E5- 2690 v2@3.00GH z + 4 x Tesla K40@875Mh z 4 x Xeon E5- 2690 v2@3.00GH z + 8 x Tesla K40@875Mh z 8 x Xeon E5- 2690 v2@3.00GH z 8 x Xeon E5- 2690 v2@3.00GH z + 4 x Tesla K40@875Mh z 8 x Xeon E5- 2690 v2@3.00GH z + 8 x Tesla K40@875Mh z 8 x Xeon E5- 2690 v2@3.00GH z + 16 x Tesla K40@875Mh z Series1 179.4 1015.5 1249.5 1759.0 338.5 1214.2 1696.5 2082.1 576.2 1773.6 2038.4 2434.8 179.4 1015.5 1249.5 1759.0 338.5 1214.2 1696.5 2082.1 576.2 1773.6 2038.4 2434.8 0.0 500.0 1000.0 1500.0 2000.0 2500.0 3000.0

Timesteps per Second

HOOMD-Blue (Timesteps/Sec) on K40 with Boost Clocks polymer(64,017 particles) Benchmark (CUDA 5.5, ECC on, gcc 4.7.3)

SLIDE 91

HOOMD-Blue 1.0.0 and K40, fastest GPU yet!

2 x Xeon E5- 2690 v2@3.00GHz 2 x Xeon E5- 2690 v2@3.00GHz + 1 x Tesla K40@875Mh z 2 x Xeon E5- 2690 v2@3.00GHz + 2 x Tesla K40@875Mh z 2 x Xeon E5- 2690 v2@3.00GHz + 4 x Tesla K40@875Mh z 4 x Xeon E5- 2690 v2@3.00GHz 4 x Xeon E5- 2690 v2@3.00GHz + 2 x Tesla K40@875Mh z 4 x Xeon E5- 2690 v2@3.00GHz + 4 x Tesla K40@875Mh z 4 x Xeon E5- 2690 v2@3.00GHz + 8 x Tesla K40@875Mh z 8 x Xeon E5- 2690 v2@3.00GHz 8 x Xeon E5- 2690 v2@3.00GHz + 4 x Tesla K40@875Mh z 8 x Xeon E5- 2690 v2@3.00GHz + 8 x Tesla K40@875Mh z 8 x Xeon E5- 2690 v2@3.00GHz + 16 x Tesla K40@875Mh z Series1 20.6 161.6 268.3 458.0 40.2 273.9 463.5 778.9 77.5 474.0 757.5 1150.2 20.6 161.6 268.3 458.0 40.2 273.9 463.5 778.9 77.5 474.0 757.5 1150.2 0.0 200.0 400.0 600.0 800.0 1000.0 1200.0 1400.0

Timesteps per Second

HOOMD-Blue (Timesteps/Sec) on K40 with Boost Clocks lj_liquid (512K particles) Benchmark (CUDA 5.5, ECC on, gcc 4.7.3)

SLIDE 92

HOOMD-Blue on ARM vs. Ivy Bridge w/ & w/o K20 Equivalent Performance on ARM + K20

ARMv8 64-bit (2.4 GHz) 8 cores, no GPU ARMv8 64-bit (2.4 GHz) 8 cores w/K20 Ivy Bridge (E5-2690 v2 @ 3.00GHz) 8 cores Ivy Bridge (E5-2690 v2 @ 3.00GHz) 20 cores Ivy Bridge (E5-2690 v2 @ 3.00GHz) 20 cores w/K20 Series1 31.04057 896.2 85.4 181.8 896.2 31.04057 896.2 85.4 181.8 896.2 100 200 300 400 500 600 700 800 900 1000

Timesteps per Second

HOOMD-Blue 1.0.0 (Timesteps/Sec) on ARM & Ivy Bridge with/without K20 lj_liquid (64K particles) Benchmark (OpenMPI Ver 1.8.1)

SLIDE 93

93

Webinar - June ‘14 500 1000 1500 2000 2500 3000 4 8 16 32 64 Average TPS Number of GPU Nodes

MV2-2.0b-GDR MV2-NewGDR-Loopback MV2-NewGDR-Fastcopy

Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
Strong Scaling: fixed 64K particles
Loopback and Fastcopy get up to 45% and 48% improvement for 32 GPUs
Weak Sailing: fixed 2K particles / GPU
Loopback and Fastcopy get up to 54% and 56% improvement for 16 GPUs

HOOMD-blue Strong Scaling

Application-Level Evaluation (HOOMD-blue)

HOOMD-blue Weak Scaling

48% 47% 56% 53%

500 1000 1500 2000 2500 3000 3500 4000 4500 4 8 16 32 64

Average TPS

Number of GPU Nodes

MV2-2.0b-GDR MV2-NewGDR-Loopback MV2-NewGDR-Fastcopy

SLIDE 94

October 2015

LAMMPS

SLIDE 95

95

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

6.19 2.86 2.51 2.31 2.32 2.14 2.21 2 4 6 8 1 Ivybridge Node 1 CPU Node + 1x K20X 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K20X 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Loop Time (s)

Atomic Fluid - Lennard Jones (2.5 Cutoff) Single Precision (2,048,000 atoms) 1 Ivybridge Node

2.2X 2.5X 2.7X 2.7X 2.9X 2.8X

Lennard-Jones on K20X, K40s & K80s

SLIDE 96

96

7.98 6.14 3.60 2.56 3.85 2.62 2.47 4 8 12 1 Ivybridge Node 1 CPU Node + 1x K20X 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K20X 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Loop Time (s)

Atomic Fluid - Lennard Jones (2.5 Cutoff) Double Precision (2,048,000 atoms) 1 Ivybridge Node

1.3X 2.2X 3.1X 2.1X 3.0X 3.2X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Lennard-Jones on K20X, K40s & K80s

SLIDE 97

97

3.15 1.60 1.34 1.11 1.08 0.99 1.04 1 2 3 4 2 Ivybridge Node 2 CPU Node + 1x K20X 2 CPU Node + 1x K40 2 CPU Node + 1x K80 2 CPU Node + 2x K20X 2 CPU Node + 2x K40 2 CPU Node + 2x K80 Average Loop Time (s)

Atomic Fluid - Lennard Jones (2.5 Cutoff) Single Precision (2,048,000 atoms) 2 Ivybridge Node

2.0X 2.4X 2.8X 3.0X 3.2X 3.0X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Lennard-Jones on K20X, K40s & K80s

SLIDE 98

98

4.08 2.56 2.03 1.30 1.53 1.29 1.17 2 4 6 2 Ivybridge Node 2 CPU Node + 1x K20X 2 CPU Node + 1x K40 2 CPU Node + 1x K80 2 CPU Node + 2x K20X 2 CPU Node + 2x K40 2 CPU Node + 2x K80 Average Loop Time (s)

Atomic Fluid - Lennard Jones (2.5 Cutoff) Double Precision (2,048,000 atoms) 2 Ivybridge Node

1.6X 2.0X 3.1X 2.7X 3.2X 3.5X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Lennard-Jones on K20X, K40s & K80s

SLIDE 99

99

1.64 1.00 0.80 0.65 0.61 0.53 0.53 0.0 0.5 1.0 1.5 2.0 4 Ivybridge Node 4 CPU Node + 1x K20X 4 CPU Node + 1x K40 4 CPU Node + 1x K80 4 CPU Node + 2x K20X 4 CPU Node + 2x K40 4 CPU Node + 2x K80 Average Loop Time (s)

Atomic Fluid - Lennard Jones (2.5 Cutoff) Single Precision (2,048,000 atoms) 4 Ivybridge Node

1.6X 2.1X 2.5X 2.7X 3.1X 3.1X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Lennard-Jones on K20X, K40s & K80s

SLIDE 100

100

2.09 1.46 1.17 0.77 0.86 0.71 0.61 0.0 0.7 1.4 2.1 2.8 4 Ivybridge Node 4 CPU Node + 1x K20X 4 CPU Node + 1x K40 4 CPU Node + 1x K80 4 CPU Node + 2x K20X 4 CPU Node + 2x K40 4 CPU Node + 2x K80 Average Loop Time (s)

Atomic Fluid - Lennard Jones (2.5 Cutoff) Double Precision (2,048,000 atoms) 4 Ivybridge Node

1.4X 1.8X 2.7X 2.4X 2.9X 3.4X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Lennard-Jones on K20X, K40s & K80s

SLIDE 101

101

0.91 0.62 0.48 0.36 0.29 0.0 0.4 0.8 1.2 8 Ivybridge Node 8 CPU Node + 1x K20X 8 CPU Nodes + 1x K40 8 CPU Node + 2x K20X 8 CPU Node + 2x K40 Average Loop Time (s)

Atomic Fluid - Lennard Jones (2.5 Cutoff) Single Precision (2,048,000 atoms) 8 Ivybridge Node

1.5X 1.9X 2.5X 3.1X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Lennard-Jones on K20X, K40s & K80s

SLIDE 102

102

1.18 0.89 0.70 0.51 0.41 0.0 0.5 1.0 1.5 8 Ivybridge Node 8 CPU Node + 1x K20X 8 CPU Node + 1x K40 8 CPU Node + 2x K20X 8 CPU Node + 2x K40 Average Loop Time (s)

Atomic Fluid - Lennard Jones (2.5 Cutoff) Double Precision (2,048,000 atoms) 8 Ivybridge Node

1.3X 1.7X 2.3X 2.9X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Lennard-Jones on K20X, K40s & K80s

SLIDE 103

103

23.80 11.76 6.36 7.56 6.57 3.95 10 20 30 1 Ivybridge Node 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K20X 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Loop Time (s)

Atomic Fluid - Lennard Jones (5.0 Cutoff) Single Precision (2,048,000 atoms) 1 Ivybridge Node

3.7X 2.0X 3.1X 3.6X 6.0X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Lennard-Jones on K20X, K40s & K80s

SLIDE 104

104

31.31 18.77 8.76 11.81 10.30 5.23 10 20 30 40 1 Ivybridge Node1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K20X 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Loop Time (s)

Atomic Fluid - Lennard Jones (5.0 Cutoff) Double Precision (2,048,000 atoms) 1 Ivybridge Node

1.7X 3.6X 2.7X 3.0X 6.0X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Lennard-Jones on K20X, K40s & K80s

SLIDE 105

105

12.23 6.25 3.53 4.21 3.64 2.27 5 10 15 2 Ivybridge Node 2 CPU Node + 1x K40 2 CPU Node + 1x K80 2 CPU Node + 2x K20X 2 CPU Node + 2x K40 2 CPU Node + 2x K80 Average Loop Time (s)

Atomic Fluid - Lennard Jones (5.0 Cutoff) Single Precision (2,048,000 atoms) 2 Ivybridge Node

2.0X 3.5X 2.9X 3.4X 5.4X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Lennard-Jones on K20X, K40s & K80s

SLIDE 106

106

16.03 9.24 4.46 6.05 5.02 2.73 5 10 15 20 2 Ivybridge Node 2 CPU Node + 1x K40 2 CPU Node + 1x K80 2 CPU Node + 2x K20X 2 CPU Node + 2x K40 2 CPU Node + 2x K80 Average Loop Time (s)

Atomic Fluid - Lennard Jones (5.0 Cutoff) Double Precision (2,048,000 atoms) 2 Ivybridge Node

1.7X 3.6X 2.6X 3.2X 5.9X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Lennard-Jones on K20X, K40s & K80s

SLIDE 107

107

6.41 3.14 1.78 2.30 1.74 1.22 3 6 9 4 Ivybridge Node 4 CPU Node + 1x K40 4 CPU Node + 1x K80 4 CPU Node + 2x K20X 4 CPU Node + 2x K40 4 CPU Node + 2x K80 Average Loop Time (s)

Atomic Fluid - Lennard Jones (5.0 Cutoff) Single Precision (2,048,000 atoms) 4 Ivybridge Node

2.0X 3.6X 2.8X 3.7X 5.3X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Lennard-Jones on K20X, K40s & K80s

SLIDE 108

108

8.56 5.12 2.49 3.38 2.82 1.58 3 6 9 12 4 Ivybridge Node 4 CPU Node + 1x K40 4 CPU Node + 1x K80 4 CPU Node + 2x K20X 4 CPU Node + 2x K40 4 CPU Node + 2x K80 Average Loop Time (s)

Atomic Fluid - Lennard Jones (5.0 Cutoff) Double Precision (2,048,000 atoms) 4 Ivybridge Node

1.7X 3.4X 2.5X 3.0X 5.4X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Lennard-Jones on K20X, K40s & K80s

SLIDE 109

109

3.52 1.90 1.28 1.03 1 2 3 4 8 Ivybridge Node 8 CPU Node + 1x K40 8 CPU Node + 2x K20X 8 CPU Node + 2x K40 Average Loop Time (s)

Atomic Fluid - Lennard Jones (5.0 Cutoff) Single Precision (2,048,000 atoms) 8 Ivybridge Node

1.9X 2.8X 3.4X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Lennard-Jones on K20X and K40s

SLIDE 110

110

4.60 2.73 1.95 1.48 2 4 6 8 Ivybridge Node 8 CPU Node + 1x K40 8 CPU Node + 2x K20X 8 CPU Node + 2x K40 Average Loop Time (s)

Atomic Fluid - Lennard Jones (5.0 Cutoff) Double Precision (2,048,000 atoms) 8 Ivybridge Node

1.7X 2.4X 3.1X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Lennard-Jones on K20X and K40s

SLIDE 111

111

Lennard-Jones single/multi-node throughout

Figure 1: Leonard-Jones single-node throughout (strong scaling) Figure 2: Leonard-Jones multi-node throughout

SLIDE 112

112

203.01 106.32 85.71 44.17 47.53 40.27 44.12 50 100 150 200 250 1 Ivybridge Node 1 CPU Node + 1x K20X 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K20X 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Loop Time (s)

EAM - Single Precision (2,048,000 atoms) 1 Ivybridge Node

1.9X 2.4X 4.6X 4.3X 5.0X 4.6X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

EAM on K20X, K40s & K80s

SLIDE 113

113

202.76 150.44 119.60 66.51 67.65 54.34 67.93 50 100 150 200 250 1 Ivybridge Node 1 CPU Node + 1x K20X 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K20X 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Loop Time (s)

EAM - Double Precision (2,048,000 atoms) 1 Ivybridge Node

1.3X 1.7X 3.0X 3.0X 3.7X 3.0X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

EAM on K20X, K40s & K80s

SLIDE 114

114

104.74 57.77 46.95 23.68 26.03 22.08 23.65 40 80 120 2 Ivybridge Node 2CPU Node + 1x K20X 2 CPU Node + 1x K40 2 CPU Node + 1x K80 2 CPU Node + 2x K20X 2 CPU Node + 2x K40 2 CPU Node + 2x K80 Average Loop Time (s)

EAM - Single Precision (2,048,000 atoms) 2 Ivybridge Node

1.8X 2.2X 4.4X 4.0X 4.7X 4.4X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

EAM on K20X, K40s & K80s

SLIDE 115

115

104.81 80.30 64.62 33.49 36.89 29.96 32.67 40 80 120 2 Ivybridge Node 2 CPU Node + 1x K20X 2 CPU Node + 1x K40 2 CPU Node + 1x K80 2 CPU Node + 2x K20X 2 CPU Node + 2x K40 2 CPU Node + 2x K80 Average Loop Time (s)

EAM - Double Precision (2,048,000 atoms) 2 Ivybridge Node

1.3X 1.6X 3.1X 2.8X 3.5X 3.2X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

EAM on K20X, K40s & K80s

SLIDE 116

116

54.89 33.38 28.09 13.96 15.00 12.99 14.03 15 30 45 60 4 Ivybridge Node 4 CPU Node + 1x K20X 4 CPU Node + 1x K40 4 CPU Node + 1x K80 4 CPU Node + 2x K20X 4 CPU Node + 2x K40 4 CPU Node + 2x K80 Average Loop Time (s)

EAM - Single Precision (2,048,000 atoms) 4 Ivybridge Node

1.6X 2.0X 3.9X 3.7X 4.2X 3.9X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

EAM on K20X, K40s & K80s

SLIDE 117

117

54.63 44.86 36.95 18.41 20.78 17.41 18.21 20 40 60 4 Ivybridge Node 4 CPU Node + 1x K20X 4 CPU Node + 1x K40 4 CPU Node + 1x K80 4 CPU Node + 2x K20X 4 CPU Node + 2x K40 4 CPU Node + 2x K80 Average Loop Time (s)

EAM - Double Precision (2,048,000 atoms) 4 Ivybridge Node

1.2X 1.5X 3.0X 2.6X 3.1X 3.0X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

EAM on K20X, K40s & K80s

SLIDE 118

118

28.49 24.46 21.49 12.35 11.06 10 20 30 40 8 Ivybridge Node 8 CPU Node + 1x K20X 8 CPU Node + 1x K40 8 CPU Node + 2x K20X 8 CPU Node + 2x K40 Average Loop Time (s)

EAM - Single Precision (2,048,000 atoms) 8 Ivybridge Node

1.2X 1.3X 2.3X 2.6X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

EAM on K20X and K40s

SLIDE 119

119

29.11 24.39 14.35 12.66 10 20 30 40 8 Ivybridge Node 8 CPU Node + 1x K40 8 CPU Node + 2x K20X 8 CPU Node + 2x K40 Average Loop Time (s)

EAM - Double Precision (2,048,000 atoms) 8 Ivybridge Node

1.2X 2.0X 2.3X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

EAM on K20X and K40s

SLIDE 120

120

EAM single/multi-node throughput

Figure 1: EAM single-node throughout (strong scaling) Figure 2: EAM multi-scaling throughout

SLIDE 121

121

162.20 52.66 34.06 39.05 35.35 27.48 40 80 120 160 200 1 Ivybridge Node1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K20X 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Loop Time (s)

Gay-Berne - Single Precision (2,097,152 atoms) 1 Ivybridge Node

3.1X 4.8X 4.2X 4.6X 5.9X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Gay-Berne on K20X, K40s & K80s

SLIDE 122

122

254.81 186.42 80.08 133.79 102.55 49.06 100 200 300 1 Ivybridge Node 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K20X 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Loop Time (s)

Gay-Berne - Double Precision (2,097,152 atoms) 1 Ivybridge Node

1.4X 3.2X 1.9X 2.5X 5.2X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Gay-Berne on K20X, K40s & K80s

SLIDE 123

123

82.40 24.98 17.02 18.60 16.89 14.08 20 40 60 80 100 2 Ivybridge Node 2 CPU Node + 1x K40 2 CPU Node + 1x K80 2 CPU Node + 2x K20X 2 CPU Node + 2x K40 2 CPU Node + 2x K80 Average Loop Time (s)

Gay-Berne - Single Precision (2,097,152 atoms) 2 Ivybridge Node

3.3X 4.8X 4.4X 4.9X 5.9X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Gay-Berne on K20X, K40s & K80s

SLIDE 124

124

128.97 92.50 53.11 66.49 50.85 23.95 40 80 120 160 2 Ivybridge Node 2 CPU Node + 1x K40 2 CPU Node + 1x K80 2 CPU Node + 2x K20X 2 CPU Node + 2x K40 2 CPU Node + 2x K80 Average Loop Time (s)

Gay-Berne - Double Precision (2,097,152 atoms) 2 Ivybridge Node

1.4X 2.4X 1.9X 2.5X 5.4X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Gay-Berne on K20X, K40s & K80s

SLIDE 125

125

42.43 12.38 9.19 9.53 8.47 7.31 10 20 30 40 50 4 Ivybridge Node 4 CPU Node + 1x K40 4 CPU Node + 1x K80 4 CPU Node + 2x K20X 4 CPU Node + 2x K40 4 CPU Node + 2x K80 Average Loop Time (s)

Gay-Berne - Single Precision (2,097,152 atoms) 4 Ivybridge Node

3.4X 4.6X 4.5X 5.0X 5.8X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Gay-Berne on K20X, K40s & K80s

SLIDE 126

126

66.17 45.98 19.43 33.24 25.26 12.33 0.00 20.00 40.00 60.00 80.00 4 Ivybridge Node4 CPU Node + 1x K40 4 CPU Node + 1x K80 4 CPU Node + 2x K20X 4 CPU Node + 2x K40 4 CPU Node + 2x K80 Average Loop Time (s)

Gay-Berne - Double Precision (2,097,152 atoms) 4 Ivybridge Node

1.4X 3.4X 2.0X 2.6X 5.4X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Gay-Berne on K20X, K40s & K80s

SLIDE 127

127

22.51 6.16 5.14 4.35 5 10 15 20 25 30 8 Ivybridge Node 8 CPU Node + 1x K40 8 CPU Node + 2x K20X 8 CPU Node + 2x K40 Average Loop Time (s)

Gay-Berne - Single Precision (2,097,152 atoms) 8 Ivybridge Node

3.7X 4.4X 5.2X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Gay-Berne on K20X and K40s

SLIDE 128

128

35.00 23.23 17.05 13.01 10 20 30 40 50 8 Ivybridge Node 8 CPU Node + 1x K40 8 CPU Node + 2x K20X 8 CPU Node + 2x K40 Average Loop Time (s)

Gay-Berne - Double Precision (2,097,152 atoms) 8 Ivybridge Node

1.5X 2.1X 2.7X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Gay-Berne on K20X and K40s

SLIDE 129

129

104.04 55.25 30.12 38.96 32.17 23.34 40 80 120 1 Ivybridge Node 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K20X 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Loop Time (s)

Rhodopsin - Single Precision (2,048,000 atoms) 1 Ivybridge Node

1.9X 3.5X 2.7X 3.2X 4.5X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Rhodopsin on K20X, K40s & K80s

SLIDE 130

130

138.20 56.53 100.37 76.35 32.43 40 80 120 160 200 1 Ivybridge Node 1 CPU Node + 1x K80 1 CPU Node + 2x K20X 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Loop Time (s)

Rhodopsin - Double Precision (2,048,000 atoms) 1 Ivybridge Node

2.4X 1.4X 1.8X 4.3X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Rhodopsin on K20X, K40s & K80s

SLIDE 131

131

53.28 27.27 14.93 19.57 15.85 11.99 20 40 60 2 Ivybridge Node 2 CPU Node + 1x K40 2 CPU Node + 1x K80 2 CPU Node + 2x K20X 2 CPU Node + 2x K40 2 CPU Node + 2x K80 Average Loop Time (s)

Rhodopsin - Single Precision (2,048,000 atoms) 2 Ivybridge Node

2.0X 3.6X 2.7X 3.4X 4.4X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Rhodopsin on K20X, K40s & K80s

SLIDE 132

132

70.42 27.42 50.59 38.15 15.78 30 60 90 2 Ivybridge Node 2 CPU Node + 1x K80 2 CPU Node + 2x K20X 2 CPU Node + 2x K40 2 CPU Node + 2x K80 Average Loop Time (s)

Rhodopsin - Double Precision (2,048,000 atoms) 2 Ivybridge Node

2.6X 1.4X 1.8X 4.5X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Rhodopsin on K20X, K40s & K80s

SLIDE 133

133

28.52 14.64 8.34 10.18 8.58 7.04 10 20 30 40 4 Ivybridge Node 4 CPU Node + 1x K40 4 CPU Node + 1x K80 4 CPU Node + 2x K20X 4 CPU Node + 2x K40 4 CPU Node + 2x K80 Average Loop Time (s)

Rhodopsin - Single Precision (2,048,000 atoms) 4 Ivybridge Node

1.9X 3.4X 2.8X 3.3X 4.1X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Rhodopsin on K20X, K40s & K80s

SLIDE 134

134

37.36 36.97 14.21 25.49 19.68 8.33 10 20 30 40 50 4 Ivybridge Node 4 CPU Node + 1x K40 4 CPU Node + 1x K80 4 CPU Node + 2x K20X 4 CPU Node + 2x K40 4 CPU Node + 2x K80 Average Loop Time (s)

Rhodopsin - Double Precision (2,048,000 atoms) 4 Ivybridge Node

1.0X 2.6X 1.5X 1.9X 4.5X

Running LAMMPS The blue node contains Dual Intel E5- 2698 v3@2.3GHz CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz

r Tesla K80@562Mhz GPUs

Rhodopsin on K20X, K40s & K80s

SLIDE 135

135

15.10 7.93 5.50 4.79 5 10 15 20 8 Ivybridge Node 8 CPU Node + 1x K40 8 CPU Node + 2x K20X 8 CPU Node + 2x K40 Average Loop Time (s)

Rhodopsin - Single Precision (2,048,000 atoms) 8 Ivybridge Node

1.9X 2.7X 3.2X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Rhodopsin on K20X and K40s

SLIDE 136

136

19.64 19.31 13.40 10.28 10 20 30 8 Ivybridge Node 8 CPU Node + 1x K40 8 CPU Node + 2x K20X 8 CPU Node + 2x K40 Average Loop Time (s)

Rhodopsin - Double Precision (2,048,000 atoms) 8 Ivybridge Node

1.0X 1.5X 1.9X

Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs

Rhodopsin on K20X and K40s

SLIDE 137

SLIDE 138

More Science for Your Money (LAMMPS)

1.7 2.47 2.92 3.3 4.5 5.5 1 2 3 4 5 6 CPU Only CPU + 1x K10 CPU + 1x K20 CPU + 1x K20X CPU + 2x K10 CPU + 2x K20 CPU + 2x K20X Speedup Compared to CPU Only

Embedded Atom Model

Blue node uses 2x E5-2687W (8 Cores and 150W per CPU). Green nodes have 2x E5-2687W and 1

r 2 NVIDIA K10, K20, or K20X GPUs (235W).

Experience performance increases of up to 5.5x with Kepler GPU nodes.

SLIDE 139

Excellent Strong Scaling on Large Clusters

100 200 300 400 500 600 300 400 500 600 700 800 900 Loop Time (seconds) Nodes

LAMMPS Gay-Berne 134M Atoms

GPU Accelerated XK6 CPU only XE6

Each blue Cray XE6 Nodes have 2x AMD Opteron CPUs (16 Cores per CPU) Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Cores per CPU) and 1x NVIDIA X2090

From 300-900 nodes, the NVIDIA GPU-powered XK6 maintained 3.5x performance compared to XE6 CPU nodes

3.55x 3.45x 3.48x

SLIDE 140

GPUs Sustain 5x Performance for Weak Scaling

5 10 15 20 25 30 35 40 45 1 8 27 64 125 216 343 512 729 Loop Time (seconds) Nodes

LAMMPS Weak Scaling with 32K Atoms per Node

6.7x

Performance of 4.8x-6.7x with GPU-accelerated nodes when compared to CPUs alone

4.8x

Each blue Cray XE6 Node have 2x AMD Opteron CPUs (16 Cores per CPU) Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Core per CPU) and 1x NVIDIA X2090

5.8x

SLIDE 141

Faster, Greener — Worth It! (LAMMPS)

Energy Expended = Power x Time

Lower is better

GPU-accelerated computing uses 53% less energy than CPU only

Power calculated by combining the component’s TDPs Blue node uses 2x E5-2687W (8 Cores and 150W per CPU) and CUDA 4.2.9. Green nodes have 2x E5-2687W and 1 or 2 NVIDIA K20X GPUs (235W) running CUDA 5.0.35.

20 40 60 80 100 120 140 1 Node 1 Node + 1 K20X 1 Node + 2x K20X Energy Expended (kJ)

Energy Consumed in one loop of EAM Try GPU accelerated LAMMPS for free – www.nvidia.com/GPUTestDrive

SLIDE 142

Accelerate LAMMPS Simulations with GPUs

1 2 3 4 5 6 7 1 Node Speedup XK7 w/out GPU XK7 w/ GPU

1.0 6.6

*Brown, W.M., Yamada, M., “Implementing Molecule Dynamics on Hybrid High Performance Computers – Three-Body Potentials,” Computer Physics Communications (2013, submitted)

“Summary of best speedups versus running on a single XK7 CPU for CPU-only and accelerated runs. Simulation is 400 timesteps for a 1 million molecule droplet. The speedups are calculated based on the single node loop time of 440.3 seconds.”

SLIDE 143

Accelerate LAMMPS Simulations with GPUs

50 100 150 200 250 64 Nodes Speedup XK7 w/out GPU XK7 w/ GPU

41.6 211

“Summary of best speedups versus running on a single XK7 CPU for CPU-only and accelerated runs. Simulation is 400 timesteps for a 1 million molecule droplet. The speedups are calculated based on the single node loop time of 440.3 seconds.”

*Brown, W.M., Yamada, M., “Implementing Molecule Dynamics on Hybrid High Performance Computers – Three-Body Potentials,” Computer Physics Communications (2013, submitted)

SLIDE 144

144

Recommended GPU Node Configuration for LAMMPS Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32 GPUs GTX Titan X, Kepler K20, K40, K80, M40 # of GPUs per CPU socket 1-2 GPU memory preference (GB) 6+ GPU to CPU connection PCIe 3.0 or higher Server storage 500 GB or higher Network configuration Gemini, InfiniBand

Scale to thousands of nodes with same single node configuration

14 4

SLIDE 145

NAMD 2.11 – Up to 2X Faster

SLIDE 146

146

New GPU features in NAMD 2.11

GPU-accelerated simulations up to twice as fast as NAMD 2.10
Pressure calculation with fixed atoms on GPU works as on CPU
Improved scaling for GPU-accelerated particle-mesh Ewald calculation
CPU-side operations overlap better and are parallelized across cores.
Improved scaling for GPU-accelerated simulations
Nonbonded force calculation results are streamed from the GPU for better overlap.
NVIDIA CUDA GPU-acceleration binaries for Mac OS X

Selected Text from the NAMD website

SLIDE 147

147

NAMD 2.11 is up to 2x faster

5 10 15 20 25 1 Node 2 Nodes 4 Nodes

Simulated Time (ns/day)

APoA1 (92,224 atoms)

1.2X 1.6X 2.0X

NAMD 2.10 & NAMD 2.11 contain Dual Intel E5-2697 v2@2.7GHz (IvyBridge) CPUs + 2 Tesla K80 (autoboost) GPUs

SLIDE 148

148

NAMD 2.11 APoA1 on 1 and 2 nodes

Running NAMD version 2.11 The blue nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs + Tesla K80 (autoboost) GPUs

2.77 11.67 16.99 5.22 19.73 24.31

5 10 15 20 25 1 Node 1 Node + 1x K80 1 Node + 2x K80 2 Nodes 2 Nodes + 1x K80 2 Nodes + 2x K80

Simulated Time (ns/day)

APoA1

(92,224 atoms)

4.2X 6.1X 3.8X 4.7X

SLIDE 149

149

NAMD 2.11 APoA1 on 4 and 8 nodes

Running NAMD version 2.11 The blue nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs + Tesla K80 (autoboost) GPUs

10.27 20.64 23.52 16.85 27.83 27.74

5 10 15 20 25 30 4 Nodes 4 Nodes + 1x K80 4 Nodes + 2x K80 8 Nodes 8 Nodes + 1x K80 8 Nodes + 2x K80

Simulated Time (ns/day)

APoA1

(92,224 atoms)

2.0X 2.3X 1.7X 1.6X

SLIDE 150

150

NAMD 2.11 is up to 1.8x faster

2 4 6 8 10 1 Node 2 Nodes 4 Nodes

Simulated Time (ns/day)

F1-ATPase (327,506 atoms)

1.1X 1.8X 1.4X

NAMD 2.10 & NAMD 2.11 contain Dual Intel E5-2697 v2@2.7GHz (IvyBridge) CPUs + 2 Tesla K80 (autoboost) GPUs

SLIDE 151

151

NAMD 2.11 F1-ATPase on 1 and 2 nodes

Running NAMD version 2.11 The blue nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs + Tesla K80 (autoboost) GPUs

0.94 3.87 6.11 1.86 7.23 10.58

5 10 15 1 Node 1 Node + 1x K80 1 Node + 2x K80 2 Nodes 2 Nodes + 1x K80 2 Nodes + 2x K80

Simulated Time (ns/day)

F1-ATPase

(327,506 atoms)

4.1X 6.5X 3.9X 5.7X

SLIDE 152

152

NAMD 2.11 F1-ATPase on 4 and 8 nodes

Running NAMD version 2.11 The blue nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs + Tesla K80 (autoboost) GPUs

3.63 11.66 12.62 6.88 14.22 15.74

5 10 15 20 4 Nodes 4 Nodes + 1x K80 4 Nodes + 2x K80 8 Nodes 8 Nodes + 1x K80 8 Nodes + 2x K80

Simulated Time (ns/day)

F1-ATPase

(327,506 atoms)

3.2X 3.5X 2.1X 2.3X

SLIDE 153

153

NAMD 2.11 is up to 1.5x faster

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1 Node 2 Nodes 4 Nodes

Simulated Time (ns/day)

STMV (1,066,628 atoms)

1.5X 1.1X 1.5X

NAMD 2.10 & NAMD 2.11 contain Dual Intel E5-2697 v2@2.7GHz (IvyBridge) CPUs + 2 Tesla K80 (autoboost) GPUs

SLIDE 154

154

NAMD 2.11 STMV on 1 and 2 nodes

Running NAMD version 2.11 The blue nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs (Haswell) + Tesla K80 (autoboost) GPUs

0.23 1.03 1.75 0.46 1.98 3.27

1 2 3 4 1 Node 1 Node + 1x K80 1 Node + 2x K80 2 Nodes 2 Nodes + 1x K80 2 Nodes + 2x K80

Simulated Time (ns/day)

STMV

(1,066,628 atoms)

4.5X 7.6X 4.3X 7.1X

SLIDE 155

155

NAMD 2.11 STMV on 4 and 8 nodes

Running NAMD version 2.11 The blue nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs (Haswell) + Tesla K80 (autoboost) GPUs

0.90 3.61 4.54 1.74 5.86 6.24

2 4 6 8 4 Nodes 4 Nodes + 1x K80 4 Nodes + 2x K80 8 Nodes 8 Nodes + 1x K80 8 Nodes + 2x K80

Simulated Time (ns/day)

STMV

(1,066,628 atoms)

4.0X 5.0X 3.4X 3.6X

SLIDE 156

156

Recommended GPU Node Configuration for NAMD Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32 GPUs Kepler K20, K40, K80 # of GPUs per CPU socket 1-2 GPU memory preference (GB) 6-12 GPU to CPU connection PCIe 3.0 or higher Server storage 500 GB or higher Network configuration Gemini, InfiniBand

Scale to thousands of nodes with same single node configuration

15 6

SLIDE 157

157

Benefits of MD GPU-Accelerated Computing

3x-8x Faster than CPU only systems in all tests (on average)
Most major compute intensive aspects of classical MD ported
Large performance boost with marginal price increase
Energy usage cut by more than half
GPUs scale well within a node and/or over multiple nodes
K80 GPU is our fastest and lowest power high performance GPU yet

Try GPU accelerated MD apps for free – www.nvidia.com/GPUTestDrive

Why wouldn’t you want to turbocharge your research?

SLIDE 158

February 11, 2016