Molecular Dynamics (MD) on GPUs Feb. 2, 2017 Accelerating - - PowerPoint PPT Presentation

molecular dynamics md on gpus
SMART_READER_LITE
LIVE PREVIEW

Molecular Dynamics (MD) on GPUs Feb. 2, 2017 Accelerating - - PowerPoint PPT Presentation

Molecular Dynamics (MD) on GPUs Feb. 2, 2017 Accelerating Discoveries Using a supercomputer powered by the Tesla Platform with over 3,000 Tesla accelerators, University of Illinois scientists performed the first all-atom simulation of the HIV


slide-1
SLIDE 1
  • Feb. 2, 2017

Molecular Dynamics (MD) on GPUs

slide-2
SLIDE 2

2

Accelerating Discoveries

Using a supercomputer powered by the Tesla Platform with over 3,000 Tesla accelerators, University of Illinois scientists performed the first all-atom simulation of the HIV virus and discovered the chemical structure of its capsid — “the perfect target for fighting the infection.” Without gpu, the supercomputer would need to be 5x larger for similar performance.

slide-3
SLIDE 3

3

Overview of Life & Material Accelerated Apps

MD: All key codes are GPU-accelerated Great multi-GPU performance Focus on dense (up to 16) GPU nodes &/or large # of GPU nodes

ACEMD*, AMBER (PMEMD)*, BAND, CHARMM, DESMOND, ESPResso, Folding@Home, GPUgrid.net, GROMACS, HALMD, HOOMD-Blue*, LAMMPS, Lattice Microbes*, mdcore, MELD, miniMD, NAMD, OpenMM, PolyFTS, SOP-GPU* & more

QC: All key codes are ported or optimizing Focus on using GPU-accelerated math libraries, OpenACC directives GPU-accelerated and available today:

ABINIT, ACES III, ADF, BigDFT, CP2K, GAMESS, GAMESS- UK, GPAW, LATTE, LSDalton, LSMS, MOLCAS, MOPAC2012, NWChem, OCTOPUS*, PEtot, QUICK, Q-Chem, QMCPack, Quantum Espresso/PWscf, QUICK, TeraChem*

Active GPU acceleration projects:

CASTEP, GAMESS, Gaussian, ONETEP, Quantum Supercharger Library*, VASP & more

green* = application where >90% of the workload is on GPU

slide-4
SLIDE 4

4

MD vs. QC on GPUs

“Classical” Molecular Dynamics Quantum Chemistry (MO, PW, DFT, Semi-Emp)

Simulates positions of atoms over time; chemical-biological or chemical-material behaviors Calculates electronic properties; ground state, excited states, spectral properties, making/breaking bonds, physical properties Forces calculated from simple empirical formulas (bond rearrangement generally forbidden) Forces derived from electron wave function (bond rearrangement OK, e.g., bond energies) Up to millions of atoms Up to a few thousand atoms Solvent included without difficulty Generally in a vacuum but if needed, solvent treated classically (QM/MM) or using implicit methods Single precision dominated Double precision is important Uses cuBLAS, cuFFT, CUDA Uses cuBLAS, cuFFT, OpenACC Geforce (Workstations), Tesla (Servers) Tesla recommended ECC off ECC on

slide-5
SLIDE 5

5

GPU-Accelerated Molecular Dynamics Apps

ACEMD AMBER CHARMM DESMOND ESPResSO Folding@Home GPUGrid.net GROMACS HALMD HOOMD-Blue LAMMPS mdcore Green Lettering Indicates Performance Slides Included

GPU Perf compared against dual multi-core x86 CPU socket.

MELD NAMD OpenMM PolyFTS

slide-6
SLIDE 6

6

Benefits of MD GPU-Accelerated Computing

  • 3x-8x Faster than CPU only systems in all tests (on average)
  • Most major compute intensive aspects of classical MD ported
  • Large performance boost with marginal price increase
  • Energy usage cut by more than half
  • GPUs scale well within a node and/or over multiple nodes
  • K80 GPU is our fastest and lowest power high performance GPU yet

Try GPU accelerated MD apps for free – www.nvidia.com/GPUTestDrive

Why wouldn’t you want to turbocharge your research?

slide-7
SLIDE 7

ACEMD

slide-8
SLIDE 8

www.acellera.com

470 ns/day on 1 GPU for L-Iduronic acid (1362 atoms) 116 ns/day on 1 GPU for DHFR (23K atoms)

  • M. Harvey, G. Giupponi and G. De Fabritiis, ACEMD: Accelerated molecular dynamics simulations in the microseconds timescale, J. Chem. Theory and
  • Comput. 5, 1632 (2009)
slide-9
SLIDE 9

www.acellera.com

NVT, NPT, PME, TCL, PLUMED, CAMSHIFT1

1 M. J. Harvey and G. De Fabritiis, An implementation of the smooth particle-mesh Ewald (PME) method on GPU hardware, J. Chem. Theory Comput., 5, 2371–2377 (2009) 2 For a list of selected references see http://www.acellera.com/acemd/publications

slide-10
SLIDE 10

June 2017

AMBER 16

slide-11
SLIDE 11

11

JAC_NVE on GP100s

Running AMBER version 16 The green nodes contain Dual Intel(R) Core(TM) i7-4820K @ 3.70GHz CPUs + Quadro GP100s GPUs (PCIe and NVLink)

320.19 320.14 370.32 404.09 50 100 150 200 250 300 350 400 450 1 node + 1x GP100 per node (PCIe) 1 node + 1x GP100 per node (NVLink) 1 node + 2x GP100 per node (PCIe) 1 node + 2x GP100 per node (NVLink) ns/day

23,558 atoms PME 2fs

slide-12
SLIDE 12

12

JAC_NVE on GP100s

614.42 613.16 714.23 782.11 100 200 300 400 500 600 700 800 900 1 node + 1x GP100 per node (PCIe) 1 node + 1x GP100 per node (NVLink) 1 node + 2x GP100 per node (PCIe) 1 node + 2x GP100 per node (NVLink) ns/day

23,558 atoms PME 4fs

Running AMBER version 16 The green nodes contain Dual Intel(R) Core(TM) i7-4820K @ 3.70GHz CPUs + Quadro GP100s GPUs (PCIe and NVLink)

slide-13
SLIDE 13

13

JAC_NPT on GP100s

295.75 295.42 333.03 360.64 50 100 150 200 250 300 350 400 1 node + 1x GP100 per node (PCIe) 1 node + 1x GP100 per node (NVLink) 1 node + 2x GP100 per node (PCIe) 1 node + 2x GP100 per node (NVLink) ns/day

23,558 atoms PME 2fs

Running AMBER version 16 The green nodes contain Dual Intel(R) Core(TM) i7-4820K @ 3.70GHz CPUs + Quadro GP100s GPUs (PCIe and NVLink)

slide-14
SLIDE 14

14

JAC_NPT on GP100s

580.47 578.48 654.66 706.53 100 200 300 400 500 600 700 800 1 node + 1x GP100 per node (PCIe) 1 node + 1x GP100 per node (NVLink) 1 node + 2x GP100 per node (PCIe) 1 node + 2x GP100 per node (NVLink) ns/day

23,558 atoms PME 4fs

Running AMBER version 16 The green nodes contain Dual Intel(R) Core(TM) i7-4820K @ 3.70GHz CPUs + Quadro GP100s GPUs (PCIe and NVLink)

slide-15
SLIDE 15

15

FactorIX_NVE on GP100s

106.23 105.98 142.45 166.61 20 40 60 80 100 120 140 160 180 1 node + 1x GP100 per node (PCIe) 1 node + 1x GP100 per node (NVLink) 1 node + 2x GP100 per node (PCIe) 1 node + 2x GP100 per node (NVLink) ns/day

90,906 atoms PME

Running AMBER version 16 The green nodes contain Dual Intel(R) Core(TM) i7-4820K @ 3.70GHz CPUs + Quadro GP100s GPUs (PCIe and NVLink)

slide-16
SLIDE 16

16

FactorIX_NPT on GP100s

102.27 102.26 126.75 146.34 20 40 60 80 100 120 140 160 1 node + 1x GP100 per node (PCIe) 1 node + 1x GP100 per node (NVLink) 1 node + 2x GP100 per node (PCIe) 1 node + 2x GP100 per node (NVLink) ns/day

90,906 atoms PME

Running AMBER version 16 The green nodes contain Dual Intel(R) Core(TM) i7-4820K @ 3.70GHz CPUs + Quadro GP100s GPUs (PCIe and NVLink)

slide-17
SLIDE 17

17

Cellulose_NVE on GP100s

24.01 24.02 31.35 36.91 5 10 15 20 25 30 35 40 1 node + 1x GP100 per node (PCIe) 1 node + 1x GP100 per node (NVLink) 1 node + 2x GP100 per node (PCIe) 1 node + 2x GP100 per node (NVLink) ns/day

408,609 atoms PME

Running AMBER version 16 The green nodes contain Dual Intel(R) Core(TM) i7-4820K @ 3.70GHz CPUs + Quadro GP100s GPUs (PCIe and NVLink)

slide-18
SLIDE 18

18

Cellulose_NPT on GP100s

22.76 22.8 28.76 32.37 5 10 15 20 25 30 35 1 node + 1x GP100 per node (PCIe) 1 node + 1x GP100 per node (NVLink) 1 node + 2x GP100 per node (PCIe) 1 node + 2x GP100 per node (NVLink) ns/day

408,609 atoms PME

Running AMBER version 16 The green nodes contain Dual Intel(R) Core(TM) i7-4820K @ 3.70GHz CPUs + Quadro GP100s GPUs (PCIe and NVLink)

slide-19
SLIDE 19

19

STMV_NPT on GP100s

15.64 15.43 20.22 23.13 5 10 15 20 25 1 node + 1x GP100 per node (PCIe) 1 node + 1x GP100 per node (NVLink) 1 node + 2x GP100 per node (PCIe) 1 node + 2x GP100 per node (NVLink) ns/day

1,067,095 atoms PME

Running AMBER version 16 The green nodes contain Dual Intel(R) Core(TM) i7-4820K @ 3.70GHz CPUs + Quadro GP100s GPUs (PCIe and NVLink)

slide-20
SLIDE 20

20

TRPCAGE on GP100s

1216.56 1187.3 250 500 750 1000 1250 1500 1 node + 1x GP100 per node (PCIe) 1 node + 1x GP100 per node (NVLink) ns/day

304 atoms GB

Running AMBER version 16 The green nodes contain Dual Intel(R) Core(TM) i7-4820K @ 3.70GHz CPUs + Quadro GP100s GPUs (PCIe and NVLink)

slide-21
SLIDE 21

21

Myoglobin on GP100s

470.41 458.28 443.49 447.23 150 300 450 600 1 node + 1x GP100 per node (PCIe) 1 node + 1x GP100 per node (NVLink) 1 node + 2x GP100 per node (PCIe) 1 node + 2x GP100 per node (NVLink) ns/day

2,492 atoms GB

Running AMBER version 16 The green nodes contain Dual Intel(R) Core(TM) i7-4820K @ 3.70GHz CPUs + Quadro GP100s GPUs (PCIe and NVLink)

slide-22
SLIDE 22

22

Nucleosome on GP100s

11.47 11.29 21.29 20.51 5 10 15 20 25 1 node + 1x GP100 per node (PCIe) 1 node + 1x GP100 per node (NVLink) 1 node + 2x GP100 per node (PCIe) 1 node + 2x GP100 per node (NVLink) ns/day

25,095 atoms GB

Running AMBER version 16 The green nodes contain Dual Intel(R) Core(TM) i7-4820K @ 3.70GHz CPUs + Quadro GP100s GPUs (PCIe and NVLink)

slide-23
SLIDE 23

February 2017

AMBER 16

slide-24
SLIDE 24

24

PME-Cellulose_NPT on K80s

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs ➢ 1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

2.35 11.36 15.43 4 8 12 16 20 1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node ns/day

PME-Cellulose_NPT

4.8X 6.6X

slide-25
SLIDE 25

25

PME-Cellulose_NPT on P100s PCIe

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

2.35 21.85 30.00 5 10 15 20 25 30 35 40 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node ns/day

PME-Cellulose_NPT

9.3X

12.8X

slide-26
SLIDE 26

26

PME-Cellulose_NPT on P100s SXM2

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

2.35 23.37 32.22 36.65 5 10 15 20 25 30 35 40 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node ns/day

PME-Cellulose_NPT

9.9X 13.7X 15.6X

slide-27
SLIDE 27

27

PME-Cellulose_NVE on K80s

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs ➢ 1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

2.47 11.85 16.53 4 8 12 16 20 1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node ns/day

PME-Cellulose_NVE

4.8X 6.7X

slide-28
SLIDE 28

28

PME-Cellulose_NVE on P100s PCIe

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

2.47 23.34 32.55 5 10 15 20 25 30 35 40 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node ns/day

PME-Cellulose_NVE

9.4X 13.2X

slide-29
SLIDE 29

29

PME-Cellulose_NVE on P100s SXM2

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

2.47 24.94 35.16 40.88 5 10 15 20 25 30 35 40 45 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node ns/day

PME-Cellulose_NVE

10.1X 14.2X 16.6X

slide-30
SLIDE 30

30

PME-FactorIX_NPT on K80s

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs ➢ 1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

11.43 48.54 66.68 10 20 30 40 50 60 70 80 1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node ns/day

PME-FactorIX_NPT

4.2X 5.8X

slide-31
SLIDE 31

31

PME-FactorIX_NPT on P100s PCIe

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

11.43 98.77 132.86 20 40 60 80 100 120 140 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node ns/day

PME-FactorIX_NPT

8.6X 11.6X

slide-32
SLIDE 32

32

PME-FactorIX_NPT on P100s SXM2

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

11.43 106.25 144.11 159.80 20 40 60 80 100 120 140 160 180 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node ns/day

PME-FactorIX_NPT

9.3X 12.6X 14.0X

slide-33
SLIDE 33

33

PME-FactorIX_NVE on K80s

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs ➢ 1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

11.98 51.14 71.49 10 20 30 40 50 60 70 80 1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node ns/day

PME-FactorIX_NVE

5.4X 6.0X

slide-34
SLIDE 34

34

PME-FactorIX_NVE on P100s PCIe

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

11.98 105.86 145.83 20 40 60 80 100 120 140 160 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node ns/day

PME-FactorIX_NVE

8.8X 12.2X

slide-35
SLIDE 35

35

PME-FactorIX_NVE on P100s SXM2

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

11.98 114.88 159.24 178.02 20 40 60 80 100 120 140 160 180 200 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node ns/day

PME-FactorIX_NVE 9.6X

13.3X 14.9X

slide-36
SLIDE 36

36

PME-JAC_NPT on K80s

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs ➢ 1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

45.89 162.09 216.78 50 100 150 200 250 1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node ns/day

PME-JAC_NPT

3.5X 4.7X

slide-37
SLIDE 37

37

PME-JAC_NPT on P100s PCIe

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

45.89 283.60 327.69 50 100 150 200 250 300 350 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node ns/day

PME-JAC_NPT

6.2X 7.1X

slide-38
SLIDE 38

38

PME-JAC_NPT on P100s SXM2

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

45.89 310.52 360.64 423.09 50 100 150 200 250 300 350 400 450 1 Broadwell node 1 node + 1x P100 PCIe per node 1 node + 2x P100 PCIe per node 1 node + 4x P100 PCIe per node ns/day

PME-JAC_NPT

6.8X 7.9X 9.2X

slide-39
SLIDE 39

39

PME-JAC_NVE on K80s

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs ➢ 1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

47.90 173.20 234.99 50 100 150 200 250 1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node ns/day

PME-JAC_NVE

3.6X 4.9X

slide-40
SLIDE 40

40

PME-JAC_NVE on P100s PCIe

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

47.90 308.46 363.79 50 100 150 200 250 300 350 400 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node ns/day

PME-JAC_NVE

6.4X 7.6X

slide-41
SLIDE 41

41

PME-JAC_NVE on P100s SXM2

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

47.90 339.81 402.18 473.10 50 100 150 200 250 300 350 400 450 500 1 Broadwell node 1 node + 1x P100 PCIe per node 1 node + 2x P100 PCIe per node 1 node + 4x P100 PCIe per node ns/day

PME-JAC_NVE

7.1X 8.4X 9.9X

slide-42
SLIDE 42

42

GB-Myoglobin on K80s

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs ➢ 1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

28.86 288.47 339.45 50 100 150 200 250 300 350 400 1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node ns/day

GB-Myoglobin

10.0X 11.8X

slide-43
SLIDE 43

43

GB-Myoglobin on P100s PCIe

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

28.86 483.37 561.94 100 200 300 400 500 600 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node ns/day

GB-Myoglobin

16.7X 19.5X

slide-44
SLIDE 44

44

GB-Myoglobin on P100s SXM2

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

28.86 534.28 639.37 100 200 300 400 500 600 700 1 Broadwell node 1 node + 1x P100 PCIe per node 1 node + 4x P100 PCIe per node ns/day

GB-Myoglobin

18.5X 22.2X

slide-45
SLIDE 45

45

GB-Nucleosome on K80s

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs ➢ 1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

0.40 5.84 11.31 20.55 5 10 15 20 25 1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node 1 node + 4x K80 per node ns/day

GB-Nucleosome

14.6X 28.3X 51.4X

slide-46
SLIDE 46

46

GB-Nucleosome on P100s PCIe

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

0.40 11.91 22.77 39.91 45.92 5 10 15 20 25 30 35 40 45 50 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node 1 node + 8x P100 PCIe (16GB) per node ns/day

GB-Nucleosome

29.8X 56.9X 99.8X 114.8X

slide-47
SLIDE 47

47

GB-Nucleosome on P100s SXM2

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

0.40 13.36 25.53 46.29 48.29 10 20 30 40 50 60 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node 1 node + 8x P100 SXM2 per node ns/day

GB-Nucleosome

33.4X 63.8X 115.7X 120.7X

slide-48
SLIDE 48

48

Rubisco-75K on K80s

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs ➢ 1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

0.01 0.35 0.69 1.34 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node 1 node + 4x K80 per node ns/day

Rubisco-75K

35.0X 69.0X 134.0X

slide-49
SLIDE 49

49

Rubisco-75K on P100s PCIe

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

0.01 0.71 1.40 2.69 4.20 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node 1 node + 8x P100 PCIe (16GB) per node ns/day

Rubisco-75K

71.0X 140.0X 269.0X 420.0X

slide-50
SLIDE 50

50

Rubisco-75K on P100s SXM2

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

0.01 0.80 1.57 3.06 4.46 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node 1 node + 8x P100 SXM2 per node ns/day

Rubisco-75K

80.0X 157.0X 306.0X 446.0X

slide-51
SLIDE 51

AMBER 14

slide-52
SLIDE 52

52

AMBER 14 vs. AMBER 12

Courtesy of Scott Le Grand From GTC 2014 presentation

slide-53
SLIDE 53

53

AMBER 14; large P2P and small Boost Clocks impacts

2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@745Mhz (no P2P) 2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@875Mhz (no P2P) 2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@745Mhz (P2P) 2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@875Mhz (P2P) Series1 125.77 132.97 196.68 215.18 125.77 132.97 196.68 215.18 50 100 150 200 250

ns/day

AMBER 14 (ns/day) on 4x K40; P2P and Boost Clocks Impact DHFR NVE PME, 2fs Benchmark (CUDA 6.0, ECC off)

Boost P2P Boost No P2P No Boost P2P No Boost No P2P

slide-54
SLIDE 54

54

54

AMBER Performance Over Time

Courtesy of Scott Le Grand From GTC 2014 presentation

slide-55
SLIDE 55

55

Cellulose on K40s, K80s and M6000s

Running AMBER version 14 The blue node contains Dual Intel E5- 2698 v3@2.3GHz, 3.6GHz Turbo CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz, 3.6GHz Turbo CPUs + either NVIDIA Tesla K40@875Mhz, Tesla K80@562Mhz (autoboost), or Quadro M6000@987Mhz GPUs

1.93 8.96 7.87 11.76 10.49 13.67 15.38 14.90 4 8 12 16 20 1 Haswell Node 1 CPU Node + 1x K40 1 CPU Node + 0.5x K80 1 CPU Node + 1x K80 1 CPU Node + 1x M6000 1 CPU Node + 2x K40 1 CPU Node + 2x K80 1 CPU Node + 2x M6000 Simulated Time (ns/day)

PME-Cellulose_NVE

4.1X 6.1X 5.4X 8.0X 7.7X 4.6X 7.1X

slide-56
SLIDE 56

56

Factor IX on K40s, K80s and M6000s

Running AMBER version 14 The blue node contains Dual Intel E5- 2698 v3@2.3GHz, 3.6GHz Turbo CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz, 3.6GHz Turbo CPUs + either NVIDIA Tesla K40@875Mhz, Tesla K80@562Mhz (autoboost), or Quadro M6000@987Mhz GPUs

9.68 40.48 33.59 50.70 47.80 61.18 60.93 66.89 10 20 30 40 50 60 70 80 1 Haswell Node 1 CPU Node + 1x K40 1 CPU Node + 0.5x K80 1 CPU Node + 1x K80 1 CPU Node + 1x M6000 1 CPU Node + 2x K40 1 CPU Node + 2x K80 1 CPU Node + 2x M6000 Simulated Time (ns/day)

PME-FactorIX_NVE

3.5X 5.2X 5.0X 6.4X 6.3X 7.0X 4.2X

slide-57
SLIDE 57

57

JAC on K40s, K80s and M6000s

Running AMBER version 14 The blue node contains Dual Intel E5- 2698 v3@2.3GHz, 3.6GHz Turbo CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz, 3.6GHz Turbo CPUs + either NVIDIA Tesla K40@875Mhz, Tesla K80@562Mhz (autoboost), or Quadro M6000@987Mhz GPUs

37.38 134.82 121.30 174.34 161.53 200.34 225.34 219.83 50 100 150 200 250 1 Haswell Node 1 CPU Node + 1x K40 1 CPU Node + 0.5x K80 1 CPU Node + 1x K80 1 CPU Node + 1x M6000 1 CPU Node + 2x K40 1 CPU Node + 2x K80 1 CPU Node + 2x M6000 Simulated Time (ns/day)

PME-JAC_NVE

3.2X 4.7X 4.3X 5.4X 6.0X 5.9X 3.6X

slide-58
SLIDE 58

58

Cellulose on M40s

Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

1.07 10.12 14.40 15.90

2 4 6 8 10 12 14 16 18 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node

Simulated Time (ns/Day)

PME - Cellulose_NPT

9.5X 13.5X 14.9X

slide-59
SLIDE 59

59

Cellulose on M40s

1.07 10.50 15.41 17.13

2 4 6 8 10 12 14 16 18 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node

Simulated Time (ns/Day)

PME - Cellulose_NVE

9.8X 14.4X 16.0X

Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

slide-60
SLIDE 60

60

FactorIX on M40s

5.38 46.90 67.37 72.96

10 20 30 40 50 60 70 80 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node

Simulated Time (ns/Day)

PME - FactorIX_NPT

8.7X 12.5X 13.6X

Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

slide-61
SLIDE 61

61

FactorIX on M40s

5.47 49.33 73.00 80.04

10 20 30 40 50 60 70 80 90 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node

Simulated Time (ns/Day)

PME - FactorIX_NVE

9.0X 13.3X 14.6X

Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

slide-62
SLIDE 62

62

JAC on M40s

20.88 149.40 211.97 226.63

50 100 150 200 250 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node

Simulated Time (ns/Day)

PME - JAC_NPT

7.2X 10.2X 10.9X

Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

slide-63
SLIDE 63

63

JAC on M40s

21.11 157.68 230.18 246.15

50 100 150 200 250 300 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node

Simulated Time (ns/Day)

PME - JAC_NVE

7.5X 10.9X 11.7X

Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

slide-64
SLIDE 64

64

Myoglobin on M40s

9.83 232.20 300.86 322.09

50 100 150 200 250 300 350 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node

Simulated Time (ns/Day)

GB - Myoglobin

23.6X 30.6X 32.8X

Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

slide-65
SLIDE 65

65

Nucleosome on M40s

0.13 4.67 9.05 16.11

2 4 6 8 10 12 14 16 18 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node

Simulated Time (ns/Day)

GB - Nucleosome

35.9X 69.6X 123.9X

Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

slide-66
SLIDE 66

66

TrpCage on M40s

408.88 831.91 551.36 464.63

100 200 300 400 500 600 700 800 900 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node

Simulated Time (ns/Day)

GB - TrpCage

2.03X 1.3X 1.1X

Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

slide-67
SLIDE 67

67

Recommended GPU Node Configuration for AMBER Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2 Cores per CPU socket 6+ (1 CPU core drives 1 GPU) CPU speed (Ghz) 2.66+ System memory per node (GB) 16 GPUs Kepler K20, K40, K80, P100 # of GPUs per CPU socket 1-4 GPU memory preference (GB) 6 GPU to CPU connection PCIe 3.0 16x or higher Server storage 2 TB Network configuration Infiniband QDR or better

Scale to multiple nodes with same single node configuration

67

slide-68
SLIDE 68

July 2016

CHARMM DOMDEC-GUI

slide-69
SLIDE 69

69

CHARMM DOMDEC-GUI 465 K System Benchmark

Running CHARMM version c40a1 The blue node contains Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs + Tesla K80 (autoboost) GPUs

Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error. 0.36 2.15 1 2 3 4 1 Haswell node 1 node + 1x K80 per node ns/day

465 K System (Her1_HER1_membrane)

6.0X

*Higher is better

slide-70
SLIDE 70

70

CHARMM DOMDEC-GUI 534 K System Benchmark

Running CHARMM version c40a1 The blue node contains Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs + Tesla K80 (autoboost) GPUs

Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error. 0.18 1.43 0.0 0.5 1.0 1.5 2.0 1 Haswell node 1 node + 1x K80 per node ns/day

534 K System (POPC_PSPC_CHL1mixture)

*Higher is better

8.0X

slide-71
SLIDE 71

71

CHARMM DOMDEC-GUI 20 K System Benchmark

Running CHARMM version c40a1 The blue node contains Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs + Tesla M40 GPUs

Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error. 16.00 59.68 20 40 60 80 1 Haswell node 1 node + 1x M40 per node ns/day

20 K System (Crambin)

*Higher is better

3.7X

slide-72
SLIDE 72

72

CHARMM DOMDEC-GUI 61 K System Benchmark

Running CHARMM version c40a1 The blue node contains Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs + Tesla M40 GPUs

Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error. 3.90 25.08 5 10 15 20 25 30 35 1 Haswell node 1 node + 1x M40 per node ns/day

61 K System (GlnBP)

6.4X

*Higher is better

slide-73
SLIDE 73

73

CHARMM DOMDEC-GUI 465 K System Benchmark

Running CHARMM version c40a1 The blue node contains Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs + Tesla M40 GPUs

Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error. 0.36 2.27 1 2 3 4 1 Haswell node 1 node + 1x M40 per node ns/day

465 K System (Her1_HER1_membrane)

*Higher is better

6.3X

slide-74
SLIDE 74

October 2016

GROMACS 2016

slide-75
SLIDE 75

75

Erik Lindahl (GROMACS developer) video

slide-76
SLIDE 76

76

Water 1.5M on K80s

Running GROMACS version 2016 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

2.79 5.22 6.14 1 2 3 4 5 6 7 1 Broadwell node 1 node + 2x K80 per node 1 node + 4x K80 per node ns/day

Water 1.5M

1.9X 2.2X

slide-77
SLIDE 77

77

Water 3M on K80s

Running GROMACS version 2016 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

1.32 2.66 3.05 1 1 2 2 3 3 4 1 Broadwell node 1 node + 2x K80 per node 1 node + 4x K80 per node ns/day

Water 3M

2.0X 2.3X

slide-78
SLIDE 78

78

Water 1.5M on M40s

Running GROMACS version 2016 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla M40 (autoboost) GPUs

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

2.79 6.15 7.60 1 2 3 4 5 6 7 8 1 Broadwell node 1 node + 2x M40 per node 1 node + 4x M40 per node ns/day

Water 1.5M

2.2X 2.7X

slide-79
SLIDE 79

79

Water 3M on M40s

Running GROMACS version 2016 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla M40 (autoboost) GPUs

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

1.32 2.97 3.94 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 1 Broadwell node 1 node + 2x M40 per node 1 node + 4x M40 per node ns/day

Water 3M

2.3X 3.0X

slide-80
SLIDE 80

80

Water 1.5M on P40s

Running GROMACS version 2016 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P40 GPUs

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

2.79 6.60 8.07 1 2 3 4 5 6 7 8 9 1 Broadwell node 1 node + 2x P40 per node 1 node + 4x P40 per node ns/day

Water 1.5M

2.4X 2.9X

slide-81
SLIDE 81

81

Water 3M on P40s

Running GROMACS version 2016 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P40 GPUs

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

1.32 3.36 4.19 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 1 Broadwell node 1 node + 2x P40 per node 1 node + 4x P40 per node ns/day

Water 3M

2.5X 3.2X

slide-82
SLIDE 82

82

Water 1.5M on P100 PCIes

Running GROMACS version 2016 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

2.79 6.34 7.11 1 2 3 4 5 6 7 8 1 Broadwell node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node ns/day

Water 1.5M

2.3X 2.5X

slide-83
SLIDE 83

83

Water 3M on P100 PCIes

Running GROMACS version 2016 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

1.32 3.16 3.43 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1 Broadwell node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node ns/day

Water 3M

2.4X 2.6X

slide-84
SLIDE 84

February 2017

GROMACS 5.1.2

slide-85
SLIDE 85

85

Water 1.5M on K80s

Running GROMACS version 5.1.2 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs ➢ 1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

3.04 3.49 5.75 1 2 3 4 5 6 7 1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node ns/day

Water 1.5M

1.1X 1.9X

slide-86
SLIDE 86

86

Water 1.5M on P100s PCIe

3.04 4.39 6.96 7.21 2 4 6 8 10 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node ns/day

Water 1.5M

1.4X 2.3X 2.4X

Running GROMACS version 5.1.2 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

slide-87
SLIDE 87

87

Water 1.5M on P100s SXM2

Running GROMACS version 5.1.2 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

3.04 4.11 6.70 7.18 7.88 1 2 3 4 5 6 7 8 9 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x 100 SXM2 per node 1 node + 4x P100 SXM2 per node 1 node + 8x P100 SXM2 per node ns/day

Water 1.5M

1.4X 2.2X 2.4X 2.6X

slide-88
SLIDE 88

88

Water 3M on K80s

1.38 1.59 2.98 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node ns/day

Water 3M

1.2X 2.2X

Running GROMACS version 5.1.2 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs ➢ 1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

slide-89
SLIDE 89

89

Water 3M on P100s PCIe

1.38 1.96 3.43 3.80 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node ns/day

Water 3M

1.4X 2.5X 2.8X

Running GROMACS version 5.1.2 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

slide-90
SLIDE 90

90

Water 3M on P100s SXM2

Running GROMACS version 5.1.2 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

1.38 1.84 3.50 3.82 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node ns/day

Water 3M

1.3X 2.5X 2.8X

slide-91
SLIDE 91

91

Recommended GPU Node Configuration for GROMACS Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32 GPUs Kepler K20, K40, K80 # of GPUs per CPU socket 1x Kepler GPUs: need fast Sandy Bridge or Ivy Bridge, or high-end AMD Opterons GPU memory preference (GB) 6 GPU to CPU connection PCIe 3.0 or higher Server storage 500 GB or higher Network configuration Gemini, InfiniBand

91

slide-92
SLIDE 92

February 2017

HOOMD-Blue 1.3.3

slide-93
SLIDE 93

93

lj-liquid on K80s

Running HOOMD-Blue version 1.3.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs ➢ 1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

326.52 1324.84 1594.37 1942.12 500 1000 1500 2000 2500 1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node 1 node + 4x K80 per node avg time steps/sec

lj-liquid

4.1X 4.9X 5.9X

slide-94
SLIDE 94

94

lj-liquid on P100s PCIe

Running HOOMD-Blue version 1.3.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

326.52 2912.66 3217.68 500 1000 1500 2000 2500 3000 3500 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 8x P100 PCIe (16GB) per node avg timesteps/sec

lj-liquid

8.9X 9.9X

slide-95
SLIDE 95

95

lj-liquid on P100s SXM2

326.52 3129.11 3397.74 500 1000 1500 2000 2500 3000 3500 4000 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 8x P100 SXM2 per node avg timesteps/sec

lj-liquid

9.6X 10.4X

Running HOOMD-Blue version 1.3.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

slide-96
SLIDE 96

96

lj_liquid_512k on K80s

Running HOOMD-Blue version 1.3.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs ➢ 1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

43.43 220.10 334.59 526.47 100 200 300 400 500 600 1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node 1 node + 4x K80 per node avg timesteps/sec

lj_liquid_512k

5.1X 7.7X 12.1X

slide-97
SLIDE 97

97

lj_liquid_512k on P100s PCIe

Running HOOMD-Blue version 1.3.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

43.43 398.12 534.54 770.18 1045.50 200 400 600 800 1000 1200 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node 1 node + 8x P100 PCIe (16GB) per node avg timesteps/sec

lj_liquid_512k

9.2X 12.3X 17.7X 24.1X

slide-98
SLIDE 98

98

lj_liquid_512k on P100s SXM2

Running HOOMD-Blue version 1.3.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

43.43 443.74 568.51 793.36 1119.76 200 400 600 800 1000 1200 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node 1 node + 8x P100 SXM2 per node avg timesteps/sec

lj_liquid_512k

10.2X 13.1X 18.3X 25.8X

slide-99
SLIDE 99

99

lj_liquid_1m on K80s

Running HOOMD-Blue version 1.3.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs ➢ 1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

22.07 109.54 181.42 303.00 50 100 150 200 250 300 350 1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node 1 node + 4x K80 per node avg timesteps/sec

lj_liquid_1m

5.0X 8.2X 13.7X

slide-100
SLIDE 100

100

lj_liquid_1m on P100s PCIe

Running HOOMD-Blue version 1.3.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

22.07 204.67 294.88 465.58 672.46 100 200 300 400 500 600 700 800 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node 1 node + 8x P100 PCIe (16GB) per node avg timesteps/sec

lj_liquid_1m

9.3X 13.4X 21.1X 30.5X

slide-101
SLIDE 101

101

lj_liquid_1m on P100s SXM2

Running HOOMD-Blue version 1.3.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

22.07 221.02 315.07 488.04 707.73 100 200 300 400 500 600 700 800 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node 1 node + 8x P100 SXM2 per node avg timesteps/sec

lj_liquid_1m

10.0X 14.3X 22.1X 32.1X

slide-102
SLIDE 102

102

Microsphere on K80s

Running HOOMD-Blue version 1.3.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs ➢ 1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

17.53 64.87 98.43 166.74 20 40 60 80 100 120 140 160 180 1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node 1 node + 4x K80 per node avg timesteps/sec

microsphere

3.7X 5.6X 9.5X

slide-103
SLIDE 103

103

Microsphere on P100s PCIe

Running HOOMD-Blue version 1.3.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

17.53 145.71 179.54 257.58 371.24 50 100 150 200 250 300 350 400 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node 1 node + 8x P100 PCIe (16GB) per node avg timesteps/sec

microsphere

8.3X 10.2X 14.7X 21.2X

slide-104
SLIDE 104

104

Microsphere on P100s SXM2

Running HOOMD-Blue version 1.3.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

17.53 151.51 186.01 271.21 384.72 50 100 150 200 250 300 350 400 450 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node 1 node + 8x P100 SXM2 per node avg timesteps/sec

microsphere

8.6X 10.6X 15.5X 21.9X

slide-105
SLIDE 105

105

Polymer on K80s

Running HOOMD-Blue version 1.3.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs ➢ 1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

362.19 975.14 1209.45 1518.99 200 400 600 800 1000 1200 1400 1600 1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node 1 node + 4x K80 per node avg timesteps/sec

polymer

2.7X 3.3X 4.2X

slide-106
SLIDE 106

106

Polymer on P100s PCIe

Running HOOMD-Blue version 1.3.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

362.19 1999.64 2143.15 2480.70 500 1000 1500 2000 2500 3000 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node 1 node + 8x P100 PCIe (16GB) per node avg timesteps/sec

polymer

5.5X 5.9X 6.8X

slide-107
SLIDE 107

107

Polymer on P100s SXM2

Running HOOMD-Blue version 1.3.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

362.19 2111.99 2272.27 2651.56 500 1000 1500 2000 2500 3000 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 4x P100 SXM2 per node 1 node + 8x P100 SXM2 per node avg timesteps/sec

polymer

5.8X 6.3X 7.3X

slide-108
SLIDE 108

108

Quasicrystal on K80s

Running HOOMD-Blue version 1.3.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs ➢ 1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

78.32 502.53 767.90 1280.44 200 400 600 800 1000 1200 1400 1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node 1 node + 4x K80 per node avg timesteps/sec

quasicrystal

6.4X 9.8X 16.3X

slide-109
SLIDE 109

109

Quasicrystal on P100s PCIe

Running HOOMD-Blue version 1.3.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

78.32 851.29 1199.64 1791.41 2261.72 500 1000 1500 2000 2500 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node 1 node + 8x P100 PCIe (16GB) per node avg timsteps/sec

quasicrystal

10.9X 15.3X 22.9X 28.9X

slide-110
SLIDE 110

110

Quasicrystal on P100s SXM2

Running HOOMD-Blue version 1.3.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

78.32 939.53 1249.90 1940.29 2429.68 500 1000 1500 2000 2500 3000 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node 1 node + 8x P100 SXM2 per node avg timsteps/sec

quasicrystal

24.8X 31.0X 12.0X 16.0X

slide-111
SLIDE 111

111

Triblock-copolymer on K80s

Running HOOMD-Blue version 1.3.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs ➢ 1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

361.42 953.01 1170.47 1492.01 200 400 600 800 1000 1200 1400 1600 1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node 1 node + 4x K80 per node avg timesteps/sec

triblock-copolymer

2.6X 3.2X 4.1X

slide-112
SLIDE 112

112

Triblock-copolymer on P100s PCIe

Running HOOMD-Blue version 1.3.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

361.42 1999.14 2155.27 2456.09 500 1000 1500 2000 2500 3000 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node 1 node + 8x P100 PCIe (16GB) per node avg timesteps/sec

triblock-copolymer

5.5X 6.0X 6.8X

slide-113
SLIDE 113

113

Triblock-copolymer on P100s SXM2

Running HOOMD-Blue version 1.3.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

361.42 2132.92 2253.83 2587.91 0.00 500.00 1000.00 1500.00 2000.00 2500.00 3000.00 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 4x P100 SXM2 per node 1 node + 8x P100 SXM2 per node avg timesteps/sec

triblock-copolymer

5.9X 6.2X 7.2X

slide-114
SLIDE 114

February 2017

LAMMPS 2016

slide-115
SLIDE 115

115

Atomic-Fluid Lennard-Jones 2.5 Cutoff on K80s

Running LAMMPS version 2016 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

0.37 0.57 0.00 0.20 0.40 0.60 0.80 1.00 1 Broadwell node 1 node + 2x K80 per node 1/seconds

Atomic-Fluid Lennard-Jones 2.5 Cutoff

1.5X

slide-116
SLIDE 116

116

Atomic-Fluid Lennard-Jones 2.5 Cutoff on P100s PCIe

Running LAMMPS version 2016 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs

0.37 0.62 0.00 0.20 0.40 0.60 0.80 1.00 1 Broadwell node 1 node + 2x P100 PCIe (16GB) per node 1/seconds

Atomic-Fluid Lennard- Jones 2.5 Cutoff

1.7X

slide-117
SLIDE 117

117

Atomic-Fluid Lennard-Jones 2.5 Cutoff on P100s SXM2

Running LAMMPS version 2016 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 (autoboost) GPUs

0.37 0.64 0.00 0.25 0.50 0.75 1.00 1 Broadwell node 1 node + 2x P100 SXM2 per node 1/seconds

Atomic-Fluid Lennard- Jones 2.5 Cutoff

1.7X

slide-118
SLIDE 118

118

Atomic-Fluid Lennard-Jones 5.0 Cutoff on K80s

Running LAMMPS version 2016 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs ➢ 1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

0.10 0.14 0.26 0.36 0.00 0.20 0.40 0.60 0.80 1.00 1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node 1 node + 4x K80 per node 1/seconds

Atomic-Fluid Lennard- Jones 5.0 Cutoff

1.4X 2.6X 3.6X

slide-119
SLIDE 119

119

Atomic-Fluid Lennard-Jones 5.0 Cutoff on P100s PCIe

Running LAMMPS version 2016 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

0.10 0.22 0.35 0.37 0.38 0.00 0.20 0.40 0.60 0.80 1.00 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node 1 node + 8x P100 PCIe (16GB) per node 1/seconds

Atomic-Fluid Lennard-Jones 5.0 Cutoff

2.2X 3.5X 3.7X 3.8X

slide-120
SLIDE 120

120

Atomic-Fluid Lennard-Jones 5.0 Cutoff on P100s SXM2

Running LAMMPS version 2016 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

0.10 0.22 0.36 0.41 0.00 0.25 0.50 0.75 1.00 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node 1/seconds

Atomic-Fluid Lennard- Jones 5.0 Cutoff

2.2X 3.6X 4.1X

slide-121
SLIDE 121

121

Course-grain Water on K80s

Running LAMMPS version 2016 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

0.00437 0.00444 0.0000 0.0020 0.0040 0.0060 0.0080 0.0100 1 Broadwell node 1 node + 4x K80 per node 1/seconds

Course-grain Water

1.0X

slide-122
SLIDE 122

122

Course-grain Water on P100s PCIe

Running LAMMPS version 2016 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs

0.0044 0.0061 0.0093 0.0000 0.0010 0.0020 0.0030 0.0040 0.0050 0.0060 0.0070 0.0080 0.0090 0.0100 1 Broadwell node 1 node + 4x P100 PCIe (16GB) per node 1 node + 8x P100 PCIe (16GB) per node 1/seconds

Course-grain Water

1.4X 2.1X

slide-123
SLIDE 123

123

Course-grain Water on P100s SXM2

Running LAMMPS version 2016 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

0.0044 0.0069 0.0110 0.0000 0.0020 0.0040 0.0060 0.0080 0.0100 0.0120 1 Broadwell node 1 node + 4x P100 SXM2 per node 1 node + 8x 100 SXM2 per node 1/seconds

Course-grain Water

1.6X 2.5X

slide-124
SLIDE 124

124

EAM on K80s

Running LAMMPS version 2016 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs ➢ 1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

0.01 0.02 0.04 0.07 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node 1 node + 4x K80 per node 1/seconds

EAM

2.0X 4.0X 7.0X

slide-125
SLIDE 125

125

EAM on P100s PCIe

Running LAMMPS version 2016 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

0.01 0.03 0.05 0.08 0.13 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node 1 node + 8x P100 PCIe (16GB) per node 1/seconds

EAM

3.0X 5.0X 8.0X 13.0X

slide-126
SLIDE 126

126

EAM on P100s SXM2

Running LAMMPS version 2016 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

0.01 0.03 0.05 0.08 0.13 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node 1 node + 8x P100 SXM2 per node 1/seconds

EAM

3.0X 5.0X 8.0X 13.0X

slide-127
SLIDE 127

127

Gay-Berne on K80s

Running LAMMPS version 2016 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs ➢ 1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

0.01 0.02 0.03 0.04 0.00 0.01 0.02 0.03 0.04 0.05 1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node 1 node + 4x K80 per node 1/seconds

Gay-Berne

2.0X 3.0X 4.0X

slide-128
SLIDE 128

128

Gay-Berne on P100s PCIe

Running LAMMPS version 2016 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

0.01 0.02 0.04 0.05 0.00 0.01 0.02 0.03 0.04 0.05 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node 1/seconds

Gay-Berne

2.0X 4.0X 5.0X

slide-129
SLIDE 129

129

Gay-Berne on P100s SXM2

Running LAMMPS version 2016 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

0.01 0.02 0.04 0.05 0.00 0.01 0.02 0.03 0.04 0.05 1 Broadwell node 1 node + 1x SXM2 per node 1 node + 2x SXM2 per node 1 node + 4x SXM2 per node 1/seconds

Gay-Berne

2.0X 4.0X 5.0X

slide-130
SLIDE 130

130

Rhodopsin on K80s

Running LAMMPS version 2016 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs ➢ 1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

0.22 0.22 0.31 0.38 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node 1 node + 4x K80 per node 1/seconds

Rhodopsin

1.4X 1.7X

slide-131
SLIDE 131

131

Rhodopsin on P100s PCIe

Running LAMMPS version 2016 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

0.22 0.29 0.33 0.48 0.52 0.00 0.10 0.20 0.30 0.40 0.50 0.60 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node 1 node + 8x P100 PCIe (16GB) per node 1/seconds

Rhodopsin

1.3X 1.5X 2.2X 2.4X

slide-132
SLIDE 132

132

Rhodopsin on P100s SXM2

Running LAMMPS version 2016 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

0.22 0.30 0.38 0.49 0.50 0.00 0.10 0.20 0.30 0.40 0.50 0.60 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node 1 node + 8x P100 SXM2 per node 1/seconds

Rhodopsin

1.4X 1.7X 2.2X 2.3X

slide-133
SLIDE 133

133

Recommended GPU Node Configuration for LAMMPS Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32 GPUs GTX Titan X, Kepler K20, K40, K80, M40 # of GPUs per CPU socket 1-2 GPU memory preference (GB) 6+ GPU to CPU connection PCIe 3.0 or higher Server storage 500 GB or higher Network configuration Gemini, InfiniBand

Scale to thousands of nodes with same single node configuration

13 3

slide-134
SLIDE 134

July 2017

NAMD 2.12

slide-135
SLIDE 135

135

APOA1 on K80s

Running NAMD version 2.12 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

3.45 14.92 17.73 4 8 12 16 20

1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node

ns/day

APOA1

slide-136
SLIDE 136

136

APOA1 on P100s PCIe

Running NAMD version 2.12 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs

3.45 22.58 22.85 4 8 12 16 20 24 28

1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node

ns/day

APOA1

slide-137
SLIDE 137

137

APOA1 on P100s SXM2

Running NAMD version 2.12 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

3.45 22.98 23.44 23.87 5 10 15 20 25 30

1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node

ns/day

APOA1

slide-138
SLIDE 138

138

F1ATPASE on K80s

Running NAMD version 2.12 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1.15 4.81 6.27 2 4 6 8

1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node

ns/day

F1ATPASE

slide-139
SLIDE 139

139

F1ATPASE on P100s PCIe

Running NAMD version 2.12 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs

1.15 7.34 6.99 7.40 2 4 6 8 10

1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node

ns/day

F1ATPASE

slide-140
SLIDE 140

140

F1ATPASE on P100s SXM2

Running NAMD version 2.12 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1.15 7.11 6.85 7.11 2 4 6 8 10

1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node

ns/day

F1ATPASE

slide-141
SLIDE 141

141

STMV on K80s

Running NAMD version 2.12 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

0.292 1.274 2.085 0.0 0.5 1.0 1.5 2.0 2.5 3.0

1 Broadwell node 1 node + 1x K80 per node 1 node + 2x K80 per node

ns/day

STMV

slide-142
SLIDE 142

142

STMV on P100s PCIe

Running NAMD version 2.12 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs

0.29 2.15 2.32 0.0 0.5 1.0 1.5 2.0 2.5 3.0

1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node

ns/day

STMV

slide-143
SLIDE 143

143

STMV on P100s SXM2

Running NAMD version 2.12 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

0.292 2.077 0.0 0.5 1.0 1.5 2.0 2.5 3.0

1 Broadwell node 1 node + 1x P100 SXM2 per node

ns/day

STMV

slide-144
SLIDE 144

NAMD 2.11 – Up to 2X Faster

slide-145
SLIDE 145

145

New GPU features in NAMD 2.11

  • GPU-accelerated simulations up to twice as fast as NAMD 2.10
  • Pressure calculation with fixed atoms on GPU works as on CPU
  • Improved scaling for GPU-accelerated particle-mesh Ewald calculation
  • CPU-side operations overlap better and are parallelized across cores.
  • Improved scaling for GPU-accelerated simulations
  • Nonbonded force calculation results are streamed from the GPU for better overlap.
  • NVIDIA CUDA GPU-acceleration binaries for Mac OS X

Selected Text from the NAMD website

slide-146
SLIDE 146

146

NAMD 2.11 is up to 2x faster

5 10 15 20 25 1 Node 2 Nodes 4 Nodes

Simulated Time (ns/day)

APoA1 (92,224 atoms)

1.2X 1.6X 2.0X

NAMD 2.10 & NAMD 2.11 contain Dual Intel E5-2697 v2@2.7GHz (IvyBridge) CPUs + 2 Tesla K80 (autoboost) GPUs

slide-147
SLIDE 147

147

NAMD 2.11 APoA1 on 1 and 2 nodes

Running NAMD version 2.11 The blue nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs + Tesla K80 (autoboost) GPUs

2.77 11.67 16.99 5.22 19.73 24.31

5 10 15 20 25 1 Node 1 Node + 1x K80 1 Node + 2x K80 2 Nodes 2 Nodes + 1x K80 2 Nodes + 2x K80

Simulated Time (ns/day)

APoA1

(92,224 atoms)

4.2X 6.1X 3.8X 4.7X

slide-148
SLIDE 148

148

NAMD 2.11 APoA1 on 4 and 8 nodes

Running NAMD version 2.11 The blue nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs + Tesla K80 (autoboost) GPUs

10.27 20.64 23.52 16.85 27.83 27.74

5 10 15 20 25 30 4 Nodes 4 Nodes + 1x K80 4 Nodes + 2x K80 8 Nodes 8 Nodes + 1x K80 8 Nodes + 2x K80

Simulated Time (ns/day)

APoA1

(92,224 atoms)

2.0X 2.3X 1.7X 1.6X

slide-149
SLIDE 149

149

NAMD 2.11 is up to 1.8x faster

2 4 6 8 10 1 Node 2 Nodes 4 Nodes

Simulated Time (ns/day)

F1-ATPase (327,506 atoms)

1.1X 1.8X 1.4X

NAMD 2.10 & NAMD 2.11 contain Dual Intel E5-2697 v2@2.7GHz (IvyBridge) CPUs + 2 Tesla K80 (autoboost) GPUs

slide-150
SLIDE 150

150

NAMD 2.11 F1-ATPase on 1 and 2 nodes

Running NAMD version 2.11 The blue nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs + Tesla K80 (autoboost) GPUs

0.94 3.87 6.11 1.86 7.23 10.58

5 10 15 1 Node 1 Node + 1x K80 1 Node + 2x K80 2 Nodes 2 Nodes + 1x K80 2 Nodes + 2x K80

Simulated Time (ns/day)

F1-ATPase

(327,506 atoms)

4.1X 6.5X 3.9X 5.7X

slide-151
SLIDE 151

151

NAMD 2.11 F1-ATPase on 4 and 8 nodes

Running NAMD version 2.11 The blue nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs + Tesla K80 (autoboost) GPUs

3.63 11.66 12.62 6.88 14.22 15.74

5 10 15 20 4 Nodes 4 Nodes + 1x K80 4 Nodes + 2x K80 8 Nodes 8 Nodes + 1x K80 8 Nodes + 2x K80

Simulated Time (ns/day)

F1-ATPase

(327,506 atoms)

3.2X 3.5X 2.1X 2.3X

slide-152
SLIDE 152

152

NAMD 2.11 is up to 1.5x faster

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1 Node 2 Nodes 4 Nodes

Simulated Time (ns/day)

STMV (1,066,628 atoms)

1.5X 1.1X 1.5X

NAMD 2.10 & NAMD 2.11 contain Dual Intel E5-2697 v2@2.7GHz (IvyBridge) CPUs + 2 Tesla K80 (autoboost) GPUs

slide-153
SLIDE 153

153

NAMD 2.11 STMV on 1 and 2 nodes

Running NAMD version 2.11 The blue nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs (Haswell) + Tesla K80 (autoboost) GPUs

0.23 1.03 1.75 0.46 1.98 3.27

1 2 3 4 1 Node 1 Node + 1x K80 1 Node + 2x K80 2 Nodes 2 Nodes + 1x K80 2 Nodes + 2x K80

Simulated Time (ns/day)

STMV

(1,066,628 atoms)

4.5X 7.6X 4.3X 7.1X

slide-154
SLIDE 154

154

NAMD 2.11 STMV on 4 and 8 nodes

Running NAMD version 2.11 The blue nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs (Haswell) + Tesla K80 (autoboost) GPUs

0.90 3.61 4.54 1.74 5.86 6.24

2 4 6 8 4 Nodes 4 Nodes + 1x K80 4 Nodes + 2x K80 8 Nodes 8 Nodes + 1x K80 8 Nodes + 2x K80

Simulated Time (ns/day)

STMV

(1,066,628 atoms)

4.0X 5.0X 3.4X 3.6X

slide-155
SLIDE 155

155

Benefits of MD GPU-Accelerated Computing

  • 3x-8x Faster than CPU only systems in all tests (on average)
  • Most major compute intensive aspects of classical MD ported
  • Large performance boost with marginal price increase
  • Energy usage cut by more than half
  • GPUs scale well within a node and/or over multiple nodes
  • K80 GPU is our fastest and lowest power high performance GPU yet

Try GPU accelerated MD apps for free – www.nvidia.com/GPUTestDrive

Why wouldn’t you want to turbocharge your research?

slide-156
SLIDE 156
  • Dec. 19, 2016

Molecular Dynamics (MD) on GPUs

slide-157
SLIDE 157

157

GPU-Accelerated Quantum Chemistry Apps

Abinit ACES III ADF BigDFT CP2K GAMESS-US Gaussian GPAW LATTE LSDalton MOLCAS Mopac2012 NWChem Green Lettering Indicates Performance Slides Included

GPU Perf compared against dual multi-core x86 CPU socket.

Quantum SuperCharger Library RMG TeraChem UNM VASP WL-LSMS Octopus ONETEP Petot Q-Chem QMCPACK Quantum Espresso