October 2017 and RELION too Accelerating Discoveries Using a - - PowerPoint PPT Presentation

october 2017 and relion too accelerating discoveries
SMART_READER_LITE
LIVE PREVIEW

October 2017 and RELION too Accelerating Discoveries Using a - - PowerPoint PPT Presentation

Molecular Dynamics (MD) on GPUs October 2017 and RELION too Accelerating Discoveries Using a supercomputer powered by the Tesla Platform with over 3,000 Tesla accelerators, University of Illinois scientists performed the first all-atom


slide-1
SLIDE 1

October 2017 and RELION too

Molecular Dynamics (MD) on GPUs

slide-2
SLIDE 2

2

Accelerating Discoveries

Using a supercomputer powered by the Tesla Platform with over 3,000 Tesla accelerators, University of Illinois scientists performed the first all-atom simulation of the HIV virus and discovered the chemical structure of its capsid — “the perfect target for fighting the infection.” Without gpu, the supercomputer would need to be 5x larger for similar performance.

slide-3
SLIDE 3

3

Overview of Life & Material Accelerated Apps

MD: All key codes are GPU-accelerated Great multi-GPU performance Focus on dense (up to 16) GPU nodes &/or large # of GPU nodes

ACEMD*, AMBER (PMEMD)*, BAND, CHARMM, DESMOND, ESPResso, Folding@Home, GPUgrid.net, GROMACS, HALMD, HOOMD-Blue*, LAMMPS, Lattice Microbes*, mdcore, MELD, miniMD, NAMD, OpenMM, PolyFTS, SOP-GPU* & more

QC: All key codes are ported or optimizing Focus on using GPU-accelerated math libraries, OpenACC directives GPU-accelerated and available today:

ABINIT, ACES III, ADF, BigDFT, CP2K, GAMESS, GAMESS- UK, GPAW, LATTE, LSDalton, LSMS, MOLCAS, MOPAC2012, NWChem, OCTOPUS*, PEtot, QUICK, Q-Chem, QMCPack, Quantum Espresso/PWscf, QUICK, TeraChem*

Active GPU acceleration projects:

CASTEP, GAMESS, Gaussian, ONETEP, Quantum Supercharger Library*, VASP & more

green* = application where >90% of the workload is on GPU

slide-4
SLIDE 4

4

MD vs. QC on GPUs

“Classical” Molecular Dynamics Quantum Chemistry (MO, PW, DFT, Semi-Emp)

Simulates positions of atoms over time; chemical-biological or chemical-material behaviors Calculates electronic properties; ground state, excited states, spectral properties, making/breaking bonds, physical properties Forces calculated from simple empirical formulas (bond rearrangement generally forbidden) Forces derived from electron wave function (bond rearrangement OK, e.g., bond energies) Up to millions of atoms Up to a few thousand atoms Solvent included without difficulty Generally in a vacuum but if needed, solvent treated classically (QM/MM) or using implicit methods Single precision dominated Double precision is important Uses cuBLAS, cuFFT, CUDA Uses cuBLAS, cuFFT, OpenACC GeForce (Workstations), Tesla (Servers) Tesla recommended ECC off ECC on

slide-5
SLIDE 5

5

GPU-Accelerated Molecular Dynamics Apps

ACEMD AMBER CHARMM DESMOND ESPResSO Folding@Home GENESIS GPUGrid.net GROMACS HALMD HOOMD-Blue HTMD Green Lettering Indicates Performance Slides Included

GPU Perf compared against dual multi-core x86 CPU socket.

LAMMPS mdcore MELD NAMD OpenMM PolyFTS

slide-6
SLIDE 6

6

Benefits of MD GPU-Accelerated Computing

  • 3x-8x Faster than CPU only systems in all tests (on average)
  • Most major compute intensive aspects of classical MD ported
  • Large performance boost and save “Big Money” on CPUs, networks
  • Energy usage cut by more than half
  • GPUs scale well within a node and/or over multiple nodes
  • P100 GPU is our fastest and lowest power high performance GPU yet

Try GPU accelerated MD apps for free – www.nvidia.com/GPUTestDrive

Why wouldn’t you want to turbocharge your research?

slide-7
SLIDE 7

August 2017

RELION 2.0.3

slide-8
SLIDE 8

8

Plasmodium ribosome on P100s PCIe

Running RELION version 2.0.3 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs

Data Citation: http://en.community.dell.com/techcenter/hi gh-performance- computing/b/general_hpc/archive/2017/03/1 4/application-performance-on-p100-pcie-gpus

0.0003 0.0027 0.0046 0.0070 0.0101 0.0112 0.0120 0.0000 0.0020 0.0040 0.0060 0.0080 0.0100 0.0120 0.0140

1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node 2 nodes + 8x P100 PCIe (16GB) 3 nodes + 12x P100 PCIe (16GB) 4 nodes + 16x P100 PCIe (16GB)

1/Minutes

slide-9
SLIDE 9

ACEMD

slide-10
SLIDE 10

: Extremely efficient and robust MD software built on GPUs

610 ns/day on 1 GPU for DHFR (23K atoms)

  • M. Harvey, G. Giupponi and G. de Fabritiis, ACEMD: Accelerated molecular dynamics simulations in the microseconds timescale, J. Chem. Theory

and Comput. 5, 1632 (2009)

slide-11
SLIDE 11
  • Standardised and easy to use: ACEMD reads CHARMM/NAMD and AMBER input files

and uses similar syntax to other MD software.

  • Fully featured: NVT, NPT, PME, TCL, PLUMED.1
  • Robust: ACEMD is a proven computational engine and is used in one of the largest

distributed projects Worldwide: GPUGRID.

  • Compatible: ACEMD works with CUDA and OpenCL, the new standard framework for

parallel and high-performance computing.

  • Validated: ACEMD is used in reputable academic and industrial institutions. Results

describing its applications have appeared in peer-reviewed journals of high impact such as Nature Chemistry, PNAS, Scientific Reports, PLoS and JACS.2

  • 1. M. J. Harvey, and G. de Fabritiis, An implementation of the smooth particle-mesh Ewald (PME) method on GPU hardware, J. Chem. Theory

Comput., 5, 2371-2377 (2009)

  • 2. For a list of selected references see http://www.acellera.com/science
slide-12
SLIDE 12

October 2017

AMBER 16.8

slide-13
SLIDE 13

13

PME-Cellulose_NPT on V100s PCIe

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

1.94 47.67 10 20 30 40 50

1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

ns/day

24.6X

slide-14
SLIDE 14

14

PME-Cellulose_NPT on V100s SXM2

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

1.94 54.74 55.52 10 20 30 40 50 60

1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB) 1 node + 4x V100 SXM2 per node (16GB)

ns/day

28.2X 28.6X

slide-15
SLIDE 15

15

PME-Cellulose_NVE on V100s PCIe

1.96 54.08 10 20 30 40 50 60

1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

ns/day

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

27.6X

slide-16
SLIDE 16

16

PME-Cellulose_NVE on V100s SXM2

1.96 63.04 65.02 10 20 30 40 50 60 70

1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB) 1 node + 4x V100 SXM2 per node (16GB)

ns/day

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

32.2X 33.2X

slide-17
SLIDE 17

17

PME-FactorIX_NPT on V100s PCIe

9.33 193.16 50 100 150 200 250

1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

ns/day

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

20.7X

slide-18
SLIDE 18

18

PME-FactorIX_NPT on V100s SXM2

9.33 217.95 224.23 50 100 150 200 250

1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB) 1 node + 4x V100 SXM2 per node (16GB)

ns/day

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

23.4X 24.0X

slide-19
SLIDE 19

19

PME-FactorIX_NVE on V100s PCIe

9.61 217.95 50 100 150 200 250

1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

ns/day

22.7X

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

slide-20
SLIDE 20

20

PME-FactorIX_NVE on V100s SXM2

9.61 249.63 261.19 50 100 150 200 250 300

1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB) 1 node + 4x V100 SXM2 per node (16GB)

ns/day

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

26.0X 27.2X

slide-21
SLIDE 21

21

PME-JAC_NPT on V100s PCIe

34.35 439.87 100 200 300 400 500

1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

ns/day

12.8X

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

slide-22
SLIDE 22

22

PME-JAC_NPT on V100s SXM2

34.35 481.75 515.36 100 200 300 400 500 600

1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB) 1 node + 4x V100 SXM2 per node (16GB)

ns/day

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

14.0X 15.0X

slide-23
SLIDE 23

23

PME-JAC_NVE on V100s PCIe

36.53 490.77 100 200 300 400 500 600

1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

ns/day

13.4X

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

slide-24
SLIDE 24

24

PME-JAC_NVE on V100s SXM2

36.53 539.78 583.33 100 200 300 400 500 600 700

1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB) 1 node + 4x V100 SXM2 per node (16GB)

ns/day

14.8X

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

16.0X

slide-25
SLIDE 25

25

PME-JAC_NPT_4fs on V100s PCIe

65.74 863.80 150 300 450 600 750 900

1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

ns/day

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

13.1X

slide-26
SLIDE 26

26

PME-JAC_NPT_4fs on V100s SXM2

65.74 946.57 1006.32 200 400 600 800 1000 1200

1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB) 1 node + 4x V100 SXM2 per node (16GB)

ns/day

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

14.4X 15.3X

slide-27
SLIDE 27

27

PME-JAC_NVE_4fs on V100s PCIe

67.10 940.32 150 300 450 600 750 900 1050

1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

ns/day

26.0X

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

slide-28
SLIDE 28

28

PME-JAC_NVE_4fs on V100s SXM2

67.10 1027.44 1123.40 200 400 600 800 1000 1200

1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB) 1 node + 4x V100 SXM2 per node (16GB)

ns/day

15.3X 16.7X

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

slide-29
SLIDE 29

29

PME-STMV_NPT_4fs on V100s PCIe

1.06 33.21 5 10 15 20 25 30 35

1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

ns/day

31.3X

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

slide-30
SLIDE 30

30

PME-STMV_NPT_4fs on V100s SXM2

1.06 37.24 5 10 15 20 25 30 35 40

1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB)

ns/day

35.1X

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

slide-31
SLIDE 31

31

GB-Myoglobin on V100s PCIe

22.30 699.21 150 300 450 600 750

1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

ns/day

31.4X

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

slide-32
SLIDE 32

32

GB-Myoglobin on V100s SXM2

22.30 750.76 100 200 300 400 500 600 700 800

1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB)

ns/day

33.7X

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

slide-33
SLIDE 33

33

GB-Nucleosome on V100s PCIe

0.31 49.14 78.39 17 34 51 68 85

1 Broadwell node 1 node + 2x V100 PCIe per node (16GB) 1 node + 4x V100 PCIe per node (16GB)

ns/day

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

158.5X 252.9X

slide-34
SLIDE 34

34

GB-Nucleosome on V100s SXM2

0.31 52.89 92.46 25 50 75 100

1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB) 1 node + 4x V100 SXM2 per node (16GB)

ns/day

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

170.6X 298.3X

slide-35
SLIDE 35

35

Rubisco on V100s PCIe

0.01 2.79 5.22 6.78 1 2 3 4 5 6 7 8

1 Broadwell node 1 node + 2x V100 PCIe per node (16GB) 1 node + 4x V100 PCIe per node (16GB) 1 node + 8x V100 PCIe per node (16GB)

ns/day

279.0X 522.0X

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

678.0X

slide-36
SLIDE 36

36

Rubisco on V100s SXM2

0.01 3.00 5.96 7.00 1 2 3 4 5 6 7 8

1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB) 1 node + 4x V100 SXM2 per node (16GB) 1 node + 8x V100 SXM2 per node (16GB)

ns/day

(Untuned on Volta) Running AMBER version 16.8 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

300.0X 596.0X 700.0X

slide-37
SLIDE 37

February 2017

AMBER 16

slide-38
SLIDE 38

38

PME-Cellulose_NPT on P100s PCIe

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

2.35 21.85 30.00 5 10 15 20 25 30 35 40 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node ns/day

PME-Cellulose_NPT

9.3X

12.8X

slide-39
SLIDE 39

39

PME-Cellulose_NPT on P100s SXM2

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

2.35 23.37 32.22 36.65 5 10 15 20 25 30 35 40 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node ns/day

PME-Cellulose_NPT

9.9X 13.7X 15.6X

slide-40
SLIDE 40

40

PME-Cellulose_NVE on P100s PCIe

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

2.47 23.34 32.55 5 10 15 20 25 30 35 40 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node ns/day

PME-Cellulose_NVE

9.4X 13.2X

slide-41
SLIDE 41

41

PME-Cellulose_NVE on P100s SXM2

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

2.47 24.94 35.16 40.88 5 10 15 20 25 30 35 40 45 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node ns/day

PME-Cellulose_NVE

10.1X 14.2X 16.6X

slide-42
SLIDE 42

42

PME-FactorIX_NPT on P100s PCIe

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

11.43 98.77 132.86 20 40 60 80 100 120 140 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node ns/day

PME-FactorIX_NPT

8.6X 11.6X

slide-43
SLIDE 43

43

PME-FactorIX_NPT on P100s SXM2

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

11.43 106.25 144.11 159.80 20 40 60 80 100 120 140 160 180 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node ns/day

PME-FactorIX_NPT

9.3X 12.6X 14.0X

slide-44
SLIDE 44

44

PME-FactorIX_NVE on P100s PCIe

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

11.98 105.86 145.83 20 40 60 80 100 120 140 160 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node ns/day

PME-FactorIX_NVE

8.8X 12.2X

slide-45
SLIDE 45

45

PME-FactorIX_NVE on P100s SXM2

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

11.98 114.88 159.24 178.02 20 40 60 80 100 120 140 160 180 200 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node ns/day

PME-FactorIX_NVE 9.6X

13.3X 14.9X

slide-46
SLIDE 46

46

PME-JAC_NPT on P100s PCIe

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

45.89 283.60 327.69 50 100 150 200 250 300 350 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node ns/day

PME-JAC_NPT

6.2X 7.1X

slide-47
SLIDE 47

47

PME-JAC_NPT on P100s SXM2

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

45.89 310.52 360.64 423.09 50 100 150 200 250 300 350 400 450 1 Broadwell node 1 node + 1x P100 PCIe per node 1 node + 2x P100 PCIe per node 1 node + 4x P100 PCIe per node ns/day

PME-JAC_NPT

6.8X 7.9X 9.2X

slide-48
SLIDE 48

48

PME-JAC_NVE on P100s PCIe

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

47.90 308.46 363.79 50 100 150 200 250 300 350 400 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node ns/day

PME-JAC_NVE

6.4X 7.6X

slide-49
SLIDE 49

49

PME-JAC_NVE on P100s SXM2

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

47.90 339.81 402.18 473.10 50 100 150 200 250 300 350 400 450 500 1 Broadwell node 1 node + 1x P100 PCIe per node 1 node + 2x P100 PCIe per node 1 node + 4x P100 PCIe per node ns/day

PME-JAC_NVE

7.1X 8.4X 9.9X

slide-50
SLIDE 50

50

GB-Myoglobin on P100s PCIe

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

28.86 483.37 561.94 100 200 300 400 500 600 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node ns/day

GB-Myoglobin

16.7X 19.5X

slide-51
SLIDE 51

51

GB-Myoglobin on P100s SXM2

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

28.86 534.28 639.37 100 200 300 400 500 600 700 1 Broadwell node 1 node + 1x P100 PCIe per node 1 node + 4x P100 PCIe per node ns/day

GB-Myoglobin

18.5X 22.2X

slide-52
SLIDE 52

52

GB-Nucleosome on P100s PCIe

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

0.40 11.91 22.77 39.91 45.92 5 10 15 20 25 30 35 40 45 50 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node 1 node + 8x P100 PCIe (16GB) per node ns/day

GB-Nucleosome

29.8X 56.9X 99.8X 114.8X

slide-53
SLIDE 53

53

GB-Nucleosome on P100s SXM2

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

0.40 13.36 25.53 46.29 48.29 10 20 30 40 50 60 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node 1 node + 8x P100 SXM2 per node ns/day

GB-Nucleosome

33.4X 63.8X 115.7X 120.7X

slide-54
SLIDE 54

54

Rubisco-75K on P100s PCIe

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

0.01 0.71 1.40 2.69 4.20 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node 1 node + 8x P100 PCIe (16GB) per node ns/day

Rubisco-75K

71.0X 140.0X 269.0X 420.0X

slide-55
SLIDE 55

55

Rubisco-75K on P100s SXM2

Running AMBER version 16.3 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

0.01 0.80 1.57 3.06 4.46 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node 1 node + 8x P100 SXM2 per node ns/day

Rubisco-75K

80.0X 157.0X 306.0X 446.0X

slide-56
SLIDE 56

56

Recommended GPU Node Configuration for AMBER Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2 Cores per CPU socket 6+ (1 CPU core drives 1 GPU) CPU speed (Ghz) 2.66+ System memory per node (GB) 16 GPUs P100, V100 # of GPUs per CPU socket 1-4 GPU memory preference (GB) 6 GPU to CPU connection PCIe 3.0 16x or higher Server storage 2 TB Network configuration Infiniband QDR or better

Scale to multiple nodes with same single node configuration

56

slide-57
SLIDE 57

July 2016

CHARMM DOMDEC-GUI

slide-58
SLIDE 58

58

CHARMM DOMDEC-GUI 465 K System Benchmark

Running CHARMM version c40a1 The blue node contains Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs + Tesla K80 (autoboost) GPUs

Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error. 0.36 2.15 1 2 3 4 1 Haswell node 1 node + 1x K80 per node ns/day

465 K System (Her1_HER1_membrane)

6.0X

*Higher is better

slide-59
SLIDE 59

59

CHARMM DOMDEC-GUI 534 K System Benchmark

Running CHARMM version c40a1 The blue node contains Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs + Tesla K80 (autoboost) GPUs

Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error. 0.18 1.43 0.0 0.5 1.0 1.5 2.0 1 Haswell node 1 node + 1x K80 per node ns/day

534 K System (POPC_PSPC_CHL1mixture)

*Higher is better

8.0X

slide-60
SLIDE 60

60

CHARMM DOMDEC-GUI 20 K System Benchmark

Running CHARMM version c40a1 The blue node contains Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs + Tesla M40 GPUs

Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error. 16.00 59.68 20 40 60 80 1 Haswell node 1 node + 1x M40 per node ns/day

20 K System (Crambin)

*Higher is better

3.7X

slide-61
SLIDE 61

61

CHARMM DOMDEC-GUI 61 K System Benchmark

Running CHARMM version c40a1 The blue node contains Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs + Tesla M40 GPUs

Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error. 3.90 25.08 5 10 15 20 25 30 35 1 Haswell node 1 node + 1x M40 per node ns/day

61 K System (GlnBP)

6.4X

*Higher is better

slide-62
SLIDE 62

62

CHARMM DOMDEC-GUI 465 K System Benchmark

Running CHARMM version c40a1 The blue node contains Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs + Tesla M40 GPUs

Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error. 0.36 2.27 1 2 3 4 1 Haswell node 1 node + 1x M40 per node ns/day

465 K System (Her1_HER1_membrane)

*Higher is better

6.3X

slide-63
SLIDE 63

October 2017

GROMACS 2016.4

slide-64
SLIDE 64

64

Water 1.5M on P100s PCIe

2.28 5.34 7.30 1 2 3 4 5 6 7 8 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB) 1 node + 2x V100 PCIe per node (16GB) ns/day

2.3X 3.2X

(Untuned on Volta) Running GROMACS version 2016.4 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

slide-65
SLIDE 65

65

Water 3M on P100s PCIe

1.12 2.53 3.85 1 1 2 2 3 3 4 4 5 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB) 1 node + 2x V100 PCIe per node (16GB) ns/day

2.3X 3.4X

(Untuned on Volta) Running GROMACS version 2016.4 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

slide-66
SLIDE 66

October 2016

GROMACS 2016

slide-67
SLIDE 67

67

Water 1.5M on P100 PCIes

Running GROMACS version 2016 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

2.79 6.34 7.11 1 2 3 4 5 6 7 8 1 Broadwell node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node ns/day

Water 1.5M

2.3X 2.5X

slide-68
SLIDE 68

68

Water 3M on P100 PCIes

Running GROMACS version 2016 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

1.32 3.16 3.43 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1 Broadwell node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node ns/day

Water 3M

2.4X 2.6X

slide-69
SLIDE 69

February 2017

GROMACS 5.1.2

slide-70
SLIDE 70

70

Water 1.5M on P100s PCIe

3.04 4.39 6.96 7.21 2 4 6 8 10 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node ns/day

Water 1.5M

1.4X 2.3X 2.4X

Running GROMACS version 5.1.2 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

slide-71
SLIDE 71

71

Water 1.5M on P100s SXM2

Running GROMACS version 5.1.2 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

3.04 4.11 6.70 7.18 7.88 1 2 3 4 5 6 7 8 9 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x 100 SXM2 per node 1 node + 4x P100 SXM2 per node 1 node + 8x P100 SXM2 per node ns/day

Water 1.5M

1.4X 2.2X 2.4X 2.6X

slide-72
SLIDE 72

72

Water 3M on P100s PCIe

1.38 1.96 3.43 3.80 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node ns/day

Water 3M

1.4X 2.5X 2.8X

Running GROMACS version 5.1.2 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

slide-73
SLIDE 73

73

Water 3M on P100s SXM2

Running GROMACS version 5.1.2 The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell)

1.38 1.84 3.50 3.82 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node ns/day

Water 3M

1.3X 2.5X 2.8X

slide-74
SLIDE 74

74

Recommended GPU Node Configuration for GROMACS Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32 GPUs Tesla P100, V100 # of GPUs per CPU socket 1x Kepler GPUs: need fast Sandy Bridge or Ivy Bridge, or high-end AMD Opterons GPU memory preference (GB) 6 GPU to CPU connection PCIe 3.0 or higher Server storage 500 GB or higher Network configuration Gemini, InfiniBand

74

slide-75
SLIDE 75

September 2017

HOOMD-Blue 2.1.6

slide-76
SLIDE 76

76

lj-liquid on V100s PS PCIe

(Untuned on Volta) Running HOOMD-Blue version 2.1.6 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) or V100 PS PCIe (16GB) GPUs

238.47 2730.94 3890.73 500 1000 1500 2000 2500 3000 3500 4000 4500

1 Broadwell node 1 node + 2x P100 PCIe per node (16GB) 1 node + 2x V100 PS PCIe per node (16GB)

ns/day

lj-liquid

slide-77
SLIDE 77

77

microsphere on V100s PS PCIe

(Untuned on Volta) Running HOOMD-Blue version 2.1.6 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) or V100 PS PCIe (16GB) GPUs

9.79 182.20 262.79 360.26 298.15 371.06 466.88 50 100 150 200 250 300 350 400 450 500

1 Broadwell node 1 node + 2x P100 PCIe per node (16GB) 1 node + 4x P100 PCIe per node (16GB) 1 node + 8x P100 PCIe per node (16GB) 1 node + 2x V100 PS PCIe per node (16GB) 1 node + 4x V100 PS PCIe per node (16GB) 1 node + 8x V100 PS PCIe per node (16GB)

ns/day

microsphere

slide-78
SLIDE 78

78

quasicrystal on V100s PS PCIe

(Untuned on Volta) Running HOOMD-Blue version 2.1.6 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) or V100 PS PCIe (16GB) GPUs

52.82 1184.14 1819.76 2371.16 2530.74 500 1000 1500 2000 2500 3000

1 Broadwell node 1 node + 2x P100 PCIe per node (16GB) 1 node + 4x P100 PCIe per node (16GB) 1 node + 2x V100 PS PCIe per node (16GB) 1 node + 4x V100 PS PCIe per node (16GB)

ns/day

quasicrystal

slide-79
SLIDE 79

79

triblock-copolymer on V100s PS PCIe

(Untuned on Volta) Running HOOMD-Blue version 2.1.6 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) or V100 PS PCIe (16GB) GPUs

234.71 1972.93 2761.75 500 1000 1500 2000 2500 3000

1 Broadwell node 1 node + 2x P100 PCIe per node (16GB) 1 node + 2x V100 PS PCIe per node (16GB)

ns/day

triblock-copolymer

slide-80
SLIDE 80

80

dodecahedron on V100s PS PCIe

(Untuned on Volta) Running HOOMD-Blue version 2.1.6 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) or V100 PS PCIe (16GB) GPUs

25.84 121.49 196.18 226.39 172.28 277.85 293.25 50 100 150 200 250 300 350

1 Broadwell node 1 node + 2x P100 PCIe per node (16GB) 1 node + 4x P100 PCIe per node (16GB) 1 node + 8x P100 PCIe per node (16GB) 1 node + 2x V100 PS PCIe per node (16GB) 1 node + 4x V100 PS PCIe per node (16GB) 1 node + 8x V100 PS PCIe per node (16GB)

ns/day

dodecahedron

slide-81
SLIDE 81

81

hexagon on V100s PS PCIe

(Untuned on Volta) Running HOOMD-Blue version 2.1.6 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) or V100 PS PCIe (16GB) GPUs

6.33 30.30 55.15 102.16 37.30 69.55 126.70 20 40 60 80 100 120 140

1 Broadwell node 1 node + 2x P100 PCIe per node (16GB) 1 node + 4x P100 PCIe per node (16GB) 1 node + 8x P100 PCIe per node (16GB) 1 node + 2x V100 PS PCIe per node (16GB) 1 node + 4x V100 PS PCIe per node (16GB) 1 node + 8x V100 PS PCIe per node (16GB)

ns/day

hexagon

slide-82
SLIDE 82

October 2017

HOOMD-Blue 2.1.6

slide-83
SLIDE 83

83

lj-liquid on V100s PCIe

(Untuned on Volta) Running HOOMD-Blue version 2.1.6 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

238.47 3890.73 500 1000 1500 2000 2500 3000 3500 4000 4500

1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

Average TPS

16.3X

slide-84
SLIDE 84

84

lj-liquid on V100s SXM2

(Untuned on Volta) Running HOOMD-Blue version 2.1.6 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

238.47 4285.59 4435.12 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB) 1 node + 4x V100 SXM2 per node (16GB)

Average TPS

18.0X 18.6X

slide-85
SLIDE 85

85

microsphere on V100 PCIe

(Untuned on Volta) Running HOOMD-Blue version 2.1.6 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

9.79 298.15 371.06 466.88 50 100 150 200 250 300 350 400 450 500

1 Broadwell node 1 node + 2x V100 PCIe per node (16GB) 1 node + 4x V100 PCIe per node (16GB) 1 node + 8x V100 PCIe per node (16GB)

Average TPS

30.5X 37.9X 47.7X

slide-86
SLIDE 86

86

microsphere on V100s SXM2

(Untuned on Volta) Running HOOMD-Blue version 2.1.6 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

9.79 329.43 506.09 688.99 100 200 300 400 500 600 700 800

1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB) 1 node + 4x V100 SXM2 per node (16GB) 1 node + 8x V100 SXM2 per node (16GB)

Average TPS

33.6X 51.7X 70.4X

slide-87
SLIDE 87

87

quasicrystal on V100s PCIe

(Untuned on Volta) Running HOOMD-Blue version 2.1.6 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

52.82 2371.16 2530.74 500 1000 1500 2000 2500 3000

1 Broadwell node 1 node + 2x V100 PCIe per node (16GB) 1 node + 4x V100 PCIe per node (16GB)

Average TPS

44.9X 47.9X

slide-88
SLIDE 88

88

quasicrystal on V100s SXM2

(Untuned on Volta) Running HOOMD-Blue version 2.1.6 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

52.82 2546.38 3015.42 500 1000 1500 2000 2500 3000 3500

1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB) 1 node + 4x V100 SXM2 per node (16GB)

Average TPS

48.2X 57.1X

slide-89
SLIDE 89

89

triblock-copolymer on V100s PCIe

(Untuned on Volta) Running HOOMD-Blue version 2.1.6 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

234.71 2761.75 500 1000 1500 2000 2500 3000

1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

Average TPS

11.8X

slide-90
SLIDE 90

90

triblock-copolymer on V100s SXM2

(Untuned on Volta) Running HOOMD-Blue version 2.1.6 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

234.71 2958.60 3188.84 500 1000 1500 2000 2500 3000 3500

1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB) 1 node + 4x V100 SXM2 per node (16GB)

Average TPS

12.6X 13.6X

slide-91
SLIDE 91

91

dodecahedron on V100s PCIe

(Untuned on Volta) Running HOOMD-Blue version 2.1.6 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

25.84 172.28 277.85 293.25 50 100 150 200 250 300 350

1 Broadwell node 1 node + 2x V100 PCIe per node (16GB) 1 node + 4x V100 PCIe per node (16GB) 1 node + 8x V100 PCIe per node (16GB)

Average TPS

6.7X 10.8X 11.3X

slide-92
SLIDE 92

92

dodecahedron on V100s SXM2

(Untuned on Volta) Running HOOMD-Blue version 2.1.6 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

25.84 179.94 309.65 317.00 50 100 150 200 250 300 350

1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB) 1 node + 4x V100 SXM2 per node (16GB) 1 node + 8x V100 SXM2 per node (16GB)

Average TPS

7.0X 12.0X 12.3X

slide-93
SLIDE 93

93

hexagon on V100s PCIe

(Untuned on Volta) Running HOOMD-Blue version 2.1.6 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

6.33 37.30 69.55 126.70 20 40 60 80 100 120 140

1 Broadwell node 1 node + 2x V100 PCIe per node (16GB) 1 node + 4x V100 PCIe per node (16GB) 1 node + 8x V100 PCIe per node (16GB)

Average TPS

5.9X 11.0X 20.0X

slide-94
SLIDE 94

94

hexagon on V100s SXM2

(Untuned on Volta) Running HOOMD-Blue version 2.1.6 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

6.33 38.70 69.08 119.50 20 40 60 80 100 120 140

1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB) 1 node + 4x V100 SXM2 per node (16GB) 1 node + 8x V100 SXM2 per node (16GB)

Average TPS

6.1X 10.9X 18.9X

slide-95
SLIDE 95

October 2017

LAMMPS 2017

slide-96
SLIDE 96

96

Atomic-Fluid Lennard-Jones 2.5 Cutoff on V100s PCIe

0.25 0.73 0.00 0.20 0.40 0.60 0.80 1.00 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB) 1/seconds

2,048,000 atoms

(Untuned on Volta) Running LAMMPS version 2017 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

3.0X

slide-97
SLIDE 97

97

Atomic-Fluid Lennard-Jones 2.5 Cutoff on V100s SXM2

0.25 0.82 0.00 0.20 0.40 0.60 0.80 1.00 1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB) 1/seconds

2,048,000 atoms

(Untuned on Volta) Running LAMMPS version 2017 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

3.3X

slide-98
SLIDE 98

98

Atomic-Fluid Lennard-Jones 5.0 Cutoff on V100s PCIe

0.06 0.45 0.47 0.60 0.00 0.20 0.40 0.60 0.80 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB) 1 node + 4x V100 PCIe per node (16GB) 1 node + 8x V100 PCIe per node (16GB) 1/seconds

2,048,000 atoms

(Untuned on Volta) Running LAMMPS version 2017 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

7.5X 7.8X 10.0X

slide-99
SLIDE 99

99

Atomic-Fluid Lennard-Jones 5.0 Cutoff on V100s SXM2

0.06 0.48 0.55 0.56 0.00 0.20 0.40 0.60 0.80 1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB) 1 node + 4x V100 SXM2 per node (16GB) 1 node + 8x V100 SXM2 per node (16GB) 1/seconds

2,048,000 atoms

(Untuned on Volta) Running LAMMPS version 2017 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

8.0X 9.2X 9.3X

slide-100
SLIDE 100

100

Course-grain Water on V100s PCIe

0.003 0.007 0.011 0.016 0.000 0.005 0.010 0.015 0.020 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB) 1 node + 4x V100 PCIe per node (16GB) 1 node + 8x V100 PCIe per node (16GB) 1/seconds

2,048,000 atoms

(Untuned on Volta) Running LAMMPS version 2017 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

2.3X 3.7X 5.3X

slide-101
SLIDE 101

101

Course-grain Water on V100s SXM2

0.003 0.009 0.014 0.020 0.000 0.005 0.010 0.015 0.020 0.025 1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB) 1 node + 4x V100 SXM2 per node (16GB) 1 node + 8x V100 SXM2 per node (16GB) 1/seconds

2,048,000 atoms

3.0X

(Untuned on Volta) Running LAMMPS version 2017 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

4.7X 6.7X

slide-102
SLIDE 102

102

Gay-Berne on V100s PCIe

0.01 0.05 0.00 0.01 0.02 0.03 0.04 0.05 0.06 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB) 1/seconds

2,097,152 atoms

7.5X

(Untuned on Volta) Running LAMMPS version 2017 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

slide-103
SLIDE 103

103

Gay-Berne on V100s SXM2

0.01 0.05 0.00 0.01 0.02 0.03 0.04 0.05 0.06 1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB) 1/seconds

2,097,152 atoms

(Untuned on Volta) Running LAMMPS version 2017 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

5.0X

slide-104
SLIDE 104

104

Rhodopsin on V100s PCIe

0.17 0.44 0.53 0.58 0.00 0.20 0.40 0.60 0.80 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB) 1 node + 4x V100 PCIe per node (16GB) 1 node + 8x V100 PCIe per node (16GB) 1/seconds

256,000 atoms

(Untuned on Volta) Running LAMMPS version 2017 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

2.6X 3.1X 3.4X

slide-105
SLIDE 105

105

Rhodopsin on V100s SXM2

0.17 0.42 0.60 0.68 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB) 1 node + 4x V100 SXM2 per node (16GB) 1 node + 8x V100 SXM2 per node (16GB) 1/seconds

256,000 atoms

2.5X 3.5X

(Untuned on Volta) Running LAMMPS version 2017 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

4.0X

slide-106
SLIDE 106

September 2017

LAMMPS 2017

slide-107
SLIDE 107

107

Atomic-Fluid Lennard-Jones 2.5 Cutoff on V100s PS PCIe

0.25 0.73 0.73 0.00 0.20 0.40 0.60 0.80 1.00 1 Broadwell node 1 node + 2x P100 PCIe per node (16GB) 1 node + 2x V100 PS PCIe per node (16GB) 1/seconds

2,048,000 atoms

(Untuned on Volta) Running LAMMPS version 2017 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) or V100 PS PCIe (16GB) GPUs

slide-108
SLIDE 108

108

Atomic-Fluid Lennard-Jones 2.5 Cutoff on V100s PS SXM2

0.25 0.70 0.82 0.00 0.20 0.40 0.60 0.80 1.00 1 Broadwell node 1 node + 2x P100 SXM2 per node (16GB) 1 node + 2x V100 PS SXM2 per node (16GB) 1/seconds

2,048,000 atoms

(Untuned on Volta) Running LAMMPS version 2017 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 (16GB) or V100 PS SXM2 (16GB) GPUs

slide-109
SLIDE 109

109

Atomic-Fluid Lennard-Jones 5.0 Cutoff on V100s PS PCIe

0.06 0.41 0.46 0.56 0.45 0.47 0.60 0.00 0.20 0.40 0.60 0.80 1 Broadwell node 1 node + 2x P100 PCIe per node (16GB) 1 node + 4x P100 PCIe per node (16GB) 1 node + 8x P100 PCIe per node (16GB) 1 node + 2x V100 PS PCIe per node (16GB) 1 node + 4x V100 PS PCIe per node (16GB) 1 node + 8x V100 PS PCIe per node (16GB) 1/seconds

2,048,000 atoms

(Untuned on Volta) Running LAMMPS version 2017 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) or V100 PS PCIe (16GB) GPUs

slide-110
SLIDE 110

110

Atomic-Fluid Lennard-Jones 5.0 Cutoff on V100s PS SXM2

0.06 0.36 0.44 0.44 0.48 0.55 0.56 0.00 0.20 0.40 0.60 0.80 1 Broadwell node 1 node + 2x P100 SXM2 per node (16GB) 1 node + 4x P100 SXM2 per node (16GB) 1 node + 8x P100 SXM2 per node (16GB) 1 node + 2x V100 PS SXM2 per node (16GB) 1 node + 4x V100 PS SXM2 per node (16GB) 1 node + 8x V100 PS SXM2 per node (16GB) 1/seconds

2,048,000 atoms

(Untuned on Volta) Running LAMMPS version 2017 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 (16GB) or V100 PS SXM2 (16GB) GPUs

slide-111
SLIDE 111

111

Course-grain Water on V100s PS PCIe

0.003 0.004 0.007 0.011 0.007 0.011 0.016 0.000 0.005 0.010 0.015 0.020 1 Broadwell node 1 node + 2x P100 PCIe per node (16GB) 1 node + 4x P100 PCIe per node (16GB) 1 node + 8x P100 PCIe per node (16GB) 1 node + 2x V100 PS PCIe per node (16GB) 1 node + 4x V100 PS PCIe per node (16GB) 1 node + 8x V100 PS PCIe per node (16GB) 1/seconds

2,048,000 atoms

(Untuned on Volta) Running LAMMPS version 2017 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) or V100 PS PCIe (16GB) GPUs

slide-112
SLIDE 112

112

Course-grain Water on V100s PS SXM2

0.003 0.004 0.007 0.012 0.009 0.014 0.020 0.000 0.005 0.010 0.015 0.020 0.025 1 Broadwell node 1 node + 2x P100 SXM2 per node (16GB) 1 node + 4x P100 SXM2 per node (16GB) 1 node + 8x P100 SXM2 per node (16GB) 1 node + 2x V100 PS SXM2 per node (16GB) 1 node + 4x V100 PS SXM2 per node (16GB) 1 node + 8x V100 PS SXM2 per node (16GB) 1/seconds

2,048,000 atoms

(Untuned on Volta) Running LAMMPS version 2017 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 (16GB) or V100 PS SXM2 (16GB) GPUs

slide-113
SLIDE 113

113

Gay-Berne on V100s PS PCIe

0.01 0.04 0.05 0.00 0.01 0.02 0.03 0.04 0.05 0.06 1 Broadwell node 1 node + 2x P100 PCIe per node (16GB) 1 node + 2x V100 PS PCIe per node (16GB) 1/seconds

2,097,152 atoms

(Untuned on Volta) Running LAMMPS version 2017 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) or V100 PS PCIe (16GB) GPUs

slide-114
SLIDE 114

114

Gay-Berne on V100s PS SXM2

0.01 0.04 0.05 0.00 0.01 0.02 0.03 0.04 0.05 0.06 1 Broadwell node 1 node + 2x P100 SXM2 per node (16GB) 1 node + 2x V100 PS SXM2 per node (16GB) 1/seconds

2,097,152 atoms

(Untuned on Volta) Running LAMMPS version 2017 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 (16GB) or V100 PS SXM2 (16GB) GPUs

slide-115
SLIDE 115

115

Rhodopsin on V100s PS PCIe

0.17 0.41 0.55 0.44 0.53 0.00 0.20 0.40 0.60 0.80 1 Broadwell node 1 node + 2x P100 PCIe per node (16GB) 1 node + 4x P100 PCIe per node (16GB) 1 node + 2x V100 PS PCIe per node (16GB) 1 node + 4x V100 PS PCIe per node (16GB) 1/seconds

256,000 atoms

(Untuned on Volta) Running LAMMPS version 2017 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) or V100 PS PCIe (16GB) GPUs

slide-116
SLIDE 116

116

Rhodopsin on V100s PS SXM2

0.17 0.40 0.54 0.42 0.60 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 1 Broadwell node 1 node + 2x P100 SXM2 per node (16GB) 1 node + 4x P100 SXM2 per node (16GB) 1 node + 2x V100 PS SXM2 per node (16GB) 1 node + 4x V100 PS SXM2 per node (16GB) 1/seconds

256,000 atoms

(Untuned on Volta) Running LAMMPS version 2017 The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 (16GB) or V100 PS SXM2 (16GB) GPUs

slide-117
SLIDE 117

117

Recommended GPU Node Configuration for LAMMPS Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32 GPUs GTX Titan X, Tesla P100, V100 # of GPUs per CPU socket 1-2 GPU memory preference (GB) 6+ GPU to CPU connection PCIe 3.0 or higher Server storage 500 GB or higher Network configuration Gemini, InfiniBand

Scale to thousands of nodes with same single node configuration

11 7

slide-118
SLIDE 118

July 2017

NAMD 2.12

slide-119
SLIDE 119

119

APOA1 on P100s PCIe

Running NAMD version 2.12 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs

3.45 22.58 22.85 4 8 12 16 20 24 28

1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node

ns/day

APOA1

slide-120
SLIDE 120

120

APOA1 on P100s SXM2

Running NAMD version 2.12 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

3.45 22.98 23.44 23.87 5 10 15 20 25 30

1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node

ns/day

APOA1

slide-121
SLIDE 121

121

F1ATPASE on P100s PCIe

Running NAMD version 2.12 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs

1.15 7.34 6.99 7.40 2 4 6 8 10

1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node 1 node + 4x P100 PCIe (16GB) per node

ns/day

F1ATPASE

slide-122
SLIDE 122

122

F1ATPASE on P100s SXM2

Running NAMD version 2.12 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1.15 7.11 6.85 7.11 2 4 6 8 10

1 Broadwell node 1 node + 1x P100 SXM2 per node 1 node + 2x P100 SXM2 per node 1 node + 4x P100 SXM2 per node

ns/day

F1ATPASE

slide-123
SLIDE 123

123

STMV on P100s PCIe

Running NAMD version 2.12 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs

0.29 2.15 2.32 0.0 0.5 1.0 1.5 2.0 2.5 3.0

1 Broadwell node 1 node + 1x P100 PCIe (16GB) per node 1 node + 2x P100 PCIe (16GB) per node

ns/day

STMV

slide-124
SLIDE 124

124

STMV on P100s SXM2

Running NAMD version 2.12 The blue node contains Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 v4@2.6GHz [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

0.292 2.077 0.0 0.5 1.0 1.5 2.0 2.5 3.0

1 Broadwell node 1 node + 1x P100 SXM2 per node

ns/day

STMV

slide-125
SLIDE 125

125

Benefits of MD GPU-Accelerated Computing

  • 3x-8x Faster than CPU only systems in all tests (on average)
  • Most major compute intensive aspects of classical MD ported
  • Large performance boost and save “Big Money” on CPUs, networks
  • Energy usage cut by more than half
  • GPUs scale well within a node and/or over multiple nodes
  • K80 GPU is our fastest and lowest power high performance GPU yet

Try GPU accelerated MD apps for free – www.nvidia.com/GPUTestDrive

Why wouldn’t you want to turbocharge your research?

slide-126
SLIDE 126
  • Dec. 19, 2016

Molecular Dynamics (MD) on GPUs

slide-127
SLIDE 127

127

GPU-Accelerated Quantum Chemistry Apps

Abinit ACES III ADF BigDFT CP2K GAMESS-US Gaussian GPAW LATTE LSDalton MOLCAS Mopac2012 NWChem Green Lettering Indicates Performance Slides Included

GPU Perf compared against dual multi-core x86 CPU socket.

Quantum SuperCharger Library RMG TeraChem UNM VASP WL-LSMS Octopus ONETEP Petot Q-Chem QMCPACK Quantum Espresso