May 5, 2016
Molecular Dynamics (MD) on GPUs May 5, 2016 Accelerating - - PowerPoint PPT Presentation
Molecular Dynamics (MD) on GPUs May 5, 2016 Accelerating - - PowerPoint PPT Presentation
Molecular Dynamics (MD) on GPUs May 5, 2016 Accelerating Discoveries Using a supercomputer powered by the Tesla Platform with over 3,000 Tesla accelerators, University of Illinois scientists performed the first all-atom simulation of the HIV
2
Accelerating Discoveries
Using a supercomputer powered by the Tesla Platform with over 3,000 Tesla accelerators, University of Illinois scientists performed the first all-atom simulation of the HIV virus and discovered the chemical structure of its capsid — “the perfect target for fighting the infection.” Without gpu, the supercomputer would need to be 5x larger for similar performance.
3
Overview of Life & Material Accelerated Apps
MD: All key codes are GPU-accelerated Great multi-GPU performance Focus on dense (up to 16) GPU nodes &/or large # of GPU nodes
ACEMD*, AMBER (PMEMD)*, BAND, CHARMM, DESMOND, ESPResso, Folding@Home, GPUgrid.net, GROMACS, HALMD, HOOMD-Blue*, LAMMPS, Lattice Microbes*, mdcore, MELD, miniMD, NAMD, OpenMM, PolyFTS, SOP-GPU* & more
QC: All key codes are ported or optimizing Focus on using GPU-accelerated math libraries, OpenACC directives GPU-accelerated and available today:
ABINIT, ACES III, ADF, BigDFT, CP2K, GAMESS, GAMESS- UK, GPAW, LATTE, LSDalton, LSMS, MOLCAS, MOPAC2012, NWChem, OCTOPUS*, PEtot, QUICK, Q-Chem, QMCPack, Quantum Espresso/PWscf, QUICK, TeraChem*
Active GPU acceleration projects:
CASTEP, GAMESS, Gaussian, ONETEP, Quantum Supercharger Library*, VASP & more
green* = application where >90% of the workload is on GPU
4
MD vs. QC on GPUs
“Classical” Molecular Dynamics Quantum Chemistry (MO, PW, DFT, Semi-Emp)
Simulates positions of atoms over time; chemical-biological or chemical-material behaviors Calculates electronic properties; ground state, excited states, spectral properties, making/breaking bonds, physical properties Forces calculated from simple empirical formulas (bond rearrangement generally forbidden) Forces derived from electron wave function (bond rearrangement OK, e.g., bond energies) Up to millions of atoms Up to a few thousand atoms Solvent included without difficulty Generally in a vacuum but if needed, solvent treated classically (QM/MM) or using implicit methods Single precision dominated Double precision is important Uses cuBLAS, cuFFT, CUDA Uses cuBLAS, cuFFT, OpenACC Geforce (Accademics), Tesla (Servers) Tesla recommended ECC off ECC on
5
GPU-Accelerated Molecular Dynamics Apps
ACEMD AMBER CHARMM DESMOND ESPResSO Folding@Home GPUGrid.net GROMACS HALMD HOOMD-Blue LAMMPS mdcore Green Lettering Indicates Performance Slides Included
GPU Perf compared against dual multi-core x86 CPU socket.
MELD NAMD OpenMM PolyFTS
6
Benefits of MD GPU-Accelerated Computing
- 3x-8x Faster than CPU only systems in all tests (on average)
- Most major compute intensive aspects of classical MD ported
- Large performance boost with marginal price increase
- Energy usage cut by more than half
- GPUs scale well within a node and/or over multiple nodes
- K80 GPU is our fastest and lowest power high performance GPU yet
Try GPU accelerated MD apps for free – www.nvidia.com/GPUTestDrive
Why wouldn’t you want to turbocharge your research?
ACEMD
www.acellera.com
470 ns/day on 1 GPU for L-Iduronic acid (1362 atoms) 116 ns/day on 1 GPU for DHFR (23K atoms)
- M. Harvey, G. Giupponi and G. De Fabritiis, ACEMD: Accelerated molecular dynamics simulations in the microseconds timescale, J. Chem. Theory and
- Comput. 5, 1632 (2009)
www.acellera.com
NVT, NPT, PME, TCL, PLUMED, CAMSHIFT1
1 M. J. Harvey and G. De Fabritiis, An implementation of the smooth particle-mesh Ewald (PME) method on GPU hardware, J. Chem. Theory Comput., 5, 2371–2377 (2009) 2 For a list of selected references see http://www.acellera.com/acemd/publications
AMBER 14
11
11
Ross Walker (AMBER developer) video
12
AMBER 14 vs. AMBER 12
Courtesy of Scott Le Grand From GTC 2014 presentation
13
AMBER 14; large P2P and small Boost Clocks impacts
2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@745Mhz (no P2P) 2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@875Mhz (no P2P) 2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@745Mhz (P2P) 2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@875Mhz (P2P) Series1 125.77 132.97 196.68 215.18 125.77 132.97 196.68 215.18 50 100 150 200 250
ns/day
AMBER 14 (ns/day) on 4x K40; P2P and Boost Clocks Impact DHFR NVE PME, 2fs Benchmark (CUDA 6.0, ECC off)
Boost P2P Boost No P2P No Boost P2P No Boost No P2P
14
14
AMBER Performance Over Time
Courtesy of Scott Le Grand From GTC 2014 presentation
Protein Folding Simulation With AMBER Accelerated By GPUs
168 ns/day 2x8xE5-2670 CPU 585 ns/day 2x8xE5-2670 CPU + Tesla K20X GPU
2.95x Faster
Data courtesy of AMBER.org
16
AMBER 14 with P2P, Higher Density Nodes
2 x Xeon E5-2690 v2@3.00GHz 2 x Xeon E5-2690 v2@3.00GHz + 1 x Tesla K40@875Mhz 2 x Xeon E5-2690 v2@3.00GHz + 2 x Tesla K40@875Mhz 2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@875Mhz Series1 23.79 133.74 200.58 223.56 23.79 133.74 200.58 223.56 50 100 150 200 250
ns/day
AMBER 14 (ns/day) on K40 with P2P and Boost Clocks DHFR NVE PME, 2fs Benchmark (CUDA 5.5, ECC off)
1 K40 2 K40 4 K40
17
AMBER 14 and K40 with P2P, fastest GPU yet!
2 x Xeon E5-2690 v2@3.00GHz 2 x Xeon E5-2690 v2@3.00GHz + 1 x Tesla K40@875Mhz 2 x Xeon E5-2690 v2@3.00GHz + 2 x Tesla K40@875Mhz 2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@875Mhz Series1 6.1 37.15 56.08 60.27 6.1 37.15 56.08 60.27 10 20 30 40 50 60 70
ns/day
AMBER 14 (ns/day) on K40 with P2P and Boost Clocks Factor IX NPT PME, 2fs Benchmark (CUDA 5.5, ECC off)
18
AMBER 14 and K40 with P2P, fastest GPU yet!
2 x Xeon E5-2690 v2@3.00GHz 2 x Xeon E5-2690 v2@3.00GHz + 1 x Tesla K40@875Mhz 2 x Xeon E5-2690 v2@3.00GHz + 2 x Tesla K40@875Mhz 2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@875Mhz Series1 1.32 9.22 14.03 16.64 1.32 9.22 14.03 16.64 2 4 6 8 10 12 14 16 18
ns/day
AMBER 14 (ns/day) on K40 with P2P and Boost Clocks Cellulose NVE PME, 2fs Benchmark (CUDA 5.5, ECC off)
19
AMBER 14 and K40 with P2P, fastest GPU yet!
2 x Xeon E5-2690 v2@3.00GHz 2 x Xeon E5-2690 v2@3.00GHz + 1 x Tesla K40@875Mhz 2 x Xeon E5-2690 v2@3.00GHz + 2 x Tesla K40@875Mhz 2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@875Mhz Series1 0.08 3.97 7.6 13.77 0.08 3.97 7.6 13.77 2 4 6 8 10 12 14 16
ns/day
AMBER 14 (ns/day) on K40 with P2P and Boost Clocks Nucleosome GB, 2fs Benchmark (CUDA 5.5, ECC off)
20
Cellulose on K40s, K80s and M6000s
Running AMBER version 14 The blue node contains Dual Intel E5- 2698 v3@2.3GHz, 3.6GHz Turbo CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz, 3.6GHz Turbo CPUs + either NVIDIA Tesla K40@875Mhz, Tesla K80@562Mhz (autoboost), or Quadro M6000@987Mhz GPUs
1.93 8.96 7.87 11.76 10.49 13.67 15.38 14.90 4 8 12 16 20 1 Haswell Node 1 CPU Node + 1x K40 1 CPU Node + 0.5x K80 1 CPU Node + 1x K80 1 CPU Node + 1x M6000 1 CPU Node + 2x K40 1 CPU Node + 2x K80 1 CPU Node + 2x M6000 Simulated Time (ns/day)
PME-Cellulose_NVE
4.1X 6.1X 5.4X 8.0X 7.7X 4.6X 7.1X
21
Factor IX on K40s, K80s and M6000s
Running AMBER version 14 The blue node contains Dual Intel E5- 2698 v3@2.3GHz, 3.6GHz Turbo CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz, 3.6GHz Turbo CPUs + either NVIDIA Tesla K40@875Mhz, Tesla K80@562Mhz (autoboost), or Quadro M6000@987Mhz GPUs
9.68 40.48 33.59 50.70 47.80 61.18 60.93 66.89 10 20 30 40 50 60 70 80 1 Haswell Node 1 CPU Node + 1x K40 1 CPU Node + 0.5x K80 1 CPU Node + 1x K80 1 CPU Node + 1x M6000 1 CPU Node + 2x K40 1 CPU Node + 2x K80 1 CPU Node + 2x M6000 Simulated Time (ns/day)
PME-FactorIX_NVE
3.5X 5.2X 5.0X 6.4X 6.3X 7.0X 4.2X
22
JAC on K40s, K80s and M6000s
Running AMBER version 14 The blue node contains Dual Intel E5- 2698 v3@2.3GHz, 3.6GHz Turbo CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz, 3.6GHz Turbo CPUs + either NVIDIA Tesla K40@875Mhz, Tesla K80@562Mhz (autoboost), or Quadro M6000@987Mhz GPUs
37.38 134.82 121.30 174.34 161.53 200.34 225.34 219.83 50 100 150 200 250 1 Haswell Node 1 CPU Node + 1x K40 1 CPU Node + 0.5x K80 1 CPU Node + 1x K80 1 CPU Node + 1x M6000 1 CPU Node + 2x K40 1 CPU Node + 2x K80 1 CPU Node + 2x M6000 Simulated Time (ns/day)
PME-JAC_NVE
3.2X 4.7X 4.3X 5.4X 6.0X 5.9X 3.6X
March 2016
AMBER 14
24
Cellulose on M40s
Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs
1.07 10.12 14.40 15.90
2 4 6 8 10 12 14 16 18 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node
Simulated Time (ns/Day)
PME - Cellulose_NPT
9.5X 13.5X 14.9X
25
Cellulose on M40s
1.07 10.50 15.41 17.13
2 4 6 8 10 12 14 16 18 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node
Simulated Time (ns/Day)
PME - Cellulose_NVE
9.8X 14.4X 16.0X
Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs
26
FactorIX on M40s
5.38 46.90 67.37 72.96
10 20 30 40 50 60 70 80 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node
Simulated Time (ns/Day)
PME - FactorIX_NPT
8.7X 12.5X 13.6X
Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs
27
FactorIX on M40s
5.47 49.33 73.00 80.04
10 20 30 40 50 60 70 80 90 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node
Simulated Time (ns/Day)
PME - FactorIX_NVE
9.0X 13.3X 14.6X
Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs
28
JAC on M40s
20.88 149.40 211.97 226.63
50 100 150 200 250 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node
Simulated Time (ns/Day)
PME - JAC_NPT
7.2X 10.2X 10.9X
Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs
29
JAC on M40s
21.11 157.68 230.18 246.15
50 100 150 200 250 300 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node
Simulated Time (ns/Day)
PME - JAC_NVE
7.5X 10.9X 11.7X
Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs
30
Myoglobin on M40s
9.83 232.20 300.86 322.09
50 100 150 200 250 300 350 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node
Simulated Time (ns/Day)
GB - Myoglobin
23.6X 30.6X 32.8X
Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs
31
Nucleosome on M40s
0.13 4.67 9.05 16.11
2 4 6 8 10 12 14 16 18 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node
Simulated Time (ns/Day)
GB - Nucleosome
35.9X 69.6X 123.9X
Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs
32
TrpCage on M40s
408.88 831.91 551.36 464.63
100 200 300 400 500 600 700 800 900 1 Node 1 Node + 1x M40 per node 1 Node + 2x M40 per node 1 Node + 4x M40 per node
Simulated Time (ns/Day)
GB - TrpCage
2.03X 1.3X 1.1X
Running AMBER version 14 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs
JAC on K40s and K80s
2 x Xeon E5-2697 v2@2.70GHz (1 Ivybridge node) 2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 0.5 x Tesla K80 (autoboost) (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K80 (autoboost) (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K40@875Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K80 (autoboost) (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 4 x Tesla K40@875Mhz (1 node) 24.94 131.03 116.67 167.81 195.39 205.30 218.62 50 100 150 200 250
Simulated Time (ns/Day) AMBER 14; PME-JAC_NVE on Intel Phi, Tesla K40s and K80s & IVB CPUs (1 Node: Simulation Time in ns/Day)
FactorIX on K40s and K80s
2 x Xeon E5-2697 v2@2.70GHz (1 Ivybridge node) 2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 0.5 x Tesla K80@562Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K80@562Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K40@875Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K80@562Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 4 x Tesla K40@875Mhz (1 node) 6.61 36.31 31.52 46.65 53.90 53.28 59.45 10 20 30 40 50 60 70
Simulated Time (ns/Day)
AMBER 14; PME-FactorIX_NVE on Intel Phi, Tesla K40s and K80s & IVB CPUs (1 Node: Simulation Time in ns/Day)
Cellulose on K40s and K80s
2 x Xeon E5-2697 v2@2.70GHz (1 Ivybridge node) 2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 0.5 x Tesla K80@562Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K80@562Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K40@875Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K80@562Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 4 x Tesla K40@875Mhz (1 node) 1.35 8.45 7.36 10.96 12.49 13.48 14.68 3 6 9 12 15 18
Simulated Time (ns/Day)
AMBER 14; PME-Cellulose_NVE on Intel Phi, Tesla K40s and K80s & IVB CPUs (1 Node: Simulation Time in ns/Day)
36
Kepler - Our Fastest Family of GPUs Yet
Running AMBER 14 The blue node contains Dual E5-2697 CPUs (12 Cores per CPU). The green nodes contain Dual E5-2697 CPUs (12 Cores per CPU) and either 1x or 2x NVIDIA K20X, K40 or K80 for the GPU DHFR (JAC)
1 x Xeon E5-2697 v2@2.70 GHz 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Phi 5110p (Offload) 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Phi 7120p (Offload) 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K20X 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K40@87 5Mhz 1 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K20X 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K80 Board 1 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K40@87 5Mhz 2 x Xeon E5-2697 v2@2.70 GHz 2 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K20X 2 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K40@87 5Mhz 2 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K20X 2 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K40@87 5Mhz Series1 14.54 4.08 3.82 111.32 134.08 159.25 175.43 196.69 25.80 110.87 132.68 159.06 196.86 14.54 4.08 3.82 111.32 134.08 159.25 175.43 196.69 25.80 110.87 132.68 159.06 196.86 0.00 50.00 100.00 150.00 200.00 250.00
ns/Day
AMBER 14, SPFP-DHFR_production_NVE
37
Kepler - Our Fastest Family of GPUs Yet
Running AMBER 14 The blue node contains Dual E5-2697 CPUs (12 Cores per CPU). The green nodes contain Dual E5-2697 CPUs (12 Cores per CPU) and either 1x or 2x NVIDIA K20X, K40 or K80 for the GPU Factor IX
1 x Xeon E5-2697 v2@2.70 GHz 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Phi 5110p (Offload) 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Phi 7120p (Offload) 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K20X 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K40@87 5Mhz 1 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K20X 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K80 Board 1 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K40@87 5Mhz 2 x Xeon E5-2697 v2@2.70 GHz 2 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K20X 2 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K40@87 5Mhz 2 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K20X 2 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K40@87 5Mhz Series1 3.70 3.29 3.35 32.45 38.65 46.58 51.12 57.89 6.87 32.30 38.60 46.50 57.83 3.70 3.29 3.35 32.45 38.65 46.58 51.12 57.89 6.87 32.30 38.60 46.50 57.83 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00
ns/Day
AMBER 14, SPFP-Factor_IX_Production_NVE
38
Kepler - Our Fastest Family of GPUs Yet
Running AMBER 14 The blue node contains Dual E5-2697 CPUs (12 Cores per CPU). The green nodes contain Dual E5-2697 CPUs (12 Cores per CPU) and either 1x or 2x NVIDIA K20X, K40 or K80 for the GPU Cellulose
1 x Xeon E5-2697 v2@2.70 GHz 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Phi 5110p (Offload) 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Phi 7120p (Offload) 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K20X 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K40@87 5Mhz 1 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K20X 1 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K80 Board 1 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K40@87 5Mhz 2 x Xeon E5-2697 v2@2.70 GHz 2 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K20X 2 x Xeon E5-2697 v2@2.70 GHz + 1 x Tesla K40@87 5Mhz 2 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K20X 2 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K40@87 5Mhz Series1 0.74 1.50 1.56 7.60 8.95 10.82 11.86 13.29 1.38 7.60 8.95 10.83 13.29 0.74 1.50 1.56 7.60 8.95 10.82 11.86 13.29 1.38 7.60 8.95 10.83 13.29 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00
ns/Day
AMBER 14, SPFP-Cellulose_Production_NVE
SAN DIEGO SUPERCOMPUTER CENTER
Cost Comparison
Traditional Cluster GPU Workstation Nodes Required 12 1 (4 GPUs) Interconnect QDR IB None Time to complete simulations 4.98 days 2.25 days Power Consumption 5.7 kW (681.3 kWh) 1.0 kW (54.0 kWh) System Cost (per day) $96,800 ($88.40) $5200 ($4.75) Simulation Cost (681.3 * 0.18) + (88.40 * 4.98) (54.0 * 0.18) + (4.75 * 2.25) $562.87 $20.41
39
4 simultaneous simulations, 23,000 atoms, 250ns each, 5 days maximum time to solution. >25x cheaper AND solution obtained in less than half the time
40
Replace 8 Nodes with 1 K20 GPU
Cut down simulation costs to ¼ and gain higher performance
Running AMBER 12 GPU Support Revision 12.1 SPFP with CUDA 4.2.9 ECC Off The eight (8) blue nodes each contain 2x Intel E5-2687W CPUs (8 Cores per CPU) Each green node contains 2x Intel E5- 2687W CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPU Note: Typical CPU and GPU node pricing
- used. Pricing may vary depending on node
- configuration. Contact your preferred HW
vendor for actual pricing.
65.00 81.09 $32,000.00 $6,500.00 5000 10000 15000 20000 25000 30000 35000 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 Nanoseconds/Day Cost
DHFR
41
Replace 7 Nodes with 1 K10 GPU
Cut down simulation costs to ¼ and increase performance by 70%
Running AMBER 12 GPU Support Revision 12.1 SPFP with CUDA 4.2.9 ECC Off The eight (8) blue nodes each contain 2x Intel E5-2687W CPUs (8 Cores per CPU) The green node contains 2x Intel E5-2687W CPUs (8 Cores per CPU) plus 1x NVIDIA K10 GPU Note: Typical CPU and GPU node pricing
- used. Pricing may vary depending on node
- configuration. Contact your preferred HW
vendor for actual pricing.
DHFR
$32,000 $7,000 $0.00 $5,000.00 $10,000.00 $15,000.00 $20,000.00 $25,000.00 $30,000.00 $35,000.00 CPU Only GPU Enabled
Cost
10 20 30 40 50 60 70 80 CPU Only GPU Enabled Nanoseconds / Day
Performance on JAC NVE
When used with GPUs, dual CPU sockets perform worse than single CPU sockets.
Extra CPUs decrease Performance
Running AMBER 12 GPU Support Revision 12.1 The orange bars contains one E5-2687W CPUs (8 Cores per CPU). The blue bars contain Dual E5-2687W CPUs (8 Cores per CPU)
1 2 3 4 5 6 7 8 CPU Only CPU with dual K20s Nanoseconds / Day
Cellulose NVE
1 E5-2687W 2 E5-2687W
Cellulose 1 CPU 2 GPUs 2 CPUs 2 GPUs
Kepler - Greener Science
The GPU Accelerated systems use 65-75% less energy Energy Expended = Power x Time
Lower is better
500 1000 1500 2000 2500 CPU Only CPU + K10 CPU + K20 CPU + K20X Energy Expended (kJ)
Energy used in simulating 1 ns of DHFR JAC
Running AMBER 12 GPU Support Revision 12.1 The blue node contains Dual E5-2687W CPUs (150W each, 8 Cores per CPU). The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and 1x NVIDIA K10, K20, or K20X GPUs (235W each).
44
Recommended GPU Node Configuration for AMBER Computational Chemistry
Workstation or Single Node Configuration
# of CPU sockets 2 Cores per CPU socket 6+ (1 CPU core drives 1 GPU) CPU speed (Ghz) 2.66+ System memory per node (GB) 16 GPUs Kepler K20, K40, K80 # of GPUs per CPU socket 1-4 GPU memory preference (GB) 6 GPU to CPU connection PCIe 3.0 16x or higher Server storage 2 TB Network configuration Infiniband QDR or better
Scale to multiple nodes with same single node configuration
44
CHARMM
Courtesy of Antti-Pekka Hynninen @ NREL
Courtesy of Antti-Pekka Hynninen @ NREL
Courtesy of Antti-Pekka Hynninen @ NREL
Courtesy of Antti-Pekka Hynninen @ NREL
Greener Science with NVIDIA
Running CHARMM release C37b1 The blue nodes contains 64 X5667 CPUs (95W, 4 Cores per CPU). The green nodes contain 2 X5667 CPUs and 1 or 2 NVIDIA C2070 GPUs (238W each). Note: Typical CPU and GPU node pricing
- used. Pricing may vary depending on node
- configuration. Contact your preferred HW
vendor for actual pricing.
Using GPUs will decrease energy use by 75%
2000 4000 6000 8000 10000 12000 14000 16000 18000 64x X5667 2x X5667 + 1x C2070 2x X5667 + 2x C2070 Energy Expended (kJ)
Energy Used in Simulating 1 ns Daresbury G1nBP 61.2k Atoms Lower is better
Energy Expended = Power x Time
May 2016
CHARMM c40a2
52
465K System on K80s
Running CHARMM version c40a2 The blue node contains Dual Intel Xeon Intel Xeon (R) ES-2698@2.30 MHZ (Haswell) CPUs The green nodes contain Dual Intel Xeon Intel Xeon (R) ES-2698@2.30 MHZ (Haswell) CPUs + Tesla K80 (autoboost) GPUs “Gpuonly” means all the forces are calculated in GPU “Gpuon” means only non-bonded forces are calculated in GPU
0.36 2.15 1.70 1.80 1.62 0.0 0.5 1.0 1.5 2.0 2.5 1 Haswell node 1 Node + 1x K80 per node (gpuonly) 1 Node + 1x K80 per node (gpuon) 1 Node + 2x K80 per node (gpuon) 1 Node + 4x K80 per node (gpuon) ns/day
465K System
6.0X 4.7X 5.0X 4.5X
53
534K System on K80s
Running CHARMM version c40a2 The blue node contains Dual Intel Xeon Intel Xeon (R) ES-2698@2.30 MHZ (Haswell) CPUs The green nodes contain Dual Intel Xeon Intel Xeon (R) ES-2698@2.30 MHZ (Haswell) CPUs + Tesla K80 (autoboost) GPUs “Gpuonly” means all the forces are calculated in GPU “Gpuon” means only non-bonded forces are calculated in GPU
0.18 1.43 1.44 1.44 1.86 0.0 0.4 0.8 1.2 1.6 2.0 1 Haswell node 1 Node + 1x K80 per node (gpuonly) 1 Node + 1x K80 per node (gpuon) 1 Node + 2x K80 per node (gpuon) 1 Node + 4x K80 per node (gpuon) ns/day
534K System
8.0X 8.0X 8.0X 10.3X
October 2015
GROMACS 5.1
55
Erik Lindahl (GROMACS developer) video
56
384K Waters on K40s and K80s
Running GROMACS version 5.1 The blue node contains Dual Intel E5- 2698 v3@2.3GHz CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs + either NVIDIA Tesla K40@875Mhz or Tesla K80@562Mhz (autoboost) GPUs
7.16 10.45 16.99 17.07 22.95 22.36 24.72 5 10 15 20 25 30 1 Haswell Node 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K40 1 CPU Node + 2x K80 1 CPU Node + 4x K40 1 CPU Node + 4x K80 Simulated Time (ns/day)
Water [PME] 384k
2.4X 3.1X 3.5X 1.5X 2.4X 3.2X
57
384K Waters on Titan X
Running GROMACS version 5.1 The blue node contains Dual Intel E5- 2698 v3@2.3GHz CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs + GeForce GTX TitanX@1000Mhz GPUs
7.16 16.08 18.13 21.74 5 10 15 20 25 1 Haswell Node 1 CPU Node + 1x TitanX 1 CPU Node + 2x TitanX 1 CPU Node + 4x TitanX Simulated Time (ns/day)
Water [PME] 384k
2.5X 3.0X 2.2X
58
768K Waters on K40s and K80s
Running GROMACS version 5.1 The blue node contains Dual Intel E5- 2698 v3@2.3GHz CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs + either NVIDIA Tesla K40@875Mhz or Tesla K80@562Mhz (autoboost) GPUs
3.58 5.37 8.50 8.60 11.36 11.31 12.78 5 10 15 1 Haswell Node 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K40 1 CPU Node + 2x K80 1 CPU Node + 4x K40 1 CPU Node + 4x K80 Simulated Time (ns/day)
Water [PME] 768k
1.5X 2.4X 2.4X 3.2X 3.2X 3.6X
59
768K Waters on Titan X
Running GROMACS version 5.1 The blue node contains Dual Intel E5- 2698 v3@2.3GHz CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs + GeForce GTX TitanX@1000Mhz GPUs
3.58 8.19 9.12 11.51 4 8 12 16 1 Haswell Node 1 CPU Node + 1x TitanX 1 CPU Node + 2x TitanX 1 CPU Node + 4x TitanX Simulated Time (ns/day)
Water [PME] 768k
2.5X 2.3X 3.2X
60
1.5M Waters on K40s and K80s
Running GROMACS version 5.1 The blue node contains Dual Intel E5- 2698 v3@2.3GHz CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs + either NVIDIA Tesla K40@875Mhz or Tesla K80@562Mhz (autoboost) GPUs
1.72 2.69 4.13 4.16 5.67 5.61 6.07 1 2 3 4 5 6 7 1 Haswell Node 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K40 1 CPU Node + 2x K80 1 CPU Node + 4x K40 1 CPU Node + 4x K80 Simulated Time (ns/day)
Water [PME] 1.5M
1.6X 2.4X 2.4X 3.3X 3.3X 3.5X
61
1.5M Waters on Titan X
Running GROMACS version 5.1 The blue node contains Dual Intel E5- 2698 v3@2.3GHz CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs + GeForce GTX TitanX@1000Mhz GPUs
1.72 3.75 4.64 5.87 2 4 6 8 1 Haswell Node 1 CPU Node + 1x TitanX 1 CPU Node + 2x TitanX 1 CPU Node + 4x TitanX Simulated Time (ns/day)
Water [PME] 1.5M
2.7X 2.2X 3.4X
62
3M Waters on K40s and K80s
Running GROMACS version 5.1 The blue node contains Dual Intel E5- 2698 v3@2.3GHz CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs + either NVIDIA Tesla K40@875Mhz or Tesla K80@562Mhz (autoboost) GPUs
0.81 1.32 1.88 1.85 2.72 2.76 3.23 1 2 3 4 1 Haswell Node 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K40 1 CPU Node + 2x K80 1 CPU Node + 4x K40 1 CPU Node + 4x K80 Simulated Time (ns/day)
Water [PME] 3M
1.6X 2.3X 2.3X 3.4X 3.4X 4.0X
63
3M Waters on Titan X
Running GROMACS version 5.1 The blue node contains Dual Intel E5- 2698 v3@2.3GHz CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs + GeForce GTX TitanX@1000Mhz GPUs
0.81 1.53 2.36 2.99 1 2 3 4 1 Haswell Node 1 CPU Node + 1x TitanX 1 CPU Node + 2x TitanX 1 CPU Node + 4x TitanX Simulated Time (ns/day)
Water [PME] 3M
1.9X 2.9X 3.7X
GROMACS 5.0: Phi vs. Kepler K40 fastest GPU!
1 x Xeon E5-2697 v2@2.70GHz 1 x Intel Phi 3120p (Native Mode) 1 x Intel Phi 5110p (Native Mode) 1 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K40@875Mhz 1 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K40@875Mhz 2 x Xeon E5-2697 v2@2.70GHz 2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K40@875Mhz 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K40@875Mhz Series1 4.96 6.02 5.9 18.19 18.55 7.9 19.29 25.84 4.96 6.02 5.9 18.19 18.55 7.9 19.29 25.84 5 10 15 20 25 30
ns/day
GROMACS 5.0 RC1 (ns/day) on K40 with Boost Clocks and Intel Phi 192K Waters Benchmark (CUDA 6.0)
GROMACS 5.0 & Fastest Kepler GPUs yet!
1 x Xeon E5-2697 v2@2.70GHz 1 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K20X 1 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K40@875Mhz 1 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K80 (autoboost) 2 x Xeon E5-2697 v2@2.70GHz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K20X (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K40@875Mhz (1 node) 7.92 20.01 21.79 18.63 11.60 25.49 26.00 0.00 5.00 10.00 15.00 20.00 25.00 30.00
ns/Day
GROMACS 5.0, cresta_ion_channel Single Node with & without Kepler GPUs
GROMACS 5.0 & Fastest Kepler GPUs yet!
1 x Xeon E5- 2697 v2@2.70GHz 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K20X 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K40@875Mhz 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K80 Board 2 x Xeon E5- 2697 v2@2.70GHz (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K20X (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 2 x Tesla K20X (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 2 x Tesla K40@875Mhz (1 node) 13.66 35.27 37.00 31.86 17.98 41.94 45.29 42.57 45.37 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 45.00 50.00
ns/Day
GROMACS 5.0, cresta_ion_channel_vsites Single node with & without Kepler GPUs
GROMACS 5.0 & Fastest Kepler GPUs yet!
1 x Xeon E5- 2697 v2@2.70GHz 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K20X 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K40@875Mhz 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K80 Board (autoboost) 2 x Xeon E5- 2697 v2@2.70GHz (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K20X (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 2 x Tesla K20X (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 2 x Tesla K40@875Mhz (1 node) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45
ns/Day
GROMACS 5.0, cresta_methanol Single node with & without Kepler GPUs
GROMACS 5.0 & Fastest Kepler GPUs yet!
1 x Xeon E5- 2697 v2@2.70GHz 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K20X 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K40@875Mhz 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K80 Board 2 x Xeon E5- 2697 v2@2.70GHz (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K20X (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 2 x Tesla K20X (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 2 x Tesla K40@875Mhz (1 node) Series1 0.12 0.27 0.31 0.34 0.19 0.30 0.36 0.46 0.52 0.12 0.27 0.31 0.34 0.19 0.30 0.36 0.46 0.52 0.00 0.10 0.20 0.30 0.40 0.50 0.60
ns/Day
GROMACS 5.0, cresta_methanol_rf Single Node with & without Kepler GPUs
GROMACS 5.0 & Fastest Kepler GPUs yet!
1 x Xeon E5- 2697 v2@2.70GHz 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K20X 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K40@875Mhz 1 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K80 Board 2 x Xeon E5- 2697 v2@2.70GHz (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K20X (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 2 x Tesla K20X (1 node) 2 x Xeon E5- 2697 v2@2.70GHz + 2 x Tesla K40@875Mhz (1 node) Series1 0.92 2.79 2.99 3.30 1.54 3.24 3.83 4.58 5.18 0.00 1.00 2.00 3.00 4.00 5.00 6.00
ns/Day
GROMACS 5.0, cresta_virus_capsid Single Node with & without Kepler GPUs
GROMACS 5.0 & Fastest Kepler GPUs yet!
4 x Xeon E5-2697 v2@2.70G Hz (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 2 x Tesla K20X (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 2 x Tesla K40@875 Mhz (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K20X (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K40@875 Mhz (2 nodes) 8 x Xeon E5-2697 v2@2.70G Hz (4 node) 8 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K20X (4 nodes) 8 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K40@875 Mhz (4 nodes) 8 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K20X (4 nodes) 8 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K40@875 Mhz (4 nodes) 16 x Xeon E5-2697 v2@2.70G Hz (8 node) 16 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K20X (8 nodes) 16 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K40@875 Mhz (8 nodes) 16 x Xeon E5-2697 v2@2.70G Hz + 16 x Tesla K20X (8 nodes) 16 x Xeon E5-2697 v2@2.70G Hz + 16 x Tesla K40@875 Mhz (8 nodes) Series1 21.32 31.80 33.76 44.49 45.92 35.99 48.85 52.25 59.28 61.16 54.72 62.95 68.11 72.18 78.48 21.32 31.80 33.76 44.49 45.92 35.99 48.85 52.25 59.28 61.16 54.72 62.95 68.11 72.18 78.48 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00
ns/Day GROMACS 5.0, cresta_ion_channel 2 to 8 Nodes, with & without Kepler GPUs
GROMACS 5.0 & Fastest Kepler GPUs yet!
4 x Xeon E5-2697 v2@2.70G Hz (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 2 x Tesla K20X (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 2 x Tesla K40@875 Mhz (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K20X (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K40@875 Mhz (2 nodes) 8 x Xeon E5-2697 v2@2.70G Hz (4 node) 8 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K20X (4 nodes) 8 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K40@875 Mhz (4 nodes) 8 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K20X (4 nodes) 8 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K40@875 Mhz (4 nodes) 16 x Xeon E5-2697 v2@2.70G Hz (8 node) 16 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K20X (8 nodes) 16 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K40@875 Mhz (8 nodes) 16 x Xeon E5-2697 v2@2.70G Hz + 16 x Tesla K20X (8 nodes) 16 x Xeon E5-2697 v2@2.70G Hz + 16 x Tesla K40@875 Mhz (8 nodes) 32.81 47.92 53.98 70.02 76.48 55.66 75.50 81.26 99.26 98.37 82.31 102.47 105.78 131.88 140.66 0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 160.00
ns/Day
GROMACS 5.0, cresta_ion_channel_vsites 2 to 8 Nodes, with & without Kepler GPUs
GROMACS 5.0 & Fastest Kepler GPUs yet!
4 x Xeon E5-2697 v2@2.70G Hz (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 2 x Tesla K20X (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 2 x Tesla K40@875 Mhz (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K20X (2 nodes) 4 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K40@875 Mhz (2 nodes) 8 x Xeon E5-2697 v2@2.70G Hz (4 node) 8 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K20X (4 nodes) 8 x Xeon E5-2697 v2@2.70G Hz + 4 x Tesla K40@875 Mhz (4 nodes) 8 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K20X (4 nodes) 8 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K40@875 Mhz (4 nodes) 16 x Xeon E5-2697 v2@2.70G Hz (8 node) 16 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K20X (8 nodes) 16 x Xeon E5-2697 v2@2.70G Hz + 8 x Tesla K40@875 Mhz (8 nodes) 16 x Xeon E5-2697 v2@2.70G Hz + 16 x Tesla K20X (8 nodes) 16 x Xeon E5-2697 v2@2.70G Hz + 16 x Tesla K40@875 Mhz (8 nodes) Series1 0.33 0.44 0.47 0.63 0.80 0.60 0.84 0.97 1.38 1.53 1.25 1.73 1.83 2.73 2.85 0.33 0.44 0.47 0.63 0.80 0.60 0.84 0.97 1.38 1.53 1.25 1.73 1.83 2.73 2.85 0.00 0.50 1.00 1.50 2.00 2.50 3.00
ns/Day
GROMACS 5.0, cresta_methanol 2 to 8 Nodes, with & without Kepler GPUs
GROMACS 5.0 & Fastest Kepler GPUs yet!
4 x Xeon E5-2697 v2@2.70 GHz (2 nodes) 4 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K20X (2 nodes) 4 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K40@875 Mhz (2 nodes) 4 x Xeon E5-2697 v2@2.70 GHz + 4 x Tesla K20X (2 nodes) 4 x Xeon E5-2697 v2@2.70 GHz + 4 x Tesla K40@875 Mhz (2 nodes) 8 x Xeon E5-2697 v2@2.70 GHz (4 node) 8 x Xeon E5-2697 v2@2.70 GHz + 4 x Tesla K20X (4 nodes) 8 x Xeon E5-2697 v2@2.70 GHz + 4 x Tesla K40@875 Mhz (4 nodes) 8 x Xeon E5-2697 v2@2.70 GHz + 8 x Tesla K20X (4 nodes) 8 x Xeon E5-2697 v2@2.70 GHz + 8 x Tesla K40@875 Mhz (4 nodes) 16 x Xeon E5-2697 v2@2.70 GHz (8 node) 16 x Xeon E5-2697 v2@2.70 GHz + 8 x Tesla K20X (8 nodes) 16 x Xeon E5-2697 v2@2.70 GHz + 8 x Tesla K40@875 Mhz (8 nodes) 16 x Xeon E5-2697 v2@2.70 GHz + 16 x Tesla K20X (8 nodes) 16 x Xeon E5-2697 v2@2.70 GHz + 16 x Tesla K40@875 Mhz (8 nodes) Series1 0.38 0.49 0.57 0.89 1.05 0.75 0.91 1.17 1.73 2.12 1.48 1.86 2.23 3.65 4.16 0.38 0.49 0.57 0.89 1.05 0.75 0.91 1.17 1.73 2.12 1.48 1.86 2.23 3.65 4.16 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50
ns/Day
GROMACS 5.0, cresta_methanol_rf 2 to 8 Nodes, with & without Kepler GPUs
GROMACS 5.0 & Fastest Kepler GPUs yet!
4 x Xeon E5-2697 v2@2.70 GHz (2 nodes) 4 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K20X (2 nodes) 4 x Xeon E5-2697 v2@2.70 GHz + 2 x Tesla K40@875 Mhz (2 nodes) 4 x Xeon E5-2697 v2@2.70 GHz + 4 x Tesla K20X (2 nodes) 4 x Xeon E5-2697 v2@2.70 GHz + 4 x Tesla K40@875 Mhz (2 nodes) 8 x Xeon E5-2697 v2@2.70 GHz (4 node) 8 x Xeon E5-2697 v2@2.70 GHz + 4 x Tesla K20X (4 nodes) 8 x Xeon E5-2697 v2@2.70 GHz + 4 x Tesla K40@875 Mhz (4 nodes) 8 x Xeon E5-2697 v2@2.70 GHz + 8 x Tesla K20X (4 nodes) 8 x Xeon E5-2697 v2@2.70 GHz + 8 x Tesla K40@875 Mhz (4 nodes) 16 x Xeon E5-2697 v2@2.70 GHz (8 node) 16 x Xeon E5-2697 v2@2.70 GHz + 8 x Tesla K20X (8 nodes) 16 x Xeon E5-2697 v2@2.70 GHz + 8 x Tesla K40@875 Mhz (8 nodes) 16 x Xeon E5-2697 v2@2.70 GHz + 16 x Tesla K20X (8 nodes) 16 x Xeon E5-2697 v2@2.70 GHz + 16 x Tesla K40@875 Mhz (8 nodes) Series1 2.93 5.44 5.71 8.36 8.63 5.53 8.99 9.81 14.20 12.93 9.18 15.24 15.57 20.30 22.01 2.93 5.44 5.71 8.36 8.63 5.53 8.99 9.81 14.20 12.93 9.18 15.24 15.57 20.30 22.01 0.00 5.00 10.00 15.00 20.00 25.00
ns/Day
GROMACS 5.0, cresta_virus_capsid 2 to 8 Nodes, with & without Kepler GPUs
Slides – courtesy of GROMACS Dev Team
Slides – courtesy of GROMACS Dev Team
Slides – courtesy of GROMACS Dev Team
Slides – courtesy of GROMACS Dev Team
Greener Science
Running GROMACS 4.6 with CUDA 4.1 The blue nodes contain 2x Intel X5550 CPUs (95W TDP, 4 Cores per CPU) The green node contains 2x Intel X5550 CPUs, 4 Cores per CPU) and 2x NVIDIA M2090s GPUs (225W TDP per GPU)
In simulating each nanosecond, the GPU-accelerated system uses 33% less energy Energy Expended = Power x Time
Lower is better
2000 4000 6000 8000 10000 12000 4 Nodes (760 Watts) 1 Node + 2x M2090 (640 Watts) Energy Expended (KiloJoules Consumed)
ADH in Water (134K Atoms)
80
Recommended GPU Node Configuration for GROMACS Computational Chemistry
Workstation or Single Node Configuration
# of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32 GPUs Kepler K20, K40, K80 # of GPUs per CPU socket 1x Kepler GPUs: need fast Sandy Bridge or Ivy Bridge, or high-end AMD Opterons GPU memory preference (GB) 6 GPU to CPU connection PCIe 3.0 or higher Server storage 500 GB or higher Network configuration Gemini, InfiniBand
80
March 2016
HOOMD-Blue
82
500 1000 1500 2000 2500 3000 1 2 4 8 16 32 64 Hours to complete 10e6 sweeps Nodes
2^23 dodecahedral in HPMC, running on Comet
96 CPU cores 384 CPU cores 768 CPU cores 1536 CPU cores
Blue nodes contain Dual Intel Xeon E5-2680 v3@2.50 GHz (Haswell) CPUs Green nodes contain Tesla K80 (autoboost) GPUs
4 K80 GPUs 8 K80 GPUs 16 K80 GPUs 24 CPU cores 48 CPU cores 192 CPU cores
October 2015
HOOMD-BLUE 1.0
84
Running HOOMD-Blue version 1.0 The green nodes contain Dual Intel E5- 2697 v2@2.70GHz CPUs + either NVIDIA Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
1184.44 1496.42 1516.91 2068.27 500 1000 1500 2000 2500 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Timesteps (s)
Liquid
HOOMD-Blue 1.0, K40 & K80, Boost impact!
85
Running HOOMD-Blue version 1.0 The green nodes contain Dual Intel E5- 2697 v2@2.70GHz CPUs + either NVIDIA Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
1031.79 1173.01 1203.83 1580.45 500 1000 1500 2000 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Timesteps (s)
Polymer
HOOMD-Blue 1.0, K40 & K80, Boost impact!
HOOMD-Blue 1.0, K40 & K80, Boost impact!
2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K80 (autoboost) 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K40@875Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K80 (autoboost) Series1 1184.44 1496.42 1516.91 2068.27 1184.44 1496.42 1516.91 2068.27 0.00 500.00 1000.00 1500.00 2000.00 2500.00
Average Timesteps per second
HOOMD-Blue 1.0, Liquid Single Node with 1 or 2 Kepler GPUs
HOOMD-Blue 1.0, K40 & K80, Boost impact!
2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 1 x Tesla K80 (autoboost) 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K40@875Mhz (1 node) 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K80 (autoboost) Series1 1031.79 1173.01 1203.83 1580.45 1031.79 1173.01 1203.83 1580.45 0.00 200.00 400.00 600.00 800.00 1000.00 1200.00 1400.00 1600.00 1800.00
Average Timesteps per Second
HOOMD-Blue, Polymer Single Node with 1 or 2 Kepler GPUs
HOOMD-Blue 1.0.0 and K40, Boost impact!
2 x Xeon E5-2690 v2@3.00GHz 2 x Xeon E5-2690 v2@3.00GHz + 1 x Tesla K40@745Mhz 2 x Xeon E5-2690 v2@3.00GHz + 1 x Tesla K40@875Mhz 2 x Xeon E5-2690 v2@3.00GHz + 2 x Tesla K40@745Mhz 2 x Xeon E5-2690 v2@3.00GHz + 2 x Tesla K40@875Mhz 2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@745Mhz 2 x Xeon E5-2690 v2@3.00GHz + 4 x Tesla K40@875Mhz Series1 183.6 1017.4 1180.6 1412.9 1599.0 1989.7 2232.1 183.6 1017.4 1180.6 1412.9 1599.0 1989.7 2232.1 500 1000 1500 2000 2500
Timesteps per Second
HOOMD-Blue (Timesteps/Sec) on K40 with and without Boost Clocks lj_liquid (64K particles) Benchmark (CUDA 5.5, ECC on, gcc 4.7.3)
HOOMD-Blue 1.0.0 and K40, fastest GPU yet!
2 x Xeon E5- 2690 v2@3.00GHz 2 x Xeon E5- 2690 v2@3.00GHz + 1 x Tesla K40@875Mh z 2 x Xeon E5- 2690 v2@3.00GHz + 2 x Tesla K40@875Mh z 2 x Xeon E5- 2690 v2@3.00GHz + 4 x Tesla K40@875Mh z 4 x Xeon E5- 2690 v2@3.00GHz 4 x Xeon E5- 2690 v2@3.00GHz + 2 x Tesla K40@875Mh z 4 x Xeon E5- 2690 v2@3.00GHz + 4 x Tesla K40@875Mh z 4 x Xeon E5- 2690 v2@3.00GHz + 8 x Tesla K40@875Mh z 8 x Xeon E5- 2690 v2@3.00GHz 8 x Xeon E5- 2690 v2@3.00GHz + 4 x Tesla K40@875Mh z 8 x Xeon E5- 2690 v2@3.00GHz + 8 x Tesla K40@875Mh z 8 x Xeon E5- 2690 v2@3.00GHz + 16 x Tesla K40@875Mh z Series1 183.6 1180.6 1599.0 2232.1 343.4 1621.9 2166.2 2721.6 582.5 2257.0 2684.5 3235.4 183.6 1180.6 1599.0 2232.1 343.4 1621.9 2166.2 2721.6 582.5 2257.0 2684.5 3235.4 500 1000 1500 2000 2500 3000 3500
Timesteps per Second
HOOMD-Blue (Timesteps/Sec) on K40 with Boost Clocks lj_liquid (64K particles) Benchmark (CUDA 5.5, ECC on, gcc 4.7.3)
HOOMD-Blue 1.0.0 and K40, fastest GPU yet!
2 x Xeon E5- 2690 v2@3.00GH z 2 x Xeon E5- 2690 v2@3.00GH z + 1 x Tesla K40@875Mh z 2 x Xeon E5- 2690 v2@3.00GH z + 2 x Tesla K40@875Mh z 2 x Xeon E5- 2690 v2@3.00GH z + 4 x Tesla K40@875Mh z 4 x Xeon E5- 2690 v2@3.00GH z 4 x Xeon E5- 2690 v2@3.00GH z + 2 x Tesla K40@875Mh z 4 x Xeon E5- 2690 v2@3.00GH z + 4 x Tesla K40@875Mh z 4 x Xeon E5- 2690 v2@3.00GH z + 8 x Tesla K40@875Mh z 8 x Xeon E5- 2690 v2@3.00GH z 8 x Xeon E5- 2690 v2@3.00GH z + 4 x Tesla K40@875Mh z 8 x Xeon E5- 2690 v2@3.00GH z + 8 x Tesla K40@875Mh z 8 x Xeon E5- 2690 v2@3.00GH z + 16 x Tesla K40@875Mh z Series1 179.4 1015.5 1249.5 1759.0 338.5 1214.2 1696.5 2082.1 576.2 1773.6 2038.4 2434.8 179.4 1015.5 1249.5 1759.0 338.5 1214.2 1696.5 2082.1 576.2 1773.6 2038.4 2434.8 0.0 500.0 1000.0 1500.0 2000.0 2500.0 3000.0
Timesteps per Second
HOOMD-Blue (Timesteps/Sec) on K40 with Boost Clocks polymer(64,017 particles) Benchmark (CUDA 5.5, ECC on, gcc 4.7.3)
HOOMD-Blue 1.0.0 and K40, fastest GPU yet!
2 x Xeon E5- 2690 v2@3.00GHz 2 x Xeon E5- 2690 v2@3.00GHz + 1 x Tesla K40@875Mh z 2 x Xeon E5- 2690 v2@3.00GHz + 2 x Tesla K40@875Mh z 2 x Xeon E5- 2690 v2@3.00GHz + 4 x Tesla K40@875Mh z 4 x Xeon E5- 2690 v2@3.00GHz 4 x Xeon E5- 2690 v2@3.00GHz + 2 x Tesla K40@875Mh z 4 x Xeon E5- 2690 v2@3.00GHz + 4 x Tesla K40@875Mh z 4 x Xeon E5- 2690 v2@3.00GHz + 8 x Tesla K40@875Mh z 8 x Xeon E5- 2690 v2@3.00GHz 8 x Xeon E5- 2690 v2@3.00GHz + 4 x Tesla K40@875Mh z 8 x Xeon E5- 2690 v2@3.00GHz + 8 x Tesla K40@875Mh z 8 x Xeon E5- 2690 v2@3.00GHz + 16 x Tesla K40@875Mh z Series1 20.6 161.6 268.3 458.0 40.2 273.9 463.5 778.9 77.5 474.0 757.5 1150.2 20.6 161.6 268.3 458.0 40.2 273.9 463.5 778.9 77.5 474.0 757.5 1150.2 0.0 200.0 400.0 600.0 800.0 1000.0 1200.0 1400.0
Timesteps per Second
HOOMD-Blue (Timesteps/Sec) on K40 with Boost Clocks lj_liquid (512K particles) Benchmark (CUDA 5.5, ECC on, gcc 4.7.3)
HOOMD-Blue on ARM vs. Ivy Bridge w/ & w/o K20 Equivalent Performance on ARM + K20
ARMv8 64-bit (2.4 GHz) 8 cores, no GPU ARMv8 64-bit (2.4 GHz) 8 cores w/K20 Ivy Bridge (E5-2690 v2 @ 3.00GHz) 8 cores Ivy Bridge (E5-2690 v2 @ 3.00GHz) 20 cores Ivy Bridge (E5-2690 v2 @ 3.00GHz) 20 cores w/K20 Series1 31.04057 896.2 85.4 181.8 896.2 31.04057 896.2 85.4 181.8 896.2 100 200 300 400 500 600 700 800 900 1000
Timesteps per Second
HOOMD-Blue 1.0.0 (Timesteps/Sec) on ARM & Ivy Bridge with/without K20 lj_liquid (64K particles) Benchmark (OpenMPI Ver 1.8.1)
93
Webinar - June ‘14 500 1000 1500 2000 2500 3000 4 8 16 32 64 Average TPS Number of GPU Nodes
MV2-2.0b-GDR MV2-NewGDR-Loopback MV2-NewGDR-Fastcopy
- Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
- Strong Scaling: fixed 64K particles
- Loopback and Fastcopy get up to 45% and 48% improvement for 32 GPUs
- Weak Sailing: fixed 2K particles / GPU
- Loopback and Fastcopy get up to 54% and 56% improvement for 16 GPUs
HOOMD-blue Strong Scaling
Application-Level Evaluation (HOOMD-blue)
HOOMD-blue Weak Scaling
48% 47% 56% 53%
500 1000 1500 2000 2500 3000 3500 4000 4500 4 8 16 32 64
Average TPS
Number of GPU Nodes
MV2-2.0b-GDR MV2-NewGDR-Loopback MV2-NewGDR-Fastcopy
October 2015
LAMMPS
95
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
6.19 2.86 2.51 2.31 2.32 2.14 2.21 2 4 6 8 1 Ivybridge Node 1 CPU Node + 1x K20X 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K20X 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Loop Time (s)
Atomic Fluid - Lennard Jones (2.5 Cutoff) Single Precision (2,048,000 atoms) 1 Ivybridge Node
2.2X 2.5X 2.7X 2.7X 2.9X 2.8X
Lennard-Jones on K20X, K40s & K80s
96
7.98 6.14 3.60 2.56 3.85 2.62 2.47 4 8 12 1 Ivybridge Node 1 CPU Node + 1x K20X 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K20X 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Loop Time (s)
Atomic Fluid - Lennard Jones (2.5 Cutoff) Double Precision (2,048,000 atoms) 1 Ivybridge Node
1.3X 2.2X 3.1X 2.1X 3.0X 3.2X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Lennard-Jones on K20X, K40s & K80s
97
3.15 1.60 1.34 1.11 1.08 0.99 1.04 1 2 3 4 2 Ivybridge Node 2 CPU Node + 1x K20X 2 CPU Node + 1x K40 2 CPU Node + 1x K80 2 CPU Node + 2x K20X 2 CPU Node + 2x K40 2 CPU Node + 2x K80 Average Loop Time (s)
Atomic Fluid - Lennard Jones (2.5 Cutoff) Single Precision (2,048,000 atoms) 2 Ivybridge Node
2.0X 2.4X 2.8X 3.0X 3.2X 3.0X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Lennard-Jones on K20X, K40s & K80s
98
4.08 2.56 2.03 1.30 1.53 1.29 1.17 2 4 6 2 Ivybridge Node 2 CPU Node + 1x K20X 2 CPU Node + 1x K40 2 CPU Node + 1x K80 2 CPU Node + 2x K20X 2 CPU Node + 2x K40 2 CPU Node + 2x K80 Average Loop Time (s)
Atomic Fluid - Lennard Jones (2.5 Cutoff) Double Precision (2,048,000 atoms) 2 Ivybridge Node
1.6X 2.0X 3.1X 2.7X 3.2X 3.5X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Lennard-Jones on K20X, K40s & K80s
99
1.64 1.00 0.80 0.65 0.61 0.53 0.53 0.0 0.5 1.0 1.5 2.0 4 Ivybridge Node 4 CPU Node + 1x K20X 4 CPU Node + 1x K40 4 CPU Node + 1x K80 4 CPU Node + 2x K20X 4 CPU Node + 2x K40 4 CPU Node + 2x K80 Average Loop Time (s)
Atomic Fluid - Lennard Jones (2.5 Cutoff) Single Precision (2,048,000 atoms) 4 Ivybridge Node
1.6X 2.1X 2.5X 2.7X 3.1X 3.1X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Lennard-Jones on K20X, K40s & K80s
100
2.09 1.46 1.17 0.77 0.86 0.71 0.61 0.0 0.7 1.4 2.1 2.8 4 Ivybridge Node 4 CPU Node + 1x K20X 4 CPU Node + 1x K40 4 CPU Node + 1x K80 4 CPU Node + 2x K20X 4 CPU Node + 2x K40 4 CPU Node + 2x K80 Average Loop Time (s)
Atomic Fluid - Lennard Jones (2.5 Cutoff) Double Precision (2,048,000 atoms) 4 Ivybridge Node
1.4X 1.8X 2.7X 2.4X 2.9X 3.4X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Lennard-Jones on K20X, K40s & K80s
101
0.91 0.62 0.48 0.36 0.29 0.0 0.4 0.8 1.2 8 Ivybridge Node 8 CPU Node + 1x K20X 8 CPU Nodes + 1x K40 8 CPU Node + 2x K20X 8 CPU Node + 2x K40 Average Loop Time (s)
Atomic Fluid - Lennard Jones (2.5 Cutoff) Single Precision (2,048,000 atoms) 8 Ivybridge Node
1.5X 1.9X 2.5X 3.1X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Lennard-Jones on K20X, K40s & K80s
102
1.18 0.89 0.70 0.51 0.41 0.0 0.5 1.0 1.5 8 Ivybridge Node 8 CPU Node + 1x K20X 8 CPU Node + 1x K40 8 CPU Node + 2x K20X 8 CPU Node + 2x K40 Average Loop Time (s)
Atomic Fluid - Lennard Jones (2.5 Cutoff) Double Precision (2,048,000 atoms) 8 Ivybridge Node
1.3X 1.7X 2.3X 2.9X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Lennard-Jones on K20X, K40s & K80s
103
23.80 11.76 6.36 7.56 6.57 3.95 10 20 30 1 Ivybridge Node 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K20X 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Loop Time (s)
Atomic Fluid - Lennard Jones (5.0 Cutoff) Single Precision (2,048,000 atoms) 1 Ivybridge Node
3.7X 2.0X 3.1X 3.6X 6.0X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Lennard-Jones on K20X, K40s & K80s
104
31.31 18.77 8.76 11.81 10.30 5.23 10 20 30 40 1 Ivybridge Node1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K20X 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Loop Time (s)
Atomic Fluid - Lennard Jones (5.0 Cutoff) Double Precision (2,048,000 atoms) 1 Ivybridge Node
1.7X 3.6X 2.7X 3.0X 6.0X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Lennard-Jones on K20X, K40s & K80s
105
12.23 6.25 3.53 4.21 3.64 2.27 5 10 15 2 Ivybridge Node 2 CPU Node + 1x K40 2 CPU Node + 1x K80 2 CPU Node + 2x K20X 2 CPU Node + 2x K40 2 CPU Node + 2x K80 Average Loop Time (s)
Atomic Fluid - Lennard Jones (5.0 Cutoff) Single Precision (2,048,000 atoms) 2 Ivybridge Node
2.0X 3.5X 2.9X 3.4X 5.4X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Lennard-Jones on K20X, K40s & K80s
106
16.03 9.24 4.46 6.05 5.02 2.73 5 10 15 20 2 Ivybridge Node 2 CPU Node + 1x K40 2 CPU Node + 1x K80 2 CPU Node + 2x K20X 2 CPU Node + 2x K40 2 CPU Node + 2x K80 Average Loop Time (s)
Atomic Fluid - Lennard Jones (5.0 Cutoff) Double Precision (2,048,000 atoms) 2 Ivybridge Node
1.7X 3.6X 2.6X 3.2X 5.9X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Lennard-Jones on K20X, K40s & K80s
107
6.41 3.14 1.78 2.30 1.74 1.22 3 6 9 4 Ivybridge Node 4 CPU Node + 1x K40 4 CPU Node + 1x K80 4 CPU Node + 2x K20X 4 CPU Node + 2x K40 4 CPU Node + 2x K80 Average Loop Time (s)
Atomic Fluid - Lennard Jones (5.0 Cutoff) Single Precision (2,048,000 atoms) 4 Ivybridge Node
2.0X 3.6X 2.8X 3.7X 5.3X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Lennard-Jones on K20X, K40s & K80s
108
8.56 5.12 2.49 3.38 2.82 1.58 3 6 9 12 4 Ivybridge Node 4 CPU Node + 1x K40 4 CPU Node + 1x K80 4 CPU Node + 2x K20X 4 CPU Node + 2x K40 4 CPU Node + 2x K80 Average Loop Time (s)
Atomic Fluid - Lennard Jones (5.0 Cutoff) Double Precision (2,048,000 atoms) 4 Ivybridge Node
1.7X 3.4X 2.5X 3.0X 5.4X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Lennard-Jones on K20X, K40s & K80s
109
3.52 1.90 1.28 1.03 1 2 3 4 8 Ivybridge Node 8 CPU Node + 1x K40 8 CPU Node + 2x K20X 8 CPU Node + 2x K40 Average Loop Time (s)
Atomic Fluid - Lennard Jones (5.0 Cutoff) Single Precision (2,048,000 atoms) 8 Ivybridge Node
1.9X 2.8X 3.4X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Lennard-Jones on K20X and K40s
110
4.60 2.73 1.95 1.48 2 4 6 8 Ivybridge Node 8 CPU Node + 1x K40 8 CPU Node + 2x K20X 8 CPU Node + 2x K40 Average Loop Time (s)
Atomic Fluid - Lennard Jones (5.0 Cutoff) Double Precision (2,048,000 atoms) 8 Ivybridge Node
1.7X 2.4X 3.1X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Lennard-Jones on K20X and K40s
111
Lennard-Jones single/multi-node throughout
Figure 1: Leonard-Jones single-node throughout (strong scaling) Figure 2: Leonard-Jones multi-node throughout
112
203.01 106.32 85.71 44.17 47.53 40.27 44.12 50 100 150 200 250 1 Ivybridge Node 1 CPU Node + 1x K20X 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K20X 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Loop Time (s)
EAM - Single Precision (2,048,000 atoms) 1 Ivybridge Node
1.9X 2.4X 4.6X 4.3X 5.0X 4.6X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
EAM on K20X, K40s & K80s
113
202.76 150.44 119.60 66.51 67.65 54.34 67.93 50 100 150 200 250 1 Ivybridge Node 1 CPU Node + 1x K20X 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K20X 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Loop Time (s)
EAM - Double Precision (2,048,000 atoms) 1 Ivybridge Node
1.3X 1.7X 3.0X 3.0X 3.7X 3.0X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
EAM on K20X, K40s & K80s
114
104.74 57.77 46.95 23.68 26.03 22.08 23.65 40 80 120 2 Ivybridge Node 2CPU Node + 1x K20X 2 CPU Node + 1x K40 2 CPU Node + 1x K80 2 CPU Node + 2x K20X 2 CPU Node + 2x K40 2 CPU Node + 2x K80 Average Loop Time (s)
EAM - Single Precision (2,048,000 atoms) 2 Ivybridge Node
1.8X 2.2X 4.4X 4.0X 4.7X 4.4X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
EAM on K20X, K40s & K80s
115
104.81 80.30 64.62 33.49 36.89 29.96 32.67 40 80 120 2 Ivybridge Node 2 CPU Node + 1x K20X 2 CPU Node + 1x K40 2 CPU Node + 1x K80 2 CPU Node + 2x K20X 2 CPU Node + 2x K40 2 CPU Node + 2x K80 Average Loop Time (s)
EAM - Double Precision (2,048,000 atoms) 2 Ivybridge Node
1.3X 1.6X 3.1X 2.8X 3.5X 3.2X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
EAM on K20X, K40s & K80s
116
54.89 33.38 28.09 13.96 15.00 12.99 14.03 15 30 45 60 4 Ivybridge Node 4 CPU Node + 1x K20X 4 CPU Node + 1x K40 4 CPU Node + 1x K80 4 CPU Node + 2x K20X 4 CPU Node + 2x K40 4 CPU Node + 2x K80 Average Loop Time (s)
EAM - Single Precision (2,048,000 atoms) 4 Ivybridge Node
1.6X 2.0X 3.9X 3.7X 4.2X 3.9X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
EAM on K20X, K40s & K80s
117
54.63 44.86 36.95 18.41 20.78 17.41 18.21 20 40 60 4 Ivybridge Node 4 CPU Node + 1x K20X 4 CPU Node + 1x K40 4 CPU Node + 1x K80 4 CPU Node + 2x K20X 4 CPU Node + 2x K40 4 CPU Node + 2x K80 Average Loop Time (s)
EAM - Double Precision (2,048,000 atoms) 4 Ivybridge Node
1.2X 1.5X 3.0X 2.6X 3.1X 3.0X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
EAM on K20X, K40s & K80s
118
28.49 24.46 21.49 12.35 11.06 10 20 30 40 8 Ivybridge Node 8 CPU Node + 1x K20X 8 CPU Node + 1x K40 8 CPU Node + 2x K20X 8 CPU Node + 2x K40 Average Loop Time (s)
EAM - Single Precision (2,048,000 atoms) 8 Ivybridge Node
1.2X 1.3X 2.3X 2.6X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
EAM on K20X and K40s
119
29.11 24.39 14.35 12.66 10 20 30 40 8 Ivybridge Node 8 CPU Node + 1x K40 8 CPU Node + 2x K20X 8 CPU Node + 2x K40 Average Loop Time (s)
EAM - Double Precision (2,048,000 atoms) 8 Ivybridge Node
1.2X 2.0X 2.3X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
EAM on K20X and K40s
120
EAM single/multi-node throughput
Figure 1: EAM single-node throughout (strong scaling) Figure 2: EAM multi-scaling throughout
121
162.20 52.66 34.06 39.05 35.35 27.48 40 80 120 160 200 1 Ivybridge Node1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K20X 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Loop Time (s)
Gay-Berne - Single Precision (2,097,152 atoms) 1 Ivybridge Node
3.1X 4.8X 4.2X 4.6X 5.9X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Gay-Berne on K20X, K40s & K80s
122
254.81 186.42 80.08 133.79 102.55 49.06 100 200 300 1 Ivybridge Node 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K20X 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Loop Time (s)
Gay-Berne - Double Precision (2,097,152 atoms) 1 Ivybridge Node
1.4X 3.2X 1.9X 2.5X 5.2X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Gay-Berne on K20X, K40s & K80s
123
82.40 24.98 17.02 18.60 16.89 14.08 20 40 60 80 100 2 Ivybridge Node 2 CPU Node + 1x K40 2 CPU Node + 1x K80 2 CPU Node + 2x K20X 2 CPU Node + 2x K40 2 CPU Node + 2x K80 Average Loop Time (s)
Gay-Berne - Single Precision (2,097,152 atoms) 2 Ivybridge Node
3.3X 4.8X 4.4X 4.9X 5.9X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Gay-Berne on K20X, K40s & K80s
124
128.97 92.50 53.11 66.49 50.85 23.95 40 80 120 160 2 Ivybridge Node 2 CPU Node + 1x K40 2 CPU Node + 1x K80 2 CPU Node + 2x K20X 2 CPU Node + 2x K40 2 CPU Node + 2x K80 Average Loop Time (s)
Gay-Berne - Double Precision (2,097,152 atoms) 2 Ivybridge Node
1.4X 2.4X 1.9X 2.5X 5.4X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Gay-Berne on K20X, K40s & K80s
125
42.43 12.38 9.19 9.53 8.47 7.31 10 20 30 40 50 4 Ivybridge Node 4 CPU Node + 1x K40 4 CPU Node + 1x K80 4 CPU Node + 2x K20X 4 CPU Node + 2x K40 4 CPU Node + 2x K80 Average Loop Time (s)
Gay-Berne - Single Precision (2,097,152 atoms) 4 Ivybridge Node
3.4X 4.6X 4.5X 5.0X 5.8X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Gay-Berne on K20X, K40s & K80s
126
66.17 45.98 19.43 33.24 25.26 12.33 0.00 20.00 40.00 60.00 80.00 4 Ivybridge Node4 CPU Node + 1x K40 4 CPU Node + 1x K80 4 CPU Node + 2x K20X 4 CPU Node + 2x K40 4 CPU Node + 2x K80 Average Loop Time (s)
Gay-Berne - Double Precision (2,097,152 atoms) 4 Ivybridge Node
1.4X 3.4X 2.0X 2.6X 5.4X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Gay-Berne on K20X, K40s & K80s
127
22.51 6.16 5.14 4.35 5 10 15 20 25 30 8 Ivybridge Node 8 CPU Node + 1x K40 8 CPU Node + 2x K20X 8 CPU Node + 2x K40 Average Loop Time (s)
Gay-Berne - Single Precision (2,097,152 atoms) 8 Ivybridge Node
3.7X 4.4X 5.2X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Gay-Berne on K20X and K40s
128
35.00 23.23 17.05 13.01 10 20 30 40 50 8 Ivybridge Node 8 CPU Node + 1x K40 8 CPU Node + 2x K20X 8 CPU Node + 2x K40 Average Loop Time (s)
Gay-Berne - Double Precision (2,097,152 atoms) 8 Ivybridge Node
1.5X 2.1X 2.7X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Gay-Berne on K20X and K40s
129
104.04 55.25 30.12 38.96 32.17 23.34 40 80 120 1 Ivybridge Node 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K20X 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Loop Time (s)
Rhodopsin - Single Precision (2,048,000 atoms) 1 Ivybridge Node
1.9X 3.5X 2.7X 3.2X 4.5X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Rhodopsin on K20X, K40s & K80s
130
138.20 56.53 100.37 76.35 32.43 40 80 120 160 200 1 Ivybridge Node 1 CPU Node + 1x K80 1 CPU Node + 2x K20X 1 CPU Node + 2x K40 1 CPU Node + 2x K80 Average Loop Time (s)
Rhodopsin - Double Precision (2,048,000 atoms) 1 Ivybridge Node
2.4X 1.4X 1.8X 4.3X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Rhodopsin on K20X, K40s & K80s
131
53.28 27.27 14.93 19.57 15.85 11.99 20 40 60 2 Ivybridge Node 2 CPU Node + 1x K40 2 CPU Node + 1x K80 2 CPU Node + 2x K20X 2 CPU Node + 2x K40 2 CPU Node + 2x K80 Average Loop Time (s)
Rhodopsin - Single Precision (2,048,000 atoms) 2 Ivybridge Node
2.0X 3.6X 2.7X 3.4X 4.4X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Rhodopsin on K20X, K40s & K80s
132
70.42 27.42 50.59 38.15 15.78 30 60 90 2 Ivybridge Node 2 CPU Node + 1x K80 2 CPU Node + 2x K20X 2 CPU Node + 2x K40 2 CPU Node + 2x K80 Average Loop Time (s)
Rhodopsin - Double Precision (2,048,000 atoms) 2 Ivybridge Node
2.6X 1.4X 1.8X 4.5X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Rhodopsin on K20X, K40s & K80s
133
28.52 14.64 8.34 10.18 8.58 7.04 10 20 30 40 4 Ivybridge Node 4 CPU Node + 1x K40 4 CPU Node + 1x K80 4 CPU Node + 2x K20X 4 CPU Node + 2x K40 4 CPU Node + 2x K80 Average Loop Time (s)
Rhodopsin - Single Precision (2,048,000 atoms) 4 Ivybridge Node
1.9X 3.4X 2.8X 3.3X 4.1X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Rhodopsin on K20X, K40s & K80s
134
37.36 36.97 14.21 25.49 19.68 8.33 10 20 30 40 50 4 Ivybridge Node 4 CPU Node + 1x K40 4 CPU Node + 1x K80 4 CPU Node + 2x K20X 4 CPU Node + 2x K40 4 CPU Node + 2x K80 Average Loop Time (s)
Rhodopsin - Double Precision (2,048,000 atoms) 4 Ivybridge Node
1.0X 2.6X 1.5X 1.9X 4.5X
Running LAMMPS The blue node contains Dual Intel E5- 2698 v3@2.3GHz CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz
- r Tesla K80@562Mhz GPUs
Rhodopsin on K20X, K40s & K80s
135
15.10 7.93 5.50 4.79 5 10 15 20 8 Ivybridge Node 8 CPU Node + 1x K40 8 CPU Node + 2x K20X 8 CPU Node + 2x K40 Average Loop Time (s)
Rhodopsin - Single Precision (2,048,000 atoms) 8 Ivybridge Node
1.9X 2.7X 3.2X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Rhodopsin on K20X and K40s
136
19.64 19.31 13.40 10.28 10 20 30 8 Ivybridge Node 8 CPU Node + 1x K40 8 CPU Node + 2x K20X 8 CPU Node + 2x K40 Average Loop Time (s)
Rhodopsin - Double Precision (2,048,000 atoms) 8 Ivybridge Node
1.0X 1.5X 1.9X
Running LAMMPS The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs
Rhodopsin on K20X and K40s
More Science for Your Money (LAMMPS)
1.7 2.47 2.92 3.3 4.5 5.5 1 2 3 4 5 6 CPU Only CPU + 1x K10 CPU + 1x K20 CPU + 1x K20X CPU + 2x K10 CPU + 2x K20 CPU + 2x K20X Speedup Compared to CPU Only
Embedded Atom Model
Blue node uses 2x E5-2687W (8 Cores and 150W per CPU). Green nodes have 2x E5-2687W and 1
- r 2 NVIDIA K10, K20, or K20X GPUs (235W).
Experience performance increases of up to 5.5x with Kepler GPU nodes.
Excellent Strong Scaling on Large Clusters
100 200 300 400 500 600 300 400 500 600 700 800 900 Loop Time (seconds) Nodes
LAMMPS Gay-Berne 134M Atoms
GPU Accelerated XK6 CPU only XE6
Each blue Cray XE6 Nodes have 2x AMD Opteron CPUs (16 Cores per CPU) Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Cores per CPU) and 1x NVIDIA X2090
From 300-900 nodes, the NVIDIA GPU-powered XK6 maintained 3.5x performance compared to XE6 CPU nodes
3.55x 3.45x 3.48x
GPUs Sustain 5x Performance for Weak Scaling
5 10 15 20 25 30 35 40 45 1 8 27 64 125 216 343 512 729 Loop Time (seconds) Nodes
LAMMPS Weak Scaling with 32K Atoms per Node
6.7x
Performance of 4.8x-6.7x with GPU-accelerated nodes when compared to CPUs alone
4.8x
Each blue Cray XE6 Node have 2x AMD Opteron CPUs (16 Cores per CPU) Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Core per CPU) and 1x NVIDIA X2090
5.8x
Faster, Greener — Worth It! (LAMMPS)
Energy Expended = Power x Time
Lower is better
GPU-accelerated computing uses 53% less energy than CPU only
Power calculated by combining the component’s TDPs Blue node uses 2x E5-2687W (8 Cores and 150W per CPU) and CUDA 4.2.9. Green nodes have 2x E5-2687W and 1 or 2 NVIDIA K20X GPUs (235W) running CUDA 5.0.35.
20 40 60 80 100 120 140 1 Node 1 Node + 1 K20X 1 Node + 2x K20X Energy Expended (kJ)
Energy Consumed in one loop of EAM Try GPU accelerated LAMMPS for free – www.nvidia.com/GPUTestDrive
Accelerate LAMMPS Simulations with GPUs
1 2 3 4 5 6 7 1 Node Speedup XK7 w/out GPU XK7 w/ GPU
1.0 6.6
*Brown, W.M., Yamada, M., “Implementing Molecule Dynamics on Hybrid High Performance Computers – Three-Body Potentials,” Computer Physics Communications (2013, submitted)
“Summary of best speedups versus running on a single XK7 CPU for CPU-only and accelerated runs. Simulation is 400 timesteps for a 1 million molecule droplet. The speedups are calculated based on the single node loop time of 440.3 seconds.”
Accelerate LAMMPS Simulations with GPUs
50 100 150 200 250 64 Nodes Speedup XK7 w/out GPU XK7 w/ GPU
41.6 211
“Summary of best speedups versus running on a single XK7 CPU for CPU-only and accelerated runs. Simulation is 400 timesteps for a 1 million molecule droplet. The speedups are calculated based on the single node loop time of 440.3 seconds.”
*Brown, W.M., Yamada, M., “Implementing Molecule Dynamics on Hybrid High Performance Computers – Three-Body Potentials,” Computer Physics Communications (2013, submitted)
144
Recommended GPU Node Configuration for LAMMPS Computational Chemistry
Workstation or Single Node Configuration
# of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32 GPUs GTX Titan X, Kepler K20, K40, K80, M40 # of GPUs per CPU socket 1-2 GPU memory preference (GB) 6+ GPU to CPU connection PCIe 3.0 or higher Server storage 500 GB or higher Network configuration Gemini, InfiniBand
Scale to thousands of nodes with same single node configuration
14 4
NAMD 2.11 – Up to 2X Faster
146
New GPU features in NAMD 2.11
- GPU-accelerated simulations up to twice as fast as NAMD 2.10
- Pressure calculation with fixed atoms on GPU works as on CPU
- Improved scaling for GPU-accelerated particle-mesh Ewald calculation
- CPU-side operations overlap better and are parallelized across cores.
- Improved scaling for GPU-accelerated simulations
- Nonbonded force calculation results are streamed from the GPU for better overlap.
- NVIDIA CUDA GPU-acceleration binaries for Mac OS X
Selected Text from the NAMD website
147
NAMD 2.11 is up to 2x faster
5 10 15 20 25 1 Node 2 Nodes 4 Nodes
Simulated Time (ns/day)
APoA1 (92,224 atoms)
1.2X 1.6X 2.0X
NAMD 2.10 & NAMD 2.11 contain Dual Intel E5-2697 v2@2.7GHz (IvyBridge) CPUs + 2 Tesla K80 (autoboost) GPUs
148
NAMD 2.11 APoA1 on 1 and 2 nodes
Running NAMD version 2.11 The blue nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs + Tesla K80 (autoboost) GPUs
2.77 11.67 16.99 5.22 19.73 24.31
5 10 15 20 25 1 Node 1 Node + 1x K80 1 Node + 2x K80 2 Nodes 2 Nodes + 1x K80 2 Nodes + 2x K80
Simulated Time (ns/day)
APoA1
(92,224 atoms)
4.2X 6.1X 3.8X 4.7X
149
NAMD 2.11 APoA1 on 4 and 8 nodes
Running NAMD version 2.11 The blue nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs + Tesla K80 (autoboost) GPUs
10.27 20.64 23.52 16.85 27.83 27.74
5 10 15 20 25 30 4 Nodes 4 Nodes + 1x K80 4 Nodes + 2x K80 8 Nodes 8 Nodes + 1x K80 8 Nodes + 2x K80
Simulated Time (ns/day)
APoA1
(92,224 atoms)
2.0X 2.3X 1.7X 1.6X
150
NAMD 2.11 is up to 1.8x faster
2 4 6 8 10 1 Node 2 Nodes 4 Nodes
Simulated Time (ns/day)
F1-ATPase (327,506 atoms)
1.1X 1.8X 1.4X
NAMD 2.10 & NAMD 2.11 contain Dual Intel E5-2697 v2@2.7GHz (IvyBridge) CPUs + 2 Tesla K80 (autoboost) GPUs
151
NAMD 2.11 F1-ATPase on 1 and 2 nodes
Running NAMD version 2.11 The blue nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs + Tesla K80 (autoboost) GPUs
0.94 3.87 6.11 1.86 7.23 10.58
5 10 15 1 Node 1 Node + 1x K80 1 Node + 2x K80 2 Nodes 2 Nodes + 1x K80 2 Nodes + 2x K80
Simulated Time (ns/day)
F1-ATPase
(327,506 atoms)
4.1X 6.5X 3.9X 5.7X
152
NAMD 2.11 F1-ATPase on 4 and 8 nodes
Running NAMD version 2.11 The blue nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs + Tesla K80 (autoboost) GPUs
3.63 11.66 12.62 6.88 14.22 15.74
5 10 15 20 4 Nodes 4 Nodes + 1x K80 4 Nodes + 2x K80 8 Nodes 8 Nodes + 1x K80 8 Nodes + 2x K80
Simulated Time (ns/day)
F1-ATPase
(327,506 atoms)
3.2X 3.5X 2.1X 2.3X
153
NAMD 2.11 is up to 1.5x faster
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1 Node 2 Nodes 4 Nodes
Simulated Time (ns/day)
STMV (1,066,628 atoms)
1.5X 1.1X 1.5X
NAMD 2.10 & NAMD 2.11 contain Dual Intel E5-2697 v2@2.7GHz (IvyBridge) CPUs + 2 Tesla K80 (autoboost) GPUs
154
NAMD 2.11 STMV on 1 and 2 nodes
Running NAMD version 2.11 The blue nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs (Haswell) + Tesla K80 (autoboost) GPUs
0.23 1.03 1.75 0.46 1.98 3.27
1 2 3 4 1 Node 1 Node + 1x K80 1 Node + 2x K80 2 Nodes 2 Nodes + 1x K80 2 Nodes + 2x K80
Simulated Time (ns/day)
STMV
(1,066,628 atoms)
4.5X 7.6X 4.3X 7.1X
155
NAMD 2.11 STMV on 4 and 8 nodes
Running NAMD version 2.11 The blue nodes contain Dual Intel E5- 2698 v3@2.3GHz (Haswell) CPUs The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs (Haswell) + Tesla K80 (autoboost) GPUs
0.90 3.61 4.54 1.74 5.86 6.24
2 4 6 8 4 Nodes 4 Nodes + 1x K80 4 Nodes + 2x K80 8 Nodes 8 Nodes + 1x K80 8 Nodes + 2x K80
Simulated Time (ns/day)
STMV
(1,066,628 atoms)
4.0X 5.0X 3.4X 3.6X
156
Recommended GPU Node Configuration for NAMD Computational Chemistry
Workstation or Single Node Configuration
# of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32 GPUs Kepler K20, K40, K80 # of GPUs per CPU socket 1-2 GPU memory preference (GB) 6-12 GPU to CPU connection PCIe 3.0 or higher Server storage 500 GB or higher Network configuration Gemini, InfiniBand
Scale to thousands of nodes with same single node configuration
15 6
157
Benefits of MD GPU-Accelerated Computing
- 3x-8x Faster than CPU only systems in all tests (on average)
- Most major compute intensive aspects of classical MD ported
- Large performance boost with marginal price increase
- Energy usage cut by more than half
- GPUs scale well within a node and/or over multiple nodes
- K80 GPU is our fastest and lowest power high performance GPU yet
Try GPU accelerated MD apps for free – www.nvidia.com/GPUTestDrive
Why wouldn’t you want to turbocharge your research?
February 11, 2016