Speed Without Compromise: Precision and Methodology Innovation in - - PowerPoint PPT Presentation

speed without compromise precision and methodology
SMART_READER_LITE
LIVE PREVIEW

Speed Without Compromise: Precision and Methodology Innovation in - - PowerPoint PPT Presentation

Speed Without Compromise: Precision and Methodology Innovation in the AMBER GPU MD Software Ross Walker, Associate Professor and NVIDIA CUDA Fellow San Diego Supercomputer Center UC San Diego Department of Chemistry & Biochemistry


slide-1
SLIDE 1

SAN DIEGO SUPERCOMPUTER CENTER

Speed Without Compromise: Precision and Methodology Innovation in the AMBER GPU MD Software

Ross Walker, Associate Professor and NVIDIA CUDA Fellow San Diego Supercomputer Center UC San Diego Department of Chemistry & Biochemistry

1

slide-2
SLIDE 2

SAN DIEGO SUPERCOMPUTER CENTER

Molecular Dynamics for the 99%

  • Develop a GPU accelerated


version of AMBER’s PMEMD.


San Diego
 Supercomputer Center Ross C. Walker NVIDIA Scott Le Grand

2

Partly funded under NSF SI2 - SSE Program

Taking MD to 11

slide-3
SLIDE 3

SAN DIEGO SUPERCOMPUTER CENTER

Project Info

  • AMBER Website: http://ambermd.org/gpus/

Publications

1. Salomon-Ferrer, R.; Goetz, A.W.; Poole, D.; Le Grand, S.; Walker, R.C.* "Routine microsecond molecular dynamics simulations with AMBER - Part II: Particle Mesh Ewald" , J. Chem. Theory Comput. 2013, 9 (9), pp 3878-3888. DOI: 10.1021/ct400314y 2. Goetz, A.W., Williamson, M.J., Xu, D., Poole, D., Le Grand, S., Walker, R.C. "Routine microsecond molecular dynamics simulations with amber - part i: Generalized born", Journal of Chemical Theory and Computation, 2012, 8 (5), pp 1542-1555, DOI:10.1021/ct200909j 3. Pierce, L.C.T., Salomon-Ferrer, R. de Oliveira, C.A.F. McCammon, J.A. Walker, R.C., "Routine access to millisecond timescale events with accelerated molecular dynamics.", Journal of Chemical Theory and Computation, 2012, 8 (9), pp 2997-3002, DOI: 10.1021/ct300284c 4. Salomon-Ferrer, R.; Case, D.A.; Walker, R.C.; "An overview of the Amber biomolecular simulation package", WIREs Comput. Mol. Sci., 2012, in press, DOI: 10.1002/wcms.1121 5. Le Grand, S.; Goetz, A.W.; Walker, R.C.; "SPFP: Speed without compromise - a mixed precision model for GPU accelerated molecular dynamics simulations", Chem. Phys. Comm., 2013, 184, pp374-380, DOI: 10.1016/j.cpc.2012.09.022

3

slide-4
SLIDE 4

SAN DIEGO SUPERCOMPUTER CENTER

Design Goals

Overriding Design Goal: Sampling for the 99%

  • Focus on ~< 4 million atoms.
  • Maximize single workstation performance.
  • Focus on minimizing costs.
  • Be able to use very cheap nodes.
  • Both gaming and tesla cards.
  • Ease of use (same input, same output)

4

The <0.0001% The 1.0% The 99.0%

slide-5
SLIDE 5

SAN DIEGO SUPERCOMPUTER CENTER

Simplicity - Appliances for the 99%

5

slide-6
SLIDE 6

SAN DIEGO SUPERCOMPUTER CENTER

AMBER Server (ca. 2013)

6

$8999

slide-7
SLIDE 7

SAN DIEGO SUPERCOMPUTER CENTER

Digits Dev Box (ca. 2015)

7

$15,000

slide-8
SLIDE 8

SAN DIEGO SUPERCOMPUTER CENTER

8

http://exxactcorp.com/index.php/solution/solu_detail/225

slide-9
SLIDE 9

SAN DIEGO SUPERCOMPUTER CENTER

DGX-99 (Deep Learning for the 99%)

9

http://exxactcorp.com/ index.php/solution/ solu_detail/252 20 x Titan-X = 133 TFLoPs FP32 in 1 node. DGX-1 = 85 TFLoPs FP32 in 1 node.

slide-10
SLIDE 10

SAN DIEGO SUPERCOMPUTER CENTER

10

SECOND

slide-11
SLIDE 11

SAN DIEGO SUPERCOMPUTER CENTER

Map problem onto GPU hardware

  • Subdivide force matrix into 3

classes of independent tiles Off-diagonal On-diagonal Redundant

  • Map non-redundant tiles to warps
  • SMs consume tiles
War p 0 War p 1 War p 2 War p n War p 0 War p 1 War p 2 War p n War p 0 War p 1 War p 2 War p n War p 0 War p 1 War p 2 War p n

. . .

SM 0 SM 1 SM m SM 2

  • Avoid race conditions by dividing

the calculation in both space (tiles) and time (warps).

Shared Memory Registers

Example: Nonbonded forces

atom j atom i

Patent: US 8473948 B1

slide-12
SLIDE 12

SAN DIEGO SUPERCOMPUTER CENTER

Version History

  • AMBER 10 – Released Apr 2008
  • Implicit Solvent GB GPU support released as patch Sept 2009.
  • AMBER 11 – Released Apr 2010
  • Implicit and Explicit solvent supported internally on single GPU.
  • Oct 2010 – Bugfix.9 doubled performance on single GPU, added

multi-GPU support.

  • AMBER 12 – Released Apr 2012
  • Added Umbrella Sampling Support, REMD, Simulated Annealing,

aMD, IPS and Extra Points.

  • Aug 2012 – Bugfix.9 new SPFP precision model, support for Kepler I,

GPU accelerate NMR restraints, improved performance.

  • Jan 2013 – Bugfix.14 support CUDA 5.0, Jarzynski on GPU, GBSA.

Kepler II support.

12

slide-13
SLIDE 13

SAN DIEGO SUPERCOMPUTER CENTER

Version History

  • AMBER 14 – Released Apr 2014
  • ~20-30% performance improvement for single GPU runs.
  • Peer to peer support for multi-GPU runs providing enhanced multi-GPU scaling.
  • Hybrid bitwise reproducible fixed point precision model as standard (SPFP)
  • Support for Extra Points in Multi-GPU runs.
  • Jarzynski Sampling
  • GBSA support
  • Support for off-diagonal modifications to VDW parameters.
  • Multi-dimensional Replica Exchange (Temperature and Hamiltonian)
  • Support for CUDA 5.0, 5.5 and 6.0
  • Support for latest generation GPUs.
  • Monte Carlo barostat support providing NPT performance equivalent to NVT.
  • ScaledMD support.
  • Improved accelerated (aMD) MD support.
  • Explicit solvent constant pH support.
  • NMR restraint support on multiple GPUs.
  • Improved error messages and checking.
  • Hydrogen mass repartitioning support (4fs time steps).

13

slide-14
SLIDE 14

SAN DIEGO SUPERCOMPUTER CENTER

AMBER 16 (GPU)
 Coming Apr 2016

  • Enhanced NMR Restraints.
  • R^6 restraint averaging.
  • Gaussian Accelerated Molecular


Dynamics.

  • Optimized binary IO support


(mdcrd and restrt).

  • External electric field support.
  • Expanded Umbrella Sampling.
  • Maxwell specific optimizations.
  • Another 20 to 30% performance


improvement!

  • New SPXP precision model for


Maxwell and future hardware.

14

Amber 2016 Reference Manual

(Covers Amber16 and AmberTools16)

slide-15
SLIDE 15

SAN DIEGO SUPERCOMPUTER CENTER

A Question of Dynamic Range

32-bit floating point has approximately 7 significant figures When it happens: PBC, SHAKE, and Force Accumulation.

1.456702

+0.3046714

  • 1.761373
  • 1.456702
  • 0.3046710

Lost a sig fig 1456702.0000000 + 0.3046714

  • 1456702.0000000
  • 1456702.0000000
  • 0.0000000

Lost everything.

slide-16
SLIDE 16

SAN DIEGO SUPERCOMPUTER CENTER

Precision Models

SPSP - Use single precision for the entire calculation with the exception of SHAKE which is always done in double precision. SPDP - Use a combination of single precision for calculation and double precision for accumulation (default < AMBER 12.9) DPDP – Use double precision for the entire calculation.

16

slide-17
SLIDE 17

SAN DIEGO SUPERCOMPUTER CENTER

Validation and Precision Testing

  • Measure a combination of elements that depend
  • n both static energies / forces and ensemble

averages.

  • Energy conservation.
  • Optimized structures.
  • Free energy surfaces.
  • Order parameters.
  • RMSF.
  • Radial distribution functions. etc…
  • 2 aims
  • Is our implementation valid/correct?
  • What level of approximation with precision is acceptable?
slide-18
SLIDE 18

SAN DIEGO SUPERCOMPUTER CENTER

Force Accuracy

slide-19
SLIDE 19

SAN DIEGO SUPERCOMPUTER CENTER

Energy Conservation

slide-20
SLIDE 20

SAN DIEGO SUPERCOMPUTER CENTER

Free Energy Surfaces

CPU (DP) GPU (SPDP)

slide-21
SLIDE 21

SAN DIEGO SUPERCOMPUTER CENTER

Explicit Solvent Performance
 (JAC DHFR Production Benchmark)

slide-22
SLIDE 22

SAN DIEGO SUPERCOMPUTER CENTER

But then…

22

GTX680 and K10 Ruined the Party. DP performance REALLY sucked. 4 month delay in usefulness while we Developed and tested a new precision model.

slide-23
SLIDE 23

SAN DIEGO SUPERCOMPUTER CENTER

SPFP

  • Single / Double / Fixed precision hybrid. Designed for
  • ptimum performance on Kepler I. Uses fire and forget

atomic ops. Fully deterministic, faster and more precise than SPDP, minimal memory overhead. (default >= AMBER 12.9)

23

Q24.40 for Forces, Q34.30 for Energies / Virials

slide-24
SLIDE 24

SAN DIEGO SUPERCOMPUTER CENTER

Reproducibility
 Critical for Debugging Software and Hardware

24

  • SPFP precision model is bitwise reproducible.
  • Same simulation from same random seed = same

result.

  • Is used to validate hardware (misbehaving GPUs)

(Exxact AMBER Certified Machines)

  • Successfully identified 3 GPU models with underlying

hardware issues based on this that needed post release fixes. (GTX-Titan, GTX-780TI, GTX-Titan-Black)

slide-25
SLIDE 25

SAN DIEGO SUPERCOMPUTER CENTER

Reproducibility

Good GPU Bad GPU

0.0: Etot = -58229.3134 0.1: Etot = -58229.3134 0.2: Etot = -58229.3134 0.3: Etot = -58229.3134 0.4: Etot = -58229.3134 0.5: Etot = -58229.3134 0.6: Etot = -58229.3134 0.7: Etot = -58229.3134 0.8: Etot = -58229.3134 0.9: Etot = -58229.3134 0.10: Etot = -58229.3134 0.0: Etot = -58229.3134 0.1: Etot = -58227.1072 0.2: Etot = -58229.3134 0.3: Etot = -58218.9033 0.4: Etot = -58217.2088 0.5: Etot = -58229.3134 0.6: Etot = -58228.3001 0.7: Etot = -58229.3134 0.8: Etot = -58229.3134 0.9: Etot = -58231.6743 0.10: Etot = -58229.3134

25

Final Energy after 10^6 MD steps (~45 mins per run)

slide-26
SLIDE 26

SAN DIEGO SUPERCOMPUTER CENTER

Worked Great
 Until Maxwell

26 30.21 81.26 129.79 251.43 262.39 280.54 383.32 261.82 356.48 116.09 196.99 263.85 266.07 364.67 489.68 229.29 334.05 423.69 0.00 100.00 200.00 300.00 400.00 500.00 600.00 2xE5-2660v2 CPU (16 Cores) 1X C2075 2X C2075 1X GTX 780 1X GTX 980 1X GTX Titan Black 2X GTX Titan Black GTX-Titan-Z (1 GPU, 1/2 board) GTX-Titan-Z (2 GPU, full board) 1X K8 1X K20 2X K20 1X K40 2X K40 4X K40 1/2x K80 board (1 GPU) 1x K80 board (2 GPUs) 2x K80 boards (4 GPUs)

Performance (ns/day)

DHFR (NVE) HMR 4fs 23,558 Atoms

slide-27
SLIDE 27

SAN DIEGO SUPERCOMPUTER CENTER

Titan-X Helps
 (But only through brute force)

27

slide-28
SLIDE 28

SAN DIEGO SUPERCOMPUTER CENTER

Beating Hardware
 Faster llrintf

static __device__ __inline__ long long fast_llrintf(float x) { #if ((__CUDA_ARCH__ == 500) || (__CUDA_ARCH__ == 520) || (__CUDA_ARCH__ == 530)) //Maxwell hardware float z = x * (float)0x1.00000p-32; int hi = __float2int_rz( z ); // First convert high bits float delta = x - ((float)0x1.00000p32*((float)hi)); // Check remainder sign int test = (__float_as_uint(delta) > 0xbf000000); int lo = __float2uint_rn(fabsf(delta)); // Convert the (unsigned) remainder lo = (test) ? -lo: lo; hi -= test; // Two's complement correction long long res = __double_as_longlong(__hiloint2double(hi,lo)); // Return 64-bit result return res; #else return llrintf(x); #endif } 28

Kate Clark (NVIDIA) Requires >= cuda 7.5

slide-29
SLIDE 29

SAN DIEGO SUPERCOMPUTER CENTER

A future proof? solution needed
 SPXP


Use 2 x 32 bits (~48-bit FP) Extended-Precision Floating-Point Numbers for GPU Computation - Andrew Thall, Alma College http://andrewthall.org/papers/df64_qf128.pdf High-Performance Quasi Double-Precision Method Using Single-Precision Hardware for Molecular Dynamics on GPUs – Tetsuo Narumi et al. HPC Asia and JAPAN 2009

slide-30
SLIDE 30

SAN DIEGO SUPERCOMPUTER CENTER

Knuth & Dekker Summation

Represented as 2 floats struct Accumulator { float hs; float ls; Accumulator() : hs(0.0f), ls(0.0f) {} };

slide-31
SLIDE 31

SAN DIEGO SUPERCOMPUTER CENTER

Addition

void add_forces(Accumulator& a, float ys) { float hs, ls, ws; // Knuth and Dekker addition hs = a.hs + ys; ws = hs - a.hs; a.hs = ls; a.ls = ys - ws; }

slide-32
SLIDE 32

SAN DIEGO SUPERCOMPUTER CENTER

Conversion to double

long long int upcast_forces(Accumulator& a) { long long int l = llrintf(a.hs * FORCESCALEF) + llrintf(a.ls * FORCESCALEF); return l; }

slide-33
SLIDE 33

SAN DIEGO SUPERCOMPUTER CENTER

Something for Everyone

  • DPFP

64-bit everything

  • SPFP

32-bit forces, U64 force summation, 64-bit state

  • SPXP

32-bit forces, ~FP48 force summation for inner loops, U64 summation, 64-bit state

slide-34
SLIDE 34

SAN DIEGO SUPERCOMPUTER CENTER

Side by Side

DP: 22.855216396810960 DPFP: 22.855216396810960 SPFP: 22.855216396810xxx
 SPXP: 22.8552163xxxxxxxx SP: 22.855xxxxxxxxxxxx

slide-35
SLIDE 35

SAN DIEGO SUPERCOMPUTER CENTER

35 263.91 309.70 379.52 437.05 293.23 344.11 316.29 374.25 429.80 487.83 555.15 0.00 100.00 200.00 300.00 400.00 500.00 600.00 1X GTX980 1X GTX980 AMBER 16 2X GTX980 2X GTX980 AMBER 16 1X GTX980Ti 1X GTX980Ti AMBER 16 1X GTX-Titan-X 1X GTX-Titan-X AMBER 16 2X GTX-Titan-X 2X GTX-Titan-X AMBER 16 4X GTX-Titan-X AMBER 16

Performance (ns/day)

DHFR (NVE) HMR 4fs 23,558 Atoms

slide-36
SLIDE 36

SAN DIEGO SUPERCOMPUTER CENTER

Summary

  • GPUs are awesome but continual ‘internal’

performance changes are crippling development.

  • Lots of new things in the pipeline – would be more

if we didn’t have to keep rewriting the guts.

slide-37
SLIDE 37

SAN DIEGO SUPERCOMPUTER CENTER

But that said.... GPUs are still awesome!

34.03 309.70 437.05 344.11 374.25 487.83 555.15 548.45 636.49 757.90 0.00 100.00 200.00 300.00 400.00 500.00 600.00 700.00 800.00 2xE5-2650v3 CPU (20 Cores) 1X GTX980 2X GTX980 1X GTX980Ti 1X GTX-Titan-X 2X GTX-Titan-X 4X GTX-Titan-X 1X GP100 2X GP100 4X GP100

Performance (ns/day)

DHFR (NVE) HMR 4fs 23,558 Atoms

slide-38
SLIDE 38

38

BIG PROBLEMS NEED FAST COMPUTERS

2.5x Faster than the Largest CPU Data Center

5 10 15 20 25 30 35 40 45 50 2 4 6 8 10

ns/day # of Processors (CPUs and GPUs)

AMBER Simulation of CRISPR, Nature’s Tool for Genome Editing

1 Node with 4 P100 GPUs 48 CPU Nodes Comet Supercomputer

AMBER 16 Pre-release, CRSPR based on PDB ID 5f9r, 336,898 atoms CPU: Dual Socket Intel E5-2680v3 12 cores, 128 GB DDR4 per node, FDR IB

“Biotech discovery of the century”

  • MIT Technology Review 12/2014
slide-39
SLIDE 39
slide-40
SLIDE 40

SAN DIEGO SUPERCOMPUTER CENTER

Acknowledgements

San Diego Supercomputer Center University of California San Diego National Science Foundation

NSF SI2-SSE Program

NVIDIA Corporation

Hardware + People

People Scott Le Grand Kate Clark Duncan Poole Mark Berger Simon Layton Longhu Yang Perri Needham Romelia Salomon Age Skjevik Robin Betz Ben Madej Andreas Goetz Charles Lin Dan Mermelstein Adrian Roitberg

40