Speed Without Compromise: Precision and Methodology Innovation in - PowerPoint PPT Presentation

Speed Without Compromise: Precision and Methodology Innovation in the AMBER GPU MD Software Ross Walker, Associate Professor and NVIDIA CUDA Fellow � San Diego Supercomputer Center � UC San Diego Department of Chemistry & Biochemistry 1 SAN DIEGO SUPERCOMPUTER CENTER

Molecular Dynamics for the 99% • Develop a GPU accelerated   version of AMBER’s PMEMD.   San Diego   Supercomputer Center Ross C. Walker NVIDIA Scott Le Grand Taking MD to 11 Partly funded under NSF SI2 - SSE Program 2 SAN DIEGO SUPERCOMPUTER CENTER

Project Info • AMBER Website: http://ambermd.org/gpus/ Publications 1. Salomon-Ferrer, R.; Goetz, A.W.; Poole, D.; Le Grand, S.; Walker, R.C.* " Routine microsecond molecular dynamics simulations with AMBER - Part II: Particle Mesh Ewald " , J. Chem. Theory Comput. 2013, 9 (9), pp 3878-3888. DOI: 10.1021/ct400314y 2. Goetz, A.W., Williamson, M.J., Xu, D., Poole, D., Le Grand, S., Walker, R.C. "Routine microsecond molecular dynamics simulations with amber - part i: Generalized born", Journal of Chemical Theory and Computation, 2012, 8 (5), pp 1542-1555, DOI:10.1021/ct200909j 3. Pierce, L.C.T., Salomon-Ferrer, R. de Oliveira, C.A.F. McCammon, J.A. Walker, R.C., "Routine access to millisecond timescale events with accelerated molecular dynamics.", Journal of Chemical Theory and Computation, 2012, 8 (9), pp 2997-3002, DOI: 10.1021/ct300284c 4. Salomon-Ferrer, R.; Case, D.A.; Walker, R.C.; "An overview of the Amber biomolecular simulation package", WIREs Comput. Mol. Sci., 2012, in press , DOI: 10.1002/wcms.1121 5. Le Grand, S.; Goetz, A.W.; Walker, R.C.; "SPFP: Speed without compromise - a mixed precision model for GPU accelerated molecular dynamics simulations", Chem. Phys. Comm., 2013, 184, pp374-380 , DOI: 10.1016/j.cpc.2012.09.022 3 SAN DIEGO SUPERCOMPUTER CENTER

Design Goals Overriding Design Goal: Sampling for the 99% • Focus on ~< 4 million atoms. • Maximize single workstation performance. • Focus on minimizing costs. • Be able to use very cheap nodes. • Both gaming and tesla cards. • Ease of use (same input, same output) The <0.0001% The 1.0% The 99.0% 4 SAN DIEGO SUPERCOMPUTER CENTER

Simplicity - Appliances for the 99% 5 SAN DIEGO SUPERCOMPUTER CENTER

AMBER Server (ca. 2013) $8999 6 SAN DIEGO SUPERCOMPUTER CENTER

Digits Dev Box (ca. 2015) $15,000 7 SAN DIEGO SUPERCOMPUTER CENTER

http://exxactcorp.com/index.php/solution/solu_detail/225 8 SAN DIEGO SUPERCOMPUTER CENTER

DGX-99 (Deep Learning for the 99%) http://exxactcorp.com/ index.php/solution/ solu_detail/252 20 x Titan-X = 133 TFLoPs FP32 in 1 node. DGX-1 = 85 TFLoPs FP32 in 1 node. 9 SAN DIEGO SUPERCOMPUTER CENTER

SECOND 10 SAN DIEGO SUPERCOMPUTER CENTER

Map problem onto GPU hardware Example: Nonbonded forces atom j • Subdivide force matrix into 3 classes of independent tiles atom i Off-diagonal On-diagonal Redundant • Map non-redundant tiles to warps • SMs consume tiles SM 0 SM 1 SM 2 SM m Shared Memory . . . War War War War p 0 p 0 p 0 p 0 War War War War p 1 p 1 p 1 p 1 War War War War p 2 p 2 p 2 p 2 Registers War War War War p n p n p n p n • Avoid race conditions by dividing the calculation in both space (tiles) and time (warps). Patent: US 8473948 B1 SAN DIEGO SUPERCOMPUTER CENTER

Version History • AMBER 10 – Released Apr 2008 • Implicit Solvent GB GPU support released as patch Sept 2009. • AMBER 11 – Released Apr 2010 • Implicit and Explicit solvent supported internally on single GPU. • Oct 2010 – Bugfix.9 doubled performance on single GPU, added multi-GPU support. • AMBER 12 – Released Apr 2012 • Added Umbrella Sampling Support, REMD, Simulated Annealing, aMD, IPS and Extra Points. • Aug 2012 – Bugfix.9 new SPFP precision model, support for Kepler I, GPU accelerate NMR restraints, improved performance. • Jan 2013 – Bugfix.14 support CUDA 5.0, Jarzynski on GPU, GBSA. Kepler II support. 12 SAN DIEGO SUPERCOMPUTER CENTER

Version History • AMBER 14 – Released Apr 2014 • ~20-30% performance improvement for single GPU runs. • Peer to peer support for multi-GPU runs providing enhanced multi-GPU scaling. • Hybrid bitwise reproducible fixed point precision model as standard (SPFP) • Support for Extra Points in Multi-GPU runs. • Jarzynski Sampling • GBSA support • Support for off-diagonal modifications to VDW parameters. • Multi-dimensional Replica Exchange (Temperature and Hamiltonian) • Support for CUDA 5.0, 5.5 and 6.0 • Support for latest generation GPUs. • Monte Carlo barostat support providing NPT performance equivalent to NVT. • ScaledMD support. • Improved accelerated (aMD) MD support. • Explicit solvent constant pH support. • NMR restraint support on multiple GPUs. • Improved error messages and checking. • Hydrogen mass repartitioning support (4fs time steps). 13 SAN DIEGO SUPERCOMPUTER CENTER

AMBER 16 (GPU)   Coming Apr 2016 Amber 2016 • Enhanced NMR Restraints. Reference Manual • R^6 restraint averaging. • Gaussian Accelerated Molecular   (Covers Amber16 and AmberTools16) Dynamics. • Optimized binary IO support   (mdcrd and restrt). • External electric field support. • Expanded Umbrella Sampling. • Maxwell specific optimizations. • Another 20 to 30% performance   improvement! • New SPXP precision model for   Maxwell and future hardware. 14 SAN DIEGO SUPERCOMPUTER CENTER

A Question of Dynamic Range 32-bit floating point has approximately 7 significant figures 1.456702 1456702.0000000 +0.3046714 + 0.3046714 ----------- ----------------- 1.761373 1456702.0000000 -1.456702 -1456702.0000000 ----------- ----------------- 0.3046710 0.0000000 Lost a sig fig Lost everything. When it happens: PBC, SHAKE, and Force Accumulation. SAN DIEGO SUPERCOMPUTER CENTER

Precision Models SPSP - Use single precision for the entire calculation with the exception of SHAKE which is always done in double precision. SPDP - Use a combination of single precision for calculation and double precision for accumulation (default < AMBER 12.9) DPDP – Use double precision for the entire calculation. 16 SAN DIEGO SUPERCOMPUTER CENTER

Validation and Precision Testing • Measure a combination of elements that depend on both static energies / forces and ensemble averages. • Energy conservation. • Optimized structures. • Free energy surfaces. • Order parameters. • RMSF. • Radial distribution functions. etc… • 2 aims • Is our implementation valid/correct? • What level of approximation with precision is acceptable? SAN DIEGO SUPERCOMPUTER CENTER

Force Accuracy SAN DIEGO SUPERCOMPUTER CENTER

Energy Conservation SAN DIEGO SUPERCOMPUTER CENTER

Free Energy Surfaces CPU (DP) GPU (SPDP) SAN DIEGO SUPERCOMPUTER CENTER

Explicit Solvent Performance   (JAC DHFR Production Benchmark) SAN DIEGO SUPERCOMPUTER CENTER

But then… GTX680 and K10 Ruined the Party. DP performance REALLY sucked. 4 month delay in usefulness while we Developed and tested a new precision model. 22 SAN DIEGO SUPERCOMPUTER CENTER

SPFP • Single / Double / Fixed precision hybrid. Designed for optimum performance on Kepler I. Uses fire and forget atomic ops. Fully deterministic, faster and more precise than SPDP, minimal memory overhead. (default >= AMBER 12.9) Q24.40 for Forces, Q34.30 for Energies / Virials 23 SAN DIEGO SUPERCOMPUTER CENTER

Reproducibility   Critical for Debugging Software and Hardware • SPFP precision model is bitwise reproducible. • Same simulation from same random seed = same result. • Is used to validate hardware (misbehaving GPUs) (Exxact AMBER Certified Machines) • Successfully identified 3 GPU models with underlying hardware issues based on this that needed post release fixes. ( GTX-Titan, GTX-780TI, GTX-Titan-Black ) 24 SAN DIEGO SUPERCOMPUTER CENTER

Reproducibility Final Energy after 10^6 MD steps (~45 mins per run) Good GPU Bad GPU 0.0: Etot = -58229.3134 0.0: Etot = -58229.3134 0.1: Etot = -58229.3134 0.1: Etot = -58227.1072 0.2: Etot = -58229.3134 0.2: Etot = -58229.3134 0.3: Etot = -58229.3134 0.3: Etot = -58218.9033 0.4: Etot = -58229.3134 0.4: Etot = -58217.2088 0.5: Etot = -58229.3134 0.5: Etot = -58229.3134 0.6: Etot = -58229.3134 0.6: Etot = -58228.3001 0.7: Etot = -58229.3134 0.7: Etot = -58229.3134 0.8: Etot = -58229.3134 0.8: Etot = -58229.3134 0.9: Etot = -58229.3134 0.9: Etot = -58231.6743 0.10: Etot = -58229.3134 0.10: Etot = -58229.3134 25 SAN DIEGO SUPERCOMPUTER CENTER

Worked Great   Until Maxwell DHFR (NVE) HMR 4fs 23,558 Atoms 2x K80 boards (4 GPUs) 423.69 1x K80 board (2 GPUs) 334.05 1/2x K80 board (1 GPU) 229.29 4X K40 489.68 2X K40 364.67 1X K40 266.07 2X K20 263.85 1X K20 196.99 1X K8 116.09 GTX-Titan-Z (2 GPU, full board) 356.48 GTX-Titan-Z (1 GPU, 1/2 board) 261.82 2X GTX Titan Black 383.32 1X GTX Titan Black 280.54 1X GTX 980 262.39 1X GTX 780 251.43 2X C2075 129.79 1X C2075 81.26 2xE5-2660v2 CPU (16 Cores) 30.21 0.00 100.00 200.00 300.00 400.00 500.00 600.00 Performance (ns/day) 26 SAN DIEGO SUPERCOMPUTER CENTER

Titan-X Helps   (But only through brute force) 27 SAN DIEGO SUPERCOMPUTER CENTER

Speed Without Compromise: Precision and Methodology Innovation in - PowerPoint PPT Presentation

Speed Without Compromise: Precision and Methodology Innovation in the AMBER GPU MD Software Ross Walker, Associate Professor and NVIDIA CUDA Fellow San Diego Supercomputer Center UC San Diego Department of Chemistry & Biochemistry

Mixed Precision Training PAI Overview What is mixed-precision

Missouri Compromise, 1820 Firebell in the Night The Missouri Compromise, 1820 Divides the

Tripwire Inferring Internet Site Compromise Joe DeBlasio Stefan Savage UC San Diego Geoffrey

SPEED OF THOUGHT SPEED OF THOUGHT 120m/s SPEED OF THOUGHT COMMUNICATIVE The Artist is Absent:

Cedar Rapids RLR & Speed Des Moines RLR & Speed

Speed, speed, speed D. J. Bernstein University of Illinois at Chicago; Ruhr University Bochum

2018 Milken Institute Hamptons Dialogues Precision, Precision, Precision: The Future of Health

VLVK EHF. VLVK EHF. Precision machining Precision machining Professional precision for

MIXED PRECISION TRAINING OF DEEP NEURAL NETWORKS Carl Case, NVIDIA OUTLINE 1. What is mixed

Vine Parsing and Minimum Risk Reranking for Speed and Precision Markus Dreyer David A. Smith

Manufacturer of High Speed, Precision Electrolytic Plating Equipment & Technology Development

Scaling Methodology Scaling Methodology Dan Smith Director HW Engineering dsmith@nvidia.com

Without sustaining injury Without sustaining injury Without sustaining injury Without sustaining

Anatomy of a Fraud Business E-mail Compromise & Computer Intrusion What is Business E-Mail

Compromise Solutions Kai-Simon Goetzmann, TU Berlin (Joint work with Christina B using, Jannik

Canada By Anne Tomalin June, 2016 WHO Held Meeting For Compromise Proposal To Name

ABOUT US Larnitech is a developer and manufacturer of Smart Home systems for home and

Hard Facts - Benchmarking GRID- Accelerated Remote Desktop User Experience Ruben Spruijt Benny

XR8051, XR8052, XR8054 Low Cost, High Speed Rail-to-Rail Amplifiers Key Features 175MHz BW

APPLICATIONS OF DEEP LEARNING TO COMPUTER VISION AND COMPUTER GRAPHICS Mike Houston Practical DEEP

Quarterly Results Quarterly Results Q3 F2008 Johannesburg 9 May 2008 SAFETY Introduction In

D.M. Wenceslao & Associates, Incorporated (DMW) Management Presentation Disclaimer

GPU Codes for High Performance Computing with Allinea Forge Ryan Hulguin Applications Engineer

Alphageo India Ltd: Spearheading Indias Search for Oil Alphageo India - Introduction First

Speed Without Compromise: Precision and Methodology Innovation in - PowerPoint PPT Presentation

Speed Without Compromise: Precision and Methodology Innovation in the AMBER GPU MD Software Ross Walker, Associate Professor and NVIDIA CUDA Fellow San Diego Supercomputer Center UC San Diego Department of Chemistry & Biochemistry

Mixed Precision Training PAI Overview What is mixed-precision

Missouri Compromise, 1820 Firebell in the Night The Missouri Compromise, 1820 Divides the

Tripwire Inferring Internet Site Compromise Joe DeBlasio Stefan Savage UC San Diego Geoffrey

SPEED OF THOUGHT SPEED OF THOUGHT 120m/s SPEED OF THOUGHT COMMUNICATIVE The Artist is Absent:

Cedar Rapids RLR &amp; Speed Des Moines RLR &amp; Speed

Speed, speed, speed D. J. Bernstein University of Illinois at Chicago; Ruhr University Bochum

2018 Milken Institute Hamptons Dialogues Precision, Precision, Precision: The Future of Health

VLVK EHF. VLVK EHF. Precision machining Precision machining Professional precision for

MIXED PRECISION TRAINING OF DEEP NEURAL NETWORKS Carl Case, NVIDIA OUTLINE 1. What is mixed

Vine Parsing and Minimum Risk Reranking for Speed and Precision Markus Dreyer David A. Smith

Manufacturer of High Speed, Precision Electrolytic Plating Equipment &amp; Technology Development

Scaling Methodology Scaling Methodology Dan Smith Director HW Engineering dsmith@nvidia.com

Without sustaining injury Without sustaining injury Without sustaining injury Without sustaining

Anatomy of a Fraud Business E-mail Compromise &amp; Computer Intrusion What is Business E-Mail

Compromise Solutions Kai-Simon Goetzmann, TU Berlin (Joint work with Christina B using, Jannik

Canada By Anne Tomalin June, 2016 WHO Held Meeting For Compromise Proposal To Name

ABOUT US Larnitech is a developer and manufacturer of Smart Home systems for home and

Hard Facts - Benchmarking GRID- Accelerated Remote Desktop User Experience Ruben Spruijt Benny

XR8051, XR8052, XR8054 Low Cost, High Speed Rail-to-Rail Amplifiers Key Features 175MHz BW

APPLICATIONS OF DEEP LEARNING TO COMPUTER VISION AND COMPUTER GRAPHICS Mike Houston Practical DEEP

Quarterly Results Quarterly Results Q3 F2008 Johannesburg 9 May 2008 SAFETY Introduction In

D.M. Wenceslao &amp; Associates, Incorporated (DMW) Management Presentation Disclaimer

GPU Codes for High Performance Computing with Allinea Forge Ryan Hulguin Applications Engineer

Alphageo India Ltd: Spearheading Indias Search for Oil Alphageo India - Introduction First

Cedar Rapids RLR & Speed Des Moines RLR & Speed

Manufacturer of High Speed, Precision Electrolytic Plating Equipment & Technology Development

Anatomy of a Fraud Business E-mail Compromise & Computer Intrusion What is Business E-Mail

D.M. Wenceslao & Associates, Incorporated (DMW) Management Presentation Disclaimer