speed without compromise precision and methodology
play

Speed Without Compromise: Precision and Methodology Innovation in - PowerPoint PPT Presentation

Speed Without Compromise: Precision and Methodology Innovation in the AMBER GPU MD Software Ross Walker, Associate Professor and NVIDIA CUDA Fellow San Diego Supercomputer Center UC San Diego Department of Chemistry & Biochemistry


  1. Speed Without Compromise: Precision and Methodology Innovation in the AMBER GPU MD Software Ross Walker, Associate Professor and NVIDIA CUDA Fellow � San Diego Supercomputer Center � UC San Diego Department of Chemistry & Biochemistry 1 SAN DIEGO SUPERCOMPUTER CENTER

  2. Molecular Dynamics for the 99% • Develop a GPU accelerated 
 version of AMBER’s PMEMD. 
 San Diego 
 Supercomputer Center Ross C. Walker NVIDIA Scott Le Grand Taking MD to 11 Partly funded under NSF SI2 - SSE Program 2 SAN DIEGO SUPERCOMPUTER CENTER

  3. Project Info • AMBER Website: http://ambermd.org/gpus/ Publications 1. Salomon-Ferrer, R.; Goetz, A.W.; Poole, D.; Le Grand, S.; Walker, R.C.* " Routine microsecond molecular dynamics simulations with AMBER - Part II: Particle Mesh Ewald " , J. Chem. Theory Comput. 2013, 9 (9), pp 3878-3888. DOI: 10.1021/ct400314y 2. Goetz, A.W., Williamson, M.J., Xu, D., Poole, D., Le Grand, S., Walker, R.C. "Routine microsecond molecular dynamics simulations with amber - part i: Generalized born", Journal of Chemical Theory and Computation, 2012, 8 (5), pp 1542-1555, DOI:10.1021/ct200909j 3. Pierce, L.C.T., Salomon-Ferrer, R. de Oliveira, C.A.F. McCammon, J.A. Walker, R.C., "Routine access to millisecond timescale events with accelerated molecular dynamics.", Journal of Chemical Theory and Computation, 2012, 8 (9), pp 2997-3002, DOI: 10.1021/ct300284c 4. Salomon-Ferrer, R.; Case, D.A.; Walker, R.C.; "An overview of the Amber biomolecular simulation package", WIREs Comput. Mol. Sci., 2012, in press , DOI: 10.1002/wcms.1121 5. Le Grand, S.; Goetz, A.W.; Walker, R.C.; "SPFP: Speed without compromise - a mixed precision model for GPU accelerated molecular dynamics simulations", Chem. Phys. Comm., 2013, 184, pp374-380 , DOI: 10.1016/j.cpc.2012.09.022 3 SAN DIEGO SUPERCOMPUTER CENTER

  4. Design Goals Overriding Design Goal: Sampling for the 99% • Focus on ~< 4 million atoms. • Maximize single workstation performance. • Focus on minimizing costs. • Be able to use very cheap nodes. • Both gaming and tesla cards. • Ease of use (same input, same output) The <0.0001% The 1.0% The 99.0% 4 SAN DIEGO SUPERCOMPUTER CENTER

  5. Simplicity - Appliances for the 99% 5 SAN DIEGO SUPERCOMPUTER CENTER

  6. AMBER Server (ca. 2013) $8999 6 SAN DIEGO SUPERCOMPUTER CENTER

  7. Digits Dev Box (ca. 2015) $15,000 7 SAN DIEGO SUPERCOMPUTER CENTER

  8. http://exxactcorp.com/index.php/solution/solu_detail/225 8 SAN DIEGO SUPERCOMPUTER CENTER

  9. DGX-99 (Deep Learning for the 99%) http://exxactcorp.com/ index.php/solution/ solu_detail/252 20 x Titan-X = 133 TFLoPs FP32 in 1 node. DGX-1 = 85 TFLoPs FP32 in 1 node. 9 SAN DIEGO SUPERCOMPUTER CENTER

  10. SECOND 10 SAN DIEGO SUPERCOMPUTER CENTER

  11. Map problem onto GPU hardware Example: Nonbonded forces atom j • Subdivide force matrix into 3 classes of independent tiles atom i Off-diagonal On-diagonal Redundant • Map non-redundant tiles to warps • SMs consume tiles SM 0 SM 1 SM 2 SM m Shared Memory . . . War War War War p 0 p 0 p 0 p 0 War War War War p 1 p 1 p 1 p 1 War War War War p 2 p 2 p 2 p 2 Registers War War War War p n p n p n p n • Avoid race conditions by dividing the calculation in both space (tiles) and time (warps). Patent: US 8473948 B1 SAN DIEGO SUPERCOMPUTER CENTER

  12. Version History • AMBER 10 – Released Apr 2008 • Implicit Solvent GB GPU support released as patch Sept 2009. • AMBER 11 – Released Apr 2010 • Implicit and Explicit solvent supported internally on single GPU. • Oct 2010 – Bugfix.9 doubled performance on single GPU, added multi-GPU support. • AMBER 12 – Released Apr 2012 • Added Umbrella Sampling Support, REMD, Simulated Annealing, aMD, IPS and Extra Points. • Aug 2012 – Bugfix.9 new SPFP precision model, support for Kepler I, GPU accelerate NMR restraints, improved performance. • Jan 2013 – Bugfix.14 support CUDA 5.0, Jarzynski on GPU, GBSA. Kepler II support. 12 SAN DIEGO SUPERCOMPUTER CENTER

  13. Version History • AMBER 14 – Released Apr 2014 • ~20-30% performance improvement for single GPU runs. • Peer to peer support for multi-GPU runs providing enhanced multi-GPU scaling. • Hybrid bitwise reproducible fixed point precision model as standard (SPFP) • Support for Extra Points in Multi-GPU runs. • Jarzynski Sampling • GBSA support • Support for off-diagonal modifications to VDW parameters. • Multi-dimensional Replica Exchange (Temperature and Hamiltonian) • Support for CUDA 5.0, 5.5 and 6.0 • Support for latest generation GPUs. • Monte Carlo barostat support providing NPT performance equivalent to NVT. • ScaledMD support. • Improved accelerated (aMD) MD support. • Explicit solvent constant pH support. • NMR restraint support on multiple GPUs. • Improved error messages and checking. • Hydrogen mass repartitioning support (4fs time steps). 13 SAN DIEGO SUPERCOMPUTER CENTER

  14. AMBER 16 (GPU) 
 Coming Apr 2016 Amber 2016 • Enhanced NMR Restraints. Reference Manual • R^6 restraint averaging. • Gaussian Accelerated Molecular 
 (Covers Amber16 and AmberTools16) Dynamics. • Optimized binary IO support 
 (mdcrd and restrt). • External electric field support. • Expanded Umbrella Sampling. • Maxwell specific optimizations. • Another 20 to 30% performance 
 improvement! • New SPXP precision model for 
 Maxwell and future hardware. 14 SAN DIEGO SUPERCOMPUTER CENTER

  15. A Question of Dynamic Range 32-bit floating point has approximately 7 significant figures 1.456702 1456702.0000000 +0.3046714 + 0.3046714 ----------- ----------------- 1.761373 1456702.0000000 -1.456702 -1456702.0000000 ----------- ----------------- 0.3046710 0.0000000 Lost a sig fig Lost everything. When it happens: PBC, SHAKE, and Force Accumulation. SAN DIEGO SUPERCOMPUTER CENTER

  16. Precision Models SPSP - Use single precision for the entire calculation with the exception of SHAKE which is always done in double precision. SPDP - Use a combination of single precision for calculation and double precision for accumulation (default < AMBER 12.9) DPDP – Use double precision for the entire calculation. 16 SAN DIEGO SUPERCOMPUTER CENTER

  17. Validation and Precision Testing • Measure a combination of elements that depend on both static energies / forces and ensemble averages. • Energy conservation. • Optimized structures. • Free energy surfaces. • Order parameters. • RMSF. • Radial distribution functions. etc… • 2 aims • Is our implementation valid/correct? • What level of approximation with precision is acceptable? SAN DIEGO SUPERCOMPUTER CENTER

  18. Force Accuracy SAN DIEGO SUPERCOMPUTER CENTER

  19. Energy Conservation SAN DIEGO SUPERCOMPUTER CENTER

  20. Free Energy Surfaces CPU (DP) GPU (SPDP) SAN DIEGO SUPERCOMPUTER CENTER

  21. Explicit Solvent Performance 
 (JAC DHFR Production Benchmark) SAN DIEGO SUPERCOMPUTER CENTER

  22. But then… GTX680 and K10 Ruined the Party. DP performance REALLY sucked. 4 month delay in usefulness while we Developed and tested a new precision model. 22 SAN DIEGO SUPERCOMPUTER CENTER

  23. SPFP • Single / Double / Fixed precision hybrid. Designed for optimum performance on Kepler I. Uses fire and forget atomic ops. Fully deterministic, faster and more precise than SPDP, minimal memory overhead. (default >= AMBER 12.9) Q24.40 for Forces, Q34.30 for Energies / Virials 23 SAN DIEGO SUPERCOMPUTER CENTER

  24. Reproducibility 
 Critical for Debugging Software and Hardware • SPFP precision model is bitwise reproducible. • Same simulation from same random seed = same result. • Is used to validate hardware (misbehaving GPUs) (Exxact AMBER Certified Machines) • Successfully identified 3 GPU models with underlying hardware issues based on this that needed post release fixes. ( GTX-Titan, GTX-780TI, GTX-Titan-Black ) 24 SAN DIEGO SUPERCOMPUTER CENTER

  25. Reproducibility Final Energy after 10^6 MD steps (~45 mins per run) Good GPU Bad GPU 0.0: Etot = -58229.3134 0.0: Etot = -58229.3134 0.1: Etot = -58229.3134 0.1: Etot = -58227.1072 0.2: Etot = -58229.3134 0.2: Etot = -58229.3134 0.3: Etot = -58229.3134 0.3: Etot = -58218.9033 0.4: Etot = -58229.3134 0.4: Etot = -58217.2088 0.5: Etot = -58229.3134 0.5: Etot = -58229.3134 0.6: Etot = -58229.3134 0.6: Etot = -58228.3001 0.7: Etot = -58229.3134 0.7: Etot = -58229.3134 0.8: Etot = -58229.3134 0.8: Etot = -58229.3134 0.9: Etot = -58229.3134 0.9: Etot = -58231.6743 0.10: Etot = -58229.3134 0.10: Etot = -58229.3134 25 SAN DIEGO SUPERCOMPUTER CENTER

  26. Worked Great 
 Until Maxwell DHFR (NVE) HMR 4fs 23,558 Atoms 2x K80 boards (4 GPUs) 423.69 1x K80 board (2 GPUs) 334.05 1/2x K80 board (1 GPU) 229.29 4X K40 489.68 2X K40 364.67 1X K40 266.07 2X K20 263.85 1X K20 196.99 1X K8 116.09 GTX-Titan-Z (2 GPU, full board) 356.48 GTX-Titan-Z (1 GPU, 1/2 board) 261.82 2X GTX Titan Black 383.32 1X GTX Titan Black 280.54 1X GTX 980 262.39 1X GTX 780 251.43 2X C2075 129.79 1X C2075 81.26 2xE5-2660v2 CPU (16 Cores) 30.21 0.00 100.00 200.00 300.00 400.00 500.00 600.00 Performance (ns/day) 26 SAN DIEGO SUPERCOMPUTER CENTER

  27. Titan-X Helps 
 (But only through brute force) 27 SAN DIEGO SUPERCOMPUTER CENTER

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend