s m mniszewski c f a negre m j cawkwell a m n niklasson
play

S. M. Mniszewski, C. F. A. Negre, M. J. Cawkwell, A. M. N. Niklasson - PowerPoint PPT Presentation

Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs S. M. Mniszewski, C. F. A. Negre, M. J. Cawkwell, A. M. N. Niklasson Los Alamos National Laboratory smm@lanl.gov 2016 GPU Technology Conference April


  1. Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs S. M. Mniszewski, C. F. A. Negre, M. J. Cawkwell, A. M. N. Niklasson Los Alamos National Laboratory smm@lanl.gov 2016 GPU Technology Conference April 4-7, 2016 U N C L A S S I F I E D Slide 1

  2. Why Molecular Dynamics? U N C L A S S I F I E D Slide 2

  3. Background • In molecular dynamics simulation, the relative positions of atoms evolve over a series of time steps according to the force acting on each atom • Employed in materials science, chemistry, and biology to study structures, defects, and equilibrium and non-equilibrium phenomena • Dependence on an interatomic potential to calculate forces and energy • Quantum-based models capture the making and breaking of covalent bonds, charge transfer between species of differing electronegativities, and long-range electrostatic interactions - reactions U N C L A S S I F I E D Slide 3

  4. Quantum Molecular Dynamics (QMD) Integrate the equations of motion Born-Oppenheimer MD of classical molecular trajectories R I = − ∂ U ( R ; ρ sc ) SCF � M I ¨ SCF � U ( R ; ρ ) ∂ R I SCF � with the forces calculated on the fly from a self-consistent quantum U ( R ; ρ sc ) mechanical description of the electronic structure: H [ ρ ] Ψ i = ε i Ψ i SCF � → ρ sc X | Ψ i | 2 ρ = #SCF × O ( N 3 ) occ . U N C L A S S I F I E D Slide 4

  5. Example: Biosynthesis of Histidine • What is responsible for the biosynthesis of histidine? • Present in several pathogenic bacteria • Allosteric mechanism was determined with MD simulation • Several thousand atoms over timescales of 100s of ns! Rivalta I, Sultan MM, Lee N-S, Manley GA, Loria JP, Batista VS. PNAS, 2012 109(22), E1428-E1436. U N C L A S S I F I E D Slide 5

  6. Future QMD Simulations IAPP dimerization and type-2 diabetes (537 atoms). Beta Amyloid Peptide and Alzheimer’s Disease (410 atoms). Raffa DF, A. Rauk A, J. Phys. Chem. B, 2 MT007, 111 (14), 3789-99. Dupuis NF, Wu C, Shea J-E,,Bowers, MT J. Am. Chem. Soc., 2011, 133 (19), 7240-43. U N C L A S S I F I E D Slide 6

  7. Computational Cost • High computational cost Quantum Molecular Dynamics and complexity of QMD Wall-Clock Time / MD Time Step calculations 3 ) Diagonalization O(N • The MD timestep is the O(N) Regular linear scaling O(N) Low pre-factor most expensive – the density matrix construction Significant pre-factor reduction � is required for practical � • The second order spectral large scale QMD! � projection (SP2) algorithm breakthrough System Size (N) • Use of hybrid parallelism on GPU-accelerated clusters U N C L A S S I F I E D Slide 7

  8. The Density Matrix Computation • Typically, algorithms used in quantum-based models, most notably matrix diagonalization, are not ideally suited to GPUs – Due to their complexity – Difficulty in extracting thread-level parallelism – Difficulty of avoiding branching within warps • New SP2 approach – Computed directly from the Hamiltonian through a recursive expansion of the Fermi Operator with the second order spectral projection (SP2) algorithm – Based on a series of generalized matrix-matrix multiplications – Only one matrix-matrix multiplication is required per iteration – Maps very well to GPUs U N C L A S S I F I E D Slide 8

  9. The Second Order Spectral Projection Algorithm (SP2) – Reduced Complexity Recursive Fermi Operator expansion ρ = θ µ I − H ' = lim i →∞ f i [ f i − 1 [ … f 0 [ X 0 ] … ]] $ & % 2 f(x) = x 1 2 f(x) = 2x - x X 0 = ε max I − H f 8 (f 7 (...f 1 (x) ...)) f(x) 0.5 ε max − ε min 0 2 if 2Tr[ X i ] ≥ N e X i 0 0.5 1 x f i [ X i ] = 2 if 2Tr[ X i ] < N e 2 X i − X i Niklasson AMN, Phys. Rev. B 66, 155115 (2002). U N C L A S S I F I E D Slide 9

  10. SP2 Algorithm Using the GPU Approach Estimate ε max and ε min X = ( ε max I - H )/( ε max - ε min ) TraceX = Tr[ X ] /* Trace kernel on GPU */ Until converged do X tmp = X X tmp = X 2 + X tmp /*CUBLAS xGEMM */ TraceX tmp = Tr[ X tmp ] /*Trace kernel on GPU */ if |2TraceX – 2TraceX tmp – N e | > |2TraceX + 2TraceX tmp –N e | X = X + X tmp /* CUBLAS xAXPY */ TraceX = TraceX + TraceX tmp /* CUBLAS xAXPY */ else X = X – X tmp /* CUBLAS xAXPY */ TraceX = TraceX – TraceX tmp end until ρ = X Cawkwell MJ, Mniszewski SM, Niklasson AMN, Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE, GTC 2013. U N C L A S S I F I E D Slide 10

  11. Density Matrix Calculation (Nvidia M2090) – Liquid Methane (10 – 1250 molecules) 90 SP2: 1 GPU Time per density matrix build (s) SP2: 2 GPUs SP2: 3 GPUs SP2: CPU Diagonalization 60 30 0 0 2000 4000 6000 8000 10000 Matrix dimension Cawkwell MJ, Mniszewski SM, Niklasson AMN, Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE, GTC 2013. U N C L A S S I F I E D Slide 11

  12. Sparse Matrix SP2 – ELLPACK-R Format Sparse"Matrix" Dense"Matrix" Values" Columns" #"Non4zeroes" 1" 1" 1" 1" 1" 1" 1" 1" 2" 4" 6" 1" 1" 1" 3" 3" 1" 1" 1" 1" 1" 1" 1" 1" 2" 4" 5" 1" 1" 1" 3" 4" 5 1" 1" 1" 1" 2" 1" 1" 1" 1" 2" 6" 1" 2" • Described by 3 arrays, 2-D values and indices, 1-D non-zero entries per row • N rows and M max non-zeroes per row, O(Nm 2 ) computational complexity • Row-wise storage for parallelism opportunities • No insertion cost compared to CSR Mniszewski SM, Cawkwell MJ, Wall ME, Mohd-Yosuf J, Bock N, Germann TG, Niklasson AMN, Efficient Parallel Linear Scaling Construction of the Density Matrix for Born–Oppenheimer Molecular Dynamics, J. Chem. Theory Comput., 2015, 11 (10), pp 4644–4654. U N C L A S S I F I E D Slide 12

  13. SP2 Shared Memory – Significant Cost Reduction Density Matrix Construction Time (s) Polyethylene Diag. Serial 6 Diag. 16 Threads SP2, CSR Serial SP2, ELL 1 Thread SP2, ELL 4 Threads 4 SP2, ELL 16 Threads 2 0 0 1000 2000 3000 4000 5000 6000 Number of atoms Mniszewski SM, Cawkwell MJ, Wall ME, Mohd-Yosuf J, Bock N, Germann TG, Niklasson AMN, Efficient Parallel Linear Scaling Construction of the Density Matrix for Born–Oppenheimer Molecular Dynamics, J. Chem. Theory Comput., 2015, 11 (10), pp 4644–4654. U N C L A S S I F I E D Slide 13

  14. Shared Memory SP2 – TRP Cage Protein 303 atom Trp Cage Protein LATTE Simulation – 18.8 ps solvated by 2682 water molecules (8349 atoms) -39027.8 -38850 Microcanonical Simulation Total energy (eV) -39027.9 Total energy (eV) -38900 -39028.0 -39028.1 -38950 6 8 10 12 14 16 18 Time (ps) -39000 NVT NVE (Thermalization) -39050 0 5 10 15 Time (ps) Mniszewski SM, Cawkwell MJ, Wall ME, Mohd-Yosuf J, Bock N, Germann TG, Niklasson AMN, Efficient Parallel Linear Scaling Construction of the Density Matrix for Born–Oppenheimer Molecular Dynamics, J. Chem. Theory Comput. , 2015, 11 (10), pp 4644–4654. U N C L A S S I F I E D Slide 14

  15. Graph-based Electronic Structure Theory Divide and Conquer Sparse Matrix Algebra Graph Theory U N C L A S S I F I E D Slide 15 15

  16. Graph-based SP2 Data dependency Graph S τ S τ � ⇥ Fermi Operator Expansion ⇤ τ (global) halo core i o N n s ( i ) S τ = s ( i ) τ i =1 τ Recursive Fermi-operator expansion o N n n →∞ f n ( f n − 1 ( . . . f 0 ( h [ s ( i ) D τ = lim τ ]) . . . )) o N i =1 n h [ s ( i ) H = τ ]) i =1 Exact Relation! U N C L A S S I F I E D Slide 16

  17. Graph-based SP2 – Hybrid approach On Gpus! Niklasson AMN, et al, Graph-based Linear Scaling Electronic U N C L A S S I F I E D Structure Theory, http://arxiv.org/abs/1603.00937, 2016. Slide 17

  18. Graph Partitioning SP2 Halo Base X X X X X X X X X X X X X X X X X X X X X D t-1 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X Graph of H Partitioned graph of H Subgraph of H Submatrix of H/D D t H t Graph Partitioning Subgraph Processing Structure-based 1. Determine core+halo Graph-based 2. Extract submatrix Hypergraph-based 3. Run Dense SP2 Community-based 4. Collect into next D Trivial parallelism Partitions are sets of core based on dense matrix algebra rows/orbitals of H at BLAS3 performance U N C L A S S I F I E D Slide 18

  19. Subgraph Processing For all sub-graphs: Determine elements Extract submatrix Run SP2 Assemble into D Determine elements Extract submatrix Run SP2 Assemble into D D t Determine elements Extract submatrix Run SP2 Assemble into D . . . . . . . . . . . . • Trivial parallelism • Same HOMO-LUMO sequence through SP2 • Dense matrix & communication-free SP2 • Small subgraphs, process single-threaded • Large subgraphs, process multi-threaded • Tunable accuracy - thresholds U N C L A S S I F I E D Slide 19

  20. Graph-based XL Born-Oppenheimer Molecular Dynamics (XL-BOMD) Liquid water (H 2 O) 100 DFTB-LATTE Niklasson AMN, et al, Graph-based Linear Scaling Electronic U N C L A S S I F I E D Structure Theory, http://arxiv.org/abs/1603.00937, 2016. Slide 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend