Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs S. M. Mniszewski, C. F. A. Negre, M. J. Cawkwell, A. M. N. Niklasson Los Alamos National Laboratory smm@lanl.gov 2016 GPU Technology Conference April 4-7, 2016 U N C L A S S I F I E D Slide 1
Why Molecular Dynamics? U N C L A S S I F I E D Slide 2
Background • In molecular dynamics simulation, the relative positions of atoms evolve over a series of time steps according to the force acting on each atom • Employed in materials science, chemistry, and biology to study structures, defects, and equilibrium and non-equilibrium phenomena • Dependence on an interatomic potential to calculate forces and energy • Quantum-based models capture the making and breaking of covalent bonds, charge transfer between species of differing electronegativities, and long-range electrostatic interactions - reactions U N C L A S S I F I E D Slide 3
Quantum Molecular Dynamics (QMD) Integrate the equations of motion Born-Oppenheimer MD of classical molecular trajectories R I = − ∂ U ( R ; ρ sc ) SCF � M I ¨ SCF � U ( R ; ρ ) ∂ R I SCF � with the forces calculated on the fly from a self-consistent quantum U ( R ; ρ sc ) mechanical description of the electronic structure: H [ ρ ] Ψ i = ε i Ψ i SCF � → ρ sc X | Ψ i | 2 ρ = #SCF × O ( N 3 ) occ . U N C L A S S I F I E D Slide 4
Example: Biosynthesis of Histidine • What is responsible for the biosynthesis of histidine? • Present in several pathogenic bacteria • Allosteric mechanism was determined with MD simulation • Several thousand atoms over timescales of 100s of ns! Rivalta I, Sultan MM, Lee N-S, Manley GA, Loria JP, Batista VS. PNAS, 2012 109(22), E1428-E1436. U N C L A S S I F I E D Slide 5
Future QMD Simulations IAPP dimerization and type-2 diabetes (537 atoms). Beta Amyloid Peptide and Alzheimer’s Disease (410 atoms). Raffa DF, A. Rauk A, J. Phys. Chem. B, 2 MT007, 111 (14), 3789-99. Dupuis NF, Wu C, Shea J-E,,Bowers, MT J. Am. Chem. Soc., 2011, 133 (19), 7240-43. U N C L A S S I F I E D Slide 6
Computational Cost • High computational cost Quantum Molecular Dynamics and complexity of QMD Wall-Clock Time / MD Time Step calculations 3 ) Diagonalization O(N • The MD timestep is the O(N) Regular linear scaling O(N) Low pre-factor most expensive – the density matrix construction Significant pre-factor reduction � is required for practical � • The second order spectral large scale QMD! � projection (SP2) algorithm breakthrough System Size (N) • Use of hybrid parallelism on GPU-accelerated clusters U N C L A S S I F I E D Slide 7
The Density Matrix Computation • Typically, algorithms used in quantum-based models, most notably matrix diagonalization, are not ideally suited to GPUs – Due to their complexity – Difficulty in extracting thread-level parallelism – Difficulty of avoiding branching within warps • New SP2 approach – Computed directly from the Hamiltonian through a recursive expansion of the Fermi Operator with the second order spectral projection (SP2) algorithm – Based on a series of generalized matrix-matrix multiplications – Only one matrix-matrix multiplication is required per iteration – Maps very well to GPUs U N C L A S S I F I E D Slide 8
The Second Order Spectral Projection Algorithm (SP2) – Reduced Complexity Recursive Fermi Operator expansion ρ = θ µ I − H ' = lim i →∞ f i [ f i − 1 [ … f 0 [ X 0 ] … ]] $ & % 2 f(x) = x 1 2 f(x) = 2x - x X 0 = ε max I − H f 8 (f 7 (...f 1 (x) ...)) f(x) 0.5 ε max − ε min 0 2 if 2Tr[ X i ] ≥ N e X i 0 0.5 1 x f i [ X i ] = 2 if 2Tr[ X i ] < N e 2 X i − X i Niklasson AMN, Phys. Rev. B 66, 155115 (2002). U N C L A S S I F I E D Slide 9
SP2 Algorithm Using the GPU Approach Estimate ε max and ε min X = ( ε max I - H )/( ε max - ε min ) TraceX = Tr[ X ] /* Trace kernel on GPU */ Until converged do X tmp = X X tmp = X 2 + X tmp /*CUBLAS xGEMM */ TraceX tmp = Tr[ X tmp ] /*Trace kernel on GPU */ if |2TraceX – 2TraceX tmp – N e | > |2TraceX + 2TraceX tmp –N e | X = X + X tmp /* CUBLAS xAXPY */ TraceX = TraceX + TraceX tmp /* CUBLAS xAXPY */ else X = X – X tmp /* CUBLAS xAXPY */ TraceX = TraceX – TraceX tmp end until ρ = X Cawkwell MJ, Mniszewski SM, Niklasson AMN, Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE, GTC 2013. U N C L A S S I F I E D Slide 10
Density Matrix Calculation (Nvidia M2090) – Liquid Methane (10 – 1250 molecules) 90 SP2: 1 GPU Time per density matrix build (s) SP2: 2 GPUs SP2: 3 GPUs SP2: CPU Diagonalization 60 30 0 0 2000 4000 6000 8000 10000 Matrix dimension Cawkwell MJ, Mniszewski SM, Niklasson AMN, Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE, GTC 2013. U N C L A S S I F I E D Slide 11
Sparse Matrix SP2 – ELLPACK-R Format Sparse"Matrix" Dense"Matrix" Values" Columns" #"Non4zeroes" 1" 1" 1" 1" 1" 1" 1" 1" 2" 4" 6" 1" 1" 1" 3" 3" 1" 1" 1" 1" 1" 1" 1" 1" 2" 4" 5" 1" 1" 1" 3" 4" 5 1" 1" 1" 1" 2" 1" 1" 1" 1" 2" 6" 1" 2" • Described by 3 arrays, 2-D values and indices, 1-D non-zero entries per row • N rows and M max non-zeroes per row, O(Nm 2 ) computational complexity • Row-wise storage for parallelism opportunities • No insertion cost compared to CSR Mniszewski SM, Cawkwell MJ, Wall ME, Mohd-Yosuf J, Bock N, Germann TG, Niklasson AMN, Efficient Parallel Linear Scaling Construction of the Density Matrix for Born–Oppenheimer Molecular Dynamics, J. Chem. Theory Comput., 2015, 11 (10), pp 4644–4654. U N C L A S S I F I E D Slide 12
SP2 Shared Memory – Significant Cost Reduction Density Matrix Construction Time (s) Polyethylene Diag. Serial 6 Diag. 16 Threads SP2, CSR Serial SP2, ELL 1 Thread SP2, ELL 4 Threads 4 SP2, ELL 16 Threads 2 0 0 1000 2000 3000 4000 5000 6000 Number of atoms Mniszewski SM, Cawkwell MJ, Wall ME, Mohd-Yosuf J, Bock N, Germann TG, Niklasson AMN, Efficient Parallel Linear Scaling Construction of the Density Matrix for Born–Oppenheimer Molecular Dynamics, J. Chem. Theory Comput., 2015, 11 (10), pp 4644–4654. U N C L A S S I F I E D Slide 13
Shared Memory SP2 – TRP Cage Protein 303 atom Trp Cage Protein LATTE Simulation – 18.8 ps solvated by 2682 water molecules (8349 atoms) -39027.8 -38850 Microcanonical Simulation Total energy (eV) -39027.9 Total energy (eV) -38900 -39028.0 -39028.1 -38950 6 8 10 12 14 16 18 Time (ps) -39000 NVT NVE (Thermalization) -39050 0 5 10 15 Time (ps) Mniszewski SM, Cawkwell MJ, Wall ME, Mohd-Yosuf J, Bock N, Germann TG, Niklasson AMN, Efficient Parallel Linear Scaling Construction of the Density Matrix for Born–Oppenheimer Molecular Dynamics, J. Chem. Theory Comput. , 2015, 11 (10), pp 4644–4654. U N C L A S S I F I E D Slide 14
Graph-based Electronic Structure Theory Divide and Conquer Sparse Matrix Algebra Graph Theory U N C L A S S I F I E D Slide 15 15
Graph-based SP2 Data dependency Graph S τ S τ � ⇥ Fermi Operator Expansion ⇤ τ (global) halo core i o N n s ( i ) S τ = s ( i ) τ i =1 τ Recursive Fermi-operator expansion o N n n →∞ f n ( f n − 1 ( . . . f 0 ( h [ s ( i ) D τ = lim τ ]) . . . )) o N i =1 n h [ s ( i ) H = τ ]) i =1 Exact Relation! U N C L A S S I F I E D Slide 16
Graph-based SP2 – Hybrid approach On Gpus! Niklasson AMN, et al, Graph-based Linear Scaling Electronic U N C L A S S I F I E D Structure Theory, http://arxiv.org/abs/1603.00937, 2016. Slide 17
Graph Partitioning SP2 Halo Base X X X X X X X X X X X X X X X X X X X X X D t-1 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X Graph of H Partitioned graph of H Subgraph of H Submatrix of H/D D t H t Graph Partitioning Subgraph Processing Structure-based 1. Determine core+halo Graph-based 2. Extract submatrix Hypergraph-based 3. Run Dense SP2 Community-based 4. Collect into next D Trivial parallelism Partitions are sets of core based on dense matrix algebra rows/orbitals of H at BLAS3 performance U N C L A S S I F I E D Slide 18
Subgraph Processing For all sub-graphs: Determine elements Extract submatrix Run SP2 Assemble into D Determine elements Extract submatrix Run SP2 Assemble into D D t Determine elements Extract submatrix Run SP2 Assemble into D . . . . . . . . . . . . • Trivial parallelism • Same HOMO-LUMO sequence through SP2 • Dense matrix & communication-free SP2 • Small subgraphs, process single-threaded • Large subgraphs, process multi-threaded • Tunable accuracy - thresholds U N C L A S S I F I E D Slide 19
Graph-based XL Born-Oppenheimer Molecular Dynamics (XL-BOMD) Liquid water (H 2 O) 100 DFTB-LATTE Niklasson AMN, et al, Graph-based Linear Scaling Electronic U N C L A S S I F I E D Structure Theory, http://arxiv.org/abs/1603.00937, 2016. Slide 20
Recommend
More recommend