S. M. Mniszewski, C. F. A. Negre, M. J. Cawkwell, A. M. N. Niklasson - PowerPoint PPT Presentation

Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs S. M. Mniszewski, C. F. A. Negre, M. J. Cawkwell, A. M. N. Niklasson Los Alamos National Laboratory smm@lanl.gov 2016 GPU Technology Conference April 4-7, 2016 U N C L A S S I F I E D Slide 1

Why Molecular Dynamics? U N C L A S S I F I E D Slide 2

Background • In molecular dynamics simulation, the relative positions of atoms evolve over a series of time steps according to the force acting on each atom • Employed in materials science, chemistry, and biology to study structures, defects, and equilibrium and non-equilibrium phenomena • Dependence on an interatomic potential to calculate forces and energy • Quantum-based models capture the making and breaking of covalent bonds, charge transfer between species of differing electronegativities, and long-range electrostatic interactions - reactions U N C L A S S I F I E D Slide 3

Quantum Molecular Dynamics (QMD) Integrate the equations of motion Born-Oppenheimer MD of classical molecular trajectories R I = − ∂ U ( R ; ρ sc ) SCF � M I ¨ SCF � U ( R ; ρ ) ∂ R I SCF � with the forces calculated on the fly from a self-consistent quantum U ( R ; ρ sc ) mechanical description of the electronic structure: H [ ρ ] Ψ i = ε i Ψ i SCF � → ρ sc X | Ψ i | 2 ρ = #SCF × O ( N 3 ) occ . U N C L A S S I F I E D Slide 4

Example: Biosynthesis of Histidine • What is responsible for the biosynthesis of histidine? • Present in several pathogenic bacteria • Allosteric mechanism was determined with MD simulation • Several thousand atoms over timescales of 100s of ns! Rivalta I, Sultan MM, Lee N-S, Manley GA, Loria JP, Batista VS. PNAS, 2012 109(22), E1428-E1436. U N C L A S S I F I E D Slide 5

Future QMD Simulations IAPP dimerization and type-2 diabetes (537 atoms). Beta Amyloid Peptide and Alzheimer’s Disease (410 atoms). Raffa DF, A. Rauk A, J. Phys. Chem. B, 2 MT007, 111 (14), 3789-99. Dupuis NF, Wu C, Shea J-E,,Bowers, MT J. Am. Chem. Soc., 2011, 133 (19), 7240-43. U N C L A S S I F I E D Slide 6

Computational Cost • High computational cost Quantum Molecular Dynamics and complexity of QMD Wall-Clock Time / MD Time Step calculations 3 ) Diagonalization O(N • The MD timestep is the O(N) Regular linear scaling O(N) Low pre-factor most expensive – the density matrix construction Significant pre-factor reduction � is required for practical � • The second order spectral large scale QMD! � projection (SP2) algorithm breakthrough System Size (N) • Use of hybrid parallelism on GPU-accelerated clusters U N C L A S S I F I E D Slide 7

The Density Matrix Computation • Typically, algorithms used in quantum-based models, most notably matrix diagonalization, are not ideally suited to GPUs – Due to their complexity – Difficulty in extracting thread-level parallelism – Difficulty of avoiding branching within warps • New SP2 approach – Computed directly from the Hamiltonian through a recursive expansion of the Fermi Operator with the second order spectral projection (SP2) algorithm – Based on a series of generalized matrix-matrix multiplications – Only one matrix-matrix multiplication is required per iteration – Maps very well to GPUs U N C L A S S I F I E D Slide 8

The Second Order Spectral Projection Algorithm (SP2) – Reduced Complexity Recursive Fermi Operator expansion ρ = θ µ I − H ' = lim i →∞ f i [ f i − 1 [ … f 0 [ X 0 ] … ]] $ & % 2 f(x) = x 1 2 f(x) = 2x - x X 0 = ε max I − H f 8 (f 7 (...f 1 (x) ...)) f(x) 0.5 ε max − ε min 0 2 if 2Tr[ X i ] ≥ N e X i 0 0.5 1 x f i [ X i ] = 2 if 2Tr[ X i ] < N e 2 X i − X i Niklasson AMN, Phys. Rev. B 66, 155115 (2002). U N C L A S S I F I E D Slide 9

SP2 Algorithm Using the GPU Approach Estimate ε max and ε min X = ( ε max I - H )/( ε max - ε min ) TraceX = Tr[ X ] /* Trace kernel on GPU */ Until converged do X tmp = X X tmp = X 2 + X tmp /*CUBLAS xGEMM */ TraceX tmp = Tr[ X tmp ] /*Trace kernel on GPU */ if |2TraceX – 2TraceX tmp – N e | > |2TraceX + 2TraceX tmp –N e | X = X + X tmp /* CUBLAS xAXPY */ TraceX = TraceX + TraceX tmp /* CUBLAS xAXPY */ else X = X – X tmp /* CUBLAS xAXPY */ TraceX = TraceX – TraceX tmp end until ρ = X Cawkwell MJ, Mniszewski SM, Niklasson AMN, Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE, GTC 2013. U N C L A S S I F I E D Slide 10

Density Matrix Calculation (Nvidia M2090) – Liquid Methane (10 – 1250 molecules) 90 SP2: 1 GPU Time per density matrix build (s) SP2: 2 GPUs SP2: 3 GPUs SP2: CPU Diagonalization 60 30 0 0 2000 4000 6000 8000 10000 Matrix dimension Cawkwell MJ, Mniszewski SM, Niklasson AMN, Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE, GTC 2013. U N C L A S S I F I E D Slide 11

Sparse Matrix SP2 – ELLPACK-R Format Sparse"Matrix" Dense"Matrix" Values" Columns" #"Non4zeroes" 1" 1" 1" 1" 1" 1" 1" 1" 2" 4" 6" 1" 1" 1" 3" 3" 1" 1" 1" 1" 1" 1" 1" 1" 2" 4" 5" 1" 1" 1" 3" 4" 5 1" 1" 1" 1" 2" 1" 1" 1" 1" 2" 6" 1" 2" • Described by 3 arrays, 2-D values and indices, 1-D non-zero entries per row • N rows and M max non-zeroes per row, O(Nm 2 ) computational complexity • Row-wise storage for parallelism opportunities • No insertion cost compared to CSR Mniszewski SM, Cawkwell MJ, Wall ME, Mohd-Yosuf J, Bock N, Germann TG, Niklasson AMN, Efficient Parallel Linear Scaling Construction of the Density Matrix for Born–Oppenheimer Molecular Dynamics, J. Chem. Theory Comput., 2015, 11 (10), pp 4644–4654. U N C L A S S I F I E D Slide 12

SP2 Shared Memory – Significant Cost Reduction Density Matrix Construction Time (s) Polyethylene Diag. Serial 6 Diag. 16 Threads SP2, CSR Serial SP2, ELL 1 Thread SP2, ELL 4 Threads 4 SP2, ELL 16 Threads 2 0 0 1000 2000 3000 4000 5000 6000 Number of atoms Mniszewski SM, Cawkwell MJ, Wall ME, Mohd-Yosuf J, Bock N, Germann TG, Niklasson AMN, Efficient Parallel Linear Scaling Construction of the Density Matrix for Born–Oppenheimer Molecular Dynamics, J. Chem. Theory Comput., 2015, 11 (10), pp 4644–4654. U N C L A S S I F I E D Slide 13

Shared Memory SP2 – TRP Cage Protein 303 atom Trp Cage Protein LATTE Simulation – 18.8 ps solvated by 2682 water molecules (8349 atoms) -39027.8 -38850 Microcanonical Simulation Total energy (eV) -39027.9 Total energy (eV) -38900 -39028.0 -39028.1 -38950 6 8 10 12 14 16 18 Time (ps) -39000 NVT NVE (Thermalization) -39050 0 5 10 15 Time (ps) Mniszewski SM, Cawkwell MJ, Wall ME, Mohd-Yosuf J, Bock N, Germann TG, Niklasson AMN, Efficient Parallel Linear Scaling Construction of the Density Matrix for Born–Oppenheimer Molecular Dynamics, J. Chem. Theory Comput. , 2015, 11 (10), pp 4644–4654. U N C L A S S I F I E D Slide 14

Graph-based Electronic Structure Theory Divide and Conquer Sparse Matrix Algebra Graph Theory U N C L A S S I F I E D Slide 15 15

Graph-based SP2 Data dependency Graph S τ S τ � ⇥ Fermi Operator Expansion ⇤ τ (global) halo core i o N n s ( i ) S τ = s ( i ) τ i =1 τ Recursive Fermi-operator expansion o N n n →∞ f n ( f n − 1 ( . . . f 0 ( h [ s ( i ) D τ = lim τ ]) . . . )) o N i =1 n h [ s ( i ) H = τ ]) i =1 Exact Relation! U N C L A S S I F I E D Slide 16

Graph-based SP2 – Hybrid approach On Gpus! Niklasson AMN, et al, Graph-based Linear Scaling Electronic U N C L A S S I F I E D Structure Theory, http://arxiv.org/abs/1603.00937, 2016. Slide 17

Graph Partitioning SP2 Halo Base X X X X X X X X X X X X X X X X X X X X X D t-1 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X Graph of H Partitioned graph of H Subgraph of H Submatrix of H/D D t H t Graph Partitioning Subgraph Processing Structure-based 1. Determine core+halo Graph-based 2. Extract submatrix Hypergraph-based 3. Run Dense SP2 Community-based 4. Collect into next D Trivial parallelism Partitions are sets of core based on dense matrix algebra rows/orbitals of H at BLAS3 performance U N C L A S S I F I E D Slide 18

Subgraph Processing For all sub-graphs: Determine elements Extract submatrix Run SP2 Assemble into D Determine elements Extract submatrix Run SP2 Assemble into D D t Determine elements Extract submatrix Run SP2 Assemble into D . . . . . . . . . . . . • Trivial parallelism • Same HOMO-LUMO sequence through SP2 • Dense matrix & communication-free SP2 • Small subgraphs, process single-threaded • Large subgraphs, process multi-threaded • Tunable accuracy - thresholds U N C L A S S I F I E D Slide 19

Graph-based XL Born-Oppenheimer Molecular Dynamics (XL-BOMD) Liquid water (H 2 O) 100 DFTB-LATTE Niklasson AMN, et al, Graph-based Linear Scaling Electronic U N C L A S S I F I E D Structure Theory, http://arxiv.org/abs/1603.00937, 2016. Slide 20

S. M. Mniszewski, C. F. A. Negre, M. J. Cawkwell, A. M. N. Niklasson - PowerPoint PPT Presentation

Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs S. M. Mniszewski, C. F. A. Negre, M. J. Cawkwell, A. M. N. Niklasson Los Alamos National Laboratory smm@lanl.gov 2016 GPU Technology Conference April

Coral Jacob Austin Matthew Bowers Rebecca Cawkwell Sanford Miller * please note that this

Graph Partitioning Methods for Fast Parallel Quantum Molecular Dynamics Hristo Djidjev, Georg

Digital exams in Inspera 17 July 2017 Joey Corbin and Karoline Niklasson International Summer

Added Chemical Specialities and I ntermediates . Enzo Montoneri, Gloria Fabbri, Michele Negre,

Efficient Leak Resistant Modular Exponentiation in RNS Andrea Lesavourey (1) , Christophe Negre

09-06025 LA-UR- Approved for public release; distribution is unlimited. Title: Increasing

Using Charm++ to Support you Multiscale Multiphysics On the Trinity Supercomputer nt wo

Percolation noise at the metalinsulator transition of nanostructured VO 2 films Z. Topalian 1 ,

Finite field multiplication combining AMNS and DFT approach for Pairing Based Cryptography Nadia

Recent Advances in Parallel Implementations of Scalar Multiplication over Binary Elliptic Curves

Using ontologies for function management Caroline Domerg, Juliette Fabre and Pascal Neveu

Algorithm for RSA and Hyperelliptic Curve Cryptosystems Resistant to Simple Power Analysis

Efficient Modular Exponentiation Based on Multiple Multiplications by a Common Operand Christophe

Where are we now? A report from the MetBioNet Amino Acid questionnaire Claire Hart Sheffield

FAQ Frequently asked questions You can find answers to most questions in this FAQ. You can

Your Path to the Student-Athlete Experience for High School Counselors NCAA Eligibility Center

Presentation Prayer Guide Katie Bivens Vanderbloemen Search Group www . vanderbloemen . com

Dr. R. JAYABALAN Assistant Professor Food and Bioprocess Technology Laboratory Department of

Histio UK and our school Histiocytosis UK Histio UK is the UKs leading histiocytosis charity

Our innovative solutions enable the general public to rediscover the wealth of cultural heritage

Corporate Presentation March 2019 Advisories In the interest of providing information

Statistics of the Universe: Exa-calculations and Cosmology's Data Deluge Matt Bellis Debbie

SAVANNAH ZONING ORDINANCE UPDATE Presentation to the Planning Commission February 12, 2019

Investor Presentation March 2020 Management Team Tyler Glover Chief Executive Officer Robert