S. M. Mniszewski, C. F. A. Negre, M. J. Cawkwell, A. M. N. Niklasson - - PowerPoint PPT Presentation

s m mniszewski c f a negre m j cawkwell a m n niklasson
SMART_READER_LITE
LIVE PREVIEW

S. M. Mniszewski, C. F. A. Negre, M. J. Cawkwell, A. M. N. Niklasson - - PowerPoint PPT Presentation

Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs S. M. Mniszewski, C. F. A. Negre, M. J. Cawkwell, A. M. N. Niklasson Los Alamos National Laboratory smm@lanl.gov 2016 GPU Technology Conference April


slide-1
SLIDE 1

Slide 1

U N C L A S S I F I E D

Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs

April 4-7, 2016 2016 GPU Technology Conference

  • S. M. Mniszewski, C. F. A. Negre, M. J. Cawkwell, A. M. N. Niklasson

Los Alamos National Laboratory

smm@lanl.gov

slide-2
SLIDE 2

Slide 2

U N C L A S S I F I E D

Why Molecular Dynamics?

slide-3
SLIDE 3

Slide 3

U N C L A S S I F I E D

Background

  • In molecular dynamics simulation, the relative

positions of atoms evolve over a series of time steps according to the force acting on each atom

  • Employed in materials science, chemistry, and

biology to study structures, defects, and equilibrium and non-equilibrium phenomena

  • Dependence on an interatomic potential to

calculate forces and energy

  • Quantum-based models capture the making

and breaking of covalent bonds, charge transfer between species of differing electronegativities, and long-range electrostatic interactions - reactions

slide-4
SLIDE 4

Slide 4

U N C L A S S I F I E D

Integrate the equations of motion

  • f classical molecular trajectories

with the forces calculated on the fly from a self-consistent quantum mechanical description of the electronic structure: MI ¨ RI = −∂U(R; ρsc) ∂RI H[ρ]Ψi = εiΨi ρ = X

  • cc.

|Ψi|2 → ρsc

SCF SCF SCF

SCF U(R; ρ) U(R; ρsc) Born-Oppenheimer MD

#SCF × O(N 3)

Quantum Molecular Dynamics (QMD)

slide-5
SLIDE 5

Slide 5

U N C L A S S I F I E D

Example: Biosynthesis of Histidine

  • What is responsible for the biosynthesis of histidine?
  • Present in several pathogenic bacteria
  • Allosteric mechanism was determined with MD simulation
  • Several thousand atoms over timescales of 100s of ns!

Rivalta I, Sultan MM, Lee N-S, Manley GA, Loria JP, Batista VS. PNAS, 2012 109(22), E1428-E1436.

slide-6
SLIDE 6

Slide 6

U N C L A S S I F I E D

Future QMD Simulations

IAPP dimerization and type-2 diabetes (537 atoms). Beta Amyloid Peptide and Alzheimer’s Disease (410 atoms).

Raffa DF, A. Rauk A, J. Phys. Chem. B, 2 MT007, 111 (14), 3789-99. Dupuis NF, Wu C, Shea J-E,,Bowers, MT J. Am. Chem. Soc., 2011, 133 (19), 7240-43.

slide-7
SLIDE 7

Slide 7

U N C L A S S I F I E D

Computational Cost

  • High computational cost

and complexity of QMD calculations

  • The MD timestep is the

most expensive – the density matrix construction

  • The second order spectral

projection (SP2) algorithm breakthrough

  • Use of hybrid parallelism on

GPU-accelerated clusters

System Size (N) Wall-Clock Time / MD Time Step

O(N

3) Diagonalization

O(N) Regular linear scaling O(N) Low pre-factor

Significant pre-factor reduction is required for practical large scale QMD!

Quantum Molecular Dynamics

slide-8
SLIDE 8

Slide 8

U N C L A S S I F I E D

The Density Matrix Computation

  • Typically, algorithms used in quantum-based models,

most notably matrix diagonalization, are not ideally suited to GPUs – Due to their complexity – Difficulty in extracting thread-level parallelism – Difficulty of avoiding branching within warps

  • New SP2 approach

– Computed directly from the Hamiltonian through a recursive

expansion of the Fermi Operator with the second order spectral projection (SP2) algorithm

– Based on a series of generalized matrix-matrix multiplications – Only one matrix-matrix multiplication is required per iteration – Maps very well to GPUs

slide-9
SLIDE 9

Slide 9

U N C L A S S I F I E D

0.5 1

x

0.5 1

f(x)

f(x) = x

2

f(x) = 2x - x

2

f8(f7(...f1(x) ...))

The Second Order Spectral Projection Algorithm (SP2) – Reduced Complexity

ρ = θ µI − H $ % & ' = lim

i→∞ fi[ fi−1[… f0[X0]…]]

X0 = εmaxI − H εmax − εmin

fi[Xi]= Xi

2 if 2Tr[Xi]≥ Ne

2Xi − Xi

2 if 2Tr[Xi]< Ne

Recursive Fermi Operator expansion

Niklasson AMN, Phys. Rev. B 66, 155115 (2002).

slide-10
SLIDE 10

Slide 10

U N C L A S S I F I E D

SP2 Algorithm Using the GPU Approach

Estimate εmax and εmin X = (εmaxI-H)/(εmax-εmin) TraceX = Tr[X] /* Trace kernel on GPU */ Until converged do Xtmp = X Xtmp = X2+Xtmp /*CUBLAS xGEMM */ TraceXtmp = Tr[Xtmp] /*Trace kernel on GPU */ if |2TraceX – 2TraceXtmp – Ne| > |2TraceX + 2TraceXtmp –Ne| X = X + Xtmp /* CUBLAS xAXPY */ TraceX = TraceX + TraceXtmp /* CUBLAS xAXPY */ else X = X – Xtmp /* CUBLAS xAXPY */ TraceX = TraceX – TraceXtmp end until ρ = X

Cawkwell MJ, Mniszewski SM, Niklasson AMN, Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE, GTC 2013.

slide-11
SLIDE 11

Slide 11

U N C L A S S I F I E D

Density Matrix Calculation (Nvidia M2090) – Liquid Methane (10 – 1250 molecules)

2000 4000 6000 8000 10000 Matrix dimension 30 60 90

Time per density matrix build (s) SP2: 1 GPU SP2: 2 GPUs SP2: 3 GPUs SP2: CPU Diagonalization

Cawkwell MJ, Mniszewski SM, Niklasson AMN, Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE, GTC 2013.

slide-12
SLIDE 12

Slide 12

U N C L A S S I F I E D

Sparse Matrix SP2 – ELLPACK-R Format

  • Described by 3 arrays, 2-D values and indices, 1-D non-zero entries per row
  • N rows and M max non-zeroes per row, O(Nm2) computational complexity
  • Row-wise storage for parallelism opportunities
  • No insertion cost compared to CSR

1" 1" 2" 4" 6" 1" 3" 1" 2" 4" 5" 1" 4" 5 1" 2" 6" 1" 1" 1" 1" 1" 1" 1" 1" 1" 1" 1" 1" 1" 1" 1" 1" 1" 1" 1" 1" 1" 1" 1" 1" 3" 1" 3" 2" 2"

Values" Columns" #"Non4zeroes"

Dense"Matrix" Sparse"Matrix"

Mniszewski SM, Cawkwell MJ, Wall ME, Mohd-Yosuf J, Bock N, Germann TG, Niklasson AMN, Efficient Parallel Linear Scaling Construction of the Density Matrix for Born–Oppenheimer Molecular Dynamics, J. Chem. Theory Comput., 2015, 11 (10), pp 4644–4654.

slide-13
SLIDE 13

Slide 13

U N C L A S S I F I E D

SP2 Shared Memory – Significant Cost Reduction

1000 2000 3000 4000 5000 6000

Number of atoms

2 4 6

Density Matrix Construction Time (s)

  • Diag. Serial
  • Diag. 16 Threads

SP2, CSR Serial SP2, ELL 1 Thread SP2, ELL 4 Threads SP2, ELL 16 Threads

Polyethylene

Mniszewski SM, Cawkwell MJ, Wall ME, Mohd-Yosuf J, Bock N, Germann TG, Niklasson AMN, Efficient Parallel Linear Scaling Construction of the Density Matrix for Born–Oppenheimer Molecular Dynamics, J. Chem. Theory Comput., 2015, 11 (10), pp 4644–4654.

slide-14
SLIDE 14

Slide 14

U N C L A S S I F I E D

Shared Memory SP2 – TRP Cage Protein

5 10 15 Time (ps)

  • 39050
  • 39000
  • 38950
  • 38900
  • 38850

Total energy (eV)

6 8 10 12 14 16 18 Time (ps)

  • 39028.1
  • 39028.0
  • 39027.9
  • 39027.8

Total energy (eV)

NVT NVE

Mniszewski SM, Cawkwell MJ, Wall ME, Mohd-Yosuf J, Bock N, Germann TG, Niklasson AMN, Efficient Parallel Linear Scaling Construction of the Density Matrix for Born–Oppenheimer Molecular Dynamics, J. Chem. Theory Comput., 2015, 11 (10), pp 4644–4654.

303 atom Trp Cage Protein solvated by 2682 water molecules (8349 atoms) LATTE Simulation – 18.8 ps

(Thermalization) Microcanonical Simulation

slide-15
SLIDE 15

Slide 15

U N C L A S S I F I E D

15

Sparse Matrix Algebra Divide and Conquer Graph Theory

Graph-based Electronic Structure Theory

slide-16
SLIDE 16

Slide 16

U N C L A S S I F I E D

Data dependency Graph Sτ

core halo

i s(i)

τ

Sτ = n s(i)

τ

  • N

i=1

Recursive Fermi-operator expansion Dτ = n lim

n→∞ fn(fn−1(. . . f0(h[s(i) τ ]) . . .))

  • N

i=1

H = n h[s(i)

τ ])

  • N

i=1

Sτ ⇥Fermi Operator Expansion⇤τ(global)

Exact Relation!

Graph-based SP2

slide-17
SLIDE 17

Slide 17

U N C L A S S I F I E D

Graph-based SP2 – Hybrid approach On Gpus!

Niklasson AMN, et al, Graph-based Linear Scaling Electronic Structure Theory, http://arxiv.org/abs/1603.00937, 2016.

slide-18
SLIDE 18

Slide 18

U N C L A S S I F I E D

Graph Partitioning SP2

Dt

Graph of H Partitioned graph of H

Graph Partitioning Structure-based Graph-based Hypergraph-based Community-based Subgraph Processing

  • 1. Determine core+halo
  • 2. Extract submatrix
  • 3. Run Dense SP2
  • 4. Collect into next D

Base Halo

Subgraph of H

X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

Submatrix of H/D

Ht Dt-1

Trivial parallelism

based on dense matrix algebra at BLAS3 performance Partitions are sets of core rows/orbitals of H

slide-19
SLIDE 19

Slide 19

U N C L A S S I F I E D

Subgraph Processing

Determine elements Run SP2 Extract submatrix Assemble into D Determine elements Run SP2 Extract submatrix Assemble into D Determine elements Run SP2 Extract submatrix Assemble into D

For all sub-graphs:

. . . . . . . . . . . .

Dt

  • Trivial parallelism
  • Same HOMO-LUMO sequence through SP2
  • Dense matrix & communication-free SP2
  • Small subgraphs, process single-threaded
  • Large subgraphs, process multi-threaded
  • Tunable accuracy - thresholds
slide-20
SLIDE 20

Slide 20

U N C L A S S I F I E D

Liquid water (H2O)100 DFTB-LATTE

Graph-based XL Born-Oppenheimer Molecular Dynamics (XL-BOMD)

Niklasson AMN, et al, Graph-based Linear Scaling Electronic Structure Theory, http://arxiv.org/abs/1603.00937, 2016.

slide-21
SLIDE 21

Slide 21

U N C L A S S I F I E D

1×10

  • 8

1×10

  • 7

1×10

  • 6

1×10

  • 5

1×10

  • 4

Numerical Threshold

1×10

  • 5

1×10

  • 4

1×10

  • 3

1×10

  • 2

1×10

  • 1

||D-Dexact||F per atom

2048 No. Subgraphs 1024 No. Subgraphs 512 No. Subgraphs

Polyalanine in water 20,000 atoms

τ

Graph-based SP2 – Tunable accuracy

Niklasson AMN, et al, Graph-based Linear Scaling Electronic Structure Theory, http://arxiv.org/abs/1603.00937, 2016.

slide-22
SLIDE 22

Slide 22

U N C L A S S I F I E D

64 128 256 512 1024 2048 4096

Number of communities

0.5 1 2 4 8 16 32 64 128

SP2 run time (s)

1 CPU SpM Alg (MKL) 1 CPU Graph Part. 1 GPU Graph Part. 16 GPU Graph Part. 32 GPU Graph Part.

Graph-based SP2 – Distributed GPU Performance

Density matrix calculations for Polyalanine in water 20,000 atoms

  • 64-4096 METIS

partitions

  • Single Nvidia Tesla

M2090 GPU per node

  • Partition core+halo

sizes vary, load- balancing required

  • 16,384 GPU threads,

still perfect strong scaling

  • Best - ~25 µs/atom

Niklasson AMN, et al, Graph-based Linear Scaling Electronic Structure Theory, http://arxiv.org/abs/1603.00937, 2016.

slide-23
SLIDE 23

Slide 23

U N C L A S S I F I E D

Graph-based SP2 – Distributed CPU vs. GPU

  • 1024 2048 METIS

partitions

  • 16-256 CPU/GPU nodes
  • MKL vs. CuBLAS
  • Similar for 1024 and 2048

partitions

  • Near linear scaling
  • Speedup of 1.7X on GPUs
  • Best – 5 µs/atom

Density matrix calculations for Polyalanine in water 20,000 atoms

slide-24
SLIDE 24

Slide 24

U N C L A S S I F I E D

Basic Matrix Library (BML) & PROGRESS Library

bml C API

  • bml_multiply()
  • bml_add()
  • ...

bml Fortran API

  • call bml_multiply()
  • call bml_add()
  • ...

dense matrix sparse ELLPACK matrix sparse CSR matrix CPU/SMP GPGPU MPI BML library core bml library

Uses BLAS Level 2 and 3 routines.

BML available at http://qmmd.github.io/bml/ under the BSD 3-clause license

slide-25
SLIDE 25

Slide 25

U N C L A S S I F I E D

Summary

  • Distributed graph-based SP2 provides significant

speedup using distributed GPU-accelerated architectures

  • Available as part of BML & PROGRESS libraries
  • Allows for QMD simulations of larger systems and longer

timeframes than previously possible