[PPT] - Mathias Jacquelin mjacquelin@lbl.gov Wibe De Jong wadejong@lbl.gov PowerPoint Presentation

SLIDE 1

towards highly scalable ab initio molecular dynamics (aimd) on the intel knights landing manycore processor Mathias Jacquelin

mjacquelin@lbl.gov

Wibe De Jong

wadejong@lbl.gov Computational Research Department Lawrence Berkeley National Laboratory

Eric Bylaska

ebylaska@pnnl.gov Pacifjc Northwest National Laboratory

Scheduling in Knoxville 17 May 25 2017

SLIDE 2

introduction: plane wave methods

QM-CC QM-DFT AIMD QM/MM MM

∙ 100-1000 atoms ∙ Uses plane wave basis ∙ Many FFTs ∙ SUMMA like for global

perations

VxcΨ

Ψi Ψ j = δij

(-1/2)∇2Ψ

VextΨ

VHΨ

+ + + :: NeNpack :: (NaNpack + NgLogNpack + NeNpack) + NaNeNpack :: NeNgLogNg + NeNg + 2NgLogNg + Ng + NeNg :: NeNgLogNg + NeNg :: Ne

2Npack + Ne 3

Ψ E =

Na - number of atoms, Ne - number of electrons Ng – size of FFT grid, Npack- size of reciprocal space (-1/2)∇2Ψ Ψ

ext

V VHΨ VxcΨ

Ψi Ψ j

1/22

SLIDE 3

introduction: plane wave discretization

Hψi(r) = −1 2∇2 + VL(r) + (1 − α)Vx[ρ](r) + Vc[ρ](r) + VNL + VH[ρ](r) − α

j

Kij(r)ψj(r) ∇2VH,X,C(r) = −4πρ(r) ∇2Kij(r) = −4πψi(r)ψj(r) ρ(r) = N

i=1 |ψi(r)|2

ω ψi(r)ψj(r)dr = δij

Matrix multiplication in reciprocal space Ne 3D-FFT Poisson (Ne + 1)Ne 3D-FFT Density Orthogonality (matrix multiplication)

2/22

SLIDE 4

introduction: plane wave dft solutions

∙ Avoid direct diagonalization because of large basis sets (much larger than Gaussian basis sets) ∙ Instead evaluate wave function gradient using a conjugate gradient algorithm to solve DFT equation ∙ Kinetic and nonlocal pseudopotential = matrix multiplications in reciprocal space (Ng × Ne, Ng = 963) ∙ Local pseudopot., Coulomb, and exchange-correlation = Ne FFTs ∙ Exact exchange ((Ne + 1)Ne FFTs), nonlocal pseudopotential, and wavefunction orthogonalization Expensive parts: global operations ∙ 20 ps of simulation time 200,000 steps ∙ 1 s/step = 2-3 days ∙ 10 s/step = 23 days ∙ 13 s/step = 70 days ∙ Mesoscale phenomena at longer time scales ∙ Assume 1 s/step ∙ 100 ps = 10-15 days ∙ 1 ns = 100 - 150 days

3/22

SLIDE 5

introduction: plane wave dft solutions

∙ Avoid direct diagonalization because of large basis sets (much larger than Gaussian basis sets) ∙ Instead evaluate wave function gradient using a conjugate gradient algorithm to solve DFT equation ∙ Kinetic and nonlocal pseudopotential = matrix multiplications in reciprocal space (Ng × Ne, Ng = 963) ∙ Local pseudopot., Coulomb, and exchange-correlation = Ne FFTs ∙ Exact exchange ((Ne + 1)Ne FFTs), nonlocal pseudopotential, and wavefunction orthogonalization Expensive parts: global operations ∙ 20 ps of simulation time 200,000 steps ∙ 1 s/step = 2-3 days ∙ 10 s/step = 23 days ∙ 13 s/step = 70 days ∙ Mesoscale phenomena at longer time scales ∙ Assume 1 s/step ∙ 100 ps = 10-15 days ∙ 1 ns = 100 - 150 days

3/22

SLIDE 6

introduction: plane wave dft solutions

∙ Avoid direct diagonalization because of large basis sets (much larger than Gaussian basis sets) ∙ Instead evaluate wave function gradient using a conjugate gradient algorithm to solve DFT equation ∙ Kinetic and nonlocal pseudopotential = matrix multiplications in reciprocal space (Ng × Ne, Ng = 963) ∙ Local pseudopot., Coulomb, and exchange-correlation = Ne FFTs ∙ Exact exchange ((Ne + 1)Ne FFTs), nonlocal pseudopotential, and wavefunction orthogonalization Expensive parts: global operations ∙ 20 ps of simulation time ≈ 200,000 steps ∙ 1 s/step = 2-3 days ∙ 10 s/step = 23 days ∙ 13 s/step = 70 days ∙ Mesoscale phenomena at longer time scales ∙ Assume 1 s/step ∙ 100 ps = 10-15 days ∙ 1 ns = 100 - 150 days

3/22

SLIDE 7

introduction: plane wave dft solutions

∙ Avoid direct diagonalization because of large basis sets (much larger than Gaussian basis sets) ∙ Instead evaluate wave function gradient using a conjugate gradient algorithm to solve DFT equation ∙ Kinetic and nonlocal pseudopotential = matrix multiplications in reciprocal space (Ng × Ne, Ng = 963) ∙ Local pseudopot., Coulomb, and exchange-correlation = Ne FFTs ∙ Exact exchange ((Ne + 1)Ne FFTs), nonlocal pseudopotential, and wavefunction orthogonalization Expensive parts: global operations ∙ 20 ps of simulation time ≈ 200,000 steps ∙ 1 s/step = 2-3 days ∙ 10 s/step = 23 days ∙ 13 s/step = 70 days ∙ Mesoscale phenomena at longer time scales ∙ Assume 1 s/step ∙ 100 ps = 10-15 days ∙ 1 ns = 100 - 150 days

3/22

SLIDE 8

3d ffts

∙ At every AIMD step, perform:

∙ Ne Reverse 3D FFTs ∙ Ne Forward 3D FFTs

∙ In reciprocal space, sphere of radius Ecut is stored ∙ Each forward FFT performed in 6 steps:

1. Unpack sphere into a 3D cube (z,x,y)
2. Perform nx

ny 1D FFTs along the z-dimension

3. Rotate the cube z x y

y z x

4. Perform nz

nx 1D FFTs along the y-dimension

5. Rotate the cube y z x

x y z

6. Perform ny

nz 1D FFTs along the x-dimension

∙ Reverse FFTs: steps in reverse order

4/22

SLIDE 9

3d ffts

∙ At every AIMD step, perform:

∙ Ne Reverse 3D FFTs ∙ Ne Forward 3D FFTs

∙ In reciprocal space, sphere of radius Ecut is stored ∙ Each forward FFT performed in 6 steps:

1. Unpack sphere into a 3D cube (z,x,y)
2. Perform nx

ny 1D FFTs along the z-dimension

3. Rotate the cube z x y

y z x

4. Perform nz

nx 1D FFTs along the y-dimension

5. Rotate the cube y z x

x y z

6. Perform ny

nz 1D FFTs along the x-dimension

∙ Reverse FFTs: steps in reverse order

4/22

SLIDE 10

3d ffts

∙ At every AIMD step, perform:

∙ Ne Reverse 3D FFTs ∙ Ne Forward 3D FFTs

∙ In reciprocal space, sphere of radius Ecut is stored ∙ Each forward FFT performed in 6 steps:

1. Unpack sphere into a 3D cube (z,x,y)
2. Perform nx × ny 1D FFTs along the z-dimension
3. Rotate the cube z x y

y z x

4. Perform nz

nx 1D FFTs along the y-dimension

5. Rotate the cube y z x

x y z

6. Perform ny

nz 1D FFTs along the x-dimension

∙ Reverse FFTs: steps in reverse order

4/22

SLIDE 11

3d ffts

∙ At every AIMD step, perform:

∙ Ne Reverse 3D FFTs ∙ Ne Forward 3D FFTs

∙ In reciprocal space, sphere of radius Ecut is stored ∙ Each forward FFT performed in 6 steps:

1. Unpack sphere into a 3D cube (z,x,y)
2. Perform nx × ny 1D FFTs along the z-dimension
3. Rotate the cube (z, x, y) → (y, z, x)
4. Perform nz

nx 1D FFTs along the y-dimension

5. Rotate the cube y z x

x y z

6. Perform ny

nz 1D FFTs along the x-dimension

∙ Reverse FFTs: steps in reverse order

4/22

SLIDE 12

3d ffts

∙ At every AIMD step, perform:

∙ Ne Reverse 3D FFTs ∙ Ne Forward 3D FFTs

∙ In reciprocal space, sphere of radius Ecut is stored ∙ Each forward FFT performed in 6 steps:

1. Unpack sphere into a 3D cube (z,x,y)
2. Perform nx × ny 1D FFTs along the z-dimension
3. Rotate the cube (z, x, y) → (y, z, x)
4. Perform nz × nx 1D FFTs along the y-dimension
5. Rotate the cube y z x

x y z

6. Perform ny

nz 1D FFTs along the x-dimension

∙ Reverse FFTs: steps in reverse order

4/22

SLIDE 13

3d ffts

∙ At every AIMD step, perform:

∙ Ne Reverse 3D FFTs ∙ Ne Forward 3D FFTs

∙ In reciprocal space, sphere of radius Ecut is stored ∙ Each forward FFT performed in 6 steps:

1. Unpack sphere into a 3D cube (z,x,y)
2. Perform nx × ny 1D FFTs along the z-dimension
3. Rotate the cube (z, x, y) → (y, z, x)
4. Perform nz × nx 1D FFTs along the y-dimension
5. Rotate the cube (y, z, x) → (x, y, z)
6. Perform ny

nz 1D FFTs along the x-dimension

∙ Reverse FFTs: steps in reverse order

4/22

SLIDE 14

3d ffts

∙ At every AIMD step, perform:

∙ Ne Reverse 3D FFTs ∙ Ne Forward 3D FFTs

∙ In reciprocal space, sphere of radius Ecut is stored ∙ Each forward FFT performed in 6 steps:

1. Unpack sphere into a 3D cube (z,x,y)
2. Perform nx × ny 1D FFTs along the z-dimension
3. Rotate the cube (z, x, y) → (y, z, x)
4. Perform nz × nx 1D FFTs along the y-dimension
5. Rotate the cube (y, z, x) → (x, y, z)
6. Perform ny × nz 1D FFTs along the x-dimension

∙ Reverse FFTs: steps in reverse order

4/22

SLIDE 15

3d ffts

∙ At every AIMD step, perform:

∙ Ne Reverse 3D FFTs ∙ Ne Forward 3D FFTs

∙ In reciprocal space, sphere of radius Ecut is stored ∙ Each forward FFT performed in 6 steps:

1. Unpack sphere into a 3D cube (z,x,y)
2. Perform nx × ny 1D FFTs along the z-dimension
3. Rotate the cube (z, x, y) → (y, z, x)
4. Perform nz × nx 1D FFTs along the y-dimension
5. Rotate the cube (y, z, x) → (x, y, z)
6. Perform ny × nz 1D FFTs along the x-dimension

∙ Reverse FFTs: steps in reverse order

4/22

SLIDE 16

pipe-lined 3d ffts

∙ At each AIMD step:

∙ Ne Forward 3D FFTs ∙ Ne Reverse 3D FFTs ∙ 3D FFT steps are pipe-lined

5/22

SLIDE 17

pipe-lined 3d ffts

∙ At each AIMD step:

∙ Ne Forward 3D FFTs ∙ Ne Reverse 3D FFTs ∙ 3D FFT steps are pipe-lined

5/22

SLIDE 18

pipe-lined 3d ffts

∙ At each AIMD step:

∙ Ne Forward 3D FFTs ∙ Ne Reverse 3D FFTs ∙ 3D FFT steps are pipe-lined

5/22

SLIDE 19

preserving orthogonality: lagrange multipliers

∙ At each AIMD step, wave functions need to be orthogonalized ∙ Lagrange multiplier method:

∙ Matrix Ricatti equations solved ∙ Expensive step critical to scalability

∙ Mainly matrix-matrix multiplications:

∙ M is Ne-by-Ne, F is Npack-by-Ne (or transpose) ∙ 3 letters to describe a C A B matrix product ∙ First letter for A, second for B, and third for C

FFM MMM FMF

6/22

SLIDE 20

preserving orthogonality: lagrange multipliers

∙ At each AIMD step, wave functions need to be orthogonalized ∙ Lagrange multiplier method:

∙ Matrix Ricatti equations solved ∙ Expensive step critical to scalability

∙ Mainly matrix-matrix multiplications:

∙ M is Ne-by-Ne, F is Npack-by-Ne (or transpose) ∙ 3 letters to describe a C A B matrix product ∙ First letter for A, second for B, and third for C

FFM MMM FMF

6/22

SLIDE 21

preserving orthogonality: lagrange multipliers

∙ At each AIMD step, wave functions need to be orthogonalized ∙ Lagrange multiplier method:

∙ Matrix Ricatti equations solved ∙ Expensive step critical to scalability

∙ Mainly matrix-matrix multiplications:

∙ M is Ne-by-Ne, F is Npack-by-Ne (or transpose) ∙ 3 letters to describe a C A B matrix product ∙ First letter for A, second for B, and third for C

FFM MMM FMF

6/22

SLIDE 22

preserving orthogonality: lagrange multipliers

∙ At each AIMD step, wave functions need to be orthogonalized ∙ Lagrange multiplier method:

∙ Matrix Ricatti equations solved ∙ Expensive step critical to scalability

∙ Mainly matrix-matrix multiplications:

∙ M is Ne-by-Ne, F is Npack-by-Ne (or transpose) ∙ 3 letters to describe a C = A × B matrix product ∙ First letter for A, second for B, and third for C

FFM MMM FMF

6/22

SLIDE 23

preserving orthogonality: lagrange multipliers

∙ At each AIMD step, wave functions need to be orthogonalized ∙ Lagrange multiplier method:

∙ Matrix Ricatti equations solved ∙ Expensive step critical to scalability

∙ Mainly matrix-matrix multiplications:

∙ M is Ne-by-Ne, F is Npack-by-Ne (or transpose) ∙ 3 letters to describe a C = A × B matrix product ∙ First letter for A, second for B, and third for C

FFM MMM FMF

6/22

SLIDE 24

preserving orthogonality: lagrange multipliers

∙ At each AIMD step, wave functions need to be orthogonalized ∙ Lagrange multiplier method:

∙ Matrix Ricatti equations solved ∙ Expensive step critical to scalability

∙ Mainly matrix-matrix multiplications:

∙ M is Ne-by-Ne, F is Npack-by-Ne (or transpose) ∙ 3 letters to describe a C = A × B matrix product ∙ First letter for A, second for B, and third for C

Ne Ne Npack

FFM

Ne Ne Ne

MMM

Ne Ne Npack

FMF

6/22

SLIDE 25

lagrange multipliers: flowchart of operations

FFM ψT

1 · ψ1

FFM ψT

2 · ψ1

FFM ψT

2 · ψ2

MMM MMM MMM MMM FMF ψ2 = ψ1 ∗ ov

v
1. Compute overlap matrices between ψ1 and ψ2 wave

functions (FFM)

2. Iteration to solve the matrix Ricatti equation over overlap

matrices (MMM)

3. Compute a ψ2 wave function orthogonal to ψ1 (FMF)

7/22

SLIDE 26

what’s wrong with lagrange multipliers ?

∙ Vendor libraries provide good multi-threaded BLAS kernels ∙ Heatmap collected on Intel Knights Corner

∙ FFM type of matrix product ∙ Matrix C is Ne-by-Ne

∙ Not effjcient for all matrix dimensions ∙ FFM in the “dark area” for practical problem sizes

15000 12000 10000 8000 6000 4000 2000 1000 800 600 400 200 150 100 80 60 40 Ne 15000 12000 10000 8000 6000 4000 2000 1000 800 600 400 200 150 100 80 60 40 Npack

MKL - 240 threads

100 200 300 400 500 600 700 800 GFLOP/s

8/22

SLIDE 27

parallelization strategy for lagrange multipliers

∙ nthr is the number of threads being used ∙ Matrix B is entirely accessed by each thread ∙ Parallelization along Npack rows of A and C ∙ Each thread updates a Npack/nthr rows of C

Ne Npack / n t h r

A0 A1 A2 A3 C0 C1 C2 C3 B0 B1 B2 B3

Ne

Parallelization strategy used for the FMF operations

9/22

SLIDE 28

parallelization strategy for lagrange multipliers

∙ A and B matrices are split along their common dimension (Npack) ∙ Each thread contributes to the entire matrix C in temporary buffer ∙ Contributions are fjnally reduced

Ne Ne Npack / n t h r

A0 B0 B1 B2 B3 A1 A2 A3 Multiply 1 2 Reduce C0 C1 C2 C3 C

Parallelization strategy used for the FFM operations Ctmpi A i nb i 1 nb B i nb i 1 nb Three FFMs computed concurrently on nthr 3 threads

10/22

SLIDE 29

parallelization strategy for lagrange multipliers

∙ A and B matrices are split along their common dimension (Npack) ∙ Each thread contributes to the entire matrix C in temporary buffer ∙ Contributions are fjnally reduced

Ne Ne Npack / n t h r

A0 B0 B1 B2 B3 A1 A2 A3 Multiply 1 2 Reduce C0 C1 C2 C3 C

Parallelization strategy used for the FFM operations Ctmpi = A(∗, i × nb : (i + 1) × nb) × B(i × nb : (i + 1) × nb, ∗) Three FFMs computed concurrently on nthr 3 threads

10/22

SLIDE 30

parallelization strategy for lagrange multipliers

∙ A and B matrices are split along their common dimension (Npack) ∙ Each thread contributes to the entire matrix C in temporary buffer ∙ Contributions are fjnally reduced

Ne Ne Npack / n t h r

A0 B0 B1 B2 B3 A1 A2 A3 Multiply 1 2 Reduce C0 C1 C2 C3 C

Parallelization strategy used for the FFM operations Ctmpi = A(∗, i × nb : (i + 1) × nb) × B(i × nb : (i + 1) × nb, ∗) Three FFMs computed concurrently on nthr/3 threads

10/22

SLIDE 31

a quick look at the code

double overlap[3][N_e*N_e]; //initialize OpenMP locks

mp_lock_t reduce_locks[3];
mp_init_lock(&reduce_locks[0]);
mp_init_lock(&reduce_locks[1]);
mp_init_lock(&reduce_locks[2]);

//Allocate temporary buffer int mthr = omp_get_max_threads(); double * buffer = new double[N_e*N_e*mthr]; #pragma omp parallel while(...) { int tid = omp_get_thread_num(); int nthr = omp_get_num_threads(); //Get tid-th temporary buffer double * thBuffer = buffer + N_e*N_e*tid; //Zero out each overlap matrix #pragma omp for ... if ( tid < nthr / 3 ) { //Compute local matrix product 1 thBuffer = ...

mp_set_lock(&reduce_locks[0]);

//Reduce thBuffer into overlap[0]

verlap[0] += thBuffer ...
mp_unset_lock(&reduce_locks[0]);

} else if ( tid >= nthr / 3 && tid < 2 * nthr / 3 ) { //Compute local matrix product 2 thBuffer = ...

mp_set_lock(&reduce_locks[1]);

//Reduce thBuffer into overlap[1]

verlap[1] += thBuffer ...
mp_unset_lock(&reduce_locks[1]);

} else { //Compute local matrix product 3 thBuffer = ...

mp_set_lock(&reduce_locks[2]);

//Reduce thBuffer into overlap[2]

verlap[2] += thBuffer ...
mp_unset_lock(&reduce_locks[2]);

} #pragma omp barrier //All three FFMs are now complete ... } //delete temporary buffers delete buffer; //delete OpenMP locks

mp_destroy_lock(&reduce_locks[2]);
mp_destroy_lock(&reduce_locks[1]);
mp_destroy_lock(&reduce_locks[0]);

11/22

SLIDE 32

hardware platform

∙ Experiments conducted on NERSC Cori

∙ 14 PFlop/s peak (27 PFlop/s theoretical) ∙ 5th place at 2016 TOP500 ranking

∙ Intel Knights Landing partition ∙ Intel Xeon Phi 7250 processor

∙ 68 cores ∙ 272 hardware threads (4 threads per core) ∙ 96GB DRAM ∙ 16GB MCDRAM

∙ Cray Dragonfmy interconnect

12/22

SLIDE 33

a roofline model of the planewave code

0.01 0.10 1.00 10.00 100.00

FLOPs / byte

0.4014 4.0544 20.9942 67.6271 2521.47

GFLOPs / s

L 1

6

7 6 2 . 7 1 G B / s L2 - 2099.42 GB/s MCDRAM - 405.44 GB/s DRAM - 40.14 GB/s 2521.47 GFLOP/s (Maximum) Lagrange (MT BLAS) Water 32 Lagrange (MT BLAS) Water 64 Lagrange (MT BLAS) Water 128

L1 L2 MCDRAM DRAM GFLOPs

100 101 102 103

Roofmine model of a KNL node Collected using the Empirical Roofmine Toolkit (ERT)

13/22

SLIDE 34

a roofline model of the planewave code (optimized)

0.01 0.10 1.00 10.00 100.00

FLOPs / byte

0.4014 4.0544 20.9942 67.6271 2521.47

GFLOPs / s

L 1

6

7 6 2 . 7 1 G B / s L2 - 2099.42 GB/s MCDRAM - 405.44 GB/s DRAM - 40.14 GB/s 2521.47 GFLOP/s (Maximum) Lagrange (MT BLAS) Water 32 Lagrange (MT BLAS) Water 128 Lagrange Water 32 Lagrange Water 64

L1 L2 MCDRAM DRAM GFLOPs

100 101 102 103

Lagrange (MT BLAS) Water 64 Lagrange Water 128

Roofmine model of a KNL node Higher arithmetic intensity, higher performance

14/22

SLIDE 35

a roofline model of the planewave code (optimized)

0.01 0.10 1.00 10.00 100.00

FLOPs / byte

0.4014 4.0544 20.9942 67.6271 2521.47

GFLOPs / s

L 1

6

7 6 2 . 7 1 G B / s L2 - 2099.42 GB/s MCDRAM - 405.44 GB/s DRAM - 40.14 GB/s 2521.47 GFLOP/s (Maximum) Lagrange (MT BLAS) Water 32 Lagrange (MT BLAS) Water 128 Lagrange Water 32 Lagrange Water 64

L1 L2 MCDRAM DRAM GFLOPs

100 101 102 103

Lagrange (MT BLAS) Water 64 Lagrange Water 128

Roofmine model of a KNL node Higher arithmetic intensity, higher performance

14/22

SLIDE 36

strong scalability of lagrange: 32 water molecules

1 2 4 8 1 6 3 2 6 4

Thread count

10−2 10−1 100

Time (s) Run times of Lagrange multipliers on 32 water molecules

Lagrange multipliers (MT BLAS) FFM (MT BLAS) FMF (MT BLAS) Lagrange multipliers FFM FMF

Npack = 53, 228, Ne = 128 Speedup of 9x in average

15/22

SLIDE 37

strong scalability of lagrange: 64 water molecules

1 2 4 8 1 6 3 2 6 4

Thread count

10−1 100 101

Time (s) Run times of Lagrange multipliers on 64 water molecules

Lagrange multipliers (MT BLAS) FFM (MT BLAS) FMF (MT BLAS) Lagrange multipliers FFM FMF

Npack = 106, 456, Ne = 256 Speedup of 20x in average

16/22

SLIDE 38

strong scalability of lagrange: 128 water molecules

1 2 4 8 1 6 3 2 6 4

Thread count

10−1 100 101 102

Time (s) Run times of Lagrange multipliers on 128 water molecules

Lagrange multipliers (MT BLAS) FFM (MT BLAS) FMF (MT BLAS) Lagrange multipliers FFM FMF

Npack = 212, 912, Ne = 512 Speedup of 21x in average

17/22

SLIDE 39

parallel efficiency of lagrange: 128 water molecules

1 2 4 8 1 6 3 2 6 4

Thread count

20 40 60 80 100

Effjciency (%) Parallel effjciency of Lagrange multipliers on 128 water molecules

Lagrange multipliers (MT BLAS) FFM (MT BLAS) FMF (MT BLAS) Lagrange multipliers FFM FMF

Npack = 212, 912, Ne = 512 Effjciency of 60% when using 68 threads

18/22

SLIDE 40

strong scalability of the full aimd step for 64 water molecules

1 2 4 8 16 32 64

Thread count

10−1 100 101

Time (s) Run times of AIMD on 64 water molecules

AIMD step Lagrange multipliers FFM FMF Queue 3D FFTs Non-local pseudopotentials

∙ Overall speedup of 7.8x over sequential run ∙ Lagrange multipliers speedup of 21x ∙ Pipe-lined 3D FFTs speedup of 15.5x ∙ Non-local pseudopotentials speedup of 5.5x ∙ Similar to Lagrange, with smaller C matrix

19/22

SLIDE 41

strong scalability of the full aimd step for 64 water molecules

1 2 4 8 16 32 64

Thread count

10−1 100 101

Time (s) Run times of AIMD on 64 water molecules

AIMD step Lagrange multipliers FFM FMF Queue 3D FFTs Non-local pseudopotentials

∙ Overall speedup of 7.8x over sequential run ∙ Lagrange multipliers speedup of 21x ∙ Pipe-lined 3D FFTs speedup of 15.5x ∙ Non-local pseudopotentials speedup of 5.5x ∙ Similar to Lagrange, with smaller C matrix

19/22

SLIDE 42

strong scalability of the full aimd step for 64 water molecules

1 2 4 8 16 32 64

Thread count

10−1 100 101

Time (s) Run times of AIMD on 64 water molecules

AIMD step Lagrange multipliers FFM FMF Queue 3D FFTs Non-local pseudopotentials

∙ Overall speedup of 7.8x over sequential run ∙ Lagrange multipliers speedup of 21x ∙ Pipe-lined 3D FFTs speedup of 15.5x ∙ Non-local pseudopotentials speedup of 5.5x ∙ Similar to Lagrange, with smaller C matrix

19/22

SLIDE 43

strong scalability on intel knl vs. intel haswell

1 2 4 8 1 6 3 2 6 4

Number of threads

101

Time (sec) Run times of AIMD on 64 water molecules

AIMD step Intel Knights Landing Node AIMD step Intel Haswell Node

∙ Intel Haswell node has 32 cores (2x16) ∙ Speedup of full KNL node is 1.8x

20/22

SLIDE 44

conclusion

∙ Conclusion

∙ Large speedups over straightforward approach ∙ Data reuse is critical ∙ BSP programming style does not scale well ∙ Exploit concurrency when scalability is limited

∙ Lessons learned

∙ OpenMP reduction can’t reuse same buffer in FFM ∙ Allocating memory by OpenMP at every FFM too expensive ∙ Thread placement essential ∙ Many algorithms do not have long vector loops, heterogeneous models will be useful ∙ A lot of time is spent in synchronizing threads

21/22

SLIDE 45

conclusion

∙ Conclusion

∙ Large speedups over straightforward approach ∙ Data reuse is critical ∙ BSP programming style does not scale well ∙ Exploit concurrency when scalability is limited

∙ Lessons learned

∙ OpenMP reduction can’t reuse same buffer in FFM ∙ Allocating memory by OpenMP at every FFM too expensive ∙ Thread placement essential ∙ Many algorithms do not have long vector loops, heterogeneous models will be useful ∙ A lot of time is spent in synchronizing threads

21/22

SLIDE 46

ngoing and future work

∙ Ongoing and future work

∙ Non-local pseudopotential optimization ∙ Hybrid implementation and multi-node experiments on larger problems

∙ Acknowledgements

∙ DOE SciDAC program ∙ DOE BES chemsciences, biosciences and geosciences ∙ DOE BER EMSL ∙ Intel Parallel Computing Center (IPCC) at Lawrence Berkeley National Laboratory ∙ Intel Parallel Computing Center (IPCC) at Pacifjc Northwest National Laboratory