Mathias Jacquelin mjacquelin@lbl.gov Wibe De Jong wadejong@lbl.gov - - PowerPoint PPT Presentation
Mathias Jacquelin mjacquelin@lbl.gov Wibe De Jong wadejong@lbl.gov - - PowerPoint PPT Presentation
towards highly scalable ab initio molecular dynamics (aimd) on the intel knights landing manycore processor Mathias Jacquelin mjacquelin@lbl.gov Wibe De Jong wadejong@lbl.gov Computational Research Department Lawrence Berkeley National
introduction: plane wave methods
QM-CC QM-DFT AIMD QM/MM MM
∙ 100-1000 atoms ∙ Uses plane wave basis ∙ Many FFTs ∙ SUMMA like for global
- perations
VxcΨ
Ψi Ψ j = δij
(-1/2)∇2Ψ
VextΨ
VHΨ
+ + + :: NeNpack :: (NaNpack + NgLogNpack + NeNpack) + NaNeNpack :: NeNgLogNg + NeNg + 2NgLogNg + Ng + NeNg :: NeNgLogNg + NeNg :: Ne
2Npack + Ne 3
Ψ E =
Na - number of atoms, Ne - number of electrons Ng – size of FFT grid, Npack- size of reciprocal space (-1/2)∇2Ψ Ψ
ext
V VHΨ VxcΨ
Ψi Ψ j
1/22
introduction: plane wave discretization
Hψi(r) = −1 2∇2 + VL(r) + (1 − α)Vx[ρ](r) + Vc[ρ](r) + VNL + VH[ρ](r) − α
- j
Kij(r)ψj(r) ∇2VH,X,C(r) = −4πρ(r) ∇2Kij(r) = −4πψi(r)ψj(r) ρ(r) = N
i=1 |ψi(r)|2
- ω ψi(r)ψj(r)dr = δij
Matrix multiplication in reciprocal space Ne 3D-FFT Poisson (Ne + 1)Ne 3D-FFT Density Orthogonality (matrix multiplication)
2/22
introduction: plane wave dft solutions
∙ Avoid direct diagonalization because of large basis sets (much larger than Gaussian basis sets) ∙ Instead evaluate wave function gradient using a conjugate gradient algorithm to solve DFT equation ∙ Kinetic and nonlocal pseudopotential = matrix multiplications in reciprocal space (Ng × Ne, Ng = 963) ∙ Local pseudopot., Coulomb, and exchange-correlation = Ne FFTs ∙ Exact exchange ((Ne + 1)Ne FFTs), nonlocal pseudopotential, and wavefunction orthogonalization Expensive parts: global operations ∙ 20 ps of simulation time 200,000 steps ∙ 1 s/step = 2-3 days ∙ 10 s/step = 23 days ∙ 13 s/step = 70 days ∙ Mesoscale phenomena at longer time scales ∙ Assume 1 s/step ∙ 100 ps = 10-15 days ∙ 1 ns = 100 - 150 days
3/22
introduction: plane wave dft solutions
∙ Avoid direct diagonalization because of large basis sets (much larger than Gaussian basis sets) ∙ Instead evaluate wave function gradient using a conjugate gradient algorithm to solve DFT equation ∙ Kinetic and nonlocal pseudopotential = matrix multiplications in reciprocal space (Ng × Ne, Ng = 963) ∙ Local pseudopot., Coulomb, and exchange-correlation = Ne FFTs ∙ Exact exchange ((Ne + 1)Ne FFTs), nonlocal pseudopotential, and wavefunction orthogonalization Expensive parts: global operations ∙ 20 ps of simulation time 200,000 steps ∙ 1 s/step = 2-3 days ∙ 10 s/step = 23 days ∙ 13 s/step = 70 days ∙ Mesoscale phenomena at longer time scales ∙ Assume 1 s/step ∙ 100 ps = 10-15 days ∙ 1 ns = 100 - 150 days
3/22
introduction: plane wave dft solutions
∙ Avoid direct diagonalization because of large basis sets (much larger than Gaussian basis sets) ∙ Instead evaluate wave function gradient using a conjugate gradient algorithm to solve DFT equation ∙ Kinetic and nonlocal pseudopotential = matrix multiplications in reciprocal space (Ng × Ne, Ng = 963) ∙ Local pseudopot., Coulomb, and exchange-correlation = Ne FFTs ∙ Exact exchange ((Ne + 1)Ne FFTs), nonlocal pseudopotential, and wavefunction orthogonalization Expensive parts: global operations ∙ 20 ps of simulation time ≈ 200,000 steps ∙ 1 s/step = 2-3 days ∙ 10 s/step = 23 days ∙ 13 s/step = 70 days ∙ Mesoscale phenomena at longer time scales ∙ Assume 1 s/step ∙ 100 ps = 10-15 days ∙ 1 ns = 100 - 150 days
3/22
introduction: plane wave dft solutions
∙ Avoid direct diagonalization because of large basis sets (much larger than Gaussian basis sets) ∙ Instead evaluate wave function gradient using a conjugate gradient algorithm to solve DFT equation ∙ Kinetic and nonlocal pseudopotential = matrix multiplications in reciprocal space (Ng × Ne, Ng = 963) ∙ Local pseudopot., Coulomb, and exchange-correlation = Ne FFTs ∙ Exact exchange ((Ne + 1)Ne FFTs), nonlocal pseudopotential, and wavefunction orthogonalization Expensive parts: global operations ∙ 20 ps of simulation time ≈ 200,000 steps ∙ 1 s/step = 2-3 days ∙ 10 s/step = 23 days ∙ 13 s/step = 70 days ∙ Mesoscale phenomena at longer time scales ∙ Assume 1 s/step ∙ 100 ps = 10-15 days ∙ 1 ns = 100 - 150 days
3/22
3d ffts
∙ At every AIMD step, perform:
∙ Ne Reverse 3D FFTs ∙ Ne Forward 3D FFTs
∙ In reciprocal space, sphere of radius Ecut is stored ∙ Each forward FFT performed in 6 steps:
- 1. Unpack sphere into a 3D cube (z,x,y)
- 2. Perform nx
ny 1D FFTs along the z-dimension
- 3. Rotate the cube z x y
y z x
- 4. Perform nz
nx 1D FFTs along the y-dimension
- 5. Rotate the cube y z x
x y z
- 6. Perform ny
nz 1D FFTs along the x-dimension
∙ Reverse FFTs: steps in reverse order
4/22
3d ffts
∙ At every AIMD step, perform:
∙ Ne Reverse 3D FFTs ∙ Ne Forward 3D FFTs
∙ In reciprocal space, sphere of radius Ecut is stored ∙ Each forward FFT performed in 6 steps:
- 1. Unpack sphere into a 3D cube (z,x,y)
- 2. Perform nx
ny 1D FFTs along the z-dimension
- 3. Rotate the cube z x y
y z x
- 4. Perform nz
nx 1D FFTs along the y-dimension
- 5. Rotate the cube y z x
x y z
- 6. Perform ny
nz 1D FFTs along the x-dimension
∙ Reverse FFTs: steps in reverse order
4/22
3d ffts
∙ At every AIMD step, perform:
∙ Ne Reverse 3D FFTs ∙ Ne Forward 3D FFTs
∙ In reciprocal space, sphere of radius Ecut is stored ∙ Each forward FFT performed in 6 steps:
- 1. Unpack sphere into a 3D cube (z,x,y)
- 2. Perform nx × ny 1D FFTs along the z-dimension
- 3. Rotate the cube z x y
y z x
- 4. Perform nz
nx 1D FFTs along the y-dimension
- 5. Rotate the cube y z x
x y z
- 6. Perform ny
nz 1D FFTs along the x-dimension
∙ Reverse FFTs: steps in reverse order
4/22
3d ffts
∙ At every AIMD step, perform:
∙ Ne Reverse 3D FFTs ∙ Ne Forward 3D FFTs
∙ In reciprocal space, sphere of radius Ecut is stored ∙ Each forward FFT performed in 6 steps:
- 1. Unpack sphere into a 3D cube (z,x,y)
- 2. Perform nx × ny 1D FFTs along the z-dimension
- 3. Rotate the cube (z, x, y) → (y, z, x)
- 4. Perform nz
nx 1D FFTs along the y-dimension
- 5. Rotate the cube y z x
x y z
- 6. Perform ny
nz 1D FFTs along the x-dimension
∙ Reverse FFTs: steps in reverse order
4/22
3d ffts
∙ At every AIMD step, perform:
∙ Ne Reverse 3D FFTs ∙ Ne Forward 3D FFTs
∙ In reciprocal space, sphere of radius Ecut is stored ∙ Each forward FFT performed in 6 steps:
- 1. Unpack sphere into a 3D cube (z,x,y)
- 2. Perform nx × ny 1D FFTs along the z-dimension
- 3. Rotate the cube (z, x, y) → (y, z, x)
- 4. Perform nz × nx 1D FFTs along the y-dimension
- 5. Rotate the cube y z x
x y z
- 6. Perform ny
nz 1D FFTs along the x-dimension
∙ Reverse FFTs: steps in reverse order
4/22
3d ffts
∙ At every AIMD step, perform:
∙ Ne Reverse 3D FFTs ∙ Ne Forward 3D FFTs
∙ In reciprocal space, sphere of radius Ecut is stored ∙ Each forward FFT performed in 6 steps:
- 1. Unpack sphere into a 3D cube (z,x,y)
- 2. Perform nx × ny 1D FFTs along the z-dimension
- 3. Rotate the cube (z, x, y) → (y, z, x)
- 4. Perform nz × nx 1D FFTs along the y-dimension
- 5. Rotate the cube (y, z, x) → (x, y, z)
- 6. Perform ny
nz 1D FFTs along the x-dimension
∙ Reverse FFTs: steps in reverse order
4/22
3d ffts
∙ At every AIMD step, perform:
∙ Ne Reverse 3D FFTs ∙ Ne Forward 3D FFTs
∙ In reciprocal space, sphere of radius Ecut is stored ∙ Each forward FFT performed in 6 steps:
- 1. Unpack sphere into a 3D cube (z,x,y)
- 2. Perform nx × ny 1D FFTs along the z-dimension
- 3. Rotate the cube (z, x, y) → (y, z, x)
- 4. Perform nz × nx 1D FFTs along the y-dimension
- 5. Rotate the cube (y, z, x) → (x, y, z)
- 6. Perform ny × nz 1D FFTs along the x-dimension
∙ Reverse FFTs: steps in reverse order
4/22
3d ffts
∙ At every AIMD step, perform:
∙ Ne Reverse 3D FFTs ∙ Ne Forward 3D FFTs
∙ In reciprocal space, sphere of radius Ecut is stored ∙ Each forward FFT performed in 6 steps:
- 1. Unpack sphere into a 3D cube (z,x,y)
- 2. Perform nx × ny 1D FFTs along the z-dimension
- 3. Rotate the cube (z, x, y) → (y, z, x)
- 4. Perform nz × nx 1D FFTs along the y-dimension
- 5. Rotate the cube (y, z, x) → (x, y, z)
- 6. Perform ny × nz 1D FFTs along the x-dimension
∙ Reverse FFTs: steps in reverse order
4/22
pipe-lined 3d ffts
∙ At each AIMD step:
∙ Ne Forward 3D FFTs ∙ Ne Reverse 3D FFTs ∙ 3D FFT steps are pipe-lined
5/22
pipe-lined 3d ffts
∙ At each AIMD step:
∙ Ne Forward 3D FFTs ∙ Ne Reverse 3D FFTs ∙ 3D FFT steps are pipe-lined
5/22
pipe-lined 3d ffts
∙ At each AIMD step:
∙ Ne Forward 3D FFTs ∙ Ne Reverse 3D FFTs ∙ 3D FFT steps are pipe-lined
5/22
preserving orthogonality: lagrange multipliers
∙ At each AIMD step, wave functions need to be orthogonalized ∙ Lagrange multiplier method:
∙ Matrix Ricatti equations solved ∙ Expensive step critical to scalability
∙ Mainly matrix-matrix multiplications:
∙ M is Ne-by-Ne, F is Npack-by-Ne (or transpose) ∙ 3 letters to describe a C A B matrix product ∙ First letter for A, second for B, and third for C
FFM MMM FMF
6/22
preserving orthogonality: lagrange multipliers
∙ At each AIMD step, wave functions need to be orthogonalized ∙ Lagrange multiplier method:
∙ Matrix Ricatti equations solved ∙ Expensive step critical to scalability
∙ Mainly matrix-matrix multiplications:
∙ M is Ne-by-Ne, F is Npack-by-Ne (or transpose) ∙ 3 letters to describe a C A B matrix product ∙ First letter for A, second for B, and third for C
FFM MMM FMF
6/22
preserving orthogonality: lagrange multipliers
∙ At each AIMD step, wave functions need to be orthogonalized ∙ Lagrange multiplier method:
∙ Matrix Ricatti equations solved ∙ Expensive step critical to scalability
∙ Mainly matrix-matrix multiplications:
∙ M is Ne-by-Ne, F is Npack-by-Ne (or transpose) ∙ 3 letters to describe a C A B matrix product ∙ First letter for A, second for B, and third for C
FFM MMM FMF
6/22
preserving orthogonality: lagrange multipliers
∙ At each AIMD step, wave functions need to be orthogonalized ∙ Lagrange multiplier method:
∙ Matrix Ricatti equations solved ∙ Expensive step critical to scalability
∙ Mainly matrix-matrix multiplications:
∙ M is Ne-by-Ne, F is Npack-by-Ne (or transpose) ∙ 3 letters to describe a C = A × B matrix product ∙ First letter for A, second for B, and third for C
FFM MMM FMF
6/22
preserving orthogonality: lagrange multipliers
∙ At each AIMD step, wave functions need to be orthogonalized ∙ Lagrange multiplier method:
∙ Matrix Ricatti equations solved ∙ Expensive step critical to scalability
∙ Mainly matrix-matrix multiplications:
∙ M is Ne-by-Ne, F is Npack-by-Ne (or transpose) ∙ 3 letters to describe a C = A × B matrix product ∙ First letter for A, second for B, and third for C
FFM MMM FMF
6/22
preserving orthogonality: lagrange multipliers
∙ At each AIMD step, wave functions need to be orthogonalized ∙ Lagrange multiplier method:
∙ Matrix Ricatti equations solved ∙ Expensive step critical to scalability
∙ Mainly matrix-matrix multiplications:
∙ M is Ne-by-Ne, F is Npack-by-Ne (or transpose) ∙ 3 letters to describe a C = A × B matrix product ∙ First letter for A, second for B, and third for C
Ne Ne Npack
FFM
Ne Ne Ne
MMM
Ne Ne Npack
FMF
6/22
lagrange multipliers: flowchart of operations
FFM ψT
1 · ψ1
FFM ψT
2 · ψ1
FFM ψT
2 · ψ2
MMM MMM MMM MMM FMF ψ2 = ψ1 ∗ ov
- v
- 1. Compute overlap matrices between ψ1 and ψ2 wave
functions (FFM)
- 2. Iteration to solve the matrix Ricatti equation over overlap
matrices (MMM)
- 3. Compute a ψ2 wave function orthogonal to ψ1 (FMF)
7/22
what’s wrong with lagrange multipliers ?
∙ Vendor libraries provide good multi-threaded BLAS kernels ∙ Heatmap collected on Intel Knights Corner
∙ FFM type of matrix product ∙ Matrix C is Ne-by-Ne
∙ Not effjcient for all matrix dimensions ∙ FFM in the “dark area” for practical problem sizes
15000 12000 10000 8000 6000 4000 2000 1000 800 600 400 200 150 100 80 60 40 Ne 15000 12000 10000 8000 6000 4000 2000 1000 800 600 400 200 150 100 80 60 40 Npack
MKL - 240 threads
100 200 300 400 500 600 700 800 GFLOP/s
8/22
parallelization strategy for lagrange multipliers
∙ nthr is the number of threads being used ∙ Matrix B is entirely accessed by each thread ∙ Parallelization along Npack rows of A and C ∙ Each thread updates a Npack/nthr rows of C
Ne Npack / n t h r
A0 A1 A2 A3 C0 C1 C2 C3 B0 B1 B2 B3
Ne
Parallelization strategy used for the FMF operations
9/22
parallelization strategy for lagrange multipliers
∙ A and B matrices are split along their common dimension (Npack) ∙ Each thread contributes to the entire matrix C in temporary buffer ∙ Contributions are fjnally reduced
Ne Ne Npack / n t h r
A0 B0 B1 B2 B3 A1 A2 A3 Multiply 1 2 Reduce C0 C1 C2 C3 C
Parallelization strategy used for the FFM operations Ctmpi A i nb i 1 nb B i nb i 1 nb Three FFMs computed concurrently on nthr 3 threads
10/22
parallelization strategy for lagrange multipliers
∙ A and B matrices are split along their common dimension (Npack) ∙ Each thread contributes to the entire matrix C in temporary buffer ∙ Contributions are fjnally reduced
Ne Ne Npack / n t h r
A0 B0 B1 B2 B3 A1 A2 A3 Multiply 1 2 Reduce C0 C1 C2 C3 C
Parallelization strategy used for the FFM operations Ctmpi = A(∗, i × nb : (i + 1) × nb) × B(i × nb : (i + 1) × nb, ∗) Three FFMs computed concurrently on nthr 3 threads
10/22
parallelization strategy for lagrange multipliers
∙ A and B matrices are split along their common dimension (Npack) ∙ Each thread contributes to the entire matrix C in temporary buffer ∙ Contributions are fjnally reduced
Ne Ne Npack / n t h r
A0 B0 B1 B2 B3 A1 A2 A3 Multiply 1 2 Reduce C0 C1 C2 C3 C
Parallelization strategy used for the FFM operations Ctmpi = A(∗, i × nb : (i + 1) × nb) × B(i × nb : (i + 1) × nb, ∗) Three FFMs computed concurrently on nthr/3 threads
10/22
a quick look at the code
double overlap[3][N_e*N_e]; //initialize OpenMP locks
- mp_lock_t reduce_locks[3];
- mp_init_lock(&reduce_locks[0]);
- mp_init_lock(&reduce_locks[1]);
- mp_init_lock(&reduce_locks[2]);
//Allocate temporary buffer int mthr = omp_get_max_threads(); double * buffer = new double[N_e*N_e*mthr]; #pragma omp parallel while(...) { int tid = omp_get_thread_num(); int nthr = omp_get_num_threads(); //Get tid-th temporary buffer double * thBuffer = buffer + N_e*N_e*tid; //Zero out each overlap matrix #pragma omp for ... if ( tid < nthr / 3 ) { //Compute local matrix product 1 thBuffer = ...
- mp_set_lock(&reduce_locks[0]);
//Reduce thBuffer into overlap[0]
- verlap[0] += thBuffer ...
- mp_unset_lock(&reduce_locks[0]);
} else if ( tid >= nthr / 3 && tid < 2 * nthr / 3 ) { //Compute local matrix product 2 thBuffer = ...
- mp_set_lock(&reduce_locks[1]);
//Reduce thBuffer into overlap[1]
- verlap[1] += thBuffer ...
- mp_unset_lock(&reduce_locks[1]);
} else { //Compute local matrix product 3 thBuffer = ...
- mp_set_lock(&reduce_locks[2]);
//Reduce thBuffer into overlap[2]
- verlap[2] += thBuffer ...
- mp_unset_lock(&reduce_locks[2]);
} #pragma omp barrier //All three FFMs are now complete ... } //delete temporary buffers delete buffer; //delete OpenMP locks
- mp_destroy_lock(&reduce_locks[2]);
- mp_destroy_lock(&reduce_locks[1]);
- mp_destroy_lock(&reduce_locks[0]);
11/22
hardware platform
∙ Experiments conducted on NERSC Cori
∙ 14 PFlop/s peak (27 PFlop/s theoretical) ∙ 5th place at 2016 TOP500 ranking
∙ Intel Knights Landing partition ∙ Intel Xeon Phi 7250 processor
∙ 68 cores ∙ 272 hardware threads (4 threads per core) ∙ 96GB DRAM ∙ 16GB MCDRAM
∙ Cray Dragonfmy interconnect
12/22
a roofline model of the planewave code
0.01 0.10 1.00 10.00 100.00
FLOPs / byte
0.4014 4.0544 20.9942 67.6271 2521.47
GFLOPs / s
L 1
- 6
7 6 2 . 7 1 G B / s L2 - 2099.42 GB/s MCDRAM - 405.44 GB/s DRAM - 40.14 GB/s 2521.47 GFLOP/s (Maximum) Lagrange (MT BLAS) Water 32 Lagrange (MT BLAS) Water 64 Lagrange (MT BLAS) Water 128
L1 L2 MCDRAM DRAM GFLOPs
100 101 102 103
Roofmine model of a KNL node Collected using the Empirical Roofmine Toolkit (ERT)
13/22
a roofline model of the planewave code (optimized)
0.01 0.10 1.00 10.00 100.00
FLOPs / byte
0.4014 4.0544 20.9942 67.6271 2521.47
GFLOPs / s
L 1
- 6
7 6 2 . 7 1 G B / s L2 - 2099.42 GB/s MCDRAM - 405.44 GB/s DRAM - 40.14 GB/s 2521.47 GFLOP/s (Maximum) Lagrange (MT BLAS) Water 32 Lagrange (MT BLAS) Water 128 Lagrange Water 32 Lagrange Water 64
L1 L2 MCDRAM DRAM GFLOPs
100 101 102 103
Lagrange (MT BLAS) Water 64 Lagrange Water 128
Roofmine model of a KNL node Higher arithmetic intensity, higher performance
14/22
a roofline model of the planewave code (optimized)
0.01 0.10 1.00 10.00 100.00
FLOPs / byte
0.4014 4.0544 20.9942 67.6271 2521.47
GFLOPs / s
L 1
- 6
7 6 2 . 7 1 G B / s L2 - 2099.42 GB/s MCDRAM - 405.44 GB/s DRAM - 40.14 GB/s 2521.47 GFLOP/s (Maximum) Lagrange (MT BLAS) Water 32 Lagrange (MT BLAS) Water 128 Lagrange Water 32 Lagrange Water 64
L1 L2 MCDRAM DRAM GFLOPs
100 101 102 103
Lagrange (MT BLAS) Water 64 Lagrange Water 128
Roofmine model of a KNL node Higher arithmetic intensity, higher performance
14/22
strong scalability of lagrange: 32 water molecules
1 2 4 8 1 6 3 2 6 4
Thread count
10−2 10−1 100
Time (s) Run times of Lagrange multipliers on 32 water molecules
Lagrange multipliers (MT BLAS) FFM (MT BLAS) FMF (MT BLAS) Lagrange multipliers FFM FMF
Npack = 53, 228, Ne = 128 Speedup of 9x in average
15/22
strong scalability of lagrange: 64 water molecules
1 2 4 8 1 6 3 2 6 4
Thread count
10−1 100 101
Time (s) Run times of Lagrange multipliers on 64 water molecules
Lagrange multipliers (MT BLAS) FFM (MT BLAS) FMF (MT BLAS) Lagrange multipliers FFM FMF
Npack = 106, 456, Ne = 256 Speedup of 20x in average
16/22
strong scalability of lagrange: 128 water molecules
1 2 4 8 1 6 3 2 6 4
Thread count
10−1 100 101 102
Time (s) Run times of Lagrange multipliers on 128 water molecules
Lagrange multipliers (MT BLAS) FFM (MT BLAS) FMF (MT BLAS) Lagrange multipliers FFM FMF
Npack = 212, 912, Ne = 512 Speedup of 21x in average
17/22
parallel efficiency of lagrange: 128 water molecules
1 2 4 8 1 6 3 2 6 4
Thread count
20 40 60 80 100
Effjciency (%) Parallel effjciency of Lagrange multipliers on 128 water molecules
Lagrange multipliers (MT BLAS) FFM (MT BLAS) FMF (MT BLAS) Lagrange multipliers FFM FMF
Npack = 212, 912, Ne = 512 Effjciency of 60% when using 68 threads
18/22
strong scalability of the full aimd step for 64 water molecules
1 2 4 8 16 32 64
Thread count
10−1 100 101
Time (s) Run times of AIMD on 64 water molecules
AIMD step Lagrange multipliers FFM FMF Queue 3D FFTs Non-local pseudopotentials
∙ Overall speedup of 7.8x over sequential run ∙ Lagrange multipliers speedup of 21x ∙ Pipe-lined 3D FFTs speedup of 15.5x ∙ Non-local pseudopotentials speedup of 5.5x ∙ Similar to Lagrange, with smaller C matrix
19/22
strong scalability of the full aimd step for 64 water molecules
1 2 4 8 16 32 64
Thread count
10−1 100 101
Time (s) Run times of AIMD on 64 water molecules
AIMD step Lagrange multipliers FFM FMF Queue 3D FFTs Non-local pseudopotentials
∙ Overall speedup of 7.8x over sequential run ∙ Lagrange multipliers speedup of 21x ∙ Pipe-lined 3D FFTs speedup of 15.5x ∙ Non-local pseudopotentials speedup of 5.5x ∙ Similar to Lagrange, with smaller C matrix
19/22
strong scalability of the full aimd step for 64 water molecules
1 2 4 8 16 32 64
Thread count
10−1 100 101
Time (s) Run times of AIMD on 64 water molecules
AIMD step Lagrange multipliers FFM FMF Queue 3D FFTs Non-local pseudopotentials
∙ Overall speedup of 7.8x over sequential run ∙ Lagrange multipliers speedup of 21x ∙ Pipe-lined 3D FFTs speedup of 15.5x ∙ Non-local pseudopotentials speedup of 5.5x ∙ Similar to Lagrange, with smaller C matrix
19/22
strong scalability on intel knl vs. intel haswell
1 2 4 8 1 6 3 2 6 4
Number of threads
101
Time (sec) Run times of AIMD on 64 water molecules
AIMD step Intel Knights Landing Node AIMD step Intel Haswell Node
∙ Intel Haswell node has 32 cores (2x16) ∙ Speedup of full KNL node is 1.8x
20/22
conclusion
∙ Conclusion
∙ Large speedups over straightforward approach ∙ Data reuse is critical ∙ BSP programming style does not scale well ∙ Exploit concurrency when scalability is limited
∙ Lessons learned
∙ OpenMP reduction can’t reuse same buffer in FFM ∙ Allocating memory by OpenMP at every FFM too expensive ∙ Thread placement essential ∙ Many algorithms do not have long vector loops, heterogeneous models will be useful ∙ A lot of time is spent in synchronizing threads
21/22
conclusion
∙ Conclusion
∙ Large speedups over straightforward approach ∙ Data reuse is critical ∙ BSP programming style does not scale well ∙ Exploit concurrency when scalability is limited
∙ Lessons learned
∙ OpenMP reduction can’t reuse same buffer in FFM ∙ Allocating memory by OpenMP at every FFM too expensive ∙ Thread placement essential ∙ Many algorithms do not have long vector loops, heterogeneous models will be useful ∙ A lot of time is spent in synchronizing threads
21/22
- ngoing and future work