Parallelizing the Hamiltonian Computation in DQMC Simulations: - - PowerPoint PPT Presentation

parallelizing the hamiltonian computation in dqmc
SMART_READER_LITE
LIVE PREVIEW

Parallelizing the Hamiltonian Computation in DQMC Simulations: - - PowerPoint PPT Presentation

Parallelizing the Hamiltonian Computation in DQMC Simulations: Checkerboard Method for Sparse Matrix Exponentials on Multicore and GPU Che-Rung Lee National Tsing Hua University joint work with Zhi-Hung Chen and Quey-Liang Kao Second


slide-1
SLIDE 1

Parallelizing the Hamiltonian Computation in DQMC Simulations: Checkerboard Method for Sparse Matrix Exponentials on Multicore and GPU

Che-Rung Lee

National Tsing Hua University joint work with Zhi-Hung Chen and Quey-Liang Kao

Second International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) May 25th, 2012

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31

slide-2
SLIDE 2

Outline

1

Determinant quantum Monte Carlo simulations

2

Matrix multiplication of sparse matrix exponentials

3

Parallel block checkerboard methods on multicore and GPU

4

Experiments and results

5

Concluding remarks

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 2 / 31

slide-3
SLIDE 3

Computational Material Science

To study the properties of solid-state materials: magnetism, metal-insulator transition, high temperature superconductivity, ...

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 3 / 31

slide-4
SLIDE 4

Computational Material Science

To study the properties of solid-state materials: magnetism, metal-insulator transition, high temperature superconductivity, ... The Hubbard model: Energy operator H is associated with a lattice of particles. Boltzmann weight is expressed as a path integral e−βH ≈ e−τH(h1)e−τH(h2) · · · e−τH(hL).

β = 1/T is the “imaginary time”. τ = β/L is the discretized time step. {hi} is the “Hubbard-Stratonovich field”.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 3 / 31

slide-5
SLIDE 5

Determinant Quantum Monte Carlo (DQMC) Simulations

DQMC algorithm

1 Given a random h = (hℓ,i) = (±1). 2 Until there are enough measurements

For ℓ = 1, . . . , L and i = 1, . . . , N

1

Propose a new HS config h′.

2

Compute the ratio γ of the determinants of new/old configs.

3

Generate a random number ρ ∈ [0, 1].

4

If γ > ρ, accept h = h′.

5

If the system is thermalized, sample the interested physical measurements.

3 Aggregate the sampled measurements. DQMC step Random HS field thermalized DQMC step Measurements yes no enough samples no Aggregation yes

warmup sampling

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 4 / 31

slide-6
SLIDE 6

DQMC Parallelization

Parallel Monte Carlo method can speedup DQMC simulations by parallelizing the sampling stage.

DQMC step Random HS field thermalized DQMC step Measurements

yes no

Aggregation

warmup sampling

DQMC step Measurements DQMC step Measurements

... ...

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 5 / 31

slide-7
SLIDE 7

DQMC Parallelization

Parallel Monte Carlo method can speedup DQMC simulations by parallelizing the sampling stage. Coarse-grained parallelization. (Communication only happens before sampling and in aggregation.)

DQMC step Random HS field thermalized DQMC step Measurements

yes no

Aggregation

warmup sampling

DQMC step Measurements DQMC step Measurements

... ...

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 5 / 31

slide-8
SLIDE 8

DQMC Parallelization

Parallel Monte Carlo method can speedup DQMC simulations by parallelizing the sampling stage. Coarse-grained parallelization. (Communication only happens before sampling and in aggregation.) Strong scalability if the number of desired samplings is much larger than the number of processors.

DQMC step Random HS field thermalized DQMC step Measurements

yes no

Aggregation

warmup sampling

DQMC step Measurements DQMC step Measurements

... ...

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 5 / 31

slide-9
SLIDE 9

Computational Challenges

By Amdahl’s law, the speedup of parallel Monte Carlo method is limited by the warmup stage (non-parallelizable). Speedup = Twarmup + Tsampling Twarmup + Tsampling/p → Twarmup + Tsampling Twarmup

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 6 / 31

slide-10
SLIDE 10

Computational Challenges

By Amdahl’s law, the speedup of parallel Monte Carlo method is limited by the warmup stage (non-parallelizable). Speedup = Twarmup + Tsampling Twarmup + Tsampling/p → Twarmup + Tsampling Twarmup Parallel Monte Carlo method does not scale with problem size, i.e. number of particles and discretized time length.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 6 / 31

slide-11
SLIDE 11

Computational Challenges

By Amdahl’s law, the speedup of parallel Monte Carlo method is limited by the warmup stage (non-parallelizable). Speedup = Twarmup + Tsampling Twarmup + Tsampling/p → Twarmup + Tsampling Twarmup Parallel Monte Carlo method does not scale with problem size, i.e. number of particles and discretized time length. Coarse grained parallelization does not fit well on multicore and GPU.

The computation of each DQMC step is complicated. Slower execution because of resource contention. Memory per core is reduced with the number of cores.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 6 / 31

slide-12
SLIDE 12

Inside Each DQMC Step

DQMC step Random HS field thermalized DQMC step Measurements

yes no

enough samples

no

Aggregation

yes

warmup sampling

A DQMC step

1 Propose a local change: h → h′. 2 Throw a random number 0 < r < 1. 3 Accept the change if r < det(e−βH(h′))

det(e−βH(h)) .

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 7 / 31

slide-13
SLIDE 13

Inside Each DQMC Step

DQMC step Random HS field thermalized DQMC step Measurements

yes no

enough samples

no

Aggregation

yes

warmup sampling

A DQMC step

1 Propose a local change: h → h′. 2 Throw a random number 0 < r < 1. 3 Accept the change if r < det(e−βH(h′))

det(e−βH(h)) .

Computational Kernel: Green’s function cal- culation G = (I + BL · · · B2B1)−1. for computation of det(e−βH(h′)) and phys- ical measurements.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 7 / 31

slide-14
SLIDE 14

Green’s Function Calculation

G = (I + BL · · · B2B1)−1. N: the number of particles; L: the number of time slices.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 8 / 31

slide-15
SLIDE 15

Green’s Function Calculation

G = (I + BL · · · B2B1)−1. N: the number of particles; L: the number of time slices. Time complexity of computing G is O(N3L).

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 8 / 31

slide-16
SLIDE 16

Green’s Function Calculation

G = (I + BL · · · B2B1)−1. N: the number of particles; L: the number of time slices. Time complexity of computing G is O(N3L). For 103 warmup steps and 104 sampling steps, it takes 15 hours.

For large simulations, N = O(104), L = O(102), the projected execution time could take several days to months.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 8 / 31

slide-17
SLIDE 17

Green’s Function Calculation

G = (I + BL · · · B2B1)−1. N: the number of particles; L: the number of time slices. Time complexity of computing G is O(N3L). For 103 warmup steps and 104 sampling steps, it takes 15 hours.

For large simulations, N = O(104), L = O(102), the projected execution time could take several days to months.

Profile of a DQMC simulation (N = 256, L = 96) Matrix kernel Execution time Matrix-matrix multiplication 72.39% Pivoted QR decomposition 17.83% Matrix inversion 3.02% Others 6.76%

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 8 / 31

slide-18
SLIDE 18

Green’s Function Calculation

G = (I + BL · · · B2B1)−1. N: the number of particles; L: the number of time slices. Time complexity of computing G is O(N3L). For 103 warmup steps and 104 sampling steps, it takes 15 hours.

For large simulations, N = O(104), L = O(102), the projected execution time could take several days to months.

Profile of a DQMC simulation (N = 256, L = 96) Matrix kernel Execution time Matrix-matrix multiplication 72.39% Pivoted QR decomposition 17.83% Matrix inversion 3.02% Others 6.76%

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 8 / 31

slide-19
SLIDE 19

Outline

1

Determinant quantum Monte Carlo simulations

2

Matrix multiplication of sparse matrix exponentials

3

Parallel block checkerboard methods on multicore and GPU

4

Experiments and results

5

Concluding remarks

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 9 / 31

slide-20
SLIDE 20

Matrix-matrix Multiplication

Some tuned result on multicore and on GPU (Fermi)

DGEMM on Intel Core i7-920 4 core with MKL is about 40 Gflop/s. (my laptop.) SGEMM can reach 662 Gflop/s on Fermi. [Jakub Kurzak LAWN 245, 2010] DGEMM (362Gflop/s on Fermi) [Guangming Tan et. al. SC11]

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 10 / 31

slide-21
SLIDE 21

Matrix-matrix Multiplication

Some tuned result on multicore and on GPU (Fermi)

DGEMM on Intel Core i7-920 4 core with MKL is about 40 Gflop/s. (my laptop.) SGEMM can reach 662 Gflop/s on Fermi. [Jakub Kurzak LAWN 245, 2010] DGEMM (362Gflop/s on Fermi) [Guangming Tan et. al. SC11]

It is great, but the running time grows cubically with problem size N.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 10 / 31

slide-22
SLIDE 22

Matrix-matrix Multiplication

Some tuned result on multicore and on GPU (Fermi)

DGEMM on Intel Core i7-920 4 core with MKL is about 40 Gflop/s. (my laptop.) SGEMM can reach 662 Gflop/s on Fermi. [Jakub Kurzak LAWN 245, 2010] DGEMM (362Gflop/s on Fermi) [Guangming Tan et. al. SC11]

It is great, but the running time grows cubically with problem size N. Sparse-dense matrix multiplication takes only O(N2) time.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 10 / 31

slide-23
SLIDE 23

Matrix-matrix Multiplication

Some tuned result on multicore and on GPU (Fermi)

DGEMM on Intel Core i7-920 4 core with MKL is about 40 Gflop/s. (my laptop.) SGEMM can reach 662 Gflop/s on Fermi. [Jakub Kurzak LAWN 245, 2010] DGEMM (362Gflop/s on Fermi) [Guangming Tan et. al. SC11]

It is great, but the running time grows cubically with problem size N. Sparse-dense matrix multiplication takes only O(N2) time. In the Green’s function calculation, G = (I + BLBL−1 · · · B2B1)−1, each Bi = eA is a matrix exponential. eA = I + A + A2 2! + A3 3! + · · · =

  • k=0

Ak k! .

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 10 / 31

slide-24
SLIDE 24

Matrix-matrix Multiplication

Some tuned result on multicore and on GPU (Fermi)

DGEMM on Intel Core i7-920 4 core with MKL is about 40 Gflop/s. (my laptop.) SGEMM can reach 662 Gflop/s on Fermi. [Jakub Kurzak LAWN 245, 2010] DGEMM (362Gflop/s on Fermi) [Guangming Tan et. al. SC11]

It is great, but the running time grows cubically with problem size N. Sparse-dense matrix multiplication takes only O(N2) time. In the Green’s function calculation, G = (I + BLBL−1 · · · B2B1)−1, each Bi = eA is a matrix exponential. eA = I + A + A2 2! + A3 3! + · · · =

  • k=0

Ak k! . Matrix A is highly sparse, but eA is dense.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 10 / 31

slide-25
SLIDE 25

Checkerboard Method

Checkerboard method can approximate eA by eA ≈ eA1eA2 · · · eAk, in which each eAj is sparse.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 11 / 31

slide-26
SLIDE 26

Checkerboard Method

Checkerboard method can approximate eA by eA ≈ eA1eA2 · · · eAk, in which each eAj is sparse.

Theorem

If Ai is strictly sparse with zero diagonal, eAi has the same sparse pattern as Ai with diagonal fill in.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 11 / 31

slide-27
SLIDE 27

Checkerboard Method

Checkerboard method can approximate eA by eA ≈ eA1eA2 · · · eAk, in which each eAj is sparse.

Theorem

If Ai is strictly sparse with zero diagonal, eAi has the same sparse pattern as Ai with diagonal fill in.

Definition (strictly sparse)

A matrix is strictly sparse if it contains at most one nonzero per row and per column.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 11 / 31

slide-28
SLIDE 28

Checkerboard Method

Checkerboard method can approximate eA by eA ≈ eA1eA2 · · · eAk, in which each eAj is sparse.

Theorem

If Ai is strictly sparse with zero diagonal, eAi has the same sparse pattern as Ai with diagonal fill in.

Definition (strictly sparse)

A matrix is strictly sparse if it contains at most one nonzero per row and per column.

Checkerboard method for computing eA

1 Split A = A1 + A2 + · · · + Ak such that each Ai is strictly sparse. 2 Exponentiate each Ai. 3 Return eA1eA2 · · · eAk as an approximation to eA. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 11 / 31

slide-29
SLIDE 29

Example: One Dimensional Ring

A =        h h h ... ... h h h       

¡ x ¡ 2 ¡ 3 ¡ 5 ¡ 1 ¡ 6 ¡ 7 ¡ 8 ¡ 4 ¡ Odd ¡link ¡ Even ¡link ¡

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 12 / 31

slide-30
SLIDE 30

Example: One Dimensional Ring

A =        h h h ... ... h h h       

¡ x ¡ 2 ¡ 3 ¡ 5 ¡ 1 ¡ 6 ¡ 7 ¡ 8 ¡ 4 ¡ Odd ¡link ¡ Even ¡link ¡

eA ≈    D ... D         cosh(h) sinh(h) D ... sinh(h) cosh(h)      where D = cosh(h) sinh(h) sinh(h) cosh(h)

  • .

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 12 / 31

slide-31
SLIDE 31

Block Checkerboard Method

Exploring block structure of sparse matrices can obtain better performance and accuracy.

12x12x12 16x16x16 18x18x18 20x20x20 20 40 60 80 100 120 Matrix size Running time (second) Exact Checkerboard Block Checkerboard 10

−4

10

−3

10

−2

10

−1

10

−8

10

−6

10

−4

10

−2

10 Discretize parameter ∆ τ Relative error Checkerboard Block Checkerboard Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 13 / 31

slide-32
SLIDE 32

Block Checkerboard Method

Exploring block structure of sparse matrices can obtain better performance and accuracy.

12x12x12 16x16x16 18x18x18 20x20x20 20 40 60 80 100 120 Matrix size Running time (second) Exact Checkerboard Block Checkerboard 10

−4

10

−3

10

−2

10

−1

10

−8

10

−6

10

−4

10

−2

10 Discretize parameter ∆ τ Relative error Checkerboard Block Checkerboard

The algorithm is similar, but the basic element is a block submatrix.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 13 / 31

slide-33
SLIDE 33

Block Checkerboard Method

Exploring block structure of sparse matrices can obtain better performance and accuracy.

12x12x12 16x16x16 18x18x18 20x20x20 20 40 60 80 100 120 Matrix size Running time (second) Exact Checkerboard Block Checkerboard 10

−4

10

−3

10

−2

10

−1

10

−8

10

−6

10

−4

10

−2

10 Discretize parameter ∆ τ Relative error Checkerboard Block Checkerboard

The algorithm is similar, but the basic element is a block submatrix. We will focus on the parallelization of block checkerboard method.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 13 / 31

slide-34
SLIDE 34

Outline

1

Determinant quantum Monte Carlo simulations

2

Matrix multiplication of sparse matrix exponentials

3

Parallel block checkerboard methods on multicore and GPU

4

Experiments and results

5

Concluding remarks

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 14 / 31

slide-35
SLIDE 35

Computational Difficulties

Combination of sparse matrix and dense matrix computation. (generalized SPMV)

Dense matrix-matrix multiplication Computational bound Sparse matrix-vector multiplication Memory bound Sparse matrix-matrix multiplication ?

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 15 / 31

slide-36
SLIDE 36

Computational Difficulties

Combination of sparse matrix and dense matrix computation. (generalized SPMV)

Dense matrix-matrix multiplication Computational bound Sparse matrix-vector multiplication Memory bound Sparse matrix-matrix multiplication ?

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 15 / 31

slide-37
SLIDE 37

Computational Difficulties

Combination of sparse matrix and dense matrix computation. (generalized SPMV)

Dense matrix-matrix multiplication Computational bound Sparse matrix-vector multiplication Memory bound Sparse matrix-matrix multiplication ?

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 15 / 31

slide-38
SLIDE 38

Computational Difficulties

Combination of sparse matrix and dense matrix computation. (generalized SPMV)

Dense matrix-matrix multiplication Computational bound Sparse matrix-vector multiplication Memory bound Sparse matrix-matrix multiplication ?

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 15 / 31

slide-39
SLIDE 39

Computational Difficulties

Combination of sparse matrix and dense matrix computation. (generalized SPMV)

Dense matrix-matrix multiplication Computational bound Sparse matrix-vector multiplication Memory bound Sparse matrix-matrix multiplication ?

Multiplication of a sequence of sparse matrices of different characteristics.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 15 / 31

slide-40
SLIDE 40

Computational Difficulties

Combination of sparse matrix and dense matrix computation. (generalized SPMV)

Dense matrix-matrix multiplication Computational bound Sparse matrix-vector multiplication Memory bound Sparse matrix-matrix multiplication ?

Multiplication of a sequence of sparse matrices of different characteristics. Matrix size is not large enough to reach good performance.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 15 / 31

slide-41
SLIDE 41

2D and 3D Torus Lattices

The kinetic matrices in our study are 2D and 3D torus lattices.

Images are from http://www.trampelwurm.ch/schmidt/wilhelmtux/swissremix/html/Linuxfibel/netstruct.htm and http://www.fujitsu.com/global/news/pr/archives/month/2009/20090717-01.html Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 16 / 31

slide-42
SLIDE 42

2D Lattice and Kinetic Matrix

The kinetic matrix of 2D lattices can be split into 3 block strictly sparse matrices, K1, K2, and K3. Checkerboard method approximates the matrix exponential by eK3/2eK2/2eK1eK2/2eK3/2. The figure shows the sparse pattern of the matrix of a 4 × 4 2D lattice is shown, and the matrix exponentials, eK1, eK2, and eK3.

10 5 10 15 (a) kinetic matrix K 10 5 10 15 (b) expm(K1) 10 5 10 15 (c) expm(K2) 10 5 10 15 (d) expm(K3)

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 17 / 31

slide-43
SLIDE 43

3D Lattice and Kinetic Matrix

The kinetic matrix of 3D lattices can be split into 5 block strictly sparse matrices, K1, K2, K3, K4, and K5. Checkerboard method approximates the matrix exponential by e

K5 2 e K4 2 e K3 2 e K2 2 eK1e K5 2 e K3 2 e K4 2 e K5 2 .

The figure shows the sparse pattern of the matrix of a 4 × 4 2D lattice is shown, and the matrix exponentials, eK1, eK2, eK3, eK4, and eK5.

50 20 40 60 (a) kinetic matrix K 50 20 40 60 (b) expm(K1) 50 20 40 60 (c) expm(K2) 50 20 40 60 (d) expm(K3) 50 20 40 60 (e) expm(K4) 50 20 40 60 (f) expm(K5)

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 18 / 31

slide-44
SLIDE 44

eK1: Block Diagonal Matrix

In 2D and 3D problems, eK1 =      A1 A2 ... Ak     , block diagonal.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 19 / 31

slide-45
SLIDE 45

eK1: Block Diagonal Matrix

In 2D and 3D problems, eK1 =      A1 A2 ... Ak     , block diagonal. For B =      B11 B12 . . . B1k B21 B22 . . . B2k . . . . . . ... . . . Bk1 Bk2 . . . Bkk     , the product of eK1B is eK1B =      A1B11 A1B12 . . . A1B1k A2B21 A2B22 . . . A2B2k . . . . . . ... . . . AkBk1 AkBk2 . . . AkBkk      .

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 19 / 31

slide-46
SLIDE 46

eKi, i = 1: Bidiagonal Matrices

eKi =          ... Di Cij ... Cji Dj ...          . where Ci,j and Dj are diagonal for eKi, i = 1.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 20 / 31

slide-47
SLIDE 47

eKi, i = 1: Bidiagonal Matrices

eKi =          ... Di Cij ... Cji Dj ...          . where Ci,j and Dj are diagonal for eKi, i = 1. There are only two non-zeros per row/column.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 20 / 31

slide-48
SLIDE 48

eKi, i = 1: Bidiagonal Matrices

eKi =          ... Di Cij ... Cji Dj ...          . where Ci,j and Dj are diagonal for eKi, i = 1. There are only two non-zeros per row/column. Let B =      B11 B12 . . . B1k B21 B22 . . . B2k . . . . . . ... . . . Bk1 Bk2 . . . Bkk     . The (i, j) block and the (j, i) block of F = eKiB are Fij = DiBii + CijBij Fji = DjBjj + CjiBji .

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 20 / 31

slide-49
SLIDE 49

Parallelize Algorithms for eK1B

Different algorithms need be designed from different types of matrices.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 21 / 31

slide-50
SLIDE 50

Parallelize Algorithms for eK1B

Different algorithms need be designed from different types of matrices. Block based algorithm for eK1B.

Parallelization of eK1B

For each Bij do

Parallel compute Ci,j = AiBij.

End for each

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 21 / 31

slide-51
SLIDE 51

Parallelize Algorithms for eK1B

Different algorithms need be designed from different types of matrices. Block based algorithm for eK1B.

Parallelization of eK1B

For each Bij do

Parallel compute Ci,j = AiBij.

End for each For cache effect, finer grained parallelization is needed.

Finer-grained parallelization of eK1B

For each Bij do

For each sub-block of Cij: Cij(k, ℓ) do

Parallel compute Ci,j(k, ℓ).

End for each

End for each

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 21 / 31

slide-52
SLIDE 52

Parallelize Algorithms for eKiB, i = 1

Column based parallelization For better performance, aggregate the computation of eKiB.

Parallelization of eKiB, i = 1

For each column bi in B do

Parallel compute eK2/2eK3/2bi or eK2/2 · · · eK5/2bi.

End for each

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 22 / 31

slide-53
SLIDE 53

Parallelize Algorithms for eKiB, i = 1

Column based parallelization For better performance, aggregate the computation of eKiB.

Parallelization of eKiB, i = 1

For each column bi in B do

Parallel compute eK2/2eK3/2bi or eK2/2 · · · eK5/2bi.

End for each Divide bi to fit cache

Parallelization of eKiB, i = 1

For each column bi in B do

For each segment of bi(:) do

Parallel compute eK2/2eK3/2bi(:) or eK2/2 · · · eK5/2bi(:).

End for each

End for each

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 22 / 31

slide-54
SLIDE 54

Outline

1

Determinant quantum Monte Carlo simulations

2

Matrix multiplication of sparse matrix exponentials

3

Parallel block checkerboard methods on multicore and GPU

4

Experiments and results

5

Concluding remarks

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 23 / 31

slide-55
SLIDE 55

Experimental Setting

Experiment setting We experimented the parallel algorithms on multicore and GPU. Matrices are from 2D and 3D toruses. Multicore CPU Intel Core i7 950, 8G RAM, whose peak performance is 48.48Gflop/s. Using DGEMM in MKL 11 for block matrix multiplication and for comparison. GPU GeForce GTX 480, whose peak performance is 1344GFlop/s. Using CUDA 4 and DGEMM and SGEMM CUDA SDK for comparison.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 24 / 31

slide-56
SLIDE 56

Results of 2D Problems on Multicore

0.000 ¡ 5.000 ¡ 10.000 ¡ 15.000 ¡ 20.000 ¡ 25.000 ¡ 1 ¡ 2 ¡ 3 ¡ 4 ¡ 5 ¡ 6 ¡ 7 ¡ 8 ¡

Gflop/s ¡ Number ¡of ¡threads ¡

1024/32 ¡ 4096/64 ¡ 16384/128 ¡ 0 ¡ 5 ¡ 10 ¡ 15 ¡ 20 ¡ 25 ¡ 30 ¡ 35 ¡ 1 ¡ 2 ¡ 3 ¡ 4 ¡ 5 ¡ 6 ¡ 7 ¡ 8 ¡

Performance ¡(Gflop/s) ¡ Number ¡of ¡threads ¡

1024/32 ¡ 4096/64 ¡ 16384/128 ¡

Lattice size: 32 × 32, 64 × 64, and 128 × 128. Larger problem has more stable performance result. The results of 32 × 32 are less stable than those of other cases. Hyperthreading is almost no effect.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 25 / 31

slide-57
SLIDE 57

Results of 3D Problems on Multicore

0 ¡ 2 ¡ 4 ¡ 6 ¡ 8 ¡ 10 ¡ 12 ¡ 14 ¡ 64/4 ¡ 512/8 ¡ 1728/12 ¡ 4096/16 ¡ 8000/20 ¡ 13824/24 ¡

Gflop/s ¡ Problem ¡size/block ¡size ¡

1 ¡ 2 ¡ 3 ¡ 4 ¡ 0 ¡ 5 ¡ 10 ¡ 15 ¡ 20 ¡ 25 ¡ 64/4 ¡ 512/8 ¡ 1728/12 ¡ 4096/16 ¡ 8000/20 ¡ 13824/24 ¡

Gflop/s ¡ Problem ¡size/block ¡size ¡

1 ¡ 2 ¡ 3 ¡ 4 ¡ 6 ¡ 8 ¡

Lattice size: 4 × 4 × 4, 8 × 8 × 8, 12 × 12 × 12, 16 × 16 × 16, 20 × 20 × 20, and 24 × 24 × 24. The speedup is about 3 using 4 cores. The Gflop/s of eK1 is much better than that

  • f the overall

performance. Hyperthreading is working, but its effect is decreasing.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 26 / 31

slide-58
SLIDE 58

Results of 2D Problems on GPU

1.00E+00 ¡ 1.00E+01 ¡ 1.00E+02 ¡ 1.00E+03 ¡ 1.00E+04 ¡ 1.00E+05 ¡ 1.00E+06 ¡ 64 ¡ 256 ¡ 1024 ¡ 4096 ¡ Execu&on ¡&me ¡(1E-­‑6 ¡s) ¡

Problem ¡size ¡ SDK ¡float ¡ SDK ¡double ¡ Blkckb ¡float ¡ Blkckb ¡double ¡

0.00 ¡ 10.00 ¡ 20.00 ¡ 30.00 ¡ 40.00 ¡ 50.00 ¡ 60.00 ¡ 70.00 ¡ 64 ¡ 256 ¡ 1024 ¡ 4096 ¡

Speedup ¡ Problem ¡size ¡ single ¡ double ¡

Lattice size: 8 × 8, 16 × 16, 32 × 32, and 64 × 64. The speedup is up to 67 for single precision. DGEMM of CUDA SDK cannot finish the 64 × 64 case.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 27 / 31

slide-59
SLIDE 59

Results of 3D Problems on GPU

1.00E+00 ¡ 1.00E+01 ¡ 1.00E+02 ¡ 1.00E+03 ¡ 1.00E+04 ¡ 1.00E+05 ¡ 1.00E+06 ¡ 64 ¡ 512 ¡ 4096 ¡

Execu&on ¡&me ¡(1E-­‑6 ¡s) ¡ Problem ¡size ¡ SDK ¡float ¡ SDK ¡double ¡ Blkckb ¡float ¡ Blkckb ¡double ¡ 0 ¡ 20 ¡ 40 ¡ 60 ¡ 80 ¡ 100 ¡ 120 ¡ 140 ¡ 160 ¡ 64 ¡ 512 ¡ 4096 ¡

Speedup ¡ Problem ¡size ¡ single ¡ double ¡

Lattice size: 4 × 4 × 4, 8 × 8 × 8, 12 × 12 × 12, and 16 × 16 × 16. The speedup is up to 147 for single precision. DGEMM of CUDA SDK cannot finish the 16 × 16 × 16 case.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 28 / 31

slide-60
SLIDE 60

Outline

1

Determinant quantum Monte Carlo simulations

2

Matrix multiplication of sparse matrix exponentials

3

Parallel block checkerboard methods on multicore and GPU

4

Experiments and results

5

Concluding remarks

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 29 / 31

slide-61
SLIDE 61

Concluding Remarks

Scientific applications request more and more computational power to enable larger and larger simulations. But to extract performance from modern HPC machines, which hybrid coarse-grained and fine-grained parallel architectures, application developers need to redesign the program.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 30 / 31

slide-62
SLIDE 62

Concluding Remarks

Scientific applications request more and more computational power to enable larger and larger simulations. But to extract performance from modern HPC machines, which hybrid coarse-grained and fine-grained parallel architectures, application developers need to redesign the program. For DQMC simulations, checkerboard method can make the algorithm more scalable with problem size. This paper presents the preliminary study of the parallelization of checkerboard method on multicore and GPU. Further performance tunings are required.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 30 / 31

slide-63
SLIDE 63

Concluding Remarks

Scientific applications request more and more computational power to enable larger and larger simulations. But to extract performance from modern HPC machines, which hybrid coarse-grained and fine-grained parallel architectures, application developers need to redesign the program. For DQMC simulations, checkerboard method can make the algorithm more scalable with problem size. This paper presents the preliminary study of the parallelization of checkerboard method on multicore and GPU. Further performance tunings are required. Parallelization of general lattice geometry of checkerboard method need be studied. The integration of the parallel checkerboard method to package needs be done.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 30 / 31