Parallelization of DQMC Simulations for Strongly Correlated Electron - - PowerPoint PPT Presentation

parallelization of dqmc simulations for strongly
SMART_READER_LITE
LIVE PREVIEW

Parallelization of DQMC Simulations for Strongly Correlated Electron - - PowerPoint PPT Presentation

Parallelization of DQMC Simulations for Strongly Correlated Electron Systems Che-Rung Lee Dept. of Computer Science National Tsing-Hua University Taiwan joint work with I-Hsin Chung (IBM Research), Zhaojun Bai (UCDavis) IEEE International


slide-1
SLIDE 1

Parallelization of DQMC Simulations for Strongly Correlated Electron Systems

Che-Rung Lee

  • Dept. of Computer Science

National Tsing-Hua University Taiwan joint work with I-Hsin Chung (IBM Research), Zhaojun Bai (UCDavis)

IEEE International Parallel and Distributed Processing Symposium 2010

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 1 / 22

slide-2
SLIDE 2

Outline

1

DQMC simulations

2

DQMC parallelization Algorithmic approaches System approaches

3

Experiment results

4

Conclusion

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 2 / 22

slide-3
SLIDE 3

Computational Material Science

Understanding and exploiting the properties of solid-state materials: magnetism, metal-insulator transition, high temperature superconductivity, ...

  • 15 -10 -5

5 10 15

  • A. Density

0.0 0.3 0.6 0.9 1.2

  • 15 -10 -5

5 10 15

  • B. Density fluctuations

×10-1 0.0 0.9 1.8 2.7

  • 15 -10 -5

5 10 15

  • C. Spin correlations

×10-1 0.0 0.8 1.6 2.4 3.2

  • 15 -10 -5

5 10 15

  • D. Pairing correlations

×10-1 0.0 0.8 1.6 2.4 Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 3 / 22

slide-4
SLIDE 4

Hubbard Model and DQMC Simulations

Many body simulation on multi-layer lattices using Hubbard model and quantum monte carlo method.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 4 / 22

slide-5
SLIDE 5

Hubbard Model and DQMC Simulations

Many body simulation on multi-layer lattices using Hubbard model and quantum monte carlo method. QUEST (QUantum Electron Simulation Toolbox): Fortran 90 package for Determinant Quantum Monte Carlo (DQMC) simulations.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 4 / 22

slide-6
SLIDE 6

DQMC Algorithm

Two stages: Warmup stage Sampling stage

A DQMC step

1 Propose a local change: h → h′. 2 Throw a random number 0 < r < 1. 3 Accept the change if r < det(e−βH(h′))

det(e−βH(h)) .

DQMC step Random HS field thermalized DQMC step Measurements

yes no

enough samples

no

Aggregation

yes

warmup sampling

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 5 / 22

slide-7
SLIDE 7

Computational Kernels

The equal time Green’s function

Gk = (I + BkBk+1 · · · B1BL · · · Bk−1)−1

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 6 / 22

slide-8
SLIDE 8

Computational Kernels

The equal time Green’s function

Gk = (I + BkBk+1 · · · B1BL · · · Bk−1)−1

The unequal time Green’s function

G τ =      I B1 −B2 I ... ... −BL I     

−1

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 6 / 22

slide-9
SLIDE 9

Computational Kernels

The equal time Green’s function

Gk = (I + BkBk+1 · · · B1BL · · · Bk−1)−1

The unequal time Green’s function

G τ =      I B1 −B2 I ... ... −BL I     

−1

Physical measurements

Operations on Gk and G τ, Fourier Transform, etc.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 6 / 22

slide-10
SLIDE 10

Computational Challenges

For simulating strongly correlated electron systems

The size of lattices need be large. A longer warmup stage is required.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 7 / 22

slide-11
SLIDE 11

Computational Challenges

For simulating strongly correlated electron systems

The size of lattices need be large. A longer warmup stage is required.

Numerical stability issues.

Additional stabilizing steps are required. Most calculations need double precision. Many fast updating methods and parallel algorithms cannot be used.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 7 / 22

slide-12
SLIDE 12

DQMC Parallelization

Algorithmic approaches

Parallel Markov chain Rolling feeder algorithm Parallel matrix computations

System approaches

Task decomposition Communication and computation overlapping Message compression Load balance

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 8 / 22

slide-13
SLIDE 13

Parallel Markov Chain

The sampling stage can be parallelized embarrassingly.

DQMC step Random HS field thermalized DQMC step Measurements

yes no

Aggregation

warmup sampling

DQMC step Measurements DQMC step Measurements

... ...

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 9 / 22

slide-14
SLIDE 14

Parallel Markov Chain

The sampling stage can be parallelized embarrassingly. The speedup of parallelization is limited by the time of the warmup

  • stage. (Amdahl’s law)

ρspeedup = Twarmup + Tsampling Twarmup + Tsampling/Np < Twarmup + Tsampling Twarmup

DQMC step Random HS field thermalized DQMC step Measurements

yes no

Aggregation

warmup sampling

DQMC step Measurements DQMC step Measurements

... ...

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 9 / 22

slide-15
SLIDE 15

Green’s Function Calculation

Matrix Gk need be computed cyclically with Bk−1 updated. G1 = (I + B1B2 · · · BL−1BL)−1.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 10 / 22

slide-16
SLIDE 16

Green’s Function Calculation

Matrix Gk need be computed cyclically with Bk−1 updated. G1 = (I + B1B2 · · · BL−1BL)−1. G2 = (I + B2B3 · · · BLB1)−1.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 10 / 22

slide-17
SLIDE 17

Green’s Function Calculation

Matrix Gk need be computed cyclically with Bk−1 updated. G1 = (I + B1B2 · · · BL−1BL)−1. G2 = (I + B2B3 · · · BLB1)−1. G3 = (I + B3B4 · · · B1B2)−1.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 10 / 22

slide-18
SLIDE 18

Green’s Function Calculation

Matrix Gk need be computed cyclically with Bk−1 updated. G1 = (I + B1B2 · · · BL−1BL)−1. G2 = (I + B2B3 · · · BLB1)−1. G3 = (I + B3B4 · · · B1B2)−1. · · ·

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 10 / 22

slide-19
SLIDE 19

Green’s Function Calculation

Matrix Gk need be computed cyclically with Bk−1 updated. G1 = (I + B1B2 · · · BL−1BL)−1. G2 = (I + B2B3 · · · BLB1)−1. G3 = (I + B3B4 · · · B1B2)−1. · · · Parallel reduction (takes O(N3 log L) time.)

DQMC step DQMC step Compute G Compute G 4 3 2 1 4 3 2 1 4 3 2 1

...

DQMC step DQMC step Compute G Compute G 1 4 3 2 1 4 3 2 1 4 3 2 DQMC step DQMC step Compute G Compute G 2 1 4 3 2 1 4 3 2 1 4 3

...

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 10 / 22

slide-20
SLIDE 20

Green’s Function Calculation

Matrix Gk need be computed cyclically with Bk−1 updated. G1 = (I + B1B2 · · · BL−1BL)−1. G2 = (I + B2B3 · · · BLB1)−1. G3 = (I + B3B4 · · · B1B2)−1. · · · Parallel reduction (takes O(N3 log L) time.)

DQMC step DQMC step Compute G Compute G 4 3 2 1 4 3 2 1 4 3 2 1

...

DQMC step DQMC step Compute G Compute G 1 4 3 2 1 4 3 2 1 4 3 2 DQMC step DQMC step Compute G Compute G 2 1 4 3 2 1 4 3 2 1 4 3

...

Numerically unstable!

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 10 / 22

slide-21
SLIDE 21

Rolling Feeder Algorithm

The matrix product can be stably computed sequentially.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 11 / 22

slide-22
SLIDE 22

Rolling Feeder Algorithm

The matrix product can be stably computed sequentially.

DQMC step DQMC step DQMC step DQMC step Compute G Compute G Compute G Compute G DQMC step DQMC step 1 2 3 4 4 3 2 3 4 4 2 3 4 1 1 4 3 4 1 1 Compute G Compute G 1 3 4 2 2 1 4 1 2 2

... ...

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 11 / 22

slide-23
SLIDE 23

Rolling Feeder Algorithm

The matrix product can be stably computed sequentially.

DQMC step DQMC step DQMC step DQMC step Compute G Compute G Compute G Compute G DQMC step DQMC step 1 2 3 4 4 3 2 3 4 4 2 3 4 1 1 4 3 4 1 1 Compute G Compute G 1 3 4 2 2 1 4 1 2 2

... ...

Tasks to get one Gk Sequential Parallel reduction Rolling feeder

  • 1. Matrix multiplication

L log L 1

  • 2. Stabilization step

O(L) O(log L) 1

  • 3. Inverting (I + B1 . . . BL)

1 1 1

  • 4. Data transmission

N2 O(LN2) N2 Comparisons on resources and stability Processor O(1) O(L) O(L) Numerically stable Y N Y

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 11 / 22

slide-24
SLIDE 24

Parallel Matrix Computations

Two matrix computation kernels are parallelized.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 12 / 22

slide-25
SLIDE 25

Parallel Matrix Computations

Two matrix computation kernels are parallelized.

1 The unequal time Green’s function is computed by blocks in parallel

G τ

k,ℓ =

   (I +Bk· · ·B1BL· · ·Bk+1)−1Bk· · ·Bℓ+1 k >ℓ (I +Bk· · ·B1BL· · ·Bk+1)−1 k =ℓ −(I +Bk· · ·Bk+1)−1Bk· · ·B1BL· · ·Bℓ+1 k <ℓ

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 12 / 22

slide-26
SLIDE 26

Parallel Matrix Computations

Two matrix computation kernels are parallelized.

1 The unequal time Green’s function is computed by blocks in parallel

G τ

k,ℓ =

   (I +Bk· · ·B1BL· · ·Bk+1)−1Bk· · ·Bℓ+1 k >ℓ (I +Bk· · ·B1BL· · ·Bk+1)−1 k =ℓ −(I +Bk· · ·Bk+1)−1Bk· · ·B1BL· · ·Bℓ+1 k <ℓ

2 The matrix-matrix multiplication of Gk and each block matrix of G τ

is speeded up using multicore.

The matrix size of Gk, 100-1000, is too small such that the matrix computation cannot be benefited by using MPI-style parallelization.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 12 / 22

slide-27
SLIDE 27

System Design

The system contains several “simulators” for parallel Markov chain. Each simulator consists of a “walker” and a “M-server”.

M-server

Physical measurements GC GC Unequal time measurator Equal time measurator GC

MC walker

DQMC steps Iterator Feeder Feeder Feeder Feeder Feeder HS field

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 13 / 22

slide-28
SLIDE 28

Implementation Techniques

System is implemented for hybrid systems (cluster+multicore)

Task MPI OpenMP Comm/comp Message Load

  • verlapping

compression balance Parallel Markov

  • chain

Rolling feeder

  • algorithm

Unequal time

  • Green’s fn

Physical measure-

  • ment

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 14 / 22

slide-29
SLIDE 29

Communication/Computation Overlapping

Iterator Feeder

HS field for time slice 0 G0 MC iterates

  • n G0

Multiply B0 and compute G0 ... HS field for time slice 1

MC iteration starts here

Without overlapping

Iterator Feeder

HS field for time slice 0 MC iterates

  • n G0

G0 HS field for time slice 1 Multiply B1 MC iterates

  • n G1

Multiply B0 and compute G0 ... Get G0 by FUA Get G0 and convert it to G1 by FUA

MC iteration starts here

Using fast update algorithm (FUA) to reduce waiting time

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 15 / 22

slide-30
SLIDE 30

Load Balance

Iterators are fully occupied → the bottleneck of speedup.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 16 / 22

slide-31
SLIDE 31

Load Balance

Iterators are fully occupied → the bottleneck of speedup. Processor utilization can be enhanced by merging tasks.

For example, when computing unequal time Green’s function, each processor can take care of more than one block submatrix.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 16 / 22

slide-32
SLIDE 32

Load Balance

Iterators are fully occupied → the bottleneck of speedup. Processor utilization can be enhanced by merging tasks.

For example, when computing unequal time Green’s function, each processor can take care of more than one block submatrix.

The load balance problem: how many block submatrices should one processor compute?

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 16 / 22

slide-33
SLIDE 33

Load Balance

Iterators are fully occupied → the bottleneck of speedup. Processor utilization can be enhanced by merging tasks.

For example, when computing unequal time Green’s function, each processor can take care of more than one block submatrix.

The load balance problem: how many block submatrices should one processor compute? Using the queueing theory (Little’s law) to estimate. nC = max

P≤1

P λT

1 λT

  • .

λ: arrival rate; T: processing time; P: processor utilization.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 16 / 22

slide-34
SLIDE 34

System and Benchmark

System

Run on the IBM Blue Gene/P Each compute node is equipped with 850MHz PowerPC 450 quad-core processor and 2GB memory. IBM XL compilers with IBM BLAS and LAPACK libraries.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 17 / 22

slide-35
SLIDE 35

System and Benchmark

System

Run on the IBM Blue Gene/P Each compute node is equipped with 850MHz PowerPC 450 quad-core processor and 2GB memory. IBM XL compilers with IBM BLAS and LAPACK libraries.

Benchmark

DQMC simulation on a two-dimensional periodic lattice. The lattice size is N = 16 × 16 = 256. The ratio of DQMC steps for the warmup stage and the sampling stages is 1 : 20.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 17 / 22

slide-36
SLIDE 36

Communication Pattern

Iterator 1 Feeder 1.1 Feeder 1.2 Feeder 1.3 Feeder 1.4 Feeder 1.5 Feeder 1.6 Measurator 1 Iterator 2 Feeder 2.1 Feeder 2.2 Feeder 2.3 Feeder 2.4 Feeder 2.5 Feeder 2.6 Measurator 2

Green bands show the waiting time of MPI RECV. Iterators are fully occupied after started.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 18 / 22

slide-37
SLIDE 37

Speedup for Different L

nP Speedup

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 19 / 22

slide-38
SLIDE 38

Effect of Load Balance (L = 96)

nP Speedup

nC: number of block submatrices computed per processor.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 20 / 22

slide-39
SLIDE 39

Summary

DQMC simulation for strongly correlated materials is a computationally intensive task, which is eager for parallelization.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 21 / 22

slide-40
SLIDE 40

Summary

DQMC simulation for strongly correlated materials is a computationally intensive task, which is eager for parallelization. We targeted the hybrid massive parallel systems, and explored the parallelism of DQMC simulations on different levels of granularity.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 21 / 22

slide-41
SLIDE 41

Summary

DQMC simulation for strongly correlated materials is a computationally intensive task, which is eager for parallelization. We targeted the hybrid massive parallel systems, and explored the parallelism of DQMC simulations on different levels of granularity. Our implementation shows over 80x speedup on thousand processors, which is much better than embarrassing parallelization (speedup < 21).

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 21 / 22

slide-42
SLIDE 42

Future Works

More fine-grain parallel matrix computation kernels (pivoted QR, QR, matrix inversion) to fully utilize the computational power of multicores.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 22 / 22

slide-43
SLIDE 43

Future Works

More fine-grain parallel matrix computation kernels (pivoted QR, QR, matrix inversion) to fully utilize the computational power of multicores. Better system design to enhance the processor utilization.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 22 / 22

slide-44
SLIDE 44

Future Works

More fine-grain parallel matrix computation kernels (pivoted QR, QR, matrix inversion) to fully utilize the computational power of multicores. Better system design to enhance the processor utilization. Different physics models and methods.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 22 / 22

slide-45
SLIDE 45

Future Works

More fine-grain parallel matrix computation kernels (pivoted QR, QR, matrix inversion) to fully utilize the computational power of multicores. Better system design to enhance the processor utilization. Different physics models and methods. Code is still in the experimental stage. Further development is required for practical use.

Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 2010 22 / 22