Parallelizing the Hamiltonian Computation in DQMC Simulations: - PowerPoint PPT Presentation

Parallelizing the Hamiltonian Computation in DQMC Simulations: Checkerboard Method for Sparse Matrix Exponentials on Multicore and GPU Che-Rung Lee National Tsing Hua University joint work with Zhi-Hung Chen and Quey-Liang Kao Second International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) May 25th, 2012 Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31

Outline Determinant quantum Monte Carlo simulations 1 Matrix multiplication of sparse matrix exponentials 2 Parallel block checkerboard methods on multicore and GPU 3 Experiments and results 4 Concluding remarks 5 Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 2 / 31

Computational Material Science To study the properties of solid-state materials: magnetism, metal-insulator transition, high temperature superconductivity, ... Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 3 / 31

Computational Material Science To study the properties of solid-state materials: magnetism, metal-insulator transition, high temperature superconductivity, ... The Hubbard model: Energy operator H is associated with a lattice of particles. Boltzmann weight is expressed as a path integral e − βH ≈ e − τH ( h 1 ) e − τH ( h 2 ) · · · e − τH ( h L ) . β = 1 /T is the “imaginary time”. τ = β/L is the discretized time step. { h i } is the “Hubbard-Stratonovich field”. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 3 / 31

Determinant Quantum Monte Carlo (DQMC) Simulations DQMC algorithm Random HS field 1 Given a random h = ( h ℓ,i ) = ( ± 1) . warmup DQMC step 2 Until there are enough measurements no For ℓ = 1 , . . . , L and i = 1 , . . . , N thermalized Propose a new HS config h ′ . yes 1 Compute the ratio γ of the 2 DQMC step determinants of new/old configs. sampling Generate a random number ρ ∈ [0 , 1] . 3 Measurements If γ > ρ , accept h = h ′ . 4 If the system is thermalized, sample no 5 enough samples the interested physical measurements. yes 3 Aggregate the sampled measurements. Aggregation Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 4 / 31

DQMC Parallelization Random HS field warmup Parallel Monte Carlo method can DQMC step speedup DQMC simulations by no parallelizing the sampling stage. thermalized yes DQMC step DQMC step DQMC step ... sampling Measurements Measurements Measurements ... Aggregation Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 5 / 31

DQMC Parallelization Random HS field warmup Parallel Monte Carlo method can DQMC step speedup DQMC simulations by no parallelizing the sampling stage. thermalized yes DQMC step DQMC step DQMC step Coarse-grained parallelization. ... (Communication only happens before sampling sampling and in aggregation.) Measurements Measurements Measurements ... Aggregation Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 5 / 31

DQMC Parallelization Random HS field warmup Parallel Monte Carlo method can DQMC step speedup DQMC simulations by no parallelizing the sampling stage. thermalized yes DQMC step DQMC step DQMC step Coarse-grained parallelization. ... (Communication only happens before sampling sampling and in aggregation.) Measurements Measurements Measurements ... Strong scalability if the number of desired samplings is much larger than Aggregation the number of processors. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 5 / 31

Computational Challenges By Amdahl’s law, the speedup of parallel Monte Carlo method is limited by the warmup stage (non-parallelizable). T warmup + T sampling Speedup = T warmup + T sampling /p T warmup + T sampling → T warmup Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 6 / 31

Computational Challenges By Amdahl’s law, the speedup of parallel Monte Carlo method is limited by the warmup stage (non-parallelizable). T warmup + T sampling Speedup = T warmup + T sampling /p T warmup + T sampling → T warmup Parallel Monte Carlo method does not scale with problem size, i.e. number of particles and discretized time length. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 6 / 31

Computational Challenges By Amdahl’s law, the speedup of parallel Monte Carlo method is limited by the warmup stage (non-parallelizable). T warmup + T sampling Speedup = T warmup + T sampling /p T warmup + T sampling → T warmup Parallel Monte Carlo method does not scale with problem size, i.e. number of particles and discretized time length. Coarse grained parallelization does not fit well on multicore and GPU. The computation of each DQMC step is complicated. Slower execution because of resource contention. Memory per core is reduced with the number of cores. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 6 / 31

Inside Each DQMC Step A DQMC step Random HS field warmup 1 Propose a local change: h → h ′ . DQMC step 2 Throw a random number 0 < r < 1 . no 3 Accept the change if r < det( e − βH ( h ′ ) ) thermalized det( e − βH ( h ) ) . yes DQMC step sampling Measurements no enough samples yes Aggregation Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 7 / 31

Inside Each DQMC Step A DQMC step Random HS field warmup 1 Propose a local change: h → h ′ . DQMC step 2 Throw a random number 0 < r < 1 . no 3 Accept the change if r < det( e − βH ( h ′ ) ) thermalized det( e − βH ( h ) ) . yes DQMC step Computational Kernel: Green’s function cal- sampling culation Measurements G = ( I + B L · · · B 2 B 1 ) − 1 . no enough samples yes for computation of det( e − βH ( h ′ ) ) and phys- Aggregation ical measurements. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 7 / 31

Green’s Function Calculation G = ( I + B L · · · B 2 B 1 ) − 1 . N : the number of particles; L : the number of time slices. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 8 / 31

Green’s Function Calculation G = ( I + B L · · · B 2 B 1 ) − 1 . N : the number of particles; L : the number of time slices. Time complexity of computing G is O ( N 3 L ) . Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 8 / 31

Green’s Function Calculation G = ( I + B L · · · B 2 B 1 ) − 1 . N : the number of particles; L : the number of time slices. Time complexity of computing G is O ( N 3 L ) . For 10 3 warmup steps and 10 4 sampling steps, it takes 15 hours. For large simulations, N = O (10 4 ) , L = O (10 2 ) , the projected execution time could take several days to months. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 8 / 31

Green’s Function Calculation G = ( I + B L · · · B 2 B 1 ) − 1 . N : the number of particles; L : the number of time slices. Time complexity of computing G is O ( N 3 L ) . For 10 3 warmup steps and 10 4 sampling steps, it takes 15 hours. For large simulations, N = O (10 4 ) , L = O (10 2 ) , the projected execution time could take several days to months. Profile of a DQMC simulation ( N = 256 , L = 96 ) Matrix kernel Execution time Matrix-matrix multiplication 72.39% Pivoted QR decomposition 17.83% Matrix inversion 3.02% Others 6.76% Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 8 / 31

Outline Determinant quantum Monte Carlo simulations 1 Matrix multiplication of sparse matrix exponentials 2 Parallel block checkerboard methods on multicore and GPU 3 Experiments and results 4 Concluding remarks 5 Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 9 / 31

Matrix-matrix Multiplication Some tuned result on multicore and on GPU (Fermi) DGEMM on Intel Core i7-920 4 core with MKL is about 40 Gflop/s. (my laptop.) SGEMM can reach 662 Gflop/s on Fermi. [Jakub Kurzak LAWN 245, 2010] DGEMM (362Gflop/s on Fermi) [Guangming Tan et. al. SC11] Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 10 / 31

Matrix-matrix Multiplication Some tuned result on multicore and on GPU (Fermi) DGEMM on Intel Core i7-920 4 core with MKL is about 40 Gflop/s. (my laptop.) SGEMM can reach 662 Gflop/s on Fermi. [Jakub Kurzak LAWN 245, 2010] DGEMM (362Gflop/s on Fermi) [Guangming Tan et. al. SC11] It is great, but the running time grows cubically with problem size N . Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 10 / 31

Parallelizing the Hamiltonian Computation in DQMC Simulations: - PowerPoint PPT Presentation

Parallelizing the Hamiltonian Computation in DQMC Simulations: Checkerboard Method for Sparse Matrix Exponentials on Multicore and GPU Che-Rung Lee National Tsing Hua University joint work with Zhi-Hung Chen and Quey-Liang Kao Second

Hamiltonian Cycles Hamiltonian Cycles CSE, IIT KGP Hamiltonian Cycle Hamiltonian Cycle A A

Quantization of Poisson-Lie Hamiltonian systems Chiara Esposito Julius Maximilian University of

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus EPFL AACBB

Hamiltonian engineering for many-body quantum systems by Shortcuts To Adiabaticity Kazutaka

Generalized Hamiltonian Cycles Jakub Teska School of ITMS University of Ballarat, VIC 3353,

Hamiltonian systems Marc R. Roussel October 31, 2019 Marc R. Roussel Hamiltonian systems

Parallelization of DQMC Simulations for Strongly Correlated Electron Systems Che-Rung Lee Dept.

Parallelizing SCIP-SDP via the UG framework Tristan Gally joint work with Marc E. Pfetsch,

Parallelizing an Interactive Theorem Prover Functional Programming and Proofs with ACL2 David L.

Ruijsenaars-Schneider system from reduction Quasi- quasi-Hamiltonian reduction Hamiltonian

Uniquely Hamiltonian Graphs Benedikt Klocker Algorithms and Complexity Group Institute of

On cubic 4-ordered graphs and cubic 4-ordered Hamiltonian graphs Hamiltonian graphs Lih-Hsing

Hamiltonian Systems Hamiltonian Systems H ( p 1 , . . . p N , q 1 , . . . q N ) q i = H p i =

Analysis and Control of Multi-Robot Systems Elements of Port-Hamiltonian Modeling Dr. Paolo

Hamiltonian Triangulations and Triangle Strips: an overview Peter Palfrader University of

Fault- -free Hamiltonian Cycles in Pancake free Hamiltonian Cycles in Pancake Fault Graphs with

Applica'on Support NDNComm 2014 ICN Tutorial Dry Run

Systems and Information Security Issues Prof. Alexander K. Petrenko, petrenko@ispras.ru 12th

Machine Learning, Reinforcement Learning Machine Learning: A quick retrospective AI Class 25

COMP60411: Modelling Data on the Web SAX, Schematron, JSON, Robustness & Errors Week 4

INF5140 Specification and Verification of Parallel Systems Overview, lecture 1 Spring 2015

Formal verification of numerical programs: from C annotated programs to Coq proofs Sylvie Boldo

Automated Theorem Proving 4/4: Satisfiability Checkers, SAT/SMT A.L. Lamprecht Course Program

COSC345 Week 17 Static verification 4 August 2015 Richard A. O Keefe 1 Aim of static

Parallelizing the Hamiltonian Computation in DQMC Simulations: - PowerPoint PPT Presentation

Parallelizing the Hamiltonian Computation in DQMC Simulations: Checkerboard Method for Sparse Matrix Exponentials on Multicore and GPU Che-Rung Lee National Tsing Hua University joint work with Zhi-Hung Chen and Quey-Liang Kao Second

Hamiltonian Cycles Hamiltonian Cycles CSE, IIT KGP Hamiltonian Cycle Hamiltonian Cycle A A

Quantization of Poisson-Lie Hamiltonian systems Chiara Esposito Julius Maximilian University of

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus EPFL AACBB

Hamiltonian engineering for many-body quantum systems by Shortcuts To Adiabaticity Kazutaka

Generalized Hamiltonian Cycles Jakub Teska School of ITMS University of Ballarat, VIC 3353,

Hamiltonian systems Marc R. Roussel October 31, 2019 Marc R. Roussel Hamiltonian systems

Parallelization of DQMC Simulations for Strongly Correlated Electron Systems Che-Rung Lee Dept.

Parallelizing SCIP-SDP via the UG framework Tristan Gally joint work with Marc E. Pfetsch,

Parallelizing an Interactive Theorem Prover Functional Programming and Proofs with ACL2 David L.

Ruijsenaars-Schneider system from reduction Quasi- quasi-Hamiltonian reduction Hamiltonian

Uniquely Hamiltonian Graphs Benedikt Klocker Algorithms and Complexity Group Institute of

On cubic 4-ordered graphs and cubic 4-ordered Hamiltonian graphs Hamiltonian graphs Lih-Hsing

Hamiltonian Systems Hamiltonian Systems H ( p 1 , . . . p N , q 1 , . . . q N ) q i = H p i =

Analysis and Control of Multi-Robot Systems Elements of Port-Hamiltonian Modeling Dr. Paolo

Hamiltonian Triangulations and Triangle Strips: an overview Peter Palfrader University of

Fault- -free Hamiltonian Cycles in Pancake free Hamiltonian Cycles in Pancake Fault Graphs with

Applica'on Support NDNComm 2014 ICN Tutorial Dry Run

Systems and Information Security Issues Prof. Alexander K. Petrenko, petrenko@ispras.ru 12th

Machine Learning, Reinforcement Learning Machine Learning: A quick retrospective AI Class 25

COMP60411: Modelling Data on the Web SAX, Schematron, JSON, Robustness &amp; Errors Week 4

INF5140 Specification and Verification of Parallel Systems Overview, lecture 1 Spring 2015

Formal verification of numerical programs: from C annotated programs to Coq proofs Sylvie Boldo

Automated Theorem Proving 4/4: Satisfiability Checkers, SAT/SMT A.L. Lamprecht Course Program

COSC345 Week 17 Static verification 4 August 2015 Richard A. O Keefe 1 Aim of static

COMP60411: Modelling Data on the Web SAX, Schematron, JSON, Robustness & Errors Week 4