A Physicists Guide to Parallelization at the IHPC Viar Gumundsson - - PowerPoint PPT Presentation
A Physicists Guide to Parallelization at the IHPC Viar Gumundsson - - PowerPoint PPT Presentation
A Physicists Guide to Parallelization at the IHPC Viar Gumundsson Science Institute, University of Iceland vidar@hi.is 2017 Transport of electrons through dots in a photon cavity Exactly interacting electrons and photons, geometry
Transport of electrons through dots in a photon cavity
Exactly interacting electrons and photons, geometry
Equation of motion
Complicated construction through functional spaces ⇓ Non-interacting single-electron basis, Hilbert space ⇓ Interacting many-electron states, Fock space ⇓ Non-interacting electron-photon basis, Fock space ⇓ Photon-dressed electron states, Fock space ⇓ Open system → projection on the central system ⇓ Equation of motion, linear space of transitions, Liouville space
Lot of analytical work and planning ⇓ Heavy use of linear algebra ⇓ Truncation of infinite dimensional spaces ⇓ Compact (dense) complex matrices ⇓ Intel MKL, BLAS, Lapack, OpenMP – cuBLAS, Magma ⇓ G-FORTRAN + ifort – ifort + CUDA
BLAS
Always use (parallel f95-interface) BLAS for simple linear algebra task Level I : Vector – vector operations Level II : Matrix – vector operations Level III : Matrix – matrix operations You can not compete with your own code!
Module – wrapper
Subroutine → function – matrix multiplication
MODULE PAB CONTAINS FUNCTION MATMULVG(Af ,Bf) USE Mod_Precision USE lapack95 USE blas95 USE
- mp$_$lib
IMPLICIT NONE INTEGER :: Nd1 , Nd2 , ierr COMPLEX(KIND=dp), ALLOCATABLE , DIMENSION (: ,:) :: MATMULVG , Af , Bf Nd1 = SIZE(Af ,1) Nd2 = SIZE(Bf ,2) ierr = 0 ALLOCATE(MATMULVG(Nd1 ,Nd2), STAT=ierr) MATMULVG = (0.0_dp , 0.0 _dp) CALL GEMM3M(Af ,Bf ,MATMULVG) ! ---------------------------------------- END FUNCTION MATMULVG END MODULE PAB
Usage – readability
ρvec(t) = {V exp(Lt)U} ρ(0)
SUBROUTINE NaZwa_gR_errCorr_I ... USE PAB USE PcAB USE PABz USE PAv USE PAvT ... rhotv = MATvecVG(MATMULVG(MATMULVG(vrV ,expiLt),vlU),rho0v) ... END SUBROUTINE NaZwa_gR_errCorr_I
GPU-cards
We can also off-load matrix handling to GPU-cards
MODULE Mod_MVG CONTAINS FUNCTION MVG(A,B) ! Input: -Two complex N x N matrices A and B ! Output: -Complex N x N matrix c(A)*B USE Mod_Precision USE mkl95_lapack USE blas95 USE
- mp_lib
IMPLICIT NONE INTEGER :: m, k, n, ierr COMPLEX(KIND=dp), ALLOCATABLE , DIMENSION (: ,:) :: MVG , A, B m = SIZE(A ,1) n = SIZE(B ,2) k = SIZE(A ,2) ierr = 0 ALLOCATE(MVG(m,n), STAT=ierr) MVG = (0.0_dp , 0.0 _dp) CALL matmulvgcuda (A, B, MVG , m, k, n) ! ---------------------------------------- END FUNCTION MVG END MODULE Mod_MVG
extern "C" void matmulvgcuda_ ( cuDoubleComplex *A, cuDoubleComplex *B, cuDoubleComplex *C, cudaSetDevice (0); cuDoubleComplex *d_A = 0; cuDoubleComplex *d_B = 0; cuDoubleComplex *d_C = 0; cuDoubleComplex alpha = {.x=1.0 , .y=0.0}; cuDoubleComplex beta = {.x=0.0 , .y=0.0}; int matrixSizeA = (*m)*(*k); int matrixSizeB = (*k)*(*n); int matrixSizeC = (*m)*(*n); cublasHandle_t handle; /* Initialize CUBLAS */ CUBLAS_HANDLE_ERROR ( cublasCreate (& handle) ); /* Allocate device memory for matrices */ CUDA_HANDLE_ERROR ( cudaMalloc (( void **)&d_A , matrixSizeA *sizeof( cuDoubleComplex ))); CUDA_HANDLE_ERROR ( cudaMalloc (( void **)&d_B , matrixSizeB *sizeof( cuDoubleComplex ))); CUDA_HANDLE_ERROR ( cudaMalloc (( void **)&d_C , matrixSizeC *sizeof( cuDoubleComplex ))); /* Copy host matrixes to device */ CUBLAS_HANDLE_ERROR ( cublasSetVector (matrixSizeA , sizeof( cuDoubleComplex ), A, 1, d_A , CUBLAS_HANDLE_ERROR ( cublasSetVector (matrixSizeB , sizeof( cuDoubleComplex ), B, 1, d_B , /* Perform multiplication with CUBLAS */ CUBLAS_HANDLE_ERROR ( cublasZgemm (handle , CUBLAS_OP_N , CUBLAS_OP_N , *m, *n, *k, &alpha /* Read result back */ CUBLAS_HANDLE_ERROR ( cublasGetVector (matrixSizeC , sizeof( cuDoubleComplex ), d_C , 1, C, /* Clean up device memory */ CUDA_HANDLE_ERROR ( cudaFree(d_A) ); CUDA_HANDLE_ERROR ( cudaFree(d_B) ); CUDA_HANDLE_ERROR ( cudaFree(d_C) ); /* Shutdown */ CUBLAS_HANDLE_ERROR ( cublasDestroy (handle) ); }
GPU – off loading
Information on FORTRAN-CUDA modules: vidar@hi.is Lapack: MAGMA: Matrix Algebra on GPU and Multicore Architectures, http://icl.cs.utk.edu/magma/ cuBLAS: https://developer.nvidia.com/cublas
OpenMP
Matrix elements:
!$OMP PARALLEL DO PRIVATE(fs , sumLp , sumRp , FermiS , Cpnu) & !$OMP SHARED(TL , TR , TauLp , TauRp , gamma , Plus , i1 , i2 , j1 , j2) SCHEDULE(DYNAMIC) DO nu = 1, N_mes DO mu = 1, N_mes sumLp = Czero sumRp = Czero DO ia = 1, N_ses FermiS = gamma(nu ,ia) Cpnu = Plus(nu ,ia) IF(mu .EQ. Cpnu) THEN fs = ( -1)** FermiS sumLp = sumLp+TL(i1 ,j1 ,ia)* CMPLX(REAL(fs ,dp) ,0.0_dp ,dp) sumRp = sumRp+TR(i2 ,j2 ,ia)* CMPLX(REAL(fs ,dp) ,0.0_dp ,dp) END IF END DO TauLp(mu ,nu) = sumLp TauRp(mu ,nu) = sumRp END DO END DO !$OMP END PARALLEL DO
Many dimensions
OpenMP – shared RAM – upto 32 cores on Garðar MPI – distributed RAM – images OpenMP – depth of parallel tasks BLAS – linear algebra Modules – readability – easy usage
Importance of analytical + numerical work
Efficient determination of the Markovian time-evolution towards a steady-state of a complex open quantum system, Computer Physics Communications, in press, (2017), https://doi.org/10.1016/j.cpc.2017.06.018
Conclusions
MKL, OpenMP, cuBLAS, MAGMA Faster development than with MPI Build modules (wrappers) Get clever students G-FORTRAN – ifort
Collaboration and support
Þorsteinn Hjörtur Jónsson (UI) Andrei Manolescu (RU) Chi-Shung Tang (NUU) Hsi-Sheng Goan (NTU) Anna Sitek (UI) Nzar Rauf Abdullah (KUS) Maria Laura Bernodusson (ALUF) Sigtryggur Hauksson (McGill) Árni Johnsen (UI) Elías Snorrason (UI)
University of Iceland Research Fund The Icelandic Research Fund The Taiwan Ministry
- f Technology