interoperability of shared memory parallel programming
play

Interoperability of Shared Memory Parallel Programming Models with - PowerPoint PPT Presentation

Interoperability of Shared Memory Parallel Programming Models with Charm++ Jmin Choi University of Illinois Urbana-Champaign May 2, 2019 1 / 16 Overview 2. Compiling Libraries 3. Creating Hybrid Programs 4. Vector Addition Example 5.


  1. Interoperability of Shared Memory Parallel Programming Models with Charm++ Jæmin Choi University of Illinois Urbana-Champaign May 2, 2019 1 / 16

  2. Overview 2. Compiling Libraries 3. Creating Hybrid Programs 4. Vector Addition Example 5. Kokkos vs. RAJA 6. Future Work 2 / 16 1. Why Interoperate with Charm++?

  3. Why Interoperate with Charm++? 3 / 16 ▶ Kokkos (SNL) and RAJA (LLNL) ▶ ’Performance portability’ ▶ Abstractions for parallel execution and data management ▶ Limited to shared memory parallelism by itself ▶ Use MPI for distributed memory execution ▶ Charm++ is another option ▶ Support for wide variety of architectures ▶ Load balancing

  4. Basic Interoperability (intra- & inter-node) 4 / 16 ▶ Let Kokkos/RAJA handle shared memory parallelism ▶ OpenMP backend for CPU ▶ CUDA backend for GPU ▶ Use Charm++ for communication between processes

  5. Compilation: Kokkos \ install are all we need i n s t a l l make kokkoslib included NVCC wrapper > to mkdir 5 / 16 path to b u i l d && cd b u i l d build > \ . . / generate_makefile . bash − − p r e f i x =<absolute − − with − cuda=<path to CUDA t o o l k i t > − − with − cuda − options=enable_lambda − − with − openmp − − arch=<CPU arch >,<GPU arch > − − compiler=<path make − j ▶ Assume GPUs are available ▶ OpenMP and CUDA backends ▶ Headers ( build/include ) and library fjle ( build/lib ) after

  6. Compilation: RAJA i n s t a l l after install are all we need i n s t a l l make . . / mkdir folder > RAJA to b u i l d && cd i n s t a l l b u i l d && mkdir 6 / 16 cmake − DENABLE_CUDA=On − DCMAKE_INSTALL_PREFIX=<path make − j ▶ Assume GPUs are available ▶ OpenMP and CUDA backends ▶ Headers ( install/include ) and library fjle ( install/lib )

  7. Creating a Kokkos/RAJA + Charm++ Hybrid Program (if CUDA backend not built) examples/charm++/shared_runtimes/[kokkos,raja] 7 / 16 ▶ Write Kokkos/RAJA code in a .cpp fjle ▶ Can be put in the same fjle as Charm++ if GPU is not used ▶ Write Charm++ code in a separate .C fjle ▶ A nodegroup chare for each Kokkos/RAJA instance ▶ Compile Kokkos/RAJA code with NVCC ▶ Additional options needed (e.g. -fopenmp ) ▶ Use NVCC wrapper with Kokkos ▶ Use charmc to compile Charm++ code and link ▶ Need to link Kokkos/RAJA library ▶ Examples (Hello World, vector addition) in

  8. Vector Addition Example: Kokkos nodegroup Listing 1: vecadd.ci } } void run ( ) ; entry Process ( ) ; entry { Process i n s t a n c e / process mainmodule vecadd a Kokkos E n c a p s u l a t e / / } . . . { mainchare Main . . . { 8 / 16

  9. Vector Addition Example: Kokkos / / use_gpu vecadd ( n , CkMyNode ( ) , use_gpu ) ; k o k k o s F i n a l i z e ( ) ; / / C a l l s Kokkos : : f i n a l i z e ( ) i n t e r n a l l y C o n t r i b u t e uses CUDA to Main to end the program . . . } } Listing 2: vecadd_charm.C i f d e f a u l t , . . . / / class Process : public CBase_Process { public : Process ( ) { k o k k o s I n i t ( ) ; C a l l s Uses OpenMP by Kokkos : : i n i t i a l i z e ( ) i n t e r n a l l y } void run ( ) { / / Execute v e c t o r a d d i t i o n / / 9 / 16

  10. Vector Addition Example: Kokkos { a ( i ) += b ( i ) ; } } . . . void vecadd ( const ui n t 6 4 _ t n , i n t process , bool use_gpu ) HostView h_a ( ” Host A” , n ) ; const CudaView d_a ( ” Device A” , n ) ; CudaView d_b ( ” Device B” , n ) ; Kokkos : : p a r a l l e l _ f o r ( Kokkos : : RangePolicy <Kokkos : : Cuda >(0 , n ) , Compute<CudaView >( d_a , d_b ) ) ; Kokkos : : deep_copy ( h_a , d_a ) ; } Listing 3: vecadd_kokkos.cpp { i n t & i ) #include <Kokkos_Core . hpp> ( const . . . / / Views Kokkos : : CudaSpace > CudaView ; Kokkos : : CudaHostPinnedSpace > HostView ; / / F u n c t o r s template < typename ViewType > struct Compute { ViewType a , b ; Compute ( const ViewType& d_a , const ViewType& d_b ) : a ( d_a ) , b ( d_b ) { } KOKKOS_INLINE_FUNCTION void operator ( ) 10 / 16 typedef Kokkos : : View < double * , Kokkos : : LayoutLeft , typedef Kokkos : : View < double * , Kokkos : : LayoutRight ,

  11. Vector Addition Example: RAJA nodegroup Listing 4: vecadd.ci } } void run ( ) ; entry Process ( ) ; entry { Process i n s t a n c e / process mainmodule vecadd a RAJA E n c a p s u l a t e / / } . . . { mainchare Main . . . { 11 / 16

  12. Vector Addition Example: RAJA to uses CUDA i f use_gpu vecadd ( n , CkMyNode ( ) , use_gpu ) ; / / C o n t r i b u t e Main Uses OpenMP by to end the program . . . } } Listing 5: vecadd_charm.C d e f a u l t , / / . . . { class Process : public CBase_Process { public : Process ( ) / / a d d i t i o n No i n i t i a l i z a t i o n / cleanup needed } void run ( ) { / / Execute v e c t o r 12 / 16

  13. Vector Addition Example: RAJA [ = ] Listing 6: vecadd_raja.cpp } cudaMemcpyDeviceToHost ) ) ; d_a , cudaErrchk ( cudaMemcpy ( h_a , } ) ; d_a [ i ] += d_b [ i ] ; { i ) ( i n t RAJA_DEVICE RAJA : : f o r a l l <RAJA : : cuda_exec <256>>( RAJA : : RangeSegment ( 0 , n ) , { void vecadd ( const bool use_gpu ) process , i n t ui n t 6 4 _ t n , 13 / 16 double * h_a , * d_a , * d_b ; cudaErrchk ( cudaMallocHost ( ( void ** )& h_a , n * sizeof ( double ) ) ) ; cudaErrchk ( cudaMalloc ( ( void ** )& d_a , n * sizeof ( double ) ) ) ; cudaErrchk ( cudaMalloc ( ( void ** )&d_b , n * sizeof ( double ) ) ) ; n * sizeof ( double ) ,

  14. Kokkos vs. RAJA kernels management 14 / 16 ▶ Both allow C++ functors and lambdas for computation ▶ Kokkos needs initialize and fjnalize calls ▶ Kokkos provides the View abstraction for memory ▶ Explicit memory management in RAJA ▶ No performance difgerence in vector addition

  15. Future Work node? 15 / 16 ▶ What if we want more than one Kokkos/RAJA instance per ▶ In NUMA environments, etc. ▶ Should be able to pin Charm++ processes to a set of cores ▶ A more involved integration with Charm++ scheduler ▶ Other shared memory parallel frameworks: StarPU, OmpSS ▶ Performance comparison with standardized set of benchmarks

  16. Thank You 16 / 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend