Interoperability of Shared Memory Parallel Programming Models with - PowerPoint PPT Presentation

Interoperability of Shared Memory Parallel Programming Models with Charm++ Jæmin Choi University of Illinois Urbana-Champaign May 2, 2019 1 / 16

Overview 2. Compiling Libraries 3. Creating Hybrid Programs 4. Vector Addition Example 5. Kokkos vs. RAJA 6. Future Work 2 / 16 1. Why Interoperate with Charm++?

Why Interoperate with Charm++? 3 / 16 ▶ Kokkos (SNL) and RAJA (LLNL) ▶ ’Performance portability’ ▶ Abstractions for parallel execution and data management ▶ Limited to shared memory parallelism by itself ▶ Use MPI for distributed memory execution ▶ Charm++ is another option ▶ Support for wide variety of architectures ▶ Load balancing

Basic Interoperability (intra- & inter-node) 4 / 16 ▶ Let Kokkos/RAJA handle shared memory parallelism ▶ OpenMP backend for CPU ▶ CUDA backend for GPU ▶ Use Charm++ for communication between processes

Compilation: Kokkos \ install are all we need i n s t a l l make kokkoslib included NVCC wrapper > to mkdir 5 / 16 path to b u i l d && cd b u i l d build > \ . . / generate_makefile . bash − − p r e f i x =<absolute − − with − cuda=<path to CUDA t o o l k i t > − − with − cuda − options=enable_lambda − − with − openmp − − arch=<CPU arch >,<GPU arch > − − compiler=<path make − j ▶ Assume GPUs are available ▶ OpenMP and CUDA backends ▶ Headers ( build/include ) and library fjle ( build/lib ) after

Compilation: RAJA i n s t a l l after install are all we need i n s t a l l make . . / mkdir folder > RAJA to b u i l d && cd i n s t a l l b u i l d && mkdir 6 / 16 cmake − DENABLE_CUDA=On − DCMAKE_INSTALL_PREFIX=<path make − j ▶ Assume GPUs are available ▶ OpenMP and CUDA backends ▶ Headers ( install/include ) and library fjle ( install/lib )

Creating a Kokkos/RAJA + Charm++ Hybrid Program (if CUDA backend not built) examples/charm++/shared_runtimes/[kokkos,raja] 7 / 16 ▶ Write Kokkos/RAJA code in a .cpp fjle ▶ Can be put in the same fjle as Charm++ if GPU is not used ▶ Write Charm++ code in a separate .C fjle ▶ A nodegroup chare for each Kokkos/RAJA instance ▶ Compile Kokkos/RAJA code with NVCC ▶ Additional options needed (e.g. -fopenmp ) ▶ Use NVCC wrapper with Kokkos ▶ Use charmc to compile Charm++ code and link ▶ Need to link Kokkos/RAJA library ▶ Examples (Hello World, vector addition) in

Vector Addition Example: Kokkos nodegroup Listing 1: vecadd.ci } } void run ( ) ; entry Process ( ) ; entry { Process i n s t a n c e / process mainmodule vecadd a Kokkos E n c a p s u l a t e / / } . . . { mainchare Main . . . { 8 / 16

Vector Addition Example: Kokkos / / use_gpu vecadd ( n , CkMyNode ( ) , use_gpu ) ; k o k k o s F i n a l i z e ( ) ; / / C a l l s Kokkos : : f i n a l i z e ( ) i n t e r n a l l y C o n t r i b u t e uses CUDA to Main to end the program . . . } } Listing 2: vecadd_charm.C i f d e f a u l t , . . . / / class Process : public CBase_Process { public : Process ( ) { k o k k o s I n i t ( ) ; C a l l s Uses OpenMP by Kokkos : : i n i t i a l i z e ( ) i n t e r n a l l y } void run ( ) { / / Execute v e c t o r a d d i t i o n / / 9 / 16

Vector Addition Example: Kokkos { a ( i ) += b ( i ) ; } } . . . void vecadd ( const ui n t 6 4 _ t n , i n t process , bool use_gpu ) HostView h_a ( ” Host A” , n ) ; const CudaView d_a ( ” Device A” , n ) ; CudaView d_b ( ” Device B” , n ) ; Kokkos : : p a r a l l e l _ f o r ( Kokkos : : RangePolicy <Kokkos : : Cuda >(0 , n ) , Compute<CudaView >( d_a , d_b ) ) ; Kokkos : : deep_copy ( h_a , d_a ) ; } Listing 3: vecadd_kokkos.cpp { i n t & i ) #include <Kokkos_Core . hpp> ( const . . . / / Views Kokkos : : CudaSpace > CudaView ; Kokkos : : CudaHostPinnedSpace > HostView ; / / F u n c t o r s template < typename ViewType > struct Compute { ViewType a , b ; Compute ( const ViewType& d_a , const ViewType& d_b ) : a ( d_a ) , b ( d_b ) { } KOKKOS_INLINE_FUNCTION void operator ( ) 10 / 16 typedef Kokkos : : View < double * , Kokkos : : LayoutLeft , typedef Kokkos : : View < double * , Kokkos : : LayoutRight ,

Vector Addition Example: RAJA nodegroup Listing 4: vecadd.ci } } void run ( ) ; entry Process ( ) ; entry { Process i n s t a n c e / process mainmodule vecadd a RAJA E n c a p s u l a t e / / } . . . { mainchare Main . . . { 11 / 16

Vector Addition Example: RAJA to uses CUDA i f use_gpu vecadd ( n , CkMyNode ( ) , use_gpu ) ; / / C o n t r i b u t e Main Uses OpenMP by to end the program . . . } } Listing 5: vecadd_charm.C d e f a u l t , / / . . . { class Process : public CBase_Process { public : Process ( ) / / a d d i t i o n No i n i t i a l i z a t i o n / cleanup needed } void run ( ) { / / Execute v e c t o r 12 / 16

Vector Addition Example: RAJA [ = ] Listing 6: vecadd_raja.cpp } cudaMemcpyDeviceToHost ) ) ; d_a , cudaErrchk ( cudaMemcpy ( h_a , } ) ; d_a [ i ] += d_b [ i ] ; { i ) ( i n t RAJA_DEVICE RAJA : : f o r a l l <RAJA : : cuda_exec <256>>( RAJA : : RangeSegment ( 0 , n ) , { void vecadd ( const bool use_gpu ) process , i n t ui n t 6 4 _ t n , 13 / 16 double * h_a , * d_a , * d_b ; cudaErrchk ( cudaMallocHost ( ( void ** )& h_a , n * sizeof ( double ) ) ) ; cudaErrchk ( cudaMalloc ( ( void ** )& d_a , n * sizeof ( double ) ) ) ; cudaErrchk ( cudaMalloc ( ( void ** )&d_b , n * sizeof ( double ) ) ) ; n * sizeof ( double ) ,

Kokkos vs. RAJA kernels management 14 / 16 ▶ Both allow C++ functors and lambdas for computation ▶ Kokkos needs initialize and fjnalize calls ▶ Kokkos provides the View abstraction for memory ▶ Explicit memory management in RAJA ▶ No performance difgerence in vector addition

Future Work node? 15 / 16 ▶ What if we want more than one Kokkos/RAJA instance per ▶ In NUMA environments, etc. ▶ Should be able to pin Charm++ processes to a set of cores ▶ A more involved integration with Charm++ scheduler ▶ Other shared memory parallel frameworks: StarPU, OmpSS ▶ Performance comparison with standardized set of benchmarks

Thank You 16 / 16

Interoperability of Shared Memory Parallel Programming Models with - PowerPoint PPT Presentation

Interoperability of Shared Memory Parallel Programming Models with Charm++ Jmin Choi University of Illinois Urbana-Champaign May 2, 2019 1 / 16 Overview 2. Compiling Libraries 3. Creating Hybrid Programs 4. Vector Addition Example 5.

Interoperability of retail DFS What is Interoperability? DFS interoperability models

Threaded Programming Lecture 1: Concepts Overview Shared memory systems Basic Concepts

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

Interoperability in Ten Minutes What is interoperability? How do we achieve interoperability How

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Distributed Shared Memory Shared memory : difficult to realize vs . easy to program with.

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Programming with Shared Memory In a shared memory system, any memory location can be accessible by

Case Studies in Asynchronous, Message-Driven Shared Memory Programming Pritish Jetley Parallel

OpenMP: a shared-memory parallel programming model Eduard Ayguad Computer Sciences Department

28. Parallel Programming II 28.1 Shared Memory, Concurrency Shared Memory, Concurrency,

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

The Path to Toll Interoperability Report from the IBTTA Interoperability (IOP) Committee The

Memory Barriers in the Linux Kernel Semantics and Practices Embedded Linux Conference April

Dynamic Memory Alloca/on: Basic Concepts 15-213: Introduc0on

A marriage of rely/guarantee & separation logic Viktor V afeiadis MPI - SWS Coarse - grain

Blurred Lines: You Got Your Memory in My Storage! Jay Lofstead Scalable System Software Sandia

Closures the Forth way M. Anton Ertl, TU Wien Bernd Paysan, net2o Problem Given numint ( a

Dynamic Neural Turing Machine with Continuous and Discrete Addressing Schemes arXiv:1607.00036v2

abstractions at scale our experiences at twitter marius a. eriksen @marius QConSF , November

HMM: GUP NO MORE ! XDC 2018 Jrme Glisse HETEROGENEOUS COMPUTING CPU is dead, long live the

Interoperability of Shared Memory Parallel Programming Models with - PowerPoint PPT Presentation

Interoperability of Shared Memory Parallel Programming Models with Charm++ Jmin Choi University of Illinois Urbana-Champaign May 2, 2019 1 / 16 Overview 2. Compiling Libraries 3. Creating Hybrid Programs 4. Vector Addition Example 5.

Interoperability of retail DFS What is Interoperability? DFS interoperability models

Threaded Programming Lecture 1: Concepts Overview Shared memory systems Basic Concepts

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

Interoperability in Ten Minutes What is interoperability? How do we achieve interoperability How

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Distributed Shared Memory Shared memory : difficult to realize vs . easy to program with.

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Programming with Shared Memory In a shared memory system, any memory location can be accessible by

Case Studies in Asynchronous, Message-Driven Shared Memory Programming Pritish Jetley Parallel

OpenMP: a shared-memory parallel programming model Eduard Ayguad Computer Sciences Department

28. Parallel Programming II 28.1 Shared Memory, Concurrency Shared Memory, Concurrency,

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

The Path to Toll Interoperability Report from the IBTTA Interoperability (IOP) Committee The

Memory Barriers in the Linux Kernel Semantics and Practices Embedded Linux Conference April

Dynamic Memory Alloca/on: Basic Concepts 15-213: Introduc0on

A marriage of rely/guarantee &amp; separation logic Viktor V afeiadis MPI - SWS Coarse - grain

Blurred Lines: You Got Your Memory in My Storage! Jay Lofstead Scalable System Software Sandia

Closures the Forth way M. Anton Ertl, TU Wien Bernd Paysan, net2o Problem Given numint ( a

Dynamic Neural Turing Machine with Continuous and Discrete Addressing Schemes arXiv:1607.00036v2

abstractions at scale our experiences at twitter marius a. eriksen @marius QConSF , November

HMM: GUP NO MORE ! XDC 2018 Jrme Glisse HETEROGENEOUS COMPUTING CPU is dead, long live the

A marriage of rely/guarantee & separation logic Viktor V afeiadis MPI - SWS Coarse - grain