efficient abstractions for gpgpu programming
play

Efficient Abstractions for GPGPU Programming . Mathias Bourgoin - PowerPoint PPT Presentation

. Efficient Abstractions for GPGPU Programming . Mathias Bourgoin 10.03.2015 Efficient abstractions for GPGPU programming . PhD (LIP6/UPMC) . GPGPU programming general purpose computations on the GPU Abstractions languages and


  1. . Efficient Abstractions for GPGPU Programming . Mathias Bourgoin 10.03.201​5

  2. Efficient abstractions for GPGPU programming . PhD (LIP6/UPMC) . GPGPU programming → general purpose computations on the GPU Abstractions → languages and algorithmic constructs Efficient → High Performance Computing Applications → computational science and numerical simulation . . OpenGPU project . Systematic Cluster Academic and Industrial partners Goal : provide open-source solutions for GPGPU programming Success : develop real size numerical applications . Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 2 / 25

  3. Graphic card . Properties of a dedicated graphic card . Several multi-processors Dedicated memory Connected to a host (CPU) via a PCI-Express bus Implies data transfers between host and graphic card memories Complex and specific programming . . Current hardware . CPU GPU # cores 4-16 300-2000 Max memory 32GB 6GB GFLOPS SP 200 1000-4000 GFLOPS DP 100 100-1000 . Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 3 / 25

  4. GPGPU Programming Two main frameworks Cuda (NVidia) OpenCL (Consortium OpenCL) Different languages To write kernels Assembly (PTX, SPIR, IL,…) Subsets of C/C++ To manage kernels C/C++/Objective-C Bindings : Fortran, Python, Java, … . Stream Processing . From a data set (stream), a series of computations (kernel) is applied to each element of the stream. . Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 4 / 25

  5. GPGPU programming in practice 1 Grid Block 0 Block 1 Shared memory Shared memory Registers Registers Registers Registers Thread (0,0) Thread (1,0) Thread (0,1) Thread (1,1) Local mem. Local mem. Local mem. Local mem. Global memory . Do not forget tranfers between the host and its guests CPU-X86 GPU Mobile GPU Gamer GPU HPC i7-3770K GTX 680M GTX 680 7970HD K20X Memory bandwidth 25.6GB/s 115.2 GB/s 192.2GB/s 264GB/s 250GB/s PCI-Express 3.0 maximum bandwidth is 16GB/s . . . . . . . . . . . . . . . . . . . . Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 5 / 25

  6. GPGPU programming in practice 2 Kernel : small example using OpenCL . Vector addition . __kernel void vec_add ( __global const double * a , __global const double * b , __global double * c , i n t N ) { i n t nIndex = get_global_id ( 0 ) ; i f ( nIndex >= N ) return ; c [ nIndex ] = a [ nIndex ] + b [ nIndex ] ; } . Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 6 / 25

  7. GPGPU programming in practice 2 Host : small example using C / / c r e a t e OpenCL d e v i c e & c o n t e x t CL_MEM_READ_ONLY | ← ֓ cl_context hContext ; CL_MEM_COPY_HOST_PTR , hContext = clCreateContextFromType ( 0 , ← cnDimension * s i z e o f ( cl_double ) , ֓ CL_DEVICE_TYPE_GPU , pA , 0 , 0 , 0) ; 0) ; / / query a l l d e v i c e s a v a i l a b l e to the c o n t e x t hDeviceMemB = clCreateBuffer ( hContext , size_t nContextDescriptorSize ; CL_MEM_READ_ONLY | ← ֓ clGetContextInfo ( hContext , CL_CONTEXT_DEVICES , CL_MEM_COPY_HOST_PTR , 0 , 0 , & nContextDescriptorSize ) ; cnDimension * s i z e o f ( cl_double ) , cl_device_id * aDevices = malloc ( ← pA , ֓ nContextDescriptorSize ) ; 0) ; clGetContextInfo ( hContext , CL_CONTEXT_DEVICES , hDeviceMemC = clCreateBuffer ( hContext , nContextDescriptorSize , aDevices , 0) ← CL_MEM_WRITE_ONLY , ֓ ; cnDimension * s i z e o f ( cl_double ) , / / c r e a t e a command queue f o r f i r s t d e v i c e the ← 0 , 0) ; ֓ c o n t e x t r e p o r t e d / / setup parameter v a l u e s cl_command_queue hCmdQueue ; clSetKernelArg ( hKernel , 0 , s i z e o f ( cl_mem ) , ( void * )& ← ֓ hCmdQueue = clCreateCommandQueue ( hContext , aDevices ← hDeviceMemA ) ; ֓ [ 0 ] , 0 , 0) ; clSetKernelArg ( hKernel , 1 , s i z e o f ( cl_mem ) , ( void * )& ← ֓ / / c r e a t e & compile program hDeviceMemB ) ; cl_program hProgram ; clSetKernelArg ( hKernel , 2 , s i z e o f ( cl_mem ) , ( void * )& ← ֓ hProgram = clCreateProgramWithSource ( hContext , 1 , hDeviceMemC ) ; sProgramSource , ← / / e x e c u t e k e r n e l ֓ 0 , 0) ; clEnqueueNDRangeKernel ( hCmdQueue , hKernel , 1 , 0 , clBuildProgram ( hProgram , 0 , 0 , 0 , 0 , 0) ; & cnDimension , 0 , 0 , 0 , 0) ; / / copy r e s u l t s from d e v i c e back to host / / c r e a t e k e r n e l clEnqueueReadBuffer ( hContext , hDeviceMemC , CL_TRUE , ← ֓ cl_kernel hKernel ; 0 , hKernel = clCreateKernel ( hProgram , “” vec_add , 0) ; cnDimension * s i z e o f ( cl_double ) , pC , 0 , 0 , 0) ; / / a l l o c a t e d e v i c e memory clReleaseMemObj ( hDeviceMemA ) ; cl_mem hDeviceMemA , hDeviceMemB , hDeviceMemC ; clReleaseMemObj ( hDeviceMemB ) ; hDeviceMemA = clCreateBuffer ( hContext , clReleaseMemObj ( hDeviceMemC ) ; Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 7 / 25

  8. GPGPU Programming with OCaml Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 8 / 25

  9. Main Goals . Target Cuda/OpenCL frameworks with OCaml Unify these two frmeworks Abstract memory transfers Use static type checking to verify kernels Propose abstractions for GPGPU programming Keep the high performance . . Host-side solution : an OCaml library . . Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 9 / 25

  10. SPOC overview . Abstract frameworks . Unify both APIs (Cuda/OpenCL), dynamic linking . Portable solution, multi-GPGPU, heterogeneous . . Abstract transfers . Vectors move automatically between CPU and GPGPUs On-demand (lazy) transfers Automatic allocation/dealloction of the memory space used by vectors (on the host as well as on GPGPU devices) Failure during allocation on a GPGPU triggers a garbage collection . Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 10 / 25

  11. External kernels . Type safety . Static type checking of kernel parameters (at compile-time). Kernel.run compiles kernels from .ptx / .cl sources. . kernel vec_add : Vector.vfloat64 -> Vector.vfloat64 -> Vector.vfloat64 -> int -> unit = «kernels» «vec_add» kernel launch dev Compilation/Execution Kernel.run vec_add dev .entry vec_add (…){ 
 … 
 } Cuda for i = 0 to Vector.length v3 - 1 do ! kernels.ptx ! printf « res[%d] = %f\n » ! Kernel.run vec_add dev ! ! i v3.[<i>] ! done; __kernel void vec_add (…){ 
 … 
 } OpenCL kernels.cl Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 11 / 25

  12. Sarek : Stream ARchitecture using Extensible Kernels . Vector addition with Sarek . l e t vec_add = kern a b c n − > l e t open Std in l e t open Math . Float64 in l e t idx = global_thread_id in i f idx < n then c .[ < idx >] < − add a .[ < idx >] b .[ < idx >] . . Vector addition with OpenCL . __kernel void vec_add ( __global const double * a , __global const double * b , __global double * c , i n t N ) { i n t nIndex = get_global_id ( 0 ) ; i f ( nIndex >= N ) return ; c [ nIndex ] = a [ nIndex ] + b [ nIndex ] ; } . Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 12 / 25

  13. Sarek . Vector addition with Sarek . l e t vec_add = kern a b c n − > l e t open Std in l e t open Math . Float64 in l e t idx = global_thread_id in i f idx < n then c .[ < idx >] < − add a .[ < idx >] b .[ < idx >] . . Sarek features . ML-like syntax type inference static type checking static compilation to OCaml code dynamic compilation to Cuda/OpenCL . Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 13 / 25

  14. Sarek static compilation Sarek code . . . . . . . . . . . . . . . . kern a → let idx = Std .global_thread_id () in a. [ < idx > ] ← 0 IR Bind( (Id 0), (ModuleAccess((Std), Typing (global_thread_id)), (VecSet(VecAcc…)))) typed IR OCaml code generation spoc_kernel generation Kir generation spoc_kernel OCaml Code Kir fun a − > Kern class spoc_class1 let idx = Params method run = ... Std.global_thread_id () VecVar 0 method compile = ... in a. [ < idx > ] < − 0l VecVar 1 end … new spoc_class1 Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 14 / 25

  15. Sarek dynamic compilation . . . . . . . . .let my_kernel = kern ... − > ... . . . . . . . . . . . . nvcc -O3 -ptx… ... ;; Compile to Compile Kirc.gen my_kernel ; Cuda C source file to OpenCL C99 Kirc.run my_kernel dev (block,grid) ; Cuda ptx assembly device OpenCL Cuda Compile kernel OpenCL C99 Cuda ptx assembly and source Run Return to OCaml code execution Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 15 / 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend