Efficient Abstractions for GPGPU Programming . Mathias Bourgoin - PowerPoint PPT Presentation

. Efficient Abstractions for GPGPU Programming . Mathias Bourgoin 10.03.2015

Efficient abstractions for GPGPU programming . PhD (LIP6/UPMC) . GPGPU programming → general purpose computations on the GPU Abstractions → languages and algorithmic constructs Efficient → High Performance Computing Applications → computational science and numerical simulation . . OpenGPU project . Systematic Cluster Academic and Industrial partners Goal : provide open-source solutions for GPGPU programming Success : develop real size numerical applications . Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 2 / 25

Graphic card . Properties of a dedicated graphic card . Several multi-processors Dedicated memory Connected to a host (CPU) via a PCI-Express bus Implies data transfers between host and graphic card memories Complex and specific programming . . Current hardware . CPU GPU # cores 4-16 300-2000 Max memory 32GB 6GB GFLOPS SP 200 1000-4000 GFLOPS DP 100 100-1000 . Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 3 / 25

GPGPU Programming Two main frameworks Cuda (NVidia) OpenCL (Consortium OpenCL) Different languages To write kernels Assembly (PTX, SPIR, IL,…) Subsets of C/C++ To manage kernels C/C++/Objective-C Bindings : Fortran, Python, Java, … . Stream Processing . From a data set (stream), a series of computations (kernel) is applied to each element of the stream. . Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 4 / 25

GPGPU programming in practice 1 Grid Block 0 Block 1 Shared memory Shared memory Registers Registers Registers Registers Thread (0,0) Thread (1,0) Thread (0,1) Thread (1,1) Local mem. Local mem. Local mem. Local mem. Global memory . Do not forget tranfers between the host and its guests CPU-X86 GPU Mobile GPU Gamer GPU HPC i7-3770K GTX 680M GTX 680 7970HD K20X Memory bandwidth 25.6GB/s 115.2 GB/s 192.2GB/s 264GB/s 250GB/s PCI-Express 3.0 maximum bandwidth is 16GB/s . . . . . . . . . . . . . . . . . . . . Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 5 / 25

GPGPU programming in practice 2 Kernel : small example using OpenCL . Vector addition . __kernel void vec_add ( __global const double * a , __global const double * b , __global double * c , i n t N ) { i n t nIndex = get_global_id ( 0 ) ; i f ( nIndex >= N ) return ; c [ nIndex ] = a [ nIndex ] + b [ nIndex ] ; } . Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 6 / 25

GPGPU programming in practice 2 Host : small example using C / / c r e a t e OpenCL d e v i c e & c o n t e x t CL_MEM_READ_ONLY | ← ֓ cl_context hContext ; CL_MEM_COPY_HOST_PTR , hContext = clCreateContextFromType ( 0 , ← cnDimension * s i z e o f ( cl_double ) , ֓ CL_DEVICE_TYPE_GPU , pA , 0 , 0 , 0) ; 0) ; / / query a l l d e v i c e s a v a i l a b l e to the c o n t e x t hDeviceMemB = clCreateBuffer ( hContext , size_t nContextDescriptorSize ; CL_MEM_READ_ONLY | ← ֓ clGetContextInfo ( hContext , CL_CONTEXT_DEVICES , CL_MEM_COPY_HOST_PTR , 0 , 0 , & nContextDescriptorSize ) ; cnDimension * s i z e o f ( cl_double ) , cl_device_id * aDevices = malloc ( ← pA , ֓ nContextDescriptorSize ) ; 0) ; clGetContextInfo ( hContext , CL_CONTEXT_DEVICES , hDeviceMemC = clCreateBuffer ( hContext , nContextDescriptorSize , aDevices , 0) ← CL_MEM_WRITE_ONLY , ֓ ; cnDimension * s i z e o f ( cl_double ) , / / c r e a t e a command queue f o r f i r s t d e v i c e the ← 0 , 0) ; ֓ c o n t e x t r e p o r t e d / / setup parameter v a l u e s cl_command_queue hCmdQueue ; clSetKernelArg ( hKernel , 0 , s i z e o f ( cl_mem ) , ( void * )& ← ֓ hCmdQueue = clCreateCommandQueue ( hContext , aDevices ← hDeviceMemA ) ; ֓ [ 0 ] , 0 , 0) ; clSetKernelArg ( hKernel , 1 , s i z e o f ( cl_mem ) , ( void * )& ← ֓ / / c r e a t e & compile program hDeviceMemB ) ; cl_program hProgram ; clSetKernelArg ( hKernel , 2 , s i z e o f ( cl_mem ) , ( void * )& ← ֓ hProgram = clCreateProgramWithSource ( hContext , 1 , hDeviceMemC ) ; sProgramSource , ← / / e x e c u t e k e r n e l ֓ 0 , 0) ; clEnqueueNDRangeKernel ( hCmdQueue , hKernel , 1 , 0 , clBuildProgram ( hProgram , 0 , 0 , 0 , 0 , 0) ; & cnDimension , 0 , 0 , 0 , 0) ; / / copy r e s u l t s from d e v i c e back to host / / c r e a t e k e r n e l clEnqueueReadBuffer ( hContext , hDeviceMemC , CL_TRUE , ← ֓ cl_kernel hKernel ; 0 , hKernel = clCreateKernel ( hProgram , “” vec_add , 0) ; cnDimension * s i z e o f ( cl_double ) , pC , 0 , 0 , 0) ; / / a l l o c a t e d e v i c e memory clReleaseMemObj ( hDeviceMemA ) ; cl_mem hDeviceMemA , hDeviceMemB , hDeviceMemC ; clReleaseMemObj ( hDeviceMemB ) ; hDeviceMemA = clCreateBuffer ( hContext , clReleaseMemObj ( hDeviceMemC ) ; Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 7 / 25

GPGPU Programming with OCaml Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 8 / 25

Main Goals . Target Cuda/OpenCL frameworks with OCaml Unify these two frmeworks Abstract memory transfers Use static type checking to verify kernels Propose abstractions for GPGPU programming Keep the high performance . . Host-side solution : an OCaml library . . Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 9 / 25

SPOC overview . Abstract frameworks . Unify both APIs (Cuda/OpenCL), dynamic linking . Portable solution, multi-GPGPU, heterogeneous . . Abstract transfers . Vectors move automatically between CPU and GPGPUs On-demand (lazy) transfers Automatic allocation/dealloction of the memory space used by vectors (on the host as well as on GPGPU devices) Failure during allocation on a GPGPU triggers a garbage collection . Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 10 / 25

External kernels . Type safety . Static type checking of kernel parameters (at compile-time). Kernel.run compiles kernels from .ptx / .cl sources. . kernel vec_add : Vector.vfloat64 -> Vector.vfloat64 -> Vector.vfloat64 -> int -> unit = «kernels» «vec_add» kernel launch dev Compilation/Execution Kernel.run vec_add dev .entry vec_add (…){   …   } Cuda for i = 0 to Vector.length v3 - 1 do ! kernels.ptx ! printf « res[%d] = %f\n » ! Kernel.run vec_add dev ! ! i v3.[<i>] ! done; __kernel void vec_add (…){   …   } OpenCL kernels.cl Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 11 / 25

Sarek : Stream ARchitecture using Extensible Kernels . Vector addition with Sarek . l e t vec_add = kern a b c n − > l e t open Std in l e t open Math . Float64 in l e t idx = global_thread_id in i f idx < n then c .[ < idx >] < − add a .[ < idx >] b .[ < idx >] . . Vector addition with OpenCL . __kernel void vec_add ( __global const double * a , __global const double * b , __global double * c , i n t N ) { i n t nIndex = get_global_id ( 0 ) ; i f ( nIndex >= N ) return ; c [ nIndex ] = a [ nIndex ] + b [ nIndex ] ; } . Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 12 / 25

Sarek . Vector addition with Sarek . l e t vec_add = kern a b c n − > l e t open Std in l e t open Math . Float64 in l e t idx = global_thread_id in i f idx < n then c .[ < idx >] < − add a .[ < idx >] b .[ < idx >] . . Sarek features . ML-like syntax type inference static type checking static compilation to OCaml code dynamic compilation to Cuda/OpenCL . Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 13 / 25

Sarek static compilation Sarek code . . . . . . . . . . . . . . . . kern a → let idx = Std .global_thread_id () in a. [ < idx > ] ← 0 IR Bind( (Id 0), (ModuleAccess((Std), Typing (global_thread_id)), (VecSet(VecAcc…)))) typed IR OCaml code generation spoc_kernel generation Kir generation spoc_kernel OCaml Code Kir fun a − > Kern class spoc_class1 let idx = Params method run = ... Std.global_thread_id () VecVar 0 method compile = ... in a. [ < idx > ] < − 0l VecVar 1 end … new spoc_class1 Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 14 / 25

Sarek dynamic compilation . . . . . . . . .let my_kernel = kern ... − > ... . . . . . . . . . . . . nvcc -O3 -ptx… ... ;; Compile to Compile Kirc.gen my_kernel ; Cuda C source file to OpenCL C99 Kirc.run my_kernel dev (block,grid) ; Cuda ptx assembly device OpenCL Cuda Compile kernel OpenCL C99 Cuda ptx assembly and source Run Return to OCaml code execution Mathias Bourgoin (INPG-Verimag) Efficient Abstractions for GPGPU Programming 10.03.15 15 / 25

Efficient Abstractions for GPGPU Programming . Mathias Bourgoin - PowerPoint PPT Presentation

. Efficient Abstractions for GPGPU Programming . Mathias Bourgoin 10.03.2015 Efficient abstractions for GPGPU programming . PhD (LIP6/UPMC) . GPGPU programming general purpose computations on the GPU Abstractions languages and

Welcome! Global Agenda: 1. GPGPU (1) : Introduction, architecture, concepts 2. GPGPU (2) :

Welcome! Todays Agenda: GPU Execution Model GPGPU Flow GPGPU Low Level Notes

Parallel Incep+on MPP Databases GPGPU Kyle Dunn Me Data nerd for Recovering HPC/GPGPU

Welcome! Todays Agenda: Introduction to GPGPU Example: Voronoi Noise GPGPU

Welcome! Todays Agenda: Practical GPGPU: Verlet Fluid GPGPU Algorithms Optimizing

GPGPU Programming in Haskell with Accelerate Trevor L. McDonell University of New South Wales

Abstractions for Routing Abstractions for Network Routing Brighten Godfrey Brighten Godfrey

Planning and Optimization D2. Abstractions: Additive Abstractions Gabriele R oger and Thomas

Automatically Deriving Abstraction Heuristics PDB Abstractions Explicit-State Abstractions

Unified L2 Abstractions for L3-Driven Fast Handover draft-irtf-mobopts-l2-abstractions-01 F.

K E D b . D a L a t a B a s e Jordan Vincent XML processing using GPGPU Jordan

GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006 Overview 1. Motivation: Why

K Pre-Post Cloud Tutorial for the use of GPGPU instances RIKEN R-CCS MARCH 29, 2019 About this

Shader Programming Shader Programming vs CUDA vs CUDA Tien-Tsin Wong The Chinese University of

APIs/abstractions Previously iously Abstractions for infrastructure to ease operations (Ops)

L9: Frontend Abstractions Web Engineering 188.951 2VU SS20 Jrgen Cito L9: Frontend

Logos Class Book of Prophet Amos Online class for April 19, 2020- session 1 4/18/20 1

From randomness to segregation Schelling segregation, Ising models and network cascades George

5 2020

Social Interaction A Formal Exploration Dominik Klein University of Bayreuth PhD Colloquium

r rs t r t

Chapter 7: The Laplace Transform Part 3 Department of Electrical Engineering National Taiwan

1995 slides bootstraping reflection Data July 1995 CITATIONS READS 0 7 2 authors: Jecel

Planning maximum capacity Wireless Local Area Networks Edoardo Amaldi Sandro Bosio Antonio