GPU Computing: Development and Analysis Part 1
Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten
GPU Computing: Development and Analysis Part 1 Anton Wijs - - PowerPoint PPT Presentation
GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor, Software
Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten
We will cover approx. first five chapters
The computer chip on a graphics card
machine learning,
(According to Nvidia)
– GPU assumes work load is highly parallel – CPU must be good at everything, parallel or not
– Big on-chip caches – Sophisticated control logic
– Multithreading can hide latency, so no big caches – Control logic
Core
Core Core Core
320 GB/s for GTX1080
– Large scale parallelism
– Use transistors more efficiently – #1 in green 500 uses NVIDIA Tesla P100
– Huge market – Mass production, economy of scale – Gamers pay for our HPC needs!
– Thousands or even millions of elements that can be processed in parallel
– have high arithmetic intensity (lots of computations per element) – have regular data access patterns – do not have a lot of data dependencies between elements – do the same set of instructions for all elements
Before we start:
executed by the hardware
seen as a parallel programming concept. Do not compare them to CPU threads.
(GPU) part.
and starts GPU functions
kernels
and freeing of GPU memory
different physical memories
Host memory Device memory Host Device PCI Express link GPU
to organize them in some manner
Grid (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Thread block (0,0,0) (1,0,0) (2,0,0) (0,1,0) (1,1,0) (2,1,0) Typical block sizes: 256, 512, 1024
computations
threadIdx.x, y, z (thread index within the thread block)
Single Instruction Multiple Data (SIMD) principle
maximum thread block size
containing the index of each block within the grid.
not exactly equal to the problem size
and grid dimensions
kernel
float HostFunc()
gpuser
cuda2018
Manage
connection, and set Remote executable folder
thread block 0 1 2 3 thread block 1 1 2 3 thread block 2 1 2 3 blockDim.x ? blockIdx.x threadIdx.x
Create a N threads using multiple thread blocks:
Multiple Data (SIMD) principle
Thread Thread Block Grid (0, 0) (1, 0) Registers Shared memory Global memory Constant memory
– Thread-local scalars or small constant size arrays are stored as registers – Implicit in the programming model – Behavior is very similar to normal local variables – Not persistent, after the kernel has finished, values in registers are lost
– Allocated by the host program using – Initialized by the host program using or previous kernels – Persistent, the values in global memory remain across kernel invocations – Not coherent, writes by other threads will not be visible until kernel has finished
– Statically defined by the host program using qualifier – Defined as a global variable – Initialized by the host program using – Read-only to the GPU, cannot be accessed directly by the host – Values are cached in a special cache optimized for broadcast access by multiple threads simultaneously, access should not depend on
– Implement the kernel to perform a single iteration of parallel reduction
– It is assumed that enough threads are launched such that each thread only needs to compute the sum of two elements in the input array – In each iteration, an array of size n is reduced into an array of size n/2 – Each thread stores it result at a designated position in the output array
grid
kernel launches
Kernel launch 0 Kernel launch 1 Kernel launch 2 Kernel launch 3 Kernel launch 4
provide a mechanism to synchronise all threads in the same block
tomorrow afternoon
CUDA program PTX assembly CUBIN bytecode Machine-level binary Nvidia Compiler nvcc Runtime compiler driver
CUDA OpenCL OpenACC OpenMP 4 Grid NDRange compute region parallel region Thread block Work group Gang Team Warp CL_KERNEL_PREFERRED _WORK_GROUP_SIZE_MU LTIPLE Worker SIMD Chunk Thread Work item Vector Thread or SIMD
and may differ across computing platforms
tomorrow afternoon)
all currently active threads
hiding the long latency of instructions such as memory loads
Streaming multiprocessor (SM) 32 core block
Register file Shared memory block of 32 cores L1 Cache L1 Cache
Shared memory
concurrently on the SM, resulting in a certain occupancy
Register file Shared memory Thread slots
some GPUs also in L1
memory, think about:
– The total number of values that are accessed by the warp that the thread belongs to – The cache line length and the number of cache lines that those values will belong to – Alignment of the data accesses to that of the cache lines
SM L1 SM L1 L2 GPU Device memory
Main memory CPU Cache
Memory is optimized for reading in (row- wise) bursts All memory accesses happen through the cache Cache fetches memory at the granularity of cache-lines
CUDA Programming model (API) threads warps GPU Hardware Think in terms of threads Reason on program correctness Think in terms of warps Reason on program performance Tomorrow in Part 2 of GPU Development! Proving correctness tomorrow morning / afternoon!