GPU Programming
René Kloth Florian Wende
Pro Seminar: Parallel Programming Freie Universität Berlin, WS 2012/13, Prof. Dr. M. Esponda
GPU Programming Ren Kloth Florian Wende Pro Seminar: Parallel - - PowerPoint PPT Presentation
GPU Programming Ren Kloth Florian Wende Pro Seminar: Parallel Programming Freie Universitt Berlin, WS 2012/13, Prof. Dr. M. Esponda Presentation Outline Current Hardware Accelerators Nvidia Fermi GPU Architecture The CUDA/OpenCL
Pro Seminar: Parallel Programming Freie Universität Berlin, WS 2012/13, Prof. Dr. M. Esponda
■ Current Hardware Accelerators ■ Nvidia Fermi GPU Architecture ■ The CUDA/OpenCL Programming Model ■ Matrix-Matrix-Multiplication on GPU and CPU
■ On Using the GPU’s Shared Memory ■ Tiling Techniques ■ Efficient Memory Access Patterns ■ OpenMP + Vectorization on CPU
■ Molecular Dynamics ■ Ray Tracing
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Hardware accelerator: Computer hardware that allows to perform certain functions faster than a standard CPU can do in software. Specialized hardware acting as co-processor for the CPU 1980 — Intel 8087 x87 floating-point co-processor for 8086 CPUs. 1985 — Amiga 1000: Co-processors for Video/Audio/DMA/ 2000 and later — Single thread performance stalls! Instruction-level parallelism limited by single thread performance. Increase of application performance through task/thread-level parallelism. FPGA ClearSpeed Cell processor, IBM PowerXCell 8i GPU Intel Xeon Phi
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 3 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Hardware accelerator: Computer hardware that allows to perform certain functions faster than a standard CPU can do in software. Specialized hardware acting as co-processor for the CPU
■ 1980 — Intel 8087 x87 floating-point co-processor for 8086 CPUs. ■ 1985 — Amiga 1000: Co-processors for Video/Audio/DMA/. . .
2000 and later — Single thread performance stalls! Instruction-level parallelism limited by single thread performance. Increase of application performance through task/thread-level parallelism. FPGA ClearSpeed Cell processor, IBM PowerXCell 8i GPU Intel Xeon Phi
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 3 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Hardware accelerator: Computer hardware that allows to perform certain functions faster than a standard CPU can do in software. Specialized hardware acting as co-processor for the CPU
■ 1980 — Intel 8087 x87 floating-point co-processor for 8086 CPUs. ■ 1985 — Amiga 1000: Co-processors for Video/Audio/DMA/. . . ■ 2000 and later — Single thread performance stalls!
Instruction-level parallelism limited by single thread performance. Increase of application performance through task/thread-level parallelism.
■ FPGA ■ ClearSpeed ■ Cell processor, IBM PowerXCell 8i ■ GPU ■ Intel Xeon Phi Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 3 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Modern computer systems consist of a combination of CPU(s) and hardware accelerator(s).
Peripherals Memory CPU Shared Memory Memory Connection Channel: PCI(e), HyperTransport, etc., Accelerator (GPU, FPGA, Cell, etc.)
Trend: Multiple CPUs + multiple accelerators per compute node.
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 4 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Top 500 Supercomputers, November 2012:
■ More than 12% of the systems use hardware accelerators:
80.6% Nvidia GPU, 11.3% Intel Xeon Phi, 4.8% AMD GPU.
■ 84.6% of the systems use processors with 6 or more cores, and
46.2% with 8 or more cores. Green 500 Supercomputers, November 2012 Most power efficient system uses Intel Xeon Phi. 33% of the top 100 systems use GPU and Xeon Phi hardware accelerators. Heterogeneous computer systems are in the ascendant Challenging aspect: Make multiple/different compute devices work together.
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 5 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Top 500 Supercomputers, November 2012:
■ More than 12% of the systems use hardware accelerators:
80.6% Nvidia GPU, 11.3% Intel Xeon Phi, 4.8% AMD GPU.
■ 84.6% of the systems use processors with 6 or more cores, and
46.2% with 8 or more cores. Green 500 Supercomputers, November 2012
■ Most power efficient system uses Intel Xeon Phi. ■ 33% of the top 100 systems use GPU and Xeon Phi hardware accelerators.
Heterogeneous computer systems are in the ascendant Challenging aspect: Make multiple/different compute devices work together.
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 5 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Top 500 Supercomputers, November 2012:
■ More than 12% of the systems use hardware accelerators:
80.6% Nvidia GPU, 11.3% Intel Xeon Phi, 4.8% AMD GPU.
■ 84.6% of the systems use processors with 6 or more cores, and
46.2% with 8 or more cores. Green 500 Supercomputers, November 2012
■ Most power efficient system uses Intel Xeon Phi. ■ 33% of the top 100 systems use GPU and Xeon Phi hardware accelerators.
Heterogeneous computer systems are in the ascendant Challenging aspect: Make multiple/different compute devices work together.
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 5 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Since 2007 GPUs are increasingly used for non-graphics computations (GPGPU) due to
■ high compute performance, and ■ low acquisition and maintenance costs. Nvidia GPU Single Precision Nvidia GPU Double Precision Intel CPU Single Precision Intel CPU Double Precision Theoretical GFLOP/s 2500 2000 1500 1000 500 GeForce FX 5800 GeForce 6800 Ultra GeForce 7800 GTX GeForce 8800 GTX GeForce GTX 280 GeForce GTX 480 GeForce GTX 580 GeForce GTX 680 Tesla K20X Tesla K20X 3.95 TFLOP/s Tesla C2050 1.31 TFLOP/s Tesla C1060 Bloomfield Sandy Bridge Westmere Harpertown Woodcrest Pentium 4 Sep-01 Jun-04 Mar-07 Dec-09 Aug-12 Tesla M2090 3.09 TFLOP/s
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 6 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
On the GPU we observe a strongly increasing number of compute cores, disproportionate to the increase in the memory bandwidth.
Memory Bandwidth GB/s 200 150 100 50 GeForce FX 5900 GeForce 6800 GT GeForce 7800 GTX GeForce 8800 GTX GeForce GTX 280 GeForce GTX 480 GeForce GTX 580 Tesla K20X 250 GB/s Bloomfield Sandy Bridge Westmere Harpertown Woodcrest Pentium 4 Nvidia GPU Intel CPU 2012 2010 2008 2006
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 7 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Before the introduction of the unified-compute shader in 2007, programming GPUs for non-graphics applications was done by means of GLSL, Cg, OpenGL, DirectX: Complicated! Current GPUs (based on the unified-shader architecture) can be ‘easily’ programmed using
■ CUDA: Nvidia proprietary parallel computing platform supporting Nvidia
GPUs from the G80 series (2007) onwards.
■ OpenCL: Parallel programming platform for heterogeneous computer
systems.
■ Apple, AMD, Intel, IBM, Nvidia: OpenCL 1.0 by the end of 2008. ■ More general than CUDA + available for any computer architecture
supporting OpenCL.
■ Programming API similar to CUDA
We focus on CUDA.
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 8 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Before the introduction of the unified-compute shader in 2007, programming GPUs for non-graphics applications was done by means of GLSL, Cg, OpenGL, DirectX: Complicated! Current GPUs (based on the unified-shader architecture) can be ‘easily’ programmed using
■ CUDA: Nvidia proprietary parallel computing platform supporting Nvidia
GPUs from the G80 series (2007) onwards.
■ OpenCL: Parallel programming platform for heterogeneous computer
systems.
■ Apple, AMD, Intel, IBM, Nvidia: OpenCL 1.0 by the end of 2008. ■ More general than CUDA + available for any computer architecture
supporting OpenCL.
■ Programming API similar to CUDA → We focus on CUDA. Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 8 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Writing efficient programs for the GPU requires a ‘fundamental’ understanding
Nvidia’s previous unified-shader GPU architectures: Tesla, Fermi, Kepler. The GPU acts as co-processor to the host/CPU. GPU consists of streaming multiprocessors (SM). SMs are organized into compute clusters. SMs consist of scalar pro- cessors (SP), special function units (SFU), scheduling and dispatch unit(s), local memory, load/store units. Scalar processors execute in- dependent thread programs. .
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 9 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Writing efficient programs for the GPU requires a ‘fundamental’ understanding
Nvidia’s previous unified-shader GPU architectures: Tesla, Fermi, Kepler.
■ The GPU acts as co-processor
to the host/CPU.
■ GPU consists of streaming
multiprocessors (SM).
■ SMs are organized into
compute clusters.
■ SMs consist of scalar pro-
cessors (SP), special function units (SFU), scheduling and dispatch unit(s), local memory, load/store units.
■ Scalar processors execute in-
dependent thread programs.
SP SFU SFU Shared Memory
Texture/Load/Store
SM Compute Cluster
SP SP SP SP SP SP SP SP SFU SFU Shared Memory
SM
SP SP SP SP SP SP SP
Interconnection Network
DRAM DRAM DRAM DRAM DRAM DRAM
Scheduling
GPU Host/CPU
Scheduling Dispatch Scheduling Dispatch SP SFU SFU Shared Memory
Texture/Load/Store
SM Compute Cluster
SP SP SP SP SP SP SP SP SFU SFU Shared Memory
SM
SP SP SP SP SP SP SP Scheduling Dispatch Scheduling Dispatch SP SFU SFU Shared Memory
Texture/Load/Store
SM Compute Cluster
SP SP SP SP SP SP SP SP SFU SFU Shared Memory
SM
SP SP SP SP SP SP SP Scheduling Dispatch Scheduling Dispatch SP SFU SFU Shared Memory
Texture/Load/Store
SM Compute Cluster
SP SP SP SP SP SP SP SP SFU SFU Shared Memory
SM
SP SP SP SP SP SP SP Scheduling Dispatch Scheduling Dispatch
.
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 9 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
The GPU consists of up to 16 SMs, each containing 32 SPs → up to 512 SPs.
■ SMs can execute independent GPU programs (MIMD):
■ different execution path’ within the same GPU program. ■ concurrent execution of different GPU programs.
■ SPs execute the thread programs the GPU program consists of:
■ SIMD — all SPs per SM execute the same instruction. ■ SIMT — a subset of the SM’s SPs execute the same ‘SIMD instruction’.
The GPU architecture is SIMT SIMT — Same Instruction Multiple Thread: For a given instruction, threads within a work unit which are ready to execute this instruction are marked active; all other threads within that work unit are marked passive. Active threads are mapped onto SPs for execution (SIMD).
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 10 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
The GPU consists of up to 16 SMs, each containing 32 SPs → up to 512 SPs.
■ SMs can execute independent GPU programs (MIMD):
■ different execution path’ within the same GPU program. ■ concurrent execution of different GPU programs.
■ SPs execute the thread programs the GPU program consists of:
■ SIMD — all SPs per SM execute the same instruction. ■ SIMT — a subset of the SM’s SPs execute the same ‘SIMD instruction’.
The GPU architecture is SIMT SIMT — Same Instruction Multiple Thread: For a given instruction, threads within a work unit which are ready to execute this instruction are marked active; all other threads within that work unit are marked passive. Active threads are mapped onto SPs for execution (SIMD).
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 10 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
For every GPU program the number of threads that execute the program is defined using a grid-block hierarchy of threads:
■ Grid = 1D/2D/3D arrangement of blocks (max: 65535×65535×65535). ■ Block = 1D/2D/3D arrangement of threads (max: 1024×1024×64). ■ Blocks are distributed to SMs: at most 8 blocks per SM at a time. ■ Threads within blocks are distributed to SPs in SIMT manner by the SM’s
thread schedulers: 32 threads form a work unit (warp). Fermi’s SMs have 32 physical cores, and 2 thread schedulers handling up to 1536 (48 warps) concurrent threads with zero overhead, each. Interleaved Multithreading (IMT) Memory access latency hiding (more cores than caches). Minimization of idle cycles of the GPU’s cores.
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 11 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
For every GPU program the number of threads that execute the program is defined using a grid-block hierarchy of threads:
■ Grid = 1D/2D/3D arrangement of blocks (max: 65535×65535×65535). ■ Block = 1D/2D/3D arrangement of threads (max: 1024×1024×64). ■ Blocks are distributed to SMs: at most 8 blocks per SM at a time. ■ Threads within blocks are distributed to SPs in SIMT manner by the SM’s
thread schedulers: 32 threads form a work unit (warp). Fermi’s SMs have 32 physical cores, and 2 thread schedulers handling up to 1536 (48 warps) concurrent threads with zero overhead, each. Interleaved Multithreading (IMT) → Memory access latency hiding (more cores than caches). → Minimization of idle cycles of the GPU’s cores.
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 11 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Instruction Cache Warp Scheduler Warp Scheduler Dispatch Unit Dispatch Unit Register File (32768 x 32-Bit) SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST
SFU SFU SFU SFU Interconnect Network 64kB Shared Memory/L1-Cache Unified L2-Cache SM
Dispatch Port Operand Collector FP Unit INT Unit Result Queue
Streaming Multiprocessor (SM)
■ 32 physical cores with floating-point and
integer unit, each (IEEE 754-2008).
■ Up to 16 double precision operations per
SM, per clock (only Tesla/Quadro).
■ 4 special function units (SFU):
cos(x), sin(x), rsqrt(x), . . .
■ 16 load/store units (LD/ST). ■ 2 warp schedulers and instruction dispatch
units: each of them issues one instruc- tion from each warp to 16 SPs, 16 LD/ST units, or 4 SFUs.
■ 64kB configurable on-chip memory:
shared memory & L1-cache.
■ First GPU architecture featuring a true
cache hierarchy for load/store operations.
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 12 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
For the selection of ready-for-execution warps, the thread schedulers maintain a scoreboard, each: ready warps are prioritized. Prioritization considers warp type, instruction type, and ‘fairness’ to all warps executing on the same SM.
Warp Scheduler Instruction Dispatch Unit Warp 8 Instruction 11 Warp 2 Instruction 14 Warp 4 Instruction 32 Warp 24 Instruction 4 Warp 18 Instruction 57 Warp 44 Instruction 23 Warp Scheduler Instruction Dispatch Unit Warp 5 Instruction 13 Warp 11 Instruction 3 Warp 23 Instruction 4 Warp 31 Instruction 6 Warp 1 Instruction 65 Warp 17 Instruction 24 Time ... ...
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 13 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
■ Strongly increased double precision compute performance over previous
Tesla GPU architecture.
■ Cache hierarchy (not necessary for graphics processing). ■ ECC memory support throughout all memory layers (important for HPC). ■ Concurrent kernel execution: up to 16 different GPU programs within the
same CUDA context can execute on the same GPU concurrently (MIMD).
Time Serial Concurrent Kernel Execution Program 2 Program 1 Pr. 3 Pr. 4 Program 5 Program 1 Pr. 3 Pr. 4 Program 2 Program 5
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 14 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Tesla C1060 Tesla M2090 Tesla K20X
■ Architecture
Tesla, GT200b Fermi, GF110 Kepler, GK110
■ Transistor count
1.4×109 3.0×109 7.1×109
■ SM count
30 16 14
■ SPs per SM
8 32 192
■ SP count
240 512 2688
■ Clock rate
1.3 GHz 1.3 GHz
993 (sp) 1330 (sp) 3950 (sp)
■ in GFLOPS/s
78 (dp) 665 (dp) 1310 (dp)
■ Main memory
4 GB 6 GB 6 GB
■ Memory bus width
512-Bit 384-Bit 384-Bit
■ Memory bandwidth
102 GB/s 177 GB/s 250 GB/s
■ TDP
200 W 225 W 235 W
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 15 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
■ Why CUDA? ■ CUDA kernels ■ Threads, blocks, and grids ■ Memory ■ CUDA specific declarations and commands
■ Example: Vector addition
■ Synchronization
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 16 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Most famous APIs for making use of the GPU’s compute capabilities? CUDA:
■ Nvidia proprietary. ■ Latest features.
OpenCL:
■ All compute devices with OpenCL support can be used for computations. ■ More general approach for handling heterogeneous computer systems
compared to CUDA.
■ Not yet widely accepted.
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 17 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Kernels contain the portion of the code that is offloaded to the GPU for parallel execution.
■ Define the acting of each CUDA thread. ■ Execute asynchronously to the host program. ■ Kernel invocation → Creation of ‘many’ lightweight threads.
Semantics: Declaration: Invocation:
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 18 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Kernels contain the portion of the code that is offloaded to the GPU for parallel execution.
■ Define the acting of each CUDA thread. ■ Execute asynchronously to the host program. ■ Kernel invocation → Creation of ‘many’ lightweight threads.
Semantics: Declaration:
__global__ void kernel();
Invocation:
kernel<<<grid,block>>>();
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 18 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Function type qualifiers:
__host__
■ Optional (tell the compiler to also compile the code for the host). ■ Standard C function to be executed on the CPU.
__global__
(→ kernel)
■ Function that is executed on the GPU. ■ Callable from within the host program (CPU).
__device__
■ GPU function callable from within a kernel only (GPU). Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 19 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Kernels need to be configured when they are called:
__global__ void kernel(){ // kernel definition } void main(){ ... dim3 block(blockX,blockY,blockZ), grid(gridX,gridY,gridZ); // kernel invocation kernel<<<grid,block>>>(); ... }
dim3 is a struct containing the x-, y-, and z-extent of the block/grid.
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 20 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
The programmer specifies the number of threads that should execute the kernel using a grid-block hierarchy of threads:
■ Threads are organized into blocks. ■ Blocks are organized into a Grid. ■ Cooperation and synchronization
between threads possible within thread blocks only.
■ Thread blocks are distributed to SMs
at runtime by the hardware.
■ Thread blocks need to be independent
Grid
Block(0,0) Block(1,0) Block(2,0) Block(0,1) Block(1,1) Block(2,1) Block(1,1) Thread(0,0) Thread(1,0) Thread(2,0) Thread(3,0) Thread(0,1) Thread(1,1) Thread(2,1) Thread(3,1) Thread(0,2) Thread(1,2) Thread(2,2) Thread(3,2)
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 21 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Host
Kernel 1 Kernel 2
Device
Grid 1 Block (0,0) Block (1,0) Block (2,0) Block (0,1) Block (1,1) Block (2,1) Grid 2 Thread (0,0) Thread (1,0) Thread (2,0) Thread (3,0) Thread (4,0) Thread (0,1) Thread (1,1) Thread (2,1) Thread (3,1) Thread (4,1) Thread (0,2) Thread (1,2) Thread (2,2) Thread (3,2) Thread (4,2)
Grid 1:
dim3 grid(3,2); dim3 block(5,3); kernel1<<<grid,block>>>();
■ Grid 2D:
3 × 2 = 6 blocks.
■ Block 2D:
5 × 3 = 15 threads.
■ Kernel 1:
6 × 15 = 90 threads.
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 22 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Thread and block IDs, grid and block extents are accessible within the kernel through built-in variables → each thread has a unique (global) thread ID:
■ Blocks within the grid:
blockIdx.x, blockIdx.y, blockIdx.z.
■ Threads within a block:
threadIdx.x, threadIdx.y, threadIdx.z.
■ Block extent: blockDim.x, blockDim.y, blockDim.z. ■ Grid extent: gridDim.x, gridDim.y, gridDim.z.
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 23 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Threads within a CUDA kernel can load (store) data from (to) different memory layers. Variable type qualifiers:
__device__
■ Global memory (RAM). ■ Accessible from all threads.
__shared__
■ On-chip memory (fast). ■ Accessible from all threads
within the same block.
__constant__
■ Read-only memory. ■ Accessible from all threads.
Thread
Per-Thread Local Memory Per-Block Shared Thread Block
Grid 1
Block (0,0) Block (1,0) Block (2,0) Block (0,1) Block (1,1) Block (2,1) Memory Global Memory
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 24 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
(Device) Grid
Block (0,0) Block (1,0) Shared Memory Shared Memory
Register Register Register Register Thread (0,0) Thread (1,0) Thread (0,0) Thread (1,0)
Global Memory Constant Memory Host
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 25 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
■ cudaMalloc(void **,size_t); ■ cudaFree(void *); ■ cudaMemcpy(void *,void *,size_t,enum cudaMemcpyKind);
where cudeMemcpyKind is one of
■ cudaMemcpyHostToDevice. ■ cudaMemcpyDeviceToHost. ■ cudaMemcpyDeviceToDevice. ■ cudaMemcpyHostToHost.
■ cudaMemset(void *,byte,size_t); ■ cudaDeviceSynchronize();
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 26 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Vector Addition:
// kernel definition __global__ void vecAdd(float *A,float *B,float *C){ int i=threadIdx.x; C[i]=A[i]+B[i]; } void main(){ ... // kernel invocation with N threads vecAdd<<<1,N>>>(A,B,C); ... }
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 27 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Barrier synchronization between threads only within blocks using
__syncthreads().
There is no global barrier! Deadlock situation:
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 28 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Barrier synchronization between threads only within blocks using
__syncthreads().
There is no global barrier! Deadlock situation:
if(...){ ... __syncthreads(); }else{ ... __syncthreads(); }
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 28 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Naive matrix-matrix-multiplication: C = A · B, A ∈ Rm×n, B ∈ Rn×p cij =
n
∑
k=1
aik · bkj , 1 ≤ i ≤ m , 1 ≤ j ≤ p . Straight forward implementation: Loop permutation trick (CPU). Optimized implementation: GPU — Shared memory, Tiling. CPU — OpenMP + vectorization.
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 29 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Naive matrix-matrix-multiplication: C = A · B, A ∈ Rm×n, B ∈ Rn×p cij =
n
∑
k=1
aik · bkj , 1 ≤ i ≤ m , 1 ≤ j ≤ p .
■ Straight forward implementation:
■ Loop permutation trick (CPU).
■ Optimized implementation:
■ GPU — Shared memory, Tiling. ■ CPU — OpenMP + vectorization. Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 29 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Fermi GPU:
■ Max. threads per SM: 1536 → full occupancy with blocks of size 16 × 16.
■ 32768 registers per SM → 21 reg. per thread: we use less than 21. ■ 1536/(16 × 16) = 6 blocks: less than 8 blocks per SM.
■ Adjust size of matrix X ∈ Rm×n in order to meet the block geometry:
m′ × n′ = ceilN(m,16) × ceilN(n,16) , ceilN(x,N) = ⌈ x N ⌉ N . X′ ∈ Rm′×n′ with x′
ij =
{ xij if 1 ≤ i ≤ m , 1 ≤ j ≤ n ,
■ C is of type Rm×p → grid geometry:
⌈ m 16 ⌉ × ⌈ p 16 ⌉ .
■ Initialize C with 0.
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 30 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Host
#define BLOCK 16 int ceilN(int x,int n){ return (x/n+((x%n)==0?0:1))*n; } void main(int argc,char **argv){ int hA=atoi(argv[1]),wA=atoi(argv[2]),hB=atoi(argv[3]),wB=atoi(argv[4]), kMax=ceilN(wA,BLOCK); dim3 block(BLOCK,BLOCK), grid(ceilN(wB,BLOCK)/BLOCK,ceilN(hA,BLOCK)/BLOCK); float *a=malloc(kMax*grid.y*BLOCK*sizeof(float)), *b=malloc(grid.x*BLOCK*kMax*sizeof(float)), *c=malloc(grid.x*BLOCK*grid.y*BLOCK*sizeof(float)), *da,*db,*dc; cudaMalloc(&da,kMax*grid.y*BLOCK*sizeof(float)); cudaMalloc(&db,grid.x*BLOCK*kMax*sizeof(float)); cudaMalloc(&dc,grid.x*BLOCK*grid.y*BLOCK*sizeof(float)); ...
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 31 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Host
... //set up matrix A at random memset(a,0,kMax*grid.y*BLOCK*sizeof(float)); for(int m=0;m<hA;m++) for(int n=0;n<wA;n++) a[m*kMax+n]=(float)(rand())/RAND_MAX; cudaMemcpy(da,a,kMax*grid.y*BLOCK*sizeof(float),cudaMemcpyHostToDevice); //set up matrix B at random ... //initialize matrix C (GPU only) cudaMemset(dc,0,grid.x*BLOCK*grid.y*BLOCK*sizeof(float)); //kernel invocation matMulGPU<<<grid,block>>>(da,db,dc,kMax); //copy matrix C from GPU to host cudaMemcpy(c,dc,grid.x*BLOCK*grid.y*BLOCK*sizeof(float),cudaMemcpyDeviceToHost); }
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 32 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
GPU straight forward
__global__ void matMulGPU(float *a,float *b,float *c,int kMax){ int row=blockIdx.y*BLOCK+threadIdx.y, col=blockIdx.x*BLOCK+threadIdx.x; float cRes=0.0F; for(int k=0;k<kMax;k++) cRes+=a[row*kMax+k]*b[k*gridDim.x*BLOCK+col]; c[row*gridDim.x*BLOCK+col]=cRes; }
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 33 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
GPU shared memory
__global__ void matMulGPU(float *a,float *b,float *c,int kMax){ int row=blockIdx.y*BLOCK+threadIdx.y, col=blockIdx.x*BLOCK+threadIdx.x; float cRes=0.0F; __shared__ float aSh[BLOCK][BLOCK]; __shared__ float bSh[BLOCK][BLOCK]; for(int k=0;k<kMax;k+=BLOCK){ aSh[threadIdx.y][threadIdx.x]=a[row*kMax+k+threadIdx.x]; bSh[threadIdx.y][threadIdx.x]=b[(k+threadIdx.y)*gridDim.x*BLOCK+col]; __syncthreads(); for(int kk=0;kk<BLOCK;kk++) cRes+=aSh[threadIdx.y][kk]*bSh[kk][threadIdx.x]; __syncthreads(); } c[row*gridDim.x*BLOCK+col]=cRes; }
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 34 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Tiling: Programming strategy for distributing data and computations, and for locality enhancement in parallel/sequential programs. Here: Sub-matrices (tiles) are loaded into shared memory.
■ 2 × n BLOCK load operations per thread instead of ‘2n’. ■ Memory loads are coalesced → high memory throughput. ■ Number of arithmetic operations per word read is up to ‘BLOCK’× higher
than in the straight forward implementation (on later-than-Fermi GPUs memory accesses are cached). Attention (GPU): SM occupancy: shared memory is a limited resource. Shared memory bank conflicts: do not occur in the matrix-matrix multiplication.
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 35 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Tiling: Programming strategy for distributing data and computations, and for locality enhancement in parallel/sequential programs. Here: Sub-matrices (tiles) are loaded into shared memory.
■ 2 × n BLOCK load operations per thread instead of ‘2n’. ■ Memory loads are coalesced → high memory throughput. ■ Number of arithmetic operations per word read is up to ‘BLOCK’× higher
than in the straight forward implementation (on later-than-Fermi GPUs memory accesses are cached). Attention (GPU):
■ SM occupancy: shared memory is a limited resource. ■ Shared memory bank conflicts: do not occur in the matrix-matrix
multiplication.
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 35 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
We draw on the host code given for the GPU implementation above: remove all CUDA runtime API calls, and replace matMulGPU() by matMulCPU(). CPU straight forward
void matMulCPU(float *a,float *b,float *c,int hA,int wA,int wB){ for(int m=0;m<hA;m++) for(int n=0;n<wB;n++) for(int k=0;k<wA;k++) c[m*wB+n]+=a[m*wA+k]*b[k*wB+n]; }
CPU loop permutation Better cache utilization, and more suitable for vectorization: Intel’s C compiler automatically permutes loops and vectorizes them (if possible).
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 36 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
We draw on the host code given for the GPU implementation above: remove all CUDA runtime API calls, and replace matMulGPU() by matMulCPU(). CPU straight forward
void matMulCPU(float *a,float *b,float *c,int hA,int wA,int wB){ for(int m=0;m<hA;m++) for(int n=0;n<wB;n++) for(int k=0;k<wA;k++) c[m*wB+n]+=a[m*wA+k]*b[k*wB+n]; }
CPU loop permutation Better cache utilization, and more suitable for vectorization: Intel’s C compiler automatically permutes loops and vectorizes them (if possible).
void matMulCPU(float *a,float *b,float *c,int hA,int wA,int wB){ for(int m=0;m<hA;m++) for(int k=0;k<wA;k++) for(int n=0;n<wB;n++) c[m*wB+n]+=a[m*wA+k]*b[k*wB+n]; }
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 36 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Vectorization using AVX intrinsics (requires including immintrin.h).
■ 32-byte alignment for b, c necessary: _mm_malloc(..,32). ■ Block size now is given by SIMD width: #define BLOCK 8. void matMulCPU(float *a,float *b,float *c,int hA,int wA,int wB){ int pitchA=ceilN(wA,BLOCK)/BLOCK, pitchB=ceilN(wB,BLOCK)/BLOCK; __m256 aScalar, *ptrB=(__m256 *)b, *ptrC=(__m256 *)c; for(int m=0;m<hA;m++) for(int k=0;k<(pitchA*BLOCK);k++){ sScalar=_mm256_set1_ps(a[m*(pitchA*BLOCK)+k]); for(int n=0;n<pitchB;n++) ptrC[m*pitchB+n]=_mm256_add_ps(ptrC[m*pitchB+n], _mm256_mul_ps(aScalar,ptrB[k*pitchB+n])); } }
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 37 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Multithreading using OpenMP (requires including omp.h).
#define NUM_THREADS 16 void matMulCPU(float *a,float *b,float *c,int hA,int wA,int hB,int wB){ int pitchA=ceilN(wA,BLOCK)/BLOCK, pitchB=ceilN(wB,BLOCK)/BLOCK; __m256 aScalar, *ptrB=(__m256 *)b, *ptrC=(__m256 *)c; #pragma omp parallel private(aScalar) num_threads(NUM_THREADS) { for(int m=omp_get_thread_num();m<hA;m+=NUM_THREADS) for(int k=0;k<(pitchA*BLOCK);k++){ sScalar=_mm256_set1_ps(a[m*(pitchA*BLOCK)+k]); for(int n=0;n<pitchB;n++) ptrC[m*pitchB+n]=_mm256_add_ps(ptrC[m*pitchB+n], _mm256_mul_ps(aScalar,ptrB[k*pitchB+n])); } } }
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 38 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
CPU: straight forward vs. loop permutation (g++-4.7)
1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 1e+07 128 256 512 1024 2048 4096 8192 Execution Time in ms Matrix Size n Xeon E5-2670 Xeon E5-2670, loop permutation
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 39 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
CPU: loop permutation vs. AVX vs. AVX + OpenMP (g++-4.7, icpc-13)
1e-01 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 128 256 512 1024 2048 4096 8192 Execution Time in ms Matrix Size n Speedup over sequential: ≈35 Xeon E5-2670, loop permutation Xeon E5-2670, AVX Xeon E5-2670, AVX, 16 threads Xeon E5-2670, AVX, 16 threads, Intel compiler
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 40 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
GPU vs. CPU (CUDA 4.1, icpc-13)
1e-01 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 128 256 512 1024 2048 4096 8192 Execution Time in ms Matrix Size n Speedup over CPU: ≈4 Tesla M2090 Tesla M2090, shared memory Xeon E5-2670, AVX, 16 threads, Intel compiler
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 41 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
■ System of n atoms/molecules interacting with each other: n-body problem. ■ Properties (entropy, pressure, etc.) of such systems cannot be determined
analytically, even for small n.
■ MD simulations use numerical methods to solve equations of motion for the
system’s constituents. Lennard-Jones potential + Coulomb potential: Vint(xij) = 4εij {(σij xij )
12
− (σij xij )
6}
+ 1 4πε0 qiqj xij . Hamiltonian: H = ∑
i
( mi
2 v 2 i + Vbox(xi)
) + ∑
j̸=i Vint(|xi − xj|).
Force on particle i : Fi = −∂H/∂xi.
■ Simple MD update scheme (Euler): for all particles i compute 1) Fi(t),
2) velocities vi(t + ∆t) using Fi(t) and vi(t), 3) positions xi(t + ∆t) using vi(t) and xi(t). Better: Verlet algorithm (used for our MD simulation).
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 42 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
0.1 1 10 100 1k 4k 16k 64k 256k 1M Billion Particle-Particle Interactions per Second #Particles n Interaction Rate 5 10 15 1k 4k 16k 64k 256k 1M Speedup over 8-Thread CPU Code #Particles n GPU Speedup 5 CPU Speedup
Speedup over Sequential CPU Code Xeon E5620, sequential Xeon E5620, 4 threads Xeon E5620, 8 threads Tesla M2090, CUDA Tesla M2090, OpenCL Tesla M2090, CUDA Tesla M2090, OpenCL 4 threads 8 threads
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 43 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
High quality image production by tracing rays through a scene made up of geometric primitives.
■ One ray per pixel: ray ↔ thread. ■ Geometric primitives: triangles. ■ Number of effective ray-triangle intersection
tests can be reduced by par- tition the scene using an appro- priate data structure: Octree.
■ No† recursion on GPUs:
■ Reflexion/refraction rays. ■ Octree traversal.
Image/Screen u w v
Pixel Scene 1 2 3 4 5 6 1 2 3 4 5 6 FIN FIN FIN 1 3 4 2 5 6 FIN FIN FIN skip (no ray-box intersection child (ray-box intersection)
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 44 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Kings treasure scene: ≈280000 triangles + almost all surfaces are reflective.
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 45 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Kings treasure scene: 4× super sampling (anti-aliasing)
0.1 1 10 100 1k 10k 100k 1M 64 128 256 512 1024 2048 4096 8192 Execution Time in Seconds n Execution Times per n × n Image
Intel Core i7-920, 32 threads Nvidia Tesla C1060 Nvidia Tesla M2090
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 46 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
Kings treasure scene: 4× super sampling (anti-aliasing)
10 20 30 40 64 128 256 512 1024 2048 4096 8192 Speedup n Speedup over Core i7-920 (32 Threads) for an n × n Image
Nvidia Tesla C1060 Nvidia Tesla M2090
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 47 / 51
. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References
■ Heterogeneous systems: CPU + GPU + Xeon Phi + FPGA + . . . ■ Appropriate programming model: OpenCL? OpenACC?
For Nvidia GPU setups, CUDA seems to be the inofficial standard?!
■ Understanding the functioning of the compute devices in heterogeneous
systems may help increase program performance/efficiency.
■ Performance comparisons (CPU vs. GPU) should draw on optimized
(and parallelized) codes:
■ Many people compare GPU programs against single-threaded less
■ ‘About-one-order-of-magnitude’ speedups are OK (usually). ■ Don’t forget data transfers: CPU → GPU → CPU ̸= GPU
(all our benchmarks incorporate data transfers).
Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 48 / 51
Nvidia Corp., Nvidia CUDA C Programming Guide, v. 4.0, 2011. Nvidia Corp., Fermi Whitepaper, 2009. Nvidia Corp., Kepler Whitepaper, 2012. Lindholm E., Nickolls J., Oberman S., Montrym J., Nvidia Tesla: A Unified Graphics and Computing Architecture, 2008. Rauber Th., Rünger G., Parallele Programmierung, 2012. Intel Corp., Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 2 (2A & 2B): Instruction Set Reference, A-Z, 2011. Kindratenko V., Overview of Hardware Accelerators wsdmhp09.hpcl.gwu.edu/kindratenko.pdf, 2009.
Griebel M., Knapek S., Zumbusch G., Numerical Simulation in Molecular Dynamics: Numerics, Algorithms, Parallelization, Applications, 2007.
Kay, Timothy L., Kajiya, James T., Ray Tracing Complex Scenes, SIGGRAPH Comput. Graph., 20(4):269-278, 1986. Shirley P., Morley, Keith R., Realistic Ray Tracing, 2003.
Top 500 Supercomputers, www.top500.org/lists/2012/11 Green 500 Supercomputers, www.green500.org/lists/green201211 www.wikipedia.org