[PPT] - GPU Programming Ren Kloth Florian Wende Pro Seminar: Parallel PowerPoint Presentation

SLIDE 1

GPU Programming

René Kloth Florian Wende

Pro Seminar: Parallel Programming Freie Universität Berlin, WS 2012/13, Prof. Dr. M. Esponda

SLIDE 2

Presentation Outline

■ Current Hardware Accelerators ■ Nvidia Fermi GPU Architecture ■ The CUDA/OpenCL Programming Model ■ Matrix-Matrix-Multiplication on GPU and CPU

■ On Using the GPU’s Shared Memory ■ Tiling Techniques ■ Efficient Memory Access Patterns ■ OpenMP + Vectorization on CPU

■ Molecular Dynamics ■ Ray Tracing

SLIDE 3

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Hardware Accelerators

Hardware accelerator: Computer hardware that allows to perform certain functions faster than a standard CPU can do in software. Specialized hardware acting as co-processor for the CPU 1980 — Intel 8087 x87 floating-point co-processor for 8086 CPUs. 1985 — Amiga 1000: Co-processors for Video/Audio/DMA/ 2000 and later — Single thread performance stalls! Instruction-level parallelism limited by single thread performance. Increase of application performance through task/thread-level parallelism. FPGA ClearSpeed Cell processor, IBM PowerXCell 8i GPU Intel Xeon Phi

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 3 / 51

SLIDE 4

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Hardware Accelerators

Hardware accelerator: Computer hardware that allows to perform certain functions faster than a standard CPU can do in software. Specialized hardware acting as co-processor for the CPU

■ 1980 — Intel 8087 x87 floating-point co-processor for 8086 CPUs. ■ 1985 — Amiga 1000: Co-processors for Video/Audio/DMA/. . .

2000 and later — Single thread performance stalls! Instruction-level parallelism limited by single thread performance. Increase of application performance through task/thread-level parallelism. FPGA ClearSpeed Cell processor, IBM PowerXCell 8i GPU Intel Xeon Phi

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 3 / 51

SLIDE 5

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Hardware Accelerators

Hardware accelerator: Computer hardware that allows to perform certain functions faster than a standard CPU can do in software. Specialized hardware acting as co-processor for the CPU

■ 1980 — Intel 8087 x87 floating-point co-processor for 8086 CPUs. ■ 1985 — Amiga 1000: Co-processors for Video/Audio/DMA/. . . ■ 2000 and later — Single thread performance stalls!

Instruction-level parallelism limited by single thread performance. Increase of application performance through task/thread-level parallelism.

■ FPGA ■ ClearSpeed ■ Cell processor, IBM PowerXCell 8i ■ GPU ■ Intel Xeon Phi Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 3 / 51

SLIDE 6

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Heterogeneous Computing

Modern computer systems consist of a combination of CPU(s) and hardware accelerator(s).

Peripherals Memory CPU Shared Memory Memory Connection Channel: PCI(e), HyperTransport, etc., Accelerator (GPU, FPGA, Cell, etc.)

Trend: Multiple CPUs + multiple accelerators per compute node.

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 4 / 51

SLIDE 7

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Heterogeneous Computing

Top 500 Supercomputers, November 2012:

■ More than 12% of the systems use hardware accelerators:

80.6% Nvidia GPU, 11.3% Intel Xeon Phi, 4.8% AMD GPU.

■ 84.6% of the systems use processors with 6 or more cores, and

46.2% with 8 or more cores. Green 500 Supercomputers, November 2012 Most power efficient system uses Intel Xeon Phi. 33% of the top 100 systems use GPU and Xeon Phi hardware accelerators. Heterogeneous computer systems are in the ascendant Challenging aspect: Make multiple/different compute devices work together.

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 5 / 51

SLIDE 8

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Heterogeneous Computing

Top 500 Supercomputers, November 2012:

■ More than 12% of the systems use hardware accelerators:

80.6% Nvidia GPU, 11.3% Intel Xeon Phi, 4.8% AMD GPU.

■ 84.6% of the systems use processors with 6 or more cores, and

46.2% with 8 or more cores. Green 500 Supercomputers, November 2012

■ Most power efficient system uses Intel Xeon Phi. ■ 33% of the top 100 systems use GPU and Xeon Phi hardware accelerators.

Heterogeneous computer systems are in the ascendant Challenging aspect: Make multiple/different compute devices work together.

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 5 / 51

SLIDE 9

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Heterogeneous Computing

Top 500 Supercomputers, November 2012:

■ More than 12% of the systems use hardware accelerators:

80.6% Nvidia GPU, 11.3% Intel Xeon Phi, 4.8% AMD GPU.

■ 84.6% of the systems use processors with 6 or more cores, and

46.2% with 8 or more cores. Green 500 Supercomputers, November 2012

■ Most power efficient system uses Intel Xeon Phi. ■ 33% of the top 100 systems use GPU and Xeon Phi hardware accelerators.

Heterogeneous computer systems are in the ascendant Challenging aspect: Make multiple/different compute devices work together.

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 5 / 51

SLIDE 10

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

GPU Computing

Since 2007 GPUs are increasingly used for non-graphics computations (GPGPU) due to

■ high compute performance, and ■ low acquisition and maintenance costs. Nvidia GPU Single Precision Nvidia GPU Double Precision Intel CPU Single Precision Intel CPU Double Precision Theoretical GFLOP/s 2500 2000 1500 1000 500 GeForce FX 5800 GeForce 6800 Ultra GeForce 7800 GTX GeForce 8800 GTX GeForce GTX 280 GeForce GTX 480 GeForce GTX 580 GeForce GTX 680 Tesla K20X Tesla K20X 3.95 TFLOP/s Tesla C2050 1.31 TFLOP/s Tesla C1060 Bloomfield Sandy Bridge Westmere Harpertown Woodcrest Pentium 4 Sep-01 Jun-04 Mar-07 Dec-09 Aug-12 Tesla M2090 3.09 TFLOP/s

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 6 / 51

SLIDE 11

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

GPU Computing

On the GPU we observe a strongly increasing number of compute cores, disproportionate to the increase in the memory bandwidth.

Memory Bandwidth GB/s 200 150 100 50 GeForce FX 5900 GeForce 6800 GT GeForce 7800 GTX GeForce 8800 GTX GeForce GTX 280 GeForce GTX 480 GeForce GTX 580 Tesla K20X 250 GB/s Bloomfield Sandy Bridge Westmere Harpertown Woodcrest Pentium 4 Nvidia GPU Intel CPU 2012 2010 2008 2006

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 7 / 51

SLIDE 12

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

GPU Computing

Before the introduction of the unified-compute shader in 2007, programming GPUs for non-graphics applications was done by means of GLSL, Cg, OpenGL, DirectX: Complicated! Current GPUs (based on the unified-shader architecture) can be ‘easily’ programmed using

■ CUDA: Nvidia proprietary parallel computing platform supporting Nvidia

GPUs from the G80 series (2007) onwards.

■ OpenCL: Parallel programming platform for heterogeneous computer

systems.

■ Apple, AMD, Intel, IBM, Nvidia: OpenCL 1.0 by the end of 2008. ■ More general than CUDA + available for any computer architecture

supporting OpenCL.

■ Programming API similar to CUDA

We focus on CUDA.

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 8 / 51

SLIDE 13

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

GPU Computing

Before the introduction of the unified-compute shader in 2007, programming GPUs for non-graphics applications was done by means of GLSL, Cg, OpenGL, DirectX: Complicated! Current GPUs (based on the unified-shader architecture) can be ‘easily’ programmed using

■ CUDA: Nvidia proprietary parallel computing platform supporting Nvidia

GPUs from the G80 series (2007) onwards.

■ OpenCL: Parallel programming platform for heterogeneous computer

systems.

■ Apple, AMD, Intel, IBM, Nvidia: OpenCL 1.0 by the end of 2008. ■ More general than CUDA + available for any computer architecture

supporting OpenCL.

■ Programming API similar to CUDA → We focus on CUDA. Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 8 / 51

SLIDE 14

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Nvidia Fermi GPU Architecture

Writing efficient programs for the GPU requires a ‘fundamental’ understanding

f the GPU architecture.

Nvidia’s previous unified-shader GPU architectures: Tesla, Fermi, Kepler. The GPU acts as co-processor to the host/CPU. GPU consists of streaming multiprocessors (SM). SMs are organized into compute clusters. SMs consist of scalar processors (SP), special function units (SFU), scheduling and dispatch unit(s), local memory, load/store units. Scalar processors execute independent thread programs. .

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 9 / 51

SLIDE 15

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Nvidia Fermi GPU Architecture

Writing efficient programs for the GPU requires a ‘fundamental’ understanding

f the GPU architecture.

Nvidia’s previous unified-shader GPU architectures: Tesla, Fermi, Kepler.

■ The GPU acts as co-processor

to the host/CPU.

■ GPU consists of streaming

multiprocessors (SM).

■ SMs are organized into

compute clusters.

■ SMs consist of scalar pro-

cessors (SP), special function units (SFU), scheduling and dispatch unit(s), local memory, load/store units.

■ Scalar processors execute in-

dependent thread programs.

SP SFU SFU Shared Memory

Texture/Load/Store

SM Compute Cluster

SP SP SP SP SP SP SP SP SFU SFU Shared Memory

SM

SP SP SP SP SP SP SP

Interconnection Network

DRAM DRAM DRAM DRAM DRAM DRAM

Scheduling

GPU Host/CPU

Scheduling Dispatch Scheduling Dispatch SP SFU SFU Shared Memory

Texture/Load/Store

SM Compute Cluster

SP SP SP SP SP SP SP SP SFU SFU Shared Memory

SM

SP SP SP SP SP SP SP Scheduling Dispatch Scheduling Dispatch SP SFU SFU Shared Memory

Texture/Load/Store

SM Compute Cluster

SP SP SP SP SP SP SP SP SFU SFU Shared Memory

SM

SP SP SP SP SP SP SP Scheduling Dispatch Scheduling Dispatch SP SFU SFU Shared Memory

Texture/Load/Store

SM Compute Cluster

SP SP SP SP SP SP SP SP SFU SFU Shared Memory

SM

SP SP SP SP SP SP SP Scheduling Dispatch Scheduling Dispatch

.

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 9 / 51

SLIDE 16

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Nvidia Fermi GPU Architecture — SIMT

The GPU consists of up to 16 SMs, each containing 32 SPs → up to 512 SPs.

■ SMs can execute independent GPU programs (MIMD):

■ different execution path’ within the same GPU program. ■ concurrent execution of different GPU programs.

■ SPs execute the thread programs the GPU program consists of:

■ SIMD — all SPs per SM execute the same instruction. ■ SIMT — a subset of the SM’s SPs execute the same ‘SIMD instruction’.

The GPU architecture is SIMT SIMT — Same Instruction Multiple Thread: For a given instruction, threads within a work unit which are ready to execute this instruction are marked active; all other threads within that work unit are marked passive. Active threads are mapped onto SPs for execution (SIMD).

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 10 / 51

SLIDE 17

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Nvidia Fermi GPU Architecture — SIMT

The GPU consists of up to 16 SMs, each containing 32 SPs → up to 512 SPs.

■ SMs can execute independent GPU programs (MIMD):

■ different execution path’ within the same GPU program. ■ concurrent execution of different GPU programs.

■ SPs execute the thread programs the GPU program consists of:

■ SIMD — all SPs per SM execute the same instruction. ■ SIMT — a subset of the SM’s SPs execute the same ‘SIMD instruction’.

The GPU architecture is SIMT SIMT — Same Instruction Multiple Thread: For a given instruction, threads within a work unit which are ready to execute this instruction are marked active; all other threads within that work unit are marked passive. Active threads are mapped onto SPs for execution (SIMD).

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 10 / 51

SLIDE 18

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Nvidia Fermi GPU Architecture — IMT

For every GPU program the number of threads that execute the program is defined using a grid-block hierarchy of threads:

■ Grid = 1D/2D/3D arrangement of blocks (max: 65535×65535×65535). ■ Block = 1D/2D/3D arrangement of threads (max: 1024×1024×64). ■ Blocks are distributed to SMs: at most 8 blocks per SM at a time. ■ Threads within blocks are distributed to SPs in SIMT manner by the SM’s

thread schedulers: 32 threads form a work unit (warp). Fermi’s SMs have 32 physical cores, and 2 thread schedulers handling up to 1536 (48 warps) concurrent threads with zero overhead, each. Interleaved Multithreading (IMT) Memory access latency hiding (more cores than caches). Minimization of idle cycles of the GPU’s cores.

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 11 / 51

SLIDE 19

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Nvidia Fermi GPU Architecture — IMT

For every GPU program the number of threads that execute the program is defined using a grid-block hierarchy of threads:

■ Grid = 1D/2D/3D arrangement of blocks (max: 65535×65535×65535). ■ Block = 1D/2D/3D arrangement of threads (max: 1024×1024×64). ■ Blocks are distributed to SMs: at most 8 blocks per SM at a time. ■ Threads within blocks are distributed to SPs in SIMT manner by the SM’s

thread schedulers: 32 threads form a work unit (warp). Fermi’s SMs have 32 physical cores, and 2 thread schedulers handling up to 1536 (48 warps) concurrent threads with zero overhead, each. Interleaved Multithreading (IMT) → Memory access latency hiding (more cores than caches). → Minimization of idle cycles of the GPU’s cores.

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 11 / 51

SLIDE 20

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Nvidia Fermi GPU Architecture — SMs

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Instruction Cache Warp Scheduler Warp Scheduler Dispatch Unit Dispatch Unit Register File (32768 x 32-Bit) SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST

SFU SFU SFU SFU Interconnect Network 64kB Shared Memory/L1-Cache Unified L2-Cache SM

Dispatch Port Operand Collector FP Unit INT Unit Result Queue

Streaming Multiprocessor (SM)

■ 32 physical cores with floating-point and

integer unit, each (IEEE 754-2008).

■ Up to 16 double precision operations per

SM, per clock (only Tesla/Quadro).

■ 4 special function units (SFU):

cos(x), sin(x), rsqrt(x), . . .

■ 16 load/store units (LD/ST). ■ 2 warp schedulers and instruction dispatch

units: each of them issues one instruction from each warp to 16 SPs, 16 LD/ST units, or 4 SFUs.

■ 64kB configurable on-chip memory:

shared memory & L1-cache.

■ First GPU architecture featuring a true

cache hierarchy for load/store operations.

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 12 / 51

SLIDE 21

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Nvidia Fermi GPU Architecture — Warp Scheduler

For the selection of ready-for-execution warps, the thread schedulers maintain a scoreboard, each: ready warps are prioritized. Prioritization considers warp type, instruction type, and ‘fairness’ to all warps executing on the same SM.

Warp Scheduler Instruction Dispatch Unit Warp 8 Instruction 11 Warp 2 Instruction 14 Warp 4 Instruction 32 Warp 24 Instruction 4 Warp 18 Instruction 57 Warp 44 Instruction 23 Warp Scheduler Instruction Dispatch Unit Warp 5 Instruction 13 Warp 11 Instruction 3 Warp 23 Instruction 4 Warp 31 Instruction 6 Warp 1 Instruction 65 Warp 17 Instruction 24 Time ... ...

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 13 / 51

SLIDE 22

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Nvidia Fermi GPU Architecture — Highlights

■ Strongly increased double precision compute performance over previous

Tesla GPU architecture.

■ Cache hierarchy (not necessary for graphics processing). ■ ECC memory support throughout all memory layers (important for HPC). ■ Concurrent kernel execution: up to 16 different GPU programs within the

same CUDA context can execute on the same GPU concurrently (MIMD).

Time Serial Concurrent Kernel Execution Program 2 Program 1 Pr. 3 Pr. 4 Program 5 Program 1 Pr. 3 Pr. 4 Program 2 Program 5

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 14 / 51

SLIDE 23

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Nvidia GPU Architectures

Tesla C1060 Tesla M2090 Tesla K20X

■ Architecture

Tesla, GT200b Fermi, GF110 Kepler, GK110

■ Transistor count

1.4×109 3.0×109 7.1×109

■ SM count

30 16 14

■ SPs per SM

8 32 192

■ SP count

240 512 2688

■ Clock rate

1.3 GHz 1.3 GHz

■ Peek performance

993 (sp) 1330 (sp) 3950 (sp)

■ in GFLOPS/s

78 (dp) 665 (dp) 1310 (dp)

■ Main memory

4 GB 6 GB 6 GB

■ Memory bus width

512-Bit 384-Bit 384-Bit

■ Memory bandwidth

102 GB/s 177 GB/s 250 GB/s

■ TDP

200 W 225 W 235 W

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 15 / 51

SLIDE 24

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

CUDA Programming Model

■ Why CUDA? ■ CUDA kernels ■ Threads, blocks, and grids ■ Memory ■ CUDA specific declarations and commands

■ Example: Vector addition

■ Synchronization

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 16 / 51

SLIDE 25

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

CUDA Programming Model

Most famous APIs for making use of the GPU’s compute capabilities? CUDA:

■ Nvidia proprietary. ■ Latest features.

OpenCL:

■ All compute devices with OpenCL support can be used for computations. ■ More general approach for handling heterogeneous computer systems

compared to CUDA.

■ Not yet widely accepted.

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 17 / 51

SLIDE 26

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

CUDA Programming Model — Kernels

Kernels contain the portion of the code that is offloaded to the GPU for parallel execution.

■ Define the acting of each CUDA thread. ■ Execute asynchronously to the host program. ■ Kernel invocation → Creation of ‘many’ lightweight threads.

Semantics: Declaration: Invocation:

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 18 / 51

SLIDE 27

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

CUDA Programming Model — Kernels

Kernels contain the portion of the code that is offloaded to the GPU for parallel execution.

■ Define the acting of each CUDA thread. ■ Execute asynchronously to the host program. ■ Kernel invocation → Creation of ‘many’ lightweight threads.

Semantics: Declaration:

__global__ void kernel();

Invocation:

kernel<<<grid,block>>>();

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 18 / 51

SLIDE 28

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

CUDA Programming Model — Kernels

Function type qualifiers:

__host__

■ Optional (tell the compiler to also compile the code for the host). ■ Standard C function to be executed on the CPU.

__global__

(→ kernel)

■ Function that is executed on the GPU. ■ Callable from within the host program (CPU).

__device__

■ GPU function callable from within a kernel only (GPU). Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 19 / 51

SLIDE 29

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

CUDA Programming Model — Kernels

Kernels need to be configured when they are called:

__global__ void kernel(){ // kernel definition } void main(){ ... dim3 block(blockX,blockY,blockZ), grid(gridX,gridY,gridZ); // kernel invocation kernel<<<grid,block>>>(); ... }

dim3 is a struct containing the x-, y-, and z-extent of the block/grid.

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 20 / 51

SLIDE 30

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

CUDA Programming Model — Threads, Blocks, ...

The programmer specifies the number of threads that should execute the kernel using a grid-block hierarchy of threads:

■ Threads are organized into blocks. ■ Blocks are organized into a Grid. ■ Cooperation and synchronization

between threads possible within thread blocks only.

■ Thread blocks are distributed to SMs

at runtime by the hardware.

■ Thread blocks need to be independent

f each other → Scalability.

Grid

Block(0,0) Block(1,0) Block(2,0) Block(0,1) Block(1,1) Block(2,1) Block(1,1) Thread(0,0) Thread(1,0) Thread(2,0) Thread(3,0) Thread(0,1) Thread(1,1) Thread(2,1) Thread(3,1) Thread(0,2) Thread(1,2) Thread(2,2) Thread(3,2)

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 21 / 51

SLIDE 31

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

CUDA Programming Model — Threads, Blocks, ...

Host

Kernel 1 Kernel 2

Device

Grid 1 Block (0,0) Block (1,0) Block (2,0) Block (0,1) Block (1,1) Block (2,1) Grid 2 Thread (0,0) Thread (1,0) Thread (2,0) Thread (3,0) Thread (4,0) Thread (0,1) Thread (1,1) Thread (2,1) Thread (3,1) Thread (4,1) Thread (0,2) Thread (1,2) Thread (2,2) Thread (3,2) Thread (4,2)

Grid 1:

dim3 grid(3,2); dim3 block(5,3); kernel1<<<grid,block>>>();

■ Grid 2D:

3 × 2 = 6 blocks.

■ Block 2D:

5 × 3 = 15 threads.

■ Kernel 1:

6 × 15 = 90 threads.

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 22 / 51

SLIDE 32

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

CUDA Programming Model — Threads, Blocks, ...

Thread and block IDs, grid and block extents are accessible within the kernel through built-in variables → each thread has a unique (global) thread ID:

■ Blocks within the grid:

blockIdx.x, blockIdx.y, blockIdx.z.

■ Threads within a block:

threadIdx.x, threadIdx.y, threadIdx.z.

■ Block extent: blockDim.x, blockDim.y, blockDim.z. ■ Grid extent: gridDim.x, gridDim.y, gridDim.z.

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 23 / 51

SLIDE 33

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

CUDA Programming Model — Memory

Threads within a CUDA kernel can load (store) data from (to) different memory layers. Variable type qualifiers:

__device__

■ Global memory (RAM). ■ Accessible from all threads.

__shared__

■ On-chip memory (fast). ■ Accessible from all threads

within the same block.

__constant__

■ Read-only memory. ■ Accessible from all threads.

Thread

Per-Thread Local Memory Per-Block Shared Thread Block

Grid 1

Block (0,0) Block (1,0) Block (2,0) Block (0,1) Block (1,1) Block (2,1) Memory Global Memory

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 24 / 51

SLIDE 34

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

CUDA Programming Model — Memory

(Device) Grid

Block (0,0) Block (1,0) Shared Memory Shared Memory

Register Register Register Register Thread (0,0) Thread (1,0) Thread (0,0) Thread (1,0)

Global Memory Constant Memory Host

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 25 / 51

SLIDE 35

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

CUDA Programming Model — Some API Calls

■ cudaMalloc(void **,size_t); ■ cudaFree(void *); ■ cudaMemcpy(void *,void *,size_t,enum cudaMemcpyKind);

where cudeMemcpyKind is one of

■ cudaMemcpyHostToDevice. ■ cudaMemcpyDeviceToHost. ■ cudaMemcpyDeviceToDevice. ■ cudaMemcpyHostToHost.

■ cudaMemset(void *,byte,size_t); ■ cudaDeviceSynchronize();

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 26 / 51

SLIDE 36

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

CUDA Programming Model — Example

Vector Addition:

// kernel definition __global__ void vecAdd(float *A,float *B,float *C){ int i=threadIdx.x; C[i]=A[i]+B[i]; } void main(){ ... // kernel invocation with N threads vecAdd<<<1,N>>>(A,B,C); ... }

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 27 / 51

SLIDE 37

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

CUDA Programming Model — Synchronization

Barrier synchronization between threads only within blocks using

__syncthreads().

There is no global barrier! Deadlock situation:

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 28 / 51

SLIDE 38

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

CUDA Programming Model — Synchronization

Barrier synchronization between threads only within blocks using

__syncthreads().

There is no global barrier! Deadlock situation:

if(...){ ... __syncthreads(); }else{ ... __syncthreads(); }

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 28 / 51

SLIDE 39

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Matrix-Matrix-Multiplication

Naive matrix-matrix-multiplication: C = A · B, A ∈ Rm×n, B ∈ Rn×p cij =

n

∑

k=1

aik · bkj , 1 ≤ i ≤ m , 1 ≤ j ≤ p . Straight forward implementation: Loop permutation trick (CPU). Optimized implementation: GPU — Shared memory, Tiling. CPU — OpenMP + vectorization.

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 29 / 51

SLIDE 40

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Matrix-Matrix-Multiplication

Naive matrix-matrix-multiplication: C = A · B, A ∈ Rm×n, B ∈ Rn×p cij =

n

∑

k=1

aik · bkj , 1 ≤ i ≤ m , 1 ≤ j ≤ p .

■ Straight forward implementation:

■ Loop permutation trick (CPU).

■ Optimized implementation:

■ GPU — Shared memory, Tiling. ■ CPU — OpenMP + vectorization. Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 29 / 51

SLIDE 41

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Matrix-Matrix-Multiplication — GPU

Fermi GPU:

■ Max. threads per SM: 1536 → full occupancy with blocks of size 16 × 16.

■ 32768 registers per SM → 21 reg. per thread: we use less than 21. ■ 1536/(16 × 16) = 6 blocks: less than 8 blocks per SM.

■ Adjust size of matrix X ∈ Rm×n in order to meet the block geometry:

m′ × n′ = ceilN(m,16) × ceilN(n,16) , ceilN(x,N) = ⌈ x N ⌉ N . X′ ∈ Rm′×n′ with x′

ij =

{ xij if 1 ≤ i ≤ m , 1 ≤ j ≤ n ,

therwise .

■ C is of type Rm×p → grid geometry:

⌈ m 16 ⌉ × ⌈ p 16 ⌉ .

■ Initialize C with 0.

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 30 / 51

SLIDE 42

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Matrix-Matrix-Multiplication — GPU

Host

#define BLOCK 16 int ceilN(int x,int n){ return (x/n+((x%n)==0?0:1))*n; } void main(int argc,char **argv){ int hA=atoi(argv[1]),wA=atoi(argv[2]),hB=atoi(argv[3]),wB=atoi(argv[4]), kMax=ceilN(wA,BLOCK); dim3 block(BLOCK,BLOCK), grid(ceilN(wB,BLOCK)/BLOCK,ceilN(hA,BLOCK)/BLOCK); float *a=malloc(kMax*grid.y*BLOCK*sizeof(float)), *b=malloc(grid.x*BLOCK*kMax*sizeof(float)), *c=malloc(grid.x*BLOCK*grid.y*BLOCK*sizeof(float)), *da,*db,*dc; cudaMalloc(&da,kMax*grid.y*BLOCK*sizeof(float)); cudaMalloc(&db,grid.x*BLOCK*kMax*sizeof(float)); cudaMalloc(&dc,grid.x*BLOCK*grid.y*BLOCK*sizeof(float)); ...

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 31 / 51

SLIDE 43

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Matrix-Matrix-Multiplication — GPU

Host

... //set up matrix A at random memset(a,0,kMax*grid.y*BLOCK*sizeof(float)); for(int m=0;m<hA;m++) for(int n=0;n<wA;n++) a[m*kMax+n]=(float)(rand())/RAND_MAX; cudaMemcpy(da,a,kMax*grid.y*BLOCK*sizeof(float),cudaMemcpyHostToDevice); //set up matrix B at random ... //initialize matrix C (GPU only) cudaMemset(dc,0,grid.x*BLOCK*grid.y*BLOCK*sizeof(float)); //kernel invocation matMulGPU<<<grid,block>>>(da,db,dc,kMax); //copy matrix C from GPU to host cudaMemcpy(c,dc,grid.x*BLOCK*grid.y*BLOCK*sizeof(float),cudaMemcpyDeviceToHost); }

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 32 / 51

SLIDE 44

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Matrix-Matrix-Multiplication — GPU

GPU straight forward

__global__ void matMulGPU(float *a,float *b,float *c,int kMax){ int row=blockIdx.y*BLOCK+threadIdx.y, col=blockIdx.x*BLOCK+threadIdx.x; float cRes=0.0F; for(int k=0;k<kMax;k++) cRes+=a[row*kMax+k]*b[k*gridDim.x*BLOCK+col]; c[row*gridDim.x*BLOCK+col]=cRes; }

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 33 / 51

SLIDE 45

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Matrix-Matrix-Multiplication — GPU

GPU shared memory

__global__ void matMulGPU(float *a,float *b,float *c,int kMax){ int row=blockIdx.y*BLOCK+threadIdx.y, col=blockIdx.x*BLOCK+threadIdx.x; float cRes=0.0F; __shared__ float aSh[BLOCK][BLOCK]; __shared__ float bSh[BLOCK][BLOCK]; for(int k=0;k<kMax;k+=BLOCK){ aSh[threadIdx.y][threadIdx.x]=a[row*kMax+k+threadIdx.x]; bSh[threadIdx.y][threadIdx.x]=b[(k+threadIdx.y)*gridDim.x*BLOCK+col]; __syncthreads(); for(int kk=0;kk<BLOCK;kk++) cRes+=aSh[threadIdx.y][kk]*bSh[kk][threadIdx.x]; __syncthreads(); } c[row*gridDim.x*BLOCK+col]=cRes; }

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 34 / 51

SLIDE 46

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Matrix-Matrix-Multiplication — GPU

Tiling: Programming strategy for distributing data and computations, and for locality enhancement in parallel/sequential programs. Here: Sub-matrices (tiles) are loaded into shared memory.

■ 2 × n BLOCK load operations per thread instead of ‘2n’. ■ Memory loads are coalesced → high memory throughput. ■ Number of arithmetic operations per word read is up to ‘BLOCK’× higher

than in the straight forward implementation (on later-than-Fermi GPUs memory accesses are cached). Attention (GPU): SM occupancy: shared memory is a limited resource. Shared memory bank conflicts: do not occur in the matrix-matrix multiplication.

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 35 / 51

SLIDE 47

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Matrix-Matrix-Multiplication — GPU

Tiling: Programming strategy for distributing data and computations, and for locality enhancement in parallel/sequential programs. Here: Sub-matrices (tiles) are loaded into shared memory.

■ 2 × n BLOCK load operations per thread instead of ‘2n’. ■ Memory loads are coalesced → high memory throughput. ■ Number of arithmetic operations per word read is up to ‘BLOCK’× higher

than in the straight forward implementation (on later-than-Fermi GPUs memory accesses are cached). Attention (GPU):

■ SM occupancy: shared memory is a limited resource. ■ Shared memory bank conflicts: do not occur in the matrix-matrix

multiplication.

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 35 / 51

SLIDE 48

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Matrix-Matrix-Multiplication — CPU

We draw on the host code given for the GPU implementation above: remove all CUDA runtime API calls, and replace matMulGPU() by matMulCPU(). CPU straight forward

void matMulCPU(float *a,float *b,float *c,int hA,int wA,int wB){ for(int m=0;m<hA;m++) for(int n=0;n<wB;n++) for(int k=0;k<wA;k++) c[m*wB+n]+=a[m*wA+k]*b[k*wB+n]; }

CPU loop permutation Better cache utilization, and more suitable for vectorization: Intel’s C compiler automatically permutes loops and vectorizes them (if possible).

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 36 / 51

SLIDE 49

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Matrix-Matrix-Multiplication — CPU

We draw on the host code given for the GPU implementation above: remove all CUDA runtime API calls, and replace matMulGPU() by matMulCPU(). CPU straight forward

void matMulCPU(float *a,float *b,float *c,int hA,int wA,int wB){ for(int m=0;m<hA;m++) for(int n=0;n<wB;n++) for(int k=0;k<wA;k++) c[m*wB+n]+=a[m*wA+k]*b[k*wB+n]; }

CPU loop permutation Better cache utilization, and more suitable for vectorization: Intel’s C compiler automatically permutes loops and vectorizes them (if possible).

void matMulCPU(float *a,float *b,float *c,int hA,int wA,int wB){ for(int m=0;m<hA;m++) for(int k=0;k<wA;k++) for(int n=0;n<wB;n++) c[m*wB+n]+=a[m*wA+k]*b[k*wB+n]; }

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 36 / 51

SLIDE 50

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Matrix-Matrix-Multiplication — CPU

Vectorization using AVX intrinsics (requires including immintrin.h).

■ 32-byte alignment for b, c necessary: _mm_malloc(..,32). ■ Block size now is given by SIMD width: #define BLOCK 8. void matMulCPU(float *a,float *b,float *c,int hA,int wA,int wB){ int pitchA=ceilN(wA,BLOCK)/BLOCK, pitchB=ceilN(wB,BLOCK)/BLOCK; __m256 aScalar, *ptrB=(__m256 *)b, *ptrC=(__m256 *)c; for(int m=0;m<hA;m++) for(int k=0;k<(pitchA*BLOCK);k++){ sScalar=_mm256_set1_ps(a[m*(pitchA*BLOCK)+k]); for(int n=0;n<pitchB;n++) ptrC[m*pitchB+n]=_mm256_add_ps(ptrC[m*pitchB+n], _mm256_mul_ps(aScalar,ptrB[k*pitchB+n])); } }

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 37 / 51

SLIDE 51

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Matrix-Matrix-Multiplication — CPU

Multithreading using OpenMP (requires including omp.h).

#define NUM_THREADS 16 void matMulCPU(float *a,float *b,float *c,int hA,int wA,int hB,int wB){ int pitchA=ceilN(wA,BLOCK)/BLOCK, pitchB=ceilN(wB,BLOCK)/BLOCK; __m256 aScalar, *ptrB=(__m256 *)b, *ptrC=(__m256 *)c; #pragma omp parallel private(aScalar) num_threads(NUM_THREADS) { for(int m=omp_get_thread_num();m<hA;m+=NUM_THREADS) for(int k=0;k<(pitchA*BLOCK);k++){ sScalar=_mm256_set1_ps(a[m*(pitchA*BLOCK)+k]); for(int n=0;n<pitchB;n++) ptrC[m*pitchB+n]=_mm256_add_ps(ptrC[m*pitchB+n], _mm256_mul_ps(aScalar,ptrB[k*pitchB+n])); } } }

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 38 / 51

SLIDE 52

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Matrix-Matrix-Multiplication — Benchmark

CPU: straight forward vs. loop permutation (g++-4.7)

1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 1e+07 128 256 512 1024 2048 4096 8192 Execution Time in ms Matrix Size n Xeon E5-2670 Xeon E5-2670, loop permutation

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 39 / 51

SLIDE 53

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Matrix-Matrix-Multiplication — Benchmark

CPU: loop permutation vs. AVX vs. AVX + OpenMP (g++-4.7, icpc-13)

1e-01 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 128 256 512 1024 2048 4096 8192 Execution Time in ms Matrix Size n Speedup over sequential: ≈35 Xeon E5-2670, loop permutation Xeon E5-2670, AVX Xeon E5-2670, AVX, 16 threads Xeon E5-2670, AVX, 16 threads, Intel compiler

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 40 / 51

SLIDE 54

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Matrix-Matrix-Multiplication — Benchmark

GPU vs. CPU (CUDA 4.1, icpc-13)

1e-01 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 128 256 512 1024 2048 4096 8192 Execution Time in ms Matrix Size n Speedup over CPU: ≈4 Tesla M2090 Tesla M2090, shared memory Xeon E5-2670, AVX, 16 threads, Intel compiler

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 41 / 51

SLIDE 55

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Molecular Dynamics (MD)

■ System of n atoms/molecules interacting with each other: n-body problem. ■ Properties (entropy, pressure, etc.) of such systems cannot be determined

analytically, even for small n.

■ MD simulations use numerical methods to solve equations of motion for the

system’s constituents. Lennard-Jones potential + Coulomb potential: Vint(xij) = 4εij {(σij xij )

12

− (σij xij )

6}

+ 1 4πε0 qiqj xij . Hamiltonian: H = ∑

i

( mi

2 v 2 i + Vbox(xi)

) + ∑

j̸=i Vint(|xi − xj|).

Force on particle i : Fi = −∂H/∂xi.

■ Simple MD update scheme (Euler): for all particles i compute 1) Fi(t),

2) velocities vi(t + ∆t) using Fi(t) and vi(t), 3) positions xi(t + ∆t) using vi(t) and xi(t). Better: Verlet algorithm (used for our MD simulation).

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 42 / 51

SLIDE 56

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Molecular Dynamics (MD)

0.1 1 10 100 1k 4k 16k 64k 256k 1M Billion Particle-Particle Interactions per Second #Particles n Interaction Rate 5 10 15 1k 4k 16k 64k 256k 1M Speedup over 8-Thread CPU Code #Particles n GPU Speedup 5 CPU Speedup

Speedup over Sequential CPU Code Xeon E5620, sequential Xeon E5620, 4 threads Xeon E5620, 8 threads Tesla M2090, CUDA Tesla M2090, OpenCL Tesla M2090, CUDA Tesla M2090, OpenCL 4 threads 8 threads

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 43 / 51

SLIDE 57

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Ray Tracing

High quality image production by tracing rays through a scene made up of geometric primitives.

■ One ray per pixel: ray ↔ thread. ■ Geometric primitives: triangles. ■ Number of effective ray-triangle intersection

tests can be reduced by par- tition the scene using an appropriate data structure: Octree.

■ No† recursion on GPUs:

■ Reflexion/refraction rays. ■ Octree traversal.

Image/Screen u w v

Ray

Pixel Scene 1 2 3 4 5 6 1 2 3 4 5 6 FIN FIN FIN 1 3 4 2 5 6 FIN FIN FIN skip (no ray-box intersection child (ray-box intersection)

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 44 / 51

SLIDE 58

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Ray Tracing

Kings treasure scene: ≈280000 triangles + almost all surfaces are reflective.

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 45 / 51

SLIDE 59

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Ray Tracing

Kings treasure scene: 4× super sampling (anti-aliasing)

0.1 1 10 100 1k 10k 100k 1M 64 128 256 512 1024 2048 4096 8192 Execution Time in Seconds n Execution Times per n × n Image

Intel Core i7-920, 32 threads Nvidia Tesla C1060 Nvidia Tesla M2090

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 46 / 51

SLIDE 60

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Ray Tracing

Kings treasure scene: 4× super sampling (anti-aliasing)

10 20 30 40 64 128 256 512 1024 2048 4096 8192 Speedup n Speedup over Core i7-920 (32 Threads) for an n × n Image

Nvidia Tesla C1060 Nvidia Tesla M2090

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 47 / 51

SLIDE 61

. . . Hardware Accelerators . . . GPU Computing . . . . . . . Nvidia Fermi GPU Architecture . . . . . . . . . . . . . CUDA Programming Model . . . . . . . . . . . . . . . . . . . Applications . Summary . . . References

Summary

■ Heterogeneous systems: CPU + GPU + Xeon Phi + FPGA + . . . ■ Appropriate programming model: OpenCL? OpenACC?

For Nvidia GPU setups, CUDA seems to be the inofficial standard?!

■ Understanding the functioning of the compute devices in heterogeneous

systems may help increase program performance/efficiency.

■ Performance comparisons (CPU vs. GPU) should draw on optimized

(and parallelized) codes:

■ Many people compare GPU programs against single-threaded less

ptimized CPU programs: Speedups >100?

■ ‘About-one-order-of-magnitude’ speedups are OK (usually). ■ Don’t forget data transfers: CPU → GPU → CPU ̸= GPU

(all our benchmarks incorporate data transfers).

Pro Seminar: Parallel Programming, GPU Programming : René Kloth, Florian Wende 48 / 51

SLIDE 62

Questions?

SLIDE 63

Literature (Hardware & Software Models) .

Nvidia Corp., Nvidia CUDA C Programming Guide, v. 4.0, 2011. Nvidia Corp., Fermi Whitepaper, 2009. Nvidia Corp., Kepler Whitepaper, 2012. Lindholm E., Nickolls J., Oberman S., Montrym J., Nvidia Tesla: A Unified Graphics and Computing Architecture, 2008. Rauber Th., Rünger G., Parallele Programmierung, 2012. Intel Corp., Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 2 (2A & 2B): Instruction Set Reference, A-Z, 2011. Kindratenko V., Overview of Hardware Accelerators wsdmhp09.hpcl.gwu.edu/kindratenko.pdf, 2009.

SLIDE 64

Literature (Molecular Dynamics) .

Griebel M., Knapek S., Zumbusch G., Numerical Simulation in Molecular Dynamics: Numerics, Algorithms, Parallelization, Applications, 2007.

Literature (Ray Tracing) .

Kay, Timothy L., Kajiya, James T., Ray Tracing Complex Scenes, SIGGRAPH Comput. Graph., 20(4):269-278, 1986. Shirley P., Morley, Keith R., Realistic Ray Tracing, 2003.

Web Links.

Top 500 Supercomputers, www.top500.org/lists/2012/11 Green 500 Supercomputers, www.green500.org/lists/green201211 www.wikipedia.org