TDT24 Presentation - GPU Optimization Principles Johannes Kvam - PowerPoint PPT Presentation

TDT24 Presentation - GPU Optimization Principles Johannes Kvam Department of Engineering Cybernetics November 1, 2011 www.ntnu.no TDT24, GPU Optimization Principles

2 GPU Optimization Principles — Paper studies optimizations in CUDA, using a GeForce 8800 — Demonstrates the optimization process on a matrix multiplication kernel — Ports several well known algorithms to CUDA and discusses their performance on the platform www.ntnu.no TDT24, GPU Optimization Principles

3 GPU - Major optimization principles — Zero-overhead thread scheduling to hide memory latency — Optimize use of off-chip memory — Optimize use of on-chip memory — Grouping threads to avoid SIMD penalties — Limited synchronization www.ntnu.no TDT24, GPU Optimization Principles

4 Zero-Overhead Scheduling — Run-time able to schedule warps of threads (subset of threads within a thread block executing on a SMP) with practically zero overhead — Able to mask the latency of the slow global memory accesses — A high compute-to-memory-access ratio is necessary to avoid saturation of memory channels www.ntnu.no TDT24, GPU Optimization Principles

5 Optimize use of off-chip memory — Bandwidth to off-chip memory very high (86 . 4GB/s for GeForce 8800) — Can saturate if many threads request access within a short period of time — This bandwidth can only be obtained when accesses are contiguous 16-word lines — Optimizations to coalesce accesses and reuse data is hence important — Each non-contiguous access is a seperate DRAM access request www.ntnu.no TDT24, GPU Optimization Principles

6 Optimize use of on-chip memory — Threads can share data through shared on-chip memory — Shared memory is limited capacity low latency memory — Enables interthread data reuse — Less loads from high latency global memory An incremental increase in the use of registers or shared memory per thread can have degredating effects on performance as the memory usage may limit the number of threads able to execute simultaneously. www.ntnu.no TDT24, GPU Optimization Principles

7 Grouping threads to avoid SIMD penalties — CUDA follows the SPMD ( Single Program Multiple Data ) execution model — However, warps follow the SIMD ( Single Instruction Multiple Data ) execution model — Branching within warps are hence serialized, yielding a longer critical path of execution — Can be avoided by grouping threads so that all threads within the same warp follow the same path of execution www.ntnu.no TDT24, GPU Optimization Principles

8 Limited synchronization — Threads within the same block can synchronize using the on-chip shared memory — Global synchronization however is only obtained when a kernel terminates — Limits the kind of parallelism that can be utilized within a single kernel call www.ntnu.no TDT24, GPU Optimization Principles

9 Effects of memory optimizations The number of thread blocks that are simultaneously resident on an SM is limited by whichever limit of registers, shared memory, or thread blocks is reached first. — Optimizations may hence have negative effects — Relativly easy to be “trapped” in local maximum — Widely varying configurations should therefore be applied www.ntnu.no TDT24, GPU Optimization Principles

10 Compiler optimizations — Complete loop unrolling yields a significant performance increase ( removes induction variable register and condition variable evaluation ) — Common case compiler optimization e.g. common subexpression elimination increases the number of registers as a side effect www.ntnu.no TDT24, GPU Optimization Principles

11 Matrix-Matrix multiply results www.ntnu.no TDT24, GPU Optimization Principles

12 Matrix-Matrix multiply results cont. — The most efficient implementation used tiling and loop unrolling. Taking advantage of shared memory and coalesced global memory accesses. Obtained 91 . 14 GFLOPS. — A second implementation with a prefetching scheme to improve occupancy as global memory latency was reduced. Implementation needed two more registers, which reduced the number of thread blocks that could be scheduled by each SM by 1. — Implementation obtained 87 . 10 GFLOPS ( ∼ − 5%) with 2 3 the number of active threads! www.ntnu.no TDT24, GPU Optimization Principles

13 Article Name - Optimization Principles and Application Performance Evaluation of a Multithreaded GPU using CUDA URL - http://dl.acm.org/citation.cfm?id=1345206.1345220 www.ntnu.no TDT24, GPU Optimization Principles

TDT24 Presentation - GPU Optimization Principles Johannes Kvam - PowerPoint PPT Presentation

TDT24 Presentation - GPU Optimization Principles Johannes Kvam Department of Engineering Cybernetics November 1, 2011 www.ntnu.no TDT24, GPU Optimization Principles 2 GPU Optimization Principles Paper studies optimizations in CUDA, using

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Interrupt Coalescing in Xen with Scheduler Awareness Michael Peirce & Kevin Boos Outline

Large-scale performance monitoring framework for cloud monitoring Run-Time Latency Detection in

Latency Outliers Root Cause Analysis in the Field by Combining Aggregation and Tracing Tools

FAILURE AT NETFLIX VELOCITY Cannot Connect to the Netflix Service 0 0 Ms % IMPACT LATENCY

Family Week 2019 Building Connections in the Spirit of Hope A Family Blessing Blessed are we as

2017 2017 Launch Presentation #PAMS2017 TH THE JOU E JOURNE NEY THE HE BR BRIEF IEF

Why i is q quality a assurance i important? A lot of time and effort goes into setting up

Western Pennsylvania Regional Data Center Bob Gradeck University of Pittsburgh Center for

TDT24 Presentation - GPU Optimization Principles Johannes Kvam - PowerPoint PPT Presentation

TDT24 Presentation - GPU Optimization Principles Johannes Kvam Department of Engineering Cybernetics November 1, 2011 www.ntnu.no TDT24, GPU Optimization Principles 2 GPU Optimization Principles Paper studies optimizations in CUDA, using

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Interrupt Coalescing in Xen with Scheduler Awareness Michael Peirce &amp; Kevin Boos Outline

Large-scale performance monitoring framework for cloud monitoring Run-Time Latency Detection in

Latency Outliers Root Cause Analysis in the Field by Combining Aggregation and Tracing Tools

FAILURE AT NETFLIX VELOCITY Cannot Connect to the Netflix Service 0 0 Ms % IMPACT LATENCY

Family Week 2019 Building Connections in the Spirit of Hope A Family Blessing Blessed are we as

2017 2017 Launch Presentation #PAMS2017 TH THE JOU E JOURNE NEY THE HE BR BRIEF IEF

Why i is q quality a assurance i important? A lot of time and effort goes into setting up

Western Pennsylvania Regional Data Center Bob Gradeck University of Pittsburgh Center for

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Interrupt Coalescing in Xen with Scheduler Awareness Michael Peirce & Kevin Boos Outline