tdt24 presentation gpu optimization principles
play

TDT24 Presentation - GPU Optimization Principles Johannes Kvam - PowerPoint PPT Presentation

TDT24 Presentation - GPU Optimization Principles Johannes Kvam Department of Engineering Cybernetics November 1, 2011 www.ntnu.no TDT24, GPU Optimization Principles 2 GPU Optimization Principles Paper studies optimizations in CUDA, using


  1. TDT24 Presentation - GPU Optimization Principles Johannes Kvam Department of Engineering Cybernetics November 1, 2011 www.ntnu.no TDT24, GPU Optimization Principles

  2. 2 GPU Optimization Principles — Paper studies optimizations in CUDA, using a GeForce 8800 — Demonstrates the optimization process on a matrix multiplication kernel — Ports several well known algorithms to CUDA and discusses their performance on the platform www.ntnu.no TDT24, GPU Optimization Principles

  3. 3 GPU - Major optimization principles — Zero-overhead thread scheduling to hide memory latency — Optimize use of off-chip memory — Optimize use of on-chip memory — Grouping threads to avoid SIMD penalties — Limited synchronization www.ntnu.no TDT24, GPU Optimization Principles

  4. 4 Zero-Overhead Scheduling — Run-time able to schedule warps of threads (subset of threads within a thread block executing on a SMP) with practically zero overhead — Able to mask the latency of the slow global memory accesses — A high compute-to-memory-access ratio is necessary to avoid saturation of memory channels www.ntnu.no TDT24, GPU Optimization Principles

  5. 5 Optimize use of off-chip memory — Bandwidth to off-chip memory very high (86 . 4GB/s for GeForce 8800) — Can saturate if many threads request access within a short period of time — This bandwidth can only be obtained when accesses are contiguous 16-word lines — Optimizations to coalesce accesses and reuse data is hence important — Each non-contiguous access is a seperate DRAM access request www.ntnu.no TDT24, GPU Optimization Principles

  6. 6 Optimize use of on-chip memory — Threads can share data through shared on-chip memory — Shared memory is limited capacity low latency memory — Enables interthread data reuse — Less loads from high latency global memory An incremental increase in the use of registers or shared memory per thread can have degredating effects on performance as the memory usage may limit the number of threads able to execute simultaneously. www.ntnu.no TDT24, GPU Optimization Principles

  7. 7 Grouping threads to avoid SIMD penalties — CUDA follows the SPMD ( Single Program Multiple Data ) execution model — However, warps follow the SIMD ( Single Instruction Multiple Data ) execution model — Branching within warps are hence serialized, yielding a longer critical path of execution — Can be avoided by grouping threads so that all threads within the same warp follow the same path of execution www.ntnu.no TDT24, GPU Optimization Principles

  8. 8 Limited synchronization — Threads within the same block can synchronize using the on-chip shared memory — Global synchronization however is only obtained when a kernel terminates — Limits the kind of parallelism that can be utilized within a single kernel call www.ntnu.no TDT24, GPU Optimization Principles

  9. 9 Effects of memory optimizations The number of thread blocks that are simultaneously resident on an SM is limited by whichever limit of registers, shared memory, or thread blocks is reached first. — Optimizations may hence have negative effects — Relativly easy to be “trapped” in local maximum — Widely varying configurations should therefore be applied www.ntnu.no TDT24, GPU Optimization Principles

  10. 10 Compiler optimizations — Complete loop unrolling yields a significant performance increase ( removes induction variable register and condition variable evaluation ) — Common case compiler optimization e.g. common subexpression elimination increases the number of registers as a side effect www.ntnu.no TDT24, GPU Optimization Principles

  11. 11 Matrix-Matrix multiply results www.ntnu.no TDT24, GPU Optimization Principles

  12. 12 Matrix-Matrix multiply results cont. — The most efficient implementation used tiling and loop unrolling. Taking advantage of shared memory and coalesced global memory accesses. Obtained 91 . 14 GFLOPS. — A second implementation with a prefetching scheme to improve occupancy as global memory latency was reduced. Implementation needed two more registers, which reduced the number of thread blocks that could be scheduled by each SM by 1. — Implementation obtained 87 . 10 GFLOPS ( ∼ − 5%) with 2 3 the number of active threads! www.ntnu.no TDT24, GPU Optimization Principles

  13. 13 Article Name - Optimization Principles and Application Performance Evaluation of a Multithreaded GPU using CUDA URL - http://dl.acm.org/citation.cfm?id=1345206.1345220 www.ntnu.no TDT24, GPU Optimization Principles

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend