GTC 2016, San Jose, CA, USA, April 7
th
, 2016
Rommel Anatoli Quintanilla Cruz / Master's Student Esteban Clua / Associate Professor
Universidade Federal Fluminense
Overview of Performance Prediction Tools for Better Development and - - PowerPoint PPT Presentation
Overview of Performance Prediction Tools for Better Development and Tuning Support Universidade Federal Fluminense Rommel Anatoli Quintanilla Cruz / Master's Student Esteban Clua / Associate Professor th GTC 2016, San Jose, CA, USA, April 7 ,
GTC 2016, San Jose, CA, USA, April 7
th
, 2016
Rommel Anatoli Quintanilla Cruz / Master's Student Esteban Clua / Associate Professor
Universidade Federal Fluminense
* Adapted from S5173 CUDA Optimization with NVIDIA NSIGHT ECLIPSE Edition – GTC 2015
NVIDIA Visual Profiler The NVIDIA CUDA Profiling Tools Interface The PAPI CUDA Component
NVIDIA Visual Profiler
Performance model
Source code Performance model
PTX Pseudocode CUBIN Target Device Information
Source code Performance model
Power consumption estimation Execution time prediction
Performance bottlenecks identification PTX Pseudocode CUBIN
Target Device Information
* GPGPU performance and power estimation using machine learning. - Wu, Gene, et al.
PTX Kernel PTX Emulation LLVM Translation GPU Execution
* Image from http://www.turkpaylasim.com/cevahir
Hong, Sunpyo, and Hyesoon Kim. "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness." ACM SIGARCH Computer Architecture News. Vol. 37. No. 3. ACM, 2009. Kim, Hyesoon, et al. "Performance analysis and tuning for general purpose graphics processing units (GPGPU)." Synthesis Lectures on Computer Architecture 7.2 (2012): 1-96. Lopez-Novoa, Unai, Alexander Mendiburu, and José Miguel-Alonso. "A survey of performance modeling and simulation techniques for accelerator-based computing." Parallel and Distributed Systems, IEEE Transactions on 26.1 (2015): 272-281. Zhong, Jianlong, and Bingsheng He. "Kernelet: High-throughput gpu kernel executions with dynamic slicing and scheduling." Parallel and Distributed Systems, IEEE Transactions on 25.6 (2014): 1522-1532.
rquintanillac@ic.uff.br esteban@ic.uff.br http://medialab.ic.uff.br
#GTC16
.cu .ptx Virtual Instruction Set .cubin CUDA Binary File .gpu Device code ptxas cicc cudafe .cpu Host code nvcc PTX Optimizing Assembler
High level optimizer and PTX generator
CUDA Front End CUDA Compiler Host Compiler .fatbinary CUDA Executable
Timeline K1 16 blocks K1 16 blocks ... K1 4 blocks K2 12 blocks K2 16 blocks ...
Timeline ... K1 6 blocks K2 10 blocks K1 6 blocks K1 6 blocks K1 6 blocks K2 10 blocks K2 10 blocks K2 10 blocks
* Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS - Jiao, Qing, et al.