Overview of Performance Prediction Tools for Better Development and - - PowerPoint PPT Presentation

overview of performance prediction tools for better
SMART_READER_LITE
LIVE PREVIEW

Overview of Performance Prediction Tools for Better Development and - - PowerPoint PPT Presentation

Overview of Performance Prediction Tools for Better Development and Tuning Support Universidade Federal Fluminense Rommel Anatoli Quintanilla Cruz / Master's Student Esteban Clua / Associate Professor th GTC 2016, San Jose, CA, USA, April 7 ,


slide-1
SLIDE 1

GTC 2016, San Jose, CA, USA, April 7

th

, 2016

Rommel Anatoli Quintanilla Cruz / Master's Student Esteban Clua / Associate Professor

Universidade Federal Fluminense

Overview of Performance Prediction Tools for Better Development and Tuning Support

slide-2
SLIDE 2

What you will learn from this talk ...

slide-3
SLIDE 3

Outline

  • Motivation
  • Performance models
  • Applications
  • Challenges
slide-4
SLIDE 4
  • 1. Profile Application
  • 2. Identify

Performance Limiters

  • 3. Analyze Profile

& Find Indicators

  • 5. Change and

Test Code

  • 4. Reflect

* Adapted from S5173 CUDA Optimization with NVIDIA NSIGHT ECLIPSE Edition – GTC 2015

Performance Optimization Cycle*

slide-5
SLIDE 5

NVIDIA Visual Profiler The NVIDIA CUDA Profiling Tools Interface The PAPI CUDA Component

Performance Analysis Tools

slide-6
SLIDE 6

CUDA 7.5 Instruction-level profiling

NVIDIA Visual Profiler

Performance tools are still evolving

slide-7
SLIDE 7

But it's still not enough ...

Performance tools are still evolving

Power Concurrent Kernel Execution Streaming

slide-8
SLIDE 8

Outline

  • Motivation
  • Performance models
  • Applications
  • Challenges
slide-9
SLIDE 9

Performance model

Performance models

slide-10
SLIDE 10

Source code Performance model

Input

PTX Pseudocode CUBIN Target Device Information

Performance models

slide-11
SLIDE 11

Source code Performance model

Input

Power consumption estimation Execution time prediction

  • n a target device

Performance bottlenecks identification PTX Pseudocode CUBIN

Output

Target Device Information

Performance models

slide-12
SLIDE 12

Analytical Models Statistical Models Simulation

Advantages & Disadvantages

Types of performance models

slide-13
SLIDE 13

The MWP-CWP model [Hong & Kim 2009] MWP: Memory warp parallelism CWP: Computation warp parallelism

Analytical models

slide-14
SLIDE 14

* GPGPU performance and power estimation using machine learning. - Wu, Gene, et al.

Statistical models

slide-15
SLIDE 15

GPU Ocelot

Simulation

PTX Kernel PTX Emulation LLVM Translation GPU Execution

slide-16
SLIDE 16

Outline

  • Motivation
  • Performance models
  • Applications
  • Challenges
slide-17
SLIDE 17

Successfully used to …

Today!

Applications of performance models

  • schedule concurrent kernels
  • make auto-tuning
  • estimate power consumption
  • identify performance bottlenecks
  • make workload balancing
slide-18
SLIDE 18
  • Optimization goals

Auto-tuning

  • Parameters
  • Large search space
slide-19
SLIDE 19

Supported since Fermi

Limitations: Registers, Shared Memory, Occupancy

* Image from http://www.turkpaylasim.com/cevahir

Concurrent Kernel Execution

slide-20
SLIDE 20

Outline

  • Motivation
  • Performance models
  • Applications
  • Challenges
slide-21
SLIDE 21
  • Multiple-gpu systems, heterogeneous systems

Challenges

  • Each microarchitecture has its own features
  • More complex execution behavior is harder to

model accurately

slide-22
SLIDE 22

Hong, Sunpyo, and Hyesoon Kim. "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness." ACM SIGARCH Computer Architecture News. Vol. 37. No. 3. ACM, 2009. Kim, Hyesoon, et al. "Performance analysis and tuning for general purpose graphics processing units (GPGPU)." Synthesis Lectures on Computer Architecture 7.2 (2012): 1-96. Lopez-Novoa, Unai, Alexander Mendiburu, and José Miguel-Alonso. "A survey of performance modeling and simulation techniques for accelerator-based computing." Parallel and Distributed Systems, IEEE Transactions on 26.1 (2015): 272-281. Zhong, Jianlong, and Bingsheng He. "Kernelet: High-throughput gpu kernel executions with dynamic slicing and scheduling." Parallel and Distributed Systems, IEEE Transactions on 25.6 (2014): 1522-1532.

References and Further reading

slide-23
SLIDE 23

Acknowledgements

slide-24
SLIDE 24

Contact:

rquintanillac@ic.uff.br esteban@ic.uff.br http://medialab.ic.uff.br

Thank you!

#GTC16

slide-25
SLIDE 25

Questions & Answers

slide-26
SLIDE 26

Backup Slides

slide-27
SLIDE 27

.cu .ptx Virtual Instruction Set .cubin CUDA Binary File .gpu Device code ptxas cicc cudafe .cpu Host code nvcc PTX Optimizing Assembler

High level optimizer and PTX generator

CUDA Front End CUDA Compiler Host Compiler .fatbinary CUDA Executable

Simplified compilation flow

slide-28
SLIDE 28

Leftover policy

Timeline K1 16 blocks K1 16 blocks ... K1 4 blocks K2 12 blocks K2 16 blocks ...

Kernel slicing

Timeline ... K1 6 blocks K2 10 blocks K1 6 blocks K1 6 blocks K1 6 blocks K2 10 blocks K2 10 blocks K2 10 blocks

* Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS - Jiao, Qing, et al.

Concurrent Kernel Execution