GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall - PowerPoint PPT Presentation

vuduc.org/cse6230 GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall 2014 — September 30 & October 2

Recall: 2

Recall: 6 GB/s 2

Recall: 3

Recall: 4

Recall: 5

Recall: 6

Recall: 7

Recall: 8

Recall: 9

vuduc.org/cse6230 Performance engineering principles (See HPCA’10 tutorial)

Slow memory Q mops # (fl)ops W ≡ Fast memory # mem. ops (mops) Q ≡ (total size = Z ) = Q ( Z ) xPU W (fl)ops von Neumann bottleneck Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

= max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory (total size = Z ) = W ✏ flop + Q ✏ mem E xPU ✓ ◆ 1 + B ✏ = W ✏ flop I τ flop = time/flop W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

= max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory Minimum time (total size = Z ) = W ✏ flop + Q ✏ mem E xPU ✓ ◆ 1 + B ✏ = W ✏ flop I τ flop = time/flop W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

= max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory Minimum time (total size = Z ) Intensity = W ✏ flop + Q ✏ mem E (flop : mop) xPU ✓ ◆ 1 + B ✏ = W ✏ flop I τ flop = time/flop W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

= max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory Minimum time (total size = Z ) Intensity = W ✏ flop + Q ✏ mem E (flop : mop) xPU ✓ ◆ 1 + B ✏ = W ✏ flop Balance I τ flop = time/flop (flop : mop) W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

= max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory (total size = Z ) Intensity = W ✏ flop + Q ✏ mem E (flop : mop) xPU ✓ ◆ 1 + B ✏ = W ✏ flop Balance I τ flop = time/flop (flop : mop) W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

“Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 1/2 Relative performance 1/4 1/8 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)

“Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 1/2 Relative performance 1/4 Balance (flop : mop) 1/8 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)

“Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 Compute bound 1/2 Relative performance 1/4 Balance (flop : mop) 1/8 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)

“Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 Compute bound 1/2 Relative performance 1/4 Memory Balance (bandwidth) (flop : mop) bound 1/8 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)

“Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 Compute bound 1/2 Relative performance 1/4 Memory Balance (bandwidth) (flop : mop) bound 1/8 Dense matrix multiply 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)

“Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 Compute bound 1/2 Relative performance 1/4 Memory Balance (bandwidth) (flop : mop) bound 1/8 Dense matrix multiply sparse matvec; 1/16 stencils 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)

“Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 Compute bound 1/2 Relative performance 1/4 Memory Balance (bandwidth) (flop : mop) bound 1/8 Dense matrix multiply sparse matvec; FFTs 1/16 stencils 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)

vuduc.org/cse6230 TLP vs. ILP ( thread - vs. instruction -level parallelism) See also: https://bitbucket.org/rvuduc/volkov-gtc10 http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf http://www.realworldtech.com/fermi/

Throughput [ops/time] ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ Recall Little’s Law , which quantifies the degree of concurrency needed Latency ↓ ↓ ↓ ↓ [time] to hide latency. ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

Throughput [ops/time] ↓ ↓ ↓ ↓ The NVIDIA M2090 implements a ↓ ↓ ↓ ↓ fused multiply-add ( FMA ) with a ↓ ↓ ↓ ↓ latency of ~ 20 cycle . It issues up to Latency ↓ ↓ ↓ ↓ 32 FMAs per cycle . [time] ↓ ↓ ↓ ↓ � Concurrency ~ (20 cy) * (32 ops/cy), ↓ ↓ ↓ ↓ or 640 operations . ↓ ↓ ↓ ↓ � So, a thread block size of 640 threads ↓ ↓ ↓ ↓ should fully hide the latency.

#define N < constant-value > � __global__ void kernel ( float *pa, float b, float c) { float a = *pa; � #pragma unroll 8 for ( int i=0; i<N; ++i) a = a * b + c; � *pa = a; } https://bitbucket.org/rvuduc/volkov-gtc10

Plateau starts roughly where expected (~ 640 threads) vuduc.org/cse6230 Fraction of peak 1 0.63 ● ● ● ● ● ● ● ● ● ● ● 0.05 ● 0 32 64 96128 192 256 384 512 640 768 896 1024 Threads per block https://bitbucket.org/rvuduc/volkov-gtc10

#define N < constant-value > � __global__ void kernel ( float *pa, float b, float c) { float a[2] = {0, 0}; � #pragma unroll 8 for ( int i=0; i<N; ++i) { a[0] = a[0] * b + c; a[1] = a[1] * b + c; } � *pa += a[0] + a[1]; } https://bitbucket.org/rvuduc/volkov-gtc10

#define N < constant-value > � __global__ void kernel ( float *pa, float b, float c) { float a[2] = {0, 0}; � #pragma unroll 8 for ( int i=0; i<N; ++i) { a[0] = a[0] * b + c; Mutually independent a[1] = a[1] * b + c; } � *pa += a[0] + a[1]; } https://bitbucket.org/rvuduc/volkov-gtc10

Plateau starts roughly where expected (~ 640 threads) vuduc.org/cse6230 Fraction of peak 1 0.63 ● ● ● ● ● ● ● ● ● ● ● 0.05 ● 0 32 64 96128 192 256 384 512 640 768 896 1024 Threads per block https://bitbucket.org/rvuduc/volkov-gtc10

GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall - PowerPoint PPT Presentation

vuduc.org/cse6230 GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall 2014 September 30 & October 2 Recall: 2 Recall: 6 GB/s 2 Recall: 3 Recall: 4 Recall: 5 Recall: 6 Recall: 7 Recall: 8 Recall: 9

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Auto-tuning a High-Level Language Targeted to GPU Codes By Scott Grauer-Gray, Lifan Xu, Robert

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

TUNING Russia: Development of master programmes in engineering education using the Tuning

Parameters vs hyperparameters Dr. Shirin Glander Data Scientist DataCamp Hyperparameter Tuning

SELF TUNING MEMORY MANAGEMENT FOR DATA SERVERS By Sangeetha Sivaprakasam Introduction : 1)

CHAPTER 9: PID TUNING Process Solve the tuning Apply, is the reaction curve problem. Requires

Hyperparameter tuning in caret Dr. Shirin Glander Data Scientist DataCamp Hyperparameter

Elementary Particles Lecture 4 Niels Tuning Harry van der Graaf Niels Tuning (1) Thanks

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

The Eisenbud-Green-Harris Conjecture Ben Richert California Polytechnic State University

Distributed Algorithms Below the Greedy Regime Yannic Maus Mohsen Ghaffari , Juho Hirvonen,

Introduction: Generative Model for Graphs Modeling graphs is fundamental for studying networks

Welcome Presented by Laura Friedman Thank you for joining us Why a Virtual Town Hall

Interest point detection Nicolas ROUGON ARTEMIS Department Nicolas.Rougon@telecom-sudparis.eu

Scales and Descriptors EECS 442 Prof. David Fouhey Winter 2019, University of Michigan

Mapping and Modeling with RGB-D Cameras University of Washington Dieter Fox 1 Outline

R. M. Wagner Max-Planck-Institut fr Physik, Mnchen and Excellence Cluster Origin and

Sambuz

Useful Links

Newsletter

Mail Us

GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall - PowerPoint PPT Presentation

vuduc.org/cse6230 GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall 2014 September 30 & October 2 Recall: 2 Recall: 6 GB/s 2 Recall: 3 Recall: 4 Recall: 5 Recall: 6 Recall: 7 Recall: 8 Recall: 9

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Auto-tuning a High-Level Language Targeted to GPU Codes By Scott Grauer-Gray, Lifan Xu, Robert

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

TUNING Russia: Development of master programmes in engineering education using the Tuning

Parameters vs hyperparameters Dr. Shirin Glander Data Scientist DataCamp Hyperparameter Tuning

SELF TUNING MEMORY MANAGEMENT FOR DATA SERVERS By Sangeetha Sivaprakasam Introduction : 1)

CHAPTER 9: PID TUNING Process Solve the tuning Apply, is the reaction curve problem. Requires

Hyperparameter tuning in caret Dr. Shirin Glander Data Scientist DataCamp Hyperparameter

Elementary Particles Lecture 4 Niels Tuning Harry van der Graaf Niels Tuning (1) Thanks

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

The Eisenbud-Green-Harris Conjecture Ben Richert California Polytechnic State University

Distributed Algorithms Below the Greedy Regime Yannic Maus Mohsen Ghaffari , Juho Hirvonen,

Introduction: Generative Model for Graphs Modeling graphs is fundamental for studying networks

Welcome Presented by Laura Friedman Thank you for joining us Why a Virtual Town Hall

Interest point detection Nicolas ROUGON ARTEMIS Department Nicolas.Rougon@telecom-sudparis.eu

Scales and Descriptors EECS 442 Prof. David Fouhey Winter 2019, University of Michigan

Mapping and Modeling with RGB-D Cameras University of Washington Dieter Fox 1 Outline

R. M. Wagner Max-Planck-Institut fr Physik, Mnchen and Excellence Cluster Origin and

Sambuz

Useful Links

Newsletter

Mail Us

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,