gpu tuning part 1 updated
play

GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall - PowerPoint PPT Presentation

vuduc.org/cse6230 GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall 2014 September 30 & October 2 Recall: 2 Recall: 6 GB/s 2 Recall: 3 Recall: 4 Recall: 5 Recall: 6 Recall: 7 Recall: 8 Recall: 9


  1. vuduc.org/cse6230 GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall 2014 — September 30 & October 2

  2. Recall: 2

  3. Recall: 6 GB/s 2

  4. Recall: 3

  5. Recall: 4

  6. Recall: 5

  7. Recall: 6

  8. Recall: 7

  9. Recall: 8

  10. Recall: 9

  11. vuduc.org/cse6230 Performance engineering principles (See HPCA’10 tutorial)

  12. Slow memory Q mops # (fl)ops W ≡ Fast memory # mem. ops (mops) Q ≡ (total size = Z ) = Q ( Z ) xPU W (fl)ops von Neumann bottleneck Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

  13. = max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory (total size = Z ) = W ✏ flop + Q ✏ mem E xPU ✓ ◆ 1 + B ✏ = W ✏ flop I τ flop = time/flop W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

  14. = max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory (total size = Z ) = W ✏ flop + Q ✏ mem E xPU ✓ ◆ 1 + B ✏ = W ✏ flop I τ flop = time/flop W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

  15. = max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory (total size = Z ) = W ✏ flop + Q ✏ mem E xPU ✓ ◆ 1 + B ✏ = W ✏ flop I τ flop = time/flop W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

  16. = max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory Minimum time (total size = Z ) = W ✏ flop + Q ✏ mem E xPU ✓ ◆ 1 + B ✏ = W ✏ flop I τ flop = time/flop W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

  17. = max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory Minimum time (total size = Z ) Intensity = W ✏ flop + Q ✏ mem E (flop : mop) xPU ✓ ◆ 1 + B ✏ = W ✏ flop I τ flop = time/flop W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

  18. = max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory Minimum time (total size = Z ) Intensity = W ✏ flop + Q ✏ mem E (flop : mop) xPU ✓ ◆ 1 + B ✏ = W ✏ flop Balance I τ flop = time/flop (flop : mop) W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

  19. = max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory (total size = Z ) Intensity = W ✏ flop + Q ✏ mem E (flop : mop) xPU ✓ ◆ 1 + B ✏ = W ✏ flop Balance I τ flop = time/flop (flop : mop) W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

  20. “Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 1/2 Relative performance 1/4 1/8 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)

  21. “Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 1/2 Relative performance 1/4 Balance (flop : mop) 1/8 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)

  22. “Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 Compute bound 1/2 Relative performance 1/4 Balance (flop : mop) 1/8 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)

  23. “Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 Compute bound 1/2 Relative performance 1/4 Memory Balance (bandwidth) (flop : mop) bound 1/8 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)

  24. “Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 Compute bound 1/2 Relative performance 1/4 Memory Balance (bandwidth) (flop : mop) bound 1/8 Dense matrix multiply 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)

  25. “Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 Compute bound 1/2 Relative performance 1/4 Memory Balance (bandwidth) (flop : mop) bound 1/8 Dense matrix multiply sparse matvec; 1/16 stencils 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)

  26. “Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 Compute bound 1/2 Relative performance 1/4 Memory Balance (bandwidth) (flop : mop) bound 1/8 Dense matrix multiply sparse matvec; FFTs 1/16 stencils 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)

  27. vuduc.org/cse6230 TLP vs. ILP ( thread - vs. instruction -level parallelism) See also: https://bitbucket.org/rvuduc/volkov-gtc10 http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf http://www.realworldtech.com/fermi/

  28. Throughput [ops/time] ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ Recall Little’s Law , which quantifies the degree of concurrency needed Latency ↓ ↓ ↓ ↓ [time] to hide latency. ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

  29. Throughput [ops/time] ↓ ↓ ↓ ↓ The NVIDIA M2090 implements a ↓ ↓ ↓ ↓ fused multiply-add ( FMA ) with a ↓ ↓ ↓ ↓ latency of ~ 20 cycle . It issues up to Latency ↓ ↓ ↓ ↓ 32 FMAs per cycle . [time] ↓ ↓ ↓ ↓ � Concurrency ~ (20 cy) * (32 ops/cy), ↓ ↓ ↓ ↓ or 640 operations . ↓ ↓ ↓ ↓ � So, a thread block size of 640 threads ↓ ↓ ↓ ↓ should fully hide the latency.

  30. #define N < constant-value > � __global__ void kernel ( float *pa, float b, float c) { float a = *pa; � #pragma unroll 8 for ( int i=0; i<N; ++i) a = a * b + c; � *pa = a; } https://bitbucket.org/rvuduc/volkov-gtc10

  31. Plateau starts roughly where expected (~ 640 threads) vuduc.org/cse6230 Fraction of peak 1 0.63 ● ● ● ● ● ● ● ● ● ● ● 0.05 ● 0 32 64 96128 192 256 384 512 640 768 896 1024 Threads per block https://bitbucket.org/rvuduc/volkov-gtc10

  32. #define N < constant-value > � __global__ void kernel ( float *pa, float b, float c) { float a[2] = {0, 0}; � #pragma unroll 8 for ( int i=0; i<N; ++i) { a[0] = a[0] * b + c; a[1] = a[1] * b + c; } � *pa += a[0] + a[1]; } https://bitbucket.org/rvuduc/volkov-gtc10

  33. #define N < constant-value > � __global__ void kernel ( float *pa, float b, float c) { float a[2] = {0, 0}; � #pragma unroll 8 for ( int i=0; i<N; ++i) { a[0] = a[0] * b + c; Mutually independent a[1] = a[1] * b + c; } � *pa += a[0] + a[1]; } https://bitbucket.org/rvuduc/volkov-gtc10

  34. Plateau starts roughly where expected (~ 640 threads) vuduc.org/cse6230 Fraction of peak 1 0.63 ● ● ● ● ● ● ● ● ● ● ● 0.05 ● 0 32 64 96128 192 256 384 512 640 768 896 1024 Threads per block https://bitbucket.org/rvuduc/volkov-gtc10

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend