comment on bitonic merging more cuda performance tuning
play

Comment on bitonic merging; more CUDA performance tuning CSE 6230: - PowerPoint PPT Presentation

Comment on bitonic merging; more CUDA performance tuning CSE 6230: HPC Tools & Apps Tu Sep 18, 2012 Tuesday, September 18, 12 Comment on bitonic merging , including ideas & hints for Lab 3 Note: Some figures taken from Grama et al.


  1. Comment on bitonic merging; more CUDA performance tuning CSE 6230: HPC Tools & Apps Tu Sep 18, 2012 Tuesday, September 18, 12

  2. ๏ Comment on bitonic merging , including ideas & hints for Lab 3 Note: Some figures taken from Grama et al. book (2003) http://www-users.cs.umn.edu/~karypis/parbook/ This book is also available online through the GT library – see our course website. Tuesday, September 18, 12

  3. Source: Grama et al. (2003) Tuesday, September 18, 12

  4. Summary so far: bitonicMerge (bitonic sequence) == sorted Q: How do we get a bitonic sequence? Tuesday, September 18, 12

  5. Source: Grama et al. (2003) Tuesday, September 18, 12

  6. “ ⊕ ” = (min, max) “ ⊖ ” = (max, min) Source: Grama et al. (2003) Tuesday, September 18, 12

  7. “ ⊕ ” = (min, max) “ ⊖ ” = (max, min) Source: Grama et al. (2003) Tuesday, September 18, 12

  8. “ ⊕ ” = (min, max) “ ⊖ ” = (max, min) Source: Grama et al. (2003) Tuesday, September 18, 12

  9. “ ⊕ ” = (min, max) “ ⊖ ” = (max, min) Source: Grama et al. (2003) Tuesday, September 18, 12

  10. Source: Grama et al. (2003) Tuesday, September 18, 12

  11. Source: Grama et al. (2003) Tuesday, September 18, 12

  12. Bitonic sort parallel complexity (work-depth)? Tuesday, September 18, 12

  13. 0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 Tuesday, September 18, 12

  14. 0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 log p steps: Comm req’d log (n/p) steps: No comm Block Layout (p=4) Tuesday, September 18, 12

  15. 0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 rounds of communication = O (log n ) 5: 0101 6: 0110 number of pairwise exchanges per round = O ( P ) 7: 0111 words sent per exchange = O ( n / P ) 8: 1000 9: 1001 total words sent = O ( n log n ) 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 log p steps: Comm req’d log (n/p) steps: No comm Block Layout (p=4) Tuesday, September 18, 12

  16. 0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 log p steps: Comm req’d log (n/p) steps: No comm Block Layout (p=4) Tuesday, September 18, 12

  17. 0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 log (n/p): No comm log (p): Comm req’d Cyclic Layout (p=4) Tuesday, September 18, 12

  18. These (block or cyclic) examples are binary exchange algorithms. Question: Can we get the “best” of these two schemes? Tuesday, September 18, 12

  19. 0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 … All-to-all 7: 0111 exchange 8: 1000 … 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 log (p): No comm log (n/p): No comm “Transpose” (p=4) Tuesday, September 18, 12

  20. 0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 rounds of communication = 1 5: 0101 6: 0110 number of pairwise exchanges per round = O ( P 2 ) … All-to-all 7: 0111 exchange words sent per exchange = O ( n / P 2 ) 8: 1000 … 9: 1001 total words sent = O ( n ) 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 log (p): No comm log (n/p): No comm “Transpose” (p=4) Tuesday, September 18, 12

  21. 0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 … All-to-all 7: 0111 exchange 8: 1000 … 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 log (p): No comm log (n/p): No comm “Transpose” (p=4) Tuesday, September 18, 12

  22. Cyclic Block 0 1 2 3 0 4 8 12 All-to-all 4 5 6 7 1 5 9 13 exchange ≡ 8 9 10 11 2 6 10 14 Matrix transpose 12 13 14 15 3 7 11 15 Tuesday, September 18, 12

  23. “Binary exchange” algorithm (block or cyclic): rounds of communication = O (log n ) number of pairwise exchanges per round = O ( P ) total number of pairwise exchanges = O ( P log n ) words sent per exchange = O ( n / P ) total words sent = O ( n log n ) “Transpose” algorithm (cyclic → all-to-all → block): rounds of communication = 1 number of pairwise exchanges per round = O ( P 2 ) total number of pairwise exchanges = O ( P 2 ) words sent per exchange = O ( n / P 2 ) total words sent = O ( n ) Tuesday, September 18, 12

  24. ๏ More CUDA tuning: Occupancy and ILP References: http://developer.nvidia.com/cuda/get-started-cuda-cc http://developer.download.nvidia.com/CUDA/training/cuda_webinars_WarpsAndOccupancy.pdf http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf http://www.cs.berkeley.edu/~volkov/volkov11-unrolling.pdf Tuesday, September 18, 12

  25. https://piazza.com/class#fall2012/cse6230/52 Tuesday, September 18, 12

  26. https://piazza.com/class#fall2012/cse6230/52 Tuesday, September 18, 12

  27. https://piazza.com/class#fall2012/cse6230/52 Tuesday, September 18, 12

  28. https://piazza.com/class#fall2012/cse6230/52 Tuesday, September 18, 12

  29. https://piazza.com/class#fall2012/cse6230/52 Tuesday, September 18, 12

  30. Occupancy Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block Resources are finite Utilizing too many resources per thread may limit the occupancy Potential occupancy limiters: Register usage Shared memory usage Block size Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp Tuesday, September 18, 12

  31. Occupancy Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block Resources are finite Utilizing too many resources per thread may limit the occupancy Potential occupancy limiters: Register usage Shared memory usage Block size Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp Tuesday, September 18, 12

  32. https://piazza.com/class#fall2012/cse6230/52 Tuesday, September 18, 12

  33. /opt/cuda-4.0/cuda/bin/nvcc -arch=sm_20 --ptxas-options=-v -O3 \ -o bitmerge-cuda.o -c bitmerge-cuda.cu ptxas info : Compiling entry function '_Z12bitonicSplitjPfj' for 'sm_20' ptxas info : Function properties for _Z12bitonicSplitjPfj 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 8 registers , 52 bytes cmem[0] icpc -O3 -g -o bitmerge timer.o bitmerge.o bitmerge-seq.o \ bitmerge-cilk.o bitmerge-cuda.o \ -L/opt/cuda-4.0/cuda/bin/../lib64 \ -Wl,-rpath /opt/cuda-4.0/cuda/bin/../lib64 -lcudart Tuesday, September 18, 12

  34. Occupancy Limiters: Registers Register usage: compile with --ptxas-options=-v Fermi has 32K registers per SM Example 1 Kernel uses 20 registers per thread (+1 implicit) Active threads = 32K/21 = 1560 threads > 1536 thus an occupancy of 1 Example 2 Kernel uses 63 registers per thread (+1 implicit) Active threads = 32K/64 = 512 threads 512/1536 = .3333 occupancy Can control register usage with the nvcc flag: --maxrregcount Occupancy = (Active warps) / (Max active warps) Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp Tuesday, September 18, 12

  35. Occupancy Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block Resources are finite Utilizing too many resources per thread may limit the occupancy Potential occupancy limiters: Register usage Shared memory usage Block size Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp Tuesday, September 18, 12

  36. Occupancy Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block Resources are finite Utilizing too many resources per thread may limit the occupancy Potential occupancy limiters: Register usage Shared memory usage Block size Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp Tuesday, September 18, 12

  37. https://piazza.com/class#fall2012/cse6230/52 Tuesday, September 18, 12

  38. Recall: Reduction example Tuesday, September 18, 12

  39. Recall: Reduction example Tuesday, September 18, 12

  40. Recall: Reduction example Tuesday, September 18, 12

  41. Recall: Reduction example b = 256 threads/block ⇒ shmem = 256 * (4 Bytes/ int ) = 1024 Bytes Tuesday, September 18, 12

  42. Occupancy Limiters: Shared Memory Shared memory usage: compile with --ptxas-options=-v Reports shared memory per block Fermi has either 16K or 48K shared memory Example 1, 48K shared memory Kernel uses 32 bytes of shared memory per thread 48K/32 = 1536 threads occupancy=1 Example 2, 16K shared memory Kernel uses 32 bytes of shared memory per thread 16K/32 = 512 threads occupancy=.3333 Don’t use too much shared memory Choose L1/Shared config appropriately. Occupancy = (Active warps) / (Max active warps) Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp Tuesday, September 18, 12

  43. Occupancy Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block Resources are finite Utilizing too many resources per thread may limit the occupancy Potential occupancy limiters: Register usage Shared memory usage Block size Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp Tuesday, September 18, 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend