Comment on bitonic merging; more CUDA performance tuning CSE 6230: - PowerPoint PPT Presentation

Comment on bitonic merging; more CUDA performance tuning CSE 6230: HPC Tools & Apps Tu Sep 18, 2012 Tuesday, September 18, 12

๏ Comment on bitonic merging , including ideas & hints for Lab 3 Note: Some figures taken from Grama et al. book (2003) http://www-users.cs.umn.edu/~karypis/parbook/ This book is also available online through the GT library – see our course website. Tuesday, September 18, 12

Source: Grama et al. (2003) Tuesday, September 18, 12

Summary so far: bitonicMerge (bitonic sequence) == sorted Q: How do we get a bitonic sequence? Tuesday, September 18, 12

“ ⊕ ” = (min, max) “ ⊖ ” = (max, min) Source: Grama et al. (2003) Tuesday, September 18, 12

Bitonic sort parallel complexity (work-depth)? Tuesday, September 18, 12

0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 Tuesday, September 18, 12

0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 log p steps: Comm req’d log (n/p) steps: No comm Block Layout (p=4) Tuesday, September 18, 12

0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 rounds of communication = O (log n ) 5: 0101 6: 0110 number of pairwise exchanges per round = O ( P ) 7: 0111 words sent per exchange = O ( n / P ) 8: 1000 9: 1001 total words sent = O ( n log n ) 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 log p steps: Comm req’d log (n/p) steps: No comm Block Layout (p=4) Tuesday, September 18, 12

0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 log p steps: Comm req’d log (n/p) steps: No comm Block Layout (p=4) Tuesday, September 18, 12

0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 log (n/p): No comm log (p): Comm req’d Cyclic Layout (p=4) Tuesday, September 18, 12

These (block or cyclic) examples are binary exchange algorithms. Question: Can we get the “best” of these two schemes? Tuesday, September 18, 12

0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 … All-to-all 7: 0111 exchange 8: 1000 … 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 log (p): No comm log (n/p): No comm “Transpose” (p=4) Tuesday, September 18, 12

0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 rounds of communication = 1 5: 0101 6: 0110 number of pairwise exchanges per round = O ( P 2 ) … All-to-all 7: 0111 exchange words sent per exchange = O ( n / P 2 ) 8: 1000 … 9: 1001 total words sent = O ( n ) 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 log (p): No comm log (n/p): No comm “Transpose” (p=4) Tuesday, September 18, 12

0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 … All-to-all 7: 0111 exchange 8: 1000 … 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 log (p): No comm log (n/p): No comm “Transpose” (p=4) Tuesday, September 18, 12

Cyclic Block 0 1 2 3 0 4 8 12 All-to-all 4 5 6 7 1 5 9 13 exchange ≡ 8 9 10 11 2 6 10 14 Matrix transpose 12 13 14 15 3 7 11 15 Tuesday, September 18, 12

“Binary exchange” algorithm (block or cyclic): rounds of communication = O (log n ) number of pairwise exchanges per round = O ( P ) total number of pairwise exchanges = O ( P log n ) words sent per exchange = O ( n / P ) total words sent = O ( n log n ) “Transpose” algorithm (cyclic → all-to-all → block): rounds of communication = 1 number of pairwise exchanges per round = O ( P 2 ) total number of pairwise exchanges = O ( P 2 ) words sent per exchange = O ( n / P 2 ) total words sent = O ( n ) Tuesday, September 18, 12

๏ More CUDA tuning: Occupancy and ILP References: http://developer.nvidia.com/cuda/get-started-cuda-cc http://developer.download.nvidia.com/CUDA/training/cuda_webinars_WarpsAndOccupancy.pdf http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf http://www.cs.berkeley.edu/~volkov/volkov11-unrolling.pdf Tuesday, September 18, 12

https://piazza.com/class#fall2012/cse6230/52 Tuesday, September 18, 12

Occupancy Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block Resources are finite Utilizing too many resources per thread may limit the occupancy Potential occupancy limiters: Register usage Shared memory usage Block size Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp Tuesday, September 18, 12

/opt/cuda-4.0/cuda/bin/nvcc -arch=sm_20 --ptxas-options=-v -O3 \ -o bitmerge-cuda.o -c bitmerge-cuda.cu ptxas info : Compiling entry function '_Z12bitonicSplitjPfj' for 'sm_20' ptxas info : Function properties for _Z12bitonicSplitjPfj 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 8 registers , 52 bytes cmem[0] icpc -O3 -g -o bitmerge timer.o bitmerge.o bitmerge-seq.o \ bitmerge-cilk.o bitmerge-cuda.o \ -L/opt/cuda-4.0/cuda/bin/../lib64 \ -Wl,-rpath /opt/cuda-4.0/cuda/bin/../lib64 -lcudart Tuesday, September 18, 12

Occupancy Limiters: Registers Register usage: compile with --ptxas-options=-v Fermi has 32K registers per SM Example 1 Kernel uses 20 registers per thread (+1 implicit) Active threads = 32K/21 = 1560 threads > 1536 thus an occupancy of 1 Example 2 Kernel uses 63 registers per thread (+1 implicit) Active threads = 32K/64 = 512 threads 512/1536 = .3333 occupancy Can control register usage with the nvcc flag: --maxrregcount Occupancy = (Active warps) / (Max active warps) Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp Tuesday, September 18, 12

Recall: Reduction example Tuesday, September 18, 12

Recall: Reduction example b = 256 threads/block ⇒ shmem = 256 * (4 Bytes/ int ) = 1024 Bytes Tuesday, September 18, 12

Occupancy Limiters: Shared Memory Shared memory usage: compile with --ptxas-options=-v Reports shared memory per block Fermi has either 16K or 48K shared memory Example 1, 48K shared memory Kernel uses 32 bytes of shared memory per thread 48K/32 = 1536 threads occupancy=1 Example 2, 16K shared memory Kernel uses 32 bytes of shared memory per thread 16K/32 = 512 threads occupancy=.3333 Don’t use too much shared memory Choose L1/Shared config appropriately. Occupancy = (Active warps) / (Max active warps) Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp Tuesday, September 18, 12

Comment on bitonic merging; more CUDA performance tuning CSE 6230: - PowerPoint PPT Presentation

Comment on bitonic merging; more CUDA performance tuning CSE 6230: HPC Tools & Apps Tu Sep 18, 2012 Tuesday, September 18, 12 Comment on bitonic merging , including ideas & hints for Lab 3 Note: Some figures taken from Grama et al.

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Bitonic st -orderings for Upward Planar Graphs Martin Gronemann University of Cologne, Germany

1. procedure BITONIC SORT ( label , d ) 2. begin 3. for i := 0 to d 1 do 4. for j := i

Optimal Merging in Quantum k -xor and k -sum Algorithms Mara Naya-Plasencia, Andr

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

Track Filtering/Quality/Merging A proposal for data format of track quality and track merging in

Merging DataFrames Merging DataFrames with pandas Population DataFrame In [1]: import pandas as

Parton Showers and Matching/Merging Lecture 2 of 2: Matching/Merging & Non-Perturbative

Comparison Based Merging Upper and Lower bounds EMADS Fall 2003: Comparison Based Merging Page 1

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Plan Optimizing Matrix Transpose with CUDA 1 CS4402-9535: High-Performance Computing with CUDA

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

THE MOD METHOD with VESPERS MASTERING In this Module What mastering can do & what it

Figure 2.14 from page 63 of Exploring the Heart of Ma2er

Design of an experimental set-up to analyse compliant mechanisms used for the deployment of a

Postdoc event Science, Engineering & Technology Group June 28, 2019 Thermotechnisch

Positivity- -preserving Lagrangian schemes for preserving Lagrangian schemes for Positivity

Spectroscopic applications for plasma-wall interaction observations in fusion devices Kalle

at sub-25 ps precision: the PICOSEC detection concept F.J. Iguaz On behalf of PICOSEC

QuadVtx vertex reconstruction DUNE FD sim/reco meeting 2 Sep, 2019 Chris Backhouse University

Comment on bitonic merging; more CUDA performance tuning CSE 6230: - PowerPoint PPT Presentation

Comment on bitonic merging; more CUDA performance tuning CSE 6230: HPC Tools & Apps Tu Sep 18, 2012 Tuesday, September 18, 12 Comment on bitonic merging , including ideas & hints for Lab 3 Note: Some figures taken from Grama et al.

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Bitonic st -orderings for Upward Planar Graphs Martin Gronemann University of Cologne, Germany

1. procedure BITONIC SORT ( label , d ) 2. begin 3. for i := 0 to d 1 do 4. for j := i

Optimal Merging in Quantum k -xor and k -sum Algorithms Mara Naya-Plasencia, Andr

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

Track Filtering/Quality/Merging A proposal for data format of track quality and track merging in

Merging DataFrames Merging DataFrames with pandas Population DataFrame In [1]: import pandas as

Parton Showers and Matching/Merging Lecture 2 of 2: Matching/Merging &amp; Non-Perturbative

Comparison Based Merging Upper and Lower bounds EMADS Fall 2003: Comparison Based Merging Page 1

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Plan Optimizing Matrix Transpose with CUDA 1 CS4402-9535: High-Performance Computing with CUDA

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

THE MOD METHOD with VESPERS MASTERING In this Module What mastering can do &amp; what it

Figure 2.14 from page 63 of Exploring the Heart of Ma2er

Design of an experimental set-up to analyse compliant mechanisms used for the deployment of a

Postdoc event Science, Engineering &amp; Technology Group June 28, 2019 Thermotechnisch

Positivity- -preserving Lagrangian schemes for preserving Lagrangian schemes for Positivity

Spectroscopic applications for plasma-wall interaction observations in fusion devices Kalle

at sub-25 ps precision: the PICOSEC detection concept F.J. Iguaz On behalf of PICOSEC

QuadVtx vertex reconstruction DUNE FD sim/reco meeting 2 Sep, 2019 Chris Backhouse University

Parton Showers and Matching/Merging Lecture 2 of 2: Matching/Merging & Non-Perturbative

THE MOD METHOD with VESPERS MASTERING In this Module What mastering can do & what it

Postdoc event Science, Engineering & Technology Group June 28, 2019 Thermotechnisch