Alberto Magni
SymEngine Symbolic Executjon of OpenCL Kernels Alberto Magni - - PowerPoint PPT Presentation
SymEngine Symbolic Executjon of OpenCL Kernels Alberto Magni - - PowerPoint PPT Presentation
SymEngine Symbolic Executjon of OpenCL Kernels Alberto Magni Optjmize code for GPUs Optjmize Memory Accesses 2 GPU Memory Transactjons Coalesced Access GPU Core 1 Load Request = 4 Bytes per Thread 128 Bytes L1 32 Threads GPU Memory
2
Optjmize code for GPUs
Optjmize Memory Accesses
3
GPU Memory Transactjons
32 Threads L1 Cache
GPU Memory GPU Core Coalesced Access
128 Bytes 1 Cache Line 1 Load Request = 4 Bytes per Thread
4
GPU Memory Transactjons
32 Threads L1 Cache
GPU Memory GPU Core UnCoalesced Access
1 Load Request = 4 Bytes per Thread 512 Bytes 4 Cache Lines
5
GPU Memory Transactjons
32 Threads L1 Cache
GPU Memory GPU Core UnCoalesced Access
1 Load Request = 4 Bytes per Thread 512 Bytes 4 Cache Lines
Wasted Bandwidth
6
SymEngine
Statjcally Detect Suboptjmal Accesses to Memory
7
SymEngine
Statjcally Detect Suboptjmal Accesses to Memory Resolve Address
int threadID = get_global_id(0);
sX = x[threadID];
sY = y[threadId]; sZ = z[threadId]; sQr = Qr[threadId]; sQi = Qi[threadId]; for (int kIndex = 0; (kIndex < KERNEL_ELEMS_PER_GRID); kIndex ++, kGlobalIndex ++) { fmoat expArg = PIx2 * (ck[kIndex].Kx * sX + ck[kIndex].Ky * sY + ck[kIndex].Kz * sZ); sQr += ck[kIndex].PhiMag * cos(expArg); sQi += ck[kIndex].PhiMag * sin(expArg); } Qr[threadId] = sQr; Qi[threadId] = sQi;
Compute Number of Transactjons OpenCL Kernel
8
Symbolic Executjon
SymEngine
Warp-Id Input Values Number of Threads Hardware Memory Transactjons OpenCL Code
9
Symbolic Executjon
Threads in a Warp
Memory Instructjon Address
4 … 29 30 31 1 2 3
Memory Instructjon Address
...
Memory Instructjon Address Address Memory Instructjon
SCEV SCEV SCEV SCEV
10
Symbolic Executjon
Threads in a Warp Transactjon Number
Memory Instructjon Address
4 … 29 30 31 1 2 3
Memory Instructjon Address
...
Memory Instructjon Address Address Memory Instructjon
SCEV SCEV SCEV SCEV
Number of Cache lines touched
11
Validatjon – Nvidia GTX480
Against Hardware Performance counters
Total HW Transactjons for Black-Scholes
Program Versions
HW Counter
12
Validatjon – Nvidia GTX480
Against Hardware Performance counters
Total HW Transactjons for Black-Scholes
Program Versions
HW Counter Predictjon
13
Validatjon – Nvidia GTX480
14
Validatjon – Nvidia GTX480
0.99 correlatjon with HW counters
15