SymEngine Symbolic Executjon of OpenCL Kernels Alberto Magni - - PowerPoint PPT Presentation

symengine
SMART_READER_LITE
LIVE PREVIEW

SymEngine Symbolic Executjon of OpenCL Kernels Alberto Magni - - PowerPoint PPT Presentation

SymEngine Symbolic Executjon of OpenCL Kernels Alberto Magni Optjmize code for GPUs Optjmize Memory Accesses 2 GPU Memory Transactjons Coalesced Access GPU Core 1 Load Request = 4 Bytes per Thread 128 Bytes L1 32 Threads GPU Memory


slide-1
SLIDE 1

Alberto Magni

SymEngine

Symbolic Executjon of OpenCL Kernels

slide-2
SLIDE 2

2

Optjmize code for GPUs

Optjmize Memory Accesses

slide-3
SLIDE 3

3

GPU Memory Transactjons

32 Threads L1 Cache

GPU Memory GPU Core Coalesced Access

128 Bytes 1 Cache Line 1 Load Request = 4 Bytes per Thread

slide-4
SLIDE 4

4

GPU Memory Transactjons

32 Threads L1 Cache

GPU Memory GPU Core UnCoalesced Access

1 Load Request = 4 Bytes per Thread 512 Bytes 4 Cache Lines

slide-5
SLIDE 5

5

GPU Memory Transactjons

32 Threads L1 Cache

GPU Memory GPU Core UnCoalesced Access

1 Load Request = 4 Bytes per Thread 512 Bytes 4 Cache Lines

Wasted Bandwidth

slide-6
SLIDE 6

6

SymEngine

Statjcally Detect Suboptjmal Accesses to Memory

slide-7
SLIDE 7

7

SymEngine

Statjcally Detect Suboptjmal Accesses to Memory Resolve Address

int threadID = get_global_id(0);

sX = x[threadID];

sY = y[threadId]; sZ = z[threadId]; sQr = Qr[threadId]; sQi = Qi[threadId]; for (int kIndex = 0; (kIndex < KERNEL_ELEMS_PER_GRID); kIndex ++, kGlobalIndex ++) { fmoat expArg = PIx2 * (ck[kIndex].Kx * sX + ck[kIndex].Ky * sY + ck[kIndex].Kz * sZ); sQr += ck[kIndex].PhiMag * cos(expArg); sQi += ck[kIndex].PhiMag * sin(expArg); } Qr[threadId] = sQr; Qi[threadId] = sQi;

Compute Number of Transactjons OpenCL Kernel

slide-8
SLIDE 8

8

Symbolic Executjon

SymEngine

Warp-Id Input Values Number of Threads Hardware Memory Transactjons OpenCL Code

slide-9
SLIDE 9

9

Symbolic Executjon

Threads in a Warp

Memory Instructjon Address

4 … 29 30 31 1 2 3

Memory Instructjon Address

...

Memory Instructjon Address Address Memory Instructjon

SCEV SCEV SCEV SCEV

slide-10
SLIDE 10

10

Symbolic Executjon

Threads in a Warp Transactjon Number

Memory Instructjon Address

4 … 29 30 31 1 2 3

Memory Instructjon Address

...

Memory Instructjon Address Address Memory Instructjon

SCEV SCEV SCEV SCEV

Number of Cache lines touched

slide-11
SLIDE 11

11

Validatjon – Nvidia GTX480

Against Hardware Performance counters

Total HW Transactjons for Black-Scholes

Program Versions

HW Counter

slide-12
SLIDE 12

12

Validatjon – Nvidia GTX480

Against Hardware Performance counters

Total HW Transactjons for Black-Scholes

Program Versions

HW Counter Predictjon

slide-13
SLIDE 13

13

Validatjon – Nvidia GTX480

slide-14
SLIDE 14

14

Validatjon – Nvidia GTX480

0.99 correlatjon with HW counters

slide-15
SLIDE 15

15

It's on GitHub!

htup://github.com/HariSeldon/SymEngine