Understanding GPU performance How to get peak FLOPS (GPU version) - - PowerPoint PPT Presentation

understanding gpu performance
SMART_READER_LITE
LIVE PREVIEW

Understanding GPU performance How to get peak FLOPS (GPU version) - - PowerPoint PPT Presentation

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents 1 Data Access Performance 2 / 7 Contents 1 Data Access Performance 3 / 7 Data access performance data access performance is important in GPU too


slide-1
SLIDE 1

Understanding GPU performance

How to get peak FLOPS (GPU version)

Kenjiro Taura

1 / 7

slide-2
SLIDE 2

Contents

1 Data Access Performance

2 / 7

slide-3
SLIDE 3

Contents

1 Data Access Performance

3 / 7

slide-4
SLIDE 4

Data access performance

data access performance is important in GPU too

4 / 7

slide-5
SLIDE 5

Memory organization

Pascal (P100)

level line size capacity associativity L1 32B 24KB/SM ? L2 32B 4MB/device ? Global Memory 12/16GB N/A Shared Memory 64KB (∗) N/A

Volta (V100)

level line size capacity associativity L1 32B 32-128 KB/SM (∗) ? L2 32B 6MB/device ? Global Memory 16GB N/A Shared Memory ≤96KB (∗) N/A

∗ : 128KB is split between L1 and Shared Memory (configurable) source: https://arxiv.org/abs/1804.06826

5 / 7

slide-6
SLIDE 6

Global vs. Shared Memory

global memory and L1/L2 cache are the ordinary memory that make a hierarchy

cudaMalloc returns a global memory accesses to global memory are transparently cached into L1/L2 caches

shared memory is an explicitly-managed scratch memory

latency shorter than L1 (esp. on Pascal) you explicitly move between global and shared memory data shared only within a thread block programming interface is covered shortly

6 / 7

slide-7
SLIDE 7

Latency measurement

the same pointer chasing experiment as we did on CPU

1

for (N times) {

2

p = p->next;

3

} cache line size next pointers N elements (link all elements in a random order)

7 / 7

slide-8
SLIDE 8

Data size vs. latency

even L1 cache hit takes 30 (Volta) - 100 (Pascal) cycles

100 200 300 400 500 600 700 1024 4096 16384 65536 262144 1.04858 × 106 4.1943 × 106 1.67772 × 107 6.71089 × 107 latency/load (GPU cycles) size of the region (bytes) latency per load in a random list traversal p 8 v 8

8 / 7

slide-9
SLIDE 9

Shared memory

9 / 7