What You Must Know about Memory, Caches, and Shared Memory
Kenjiro Taura
1 / 67
What You Must Know about Memory, Caches, and Shared Memory Kenjiro - - PowerPoint PPT Presentation
What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 67 Contents 1 Introduction 2 Organization of processors, caches, and memory 3 Caches 4 So how costly is it to access data? Latency Bandwidth Many algorithms are
1 / 67
1 Introduction 2 Organization of processors, caches, and memory 3 Caches 4 So how costly is it to access data?
5 How costly is it to communicate between threads?
2 / 67
1 Introduction 2 Organization of processors, caches, and memory 3 Caches 4 So how costly is it to access data?
5 How costly is it to communicate between threads?
3 / 67
4 / 67
5 / 67
5 / 67
6 / 67
7 / 67
7 / 67
http://ark.intel.com/products/81060/Intel-Xeon-Processor-E5-2698-v3-40M-Cache-2_30-GHz
7 / 67
http://ark.intel.com/products/81060/Intel-Xeon-Processor-E5-2698-v3-40M-Cache-2_30-GHz
7 / 67
1
double t0 = cur_time();
2
memcpy(a, b, nb);
3
double t1 = cur_time();
8 / 67
1
double t0 = cur_time();
2
memcpy(a, b, nb);
3
double t1 = cur_time();
1
$ gcc -O3 memcpy.c
2
$ ./a.out $((1 << 26)) # 64M long elements = 512MB
3
536870912 bytes copied in 0.117333 sec 4.575611 GB/sec
8 / 67
1
double t0 = cur_time();
2
memcpy(a, b, nb);
3
double t1 = cur_time();
1
$ gcc -O3 memcpy.c
2
$ ./a.out $((1 << 26)) # 64M long elements = 512MB
3
536870912 bytes copied in 0.117333 sec 4.575611 GB/sec
8 / 67
1 Introduction 2 Organization of processors, caches, and memory 3 Caches 4 So how costly is it to access data?
5 How costly is it to communicate between threads?
9 / 67
memory controller
L3 cache
(physical) core
cache
10 / 67
(physical) core
L2 cache
L1 cache
multi-level caches
11 / 67
memory controller
L3 cache
hardware thread (virtual core, CPU)
(physical) core
L2 cache
L1 cache
chip (socket, node, CPU)
12 / 67
memory controller
L3 cache
hardware thread (virtual core, CPU)
(physical) core
L2 cache
L1 cache
chip (socket, node, CPU)
13 / 67
memory controller
L3 cache
hardware thread (virtual core, CPU)(physical) core
L2 cache
L1 cachechip (socket, node, CPU) interconnect
14 / 67
15 / 67
1 Introduction 2 Organization of processors, caches, and memory 3 Caches 4 So how costly is it to access data?
5 How costly is it to communicate between threads?
16 / 67
17 / 67
17 / 67
17 / 67
17 / 67
17 / 67
18 / 67
18 / 67
18 / 67
18 / 67
a 32KB cache with 64 bytes lines (holds most recently accessed 512 distinct blocks)
19 / 67
a 32KB cache with 64 bytes lines (holds most recently accessed 512 distinct blocks)
19 / 67
a 32KB cache with 64 bytes lines (holds most recently accessed 512 distinct blocks)
19 / 67
20 / 67
21 / 67
22 / 67
5 6 14 15 a address within a line (26 = 64 bytes) index the set in the cache (among 29 = 512 sets)
23 / 67
1
float a[100][8192];
1
float a[100][8192+16];
24 / 67
25 / 67
25 / 67
25 / 67
1 Introduction 2 Organization of processors, caches, and memory 3 Caches 4 So how costly is it to access data?
5 How costly is it to communicate between threads?
26 / 67
27 / 67
1 Introduction 2 Organization of processors, caches, and memory 3 Caches 4 So how costly is it to access data?
5 How costly is it to communicate between threads?
28 / 67
1
for (N times) {
2
p = p->next;
3
}
29 / 67
1
numactl --cpunodebind 0 --interleave 0 ./traverse 50 100 150 200 250 10000 100000 1 × 106 1 × 107 1 × 108 1 × 109 latency/load size of the region (bytes) latency per load in a list traversal (local) [≥ 0] local
memory controller L3 cache hardware thread (virtual core, CPU) (physical) core L2 cache L1 cache chip (socket, node, CPU) interconnect30 / 67
size level latency (cycles) 12736 L1 3.73 101312 L2 9.69 1047232 L3 47.46 104387776 main 184.37
50 100 150 200 250 10000 100000 1x106 1x107 1x108 1x109 latency/load size of the region (bytes)
L1 L2 L3 main memory
31 / 67
1
numactl --cpunodebind 0 --interleave 1 ./traverse 50 100 150 200 250 300 350 10000 100000 1 × 106 1 × 107 1 × 108 1 × 109 latency/load size of the region (bytes) latency per load in a list traversal (local and remote) [≥ 0] local remote
memory controller L3 cache hardware thread (virtual core, CPU) (physical) core L2 cache L1 cache chip (socket, node, CPU) interconnect32 / 67
1 Introduction 2 Organization of processors, caches, and memory 3 Caches 4 So how costly is it to access data?
5 How costly is it to communicate between threads?
33 / 67
5 10 15 20 25 30 35 40 10000 100000 1 × 106 1 × 107 1 × 108 1 × 109 bandwidth (GB/sec) size of the region (bytes) bandwidth (local and remote) [≥ 0] local remote
memory controller L3 cache hardware thread (virtual core, CPU) (physical) core L2 cache L1 cache chip (socket, node, CPU) interconnect34 / 67
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 × 108 1 × 109 bandwidth (GB/sec) size of the region (bytes) bandwidth (local and remote) [≥ 100000000] local remote
35 / 67
36 / 67
1
for (N times) {
2
p1 = p1->next;
3
p2 = p2->next;
4
...
5
}
37 / 67
10 20 30 40 50 60 70 80 90 100 1000 10000 100000 1 × 106 1 × 107 1 × 108 1 × 109 bandwidth (GB/sec) size of the region (bytes) bandwidth with a number of chains (local) [≥ 0] 1 chains 2 chains 4 chains 8 chains 10 chains 12 chains 14 chains
38 / 67
1 2 3 4 5 6 1 × 108 1 × 109 bandwidth (GB/sec) size of the region (bytes) bandwidth with a number of chains (local) [≥ 100000000] 1 chains 2 chains 4 chains 8 chains 10 chains 12 chains 14 chains
39 / 67
0.5 1 1.5 2 2.5 3 3.5 4 1 × 108 1 × 109 bandwidth (GB/sec) size of the region (bytes) bandwidth with a number of chains (remote) [≥ 100000000] 1 chains 2 chains 4 chains 8 chains 10 chains 12 chains 14 chains
40 / 67
41 / 67
41 / 67
41 / 67
41 / 67
41 / 67
41 / 67
41 / 67
1 Introduction 2 Organization of processors, caches, and memory 3 Caches 4 So how costly is it to access data?
5 How costly is it to communicate between threads?
42 / 67
43 / 67
43 / 67
43 / 67
1
for (i = 0; i < M; i++)
2
for (j = 0; j < N; j++)
3
y[i] += a[i][j] * x[j];
44 / 67
1
for (i = 0; i < M; i++)
2
for (j = 0; j < N; j++)
3
y[i] += a[i][j] * x[j];
44 / 67
1
for (i = 0; i < M; i++)
2
for (j = 0; j < N; j++)
3
y[i] += a[i][j] * x[j];
44 / 67
1
for (i = 0; i < M; i++)
2
for (j = 0; j < N; j++)
3
y[i] += a[i][j] * x[j];
44 / 67
45 / 67
45 / 67
45 / 67
45 / 67
1 Introduction 2 Organization of processors, caches, and memory 3 Caches 4 So how costly is it to access data?
5 How costly is it to communicate between threads?
46 / 67
47 / 67
47 / 67
47 / 67
47 / 67
48 / 67
5 10 15 20 25 30 35 40 10000 100000 1 × 106 1 × 107 1 × 108 1 × 109 bandwidth (GB/sec) size of the region (bytes) bandwidth of random list traversal vs address-ordered list traversal [≥ 0] address-ordered randomly ordered
49 / 67
1
for (N times) {
2
a[j];
3
j = (j + s) % N;
4
}
50 / 67
5 10 15 20 25 30 35 40 45 10000 100000 1 × 106 1 × 107 1 × 108 1 × 109 bandwidth (GB/sec) size of the region (bytes) bandwidth of random list traversal vs random array traversal [≥ 0] random traverse
51 / 67
2 4 6 8 10 12 1 × 108 1 × 109 bandwidth (GB/sec) size of the region (bytes) bandwidth of various access patterns [≥ 100000000] list, ordered, 1 list, ordered, 10 list, random, 1 index, random, 1 sequential, 1 list, random, 10 index, random, 10 sequential, 10
52 / 67
1 Introduction 2 Organization of processors, caches, and memory 3 Caches 4 So how costly is it to access data?
5 How costly is it to communicate between threads?
53 / 67
5 10 15 20 25 30 35 40 45 50 1 × 108 1 × 109 bandwidth (GB/sec) size of the region (bytes) bandwidth with a number of threads (local) [≥ 100000000] 10 chains, 1 threads 10 chains, 2 threads 10 chains, 4 threads 10 chains, 8 threads 10 chains, 12 threads 10 chains, 16 threads
54 / 67
1 Introduction 2 Organization of processors, caches, and memory 3 Caches 4 So how costly is it to access data?
5 How costly is it to communicate between threads?
55 / 67
x
x = 100; ... = x;
56 / 67
57 / 67
1
57 / 67
1
2
57 / 67
1
L3 cache
hardware thread (virtual core, CPU) (physical) coreL2 cache
L1 cachechip (socket, node, CPU) interconnect
58 / 67
1
L3 cache
hardware thread (virtual core, CPU) (physical) coreL2 cache
L1 cachechip (socket, node, CPU) interconnect
58 / 67
1
1
L3 cache
hardware thread (virtual core, CPU) (physical) coreL2 cache
L1 cachechip (socket, node, CPU) interconnect
58 / 67
1
1
2
L3 cache
hardware thread (virtual core, CPU) (physical) coreL2 cache
L1 cachechip (socket, node, CPU) interconnect
58 / 67
59 / 67
60 / 67
1
volatile long x = 0;
2
volatile long y = 0;
1
(ping thread)
2
for (i = 0; i < n; i++) {
3
x = i + 1;
4
while (y <= i) ;
5
}
1
(pong thread)
2
for (i = 0; i < n; i++) {
3
while (x <= i) ;
4
y = i + 1;
5
}
61 / 67
2p(p − 1) combinations) and show a matrix
62 / 67
’-’ matrix 8 16 24 32 40 48 56 8 16 24 32 40 48 56 200 400 600 800 1000 1200 1400 1600
63 / 67
’-’ matrix 8 16 24 32 40 48 56 8 16 24 32 40 48 56 200 400 600 800 1000 1200 1400 1600
64 / 67
65 / 67
66 / 67