1
Optimizing Indirect Memory References with milk
Vladimir Kiriansky, Yunming Zhang, Saman Amarasinghe MIT PACT ’16 September 13, 2016, Haifa, Israel
Optimizing Indirect Memory References with milk Vladimir Kiriansky, - - PowerPoint PPT Presentation
Optimizing Indirect Memory References with milk Vladimir Kiriansky, Yunming Zhang, Saman Amarasinghe MIT PACT 16 September 13, 2016, Haifa, Israel 1 Indirect Accesses 2 Indirect Accesses with OpenMP 3 Indirect Accesses
1
Vladimir Kiriansky, Yunming Zhang, Saman Amarasinghe MIT PACT ’16 September 13, 2016, Haifa, Israel
2
3
3
Speedup 1 2 3 4 5
OpenMP +Milk uniform [0..100M) 8 threads, 8MB L3
4
Speedup 1 2 3 4 5
OpenMP +Milk
if(!milk)
uniform [0..100M) 8 threads, 8MB L3
5
Address
Time
6
Address
Time
7
Time
Address
8
Time
Address
9
Time
Address
10
Time
8 threads
Address
11
Time
Address
12
Time
Address
http://research.blogs.lincoln.ac.uk/ files/2011/02/map-of-internet.png
14
15
16
17
18
0.00 0.20 0.40 0.60 0.80 1.00
R T W
0.00 0.20 0.40 0.60 0.80 1.00
R T W
20 40 60 80 100
R T W
Temporal Locality (infinite cache) Spatial Locality (64 byte)
0.00 0.20 0.40 0.60 0.80 1.00
R T W
0.00 0.20 0.40 0.60 0.80 1.00
R T W
Ideal Cache Hit % Betweenness Centrality Breadth-First Search Connected Components PageRank Single-Source Shortest Paths [GAPBS] Road (d=2.4) Twitter (d=24) Web (d=39)
19
+= f(i);
7 14 5 18 7 7
d
1 2 3 4 5 6 7
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
count
+= f(i);
7 14 5 18 7 7
d
1 2 3 4 5 6 7
7 14 5 18 7 7
f(0) f(1) f(2) f(3) f(4) f(5) f(6) f(7)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
count
+= f(i);
7 14 5 18 7 7
d
1 2 3 4 5 6 7
7 14 5 18 7 7
f(0) f(1) f(2) f(3) f(4) f(5) f(6) f(7)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
count
+= f(i);
7 14 5 18 7 7
d
1 2 3 4 5 6 7
7 14 5 18 7 7
f(0) f(1) f(2) f(3) f(4) f(5) f(6) f(7)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
count
+= f(i);
7 14 5 18 7 7
d
1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
count
5 7 7 7 14 18
f(5) f(1) f(3) f(0) f(6) f(7) f(2) f(4)
+= f(i);
7 14 5 18 7 7
d
1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
count
tag — address to group by pack — additional state
23
f(1)
24
7 14 5 18 7 7
d
1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
count
5 7 7 7 14 18
f(1) f(6) f(3) f(0) f(5) f(7) f(2) f(4)
+= f(i);
7 14 5 18 7 7
d
1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
count
7 14 18
f(1) f(6) f(0) f(5) f(7) f(2) f(4)
+ + +
5
f(3)
27
parallel multipass radix partitioning
28
28
7
0.5
17
29
30
31
32
7
0.5
17
33
0.5
7
34
9-bit radix partition pails
L2
…
35
pails
L2
p=7
17 0.5
…
17 17 7
0.5
35
pails
L2
p=7
17 0.5
…
17 17 7
0.5
36
pails
L2
p=7
17 0.5 7 0.5
…
17 7 17 7
0.2
37
pails tubs
DRAM L2
p=7
…
17 0.5 7 0.5 0.2
17 7 17
0.2
17
0.2 27 0.1 7 0.3
17 27 17
17 0.5 7 0.5 0.2
38
L2
tubs
DRAM
17 7 17
39
L2
tubs
DRAM
0.2 27 0.1 7 0.3
17 27 17
17 0.5 7 0.5 0.2
17 7 17
TLB, SIMD, NUMA, non-temporal writes, software write buffers
40
41
Speedup
0x 0.5x 1x 1.5x 2x 2.5x 3x
BC BFS CC PR SSSP Betweenness Centrality Breadth-First Search Connected Components PageRank Single-Source Shortest Paths
[GAPBS]
1.4× 2.7×
V=32M [i7-4790K] 8 MB L3
42
Cache Hit % 20 40 60 80 100
BC BFS CC PR SSSP baseline milk
V=32M [i7-4790K] 8 MB L3 256KB L2 > 80% DRAM → < 22%
43
% of Total Cycles 0% 20% 40% 60% 80% 100%
L2 miss stalls 256 KB L2 L3 miss stalls 8 MB L3 baseline milk PageRank V=32M d=16 uniform
baseline: 6 of 7 cycles stalled!
44
Speedup 0x 0.5x 1x 1.5x 2x 2.5x 3x
BC BFS CC PR SSSP
2M 8M 32M d=16 uniform 8 MB L3
[i7-4790K]
45
Speedup 0x 1x 2x 3x 4x 5x 1 2 4 8 16 32 64
V=16M V=32M
Average Degree
16M edges 2B edges CountDegree
46
http://milk-lang.org/
47
53
Social Web Road Graph Facebook Twitter Twitter62 CC12 .sk US Vertices 1.5 B 300 M 62 M 3.5 B 51 M 24 M Degree 290 200 24 36 39 2.4
[Backstrom14][Ching15][Beamer15] [CommonCrawl]
52
Cumulative Edges % 0 % 25 % 50 % 75 % 100 % Vertex Degree Rank 2 1 6 1 2 8 1 2 4 8 1 9 2 6 5 5 3 6 5 2 4 2 8 8 4 1 9 4 3 4 3 3 5 5 4 4 3 2
RMAT25 Uniform25 Twitter’ V=32M, d=16 V=62M, d=24
L3