Thread and Memory Placement on NUMA Systems: Asymmetry Matters
Baptiste Lepers, Alexandra Fedorova (Simon Fraser University), Vivien Quéma (Grenoble INP) ATC 2015
1 / 12
Thread and Memory Placement on NUMA Systems: Asymmetry Matters - - PowerPoint PPT Presentation
Thread and Memory Placement on NUMA Systems: Asymmetry Matters Baptiste Lepers, Alexandra Fedorova (Simon Fraser University), Vivien Quma (Grenoble INP) ATC 2015 1 / 12 Introduction Current threads and memory placement: minimizing hop-count
1 / 12
2 / 12
Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7
8b link 16b link 16b/8b link
3 / 12
Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7
8b link 16b link 16b/8b link
5 10 15
b t . B . x c g . C . x e p . C . x f t . C . x i s . D . x l u . B . x m g . C . x s p . A . x u a . B . x s w a p t i
s k m e a n s m a t r i x m u l t i p l y w c w r w r m e m
to average placement (%)
Worst Placement Best Placement
10 20 30 40
g r a p h 5 s p e c j b b
20 40 60 80 100
s t r e a m c l u s t e r p c a f a c e r e c
Figure 2: Performance difference between the best, and worst thread placement with respect to the average thread Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7
8b link 16b link 16b/8b link
50 100 150
b t . B . x c g . C . x e p . C . x f t . C . x i s . D . x l u . B . x m g . C . x s p . A . x u a . B . x s w a p t i
s k m e a n s m a t r i x m u l t i p l y w c w r w r m e m
Latency of memory accesses compared to average placement (cycles)
Worst Placement Best Placement
50 100 150 200
g r a p h 5 s p e c j b b
200 400 600 800
s t r e a m c l u s t e r p c a f a c e r e c
Figure 3: Difference in latency of memory accesses between the best, and worst thread placement with respect to the
4 / 12
Master thread Execution Time Diff with Latency of memory % accesses Bandwidth to node (s) 0-1 (%) accesses (cycles) via 2-hop the “master” (compared to 0-1(%)) links node (MB/s) 1
0% 750 5598 4
56% 1169 (56%) 2999 2 228 56% 1179 (57%) 2973 2 168 15% 855 (14%) 4329 3
2 1
340 133% 1527 (104%) 98 1915 3 185 27% 1040 (39%) 98 3741 5
4
340 133% 1601 (113%) 98 1903 5 228 56% 1206 (61%) 98 2884 3 7
2
3 185 27% 1020 (36%) 3748 7 338 132% 1614 (115%) 98 1928 5 1
4
1 338 132% 1612 (115%) 98 1891 5 230 58% 1200 (60%) 2880 2 7
3
2 167 15% 867 (16%) 98 3748 7 225 54% 1220 (63%) 3014 4 1
5
4 230 58% 1205 (60%) 2959 1 226 55% 1203 (60%) 98 2880
5 / 12
6 / 12
◮ Threads on same node share data ◮ Between nodes with ’high’ communication threads of same
7 / 12
8 / 12
cg.B ft.C is.D sp.A streamcluster graph500 specJBB Migrated memory (GB) 0.17 2.5 20 0.1 0.15 0.3 10 Average time - Linux syscall (ms) 860 12700 101000 490 750 1500 50500 Average time - fast migration (ms) 51 380 3050 30 45 90 1500
9 / 12
5 10 15
bt.B.x cg.C.x ep.C.x ft.C.x is.D.x lu.B.x mg.C.x sp.A.x ua.B.x swaptions kmeans matrixmultiply wc wr wrmem
to average placement (%)
Worst placement Best placement Dynamic Memory Placement Only AsymSched
10 20 30 40
graph500 specjbb
50 100 150 200 250
streamcluster pca facerec
Figure 4: Performance difference between the best and worst static thread placement, dynamic memory placement,
50 100 150 200 250
bt.B.x cg.C.x ep.C.x ft.C.x is.D.x lu.B.x mg.C.x sp.A.x ua.B.x swaptions kmeans matrixmultiply wc wr wrmem
Latency of memory accesses compared to average placement (cycles)
Worst Placement Best Placement Dynamic Memory Placement Only AsymSched
50 100 150 200
graph500 specjbb
500 1000 1500
streamcluster pca facerec
Figure 5: Memory latency under the best and worst static thread placement, dynamic memory placement, AsymSched
10 / 12
50 100 150 200 250
s p e c j b b
g r a p h 5
m a t r i x m u l t i p l y
s t r e a m c l u s t e r
g r a p h 5
s p e c j b b
s t r e a m c l u s t e r
s t r e a m c l u s t e r
s t r e a m c l u s t e r
s p e c j b b
m a t r i x m u l t i p l y
s p e c j b b
s t r e a m c l u s t e r
to average placement (%)
Worst Thread Placement Best Thread Placement Dynamic Memory Placement AsymSched
500 1000 1500 2000
specjbb-3 graph500-3 matrixmultiply-2 streamcluster-3 graph500-3 specjbb-2 streamcluster-3 streamcluster-3 streamcluster-2 specjbb-5 matrixmultiply-3 specjbb-5 streamcluster-3
Latency of memory accesses compared to average placement (cycles)
Worst Thread Placement Best Thread Placement Dynamic Memory Placement AsymSched
11 / 12
12 / 12