Thread and Memory Placement on NUMA Systems: Asymmetry Matters - PowerPoint PPT Presentation

Thread and Memory Placement on NUMA Systems: Asymmetry Matters Baptiste Lepers, Alexandra Fedorova (Simon Fraser University), Vivien Quéma (Grenoble INP) ATC 2015 1 / 12

Introduction Current threads and memory placement: minimizing hop-count (e.g. in Linux). Contributions: ◮ Connections are asymmetric, bandwidth is more important than hops. ◮ AsymSched algorithm that dynamically places threads and memory. 2 / 12

Inter-node bandwidths for 4 AMD Opteron 6272 processors Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 8b link 16b link 16b/8b link 3 / 12

Node 0 Node 4 Node 5 Node 1 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 Node 6 Node 2 Node 3 Node 7 8b link 16b link 16b/8b link 8b link 16b link 16b/8b link Measurements Applications running on 3 nodes, with different node placements. Perf. improvement relative to average placement (%) 15 40 100 Worst Placement 30 80 10 Best Placement 20 60 5 10 40 0 0 20 -10 0 -5 -20 -20 -10 -30 -40 -15 -40 -60 b c e f i s l u m s u s k m w w w g s s p f g t p w m p t a t p . . . a c r r r r c . . C D B g . a m a e a c B C . A . a e e e C . . B t p c . x . . C p a r e a r x . . x x . . i h j x x . x x t n x m b m e x i m 5 o s b c 0 c n u l 0 u s l t s i p t e l y r Figure 2: Performance difference between the best, and worst thread placement with respect to the average thread average placement (cycles) accesses compared to Latency of memory 150 200 800 Worst Placement 150 600 100 Best Placement 400 100 50 200 50 0 0 0 -200 -50 -50 -400 -100 -600 -100 -150 -800 -150 -200 -1000 b c e f i l m s u s k m w w w g s t s u s p f t g p . p a w m r p t a . C . . g a c r r r c B . . D B . . a m a e a c C C . A B e t e e . . . . C p r p c x . x x x . . a i e a r x . x x t x h j x . i n m b m e x o m 5 s b c n 0 c u s 0 l u l t s i p t e l y r Figure 3: Difference in latency of memory accesses between the best, and worst thread placement with respect to the 4 / 12

More Measurements streamcluster running on 2 nodes, with different node placements. Master thread Execution Time Diff with Latency of memory % accesses Bandwidth to node (s) 0-1 (%) accesses (cycles) via 2-hop the “master” (compared to 0-1(%)) links node (MB/s) 0 1 - 148 0% 750 0 5598 0 4 - 228 56% 1169 (56%) 0 2999 0 228 56% 1179 (57%) 0 2973 0 2 855 2 168 15% (14%) 0 4329 2 0 340 133% 1527 (104%) 98 1915 0 3 3 185 27% 1040 (39%) 98 3741 1 0 340 133% 1601 (113%) 98 1903 0 5 4 5 228 56% 1206 (61%) 98 2884 2 3 185 27% 1020 (36%) 0 3748 3 7 7 338 132% 1614 (115%) 98 1928 4 1 338 132% 1612 (115%) 98 1891 5 1 5 230 58% 1200 (60%) 0 2880 2 167 15% 867 (16%) 98 3748 2 7 7 225 54% 1220 (63%) 0 3014 3 4 230 58% 1205 (60%) 0 2959 4 1 1 226 55% 1203 (60%) 98 2880 5 5 / 12

AsymSched ◮ User-level thread+memory placement manager ◮ Continuously measures communication ◮ Decides every second whether threads/memory should be migrated 6 / 12

AsymSched – Measurement ◮ Reads some hardware counter (data accesses from CPU to node) ◮ No counter for CPU to CPU available ◮ Assumes for decision making: ◮ Threads on same node share data ◮ Between nodes with ’high’ communication threads of same application share data. 7 / 12

AsymSched – Decision ◮ Puts threads of same application that share data into clusters. ◮ Each cluster gets weight C w = log ( #remote memory accesses ) . ◮ For each placement (mapping of clusters to nodes), compute P w = � C ∈ Clusters C w · ( max bandwidth for C ) . ◮ Select placements whose P w ≥ 90 % of maximal P w . Of those choose that with least page migrations. ◮ If cost for memory migration (assuming 0 . 3s per GB) is too high, do not apply placement. ◮ Because of symmetry, not all placements need to be tested. Also “obviously bad” placement are ignored. 8 / 12

AsymSched – Migration ◮ Uses dynamic (lazy) migration . ◮ If after 2 seconds > 90 % of accesses go to old node, do full migration. ◮ Full migration uses special system call, that is faster than migrate_pages , because it stops the application and needs less locks. cg.B ft.C is.D sp.A streamcluster graph500 specJBB Migrated memory (GB) 0.17 2.5 20 0.1 0.15 0.3 10 Average time - Linux syscall (ms) 860 12700 101000 490 750 1500 50500 Average time - fast migration (ms) 51 380 3050 30 45 90 1500 9 / 12

Evaluation – 1 application on 3 nodes Perf. improvement relative to average placement (%) Worst placement Dynamic Memory Placement Only Best placement AsymSched 15 40 250 30 10 200 20 5 150 10 0 0 100 -10 -5 50 -20 -10 0 -30 -15 -40 -50 bt.B.x cg.C.x ep.C.x ft.C.x is.D.x lu.B.x mg.C.x sp.A.x ua.B.x swaptions kmeans matrixmultiply wc wr wrmem graph500 specjbb streamcluster pca facerec Figure 4: Performance difference between the best and worst static thread placement, dynamic memory placement, average placement (cycles) Worst Placement Dynamic Memory Placement Only accesses compared to Best Placement AsymSched Latency of memory 200 1500 250 150 200 1000 100 150 50 500 100 0 50 0 -50 0 -100 -500 -50 -150 -100 -200 -1000 bt.B.x cg.C.x ep.C.x ft.C.x is.D.x lu.B.x mg.C.x sp.A.x ua.B.x swaptions kmeans matrixmultiply wc wr wrmem graph500 specjbb streamcluster pca facerec Figure 5: Memory latency under the best and worst static thread placement, dynamic memory placement, AsymSched 10 / 12

Evaluation – 3 applications Perf. improvement relative to average placement (%) 250 Worst Thread Placement 200 Best Thread Placement 150 Dynamic Memory Placement 100 AsymSched 50 0 -50 s g m s g s s s s s m s s p r t r p t t t p p t a a r a r r r a r e e e e e e e e e p t p t c r a c a a a c r c a j h i h j j i j b x m b m m m b x b m 5 m 5 m b c b c c c b b c 0 0 - u l - l l l - u - l 3 0 u 0 2 u u u 5 5 u - l - l t s s s s t s 3 i p t 3 t t t p i t e e e e e l l y r r r r y r - - - - - - 3 3 3 2 - 3 2 3 average placement (cycles) accesses compared to 2000 Latency of memory Worst Thread Placement 1500 Best Thread Placement 1000 Dynamic Memory Placement AsymSched 500 0 -500 -1000 -1500 specjbb-3 graph500-3 matrixmultiply-2 streamcluster-3 graph500-3 specjbb-2 streamcluster-3 streamcluster-3 streamcluster-2 specjbb-5 matrixmultiply-3 specjbb-5 streamcluster-3 11 / 12

Discussion ◮ What’s the matter with memory migration? ◮ How well would this work without the magic constants? ◮ What if #threads is not a multiple of #cores in NUMA-domain? 12 / 12

Thread and Memory Placement on NUMA Systems: Asymmetry Matters - PowerPoint PPT Presentation

Thread and Memory Placement on NUMA Systems: Asymmetry Matters Baptiste Lepers, Alexandra Fedorova (Simon Fraser University), Vivien Quma (Grenoble INP) ATC 2015 1 / 12 Introduction Current threads and memory placement: minimizing hop-count

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement Strategies for NUMA Machines

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

13 IN THIS CHAPTER Benefits of Thread Pooling 308 Considerations and Costs of Thread

CS 6958 LECTURE 9 TRAX MEMORY MODEL February 5, 2014 Recap: TRaX Thread DRAM L2 L1 Thread

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems Mohammad Dashti 1 ,

NUMA obliviousness through memory mapping Mrunal Gawade Martin Kersten CWI, Amsterdam

To thread or not to thread? Why PETSc favors MPI-only Plenary Discussion PETSc User Meeting 2016

t rs

Performance Evaluation of Throughput Constrained Dataflow Programs Executed On Shared-Memory

Je Jesus sus of of Na Naza zareth: reth: Myth or Messiah? What shall we say of this? It

Because Your Worth It Growing Towards Godly Giving Today we are starting a 3 week series on giving

Case study: WINPACCS at an NGO Chantal Menjivar-White, German Leprosy and Tuberculosis Relief

One P Punch La Laws s and J Judici cial Respo ponses t to o Al Alcohol ol-Fu Fuelled

The diagnosis of leprosy Introduction Salvatore Noto, Pieter A M Schreuder and Bernard Naafs,

Doing so well that they cut me to just having to go to Stanford once a week instead of twice.