traffic management a holistic approach to memory
play

Traffic Management: A Holistic Approach to Memory Placement on NUMA - PowerPoint PPT Presentation

Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems Mohammad Dashti 1 , Alexandra Fedorova 1 , Justin Funston 1 , Fabien Gaud 1 , Renaud Lachaize 2 , Baptiste Lepers 3 , Vivien ema 4 , Mark Roth 1 Qu 1 Simon Fraser


  1. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems Mohammad Dashti 1 , Alexandra Fedorova 1 , Justin Funston 1 , Fabien Gaud 1 , Renaud Lachaize 2 , Baptiste Lepers 3 , Vivien ema 4 , Mark Roth 1 Qu´ 1 Simon Fraser University 2 Universit´ e Joseph Fourier 3 CNRS 4 Grenoble INP March 19, 2013 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 1 / 20

  2. New multicore machines are NUMA Core Core Core Core Core Core Core Core NODE 0 NODE 1 240 cycles / 5.5GB/ s 300 cycles / 2.8GB/ s M M DRAM DRAM C C M M DRAM DRAM NODE 2 C C NODE 3 Core Core Core Core Core Core Core Core Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 2 / 20

  3. Well-know issue: remote access latency overhead Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C 300 cycles M M DRAM DRAM NODE 2 C C NODE 3 Thread Core Core Core Core Core Core Core Core Memory ◮ Impacts performance by at most 30% Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 3 / 20

  4. New issue: Memory controller and interconnect congestion Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C 1200 cycles M M DRAM DRAM NODE 2 C C NODE 3 Thread Core Core Core Core Core Core Core Core Memory Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 4 / 20

  5. Current solutions ◮ Try to improve locality ◮ Thread scheduling and page migration (USENIX ATC’11) ◮ Thread Clustering (EuroSys’07) ◮ Page replication (ASPLOS’96) ◮ Etc. ◮ But the main problem is MC/interconnect congestion Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 5 / 20

  6. MC/Interconnect congestion impact on performance ◮ 16 threads, one per core ◮ Memory either allocated on first touch or interleaved Example: Streamcluster 25% 25% 1% 97% 25% 25% 1% 1% First touch scenario Interleave scenario Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 6 / 20

  7. MC/Interconnect congestion impact on performance (2) between best and worst policy Performance difference (%) 100 80 Up to 100% performance difference 60 40 20 0 BT CG DC EP FT IS LU MG SP UA bodytrack facesim fluidanimate streamcluster swaptions x264 kmeans matrixmult PCA (I) wrmem Best policy is First Touch Best policy is Interleaving Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 7 / 20

  8. Why do applications benefit from interleaving? (1) Streamcluster Interleaving First touch Local access ratio 25% 25% Memory latency (cycles) 471 1169 Memory controller imbalance 7% 200% Interconnect imbalance 21% 86% Perf. improvement / first touch 105% - ⇒ Interconnect and memory controller congestion drive up memory access latency Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 8 / 20

  9. Why do applications benefit from interleaving? (2) PCA Interleaving First touch Local access ratio 25% 33% Memory latency (cycles) 480 665 Memory controller imbalance 4% 154% Interconnect imbalance 19% 64% Perf. improvement / first touch 38% - ⇒ Balancing load on memory controllers is more important than improve locality Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 9 / 20

  10. Conclusions ◮ Balance is more important than locality ◮ Memory controller and interconnect congestion can drive up access latency ◮ Always manually interleaving memory is NOT the way to go Manual interleaving Performance improvement 10 with respect to Linux (%) 0 -10 -20 -30 -40 BT CG DC EP FT IS LU MG SP UA ⇒ Need a new solution Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 10 / 20

  11. Carrefour: A new memory traffic management algorithm ◮ First goal: balance memory pressure on interconnect and MC ◮ Second goal: improve locality Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 11 / 20

  12. Mechanism #1: Page relocation Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C M M DRAM DRAM NODE 2 C C NODE 3 Core Core Core Core Core Core Core Core Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 12 / 20

  13. Mechanism #1: Page relocation Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C M M DRAM DRAM ` NODE 2 C C NODE 3 Core Core Core Core Core Core Core Core � Better locality � Cannot be applied if region � Lower interconnect load is shared by multiple threads � Balanced load on MC Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 12 / 20

  14. Mechanism #2: Page replication Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C M M DRAM DRAM NODE 2 C C NODE 3 Core Core Core Core Core Core Core Core Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 13 / 20

  15. Mechanism #2: Page replication Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C M M DRAM DRAM NODE 2 C C NODE 3 Core Core Core Core Core Core Core Core � Better locality � Higher memory consumption � Lower interconnect load � Expensive synchronization � Balanced load on MC Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 13 / 20

  16. Mechanism #3: Page interleaving Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C M M DRAM DRAM NODE 2 C C NODE 3 Core Core Core Core Core Core Core Core Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 14 / 20

  17. Mechanism #3: Page interleaving Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C M M DRAM DRAM NODE 2 C C NODE 3 Core Core Core Core Core Core Core Core � Balanced load on interconnect � Can decrease locality � Balanced load on MC Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 14 / 20

  18. Carrefour in details ◮ Goal: Combine these techniques to: 1. Balance memory pressure 2. Increase locality Per application profiling Per application decisions Per page decisions Memory congestion ? Global application m etrics Memory intensity Memory imbalance Enable migrations ? Local access ratio Memory read ratio Migrate / Interleave / Enable interleaving ? Replicate page Enable replications ? Per page m etrics RW ratio Accessed by nodes Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 15 / 20

  19. Carrefour in details Per application profiling Per application decisions Per page decisions Memory congestion ? Global application m etrics H Memory intensity W Memory imbalance Expensive ! Enable migrations ? C Local access ratio Memory read ratio Migrate / Interleave / Enable interleaving ? Replicate page Enable replications ? Expensive ! I Per page m etrics B RW ratio S Accessed by nodes ◮ Accurate and low-overhead page access statistics ◮ Adaptive IBS sampling ◮ Include cache accesses ◮ Use hardware counter feedback Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 15 / 20

  20. Carrefour in details Per application profiling Per application decisions Per page decisions Memory congestion ? Global application m etrics H Memory intensity W Memory imbalance Expensive ! Enable migrations ? C Local access ratio Memory read ratio Migrate / Interleave / Enable interleaving ? Replicate page Enable replications ? Expensive ! I Per page m etrics B RW ratio S Accessed by nodes ◮ Efficient page replication ◮ Use a careful implementation (fine grain locks) ◮ Prevent data synchronization Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 15 / 20

  21. Evaluation ◮ Carrefour is implemented in Linux 3.6 ◮ Machines ◮ 16 cores, 4 nodes, 64 GB of RAM ◮ 24 cores, 4 nodes, 64 GB of RAM ◮ Benchmarks (23 applications) ◮ Parsec ◮ FaceRec ◮ Metis (Map/Reduce) ◮ NAS ◮ Compare Carrefour to ◮ Linux (default) ◮ Linux Autonuma ◮ Manual Interleaving Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 16 / 20

  22. Performance 270 Performance improvement AutoNUMA with respect to Linux (%) 240 Carrefour 210 180 150 120 90 60 30 0 -30 F S F F P E S a a a t C P P c r c c e A e e e a s R R m i m e e c c c l u L s o t n e g r ⇒ Carrefour significantly improves performance ! Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 17 / 20

  23. Carrefour overhead Configuration Maximum overhead / default Autonuma 25% Carrefour 4% ◮ Carrefour average overhead when no decision are taken: 2% Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 18 / 20

  24. Conclusion ◮ In modern NUMA systems: ◮ Remote latency overhead is not the main bottleneck ◮ MC and interconnect congestion can drive up memory latency ◮ Carrefour: a memory traffic management algorithm ◮ First goal: balance memory pressure on interconnect and MC ◮ Second goal: improve locality ◮ Performance: ◮ Improves performance significantly (up to 270%) ◮ Outperforms others solutions Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 19 / 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend