Traffic Management: A Holistic Approach to Memory Placement on NUMA - PowerPoint PPT Presentation

Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems Mohammad Dashti 1 , Alexandra Fedorova 1 , Justin Funston 1 , Fabien Gaud 1 , Renaud Lachaize 2 , Baptiste Lepers 3 , Vivien ema 4 , Mark Roth 1 Qu´ 1 Simon Fraser University 2 Universit´ e Joseph Fourier 3 CNRS 4 Grenoble INP March 19, 2013 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 1 / 20

New multicore machines are NUMA Core Core Core Core Core Core Core Core NODE 0 NODE 1 240 cycles / 5.5GB/ s 300 cycles / 2.8GB/ s M M DRAM DRAM C C M M DRAM DRAM NODE 2 C C NODE 3 Core Core Core Core Core Core Core Core Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 2 / 20

Well-know issue: remote access latency overhead Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C 300 cycles M M DRAM DRAM NODE 2 C C NODE 3 Thread Core Core Core Core Core Core Core Core Memory ◮ Impacts performance by at most 30% Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 3 / 20

New issue: Memory controller and interconnect congestion Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C 1200 cycles M M DRAM DRAM NODE 2 C C NODE 3 Thread Core Core Core Core Core Core Core Core Memory Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 4 / 20

Current solutions ◮ Try to improve locality ◮ Thread scheduling and page migration (USENIX ATC’11) ◮ Thread Clustering (EuroSys’07) ◮ Page replication (ASPLOS’96) ◮ Etc. ◮ But the main problem is MC/interconnect congestion Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 5 / 20

MC/Interconnect congestion impact on performance ◮ 16 threads, one per core ◮ Memory either allocated on first touch or interleaved Example: Streamcluster 25% 25% 1% 97% 25% 25% 1% 1% First touch scenario Interleave scenario Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 6 / 20

MC/Interconnect congestion impact on performance (2) between best and worst policy Performance difference (%) 100 80 Up to 100% performance difference 60 40 20 0 BT CG DC EP FT IS LU MG SP UA bodytrack facesim fluidanimate streamcluster swaptions x264 kmeans matrixmult PCA (I) wrmem Best policy is First Touch Best policy is Interleaving Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 7 / 20

Why do applications benefit from interleaving? (1) Streamcluster Interleaving First touch Local access ratio 25% 25% Memory latency (cycles) 471 1169 Memory controller imbalance 7% 200% Interconnect imbalance 21% 86% Perf. improvement / first touch 105% - ⇒ Interconnect and memory controller congestion drive up memory access latency Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 8 / 20

Why do applications benefit from interleaving? (2) PCA Interleaving First touch Local access ratio 25% 33% Memory latency (cycles) 480 665 Memory controller imbalance 4% 154% Interconnect imbalance 19% 64% Perf. improvement / first touch 38% - ⇒ Balancing load on memory controllers is more important than improve locality Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 9 / 20

Conclusions ◮ Balance is more important than locality ◮ Memory controller and interconnect congestion can drive up access latency ◮ Always manually interleaving memory is NOT the way to go Manual interleaving Performance improvement 10 with respect to Linux (%) 0 -10 -20 -30 -40 BT CG DC EP FT IS LU MG SP UA ⇒ Need a new solution Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 10 / 20

Carrefour: A new memory traffic management algorithm ◮ First goal: balance memory pressure on interconnect and MC ◮ Second goal: improve locality Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 11 / 20

Mechanism #1: Page relocation Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C M M DRAM DRAM NODE 2 C C NODE 3 Core Core Core Core Core Core Core Core Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 12 / 20

Mechanism #1: Page relocation Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C M M DRAM DRAM ` NODE 2 C C NODE 3 Core Core Core Core Core Core Core Core � Better locality � Cannot be applied if region � Lower interconnect load is shared by multiple threads � Balanced load on MC Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 12 / 20

Mechanism #2: Page replication Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C M M DRAM DRAM NODE 2 C C NODE 3 Core Core Core Core Core Core Core Core Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 13 / 20

Mechanism #2: Page replication Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C M M DRAM DRAM NODE 2 C C NODE 3 Core Core Core Core Core Core Core Core � Better locality � Higher memory consumption � Lower interconnect load � Expensive synchronization � Balanced load on MC Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 13 / 20

Mechanism #3: Page interleaving Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C M M DRAM DRAM NODE 2 C C NODE 3 Core Core Core Core Core Core Core Core Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 14 / 20

Mechanism #3: Page interleaving Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C M M DRAM DRAM NODE 2 C C NODE 3 Core Core Core Core Core Core Core Core � Balanced load on interconnect � Can decrease locality � Balanced load on MC Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 14 / 20

Carrefour in details ◮ Goal: Combine these techniques to: 1. Balance memory pressure 2. Increase locality Per application profiling Per application decisions Per page decisions Memory congestion ? Global application m etrics Memory intensity Memory imbalance Enable migrations ? Local access ratio Memory read ratio Migrate / Interleave / Enable interleaving ? Replicate page Enable replications ? Per page m etrics RW ratio Accessed by nodes Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 15 / 20

Carrefour in details Per application profiling Per application decisions Per page decisions Memory congestion ? Global application m etrics H Memory intensity W Memory imbalance Expensive ! Enable migrations ? C Local access ratio Memory read ratio Migrate / Interleave / Enable interleaving ? Replicate page Enable replications ? Expensive ! I Per page m etrics B RW ratio S Accessed by nodes ◮ Accurate and low-overhead page access statistics ◮ Adaptive IBS sampling ◮ Include cache accesses ◮ Use hardware counter feedback Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 15 / 20

Carrefour in details Per application profiling Per application decisions Per page decisions Memory congestion ? Global application m etrics H Memory intensity W Memory imbalance Expensive ! Enable migrations ? C Local access ratio Memory read ratio Migrate / Interleave / Enable interleaving ? Replicate page Enable replications ? Expensive ! I Per page m etrics B RW ratio S Accessed by nodes ◮ Efficient page replication ◮ Use a careful implementation (fine grain locks) ◮ Prevent data synchronization Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 15 / 20

Evaluation ◮ Carrefour is implemented in Linux 3.6 ◮ Machines ◮ 16 cores, 4 nodes, 64 GB of RAM ◮ 24 cores, 4 nodes, 64 GB of RAM ◮ Benchmarks (23 applications) ◮ Parsec ◮ FaceRec ◮ Metis (Map/Reduce) ◮ NAS ◮ Compare Carrefour to ◮ Linux (default) ◮ Linux Autonuma ◮ Manual Interleaving Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 16 / 20

Performance 270 Performance improvement AutoNUMA with respect to Linux (%) 240 Carrefour 210 180 150 120 90 60 30 0 -30 F S F F P E S a a a t C P P c r c c e A e e e a s R R m i m e e c c c l u L s o t n e g r ⇒ Carrefour significantly improves performance ! Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 17 / 20

Carrefour overhead Configuration Maximum overhead / default Autonuma 25% Carrefour 4% ◮ Carrefour average overhead when no decision are taken: 2% Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 18 / 20

Conclusion ◮ In modern NUMA systems: ◮ Remote latency overhead is not the main bottleneck ◮ MC and interconnect congestion can drive up memory latency ◮ Carrefour: a memory traffic management algorithm ◮ First goal: balance memory pressure on interconnect and MC ◮ Second goal: improve locality ◮ Performance: ◮ Improves performance significantly (up to 270%) ◮ Outperforms others solutions Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 19 / 20

Traffic Management: A Holistic Approach to Memory Placement on NUMA - PowerPoint PPT Presentation

Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems Mohammad Dashti 1 , Alexandra Fedorova 1 , Justin Funston 1 , Fabien Gaud 1 , Renaud Lachaize 2 , Baptiste Lepers 3 , Vivien ema 4 , Mark Roth 1 Qu 1 Simon Fraser

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

GRADUATE HOLISTIC NURSING ROUND TABLE HOLDING SPACE: ADVANCED HOLISTIC NURSING CONSTRAINTS,

Enhancing Student Learning through Holistic Mentoring Program Holistic Mentoring Program Karen KW

Cooking Academy Holistic Food Preparation Cooking Academy Holistic Food Preparation Module #3

Traffic Shaping, Traffic Policing Peter Puschner, Institut fr Technische Informatik Traffic

Traffic signal optimization and traffic assignment Traffic signals Traffic signal optimization

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

28.05.04 09:50 Memory Management The computer memory is a limited resource so the Memory

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Memory Management Memory Manager Requirements Minimize primary memory access time

Modern Risk Modern Risk Modern Risk Management Modern Risk Management anagement Concepts:

Memory Management Ideally programmers want memory that is large fast non

Memory Management Memory Management 5A. Memory Management and Address Spaces 1. allocate/assign

Memory Management Memory Management 5A. Memory Management and Address Spaces 1. allocate/assign

Operating Systems: Operating Systems: Memory management Memory management Fall 2008 Fall 2008

An Introduction to Holistic 3D Reconstruction Yi Ma EECS Department, UC Berkeley 1 What is 3D

Launch Your Info Product Empire On Clickbank Presented By: Paul Counts @paulcounts *Clickbank

Welcome to Oregon Chapter of HIMS S Webinar Tom Finnerity Board President A BIG THANK YOU

What Can We Do? How do you undo the damage of the WHI? 16th WCM 6/4/18 318 Pre-Congress

Predicate Logic: Natural Deduction Alice Gao Lecture 15 Based on work by J. Buss, L. Kari, A.

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

UMAMI: A Recipe for Generating Meaningful Metrics through Holistic I/O Performance Analysis

Welcome! LGSEC.org Explore a New Funding/Partner-Finding Platform from the California Energy