A Case for NUMA-aware Contention Management on Multicore Systems - PowerPoint PPT Presentation

A Case for NUMA-aware Contention Management on Multicore Systems Sergey Blagodurov sergey_blagodurov@sfu.ca Sergey Zhuravlev sergey_zhuravlev@sfu.ca Mohammad Dashti mohammad_dashti@sfu.ca Alexandra Fedorova alexandra_fedorova@sfu.ca USENIX ATC’11 / Scheduling session 15 th of June

NUMA Domain 0 Core 0 Core 1 Core 2 Core 3 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache Shared L3 Cache System Request Interface   Crossbar switch Memory HyperTransport Controller to other domains Memory node 0 An AMD Opteron 8356 Barcelona domain USENIX ATC’11 / Scheduling session -2-

NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 Core 4 Core 8 Core 12 Core 7 Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache HT MC HT MC Memory Memory node 0 node 1 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 An AMD Opteron system with 4 domains USENIX ATC’11 / Scheduling session -3-

NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 Core 4 Core 8 Core 12 Core 7 Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache HT MC HT MC Memory Memory node 0 node 1 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 Contention for the shared last-level cache (CA) USENIX ATC’11 / Scheduling session -4-

NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 Core 4 Core 8 Core 12 Core 7 Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache HT MC HT MC Memory Memory node 0 node 1 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 Contention for the memory controller (MC) USENIX ATC’11 / Scheduling session -5-

NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 Core 4 Core 8 Core 12 Core 7 Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache HT MC HT MC Memory Memory node 0 node 1 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 Contention for the inter-domain interconnect (IC) USENIX ATC’11 / Scheduling session -6-

NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 A Core 4 Core 8 Core 12 Core 7 Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache HT MC HT MC Memory Memory node 0 node 1 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 Remote access latency (RL) USENIX ATC’11 / Scheduling session -7-

NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 A Core 4 Core 8 Core 12 Core 7 B Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache HT MC HT MC Memory Memory node 1 node 0 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 Isolating Memory controller contention (MC) USENIX ATC’11 / Scheduling session -8-

Memory Controller (MC) and InterConnect (IC) contention are key factors hurting performance Dominant degradation factors USENIX ATC’11 / Scheduling session -9-

Characterization method  Given two threads, decide if they will hurt each other ʼ s performance if co-scheduled A B Scheduling algorithm  Separate threads that are expected to interfere A B Contention-Aware Scheduling USENIX ATC’11 / Scheduling session -10-

Limited observability  We do not know for sure if threads compete and how severely  Hardware does not tell us Trial and error infeasible on large systems  Can ʼ t try all possible combinations  Even sampling becomes difficult A good trade-off: measure LLC Miss rate!  Assumes that threads interfere if they have high miss rates  No account for cache contention impact  Works well because cache contention is not dominant Characterization Method USENIX ATC’11 / Scheduling session -11-

Sort threads by LLC missrate: A B X Y Goal: isolate threads that compete for shared resources High contention: Low contention? A Y A Y X B B X MC HT MC HT MC HT MC HT A C B D Memory Memory Memory Memory node 1 node 2 node 1 node 2 Domain 1 Domain 2 Domain 1 Domain 2 Migrate competing threads to different domains Our previous work: an algorithm for UMA systems Distributed Intensity (DI-Plain) USENIX ATC’11 / Scheduling session -12-

NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 A Core 4 B Core 8 Core 12 Core 7 Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache Shared L3 Cache HT MC HT MC Memory Memory node 0 node 1 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 Failing to migrate memory leaves MC and introduces RL USENIX ATC’11 / Scheduling session -13-

SPEC CPU 2006 SPEC MPI 2007 % improvement over DEFAULT DI-Plain hurts performance on NUMA systems because it does not migrate memory! USENIX ATC’11 / Scheduling session -14-

Sort threads by LLC missrate: A B X Y Goal: isolate threads that compete for shared resources and pull the memory to the local node upon migration A Y A Y X B B X MC HT MC HT MC HT MC HT A C B D Memory Memory Memory Memory node 1 node 2 node 1 node 2 Domain 1 Domain 2 Domain 1 Domain 2 Migrate competing threads along with memory to different domains Solution #1: Distributed Intensity with memory migration (DI-Migrate) USENIX ATC’11 / Scheduling session -15-

SPEC CPU 2006 (low migration rate) % improvement over DEFAULT SPEC MPI 2007 (high migration rate) DI-Migrate performs too many migrations for MPI. Migrations are expensive on NUMA systems. USENIX ATC’11 / Scheduling session -16-

NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 A Core 4 B Core 8 Core 12 Core 7 Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache Shared L3 Cache HT MC MC HT MC Memory Memory Memory node 0 node 1 node 1 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 Migrating too frequently causes IC USENIX ATC’11 / Scheduling session -13-

DI-Migrate: threads sorted by miss rate if array positions change, we migrate thread and memory 2 5 7 12 21 35 47 110 150 200 1 3 7 15 27 51 78 92 170 190 DINO: threads sorted by class only migrate if we jump from one class to another C1 <= 10 10 < C2 <= 100 100 < C3 2 5 7 12 21 35 47 110 150 200 C1 <= 10 10 < C2 <= 100 100 < C3 1 3 7 15 27 51 78 92 170 190 Solution #2: Distributed Intensity NUMA Online (DINO) USENIX ATC’11 / Scheduling session -17-

Loose correlation between miss rate and degradation, so most migrations will not payoff USENIX ATC’11 / Scheduling session -18-

Average number of memory migrations per hour of execution (DI-Migrate and DINO) DINO significantly reduces the number of migrations USENIX ATC’11 / Scheduling session -19-

SPEC CPU 2006 SPEC MPI 2007 LAMP % improvement over DEFAULT DINO results USENIX ATC’11 / Scheduling session -20-

A Case for NUMA-aware Contention Management on Multicore Systems - PowerPoint PPT Presentation

A Case for NUMA-aware Contention Management on Multicore Systems Sergey Blagodurov sergey_blagodurov@sfu.ca Sergey Zhuravlev sergey_zhuravlev@sfu.ca Mohammad Dashti mohammad_dashti@sfu.ca Alexandra Fedorova alexandra_fedorova@sfu.ca USENIX

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Performance Impact of Resource Contention in Multicore Systems R. Hood, H. Jin, P. Mehrotra, J.

Addressing Shared Resource Contention in Multicore Processors via Scheduling ASPLOS 10

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement Strategies for NUMA Machines

Neural Networks Language Models Philipp Koehn 1 October 2020 Philipp Koehn Machine Translation:

Introducing Adaptive Algorithmic Behavior of Primal Heuristics in SCIP for Solving Mixed Integer

GLOBAL BIOANALYSIS CONSORTIUM Regulated Bioanalysis - A Proposed Global Harmonization Process

The Projective Line Over The Integers Ela Celikbas and Christina Eubanks-Turner Department of

Section 1 Time Series Modeling 1 / 37 Time Series Modeling ST 810-006 Statistics and Financial

Foundations of Computer Science Lecture 19 Expected Value The Average Over Many Runs of an

Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age Viktor

Partial Matching between Surfaces U i Using Frchet Distance F h t Di t Carola Wenk

A Case for NUMA-aware Contention Management on Multicore Systems - PowerPoint PPT Presentation

A Case for NUMA-aware Contention Management on Multicore Systems Sergey Blagodurov sergey_blagodurov@sfu.ca Sergey Zhuravlev sergey_zhuravlev@sfu.ca Mohammad Dashti mohammad_dashti@sfu.ca Alexandra Fedorova alexandra_fedorova@sfu.ca USENIX

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

CS 240A: Shared Memory &amp; Multicore Programming with Cilk++ Multicore and NUMA

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Performance Impact of Resource Contention in Multicore Systems R. Hood, H. Jin, P. Mehrotra, J.

Addressing Shared Resource Contention in Multicore Processors via Scheduling ASPLOS 10

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement Strategies for NUMA Machines

Neural Networks Language Models Philipp Koehn 1 October 2020 Philipp Koehn Machine Translation:

Introducing Adaptive Algorithmic Behavior of Primal Heuristics in SCIP for Solving Mixed Integer

GLOBAL BIOANALYSIS CONSORTIUM Regulated Bioanalysis - A Proposed Global Harmonization Process

The Projective Line Over The Integers Ela Celikbas and Christina Eubanks-Turner Department of

Section 1 Time Series Modeling 1 / 37 Time Series Modeling ST 810-006 Statistics and Financial

Foundations of Computer Science Lecture 19 Expected Value The Average Over Many Runs of an

Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age Viktor

Partial Matching between Surfaces U i Using Frchet Distance F h t Di t Carola Wenk

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA