A Case for NUMA-aware Contention Management on Multicore Systems - - PowerPoint PPT Presentation

a case for numa aware contention management on multicore
SMART_READER_LITE
LIVE PREVIEW

A Case for NUMA-aware Contention Management on Multicore Systems - - PowerPoint PPT Presentation

A Case for NUMA-aware Contention Management on Multicore Systems Sergey Blagodurov sergey_blagodurov@sfu.ca Sergey Zhuravlev sergey_zhuravlev@sfu.ca Mohammad Dashti mohammad_dashti@sfu.ca Alexandra Fedorova alexandra_fedorova@sfu.ca USENIX


slide-1
SLIDE 1

A Case for NUMA-aware Contention Management on Multicore Systems

Sergey Blagodurov sergey_blagodurov@sfu.ca Sergey Zhuravlev sergey_zhuravlev@sfu.ca Mohammad Dashti mohammad_dashti@sfu.ca Alexandra Fedorova alexandra_fedorova@sfu.ca

USENIX ATC’11 / Scheduling session 15th of June

slide-2
SLIDE 2

Memory Controller HyperTransport Shared L3 Cache System Request Interface
 Crossbar switch Core 0 L1, L2 cache Core 1 L1, L2 cache Core 2 L1, L2 cache Core 3 L1, L2 cache Memory node 0

NUMA Domain 0

to other domains USENIX ATC’11 / Scheduling session

An AMD Opteron 8356 Barcelona domain

  • 2-
slide-3
SLIDE 3

An AMD Opteron system with 4 domains

MC HT Shared L3 Cache Core 0 L1, L2 cache Core 4 L1, L2 cache Core 8 L1, L2 cache Core 12 L1, L2 cache Memory node 0 NUMA Domain 0 MC HT Shared L3 Cache Core 2 L1, L2 cache Memory node 2 NUMA Domain 2 MC HT Shared L3 Cache Core 3 L1, L2 cache Core 7 L1, L2 cache Core 11 L1, L2 cache Core 15 L1, L2 cache Memory node 1 NUMA Domain 1 MC HT Memory node 3 NUMA Domain 3 Core 6 L1, L2 cache Core 10 L1, L2 cache Core 14 L1, L2 cache Shared L3 Cache Core 1 L1, L2 cache Core 5 L1, L2 cache Core 9 L1, L2 cache Core 13 L1, L2 cache

USENIX ATC’11 / Scheduling session

  • 3-
slide-4
SLIDE 4

Contention for the shared last-level cache (CA)

MC HT Shared L3 Cache Core 0 L1, L2 cache Core 4 L1, L2 cache Core 8 L1, L2 cache Core 12 L1, L2 cache Memory node 0 NUMA Domain 0 MC HT Shared L3 Cache Core 2 L1, L2 cache Memory node 2 NUMA Domain 2 MC HT Shared L3 Cache Core 3 L1, L2 cache Core 7 L1, L2 cache Core 11 L1, L2 cache Core 15 L1, L2 cache Memory node 1 NUMA Domain 1 MC HT Memory node 3 NUMA Domain 3 Core 6 L1, L2 cache Core 10 L1, L2 cache Core 14 L1, L2 cache Shared L3 Cache Core 1 L1, L2 cache Core 5 L1, L2 cache Core 9 L1, L2 cache Core 13 L1, L2 cache

USENIX ATC’11 / Scheduling session

  • 4-
slide-5
SLIDE 5

MC HT Shared L3 Cache Core 0 L1, L2 cache Core 4 L1, L2 cache Core 8 L1, L2 cache Core 12 L1, L2 cache Memory node 0 NUMA Domain 0 MC HT Shared L3 Cache Core 2 L1, L2 cache Memory node 2 NUMA Domain 2 MC HT Shared L3 Cache Core 3 L1, L2 cache Core 7 L1, L2 cache Core 11 L1, L2 cache Core 15 L1, L2 cache Memory node 1 NUMA Domain 1 MC HT Memory node 3 NUMA Domain 3 Core 6 L1, L2 cache Core 10 L1, L2 cache Core 14 L1, L2 cache Shared L3 Cache Core 1 L1, L2 cache Core 5 L1, L2 cache Core 9 L1, L2 cache Core 13 L1, L2 cache

Contention for the memory controller (MC)

USENIX ATC’11 / Scheduling session

  • 5-
slide-6
SLIDE 6

MC HT Shared L3 Cache Core 0 L1, L2 cache Core 4 L1, L2 cache Core 8 L1, L2 cache Core 12 L1, L2 cache Memory node 0 NUMA Domain 0 MC HT Shared L3 Cache Core 2 L1, L2 cache Memory node 2 NUMA Domain 2 MC HT Shared L3 Cache Core 3 L1, L2 cache Core 7 L1, L2 cache Core 11 L1, L2 cache Core 15 L1, L2 cache Memory node 1 NUMA Domain 1 MC HT Memory node 3 NUMA Domain 3 Core 6 L1, L2 cache Core 10 L1, L2 cache Core 14 L1, L2 cache Shared L3 Cache Core 1 L1, L2 cache Core 5 L1, L2 cache Core 9 L1, L2 cache Core 13 L1, L2 cache

Contention for the inter-domain interconnect (IC)

USENIX ATC’11 / Scheduling session

  • 6-
slide-7
SLIDE 7

USENIX ATC’11 / Scheduling session

MC HT Shared L3 Cache Core 0 L1, L2 cache Core 4 L1, L2 cache Core 8 L1, L2 cache Core 12 L1, L2 cache Memory node 0 NUMA Domain 0 MC HT Shared L3 Cache Core 2 L1, L2 cache Memory node 2 NUMA Domain 2 MC HT Shared L3 Cache Core 3 L1, L2 cache Core 7 L1, L2 cache Core 11 L1, L2 cache Core 15 L1, L2 cache Memory node 1 NUMA Domain 1 MC HT Memory node 3 NUMA Domain 3 Core 6 L1, L2 cache Core 10 L1, L2 cache Core 14 L1, L2 cache Shared L3 Cache Core 1 L1, L2 cache Core 5 L1, L2 cache Core 9 L1, L2 cache Core 13 L1, L2 cache

Remote access latency (RL)

A

  • 7-
slide-8
SLIDE 8

MC HT Shared L3 Cache Core 0 L1, L2 cache Core 4 L1, L2 cache Core 8 L1, L2 cache Core 12 L1, L2 cache NUMA Domain 0 MC HT Shared L3 Cache Core 2 L1, L2 cache Memory node 2 NUMA Domain 2 MC HT Shared L3 Cache Core 3 L1, L2 cache Core 7 L1, L2 cache Core 11 L1, L2 cache Core 15 L1, L2 cache Memory node 1 NUMA Domain 1 MC HT Memory node 3 NUMA Domain 3 Core 6 L1, L2 cache Core 10 L1, L2 cache Core 14 L1, L2 cache Shared L3 Cache Core 1 L1, L2 cache Core 5 L1, L2 cache Core 9 L1, L2 cache Core 13 L1, L2 cache

USENIX ATC’11 / Scheduling session

A B

Memory node 0

Isolating Memory controller contention (MC)

  • 8-
slide-9
SLIDE 9

Memory Controller (MC) and InterConnect (IC) contention are key factors hurting performance Dominant degradation factors

USENIX ATC’11 / Scheduling session

  • 9-
slide-10
SLIDE 10

Characterization method

  • Given two threads, decide if they will hurt each
  • therʼs performance if co-scheduled

Scheduling algorithm

  • Separate threads that are expected to interfere

A B A B

Contention-Aware Scheduling

USENIX ATC’11 / Scheduling session

  • 10-
slide-11
SLIDE 11

Limited observability

  • We do not know for sure if threads compete and how severely
  • Hardware does not tell us

Trial and error infeasible on large systems

  • Canʼt try all possible combinations
  • Even sampling becomes difficult

A good trade-off: measure LLC Miss rate!

  • Assumes that threads interfere if they have high miss rates
  • No account for cache contention impact
  • Works well because cache contention is not dominant

Characterization Method

USENIX ATC’11 / Scheduling session

  • 11-
slide-12
SLIDE 12

Sort threads by LLC missrate: Goal: isolate threads that compete for shared resources High contention: Low contention? A B A B X Y C D Domain 1 Domain 2 Domain 1 Domain 2 Migrate competing threads to different domains Memory node 1

MC HT

Our previous work: an algorithm for UMA systems Distributed Intensity (DI-Plain)

USENIX ATC’11 / Scheduling session

A B Memory node 2

MC HT MC HT

Memory node 2 Memory node 1

MC HT

X Y A B X Y

  • 12-
slide-13
SLIDE 13

USENIX ATC’11 / Scheduling session

MC HT Shared L3 Cache Core 0 L1, L2 cache Core 4 L1, L2 cache Core 8 L1, L2 cache Core 12 L1, L2 cache NUMA Domain 0 MC HT Shared L3 Cache Core 2 L1, L2 cache Memory node 2 NUMA Domain 2 MC HT Shared L3 Cache Core 3 L1, L2 cache Core 7 L1, L2 cache Core 11 L1, L2 cache Core 15 L1, L2 cache Memory node 1 NUMA Domain 1 MC HT Memory node 3 NUMA Domain 3 Core 6 L1, L2 cache Core 10 L1, L2 cache Core 14 L1, L2 cache Shared L3 Cache Core 1 L1, L2 cache Core 5 L1, L2 cache Core 9 L1, L2 cache Core 13 L1, L2 cache

Failing to migrate memory leaves MC and introduces RL

A B

  • 13-

Memory node 0 Shared L3 Cache

slide-14
SLIDE 14

DI-Plain hurts performance on NUMA systems because it does not migrate memory!

USENIX ATC’11 / Scheduling session SPEC CPU 2006 SPEC MPI 2007

% improvement over DEFAULT

  • 14-
slide-15
SLIDE 15

Goal: isolate threads that compete for shared resources and pull the memory to the local node upon migration A B C D Domain 1 Domain 2 Domain 1 Domain 2 Migrate competing threads along with memory to different domains Memory node 1

MC HT

Solution #1: Distributed Intensity with memory migration (DI-Migrate)

USENIX ATC’11 / Scheduling session

A B Memory node 2

MC HT MC HT

Memory node 2 Memory node 1

MC HT

X Y A B X Y Sort threads by LLC missrate: A B X Y

  • 15-
slide-16
SLIDE 16

DI-Migrate performs too many migrations for MPI. Migrations are expensive on NUMA systems.

USENIX ATC’11 / Scheduling session SPEC CPU 2006 (low migration rate) SPEC MPI 2007 (high migration rate)

% improvement over DEFAULT

  • 16-
slide-17
SLIDE 17

USENIX ATC’11 / Scheduling session

MC HT Shared L3 Cache Core 0 L1, L2 cache Core 4 L1, L2 cache Core 8 L1, L2 cache Core 12 L1, L2 cache NUMA Domain 0 MC HT Shared L3 Cache Core 2 L1, L2 cache Memory node 2 NUMA Domain 2 MC HT Shared L3 Cache Core 3 L1, L2 cache Core 7 L1, L2 cache Core 11 L1, L2 cache Core 15 L1, L2 cache Memory node 1 NUMA Domain 1 MC HT Memory node 3 NUMA Domain 3 Core 6 L1, L2 cache Core 10 L1, L2 cache Core 14 L1, L2 cache Shared L3 Cache Core 1 L1, L2 cache Core 5 L1, L2 cache Core 9 L1, L2 cache Core 13 L1, L2 cache

Migrating too frequently causes IC

A B

  • 13-

Memory node 0 MC Memory node 1 Shared L3 Cache

slide-18
SLIDE 18

DI-Migrate: threads sorted by miss rate if array positions change, we migrate thread and memory DINO: threads sorted by class

  • nly migrate if we jump from one class to another

USENIX ATC’11 / Scheduling session

Solution #2: Distributed Intensity NUMA Online (DINO)

7 12 21 35 47 110 150 200 5 2 7 15 27 51 78 92 170 190 3 1 7 12 21 35 47 110 150 200 5 2 7 15 27 51 78 92 170 190 3 1 C1 <= 10 10 < C2 <= 100 100 < C3 C1 <= 10 10 < C2 <= 100 100 < C3

  • 17-
slide-19
SLIDE 19

Loose correlation between miss rate and degradation, so most migrations will not payoff

USENIX ATC’11 / Scheduling session

  • 18-
slide-20
SLIDE 20

Average number of memory migrations per hour of execution (DI-Migrate and DINO)

USENIX ATC’11 / Scheduling session

DINO significantly reduces the number of migrations

  • 19-
slide-21
SLIDE 21

DINO results

USENIX ATC’11 / Scheduling session SPEC CPU 2006 SPEC MPI 2007 LAMP % improvement over DEFAULT

  • 20-
slide-22
SLIDE 22

On NUMA systems we need to schedule threads and memory

  • Memory Controller contention when memory

is not migrated

  • Interconnect Contention when memory

is migrated too frequently DINO is the contention-aware scheduling algorithm for NUMA systems that

  • migrates the memory along with the application
  • eliminates excessive migrations by trying to keep the workload
  • n their old nodes, if possible
  • utilizes Instruction Based Sampling to perform partial memory

migration of “hot” pages

Summary

USENIX ATC’11 / Scheduling session

  • 21-
slide-23
SLIDE 23
  • Read our Linux Symposium 2011 paper:

“User-level scheduling on NUMA multicore systems under Linux”

  • Source code is available at:

http://clavis.sourceforge.net For further information

USENIX ATC’11 / Scheduling session

  • 22-
slide-24
SLIDE 24

Any [time for] questions?

A Case for NUMA-aware Contention Management

  • n Multicore Systems

USENIX ATC’11 / Scheduling session