Traffic Management: A Holistic Approach to Memory Placement on NUMA - - PowerPoint PPT Presentation

traffic management a holistic approach to memory
SMART_READER_LITE
LIVE PREVIEW

Traffic Management: A Holistic Approach to Memory Placement on NUMA - - PowerPoint PPT Presentation

Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems Mohammad Dashti 1 , Alexandra Fedorova 1 , Justin Funston 1 , Fabien Gaud 1 , Renaud Lachaize 2 , Baptiste Lepers 3 , Vivien ema 4 , Mark Roth 1 Qu 1 Simon Fraser


slide-1
SLIDE 1

Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

Mohammad Dashti1, Alexandra Fedorova1, Justin Funston1, Fabien Gaud1, Renaud Lachaize2, Baptiste Lepers3, Vivien Qu´ ema4, Mark Roth1

1Simon Fraser University 2Universit´

e Joseph Fourier

3CNRS 4Grenoble INP

March 19, 2013

1 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-2
SLIDE 2

New multicore machines are NUMA

NODE 0

240 cycles / 5.5GB/ s 300 cycles / 2.8GB/ s

Core Core Core Core

NODE 1

Core Core Core Core

DRAM

M C

DRAM

M C

NODE 2

Core Core Core Core

DRAM

M C

NODE 3

Core Core Core Core

DRAM

M C 2 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-3
SLIDE 3

Well-know issue: remote access latency overhead

NODE 0

Core Core Core Core

NODE 1

Core Core Core Core

DRAM

M C

DRAM

M C

NODE 2

Core Core Core Core

DRAM

M C

NODE 3

Core Core Core Core

DRAM

M C Thread Memory

300 cycles

◮ Impacts performance by at most 30%

3 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-4
SLIDE 4

New issue: Memory controller and interconnect congestion

NODE 0

Core Core Core Core

NODE 1

Core Core Core Core

DRAM

M C

DRAM

M C

NODE 2

Core Core Core Core

DRAM

M C

NODE 3

Core Core Core Core

DRAM

M C Thread Memory 1200 cycles 4 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-5
SLIDE 5

Current solutions

◮ Try to improve locality

◮ Thread scheduling and page migration (USENIX ATC’11) ◮ Thread Clustering (EuroSys’07) ◮ Page replication (ASPLOS’96) ◮ Etc.

◮ But the main problem is MC/interconnect congestion

5 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-6
SLIDE 6

MC/Interconnect congestion impact on performance

◮ 16 threads, one per core ◮ Memory either allocated on first touch or interleaved

Example: Streamcluster 1% 1% 1% 97% First touch scenario

25% 25% 25% 25%

Interleave scenario

6 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-7
SLIDE 7

MC/Interconnect congestion impact on performance (2)

20 40 60 80 100

BT CG DC EP FT IS LU MG SP UA bodytrack facesim fluidanimate streamcluster swaptions x264 kmeans matrixmult PCA (I) wrmem Performance difference (%) between best and worst policy Best policy is First Touch Best policy is Interleaving Up to 100% performance difference

7 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-8
SLIDE 8

Why do applications benefit from interleaving? (1)

Streamcluster Interleaving First touch Local access ratio 25% 25% Memory latency (cycles) 471 1169 Memory controller imbalance 7% 200% Interconnect imbalance 21% 86%

  • Perf. improvement / first touch

105%

  • ⇒ Interconnect and memory controller congestion drive up

memory access latency

8 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-9
SLIDE 9

Why do applications benefit from interleaving? (2)

PCA Interleaving First touch Local access ratio 25% 33% Memory latency (cycles) 480 665 Memory controller imbalance 4% 154% Interconnect imbalance 19% 64%

  • Perf. improvement / first touch

38%

  • ⇒ Balancing load on memory controllers is more important than

improve locality

9 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-10
SLIDE 10

Conclusions

◮ Balance is more important than locality

◮ Memory controller and interconnect congestion can drive up

access latency

◮ Always manually interleaving memory is NOT the way to go

  • 40
  • 30
  • 20
  • 10

10

BT CG DC EP FT IS LU MG SP UA

Performance improvement with respect to Linux (%) Manual interleaving

⇒ Need a new solution

10 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-11
SLIDE 11

Carrefour: A new memory traffic management algorithm

◮ First goal: balance memory pressure on interconnect and MC ◮ Second goal: improve locality

11 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-12
SLIDE 12

Mechanism #1: Page relocation

NODE 0

Core Core Core Core

NODE 1

Core Core Core Core

DRAM

M C

DRAM

M C

NODE 2

Core Core Core Core

DRAM

M C

NODE 3

Core Core Core Core

DRAM

M C 12 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-13
SLIDE 13

Mechanism #1: Page relocation

NODE 0

Core Core Core Core

NODE 1

Core Core Core Core

DRAM

M C

DRAM

M C

NODE 2

Core Core Core Core

DRAM

M C

NODE 3

Core Core Core Core

DRAM

M C

`

Better locality Lower interconnect load Balanced load on MC Cannot be applied if region

is shared by multiple threads

12 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-14
SLIDE 14

Mechanism #2: Page replication

NODE 0

Core Core Core Core

NODE 1

Core Core Core Core

DRAM

M C

DRAM

M C

NODE 2

Core Core Core Core

DRAM

M C

NODE 3

Core Core Core Core

DRAM

M C 13 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-15
SLIDE 15

Mechanism #2: Page replication

NODE 0

Core Core Core Core

NODE 1

Core Core Core Core

DRAM

M C

DRAM

M C

NODE 2

Core Core Core Core

DRAM

M C

NODE 3

Core Core Core Core

DRAM

M C

Better locality Lower interconnect load Balanced load on MC Higher memory consumption Expensive synchronization

13 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-16
SLIDE 16

Mechanism #3: Page interleaving

NODE 0

Core Core Core Core

NODE 1

Core Core Core Core

DRAM

M C

DRAM

M C

NODE 2

Core Core Core Core

DRAM

M C

NODE 3

Core Core Core Core

DRAM

M C 14 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-17
SLIDE 17

Mechanism #3: Page interleaving

NODE 0

Core Core Core Core

NODE 1

Core Core Core Core

DRAM

M C

DRAM

M C

NODE 2

Core Core Core Core

DRAM

M C

NODE 3

Core Core Core Core

DRAM

M C

Balanced load on interconnect Balanced load on MC Can decrease locality

14 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-18
SLIDE 18

Carrefour in details

◮ Goal: Combine these techniques to:

  • 1. Balance memory pressure
  • 2. Increase locality

Per application profiling Global application m etrics Memory intensity Memory imbalance Local access ratio Memory read ratio Per application decisions Memory congestion ? Per page decisions Migrate / Interleave / Replicate page Enable migrations ? Enable interleaving ? Enable replications ? Per page m etrics RW ratio Accessed by nodes 15 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-19
SLIDE 19

Carrefour in details

Per application profiling Global application m etrics Memory intensity Memory imbalance Local access ratio Memory read ratio Per application decisions Memory congestion ? Per page decisions Migrate / Interleave / Replicate page Enable migrations ? Enable interleaving ? Enable replications ? H W C

Expensive !

Per page m etrics RW ratio Accessed by nodes I B S

Expensive !

◮ Accurate and low-overhead page access statistics

◮ Adaptive IBS sampling ◮ Include cache accesses ◮ Use hardware counter feedback 15 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-20
SLIDE 20

Carrefour in details

Per application profiling Global application m etrics Memory intensity Memory imbalance Local access ratio Memory read ratio Per application decisions Memory congestion ? Per page decisions Migrate / Interleave / Replicate page Enable migrations ? Enable interleaving ? Enable replications ? H W C

Expensive !

Per page m etrics RW ratio Accessed by nodes I B S

Expensive !

◮ Efficient page replication

◮ Use a careful implementation (fine grain locks) ◮ Prevent data synchronization 15 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-21
SLIDE 21

Evaluation

◮ Carrefour is implemented in Linux 3.6 ◮ Machines

◮ 16 cores, 4 nodes, 64 GB of RAM ◮ 24 cores, 4 nodes, 64 GB of RAM

◮ Benchmarks (23 applications)

◮ Parsec ◮ FaceRec ◮ Metis (Map/Reduce) ◮ NAS

◮ Compare Carrefour to

◮ Linux (default) ◮ Linux Autonuma ◮ Manual Interleaving 16 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-22
SLIDE 22

Performance

  • 30

30 60 90 120 150 180 210 240 270

F a c e s i m S t r e a m c l u s t e r F a c e R e c F a c e R e c L

  • n

g P C A E P S P

Performance improvement with respect to Linux (%) AutoNUMA Carrefour

⇒ Carrefour significantly improves performance !

17 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-23
SLIDE 23

Carrefour overhead

Configuration Maximum overhead / default Autonuma 25% Carrefour 4%

◮ Carrefour average overhead when no decision are taken: 2%

18 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-24
SLIDE 24

Conclusion

◮ In modern NUMA systems:

◮ Remote latency overhead is not the main bottleneck ◮ MC and interconnect congestion can drive up memory latency

◮ Carrefour: a memory traffic management algorithm

◮ First goal: balance memory pressure on interconnect and MC ◮ Second goal: improve locality

◮ Performance:

◮ Improves performance significantly (up to 270%) ◮ Outperforms others solutions 19 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-25
SLIDE 25

Questions?

https://github.com/Carrefour

20 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-26
SLIDE 26

Carrefour supports multi-applications workloads

  • 20

20 40 60 80

MG + Streamcluster PCA + Streamcluster FaceRecLong + Streamcluster

Performance improvement with respect to Linux (%)

MG Streamcluster PCA Streamcluster FaceRecLong Streamcluster AutoNUMA Manual interleaving Carrefour

1 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-27
SLIDE 27

Detailed profiling

20 40 60 80 100 120 140 160

Facesim Streamcluster FaceRec FaceRecLong PCA MG SP

Load imbalance

  • n memory controllers (%)

Linux AutoNUMA Manual interleaving Carrefour 20 40 60 80 100 120

Facesim Streamcluster FaceRec FaceRecLong PCA MG SP

Ratio of local memory accesses (%)

Linux AutoNUMA Manual interleaving Carrefour 200 400 600 800 1000 1200

Facesim Streamcluster FaceRec FaceRecLong PCA MG SP

Avg latency (nbCycles/req)

Linux AutoNUMA Manual interleaving Carrefour

2 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

slide-28
SLIDE 28

Energy consumption

  • 70
  • 60
  • 50
  • 40
  • 30
  • 20
  • 10

10 20 30 40 50 60 70

Autonuma Manual Int. Carrefour Autonuma Manual Int. Carrefour Autonuma Manual Int. Carrefour Autonuma Manual Int. Carrefour Autonuma Manual Int. Carrefour Autonuma Manual Int. Carrefour Autonuma Manual Int. Carrefour

Increase in with respect to Linux (%) Energy consumption Completion time SP MG PCA FaceRecLong FaceRec Streamcluster Facesim 3 / 20 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems