NUMA Support for Charm++ Does memory affinity matter? Christiane - - PowerPoint PPT Presentation

numa support for charm
SMART_READER_LITE
LIVE PREVIEW

NUMA Support for Charm++ Does memory affinity matter? Christiane - - PowerPoint PPT Presentation

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso Jean-Franois Mhaut Outline Introduction Motivation NUMA Problem Support NUMA for Charm++ First Results Conclusion and


slide-1
SLIDE 1

NUMA Support for Charm++

Does memory affinity matter?

Christiane Pousa Ribeiro Maxime Martinasso Jean-François Méhaut

slide-2
SLIDE 2

Outline

  • Introduction
  • Motivation
  • NUMA Problem
  • Support NUMA for Charm++
  • First Results
  • Conclusion and Future work
slide-3
SLIDE 3

Motivation for NUMA Platforms

  • The number of cores per processor is

increasing

  • Hierarchical shared memory

multiprocessors

  • cc-NUMA is coming back (NUMA factor)
  • AMD hypertransport and Intel QuickPath
slide-4
SLIDE 4
slide-5
SLIDE 5

NUMA Problem

Node#0 Node#1 Node#2 Node#3 Node#4 Node#5 Node#6 Node#7

  • Remote access and

Memory contention

  • Optimizes:
  • Latency
  • Bandwidth
  • Assure memory

affinity

slide-6
SLIDE 6

NUMA Problem

Node#0 Node#1 Node#2 Node#3 Node#4 Node#5 Node#6 Node#7

  • Remote access and

Memory contention

  • Optimizes:
  • Latency
  • Bandwidth
  • Assure memory

affinity

slide-7
SLIDE 7

NUMA Problem

Node#0 Node#1 Node#2 Node#3 Node#4 Node#5 Node#6 Node#7

  • Remote access and

Memory contention

  • Optimizes:
  • Latency
  • Bandwidth
  • Assure memory

affinity

slide-8
SLIDE 8

NUMA Problem

Node#0 Node#1 Node#2 Node#3 Node#4 Node#5 Node#6 Node#7

  • Remote access and

Memory contention

  • Optimizes:
  • Latency
  • Bandwidth
  • Assure memory

affinity

slide-9
SLIDE 9

NUMA Problem

  • Remote access and

Memory contention

  • Optimizes:
  • Latency
  • Bandwidth
  • Assure memory

affinity

Node#0 Node#1 Node#2 Node#3 Node#4 Node#5 Node#6 Node#7

slide-10
SLIDE 10

NUMA Problem

  • Remote access and

Memory contention

  • Optimizes:
  • Latency
  • Bandwidth
  • Assure memory

affinity

Node#0 Node#1 Node#2 Node#3 Node#4 Node#5 Node#6 Node#7

slide-11
SLIDE 11

NUMA Problem

  • Memory access types:
  • Read and write
  • Different costs
  • Write operations are more expensive
  • Special memory policies
  • On NUMA, data distribution matters!
slide-12
SLIDE 12

NUMA support on Operating Systems

  • Operating systems have some support

for NUMA machines

  • Physical memory allocation:
  • First-touch, next-touch
  • Libraries and tools to distribute data
slide-13
SLIDE 13

Memory Affinity on Linux

  • The actual support for NUMA on Linux:
  • Physical memory allocation:

– First-touch: first memoy access

  • NUMA API: developers do all!

– System call to bind memory pages – Numactl, user-level tool to bind memory and to

pin threads

– Libnuma an interface to place memory pages

  • n physical memory
slide-14
SLIDE 14
  • Portability over different platforms
  • Shared memory
  • Distributed memory
  • Architecture abstraction => programmer

productivity

  • Virtualization and transparence

Charm++ Parallel Programming System

slide-15
SLIDE 15
  • Data management:
  • Stack and Heap
  • Memory allocation based on malloc
  • Isomalloc:
  • based on mmap system call
  • allows threads migration
  • What about physical memory?

Charm++ Parallel Programming System

slide-16
SLIDE 16

NUMA Support on Charm++

  • Our approach
  • Study the impact of memory affinity on

charm++

  • Bind virtual memory pages to memory

banks

  • Based on three parts:
  • +maffinity option
  • Interleaved heap
  • NUMA-aware memory allocator
slide-17
SLIDE 17

Impact of Memory Affinity on charm++

  • Study the impact of memory affinity
  • different memory allocators and memory

policies

  • Memory allocators
  • ptmalloc and NUMA-aware tcmalloc
  • Memory policies
  • First-touch, bind and interleaved
  • NUMA machine: AMD Opteron
slide-18
SLIDE 18

AMD Opteron

  • NUMA machine
  • AMD Opteron
  • 8 (2 cores) x 2.2GHz

processors

  • Cache L2 (2Mbytes)
  • Main memory 32Gbytes
  • Low latency for local

memory access

  • Numa factor: 1.2 – 1.5
  • Linux 2.6.32.6
slide-19
SLIDE 19

500 1000 1500 2000 2500 3000 3500

kNeighbor Application - charm++ multicore64 Different Memory Allocators

ptmalloc tcmalloc NUMA ptmalloc + setcpu tcmalloc NUMA + setcpu

Memory Allocators average time (us)

8 16 50 100 150 200

kNeighbor Application - charm++ multicore64 (100 iteration)

  • riginal

bind interleave Number of cores Average time - 3-kN iteration (us)

numactl

slide-20
SLIDE 20

8 16 20 40 60 80 100 120 Molecular2D - charm++ multicore64

  • riginal

bind interleave Number of cores step time (ms/step) 50 51 52 53 54 55 56 57 58 59 60

Molecular 2D - charm++ multicore64 Different Memory Allocators ptmalloc tcmalloc NUMA ptmalloc + setcpu tcmalloc NUMA + setcpu Memory Allocators Benchmark Time (ms)

numactl

slide-21
SLIDE 21

+maffinity option

  • set memory affinity for processes or

threads

  • Based on Linux NUMA system call
  • Set the process/thread memory policy
  • Bind, preferred and interleave are used in
  • ur implementation
  • Must be used with +setcpuaffinity option
slide-22
SLIDE 22

./charmrun prog +p6 +setcpuaffinity +coremap 0,2,4,8,12,13 +maffinity +nodemap 0,0,1,2,3,3 +mempol preferred

MEM MEM

MEM MEM

CPU CPU

CPU CPU Node#2 Node#3 Node#0 Node#1

slide-23
SLIDE 23

Interleaved Heap

  • Based on mbind Linux system call
  • Spread data over the NUMA nodes
  • The objective is to reduce memory

contention by optimizing bandwidth

  • One mbind per mmap
slide-24
SLIDE 24

Heap

MEM MEM

MEM MEM

CPU CPU CPU CPU

Node#2 Node#3 Node#0 Node#1

Memory page

slide-25
SLIDE 25

Heap

MEM MEM

MEM MEM

CPU CPU CPU CPU

Node#2 Node#3 Node#0 Node#1

Memory page

Heap

MEM MEM

MEM MEM

CPU CPU CPU CPU

Node#2 Node#3 Node#0 Node#1

Memory page

slide-26
SLIDE 26

Heap

MEM MEM

MEM MEM

CPU CPU CPU CPU

Node#2 Node#3 Node#0 Node#1

Heap

MEM MEM

MEM MEM

CPU CPU CPU CPU

Node#2 Node#3 Node#0 Node#1 Virtual memory pages binded to physical memory banks

slide-27
SLIDE 27

First Results

  • Charm++ version:
  • 6.1.3
  • net-linux-amd64
  • Applications:
  • Molecular2D
  • Kneighbor (1000 iterations - msg 1024)
slide-28
SLIDE 28

First Results

  • NUMA machine
  • AMD Opteron
  • 8 (2 cores) x 2.2GHz

processors

  • Cache L2 shared

(2Mbytes)

  • Main memory 32Gbytes
  • Low latency for local

memory access

  • Numa factor: 1.2 – 1.5
  • Linux 2.6.32.6
slide-29
SLIDE 29
slide-30
SLIDE 30

Intel Xeon

  • NUMA machine
  • Intel EM64T
  • 4 (24 cores) x 2.66GHz

processors

  • Shared cache L3 (16MB)
  • Main memory 192Gbytes
  • High latency for local

memory access

  • Numa factor: 1.2 - 5
  • Linux 2.6.27
slide-31
SLIDE 31

24 48 64 50000 100000 150000 200000 250000 300000 350000

Charm - Memory Affinity Kn Application

  • riginal

maffinity interleave Number of Cores Time (us)

24 48 64 10 20 30 40 50 60 70

Charm - Memory affinity Mol2d Application

  • riginal

maffinity interleave Number of Cores Time in ms

slide-32
SLIDE 32

HeapAlloc

  • NUMA-aware memory allocator
  • Reduces lock contention and optimizes

data locality

  • Several memory policies: applied

considering the access mode (read, write

  • r read/write)
slide-33
SLIDE 33

HeapAlloc

  • Default memory policy is bind
  • High-level interface: glibc compatible,

any modifications in source code

  • Low-level interface: allows developers to

manage their heaps

slide-34
SLIDE 34

MEM MEM

MEM MEM

CPU CPU CPU CPU

Node#2 Node#3 Node#0 Node#1

One heap per core of a node Memory Node#2

slide-35
SLIDE 35

MEM MEM

MEM MEM

CPU CPU CPU CPU

Node#2 Node#3 Node#0 Node#1

Thread running on node#0 calls malloc Memory Node#0 core0 core1 core2 core3 Memory is allocated from heap 'core0' Thread running on node#0 calls malloc Thread running on node#3 calls free for memory allocated by thread Memory is returned to heap 'core0'

slide-36
SLIDE 36

Conclusions

  • Charm++ performance on NUMA can be

improved

  • Tcmalloc NUMA-aware
  • +maffinity
  • Interleaved Heap
  • Proposal of an optimized memory

allocator for NUMA machines

slide-37
SLIDE 37

Future Work

  • Conclude the integration of HeapAlloc

in charm++

  • Study the impact of different memory

allocators on charm++

  • What about several memory policies?
  • Bind, interleave, next-touch,

skew_mapp .....

slide-38
SLIDE 38

Questions?

pousa@imag.fr