NUMA Support for Charm++ Does memory affinity matter? Christiane - PowerPoint PPT Presentation

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso Jean-François Méhaut

Outline ● Introduction ● Motivation ● NUMA Problem ● Support NUMA for Charm++ ● First Results ● Conclusion and Future work

Motivation for NUMA Platforms ● The number of cores per processor is increasing ● Hierarchical shared memory multiprocessors ● cc-NUMA is coming back (NUMA factor) ● AMD hypertransport and Intel QuickPath

NUMA Problem ● Remote access and Node#6 Node#7 Memory contention Node#4 Node#5 ● Optimizes: ● Latency ● Bandwidth Node#2 Node#3 ● Assure memory Node#0 Node#1 affinity

NUMA Problem Node#6 Node#7 ● Remote access and Memory contention Node#4 Node#5 ● Optimizes: ● Latency ● Bandwidth Node#2 Node#3 ● Assure memory affinity Node#0 Node#1

NUMA Problem ● Memory access types: ● Read and write ● Different costs ● Write operations are more expensive ● Special memory policies ● On NUMA, data distribution matters!

NUMA support on Operating Systems ● Operating systems have some support for NUMA machines ● Physical memory allocation: ● First-touch, next-touch ● Libraries and tools to distribute data

Memory Affinity on Linux ● The actual support for NUMA on Linux: ● Physical memory allocation: – First-touch: first memoy access ● NUMA API: developers do all! – System call to bind memory pages – Numactl , user-level tool to bind memory and to pin threads – Libnuma an interface to place memory pages on physical memory

Charm++ Parallel Programming System ● Portability over different platforms ● Shared memory ● Distributed memory ● Architecture abstraction => programmer productivity ● Virtualization and transparence

Charm++ Parallel Programming System ● Data management: ● Stack and Heap ● Memory allocation based on malloc ● Isomalloc: ● based on mmap system call ● allows threads migration ● What about physical memory?

NUMA Support on Charm++ ● Our approach ● Study the impact of memory affinity on charm++ ● Bind virtual memory pages to memory banks ● Based on three parts: ● +maffinity option ● Interleaved heap ● NUMA-aware memory allocator

Impact of Memory Affinity on charm++ ● Study the impact of memory affinity ● different memory allocators and memory policies ● Memory allocators ● ptmalloc and NUMA-aware tcmalloc ● Memory policies ● First-touch, bind and interleaved ● NUMA machine: AMD Opteron

AMD Opteron ● NUMA machine ● AMD Opteron ● 8 (2 cores) x 2.2GHz processors ● Cache L2 (2Mbytes) ● Main memory 32Gbytes ● Low latency for local memory access ● Numa factor: 1.2 – 1.5 ● Linux 2.6.32.6

Different Memory Allocators kNeighbor Application - charm++ multicore64 3500 average time (us) 3000 ptmalloc 2500 tcmalloc NUMA 2000 ptmalloc + setcpu tcmalloc NUMA + 1500 setcpu 1000 500 0 Memory Allocators numactl Average time - 3-kN iteration (us) kNeighbor Application - charm++ multicore64 (100 iteration) 200 original 150 bind 100 interleave 50 0 8 16 Number of cores

Different Memory Allocators Molecular 2D - charm++ multicore64 60 Benchmark Time (ms) 59 ptmalloc 58 tcmalloc NUMA 57 ptmalloc + 56 setcpu 55 54 tcmalloc NUMA 53 + setcpu 52 51 50 Memory Allocators numactl Molecular2D - charm++ multicore64 120 step time (ms/step) 100 80 original 60 bind 40 interleave 20 0 8 16 Number of cores

+maffinity option ● set memory affinity for processes or threads ● Based on Linux NUMA system call ● Set the process/thread memory policy ● Bind, preferred and interleave are used in our implementation ● Must be used with +setcpuaffinity option

./charmrun prog +p6 +setcpuaffinity +coremap 0,2,4,8,12,13 +maffinity +nodemap 0,0,1,2,3,3 +mempol preferred Node#2 Node#3 CPU CPU MEM MEM Node#0 Node#1 CPU CPU MEM MEM

Interleaved Heap ● Based on mbind Linux system call ● Spread data over the NUMA nodes ● The objective is to reduce memory contention by optimizing bandwidth ● One mbind per mmap

Heap Memory page Node#2 Node#3 CPU CPU MEM MEM Node#0 Node#1 CPU CPU MEM MEM

Heap Heap Memory page Memory page Node#2 Node#3 Node#2 Node#3 CPU CPU CPU CPU MEM MEM MEM MEM Node#0 Node#1 Node#0 Node#1 CPU CPU CPU CPU MEM MEM MEM MEM

Heap Heap Node#2 Node#3 Node#2 Node#3 CPU CPU CPU CPU Virtual memory pages MEM MEM MEM MEM binded to physical memory banks Node#0 Node#1 Node#0 Node#1 CPU CPU CPU CPU MEM MEM MEM MEM

First Results ● Charm++ version: ● 6.1.3 ● net-linux-amd64 ● Applications: ● Molecular2D ● Kneighbor (1000 iterations - msg 1024)

First Results ● NUMA machine ● AMD Opteron ● 8 (2 cores) x 2.2GHz processors ● Cache L2 shared (2Mbytes) ● Main memory 32Gbytes ● Low latency for local memory access ● Numa factor: 1.2 – 1.5 ● Linux 2.6.32.6

Intel Xeon ● NUMA machine ● Intel EM64T ● 4 (24 cores) x 2.66GHz processors ● Shared cache L3 (16MB) ● Main memory 192Gbytes ● High latency for local memory access ● Numa factor: 1.2 - 5 ● Linux 2.6.27

Charm - Memory Affinity Kn Application 350000 300000 250000 original Time (us) 200000 maffinity 150000 interleave 100000 50000 0 24 48 64 Number of Cores Charm - Memory affinity Mol2d Application 70 60 50 original Time in ms 40 maffinity 30 interleave 20 10 0 24 48 64 Number of Cores

HeapAlloc ● NUMA-aware memory allocator ● Reduces lock contention and optimizes data locality ● Several memory policies: applied considering the access mode (read, write or read/write)

HeapAlloc ● Default memory policy is bind ● High-level interface: glibc compatible, any modifications in source code ● Low-level interface: allows developers to manage their heaps

Memory Node#2 One heap per core of a node Node#2 Node#3 CPU CPU MEM MEM Node#0 Node#1 CPU CPU MEM MEM

Memory Node#0 core0 core1 core2 core3 Memory is allocated from heap 'core0' Node#2 Node#3 Thread running on node#3 calls free for CPU CPU memory allocated by thread MEM Thread running on Thread running on MEM node#0 calls malloc node#0 calls malloc Memory is returned Node#0 Node#1 to heap 'core0' CPU CPU MEM MEM

Conclusions ● Charm++ performance on NUMA can be improved ● Tcmalloc NUMA-aware ● +maffinity ● Interleaved Heap ● Proposal of an optimized memory allocator for NUMA machines

Future Work ● Conclude the integration of HeapAlloc in charm++ ● Study the impact of different memory allocators on charm++ ● What about several memory policies? ● Bind, interleave, next-touch, skew_mapp .....

NUMA Support for Charm++ Does memory affinity matter? Christiane - PowerPoint PPT Presentation

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso Jean-Franois Mhaut Outline Introduction Motivation NUMA Problem Support NUMA for Charm++ First Results Conclusion and

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

Recent Results in Charm Physics Recent Results in Charm Physics Topics Topics Rare Charm

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement Strategies for NUMA Machines

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

State of Charm++ Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory

Dynamic Load Balancing in Dynamic Load Balancing in Charm+ + Charm+ + Abhinav S Bhatele

Welcome to the 2017 Charm++ Workshop! Laxmikant (Sanjay) Kale http://charm.cs.illinois.edu

Charm++ Interoperability Nikhil Jain Charm Workshop - 2013 1 Monday, April 15, 13 1

Charm physics and XYZ states at BESIII Evgeny BOGER JINR Dubna On behalf of BESIII

Combination and QCD Analysis of Charm Production Cross Section Measurements in DIS at HERA Kenan

Using SCHED_DEADLINE Controlling CPU Bandwidth Steven Rostedt rostedt@goodmis.org

Cri$cality*Monotonic.Scheduling.with Dynamic(Processor(Affini0es !Bjrn!B.!Brandenburg!

ACDIS Partnership Opportunities If you are experiencing any technical difficulties, please

Sterimol parameters [4]. A satisfactory MLR (multiple linear regression) equation was obtained (r

PFE Strategies to Drive Quality Care Thomas Workman, PhD Senior Advisor, PFE Contractor in

Sigmoidal Kinetics Cooperativity Binding Constant Kinetics of Allosteric Enzymes Contents

Change of Measure formula and the Hellinger Distance of two Lvy Processes Erika Hausenblas

Tuning FreeBSD for routing and fjrewalling Olivier Cochard-Labb 1 / 61 whoami(1)