numa support for charm
play

NUMA Support for Charm++ Does memory affinity matter? Christiane - PowerPoint PPT Presentation

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso Jean-Franois Mhaut Outline Introduction Motivation NUMA Problem Support NUMA for Charm++ First Results Conclusion and


  1. NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso Jean-François Méhaut

  2. Outline ● Introduction ● Motivation ● NUMA Problem ● Support NUMA for Charm++ ● First Results ● Conclusion and Future work

  3. Motivation for NUMA Platforms ● The number of cores per processor is increasing ● Hierarchical shared memory multiprocessors ● cc-NUMA is coming back (NUMA factor) ● AMD hypertransport and Intel QuickPath

  4. NUMA Problem ● Remote access and Node#6 Node#7 Memory contention Node#4 Node#5 ● Optimizes: ● Latency ● Bandwidth Node#2 Node#3 ● Assure memory Node#0 Node#1 affinity

  5. NUMA Problem ● Remote access and Node#6 Node#7 Memory contention Node#4 Node#5 ● Optimizes: ● Latency ● Bandwidth Node#2 Node#3 ● Assure memory Node#0 Node#1 affinity

  6. NUMA Problem ● Remote access and Node#6 Node#7 Memory contention Node#4 Node#5 ● Optimizes: ● Latency ● Bandwidth Node#2 Node#3 ● Assure memory Node#0 Node#1 affinity

  7. NUMA Problem ● Remote access and Node#6 Node#7 Memory contention Node#4 Node#5 ● Optimizes: ● Latency ● Bandwidth Node#2 Node#3 ● Assure memory Node#0 Node#1 affinity

  8. NUMA Problem Node#6 Node#7 ● Remote access and Memory contention Node#4 Node#5 ● Optimizes: ● Latency ● Bandwidth Node#2 Node#3 ● Assure memory affinity Node#0 Node#1

  9. NUMA Problem Node#6 Node#7 ● Remote access and Memory contention Node#4 Node#5 ● Optimizes: ● Latency ● Bandwidth Node#2 Node#3 ● Assure memory affinity Node#0 Node#1

  10. NUMA Problem ● Memory access types: ● Read and write ● Different costs ● Write operations are more expensive ● Special memory policies ● On NUMA, data distribution matters!

  11. NUMA support on Operating Systems ● Operating systems have some support for NUMA machines ● Physical memory allocation: ● First-touch, next-touch ● Libraries and tools to distribute data

  12. Memory Affinity on Linux ● The actual support for NUMA on Linux: ● Physical memory allocation: – First-touch: first memoy access ● NUMA API: developers do all! – System call to bind memory pages – Numactl , user-level tool to bind memory and to pin threads – Libnuma an interface to place memory pages on physical memory

  13. Charm++ Parallel Programming System ● Portability over different platforms ● Shared memory ● Distributed memory ● Architecture abstraction => programmer productivity ● Virtualization and transparence

  14. Charm++ Parallel Programming System ● Data management: ● Stack and Heap ● Memory allocation based on malloc ● Isomalloc: ● based on mmap system call ● allows threads migration ● What about physical memory?

  15. NUMA Support on Charm++ ● Our approach ● Study the impact of memory affinity on charm++ ● Bind virtual memory pages to memory banks ● Based on three parts: ● +maffinity option ● Interleaved heap ● NUMA-aware memory allocator

  16. Impact of Memory Affinity on charm++ ● Study the impact of memory affinity ● different memory allocators and memory policies ● Memory allocators ● ptmalloc and NUMA-aware tcmalloc ● Memory policies ● First-touch, bind and interleaved ● NUMA machine: AMD Opteron

  17. AMD Opteron ● NUMA machine ● AMD Opteron ● 8 (2 cores) x 2.2GHz processors ● Cache L2 (2Mbytes) ● Main memory 32Gbytes ● Low latency for local memory access ● Numa factor: 1.2 – 1.5 ● Linux 2.6.32.6

  18. Different Memory Allocators kNeighbor Application - charm++ multicore64 3500 average time (us) 3000 ptmalloc 2500 tcmalloc NUMA 2000 ptmalloc + setcpu tcmalloc NUMA + 1500 setcpu 1000 500 0 Memory Allocators numactl Average time - 3-kN iteration (us) kNeighbor Application - charm++ multicore64 (100 iteration) 200 original 150 bind 100 interleave 50 0 8 16 Number of cores

  19. Different Memory Allocators Molecular 2D - charm++ multicore64 60 Benchmark Time (ms) 59 ptmalloc 58 tcmalloc NUMA 57 ptmalloc + 56 setcpu 55 54 tcmalloc NUMA 53 + setcpu 52 51 50 Memory Allocators numactl Molecular2D - charm++ multicore64 120 step time (ms/step) 100 80 original 60 bind 40 interleave 20 0 8 16 Number of cores

  20. +maffinity option ● set memory affinity for processes or threads ● Based on Linux NUMA system call ● Set the process/thread memory policy ● Bind, preferred and interleave are used in our implementation ● Must be used with +setcpuaffinity option

  21. ./charmrun prog +p6 +setcpuaffinity +coremap 0,2,4,8,12,13 +maffinity +nodemap 0,0,1,2,3,3 +mempol preferred Node#2 Node#3 CPU CPU MEM MEM Node#0 Node#1 CPU CPU MEM MEM

  22. Interleaved Heap ● Based on mbind Linux system call ● Spread data over the NUMA nodes ● The objective is to reduce memory contention by optimizing bandwidth ● One mbind per mmap

  23. Heap Memory page Node#2 Node#3 CPU CPU MEM MEM Node#0 Node#1 CPU CPU MEM MEM

  24. Heap Heap Memory page Memory page Node#2 Node#3 Node#2 Node#3 CPU CPU CPU CPU MEM MEM MEM MEM Node#0 Node#1 Node#0 Node#1 CPU CPU CPU CPU MEM MEM MEM MEM

  25. Heap Heap Node#2 Node#3 Node#2 Node#3 CPU CPU CPU CPU Virtual memory pages MEM MEM MEM MEM binded to physical memory banks Node#0 Node#1 Node#0 Node#1 CPU CPU CPU CPU MEM MEM MEM MEM

  26. First Results ● Charm++ version: ● 6.1.3 ● net-linux-amd64 ● Applications: ● Molecular2D ● Kneighbor (1000 iterations - msg 1024)

  27. First Results ● NUMA machine ● AMD Opteron ● 8 (2 cores) x 2.2GHz processors ● Cache L2 shared (2Mbytes) ● Main memory 32Gbytes ● Low latency for local memory access ● Numa factor: 1.2 – 1.5 ● Linux 2.6.32.6

  28. Intel Xeon ● NUMA machine ● Intel EM64T ● 4 (24 cores) x 2.66GHz processors ● Shared cache L3 (16MB) ● Main memory 192Gbytes ● High latency for local memory access ● Numa factor: 1.2 - 5 ● Linux 2.6.27

  29. Charm - Memory Affinity Kn Application 350000 300000 250000 original Time (us) 200000 maffinity 150000 interleave 100000 50000 0 24 48 64 Number of Cores Charm - Memory affinity Mol2d Application 70 60 50 original Time in ms 40 maffinity 30 interleave 20 10 0 24 48 64 Number of Cores

  30. HeapAlloc ● NUMA-aware memory allocator ● Reduces lock contention and optimizes data locality ● Several memory policies: applied considering the access mode (read, write or read/write)

  31. HeapAlloc ● Default memory policy is bind ● High-level interface: glibc compatible, any modifications in source code ● Low-level interface: allows developers to manage their heaps

  32. Memory Node#2 One heap per core of a node Node#2 Node#3 CPU CPU MEM MEM Node#0 Node#1 CPU CPU MEM MEM

  33. Memory Node#0 core0 core1 core2 core3 Memory is allocated from heap 'core0' Node#2 Node#3 Thread running on node#3 calls free for CPU CPU memory allocated by thread MEM Thread running on Thread running on MEM node#0 calls malloc node#0 calls malloc Memory is returned Node#0 Node#1 to heap 'core0' CPU CPU MEM MEM

  34. Conclusions ● Charm++ performance on NUMA can be improved ● Tcmalloc NUMA-aware ● +maffinity ● Interleaved Heap ● Proposal of an optimized memory allocator for NUMA machines

  35. Future Work ● Conclude the integration of HeapAlloc in charm++ ● Study the impact of different memory allocators on charm++ ● What about several memory policies? ● Bind, interleave, next-touch, skew_mapp .....

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend