An Analysis of SMP Memory Allocators: MapReduce on Large - - PowerPoint PPT Presentation

an analysis of smp memory allocators mapreduce on large
SMART_READER_LITE
LIVE PREVIEW

An Analysis of SMP Memory Allocators: MapReduce on Large - - PowerPoint PPT Presentation

An Analysis of SMP Memory Allocators: MapReduce on Large Shared-Memory Systems Robert D obbelin, Thorsten Sch utt, Alexander Reinefeld Zuse Institute Berlin September 10, 2012 1 / 11 SGI Altix UltraViolett (UV) 1000 32 GB Intel Xeon


slide-1
SLIDE 1

An Analysis of SMP Memory Allocators: MapReduce on Large Shared-Memory Systems

Robert D¨

  • bbelin, Thorsten Sch¨

utt, Alexander Reinefeld

Zuse Institute Berlin

September 10, 2012

1 / 11

slide-2
SLIDE 2

SGI Altix UltraViolett (UV) 1000

32 blades in one rack 2×8 cores per blade 64 GB memory per blade

Intel Xeon X7560 (8 cores) Intel Xeon X7560 (8 cores) HUB 32 GB DDR3 RAM 32 GB DDR3 RAM QPI QPI NUMAlink5 to other blades

QPI for memory on same blade inter-blade communication via NUMAlink5

2 / 11

slide-3
SLIDE 3

SGI Altix UltraViolett (UV) 1000

32 blades in one rack 2×8 cores per blade 64 GB memory per blade

Intel Xeon X7560 (8 cores) Intel Xeon X7560 (8 cores) HUB 32 GB DDR3 RAM 32 GB DDR3 RAM QPI QPI NUMAlink5 to other blades

QPI for memory on same blade inter-blade communication via NUMAlink5

2 / 11

slide-4
SLIDE 4

Memory allocation

First-touch policy

When a process requests memory from the OS threads gets (unmapped) virtual address page fault on first touch OS allocates physical pages to NUMA node on which accessing thread is running Once a virtual address is mapped, this mapping persists until the page is released to the OS.

3 / 11

slide-5
SLIDE 5

Memory allocation

Successive malloc/free operations

Thread A Thread B Thread C Thread D Process Allocator OS ?

4 / 11

slide-6
SLIDE 6

Memory allocation

Successive malloc/free operations

malloc: Thread A gets virtual page and touches it

Thread A Thread B Thread C Thread D Process Allocator OS A

4 / 11

slide-7
SLIDE 7

Memory allocation

Successive malloc/free operations

malloc: Thread A gets virtual page and touches it free: page may be released to the allocators cache

Thread A Thread B Thread C Thread D Process Allocator OS A

4 / 11

slide-8
SLIDE 8

Memory allocation

Successive malloc/free operations

malloc: Thread A gets virtual page and touches it free: page may be released to the allocators cache malloc: Thread D gets this page Thread D got remote memory from the allo- cator!

Thread A Thread B Thread C Thread D Process Allocator OS A

4 / 11

slide-9
SLIDE 9

MapReduce

MapReduce workflow

MapReduce stages map: apply map-function to input shuffle: merge partitions reduce: apply reduce-function to all kv-pairs with the same key

... ... ... ... ... ... ... ... ...

map combine (opt) shuffle reduce

... ...

size of buffers unknown a priori iterative MapReduce: output of one MR step is input for the next

5 / 11

slide-10
SLIDE 10

MapReduce

How to speed things up?

Memory allocators for SMPs (tbbmalloc) provide fast concurrent allocations Memory reuse (reuse) reuse buffers for subsequent MapReduce iterations Memory preallocation (prealloc) allocate needed amount of memory for each buffer

6 / 11

slide-11
SLIDE 11

Evaluation

1 2 3 4 5 6 7 8 7 15 31 61 127 relative speedup threads glibc reuse tbbmalloc tbb_pool prealloc

MR-Search with various allocators. Speedup is relative to glibc.

Significant speedup if more than one blade is used.

7 / 11

slide-12
SLIDE 12

Evaluation

100 200 300 400 500 600 glibc reuse tbbmalloc tbb_pool prealloc 1000 2000 3000 4000 5000 6000 7000 8000 sent data [GByte] time [s] time

NUMA traffic and runtime with various allocators (127 Threads).

Traffic on NUMAlink traced with Performance Co-Pilot TBB does not prevent remote memory

8 / 11

slide-13
SLIDE 13

Evaluation

NUMA traffic and runtime with various allocators (127 Threads).

Traffic on NUMAlink traced with Performance Co-Pilot TBB does not prevent remote memory

8 / 11

slide-14
SLIDE 14

Evaluation 100 200 300 400 500 100 200 300 400 500 speedup cores perfect speedup prealloc, MR only prealloc tbbmalloc reuse glibc

Scalability with various allocators.

9 / 11

slide-15
SLIDE 15

Evaluation 100 200 300 400 500 100 200 300 400 500 speedup cores perfect speedup MPI, Cluster MPI, UV OpenMP, UV, prealloc, MR only

Comparing scalability: OpenMP vs. explicit message passing

10 / 11

slide-16
SLIDE 16

Summary

Summary

It is not that easy to write scalable code for large SMPs. large variability of memory access costs on large SMPs allocators for SMPs help to increase scalability

they do not prevent remote memory

programmer needs to keep track of memory location (if possible)

11 / 11