Improving C HARM ++ Performance with a NUMA-aware Load Balancer - - PowerPoint PPT Presentation

improving c harm performance with a numa aware load
SMART_READER_LITE
LIVE PREVIEW

Improving C HARM ++ Performance with a NUMA-aware Load Balancer - - PowerPoint PPT Presentation

9 th Annual Workshop on C HARM ++ and its Applications Improving C HARM ++ Performance with a NUMA-aware Load Balancer Larcio Lima Pilla 1,2 , Christiane Pousa 2 , Daniel Cordeiro 2,3 , Abhinav Bhatele 4 , Philippe O. A. Navaux 1 , Jean-Franois


slide-1
SLIDE 1

Improving CHARM++ Performance with a NUMA-aware Load Balancer

Laércio Lima Pilla1,2, Christiane Pousa2, Daniel Cordeiro2,3, Abhinav Bhatele4, Philippe O. A. Navaux1, Jean-François Méhaut2, Laxmikant V. Kale4

1Federal University of Rio Grande do Sul – Porto Alegre, Brazil 2Grenoble University – Grenoble, France 3University of São Paulo – São Paulo, Brazil 4University of Illinois at Urbana-Champaign – Urbana, IL, USA

9th Annual Workshop on CHARM++ and its Applications

slide-2
SLIDE 2

/30

Summary

How we used NUMA architectural information to build a CHARM++ load balancer and obtained improvements on

  • verall performance.

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 2

slide-3
SLIDE 3

/30

Agenda

NUMA Our Load Balancer: NUMALB Experimental Setup Results Concluding Remarks

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 3

slide-4
SLIDE 4

/30

UMA x NUMA

Uniform Memory Access

  • Centralized shared memory

– Uniform latencies

  • Data placement does not

matter Non-Uniform Memory Access

  • Distributed shared memory

– Non-uniform latencies

  • Data placement matters

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 4

P P P P Interconnection Memory M M M M P P P P Interconnection Address space

Processor

slide-5
SLIDE 5

/30

NUMA

Reduce latencies

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 5

M0 C C C C M2 C C C C M4 C C C C M1 C C C C M3 C C C C M5 C C C C M0 C C C C M2 C C C C M4 C C C C M1 C C C C M3 C C C C M5 C C C C

Core

slide-6
SLIDE 6

/30

NUMA

Reduce contention/improve bandwidth

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 6

M0 C C C C M2 C C C C M4 C C C C M1 C C C C M3 C C C C M5 C C C C M0 C C C C M2 C C C C M4 C C C C M1 C C C C M3 C C C C M5 C C C C

slide-7
SLIDE 7

/30

NUMA

CHARM++ does not consider these characteristics

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 7

M0 C C C C M2 C C C C M4 C C C C M1 C C C C M3 C C C C M5 C C C C Physical organization C C C C C C C C C C C C C C C C C C M C C C C C C CHARM++’s vision (UMA)

No memory hierarchy No locality

slide-8
SLIDE 8

/30

Agenda

NUMA Our Load Balancer: NUMALB Experimental Setup Results Concluding Remarks

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 8

slide-9
SLIDE 9

/30

Load Balancer

  • Application data – CHARM++ LB framework

– Processor load: execution time – Chare load: execution time – Communication graph: size and number of messages

  • NUMA topology – archTopology (our library)

– Core to NUMA node (socket) hierarchy mapping – NUMA factor

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 9

NUMA factor (i, j) = Read latency from i to j Read latency on i

slide-10
SLIDE 10

/30

Load Balancer

  • Heuristic

– Task mapping is NP-Hard – No initial assumptions about the application

  • List scheduling

– Put tasks on a priority list by load – Assign tasks to the processor with the smallest cost on a greedy fashion

  • Improve performance

– by reducing unbalance – by reducing remote communication costs – while avoiding migrations (data movement costs)

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 10

slide-11
SLIDE 11

/30

Load Balancer

  • Cost function

cost(c,p) = load(p) + ɑ × ( rcomm(c,p) × NUMA factor – lcomm(c,p) )

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 11

Where c: chare p: core load(p): load (execution time) on core p rcomm(c,p): number of messages sent by chare c to chares on other NUMA node lcomm(c,p): number of messages sent by chare c to chares on the same NUMA node ɑ: communication weight

slide-12
SLIDE 12

/30

Load Balancer

Input: C set of chares, P set of cores, M mapping Output: M’ mapping of chares to cores

  • 1. M’ ← M
  • 2. while c ≠ Ø do
  • 3. c ← v | v ϵ arg maxu ϵ C load(u)
  • 4. C ← C \{c}
  • 5. p ← q, q ϵ P Ʌ {(c,q)} ϵ M
  • 6. load(p) ← load(p) − load(c)
  • 7. M’ ← M’ \ {(c,p)}
  • 8. p’ ← q | q ϵ arg minr ϵ P cost(c,r)
  • 9. load(p’) ← load(p’) + load(c)
  • 10. M’ ← M’ U {(c,p’)}

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 12

NUMALB’s Algorithm

take heaviest chare get its core remove its load from its core find core with smallest cost add chare load to new core map to new core remove from mapping for the number of chares

slide-13
SLIDE 13

/30

Agenda

NUMA Our Load Balancer: NUMALB Experimental Setup Results Concluding Remarks

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 13

slide-14
SLIDE 14

/30

Experimental Setup

  • 2 NUMA machines
  • 3 CHARM++ benchmarks
  • 4 other CHARM++ load balancers
  • Statistical confidence of 95%

– 5% relative error – Student’s t-distribution – Minimum of 25 executions

  • Performance

– Gains: Average iteration time (baseline = no LB) – Costs: Load balancing overhead

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 14

slide-15
SLIDE 15

/30

Experimental Setup: Machines

  • NUMA16

– AMD Opteron – 8×2 cores @ 2.2 GHz – 1 MB private L2 cache – 32 GB main memory – Low latency for memory access – Crossbar – NUMA factor: 1.1–1.5

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 15

M

L2 L2

C C M

L2 L2

C C M

L2 L2

C C M

L2 L2

C C M

L2 L2

C C M

L2 L2

C C M

L2 L2

C C M

L2 L2

C C

slide-16
SLIDE 16

/30

Experimental Setup: Machines

  • NUMA32

– Intel Xeon X7560 – 4×8 cores @ 2.27 GHz – 256 KB private L2 – 24 MB shared L3 – 64 GB main memory – QuickPath – NUMA factor: 1.36– 3.6

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 16

M

L2

C

L2

C

L2

C

L2

C

L2

C

L2

C

L2

C

L2

C L3

M

L2

C

L2

C

L2

C

L2

C

L2

C

L2

C

L2

C

L2

C L3

M

L2

C

L2

C

L2

C

L2

C

L2

C

L2

C

L2

C

L2

C L3

M

L2

C

L2

C

L2

C

L2

C

L2

C

L2

C

L2

C

L2

C L3

slide-17
SLIDE 17

/30

Experimental Setup: Benchmarks

  • kNeighbor

– Synthetic iterative benchmark where a chare communicates with other k chares at each step – Completely I/O bound – 200 chares, 16 KB messages, k = 8

  • lb_test

– Synthetic unbalanced benchmark with different possible communication patterns – 200 chares, random communication graph, load between 50 and 200 ms

  • jacobi2D

– Unbalanced two-dimensional five-point stencil – 100 chares, 32² data array

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 17

slide-18
SLIDE 18

/30

Experimental Setup: LBs

  • GREEDYLB

– Iteratively maps the most loaded chares to the least loaded cores

  • RECBIPARTLB

– Recursive bipartition of the communication graph – Breadth-first traversal until groups the required load

  • METISLB

– Graph partitioning algorithms from METIS

  • SCOTCHLB

– Graph partitioning algorithms from SCOTCH

  • Neither consider the current chare mapping

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 18

slide-19
SLIDE 19

/30

Agenda

NUMA Our Load Balancer: NUMALB Experimental Setup Results Concluding Remarks

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 19

slide-20
SLIDE 20

/30

Results: kNeighbor

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 20

32.0 26.9 22.4 14.9 22.1 13.5 21.5 17.9 22.8 16.9 22.1 16.1 5 10 15 20 25 30 35 NUMA16 NUMA32 Average iteration time (in ms) Baseline NumaLB GreedyLB MetisLB RecBipartLB ScotchLB No sensible difference among LBs

30% 45% Smaller is better

slide-21
SLIDE 21

/30

Results: kNeighbor

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 21

32.0 26.9 22.4 14.9 22.1 13.5 21.5 17.9 22.8 16.9 22.1 16.1 5 10 15 20 25 30 35 NUMA16 NUMA32 Average iteration time (in ms) Baseline NumaLB GreedyLB MetisLB RecBipartLB ScotchLB Homogeneous distribution Group chares and migrate them together to the same core Shared cache, faster communication

30% 45%

slide-22
SLIDE 22

/30

Results: lb_test

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 22

1.01 0.6 0.83 0.43 0.93 0.51 0.84 0.46 0.83 0.47 0.88 0.43 0.2 0.4 0.6 0.8 1 1.2 NUMA16 NUMA32 Average iteration time (in s) Baseline NumaLB GreedyLB MetisLB RecBipartLB ScotchLB Best performance by communication-aware LBs Best average performance

28% 17%

slide-23
SLIDE 23

/30

Results: jacobi2D

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 23

1.74 0.42 1.03 0.27 1.24 0.36 1.31 0.4 1.21 0.39 1.11 0.29 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 NUMA16 NUMA32 Average iteration time (in s) Baseline NumaLB GreedyLB MetisLB RecBipartLB ScotchLB Best performance. Keeps proximity among chares

  • n a NUMA node scale

41% 36%

SCOTCHLB shows similar performance

slide-24
SLIDE 24

/30

Results: jacobi2D - Projections

  • jacobi2D on NUMA16

– 2 steps before LB – 4 steps after LB

  • The smaller the idle

parts, the higher the efficency

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 24

NUMALB: 93.5% efficiency METISLB: 75% efficiency

slide-25
SLIDE 25

/30

Results: overheads

Benchmark Machine Load Balancer NUMALB GREEDYLB METISLB RECBIPARTLB SCOTCHLB kNeighbor NUMA16 25 189 188 176 185 NUMA32 57 194 195 185 194 lb_test NUMA16 40 188 187 184 184 NUMA32 48 194 194 192 192 jacobi2D NUMA16 26 94 94 91 93 NUMA32 33 97 96 93 98

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 25

Average number of chares migrated

All load balancers took less than 7 ms for their algorithms. Maximum migrations = 33% Minimum migrations = 88%

slide-26
SLIDE 26

/30

Results: migration times for NUMA16

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 26

log scale 0.001 0.01 0.1 1 10 100 1KB 10KB 100KB 1MB 10MB 100MB Average migration time (in s) Size of chares (log scale) 200 chares, ScotchLB 100 chares, ScotchLB 200 chares, NumaLB 100 chares, NumaLB Similar

Speedup of 2.9 200 chares Speedup of 5.3 100 chares Speedup of 7.1 200 chares

slide-27
SLIDE 27

/30

Agenda

NUMA Our Load Balancer: NUMALB Experimental Setup Results Concluding Remarks

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 27

slide-28
SLIDE 28

/30

Conclusions

  • Multi-core machines with NUMA design

introduce new challenges for their efficient use

  • CHARM++ does not consider NUMA asymmetries
  • With our NUMA-aware LB we obtained

– An average speedup of 1.51 over the baseline

  • Transparent to the user, no previous knowledge

– 10% improvement over most LBs – Migration overheads up to 7 times smaller

  • Migrating at most 33% of all chares

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 28

slide-29
SLIDE 29

/30

Future Work

  • Multi-core load balancer

– UMA and NUMA machines – Communication latencies among cores – Use HWLOC representation of cache hierarchy

  • Distributed multi-core load balancer

– For clusters of multi-core machines

  • Gather and organize communication information

– Latencies, bandwidth – Provide this data to other libraries (like SCOTCH)

4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 29

slide-30
SLIDE 30

Improving CHARM++ Performance with a NUMA-aware Load Balancer

Laércio Lima Pilla

Contact: llpilla@inf.ufrgs.br

9th Annual Workshop on CHARM++ and its Applications

Thank you.