improving c harm performance with a numa aware load
play

Improving C HARM ++ Performance with a NUMA-aware Load Balancer - PowerPoint PPT Presentation

9 th Annual Workshop on C HARM ++ and its Applications Improving C HARM ++ Performance with a NUMA-aware Load Balancer Larcio Lima Pilla 1,2 , Christiane Pousa 2 , Daniel Cordeiro 2,3 , Abhinav Bhatele 4 , Philippe O. A. Navaux 1 , Jean-Franois


  1. 9 th Annual Workshop on C HARM ++ and its Applications Improving C HARM ++ Performance with a NUMA-aware Load Balancer Laércio Lima Pilla 1,2 , Christiane Pousa 2 , Daniel Cordeiro 2,3 , Abhinav Bhatele 4 , Philippe O. A. Navaux 1 , Jean-François Méhaut 2 , Laxmikant V. Kale 4 1 Federal University of Rio Grande do Sul – Porto Alegre, Brazil 2 Grenoble University – Grenoble, France 3 University of São Paulo – São Paulo, Brazil 4 University of Illinois at Urbana-Champaign – Urbana, IL, USA

  2. Summary How we used NUMA architectural information to build a C HARM ++ load balancer and obtained improvements on overall performance. /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 2

  3. Agenda NUMA Our Load Balancer: N UMA LB Experimental Setup Results Concluding Remarks /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 3

  4. UMA x NUMA Uniform Memory Access Non-Uniform Memory Access • Centralized shared memory • Distributed shared memory – Uniform latencies – Non-uniform latencies • Data placement does not • Data placement matters matter Address space P P P P M M M M Interconnection P P P P Memory Interconnection Processor /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 4

  5. NUMA Reduce latencies Core C C C C C C C C M0 M1 M0 M1 C C C C C C C C C C C C C C C C M2 M3 M2 M3 C C C C C C C C C C C C C C C C M4 M5 M4 M5 C C C C C C C C /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 5

  6. NUMA Reduce contention/improve bandwidth C C C C C C C C M0 M1 M0 M1 C C C C C C C C C C C C C C C C M2 M3 M2 M3 C C C C C C C C C C C C C C C C M4 M5 M4 M5 C C C C C C C C /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 6

  7. NUMA C HARM ++ does not consider these characteristics Physical organization C HARM ++ ’s vision (UMA) No memory hierarchy C C C C No locality M0 M1 C C C C C C C C C C C C C C C C C C M2 M3 C C C C C C M C C C C C C C C C C C C M4 M5 C C C C C C C C /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 7

  8. Agenda NUMA Our Load Balancer: N UMA LB Experimental Setup Results Concluding Remarks /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 8

  9. Load Balancer • Application data – C HARM ++ LB framework – Processor load: execution time – Chare load: execution time – Communication graph: size and number of messages • NUMA topology – archTopology (our library) – Core to NUMA node (socket) hierarchy mapping – NUMA factor NUMA factor (i, j) = Read latency from i to j Read latency on i /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 9

  10. Load Balancer • Heuristic – Task mapping is NP-Hard – No initial assumptions about the application • List scheduling – Put tasks on a priority list by load – Assign tasks to the processor with the smallest cost on a greedy fashion • Improve performance – by reducing unbalance – by reducing remote communication costs – while avoiding migrations (data movement costs) /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 10

  11. Load Balancer • Cost function cost ( c , p ) = load ( p ) + ɑ × ( r comm ( c , p ) × NUMA factor – l comm ( c , p ) ) Where c : chare p : core load ( p ): load (execution time) on core p r comm ( c , p ): number of messages sent by chare c to chares on other NUMA node l comm ( c , p ): number of messages sent by chare c to chares on the same NUMA node ɑ : communication weight /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 11

  12. Load Balancer Input : C set of chares, P set of cores, M mapping N UMA LB ’s Algorithm Output : M’ mapping of chares to cores 1. M’ ← M 2. while c ≠ Ø do for the number of chares 3. c ← v | v ϵ arg max u ϵ C load ( u ) take heaviest chare 4. C ← C \{ c } 5. p ← q , q ϵ P Ʌ {( c , q )} ϵ M get its core 6. load ( p ) ← load ( p ) − load ( c ) remove its load from its core 7. M’ ← M’ \ {( c , p )} remove from mapping 8. p’ ← q | q ϵ arg min r ϵ P cost ( c,r ) find core with smallest cost 9. load ( p’ ) ← load ( p’ ) + load ( c ) add chare load to new core 10. M’ ← M’ U {( c , p ’ )} map to new core /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 12

  13. Agenda NUMA Our Load Balancer: N UMA LB Experimental Setup Results Concluding Remarks /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 13

  14. Experimental Setup • 2 NUMA machines • 3 C HARM ++ benchmarks • 4 other C HARM ++ load balancers • Statistical confidence of 95% – 5% relative error – Student’s t -distribution – Minimum of 25 executions • Performance – Gains: Average iteration time (baseline = no LB) – Costs: Load balancing overhead /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 14

  15. Experimental Setup: Machines • NUMA16 L2 L2 L2 L2 M M – AMD Opteron C C C C – 8×2 cores @ 2.2 GHz L2 L2 L2 L2 M M – 1 MB private L2 cache C C C C – 32 GB main memory L2 L2 L2 L2 – Low latency for M M C C C C memory access – Crossbar L2 L2 L2 L2 M M – NUMA factor: 1.1 – 1.5 C C C C /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 15

  16. Experimental Setup: Machines • NUMA32 L2 L2 L2 L2 L2 L2 L2 L2 C C C C C C C C – Intel Xeon X7560 M L3 M L3 – 4×8 cores @ 2.27 GHz C C C C C C C C – 256 KB private L2 L2 L2 L2 L2 L2 L2 L2 L2 – 24 MB shared L3 L2 L2 L2 L2 L2 L2 L2 L2 – 64 GB main memory C C C C C C C C – QuickPath M L3 M L3 – NUMA factor: 1.36 – C C C C C C C C L2 L2 L2 L2 L2 L2 L2 L2 3.6 /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 16

  17. Experimental Setup: Benchmarks • kNeighbor – Synthetic iterative benchmark where a chare communicates with other k chares at each step – Completely I/O bound – 200 chares, 16 KB messages, k = 8 • lb_test – Synthetic unbalanced benchmark with different possible communication patterns – 200 chares, random communication graph, load between 50 and 200 ms • jacobi2D – Unbalanced two-dimensional five-point stencil – 100 chares, 32² data array /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 17

  18. Experimental Setup: LBs • G REEDY LB – Iteratively maps the most loaded chares to the least loaded cores • R EC B IPART LB – Recursive bipartition of the communication graph – Breadth-first traversal until groups the required load • M ETIS LB – Graph partitioning algorithms from METIS • S COTCH LB – Graph partitioning algorithms from SCOTCH • Neither consider the current chare mapping /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 18

  19. Agenda NUMA Our Load Balancer: N UMA LB Experimental Setup Results Concluding Remarks /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 19

  20. Results: kNeighbor 35 32.0 Smaller is Average iteration time (in ms) No sensible difference better 30 26.9 among LBs 30% 25 22.8 22.4 22.1 22.1 21.5 20 17.9 45% 16.9 16.1 14.9 13.5 15 10 5 0 NUMA16 NUMA32 Baseline NumaLB GreedyLB MetisLB RecBipartLB ScotchLB /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 20

  21. Results: kNeighbor Homogeneous distribution 35 32.0 Average iteration time (in ms) Shared cache, faster communication Group chares and 30 26.9 migrate them together 30% 25 22.8 22.4 to the same core 22.1 22.1 21.5 20 17.9 45% 16.9 16.1 14.9 13.5 15 10 5 0 NUMA16 NUMA32 Baseline NumaLB GreedyLB MetisLB RecBipartLB ScotchLB /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 21

  22. Results: lb_test 1.2 Best performance by communication-aware LBs Average iteration time (in s) 1.01 1 0.93 0.88 0.84 0.83 0.83 Best average performance 17% 0.8 0.6 0.6 0.51 0.47 0.46 0.43 0.43 28% 0.4 0.2 0 NUMA16 NUMA32 Baseline NumaLB GreedyLB MetisLB RecBipartLB ScotchLB /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 22

  23. Results: jacobi2D 2 Best performance. 1.74 Average iteration time (in s) 1.8 Keeps proximity among chares 1.6 on a NUMA node scale 1.31 1.4 1.24 1.21 41% 1.11 S COTCH LB shows similar 1.2 1.03 performance 1 0.8 0.6 0.42 0.4 0.39 0.36 0.29 0.4 0.27 36% 0.2 0 NUMA16 NUMA32 Baseline NumaLB GreedyLB MetisLB RecBipartLB ScotchLB /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 23

  24. Results: jacobi2D - Projections M ETIS LB: 75% efficiency • jacobi2D on NUMA16 – 2 steps before LB – 4 steps after LB N UMA LB: 93.5% efficiency • The smaller the idle parts, the higher the efficency /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend