the impact of network noise on large scale communication
play

The Impact of Network Noise on Large-Scale Communication Performance - PowerPoint PPT Presentation

The Impact of Network Noise on Large-Scale Communication Performance Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine Open Systems Lab Indiana University Bloomington, USA Workshop on Large-Scale Parallel Processing/IPDPS09 Rome, Italy


  1. The Impact of Network Noise on Large-Scale Communication Performance Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine Open Systems Lab Indiana University Bloomington, USA Workshop on Large-Scale Parallel Processing/IPDPS’09 Rome, Italy May, 29th 2009 Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Per

  2. Motivation operating system noise is a known phenomenon local interruptions by daemons, interrupts, ... not problematic (<2%) for serial applications noise propagation is problematic can lower application performance significantly pure system issues, often “simple” to solve 1 2 3 4 1 2 3 4 Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Per

  3. Motivation effects in the network can cause similar behavior ⇒ network noise (net noise) management, filesystem, other application, ... traffic such congestion causes delays delays optimized communication patterns (collectives) propagation can lead to delays applications interfering with themselves is not net noise! 1 1 2 3 4 2 3 4 1 2 3 4 Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Per

  4. Our Approach OS noise is modelled with statistical or signal processing methods network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint (multicore) communication pattern of all endpoints ⇒ not as easy to model approach: benchmark + simulation Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Per

  5. Target Architecture complex topologies network noise can easily be avoided in tori/hypercubes (make sure all allocations are convex sets) other topologies (fat tree, Kautz) are not as simple we focus on fat trees random application/application interaction most common also models random filesystem traffic collective communication patterns (including stencil) most common in HPC scenarios Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Per

  6. Benchmark Method create two random communicators create two random communicators with given ratio and warm them up with given ratio and warm them up syncronize clocks on all ranks and syncronize clocks on all ranks and start next step synchronously start next step synchronously 3 0 5 6 1 2 4 0 7 2 5 3 4 1 7 6 random pattern MPI_Bcast(..); random pattern MPI_Bcast(..); t t pert pert syncronize clocks on all ranks and syncronize clocks on all ranks and start next step synchronously start next step synchronously 3 0 2 5 6 5 t t MPI_Bcast(..); MPI_Bcast(..); nopert nopert Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Per

  7. Benchmark Results 4000 no perturbation with perturbation 3000 Time [us] 2000 1000 0 2 4 6 8 10 14 18 22 26 30 Nodes in collective boxplot, 32 nodes, MPI_Allreduce (single MPI_DOUBLE) Open MPI 1.2.8, SDR/IB, 566 node fat tree, FBB Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Per

  8. Benchmark Results Slowdown relative to unperturbated run [%] 250 Broadcast with 208 nodes Reduce with 492 nodes Allreduce with 492 nodes 200 150 100 0.2 0.4 0.6 0.8 Perturbation Ratio different collectives, 128 measurements, average plotted very high variance (only 128 samples, background load) Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Per

  9. Benchmark Results 150 Slowdown at Ratio of 0.5 [in %] 140 130 120 110 0 50 100 150 200 250 Application Communicator Size fixed perturbation ratio (0.5) slowdown with increasing communicator size Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Per

  10. Simulation Methodology needs to consider topology 0 and routing use IB as a model 1 simple linear congestion model we model collective 3 2 operations as a set of dependencies 7 5 6 4 (collective) level-wise simulation Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Per

  11. Simulation Methodology route every logical link through the network record congestion on edges 4x4 4x4 2, 4, 6, 8, 10, 12, 14, 0, 2, 4, 6, 8, 10, 12, 14, 0, 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15 1 + 1 + 2 + 1 1 + 1 1 0 1 + 1 + 1 1 + 1 2 1 + 1 1 1 + 1 + 1 1 + 1 1 + 1 + 1 + 1 + 1 1 2 4,8, 1 0,4,8 1,5,9 2,6,10 3,7,11 5,9,13 1,9,13 2,10,14 3,11,15 12 6,10,14 7,11,15 0,8,12 0,4,12 1,5,13 2,6,14 3,7,15 4x4 4x4 4x4 4x4 2 3 1 2 1 r3 r7 r6 r0 r2 r1 r4 r5 5 6 4 10 11 12 13 14 15 7 8 9 1 2 4 5 6 7 0 3 > 1 > 8 > 2 > 7 > 1 5 > 9 > 1 2 > 1 4 > 1 0 > 4 > 1 1 > 0 annotate collective graph with maximum path congestion longest path from any root node is reported Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Per

  12. Simulation Systems real-world system inputs (IB network maps) Odin @ IU (128 nodes, FBB fat tree) CHiC @ TUC (566 nodes, FBB fat tree) Atlas @ LLNL (1142 nodes, FBB fat tree) Ranger @ TACC (3908 nodes, FBB fat tree) TBird @ SNL (4391 nodes, 1/2 BB fat tree) ⇒ your system? Please give us the maps! Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Per

  13. Simulation Results Slowdown relative to unperturbated run [%] Odin (128) 300 CHiC (566) Atlas (1142) Ranger (3908) TBird (4391) 250 200 150 0.3 0.4 0.5 0.6 0.7 Perturbation Ratio binomial tree pattern (small message Bcast, Reduce) CHiC results reflect microbenchmark accurately! Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Per

  14. Real Large-Scale Simulation Results simulated large exteded generalized fat trees (XGFT) 24 port crossbars, full bisection bandwidth fat tree optimized routing (OpenSM) 144 nodes (one level) to 20,736 nodes (three levels) above: 144 nodes, below: 1152 nodes Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Per

  15. Simulation Results 350 Slowdown at ratio of 0.5 [in %] 300 250 200 150 0 5000 10000 15000 20000 Network Size perturbation ratio 0.5, tree pattern logarithmic shape reflects CHiC benchmarks! Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Per

  16. Conclusions and Future Works Conclusions network noise must be considered significant impact, similar to OS noise no known real-world analyses yet network topology and routing are very important Future Work good process-to-node mapping could reduce problems topology-aware communication algorithms extend analysis to real applications (profiling, tracing) analyze several network topologies and workarounds Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Per

  17. Thanks for your attention! Questions? Download the (research-quality) ORCS simulator at: http://www.unixer.de/ORCS Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Per

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend