open mpi on the cray xt
play

Open MPI on the Cray XT presented by Richard L. Graham Galen - PowerPoint PPT Presentation

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open source project / community Consolidation and evolution of several prior MPI implementations All of MPI-1 and MPI-2 Production quality


  1. Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman

  2. Open MPI Is… • Open source project / community • Consolidation and evolution of several prior MPI implementations • All of MPI-1 and MPI-2 • Production quality • Vendor-friendly • Research- and academic-friendly 2

  3. Current Membership • 14 members, 9 contributors, 1 partner − 4 US DOE labs − 8 universities − 10 vendors − 1 individual 3

  4. Some Current Highlights • Production MPI on SNL’s Thunderbird • Production MPI on LANL’s Road Runner • Working on getting it up on TACC (Ranger) • The MPI used for the EU QosCosGrid: Quasi- Opportunistic Complex System Simulations on Grid • Tightly integrated with VampirTrace (vs 1.3) 4

  5. Modular Component Architecture Framework : • User application − API targeted at a specific task • PTP message Open MPI management • PTP transfer layer • Collectives B … Framework A • Process startup …. Component : • − An implementation of a … Comp A Comp B framework’s API Module : • − An instance of a component Mod B Mod A1 … Mod A2 5

  6. Open MPI’s CNL Port • Portals port from Catamount to CNL • Enhance Point-to-Point BTL component • ALPS support added • Add process control components for ALPS • mpirun wraps multiple calls to APRUN to • Support MPI-2 dynamic process control • Support for recovery from process failure • Support arbitrary number of procs per node (even over subscribe) • Pick up full MPI 2.0 support 6

  7. Modular Component Architecture - Data Transfer User application Open MPI … … Point-to-Point … Portals TCP Infiniband SeaStar NIC 0 HCA NIC 1 7

  8. Process Startup on CNL - Start Allocated Allocated Allocated MPIRUN Allocated Allocated Allocated App App Daemon Daemon App APRUN App Allocated Allocated Allocated App App Daemon Daemon App App 8

  9. Process Startup on CNL - Spawn APRUN Allocated Allocated Allocated App App Daemon Daemon MPIRUN App App Allocated Allocated Allocated App App Daemon Daemon App APRUN App Allocated Allocated Allocated App App Daemon Daemon App App 9

  10. Features in Open MPI for Multi-Core Support • Shared Memory point-to-point communications − On par with other network devices − Does not use any network resources • Shared Memory Collective optimizations − On-host-communicator optimization − Hierarchical collectives on the way 10

  11. Hierarchical Collectives • Exist in the code base (HLRS/University of Houston) • Need to be tested with the new shared-memory module • Need to be optimized 11

  12. Collective Communication Pattern - per process III II proc proc proc proc I proc proc proc proc Network IV proc proc proc proc proc proc proc proc 12

  13. Collective Communication Pattern - Total Interhost traffic I proc proc proc proc sm sm proc proc proc proc II Network II I proc proc proc proc sm sm proc proc proc proc 13

  14. Performance Data 14

  15. Ping-Pong 0 byte MPI latency : Inter-node MPI / Protocol Latency (uSec) Open MPI / CM 6.18 Open MPI / OB1 8.65 Open MPI / OB1 - no ack 7.24 Cray MPT (3.0.7) 7.44 15

  16. Ping-Pong 0 byte MPI latency CM 0 Bytes - 6.18 uSec 16 Bytes - 6.88 uSec 17 Bytes - 9.69 uSec (measured on different system) OB1 0 Bytes - With ACK: 8.65 uSec 0 Bytes - Without ACK: 7.24 uSec 1 Byte - Without ACK: 10.14 uSec (measured on different system) 16

  17. Ping-Pong 0 byte MPI latency : Intra-node MPI / Protocol Latency (uSec) Open MPI / CM Open MPI / OB1 0.64 Open MPI / OB1 - no ack Cray MPT (3.0.7) 0.51 17

  18. Ping-Pong Latency Data - Off Host 18

  19. Ping-Pong Data - Off Host 19

  20. Ping-Pong Data - On Host 20

  21. Ping-Pong Bandwidth Data - On Host 21

  22. Barrier - 16 cores per host 22

  23. Barrier - 16 cores per host - Hierarchical 23

  24. Barrier - XT 24

  25. Shared-Memory Reduction - 16 processes 25

  26. Reduction - 16 core nodes - 8 Bytes 26

  27. Reduction - 16 core nodes - 8 Bytes - Hierarchical 27

  28. Shared-Memory Reduction - 16 Processes 28

  29. Reduction - 16 core nodes - 512 KBytes 29

  30. Reduction - XT 30

  31. Shared Memory Allreduce - 16 processes 31

  32. Allreduce - 16 cores per node - 8 Bytes 32

  33. Allreduce - 16 cores per node - 8 Bytes - Hierarchical 33

  34. Shared Memory Allreduce - 16 processes 34

  35. Allreduce - XT 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend