presented by
Open MPI on the Cray XT presented by Richard L. Graham Galen - - PowerPoint PPT Presentation
Open MPI on the Cray XT presented by Richard L. Graham Galen - - PowerPoint PPT Presentation
Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open source project / community Consolidation and evolution of several prior MPI implementations All of MPI-1 and MPI-2 Production quality
SLIDE 1
SLIDE 2
2
Open MPI Is…
- Open source project / community
- Consolidation and evolution of several prior MPI
implementations
- All of MPI-1 and MPI-2
- Production quality
- Vendor-friendly
- Research- and academic-friendly
SLIDE 3
3
Current Membership
- 14 members, 9 contributors, 1 partner
− 4 US DOE labs − 8 universities − 10 vendors − 1 individual
SLIDE 4
4
Some Current Highlights
- Production MPI on SNL’s Thunderbird
- Production MPI on LANL’s Road Runner
- Working on getting it up on TACC (Ranger)
- The MPI used for the EU QosCosGrid: Quasi-
Opportunistic Complex System Simulations on Grid
- Tightly integrated with VampirTrace (vs 1.3)
SLIDE 5
5
Modular Component Architecture
- Framework:
− API targeted at a specific task
- PTP message
management
- PTP transfer layer
- Collectives
- Process startup ….
- Component:
− An implementation of a framework’s API
- Module:
− An instance of a component
User application Open MPI Framework A B … Comp A Comp B …
Mod A1 Mod A2 Mod B …
SLIDE 6
6
Open MPI’s CNL Port
- Portals port from Catamount to CNL
- Enhance Point-to-Point BTL component
- ALPS support added
- Add process control components for ALPS
- mpirun wraps multiple calls to APRUN to
- Support MPI-2 dynamic process control
- Support for recovery from process failure
- Support arbitrary number of procs per node (even over
subscribe)
- Pick up full MPI 2.0 support
SLIDE 7
7
Modular Component Architecture - Data Transfer User application Open MPI Point-to-Point … … TCP Infiniband Portals
NIC 0 NIC 1 HCA
…
SeaStar
SLIDE 8
8
Process Startup on CNL - Start
Allocated Allocated Allocated Allocated Allocated Allocated Allocated Allocated Allocated Daemon Daemon Daemon Daemon
APRUN
App App App App App App App App
MPIRUN
SLIDE 9
9
Process Startup on CNL - Spawn
Allocated Allocated Allocated Allocated Allocated Allocated Allocated Allocated Allocated Daemon Daemon Daemon
APRUN
App App App App App App Daemon App App Daemon App App Daemon App App
APRUN MPIRUN
SLIDE 10
10
Features in Open MPI for Multi-Core Support
- Shared Memory point-to-point communications
− On par with other network devices − Does not use any network resources
- Shared Memory Collective optimizations
− On-host-communicator optimization − Hierarchical collectives on the way
SLIDE 11
11
Hierarchical Collectives
- Exist in the code base (HLRS/University of Houston)
- Need to be tested with the new shared-memory
module
- Need to be optimized
SLIDE 12
12
Collective Communication Pattern - per process
proc proc proc proc
Network
proc proc proc proc proc proc proc proc proc proc proc proc I II III IV
SLIDE 13
13
Collective Communication Pattern - Total Interhost traffic
proc proc proc proc sm proc proc proc proc sm proc proc proc proc sm proc proc proc proc sm
Network
I I II II
SLIDE 14
14
Performance Data
SLIDE 15
15
Ping-Pong 0 byte MPI latency : Inter-node
7.44 Cray MPT (3.0.7) 7.24 Open MPI / OB1 - no ack 8.65 Open MPI / OB1 6.18 Open MPI / CM
Latency (uSec) MPI / Protocol
SLIDE 16
16
Ping-Pong 0 byte MPI latency
CM 0 Bytes - 6.18 uSec 16 Bytes - 6.88 uSec 17 Bytes - 9.69 uSec (measured on different system) OB1 0 Bytes - With ACK: 8.65 uSec 0 Bytes - Without ACK: 7.24 uSec 1 Byte - Without ACK: 10.14 uSec (measured on different system)
SLIDE 17
17
Ping-Pong 0 byte MPI latency : Intra-node
0.51 Cray MPT (3.0.7) Open MPI / OB1 - no ack 0.64 Open MPI / OB1 Open MPI / CM
Latency (uSec) MPI / Protocol
SLIDE 18
18
Ping-Pong Latency Data - Off Host
SLIDE 19
19
Ping-Pong Data - Off Host
SLIDE 20
20
Ping-Pong Data - On Host
SLIDE 21
21
Ping-Pong Bandwidth Data - On Host
SLIDE 22
22
Barrier - 16 cores per host
SLIDE 23
23
Barrier - 16 cores per host - Hierarchical
SLIDE 24
24
Barrier - XT
SLIDE 25
25
Shared-Memory Reduction - 16 processes
SLIDE 26
26
Reduction - 16 core nodes - 8 Bytes
SLIDE 27
27
Reduction - 16 core nodes - 8 Bytes - Hierarchical
SLIDE 28
28
Shared-Memory Reduction - 16 Processes
SLIDE 29
29
Reduction - 16 core nodes - 512 KBytes
SLIDE 30
30
Reduction - XT
SLIDE 31
31
Shared Memory Allreduce - 16 processes
SLIDE 32
32
Allreduce - 16 cores per node - 8 Bytes
SLIDE 33
33
Allreduce - 16 cores per node - 8 Bytes - Hierarchical
SLIDE 34
34
Shared Memory Allreduce - 16 processes
SLIDE 35
35