Open MPI on the Cray XT presented by Richard L. Graham Galen - - PowerPoint PPT Presentation

open mpi on the cray xt
SMART_READER_LITE
LIVE PREVIEW

Open MPI on the Cray XT presented by Richard L. Graham Galen - - PowerPoint PPT Presentation

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open source project / community Consolidation and evolution of several prior MPI implementations All of MPI-1 and MPI-2 Production quality


slide-1
SLIDE 1

presented by

Open MPI on the Cray XT

Richard L. Graham Galen Shipman

slide-2
SLIDE 2

2

Open MPI Is…

  • Open source project / community
  • Consolidation and evolution of several prior MPI

implementations

  • All of MPI-1 and MPI-2
  • Production quality
  • Vendor-friendly
  • Research- and academic-friendly
slide-3
SLIDE 3

3

Current Membership

  • 14 members, 9 contributors, 1 partner

− 4 US DOE labs − 8 universities − 10 vendors − 1 individual

slide-4
SLIDE 4

4

Some Current Highlights

  • Production MPI on SNL’s Thunderbird
  • Production MPI on LANL’s Road Runner
  • Working on getting it up on TACC (Ranger)
  • The MPI used for the EU QosCosGrid: Quasi-

Opportunistic Complex System Simulations on Grid

  • Tightly integrated with VampirTrace (vs 1.3)
slide-5
SLIDE 5

5

Modular Component Architecture

  • Framework:

− API targeted at a specific task

  • PTP message

management

  • PTP transfer layer
  • Collectives
  • Process startup ….
  • Component:

− An implementation of a framework’s API

  • Module:

− An instance of a component

User application Open MPI Framework A B … Comp A Comp B …

Mod A1 Mod A2 Mod B …

slide-6
SLIDE 6

6

Open MPI’s CNL Port

  • Portals port from Catamount to CNL
  • Enhance Point-to-Point BTL component
  • ALPS support added
  • Add process control components for ALPS
  • mpirun wraps multiple calls to APRUN to
  • Support MPI-2 dynamic process control
  • Support for recovery from process failure
  • Support arbitrary number of procs per node (even over

subscribe)

  • Pick up full MPI 2.0 support
slide-7
SLIDE 7

7

Modular Component Architecture - Data Transfer User application Open MPI Point-to-Point … … TCP Infiniband Portals

NIC 0 NIC 1 HCA

SeaStar

slide-8
SLIDE 8

8

Process Startup on CNL - Start

Allocated Allocated Allocated Allocated Allocated Allocated Allocated Allocated Allocated Daemon Daemon Daemon Daemon

APRUN

App App App App App App App App

MPIRUN

slide-9
SLIDE 9

9

Process Startup on CNL - Spawn

Allocated Allocated Allocated Allocated Allocated Allocated Allocated Allocated Allocated Daemon Daemon Daemon

APRUN

App App App App App App Daemon App App Daemon App App Daemon App App

APRUN MPIRUN

slide-10
SLIDE 10

10

Features in Open MPI for Multi-Core Support

  • Shared Memory point-to-point communications

− On par with other network devices − Does not use any network resources

  • Shared Memory Collective optimizations

− On-host-communicator optimization − Hierarchical collectives on the way

slide-11
SLIDE 11

11

Hierarchical Collectives

  • Exist in the code base (HLRS/University of Houston)
  • Need to be tested with the new shared-memory

module

  • Need to be optimized
slide-12
SLIDE 12

12

Collective Communication Pattern - per process

proc proc proc proc

Network

proc proc proc proc proc proc proc proc proc proc proc proc I II III IV

slide-13
SLIDE 13

13

Collective Communication Pattern - Total Interhost traffic

proc proc proc proc sm proc proc proc proc sm proc proc proc proc sm proc proc proc proc sm

Network

I I II II

slide-14
SLIDE 14

14

Performance Data

slide-15
SLIDE 15

15

Ping-Pong 0 byte MPI latency : Inter-node

7.44 Cray MPT (3.0.7) 7.24 Open MPI / OB1 - no ack 8.65 Open MPI / OB1 6.18 Open MPI / CM

Latency (uSec) MPI / Protocol

slide-16
SLIDE 16

16

Ping-Pong 0 byte MPI latency

CM 0 Bytes - 6.18 uSec 16 Bytes - 6.88 uSec 17 Bytes - 9.69 uSec (measured on different system) OB1 0 Bytes - With ACK: 8.65 uSec 0 Bytes - Without ACK: 7.24 uSec 1 Byte - Without ACK: 10.14 uSec (measured on different system)

slide-17
SLIDE 17

17

Ping-Pong 0 byte MPI latency : Intra-node

0.51 Cray MPT (3.0.7) Open MPI / OB1 - no ack 0.64 Open MPI / OB1 Open MPI / CM

Latency (uSec) MPI / Protocol

slide-18
SLIDE 18

18

Ping-Pong Latency Data - Off Host

slide-19
SLIDE 19

19

Ping-Pong Data - Off Host

slide-20
SLIDE 20

20

Ping-Pong Data - On Host

slide-21
SLIDE 21

21

Ping-Pong Bandwidth Data - On Host

slide-22
SLIDE 22

22

Barrier - 16 cores per host

slide-23
SLIDE 23

23

Barrier - 16 cores per host - Hierarchical

slide-24
SLIDE 24

24

Barrier - XT

slide-25
SLIDE 25

25

Shared-Memory Reduction - 16 processes

slide-26
SLIDE 26

26

Reduction - 16 core nodes - 8 Bytes

slide-27
SLIDE 27

27

Reduction - 16 core nodes - 8 Bytes - Hierarchical

slide-28
SLIDE 28

28

Shared-Memory Reduction - 16 Processes

slide-29
SLIDE 29

29

Reduction - 16 core nodes - 512 KBytes

slide-30
SLIDE 30

30

Reduction - XT

slide-31
SLIDE 31

31

Shared Memory Allreduce - 16 processes

slide-32
SLIDE 32

32

Allreduce - 16 cores per node - 8 Bytes

slide-33
SLIDE 33

33

Allreduce - 16 cores per node - 8 Bytes - Hierarchical

slide-34
SLIDE 34

34

Shared Memory Allreduce - 16 processes

slide-35
SLIDE 35

35

Allreduce - XT