Collect ollectiv ive e Fr Framew amewor ork k and and Per - - PowerPoint PPT Presentation

collect ollectiv ive e fr framew amewor ork k and and per
SMART_READER_LITE
LIVE PREVIEW

Collect ollectiv ive e Fr Framew amewor ork k and and Per - - PowerPoint PPT Presentation

Collect ollectiv ive e Fr Framew amewor ork k and and Per erfor ormance mance Optimiz Opt imizat ation ion to o Open Open MPI for or Cray ay XT 5 5 platfor plat orms ms Cray Users Group 2011 1 Managed by UT-Battelle 1


slide-1
SLIDE 1

1 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08 Graham_OpenMPI_SC08

1 Managed by UT-Battelle for the Department of Energy

Collect

  • llectiv

ive e Fr Framew amewor

  • rk

k and and Per erfor

  • rmance

mance Opt Optimiz imizat ation ion to

  • Open

Open MPI for

  • r Cray

ay XT 5 5 plat platfor

  • rms

ms

Cray Users Group 2011

slide-2
SLIDE 2

2 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

Collectives are Critical for HPC Application Performance

  • A large percentage of application execution time is spent in

the global synchronization operations (collectives)

  • Moving towards exascale systems (million processor

cores), the time spent in collectives only increases

  • Performance and scalability of HPC applications requires

efficient and scalable collective operations

slide-3
SLIDE 3

3 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

Weakness in current Open MPI implementation

Open MPI lacks support for

  • Customized collective implementation for arbitrary

communication hierarchies

  • Concurrent progress of collectives on different

communication hierarchies

  • Nonblocking collectives
  • Taking advantage of capabilities of recent network

interfaces (example offload capabilities)

  • Efficient point-to-point message protocol for Cray XT

platforms

slide-4
SLIDE 4

4 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

Cheetah : A Framework for Scalable Hierarchical Collectives

Goals of the framework

  • Provide building blocks for implementing collectives for

arbitrary communication hierarchy

  • Support collectives tailored to the communication

hierarchy

  • Support both blocking and nonblocking collectives

efficiently

  • Enable building collectives customized for the hardware

architecture

slide-5
SLIDE 5

5 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

Cheetah Framework : Design principles

  • Collective operation is split into collective primitives over

different communication hierarchies

  • Collective primitives over the different hierarchies are

allowed to progress concurrently

  • Decouple the topology of a collective operation from the

implementation, enabling the reusability of primitives

  • Design decisions are driven by nonblocking collective

design, blocking collectives are a special case of nonblocking ones

  • Use Open MPI component architecture
slide-6
SLIDE 6

6 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

Cheetah is Implemented as a Part of Open MPI

OMPI BCOL SBGP COLL

BASEMUMA IBOFFLOAD PTPCOLL BASEMUMA BASESOCKET IBNET P2P ML DEFAULT

Cheetah Components Open MPI Components

slide-7
SLIDE 7

7 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

Cheet heetah ah Component

  • mponents and

and it its Funct Functions ions

  • Base Collectives (BCOL) – Implements basic collective

primitives

  • Subgrouping (SBGP) – Provides rules for grouping the

processes

  • Multilevel (ML) – Coordinates collective primitive

execution, manages data and control buffers, and maps MPI semantics to BCOL primitives

  • Schedule – Defines the collective primitives that are part
  • f collective operation
  • Progress Engine – Responsible for starting, progressing

and completing the collective primitives

slide-8
SLIDE 8

8 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

BCOL Component – Base collective primitives

  • Provides collective primitives that are optimized for certain

communication hierarchies

– BASESMUMA: Shared memory – P2P: SeaStar 2+, Ethernet, InfiniBand – IBNET: ConnectX-2

  • A collective operation is implemented as a combination of

these primitives

– Example, n level Barrier can be a combination of Fanin ( first n-1 levels), Barrier (nth level) and Fanout ( first n-1 levels)

slide-9
SLIDE 9

9 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

SBGP Component – Group the Processes Based

  • n the Communication Hierarchy

Allocated Core Unallocated Core

Socket Subroups UMA Subroup

Socket Group Leader UMA Group Leader

P2P Subgroup CPU Socket

Node 1 Node 2

slide-10
SLIDE 10

10 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

Open MPI portals BTL optimization

Open MPI Message Portals Message MPI Message Ack Sender MPI Process Receiver MPI Process

X

Portal acknowledgment is not required for Cray XT 5 platforms as they use Basic End to End Protocol (BEER) for message transfer

slide-11
SLIDE 11

11 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

Experimental Setup

  • Hardware :

Jaguar

– 18,688 Compute Nodes – 2.6 GHz AMD Opteron (Istanbul) – SeaStar 2+ Routers connected in a 3D torus topology

  • Benchmarks :

– Point-to-Point : OSU Latency and Bandwidth – Collectives :

  • Broadcast in a tight loop
  • Barrier in a tight loop
slide-12
SLIDE 12

12 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

1 1 Byte e Open Open MPI P2P 2P Lat Latency ency is is 15% 15% bet better er than han Cray ay MPI

10 20 30 40 50 60 70 80 90 100 110 1 10 100 1000 10000 100000 1e+06

Latency (Usec) Message size (bytes)

OMPI vs CRAY portals latency OMPI with portals optimization OMPI without portals optimization Cray MPI

slide-13
SLIDE 13

13 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

Open Open MPI and and Cray ay MPI bandw bandwidt idth h sat atur urat ate e at at ~2 2 Gbp/ Gbp/s

500 1000 1500 2000 2500 1 10 100 1000 10000 100000 1e+06 1e+07

Bandwidth (Mb/s) Message size (bytes)

OMPI vs CRAY portals bandwidth OMPI with portals optimization OMPI without portals optimization Cray MPI

slide-14
SLIDE 14

14 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

Hierarchical Collective Algorithms

slide-15
SLIDE 15

15 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

Flat Barrier Algorithm

1 2 3 4 1 2 3 4 1 2 3 4 Host 1 Host 2 Inter Host Communication Step 1 Step 2

slide-16
SLIDE 16

16 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

Hierarchical Barrier Algorithm

1 2 3 4 1 2 3 4 1 2 3 4 Host 1 Host 2 Inter Host Communication Step 1 Step 2 1 2 3 4 Step 3

slide-17
SLIDE 17

17 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

Cheet heetah’ ah’s Bar arrier ier Collect

  • llectiv

ive e Out Outper perfor

  • rms

ms the he Cray ay MPI Bar arrier ier by by 10% 10%

20 40 60 80 100 120 140 2000 4000 6000 8000 10000 12000 14000

Latency (microsec.) MPI Processes

Cheetah Cray MPI

slide-18
SLIDE 18

18 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

Data Flow in a Hierarchical Broadcast Algorithm

S

Source of the Broadcast

S

NODE 1 NODE 2

slide-19
SLIDE 19

19 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

Hierarchical Broadcast Algorithms

  • Knownroot Hierarchical Broadcast

– the suboperations are ordered based on the source of data – the suboperations are concurrently started after the execution of suboperation with the source of broadcast – uses k-nomial tree for data distribution

  • N-ary Hierarchical Broadcast

– same as Knownroot algorithm but uses N-ary tree for data distribution

  • Sequential Hierarchical Broadcast

– the suboperations are ordered sequentially – there is no concurrent execution

slide-20
SLIDE 20

20 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

Cheet heetah’ ah’s Broadcas

  • adcast Collect
  • llectiv

ive e Out Outper perfor

  • rms

ms the he Cray ay MPI Broadcas

  • adcast by

by 10% 10% (8 8 Byte) e)

10 20 30 40 50 60 70 80 90 5000 10000 15000 20000 25000

Latency (microsec.) MPI Processes

Cray MPI Cheetah three level known k-nomial Cheetah three level known n-ary Cheetah three level sequential bcast

slide-21
SLIDE 21

21 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

Cheet heetah’ ah’s Broadcas

  • adcast Collect
  • llectiv

ive e Out Outper perfor

  • rms

ms the he Cray ay MPI Broadcas

  • adcast by

by 92% 92% (4 4 KB KB)

50 100 150 200 10000 20000 30000 40000 50000 Latency (microsec.) MPI Processes Cray MPI Cheetah three-level known k-nomial Cheetah three-level known NB n-ary Cheetah three-level known NB k-nomial Cheetah sequential bcast

slide-22
SLIDE 22

22 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

Cheet heetah’ ah’s Broadcas

  • adcast Collect
  • llectiv

ive e Out Outper perfor

  • rms

ms the he Cray ay MPI Broadcas

  • adcast by

by 9% 9% (4 4 MB)

15000 20000 25000 30000 35000 40000 45000 50000 55000 5000 10000 15000 20000 25000

Latency (Usec) MPI Processes

Cray MPI Cheetah three level known k-nomial Cheetah three level known n-ary Cheetah three level sequential bcast

slide-23
SLIDE 23

23 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

Summary

  • Cheetah’s Broadcast is 92% better than the Cray MPI’s

Broadcast

  • Cheetah’s Barrier outperforms Cray MPI’s Barrier by 10%
  • Open MPI point-to-point message latency is 15% better

than the Cray MPI (1 byte message)

  • The key to the performance and scalability of the

collective operations

– Concurrent execution of sub-operations – Scalable resource usage techniques – Asynchronous semantics and progress – Customized collective primitives for each of communication hierarchy

slide-24
SLIDE 24

24 Managed by UT-Battelle for the Department of Energy

Graham_OpenMPI_SC08

Acknowledgements

  • US Department of Energy ASCR FASTOS

program

  • National Center For Computational Sciences,

ORNL