Collect ollectiv ive e Fr Framew amewor ork k and and Per - PowerPoint PPT Presentation

Collect ollectiv ive e Fr Framew amewor ork k and and Per erfor ormance mance Optimiz Opt imizat ation ion to o Open Open MPI for or Cray ay XT 5 5 platfor plat orms ms Cray Users Group 2011 1 Managed by UT-Battelle 1 Managed by UT-Battelle for the Department of Energy for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08

Collectives are Critical for HPC Application Performance • A large percentage of application execution time is spent in the global synchronization operations (collectives) • Moving towards exascale systems (million processor cores), the time spent in collectives only increases • Performance and scalability of HPC applications requires efficient and scalable collective operations 2 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Weakness in current Open MPI implementation Open MPI lacks support for • Customized collective implementation for arbitrary communication hierarchies • Concurrent progress of collectives on different communication hierarchies • Nonblocking collectives • Taking advantage of capabilities of recent network interfaces (example offload capabilities) • Efficient point-to-point message protocol for Cray XT platforms 3 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Cheetah : A Framework for Scalable Hierarchical Collectives Goals of the framework • Provide building blocks for implementing collectives for arbitrary communication hierarchy • Support collectives tailored to the communication hierarchy • Support both blocking and nonblocking collectives efficiently • Enable building collectives customized for the hardware architecture 4 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Cheetah Framework : Design principles • Collective operation is split into collective primitives over different communication hierarchies • Collective primitives over the different hierarchies are allowed to progress concurrently • Decouple the topology of a collective operation from the implementation, enabling the reusability of primitives • Design decisions are driven by nonblocking collective design, blocking collectives are a special case of nonblocking ones • Use Open MPI component architecture 5 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Cheetah is Implemented as a Part of Open MPI OMPI BCOL COLL SBGP BASESOCKET BASEMUMA BASEMUMA IBOFFLOAD PTPCOLL DEFAULT IBNET P2P ML Cheetah Components Open MPI Components 6 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Cheet heetah ah Component omponents and and it its Funct Functions ions • Base Collectives (BCOL) – Implements basic collective primitives • Subgrouping (SBGP) – Provides rules for grouping the processes • Multilevel (ML) – Coordinates collective primitive execution, manages data and control buffers, and maps MPI semantics to BCOL primitives • Schedule – Defines the collective primitives that are part of collective operation • Progress Engine – Responsible for starting, progressing and completing the collective primitives 7 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

BCOL Component – Base collective primitives • Provides collective primitives that are optimized for certain communication hierarchies – BASESMUMA: Shared memory – P2P: SeaStar 2+, Ethernet, InfiniBand – IBNET: ConnectX-2 • A collective operation is implemented as a combination of these primitives – Example, n level Barrier can be a combination of Fanin ( first n-1 levels), Barrier (n th level) and Fanout ( first n-1 levels) 8 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

SBGP Component – Group the Processes Based on the Communication Hierarchy P2P Subgroup UMA Subroup UMA Group Leader Socket Subroups Socket Group Leader CPU Socket Allocated Core Unallocated Core Node 1 Node 2 9 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Open MPI portals BTL optimization Sender MPI Process Receiver MPI Process MPI Message Open MPI Message Portals Message X Ack Portal acknowledgment is not required for Cray XT 5 platforms as they use Basic End to End Protocol (BEER) for message transfer 10 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Experimental Setup • Hardware : Jaguar – 18,688 Compute Nodes – 2.6 GHz AMD Opteron (Istanbul) – SeaStar 2+ Routers connected in a 3D torus topology • Benchmarks : – Point-to-Point : OSU Latency and Bandwidth – Collectives : • Broadcast in a tight loop • Barrier in a tight loop 11 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

1 1 Byte e Open Open MPI P2P 2P Lat Latency ency is is 15% bet 15% better er than han Cray ay MPI OMPI vs CRAY portals latency 110 OMPI with portals optimization OMPI without portals optimization 100 Cray MPI 90 80 70 Latency (Usec) 60 50 40 30 20 10 0 1 10 100 1000 10000 100000 1e+06 Message size (bytes) 12 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Open Open MPI and and Cray ay MPI bandw bandwidt idth h sat atur urat ate e at at ~2 2 Gbp/ Gbp/s OMPI vs CRAY portals bandwidth 2500 OMPI with portals optimization OMPI without portals optimization Cray MPI 2000 Bandwidth (Mb/s) 1500 1000 500 0 1 10 100 1000 10000 100000 1e+06 1e+07 Message size (bytes) 13 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Hierarchical Collective Algorithms 14 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Flat Barrier Algorithm Host 1 Host 2 1 2 3 4 Step 1 1 2 3 4 Inter Host Step 2 Communication 1 2 3 4 15 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Hierarchical Barrier Algorithm Host 1 Host 2 1 2 3 4 Step 1 1 2 3 4 Inter Host Step 2 Communication 1 2 3 4 Step 3 1 2 3 4 16 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Cheet heetah’ ah’s Bar arrier ier Collect ollectiv ive e Out Outper perfor orms ms the he Cray ay MPI Bar arrier ier by by 10% 10% 140 Cheetah Cray MPI 120 100 Latency (microsec.) 80 60 40 20 0 0 2000 4000 6000 8000 10000 12000 14000 MPI Processes 17 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Data Flow in a Hierarchical Broadcast Algorithm S NODE 1 NODE 2 S Source of the Broadcast 18 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Hierarchical Broadcast Algorithms • Knownroot Hierarchical Broadcast – the suboperations are ordered based on the source of data – the suboperations are concurrently started after the execution of suboperation with the source of broadcast – uses k-nomial tree for data distribution • N-ary Hierarchical Broadcast – same as Knownroot algorithm but uses N-ary tree for data distribution • Sequential Hierarchical Broadcast – the suboperations are ordered sequentially – there is no concurrent execution 19 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Cheet heetah’ ah’s Broadcas oadcast Collect ollectiv ive e Out Outper perfor orms ms the he Cray ay MPI Broadcas oadcast by by 10% 10% (8 8 Byte) e) 90 80 70 60 Latency (microsec.) 50 40 30 20 Cray MPI Cheetah three level known k-nomial 10 Cheetah three level known n-ary Cheetah three level sequential bcast 0 0 5000 10000 15000 20000 25000 MPI Processes 20 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Cheet heetah’ ah’s Broadcas oadcast Collect ollectiv ive e Out Outper perfor orms ms the he Cray ay MPI Broadcas oadcast by by 92% 92% (4 4 KB KB) 200 150 Latency (microsec.) 100 50 Cray MPI Cheetah three-level known k-nomial Cheetah three-level known NB n-ary Cheetah three-level known NB k-nomial Cheetah sequential bcast 0 0 10000 20000 30000 40000 50000 MPI Processes 21 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Cheet heetah’ ah’s Broadcas oadcast Collect ollectiv ive e Out Outper perfor orms ms the he Cray ay MPI Broadcas oadcast by by 9% 9% (4 4 MB) 55000 50000 45000 40000 Latency (Usec) 35000 30000 25000 Cray MPI 20000 Cheetah three level known k-nomial Cheetah three level known n-ary Cheetah three level sequential bcast 15000 0 5000 10000 15000 20000 25000 MPI Processes 22 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Summary • Cheetah’s Broadcast is 92% better than the Cray MPI’s Broadcast • Cheetah’s Barrier outperforms Cray MPI’s Barrier by 10% • Open MPI point-to-point message latency is 15% better than the Cray MPI (1 byte message) • The key to the performance and scalability of the collective operations – Concurrent execution of sub-operations – Scalable resource usage techniques – Asynchronous semantics and progress – Customized collective primitives for each of communication hierarchy 23 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Collect ollectiv ive e Fr Framew amewor ork k and and Per - PowerPoint PPT Presentation

Collect ollectiv ive e Fr Framew amewor ork k and and Per erfor ormance mance Optimiz Opt imizat ation ion to o Open Open MPI for or Cray ay XT 5 5 platfor plat orms ms Cray Users Group 2011 1 Managed by UT-Battelle 1

Inter Interconn connecte ected S d Syst ystems ems Framew amewor ork k (IS (ISF): F):

RA RACE CE A softw A softwar are e fr framew amewor ork k for inter or interope

A A Framew amewor ork k for Esta or Establishin blishing g System System of of Systems

Introd oduction on t to o Bal Baldrige Ex Excel ellen ence F e Fram amewor ork A

Business siness Services vices Frame amewor ork k Draf aft t Pla lan Engag agemen ement

Op Opti timi mizat atio ion n Fram amewor ork k for or DN DNN-dr driv iven en Aut

Intr Introduction t oduction to St o Stanf anfor ord' d's s Responsiv esponsive Drup e

FRAMEW ORK FOR THE ESTI MATI ON OF MSW UNI T W EI GHT PROFI LE by D. ZEKKOS, J. BRAY, E.

Geothermal energy R& D in the 7th Framew ork Programme Jeroen Schuppers European Commission,

Automated Vehicle Technology Creating the Framew ork for Implementation Thursday, March 31,

FRAMEW AMEWORK ORK SYNTHESI HESIS Tatyana A. Stroganova, Vladimir K. Vasilin, Georgiy A.

History Open Foris Initiative www.openforis.org Collect, Collect Mobile and Collect

Econom ical Aspects Econom ical Aspects Pay per Risk Pay per Use Pay per Use Pay per

IVE: SITE OF SEREGNO (MB), ITALY WWW.IVEVERNICI.COM IVE: THE COMPANY IVE was founded in 1941

De ve loping a c ommunity- De ve loping a c ommunity- dr dr ive n r ive n r e se ar e se ar

PCOR PCORI I Wor orks kshop hop on R on Res esear earch h Dis Dissemina semination

Welcome to the co u rse ! FOU N DATION S OF IN FE R E N C E Jo Hardin Instr u ctor What is

Learning State of the Art 1 19.11.2019 What is Deep Learning? https://youtu.be/Kfe5hKNwrCU

Lecture 23: Spectral Meshes COMPSCI/MATH 290-04 Chris Tralie, Duke University 4/7/2016

A Multi-Paradigm C++-based Hardware Description Language Chad D. Kersey ( cdkersey@gatech.edu )

Introduction to I/O and Disk Management 1 Secondary Storage Management Disks just like

Transfer from Simulation to Real World through Learning Deep Inverse Dynamics Model Paul

Introduction Kevin has over 20 years experience of working within financial markets technology

Themis Ensemble Manager Presented To: WoWoHa Seminar June 2020 David Domyancic, James Corbett,

Collect ollectiv ive e Fr Framew amewor ork k and and Per - PowerPoint PPT Presentation

Collect ollectiv ive e Fr Framew amewor ork k and and Per erfor ormance mance Optimiz Opt imizat ation ion to o Open Open MPI for or Cray ay XT 5 5 platfor plat orms ms Cray Users Group 2011 1 Managed by UT-Battelle 1

Inter Interconn connecte ected S d Syst ystems ems Framew amewor ork k (IS (ISF): F):

RA RACE CE A softw A softwar are e fr framew amewor ork k for inter or interope

A A Framew amewor ork k for Esta or Establishin blishing g System System of of Systems

Introd oduction on t to o Bal Baldrige Ex Excel ellen ence F e Fram amewor ork A

Business siness Services vices Frame amewor ork k Draf aft t Pla lan Engag agemen ement

Op Opti timi mizat atio ion n Fram amewor ork k for or DN DNN-dr driv iven en Aut

Intr Introduction t oduction to St o Stanf anfor ord' d's s Responsiv esponsive Drup e

FRAMEW ORK FOR THE ESTI MATI ON OF MSW UNI T W EI GHT PROFI LE by D. ZEKKOS, J. BRAY, E.

Geothermal energy R&amp; D in the 7th Framew ork Programme Jeroen Schuppers European Commission,

Automated Vehicle Technology Creating the Framew ork for Implementation Thursday, March 31,

FRAMEW AMEWORK ORK SYNTHESI HESIS Tatyana A. Stroganova, Vladimir K. Vasilin, Georgiy A.

History Open Foris Initiative www.openforis.org Collect, Collect Mobile and Collect

Econom ical Aspects Econom ical Aspects Pay per Risk Pay per Use Pay per Use Pay per

IVE: SITE OF SEREGNO (MB), ITALY WWW.IVEVERNICI.COM IVE: THE COMPANY IVE was founded in 1941

De ve loping a c ommunity- De ve loping a c ommunity- dr dr ive n r ive n r e se ar e se ar

PCOR PCORI I Wor orks kshop hop on R on Res esear earch h Dis Dissemina semination

Welcome to the co u rse ! FOU N DATION S OF IN FE R E N C E Jo Hardin Instr u ctor What is

Learning State of the Art 1 19.11.2019 What is Deep Learning? https://youtu.be/Kfe5hKNwrCU

Lecture 23: Spectral Meshes COMPSCI/MATH 290-04 Chris Tralie, Duke University 4/7/2016

A Multi-Paradigm C++-based Hardware Description Language Chad D. Kersey ( cdkersey@gatech.edu )

Introduction to I/O and Disk Management 1 Secondary Storage Management Disks just like

Transfer from Simulation to Real World through Learning Deep Inverse Dynamics Model Paul

Introduction Kevin has over 20 years experience of working within financial markets technology

Themis Ensemble Manager Presented To: WoWoHa Seminar June 2020 David Domyancic, James Corbett,

Geothermal energy R& D in the 7th Framew ork Programme Jeroen Schuppers European Commission,