Hierarchy Aware Blocking and Nonblocking Collective - - PowerPoint PPT Presentation

hierarchy aware blocking and nonblocking collective
SMART_READER_LITE
LIVE PREVIEW

Hierarchy Aware Blocking and Nonblocking Collective - - PowerPoint PPT Presentation

Hierarchy Aware Blocking and Nonblocking Collective Communications-The Effects of Shared Memory Communications in the Cray XT Environment Richard L. Graham, Joshua S. Ladd, Manjunath Venkata 1 Managed by UT-Battelle 1 Managed by


slide-1
SLIDE 1

1 Managed by UT-Battelle for the Department of Energy

Graham_CAC_2010 Graham_CAC_2010

1 Managed by UT-Battelle for the Department of Energy

Richard L. Graham, Joshua S. Ladd, Manjunath Venkata

Hierarchy Aware Blocking and Nonblocking Collective Communications-The Effects of Shared Memory Communications in the Cray XT Environment

slide-2
SLIDE 2

2 Managed by UT-Battelle for the Department of Energy

Graham_CAC_2010

Acknowledgements

  • US Department of Energy FASTOS program
slide-3
SLIDE 3

3 Managed by UT-Battelle for the Department of Energy

Graham_CAC_2010

Outline

  • Statement of the problem
  • Design Overview
  • Results
  • Next steps
slide-4
SLIDE 4

4 Managed by UT-Battelle for the Department of Energy

Graham_CAC_2010

Problems being addressed

  • Optimization of collective operations
  • Implementation of extensible optimized

collective operations

  • Implementation of nonblocking collective
  • perations
slide-5
SLIDE 5

5 Managed by UT-Battelle for the Department of Energy

Graham_CAC_2010

Why Optimize Collective Communications

  • Collective operations limit application scalability
  • Communication pattern involving multiple processes

(in MPI, all ranks in the communicator are involved)

  • Optimized collectives involve a communicator-wide

data-dependent communication pattern

  • Data needs to be manipulated at intermediate stages
  • f a collective operation
  • Collective operations magnify the effects of system-

noise

slide-6
SLIDE 6

6 Managed by UT-Battelle for the Department of Energy

Graham_CAC_2010

! " # $ %&'()**+,-./ 012) 3'&/ 4'225.1(-61'. ,)75(61'. 8*)+,)*596 $ :

Scalability of Collective Operations

! " # $ %&'()**+,-./ 012) 3'&/ 4'225.1(-61'. ,)75(61'. 8*)+,)*596 $ : ;'1*)

Ideal Algorithm Impact of System Noise

slide-7
SLIDE 7

7 Managed by UT-Battelle for the Department of Energy

Graham_CAC_2010

Scalability of Collective Operations - II

Offloaded Algorithm Nonblocking Algorithm

! " # $ %&'()**+,-./ 012) 3'&/ 4'225.1(-61'. ,)75(61'. 8*)+,)*596 :)9);-61'.+<;).6 $ = ! " # $ %&'()**+,-./ 012) 3'&/ 4'225.1(-61'. ,)75(61'. 8*)+,)*596 :)9);-61'.+<;).6 $ =

slide-8
SLIDE 8

8 Managed by UT-Battelle for the Department of Energy

Graham_CAC_2010

Mapping the collectives onto the system

  • Consider communication hierarchies
  • Schedule the network
slide-9
SLIDE 9

9 Managed by UT-Battelle for the Department of Energy

Graham_CAC_2010

Example – 4 Process Recursive Doubling

1 2 3 4 1 2 3 4 1 2 3 4 Host 1 Host 2 Inter Host Communication Step 1 Step 2

slide-10
SLIDE 10

10 Managed by UT-Battelle for the Department of Energy

Graham_CAC_2010

Example – 4 Process Recursive Doubling – On host optimization

1 2 3 4 1 2 3 4 1 2 3 4 Host 1 Host 2 Inter Host Communication Step 1 Step 2 1 2 3 4 Step 3

slide-11
SLIDE 11

11 Managed by UT-Battelle for the Department of Energy

Graham_CAC_2010

Design strategy

  • Decouple

– Hierarchy detection – Network specific collective algorithm implementation (“single” level) – Full collective function implementation (hierarchical) – Basic building blocks from MPI level functions

  • Share resources between levels w/o breaking

the abstraction between layers

slide-12
SLIDE 12

12 Managed by UT-Battelle for the Department of Energy

Graham_CAC_2010

Basic Collectives (bcol) Framework Subgroup Framework

IB OFFLOAD

Pt2Pt SM NUMA IBNET MUMA Collective Framework Tuned (pt2pt) Collectives Comp. MLNX OFED ML – Hierarchical Collectives Comp. Module Component Architecture OMPI

Collectives – Software Layers

slide-13
SLIDE 13

13 Managed by UT-Battelle for the Department of Energy

Graham_CAC_2010

Benchmarks

slide-14
SLIDE 14

14 Managed by UT-Battelle for the Department of Energy

Graham_CAC_2010

System setup

  • Jaguar
  • 2.6 GHz Istanbul processor
  • Dual socket
  • Hex-core
  • Smoky

– 2.0 GHz Opteron – Quad socket – Quad core

slide-15
SLIDE 15

15 Managed by UT-Battelle for the Department of Energy

Graham_CAC_2010

Barrier as a function of Process count – Jaguar – 2 Level hierarchy

1 2 3 4 5 6 7 8 9 2 4 6 8 10 12

Latency of the Barrier (usecs)

Processes

Shared Memory pt-2-pt

slide-16
SLIDE 16

16 Managed by UT-Battelle for the Department of Energy

Graham_CAC_2010

Barrier as a function of Process count – Smoky – 2 Level hierarchy

2 4 6 8 10 12 2 4 6 8 10 12 14 16

Latency of the Barrier (usecs)

Processes

Shared Memory pt-2-pt

slide-17
SLIDE 17

17 Managed by UT-Battelle for the Department of Energy

Graham_CAC_2010

Barrier As a function of number of sockets - Jaguar

Processes on Different Sockets Processes on Same Socket

1 1.5 2 2 Processes 4 Latency of the Barrier (usecs) 0.5

slide-18
SLIDE 18

18 Managed by UT-Battelle for the Department of Energy

Graham_CAC_2010

Barrier As a function of number of sockets (1,2) – Smoky

Processes on Different Sockets Processes on Same Socket

1 1.5 2 2 Processes 4 Latency of the Barrier (usecs) 0.5

slide-19
SLIDE 19

19 Managed by UT-Battelle for the Department of Energy

Graham_CAC_2010

Barrier As a function of number of sockets (1,4) – Smoky

Message Traffic between Sockets Message Traffic within Socket

1 1.5 2 4 Processes Latency of the Barrier (usecs) 0.5

slide-20
SLIDE 20

20 Managed by UT-Battelle for the Department of Energy

Graham_CAC_2010

Summary

  • Added hardware support for offloading

collective operations

  • Developed MPI-level support for

asynchronous collectives

  • Good barrier performance
  • Good overlap capabilities
  • Work is continuing