Implementing Optimized Collective Communication Routines on the IBM - - PowerPoint PPT Presentation

implementing optimized collective communication routines
SMART_READER_LITE
LIVE PREVIEW

Implementing Optimized Collective Communication Routines on the IBM - - PowerPoint PPT Presentation

Implementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer CS 425 term project By Sam Miller samm@scl.ameslab.gov April, 18, 2005 4/15/05 1 of 37 Outline What is BlueGene/L? (5 slides) Hardware (3


slide-1
SLIDE 1

4/15/05 1 of 37

Implementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer

CS 425 term project By Sam Miller samm@scl.ameslab.gov April, 18, 2005

slide-2
SLIDE 2

4/15/05 2 of 37

Outline

  • What is BlueGene/L? (5 slides)
  • Hardware (3 slides)
  • Communication Networks (2 slides)
  • Software (2 slides)
  • MPI and MPICH (1 slide)
  • Collective Algorithms (5 slides)
  • Better Collective Algorithms! (12 slides)
  • Performance
  • Conclusion
slide-3
SLIDE 3

4/15/05 3 of 37

Abbreviations Today

  • BGL = BlueGene/L
  • CNK = Compute Node Kernel
  • MPI = Message Passing Interface
  • MPICH2 = MPICH 2.0 from Argonne Labs
  • ASIC = Application Specific Integrated Circuit
  • ALU = Arithmetic Logic Unit
  • IBM = International Biscuit Makers (duh)
slide-4
SLIDE 4

4/15/05 4 of 37

What is BGL 1/2

  • Massively parallel distributed memory cluster of

embedded processors

  • 65,536 nodes! 131,072 processors!
  • Low power requirements
  • Relatively small, compared to predecessors
  • Half system installed at LLNL
  • Other systems going online too
slide-5
SLIDE 5

4/15/05 5 of 37

What is BGL 2/2

  • BlueGene/L at LLNL (360 Tflops)

– 2,500 square feet, half a tennis court

  • Earth Simulator (40 Tflops)

– 35,000 square feet, requires an entire building

slide-6
SLIDE 6

4/15/05 6 of 37

slide-7
SLIDE 7

4/15/05 7 of 37

slide-8
SLIDE 8

4/15/05 8 of 37

slide-9
SLIDE 9

4/15/05 9 of 37

Hardware 1/3

  • CPU is PowerPC 440

– Designed for embedded applications – Low power, low clock frequency (700 MHz) – 32 bit :-(

  • FPU is custom 64-bit

– Each PPC 440 core has two of these – The two FPUs operate in parallel – @ 700MHz this is 2.8 Gflops per PPC 440 core

slide-10
SLIDE 10

4/15/05 10 of 37

Hardware 2/3

  • BGL ASIC

– Two PPC 440 cores, four FPUs – L1, L2, L3 caches – DDR memory controller – Logic for 5 separate communications networks – This forms one compute node

slide-11
SLIDE 11

4/15/05 11 of 37

Hardware 3/3

  • To build the entire 65,536 node system

– Two ASICs with 256 or 512 MB DDR RAM form a compute card – Sixteen compute cards form a node board – Sixteen node boards form a midplane – Two midplanes form a rack – Sixty four racks brings us to: – 2x16x16x2x64 = 65,536!

slide-12
SLIDE 12

4/15/05 12 of 37

QuickTime™ and a Graphics decompressor are needed to see this picture.
slide-13
SLIDE 13

4/15/05 13 of 37

Communication Networks 1/2

  • Five different networks

– 3D torus

  • Primary for MPI library

– Global tree

  • Used for collectives on MPI_COMM_WORLD
  • Used by compute nodes to communicate with I/O nodes

– Global interrupt

  • 1.5 usec latency over entire 65k node system!

– JTAG

  • Used for node bootup and servicing

– Gigabit Ethernet

  • Used by I/O nodes
slide-14
SLIDE 14

4/15/05 14 of 37

Communication Networks 2/2

  • Torus

– 6 neighbors have bi-directional links at 154 MB/sec – Guarantees reliable, deadlock free delivery – Chosen due to high bandwidth nearest neighbor connectivity – Used in prior supercomputers, such as Cray T3E

slide-15
SLIDE 15

4/15/05 15 of 37

Software 1/2

  • Compute node runs stripped down Linux called

CNK

– Two threads, 1 per CPU – No context switching, no VM – Standard glibc interface, easy to port – 5000 lines of C++

  • I/O nodes run standard PPC Linux

– They have disk access – Run a daemon called console I/O daemon (ciod)

slide-16
SLIDE 16

4/15/05 16 of 37

Software 2/2

  • Network software has 3 layers

– Topmost is MPI Library – Middle is Message Layer

  • Allows transmission of arbitrary buffer sizes

– Bottom is Packet layer

  • Very simple
  • Stateless interface to torus, tree, and GI hardware
  • Facilitates sending & receiving packets
slide-17
SLIDE 17

4/15/05 17 of 37

MPICH

  • Developed by Argonne National Labs
  • Open source, freely available, standards

compliant MPI implementation

  • Used by many vendors
  • Chosen by IBM due to use of Abstract

Device Interface (ADI) and design for scalability

slide-18
SLIDE 18

4/15/05 18 of 37

Collective Algorithms 1/5

  • Collectives can be implemented with basic send

and receives

– Better algorithms exist

  • Default MPICH2 collectives perform poor on

BGL

– Assume crossbar network, poor node mapping – Point-to-point messages incur high overhead – No knowledge of network specific features

slide-19
SLIDE 19

4/15/05 19 of 37

Collective Algorithms 2/5

  • Optimization is tricky

– Message size and communicator shape are deciding factors – Large messages == optimize bandwidth – Short messages == optimize latency

  • I will not talk about short message collectives further today
  • If optimized algorithm isn’t available, BGL falls

back on default MPICH2

– It will work because point-to-point messages work – Performance will suck however

slide-20
SLIDE 20

4/15/05 20 of 37

Collective Algorithms 3/5

  • Conditions for selecting optimized collective

algorithm are made locally

– What is wrong with this?

  • Example:

char buf[100], buf2[20000]; if (rank == 0) MPI_Bcast(buf, 100, …); else MPI_Bcast(buf2, 20000, …);

– Not legal according to MPI standard, but… – What if one node uses the optimized algorithm and the

  • thers use the MPICH2 algorithm?
  • Deadlock - or worse
slide-21
SLIDE 21

4/15/05 21 of 37

Collective Algorithms 4/5

  • Solution to previous problem:

– Make optimization decisions globally – This incurs a slight latency hit – Thus, only used when offsetting increases in bandwidth are important: Ex: long message collectives

slide-22
SLIDE 22

4/15/05 22 of 37

Collective Algorithms 5/5

  • Remainder of slides

– MPI_Bcast – MPI_Reduce, MPI_Allreduce – MPI_Alltoall, MPI_Alltoallv

  • Using both the tree and torus networks

– Tree operates only on MPI_COMM_WORLD

  • Has a built in ALU, but only fixed point :-(

– Torus has deposit bit feature, requires rectangular communicator shape (for most algorithms)

slide-23
SLIDE 23

4/15/05 23 of 37

Broadcast 1/3

  • MPICH2

– Binomial tree for short messages – Scatter then Allgather for large messages – Perform poor on BGL due to high CPU overhead and lack of topology awareness

  • Torus

– Uses deposit bit feature – For n-dimension mesh, 1/n of message is sent in each direction concurrently

  • Tree

– Does not use ALU

slide-24
SLIDE 24

4/15/05 24 of 37

Broadcast 2/3

  • Red lines represent one spanning tree of

half the message

  • Blue lines represent another spanning tree
  • f the other message half
slide-25
SLIDE 25

4/15/05 25 of 37

Broadcast 3/3

slide-26
SLIDE 26

4/15/05 26 of 37

Reduce & Allreduce 1/4

  • Reduce essentially a reverse broadcast
  • Allreduce is a reduce followed by broadcast
  • Torus

– Can’t use deposit bit feature – CPU bound, bandwidth is poor – Solution: Hamiltonian path, huge latency penalty, but great bandwidth

  • Tree

– Natural choice for reduction using integers! – Floating point performance is bad

slide-27
SLIDE 27

4/15/05 27 of 37

Reduce & Allreduce 2/4

  • Hamiltonian path for 4x4x4 cube
slide-28
SLIDE 28

4/15/05 28 of 37

Reduce & Allreduce 3/4

slide-29
SLIDE 29

4/15/05 29 of 37

Reduce & Allreduce 4/4

slide-30
SLIDE 30

4/15/05 30 of 37

Alltoall and Alltoallv 1/5

  • MPICH2 has 4 algorithms

– Yes 4 separate ones – BGL performace is bad due to network hot spots and CPU overhead

  • Torus

– No communicator size restriction! – Does not use deposit bit feature

  • Tree

– No alltoall tree algorithm, it would not make sense

slide-31
SLIDE 31

4/15/05 31 of 37

Alltoall and Alltoallv 2/5

  • BGL Optimized torus algorithm

– Uses randomized packet injection – Each node creates a destination list – Each node has same seed value, different offset

  • Bad memory performance?

– Yes! – Torus payload is 240 bytes (8 cache lines) – Multiple packets in adjacent cache lines to each destination are injected before advancing

  • Measurements showed 2 packets to be optimal
slide-32
SLIDE 32

4/15/05 32 of 37

Alltoall and Alltoallv 3/5

slide-33
SLIDE 33

4/15/05 33 of 37

Alltoall and Alltoallv 4/5

slide-34
SLIDE 34

4/15/05 34 of 37

Alltoall and Alltoallv 5/5

slide-35
SLIDE 35

4/15/05 35 of 37

Conclusion

  • Optimized collectives on BGL off to a good start

– Superior performance than MPICH2 – Exploit knowledge about network features – Avoid performance penalties like memory copies and network hot spots

  • Much work remains

– Short message collectives – Non-rectangular communicators for the torus network – Tree collectives using communicators other than MPI_COMM_WORLD – Other collectives: scatter, gather, etc.

slide-36
SLIDE 36

4/15/05 36 of 37

Questions?