[PPT] - Implementing Optimized Collective Communication Routines on the IBM PowerPoint Presentation

SLIDE 1

4/15/05 1 of 37

Implementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer

CS 425 term project By Sam Miller samm@scl.ameslab.gov April, 18, 2005

SLIDE 2

4/15/05 2 of 37

Outline

What is BlueGene/L? (5 slides)
Hardware (3 slides)
Communication Networks (2 slides)
Software (2 slides)
MPI and MPICH (1 slide)
Collective Algorithms (5 slides)
Better Collective Algorithms! (12 slides)
Performance
Conclusion

SLIDE 3

4/15/05 3 of 37

Abbreviations Today

BGL = BlueGene/L
CNK = Compute Node Kernel
MPI = Message Passing Interface
MPICH2 = MPICH 2.0 from Argonne Labs
ASIC = Application Specific Integrated Circuit
ALU = Arithmetic Logic Unit
IBM = International Biscuit Makers (duh)

SLIDE 4

4/15/05 4 of 37

What is BGL 1/2

Massively parallel distributed memory cluster of

embedded processors

65,536 nodes! 131,072 processors!
Low power requirements
Relatively small, compared to predecessors
Half system installed at LLNL
Other systems going online too

SLIDE 5

4/15/05 5 of 37

What is BGL 2/2

BlueGene/L at LLNL (360 Tflops)

– 2,500 square feet, half a tennis court

Earth Simulator (40 Tflops)

– 35,000 square feet, requires an entire building

SLIDE 6

4/15/05 6 of 37

SLIDE 7

4/15/05 7 of 37

SLIDE 8

4/15/05 8 of 37

SLIDE 9

4/15/05 9 of 37

Hardware 1/3

CPU is PowerPC 440

– Designed for embedded applications – Low power, low clock frequency (700 MHz) – 32 bit :-(

FPU is custom 64-bit

– Each PPC 440 core has two of these – The two FPUs operate in parallel – @ 700MHz this is 2.8 Gflops per PPC 440 core

SLIDE 10

4/15/05 10 of 37

Hardware 2/3

BGL ASIC

– Two PPC 440 cores, four FPUs – L1, L2, L3 caches – DDR memory controller – Logic for 5 separate communications networks – This forms one compute node

SLIDE 11

4/15/05 11 of 37

Hardware 3/3

To build the entire 65,536 node system

– Two ASICs with 256 or 512 MB DDR RAM form a compute card – Sixteen compute cards form a node board – Sixteen node boards form a midplane – Two midplanes form a rack – Sixty four racks brings us to: – 2x16x16x2x64 = 65,536!

SLIDE 12

4/15/05 12 of 37

QuickTime™ and a Graphics decompressor are needed to see this picture.

SLIDE 13

4/15/05 13 of 37

Communication Networks 1/2

Five different networks

– 3D torus

Primary for MPI library

– Global tree

Used for collectives on MPI_COMM_WORLD
Used by compute nodes to communicate with I/O nodes

– Global interrupt

1.5 usec latency over entire 65k node system!

– JTAG

Used for node bootup and servicing

– Gigabit Ethernet

Used by I/O nodes

SLIDE 14

4/15/05 14 of 37

Communication Networks 2/2

Torus

– 6 neighbors have bi-directional links at 154 MB/sec – Guarantees reliable, deadlock free delivery – Chosen due to high bandwidth nearest neighbor connectivity – Used in prior supercomputers, such as Cray T3E

SLIDE 15

4/15/05 15 of 37

Software 1/2

Compute node runs stripped down Linux called

CNK

– Two threads, 1 per CPU – No context switching, no VM – Standard glibc interface, easy to port – 5000 lines of C++

I/O nodes run standard PPC Linux

– They have disk access – Run a daemon called console I/O daemon (ciod)

SLIDE 16

4/15/05 16 of 37

Software 2/2

Network software has 3 layers

– Topmost is MPI Library – Middle is Message Layer

Allows transmission of arbitrary buffer sizes

– Bottom is Packet layer

Very simple
Stateless interface to torus, tree, and GI hardware
Facilitates sending & receiving packets

SLIDE 17

4/15/05 17 of 37

MPICH

Developed by Argonne National Labs
Open source, freely available, standards

compliant MPI implementation

Used by many vendors
Chosen by IBM due to use of Abstract

Device Interface (ADI) and design for scalability

SLIDE 18

4/15/05 18 of 37

Collective Algorithms 1/5

Collectives can be implemented with basic send

and receives

– Better algorithms exist

Default MPICH2 collectives perform poor on

BGL

– Assume crossbar network, poor node mapping – Point-to-point messages incur high overhead – No knowledge of network specific features

SLIDE 19

4/15/05 19 of 37

Collective Algorithms 2/5

Optimization is tricky

– Message size and communicator shape are deciding factors – Large messages == optimize bandwidth – Short messages == optimize latency

I will not talk about short message collectives further today
If optimized algorithm isn’t available, BGL falls

back on default MPICH2

– It will work because point-to-point messages work – Performance will suck however

SLIDE 20

4/15/05 20 of 37

Collective Algorithms 3/5

Conditions for selecting optimized collective

algorithm are made locally

– What is wrong with this?

Example:

char buf[100], buf2[20000]; if (rank == 0) MPI_Bcast(buf, 100, …); else MPI_Bcast(buf2, 20000, …);

– Not legal according to MPI standard, but… – What if one node uses the optimized algorithm and the

thers use the MPICH2 algorithm?
Deadlock - or worse

SLIDE 21

4/15/05 21 of 37

Collective Algorithms 4/5

Solution to previous problem:

– Make optimization decisions globally – This incurs a slight latency hit – Thus, only used when offsetting increases in bandwidth are important: Ex: long message collectives

SLIDE 22

4/15/05 22 of 37

Collective Algorithms 5/5

Remainder of slides

– MPI_Bcast – MPI_Reduce, MPI_Allreduce – MPI_Alltoall, MPI_Alltoallv

Using both the tree and torus networks

– Tree operates only on MPI_COMM_WORLD

Has a built in ALU, but only fixed point :-(

– Torus has deposit bit feature, requires rectangular communicator shape (for most algorithms)

SLIDE 23

4/15/05 23 of 37

Broadcast 1/3

MPICH2

– Binomial tree for short messages – Scatter then Allgather for large messages – Perform poor on BGL due to high CPU overhead and lack of topology awareness

Torus

– Uses deposit bit feature – For n-dimension mesh, 1/n of message is sent in each direction concurrently

Tree

– Does not use ALU

SLIDE 24

4/15/05 24 of 37

Broadcast 2/3

Red lines represent one spanning tree of

half the message

Blue lines represent another spanning tree
f the other message half

SLIDE 25

4/15/05 25 of 37

Broadcast 3/3

SLIDE 26

4/15/05 26 of 37

Reduce & Allreduce 1/4

Reduce essentially a reverse broadcast
Allreduce is a reduce followed by broadcast
Torus

– Can’t use deposit bit feature – CPU bound, bandwidth is poor – Solution: Hamiltonian path, huge latency penalty, but great bandwidth

Tree

– Natural choice for reduction using integers! – Floating point performance is bad

SLIDE 27

4/15/05 27 of 37

Reduce & Allreduce 2/4

Hamiltonian path for 4x4x4 cube

SLIDE 28

4/15/05 28 of 37

Reduce & Allreduce 3/4

SLIDE 29

4/15/05 29 of 37

Reduce & Allreduce 4/4

SLIDE 30

4/15/05 30 of 37

Alltoall and Alltoallv 1/5

MPICH2 has 4 algorithms

– Yes 4 separate ones – BGL performace is bad due to network hot spots and CPU overhead

Torus

– No communicator size restriction! – Does not use deposit bit feature

Tree

– No alltoall tree algorithm, it would not make sense

SLIDE 31

4/15/05 31 of 37

Alltoall and Alltoallv 2/5

BGL Optimized torus algorithm

– Uses randomized packet injection – Each node creates a destination list – Each node has same seed value, different offset

Bad memory performance?

– Yes! – Torus payload is 240 bytes (8 cache lines) – Multiple packets in adjacent cache lines to each destination are injected before advancing

Measurements showed 2 packets to be optimal

SLIDE 32

4/15/05 32 of 37

Alltoall and Alltoallv 3/5

SLIDE 33

4/15/05 33 of 37

Alltoall and Alltoallv 4/5

SLIDE 34

4/15/05 34 of 37

Alltoall and Alltoallv 5/5

SLIDE 35

4/15/05 35 of 37

Conclusion

Optimized collectives on BGL off to a good start

– Superior performance than MPICH2 – Exploit knowledge about network features – Avoid performance penalties like memory copies and network hot spots

Much work remains

– Short message collectives – Non-rectangular communicators for the torus network – Tree collectives using communicators other than MPI_COMM_WORLD – Other collectives: scatter, gather, etc.

SLIDE 36

4/15/05 36 of 37