4/15/05 1 of 37
Implementing Optimized Collective Communication Routines on the IBM - - PowerPoint PPT Presentation
Implementing Optimized Collective Communication Routines on the IBM - - PowerPoint PPT Presentation
Implementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer CS 425 term project By Sam Miller samm@scl.ameslab.gov April, 18, 2005 4/15/05 1 of 37 Outline What is BlueGene/L? (5 slides) Hardware (3
4/15/05 2 of 37
Outline
- What is BlueGene/L? (5 slides)
- Hardware (3 slides)
- Communication Networks (2 slides)
- Software (2 slides)
- MPI and MPICH (1 slide)
- Collective Algorithms (5 slides)
- Better Collective Algorithms! (12 slides)
- Performance
- Conclusion
4/15/05 3 of 37
Abbreviations Today
- BGL = BlueGene/L
- CNK = Compute Node Kernel
- MPI = Message Passing Interface
- MPICH2 = MPICH 2.0 from Argonne Labs
- ASIC = Application Specific Integrated Circuit
- ALU = Arithmetic Logic Unit
- IBM = International Biscuit Makers (duh)
4/15/05 4 of 37
What is BGL 1/2
- Massively parallel distributed memory cluster of
embedded processors
- 65,536 nodes! 131,072 processors!
- Low power requirements
- Relatively small, compared to predecessors
- Half system installed at LLNL
- Other systems going online too
4/15/05 5 of 37
What is BGL 2/2
- BlueGene/L at LLNL (360 Tflops)
– 2,500 square feet, half a tennis court
- Earth Simulator (40 Tflops)
– 35,000 square feet, requires an entire building
4/15/05 6 of 37
4/15/05 7 of 37
4/15/05 8 of 37
4/15/05 9 of 37
Hardware 1/3
- CPU is PowerPC 440
– Designed for embedded applications – Low power, low clock frequency (700 MHz) – 32 bit :-(
- FPU is custom 64-bit
– Each PPC 440 core has two of these – The two FPUs operate in parallel – @ 700MHz this is 2.8 Gflops per PPC 440 core
4/15/05 10 of 37
Hardware 2/3
- BGL ASIC
– Two PPC 440 cores, four FPUs – L1, L2, L3 caches – DDR memory controller – Logic for 5 separate communications networks – This forms one compute node
4/15/05 11 of 37
Hardware 3/3
- To build the entire 65,536 node system
– Two ASICs with 256 or 512 MB DDR RAM form a compute card – Sixteen compute cards form a node board – Sixteen node boards form a midplane – Two midplanes form a rack – Sixty four racks brings us to: – 2x16x16x2x64 = 65,536!
4/15/05 12 of 37
QuickTime™ and a Graphics decompressor are needed to see this picture.4/15/05 13 of 37
Communication Networks 1/2
- Five different networks
– 3D torus
- Primary for MPI library
– Global tree
- Used for collectives on MPI_COMM_WORLD
- Used by compute nodes to communicate with I/O nodes
– Global interrupt
- 1.5 usec latency over entire 65k node system!
– JTAG
- Used for node bootup and servicing
– Gigabit Ethernet
- Used by I/O nodes
4/15/05 14 of 37
Communication Networks 2/2
- Torus
– 6 neighbors have bi-directional links at 154 MB/sec – Guarantees reliable, deadlock free delivery – Chosen due to high bandwidth nearest neighbor connectivity – Used in prior supercomputers, such as Cray T3E
4/15/05 15 of 37
Software 1/2
- Compute node runs stripped down Linux called
CNK
– Two threads, 1 per CPU – No context switching, no VM – Standard glibc interface, easy to port – 5000 lines of C++
- I/O nodes run standard PPC Linux
– They have disk access – Run a daemon called console I/O daemon (ciod)
4/15/05 16 of 37
Software 2/2
- Network software has 3 layers
– Topmost is MPI Library – Middle is Message Layer
- Allows transmission of arbitrary buffer sizes
– Bottom is Packet layer
- Very simple
- Stateless interface to torus, tree, and GI hardware
- Facilitates sending & receiving packets
4/15/05 17 of 37
MPICH
- Developed by Argonne National Labs
- Open source, freely available, standards
compliant MPI implementation
- Used by many vendors
- Chosen by IBM due to use of Abstract
Device Interface (ADI) and design for scalability
4/15/05 18 of 37
Collective Algorithms 1/5
- Collectives can be implemented with basic send
and receives
– Better algorithms exist
- Default MPICH2 collectives perform poor on
BGL
– Assume crossbar network, poor node mapping – Point-to-point messages incur high overhead – No knowledge of network specific features
4/15/05 19 of 37
Collective Algorithms 2/5
- Optimization is tricky
– Message size and communicator shape are deciding factors – Large messages == optimize bandwidth – Short messages == optimize latency
- I will not talk about short message collectives further today
- If optimized algorithm isn’t available, BGL falls
back on default MPICH2
– It will work because point-to-point messages work – Performance will suck however
4/15/05 20 of 37
Collective Algorithms 3/5
- Conditions for selecting optimized collective
algorithm are made locally
– What is wrong with this?
- Example:
char buf[100], buf2[20000]; if (rank == 0) MPI_Bcast(buf, 100, …); else MPI_Bcast(buf2, 20000, …);
– Not legal according to MPI standard, but… – What if one node uses the optimized algorithm and the
- thers use the MPICH2 algorithm?
- Deadlock - or worse
4/15/05 21 of 37
Collective Algorithms 4/5
- Solution to previous problem:
– Make optimization decisions globally – This incurs a slight latency hit – Thus, only used when offsetting increases in bandwidth are important: Ex: long message collectives
4/15/05 22 of 37
Collective Algorithms 5/5
- Remainder of slides
– MPI_Bcast – MPI_Reduce, MPI_Allreduce – MPI_Alltoall, MPI_Alltoallv
- Using both the tree and torus networks
– Tree operates only on MPI_COMM_WORLD
- Has a built in ALU, but only fixed point :-(
– Torus has deposit bit feature, requires rectangular communicator shape (for most algorithms)
4/15/05 23 of 37
Broadcast 1/3
- MPICH2
– Binomial tree for short messages – Scatter then Allgather for large messages – Perform poor on BGL due to high CPU overhead and lack of topology awareness
- Torus
– Uses deposit bit feature – For n-dimension mesh, 1/n of message is sent in each direction concurrently
- Tree
– Does not use ALU
4/15/05 24 of 37
Broadcast 2/3
- Red lines represent one spanning tree of
half the message
- Blue lines represent another spanning tree
- f the other message half
4/15/05 25 of 37
Broadcast 3/3
4/15/05 26 of 37
Reduce & Allreduce 1/4
- Reduce essentially a reverse broadcast
- Allreduce is a reduce followed by broadcast
- Torus
– Can’t use deposit bit feature – CPU bound, bandwidth is poor – Solution: Hamiltonian path, huge latency penalty, but great bandwidth
- Tree
– Natural choice for reduction using integers! – Floating point performance is bad
4/15/05 27 of 37
Reduce & Allreduce 2/4
- Hamiltonian path for 4x4x4 cube
4/15/05 28 of 37
Reduce & Allreduce 3/4
4/15/05 29 of 37
Reduce & Allreduce 4/4
4/15/05 30 of 37
Alltoall and Alltoallv 1/5
- MPICH2 has 4 algorithms
– Yes 4 separate ones – BGL performace is bad due to network hot spots and CPU overhead
- Torus
– No communicator size restriction! – Does not use deposit bit feature
- Tree
– No alltoall tree algorithm, it would not make sense
4/15/05 31 of 37
Alltoall and Alltoallv 2/5
- BGL Optimized torus algorithm
– Uses randomized packet injection – Each node creates a destination list – Each node has same seed value, different offset
- Bad memory performance?
– Yes! – Torus payload is 240 bytes (8 cache lines) – Multiple packets in adjacent cache lines to each destination are injected before advancing
- Measurements showed 2 packets to be optimal
4/15/05 32 of 37
Alltoall and Alltoallv 3/5
4/15/05 33 of 37
Alltoall and Alltoallv 4/5
4/15/05 34 of 37
Alltoall and Alltoallv 5/5
4/15/05 35 of 37
Conclusion
- Optimized collectives on BGL off to a good start
– Superior performance than MPICH2 – Exploit knowledge about network features – Avoid performance penalties like memory copies and network hot spots
- Much work remains
– Short message collectives – Non-rectangular communicators for the torus network – Tree collectives using communicators other than MPI_COMM_WORLD – Other collectives: scatter, gather, etc.
4/15/05 36 of 37