Cluster Computing
Massively Parallel Architectures MPP Specifics Cluster Computing - - PowerPoint PPT Presentation
Massively Parallel Architectures MPP Specifics Cluster Computing - - PowerPoint PPT Presentation
Cluster Computing Massively Parallel Architectures MPP Specifics Cluster Computing No shared memory Scales to hundreds or thousands of processors Homogeneous sub-components Advanced Custom Interconnects MPP Architectures
Cluster Computing
MPP Specifics
- No shared memory
- Scales to hundreds or thousands of
processors
- Homogeneous sub-components
- Advanced Custom Interconnects
Cluster Computing
MPP Architectures
- There are numerous approaches to
interconnecting CPUs in MPP architectures:
– Rings – Grids – Full Interconnect – Trees – Dancehalls – Hybercubes
Cluster Computing
Rings
Worst case distance (one ring) (bi-directional ring) Cost n
Cluster Computing
Chordal Ring 3
Cluster Computing
Chordal Ring 4
Cluster Computing
Barrel Shifter
Worst case distance n/2
Cluster Computing
Grid/Torus/Illiac Torus
#worst case distance Cost n
Cluster Computing
Fully interconnected
worst case distance 1 Cost n(n-1)/2
Cluster Computing
Trees
worst case 2log n Cost n log n
Cluster Computing
Fat Trees
Cost n worst case 2log n
Cluster Computing
Dancehalls/Butterflies
distance Cost n
Cluster Computing
Hybercubes
worst case distance Cost d
Cluster Computing
Intel Paragon
- Intel i860 based machine
- “Dual CPU” –
– 50 MHz CPUs – Shares a 400 MB/sec cache coherent bus
- Grid architecture
- Mother of ASCI Red
Cluster Computing
Intel Paragon
Cluster Computing
SP2
- Based in RS/6000 nodes
– POWER2 processors
- Special NIC: MSMU on the micro-channel
bus
- Standard Ethernet on the micro-channel
bus
- MSMUs interconnected via a HPS
backplane
Cluster Computing
SP2 MSMU
Cluster Computing
SP2 HPS
- Links are 8 bit parallel
- Contention free latency is 5 ns per stage
– 875 ns latency for 512 nodes
Cluster Computing
SP2 HPS
Cluster Computing
ASCI Red
- Build by Intel for the Department of
“Energy”
- Consist of almost 5000 dual PPro boards
with a special adaptation for user-level message-passing
- Special support for internal ‘firewalls’
Cluster Computing
ASCI Red Node
Cluster Computing
ASCI Red MRC
Cluster Computing
ASCI Red Grid
Cluster Computing
Scali
- Based on Intel or Sparc based nodes
- Nodes are connected by a Dolphin SCI
interface, using a grid of rings
- Very high performance MPI and support
for commodity operating systems
Cluster Computing
Performance???
Cluster Computing
Earth Simulator
Cluster Computing
ES
Cluster Computing
ES
Cluster Computing
ES
Cluster Computing
ES
Cluster Computing
BlueGene/L
October 2003 BG/L half rack prototype 500 Mhz 512 nodes/1024 proc. 2 TFlop/s peak 1.4 Tflop/s sustained
Cluster Computing
BlueGene/L ASIC node
PowerPC 440 Double 64-bit FPU 2kb L2 L3 cache directory SRAM L3 cache EDRAM DDR JTAG Gigabit Ethernet adapter
Cluster Computing
BlueGene/L Interconnection Networks
3 Dimensional Torus
– Interconnects all compute nodes (65,536) – Virtual cut-through hardware routing – 1.4Gb/s on all 12 node links (2.1 GB/s per node) – Communications backbone for computations – 350/700 GB/s bisection bandwidth
Global Tree
– One-to-all broadcast functionality – Reduction operations functionality – 2.8 Gb/s of bandwidth per link – Latency of tree traversal in the order of 5 µs – Interconnects all compute and I/O nodes (1024)
Ethernet
– Incorporated into every node ASIC – Active in the I/O nodes (1:64) – All external comm. (file I/O, control, user interaction, etc.)
Low Latency Global Barrier and Interrupt Control Network
Cluster Computing
BG/L – Familiar software environment
- Fortran, C, C++ with MPI
– Full language support – Automatic SIMD FPU exploitation
- Linux development environment
– Cross-compilers and other cross-tools execute on Linux front- end nodes – Users interact with system from front-end nodes
- Tools – support for debuggers, hardware performance
monitors, trace based visualization
- POSIX system calls – compute processes “feel like” they
are executing on a Linux environment (restrictions)
Cluster Computing
Measured MPI Send Bandwidth
Latency @500 MHz = 5.9 + 0.13 * “Manhattan distance” ls
Cluster Computing
NAS Parallel Benchmarks
- All NAS Parallel Benchmarks
run successfully on 256 nodes (and many other configurations)
– No tuning / code changes
- Compared 500 MHz BG/L and
450 MHz Cray T3E
- All BG/L benchmarks were
compiled with GNU and XL compilers
– Report best result (GNU for IS)
- BG/L is a factor of two/three
faster on five benchmarks (BT, FT, LU, MG, and SP), a bit slower on one (EP)