Massively Parallel Architectures MPP Specifics Cluster Computing - - PowerPoint PPT Presentation

massively parallel architectures mpp specifics
SMART_READER_LITE
LIVE PREVIEW

Massively Parallel Architectures MPP Specifics Cluster Computing - - PowerPoint PPT Presentation

Cluster Computing Massively Parallel Architectures MPP Specifics Cluster Computing No shared memory Scales to hundreds or thousands of processors Homogeneous sub-components Advanced Custom Interconnects MPP Architectures


slide-1
SLIDE 1

Cluster Computing

Massively Parallel Architectures

slide-2
SLIDE 2

Cluster Computing

MPP Specifics

  • No shared memory
  • Scales to hundreds or thousands of

processors

  • Homogeneous sub-components
  • Advanced Custom Interconnects
slide-3
SLIDE 3

Cluster Computing

MPP Architectures

  • There are numerous approaches to

interconnecting CPUs in MPP architectures:

– Rings – Grids – Full Interconnect – Trees – Dancehalls – Hybercubes

slide-4
SLIDE 4

Cluster Computing

Rings

Worst case distance (one ring) (bi-directional ring) Cost n

slide-5
SLIDE 5

Cluster Computing

Chordal Ring 3

slide-6
SLIDE 6

Cluster Computing

Chordal Ring 4

slide-7
SLIDE 7

Cluster Computing

Barrel Shifter

Worst case distance n/2

slide-8
SLIDE 8

Cluster Computing

Grid/Torus/Illiac Torus

#worst case distance Cost n

slide-9
SLIDE 9

Cluster Computing

Fully interconnected

worst case distance 1 Cost n(n-1)/2

slide-10
SLIDE 10

Cluster Computing

Trees

worst case 2log n Cost n log n

slide-11
SLIDE 11

Cluster Computing

Fat Trees

Cost n worst case 2log n

slide-12
SLIDE 12

Cluster Computing

Dancehalls/Butterflies

distance Cost n

slide-13
SLIDE 13

Cluster Computing

Hybercubes

worst case distance Cost d

slide-14
SLIDE 14

Cluster Computing

Intel Paragon

  • Intel i860 based machine
  • “Dual CPU” –

– 50 MHz CPUs – Shares a 400 MB/sec cache coherent bus

  • Grid architecture
  • Mother of ASCI Red
slide-15
SLIDE 15

Cluster Computing

Intel Paragon

slide-16
SLIDE 16

Cluster Computing

SP2

  • Based in RS/6000 nodes

– POWER2 processors

  • Special NIC: MSMU on the micro-channel

bus

  • Standard Ethernet on the micro-channel

bus

  • MSMUs interconnected via a HPS

backplane

slide-17
SLIDE 17

Cluster Computing

SP2 MSMU

slide-18
SLIDE 18

Cluster Computing

SP2 HPS

  • Links are 8 bit parallel
  • Contention free latency is 5 ns per stage

– 875 ns latency for 512 nodes

slide-19
SLIDE 19

Cluster Computing

SP2 HPS

slide-20
SLIDE 20

Cluster Computing

ASCI Red

  • Build by Intel for the Department of

“Energy”

  • Consist of almost 5000 dual PPro boards

with a special adaptation for user-level message-passing

  • Special support for internal ‘firewalls’
slide-21
SLIDE 21

Cluster Computing

ASCI Red Node

slide-22
SLIDE 22

Cluster Computing

ASCI Red MRC

slide-23
SLIDE 23

Cluster Computing

ASCI Red Grid

slide-24
SLIDE 24

Cluster Computing

Scali

  • Based on Intel or Sparc based nodes
  • Nodes are connected by a Dolphin SCI

interface, using a grid of rings

  • Very high performance MPI and support

for commodity operating systems

slide-25
SLIDE 25

Cluster Computing

Performance???

slide-26
SLIDE 26

Cluster Computing

Earth Simulator

slide-27
SLIDE 27

Cluster Computing

ES

slide-28
SLIDE 28

Cluster Computing

ES

slide-29
SLIDE 29

Cluster Computing

ES

slide-30
SLIDE 30

Cluster Computing

ES

slide-31
SLIDE 31

Cluster Computing

BlueGene/L

October 2003 BG/L half rack prototype 500 Mhz 512 nodes/1024 proc. 2 TFlop/s peak 1.4 Tflop/s sustained

slide-32
SLIDE 32

Cluster Computing

BlueGene/L ASIC node

PowerPC 440 Double 64-bit FPU 2kb L2 L3 cache directory SRAM L3 cache EDRAM DDR JTAG Gigabit Ethernet adapter

slide-33
SLIDE 33

Cluster Computing

BlueGene/L Interconnection Networks

3 Dimensional Torus

– Interconnects all compute nodes (65,536) – Virtual cut-through hardware routing – 1.4Gb/s on all 12 node links (2.1 GB/s per node) – Communications backbone for computations – 350/700 GB/s bisection bandwidth

Global Tree

– One-to-all broadcast functionality – Reduction operations functionality – 2.8 Gb/s of bandwidth per link – Latency of tree traversal in the order of 5 µs – Interconnects all compute and I/O nodes (1024)

Ethernet

– Incorporated into every node ASIC – Active in the I/O nodes (1:64) – All external comm. (file I/O, control, user interaction, etc.)

Low Latency Global Barrier and Interrupt Control Network

slide-34
SLIDE 34

Cluster Computing

BG/L – Familiar software environment

  • Fortran, C, C++ with MPI

– Full language support – Automatic SIMD FPU exploitation

  • Linux development environment

– Cross-compilers and other cross-tools execute on Linux front- end nodes – Users interact with system from front-end nodes

  • Tools – support for debuggers, hardware performance

monitors, trace based visualization

  • POSIX system calls – compute processes “feel like” they

are executing on a Linux environment (restrictions)

slide-35
SLIDE 35

Cluster Computing

Measured MPI Send Bandwidth

Latency @500 MHz = 5.9 + 0.13 * “Manhattan distance” ls

slide-36
SLIDE 36

Cluster Computing

NAS Parallel Benchmarks

  • All NAS Parallel Benchmarks

run successfully on 256 nodes (and many other configurations)

– No tuning / code changes

  • Compared 500 MHz BG/L and

450 MHz Cray T3E

  • All BG/L benchmarks were

compiled with GNU and XL compilers

– Report best result (GNU for IS)

  • BG/L is a factor of two/three

faster on five benchmarks (BT, FT, LU, MG, and SP), a bit slower on one (EP)