Massively Parallel Architectures MPP Specifics Cluster Computing - PowerPoint PPT Presentation

Cluster Computing Massively Parallel Architectures

MPP Specifics Cluster Computing • No shared memory • Scales to hundreds or thousands of processors • Homogeneous sub-components • Advanced Custom Interconnects

MPP Architectures Cluster Computing • There are numerous approaches to interconnecting CPUs in MPP architectures: – Rings – Grids – Full Interconnect – Trees – Dancehalls – Hybercubes

Rings Cluster Computing Worst case distance (one ring) (bi-directional ring) Cost n

Chordal Ring 3 Cluster Computing

Chordal Ring 4 Cluster Computing

Barrel Shifter Cluster Computing Worst case distance n/2

Grid/Torus/Illiac Torus Cluster Computing #worst case distance Cost n

Fully interconnected Cluster Computing worst case distance 1 Cost n(n-1)/2

Trees Cluster Computing worst case 2log n Cost n log n

Fat Trees Cluster Computing Cost n worst case 2log n

Dancehalls/Butterflies Cluster Computing distance Cost n

Hybercubes Cluster Computing worst case distance Cost d

Intel Paragon Cluster Computing • Intel i860 based machine • “Dual CPU” – – 50 MHz CPUs – Shares a 400 MB/sec cache coherent bus • Grid architecture • Mother of ASCI Red

Intel Paragon Cluster Computing

SP2 Cluster Computing • Based in RS/6000 nodes – POWER2 processors • Special NIC: MSMU on the micro-channel bus • Standard Ethernet on the micro-channel bus • MSMUs interconnected via a HPS backplane

SP2 MSMU Cluster Computing

SP2 HPS Cluster Computing • Links are 8 bit parallel • Contention free latency is 5 ns per stage – 875 ns latency for 512 nodes

SP2 HPS Cluster Computing

ASCI Red Cluster Computing • Build by Intel for the Department of “Energy” • Consist of almost 5000 dual PPro boards with a special adaptation for user-level message-passing • Special support for internal ‘firewalls’

ASCI Red Node Cluster Computing

ASCI Red MRC Cluster Computing

ASCI Red Grid Cluster Computing

Scali Cluster Computing • Based on Intel or Sparc based nodes • Nodes are connected by a Dolphin SCI interface, using a grid of rings • Very high performance MPI and support for commodity operating systems

Performance??? Cluster Computing

Earth Simulator Cluster Computing

ES Cluster Computing

BlueGene/L Cluster Computing October 2003 BG/L half rack prototype 500 Mhz 512 nodes/1024 proc. 2 TFlop/s peak 1.4 Tflop/s sustained

BlueGene/L ASIC node Cluster Computing PowerPC 440 Double 64-bit FPU 2kb L2 L3 cache directory SRAM L3 cache EDRAM DDR JTAG Gigabit Ethernet adapter

BlueGene/L Interconnection Networks Cluster Computing 3 Dimensional Torus – Interconnects all compute nodes (65,536) – Virtual cut-through hardware routing – 1.4Gb/s on all 12 node links (2.1 GB/s per node) – Communications backbone for computations – 350/700 GB/s bisection bandwidth Global Tree – One-to-all broadcast functionality – Reduction operations functionality – 2.8 Gb/s of bandwidth per link – Latency of tree traversal in the order of 5 µ s – Interconnects all compute and I/O nodes (1024) Ethernet – Incorporated into every node ASIC – Active in the I/O nodes (1:64) – All external comm. (file I/O, control, user interaction, etc.) Low Latency Global Barrier and Interrupt Control Network

BG/L – Familiar software environment Cluster Computing • Fortran, C, C++ with MPI – Full language support – Automatic SIMD FPU exploitation • Linux development environment – Cross-compilers and other cross-tools execute on Linux front- end nodes – Users interact with system from front-end nodes • Tools – support for debuggers, hardware performance monitors, trace based visualization • POSIX system calls – compute processes “feel like” they are executing on a Linux environment (restrictions)

Measured MPI Send Bandwidth Cluster Computing Latency @500 MHz = 5.9 + 0.13 * “Manhattan distance” ls

NAS Parallel Benchmarks Cluster Computing • All NAS Parallel Benchmarks run successfully on 256 nodes (and many other configurations) – No tuning / code changes • Compared 500 MHz BG/L and 450 MHz Cray T3E • All BG/L benchmarks were compiled with GNU and XL compilers – Report best result (GNU for IS) • BG/L is a factor of two/three faster on five benchmarks (BT, FT, LU, MG, and SP), a bit slower on one (EP)

Massively Parallel Architectures MPP Specifics Cluster Computing - PowerPoint PPT Presentation

Cluster Computing Massively Parallel Architectures MPP Specifics Cluster Computing No shared memory Scales to hundreds or thousands of processors Homogeneous sub-components Advanced Custom Interconnects MPP Architectures

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Gilead-MPP Licence Overview December 2017 LEGAL May contain MPP and/or MPP licensee confidential

Computing at MPP Arthur Erhardt MPP Computing Commission + Fachabteilung IT MPP Project Review,

Computing at MPP Arthur Erhardt MPP Computing Commission + Fachabteilung IT MPP Project Review,

Breaking the Linear-Memory Barrier in Massively Parallel Computing MIS on Trees with Strongly

Loosely Dependent Parallel Processes Complementary Paradigms Massively Parallel Task

Graph Analytics on Massively Parallel Processing Databases Frank McQuillan Feb 2017 MPP

secure payments anywhere Contents Cont ents Who are MPP Global Solutions? eSuite

PMT Measurements of PEN and Friends at MPP Oliver Schulz oschulz@mpp.mpg.de PEN Meeting,

MPP setups overview Philipp Leitl phleitl@mpp.mpg.de Max Planck Institute for Physics 21st

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

A Use Case Model for RAS (Reliability, Availability, and Serviceability) in an MPP (Massively

Massively Parallel A* Search on a GPU Yichao Zhou Jianyang Zeng Institute for Interdisciplinary

Massively Parallel Communication and Query Evaluation Paul Beame U. of Washington Based on

MPMPLAPACK: A Massively Parallel Multi-Precision Linear Algebra Package Jason Martin

Architectures Architectural styles Software architectures Architectures versus middleware

QcBits: constant-time small-key code-based cryptography Tung Chou Technische Universiteit

FPL 2019 High-performance Decoding of Variable-length Memory Data Packets for FPGA Stream

Light Meson Spectroscopy at BESIII Tianjue Min Institute of High Energy Physics, Chinese Academy

Belle BELLE2-NOTE-PH-2015-011 April 26, 2016 L1 Trigger Menu for Low Multiplicity Physics

ETH Vorlesung Systembau / Lecture System Construction Discussion: Why Build from Scratch? (2)

Pipeline Oriented Implementation of NORX for ARM Processors Luan Cardoso dos Santos

ATLAS Detector Commissioning Oslo EPF group aspect Y. Pylypchenko 1 , M. Pedersen 3 O. Rhne 2 ,

Munich Muon Spectrometer Calibration and Alignment Centre Oliver Kortner Max-Planck-Insitut f