Introduction to PC-Cluster Hardware I Russian-German School on - - PowerPoint PPT Presentation

introduction to pc cluster hardware i
SMART_READER_LITE
LIVE PREVIEW

Introduction to PC-Cluster Hardware I Russian-German School on - - PowerPoint PPT Presentation

Introduction to PC-Cluster Hardware I Russian-German School on High-Performance Computer Systems, 27 th June - 6 th July, Novosibirsk 1. Day, 27 th of June, 2005 HLRS, University of Stuttgart Introduction to PC-Cluster Hardware I Slide 1 High


slide-1
SLIDE 1

Slide 1 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Introduction to PC-Cluster Hardware I

Russian-German School on High-Performance Computer Systems, 27th June - 6th July, Novosibirsk

  • 1. Day, 27th of June, 2005

HLRS, University of Stuttgart

slide-2
SLIDE 2

Slide 2 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Outline

  • Motivation
  • Hardware Architectures

– Architectural design of Classic Personal Computers – IA32-architecture, Pentium-4 series of processors – Pipelining – Multiprocessor Architecture – Examples – Evolution of SuperComputers

slide-3
SLIDE 3

Slide 3 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Motivation

slide-4
SLIDE 4

Slide 4 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

We need the compute power

  • Relevant engineering problems require performance that is orders
  • f magnitude higher than what is available
  • CFD: Simulation of turbulence at a reasonable level of resolution
  • Combustion: Combination of turbulence simulation and realistic

chemical models

  • Climate simulation: Resolution required that is orders of

magnitude higher than today

  • Bioinformatics, Chemistry, ...
slide-5
SLIDE 5

Slide 5 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

How Has Compute Power Been Increasing ?

  • Moore‘s law:

The Performance of a Computer doubles every 18 months

  • This was realized by:

– Downsizing the structures on the silicon – Increasing the clock frequency – Adding functional units – Improving the functional units

  • Physical limits

– Speed of light at clock rate of 10 GHz, the signal travel distance within one clock tick is 3cm – Cooling (packaging)

slide-6
SLIDE 6

Slide 6 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

We can not go on like this

  • Surprisingly it looks like we are already at the physical limit:

– Intel cancelled the current Pentium IV development line – Clock-rate can no more grow orders of magnitude (7 GHz looks to be the current limit due to leakage current)

  • Fast hardware (e.g. ECL or GaAs) has a high power consumption,

therefore the potential for higher integration is limited  The processor suppliers announced that future CPU’s will have several processors on a die (currently 2 processors / 2 HT)  in future, parallel architectures will be essential and everywhere, even at the desk.

slide-7
SLIDE 7

Slide 7 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Motivation

Response Questions

slide-8
SLIDE 8

Slide 8 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Abstract Model Questions & Response

Reality Physical Model Mathematical Model Numerical Scheme Application Program Hardware Architecture a few parallel Programming Models

e.g. MPI HPF OpenMP

slide-9
SLIDE 9

Slide 9 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Hardware Architectures

slide-10
SLIDE 10

Slide 10 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

History – Intel Chips of the 4. generation (starting 1972)

  • „Highly Integrated Circuits“ („Large Scale Integration“, LSI – VLSI)

Back then: thousands of Transistors per cm2

  • Intel designs 1971 an allround-processor for the japanese firm

Busicom: 4004 Transistors Technology Frequency Addressable Memory Width of bus Performance (instr./s) Die size 42 Mio. 0.13 µm 3.5 GHz 4 GB 64 Bit 3792 MIPS 217 mm2 2300 10 µm 108 kHz 640 Byte 4 Bit 0.06 MIPS 12 mm2 Pentium 4 4004

slide-11
SLIDE 11

Slide 11 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

How does a processor work? 1/3

The architecture of a Personal Computer: (numbers are theoretical!)

Processor Northbridge Southbridge Harddisk 1 Harddisk 2

320 MB/s

USB

60 MB/s

Graphics Card

2,13GB/s

Cache

12,8 GB/s

Memory

6,4 GB/s

  • The processor executes

simple commands

  • These are read out of

memory

  • But: Main memory is

slow (theor.: 7-8 ns)

  • The cache decouples

the processor from memory (for well- behaved codes).

  • Access to the devices

and Hard disks is esp. slow (Hard disk:~10ms)

These are theoretical values, only!! To memory,You see 1,2GB/s

slide-12
SLIDE 12

Slide 12 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Register 1 Register 2 Register 3 ALU FPU Stack Pointer

  • Prog. Counter

Cache Memory Management Unit

How does a processor work? 2/3

  • Instruction Fetch: Fetch the

instruction, which the PC points to.

  • Instruction Decode: Decode the

instruction: add r3, r1, r2 and load the registers.

  • Write Back: Write register value.
  • Increment the Program Counter.
  • Prog. Counter
  • Instruction Execute: Arithmetic

Logic Unit adds up arguments.

slide-13
SLIDE 13

Slide 13 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Register 1 Register 2 Register 3 ALU FPU Stack Pointer

  • Prog. Counter

Cache Memory Management Unit

How does a processor work? 3/3

  • Instruction Execute: Arithmetic

Logic Unit adds PC and Offset.

  • Instruction Fetch: Fetch the

instruction, which the PC points to.

  • Instruction Decode: Decode the

instruction: jmp switch =PC+Offset Load PC and Offset into ALU.

  • Write Back: Write register value.
  • Increment the Program Counter. Not necessary here.
slide-14
SLIDE 14

Slide 14 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Pentium IV Hyperthreading

slide-15
SLIDE 15

Slide 15 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Picture of Pentium IV Die

slide-16
SLIDE 16

Slide 16 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Pentium IV processors

  • A jump (backwards?) from Northwood (130nm) to Prescott (90nm):
slide-17
SLIDE 17

Slide 17 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Cache performance comparison

  • Comparison of Read/Write Performance of Northwood & Prescott:

L1 Read Bandwidth MB/s Bytes/cycle Northwood 3,06 Ghz 23705 7,73 Prescott 3,20 Ghz 23206 7,25 L2 Read Bandwidth MB/s Bytes/cycle Northwood 3,06 Ghz 12162 3,97 Prescott 3,20 Ghz 13146 4,11

source: http://www.hardwareanalysis.com

slide-18
SLIDE 18

Slide 18 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Cache – Functioning of a Cache 1/3

  • How is 1GB memory mapped into 1MB cache?
  • The Cache is organized in lines: 64 Bytes / line, 16384 lines
  • If You load one byte within a cache-line (not yet in cache), the whole

line is loaded:

Memory

Register 1 Register 2 Register 3 ALU FPU Stack Pointer

  • Prog. Counter

Cache Memory Management Unit

64 Bytes

64 Byte

4 Bytes

slide-19
SLIDE 19

Slide 19 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Cache – Functioning of a Cache 2/3

  • Associativity of Cache:

– Direct Mapped Cache: Every Cache-Line would be hard-allocated to memory – here 16384 memory addresses would share the same cache-line: inefficient. – Fully Associative Cache: Any Cache-line may store from any address in memory – this is not possible to do in hardware: here need 256 address comparators!! – N-Way Set Associative: A compromise between the previous two. N parallel comparators are used, i.e. a line in memory may fit into one of the N lines.

  • Pentium-4 Northwood: 4-Way associativity
  • Pentium-4 Prescott: 8-Way associativity

(better?, slower!)

  • If the address is cached in a cache-line: Good.
  • If the address is not cached: fetch from memory, expel “old” cache-line
slide-20
SLIDE 20

Slide 20 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Cache – Functioning of a Cache 3/3

  • Which cache-line (of the N possible) to expel ??
  • Theory: Expel the one that is least likely (if at all) to be used in future.
  • The Pentium-4 uses a pseudo Least-Recently Used (LRU) algorithm:

– The part of the address information not needed is used for that:

31 15

touched

5

  • Why is there a separate Instruction Cache?
  • The instruction stream has different access characteristics (more

locality due to loops, jumps). Addr:

slide-21
SLIDE 21

Slide 21 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Dual-Core CPUs

  • To speed up computers, the frequency will be less & less important.
  • Instead multiple cores are being employed on the dye:

e.g. the fastest dual-core chip Intel 840D: two cores, each two HT.

  • All of them share the cache.....

Pentium 4 3,4 Ghz Pentium 4 3,4 GHz AGP 4x Memory controller Hub (MCH) 1,6 GB/s

Memory Memory

3,2 GB/s 1,6 GB/s I/O Controller Hub 1 GB/s 266 MB/s

slide-22
SLIDE 22

Slide 22 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Dual-Core CPUs

  • AMD Opteron's Hypertransport is a solution for Dual-CPU/Dual-Core

SMP-Systems with High Memory IO-Requirements:

Opteron 2.6 GHz Opteron 2.6 GHz Mem Mem PCI-X Tunnel

Hypertransport 16-Bit, 1 Ghz, 8 GB/s

Bus conn.

PCI-Express Gigabit Ethernet SATA Disks Legacy Peripheral

slide-23
SLIDE 23

Slide 23 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Where do we find parallelism

  • In the functional units: pipelining -> vector computing
  • Combined instructions -> e.g. multiply-add as one instruction
  • Functional Parallelism -> modern processor technology
  • Multithreading
  • Array-Processing
  • Multiprocessors (strongly coupled) -> Shared memory
  • Multicomputers (weakly coupled) -> Distributed memory

Memory access concepts:

  • Cache based
  • Vector access via several memory banks
  • Pre-load, pre-fetch

—> MFLOP/s performance and MB/s or Mword/s memory bandwidth Hybrid architectures

slide-24
SLIDE 24

Slide 24 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

The Classification of Flynn

  • Classify architectures according to multiplicity of data and

instructions

  • SI: single instruction for all processors
  • MI: multiple instructions for different processors
  • SD: single data for all processors
  • MD: multiple data for different processors
  • SISD  classical processor
  • SIMD  array processor
  • MIMD  distributed or shared memory
  • SPMD  single program & multiple data
  • MPMD multiple program & multiple data
slide-25
SLIDE 25

Slide 25 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Pipelining

a+b=c

Adopt Exponent Add Mantissa Overflow Exponent Normalize

a1 b1 a2 b2 a1 b1 a3 b3 a2 b2 a1b1 a4 a2 a3 b4 b3 b2 a1 b1 a5 a4 a3 a2 a1 b1 b2 b3 b4 b5 a6 a4 a5 a3 a2 c1 b2 b3 b4 b5 b6 a7 a8 a9 a10 a11 a12 b7 b8 b9 b10 b11 b12

slide-26
SLIDE 26

Slide 26 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Pipelining

  • c = a + b

load a(1) load b(1) add exponent add mantissa handle overflow normalize result store into c(1) i=1  2  3  4  5  6  7  8  9  10  11  Startup-time

  • f the pipeline

1 cycle a result value is stored in each cycle time Each unit of the pipeline is active

slide-27
SLIDE 27

Slide 27 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Multiprocessor - shared memory

Memory - Interconnect

Memory- Segment Memory- Segment Memory- Segment Memory- Segment

CPU CPU CPU CPU

slide-28
SLIDE 28

Slide 28 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Multiprocessor

  • A number of processors is coupled to a number of memory banks by

a fast network

  • Each CPU has the same access speed to each memory bank
  • This concept is often referred to as Uniform Memory Access (UMA)
  • The bottleneck may be the network (crossbar, bus)
  • Systems: NEC SX-6, Dual-/Quadprocessor PC’s
slide-29
SLIDE 29

Slide 29 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Switch (SX-4, SX-5)

  • Connects N CPUs to M memory banks
  • Number of switching elements N*M
slide-30
SLIDE 30

Slide 30 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Multiprocessor – Bus Network

Memory- Segment Memory- Segment Memory- Segment Memory- Segment

CPU CPU CPU CPU

slide-31
SLIDE 31

Slide 31 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Multicomputer - distributed memory (I)

Node-Interconnect

CPU CPU CPU CPU

Memory- Segment Memory- Segment Memory- Segment Memory- Segment

Node or PE

slide-32
SLIDE 32

Slide 32 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Multicomputer - distributed memory (II)

  • A number of full processors with memory is coupled by a fast

network

  • Each CPU has fast access to its own memory but slower access or

even none to other CPU’s memories

  • This concept is often referred to as Non-Uniform memory Access

(NUMA) if the non-local memory is accessible by the other CPU’s

  • Again the network may become a bottleneck
slide-33
SLIDE 33

Slide 33 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Multicomputer – switched network

Switch

CPU CPU CPU CPU

Memory- Segment Memory- Segment Memory- Segment Memory- Segment

Node or PE

slide-34
SLIDE 34

Slide 34 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Interconnects – 2D/3D mesh or torus

– each processor is connected by a link with 4 or 6 neighbors 3-D torus (8x8x3 nodes)

slide-35
SLIDE 35

Slide 35 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Hybrid architectures

Node Interconnect

  • Most modern high-performance computing (HPC) systems are

clusters of SMP nodes

  • SMP (symmetric multi-processing) inside of each node
  • DMP (distributed memory parallelization) on the node interconnect

SMP node

slide-36
SLIDE 36

Slide 36 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Other Architectures

  • ccNUMA (cache coherent non-uniform memory access)

– a distributed (hybrid) architecture – looks like one big SMP – programmable like one big SMP – but cluster of several small SMPs in reality – cache coherent – programming:

  • global access with same load/store instruction as local
  • parallelization, e.g., with OpenMP
  • ccNUMA with >500 CPUs and multi-level network
  • parallelization, e.g., with Multi Level Parallelism (MLP)
  • DMP with RDMA (remote direct memory access)

– programming:

  • global memory access with special instructions, but without

OS

  • e.g. Co-array Fortran, UPC (Universal Parallel C), shmem
slide-37
SLIDE 37

Slide 37 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Abbreviations

  • Network of workstations (NOW)?  Distributed memory
  • Beowulf-class systems = Clusters of Commercial Off-The-Shelf

(COTS) PCs  Distributed memory

  • Multiboard workstations/PCs  Shared memory
  • SMP

 Symmetric multiprocessing  Shared memory

  • PVP

 Parallel vector processing

  • MPP

 Massively parallel processing

  • PE

 Processing Element, e.g., one node of an MPP system

slide-38
SLIDE 38

Slide 38 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Examples

slide-39
SLIDE 39

Slide 39 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Fiber Channel Fast Disk I/O Network

Cluster of Workstations (COW)

High Performance Computing and Visualization Network Myrinet/Quadrics/Infiniband Gigabit Ethernet System Network

slide-40
SLIDE 40

Slide 40 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Networks for PC Clusters

900 3-4 Infiniband, 4x 900 2-3 Quadrics 300 3 Myrinet 100-110 11-40 Gigabit Ethernet Bandwidth (MB/s) Latency (µs)

slide-41
SLIDE 41

Slide 41 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

e.g. Cray Strider @ HLRS

  • 128 nodes:

– 2x2 GHz AMD Opteron Processors – 4 GB of memory

  • Network

– Gigabit Ethernet – Myrinet

  • 2 TB of disk capacity

– Connected to 2 I/O nodes

  • In total:

– 516 GB of memory – 1024 GFlops Peak Performance

slide-42
SLIDE 42

Slide 42 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Earth Simulator Project ESRDC / GS 40 (NEC)

  • System:

640 nodes, 40 TFLOP/s 10 TB memory

  • ptical 640x640 crossbar

50m x 20m without peripherals

  • Node:

8 CPUs, 64 GFLOP/s 16 GB, SMP

  • ext. b/w: 2x16 GB/s
  • CPU:

Vector 8 GFLOP/s, 500 MHz Single-Chip, 0.15 µs 32 GB/s memory b/w

  • Virtual Earth - simulating

– Climate change (global warming) – El Niño, hurricanes, droughts – Air pollution (acid rain, ozone hole) – Diastrophism (earthquake, volcanism)

  • Installation: 2002

http://www.gaia.jaeri.go.jp/public/e_publicconts.html

  • ptical

single-stage crossbar 640*640 (!)

..... .....

Node 1 Node 640

slide-43
SLIDE 43

Slide 43 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Parallel Hardware / Criteria

  • Processor performance
  • Network performance

– Latency – Bandwidth of a connection – Total Bandwidth of the network

  • I/O Performance (Is I/O parallel ?)
  • Scalability
  • Usability (Administration, User)
  • Programmability (models, tools)
  • Total Performance (Peak Performance, Sustained Performance)
slide-44
SLIDE 44

Slide 44 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

Parallel Hardware / Criteria

Performance Criteria COW SMP MPP Processor Medium Medium/High Medium Network Low- Medium High Medium I/O Low High Medium Scalability Low - Medium Medium High Usability Medium High High Programmability Low Medium Medium Performance ratio 5-10% 5-80% 5-25% Hardware costs Low - Medium Low-High High Personal costs High Medium Medium

slide-45
SLIDE 45

Slide 45 High Performance Computing Center Stuttgart Introduction to PC-Cluster Hardware I

1985 1990 1995 96 97 98 00-05

Legend Parallel Vector

Evolution of Supercomputers

Peak Performance (GFLOPS) Shipment year

T3E

1,000 100 10

Vector Supercomputer

SR2001

SR2201 S-3800 SX-2 SX-3 nCUBE2 CM200 VPP300 SR4300 SX-4/32 VPP500 T3D Paragon CM-5 T3E-900 S-820 CRAY-YMP SR8000

Parallel Computer

Improving cost performance Increase of parallel programs and low cost hardware Difficulty of performance improvement. Fading of new development.

VPP700 SX-5/32

10,000

ASCI EARTH-SIMULATOR