CSL 860: Modern Parallel Computation Computation Categories of - - PowerPoint PPT Presentation

csl 860 modern parallel computation computation
SMART_READER_LITE
LIVE PREVIEW

CSL 860: Modern Parallel Computation Computation Categories of - - PowerPoint PPT Presentation

CSL 860: Modern Parallel Computation Computation Categories of Processing Flynns classification Granularity Coarse grain: Cray C90, Fujitsu small number of very powerful processors Fine grain: CM-2, Quadrics Large


slide-1
SLIDE 1

CSL 860: Modern Parallel Computation Computation

slide-2
SLIDE 2

Categories of Processing

  • Flynns classification
  • Granularity

– Coarse grain: Cray C90, Fujitsu

  • small number of very powerful processors

– Fine grain: CM-2, Quadrics

  • Large number of relatively less powerful processors

– Medium grain: IBM SP2, CM-5 Medium grain: IBM SP2, CM-5

  • between the two extremes.

– Commuication cost >> computational cost → coarse grain – Commuication cost << computational cost → fine grain

  • Address Space Organization

– Single/shared address space

  • Uniform Memory Address:SMP (UMA)
  • Non Uniform Memory Address (NUMA)

– Distributed memory

  • Message passing
slide-3
SLIDE 3

Modern Multi-Processor

Shared Memory (maybe with L2 cache) Bus / Corssbar switch ALU FPU L1 cache State ALU FPU L1 cache State ALU FPU L1 cache State

Multi-CPU

State ALU FPU L1 cache System Bus ALU FPU St ALU FPU St ALU FPU St ALU FPU State Bus Request Shared Memory

Multi-core

slide-4
SLIDE 4

n-dim Grid/Mesh

slide-5
SLIDE 5

Torus

slide-6
SLIDE 6

Hypercube

slide-7
SLIDE 7

Tree Network

slide-8
SLIDE 8

Fat Tree Network

slide-9
SLIDE 9

Butterfly

slide-10
SLIDE 10

Current Computer Speed

  • ~15 Gflop/core
  • ~60 Gflop for Quad-core
  • ~3GHz clocks

~$1000

  • ~$1000
slide-11
SLIDE 11

Cray

  • Late 70s
  • Small # vector processors
  • $9 million

80 MHz clock

  • 80 MHz clock
  • Later (Early 80s)

– 105 to 117 MHz clock – 800 megaflops for 4-processor machine – $15-20 million

slide-12
SLIDE 12

Connection Machine

  • CM-2 (SIMD)

– Host connected – ~1989 – 64k single-bit SIMD processors connected in hypercube, plus 2K Weitek floating point units). – 8 MHz clock – 8 MHz clock – 6 GFLOPS – 400 MFLOPS per million dollars – Hypercube architecture – $15 million

  • CM-5 (MIMD)

– ~1991 – Fat tree network of 896 SPARC RISC processors

slide-13
SLIDE 13

nCube

  • nCube 2 costs between $500,000 and $2m
  • $2m for 27 GFLOPS machine

nCube3 (1994): 50 MHz

  • 50 MHz
  • Processor Module: 512 nodes and 32 GB memory
  • Up to 20 Modules for 1.0 TFLOP system of

10,240 nodes

  • $40 million
  • $40,000/Gflop
slide-14
SLIDE 14

Maspar

Host Array Control Unit PEs connected to 8 neighbors 32 bit ALUs 32 bit ALUs SIMD Also a slow global router 32 PEs per chip, Upto 16K processors overall 12.5 MHz clock 1.2 Gflops $1.5million ~`1000 flops/dollar-second Early 90s

slide-15
SLIDE 15

Cray T90

  • 1995
  • 450 MHz
  • 4-32 vector processors

– Peak 1.8 Gflops per processor 57.6 Gflops – 57.6 Gflops

  • Shared (upto) 8G memory
  • Multiple ports

– 3 64-bit words per cycle per CPU x32 > 300 GB/s per second

  • 32-processor version cost $39 million.
slide-16
SLIDE 16

Roadrunner

  • $133 million
  • Multi-stage InfiniBand interconnect

– Infiniband: 2-level fat-tree, each leaf switch has 180 down links and 96 up links (18 such CUs), 12 up links from each CU connected each of the 2nd level switches from each CU connected each of the 2nd level switches

  • cluster
  • 122400 cores

– 6912 dual-core Opterons – 12960 power XCell eDP: 116640 cores

  • peak 1.45 PetaFlops
slide-17
SLIDE 17

IBM Cell Processor

slide-18
SLIDE 18

NVIDIA GF8800

SP SP

cessor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread Issue Pixel Thread Issue Data Assembler Host

SP SP SP SP SP SP SP SP SP SP SP SP SP SP

L2 FB L1

TF

Thread Proces L1

TF

L1

TF

L1

TF

L1

TF

L1

TF

L1

TF

L1

TF

L2 FB L2 FB L2 FB L2 FB L2 FB