csl 860 modern parallel computation computation
play

CSL 860: Modern Parallel Computation Computation Categories of - PowerPoint PPT Presentation

CSL 860: Modern Parallel Computation Computation Categories of Processing Flynns classification Granularity Coarse grain: Cray C90, Fujitsu small number of very powerful processors Fine grain: CM-2, Quadrics Large


  1. CSL 860: Modern Parallel Computation Computation

  2. Categories of Processing Flynns classification • Granularity • – Coarse grain: Cray C90, Fujitsu small number of very powerful processors • – Fine grain: CM-2, Quadrics Large number of relatively less powerful processors • – Medium grain: IBM SP2, CM-5 Medium grain: IBM SP2, CM-5 between the two extremes. • – Commuication cost >> computational cost → coarse grain – Commuication cost << computational cost → fine grain Address Space Organization • – Single/shared address space Uniform Memory Address:SMP (UMA) • Non Uniform Memory Address (NUMA) • – Distributed memory Message passing •

  3. Modern Multi-Processor Shared Memory (maybe with L2 cache) Multi-CPU Bus / Corssbar switch L1 cache L1 cache L1 cache State State State St St St ALU ALU FPU FPU ALU ALU FPU FPU ALU ALU FPU FPU State FPU ALU Bus Request Shared L1 cache System Bus Memory State FPU ALU Multi-core

  4. n -dim Grid/Mesh

  5. Torus

  6. Hypercube

  7. Tree Network

  8. Fat Tree Network

  9. Butterfly

  10. Current Computer Speed • ~15 Gflop/core • ~60 Gflop for Quad-core • ~3GHz clocks • ~$1000 ~$1000

  11. Cray • Late 70s • Small # vector processors • $9 million • 80 MHz clock 80 MHz clock • Later (Early 80s) – 105 to 117 MHz clock – 800 megaflops for 4-processor machine – $15-20 million

  12. Connection Machine • CM-2 (SIMD) – Host connected – ~1989 – 64k single-bit SIMD processors connected in hypercube, plus 2K Weitek floating point units). – 8 MHz clock – 8 MHz clock – 6 GFLOPS – 400 MFLOPS per million dollars – Hypercube architecture – $15 million • CM-5 (MIMD) – ~1991 – Fat tree network of 896 SPARC RISC processors

  13. nCube • nCube 2 costs between $500,000 and $2m • $2m for 27 GFLOPS machine nCube3 (1994): • 50 MHz 50 MHz • Processor Module: 512 nodes and 32 GB memory • Up to 20 Modules for 1.0 TFLOP system of 10,240 nodes • $40 million • $40,000/Gflop

  14. Maspar Host Array Control Unit PEs connected to 8 neighbors 32 bit ALUs 32 bit ALUs SIMD Also a slow global router 32 PEs per chip, Upto 16K processors overall 12.5 MHz clock 1.2 Gflops $1.5million ~`1000 flops/dollar-second Early 90s

  15. Cray T90 • 1995 • 450 MHz • 4-32 vector processors – Peak 1.8 Gflops per processor – 57.6 Gflops 57.6 Gflops • Shared (upto) 8G memory • Multiple ports – 3 64-bit words per cycle per CPU x32 > 300 GB/s per second • 32-processor version cost $39 million.

  16. Roadrunner • $133 million • Multi-stage InfiniBand interconnect – Infiniband: 2-level fat-tree, each leaf switch has 180 down links and 96 up links (18 such CUs), 12 up links from each CU connected each of the 2nd level from each CU connected each of the 2nd level switches switches • cluster • 122400 cores – 6912 dual-core Opterons – 12960 power XCell eDP: 116640 cores • peak 1.45 PetaFlops

  17. IBM Cell Processor

  18. NVIDIA GF8800 Host Data Assembler Setup / Rstr / ZCull Vtx Thread Issue Geom Thread Issue Pixel Thread Issue SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP cessor Thread Proces TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend