high performance hardware high performance hardware
play

High Performance Hardware, High Performance Hardware, Memory & - PowerPoint PPT Presentation

High Performance Hardware, High Performance Hardware, Memory & CPU Memory & CPU Step Back, Look Inside, Many Dont Insight, not numbers! Face real world 6 GB/s CPU B K 2 3 cache 2 GB Rubin H. Landau Rubin H. Landau RAM 2


  1. High Performance Hardware, High Performance Hardware, Memory & CPU Memory & CPU Step Back, Look Inside, Many Don’t Insight, not numbers! Face real world 6 GB/s CPU B K 2 3 cache 2 GB Rubin H. Landau Rubin H. Landau RAM 2 MB With With cache 3 2 TB @ Sally Haerer and Scott Clark Sally Haerer and Scott Clark s / b M 1 1 Main Store 1 Computational Physics for Undergraduates Computational Physics for Undergraduates BS Degree Program: Oregon State University BS Degree Program: Oregon State University “ Engaging People in Cyber Infrastructure Engaging People in Cyber Infrastructure ” Suppor Support by EPICS/NSF & OSU by EPICS/NSF & OSU Introductory Computational Science 1 Rubin Landau, EPIC/OSU 2006 Introductory Computational Science 1 Rubin Landau, EPIC/OSU 2006

  2. Problem: Optimize for Speedup Speedup Problem: Optimize for Problem: Optimize for Speedup • Faster by smarter (algorithm), not bigger Fe • Yet @limit: tune program to architecture  1 st locate hot spots  speed up? ������ � �� � ��� θ ��������� ����� • Negative side  hard work & (your) time intensive ��������������  local hard/software:  portable, readable ���������� � � � ���������� ���������� ������� � � ������ ���� ��������������� ���������� • CS: “compiler's job not yours” ��� • CSE: large, complex, frequent programs: 3-5X • “CSE : tomorrow’s problems, yesterday’s HdWr CS  ” (Press) Introductory Computational Science 2 Rubin Landau, EPIC/OSU 2006 Introductory Computational Science 2 Rubin Landau, EPIC/OSU 2006

  3. Theory: Rules of Optimization Theory: Rules of Optimization Theory: Rules of Optimization 1. “More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason - including blind stupidity.” - W.A. Wulf 2. “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.” - D. Knuth 3. “The best is the enemy of the good.” - Voltaire 4. Do not do it. 5. (for experts only): “Do not do it yet.” - M.A. Jackson Jonathan Hardwich 6. “Do not optimize as you go.” www.cs.cmu.edu/ ~ jch 7. Remember the 80/20 rule: 80% results  20% effort (also 90/10) 8. Always run “before” and “after” benchmarks - fast wrong answers not compatible with search for truth/bridges 9. Use the right algorithms and data structures! Introductory Computational Science 3 Rubin Landau, EPIC/OSU 2006 Introductory Computational Science 3 Rubin Landau, EPIC/OSU 2006

  4. Theory: HPC Components Theory: HPC Components Theory: HPC Components • Supercomputers = fastest, most powerful • Now: parallel machines, PC (WS) based System (64 cabinets) • Linux/Unix ($$ if MS) Cabinet (32 boards) • HPC = good balance major components: Board (16 cards) Card  multistaged (pipelined) units (2 chips) 360 Tflops Chip 32TB (2 processors)  multiple CPU (parallel) 5.7 Tflops 80 Gflops 512 GB 16 GB DDR  fast CPU, but compatible 11.2 Gflops 2.8/5.6 Gflops 1 GB DDR 4 MB  very large, very fast memories  very fast communications  vector, array processors (?)  software: integrates all Introductory Computational Science 4 Rubin Landau, EPIC/OSU 2006 Introductory Computational Science 4 Rubin Landau, EPIC/OSU 2006

  5. Memory Hierarchy vs vs Arrays Arrays Memory Hierarchy Memory Hierarchy vs Arrays Ideal world array storage Real world matrices ≠ blocks = broken lines A(1) A(2) A(1) A(3) RAM A(2) A(3) Page 1 Data Cache A(N) A(1),..., A(16) A(2032),..., A(2048) M(1,1) A(N) Page 2 M(2,1) M(1,1) M(3,1) M(2,1) CPU M(3,1) Page 3 CPU M(N,1) M(N,1) M(1,2) Swap Space M(1,2) M(2,2) Registers Page N M(2,2) M(3,2) M(3,2) M(N,N) M(N,N) Row major: C, Java; Column major: F90 • C, J: m(0,0) m(0,1) m(0,2) m(1,0) m(1,1) m(1,2) m(2,0) m(2,1) m(2,20) • F: m(1,1) m(2,1) m(3,1) m(1,2) m(2,2) m(3,2) m(1,3) m(2,3) m(3,3) Introductory Computational Science 5 Rubin Landau, EPIC/OSU 2006 Introductory Computational Science 5 Rubin Landau, EPIC/OSU 2006

  6. Memory Hierarchy: Cost vs vs Speed Speed Memory Hierarchy: Cost Memory Hierarchy: Cost vs Speed • CPU : registers, instructions, FPA, 8 GB/s 6 GB/s • Cache : high-speed buffer, 5.5 GB/s CPU 32 KB • Cache lines : latency issues cache 2 GB • RAM : random access memory RAM 2 MB • Via RISC : reduced instruction set computer cache 3 2 TB @ s / b M • Hard disk: cheap and slow, 111 Mb/s 1 1 Main Store 1 • Pages : length = 4K (386), 8-16K (Unix) • Virtual memory ≈  RAM (32b ≈ 4GB) B A little effort,  $$ (t) = page faults C e.g. multitasking/windows D Introductory Computational Science 6 Rubin Landau, EPIC/OSU 2006 Introductory Computational Science 6 Rubin Landau, EPIC/OSU 2006

  7. High Performance Hardware, High Performance Hardware, Memory & CPU (part II) Memory & CPU (part II) (examples) 6 GB/s CPU B K 2 3 cache 2 GB Rubin H. Landau Rubin H. Landau RAM 2 MB cache With With 3 2 TB @ s / b Sally Haerer and Scott Clark Sally Haerer and Scott Clark M 1 1 Main Store 1 Computational Physics for Undergraduates Computational Physics for Undergraduates BS Degree Program: Oregon State University BS Degree Program: Oregon State University “ Engaging People in Cyber Infrastructure Engaging People in Cyber Infrastructure ” Support by EPICS/NSF & OSU Suppor by EPICS/NSF & OSU Introductory Computational Science 7 Rubin Landau, EPIC/OSU 2006 Introductory Computational Science 7 Rubin Landau, EPIC/OSU 2006

  8. Central Processing Unit Central Processing Unit Central Processing Unit • Interacting Memories A(1) RAM A(2) A(3) Page 1 Data Cache • Pipelines: speed A(1),..., A(16) A(2032),..., A(2048) Page 2 A(N) M(1,1) M(2,1) M(3,1) Page 3  Prepare next step CPU during previous M(N,1) Swap Space M(1,2) Registers Page N M(2,2) M(3,2)  Bucket brigade M(N,N) e.g: c = (a + b) / (d * f) Unit Step1 Step 2 Step 3 Step 4 A1 Fetch a Fetch b Add A2 Fetch d Fetch f Multiply A3 Divide Introductory Computational Science 8 Rubin Landau, EPIC/OSU 2006 Introductory Computational Science 8 Rubin Landau, EPIC/OSU 2006

  9. CPU Design: RISC CPU Design: RISC CPU Design: RISC • RISC = R educed I nstruction S et C omputer (HPC) M(N,N) M(3,2) M(2,2) M(1,2) M(N,1) M(3,1) M(2,1) M(1,1) A(N) A(3) A(2) A(1) •  CISC = C omplex I SC (previous)  high-level microcode on chip (1000’s instructions) Swap Space Page N Page 3 Page 2 Page 1 RAM  complex instructions  slow (10  / instruct) A(1),..., A(16) • RISC: smaller (simpler) instruction set on chip  F90, C compiler translate for RISC architecture  simpler (fewer cycles/i), cheaper, possibly faster Registers Data Cache CPU  saved instruction space  more CPU registers A(2032),..., A(2048)   pipelines,  memory conflict, some parallel • Theory CPU T = # instructs  cycles/ instruct  cycle t CISC: fewer instructs executed RISC: fewer cycles/ instruct Introductory Computational Science 9 Rubin Landau, EPIC/OSU 2006 Introductory Computational Science 9 Rubin Landau, EPIC/OSU 2006

  10. Latest & Greatest: IBM IBM Blue Gene Latest & Greatest: IBM Blue Gene Blue Gene Latest & Greatest: System • A. Gara et al ., IBM J (64 cabinets) • Specific genes  general SC Cabinet • Linux ($$ if MS) (32 boards) • By committee Board (16 cards) Card (2 chips) 360 Tflops Chip 32TB (2 processors) 5.7 Tflops 80 Gflops 512 GB 16 GB DDR 11.2 Gflops 2.8/5.6 Gflops (double data rate) 1 GB DDR 4 MB • Extreme scale  65,536 (2 16 ) nodes • Balance  cost/performance •  Peak = 360 teraflops (10 12 );   performance/watt • Medium speed 5.6 Gflop (cool) • On, off chip distributed memories • 512 chips/card, 16 cards/Board • 2 cores: 1 compute, 1 communicate • Control: distributed memory MPI Introductory Computational Science 10 Rubin Landau, EPIC/OSU 2006 Introductory Computational Science 10 Rubin Landau, EPIC/OSU 2006

  11. BG's 3 Communication Networks 's 3 Communication Networks BG BG's 3 Communication Networks • Fig (a)  : 64 x 32 x 32 3-D torus (2 x 2 x 2 shown) links = chips that also compute both: nearest-neighbor & cut through all ≈ effective bandwidth all nodes node  node: 1.4 Gb/s  1 ns ��� • Program speed: local communication 100 ns < Latency < 6.4  s (64 hops) ��� ���������������� �������� • Fig (b) Global collective network ������������� broadcast to all processors ���� ����� • Fig (c) Control network + Gb-Ethernet ������������� ��� for I/O, switch, devices > Tb/s Introductory Computational Science 11 Rubin Landau, EPIC/OSU 2006 Introductory Computational Science 11 Rubin Landau, EPIC/OSU 2006

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend