introduction to pc cluster hardware i
play

Introduction to PC-Cluster Hardware I Russian-German School on - PowerPoint PPT Presentation

Introduction to PC-Cluster Hardware I Russian-German School on High-Performance Computer Systems, 27 th June - 6 th July, Novosibirsk 1. Day, 27 th of June, 2005 HLRS, University of Stuttgart Introduction to PC-Cluster Hardware I Slide 1 High


  1. Introduction to PC-Cluster Hardware I Russian-German School on High-Performance Computer Systems, 27 th June - 6 th July, Novosibirsk 1. Day, 27 th of June, 2005 HLRS, University of Stuttgart Introduction to PC-Cluster Hardware I Slide 1 High Performance Computing Center Stuttgart

  2. Outline • Motivation • Hardware Architectures – Architectural design of Classic Personal Computers – IA32-architecture, Pentium-4 series of processors – Pipelining – Multiprocessor Architecture – Examples – Evolution of SuperComputers Introduction to PC-Cluster Hardware I Slide 2 High Performance Computing Center Stuttgart

  3. Motivation Introduction to PC-Cluster Hardware I Slide 3 High Performance Computing Center Stuttgart

  4. We need the compute power • Relevant engineering problems require performance that is orders of magnitude higher than what is available • CFD: Simulation of turbulence at a reasonable level of resolution • Combustion: Combination of turbulence simulation and realistic chemical models • Climate simulation: Resolution required that is orders of magnitude higher than today • Bioinformatics, Chemistry, ... Introduction to PC-Cluster Hardware I Slide 4 High Performance Computing Center Stuttgart

  5. How Has Compute Power Been Increasing ? • Moore‘s law: The Performance of a Computer doubles every 18 months • This was realized by: – Downsizing the structures on the silicon – Increasing the clock frequency – Adding functional units – Improving the functional units • Physical limits – Speed of light at clock rate of 10 GHz, the signal travel distance within one clock tick is 3cm – Cooling (packaging) Introduction to PC-Cluster Hardware I Slide 5 High Performance Computing Center Stuttgart

  6. We can not go on like this • Surprisingly it looks like we are already at the physical limit: – Intel cancelled the current Pentium IV development line – Clock-rate can no more grow orders of magnitude (7 GHz looks to be the current limit due to leakage current) • Fast hardware (e.g. ECL or GaAs) has a high power consumption, therefore the potential for higher integration is limited  The processor suppliers announced that future CPU’s will have several processors on a die (currently 2 processors / 2 HT)  in future, parallel architectures will be essential and everywhere, even at the desk. Introduction to PC-Cluster Hardware I Slide 6 High Performance Computing Center Stuttgart

  7. Motivation Questions Response Introduction to PC-Cluster Hardware I Slide 7 High Performance Computing Center Stuttgart

  8. Abstract Model Reality Physical Model Mathematical Model Numerical Scheme Questions & Response Application Program a few parallel Programming Models e.g. MPI HPF OpenMP Hardware Architecture Introduction to PC-Cluster Hardware I Slide 8 High Performance Computing Center Stuttgart

  9. Hardware Architectures Introduction to PC-Cluster Hardware I Slide 9 High Performance Computing Center Stuttgart

  10. History – Intel Chips of the 4. generation (starting 1972) • „Highly Integrated Circuits“ („Large Scale Integration“, LSI – VLSI) Back then: thousands of Transistors per cm 2 • Intel designs 1971 an allround-processor for the japanese firm Busicom: 4004 4004 Pentium 4 Transistors 2300 42 Mio. Technology 10 µ m 0.13 µ m Frequency 108 kHz 3.5 GHz Addressable Memory 640 Byte 4 GB Width of bus 4 Bit 64 Bit Performance (instr./s) 0.06 MIPS 3792 MIPS Die size 12 mm 2 217 mm 2 Introduction to PC-Cluster Hardware I Slide 10 High Performance Computing Center Stuttgart

  11. How does a processor work? 1/3 The architecture of a Personal Computer: (numbers are theoretical!) • The processor executes Graphics Card Processor simple commands • These are read out of 12,8 GB/s memory 2,13GB/s Cache • But: Main memory is 6,4 GB/s slow (theor.: 7-8 ns) Northbridge Memory • The cache decouples the processor from memory (for well- Harddisk 1 behaved codes). Southbridge USB Harddisk 2 • Access to the devices 320 MB/s 60 MB/s and Hard disks is esp. slow (Hard disk:~10ms) These are theoretical values, only!! To memory,You see 1,2GB/s Introduction to PC-Cluster Hardware I Slide 11 High Performance Computing Center Stuttgart

  12. How does a processor work? 2/3 • Instruction Fetch: Fetch the Register 1 instruction, which the PC points to. Register 2 ALU Register 3 • Instruction Decode: Decode the instruction: Stack Pointer add r3, r1, r2 FPU and load the registers. Prog. Counter Prog. Counter • Instruction Execute: Arithmetic Memory Management Unit Logic Unit adds up arguments. Write Back: Write register value. • Cache • Increment the Program Counter. Introduction to PC-Cluster Hardware I Slide 12 High Performance Computing Center Stuttgart

  13. How does a processor work? 3/3 • Instruction Fetch: Fetch the Register 1 instruction, which the PC points to. Register 2 ALU Register 3 • Instruction Decode: Decode the instruction: Stack Pointer jmp switch =PC+Offset FPU Load PC and Offset into ALU. Prog. Counter Memory Management Unit • Instruction Execute: Arithmetic Logic Unit adds PC and Offset . Write Back: Write register value. • Cache • Increment the Program Counter. Not necessary here. Introduction to PC-Cluster Hardware I Slide 13 High Performance Computing Center Stuttgart

  14. Pentium IV Hyperthreading Introduction to PC-Cluster Hardware I Slide 14 High Performance Computing Center Stuttgart

  15. Picture of Pentium IV Die Introduction to PC-Cluster Hardware I Slide 15 High Performance Computing Center Stuttgart

  16. Pentium IV processors A jump (backwards?) from Northwood (130nm) to Prescott (90nm): • Introduction to PC-Cluster Hardware I Slide 16 High Performance Computing Center Stuttgart

  17. Cache performance comparison Comparison of Read/Write Performance of Northwood & Prescott: • L1 Read Bandwidth L2 Read Bandwidth MB/s Bytes/cycle MB/s Bytes/cycle Northwood 3,06 Ghz 23705 7,73 Northwood 3,06 Ghz 12162 3,97 Prescott 3,20 Ghz 23206 7,25 Prescott 3,20 Ghz 13146 4,11 source: http://www.hardwareanalysis.com Introduction to PC-Cluster Hardware I Slide 17 High Performance Computing Center Stuttgart

  18. Cache – Functioning of a Cache 1/3 How is 1GB memory mapped into 1MB cache? • • The Cache is organized in lines: 64 Bytes / line, 16384 lines • If You load one byte within a cache-line (not yet in cache), the whole line is loaded: Register 1 Register 2 ALU Register 3 Stack Pointer FPU Prog. Counter Memory Management Unit 64 Bytes Memory Cache 4 Bytes 64 Byte Introduction to PC-Cluster Hardware I Slide 18 High Performance Computing Center Stuttgart

  19. Cache – Functioning of a Cache 2/3 Associativity of Cache: • – Direct Mapped Cache: Every Cache-Line would be hard-allocated to memory – here 16384 memory addresses would share the same cache-line: inefficient. – Fully Associative Cache: Any Cache-line may store from any address in memory – this is not possible to do in hardware: here need 256 address comparators!! – N-Way Set Associative: A compromise between the previous two. N parallel comparators are used, i.e. a line in memory may fit into one of the N lines. • Pentium-4 Northwood: 4-Way associativity • Pentium-4 Prescott: 8-Way associativity (better?, slower!) • If the address is cached in a cache-line: Good. • If the address is not cached: fetch from memory, expel “old” cache-line Introduction to PC-Cluster Hardware I Slide 19 High Performance Computing Center Stuttgart

  20. Cache – Functioning of a Cache 3/3 Which cache-line (of the N possible) to expel ?? • • Theory: Expel the one that is least likely (if at all) to be used in future. • The Pentium-4 uses a pseudo Least-Recently Used (LRU) algorithm: – The part of the address information not needed is used for that: 31 15 5 0 Addr: touched • Why is there a separate Instruction Cache? • The instruction stream has different access characteristics (more locality due to loops, jumps). Introduction to PC-Cluster Hardware I Slide 20 High Performance Computing Center Stuttgart

  21. Dual-Core CPUs To speed up computers, the frequency will be less & less important. • • Instead multiple cores are being employed on the dye: e.g. the fastest dual-core chip Intel 840D: two cores, each two HT. • All of them share the cache..... Pentium 4 Pentium 4 3,4 Ghz 3,4 GHz 3,2 GB/s Memory 1,6 GB/s AGP Memory controller 4x Hub (MCH) Memory 1 GB/s 266 MB/s 1,6 GB/s I/O Controller Hub Introduction to PC-Cluster Hardware I Slide 21 High Performance Computing Center Stuttgart

  22. Dual-Core CPUs AMD Opteron's Hypertransport is a solution for Dual-CPU/Dual-Core • SMP-Systems with High Memory IO-Requirements: Mem Mem PCI-X Opteron Opteron Tunnel 2.6 GHz 2.6 GHz Hypertransport 16-Bit, 1 Ghz, 8 GB/s Bus conn. PCI-Express Gigabit Ethernet SATA Disks Legacy Peripheral Introduction to PC-Cluster Hardware I Slide 22 High Performance Computing Center Stuttgart

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend