introduction to high performance computing and
play

Introduction to High Performance Computing and Optimization Oliver - PowerPoint PPT Presentation

Institut fr Numerische Mathematik und Optimierung Introduction to High Performance Computing and Optimization Oliver Ernst Audience: 1./3. CMS, 5./7./9. Mm, doctoral students Wintersemester 2012/13 Contents 1. Introduction 2. Processor


  1. Institut für Numerische Mathematik und Optimierung Introduction to High Performance Computing and Optimization Oliver Ernst Audience: 1./3. CMS, 5./7./9. Mm, doctoral students Wintersemester 2012/13

  2. Contents 1. Introduction 2. Processor Architecture 3. Optimization of Serial Code 3.1 Performance Measurement 3.2 Optimization Guidelines 3.3 Compiler-Aided Optimization 3.4 Combine example Oliver Ernst (INMO) Wintersemester 2012/13 1 HPC

  3. Contents 1. Introduction 2. Processor Architecture 3. Optimization of Serial Code Oliver Ernst (INMO) Wintersemester 2012/13 16 HPC

  4. Processor Architecture Von Neumann Architecture John von Neumann (1903–1957), Hungarian-American mathematician and computer science pioneer “First Draft of a Report on the EDVAC” (1945): computer design based on stored program, based on previous work by J. P. Eckert and J. W. Mauchly, U Pennsylvania ENIAC project, and earlier theoretical Memory work by A. Turing. Essentially all electronic digital computers based on this model. CPU Von Neumann bottleneck: Manipulation of data in Arithmetic Control Unit Logical Unit memory only via traffic to ALU; width of this data (ALU) path constrains all computing throughput. (John Backus, 1977) Inherently sequential architecture. Input Output Oliver Ernst (INMO) Wintersemester 2012/13 17 HPC

  5. Processor Architecture Current Microprocessors Extremely complex manufactured devices. Feature size currently at 22 nm and decreasing. Transistor count ≈ 1.4 billion on 160 mm 2 . Fortunately: it is enough to understand the basic schematic workings of modern microprocessors. Intel Westmere die shot Intel Ivy Bridge die labelling Oliver Ernst (INMO) Wintersemester 2012/13 18 HPC

  6. Processor Architecture MicroprocessorBlock diagram Memory queue shift L1 data INT reg. file mask L2 unified cache cache INT op Main memory LD ST INT/FP queue FP reg. file FP interface Memory mult L1 instr. FP cache add source: Hager & Wellein arithmtic units for floating point (FP) and integer (INT) operations. CPU registers (FP and general-purpose). Load (LD) and store (ST) units for transferring operands to/from registers. Instructions sorted into queues. Caches hold data and instructions. Oliver Ernst (INMO) Wintersemester 2012/13 19 HPC

  7. Processor Architecture Performance For scientific computing, typically measured in floating point operations per second, i.e, # floating point operations . runtime Unit: FLOPS, FLOP/s, Flop/sec . . . . What constitutes a FLOP? IEEE double precision floating point (FP) number (64 bits) FP add or FP multiply division, square roots etc. take several cycles Peak performance is defined as [max # floating point operations per cycle] × clock rate [Hz] × # cores × # sockets × # nodes Question: what is the peak performance of klio ? Oliver Ernst (INMO) Wintersemester 2012/13 20 HPC

  8. Processor Architecture Example: Intel Xeon 5160 (Woodcrest, June 2006) Architecture: Intel 64 Microarchitecture: Core (successor to Netburst) 1st CPU with this microarchitecture, server/workstation version of Intel Core 2 processor. 65 nm manufacturing process technology, socket LGA771 dual core, 2 sockets, total of 4 cores clock frequency: 3 GHz Each core of Woodcrest can perform 4 Flops in Woodcrest die shot each clock cycle. source: Intel Peak performance: 4 Flops × 2 cores × 2 sockets × 3 GHz = 48 GFlops But: higher rates possible using SIMD instructions (MMX, SSE, AVX). More later. Oliver Ernst (INMO) Wintersemester 2012/13 21 HPC

  9. Processor Architecture Some definitions Architecture: The instruction set of the CPU, also called instruction set architecture (ISA). The parts of a processor design that one needs to understand to write assembly code. Examples of ISAs: Intel Architectures (IA): IA32/x86, Intel 64/EM64T, IA64 MIPS (SGI) POWER (IBM) SPARC (Sun) ARM (Acorn) Microarchitecture: Implementation of ISA; invisible features such as caches, cache structure, CPU cycle time, details of virtual memory system. Process technology: The size of the physical features (such as transistors) that make up the processor. Roughly: smaller is better due to lower power consumption, more chips per silicon wafer in production. Oliver Ernst (INMO) Wintersemester 2012/13 22 HPC

  10. Processor Architecture Intel architecture/microarchitecture/process roadmap Intel tick-tock schedule Tick: New microarchitecture Tock: Die shrink, i.e., new process technology Microarchitecture Processor codename Process Date (server) technology introduced Woodcrest/Clovertown 65 nm 06/2006 Core Dunnington/Harpertown 45 nm 11/2007 Nehalem 45 nm 11/2008 Nehalem Westmere 32 nm 01/2010 Sandy Bridge 32 nm 01/2011 Sandy Bridge Ivy Bridge 22 nm 04/2012 Haswell 22 nm 03/2013 (?) Haswell Broadwell 14 nm Skylake 14 nm Skylake Skymont 10 nm Oliver Ernst (INMO) Wintersemester 2012/13 23 HPC

  11. Processor Architecture Modern processor features Pipelined instructions. Separate complex instructions into simpler ones which are executed by different functional units in an overlapping fashion; increases throughput; example of instruction level parallelism (ILP). Superscalar architecture. Multiple function units operating concurrently. SIMD instructions (Single Instruction, Multiple Data) One instruction operates on a vector of data simultaneously. (Examples: Intel’s SSE , AMD’s 3dNow! , Power/PowerPC AltiVec ) Out-of-order execution. When instruction operands are not available in registers, execute next possible instruction(s) to avoid idle time (eligible instructions held in reorder buffer ). Caches. Small, fast, on-chip buffers for holding data which has recently been used (temporal locality) or is close (in memory) to data which has recently been used (spatial locality). Simplified instruction sets. Reduced Instruction Set Computers(RISC, 1980s), contrast with CISC; simple instructions executing in few clock cycles; allowed higher clock cycles, freed up transistors; x86 processors translate to “ µ -ops” on the fly. Oliver Ernst (INMO) Wintersemester 2012/13 24 HPC

  12. Processor Architecture Pipelining: Example Pipelining, it’s natural. [D. Patterson, UC Berkeley] A B C D Four students (Anoop, Brian, Christine & each have one load of clothes Djamal) doing laundry, one load each. Washing takes 30 minutes. Drying takes 40 minutes. Folding takes 20 minutes. DAP.F96 5 Each complete laundry load, done sequentially, takes 90 minutes. Oliver Ernst (INMO) Wintersemester 2012/13 25 HPC

  13. Processor Architecture Pipelining: Example Sequential laundry Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20 30 40 20 30 40 20 T A a s k B O r C d e r D • Sequential laundry takes 6 hours for 4 loads 4 loads in 6 hours. How long would the pipelined laundry take? Oliver Ernst (INMO) Wintersemester 2012/13 26 HPC

  14. Processor Architecture Pipelining: Example Pipelined laundry Start work ASAP 6 PM 7 8 9 11 Midnight 10 Time 30 40 40 40 40 20 T A a s k B O r C d e r D Start each task as soon as functional unit available. Takes only 3.5 hours. Oliver Ernst (INMO) Wintersemester 2012/13 27 HPC

  15. Processor Architecture Pipelining: Example Pipelining lessons Pipelining doesn’t help latency of individual tasks, it helps throughput of entire workload. • 6 PM 7 8 9 Pipeline limited by slowest pipeline stage. Time Multiple stages operating concurrently. T 30 40 40 40 40 20 • a s Potential speedup = number of pipeline A • k stages. O • B r Unbalanced lengths of pipeline stages d • reduce speedup. e C r Time to “fill” (start-up) pipeline and time • D to “drain” (wind-down) it reduces speedup, especially if there are few loads relative to the # stages. Oliver Ernst (INMO) Wintersemester 2012/13 28 HPC

  16. Processor Architecture Pipelining in computers Pipelining is the basis of vector processors. Greatest source of ILP today. Invisible to programmer. Typical pipeline stages on CPU (MIPS ISA): (1) Fetch instruction from memory (2) Read registers while decoding instruction (3) Execute operation or calculate address (4) Access an operand in data memory (5) Write the result into a register Helpful: machine instructions of equal length (x86: from 1 to 17 bytes) Oliver Ernst (INMO) Wintersemester 2012/13 29 HPC

  17. Processor Architecture Pipelining in computers Pipelined floating point multiply (FPM) 1 2 3 4 5 N N+1 N+2 N+3 N+4 ... Cycle B(1) B(2) B(3) B(4) B(5) B(N) Separate ... C(1) C(2) C(3) C(4) C(5) C(N) Wind − down mant./exp. Multiply B(1) B(2) B(3) B(4) B(N) B(N − 1) ... mantissas C(1) C(2) C(3) C(4) C(N) C(N − 1) Add B(1) B(2) B(3) B(N) B(N − 2) B(N − 1) ... exponents C(1) C(2) C(3) C(N) C(N − 2) C(N − 1) Normalize A A A A(1) A(2) A(N) ... result (N − 3) (N − 2) (N − 1) Wind − up Insert A A A A A(1) A(N) ... sign (N − 4) (N − 3) (N − 2) (N − 1) source: Hager & Wellein Figure 1.5: Timeline for a simplified floating-point multiplication pipeline that executes Timeline for simplified floating-point multiplication pipeline executing A(:)=B(:)*C(:). One result is generated on each cycle after a four-cycle “filling” phase. Oliver Ernst (INMO) Wintersemester 2012/13 30 HPC

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend