Introduction to High Performance Computing and Optimization Oliver - PowerPoint PPT Presentation

Institut für Numerische Mathematik und Optimierung Introduction to High Performance Computing and Optimization Oliver Ernst Audience: 1./3. CMS, 5./7./9. Mm, doctoral students Wintersemester 2012/13

Contents 1. Introduction 2. Processor Architecture 3. Optimization of Serial Code 3.1 Performance Measurement 3.2 Optimization Guidelines 3.3 Compiler-Aided Optimization 3.4 Combine example Oliver Ernst (INMO) Wintersemester 2012/13 1 HPC

Contents 1. Introduction 2. Processor Architecture 3. Optimization of Serial Code Oliver Ernst (INMO) Wintersemester 2012/13 16 HPC

Processor Architecture Von Neumann Architecture John von Neumann (1903–1957), Hungarian-American mathematician and computer science pioneer “First Draft of a Report on the EDVAC” (1945): computer design based on stored program, based on previous work by J. P. Eckert and J. W. Mauchly, U Pennsylvania ENIAC project, and earlier theoretical Memory work by A. Turing. Essentially all electronic digital computers based on this model. CPU Von Neumann bottleneck: Manipulation of data in Arithmetic Control Unit Logical Unit memory only via traffic to ALU; width of this data (ALU) path constrains all computing throughput. (John Backus, 1977) Inherently sequential architecture. Input Output Oliver Ernst (INMO) Wintersemester 2012/13 17 HPC

Processor Architecture Current Microprocessors Extremely complex manufactured devices. Feature size currently at 22 nm and decreasing. Transistor count ≈ 1.4 billion on 160 mm 2 . Fortunately: it is enough to understand the basic schematic workings of modern microprocessors. Intel Westmere die shot Intel Ivy Bridge die labelling Oliver Ernst (INMO) Wintersemester 2012/13 18 HPC

Processor Architecture MicroprocessorBlock diagram Memory queue shift L1 data INT reg. file mask L2 unified cache cache INT op Main memory LD ST INT/FP queue FP reg. file FP interface Memory mult L1 instr. FP cache add source: Hager & Wellein arithmtic units for floating point (FP) and integer (INT) operations. CPU registers (FP and general-purpose). Load (LD) and store (ST) units for transferring operands to/from registers. Instructions sorted into queues. Caches hold data and instructions. Oliver Ernst (INMO) Wintersemester 2012/13 19 HPC

Processor Architecture Performance For scientific computing, typically measured in floating point operations per second, i.e, # floating point operations . runtime Unit: FLOPS, FLOP/s, Flop/sec . . . . What constitutes a FLOP? IEEE double precision floating point (FP) number (64 bits) FP add or FP multiply division, square roots etc. take several cycles Peak performance is defined as [max # floating point operations per cycle] × clock rate [Hz] × # cores × # sockets × # nodes Question: what is the peak performance of klio ? Oliver Ernst (INMO) Wintersemester 2012/13 20 HPC

Processor Architecture Example: Intel Xeon 5160 (Woodcrest, June 2006) Architecture: Intel 64 Microarchitecture: Core (successor to Netburst) 1st CPU with this microarchitecture, server/workstation version of Intel Core 2 processor. 65 nm manufacturing process technology, socket LGA771 dual core, 2 sockets, total of 4 cores clock frequency: 3 GHz Each core of Woodcrest can perform 4 Flops in Woodcrest die shot each clock cycle. source: Intel Peak performance: 4 Flops × 2 cores × 2 sockets × 3 GHz = 48 GFlops But: higher rates possible using SIMD instructions (MMX, SSE, AVX). More later. Oliver Ernst (INMO) Wintersemester 2012/13 21 HPC

Processor Architecture Some definitions Architecture: The instruction set of the CPU, also called instruction set architecture (ISA). The parts of a processor design that one needs to understand to write assembly code. Examples of ISAs: Intel Architectures (IA): IA32/x86, Intel 64/EM64T, IA64 MIPS (SGI) POWER (IBM) SPARC (Sun) ARM (Acorn) Microarchitecture: Implementation of ISA; invisible features such as caches, cache structure, CPU cycle time, details of virtual memory system. Process technology: The size of the physical features (such as transistors) that make up the processor. Roughly: smaller is better due to lower power consumption, more chips per silicon wafer in production. Oliver Ernst (INMO) Wintersemester 2012/13 22 HPC

Processor Architecture Intel architecture/microarchitecture/process roadmap Intel tick-tock schedule Tick: New microarchitecture Tock: Die shrink, i.e., new process technology Microarchitecture Processor codename Process Date (server) technology introduced Woodcrest/Clovertown 65 nm 06/2006 Core Dunnington/Harpertown 45 nm 11/2007 Nehalem 45 nm 11/2008 Nehalem Westmere 32 nm 01/2010 Sandy Bridge 32 nm 01/2011 Sandy Bridge Ivy Bridge 22 nm 04/2012 Haswell 22 nm 03/2013 (?) Haswell Broadwell 14 nm Skylake 14 nm Skylake Skymont 10 nm Oliver Ernst (INMO) Wintersemester 2012/13 23 HPC

Processor Architecture Modern processor features Pipelined instructions. Separate complex instructions into simpler ones which are executed by different functional units in an overlapping fashion; increases throughput; example of instruction level parallelism (ILP). Superscalar architecture. Multiple function units operating concurrently. SIMD instructions (Single Instruction, Multiple Data) One instruction operates on a vector of data simultaneously. (Examples: Intel’s SSE , AMD’s 3dNow! , Power/PowerPC AltiVec ) Out-of-order execution. When instruction operands are not available in registers, execute next possible instruction(s) to avoid idle time (eligible instructions held in reorder buffer ). Caches. Small, fast, on-chip buffers for holding data which has recently been used (temporal locality) or is close (in memory) to data which has recently been used (spatial locality). Simplified instruction sets. Reduced Instruction Set Computers(RISC, 1980s), contrast with CISC; simple instructions executing in few clock cycles; allowed higher clock cycles, freed up transistors; x86 processors translate to “ µ -ops” on the fly. Oliver Ernst (INMO) Wintersemester 2012/13 24 HPC

Processor Architecture Pipelining: Example Pipelining, it’s natural. [D. Patterson, UC Berkeley] A B C D Four students (Anoop, Brian, Christine & each have one load of clothes Djamal) doing laundry, one load each. Washing takes 30 minutes. Drying takes 40 minutes. Folding takes 20 minutes. DAP.F96 5 Each complete laundry load, done sequentially, takes 90 minutes. Oliver Ernst (INMO) Wintersemester 2012/13 25 HPC

Processor Architecture Pipelining: Example Sequential laundry Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20 30 40 20 30 40 20 T A a s k B O r C d e r D • Sequential laundry takes 6 hours for 4 loads 4 loads in 6 hours. How long would the pipelined laundry take? Oliver Ernst (INMO) Wintersemester 2012/13 26 HPC

Processor Architecture Pipelining: Example Pipelined laundry Start work ASAP 6 PM 7 8 9 11 Midnight 10 Time 30 40 40 40 40 20 T A a s k B O r C d e r D Start each task as soon as functional unit available. Takes only 3.5 hours. Oliver Ernst (INMO) Wintersemester 2012/13 27 HPC

Processor Architecture Pipelining: Example Pipelining lessons Pipelining doesn’t help latency of individual tasks, it helps throughput of entire workload. • 6 PM 7 8 9 Pipeline limited by slowest pipeline stage. Time Multiple stages operating concurrently. T 30 40 40 40 40 20 • a s Potential speedup = number of pipeline A • k stages. O • B r Unbalanced lengths of pipeline stages d • reduce speedup. e C r Time to “fill” (start-up) pipeline and time • D to “drain” (wind-down) it reduces speedup, especially if there are few loads relative to the # stages. Oliver Ernst (INMO) Wintersemester 2012/13 28 HPC

Processor Architecture Pipelining in computers Pipelining is the basis of vector processors. Greatest source of ILP today. Invisible to programmer. Typical pipeline stages on CPU (MIPS ISA): (1) Fetch instruction from memory (2) Read registers while decoding instruction (3) Execute operation or calculate address (4) Access an operand in data memory (5) Write the result into a register Helpful: machine instructions of equal length (x86: from 1 to 17 bytes) Oliver Ernst (INMO) Wintersemester 2012/13 29 HPC

Processor Architecture Pipelining in computers Pipelined floating point multiply (FPM) 1 2 3 4 5 N N+1 N+2 N+3 N+4 ... Cycle B(1) B(2) B(3) B(4) B(5) B(N) Separate ... C(1) C(2) C(3) C(4) C(5) C(N) Wind − down mant./exp. Multiply B(1) B(2) B(3) B(4) B(N) B(N − 1) ... mantissas C(1) C(2) C(3) C(4) C(N) C(N − 1) Add B(1) B(2) B(3) B(N) B(N − 2) B(N − 1) ... exponents C(1) C(2) C(3) C(N) C(N − 2) C(N − 1) Normalize A A A A(1) A(2) A(N) ... result (N − 3) (N − 2) (N − 1) Wind − up Insert A A A A A(1) A(N) ... sign (N − 4) (N − 3) (N − 2) (N − 1) source: Hager & Wellein Figure 1.5: Timeline for a simplified floating-point multiplication pipeline that executes Timeline for simplified floating-point multiplication pipeline executing A(:)=B(:)*C(:). One result is generated on each cycle after a four-cycle “filling” phase. Oliver Ernst (INMO) Wintersemester 2012/13 30 HPC

Introduction to High Performance Computing and Optimization Oliver - PowerPoint PPT Presentation

Institut fr Numerische Mathematik und Optimierung Introduction to High Performance Computing and Optimization Oliver Ernst Audience: 1./3. CMS, 5./7./9. Mm, doctoral students Wintersemester 2012/13 Contents 1. Introduction 2. Processor

New York University High Performance Computing High Performance Computing Information

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High

High Performance Computing in Web Browsers CE Seminar WT14/15 Henning Lohse High Performance

Introduction to High Performance Computing Pierre Aubert High Performance Computing (HPC)

Introduction to High Performance Computing Using Sapelo2 at GACRC Georgia Advanced Computing

Trends in High Performance Trends in High Performance Computing and Using Numerical Computing

Trends in High Performance Trends in High Performance Computing and the Grid Computing and the

An Overview of High An Overview of High Performance Computing and Performance Computing and

Mercury: RPC for High-Performance Computing Jerome Soumagne The HDF Group June 23, 2017 RPC and

High Performance Computing, High Performance Computing, Computational Grid, and Numerical

High Performance Computing at High Performance Computing at the University of Utah: A User the

High-performance computing in Java: the data processing of Gaia X. Luri & J. Torra ICCUB/IEEC

Parallel Programming and High-Performance Computing Part 2: High-Performance Networks Dr.

Finding Performance-Optimal Configurations for High-Performance Computing Alexander Grebhahn,

An Overview of High Performance An Overview of High Performance Computing, Clusters, and the Grid

High Performance Computing, High Performance Computing, Computational Grid, and Numerical

Computer Algebra and the three Es: Efficiency, Elegance and Expressivenes James H.

Formal Behavioural Models and Compliance Analysis for Service Oriented Systems Natallia Kokash

Introduction How to think creatively ? Important but difficult to learn 'how to think'. The

Consumer Perspectives on MyHealth Record: A Review Yasmin VAN KASTEREN a,b , Anthony MAEDER a,b ,

Dynamics of motor cortex Matt Kaufman Cold Spring Harbor Laboratory Stanford CS379C jPC 1 jPC 2

THE EXPERIMENT Curtis A. Meyer Carnegie Mellon 2 June 4, 2016 Meson 2016 - C.A.

Overview GlueX principal motivation: hybrid meson searches Synergies with light meson studies

Study of B X(3872) K at Belle Vishal Bhardwaj University of South Carolina (for the

Introduction to High Performance Computing and Optimization Oliver - PowerPoint PPT Presentation

Institut fr Numerische Mathematik und Optimierung Introduction to High Performance Computing and Optimization Oliver Ernst Audience: 1./3. CMS, 5./7./9. Mm, doctoral students Wintersemester 2012/13 Contents 1. Introduction 2. Processor

New York University High Performance Computing High Performance Computing Information

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High

High Performance Computing in Web Browsers CE Seminar WT14/15 Henning Lohse High Performance

Introduction to High Performance Computing Pierre Aubert High Performance Computing (HPC)

Introduction to High Performance Computing Using Sapelo2 at GACRC Georgia Advanced Computing

Trends in High Performance Trends in High Performance Computing and Using Numerical Computing

Trends in High Performance Trends in High Performance Computing and the Grid Computing and the

An Overview of High An Overview of High Performance Computing and Performance Computing and

Mercury: RPC for High-Performance Computing Jerome Soumagne The HDF Group June 23, 2017 RPC and

High Performance Computing, High Performance Computing, Computational Grid, and Numerical

High Performance Computing at High Performance Computing at the University of Utah: A User the

High-performance computing in Java: the data processing of Gaia X. Luri &amp; J. Torra ICCUB/IEEC

Parallel Programming and High-Performance Computing Part 2: High-Performance Networks Dr.

Finding Performance-Optimal Configurations for High-Performance Computing Alexander Grebhahn,

An Overview of High Performance An Overview of High Performance Computing, Clusters, and the Grid

High Performance Computing, High Performance Computing, Computational Grid, and Numerical

Computer Algebra and the three Es: Efficiency, Elegance and Expressivenes James H.

Formal Behavioural Models and Compliance Analysis for Service Oriented Systems Natallia Kokash

Introduction How to think creatively ? Important but difficult to learn 'how to think'. The

Consumer Perspectives on MyHealth Record: A Review Yasmin VAN KASTEREN a,b , Anthony MAEDER a,b ,

Dynamics of motor cortex Matt Kaufman Cold Spring Harbor Laboratory Stanford CS379C jPC 1 jPC 2

THE EXPERIMENT Curtis A. Meyer Carnegie Mellon 2 June 4, 2016 Meson 2016 - C.A.

Overview GlueX principal motivation: hybrid meson searches Synergies with light meson studies

Study of B X(3872) K at Belle Vishal Bhardwaj University of South Carolina (for the

High-performance computing in Java: the data processing of Gaia X. Luri & J. Torra ICCUB/IEEC