Vectorization & Cache Organization ASD Shared Memory HPC - PowerPoint PPT Presentation

Vectorization & Cache Organization ASD Shared Memory HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia February 11, 2020

Schedule - Day 2 Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 2 / 85

Single Instruction Multiple Data (SIMD) Operations Outline 2 Cache Basics 1 Single Instruction Multiple Data (SIMD) Operations SIMD CPU Extensions 3 Multiprocessor Cache Organization Understanding SIMD Operations SIMD Registers Using SIMD Operations 4 Thread Basics Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 3 / 85

Single Instruction Multiple Data (SIMD) Operations SIMD CPU Extensions Flynn’s Taxonomy SISD : Single instruction single data MISD : Multiple instructions single data (streaming processors) SIMD : Single instruction multiple data (array, vector processors) MIMD : Multiple instructions multiple data (multi-threaded processors) Mike Flynn, ‘Very High-Speed Computing Systems’, Proceedings of IEEE, 1966 Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 4 / 85

Single Instruction Multiple Data (SIMD) Operations SIMD CPU Extensions Types of Parallelism Data Parallelism : Performing the same operation on different pieces of data SIMD: e.g. summing two vectors element by element Task Parallelism : Executing different threads of control in parallel Instruction Level Parallelism : Multiple instructions are concurrently executed Superscalar - Multiple functional units Out-of-order execution and pipelining Very long instruction word (VLIW) SIMD - Multiple operations are concurrent, while instructions are the same Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 5 / 85

Single Instruction Multiple Data (SIMD) Operations SIMD CPU Extensions History of SIMD - Vector Processors Instructions operate on vectors rather than scalar values Has vector registers where vectors can be loaded from or stored Vectors may be of variable length, i.e. vector registers must support variable vector lengths Data elements to be loaded into a vector register may not be contiguous in memory, i.e. support for strides or distances between two elements of a vector Cray-I used vector processors Clocked at 80 MHz in Los Alamos National Lab, 1976 Introduced CPU registers for SIMD vector operations 250 MFLOPS when SIMD operations utilized effectively Primary disadvantage: Works well only if parallelism is regular Superseded by contemporary scalar processors with support for vector operations, i.e. SIMD extensions Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 6 / 85

Single Instruction Multiple Data (SIMD) Operations SIMD CPU Extensions SIMD Extensions Extensive use of SIMD extensions in contemporary hardware: Complex Instruction Set Computers (CISC) Intel MMX: 64-bit wide registers - first widely used SIMD instruction set on the desktop computer in 1996 Intel Streaming SIMD Extensions (SSE): 128-bit wide XMM registers Intel Advanced Vector Extensions (AVX): 256-bit wide YMM registers Reduced Instruction Set Computers (RISC) SPARC64 VIIIFX (HPC-ACE): 128-bit registers PowerPC A2 (Altivec, VSX): 128-bit registers ARMv7, ARMv8 (NEON): 64-bit and 128-bit registers Similar architecture: Single Instruction Multiple Thread (SIMT) used in GPUs Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 7 / 85

Single Instruction Multiple Data (SIMD) Operations Understanding SIMD Operations SIMD Processing - Vector addition C [ i ] = A [ i ] + B [ i ] 1 void VectorAdd(float *a, float *b, float *c, size_t size) { size_t i; 3 for (i = 0; i < size; i++) { c[i] = a[i] + b[i]; 5 } } Assume arrays A and B contain 8-bit short integers No dependencies between operations, i.e. embarrassingly parallel Note: arrays A and B may not be contiguously allocated How can this operation be parallelized ? Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 8 / 85

Single Instruction Multiple Data (SIMD) Operations Understanding SIMD Operations SIMD Processing - Vector addition Scalar: 8 loads + 4 scalar adds + 4 stores = 16 ops Vector: 2 loads + 1 vector add + 1 store = 4 ops Speedup: 16/4 = 4 × Fundamental idea: Perform multiple operations using single instructions on multiple data items concurrently Advantages: Performance improvement Fewer instructions - reduced code size, maximization of data bandwidth Automatic Parallelization by compiler for vectorizable code Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 9 / 85

Single Instruction Multiple Data (SIMD) Operations SIMD Registers Intel SSE Intel Streaming SIMD Extensions (1999) 70 new instructions SSE2 (2000) 144 new instructions with support for double data and 32b ints SSE3 (2005) 13 new instructions for multi-thread support and HyperThreading SSE4 (2007) 54 new instructions for text processing, strings, fixed-point arithmetic 8 (in 32-bit mode) or 16 (in 64-bit mode) 128-bit XMM Registers XMM0 - XMM15 8, 16, 32, 64-bit Integers 32-bit SP & 64-bit DP Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 10 / 85

Single Instruction Multiple Data (SIMD) Operations SIMD Registers Intel AVX Intel Advanced Vector Extensions (2008): extended vectors to 256b AVX2 (2013) Expands most integer SSE and AVX instructions to 256b Intel FMA3 (2013) Fused multiply-add introduced in Haswell 8 or 16 256-bit YMM Registers YMM0 - YMM15 SSE instructions operate on lower half of YMM registers Introduces new three-operand instructions, i.e. one destination, two source operands Previously, SSE instructions had the form a = a + b With AVX, the source operands are preserved, i.e. c = a + b Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 11 / 85

Single Instruction Multiple Data (SIMD) Operations SIMD Registers ARM NEON ARM Advanced SIMD (NEON) ARM Advanced SIMDv2 Support for fused multiply-add Support for half-precision extension Available in ARM Cortex-A15 Separate register file 32 64-bit Registers Shared by VFPv3/VFPv4 instructions Separate 10-stage execution pipeline NEON register views: D0-D31: 32 64-bit Double-word Q0-Q15: 16 128-bit Quad-word 8, 16, 32, 64-bit Integers ARMv7: 32-bit SP Floating-point ARMv8: 32-bit SP & 64-bit DP Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 12 / 85

Single Instruction Multiple Data (SIMD) Operations SIMD Registers SIMD Instruction Types Data Movement : Load, store vectors between main memory and SIMD registers Arithmetic operations : Addition, subtraction, multiplication, division, absolute difference, maximum, minimum, saturation arithmetic, square root, multiply-accumulate, multiply-subtract, halving-subtract, folding maximum and minimum Logical operations : Bitwise AND, OR, NOT operations and their combinations Data value comparisons : = , < = , <, > = , > Pack, Unpack, Shuffle : Initializing vectors from bit patterns, rearranging bits based on a control mask Conversion : Between floating-point and integer data types using saturation arithmetic Bit Shift : Often used to do integer arithmetic such as division and multiplication Other : Cache specific operations, casting, bit insert, cache line flush, data prefetch, execution pause etc Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 13 / 85

Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations How to use SIMD operations Compiler auto-vectorization : Requires a compiler with vectorizing capabilities. Least time consuming. Performance variable and entirely dependent on compiler quality. Compiler intrinsic functions : Almost one-to-one mapping to assembly instructions, without having to deal with register allocations, instruction scheduling, type checking and call stack maintenance. Inline assembly : Writing assembly instructions directly into higher level code Low-level assembly : Best approach for high performance. Most time consuming, least portable. Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 14 / 85

Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations Compiler Auto-vectorization Requires a vectorizing compiler, e.g. gcc , icc , clang Loop unrolling combined with the generation of packed SIMD instructions GCC enables vectorization with -O3 Enabled with -O2 on Intel systems Instruction set specified by -msse2 ( -msse4.1 -mavx ) for Intel systems Enabled with -mfpu=neon on ARM systems Reports from vectorization process -ftree-vectorizer-verbose=<level> (gcc), where level is between 1 and 5 -vec-report5 (Intel icc) Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 15 / 85

Vectorization & Cache Organization ASD Shared Memory HPC - PowerPoint PPT Presentation

Vectorization & Cache Organization ASD Shared Memory HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia February 11, 2020 Schedule - Day 2 Computer Systems (ANU)

MDT FE Power Consumption M. Fras, 06 June 2019 ASD Power Depending on Voltage ASD Supply [V]

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Is vectorization easy? Is vectorization enough? Sbastien Ponce Florian Lemaitre Plan

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

General Cache Mechanics CPU Block: unit of data in cache and memory. (a.k.a. line) Memory

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Chapter 4 Cache Memory Contents Computer memory system overview Characteristics of

Cache Systems CPU Main Main CPU Memory Memory 400MHz 10MHz Cache 10MHz Memory Hierarchy

Distributed Memory and Cache Consistency Distributed Memory and Cache Consistency (some slides

Distributed Memory and Cache Consistency Distributed Memory and Cache Consistency (some slides

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Autism Spectrum Disorder: A Fresh Look ASD in Females Andrea Fourie Speech Therapist ASD:

COMP 633 - Parallel Computing Lecture 15 October 1, 2020 Programming Accelerators using

CS3350B Computer Organization Chapter 5: Parallel Architectures Alex Brandt Department of

Lecture 5.1 Flynns Taxonomy EN 600.320/420/620 Instructor: Randal Burns 12 February 2018

THREAD LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing University

CSCI341 Lecture 38, Introduction to Multicore Architectures GOAL: PERFORMANCE Recall: Power as

CSC2/458 Parallel and Distributed Systems Machines and Models Sreepathi Pai January 23, 2018

Multithreaded Algorithms Architecture Evolution Weve come a long way since we blamed Von

For Friday BE ON TIME Bring two hard copies of your complete rough draft Be sure to