Vectorization & Cache Organization ASD Shared Memory HPC - - PowerPoint PPT Presentation
Vectorization & Cache Organization ASD Shared Memory HPC - - PowerPoint PPT Presentation
Vectorization & Cache Organization ASD Shared Memory HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia February 11, 2020 Schedule - Day 2 Computer Systems (ANU)
Schedule - Day 2
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 2 / 85
Single Instruction Multiple Data (SIMD) Operations
Outline
1
Single Instruction Multiple Data (SIMD) Operations SIMD CPU Extensions Understanding SIMD Operations SIMD Registers Using SIMD Operations
2
Cache Basics
3
Multiprocessor Cache Organization
4
Thread Basics Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 3 / 85
Single Instruction Multiple Data (SIMD) Operations SIMD CPU Extensions
Flynn’s Taxonomy
SISD: Single instruction single data MISD: Multiple instructions single data (streaming processors) SIMD: Single instruction multiple data (array, vector processors) MIMD: Multiple instructions multiple data (multi-threaded processors) Mike Flynn, ‘Very High-Speed Computing Systems’, Proceedings of IEEE, 1966
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 4 / 85
Single Instruction Multiple Data (SIMD) Operations SIMD CPU Extensions
Types of Parallelism
Data Parallelism: Performing the same operation on different pieces
- f data
SIMD: e.g. summing two vectors element by element
Task Parallelism: Executing different threads of control in parallel Instruction Level Parallelism: Multiple instructions are concurrently executed
Superscalar - Multiple functional units Out-of-order execution and pipelining Very long instruction word (VLIW)
SIMD - Multiple operations are concurrent, while instructions are the same
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 5 / 85
Single Instruction Multiple Data (SIMD) Operations SIMD CPU Extensions
History of SIMD - Vector Processors
Instructions operate on vectors rather than scalar values Has vector registers where vectors can be loaded from or stored Vectors may be of variable length, i.e. vector registers must support variable vector lengths Data elements to be loaded into a vector register may not be contiguous in memory, i.e. support for strides or distances between two elements of a vector Cray-I used vector processors
Clocked at 80 MHz in Los Alamos National Lab, 1976 Introduced CPU registers for SIMD vector operations 250 MFLOPS when SIMD operations utilized effectively
Primary disadvantage: Works well only if parallelism is regular Superseded by contemporary scalar processors with support for vector
- perations, i.e. SIMD extensions
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 6 / 85
Single Instruction Multiple Data (SIMD) Operations SIMD CPU Extensions
SIMD Extensions
Extensive use of SIMD extensions in contemporary hardware: Complex Instruction Set Computers (CISC)
Intel MMX: 64-bit wide registers - first widely used SIMD instruction set on the desktop computer in 1996 Intel Streaming SIMD Extensions (SSE): 128-bit wide XMM registers Intel Advanced Vector Extensions (AVX): 256-bit wide YMM registers
Reduced Instruction Set Computers (RISC)
SPARC64 VIIIFX (HPC-ACE): 128-bit registers PowerPC A2 (Altivec, VSX): 128-bit registers ARMv7, ARMv8 (NEON): 64-bit and 128-bit registers
Similar architecture: Single Instruction Multiple Thread (SIMT) used in GPUs
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 7 / 85
Single Instruction Multiple Data (SIMD) Operations Understanding SIMD Operations
SIMD Processing - Vector addition
C[i] = A[i] + B[i]
1 void VectorAdd(float *a, float *b, float *c, size_t size) { size_t i; 3 for (i = 0; i < size; i++) { c[i] = a[i] + b[i]; 5 } }
Assume arrays A and B contain 8-bit short integers No dependencies between operations, i.e. embarrassingly parallel Note: arrays A and B may not be contiguously allocated How can this operation be parallelized ?
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 8 / 85
Single Instruction Multiple Data (SIMD) Operations Understanding SIMD Operations
SIMD Processing - Vector addition
Scalar: 8 loads + 4 scalar adds + 4 stores = 16 ops Vector: 2 loads + 1 vector add + 1 store = 4 ops Speedup: 16/4 = 4× Fundamental idea: Perform multiple operations using single instructions on multiple data items concurrently Advantages:
Performance improvement Fewer instructions - reduced code size, maximization of data bandwidth Automatic Parallelization by compiler for vectorizable code
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 9 / 85
Single Instruction Multiple Data (SIMD) Operations SIMD Registers
Intel SSE
Intel Streaming SIMD Extensions (1999)
70 new instructions
SSE2 (2000)
144 new instructions with support for double data and 32b ints
SSE3 (2005)
13 new instructions for multi-thread support and HyperThreading
SSE4 (2007)
54 new instructions for text processing, strings, fixed-point arithmetic
8 (in 32-bit mode) or 16 (in 64-bit mode) 128-bit XMM Registers XMM0 - XMM15 8, 16, 32, 64-bit Integers 32-bit SP & 64-bit DP
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 10 / 85
Single Instruction Multiple Data (SIMD) Operations SIMD Registers
Intel AVX
Intel Advanced Vector Extensions (2008): extended vectors to 256b AVX2 (2013)
Expands most integer SSE and AVX instructions to 256b
Intel FMA3 (2013)
Fused multiply-add introduced in Haswell
8 or 16 256-bit YMM Registers YMM0 - YMM15 SSE instructions operate on lower half of YMM registers
Introduces new three-operand instructions, i.e. one destination, two source operands Previously, SSE instructions had the form a = a + b With AVX, the source operands are preserved, i.e. c = a + b
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 11 / 85
Single Instruction Multiple Data (SIMD) Operations SIMD Registers
ARM NEON
ARM Advanced SIMD (NEON) ARM Advanced SIMDv2 Support for fused multiply-add Support for half-precision extension Available in ARM Cortex-A15 Separate register file 32 64-bit Registers Shared by VFPv3/VFPv4 instructions Separate 10-stage execution pipeline NEON register views: D0-D31: 32 64-bit Double-word Q0-Q15: 16 128-bit Quad-word 8, 16, 32, 64-bit Integers ARMv7: 32-bit SP Floating-point ARMv8: 32-bit SP & 64-bit DP
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 12 / 85
Single Instruction Multiple Data (SIMD) Operations SIMD Registers
SIMD Instruction Types
Data Movement: Load, store vectors between main memory and SIMD registers Arithmetic operations: Addition, subtraction, multiplication, division, absolute difference, maximum, minimum, saturation arithmetic, square root, multiply-accumulate, multiply-subtract, halving-subtract, folding maximum and minimum Logical operations: Bitwise AND, OR, NOT operations and their combinations Data value comparisons: =, <=, <, >=, > Pack, Unpack, Shuffle: Initializing vectors from bit patterns, rearranging bits based on a control mask Conversion: Between floating-point and integer data types using saturation arithmetic Bit Shift: Often used to do integer arithmetic such as division and multiplication Other: Cache specific operations, casting, bit insert, cache line flush, data prefetch, execution pause etc
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 13 / 85
Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations
How to use SIMD operations
Compiler auto-vectorization: Requires a compiler with vectorizing
- capabilities. Least time consuming. Performance variable and entirely
dependent on compiler quality. Compiler intrinsic functions: Almost one-to-one mapping to assembly instructions, without having to deal with register allocations, instruction scheduling, type checking and call stack maintenance. Inline assembly: Writing assembly instructions directly into higher level code Low-level assembly: Best approach for high performance. Most time consuming, least portable.
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 14 / 85
Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations
Compiler Auto-vectorization
Requires a vectorizing compiler, e.g. gcc, icc, clang Loop unrolling combined with the generation of packed SIMD instructions GCC enables vectorization with -O3
Enabled with -O2 on Intel systems Instruction set specified by -msse2 (-msse4.1 -mavx) for Intel systems Enabled with -mfpu=neon on ARM systems
Reports from vectorization process
- ftree-vectorizer-verbose=<level> (gcc), where level is between
1 and 5
- vec-report5 (Intel icc)
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 15 / 85
Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations
Compiler Auto-vectorization - Loops
What kind of loops are good candidates for auto-vectorization? Countable: The loop trip count must be known at entry to the loop at runtime Single entry and single exit: No break Straight-line code: It is not possible for different iterations to have different flow-control (must not branch). If statements allowed if they can be implemented as masked assignments The innermost loop of a nest: Possible loop interchange in previous optimization phases No function calls: Some intrinsic math functions allowed (sin, log, pow etc) Aliasing: Pointers to vector arguments should be declared with keyword restrict which guarantees that no aliases exist for them
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 16 / 85
Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations
Compiler Auto-vectorization – Obstacles
Non-contiguous memory accesses: Four consecutive floats (ints) can be loaded directly. If there is a stride, they have to be loaded separately using multiple instructions. Non-aligned data structures: May result in multiple load instructions
can align arrays (here on 16-byte boundaries) dynamically or statically as follows:
float *a = (float *) memalign (16, N * sizeof(float)); 2 float b[N] __attribute__ (( aligned (16)));
Data dependencies: RAW, WAR, WAW
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 17 / 85
Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations
Compiler Auto-vectorization - Example
SAXPY:
Y = αX + Y in Single Precision Here α is scalar constant and X, Y are SP vectors Used in BLAS (Basic Linear Algebra Subprograms) library
void saxpy(int n, float a, float* __restrict__ X, float* __restrict__ Y) { 2 int i; for (i=0; i<n; i++) 4 Y[i] = a*X[i] + Y[i]; }
Compile using vectorization options:
1 $ gcc -O3 -ftree -vectorizer -verbose =1 saxpy.cc -o saxpy Analyzing loop at saxpy.cc :31 3 Vectorizing loop at saxpy.cc :31 saxpy.cc :31: note: &=& = vect_do_peeling_for_alignment &=& = 5 saxpy.cc :31: note: &=& = vect_update_inits_of_dr &=& = saxpy.cc :31: note: &=& = vect_do_peeling_for_loop_bound &=& =Setting upper bound of nb 7 iterations for epilogue loop to 2 saxpy.cc :31: note: LOOP VECTORIZED . 9 saxpy.cc :28: note: vectorized 1 loops in function. saxpy.cc :31: note: Completely unroll loop 2 times 11 saxpy.cc :28: note: Completely unroll loop 3 times Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 18 / 85
Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations
SIMD Data Types - Intel
Data Type Content SSE SSE2 SSE3 AVX m64 8×char ✓ ✗ ✓ ✓ 4×short 2×int32 2×float 1×int64 1×double m128 4×float ✓ ✓ ✓ ✓ m128d 2×double ✗ ✓ ✓ ✓ m128i 16×char ✗ ✓ ✓ ✓ 8×short 4×int32 2×int64 m256 8×float ✗ ✗ ✗ ✓ m256d 4×double ✗ ✗ ✗ ✓ m256i 32×char ✗ ✗ ✗ ✓ 16×short 8×int32 4×int64
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 19 / 85
Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations
SIMD Data Types - ARM
NEON vector data types have the following pattern: <type><size>x<number of lanes> t
64-bit type (D-register) 128-bit type (Q-register) Content int8x8 t int8x16 t 8-bit char int16x4 t int16x8 t 16-bit int int32x2 t int32x4 t 32-bit int int64x1 t int64x2 t 64-bit int uint8x8 t uint8x16 t 8-bit unsigned int uint16x4 t uint16x8 t 16-bit unsigned int uint32x2 t uint32x4 t 32-bit unsigned int uint64x1 t uint64x1 t 64-bit unsigned int float16x4 t float16x8 t 16-bit floating-point float32x2 t float32x4 t 32-bit floating-point poly8x8 t poly8x16 t 8-bit polynomial poly16x4 t poly16x8 t 16-bit polynomial
It is also possible to have types representing arrays of vectors from size 1 to 4D E.g. int8x8x2 t represents an array of two int8x8 vectors Individual vectors in these arrays can be accessed using <var name>.val[0] etc
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 20 / 85
Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations
Compiler Intrinsic Functions - Intel
SSE and AVX intrinsic function names use the following notational convention: mm <opname> <suffix> <opname>: Indicates the basic operation of the intrinsic function; for example add for addition and sub for subtraction <suffix>: Denotes the type of data the instruction operates on
The first one or two letters of each suffix denote whether the data is:
p: packed ep: extended packed S: scalar
The remaining letters and numbers denote the type, with notation as follows:
s: single-precision floating-point d: double-precision floating-point i32: signed 32-bit integer u32: unsigned 32-bit integer . . .
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 21 / 85
Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations
Compiler Intrinsic Functions - ARM
NEON intrinsic function names use the following notational convention: v<opname><flags> <type> <opname>: Indicates the basic operation of the intrinsic function; for example add for addition and sub for subtraction <flags>:
Number (between 1 and 4): Denotes the array size of the result vector, i.e. size 1 (int16x4 t), size 2 (int16x4x2 t), size 3 (int16x4x3 t) or size 4 (int16x4x4 t) q: Denotes that Q registers must be used by both operands and result l: Long shape. Number of bits in each result element is double the number of bits in each operand element, i.e. operands are usually doubleword vectors and result is a quad-word vector w: Wide shape. Result and operand are twice the width of the second operand, i.e. doubleword and quad-word operand make a quad-word result
<type>: Denotes the type of data the instruction operates on
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 22 / 85
Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations
Compiler Intrinsic Functions - Examples
Load four SP FP values, address aligned
Intel SSE2: m128 mm load ps(float* p); ARM NEON: float32x4 t vld1q f32 (const float32 t* p); R0 R1 R2 R3 p[0] p[1] p[2] p[3]
Store four SP FP values. Address must be 16-byte aligned
Intel SSE2: void mm store ps(float* p, m128 a); ARM NEON: void vst1q f32 (float32 t* p, float32x4 t a); p[0] p[1] p[2] p[3] a0 a1 a2 a3
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 23 / 85
Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations
Compiler Intrinsic Functions - Examples
Add four SP FP values
Intel SSE2: m128 mm add ps( m128 a, m128 b); ARM NEON: float32x4 t vaddq f32 (float32x4 t a, float32x4 t b); R0 R1 R2 R3 a0+b0 a1+b1 a2+b2 a3+b3
Fused multiply-add four SP FP values
Intel FMA3: m128 mm fmadd ps( m128 a, m128 b, m128 c); ARM NEON: void vfmaq f32(float32x4 t a, float32x4 t b, float32x4 t c); R0 R1 R2 R3 a0*b0 + c0 a1*b1 + c1 a2*b2 + c2 a3*b3 + c3
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 24 / 85
Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations
Compiler Intrinsic Functions - Portability
Code vectorized with intrinsic functions for a specific processor is not portable
Only runs if architecture supports specific SIMD extension
Portability ensured by using conditional compilation in code
Program must contain both a scalar and a vectorized version of the same computation Using #ifdef statements, the applicable version is chosen at compile time For e.g. in GCC, if -msse2 option is passed, then the macro SSE2 is defined Usage of conditional compilation:
1 #ifdef __SSE2__ /* Code
- ptimized
with SSE2 intrinsic functions */ 3 #elif __ARM_NEON__ /* Code
- ptimized
with ARM NEON intrinsic functions */ 5 #else /* Scalar code */ 7 #endif Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 25 / 85
Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations
Reference Manuals
Intel: Intel Intrinsics Guide Intel 64 and IA-32 Architectures Software Developer’s Manual ARM: ARM NEON Programmer’s Guide (version 1.0) ARM NEON Intrinsics in GCC
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 26 / 85
Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations
Optimizing Vector Addition
C[i] = A[i] + B[i]
1 void VectorAdd(float *a, float *b, float *c, size_t size) { size_t i; 3 for (i = 0; i < size; i++) { c[i] = a[i] + b[i]; 5 } } Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 27 / 85
Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations
Optimizing Vector Addition - Intel SSE2
void VectorAddSSE (float* __restrict__ a, float* __restrict__ b, float* __restrict__ c, size_t size) { 2 size_t i; for (i = 0; i < (size /4) * 4; i+=4) { 4 /* Load into SSE XMM registers */ __m128 sse_a = _mm_load_ps (&a[i]); 6 __m128 sse_b = _mm_load_ps (&b[i]); 8 /* Perform addition */ __m128 sse_c = _mm_add_ps(sse_a , sse_b); 10 /* Store back to memory */ 12 _mm_store_ps (&c[i],sse_c); } 14 for (i = (size /4) * 4; i < size; i++) { 16 c[i] = a[i] + b[i]; } 18 } Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 28 / 85
Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations
Optimizing Vector Addition - ARM NEON
void VectorAddNEON (float32_t* __restrict__ a, float32_t* __restrict__ b, float32_t* __restrict__ c, size_t size) { 2 size_t i; /* Declare vector data types */ 4 float32x4_t a4 , b4 , c4; 6 for (i=0; i< (size /4) *4; i+=4) { /* Load into Quad NEON registers */ 8 a4 = vld1q_f32(a+i); b4 = vld1q_f32(b+i); 10 /* Perform addition */ 12 c4 = vaddq_f32(a4 , b4); 14 /* Store back to memory */ vst1q_f32(c+i, c4); 16 } 18 for (i = (size /4) *4; i<size; i++) { c[i] = a[i] + b[i]; 20 } } Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 29 / 85
Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations
Hands-on Exercise: Vectorizing Loops
Objective: Vectorizing loops using SSE & NEON instructions
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 30 / 85
Cache Basics
Outline
1
Single Instruction Multiple Data (SIMD) Operations
2
Cache Basics
3
Multiprocessor Cache Organization
4
Thread Basics Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 31 / 85
Cache Basics
Importance
An infinitely fast CPU must still load and store its data to memory HPC applications have execution time t(n) = O(nm) for problem size n ⇒ ≥ n, ≤ t(n) data t(n) instr’n
- accesses
Discrete Fourier Transform - O(n) data items O(nlogn or n2)
- perations
Reduction to upper/lower triangular - O(n2) data items O(n3)
- perations
Forward/Backward substitution - O(n2) data items O(n2) operations
Hence we need large memories with fast access times Achieve by:
cost (per bit) - speed trade-off:
cache memory
low
wide (||) memory access
moderate
faster technology (?)
high Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 32 / 85
Cache Basics
Access Times
Memory Access Time: The amount of time it takes to read or write to a memory location Memory Cycle Time: How quickly you can repeat a memory access For example, a memory chip may have an access time of 200ns, but the cycle time may be 50ns Over last ≈20 years, CPU speed has improved much faster than memory access times.
In the mid 1980s, commodity DRAMs had an access time of 200ns while IBM PC had CPU clock of 4.77MHz (210ns). Today commodity DRAMs have access times of around 50ns, but the clock speeds have decreased to 1ns or less. In part because the CPU is now within 1 chip, while memory is on external chips
Memory sizes are also now much larger
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 33 / 85
Cache Basics
Memory Technologies
Two main memory technologies:
cost/bit: access time: used in: SRAM (Static RAM) high low ( ≈ 10ns) caches DRAM (Dynamic RAM) low high ( ≈ 50ns) most main memories
SRAM: each bit uses at least 3 transistors (6 for best) and constant power supply DRAM: each bit uses 1 transistor and a capacitor. Over time, the charge leaks from the capacitor and it must be refreshed. Memory access involves the stages:
select row address select column address (R/W) access selected bit(s)
as well as transferring addresses and data over the relevant bus. Hence there is scope for pipelining
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 34 / 85
Cache Basics
Memory Hierarchy
Modern microprocessors have a memory hierarchy
Access Speed Registers Clock cycle Cache Few cycles Memory (DRAM) Many cycles Virtual memory Long!
Idea: data that is “currently most needed” is brought into a (smaller) faster memory Observation: memory accesses in most programs exhibit:
Temporal locality: if access address X, likely to access X again soon Spatial locality: if access address X, likely to access X+1 soon ⇒ caches organized into lines (units) of L words (L = 2l, eg. L = 2, 4, 8)
✓ blocked memory accesses (faster) & less control info needed (per word) ✗ redundant memory traffic if only ever use 1 word per line
- eg. pointer chasing
(non-unit stride guaranteed!)
if so, yields good cost-speed trade-off
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 35 / 85
Cache Basics
Memory Hierarchy
Intel Core i7 Xeon 5500 at 2.8 GHz Cycles Time (≈ ns) Registers 1 0.3 L1 Cache Hit 4 1.5 L2 Cache Hit 10 3.5 L3 Cache Hit 40 15 Local DRAM 160 60 Remote DRAM 280 100 Memory access time includes cost of getting data over the bus
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 36 / 85
Cache Basics
Registers
Very limited resource, accessed in one cycle Goal is to keep operands in registers as much as possible X = G ∗ 2.41 + A/W − W /B RISC instructions limited to two operands, thus must store result of G ∗ 2.41, A/W , and W /B back to registers before adding them
- together. Also W is used twice and don’t want it loaded from (slow)
memory twice A fundamental job of the compiler is to optimize register use, e.g.
1 ld [%W],%f1 !load W from memory into register f1 ld [%A],%f2 !load A from memory into register f2 3 fdiv %f2 ,%f1 ,%f2 !form A/W and
- verwrite
f2 (A) with result ld [%B],%f3 !load B from memory into register f3 5 fdiv %f1 ,%f3 ,%f2 !form W/B and
- verwrite
f3 (B) with result Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 37 / 85
Cache Basics
Cache Memory
Small amount of SRAM memory Cache hit rates: % of (word) accesses in program when data is in cache
Need to be high (eg. > 95%) for good performance Problem: programs may need to be re-written to do this! Only possible if have a sufficient inherent data re-use = g(n)/n e.g.:
(n1/2 × n1/2) matrix multiply: g(n) = 2n3/2
(mixed unit and non-unit stride)
FFT (Fast Fourier Transform): g(n) = 8n lg2 n
(power-of-2 stride)
Thus, FFT is more likely to be dominated by memory access time
Consistency of data cache & main memory:
When a store instruction is executed, the relevant line is updated 1st in the cache write-through: resulting word is written to main memory at same time
slow, but consistency is maintained
copy-back: write to memory (‘dirty’) cache line when thrown out of cache
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 38 / 85
Cache Basics
Cache Friendly
Friendly Unfriendly
for (i=0; i <100000; i+=1) { sum += a[i]; } double a[200][200]; for (i=0; i <200; i++) { for (j=0; j <200; j++) { sum += a[i][j]; } } for (i=0; i <800000; i+=8) { sum += a[i]; } double a[200][200]; for (i=0; i <200; i++) { for (j=0; j <200; j++) { sum += a[j][i]; } } while (ptr ->next != NULL) ptr = ptr ->next;
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 39 / 85
Cache Basics
Direct-Mapped Caches
Cache memory is organized into lines of size 2l
for cache of size 2c there are C ′ = 2c−l lines.
All addresses with same a1 are mapped to the same cache line 31 c c−1 l l−1 X = a0 a1 a2 ✓ easy to implement & low chip area / word
(note: cache must store value of a0)
⇒ large C ′ possible
(better performance)
✗ cache conflicts: 2 (or more) words from memory map to the same cache line
can have instr’n–instr’n, instr’n–data and data–data
can make large, often unpredictable, performance losses
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 40 / 85
Cache Basics
K-way Set Associative Caches
like a direct-mapped cache of size C ′, but every line is extended to a set of K lines (total number of cache lines is now C ′K) addresses with same a1 can map into the corresponding line in any of the K sets Typically K = 1, 2, 4, 5, 6, 8 Reduces chance of conflicts by factor of K, but some extra cost Examples of Cache Thrashing 4K direct mapped cache 4K 2-way set associative cache
float a[1024] , b[1024]; for (i=0; i <1024; i++) { a[i] = a[i]+b[i]; } float a[1024] , b[1024] , c[1024]; for (i=0; i <1024; i++) { a[i] = a[i]+b[i]+c[i]; }
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 41 / 85
Cache Basics
Cache: Other Issues
Modern processors usually have separate data & instruction 1st level caches Multiple (2 or 3) levels of cache (i.e. a deep memory hierarchy)
Typically: top level (data) cache: C ′ = 16KB, K = 4, write-through; 2nd level (instr’n/data) cache: C ′ = 1MB, K = 1,copy-back ✗ Harder still to tune a program for 2 levels!
(Top-level) cache prefetching (requires load/store pipelines)
An access time (latency) of δ cycles can be hidden if perform each load δ cycles in advance of when needed Can be done via:
Prefetching or software pipelining
(by programmer or compiler, eg. UltraSPARC)
H/W instruction re-ordering
(eg. Pentium IV) more effective, but H/W more expensive, complex
Increase data bus width to cache line size L
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 42 / 85
Cache Basics
Hands-on Exercise: IPC and Cache Misses
Objective: Using the PAPI advanced interface to measure instructions per cycle for a variety of loops and cache behaviour
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 43 / 85
Multiprocessor Cache Organization
Outline
1
Single Instruction Multiple Data (SIMD) Operations
2
Cache Basics
3
Multiprocessor Cache Organization
4
Thread Basics Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 44 / 85
Multiprocessor Cache Organization
Shared Memory Hardware
(Fig 2.5 Grama et al, Intro to Parallel Computing) Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 45 / 85
Multiprocessor Cache Organization
Shared Address Space Systems
Systems with caches but otherwise flat memory generally called UMA If access to local cheaper than remote (NUMA), this should be built into your algorithm Global address space systems are easier to program
Read only interactions invisible to programmer and coded like sequential program Read/write are harder, require mutual exclusion for concurrent accesses
Programmed using threads Synchronization using locks and related mechanisms
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 46 / 85
Multiprocessor Cache Organization
Cache Hierarchy on Intel Core i7 (2013)
(64 byte cache line size)
Ref: http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1 Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 47 / 85
Multiprocessor Cache Organization
Caches on Multiprocessors
Multiple copies of some data word being manipulated by two or more processors at the same time Usually have separate I-cache (Instruction Cache) on the first 1–2 levels (usually of similar sizes to the Data Caches)
How is instruction access different to data? Why is this useful?
Two requirements
Address translation mechanism that locates each physical memory word in system Concurrent operations on multiple copies have well defined semantics
The latter generally known as a cache coherency protocol
Input/Output using direct memory access (DMA) on machines with caches also leads to coherency issues
Some machines only provide shared address space mechanisms and leave coherence to the programmer
e.g. Texas Instrument Keystone II system
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 48 / 85
Multiprocessor Cache Organization
Cache Coherency
Intuitive behaviour: reading value at address X should return the last value written to address X by any processor
What does last mean? What if simultaneous or closer in time than time required to communicate between two processors?
In a sequential program, last is determined by program order (not time)
Holds true within thread of parallel program, but what does this mean with multiple threads?
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 49 / 85
Multiprocessor Cache Organization
Cache/Memory Coherency
A memory system is coherent if
Ordered as Issued: A read by processor P to address X that follows a write by P to address X should return the value of the write by P (assuming no other processor writes to X in between) Write Propagation: A read by processor P1 to address X that follows a write by processor P2 to X returns the written value if the read and write are sufficiently separated in time (assuming no other write to X
- ccurs in between)
Write Serialization: Writes to the same address are serialized: two writes by any two processors are observed in the same order by all processors
(Later to be contrast with memory consistency!)
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 50 / 85
Multiprocessor Cache Organization
Two Cache Coherency Protocols
(Fig 2.21 Grama et al, Intro to Parallel Computing) Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 51 / 85
Multiprocessor Cache Organization
Cache Line View
Ref: http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1
Need to augment cache line information with information regarding validity
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 52 / 85
Multiprocessor Cache Organization
Update vs. Invalidate
Update Protocol
When a data item is written, all of its copies in the system are updated
Invalidate Protocol (most common)
Before a data item is written, all other copies are marked as invalid
Comparison
Multiple writes to same word with no intervening reads require multiple write broadcasts in an update protocol, but only one initial invalidation With multiword cache blocks, each word written in a cache block must be broadcast in an update protocol, but only one invalidate per line is required The delay between writing a word in one processor and reading the written data in another is usually less for update
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 53 / 85
Multiprocessor Cache Organization
False Sharing
Two processors modify different parts
- f the same cache line
Invalidate protocol leads to ping-ponged cache lines Update protocol performs reads locally but updates much traffic between processors This effect is entirely an artefact of hardware Need to design parallel systems / programs with this issue in mind
Cache line size, longer more likely Alignment of data structures with respect to cache line size
http://15418.courses.cs.cmu.edu/ spring2015/lecture/cachecoherence1 Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 54 / 85
Multiprocessor Cache Organization
Implementing Cache Coherency
On small scale bus based machines
A processor must obtain access to the bus to broadcast a write invalidation With two competing processors, the first to gain access to the bus will invalidate the others data
A cache miss needs to locate top copy of data
Easy for write-through cache For write-back caches, each processor snoops the bus and responses by providing data if it has the top copy
For writes, we would like to know if any other copies of the block are cached
i.e. whether a write-back cache needs to put details on the bus Handled by having a tag to indicate shared status
Minimizing processor stalls
Either by duplication of tags or having multiple inclusive caches
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 55 / 85
Multiprocessor Cache Organization
3 State (MSI) Cache Coherency Protocol
read: local read write: local write c read (coherency read): read on remote processor gives rise to shown transition in local cache c write (coherency write): write miss, or write in Shared state, on remote processor gives rise to shown transition in local cache
(Fig 2.22 Grama et al, Intro to Parallel Computing) Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 56 / 85
Multiprocessor Cache Organization
MSI Coherency Protocol
(Fig 2.23 Grama et al, Intro to Parallel Computing) Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 57 / 85
Multiprocessor Cache Organization
Snoopy Cache Systems
All caches broadcast all transactions (read or write misses, writes in S state)
Suited to bus or ring interconnects However scalability is limited (i.e. ≤ 8 processors)
All processors monitor the bus for transactions of interest Each processor’s cache has a set of tag bits that determine the state
- f the cache block
Tags are updated according to state diagram for relevant protocol
e.g. snoop hardware detects that a read has been issued for a cache block that it has a dirty copy of, it asserts control of the bus and puts data out, (to requesting cache and to main memory), sets tag to S state What sort of data access characteristics are likely to perform well/badly on snoopy based systems?
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 58 / 85
Multiprocessor Cache Organization
Snoopy Cache Based System
(Fig 2.24 Grama et al, Intro to Parallel Computing) Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 59 / 85
Multiprocessor Cache Organization
Snoopy Cache-Based System: Ring
The Core i7 (Sandy Bridge) on-chip interconnect revisited: a ring-based interconnect between Cores, Graphics, Last Level Cache (LLC) and System Agent domains has 4 physical rings: Data (32B), Request, Acknowledge and Snoop rings fully pipelined; bandwidth, latency and power scale with cores shortest path chosen to minimize latency has distributed arbitration & sophisticated protocols to handle coherency and ordering
(courtesy www.lostcircuits.com) Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 60 / 85
Multiprocessor Cache Organization
Directory Cache Based Systems
The need to broadcast is clearly not scalable
A solution is to only send information to processing elements specifically interested in that data
This requires a directory to store information
Augment global memory with presence bitmap to indicate which caches each memory block is located in
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 61 / 85
Multiprocessor Cache Organization
Directory Based Cache Coherency
To implement, we must track the state of each cache block A simple protocol might be:
Shared: one or more processors have the block cached, and the value in memory is up to date Uncached: no processor has a copy Exclusive: only one processor (the owner) has a copy and the value in memory is out of date
Must handle a read/write miss and a write to a shared, clean cache block
These first reference the directory entry to determine the current state
- f this block
Then update the entry’s status and presence bitmap Send the appropriate state update transactions to the processors in the presence bitmap
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 62 / 85
Multiprocessor Cache Organization
Directory-Based Cache Coherency
(Fig 2.25 Grama et al, Intro to Parallel Computing) Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 63 / 85
Multiprocessor Cache Organization
Directory-Based Systems
How much memory is required to store the directory? What sort of data access characteristics are likely to perform well/badly on directory based systems?
How do distributed and centralized systems compare?
Should the presence bitmaps be replicated in the caches? Must they be? How would you implement sending an invalidation message to all (and only to all) processors in the presence bitmap?
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 64 / 85
Multiprocessor Cache Organization
Costs on SGI Origin 3000 (clock cycles)
<= 16 CPU > 16 CPU Cache Hit 1 1 Cache miss to local memory 85 85 Cache miss to remote home directory 125 150 Cache miss to remotely cached data (3 hop) 140 170
Figure from http://people.nas.nasa.gov/∼schang/origin opt.html Data from: Computer Architecture: A Quantitative Approach, By David A. Patterson, John L. Hennessy, David Goldberg Ed 3, Morgan Kaufmann, 2003 Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 65 / 85
Multiprocessor Cache Organization
Real Cache Coherency Protocols
From Wikipedia
Most modern systems use variants of the MSI protocol to reduce the amount of traffic in the coherency interconnect. The MESI protocol adds an “Exclusive” state to reduce the traffic caused by writes of blocks that only exist in one cache. The MOSI protocol adds an “Owned” state to reduce the traffic caused by write-backs of blocks that are read by other caches [The processor owner of the cache line services requests for that data]. The MOESI protocol does both of these things. The MESIF protocol uses the “Forward” state to reduce the traffic caused by multiple responses to read requests when the coherency architecture allows caches to respond to snoop requests with data.
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 66 / 85
Multiprocessor Cache Organization
MESI(on a bus)
Ref: https://www.cs.tcd.ie/Jeremy.Jones/vivio/caches/MESIHelp.htm Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 67 / 85
Multiprocessor Cache Organization
Multi-Level Cache
What is visibility of changes between levels of cache?
http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1
Easiest model is inclusive:
If line is in owned state in L1, it is also in owned state in L2
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 68 / 85
Multiprocessor Cache Organization
Cache Summary
Cache coherency arises because abstraction of a single shared address space is not actually implemented by a single storage unit in a machine Three components to cache coherency
Issue order, write propagation, write serialization
Two implementations
Broadcast and Directory
False sharing is potential performance issue
More likely the longer the cache line
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 69 / 85
Multiprocessor Cache Organization
Hands-on Exercise: Matrix Multiplication Performance
Objective: To understand the effect of various matrix multiply loop orderings on IPC and cache misses
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 70 / 85
Thread Basics
Outline
1
Single Instruction Multiple Data (SIMD) Operations
2
Cache Basics
3
Multiprocessor Cache Organization
4
Thread Basics Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 71 / 85
Thread Basics
Fork/Join Programming Model
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 72 / 85
Thread Basics
(Heavyweight) UNIX Processes
O/S like UNIX is based on the notion of a process
The CPU is shared between different processes
UNIX processes created via fork()
Child copy an exact copy of parent, except it has unique process ID
Processes are “joined” using the system calls wait()
This introduces a synchronization
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 73 / 85
Thread Basics
UNIX Fork Example
pid = fork (); if (pid == 0) { // code to be executed by child } else { // code to be executed by parent } if (pid == 0) exit (0); else wait (0);
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 74 / 85
Thread Basics
Processes and Threads
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 75 / 85
Thread Basics
Why Threads
Software Portability
Applications can be developed on serial machine and run on parallel machines without changes (is this really true?)
Latency Hiding
Ability to mask access to memory, I/O or communication by having another thread execute in the meantime (but how quickly can execution switch between threads?)
Scheduling and Load Balancing
For unstructured and dynamic applications (e.g. game playing), load balancing can be very hard. One option is to create more threads than CPU resources and let the O/S sort out the scheduling
Ease of Programming, Widespread Use
Due to above, threaded programs are easier to write (or develop incrementally), so there has been widespread acceptance of the POSIX thread API (generally referred to as pthreads)
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 76 / 85
Thread Basics
Pthread Creation
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 77 / 85
Thread Basics
Threads and Threaded Code
pthread self() provides ID of calling routine
pthread equal(thread1, thread2) tests ID
Detached Threads
Threads that will never synchronize via a join Specified via an attribute
Re-entrant or thread-safe routines are those than can be safely called when another instance has been suspended in the middle of its invocation
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 78 / 85
Thread Basics
Example: Computing Pi
Ratio of area of circle to the square is π
4
Guess points with domain of square at random Identify those that are distance less than 1 from
- rigin
Ratio of points in circle to total points is π
4
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 79 / 85
Thread Basics
Example: Computing Pi
1 #include <pthread.h> #include <stdlib.h> 3 #define MAX_THREADS 512 void *compute_pi (void *); 5 .... main () { 7 ... pthread_t p_threads[ MAX_THREADS ]; 9 pthread_attr_t attr; pthread_attr_init (& attr); 11 for (i=0; i < num_threads ; i++) { hits[i] = i; 13 pthread_create (& p_threads[i], &attr , compute_pi , (void *) &hits[i]); 15 } for (i=0; i < num_threads ; i++) { 17 pthread_join (p_threads[i], NULL); total_hits += hits[i]; 19 } ... 21 } Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 80 / 85
Thread Basics
Example: Computing Pi (cont)
1 void *compute_pi (void *s) { int seed , i, * hit_pointer ; 3 int local_hits ; hit_pointer = (int *) s; 5 seed = * hit_pointer ; local_hits = 0; 7 double rx , ry; for (i = 0; i < sample_points_per_thread ; i++) { 9 rx = (( double) rand_r (& seed)) / RAND_MAX
- 0.5;
ry = (( double) rand_r (& seed)) / RAND_MAX
- 0.5;
11 if (rx*rx + ry*ry < 0.25) local_hits ++; 13 } * hit_pointer = local_hits ; 15 pthread_exit (0); } Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 81 / 85
Thread Basics
Programming and Performance Notes
Note the use of the function rand r (instead of superior random number generators such as drand48). Executing this on a 4-processor SGI Origin, we observe a 3.91 fold speedup at 32 threads. This corresponds to a parallel efficiency of 0.98! We can also modify the program slightly to observe the effect of false-sharing. The program can also be used to assess the secondary cache line size.
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 82 / 85
Thread Basics
Performance
(Fig 7.2 Grama et al, Intro to Parallel Computing)
4 processor SGI Origin System using up to 32 threads Instead of incrementing local hits, we add to a shared array using stride
- f 1, 16 and 32
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 83 / 85
Thread Basics
Hands-on Exercise: False Cache Line Aliasing
Objective: To observe the effect of cache line contention and false cache line sharing on performance
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 84 / 85
Thread Basics
Summary
Topics covered today - Vectorization & Cache Organization: SIMD operations Cache basics Multiprocessor cache organization Tomorrow - Multi Processor Parallelism!
Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 85 / 85