Vectorization & Cache Organization ASD Shared Memory HPC - - PowerPoint PPT Presentation

vectorization cache organization asd shared memory hpc
SMART_READER_LITE
LIVE PREVIEW

Vectorization & Cache Organization ASD Shared Memory HPC - - PowerPoint PPT Presentation

Vectorization & Cache Organization ASD Shared Memory HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia February 11, 2020 Schedule - Day 2 Computer Systems (ANU)


slide-1
SLIDE 1

Vectorization & Cache Organization ASD Shared Memory HPC Workshop

Computer Systems Group

Research School of Computer Science Australian National University Canberra, Australia

February 11, 2020

slide-2
SLIDE 2

Schedule - Day 2

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 2 / 85

slide-3
SLIDE 3

Single Instruction Multiple Data (SIMD) Operations

Outline

1

Single Instruction Multiple Data (SIMD) Operations SIMD CPU Extensions Understanding SIMD Operations SIMD Registers Using SIMD Operations

2

Cache Basics

3

Multiprocessor Cache Organization

4

Thread Basics Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 3 / 85

slide-4
SLIDE 4

Single Instruction Multiple Data (SIMD) Operations SIMD CPU Extensions

Flynn’s Taxonomy

SISD: Single instruction single data MISD: Multiple instructions single data (streaming processors) SIMD: Single instruction multiple data (array, vector processors) MIMD: Multiple instructions multiple data (multi-threaded processors) Mike Flynn, ‘Very High-Speed Computing Systems’, Proceedings of IEEE, 1966

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 4 / 85

slide-5
SLIDE 5

Single Instruction Multiple Data (SIMD) Operations SIMD CPU Extensions

Types of Parallelism

Data Parallelism: Performing the same operation on different pieces

  • f data

SIMD: e.g. summing two vectors element by element

Task Parallelism: Executing different threads of control in parallel Instruction Level Parallelism: Multiple instructions are concurrently executed

Superscalar - Multiple functional units Out-of-order execution and pipelining Very long instruction word (VLIW)

SIMD - Multiple operations are concurrent, while instructions are the same

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 5 / 85

slide-6
SLIDE 6

Single Instruction Multiple Data (SIMD) Operations SIMD CPU Extensions

History of SIMD - Vector Processors

Instructions operate on vectors rather than scalar values Has vector registers where vectors can be loaded from or stored Vectors may be of variable length, i.e. vector registers must support variable vector lengths Data elements to be loaded into a vector register may not be contiguous in memory, i.e. support for strides or distances between two elements of a vector Cray-I used vector processors

Clocked at 80 MHz in Los Alamos National Lab, 1976 Introduced CPU registers for SIMD vector operations 250 MFLOPS when SIMD operations utilized effectively

Primary disadvantage: Works well only if parallelism is regular Superseded by contemporary scalar processors with support for vector

  • perations, i.e. SIMD extensions

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 6 / 85

slide-7
SLIDE 7

Single Instruction Multiple Data (SIMD) Operations SIMD CPU Extensions

SIMD Extensions

Extensive use of SIMD extensions in contemporary hardware: Complex Instruction Set Computers (CISC)

Intel MMX: 64-bit wide registers - first widely used SIMD instruction set on the desktop computer in 1996 Intel Streaming SIMD Extensions (SSE): 128-bit wide XMM registers Intel Advanced Vector Extensions (AVX): 256-bit wide YMM registers

Reduced Instruction Set Computers (RISC)

SPARC64 VIIIFX (HPC-ACE): 128-bit registers PowerPC A2 (Altivec, VSX): 128-bit registers ARMv7, ARMv8 (NEON): 64-bit and 128-bit registers

Similar architecture: Single Instruction Multiple Thread (SIMT) used in GPUs

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 7 / 85

slide-8
SLIDE 8

Single Instruction Multiple Data (SIMD) Operations Understanding SIMD Operations

SIMD Processing - Vector addition

C[i] = A[i] + B[i]

1 void VectorAdd(float *a, float *b, float *c, size_t size) { size_t i; 3 for (i = 0; i < size; i++) { c[i] = a[i] + b[i]; 5 } }

Assume arrays A and B contain 8-bit short integers No dependencies between operations, i.e. embarrassingly parallel Note: arrays A and B may not be contiguously allocated How can this operation be parallelized ?

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 8 / 85

slide-9
SLIDE 9

Single Instruction Multiple Data (SIMD) Operations Understanding SIMD Operations

SIMD Processing - Vector addition

Scalar: 8 loads + 4 scalar adds + 4 stores = 16 ops Vector: 2 loads + 1 vector add + 1 store = 4 ops Speedup: 16/4 = 4× Fundamental idea: Perform multiple operations using single instructions on multiple data items concurrently Advantages:

Performance improvement Fewer instructions - reduced code size, maximization of data bandwidth Automatic Parallelization by compiler for vectorizable code

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 9 / 85

slide-10
SLIDE 10

Single Instruction Multiple Data (SIMD) Operations SIMD Registers

Intel SSE

Intel Streaming SIMD Extensions (1999)

70 new instructions

SSE2 (2000)

144 new instructions with support for double data and 32b ints

SSE3 (2005)

13 new instructions for multi-thread support and HyperThreading

SSE4 (2007)

54 new instructions for text processing, strings, fixed-point arithmetic

8 (in 32-bit mode) or 16 (in 64-bit mode) 128-bit XMM Registers XMM0 - XMM15 8, 16, 32, 64-bit Integers 32-bit SP & 64-bit DP

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 10 / 85

slide-11
SLIDE 11

Single Instruction Multiple Data (SIMD) Operations SIMD Registers

Intel AVX

Intel Advanced Vector Extensions (2008): extended vectors to 256b AVX2 (2013)

Expands most integer SSE and AVX instructions to 256b

Intel FMA3 (2013)

Fused multiply-add introduced in Haswell

8 or 16 256-bit YMM Registers YMM0 - YMM15 SSE instructions operate on lower half of YMM registers

Introduces new three-operand instructions, i.e. one destination, two source operands Previously, SSE instructions had the form a = a + b With AVX, the source operands are preserved, i.e. c = a + b

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 11 / 85

slide-12
SLIDE 12

Single Instruction Multiple Data (SIMD) Operations SIMD Registers

ARM NEON

ARM Advanced SIMD (NEON) ARM Advanced SIMDv2 Support for fused multiply-add Support for half-precision extension Available in ARM Cortex-A15 Separate register file 32 64-bit Registers Shared by VFPv3/VFPv4 instructions Separate 10-stage execution pipeline NEON register views: D0-D31: 32 64-bit Double-word Q0-Q15: 16 128-bit Quad-word 8, 16, 32, 64-bit Integers ARMv7: 32-bit SP Floating-point ARMv8: 32-bit SP & 64-bit DP

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 12 / 85

slide-13
SLIDE 13

Single Instruction Multiple Data (SIMD) Operations SIMD Registers

SIMD Instruction Types

Data Movement: Load, store vectors between main memory and SIMD registers Arithmetic operations: Addition, subtraction, multiplication, division, absolute difference, maximum, minimum, saturation arithmetic, square root, multiply-accumulate, multiply-subtract, halving-subtract, folding maximum and minimum Logical operations: Bitwise AND, OR, NOT operations and their combinations Data value comparisons: =, <=, <, >=, > Pack, Unpack, Shuffle: Initializing vectors from bit patterns, rearranging bits based on a control mask Conversion: Between floating-point and integer data types using saturation arithmetic Bit Shift: Often used to do integer arithmetic such as division and multiplication Other: Cache specific operations, casting, bit insert, cache line flush, data prefetch, execution pause etc

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 13 / 85

slide-14
SLIDE 14

Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations

How to use SIMD operations

Compiler auto-vectorization: Requires a compiler with vectorizing

  • capabilities. Least time consuming. Performance variable and entirely

dependent on compiler quality. Compiler intrinsic functions: Almost one-to-one mapping to assembly instructions, without having to deal with register allocations, instruction scheduling, type checking and call stack maintenance. Inline assembly: Writing assembly instructions directly into higher level code Low-level assembly: Best approach for high performance. Most time consuming, least portable.

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 14 / 85

slide-15
SLIDE 15

Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations

Compiler Auto-vectorization

Requires a vectorizing compiler, e.g. gcc, icc, clang Loop unrolling combined with the generation of packed SIMD instructions GCC enables vectorization with -O3

Enabled with -O2 on Intel systems Instruction set specified by -msse2 (-msse4.1 -mavx) for Intel systems Enabled with -mfpu=neon on ARM systems

Reports from vectorization process

  • ftree-vectorizer-verbose=<level> (gcc), where level is between

1 and 5

  • vec-report5 (Intel icc)

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 15 / 85

slide-16
SLIDE 16

Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations

Compiler Auto-vectorization - Loops

What kind of loops are good candidates for auto-vectorization? Countable: The loop trip count must be known at entry to the loop at runtime Single entry and single exit: No break Straight-line code: It is not possible for different iterations to have different flow-control (must not branch). If statements allowed if they can be implemented as masked assignments The innermost loop of a nest: Possible loop interchange in previous optimization phases No function calls: Some intrinsic math functions allowed (sin, log, pow etc) Aliasing: Pointers to vector arguments should be declared with keyword restrict which guarantees that no aliases exist for them

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 16 / 85

slide-17
SLIDE 17

Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations

Compiler Auto-vectorization – Obstacles

Non-contiguous memory accesses: Four consecutive floats (ints) can be loaded directly. If there is a stride, they have to be loaded separately using multiple instructions. Non-aligned data structures: May result in multiple load instructions

can align arrays (here on 16-byte boundaries) dynamically or statically as follows:

float *a = (float *) memalign (16, N * sizeof(float)); 2 float b[N] __attribute__ (( aligned (16)));

Data dependencies: RAW, WAR, WAW

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 17 / 85

slide-18
SLIDE 18

Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations

Compiler Auto-vectorization - Example

SAXPY:

Y = αX + Y in Single Precision Here α is scalar constant and X, Y are SP vectors Used in BLAS (Basic Linear Algebra Subprograms) library

void saxpy(int n, float a, float* __restrict__ X, float* __restrict__ Y) { 2 int i; for (i=0; i<n; i++) 4 Y[i] = a*X[i] + Y[i]; }

Compile using vectorization options:

1 $ gcc -O3 -ftree -vectorizer -verbose =1 saxpy.cc -o saxpy Analyzing loop at saxpy.cc :31 3 Vectorizing loop at saxpy.cc :31 saxpy.cc :31: note: &=& = vect_do_peeling_for_alignment &=& = 5 saxpy.cc :31: note: &=& = vect_update_inits_of_dr &=& = saxpy.cc :31: note: &=& = vect_do_peeling_for_loop_bound &=& =Setting upper bound of nb 7 iterations for epilogue loop to 2 saxpy.cc :31: note: LOOP VECTORIZED . 9 saxpy.cc :28: note: vectorized 1 loops in function. saxpy.cc :31: note: Completely unroll loop 2 times 11 saxpy.cc :28: note: Completely unroll loop 3 times Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 18 / 85

slide-19
SLIDE 19

Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations

SIMD Data Types - Intel

Data Type Content SSE SSE2 SSE3 AVX m64 8×char ✓ ✗ ✓ ✓ 4×short 2×int32 2×float 1×int64 1×double m128 4×float ✓ ✓ ✓ ✓ m128d 2×double ✗ ✓ ✓ ✓ m128i 16×char ✗ ✓ ✓ ✓ 8×short 4×int32 2×int64 m256 8×float ✗ ✗ ✗ ✓ m256d 4×double ✗ ✗ ✗ ✓ m256i 32×char ✗ ✗ ✗ ✓ 16×short 8×int32 4×int64

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 19 / 85

slide-20
SLIDE 20

Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations

SIMD Data Types - ARM

NEON vector data types have the following pattern: <type><size>x<number of lanes> t

64-bit type (D-register) 128-bit type (Q-register) Content int8x8 t int8x16 t 8-bit char int16x4 t int16x8 t 16-bit int int32x2 t int32x4 t 32-bit int int64x1 t int64x2 t 64-bit int uint8x8 t uint8x16 t 8-bit unsigned int uint16x4 t uint16x8 t 16-bit unsigned int uint32x2 t uint32x4 t 32-bit unsigned int uint64x1 t uint64x1 t 64-bit unsigned int float16x4 t float16x8 t 16-bit floating-point float32x2 t float32x4 t 32-bit floating-point poly8x8 t poly8x16 t 8-bit polynomial poly16x4 t poly16x8 t 16-bit polynomial

It is also possible to have types representing arrays of vectors from size 1 to 4D E.g. int8x8x2 t represents an array of two int8x8 vectors Individual vectors in these arrays can be accessed using <var name>.val[0] etc

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 20 / 85

slide-21
SLIDE 21

Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations

Compiler Intrinsic Functions - Intel

SSE and AVX intrinsic function names use the following notational convention: mm <opname> <suffix> <opname>: Indicates the basic operation of the intrinsic function; for example add for addition and sub for subtraction <suffix>: Denotes the type of data the instruction operates on

The first one or two letters of each suffix denote whether the data is:

p: packed ep: extended packed S: scalar

The remaining letters and numbers denote the type, with notation as follows:

s: single-precision floating-point d: double-precision floating-point i32: signed 32-bit integer u32: unsigned 32-bit integer . . .

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 21 / 85

slide-22
SLIDE 22

Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations

Compiler Intrinsic Functions - ARM

NEON intrinsic function names use the following notational convention: v<opname><flags> <type> <opname>: Indicates the basic operation of the intrinsic function; for example add for addition and sub for subtraction <flags>:

Number (between 1 and 4): Denotes the array size of the result vector, i.e. size 1 (int16x4 t), size 2 (int16x4x2 t), size 3 (int16x4x3 t) or size 4 (int16x4x4 t) q: Denotes that Q registers must be used by both operands and result l: Long shape. Number of bits in each result element is double the number of bits in each operand element, i.e. operands are usually doubleword vectors and result is a quad-word vector w: Wide shape. Result and operand are twice the width of the second operand, i.e. doubleword and quad-word operand make a quad-word result

<type>: Denotes the type of data the instruction operates on

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 22 / 85

slide-23
SLIDE 23

Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations

Compiler Intrinsic Functions - Examples

Load four SP FP values, address aligned

Intel SSE2: m128 mm load ps(float* p); ARM NEON: float32x4 t vld1q f32 (const float32 t* p); R0 R1 R2 R3 p[0] p[1] p[2] p[3]

Store four SP FP values. Address must be 16-byte aligned

Intel SSE2: void mm store ps(float* p, m128 a); ARM NEON: void vst1q f32 (float32 t* p, float32x4 t a); p[0] p[1] p[2] p[3] a0 a1 a2 a3

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 23 / 85

slide-24
SLIDE 24

Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations

Compiler Intrinsic Functions - Examples

Add four SP FP values

Intel SSE2: m128 mm add ps( m128 a, m128 b); ARM NEON: float32x4 t vaddq f32 (float32x4 t a, float32x4 t b); R0 R1 R2 R3 a0+b0 a1+b1 a2+b2 a3+b3

Fused multiply-add four SP FP values

Intel FMA3: m128 mm fmadd ps( m128 a, m128 b, m128 c); ARM NEON: void vfmaq f32(float32x4 t a, float32x4 t b, float32x4 t c); R0 R1 R2 R3 a0*b0 + c0 a1*b1 + c1 a2*b2 + c2 a3*b3 + c3

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 24 / 85

slide-25
SLIDE 25

Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations

Compiler Intrinsic Functions - Portability

Code vectorized with intrinsic functions for a specific processor is not portable

Only runs if architecture supports specific SIMD extension

Portability ensured by using conditional compilation in code

Program must contain both a scalar and a vectorized version of the same computation Using #ifdef statements, the applicable version is chosen at compile time For e.g. in GCC, if -msse2 option is passed, then the macro SSE2 is defined Usage of conditional compilation:

1 #ifdef __SSE2__ /* Code

  • ptimized

with SSE2 intrinsic functions */ 3 #elif __ARM_NEON__ /* Code

  • ptimized

with ARM NEON intrinsic functions */ 5 #else /* Scalar code */ 7 #endif Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 25 / 85

slide-26
SLIDE 26

Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations

Reference Manuals

Intel: Intel Intrinsics Guide Intel 64 and IA-32 Architectures Software Developer’s Manual ARM: ARM NEON Programmer’s Guide (version 1.0) ARM NEON Intrinsics in GCC

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 26 / 85

slide-27
SLIDE 27

Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations

Optimizing Vector Addition

C[i] = A[i] + B[i]

1 void VectorAdd(float *a, float *b, float *c, size_t size) { size_t i; 3 for (i = 0; i < size; i++) { c[i] = a[i] + b[i]; 5 } } Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 27 / 85

slide-28
SLIDE 28

Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations

Optimizing Vector Addition - Intel SSE2

void VectorAddSSE (float* __restrict__ a, float* __restrict__ b, float* __restrict__ c, size_t size) { 2 size_t i; for (i = 0; i < (size /4) * 4; i+=4) { 4 /* Load into SSE XMM registers */ __m128 sse_a = _mm_load_ps (&a[i]); 6 __m128 sse_b = _mm_load_ps (&b[i]); 8 /* Perform addition */ __m128 sse_c = _mm_add_ps(sse_a , sse_b); 10 /* Store back to memory */ 12 _mm_store_ps (&c[i],sse_c); } 14 for (i = (size /4) * 4; i < size; i++) { 16 c[i] = a[i] + b[i]; } 18 } Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 28 / 85

slide-29
SLIDE 29

Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations

Optimizing Vector Addition - ARM NEON

void VectorAddNEON (float32_t* __restrict__ a, float32_t* __restrict__ b, float32_t* __restrict__ c, size_t size) { 2 size_t i; /* Declare vector data types */ 4 float32x4_t a4 , b4 , c4; 6 for (i=0; i< (size /4) *4; i+=4) { /* Load into Quad NEON registers */ 8 a4 = vld1q_f32(a+i); b4 = vld1q_f32(b+i); 10 /* Perform addition */ 12 c4 = vaddq_f32(a4 , b4); 14 /* Store back to memory */ vst1q_f32(c+i, c4); 16 } 18 for (i = (size /4) *4; i<size; i++) { c[i] = a[i] + b[i]; 20 } } Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 29 / 85

slide-30
SLIDE 30

Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations

Hands-on Exercise: Vectorizing Loops

Objective: Vectorizing loops using SSE & NEON instructions

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 30 / 85

slide-31
SLIDE 31

Cache Basics

Outline

1

Single Instruction Multiple Data (SIMD) Operations

2

Cache Basics

3

Multiprocessor Cache Organization

4

Thread Basics Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 31 / 85

slide-32
SLIDE 32

Cache Basics

Importance

An infinitely fast CPU must still load and store its data to memory HPC applications have execution time t(n) = O(nm) for problem size n ⇒ ≥ n, ≤ t(n) data t(n) instr’n

  • accesses

Discrete Fourier Transform - O(n) data items O(nlogn or n2)

  • perations

Reduction to upper/lower triangular - O(n2) data items O(n3)

  • perations

Forward/Backward substitution - O(n2) data items O(n2) operations

Hence we need large memories with fast access times Achieve by:

cost (per bit) - speed trade-off:

cache memory

low

wide (||) memory access

moderate

faster technology (?)

high Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 32 / 85

slide-33
SLIDE 33

Cache Basics

Access Times

Memory Access Time: The amount of time it takes to read or write to a memory location Memory Cycle Time: How quickly you can repeat a memory access For example, a memory chip may have an access time of 200ns, but the cycle time may be 50ns Over last ≈20 years, CPU speed has improved much faster than memory access times.

In the mid 1980s, commodity DRAMs had an access time of 200ns while IBM PC had CPU clock of 4.77MHz (210ns). Today commodity DRAMs have access times of around 50ns, but the clock speeds have decreased to 1ns or less. In part because the CPU is now within 1 chip, while memory is on external chips

Memory sizes are also now much larger

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 33 / 85

slide-34
SLIDE 34

Cache Basics

Memory Technologies

Two main memory technologies:

cost/bit: access time: used in: SRAM (Static RAM) high low ( ≈ 10ns) caches DRAM (Dynamic RAM) low high ( ≈ 50ns) most main memories

SRAM: each bit uses at least 3 transistors (6 for best) and constant power supply DRAM: each bit uses 1 transistor and a capacitor. Over time, the charge leaks from the capacitor and it must be refreshed. Memory access involves the stages:

select row address select column address (R/W) access selected bit(s)

as well as transferring addresses and data over the relevant bus. Hence there is scope for pipelining

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 34 / 85

slide-35
SLIDE 35

Cache Basics

Memory Hierarchy

Modern microprocessors have a memory hierarchy

Access Speed Registers Clock cycle Cache Few cycles Memory (DRAM) Many cycles Virtual memory Long!

Idea: data that is “currently most needed” is brought into a (smaller) faster memory Observation: memory accesses in most programs exhibit:

Temporal locality: if access address X, likely to access X again soon Spatial locality: if access address X, likely to access X+1 soon ⇒ caches organized into lines (units) of L words (L = 2l, eg. L = 2, 4, 8)

✓ blocked memory accesses (faster) & less control info needed (per word) ✗ redundant memory traffic if only ever use 1 word per line

  • eg. pointer chasing

(non-unit stride guaranteed!)

if so, yields good cost-speed trade-off

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 35 / 85

slide-36
SLIDE 36

Cache Basics

Memory Hierarchy

Intel Core i7 Xeon 5500 at 2.8 GHz Cycles Time (≈ ns) Registers 1 0.3 L1 Cache Hit 4 1.5 L2 Cache Hit 10 3.5 L3 Cache Hit 40 15 Local DRAM 160 60 Remote DRAM 280 100 Memory access time includes cost of getting data over the bus

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 36 / 85

slide-37
SLIDE 37

Cache Basics

Registers

Very limited resource, accessed in one cycle Goal is to keep operands in registers as much as possible X = G ∗ 2.41 + A/W − W /B RISC instructions limited to two operands, thus must store result of G ∗ 2.41, A/W , and W /B back to registers before adding them

  • together. Also W is used twice and don’t want it loaded from (slow)

memory twice A fundamental job of the compiler is to optimize register use, e.g.

1 ld [%W],%f1 !load W from memory into register f1 ld [%A],%f2 !load A from memory into register f2 3 fdiv %f2 ,%f1 ,%f2 !form A/W and

  • verwrite

f2 (A) with result ld [%B],%f3 !load B from memory into register f3 5 fdiv %f1 ,%f3 ,%f2 !form W/B and

  • verwrite

f3 (B) with result Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 37 / 85

slide-38
SLIDE 38

Cache Basics

Cache Memory

Small amount of SRAM memory Cache hit rates: % of (word) accesses in program when data is in cache

Need to be high (eg. > 95%) for good performance Problem: programs may need to be re-written to do this! Only possible if have a sufficient inherent data re-use = g(n)/n e.g.:

(n1/2 × n1/2) matrix multiply: g(n) = 2n3/2

(mixed unit and non-unit stride)

FFT (Fast Fourier Transform): g(n) = 8n lg2 n

(power-of-2 stride)

Thus, FFT is more likely to be dominated by memory access time

Consistency of data cache & main memory:

When a store instruction is executed, the relevant line is updated 1st in the cache write-through: resulting word is written to main memory at same time

slow, but consistency is maintained

copy-back: write to memory (‘dirty’) cache line when thrown out of cache

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 38 / 85

slide-39
SLIDE 39

Cache Basics

Cache Friendly

Friendly Unfriendly

for (i=0; i <100000; i+=1) { sum += a[i]; } double a[200][200]; for (i=0; i <200; i++) { for (j=0; j <200; j++) { sum += a[i][j]; } } for (i=0; i <800000; i+=8) { sum += a[i]; } double a[200][200]; for (i=0; i <200; i++) { for (j=0; j <200; j++) { sum += a[j][i]; } } while (ptr ->next != NULL) ptr = ptr ->next;

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 39 / 85

slide-40
SLIDE 40

Cache Basics

Direct-Mapped Caches

Cache memory is organized into lines of size 2l

for cache of size 2c there are C ′ = 2c−l lines.

All addresses with same a1 are mapped to the same cache line 31 c c−1 l l−1 X = a0 a1 a2 ✓ easy to implement & low chip area / word

(note: cache must store value of a0)

⇒ large C ′ possible

(better performance)

✗ cache conflicts: 2 (or more) words from memory map to the same cache line

can have instr’n–instr’n, instr’n–data and data–data

can make large, often unpredictable, performance losses

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 40 / 85

slide-41
SLIDE 41

Cache Basics

K-way Set Associative Caches

like a direct-mapped cache of size C ′, but every line is extended to a set of K lines (total number of cache lines is now C ′K) addresses with same a1 can map into the corresponding line in any of the K sets Typically K = 1, 2, 4, 5, 6, 8 Reduces chance of conflicts by factor of K, but some extra cost Examples of Cache Thrashing 4K direct mapped cache 4K 2-way set associative cache

float a[1024] , b[1024]; for (i=0; i <1024; i++) { a[i] = a[i]+b[i]; } float a[1024] , b[1024] , c[1024]; for (i=0; i <1024; i++) { a[i] = a[i]+b[i]+c[i]; }

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 41 / 85

slide-42
SLIDE 42

Cache Basics

Cache: Other Issues

Modern processors usually have separate data & instruction 1st level caches Multiple (2 or 3) levels of cache (i.e. a deep memory hierarchy)

Typically: top level (data) cache: C ′ = 16KB, K = 4, write-through; 2nd level (instr’n/data) cache: C ′ = 1MB, K = 1,copy-back ✗ Harder still to tune a program for 2 levels!

(Top-level) cache prefetching (requires load/store pipelines)

An access time (latency) of δ cycles can be hidden if perform each load δ cycles in advance of when needed Can be done via:

Prefetching or software pipelining

(by programmer or compiler, eg. UltraSPARC)

H/W instruction re-ordering

(eg. Pentium IV) more effective, but H/W more expensive, complex

Increase data bus width to cache line size L

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 42 / 85

slide-43
SLIDE 43

Cache Basics

Hands-on Exercise: IPC and Cache Misses

Objective: Using the PAPI advanced interface to measure instructions per cycle for a variety of loops and cache behaviour

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 43 / 85

slide-44
SLIDE 44

Multiprocessor Cache Organization

Outline

1

Single Instruction Multiple Data (SIMD) Operations

2

Cache Basics

3

Multiprocessor Cache Organization

4

Thread Basics Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 44 / 85

slide-45
SLIDE 45

Multiprocessor Cache Organization

Shared Memory Hardware

(Fig 2.5 Grama et al, Intro to Parallel Computing) Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 45 / 85

slide-46
SLIDE 46

Multiprocessor Cache Organization

Shared Address Space Systems

Systems with caches but otherwise flat memory generally called UMA If access to local cheaper than remote (NUMA), this should be built into your algorithm Global address space systems are easier to program

Read only interactions invisible to programmer and coded like sequential program Read/write are harder, require mutual exclusion for concurrent accesses

Programmed using threads Synchronization using locks and related mechanisms

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 46 / 85

slide-47
SLIDE 47

Multiprocessor Cache Organization

Cache Hierarchy on Intel Core i7 (2013)

(64 byte cache line size)

Ref: http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1 Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 47 / 85

slide-48
SLIDE 48

Multiprocessor Cache Organization

Caches on Multiprocessors

Multiple copies of some data word being manipulated by two or more processors at the same time Usually have separate I-cache (Instruction Cache) on the first 1–2 levels (usually of similar sizes to the Data Caches)

How is instruction access different to data? Why is this useful?

Two requirements

Address translation mechanism that locates each physical memory word in system Concurrent operations on multiple copies have well defined semantics

The latter generally known as a cache coherency protocol

Input/Output using direct memory access (DMA) on machines with caches also leads to coherency issues

Some machines only provide shared address space mechanisms and leave coherence to the programmer

e.g. Texas Instrument Keystone II system

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 48 / 85

slide-49
SLIDE 49

Multiprocessor Cache Organization

Cache Coherency

Intuitive behaviour: reading value at address X should return the last value written to address X by any processor

What does last mean? What if simultaneous or closer in time than time required to communicate between two processors?

In a sequential program, last is determined by program order (not time)

Holds true within thread of parallel program, but what does this mean with multiple threads?

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 49 / 85

slide-50
SLIDE 50

Multiprocessor Cache Organization

Cache/Memory Coherency

A memory system is coherent if

Ordered as Issued: A read by processor P to address X that follows a write by P to address X should return the value of the write by P (assuming no other processor writes to X in between) Write Propagation: A read by processor P1 to address X that follows a write by processor P2 to X returns the written value if the read and write are sufficiently separated in time (assuming no other write to X

  • ccurs in between)

Write Serialization: Writes to the same address are serialized: two writes by any two processors are observed in the same order by all processors

(Later to be contrast with memory consistency!)

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 50 / 85

slide-51
SLIDE 51

Multiprocessor Cache Organization

Two Cache Coherency Protocols

(Fig 2.21 Grama et al, Intro to Parallel Computing) Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 51 / 85

slide-52
SLIDE 52

Multiprocessor Cache Organization

Cache Line View

Ref: http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1

Need to augment cache line information with information regarding validity

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 52 / 85

slide-53
SLIDE 53

Multiprocessor Cache Organization

Update vs. Invalidate

Update Protocol

When a data item is written, all of its copies in the system are updated

Invalidate Protocol (most common)

Before a data item is written, all other copies are marked as invalid

Comparison

Multiple writes to same word with no intervening reads require multiple write broadcasts in an update protocol, but only one initial invalidation With multiword cache blocks, each word written in a cache block must be broadcast in an update protocol, but only one invalidate per line is required The delay between writing a word in one processor and reading the written data in another is usually less for update

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 53 / 85

slide-54
SLIDE 54

Multiprocessor Cache Organization

False Sharing

Two processors modify different parts

  • f the same cache line

Invalidate protocol leads to ping-ponged cache lines Update protocol performs reads locally but updates much traffic between processors This effect is entirely an artefact of hardware Need to design parallel systems / programs with this issue in mind

Cache line size, longer more likely Alignment of data structures with respect to cache line size

http://15418.courses.cs.cmu.edu/ spring2015/lecture/cachecoherence1 Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 54 / 85

slide-55
SLIDE 55

Multiprocessor Cache Organization

Implementing Cache Coherency

On small scale bus based machines

A processor must obtain access to the bus to broadcast a write invalidation With two competing processors, the first to gain access to the bus will invalidate the others data

A cache miss needs to locate top copy of data

Easy for write-through cache For write-back caches, each processor snoops the bus and responses by providing data if it has the top copy

For writes, we would like to know if any other copies of the block are cached

i.e. whether a write-back cache needs to put details on the bus Handled by having a tag to indicate shared status

Minimizing processor stalls

Either by duplication of tags or having multiple inclusive caches

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 55 / 85

slide-56
SLIDE 56

Multiprocessor Cache Organization

3 State (MSI) Cache Coherency Protocol

read: local read write: local write c read (coherency read): read on remote processor gives rise to shown transition in local cache c write (coherency write): write miss, or write in Shared state, on remote processor gives rise to shown transition in local cache

(Fig 2.22 Grama et al, Intro to Parallel Computing) Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 56 / 85

slide-57
SLIDE 57

Multiprocessor Cache Organization

MSI Coherency Protocol

(Fig 2.23 Grama et al, Intro to Parallel Computing) Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 57 / 85

slide-58
SLIDE 58

Multiprocessor Cache Organization

Snoopy Cache Systems

All caches broadcast all transactions (read or write misses, writes in S state)

Suited to bus or ring interconnects However scalability is limited (i.e. ≤ 8 processors)

All processors monitor the bus for transactions of interest Each processor’s cache has a set of tag bits that determine the state

  • f the cache block

Tags are updated according to state diagram for relevant protocol

e.g. snoop hardware detects that a read has been issued for a cache block that it has a dirty copy of, it asserts control of the bus and puts data out, (to requesting cache and to main memory), sets tag to S state What sort of data access characteristics are likely to perform well/badly on snoopy based systems?

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 58 / 85

slide-59
SLIDE 59

Multiprocessor Cache Organization

Snoopy Cache Based System

(Fig 2.24 Grama et al, Intro to Parallel Computing) Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 59 / 85

slide-60
SLIDE 60

Multiprocessor Cache Organization

Snoopy Cache-Based System: Ring

The Core i7 (Sandy Bridge) on-chip interconnect revisited: a ring-based interconnect between Cores, Graphics, Last Level Cache (LLC) and System Agent domains has 4 physical rings: Data (32B), Request, Acknowledge and Snoop rings fully pipelined; bandwidth, latency and power scale with cores shortest path chosen to minimize latency has distributed arbitration & sophisticated protocols to handle coherency and ordering

(courtesy www.lostcircuits.com) Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 60 / 85

slide-61
SLIDE 61

Multiprocessor Cache Organization

Directory Cache Based Systems

The need to broadcast is clearly not scalable

A solution is to only send information to processing elements specifically interested in that data

This requires a directory to store information

Augment global memory with presence bitmap to indicate which caches each memory block is located in

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 61 / 85

slide-62
SLIDE 62

Multiprocessor Cache Organization

Directory Based Cache Coherency

To implement, we must track the state of each cache block A simple protocol might be:

Shared: one or more processors have the block cached, and the value in memory is up to date Uncached: no processor has a copy Exclusive: only one processor (the owner) has a copy and the value in memory is out of date

Must handle a read/write miss and a write to a shared, clean cache block

These first reference the directory entry to determine the current state

  • f this block

Then update the entry’s status and presence bitmap Send the appropriate state update transactions to the processors in the presence bitmap

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 62 / 85

slide-63
SLIDE 63

Multiprocessor Cache Organization

Directory-Based Cache Coherency

(Fig 2.25 Grama et al, Intro to Parallel Computing) Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 63 / 85

slide-64
SLIDE 64

Multiprocessor Cache Organization

Directory-Based Systems

How much memory is required to store the directory? What sort of data access characteristics are likely to perform well/badly on directory based systems?

How do distributed and centralized systems compare?

Should the presence bitmaps be replicated in the caches? Must they be? How would you implement sending an invalidation message to all (and only to all) processors in the presence bitmap?

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 64 / 85

slide-65
SLIDE 65

Multiprocessor Cache Organization

Costs on SGI Origin 3000 (clock cycles)

<= 16 CPU > 16 CPU Cache Hit 1 1 Cache miss to local memory 85 85 Cache miss to remote home directory 125 150 Cache miss to remotely cached data (3 hop) 140 170

Figure from http://people.nas.nasa.gov/∼schang/origin opt.html Data from: Computer Architecture: A Quantitative Approach, By David A. Patterson, John L. Hennessy, David Goldberg Ed 3, Morgan Kaufmann, 2003 Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 65 / 85

slide-66
SLIDE 66

Multiprocessor Cache Organization

Real Cache Coherency Protocols

From Wikipedia

Most modern systems use variants of the MSI protocol to reduce the amount of traffic in the coherency interconnect. The MESI protocol adds an “Exclusive” state to reduce the traffic caused by writes of blocks that only exist in one cache. The MOSI protocol adds an “Owned” state to reduce the traffic caused by write-backs of blocks that are read by other caches [The processor owner of the cache line services requests for that data]. The MOESI protocol does both of these things. The MESIF protocol uses the “Forward” state to reduce the traffic caused by multiple responses to read requests when the coherency architecture allows caches to respond to snoop requests with data.

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 66 / 85

slide-67
SLIDE 67

Multiprocessor Cache Organization

MESI(on a bus)

Ref: https://www.cs.tcd.ie/Jeremy.Jones/vivio/caches/MESIHelp.htm Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 67 / 85

slide-68
SLIDE 68

Multiprocessor Cache Organization

Multi-Level Cache

What is visibility of changes between levels of cache?

http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1

Easiest model is inclusive:

If line is in owned state in L1, it is also in owned state in L2

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 68 / 85

slide-69
SLIDE 69

Multiprocessor Cache Organization

Cache Summary

Cache coherency arises because abstraction of a single shared address space is not actually implemented by a single storage unit in a machine Three components to cache coherency

Issue order, write propagation, write serialization

Two implementations

Broadcast and Directory

False sharing is potential performance issue

More likely the longer the cache line

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 69 / 85

slide-70
SLIDE 70

Multiprocessor Cache Organization

Hands-on Exercise: Matrix Multiplication Performance

Objective: To understand the effect of various matrix multiply loop orderings on IPC and cache misses

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 70 / 85

slide-71
SLIDE 71

Thread Basics

Outline

1

Single Instruction Multiple Data (SIMD) Operations

2

Cache Basics

3

Multiprocessor Cache Organization

4

Thread Basics Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 71 / 85

slide-72
SLIDE 72

Thread Basics

Fork/Join Programming Model

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 72 / 85

slide-73
SLIDE 73

Thread Basics

(Heavyweight) UNIX Processes

O/S like UNIX is based on the notion of a process

The CPU is shared between different processes

UNIX processes created via fork()

Child copy an exact copy of parent, except it has unique process ID

Processes are “joined” using the system calls wait()

This introduces a synchronization

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 73 / 85

slide-74
SLIDE 74

Thread Basics

UNIX Fork Example

pid = fork (); if (pid == 0) { // code to be executed by child } else { // code to be executed by parent } if (pid == 0) exit (0); else wait (0);

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 74 / 85

slide-75
SLIDE 75

Thread Basics

Processes and Threads

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 75 / 85

slide-76
SLIDE 76

Thread Basics

Why Threads

Software Portability

Applications can be developed on serial machine and run on parallel machines without changes (is this really true?)

Latency Hiding

Ability to mask access to memory, I/O or communication by having another thread execute in the meantime (but how quickly can execution switch between threads?)

Scheduling and Load Balancing

For unstructured and dynamic applications (e.g. game playing), load balancing can be very hard. One option is to create more threads than CPU resources and let the O/S sort out the scheduling

Ease of Programming, Widespread Use

Due to above, threaded programs are easier to write (or develop incrementally), so there has been widespread acceptance of the POSIX thread API (generally referred to as pthreads)

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 76 / 85

slide-77
SLIDE 77

Thread Basics

Pthread Creation

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 77 / 85

slide-78
SLIDE 78

Thread Basics

Threads and Threaded Code

pthread self() provides ID of calling routine

pthread equal(thread1, thread2) tests ID

Detached Threads

Threads that will never synchronize via a join Specified via an attribute

Re-entrant or thread-safe routines are those than can be safely called when another instance has been suspended in the middle of its invocation

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 78 / 85

slide-79
SLIDE 79

Thread Basics

Example: Computing Pi

Ratio of area of circle to the square is π

4

Guess points with domain of square at random Identify those that are distance less than 1 from

  • rigin

Ratio of points in circle to total points is π

4

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 79 / 85

slide-80
SLIDE 80

Thread Basics

Example: Computing Pi

1 #include <pthread.h> #include <stdlib.h> 3 #define MAX_THREADS 512 void *compute_pi (void *); 5 .... main () { 7 ... pthread_t p_threads[ MAX_THREADS ]; 9 pthread_attr_t attr; pthread_attr_init (& attr); 11 for (i=0; i < num_threads ; i++) { hits[i] = i; 13 pthread_create (& p_threads[i], &attr , compute_pi , (void *) &hits[i]); 15 } for (i=0; i < num_threads ; i++) { 17 pthread_join (p_threads[i], NULL); total_hits += hits[i]; 19 } ... 21 } Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 80 / 85

slide-81
SLIDE 81

Thread Basics

Example: Computing Pi (cont)

1 void *compute_pi (void *s) { int seed , i, * hit_pointer ; 3 int local_hits ; hit_pointer = (int *) s; 5 seed = * hit_pointer ; local_hits = 0; 7 double rx , ry; for (i = 0; i < sample_points_per_thread ; i++) { 9 rx = (( double) rand_r (& seed)) / RAND_MAX

  • 0.5;

ry = (( double) rand_r (& seed)) / RAND_MAX

  • 0.5;

11 if (rx*rx + ry*ry < 0.25) local_hits ++; 13 } * hit_pointer = local_hits ; 15 pthread_exit (0); } Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 81 / 85

slide-82
SLIDE 82

Thread Basics

Programming and Performance Notes

Note the use of the function rand r (instead of superior random number generators such as drand48). Executing this on a 4-processor SGI Origin, we observe a 3.91 fold speedup at 32 threads. This corresponds to a parallel efficiency of 0.98! We can also modify the program slightly to observe the effect of false-sharing. The program can also be used to assess the secondary cache line size.

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 82 / 85

slide-83
SLIDE 83

Thread Basics

Performance

(Fig 7.2 Grama et al, Intro to Parallel Computing)

4 processor SGI Origin System using up to 32 threads Instead of incrementing local hits, we add to a shared array using stride

  • f 1, 16 and 32

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 83 / 85

slide-84
SLIDE 84

Thread Basics

Hands-on Exercise: False Cache Line Aliasing

Objective: To observe the effect of cache line contention and false cache line sharing on performance

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 84 / 85

slide-85
SLIDE 85

Thread Basics

Summary

Topics covered today - Vectorization & Cache Organization: SIMD operations Cache basics Multiprocessor cache organization Tomorrow - Multi Processor Parallelism!

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 85 / 85