Changelog Changes made in this version not seen in fjrst lecture: - PowerPoint PPT Presentation

Changelog Changes made in this version not seen in fjrst lecture: 15 November: vector addr picture: make order of result consistent with order of inputs 15 November: correct square to matmul on several vector slides 15 November: correct mixups of A and B, B and C on several matmul vector slides 15 November: correct some si128 instances to si256 in vectorization slides 15 November: addressing transformation: correct more A/B/C mixups 0

Vector Insts / Profjlers / Exceptions intro 1

last time loop unrolling/cache blocking instruction queues and out-of-order list of available instructions multiple execution units (ALUs + other things that can run instr.) each cycle: ready instructions from queue to execution units reassociation reorder operations to reduce data dependencies, expose more parallelism multiple accumulators — reassociation for loops shifting bottlenecks need to optimize what’s slowest — determines longest latency e.g. loop unrolling helps until parallelism limit …, but after improving parallelism, loop unrolling more helps again e.g. cache optimizations won’t matter until loop overhead lowered or vice-versa 2

aliasing problems with cache blocking for ( int k = 0; k < N; k++) { for ( int i = 0; i < N; i += 2) { for ( int j = 0; j < N; j += 2) { } } } can compiler keep A[i*N+k] in a register? 3 C[(i+0)*N + j+0] += A[i*N+k] * B[k*N+j]; C[(i+1)*N + j+0] += A[(i+1)*N+k] * B[k*N+j]; C[(i+0)*N + j+1] += A[i*N+k] * B[k*N+j+1]; C[(i+1)*N + j+1] += A[(i+1)*N+k] * B[k*N+j+1];

“register blocking” for ( int k = 0; k < N; ++k) { for ( int i = 0; i < N; i += 2) { float Ai0k = A[(i+0)*N + k]; float Ai1k = A[(i+1)*N + k]; for ( int j = 0; j < N; j += 2) { float Bkj0 = A[k*N + j+0]; float Bkj1 = A[k*N + j+1]; } } } 4 C[(i+0)*N + j+0] += Ai0k * Bkj0; C[(i+1)*N + j+0] += Ai1k * Bkj0; C[(i+0)*N + j+1] += Ai0k * Bkj1; C[(i+1)*N + j+1] += Ai1k * Bkj1;

vector instructions modern processors have registers that hold “vector” of values example: current x86-64 processors have 256-bit registers 8 ints or 8 fmoats or 4 doubles or … 256-bit registers named %ymm0 through %ymm15 vector instructions or SIMD (single instruction, multiple data) instructions extra copies of ALUs only accessed by vector instructions (also 128-bit versions named %xmm0 through %xmm15 ) 5 instructions that act on all values in register

example vector instruction vpaddd %ymm0, %ymm1, %ymm2 (packed add dword (32-bit)) Suppose registers contain (interpreted as 4 ints) %ymm0: [1, 2, 3, 4, 5, 6, 7, 8] %ymm1: [9, 10, 11, 12, 13, 14, 15, 16] Result will be: %ymm2: [10, 12, 14, 16, 18, 20, 22, 24] 6

vector instructions vmovdqu (%rsi,%rax), %ymm1 ret vzeroupper jne the_loop cmpq $2048, %rax addq $32, %rax vmovdqu %ymm0, (%rdi,%rax) vpaddd %ymm1, %ymm0, %ymm0 vmovdqu (%rdi,%rax), %ymm0 the_loop: xorl %eax, %eax add : } a[i] += b[i]; for ( int i = 0; i < 512; ++i) 7 void add( int * restrict a, int * restrict b) { /* load A into ymm0 */ /* load B into ymm1 */ /* ymm1 + ymm0 -> ymm0 */ /* store ymm0 into A */ /* increment index by 32 bytes */ /* ← - for calling convention reasons */

vector add picture B[10] %ymm0 vmovdqu %ymm1 vpaddd %ymm0 A[8] + B[8] A[9] + B[9] A[10] + A[11] … + B[11] A[12] + B[12] A[13] + B[13] A[14] + B[14] A[15] + B[15] vmovdqu … A[3] B[10] B[3] A[4] B[4] A[5] B[5] A[6] B[6] A[7] B[7] A[8] B[8] A[9] B[9] A[10] A[11] … B[11] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] A[16] B[16] A[17] B[17] … 8

one view of vector functional units (stage 2) vector ALU (one/cycle) output values (one/cycle) input values (stage 3) ALU (lane 4) (stage 2) ALU (lane 4) (stage 1) ALU (lane 4) (stage 3) ALU (lane 3) ALU (lane 3) ALU (lane 1) (stage 1) ALU (lane 3) (stage 3) ALU (lane 2) (stage 2) ALU (lane 2) (stage 1) ALU (lane 2) (stage 3) ALU (lane1) (stage 2) ALU (lane 1) (stage 1) 9

why vector instructions? lots of logic not dedicated to computation instruction queue reorder bufger instruction fetch branch prediction … …but a lot more computational capacity 10 adding vector instructions — little extra control logic

vector instructions and compilers compilers can sometimes fjgure out how to use vector instructions (and have gotten much, much better at it over the past decade) but easily messsed up: by aliasing by conditionals by some operation with no vector instruction … 11

fjckle compiler vectorization (1) GCC 8.2 and Clang 7.0 generate vector instructions for this: } for ( int k = 0; k < N; ++k) for ( int j = 0; j < N; ++j) for ( int i = 0; i < N; ++i) #define N 1024 but not: } for ( int j = 0; j < N; ++j) for ( int i = 0; i < N; ++i) for ( int k = 0; k < N; ++k) #define N 1024 12 void foo( unsigned int *A, unsigned int *B) { B[i * N + j] += A[i * N + k] * A[k * N + j]; void foo( unsigned int *A, unsigned int *B) { B[i * N + j] += A[i * N + k] * A[j * N + k];

fjckle compiler vectorization (2) Clang 5.0.0 generates vector instructions for this: for ( int k = 0; k < N; ++k) for ( int i = 0; i < N; ++i) for ( int j = 0; j < N; ++j) } but not: (fjxed in later versions) for ( long k = 0; k < N; ++k) for ( long i = 0; i < N; ++i) for ( long j = 0; j < N; ++j) } 13 void foo( int N, unsigned int *A, unsigned int *B) { B[i * N + j] += A[i * N + k] * A[k * N + j]; void foo( long N, unsigned int *A, unsigned int *B) { B[i * N + j] += A[i * N + k] * A[k * N + j];

vector intrinsics if compiler doesn’t work… could write vector instruction assembly by hand second option: “intrinsic functions” C functions that compile to particular instructions 14

vector intrinsics: add example _mm256_storeu_si256(( __m256i *) &a[i], sums); epi32 means “8 32-bit integers” function to add u for “unaligned” (otherwise, pointer address must be multiple of 32) si256 means “256-bit integer value” functions to store/load other types: __m256 (fmoats), __m128d (doubles) special type __m256i — “256 bits of integers” } } // {a[i], a[i+1], a[i+2], a[i+3]} = sums __m256i sums = _mm256_add_epi32(a_values, b_values); // sums = {a[i] + b[i], a[i+1] + b[i+1], ....} // add four 32-bit integers __m256i b_values = _mm256_loadu_si256(( __m256i *) &b[i]); // b_values = {b[i], b[i+1], b[i+2], b[i+3]} __m256i a_values = _mm256_loadu_si256(( __m256i *) &a[i]); // a_values = {a[i], a[i+1], a[i+2], a[i+3]} // "si256" --> 256 bit integer for ( int i = 0; i < 128; i += 8) { 15 void vectorized_add( int *a, int *b) {

Changelog Changes made in this version not seen in fjrst lecture: - PowerPoint PPT Presentation

Changelog Changes made in this version not seen in fjrst lecture: 15 November: vector addr picture: make order of result consistent with order of inputs 15 November: correct square to matmul on several vector slides 15 November: correct mixups

FAT cont / HDDs/ SSDs / inodes 1 Changelog Changes made in this version not seen in fjrst

Scheduling 1 1 Changelog Changes made in this version not seen in fjrst lecture: 10 September:

Optimization part 1 1 Changelog Changes made in this version not seen in fjrst lecture: 29 Feb

Threads / Synchronization 1 1 Changelog Changes made in this version not seen in fjrst lecture:

RPC (fjnish) / two-phase commit 1 Changelog Changes made in this version not seen in fjrst

Changelog Changes made in this version not seen in fjrst lecture: 15 January: logistics

thread creation / POSIX / process management 1 Changelog 21 January 2020 (between 12:30pm and

injection vulnerabilities 1 Changelog Corrections made in this version not in fjrst posting: 17

self-replicating malware 1 Changelog Corrections made in this version not in fjrst posting: 1

Viruses cont 1 Changelog Corrections made in this version not in fjrst posting: 6 Feb 2017:

1 Changelog Changes made in this version not seen in fjrst lecture: 25 September: add back

Changelog Changes made in this version not seen in fjrst lecture: sz is one past the end of the

Changelog Changes made in this version not seen in fjrst lecture: 30 August: juggling stacks: add

Changelog Changes made in this version not seen in fjrst lecture: 6 September: fjx stray @s on

Changelog Changes made in this version not seen in fjrst lecture: 18 Feb 2019: counting to binary

capabilities / virtual machines 1 Changelog Changes not seen in fjrst lecture: 23 April 2020:

The stony road to complex page layout Frank Mittelbach (and its road blocks) The Ask Some

Key findings from Singapore Floating PV Testbed Project ZHAO Lu, LIU Haohui, Jason LUN LEUNG,

Floating Elements 1 CS380 The CSS float property (reference) 2 img.headericon { float: right;

Expressions c uses the standard ordering when determining the order these operators are applied.

Keep Those Ducks in (Type) Check! Francesco Pierfederici https://pythoninside.com Hi, I am

Outline Integer representation and operations Bit operations Floating point numbers

GL Shading Language (GLSL) GLSL: high level C like language Main program (e.g.

CS4402-9535: High-Performance Computing with CUDA Marc Moreno Maza University of Western Ontario,