Changelog Changes made in this version not seen in fjrst lecture: - - PowerPoint PPT Presentation

changelog
SMART_READER_LITE
LIVE PREVIEW

Changelog Changes made in this version not seen in fjrst lecture: - - PowerPoint PPT Presentation

Changelog Changes made in this version not seen in fjrst lecture: 15 November: vector addr picture: make order of result consistent with order of inputs 15 November: correct square to matmul on several vector slides 15 November: correct mixups


slide-1
SLIDE 1

Changelog

Changes made in this version not seen in fjrst lecture:

15 November: vector addr picture: make order of result consistent with

  • rder of inputs

15 November: correct square to matmul on several vector slides 15 November: correct mixups of A and B, B and C on several matmul vector slides 15 November: correct some si128 instances to si256 in vectorization slides 15 November: addressing transformation: correct more A/B/C mixups

slide-2
SLIDE 2

Vector Insts / Profjlers / Exceptions intro

1

slide-3
SLIDE 3

last time

loop unrolling/cache blocking instruction queues and out-of-order

list of available instructions multiple execution units (ALUs + other things that can run instr.) each cycle: ready instructions from queue to execution units

reassociation

reorder operations to reduce data dependencies, expose more parallelism

multiple accumulators — reassociation for loops shifting bottlenecks

need to optimize what’s slowest — determines longest latency e.g. loop unrolling helps until parallelism limit …, but after improving parallelism, loop unrolling more helps again e.g. cache optimizations won’t matter until loop overhead lowered or vice-versa

2

slide-4
SLIDE 4

aliasing problems with cache blocking

for (int k = 0; k < N; k++) { for (int i = 0; i < N; i += 2) { for (int j = 0; j < N; j += 2) { C[(i+0)*N + j+0] += A[i*N+k] * B[k*N+j]; C[(i+1)*N + j+0] += A[(i+1)*N+k] * B[k*N+j]; C[(i+0)*N + j+1] += A[i*N+k] * B[k*N+j+1]; C[(i+1)*N + j+1] += A[(i+1)*N+k] * B[k*N+j+1]; } } }

can compiler keep A[i*N+k] in a register?

3

slide-5
SLIDE 5

“register blocking”

for (int k = 0; k < N; ++k) { for (int i = 0; i < N; i += 2) { float Ai0k = A[(i+0)*N + k]; float Ai1k = A[(i+1)*N + k]; for (int j = 0; j < N; j += 2) { float Bkj0 = A[k*N + j+0]; float Bkj1 = A[k*N + j+1]; C[(i+0)*N + j+0] += Ai0k * Bkj0; C[(i+1)*N + j+0] += Ai1k * Bkj0; C[(i+0)*N + j+1] += Ai0k * Bkj1; C[(i+1)*N + j+1] += Ai1k * Bkj1; } } }

4

slide-6
SLIDE 6

vector instructions

modern processors have registers that hold “vector” of values example: current x86-64 processors have 256-bit registers

8 ints or 8 fmoats or 4 doubles or …

256-bit registers named %ymm0 through %ymm15 instructions that act on all values in register

vector instructions or SIMD (single instruction, multiple data) instructions

extra copies of ALUs only accessed by vector instructions (also 128-bit versions named %xmm0 through %xmm15)

5

slide-7
SLIDE 7

example vector instruction

vpaddd %ymm0, %ymm1, %ymm2 (packed add dword (32-bit)) Suppose registers contain (interpreted as 4 ints)

%ymm0: [1, 2, 3, 4, 5, 6, 7, 8] %ymm1: [9, 10, 11, 12, 13, 14, 15, 16]

Result will be:

%ymm2: [10, 12, 14, 16, 18, 20, 22, 24]

6

slide-8
SLIDE 8

vector instructions

void add(int * restrict a, int * restrict b) { for (int i = 0; i < 512; ++i) a[i] += b[i]; } add: xorl %eax, %eax the_loop: vmovdqu (%rdi,%rax), %ymm0 /* load A into ymm0 */ vmovdqu (%rsi,%rax), %ymm1 /* load B into ymm1 */ vpaddd %ymm1, %ymm0, %ymm0 /* ymm1 + ymm0 -> ymm0 */ vmovdqu %ymm0, (%rdi,%rax) /* store ymm0 into A */ addq $32, %rax /* increment index by 32 bytes */ cmpq $2048, %rax jne the_loop vzeroupper /* ←- for calling convention reasons */ ret

7

slide-9
SLIDE 9

vector add picture

A[3] B[3] A[4] B[4] A[5] B[5] A[6] B[6] A[7] B[7] A[8] B[8] A[9] B[9] A[10] B[10] A[11] B[11] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] A[16] B[16] A[17] B[17]

… … … …

vmovdqu %ymm0 vmovdqu %ymm1 vpaddd %ymm0

A[8] + B[8] A[9] + B[9] A[10] + B[10] A[11] + B[11] A[12] + B[12] A[13] + B[13] A[14] + B[14] A[15] + B[15]

8

slide-10
SLIDE 10
  • ne view of vector functional units

ALU (lane 1) (stage 1) ALU (lane 1) (stage 2) ALU (lane1) (stage 3) ALU (lane 2) (stage 1) ALU (lane 2) (stage 2) ALU (lane 2) (stage 3) ALU (lane 3) (stage 1) ALU (lane 3) (stage 2) ALU (lane 3) (stage 3) ALU (lane 4) (stage 1) ALU (lane 4) (stage 2) ALU (lane 4) (stage 3) input values (one/cycle)

  • utput values

(one/cycle) vector ALU

9

slide-11
SLIDE 11

why vector instructions?

lots of logic not dedicated to computation

instruction queue reorder bufger instruction fetch branch prediction …

adding vector instructions — little extra control logic …but a lot more computational capacity

10

slide-12
SLIDE 12

vector instructions and compilers

compilers can sometimes fjgure out how to use vector instructions

(and have gotten much, much better at it over the past decade)

but easily messsed up:

by aliasing by conditionals by some operation with no vector instruction …

11

slide-13
SLIDE 13

fjckle compiler vectorization (1)

GCC 8.2 and Clang 7.0 generate vector instructions for this:

#define N 1024 void foo(unsigned int *A, unsigned int *B) { for (int k = 0; k < N; ++k) for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) B[i * N + j] += A[i * N + k] * A[k * N + j]; }

but not:

#define N 1024 void foo(unsigned int *A, unsigned int *B) { for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) for (int k = 0; k < N; ++k) B[i * N + j] += A[i * N + k] * A[j * N + k]; }

12

slide-14
SLIDE 14

fjckle compiler vectorization (2)

Clang 5.0.0 generates vector instructions for this:

void foo(int N, unsigned int *A, unsigned int *B) { for (int k = 0; k < N; ++k) for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) B[i * N + j] += A[i * N + k] * A[k * N + j]; }

but not: (fjxed in later versions)

void foo(long N, unsigned int *A, unsigned int *B) { for (long k = 0; k < N; ++k) for (long i = 0; i < N; ++i) for (long j = 0; j < N; ++j) B[i * N + j] += A[i * N + k] * A[k * N + j]; }

13

slide-15
SLIDE 15

vector intrinsics

if compiler doesn’t work… could write vector instruction assembly by hand second option: “intrinsic functions” C functions that compile to particular instructions

14

slide-16
SLIDE 16

vector intrinsics: add example

void vectorized_add(int *a, int *b) { for (int i = 0; i < 128; i += 8) { // "si256" --> 256 bit integer // a_values = {a[i], a[i+1], a[i+2], a[i+3]} __m256i a_values = _mm256_loadu_si256((__m256i*) &a[i]); // b_values = {b[i], b[i+1], b[i+2], b[i+3]} __m256i b_values = _mm256_loadu_si256((__m256i*) &b[i]); // add four 32-bit integers // sums = {a[i] + b[i], a[i+1] + b[i+1], ....} __m256i sums = _mm256_add_epi32(a_values, b_values); // {a[i], a[i+1], a[i+2], a[i+3]} = sums _mm256_storeu_si256((__m256i*) &a[i], sums); } }

special type __m256i — “256 bits of integers”

  • ther types: __m256 (fmoats), __m128d (doubles)

functions to store/load si256 means “256-bit integer value” u for “unaligned” (otherwise, pointer address must be multiple of 32) function to add epi32 means “8 32-bit integers”

15

slide-17
SLIDE 17

vector intrinsics: add example

void vectorized_add(int *a, int *b) { for (int i = 0; i < 128; i += 8) { // "si256" --> 256 bit integer // a_values = {a[i], a[i+1], a[i+2], a[i+3]} __m256i a_values = _mm256_loadu_si256((__m256i*) &a[i]); // b_values = {b[i], b[i+1], b[i+2], b[i+3]} __m256i b_values = _mm256_loadu_si256((__m256i*) &b[i]); // add four 32-bit integers // sums = {a[i] + b[i], a[i+1] + b[i+1], ....} __m256i sums = _mm256_add_epi32(a_values, b_values); // {a[i], a[i+1], a[i+2], a[i+3]} = sums _mm256_storeu_si256((__m256i*) &a[i], sums); } }

special type __m256i — “256 bits of integers”

  • ther types: __m256 (fmoats), __m128d (doubles)

functions to store/load si256 means “256-bit integer value” u for “unaligned” (otherwise, pointer address must be multiple of 32) function to add epi32 means “8 32-bit integers”

15

slide-18
SLIDE 18

vector intrinsics: add example

void vectorized_add(int *a, int *b) { for (int i = 0; i < 128; i += 8) { // "si256" --> 256 bit integer // a_values = {a[i], a[i+1], a[i+2], a[i+3]} __m256i a_values = _mm256_loadu_si256((__m256i*) &a[i]); // b_values = {b[i], b[i+1], b[i+2], b[i+3]} __m256i b_values = _mm256_loadu_si256((__m256i*) &b[i]); // add four 32-bit integers // sums = {a[i] + b[i], a[i+1] + b[i+1], ....} __m256i sums = _mm256_add_epi32(a_values, b_values); // {a[i], a[i+1], a[i+2], a[i+3]} = sums _mm256_storeu_si256((__m256i*) &a[i], sums); } }

special type __m256i — “256 bits of integers”

  • ther types: __m256 (fmoats), __m128d (doubles)

functions to store/load si256 means “256-bit integer value” u for “unaligned” (otherwise, pointer address must be multiple of 32) function to add epi32 means “8 32-bit integers”

15

slide-19
SLIDE 19

vector intrinsics: add example

void vectorized_add(int *a, int *b) { for (int i = 0; i < 128; i += 8) { // "si256" --> 256 bit integer // a_values = {a[i], a[i+1], a[i+2], a[i+3]} __m256i a_values = _mm256_loadu_si256((__m256i*) &a[i]); // b_values = {b[i], b[i+1], b[i+2], b[i+3]} __m256i b_values = _mm256_loadu_si256((__m256i*) &b[i]); // add four 32-bit integers // sums = {a[i] + b[i], a[i+1] + b[i+1], ....} __m256i sums = _mm256_add_epi32(a_values, b_values); // {a[i], a[i+1], a[i+2], a[i+3]} = sums _mm256_storeu_si256((__m256i*) &a[i], sums); } }

special type __m256i — “256 bits of integers”

  • ther types: __m256 (fmoats), __m128d (doubles)

functions to store/load si256 means “256-bit integer value” u for “unaligned” (otherwise, pointer address must be multiple of 32) function to add epi32 means “8 32-bit integers”

15

slide-20
SLIDE 20

vector intrinsics: difgerent size

void vectorized_add_64bit(long *a, long *b) { for (int i = 0; i < 128; i += 4) { // a_values = {a[i], a[i+1], ...} (4 x 64 bits) __m256i a_values = _mm256_loadu_si256((__m256i*) &a[i]); // b_values = {b[i], b[i+1], ...} (4 x 64 bits) __m256i b_values = _mm256_loadu_si256((__m256i*) &b[i]); // add four 64-bit integers: vpaddq %ymm0, %ymm1 // sums = {a[i] + b[i], a[i+1] + b[i+1], ...} __m256i sums = _mm256_add_epi64(a_values, b_values); // {a[i], a[i+1]} = sums _mm256_storeu_si256((__m256i*) &a[i], sums); } }

16

slide-21
SLIDE 21

vector intrinsics: difgerent size

void vectorized_add_64bit(long *a, long *b) { for (int i = 0; i < 128; i += 4) { // a_values = {a[i], a[i+1], ...} (4 x 64 bits) __m256i a_values = _mm256_loadu_si256((__m256i*) &a[i]); // b_values = {b[i], b[i+1], ...} (4 x 64 bits) __m256i b_values = _mm256_loadu_si256((__m256i*) &b[i]); // add four 64-bit integers: vpaddq %ymm0, %ymm1 // sums = {a[i] + b[i], a[i+1] + b[i+1], ...} __m256i sums = _mm256_add_epi64(a_values, b_values); // {a[i], a[i+1]} = sums _mm256_storeu_si256((__m256i*) &a[i], sums); } }

16

slide-22
SLIDE 22

128-bit version, too

history: 256-bit vectors added in extension called AVX (c. 2011) before: 128-bit vectors added in extension called SSE (c. 1999) 128-bit intrinsics exist, too:

__m256i becomes __m128i __mm256_add_epi32 becomes __mm_add_epi32 __mm256_loadu_si256 becomes __mm_loadu_si128

17

slide-23
SLIDE 23

intrinsics in assignments

smooth assignment: you will use instriniscs disabled compiler vectorization goal: you understand how vectorization optimization works goal: in case you needed to do more than compiler would do

missing “pattern” for how to use vectors, aliasing, code size tradeofgs, …

18

slide-24
SLIDE 24

matrix multiply

void matmul(unsigned int *A, unsigned int *B, unsigned int *C) { for (int k = 0; k < N; ++k) for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) C[i * N + j] += A[i * N + k] * B[k * N + j]; }

(simple version, no cache blocking, no avoiding aliasing beteeen C, B, A,…)

19

slide-25
SLIDE 25

matmul unrolled

void matmul(unsigned int *A, unsigned int *B, unsigned int *C) { for (int k = 0; k < N; ++k) { for (int i = 0; i < N; ++i) for (int j = 0; j < N; j += 8) { /* goal: vectorize this */ C[i * N + j + 0] += A[i * N + k] * B[k * N + j + 0]; C[i * N + j + 1] += A[i * N + k] * B[k * N + j + 1]; C[i * N + j + 2] += A[i * N + k] * B[k * N + j + 2]; C[i * N + j + 3] += A[i * N + k] * B[k * N + j + 3]; C[i * N + j + 4] += A[i * N + k] * B[k * N + j + 4]; C[i * N + j + 5] += A[i * N + k] * B[k * N + j + 5]; C[i * N + j + 6] += A[i * N + k] * B[k * N + j + 6]; C[i * N + j + 7] += A[i * N + k] * B[k * N + j + 7]; } }

(NB: would probably also want to do cache blocking…)

20

slide-26
SLIDE 26

handy intrinsic functions for matmul

_mm256_set1_epi32 — load eight copies of a 32-bit value into a 128-bit value

instructions generated vary; one example: vmovd + vpbroadcastd

_mm256_mullo_epi32 — multiply eight pairs of 32-bit values, give lowest 32-bits of results

generates vpmulld

21

slide-27
SLIDE 27

vectorizing matmul

/* goal: vectorize this */ C[i * N + j + 0] += A[i * N + k] * B[k * N + j + 0]; C[i * N + j + 1] += A[i * N + k] * B[k * N + j + 1]; ... C[i * N + j + 6] += A[i * N + k] * B[k * N + j + 6]; C[i * N + j + 7] += A[i * N + k] * B[k * N + j + 7];

22

slide-28
SLIDE 28

vectorizing matmul

/* goal: vectorize this */ C[i * N + j + 0] += A[i * N + k] * B[k * N + j + 0]; C[i * N + j + 1] += A[i * N + k] * B[k * N + j + 1]; ... C[i * N + j + 6] += A[i * N + k] * B[k * N + j + 6]; C[i * N + j + 7] += A[i * N + k] * B[k * N + j + 7]; // load eight elements from C Cij = _mm256_loadu_si256((__m256i*) &C[i * N + j + 0]); ... // manipulate vector here // store eight elements into C _mm_storeu_si256((__m256i*) &C[i * N + j + 0], Cij);

22

slide-29
SLIDE 29

vectorizing matmul

/* goal: vectorize this */ C[i * N + j + 0] += A[i * N + k] * B[k * N + j + 0]; C[i * N + j + 1] += A[i * N + k] * B[k * N + j + 1]; ... C[i * N + j + 6] += A[i * N + k] * B[k * N + j + 6]; C[i * N + j + 7] += A[i * N + k] * B[k * N + j + 7]; // load eight elements from B Bkj = _mm256_loadu_si256((__m256i*) &B[k * N + j + 0]); ... // multiply each by B[i * N + k] here

22

slide-30
SLIDE 30

vectorizing matmul

/* goal: vectorize this */ C[i * N + j + 0] += A[i * N + k] * B[k * N + j + 0]; C[i * N + j + 1] += A[i * N + k] * B[k * N + j + 1]; ... C[i * N + j + 6] += A[i * N + k] * B[k * N + j + 6]; C[i * N + j + 7] += A[i * N + k] * B[k * N + j + 7]; // load eight elements starting with B[k * n + j] Bkj = _mm256_loadu_si128((__m256i*) &B[k * N + j + 0]); // load four copies of A[i * N + k] Aik = _mm256_set1_epi32(A[i * N + k]); // multiply each pair multiply_results = _mm256_mullo_epi32(Aik, Bkj);

22

slide-31
SLIDE 31

vectorizing matmul

/* goal: vectorize this */ C[i * N + j + 0] += A[i * N + k] * B[k * N + j + 0]; C[i * N + j + 1] += A[i * N + k] * B[k * N + j + 1]; ... C[i * N + j + 6] += A[i * N + k] * B[k * N + j + 6]; C[i * N + j + 7] += A[i * N + k] * B[k * N + j + 7]; Cij = _mm256_add_epi32(Cij, multiply_results); // store back results _mm256_storeu_si256(..., Cij);

22

slide-32
SLIDE 32

matmul vectorized

__m256i Cij, Bkj, Aik, Aik_times_Bkj; // Cij = {Ci,j, Ci,j+1, Ci,j+2, ..., Ci,j+7} Cij = _mm256_loadu_si256((__m256i*) &C[i * N + j]); // Bkj = {Bk,j, Bk,j+1, Bk,j+2, ..., Bk,j+7} Bkj = _mm256_loadu_si256((__m256i*) &B[k * N + j]); // Aik = {Ai,k, Ai,k, ..., Ai,k} Aik = _mm256_set1_epi32(A[i * N + k]); // Aik_times_Bkj = {Ai,k × Bk,j, Ai,k × Bk,j+1, Ai,k × Bk,j+2, ..., Ai,k × Bk,j+7} Aik_times_Bkj = _mm256_mullo_epi32(Aij, Bkj); // Cij= {Ci,j + Ai,k × Bk,j, Ci,j+1 + Ai,k × Bk,j+1, ...} Cij = _mm256_add_epi32(Cij, Aik_times_Bkj); // store Cij into C _mm256_storeu_si256((__m256i*) &C[i * N + j], Cij);

23

slide-33
SLIDE 33

moving values in vectors?

sometimes values aren’t in the right place in vector example: have: [1, 2, 3, 4] want: [3, 4, 1, 2] there are instructions/intrinsics for doing this

called shuffming/swizzling/permute/…

sometimes might need combination of them worst-case: could rearrange on stack…, I guess

24

slide-34
SLIDE 34

example shuffming operation (1)

goal: [1, 2, 3, 4] to [3, 4, 1, 2] (64-bit values)

/* x = {1, 2, 3, 4} */ __m256i x = _mm256_setr_epi64x(1, 2, 3, 4); __m256i result = _mm256_permute4x64_epi64( x, /* index 2, then 3, then 0, then 1 */ 2 | (3 << 2) | (0 << 4) | (1 << 6) /* could also write _MM_SHUFFLE(1, 0, 3, 2) */ ); /* result = {3, 4, 1, 2} */

25

slide-35
SLIDE 35

256-bit with 128-bit?

Intel designed 256-bit vector instructions with 128-bit ones in mind goal: possible to use 128-bit vector ALUs to implement 256-bit instructions

split 256-bit instruction into two ALU operations

means less instructions move values from top to bottom half of vector

in particular, complicated to move 16-bit value between halfs

26

slide-36
SLIDE 36

aside on AVX and clock speeds

some processors ran slower when 256-bit ALUs are being used

includes a lot of notable Intel CPUs

why? they give out heat — can’t maintain higher clock speed

for energy reasons, shut down when not used

still faster assuming you’re using vectors a lot

27

slide-37
SLIDE 37

alternate vector interfaces

intrinsics functions/assembly aren’t the only way to write vector code e.g. GCC vector extensions: more like normal C code

types for each kind of vector write + instead of _mm_add_epi32

e.g. CUDA (GPUs): looks like writing multithreaded code, but each thread is vector “lane”

28

slide-38
SLIDE 38
  • ther vector instructions

multiple extensions to the X86 instruction set for vector instructions fjrst version: SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2

128-bit vectors

this class: AVX, AVX2

256-bit vectors

not this class: AVX-512

512-bit vectors

also other ISAs have these: e.g. NEON on ARM, MSA on MIPS, AltiVec/VMX on POWER, …

29

slide-39
SLIDE 39
  • ther vector instructions features

SSE pretty limiting

  • ther vector instruction sets often more featureful:

(and require more sophisticated HW support)

better conditional handling better variable-length vectors ability to load/store non-contiguous values some of these features in AVX512

30

slide-40
SLIDE 40
  • ptimizing real programs

spend efgort where it matters e.g. 90% of program time spent reading fjles, but optimize computation? e.g. 90% of program time spent in routine A, but optimize B?

31

slide-41
SLIDE 41

profjlers

fjrst step — tool to determine where you spend time tools exist to do this for programs example on Linux: perf

32

slide-42
SLIDE 42

perf usage

sampling profjler

stops periodically, takes a look at what’s running

perf record OPTIONS program

example OPTIONS:

  • F 200 — record 200/second
  • -call-graph=lbr — record stack traces (using method “lbr”)

perf report or perf annotate

33

slide-43
SLIDE 43

children/self

“children” — samples in function or things it called “self” — samples in function alone

34

slide-44
SLIDE 44

demo

35

slide-45
SLIDE 45
  • ther profjling techniques

count number of times each function is called not sampling — exact counts, but higher overhead

might give less insight into amount of time

36

slide-46
SLIDE 46

tuning optimizations

biggest factor: how fast is it actually setup a benchmark

make sure it’s realistic (right size? uses answer? etc.)

compare the alternatives

37

slide-47
SLIDE 47

addressing effjciency

for (int k = 0; k < N; k += 2) { for (int i = 0; i < N; ++i) { for (int j = 0; j < N; ++j) { float Cij = C[i * N + j]; for (int k = kk; k < kk + 2; ++k) { Cij += A[i * N + k] * B[k * N + j]; } C[i * N + j] = Cij; } } }

tons of multiplies by N?? isn’t that slow?

38

slide-48
SLIDE 48

addressing transformation

for (int kk = 0; k < N; kk += 2) for (int i = 0; i < N; ++i) { for (int j = 0; j < N; ++j) { float Cij = C[i * N + j]; float *Bkj_pointer = &B[kk * N + j]; for (int k = kk; k < kk + 2; ++k) { // Bij += A[i * N + k] * A[k * N + j~]; Bij += A[i * N + k] * Bkj_pointer; Bkj_pointer += N; } C[i * N + j] = Bij; } }

transforms loop to iterate with pointer compiler will usually do this! increment/decrement by N (× sizeof(fmoat))

39

slide-49
SLIDE 49

addressing transformation

for (int kk = 0; k < N; kk += 2) for (int i = 0; i < N; ++i) { for (int j = 0; j < N; ++j) { float Cij = C[i * N + j]; float *Bkj_pointer = &B[kk * N + j]; for (int k = kk; k < kk + 2; ++k) { // Bij += A[i * N + k] * A[k * N + j~]; Bij += A[i * N + k] * Bkj_pointer; Bkj_pointer += N; } C[i * N + j] = Bij; } }

transforms loop to iterate with pointer compiler will usually do this! increment/decrement by N (× sizeof(fmoat))

39

slide-50
SLIDE 50

addressing effjciency

compiler will usually eliminate slow multiplies

doing transformation yourself often slower if so

i * N; ++i into i_times_N; i_times_N += N way to check: see if assembly uses lots multiplies in loop if it doesn’t — do it yourself

40

slide-51
SLIDE 51

another addressing transformation

for (int i = 0; i < n; i += 4) { C[(i+0) * n + j] += A[(i+0) * n + k] * B[k * n + j]; C[(i+1) * n + j] += A[(i+1) * n + k] * B[k * n + j]; // ... float *Ai0_base = &A[k]; float *Ai1_base = Ai0_base + n; float *Ai2_base = Ai1_base + n; // ... for (int i = 0; i < n; i += 4) { C[(i+0) * n + j] += Ai0_base[i*n] * B[k * n + j]; C[(i+1) * n + j] += Ai1_base[i*n] * B[k * n + j]; // ...

compiler will do this, too

41

slide-52
SLIDE 52

another addressing transformation

for (int i = 0; i < n; i += 4) { C[(i+0) * n + j] += A[(i+0) * n + k] * B[k * n + j]; C[(i+1) * n + j] += A[(i+1) * n + k] * B[k * n + j]; // ... float *Ai0_base = &A[k]; float *Ai1_base = Ai0_base + n; float *Ai2_base = Ai1_base + n; // ... for (int i = 0; i < n; i += 4) { C[(i+0) * n + j] += Ai0_base[i*n] * B[k * n + j]; C[(i+1) * n + j] += Ai1_base[i*n] * B[k * n + j]; // ...

compiler will do this, too

41

slide-53
SLIDE 53

another addressing transformation

for (int i = 0; i < n; i += 20) { C[(i+0) * n + j] += A[(i+0) * n + k] * B[k * n + j]; C[(i+1) * n + j] += A[(i+1) * n + k] * B[k * n + j]; // ... float *Ai0_base = &A[0*n+k]; float *Ai1_base = Ai0_base + n; float *Ai2_base = Ai1_base + n; // ... for (int i = 0; i < n; i += 20) { C[(i+0) * n + j] += Ai0_base[i*n] * B[k * n + j]; C[(i+1) * n + j] += Ai1_base[i*n] * B[k * n + j]; // ...

storing 20 AiX_base? — need the stack

42

slide-54
SLIDE 54

another addressing transformation

for (int i = 0; i < n; i += 20) { C[(i+0) * n + j] += A[(i+0) * n + k] * B[k * n + j]; C[(i+1) * n + j] += A[(i+1) * n + k] * B[k * n + j]; // ... float *Ai0_base = &A[0*n+k]; float *Ai1_base = Ai0_base + n; float *Ai2_base = Ai1_base + n; // ... for (int i = 0; i < n; i += 20) { C[(i+0) * n + j] += Ai0_base[i*n] * B[k * n + j]; C[(i+1) * n + j] += Ai1_base[i*n] * B[k * n + j]; // ...

storing 20 AiX_base? — need the stack

42

slide-55
SLIDE 55

alternative addressing transformation

float *Ai0_base = &A[0*n+k]; float *Ai1_base = Ai0_base + n; // ... for (int i = 0; i < n; i += 20) { C[(i+0) * n + j] += Ai0_base[i*n] * B[k * n + j]; C[(i+1) * n + j] += Ai1_base[i*n] * B[k * n + j]; // ... float *Ai0_base = &A[k]; for (int i = 0; i < n; i += 20) { float *A_ptr = &Ai0_base[i*n]; C[(i+0) * n + j] += *A_ptr * A[k * n + j]; A_ptr += n; // what about multiple accumulators??? C[(i+1) * n + j] += *A_ptr * B[k * n + j]; // ...

more dependencies (latency bound?), more additions?, less registers might need multiple accumulator transformation?

43

slide-56
SLIDE 56

alternative addressing transformation

float *Ai0_base = &A[0*n+k]; float *Ai1_base = Ai0_base + n; // ... for (int i = 0; i < n; i += 20) { C[(i+0) * n + j] += Ai0_base[i*n] * B[k * n + j]; C[(i+1) * n + j] += Ai1_base[i*n] * B[k * n + j]; // ... float *Ai0_base = &A[k]; for (int i = 0; i < n; i += 20) { float *A_ptr = &Ai0_base[i*n]; C[(i+0) * n + j] += *A_ptr * A[k * n + j]; A_ptr += n; // what about multiple accumulators??? C[(i+1) * n + j] += *A_ptr * B[k * n + j]; // ...

more dependencies (latency bound?), more additions?, less registers might need multiple accumulator transformation?

43

slide-57
SLIDE 57

44

slide-58
SLIDE 58

an infjnite loop

int main(void) { while (1) { /* waste CPU time */ } }

If I run this on a shared department machine, can you still use it? …if the machine only has one core?

45

slide-59
SLIDE 59

timing nothing

long times[NUM_TIMINGS]; int main(void) { for (int i = 0; i < N; ++i) { long start, end; start = get_time(); /* do nothing */ end = get_time(); times[i] = end - start; }

  • utput_timings(times);

}

same instructions — same difgerence each time?

46

slide-60
SLIDE 60

doing nothing on a busy system

200000 400000 600000 800000 1000000 sample # 101 102 103 104 105 106 107 108 time (ns)

time for empty loop body

47

slide-61
SLIDE 61

doing nothing on a busy system

200000 400000 600000 800000 1000000 sample # 101 102 103 104 105 106 107 108 time (ns)

time for empty loop body

48

slide-62
SLIDE 62

time multiplexing

loop.exe ssh.exe firefox.exe loop.exe ssh.exe

CPU: time

... call get_time // whatever get_time does movq %rax, %rbp

million cycle delay

call get_time // whatever get_time does subq %rbp, %rax ...

49

slide-63
SLIDE 63

time multiplexing

loop.exe ssh.exe firefox.exe loop.exe ssh.exe

CPU: time

... call get_time // whatever get_time does movq %rax, %rbp

million cycle delay

call get_time // whatever get_time does subq %rbp, %rax ...

49

slide-64
SLIDE 64

time multiplexing

loop.exe ssh.exe firefox.exe loop.exe ssh.exe

CPU: time

... call get_time // whatever get_time does movq %rax, %rbp

million cycle delay

call get_time // whatever get_time does subq %rbp, %rax ...

49

slide-65
SLIDE 65

time multiplexing really

loop.exe ssh.exe firefox.exe loop.exe ssh.exe

= operating system exception happens return from exception

50

slide-66
SLIDE 66

time multiplexing really

loop.exe ssh.exe firefox.exe loop.exe ssh.exe

= operating system exception happens return from exception

50

slide-67
SLIDE 67

OS and time multiplexing

starts running instead of normal program

mechanism for this: exceptions (later)

saves old program counter, registers somewhere sets new registers, jumps to new program counter called context switch

saved information called context

51

slide-68
SLIDE 68

context

all registers values

%rax %rbx, …, %rsp, …

condition codes program counter i.e. all visible state in your CPU except memory address space: map from program to real addresses

52

slide-69
SLIDE 69

context switch pseudocode

context_switch(last, next): copy_preexception_pc last−>pc mov rax,last−>rax mov rcx, last−>rcx mov rdx, last−>rdx ... mov next−>rdx, rdx mov next−>rcx, rcx mov next−>rax, rax jmp next−>pc

53

slide-70
SLIDE 70

contexts (A running)

%rax %rbx %rcx %rsp … SF ZF PC

in CPU Process A memory: code, stack, etc. Process B memory: code, stack, etc. OS memory:

%raxSF %rbxZF %rcxPC … …

in Memory

54

slide-71
SLIDE 71

contexts (B running)

%rax %rbx %rcx %rsp … SF ZF PC

in CPU Process A memory: code, stack, etc. Process B memory: code, stack, etc. OS memory:

%raxSF %rbxZF %rcxPC … …

in Memory

55

slide-72
SLIDE 72

memory protection

reading from another program’s memory?

Program A Program B

0x10000: .word 42 // ... // do work // ... movq 0x10000, %rax // while A is working: movq $99, %rax movq %rax, 0x10000 ...

result: %rax is 42 (always) result: might crash

56

slide-73
SLIDE 73

memory protection

reading from another program’s memory?

Program A Program B

0x10000: .word 42 // ... // do work // ... movq 0x10000, %rax // while A is working: movq $99, %rax movq %rax, 0x10000 ...

result: %rax is 42 (always) result: might crash

56

slide-74
SLIDE 74

program memory

0xFFFF FFFF FFFF FFFF 0xFFFF 8000 0000 0000 0x7F… 0x0000 0000 0040 0000 Used by OS Stack Heap / other dynamic Writable data Code + Constants

57

slide-75
SLIDE 75

program memory (two programs)

Used by OS Program A Stack Heap / other dynamic Writable data Code + Constants Used by OS Program B Stack Heap / other dynamic Writable data Code + Constants

58

slide-76
SLIDE 76

address space

programs have illusion of own memory called a program’s address space

Program A addresses Program B addresses mapping (set by OS) mapping (set by OS) Program A code Program B code Program A data Program B data OS data … real memory trigger error = kernel-mode only

59

slide-77
SLIDE 77

program memory (two programs)

Used by OS Program A Stack Heap / other dynamic Writable data Code + Constants Used by OS Program B Stack Heap / other dynamic Writable data Code + Constants

60

slide-78
SLIDE 78

address space

programs have illusion of own memory called a program’s address space

Program A addresses Program B addresses mapping (set by OS) mapping (set by OS) Program A code Program B code Program A data Program B data OS data … real memory trigger error = kernel-mode only

61

slide-79
SLIDE 79

address space mechanisms

next topic called virtual memory mapping called page tables mapping part of what is changed in context switch

62

slide-80
SLIDE 80

context

all registers values

%rax %rbx, …, %rsp, …

condition codes program counter i.e. all visible state in your CPU except memory address space: map from program to real addresses

63

slide-81
SLIDE 81

The Process

process = thread(s) + address space illusion of dedicated machine:

thread = illusion of own CPU address space = illusion of own memory

64

slide-82
SLIDE 82

synchronous versus asynchronous

synchronous — triggered by a particular instruction

traps and faults

asynchronous — comes from outside the program

interrupts and aborts timer event keypress, other input event

65

slide-83
SLIDE 83

types of exceptions

interrupts — externally-triggered

timer — keep program from hogging CPU I/O devices — key presses, hard drives, networks, …

faults — errors/events in programs

memory not in address space (“Segmentation fault”) divide by zero invalid instruction

traps — intentionally triggered exceptions

system calls — ask OS to do something

aborts

66

slide-84
SLIDE 84

types of exceptions

interrupts — externally-triggered

timer — keep program from hogging CPU I/O devices — key presses, hard drives, networks, …

faults — errors/events in programs

memory not in address space (“Segmentation fault”) divide by zero invalid instruction

traps — intentionally triggered exceptions

system calls — ask OS to do something

aborts

67

slide-85
SLIDE 85
  • verlapping loads and arithmetic

time load load load multiply add multiply multiply multiply multiply add add add speed of load might not matter if these are slower

68

slide-86
SLIDE 86
  • ptimization and bottlenecks

arithmetic/loop effjciency was the bottleneck after fjxing this, cache performance was the bottleneck common theme when optimizing:

X may not matter until Y is optimized

69

slide-87
SLIDE 87

cache blocking performance (big sizes)

2000 4000 6000 8000 10000 N 0.0 0.2 0.4 0.6 0.8 1.0 matrix in L3 cache

cycles per multiply or add unblocked blocked

70

slide-88
SLIDE 88

cache blocking performance (small sizes)

100 200 300 400 500 N 0.0 0.5 1.0 1.5 2.0

cycles per multiply/add [less optimized loop] unblocked blocked 200 400 600 800 1000 N 0.0 0.1 0.2 0.3 0.4 0.5

cycles per multiply/add [optimized loop] unblocked blocked 71

slide-89
SLIDE 89

constant multiplies/divides (1)

unsigned int fiveEights(unsigned int x) { return x * 5 / 8; } fiveEights: leal (%rdi,%rdi,4), %eax shrl $3, %eax ret

72

slide-90
SLIDE 90

constant multiplies/divides (2)

int oneHundredth(int x) { return x / 100; }

  • neHundredth:

movl %edi, %eax movl $1374389535, %edx sarl $31, %edi imull %edx sarl $5, %edx movl %edx, %eax subl %edi, %eax ret

1374389535 237 ≈ 1 100

73

slide-91
SLIDE 91

constant multiplies/divides

compiler is very good at handling …but need to actually use constants

74

slide-92
SLIDE 92

wiggles on prior graphs

200 400 600 800 1000 N 0.0 0.1 0.2 0.3 0.4 0.5

cycles per multiply/add [optimized loop] unblocked blocked

variance from this optimization 8 elements in vector, so multiples of 8 easier

75

slide-93
SLIDE 93

aliasing

void twiddle(long *px, long *py) { *px += *py; *px += *py; }

the compiler cannot generate this:

twiddle: // BROKEN // %rsi = px, %rdi = py movq (%rdi), %rax // rax ← *py addq %rax, %rax // rax ← 2 * *py addq %rax, (%rsi) // *px ← 2 * *py ret

76

slide-94
SLIDE 94

aliasing problem

void twiddle(long *px, long *py) { *px += *py; *px += *py; // NOT the same as *px += 2 * *py; } ... long x = 1; twiddle(&x, &x); // result should be 4, not 3 twiddle: // BROKEN // %rsi = px, %rdi = py movq (%rdi), %rax // rax ← *py addq %rax, %rax // rax ← 2 * *py addq %rax, (%rsi) // *px ← 2 * *py ret

77

slide-95
SLIDE 95

non-contrived aliasing

void sumRows1(int *result, int *matrix, int N) { for (int row = 0; row < N; ++row) { result[row] = 0; for (int col = 0; col < N; ++col) result[row] += matrix[row * N + col]; } } void sumRows2(int *result, int *matrix, int N) { for (int row = 0; row < N; ++row) { int sum = 0; for (int col = 0; col < N; ++col) sum += matrix[row * N + col]; result[row] = sum; } }

78

slide-96
SLIDE 96

non-contrived aliasing

void sumRows1(int *result, int *matrix, int N) { for (int row = 0; row < N; ++row) { result[row] = 0; for (int col = 0; col < N; ++col) result[row] += matrix[row * N + col]; } } void sumRows2(int *result, int *matrix, int N) { for (int row = 0; row < N; ++row) { int sum = 0; for (int col = 0; col < N; ++col) sum += matrix[row * N + col]; result[row] = sum; } }

78

slide-97
SLIDE 97

aliasing and performance (1) / GCC 5.4 -O2

200 400 600 800 1000 N 0.0 0.5 1.0 1.5 2.0 2.5 3.0 cycles/count

79

slide-98
SLIDE 98

aliasing and performance (2) / GCC 5.4 -O3

200 400 600 800 1000 N 0.0 0.5 1.0 1.5 2.0 2.5 3.0 cycles/count

80

slide-99
SLIDE 99

aliasing and cache optimizations

for (int k = 0; k < N; ++k) for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) B[i*N+j] += A[i * N + k] * A[k * N + j]; for (int i = 0; i < N; ++i) for (int j = 0; k < N; ++j) for (int k = 0; k < N; ++k) B[i*N+j] += A[i * N + k] * A[k * N + j]; B = A? B = &A[10]?

compiler can’t generate same code for both

81