CPU design e ff ects that can degrade performance of your programs - - PowerPoint PPT Presentation

cpu design e ff ects that can degrade performance of your
SMART_READER_LITE
LIVE PREVIEW

CPU design e ff ects that can degrade performance of your programs - - PowerPoint PPT Presentation

CPU design e ff ects that can degrade performance of your programs Jakub Bernek jakub.beranek@vsb.cz whoami PhD student @ VSB-TUO, Ostrava, Czech Republic Research assistant @ IT Innovations (HPC center) HPC, distributed


slide-1
SLIDE 1

CPU design effects that can degrade performance of your programs

Jakub Beránek jakub.beranek@vsb.cz

slide-2
SLIDE 2

whoami

  • PhD student @ VSB-TUO, Ostrava, Czech Republic
  • Research assistant @ ITInnovations (HPC center)
  • HPC, distributed systems, program optimization
slide-3
SLIDE 3

How do we get maximum performance?

  • Select the right algorithm
slide-4
SLIDE 4

How do we get maximum performance?

  • Select the right algorithm
  • Use a low-overhead language
slide-5
SLIDE 5

How do we get maximum performance?

  • Select the right algorithm
  • Use a low-overhead language
  • Compile properly
slide-6
SLIDE 6

How do we get maximum performance?

  • Select the right algorithm
  • Use a low-overhead language
  • Compile properly
  • Tune to the underlying hardware
slide-7
SLIDE 7

Why should we care?

  • We write code for the C++ abstract machine
slide-8
SLIDE 8

Why should we care?

  • We write code for the C++ abstract machine
  • Intel CPUs fulfill the contract of this abstract machine
slide-9
SLIDE 9

Why should we care?

  • We write code for the C++ abstract machine
  • Intel CPUs fulfill the contract of this abstract machine
  • But inside they can do whatever they want
slide-10
SLIDE 10

Why should we care?

  • We write code for the C++ abstract machine
  • Intel CPUs fulfill the contract of this abstract machine
  • But inside they can do whatever they want
  • Understanding CPU trade-offs can get us more performance
slide-11
SLIDE 11

C++ abstract machine example How fast are the individual array increments?

void foo(int* arr, int count) { for (int i = 0; i < count; i++) { arr[i]++; } }

slide-12
SLIDE 12

Hardware effects

  • Performance effects caused by a specific CPU/memory implementation
slide-13
SLIDE 13

Hardware effects

  • Performance effects caused by a specific CPU/memory implementation
  • Demonstrate some CPU/memory trade-off or assumption
slide-14
SLIDE 14

Hardware effects

  • Performance effects caused by a specific CPU/memory implementation
  • Demonstrate some CPU/memory trade-off or assumption
  • Impossible to predict from (C++) code alone
slide-15
SLIDE 15

Hardware is getting more and more complex

Source: karlrupp.net

slide-16
SLIDE 16

Microarchitecture (Haswell)

Source: Intel Architectures Optimization Reference Manual

slide-17
SLIDE 17

Microarchitecture (Haswell)

Frontend

Source: Intel Architectures Optimization Reference Manual

slide-18
SLIDE 18

Microarchitecture (Haswell)

Backend

Source: Intel Architectures Optimization Reference Manual

slide-19
SLIDE 19

How bad is it?

  • C++ final draft:

http:/ /www.open-std.org/jtc/sc/wg/docs/papers//n.pdf https:/ /software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf https:/ /software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf

slide-20
SLIDE 20

How bad is it?

  • C++ final draft:

pages

http:/ /www.open-std.org/jtc/sc/wg/docs/papers//n.pdf https:/ /software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf https:/ /software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf

slide-21
SLIDE 21

How bad is it?

  • C++ final draft:

pages

  • Intel x manual:

http:/ /www.open-std.org/jtc/sc/wg/docs/papers//n.pdf https:/ /software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf https:/ /software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf

slide-22
SLIDE 22

How bad is it?

  • C++ final draft:

pages

  • Intel x manual: 5764 pages!

http:/ /www.open-std.org/jtc/sc/wg/docs/papers//n.pdf https:/ /software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf https:/ /software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf

slide-23
SLIDE 23

Plan of attack

  • Show example C++ programs
slide-24
SLIDE 24

Plan of attack

  • Show example C++ programs
  • short, (hopefully) comprehensible
slide-25
SLIDE 25

Plan of attack

  • Show example C++ programs
  • short, (hopefully) comprehensible
  • compiled with -O3
slide-26
SLIDE 26

Plan of attack

  • Show example C++ programs
  • short, (hopefully) comprehensible
  • compiled with -O3
  • Demonstrate weird performance behaviour
slide-27
SLIDE 27

Plan of attack

  • Show example C++ programs
  • short, (hopefully) comprehensible
  • compiled with -O3
  • Demonstrate weird performance behaviour
  • Let you guess what might cause it
slide-28
SLIDE 28

Plan of attack

  • Show example C++ programs
  • short, (hopefully) comprehensible
  • compiled with -O3
  • Demonstrate weird performance behaviour
  • Let you guess what might cause it
  • Explain (a possible) cause
slide-29
SLIDE 29

Plan of attack

  • Show example C++ programs
  • short, (hopefully) comprehensible
  • compiled with -O3
  • Demonstrate weird performance behaviour
  • Let you guess what might cause it
  • Explain (a possible) cause
  • Show how to measure and fix it
slide-30
SLIDE 30

Plan of attack

  • Show example C++ programs
  • short, (hopefully) comprehensible
  • compiled with -O3
  • Demonstrate weird performance behaviour
  • Let you guess what might cause it
  • Explain (a possible) cause
  • Show how to measure and fix it
  • Disclaimer #: Everything will be Intel x specific
slide-31
SLIDE 31

Plan of attack

  • Show example C++ programs
  • short, (hopefully) comprehensible
  • compiled with -O3
  • Demonstrate weird performance behaviour
  • Let you guess what might cause it
  • Explain (a possible) cause
  • Show how to measure and fix it
  • Disclaimer #: Everything will be Intel x specific
  • Disclaimer #: I'm not an expert on this and I may be wrong :-)
slide-32
SLIDE 32

Let's see some examples...

slide-33
SLIDE 33

Code (backup)

std::vector<float> data = /* 32K random floats in [1, 10] */; float sum = 0; // std::sort(data.begin(), data.end()); for (auto x : data) { if (x < 6.0f) { sum += x; } }

slide-34
SLIDE 34

Result (backup)

slide-35
SLIDE 35

Most upvoted Stack Overflow question

slide-36
SLIDE 36

What is going on? (Intel Amplifier - VTune)

slide-37
SLIDE 37

What is going on? (perf)

$ perf stat ./example0a --benchmark_filter=nosort 853,672012 task-clock (msec) # 0,997 CPUs utilized 30 context-switches # 0,035 K/sec 0 cpu-migrations # 0,000 K/sec 199 page-faults # 0,233 K/sec 3 159 530 915 cycles # 3,701 GHz 1 475 799 619 instructions # 0,47 insn per cycle 419 608 357 branches # 491,533 M/sec 102 425 035 branch-misses # 24,41% of all branches

slide-38
SLIDE 38

Branch predictor

slide-39
SLIDE 39

CPU pipeline Fetch Decode Execute Write

1 2 3 4 5 6 7

7 xor rax,rdx 8 add rax,rcx 9 cmp rax,rbx 10 je 15 11 inc rcx

...

15 ret

slide-40
SLIDE 40

CPU pipeline Fetch Decode Execute Write

1 2 3 4 5 6 7

7 xor rax,rdx 8 add rax,rcx 9 cmp rax,rbx 10 je 15 11 inc rcx

...

15 ret

slide-41
SLIDE 41

CPU pipeline Fetch Decode Execute Write

1 2 3 4 5 6 7

7 xor rax,rdx 8 add rax,rcx 9 cmp rax,rbx 10 je 15 11 inc rcx

...

15 ret

slide-42
SLIDE 42

CPU pipeline Fetch Decode Execute Write

1 2 3 4 5 6 7

7 xor rax,rdx 8 add rax,rcx 9 cmp rax,rbx 10 je 15 11 inc rcx

...

15 ret

slide-43
SLIDE 43

CPU pipeline Fetch Decode Execute Write

1 2 3 4 5 6 7

7 xor rax,rdx 8 add rax,rcx 9 cmp rax,rbx 10 je 15 11 inc rcx

...

15 ret

slide-44
SLIDE 44

CPU pipeline

?

Fetch Decode Execute Write

1 2 3 4 5 6 7

7 xor rax,rdx 8 add rax,rcx 9 cmp rax,rbx 10 je 15 11 inc rcx

...

15 ret

slide-45
SLIDE 45

Branch predictor

  • CPU tries to predict results of branches
slide-46
SLIDE 46

Branch predictor

  • CPU tries to predict results of branches
  • Misprediction can cost ~- cycles!
slide-47
SLIDE 47

Simple branch predictor - unsorted array

  • Prediction: Not taken

Prediction: Not taken

if (data[i] < 6) { ... }

slide-48
SLIDE 48

Simple branch predictor - unsorted array

  • < ?
  • Prediction: Not taken

Prediction: Not taken

if (data[i] < 6) { ... }

slide-49
SLIDE 49

Simple branch predictor - unsorted array

  • < ?
  • Prediction: Not taken

Prediction: Not taken

if (data[i] < 6) { ... }

slide-50
SLIDE 50

Simple branch predictor - unsorted array

  • < ?
  • Prediction: Not taken

Prediction: Not taken

if (data[i] < 6) { ... }

slide-51
SLIDE 51

Simple branch predictor - unsorted array

  • < ?
  • Prediction: Not taken

Prediction: Taken

if (data[i] < 6) { ... }

slide-52
SLIDE 52

Simple branch predictor - unsorted array

  • < ?
  • Prediction: Not taken

Prediction: Taken

if (data[i] < 6) { ... }

slide-53
SLIDE 53

Simple branch predictor - unsorted array

  • < ?
  • Prediction: Not taken

Prediction: Taken

if (data[i] < 6) { ... }

slide-54
SLIDE 54

Simple branch predictor - unsorted array

  • < ?
  • Prediction: Not taken

Prediction: Taken

if (data[i] < 6) { ... }

slide-55
SLIDE 55

Simple branch predictor - unsorted array

  • < ?
  • Prediction: Not taken

Prediction: Not taken

if (data[i] < 6) { ... }

slide-56
SLIDE 56

Simple branch predictor - unsorted array

  • < ?
  • Prediction: Not taken

Prediction: Not taken

if (data[i] < 6) { ... }

slide-57
SLIDE 57

Simple branch predictor - unsorted array

  • < ?
  • Prediction: Not taken

Prediction: Taken

if (data[i] < 6) { ... }

slide-58
SLIDE 58

Simple branch predictor - unsorted array

  • < ?
  • Prediction: Not taken

Prediction: Taken

if (data[i] < 6) { ... }

slide-59
SLIDE 59

Simple branch predictor - unsorted array

  • < ?
  • Prediction: Not taken

Prediction: Not taken

if (data[i] < 6) { ... }

slide-60
SLIDE 60

Simple branch predictor - unsorted array

  • < ?
  • Prediction: Not taken

Prediction: Not taken

if (data[i] < 6) { ... }

slide-61
SLIDE 61

Simple branch predictor - unsorted array

  • < ?
  • Prediction: Not taken

Prediction: Taken

if (data[i] < 6) { ... }

slide-62
SLIDE 62

Simple branch predictor - unsorted array

  • < ?

Prediction: Not taken Prediction: Taken

if (data[i] < 6) { ... }

slide-63
SLIDE 63

Simple branch predictor - unsorted array

  • < ?

Prediction: Not taken

if (data[i] < 6) { ... }

slide-64
SLIDE 64

Simple branch predictor - unsorted array

  • Prediction: Not taken

hits, misses (% hit rate)

if (data[i] < 6) { ... }

slide-65
SLIDE 65

Simple branch predictor - sorted array

  • Prediction: Not taken

Prediction: Not taken

if (data[i] < 6) { ... }

slide-66
SLIDE 66

Simple branch predictor - sorted array

  • < ?
  • Prediction: Not taken

Prediction: Not taken

if (data[i] < 6) { ... }

slide-67
SLIDE 67

Simple branch predictor - sorted array

  • < ?
  • Prediction: Not taken

Prediction: Taken

if (data[i] < 6) { ... }

slide-68
SLIDE 68

Simple branch predictor - sorted array

  • < ?
  • Prediction: Not taken

Prediction: Taken

if (data[i] < 6) { ... }

slide-69
SLIDE 69

Simple branch predictor - sorted array

  • < ?
  • Prediction: Not taken

Prediction: Taken

if (data[i] < 6) { ... }

slide-70
SLIDE 70

Simple branch predictor - sorted array

  • < ?
  • Prediction: Not taken

Prediction: Taken

if (data[i] < 6) { ... }

slide-71
SLIDE 71

Simple branch predictor - sorted array

  • < ?
  • Prediction: Not taken

Prediction: Taken

if (data[i] < 6) { ... }

slide-72
SLIDE 72

Simple branch predictor - sorted array

  • < ?
  • Prediction: Not taken

Prediction: Taken

if (data[i] < 6) { ... }

slide-73
SLIDE 73

Simple branch predictor - sorted array

  • < ?
  • Prediction: Not taken

Prediction: Taken

if (data[i] < 6) { ... }

slide-74
SLIDE 74

Simple branch predictor - sorted array

  • < ?
  • Prediction: Not taken

Prediction: Taken

if (data[i] < 6) { ... }

slide-75
SLIDE 75

Simple branch predictor - sorted array

  • < ?
  • Prediction: Not taken

Prediction: Not taken

if (data[i] < 6) { ... }

slide-76
SLIDE 76

Simple branch predictor - sorted array

  • < ?
  • Prediction: Not taken

Prediction: Not taken

if (data[i] < 6) { ... }

slide-77
SLIDE 77

Simple branch predictor - sorted array

  • < ?
  • Prediction: Not taken

Prediction: Not taken

if (data[i] < 6) { ... }

slide-78
SLIDE 78

Simple branch predictor - sorted array

  • < ?
  • Prediction: Not taken

Prediction: Not taken

if (data[i] < 6) { ... }

slide-79
SLIDE 79

Simple branch predictor - sorted array

  • < ?
  • Prediction: Not taken

Prediction: Not taken

if (data[i] < 6) { ... }

slide-80
SLIDE 80

Simple branch predictor - sorted array

  • < ?

Prediction: Not taken Prediction: Not taken

if (data[i] < 6) { ... }

slide-81
SLIDE 81

Simple branch predictor - sorted array

  • < ?

Prediction: Not taken

if (data[i] < 6) { ... }

slide-82
SLIDE 82

Simple branch predictor - sorted array

  • Prediction: Not taken

hits, misses (% hit rate)

if (data[i] < 6) { ... }

slide-83
SLIDE 83

How can the compiler help? With float, there are two branches per iteration

slide-84
SLIDE 84

How can the compiler help? With int, one branch is removed (using cmov)

slide-85
SLIDE 85

How to measure?

branch-misses

How many times was a branch mispredicted?

slide-86
SLIDE 86

How to measure?

branch-misses

How many times was a branch mispredicted? $ perf stat -e branch-misses ./example0a with sort -> 383 902 without sort -> 101 652 009

slide-87
SLIDE 87

How to help the branch predictor?

  • More predictable data
slide-88
SLIDE 88

How to help the branch predictor?

  • More predictable data
  • Profile-guided optimization
slide-89
SLIDE 89

How to help the branch predictor?

  • More predictable data
  • Profile-guided optimization
  • Remove (unpredictable) branches
slide-90
SLIDE 90

How to help the branch predictor?

  • More predictable data
  • Profile-guided optimization
  • Remove (unpredictable) branches
  • Compiler hints (use with caution)

if (__builtin_expect(will_it_blend(), 0)) { // this branch is not likely to be taken }

slide-91
SLIDE 91

Branch target prediction

  • Target of a jump is not known at compile time:
slide-92
SLIDE 92

Branch target prediction

  • Target of a jump is not known at compile time:
  • Function pointer
slide-93
SLIDE 93

Branch target prediction

  • Target of a jump is not known at compile time:
  • Function pointer
  • Function return address
slide-94
SLIDE 94

Branch target prediction

  • Target of a jump is not known at compile time:
  • Function pointer
  • Function return address
  • Virtual method
slide-95
SLIDE 95

Code (backup)

struct A { virtual void handle(size_t* data) const = 0; }; struct B: public A { void handle(size_t* data) const final { *data += 1; } }; struct C: public A { void handle(size_t* data) const final { *data += 2; } }; std::vector<std::unique_ptr<A>> data = /* 4K random B/C instances */; // std::sort(data.begin(), data.end(), /* sort by instance type */); size_t sum = 0; for (auto& x : data) { x->handle(&sum); }

slide-96
SLIDE 96

Result (backup)

slide-97
SLIDE 97

perf (backup)

$ perf stat -e branch-misses ./example0b with sort -> 337 274 without sort -> 84 183 161

slide-98
SLIDE 98

Code (backup)

// Addresses of N integers, each `offset` bytes apart std::vector<int*> data = ...; for (auto ptr: data) { *ptr += 1; } // Offsets: 4, 64, 4000, 4096, 4128

slide-99
SLIDE 99

Result (backup)

slide-100
SLIDE 100

Cache memory

slide-101
SLIDE 101

How are (L) caches implemented

  • N-way set associative table
  • Hardware hash table
slide-102
SLIDE 102

How are (L) caches implemented

  • N-way set associative table
  • Hardware hash table
  • Key = address (B)
slide-103
SLIDE 103

How are (L) caches implemented

  • N-way set associative table
  • Hardware hash table
  • Key = address (B)
  • Entry = cache line (B)
slide-104
SLIDE 104

N-way set associative cache

Size = cache lines

slide-105
SLIDE 105

N-way set associative cache

Size = cache lines Associativity (N) - # of cache lines per bucket

slide-106
SLIDE 106

N-way set associative cache

Size = cache lines Associativity (N) - # of cache lines per bucket # of buckets = Size / N

slide-107
SLIDE 107

N-way set associative cache

Size = cache lines Associativity (N) - # of cache lines per bucket # of buckets = Size / N N = (direct mapped)

slide-108
SLIDE 108

N-way set associative cache

Size = cache lines Associativity (N) - # of cache lines per bucket # of buckets = Size / N N = (direct mapped) N = (fully associative)

slide-109
SLIDE 109

N-way set associative cache

Size = cache lines Associativity (N) - # of cache lines per bucket # of buckets = Size / N N = (direct mapped) N = (fully associative) N =

slide-110
SLIDE 110

How are addresses hashed?

  • bit address:

Tag Index Offset

slide-111
SLIDE 111

How are addresses hashed?

  • bit address:

Tag Index Offset

  • Offset
  • Selects byte within a cache line
  • log(cache line size) bits
slide-112
SLIDE 112

How are addresses hashed?

  • bit address:

Tag Index Offset

  • Offset
  • Selects byte within a cache line
  • log(cache line size) bits
  • Index
  • Selects bucket within the cache
  • log(bucket count) bits
slide-113
SLIDE 113

How are addresses hashed?

  • bit address:

Tag Index Offset

  • Offset
  • Selects byte within a cache line
  • log(cache line size) bits
  • Index
  • Selects bucket within the cache
  • log(bucket count) bits
  • Tag
  • Used for matching
slide-114
SLIDE 114

N-way set associative cache

Cache lines: A B C Index bits:

slide-115
SLIDE 115

N-way set associative cache

Cache lines: A B C Index bits:

  • N =
slide-116
SLIDE 116

N-way set associative cache

Cache lines: A B C Index bits:

  • N =

A

slide-117
SLIDE 117

N-way set associative cache

Cache lines: A B C Index bits:

  • N =

A B

slide-118
SLIDE 118

N-way set associative cache

Cache lines: A B C Index bits:

  • N =

A C B

slide-119
SLIDE 119

N-way set associative cache

Cache lines: A B C Index bits:

  • N =

N = A C B

slide-120
SLIDE 120

N-way set associative cache

Cache lines: A B C Index bits:

  • N =

N = A C B

  • A
slide-121
SLIDE 121

N-way set associative cache

Cache lines: A B C Index bits:

  • N =

N = A C B

  • A

B

slide-122
SLIDE 122

N-way set associative cache

Cache lines: A B C Index bits:

  • N =

N = A C B

  • A

B C

slide-123
SLIDE 123

N-way set associative cache

Cache lines: A B C Index bits:

  • N =

N = N = A C B

  • A

B C

slide-124
SLIDE 124

N-way set associative cache

Cache lines: A B C Index bits:

  • N =

N = N = A C B

  • A

B C

  • A
slide-125
SLIDE 125

N-way set associative cache

Cache lines: A B C Index bits:

  • N =

N = N = A C B

  • A

B C

  • A

B

slide-126
SLIDE 126

N-way set associative cache

Cache lines: A B C Index bits:

  • N =

N = N = A C B

  • A

B C

  • A

C B

slide-127
SLIDE 127

Intel L cache

$ getconf -a | grep LEVEL1_DCACHE LEVEL1_DCACHE_SIZE 32768 LEVEL1_DCACHE_ASSOC 8 LEVEL1_DCACHE_LINESIZE 64

slide-128
SLIDE 128

Intel L cache

$ getconf -a | grep LEVEL1_DCACHE LEVEL1_DCACHE_SIZE 32768 LEVEL1_DCACHE_ASSOC 8 LEVEL1_DCACHE_LINESIZE 64

  • Cache line size - B ( offset bits)
slide-129
SLIDE 129

Intel L cache

$ getconf -a | grep LEVEL1_DCACHE LEVEL1_DCACHE_SIZE 32768 LEVEL1_DCACHE_ASSOC 8 LEVEL1_DCACHE_LINESIZE 64

  • Cache line size - B ( offset bits)
  • Associativity (N) -
slide-130
SLIDE 130

Intel L cache

$ getconf -a | grep LEVEL1_DCACHE LEVEL1_DCACHE_SIZE 32768 LEVEL1_DCACHE_ASSOC 8 LEVEL1_DCACHE_LINESIZE 64

  • Cache line size - B ( offset bits)
  • Associativity (N) -
  • Size - B
slide-131
SLIDE 131

Intel L cache

$ getconf -a | grep LEVEL1_DCACHE LEVEL1_DCACHE_SIZE 32768 LEVEL1_DCACHE_ASSOC 8 LEVEL1_DCACHE_LINESIZE 64

  • Cache line size - B ( offset bits)
  • Associativity (N) -
  • Size - B
  • / => cache lines
slide-132
SLIDE 132

Intel L cache

$ getconf -a | grep LEVEL1_DCACHE LEVEL1_DCACHE_SIZE 32768 LEVEL1_DCACHE_ASSOC 8 LEVEL1_DCACHE_LINESIZE 64

  • Cache line size - B ( offset bits)
  • Associativity (N) -
  • Size - B
  • / => cache lines
  • / => buckets ( index bits)
slide-133
SLIDE 133

Offset = B

Number A Tag .. Index

  • Offset
slide-134
SLIDE 134

Offset = B

Number A B Tag .. .. Index

  • Offset
slide-135
SLIDE 135

Offset = B

Number A B C Tag .. .. .. Index

  • Offset
slide-136
SLIDE 136

Offset = B

Number A B C D Tag .. .. .. .. Index

  • Offset
slide-137
SLIDE 137

Offset = B

Number A B C D Tag .. .. .. .. Index

  • Offset
  • Same bucket, same cache line for each number
slide-138
SLIDE 138

Offset = B

Number A B C D Tag .. .. .. .. Index

  • Offset
  • Same bucket, same cache line for each number
  • Most efficient, no space is wasted
slide-139
SLIDE 139

Offset = B

Number A Tag .. Index

  • Offset
slide-140
SLIDE 140

Offset = B

Number A B Tag .. .. Index

  • Offset
slide-141
SLIDE 141

Offset = B

Number A B C Tag .. .. .. Index

  • Offset
slide-142
SLIDE 142

Offset = B

Number A B C D Tag .. .. .. .. Index

  • Offset
slide-143
SLIDE 143

Offset = B

Number A B C D Tag .. .. .. .. Index

  • Offset
  • Different bucket for each number
slide-144
SLIDE 144

Offset = B

Number A B C D Tag .. .. .. .. Index

  • Offset
  • Different bucket for each number
  • Wastes B in each cache line
slide-145
SLIDE 145

Offset = B

Number A B C D Tag .. .. .. .. Index

  • Offset
  • Different bucket for each number
  • Wastes B in each cache line
  • Equally distributed among buckets
slide-146
SLIDE 146

Offset = B

Number A Tag .. Index

  • Offset
slide-147
SLIDE 147

Offset = B

Number A B Tag .. .. Index

  • Offset
slide-148
SLIDE 148

Offset = B

Number A B C Tag .. .. .. Index

  • Offset
slide-149
SLIDE 149

Offset = B

Number A B C D Tag .. .. .. .. Index

  • Offset
slide-150
SLIDE 150

Offset = B

Number A B C D Tag .. .. .. .. Index

  • Offset
  • Same bucket, but different cache lines for each number!
slide-151
SLIDE 151

Offset = B

Number A B C D Tag .. .. .. .. Index

  • Offset
  • Same bucket, but different cache lines for each number!
  • Bucket full => evictions necessary
slide-152
SLIDE 152

How to measure?

l1d.replacement

How many times was a cache line loaded into L?

slide-153
SLIDE 153

How to measure?

l1d.replacement

How many times was a cache line loaded into L? $ perf stat -e l1d.replacement ./example1 4B offset -> 149 558 4096B offset -> 426 218 383

slide-154
SLIDE 154

Code (backup)

float F = static_cast<float>(std::stof(argv[1])); std::vector<float> data(4 * 1024 * 1024, 1); for (int r = 0; r < 100; r++) { for (auto& item: data) { item *= F; } }

slide-155
SLIDE 155

Result (backup)

slide-156
SLIDE 156

Denormal floating point numbers

slide-157
SLIDE 157

Denormal floating point numbers

  • Zero exponent

Non-zero significand

slide-158
SLIDE 158

Denormal floating point numbers

  • Zero exponent

Non-zero significand

  • Numbers close to zero
  • Hidden bit = , smaller bias
slide-159
SLIDE 159

Denormal floating point numbers

  • Zero exponent

Non-zero significand

  • Numbers close to zero
  • Hidden bit = , smaller bias

Operations on denormal numbers are slow!

slide-160
SLIDE 160

Floating point handling

slide-161
SLIDE 161

How to measure?

fp_assist.any

How many times the CPU switched to the microcode FP handler?

slide-162
SLIDE 162

How to measure?

fp_assist.any

How many times the CPU switched to the microcode FP handler? $ perf stat -e fp_assist.any ./example2 0 -> 0 0.3 -> 15 728 640

slide-163
SLIDE 163

How to fix it?

  • The nuclear option: -ffast-math
  • Sacrifice correctness to gain more FP performance
slide-164
SLIDE 164

How to fix it?

  • The nuclear option: -ffast-math
  • Sacrifice correctness to gain more FP performance
  • Set CPU flags:
  • Flush-to-zero - treat denormal outputs as
  • Denormals-to-zero - treat denormal inputs as
slide-165
SLIDE 165

How to fix it?

  • The nuclear option: -ffast-math
  • Sacrifice correctness to gain more FP performance
  • Set CPU flags:
  • Flush-to-zero - treat denormal outputs as
  • Denormals-to-zero - treat denormal inputs as

_mm_setcsr(_mm_getcsr() | 0x8040); // or _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);

slide-166
SLIDE 166

There are many other effects

  • NUMA
  • k aliasing
  • Misaligned accesses, cache line boundaries
  • Instruction data dependencies
  • Software prefetching
  • Non-temporal stores & cache pollution
  • Bandwidth saturation
  • DRAM refresh intervals
  • AVX/SSE transition penalty
  • ...
slide-167
SLIDE 167

Thank you!

For more examples visit: github.com/kobzol/hardware-effects

Jakub Beránek

Slides built with github.com/spirali/elsie

slide-168
SLIDE 168

Code (backup)

// tid - [0, NO_OF_THREADS) void thread_fn(int tid, double* data) { size_t repetitions = 1024 * 1024 * 1024UL; for (size_t i = 0; i < repetitions; i++) { data[tid] *= i; } }

slide-169
SLIDE 169

Result (backup)

slide-170
SLIDE 170

Cache system

slide-171
SLIDE 171

Cache coherency

Memory A B

Core 2 Cache Core 1 Cache

Cache line

slide-172
SLIDE 172

Cache coherency

Memory A B

Core 2 Cache Core 1 Cache

Cache line

Read A

slide-173
SLIDE 173

Cache coherency

Memory A B

Core 2 Cache Core 1 Cache

Cache line

A B A B A B

slide-174
SLIDE 174

Cache coherency

Memory A B

Core 2 Cache Core 1 Cache

Cache line

A B A B A B

Read B

slide-175
SLIDE 175

Cache coherency

Memory A B

Core 2 Cache Core 1 Cache

Cache line

A B A B A B A B A B

slide-176
SLIDE 176

Cache coherency

Memory A B

Core 2 Cache Core 1 Cache

Cache line

A B A B A B A B A B

Write B

slide-177
SLIDE 177

Cache coherency

Memory A B

Core 2 Cache Core 1 Cache

Cache line

A B A B A B A B A B

slide-178
SLIDE 178

Cache coherency

Memory A B

Core 2 Cache Core 1 Cache

Cache line

A B A B A B A B A B

slide-179
SLIDE 179

Cache coherency

Memory A B

Core 2 Cache Core 1 Cache

Cache line

A B A B A B A B A B

slide-180
SLIDE 180

False sharing

arr[0] arr[15] arr[7]arr[8]

double arr[16];

slide-181
SLIDE 181

False sharing

arr[0] arr[15] arr[7]arr[8]

double arr[16];

8B

slide-182
SLIDE 182

False sharing

arr[0] arr[15] arr[7]arr[8]

double arr[16];

Cache line boundary 64B

slide-183
SLIDE 183

False sharing

arr[0] arr[15] arr[7]arr[8]

double arr[16];

Thread 0 Thread 1

slide-184
SLIDE 184

False sharing

arr[0] arr[15] arr[7]arr[8]

double arr[16];

Thread 0 Thread 1

slide-185
SLIDE 185

False sharing

arr[0] arr[15] arr[7]arr[8]

double arr[16];

Thread 0 Thread 1

slide-186
SLIDE 186

False sharing

arr[0] arr[15] arr[7]arr[8]

double arr[16];

Thread 0 Thread 1

slide-187
SLIDE 187

How to measure?

l2_rqsts.all_rfo

How many times some core invalidated data in other cores?

slide-188
SLIDE 188

How to measure?

l2_rqsts.all_rfo

How many times some core invalidated data in other cores? $ perf stat -e l2_rqsts.all_rfo ./example3 1 thread -> 59 711 2 threads -> 1 112 258 710