Blowing up the (C++11) atomic barrier Optimizing C++11 atomics in - - PowerPoint PPT Presentation

blowing up the c 11 atomic barrier
SMART_READER_LITE
LIVE PREVIEW

Blowing up the (C++11) atomic barrier Optimizing C++11 atomics in - - PowerPoint PPT Presentation

Blowing up the (C++11) atomic barrier Optimizing C++11 atomics in LLVM Robin Morisset, Intern at Google Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics


slide-1
SLIDE 1

Blowing up the (C++11) atomic barrier

Optimizing C++11 atomics in LLVM

Robin Morisset, Intern at Google

slide-2
SLIDE 2

Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics

slide-3
SLIDE 3

Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics

slide-4
SLIDE 4

x <- 1; print y; y <- 1; print x;

Can this possibly print 0-0 ?

Thread 1 Thread 2

slide-5
SLIDE 5

print y; x <- 1; print x; y <- 1;

Can this possibly print 0-0 ?

Thread 1 Thread 2

Yes if your compiler reorder accesses

slide-6
SLIDE 6

x <- 1; mfence; print y; y <- 1; mfence; print x;

Can this possibly print 0-0 ?

Yes on x86: needs a fence

Flush your (FIFO) store buffer

slide-7
SLIDE 7

x <- 42; ready <- 1; if (ready) print x;

Can this possibly print 0 ?

slide-8
SLIDE 8

x <- 42; dmb ish; ready <- 1; if (ready) print x;

Can this possibly print 0 ?

Yes on ARM

Flush your (non-FIFO) store buffer

slide-9
SLIDE 9

x <- 42; dmb ish; ready <- 1; if (ready) dmb ish; print x;

Can this possibly print 0 ?

Yes on ARM: needs 2 fences to prevent

Flush your (non-FIFO) store buffer Don’t speculate reads across

slide-10
SLIDE 10
  • data race (dynamic) = undefined
  • no data race (using mutexes)

= intuitive behavior (“Sequentially consistent”)

  • for lock-free code: atomic accesses

C11/C++11 memory model

Doing it portably

slide-11
SLIDE 11

x.store(1, seq_cst); print(y.load(seq_cst));

Sequentially consistent

y.store(1, seq_cst); print(x.load(seq_cst));

slide-12
SLIDE 12

x = 42; ready.store(1, release);

Release/acquire

if (ready.load(acquire)) print(x);

slide-13
SLIDE 13

x = 42; ready.store(1, release);

Release/acquire

if (ready.load(acquire)) print(x);

slide-14
SLIDE 14

x = 42; ready.store(1, release);

Release/acquire

if (ready.load(acquire)) print(x);

slide-15
SLIDE 15

Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics

slide-16
SLIDE 16

void foo(int *x, int n) { for(int i=0; i<n; ++i){ *x *= 42; } }

Compiler optimizations ?

void foo(int *x, int n) { int tmp = *x; for(int i=0; i < n; ++i){ tmp *= 42; } *x = tmp; }

LICM

slide-17
SLIDE 17

void foo(int *x, int n) { }

Compiler optimizations ?

void foo(int *x, int n) { int tmp = *x; *x = tmp; }

LICM

slide-18
SLIDE 18

void foo(int *x, int n) { }

Compiler optimizations ?

void foo(int *x, int n) { int tmp = *x; *x = tmp; }

LICM

++(*x); // in another thread...

slide-19
SLIDE 19

Never introduce a store where there was none

slide-20
SLIDE 20

x = 42; … x = 43;

Dead store elimination ?

slide-21
SLIDE 21

x = 42; flag1.store(true, release); while (!flag2.load(acquire)) continue; x = 43;

Dead store elimination ?

slide-22
SLIDE 22

x = 42; flag1.store(true, release); while (!flag2.load(acquire)) continue; x = 43;

Dead store elimination ?

while (!flag1.load(acquire)) continue; print(x); flag2.store(true, release);

slide-23
SLIDE 23

x = 42; while (!flag2.load(acquire)) continue; x = 43;

Dead store elimination ?

print(x); flag2.store(true, release);

Race !

slide-24
SLIDE 24

x = 42; flag1.store(true, release); x = 43;

Dead store elimination ?

while (!flag1.load(acquire)) continue; print(x);

Race !

slide-25
SLIDE 25

Anything can happen to memory between a release and an acquire

slide-26
SLIDE 26

Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics

slide-27
SLIDE 27

int t = y.load(acquire); … x.store(1, release); ldr r0, [r0] dmb ish … dmb ish str r2, [r1]

Fence elimination

slide-28
SLIDE 28

ldr … dmb ish dmb ish str …

2 fences on main path

str …

slide-29
SLIDE 29

ldr … dmb ish str …

1 fence on main path

dmb ish str …

slide-30
SLIDE 30

ldr … dmb ish str … str … dmb ish

1 fence on main path

slide-31
SLIDE 31

Build graph from CFG

ldr … str … str …

slide-32
SLIDE 32

Source Build graph from CFG Identify sources/sinks Sink

ldr … str … str …

slide-33
SLIDE 33

Build graph from CFG Identify sources/sinks Source Sink

ldr … str … str …

slide-34
SLIDE 34

Build graph from CFG Identify sources/sinks Annotate with frequency

5 5 2 ∞ ∞ ∞

Source Sink

ldr … str … str …

2

slide-35
SLIDE 35

Build graph from CFG Identify sources/sinks Annotate with frequency Find min-cut

2 + 5 = 7 is minimum 5 5 2 2 ∞ ∞ ∞

Source Sink

ldr … str … str …

slide-36
SLIDE 36

ldr … dmb ish str … dmb ish str …

Build graph from CFG Identify sources/sinks Annotate with frequency Find min-cut Move fences

slide-37
SLIDE 37

while(flag.load(acquire)) {} .loop: ldr r0, [r1] dmb ish bnz .loop

slide-38
SLIDE 38

while(flag.load(acquire)) {} .loop: ldr r0, [r1] bnz .loop dmb ish

slide-39
SLIDE 39

.loop: ldr r0, [r1] dmb ish bnz .loop … memory access

Source Sink

98 100 2

slide-40
SLIDE 40

.loop: ldr r0, [r1] bnz .loop … dmb ish memory access

Source Sink

98 100 2

slide-41
SLIDE 41

Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics

slide-42
SLIDE 42

x.load(release) ?

slide-43
SLIDE 43

x.fetch_add(0, release) x.load(release) ?

slide-44
SLIDE 44

x.fetch_add(0, release) mov %eax, $0 lock xadd (%ebx), %eax x.load(release) ?

x86

slide-45
SLIDE 45

x.fetch_add(0, release) mov %eax, $0 lock xadd (%ebx), %eax x.load(release) ? mfence mov %eax, (%ebx)

x86 7200% speedup for a seqlock*

slide-46
SLIDE 46

x.store(0, release) hwsync stw … dmb sy str … x.load(acquire) lwz … hwsync ldr … dmb sy

Power ARM

slide-47
SLIDE 47

x.store(0, release) lwsync stw … dmb ish str … x.load(acquire) lwz … lwsync ldr … dmb ish

Power ARM

slide-48
SLIDE 48

x.store(0, release) lwsync stw … dmb ishst str … x.load(acquire) lwz … lwsync ldr … dmb ish

Power ARM (Swift)

slide-49
SLIDE 49

Power

x.store(2, relaxed) rlwinm r2, r3, 3, 27, 28 li r4, 2 xori r5, r2, 24 rlwinm r2, r3, 0, 0, 29 li r3, 255 slw r4, r4, r5 slw r3, r3, r5 and r4, r4, r3 LBB4_1: lwarx r5, 0, r2 andc r5, r5, r3

  • r r5, r4, r5
  • stwcx. r5, 0, r2

bne cr0, LBB4_1

slide-50
SLIDE 50

Power

Shuffling

x.store(2, relaxed) rlwinm r2, r3, 3, 27, 28 li r4, 2 xori r5, r2, 24 rlwinm r2, r3, 0, 0, 29 li r3, 255 slw r4, r4, r5 slw r3, r3, r5 and r4, r4, r3 LBB4_1: lwarx r5, 0, r2 andc r5, r5, r3

  • r r5, r4, r5
  • stwcx. r5, 0, r2

bne cr0, LBB4_1

slide-51
SLIDE 51

Power

Loop Shuffling

x.store(2, relaxed) rlwinm r2, r3, 3, 27, 28 li r4, 2 xori r5, r2, 24 rlwinm r2, r3, 0, 0, 29 li r3, 255 slw r4, r4, r5 slw r3, r3, r5 and r4, r4, r3 LBB4_1: lwarx r5, 0, r2 andc r5, r5, r3

  • r r5, r4, r5
  • stwcx. r5, 0, r2

bne cr0, LBB4_1

slide-52
SLIDE 52

x.store(2, relaxed) rlwinm r2, r3, 3, 27, 28 li r4, 2 xori r5, r2, 24 rlwinm r2, r3, 0, 0, 29 li r3, 255 slw r4, r4, r5 slw r3, r3, r5 and r4, r4, r3 LBB4_1: lwarx r5, 0, r2 andc r5, r5, r3

  • r r5, r4, r5
  • stwcx. r5, 0, r2

bne cr0, LBB4_1

Power

Load linked Store conditional Loop Shuffling

slide-53
SLIDE 53

x.store(2, relaxed) li r2, 2 stb r2, 0(r3)

Power

slide-54
SLIDE 54

x.store(2, relaxed) mov %eax, $2 mov (%ebx), %eax

x86

slide-55
SLIDE 55

x.store(2, relaxed) mov (%ebx), $2

x86

slide-56
SLIDE 56

Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics

slide-57
SLIDE 57

print(y.load(relaxed)); x.store(1, relaxed); print(x.load(relaxed)); y.store(1, relaxed);

Relaxed attribute

slide-58
SLIDE 58

print(y.load(relaxed)); x.store(1, relaxed); print(x.load(relaxed)); y.store(1, relaxed);

Can print 1-1

Relaxed attribute

slide-59
SLIDE 59

t_y = y.load(relaxed); x.store(t_y, relaxed); t_x = x.load(relaxed); y.store(t_x, relaxed);

x = y = ???

Relaxed attribute

slide-60
SLIDE 60

if(y.load(relaxed)) x.store(1, relaxed); print(“foo”); if(x.load(relaxed)) y.store(1, relaxed); print(“bar”);

Can print foobar !

Relaxed attribute

slide-61
SLIDE 61

*x = 42; x.store(1, release);

Consume attribute

t = x.load(acquire); print(*t);

slide-62
SLIDE 62

*x = 42; x.store(1, release);

Consume attribute

t = x.load(consume); print(*t);

Ordered

slide-63
SLIDE 63

*x = 42; x.store(1, release);

Consume attribute

t = x.load(consume); print(*y);

Unordered !

slide-64
SLIDE 64

*x = 42; x.store(1, release);

Consume attribute

t = x.load(consume); print(*(y + t - t));

???

slide-65
SLIDE 65
  • Atomics = portable lock-free code in C11/C++11
  • Tricky to compile, but can be done
  • Lots of open questions

Conclusion

slide-66
SLIDE 66

Questions ?