blowing up the c 11 atomic barrier
play

Blowing up the (C++11) atomic barrier Optimizing C++11 atomics in - PowerPoint PPT Presentation

Blowing up the (C++11) atomic barrier Optimizing C++11 atomics in LLVM Robin Morisset, Intern at Google Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics


  1. Blowing up the (C++11) atomic barrier Optimizing C++11 atomics in LLVM Robin Morisset, Intern at Google

  2. Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics

  3. Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics

  4. Can this possibly print 0-0 ? Thread 1 Thread 2 x <- 1; y <- 1; print y; print x;

  5. Can this possibly print 0-0 ? Yes if your compiler reorder accesses Thread 1 Thread 2 print y; print x; x <- 1; y <- 1;

  6. Can this possibly print 0-0 ? Yes on x86: needs a fence Flush your (FIFO) store buffer x <- 1; y <- 1; mfence; mfence; print y; print x;

  7. Can this possibly print 0 ? x <- 42; if (ready) ready <- 1; print x;

  8. Can this possibly print 0 ? Yes on ARM Flush your (non-FIFO) store buffer x <- 42; if (ready) dmb ish; print x; ready <- 1;

  9. Can this possibly print 0 ? Yes on ARM: needs 2 fences to prevent Flush your Don’t speculate (non-FIFO) reads across store buffer x <- 42; if (ready) dmb ish; dmb ish; ready <- 1; print x;

  10. Doing it portably C11/C++11 memory model ● data race (dynamic) = undefined ● no data race (using mutexes) = intuitive behavior (“Sequentially consistent”) ● for lock-free code: atomic accesses

  11. Sequentially consistent x.store(1, seq_cst ); y.store(1, seq_cst ); print(y.load( seq_cst )); print(x.load( seq_cst ));

  12. Release/acquire x = 42; if (ready.load( acquire )) ready.store(1, release ); print(x);

  13. Release/acquire x = 42; if (ready.load( acquire )) ready.store(1, release ); print(x);

  14. Release/acquire x = 42; if (ready.load( acquire )) ready.store(1, release ); print(x);

  15. Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics

  16. Compiler optimizations ? void foo(int *x, int n) { void foo(int *x, int n) { int tmp = *x; for(int i=0; i<n; ++i){ for(int i=0; i < n; ++i){ LICM *x *= 42; tmp *= 42; } } *x = tmp; } }

  17. Compiler optimizations ? void foo(int *x, int n) { void foo(int *x, int n) { int tmp = *x; LICM *x = tmp; } }

  18. Compiler optimizations ? void foo(int *x, int n) { void foo(int *x, int n) { int tmp = *x; LICM *x = tmp; } } ++(*x); // in another thread...

  19. Never introduce a store where there was none

  20. Dead store elimination ? x = 42; … x = 43;

  21. Dead store elimination ? x = 42; flag1.store(true, release ); while (!flag2.load( acquire )) continue; x = 43;

  22. Dead store elimination ? x = 42; while (!flag1.load( acquire )) flag1.store(true, release ); continue; while (!flag2.load( acquire )) print(x); continue; flag2.store(true, release ); x = 43;

  23. Dead store elimination ? Race ! x = 42; while (!flag2.load( acquire )) print(x); continue; flag2.store(true, release ); x = 43;

  24. Dead store elimination ? x = 42; while (!flag1.load( acquire )) flag1.store(true, release ); continue; print(x); x = 43; Race !

  25. Anything can happen to memory between a release and an acquire

  26. Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics

  27. Fence elimination ldr r0, [r0] int t = y.load( acquire ); dmb ish … … x.store(1, release ); dmb ish str r2, [r1]

  28. ldr … dmb ish 2 fences on main path str … dmb ish str …

  29. ldr … dmb ish 1 fence on main path str … dmb ish str …

  30. ldr … dmb ish 1 fence on main path str … dmb ish str …

  31. ldr … str … Build graph from CFG str …

  32. Source ldr … Build graph from CFG str … Identify sources/sinks str … Sink

  33. Source ldr … Build graph from CFG str … Identify sources/sinks str … Sink

  34. Source ldr … 5 ∞ Build graph from CFG 2 str … Identify sources/sinks ∞ 2 Annotate with frequency ∞ 5 str … Sink

  35. Source ldr … 5 ∞ Build graph from CFG 2 + 5 = 7 is minimum 2 Identify sources/sinks str … ∞ Annotate with frequency 2 Find min-cut ∞ 5 str … Sink

  36. ldr … Build graph from CFG dmb ish Identify sources/sinks str … Annotate with frequency Find min-cut Move fences dmb ish str …

  37. .loop: while(flag.load( acquire )) ldr r0, [r1] {} dmb ish bnz .loop

  38. .loop: while(flag.load( acquire )) ldr r0, [r1] {} bnz .loop dmb ish

  39. Source .loop: ldr r0, [r1] 98 dmb ish 100 bnz .loop … 2 memory access Sink

  40. Source .loop: ldr r0, [r1] 98 100 bnz .loop … 2 dmb ish memory access Sink

  41. Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics

  42. x.load( release ) ?

  43. x.load( release ) ? x.fetch_add(0, release )

  44. x86 x.load( release ) ? mov %eax, $0 x.fetch_add(0, release ) lock xadd (%ebx), %eax

  45. x86 x.load( release ) ? 7200% mov %eax, $0 x.fetch_add(0, release ) lock speedup xadd (%ebx), %eax for a seqlock* mfence mov %eax, (%ebx)

  46. Power ARM x.store(0, release ) hwsync dmb sy stw … str … x.load( acquire ) lwz … ldr … hwsync dmb sy

  47. Power ARM x.store(0, release ) lwsync dmb ish stw … str … x.load( acquire ) lwz … ldr … lwsync dmb ish

  48. Power ARM (Swift) x.store(0, release ) lwsync dmb ishst stw … str … x.load( acquire ) lwz … ldr … lwsync dmb ish

  49. Power rlwinm r2, r3, 3, 27, 28 li r4, 2 xori r5, r2, 24 rlwinm r2, r3, 0, 0, 29 li r3, 255 slw r4, r4, r5 x.store(2, relaxed ) slw r3, r3, r5 and r4, r4, r3 LBB4_1: lwarx r5, 0, r2 andc r5, r5, r3 or r5, r4, r5 stwcx. r5, 0, r2 bne cr0, LBB4_1

  50. Power rlwinm r2, r3, 3, 27, 28 li r4, 2 Shuffling xori r5, r2, 24 rlwinm r2, r3, 0, 0, 29 li r3, 255 slw r4, r4, r5 x.store(2, relaxed ) slw r3, r3, r5 and r4, r4, r3 LBB4_1: lwarx r5, 0, r2 andc r5, r5, r3 or r5, r4, r5 stwcx. r5, 0, r2 bne cr0, LBB4_1

  51. Power rlwinm r2, r3, 3, 27, 28 li r4, 2 Shuffling xori r5, r2, 24 rlwinm r2, r3, 0, 0, 29 li r3, 255 slw r4, r4, r5 x.store(2, relaxed ) slw r3, r3, r5 and r4, r4, r3 LBB4_1: lwarx r5, 0, r2 andc r5, r5, r3 or r5, r4, r5 stwcx. r5, 0, r2 bne cr0, LBB4_1 Loop

  52. Power rlwinm r2, r3, 3, 27, 28 li r4, 2 Shuffling xori r5, r2, 24 rlwinm r2, r3, 0, 0, 29 li r3, 255 slw r4, r4, r5 x.store(2, relaxed ) slw r3, r3, r5 and r4, r4, r3 LBB4_1: lwarx r5, 0, r2 Load linked andc r5, r5, r3 Store conditional or r5, r4, r5 stwcx. r5, 0, r2 bne cr0, LBB4_1 Loop

  53. Power x.store(2, relaxed ) li r2, 2 stb r2, 0(r3)

  54. x86 x.store(2, relaxed ) mov %eax, $2 mov (%ebx), %eax

  55. x86 x.store(2, relaxed ) mov (%ebx), $2

  56. Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics

  57. Relaxed attribute print(y.load( relaxed )); print(x.load( relaxed )); x.store(1, relaxed ); y.store(1, relaxed );

  58. Relaxed attribute print(y.load( relaxed )); print(x.load( relaxed )); x.store(1, relaxed ); y.store(1, relaxed ); Can print 1-1

  59. Relaxed attribute t_y = y.load( relaxed ); t_x = x.load( relaxed ); x.store(t_y, relaxed ); y.store(t_x, relaxed ); x = y = ???

  60. Relaxed attribute if(y.load( relaxed )) if(x.load( relaxed )) x.store(1, relaxed ); y.store(1, relaxed ); print(“foo”); print(“bar”); Can print foobar !

  61. Consume attribute *x = 42; t = x.load( acquire ); x.store(1, release ); print(*t);

  62. Consume attribute *x = 42; t = x.load( consume ); x.store(1, release ); print(*t); Ordered

  63. Consume attribute *x = 42; t = x.load( consume ); x.store(1, release ); print(*y); Unordered !

  64. Consume attribute *x = 42; t = x.load( consume ); x.store(1, release ); print(*(y + t - t)); ???

  65. Conclusion ● Atomics = portable lock-free code in C11/C++11 ● Tricky to compile, but can be done ● Lots of open questions

  66. Questions ?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend