Blowing up the (C++11) atomic barrier Optimizing C++11 atomics in - PowerPoint PPT Presentation

Blowing up the (C++11) atomic barrier Optimizing C++11 atomics in LLVM Robin Morisset, Intern at Google

Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics

Can this possibly print 0-0 ? Thread 1 Thread 2 x <- 1; y <- 1; print y; print x;

Can this possibly print 0-0 ? Yes if your compiler reorder accesses Thread 1 Thread 2 print y; print x; x <- 1; y <- 1;

Can this possibly print 0-0 ? Yes on x86: needs a fence Flush your (FIFO) store buffer x <- 1; y <- 1; mfence; mfence; print y; print x;

Can this possibly print 0 ? x <- 42; if (ready) ready <- 1; print x;

Can this possibly print 0 ? Yes on ARM Flush your (non-FIFO) store buffer x <- 42; if (ready) dmb ish; print x; ready <- 1;

Can this possibly print 0 ? Yes on ARM: needs 2 fences to prevent Flush your Don’t speculate (non-FIFO) reads across store buffer x <- 42; if (ready) dmb ish; dmb ish; ready <- 1; print x;

Doing it portably C11/C++11 memory model ● data race (dynamic) = undefined ● no data race (using mutexes) = intuitive behavior (“Sequentially consistent”) ● for lock-free code: atomic accesses

Sequentially consistent x.store(1, seq_cst ); y.store(1, seq_cst ); print(y.load( seq_cst )); print(x.load( seq_cst ));

Release/acquire x = 42; if (ready.load( acquire )) ready.store(1, release ); print(x);

Compiler optimizations ? void foo(int *x, int n) { void foo(int *x, int n) { int tmp = *x; for(int i=0; i<n; ++i){ for(int i=0; i < n; ++i){ LICM *x *= 42; tmp *= 42; } } *x = tmp; } }

Compiler optimizations ? void foo(int *x, int n) { void foo(int *x, int n) { int tmp = *x; LICM *x = tmp; } }

Compiler optimizations ? void foo(int *x, int n) { void foo(int *x, int n) { int tmp = *x; LICM *x = tmp; } } ++(*x); // in another thread...

Never introduce a store where there was none

Dead store elimination ? x = 42; … x = 43;

Dead store elimination ? x = 42; flag1.store(true, release ); while (!flag2.load( acquire )) continue; x = 43;

Dead store elimination ? x = 42; while (!flag1.load( acquire )) flag1.store(true, release ); continue; while (!flag2.load( acquire )) print(x); continue; flag2.store(true, release ); x = 43;

Dead store elimination ? Race ! x = 42; while (!flag2.load( acquire )) print(x); continue; flag2.store(true, release ); x = 43;

Dead store elimination ? x = 42; while (!flag1.load( acquire )) flag1.store(true, release ); continue; print(x); x = 43; Race !

Anything can happen to memory between a release and an acquire

Fence elimination ldr r0, [r0] int t = y.load( acquire ); dmb ish … … x.store(1, release ); dmb ish str r2, [r1]

ldr … dmb ish 2 fences on main path str … dmb ish str …

ldr … dmb ish 1 fence on main path str … dmb ish str …

ldr … str … Build graph from CFG str …

Source ldr … Build graph from CFG str … Identify sources/sinks str … Sink

Source ldr … 5 ∞ Build graph from CFG 2 str … Identify sources/sinks ∞ 2 Annotate with frequency ∞ 5 str … Sink

Source ldr … 5 ∞ Build graph from CFG 2 + 5 = 7 is minimum 2 Identify sources/sinks str … ∞ Annotate with frequency 2 Find min-cut ∞ 5 str … Sink

ldr … Build graph from CFG dmb ish Identify sources/sinks str … Annotate with frequency Find min-cut Move fences dmb ish str …

.loop: while(flag.load( acquire )) ldr r0, [r1] {} dmb ish bnz .loop

.loop: while(flag.load( acquire )) ldr r0, [r1] {} bnz .loop dmb ish

Source .loop: ldr r0, [r1] 98 dmb ish 100 bnz .loop … 2 memory access Sink

Source .loop: ldr r0, [r1] 98 100 bnz .loop … 2 dmb ish memory access Sink

x.load( release ) ?

x.load( release ) ? x.fetch_add(0, release )

x86 x.load( release ) ? mov %eax, $0 x.fetch_add(0, release ) lock xadd (%ebx), %eax

x86 x.load( release ) ? 7200% mov %eax, $0 x.fetch_add(0, release ) lock speedup xadd (%ebx), %eax for a seqlock* mfence mov %eax, (%ebx)

Power ARM x.store(0, release ) hwsync dmb sy stw … str … x.load( acquire ) lwz … ldr … hwsync dmb sy

Power ARM x.store(0, release ) lwsync dmb ish stw … str … x.load( acquire ) lwz … ldr … lwsync dmb ish

Power ARM (Swift) x.store(0, release ) lwsync dmb ishst stw … str … x.load( acquire ) lwz … ldr … lwsync dmb ish

Power rlwinm r2, r3, 3, 27, 28 li r4, 2 xori r5, r2, 24 rlwinm r2, r3, 0, 0, 29 li r3, 255 slw r4, r4, r5 x.store(2, relaxed ) slw r3, r3, r5 and r4, r4, r3 LBB4_1: lwarx r5, 0, r2 andc r5, r5, r3 or r5, r4, r5 stwcx. r5, 0, r2 bne cr0, LBB4_1

Power rlwinm r2, r3, 3, 27, 28 li r4, 2 Shuffling xori r5, r2, 24 rlwinm r2, r3, 0, 0, 29 li r3, 255 slw r4, r4, r5 x.store(2, relaxed ) slw r3, r3, r5 and r4, r4, r3 LBB4_1: lwarx r5, 0, r2 andc r5, r5, r3 or r5, r4, r5 stwcx. r5, 0, r2 bne cr0, LBB4_1

Power rlwinm r2, r3, 3, 27, 28 li r4, 2 Shuffling xori r5, r2, 24 rlwinm r2, r3, 0, 0, 29 li r3, 255 slw r4, r4, r5 x.store(2, relaxed ) slw r3, r3, r5 and r4, r4, r3 LBB4_1: lwarx r5, 0, r2 andc r5, r5, r3 or r5, r4, r5 stwcx. r5, 0, r2 bne cr0, LBB4_1 Loop

Power rlwinm r2, r3, 3, 27, 28 li r4, 2 Shuffling xori r5, r2, 24 rlwinm r2, r3, 0, 0, 29 li r3, 255 slw r4, r4, r5 x.store(2, relaxed ) slw r3, r3, r5 and r4, r4, r3 LBB4_1: lwarx r5, 0, r2 Load linked andc r5, r5, r3 Store conditional or r5, r4, r5 stwcx. r5, 0, r2 bne cr0, LBB4_1 Loop

Power x.store(2, relaxed ) li r2, 2 stb r2, 0(r3)

x86 x.store(2, relaxed ) mov %eax, $2 mov (%ebx), %eax

x86 x.store(2, relaxed ) mov (%ebx), $2

Relaxed attribute print(y.load( relaxed )); print(x.load( relaxed )); x.store(1, relaxed ); y.store(1, relaxed );

Relaxed attribute print(y.load( relaxed )); print(x.load( relaxed )); x.store(1, relaxed ); y.store(1, relaxed ); Can print 1-1

Relaxed attribute t_y = y.load( relaxed ); t_x = x.load( relaxed ); x.store(t_y, relaxed ); y.store(t_x, relaxed ); x = y = ???

Relaxed attribute if(y.load( relaxed )) if(x.load( relaxed )) x.store(1, relaxed ); y.store(1, relaxed ); print(“foo”); print(“bar”); Can print foobar !

Consume attribute *x = 42; t = x.load( acquire ); x.store(1, release ); print(*t);

Consume attribute *x = 42; t = x.load( consume ); x.store(1, release ); print(*t); Ordered

Consume attribute *x = 42; t = x.load( consume ); x.store(1, release ); print(*y); Unordered !

Consume attribute *x = 42; t = x.load( consume ); x.store(1, release ); print(*(y + t - t)); ???

Conclusion ● Atomics = portable lock-free code in C11/C++11 ● Tricky to compile, but can be done ● Lots of open questions

Questions ?

Blowing up the (C++11) atomic barrier Optimizing C++11 atomics in - PowerPoint PPT Presentation

Blowing up the (C++11) atomic barrier Optimizing C++11 atomics in LLVM Robin Morisset, Intern at Google Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics

I - -75 Median Cable Barrier 75 Median Cable Barrier 75 Median Cable Barrier I 75 Median Cable

PET Stretch Blowing Equipment Stretch blowing Kosme KSB 4000, 1998 Equipped with 4 sets of

The angular blowing-a-kiss 001 010 000 problem 011 111 Kevin Buchin , Irina Kostitsyna, Roel

BAKER BRICK BAKER BRICK BARRIER BARRIER BAKER BRICK BARRIER The Easy Solution to Stained

DK - Batteridrevet vakuum lfter AL-Atomic 500 D - Batteriebetrieber Vakuumheber AL-Atomic 500

Air Barrier and Insulation Installation Component Guide COMPONENT AIR BARRIER CRITERIA

Noise Barrier Meeting March 12, 2019 WHY ARE WE HERE TONIGHT? Noise Barrier Final Design Noise

Overview What is an Asymmetric Barrier? Median barrier with unbalanced roadway elevations

developing a MPA network in the Great Barrier Reef Jon Day Great Barrier Reef Marine Park

WELCOME ADANI POWER MAHARASHTRA LTD,TIRODA, 5 X 660 MW TOPIC- STEAM BLOWING By- Operation Team

Influence of Roadside Environment and Road Structures on Blowing-Snow-Induced Visibility

Atomic page flip and mode setting Hardware structure and abstraction Atomic page flip The

WIOA Populations with Barriers and Proposed Solutions WIOA BARRIER POTENTIAL BARRIERS TO ACCESS

1 H.R. Naegeli Transparent high barrier laminates manufactured by extrusion lamination process

Barrier Option Pricing Introduction Barrier Options and Monte Carlo Simulations The

INDICATE ALTERATION OF THE SKIN BARRIER Pascale MILANI, Anna Drillat, Julien CHLASTA, Rawad

Configurations of lines on del Pezzo surfaces Rosa Winter Universiteit Leiden Konstanz Women in

Covers Open Problem: The Capacity of the Relay Channel Ayfer Ozg ur Stanford

Join the school band in 2021! Learn to play an instrument and join our musical team! Below is a

Erik Van Buggenhout Royal Holloway, University of London Distance Learning Conference 2014

Winter Storm Briefing Eric Ahasic Monday, November 25, 2019 1:00 PM Current Headlines (Updated

Blueberry galaxies: local analogs of the faint-end LAEs Huan Yang ( ) Las Campanas

Cr Creating a g and M Managi ging a g an E Effective F Farm Team Te AGRICULTURAL SCIENCES

A Selection Of Indie Games Minecraft http://www.minecraft.net/ In beta Under