Debugging and improving the C/C++11 memory model Viktor Vafeiadis - PowerPoint PPT Presentation

Debugging and improving the C/C++11 memory model Viktor Vafeiadis Max Planck Institute for Software Systems (MPI-SWS) January 2016

The C11 memory model Defines the semantics of concurrent memory accesses in C/C++. Standardised by ISO C/C++ 2011. Used: ◮ By several POPL/PLDI/OOPSLA papers ◮ Internally by LLVM IR ◮ Indirectly by every program 2

The C11 memory model: Atomics Two types of locations Ordinary Atomic (Non-Atomic) Welcome to the Races are errors expert mode 3

The C11 memory model: a spectrum of accesses Seq. consistent full memory fence Release write Acquire read no fence (x86); lwsync (PPC) no fence (x86); isync (PPC) Relaxed no fence Non-atomic no fence, races are errors Explicit primitives for fences 4

An execution in C11: actions and relations (and axioms) na ( a , 0) W na ( x , 0) W rf po po na ( a , 5) R acq ( x , 0) W sw po po rel ( x , 1) W R acq ( x , 1) rf rf po hb � ( po ∪ sw ) + R na ( a , 5) Initially a = x = 0. a = 5; while ( x . load( acq ) == 0); x . store(1 , release ); print( a ); 5

Relaxed behaviour: store buffering Initially x = y = 0. x . store(1 , rlx ); y . store(1 , rlx ); t 1 = y . load( rlx ); t 2 = x . load( rlx ); This can return t 1 = t 2 = 0. Justification [ x = y = 0] Behaviour observed on rlx ( x , 1) rlx ( y , 1) W W x86/Power/ARM R rlx ( y , 0) R rlx ( x , 0) 6

Coherence Programs with a single shared variable behave as under SC. x . store(1 , rlx ); a = x . load( rlx ); x . store(2 , rlx ); b = x . load( rlx ); The outcome a = 2 ∧ b = 1 is forbidden. W rlx ( x , 1) R rlx ( x , 2) rlx ( x , 2) R rlx ( x , 1) W 7

Coherence Programs with a single shared variable behave as under SC. x . store(1 , rlx ); a = x . load( rlx ); x . store(2 , rlx ); b = x . load( rlx ); The outcome a = 2 ∧ b = 1 is forbidden. W rlx ( x , 1) R rlx ( x , 2) mo x rlx ( x , 2) R rlx ( x , 1) W rb x ◮ Modification order, mo x , total order of writes to x . ◮ Reads-before : rb x � ( rf − 1 ; mo x ) ∩ ( � =) ◮ Coherence : hb ∪ rf x ∪ mo x ∪ rb x is acyclic for all x . 7

Causality cycles with relaxed accesses Initially x = y = 0. if ( x . load ( rlx ) == 1) if ( y . load ( rlx ) == 1) y . store (1 , rlx ); x . store (1 , rlx ); C11 allows the outcome x = y = 1. Justification R rlx ( x , 1) R rlx ( y , 1) Relaxed accesses don’t synchronize W rlx ( y , 1) W rlx ( x , 1) 8

No causality cycles with non-atomics Initially x = y = 0. if ( x == 1) if ( y == 1) y = 1; x = 1; C11 forbids the outcome x = y = 1. Justification Non-atomic read axiom: rf ∩ (_ × NA ) ⊆ hb 9

Is the C11 memory model definition. . . 1. Mathematically sane? ◮ For example, it is monotone. 2. Not too weak? ◮ Provides useful reasoning principles. 3. Not too strong? ◮ Can be implemented efficiently. 4. Actually useful? ◮ Admits the intended program optimisations. 10

Is the C11 memory model definition. . . 1. Mathematically sane? ◮ For example, it is monotone. 2. Not too weak? ◮ Provides useful reasoning principles. 3. Not too strong? ✓ Compilation to x86/Power/ARM. 4. Actually useful? ◮ Admits the intended program optimisations. 10

Is the C11 memory model definition. . . 1. Mathematically sane? ◮ For example, it is monotone. 2. Not too weak? ≈ Reasoning principles for C11 subsets. 3. Not too strong? ✓ Compilation to x86/Power/ARM. 4. Actually useful? ◮ Admits the intended program optimisations. 10

Is the C11 memory model definition. . . 1. Mathematically sane? ✗ No, it is not monotone. 2. Not too weak? ≈ Reasoning principles for C11 subsets. 3. Not too strong? ✓ Compilation to x86/Power/ARM. 4. Actually useful? ◮ Admits the intended program optimisations. 10

Is the C11 memory model definition. . . 1. Mathematically sane? ✗ No, it is not monotone. 2. Not too weak? ≈ Reasoning principles for C11 subsets. 3. Not too strong? ✓ Compilation to x86/Power/ARM. 4. Actually useful? ✗ No, it disallows intended program transformations. 10

Is the C11 memory model definition. . . 1. Mathematically sane? ✗ No, it is not monotone. 2. Not too weak? ≈ Reasoning principles for C11 subsets. 3. Not too strong? ≈ Compilation to x86/Power/ARM. 4. Actually useful? ✗ No, it disallows intended program transformations. 10

Non-atomic reads of atomic variables are unsound! Initially, x = 0. if ( x . load( rlx ) == 1) x . store(1 , rlx ); t = (int) x ; The program can get stuck! W na ( x , 0) rlx ( x , 1) R rlx ( x , 1) W R na ( x , ? ) ◮ Reading 0 contradicts coherence. ◮ Reading 1 contradicts the non-atomic read axiom. 11

Sequentialisation is invalid Initially, a = x = y = 0. if ( x . load( rlx ) == 1) if ( y . load( rlx ) == 1) a = 1; if ( a == 1) x . store(1 , rlx ); y . store(1 , rlx ); The only possible output is: a = 1 , x = y = 0 . Recall the non-atomic read axiom: rf ∩ (_ × NA ) ⊆ hb 12

Tentative fixes Remove non-atomic read axiom. ◮ gives extremely weak guarantees, if any In addition, forbid ( hb ∪ rf )-cycles. ◮ rules out causal loops ◮ forbids some reorderings ◮ more costly on ARM/Power Or alternatively forbid ( hb ∪ rf )-cycles with NA accesses. ◮ allows more racy behaviours ◮ forbids some reorderings 13

Tentative fixes Open problem Remove non-atomic read axiom. ◮ gives extremely weak guarantees, if any In addition, forbid ( hb ∪ rf )-cycles. ◮ rules out causal loops ◮ forbids some reorderings ◮ more costly on ARM/Power Or alternatively forbid ( hb ∪ rf )-cycles with NA accesses. ◮ allows more racy behaviours ◮ forbids some reorderings 13

Monotonicity “Adding synchronisation should not introduce new behaviours” Examples: ◮ Reducing parallelism, C 1 � C 2 � C 1 ; C 2 ◮ Expression evaluation linearisation: � x = a + b ; t 1 = a ; t 2 = b ; x = t 1 + t 2 ; ◮ Adding a memory fence ◮ Strengthening the access mode of an operation ◮ (Roach motel reorderings) 14

Other problems fixed (POPL’15, POPL’16) The axiom of SC reads is too weak. ◮ Makes strengthening unsound. The axioms of SC fences are too weak. ◮ They do not guarantee sequential consistency. The definition of release sequences is too strong. ◮ Removing ( po ∪ rf )-final events is unsound. 15

Transformation correctness

Valid instruction reorderings a ; b � b ; a (POPL’15) ↓ a \ b → R � = sc R sc W na W rlx W ⊒ rel C rlx | acq C ⊒ rel F acq F rel R na ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✓ ✗ R rlx ✓ ✓ ✓ ( ✓ ) ✗ ( ✓ ) ✗ ✗ ✗ R ⊒ acq ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✗ W � = sc ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✓ ✗ W sc ✓ ✗ ✓ ✓ ✗ ✓ ✗ ✓ ✗ C rlx | rel ✓ ✓ ✓ ( ✓ ) ✗ ( ✓ ) ✗ ✗ ✗ C ⊒ acq ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✗ F acq = ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ F rel ✓ ✓ ✓ ✗ ✓ ✗ ✓ ✓ = 17

Redundant instruction eliminations (POPL’15) Overwritten write: x . store( v , M ) ; C ; x . store( v ′ , M ) C has no rel � C ; x . store( v ′ , M ) & no x accesses Read after write: x . store( v , M ) ; C ; t = x . load( M ′ ) C has no acq � x . store( v , M ) ; C ; t = v & no x accesses Read after read: t = x . load( M ) ; C ; t ′ = x . load( M ) C has no acq � t = x . load( M ) ; C ; t ′ = t & no x accesses 18

Is DRF semantics really what we want?

Should these transformations be allowed? 1. CSE over a lock acquire: t 1 = X ; t 1 = X ; � lock (); lock (); t 2 = X ; t 2 = t 1 ; If X changes in between, the program is racy. 2. Load hoisting: if( c ) t = X ; � r = X ; r = c ? t : r ; This may introduce a race, but the racy value is not used. 20

Allowing both is clearly wrong! Consider the transformation sequence: if ( c ) t = X ; t = X ; r 1 = X ; r 1 = c ? t : r 1 ; r 1 = c ? t : r 1 ; � � lock (); lock (); lock (); r 2 = X ; r 2 = X ; r 2 = t ; When c is false, X is moved out of the critical region! So we have to forbid one transfomation. ◮ C11 forbids load hoisting, allows CSE over lock(). ◮ LLVM allows load hoisting, forbids CSE over lock(). 21

Taming the release-acquire fragment

Recall the spectrum of C11 access types Seq. consistent full memory fence Release write Acquire read no fence (x86); lwsync (PPC) no fence (x86); isync (PPC) Relaxed no fence Non-atomic no fence, races are errors 23

C11’s release-acquire memory model C11 model where all reads are acquire, all writes are release, and all atomic updates are acquire/release Store buffering [ x = y = 0] x = y = 0 mo y mo x x := 1; y := 1; W x , 1 W y , 1 print y print x rf rf both threads may print 0 R y , 0 R x , 0 Message passing [ x = m = 0] x = m = 0 mo m mo x rf while x = 0 W m , 42 R x , 1 m := 42; skip ; rf x := 1 print m W x , 1 R m , 0 hb only 42 may be printed 24

Good news ◮ Verified compilation schemes: ◮ x86-TSO (trivial compilation) [Batty el al. ’11] ◮ Power [Batty el al. ’12] [Sarkar el al. ’12] ◮ RA supports intended optimizations: ◮ In particular, write-read reordering (unlike SC): � W x → R y R y → W x ◮ DRF theorem: ◮ No data races under SC ensures no weak behaviors ◮ Monotonicity: ◮ Adding synchronization does not introduce new behaviors ◮ Program logics: ◮ RSL [Vafeiadis and Narayan ’13] ◮ GPS [Turon et al. ’14] ◮ OGRA [Lahav and Vafeiadis ’15] 25

Debugging and improving the C/C++11 memory model Viktor Vafeiadis - PowerPoint PPT Presentation

Debugging and improving the C/C++11 memory model Viktor Vafeiadis Max Planck Institute for Software Systems (MPI-SWS) January 2016 The C11 memory model Defines the semantics of concurrent memory accesses in C/C++. Standardised by ISO C/C++

Debugging Debugging Tools Module Overview Introduction to Debugging Problems in Production

Coroutines Update Seva Tolstopyatov @qwwdfsad October 13, 2020 Coroutines debugging Coroutines

Debugging Debugging with High Level Languages Same goals as low-level debugging Examine and

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Debugging Floating-Point Debugging Floating-Point Debugging Floating-Point Math in Racket Math

Visual Debugging Software What is Debugging Visualization Visualizing

Memory Debugging Parallel Applications on BlueGene SciComp May 21, 2009 1 Ed Hinkel Agenda

Memory Debugging SS 2007 Johannes Kepler University Linz, Austria Dr. Toni Jussila Institut

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Introduction to Debugging with Windbg Module Overview Introduction to Debugging Callstacks and

Kernel Debugging and Virtualization John Baldwin January 15, 2015 What is Kernel Debugging

Scalable Post-Mortem Debugging Abel Mathew CEO - Backtrace amathew@backtrace.io @nullisnt0

Debugging microservices in production Bryan Cantrill CTO bryan@joyent.com @bcantrill

Embedded Software TI2726-B 8. Debugging techniques Koen Langendoen Embedded Software Group

Debugging Techniques for C Programs Debugging Basics Will focus on the gcc/gdb combination.

Optimizing Explicit Data Transfers for Data Parallel Applications on Heterogeneous Multi-core

ECE 3574: Applied Software Design InterProcess Communication using Shared Memory Chris Wyatt

Parallelization strategies in PWSCF (and other QE codes) MPI vs Open MP MPI Message

Multiprocessors (Chapter 9) Idea: create powerful computers by connecting many smaller ones

Sequoia Programming the Memory Hierarchy Kayvon Fatahalian Timothy J. Knight Mike Houston

Strategies for parallel SpMV multiplication Albert-Jan Yzelman June 2012 2012, ExaScience

Memory Leakage Monitoring S. Y. Jun (Fermilab), G. Cosmo (CERN), A. Dotti (SLAC) 20 th Geant4

The heap hic 1 Limitations of the stack int *table_of(int num, int len) { int table[len+1];