Memory Barriers in the Linux Kernel Semantics and Practices - - PowerPoint PPT Presentation

memory barriers in the linux kernel
SMART_READER_LITE
LIVE PREVIEW

Memory Barriers in the Linux Kernel Semantics and Practices - - PowerPoint PPT Presentation

Memory Barriers in the Linux Kernel Semantics and Practices Embedded Linux Conference April 2016. San Diego, CA. Davidlohr Bueso <dave@stgolabs.net> SUSE Labs. Agenda 1. Introduction Reordering Examples Underlying need for


slide-1
SLIDE 1

Memory Barriers in the Linux Kernel

Semantics and Practices

Embedded Linux Conference – April 2016. San Diego, CA.

Davidlohr Bueso <dave@stgolabs.net> SUSE Labs.

slide-2
SLIDE 2

2

Agenda

  • 1. Introduction
  • Reordering Examples
  • Underlying need for memory barriers
  • 2. Barriers in the kernel
  • Building blocks
  • Implicit barriers
  • Atomic operations
  • Acquire/release semantics.
slide-3
SLIDE 3

3

References

  • i. David Howells, Paul E. McKenney. Linux Kernel source:

Documentation/memory-barriers.txt

  • ii. Paul E. McKenney. Is Parallel Programming Hard, And, If So, What

Can You Do About It?

  • iii. Paul E. McKenney.

Memory Barriers: a Hardware View for Software Hackers. June 2010.

  • iv. Sorin, Hill, Wood. A Primer on Memory Consistency and Cache
  • Coherence. Synthesis Lectures on Computer Architecture. 2011.
slide-4
SLIDE 4

4

slide-5
SLIDE 5

5

Flagship Example

A = 0, B = 0 (shared variables) CPU0 A = 1 x = B CPU1 B = 1 y = A

slide-6
SLIDE 6

6

Flagship Example

A = 0, B = 0 (shared variables) CPU0 A = 1 x = B CPU1 B = 1 y = A (x, y) =

slide-7
SLIDE 7

7

Flagship Example

A = 0, B = 0 (shared variables) CPU0 A = 1 x = B CPU1 B = 1 y = A (x, y) = (0, 1) A = 1 x = B B = 1 y = A

slide-8
SLIDE 8

8

Flagship Example

A = 0, B = 0 (shared variables) CPU0 A = 1 x = B CPU1 B = 1 y = A (x, y) = (0, 1) B = 1 y = A A = 1 x = B (1, 0)

slide-9
SLIDE 9

9

Flagship Example

A = 0, B = 0 (shared variables) CPU0 A = 1 x = B CPU1 B = 1 y = A (x, y) = (0, 1) A = 1 B = 1 y = A x = B (1, 0) (1, 1)

slide-10
SLIDE 10

10

Flagship Example

A = 0, B = 0 (shared variables) CPU0 A = 1 x = B CPU1 B = 1 y = A (x, y) = (0, 1) x = B y = A A = 1 B = 1 (1, 0) (1, 1) (0, 0)

slide-11
SLIDE 11

11

Memory Consistency Models

  • Most modern multicore systems are coherent but not

consistent.

‒ Same address is subject to the cache coherency protocol.

  • Describes what the CPU can do regarding instruction
  • rdering across addresses.

‒ Helps programmers make sense of the world. ‒ CPU is not aware if application is single or multi-threaded.

When optimizing, it only ensures single threaded correctness.

slide-12
SLIDE 12

12

Sequential Consistency (SC)

“A multiprocessor is sequentially consistent if the result of any execution is the same as some sequential order, and within any processor, the operations are executed in program order”

– Lamport, 1979.

  • Intuitively a programmer's ideal scenario.

‒ The instructions are executed by the same CPU in the order in

which it was written.

‒ All processes see the same interleaving of operations.

slide-13
SLIDE 13

13

Total Store Order (TSO)

  • SPARC, x86 (Intel, AMD)
  • Similar to SC, but:

‒ Loads may be reordered with writes.

[l] B [s] B [l] B [s] C [l] B [s] A [l] A [s] B

slide-14
SLIDE 14

14

Total Store Order (TSO)

  • SPARC, x86 (Intel, AMD)
  • Similar to SC, but:

‒ Loads may be reordered with writes.

[l] B [s] B [l] B [s] C [l] B [s] A [l] A [s] B L→L

slide-15
SLIDE 15

15

Total Store Order (TSO)

  • SPARC, x86 (Intel, AMD)
  • Similar to SC, but:

‒ Loads may be reordered with writes.

[l] B [s] B [l] B [s] C [l] B [s] A [l] A [s] B L→L S→S

slide-16
SLIDE 16

16

Total Store Order (TSO)

  • SPARC, x86 (Intel, AMD)
  • Similar to SC, but:

‒ Loads may be reordered with writes.

[l] B [s] B [l] B [s] C [l] B [s] A [l] A [s] B L→L S→S L→S

slide-17
SLIDE 17

17

Total Store Order (TSO)

  • SPARC, x86 (Intel, AMD)
  • Similar to SC, but:

‒ Loads may be reordered with writes.

[l] B [s] B [l] B [s] C [l] B [s] A [l] A [s] B L→L S→S L→S S→L

slide-18
SLIDE 18

18

Relaxed Models

  • Arbitrary reorder limited only by explicit memory-

barrier instructions.

  • ARM, Power, tilera, Alpha.
slide-19
SLIDE 19

19

Fixing the Example

A = 0, B = 0 (shared variables) CPU0 A = 1 x = B CPU1 B = 1 y = A CPU1 B = 1 y = A

slide-20
SLIDE 20

20

Fixing the Example

A = 0, B = 0 (shared variables) CPU0 A = 1 <MB> x = B CPU1 B = 1 y = A CPU1 B = 1 <MB> y = A

slide-21
SLIDE 21

21

Fixing the Example

A = 0, B = 0 (shared variables) CPU0 A = 1 <MB> x = B CPU1 B = 1 y = A CPU1 B = 1 <MB> y = A

  • Compiler barrier
slide-22
SLIDE 22

22

Fixing the Example

A = 0, B = 0 (shared variables) CPU0 A = 1 <MB> x = B CPU1 B = 1 y = A CPU1 B = 1 <MB> y = A

  • Compiler barrier
  • Mandatory barriers (general+rw)
slide-23
SLIDE 23

23

Fixing the Example

A = 0, B = 0 (shared variables) CPU0 A = 1 <MB> x = B CPU1 B = 1 y = A CPU1 B = 1 <MB> y = A

  • Compiler barrier
  • Mandatory barriers (general+rw)
  • SMP-conditional barriers
slide-24
SLIDE 24

24

Fixing the Example

A = 0, B = 0 (shared variables) CPU0 A = 1 <MB> x = B CPU1 B = 1 y = A CPU1 B = 1 <MB> y = A

  • Compiler barrier
  • Mandatory barriers (general+rw)
  • SMP-conditional barriers
  • acquire/release
slide-25
SLIDE 25

25

Fixing the Example

A = 0, B = 0 (shared variables) CPU0 A = 1 <MB> x = B CPU1 B = 1 y = A CPU1 B = 1 <MB> y = A

  • Compiler barrier
  • Mandatory barriers (general+rw)
  • SMP-conditional barriers
  • acquire/release
  • Data dependency barriers
  • Device barriers
slide-26
SLIDE 26

Barriers in the Linux Kernel

slide-27
SLIDE 27

27

Abstracting Architectures

  • Most kernel programmers need not worry about
  • rdering specifics of every architecture.

‒ Some notion of barrier usage is handy nonetheless – implicit

vs explicit, semantics, etc.

  • Linux must handle the CPU's memory ordering

specifics in a portable way with LCD semantics of memory barriers.

‒ CPU appears to execute in program order. ‒ Single variable consistency. ‒ Barriers operate in pairs. ‒ Sufficient to implement synchronization primitives.

slide-28
SLIDE 28

28

Abstracting Architectures

mb()

mfence dsb sync ...

  • Each architecture must implement its own calls or
  • therwise default to the generic and highly

unoptimized behavior.

  • <arch/xxx/include/asm/barriers.h> will

always define the low-level CPU specifics, then rely

  • n <include/asm-generic/barriers.h>
slide-29
SLIDE 29

29

A Note on barrier()

  • Prevents the compiler from getting smart, acting as a

general barrier.

  • Within a loop forces the compiler to reload conditional

variables – READ/WRITE_ONCE.

slide-30
SLIDE 30

30

Implicit Barriers

  • Calls that have implied barriers, the caller can safely

rely on:

‒ Locking functions ‒ Scheduler functions ‒ Interrupt disabling functions ‒ Others.

slide-31
SLIDE 31

31

Sleeping/Waking

  • Extremely common task in the kernel and flagship

example of flag-based CPU-CPU interaction.

CPU0 CPU1 while (!done) { done = true; schedule(); wake_up_process(t); current→state = …; }

slide-32
SLIDE 32

32

Sleeping/Waking

  • Extremely common task in the kernel and flagship

example of flag-based CPU-CPU interaction.

CPU0 CPU1 while (!done) { done = true; schedule(); wake_up_process(t); current→state = …; set_current_state(…); }

slide-33
SLIDE 33

33

Sleeping/Waking

  • Extremely common task in the kernel and flagship

example of flag-based CPU-CPU interaction.

CPU0 CPU1 while (!done) { done = true; schedule(); wake_up_process(t); current→state = …; set_current_state(…); } smp_store_mb(): [s] →state = … smp_mb()

slide-34
SLIDE 34

34

Atomic Operations

  • Any atomic operation that modifies some state in

memory and returns information about the state can potentially imply a SMP barrier:

‒ smp_mb() on each side of the actual operation

[atomic_*_]xchg() atomic_*_return() atomic_*_and_test() atomic_*_add_negative()

slide-35
SLIDE 35

35

Atomic Operations

  • Any atomic operation that modifies some state in

memory and returns information about the state can potentially imply a SMP barrier:

‒ smp_mb() on each side of the actual operation ‒ Conditional calls imply barriers only when successful.

[atomic_*_]xchg() atomic_*_return() atomic_*_and_test() atomic_*_add_negative() [atomic_*_]cmpxchg() atomic_*_add_unless()

slide-36
SLIDE 36

36

Atomic Operations

  • Most basic of operations therefore do not imply

barriers.

  • Many contexts can require barriers:

cpumask_set_cpu(cpu, vec->mask); /* * When adding a new vector, we update the mask first, * do a write memory barrier, and then update the count, to * make sure the vector is visible when count is set. */ smp_mb__before_atomic(); atomic_inc(&(vec)->count);

slide-37
SLIDE 37

37

Atomic Operations

  • Most basic of operations therefore do not imply

barriers.

  • Many contexts can require barriers:

/* * When removing from the vector, we decrement the counter first * do a memory barrier and then clear the mask. */ atomic_dec(&(vec)->count); smp_mb__after_atomic(); cpumask_clear_cpu(cpu, vec->mask);

slide-38
SLIDE 38

38

Acquire/Release Semantics

  • One way barriers.
  • Passing information reliably between threads about a

variable.

‒ Ideal in producer/consumer type situations (pairing!!). ‒ After an ACQUIRE on a given variable, all memory accesses

preceding any prior RELEASE on that same variable are guaranteed to be visible.

‒ All accesses of all previous critical sections for that variable

are guaranteed to have completed.

‒ C++11's memory_order_acquire,

memory_order_release and memory_order_relaxed.

slide-39
SLIDE 39

39

Acquire/Release Semantics

CPU0 spin_lock … … CPU0 spin_lock(&l) spin_unlock(&l) CPU1 spin_lock(&l) spin_unlock(&l) CR CR

slide-40
SLIDE 40

40

Acquire/Release Semantics

CPU0 spin_lock … … CPU0 spin_lock(&l) spin_unlock(&l) CPU1 spin_lock(&l) spin_unlock(&l) CR CR

smp_store_release(lock→val, 0) <-> cmpxchg_acquire(lock→val, 0, LOCKED)

slide-41
SLIDE 41

41

Acquire/Release Semantics

CPU0 spin_lock … … CPU0 spin_lock(&l) spin_unlock(&l) CPU1 spin_lock(&l) spin_unlock(&l) CR CR

RELEASE ACQUIRE

smp_store_release(lock→val, 0) <-> cmpxchg_acquire(lock→val, 0, LOCKED)

slide-42
SLIDE 42

42

Acquire/Release Semantics

CPU0 spin_lock … … CPU0 spin_lock(&l) spin_unlock(&l) CPU1 spin_lock(&l) spin_unlock(&l) CR CR

RELEASE (LS, SS) ACQUIRE (LL, LS)

smp_store_release(lock→val, 0) <-> cmpxchg_acquire(lock→val, 0, LOCKED)

slide-43
SLIDE 43

43

Acquire/Release Semantics

CPU0 spin_lock … … CPU0 spin_lock(&l) spin_unlock(&l) CPU1 spin_lock(&l) spin_unlock(&l) CR CR

RELEASE ACQUIRE

smp_store_release(lock→val, 0) <-> cmpxchg_acquire(lock→val, 0, LOCKED)

slide-44
SLIDE 44

44

Acquire/Release Semantics

CPU0 spin_lock … … CPU0 spin_lock(&l) spin_unlock(&l) CPU1 spin_lock(&l) spin_unlock(&l) CR CR

RELEASE ACQUIRE

smp_store_release(lock→val, 0) <-> cmpxchg_acquire(lock→val, 0, LOCKED)

slide-45
SLIDE 45

45

Acquire/Release Semantics

CPU0 spin_lock … … CPU0 spin_lock(&l) spin_unlock(&l) CPU1 spin_lock(&l) spin_unlock(&l) CR CR

RELEASE ACQUIRE

smp_store_release(lock→val, 0) <-> cmpxchg_acquire(lock→val, 0, LOCKED)

slide-46
SLIDE 46

46

Acquire/Release Semantics

  • Regular atomic/RMW calls have been fine grained for

archs that support strict acquire/release semantics.

  • Currently only used by arm64 and PPC.

‒ LDAR/STLR

cmpxchg() cmpxchg_acquire() cmpxchg_release() cmpxchg_relaxed() smp_load_acquire() smp_cond_acquire() smp_store_release()

slide-47
SLIDE 47

47

Acquire/Release Semantics

  • These are minimal guarantees.

‒ Ensuring barriers on both sides of a lock operation will require

therefore, full barrier semantics:

smp_mb__before_spinlock() smp_mb__after_spinlock()

  • Certainly not limited to locking.

‒ perf, IPI paths, scheduler, tty, etc.

slide-48
SLIDE 48

48

Acquire/Release Semantics

  • Busy-waiting on a variable that requires ACQUIRE

semantics:

CPU0 while (!done) cpu_relax(); smp_rmb(); CPU1 smp_store_release(done, 1);

slide-49
SLIDE 49

49

Acquire/Release Semantics

  • Busy-waiting on a variable that requires ACQUIRE

semantics:

CPU0 while (!done) cpu_relax(); [LS] smp_rmb(); [LL] CPU1 smp_store_release(done, 1);

slide-50
SLIDE 50

50

Acquire/Release Semantics

  • Busy-waiting on a variable that requires ACQUIRE

semantics:

CPU0 while (!done) cpu_relax(); [LS] smp_rmb(); [LL] CPU1 smp_store_release(done, 1); smp_load_acquire(!done);

slide-51
SLIDE 51

51

Acquire/Release Semantics

  • Busy-waiting on a variable that requires ACQUIRE

semantics:

CPU0 while (!done) cpu_relax(); [LS] smp_rmb(); [LL] CPU1 smp_store_release(done, 1);

  • Fine-graining SMP barriers while a performance
  • ptimization, makes it harder for kernel programmers.
slide-52
SLIDE 52

52

Concluding Remarks

  • Assume nothing.
  • Read memory-barriers.txt
  • Use barrier pairings.
  • Comment barriers.
slide-53
SLIDE 53

Thank you.

slide-54
SLIDE 54

54