From weak to weedy Effective use of memory barriers in the ARM Linux - - PowerPoint PPT Presentation

from weak to weedy
SMART_READER_LITE
LIVE PREVIEW

From weak to weedy Effective use of memory barriers in the ARM Linux - - PowerPoint PPT Presentation

Introduction ARMs memory model Linuxs memory model Finer-grained control Questions Future work From weak to weedy Effective use of memory barriers in the ARM Linux Kernel Will Deacon will.deacon@arm.com Embedded Linux Conference


slide-1
SLIDE 1

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

From weak to weedy

Effective use of memory barriers in the ARM Linux Kernel Will Deacon will.deacon@arm.com

Embedded Linux Conference Europe Edinburgh, UK

October 24, 2013

slide-2
SLIDE 2

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Scope

Memory ordering is a complex topic!

  • Different rules across different versions/implementations of

different architectures

  • Not well understood by most software engineers
  • Great potential for subtle, non-repeatable software bugs
  • Key contributor to overall system performance

We will focus on the ARMv7 Linux kernel from a SW perspective (the ARM ARM remains authoritative!).

slide-3
SLIDE 3

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Sequential Consistency

A talk about memory ordering wouldn’t be complete without a brief description of sequential consistency.

Sequential Consistency (SC):

‘A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.’ – Leslie Lamport (1979)

slide-4
SLIDE 4

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Sequential Consistency (2)

A B C Program B p1 A p0 C p2

slide-5
SLIDE 5

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Sequential Consistency (3)

SC makes SMP systems nice and easy to reason about. . . . . . but the hardware guys hate it!

  • Out-of-order and speculative execution
  • Caches (and coherency in SMP)
  • Write atomicity
  • Store buffers (read bypass and write merging)
  • Multi-ported bus topologies
  • Memory-mapped I/O

Back to square one with memory latency!

slide-6
SLIDE 6

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Memory Ordering

To facilitate these hardware optimisations, ordering of memory

  • perations is often relaxed from program order, potentially leading

to SC violations. Initially: A = B = 0 p0 a: A = 2; b: B = 1; p1 c: C = B; d: D = A; Results (C, D) == (0, 0) (C, D) == (0, 2) (C, D) == (1, 2) (C, D) == (1, 0) SC ? ? ? ? This is defined by the memory (consistency) model for the architecture.

slide-7
SLIDE 7

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Memory Ordering

To facilitate these hardware optimisations, ordering of memory

  • perations is often relaxed from program order, potentially leading

to SC violations. Initially: A = B = 0 p0 a: A = 2; b: B = 1; p1 c: C = B; d: D = A; Results (C, D) == (0, 0) (C, D) == (0, 2) (C, D) == (1, 2) (C, D) == (1, 0) SC Y (c, d, a, b) Y (c, a, d, b) Y (a, b, c, d) N (d, a, b, c) This is defined by the memory (consistency) model for the architecture.

slide-8
SLIDE 8

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Safety Nets

Weakly ordered memory models offer safety nets to the programmer for explicit control over access ordering. These are commonly referred to as barriers or fences. The ARMv7 memory model includes:

  • A range of barrier instructions
  • Defined dependencies between accesses
  • Memory types with different ordering constraints
slide-9
SLIDE 9

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Observers

An observer is an agent in the system that can access memory:

  • Not necessarily a CPU (which contains multiple observers!)
  • Master within a given shareability domain (more later)
  • Slave interfaces cannot observe any accesses
slide-10
SLIDE 10

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Shareability Domains

Shareability domains define sets of observers within a system.

  • {Non, Inner, Outer}-shareable and Full System
  • Impact on cache coherency and shared memory
  • Multiple domain instances (no strictly nested)
  • System-specific, but architectural (and Linux) expectations

‘This architecture (ARMv7) is written with an expectation that all processors using the same operating system or hypervisor are in the same Inner Shareable shareability domain.’

slide-11
SLIDE 11

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Example Domains

A B C D Memory DMA

slide-12
SLIDE 12

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Example Domains (NSH)

A B C D Memory DMA

slide-13
SLIDE 13

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Example Domains (ISH)

A B C D Memory DMA

slide-14
SLIDE 14

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Example Domains (OSH)

A B C D Memory DMA

slide-15
SLIDE 15

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Example Domains (SY)

A B C D Memory DMA

slide-16
SLIDE 16

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Observability

Ordering is defined in terms of observability by memory masters.

Writes

‘A write to a location in memory is said to be observed by an

  • bserver when: (1) A subsequent read of the location by the same
  • bserver will return the value written by the observed write, or

written by a write to that location by any observer that is sequenced in the coherence order of the location after the observed write and (2) A subsequent write of the location by the same

  • bserver will be sequenced in the coherence order of the location

after the observed write’ This is actually pretty intuitive. . .

slide-17
SLIDE 17

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Observability (2)

. . . but reads are observable too!

Reads

‘A read of a location in memory is said to be observed by an

  • bserver when a subsequent write to the location by the same
  • bserver will have no effect on the value returned by the read.’
slide-18
SLIDE 18

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Global Observability and Completion

  • A normal memory access is globally observed for a shareability

domain when it is observed by all observers in that domain.

  • A table walk is complete for a shareability domain when its

accesses are globally observed in that domain and the TLB is updated.

  • An access is complete for a shareability domain when it is

globally observed in that domain and any table walks associated with it have completed in the same domain. Maintenance operations also have the notion of completion.

slide-19
SLIDE 19

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Ordering Diagrams

A B C D

Read Write

slide-20
SLIDE 20

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Ordering Diagrams

A B C D

Read Write

b d a a

slide-21
SLIDE 21

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Dependencies

In the absence of explicit barriers, dependencies define observation

  • rder of normal memory accesses.

Address: value returned by a read is used to compute the address of a subsequent access. Control: value returned by a read is used to determine the condition flags and the flags are used in the condition code checking that determines the address of a subsequent access. Data: value returned by a read is used as data written by a subsequent write. There are also a few other rules (RaR, store speculation).

slide-22
SLIDE 22

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Dependency Examples

ldr r1, [r0, #4] and r1, #0xfff ldr r3, [r2, r1] (address) ldr r1, [r0, #4] cmp r1, #1 addeq r2, #4 ldr r3, [r2] (control) ldr r1, [r0, #4] add r1, #5 str r1, [r2] (data) Question: Which dependencies enforce ordering of observability?

slide-23
SLIDE 23

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Memory Barriers

The ARMv7 architecture defines three barrier instructions: isb Pipeline flush and context synchronisation dmb <option> Ensure ordering of memory accesses dsb <option> Ensure completion of memory accesses The <option> argument specifies the required shareability domain (NSH, ISH, OSH, SY) and access type (ST). Defaults to ‘full system’, all access types if omitted.

slide-24
SLIDE 24

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Ordering Diagrams (DMB)

A B C D

b0: data = 42; b1: dmb ishst; b2: flag = VALID;

b0

slide-25
SLIDE 25

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Ordering Diagrams (DMB)

A B C D

b0: data = 42; b1: dmb ishst; b2: flag = VALID;

b1 b0

slide-26
SLIDE 26

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Ordering Diagrams (DMB)

A B C D

b0: data = 42; b1: dmb ishst; b2: flag = VALID;

b2 b1 b0

X

slide-27
SLIDE 27

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Ordering Diagrams (DMB)

A B C D

b0: data = 42; b1: dmb ishst; b2: flag = VALID;

b2 b0

slide-28
SLIDE 28

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Ordering Diagrams (DSB)

A B C D

b2: flag = VALID; b3: dsb ishst; b4: sev();

b2 b1 b0

slide-29
SLIDE 29

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Ordering Diagrams (DSB)

A B C D

b2: flag = VALID; b3: dsb ishst; b4: sev();

b3 b2 b1 b0

slide-30
SLIDE 30

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Ordering Diagrams (DSB)

A B C D

b2: flag = VALID; b3: dsb ishst; b4: sev();

b3 b2 b0

slide-31
SLIDE 31

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Ordering Diagrams (DSB)

A B C D

b2: flag = VALID; b3: dsb ishst; b4: sev();

b2

slide-32
SLIDE 32

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Ordering Diagrams (DSB)

A B C D

b2: flag = VALID; b3: dsb ishst; b4: sev();

slide-33
SLIDE 33

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Overloading of barrier instructions

The barrier instructions are also overloaded to affect other parts of the system: Cache maintenance ordered by dmb [st] and completed using dsb [st] on the same CPU Branch predictor maintenance is completed at a context synchronisation operation (e.g. isb) TLB maintenance completed using dsb PTE updates ‘published’ to walker with dsb [st] (MP extensions) isb required for explicit synchronisation with instruction stream.

slide-34
SLIDE 34

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Barriers in Linux

Linux defines more barrier types than you can shake a stick at! Compiler: barrier() Mandatory: mb(), wmb(), rmb(), (read barrier depends()) SMP conditional: smp * – domain? MMIO write: (mmiowb()) Also implicit barriers in locks, atomics, bitops, I/O

  • accessors. . . (see Documentation/memory-barriers.txt).
slide-35
SLIDE 35

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Low-level barriers

The ARM architecture port maps the Linux barriers onto the v7 instruction set:

  • smp * ⇒ dmb [sy]; (SMP)
  • rmb ⇒ dsb [sy];
  • [w]mb ⇒ dsb [sy]; [outer sync();] (DMA)

There are also low-level barrier macros for ARM-specific code:

  • dmb ⇒ dmb [sy];
  • dsb ⇒ dsb [sy];

Spot the problem? (we’ve been getting away with it so far. . . )

slide-36
SLIDE 36

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Extended API

From Linux 3.12, we can specify the domain and access type for low-level barriers. This gives us a measurable performance boost, but increases the scope for horrible bugs! /* Write local pte */ dsb(nshst); /* TLB invalidation */ dsb(nsh); All implemented write barriers take the -st option and the smp * barriers become inner-shareable. Be sure to grab a ‘recent’ binutils.

slide-37
SLIDE 37

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Example: spin unlock

/* * Ensure accesses don’t leak out * from critical section */ smp_mb(); /* Release the lock */ lock->tickets.owner++; /* Wake up waiting CPUs */ dsb_sev(); @ 3.11 dmb sy ldrh r3, [r0] add r3, r3, #1 strh r3, [r0] dsb sy sev

slide-38
SLIDE 38

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Example: spin unlock

/* * Ensure accesses don’t leak out * from critical section */ smp_mb(); /* Release the lock */ lock->tickets.owner++; /* Wake up waiting CPUs */ dsb_sev(); @ 3.12 dmb ish ldrh r3, [r0] add r3, r3, #1 strh r3, [r0] dsb ishst sev @ ~5% hackbench @ improvement on TC2!

slide-39
SLIDE 39

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Example: DMA To Device CPU DMA (ctrl) DMA (master) Memory System a0: str data, [mem] a1: ?<barrier>? a2: str #DMA EN, [ctrl]

slide-40
SLIDE 40

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Example: DMA To Device CPU DMA (ctrl) DMA (master) Memory System a0: str data, [mem] a1: dmb st a2: str #DMA EN, [ctrl]

slide-41
SLIDE 41

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Example: DMA To Device CPU DMA (ctrl) DMA (master) Memory System a0: str data, [mem] a1: dmb st a2: str #DMA EN, [ctrl]

a0 a1 a2

slide-42
SLIDE 42

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Ordering of Observability Satisfied!

CPU DMA (ctrl) DMA (master) Memory System a0: str data, [mem] a1: dmb st a2: str #DMA EN, [ctrl] Race condition!

a0 a2 a1 a1

slide-43
SLIDE 43

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Example: DMA To Device CPU DMA (ctrl) DMA (master) Memory System a0: str data, [mem] a1: dsb st /* wmb() */ a2: str #DMA EN, [ctrl]

a0 a1

slide-44
SLIDE 44

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Example: DMA From Device

CPU DMA (ctrl) DMA (master) Memory System a0: ldr stat, [ctrl] a1: cmp stat, #DMA DONE a2: bne a0 a3: ?<barrier>? a4: ldr data, [mem]

slide-45
SLIDE 45

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Example: DMA From Device

CPU DMA (ctrl) DMA (master) Memory System a0: ldr stat, [ctrl] a1: cmp stat, #DMA DONE a2: bne a0 a3: dmb a4: ldr data, [mem]

slide-46
SLIDE 46

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Example: DMA From Device

CPU DMA (ctrl) DMA (master) Memory System a0: ldr stat, [ctrl] a1: cmp stat, #DMA DONE a2: bne a0 a3: dmb a4: ldr data, [mem]

a0

slide-47
SLIDE 47

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Speculation Through Control Dependency!

CPU DMA (ctrl) DMA (master) Memory System a0: ldr stat, [ctrl] a1: cmp stat, #DMA DONE a2: bne a0 a3: dmb a4: ldr data, [mem]

a0 a3 a3 a4

slide-48
SLIDE 48

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Speculation Through Control Dependency!

CPU DMA (ctrl) DMA (master) Memory System a0: ldr stat, [ctrl] a1: cmp stat, #DMA DONE a2: bne a0 a3: dmb a4: ldr data, [mem] Race condition!

a4 a0 a3 a3 a3 d0

slide-49
SLIDE 49

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Example: DMA From Device

CPU DMA (ctrl) DMA (master) Memory System a0: ldr stat, [ctrl] a1: cmp stat, #DMA DONE a2: bne a0 a3: dsb /* rmb() */ a4: ldr data, [mem]

a0 a3

slide-50
SLIDE 50

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Which Barrier Should I Use?

Ignoring maintenance operations, memory barriers are typically required when publishing to and consuming from other observers (data vs control).

  • 1. Do you even need a barrier? (dependencies)
  • 2. Do you only care about ordering between CPUs? (smp *)
  • 3. Only care about reads or writes? (*[rw]mb)
  • 4. Low-level barriers rarely needed (nsh, osh and maintenance)
  • 5. I/O accessors and relaxed variants (readl, writel)
slide-51
SLIDE 51

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

Questions?

slide-52
SLIDE 52

Introduction ARM’s memory model Linux’s memory model Finer-grained control Questions Future work

ARMv8

ARMv8 introduces some exciting new features to the memory model!

  • ld barrier option to order reads against reads/writes

Half barriers in the form of acquire/release operations Device memory attributes nGnRnE There’s also the problem of defining * relaxed across

  • architectures. . .