Memory FIFOs for uncommitted writes Consistency Invalidate queues - - PowerPoint PPT Presentation

memory
SMART_READER_LITE
LIVE PREVIEW

Memory FIFOs for uncommitted writes Consistency Invalidate queues - - PowerPoint PPT Presentation

Sistemi operativi Operating Systems Universit degli studi di Udine Sistemi operativi Operating Systems Universit degli studi di Udine Sources of out-of-order memory accesses Compiler optimizations Store buffers Memory


slide-1
SLIDE 1

Università degli studi di Udine Sistemi operativi – Operating Systems

Memory Consistency Models

Università degli studi di Udine Sistemi operativi – Operating Systems

Sources of out-of-order memory accesses

Compiler optimizations Store buffers FIFOs for uncommitted writes Invalidate queues (for cache coherency) Data prefetch Banked cache architectures Networked interconnect Non-uniform memory access (NUMA) architectures:

different accesses to memory have different latencies

... Università degli studi di Udine Sistemi operativi – Operating Systems

Compiler optimizations

Language semantic does not consider

  • 1. Side-effects of memory accesses
  • 2. Multi-threading
  • 3. Asynchronous execution

Compiler can:

Reorder instructions Eliminate operations Some compiler optimization can be controlled by the

volatile qualifier

Università degli studi di Udine Sistemi operativi – Operating Systems

void waitval (int *ptr) { while (*ptr == 0) continue; } void waitval (int *ptr) { while (*ptr == 0) continue; } waitval: ldr r3, [r0] cmp r3, #0 movne pc, lr loop: b loop waitval: ldr r3, [r0] cmp r3, #0 movne pc, lr loop: b loop

Compiler does not need to consider that someone else can change *ptr

ARM assembly code C code

int add3 (int x) { int i; for (i=0; i<3; i++) x += x; return x; } int add3 (int x) { int i; for (i=0; i<3; i++) x += x; return x; } add3: mov r0, r0, asl #3 mov pc, lr add3: mov r0, r0, asl #3 mov pc, lr

This function always returns 8·x: compiler can optimize code

ARM assembly code C code

int add_vals (int *vec) { int y = vec[1]; y += vec[0]; return y; } int add_vals (int *vec) { int y = vec[1]; y += vec[0]; return y; } add_vals: ldr r3, [r0] ldr r0, [r0, #4] add r0, r0, r3 mov pc, lr add_vals: ldr r3, [r0] ldr r0, [r0, #4] add r0, r0, r3 mov pc, lr

Result does not depend on access order: compiler can change loads order

ARM assembly code C code

slide-2
SLIDE 2

Università degli studi di Udine Sistemi operativi – Operating Systems

Volatile

Semantic

Each read from a volatile variable requires an actual load

and may return a different value

Compiler optimization cannot merge reads from the same address

Each write to a volatile variable requires an actual store

Compiler optimization cannot cancel stores

Required to access I/O address space

Note: this is the C/C++ semantic the Java semantic differs (it also implies atomicity)

Università degli studi di Udine Sistemi operativi – Operating Systems

Examples

int *ptr; /* pointer to int */ volatile int *ptr_to_vol; /* pointer to volatile int */ int *volatile vol_ptr; /* volatile pointer to int */ volatile int *volatile vol_ptr_to_vol; /* volatile pointer to volatile int */ int *ptr; /* pointer to int */ volatile int *ptr_to_vol; /* pointer to volatile int */ int *volatile vol_ptr; /* volatile pointer to int */ volatile int *volatile vol_ptr_to_vol; /* volatile pointer to volatile int */

Beware the semantic:

  • a = *ptr_to_vol;

is a volatile access

  • a = *vol_ptr;

is not a volatile access

Università degli studi di Udine Sistemi operativi – Operating Systems

Volatile

Inconsistent qualification causes errors Volatile does not enforce ordering with non-volatile

accesses

Volatile does not enforce order on how access are

actually performed

Volatile does not mean atomic Università degli studi di Udine Sistemi operativi – Operating Systems

Volatile

volatile int A; volatile int B; A=1; /* these two lines won't be */ B=1; /* reordered by compiler */ volatile int A; volatile int B; A=1; /* these two lines won't be */ B=1; /* reordered by compiler */ int A; volatile int B; A=1; /* these two lines can be */ B=1; /* reordered by compiler */ int A; volatile int B; A=1; /* these two lines can be */ B=1; /* reordered by compiler */ volatile int A; volatile int B; A=1; /* these two lines won't be */ B=1; /* reordered by compiler but */ /* accesses can be reordered */ /* by HW */ volatile int A; volatile int B; A=1; /* these two lines won't be */ B=1; /* reordered by compiler but */ /* accesses can be reordered */ /* by HW */ volatile int X; X=1; /* this assignment can be interrupted or preempted */ volatile int X; X=1; /* this assignment can be interrupted or preempted */

slide-3
SLIDE 3

Università degli studi di Udine Sistemi operativi – Operating Systems

Memory barrier

This inline assembly code:

  • 1. contains no instructions
  • 2. may read or write all of RAM

Hence:

compiler memory accesses reordering is not allowed around the barrier in either direction

asm volatile ("" : : : "memory"); asm volatile ("" : : : "memory");

Implementation on GCC Università degli studi di Udine Sistemi operativi – Operating Systems

Store Buffer

Record the store in buffer until is actually performed

Hide memory latency

Cache latency Cache-miss on write

Processor can execute other instructions

Data dependency (RAW)

Wait until the write is actually performed in memory or in cache Read the data from the store buffer (store forwarding)

Data dependency (WAW)

Add a new entry in the store buffer Replace the previous write in the store buffer

Università degli studi di Udine Sistemi operativi – Operating Systems

Example

Processor P1 executes

1) store A 2) store B A and B are shared with P2:

A is in P2 cache B is in both caches

P1

Store buffer Cache

B

P2

Store buffer Cache

B A

Interconnect

Università degli studi di Udine Sistemi operativi – Operating Systems

Example

Execution:

1: store A: cache miss

write the updated value in store buffer send a read request (data will come from P2 cache)

several clock cycles needed P1 can proceed, (the new value is in the store buffer) P2 does not see the write

2: store B: cache hit

data is written in cache a coherence message is sent to P2

P2 sees the write

3: A is loaded in P1 cache 4: A is updated in P1 cache

a coherence message is sent to P2

P2 sees the write

P2 sees the store on B first, then the store on A

slide-4
SLIDE 4

Università degli studi di Udine Sistemi operativi – Operating Systems

Consequence

initially: A=0 and B=0 A = 1 B = 1 while (B==0) continue; assert (A==1); /* this can fail! */

P1 P2

If P2 sees the stores performed by P1 in reverse order, the assertion fails

Note: A and B are volatiles

Università degli studi di Udine Sistemi operativi – Operating Systems

Cache coherency

Cache coherency can require cache line invalidation

A processor send an invalidate message to another one Target processor must invalidate cache line

Invalidate Queue

Store invalidate requests while the cache is busy

Invalidate the line when the cache is ready

Università degli studi di Udine Sistemi operativi – Operating Systems

Data prefetch

Processor can read data before the actual load

instruction

Hide memory latency

Preload data in cache

Speculative execution

Execute instructions after a branch before the branch

Università degli studi di Udine Sistemi operativi – Operating Systems

Banked cache architectures

Caches split in several banks

While accesses to busy banks must wait, accesses to idle

banks can proceed Processor

Cache

Interconnect

Cache Store buffer

\

slide-5
SLIDE 5

Università degli studi di Udine Sistemi operativi – Operating Systems

Definitions

Program order

The order of operations as specified by software

Execution order

The order of operations as executed by a processor

Perceived order

The order of operations as seen by processors and memories

Memory consistency model

Rules that specify the allowed behavior of programs in terms

  • f memory accesses

Rules: order restrictions

Università degli studi di Udine Sistemi operativi – Operating Systems

Definitions

Performed

Write

a write by processor i is performed with respect to processor k when:

a read issued by k to the same address returns the value stored by i

Read

a read by processor i is performed with respect to processor k when:

a write issued by k to the same address cannot affect the value read by i Globally Performed globally performed: is performed with respect to all processors

Write

A write is globally performed when its modi cation has been

fi propagated to all processors

Read

A read is globally performed when the value it returns is bound and the

write that wrote this value is globally performed

Università degli studi di Udine Sistemi operativi – Operating Systems

Memory consistency models

Rules on access ordering can regard:

Location (address of access) Direction

read, write, read-write

Value Causality

behavior of an access depends on the behavior of another one

Category

shared / private synchronizing / not synchronizing

Università degli studi di Udine Sistemi operativi – Operating Systems

Memory consistency models

Uniform consistency models

Rules do no concern category of accesses

Hybrid consistency models

Category of accesses matters

slide-6
SLIDE 6

Università degli studi di Udine Sistemi operativi – Operating Systems

Uniform consistency models

Local Consistency (LC)

Each process sees its own accesses in program order

There is no restriction on the order of the accesses seen by other

processors

Different processes may see different orders

The weakest consistency model:

it only guarantees sequential behavior on uniprocessor systems Not usable to program in parallel environments

Università degli studi di Udine Sistemi operativi – Operating Systems

Uniform consistency models

Sequential consistency (SC)

There is a global total order of all memory accesses (of all

processors)

all processors agree with such global order global order can change at each run

Each processor sees its own accesses in program order Offsets many architectural optimizations Easy to use Model implied by a cacheless system, with a single memory device,

with processors unable to perform Out-of-Order execution

Università degli studi di Udine Sistemi operativi – Operating Systems

SC: Consequence

initially: A=0 and B=0

  • 1a. A = 1
  • 2a. B = 1
  • 1b. while (B==0) continue;
  • 2b. assert (A==1);

P1 P2

The assertion cannot fail P1: P2: W(A)1 1a

Note: A and B are volatiles

time

history

access type variable/address data stored/read

Università degli studi di Udine Sistemi operativi – Operating Systems

SC: Consequence

initially: A=0 and B=0

  • 1a. A = 1
  • 2a. B = 1
  • 1b. while (B==0) continue;
  • 2b. assert (A==1);

P1 P2

The assertion cannot fail P1: P2: W(A)1 R(B)1 R(A)1 W(B)1 1a 2a 1b 2b R(B)0 1b R(B)0 1b R(B)0 1b

Note: A and B are volatiles

time

slide-7
SLIDE 7

Università degli studi di Udine Sistemi operativi – Operating Systems

SC: Consequence

initially: A=0 and B=0

  • 1a. A = 1
  • 2a. B = 1
  • 1b. while (B==0) continue;
  • 2b. assert (A==1);

P1 P2

The assertion cannot fail

  • Each processor sees its own accesses in program order
  • All processors agree with a global order

Access 1a is before access 2b

It is easy to enforce order between accesses from different processors

Note: A and B are volatiles

Università degli studi di Udine Sistemi operativi – Operating Systems

Sequential consistency

Cache based system, no constraint on the interconnect

Sufficient conditions

All processors issue their access in program order A processor does not issue an access until its previous accesses have been

globally performed

Need waiting for acknowledges from other processors

Offsets many architectural optimizations

No out-of -order execution Write-hit on cache must wait answers

Easy to use

Università degli studi di Udine Sistemi operativi – Operating Systems

Comparison

The union of all the Perceived orders can be valid or not

for a given consistency model

Example:

P2 executes:

I3: load A I4: load B

1) P1 sees I1, I2; P2 sees I1, I3, I4, I2

valid execution for Sequential consistency

total order implied: I1, I3, I4, I2

2) P1 sees I1, I2; P2 sees I3, I2, I4, I1

invalid execution for Sequential consistency

there is not an unique total order

valid execution for Local consistency

P1 and P2 see their own accesses in order

P1 executes:

I1: store A I2: store B

Università degli studi di Udine Sistemi operativi – Operating Systems

Comparison

Consistency model A is stronger than consistency model B if:

each execution valid on A is also valid on B also: B is weaker than A

If there exist

some execution E1 valid on A and not valid on B some execution E2 valid on B and not valid on A then, A and B are incomparable

slide-8
SLIDE 8

Università degli studi di Udine Sistemi operativi – Operating Systems

Uniform consistency models

Causal consistency (Causal)

All processors agree on the order of causally related events

causally unrelated events can be observed in different orders

  • Example
  • X is initially 0
  • event1: P1 writes 1 to X
  • event2: P2 reads X and obtains 1
  • event3: P1 writes 2 to X
  • hence:
  • event1 is happened before event2
  • event2 is happened before event3
  • all processors agree on such an ordering
  • Example
  • X is initially 0
  • event1: P2 reads X and obtains 0
  • event2: P1 writes 1 to X
  • event3: P2 reads X and obtains 1
  • hence:
  • event1 is happened before event2
  • event2 is happened before event3
  • all processors agree on such an ordering

Università degli studi di Udine Sistemi operativi – Operating Systems

Causal consistency: example 1

initially: X=0 1a: X = 1 2a: X = 3

P1

1b: A = X 2b: X = 2

P2

1c: B = X 2c: D = X 3c: F = X

P3

1d: C = X 2d: E = X 3d: G = X

P4

result: A=1 ; B=1 ; C=1 ; D=3 ; E=2 ; F=2 : G=3

Università degli studi di Udine Sistemi operativi – Operating Systems

Causal consistency: example 1

initially: X=0 P1: P2: P3: P4: W(X)1 R(X)1 R(X)1 R(X)1 W(X)2 W(X)3 R(X)3 R(X)2 R(X)2 R(X)3

Università degli studi di Udine Sistemi operativi – Operating Systems

Causal consistency: example 1

for P3:

  • 2a < 2b

for P4:

  • 2b < 2a
  • a single global order is not possible (2a ? 2b)
  • execution is not Sequentially consistent
  • no contradictions on causal dependencies
  • execution is Causally consistent
  • Note: 2a and 2b are not causally related
slide-9
SLIDE 9

Università degli studi di Udine Sistemi operativi – Operating Systems

Causal consistency: example 2

initially: X=0 , Y=0 1a: X = 1 2a: X = 2

P1

1b: A = X 2b: Y = 3

P2

1c: B = Y 2c: C = X

P3

result: A=2 ; B=3 ; C=1

Università degli studi di Udine Sistemi operativi – Operating Systems

Causal consistency: example 2

initially: X=0 , Y=0 P1: P2: P3: W(X)1 R(X)2 W(Y)3 R(Y)3 R(X)1 W(X)2

Università degli studi di Udine Sistemi operativi – Operating Systems

Causal consistency: example 2

for P2:

2a < 2b (A=2)

for P3:

2b < 2a (B=3 and C=1)

P2 and P3 disagree on the order between 2a and 2b 2a and 2b are causally related (constraint due to A=2) execution is not Causally consistent

Università degli studi di Udine Sistemi operativi – Operating Systems

Causal consistency: example 3

initially: X=0 , Y=0 1a: X = 1 2a: X = 2

P1

1b: A = X 2b: Y = 3

P2

1c: B = Y 2c: C = X

P3

result: A=2 ; B=3 ; C=2 execution is Causally consistent

slide-10
SLIDE 10

Università degli studi di Udine Sistemi operativi – Operating Systems

Uniform consistency models

PRAM (pipelined ram) consistency (PRAM)

Writes performed by a single process are seen by all other

processes in the order in which they were issued

the perceived order of all writes seen can be different for each

process

Cache consistency (CC)

All writes to the same memory location are performed in

some sequential order

all processes see the same order of writes for each location (but the

  • rder of all writes can differ)

Università degli studi di Udine Sistemi operativi – Operating Systems

PRAM consistency: example

initially: X=0 1a: X = 1

P1

1b: A = X 2b: X = 2

P2

1c: B = X 2c: D = X

P3

1d: C = X 2d: E = X

P4

result: A=1 ; B=1 ; C=2 ; D=2 ; E=1

Università degli studi di Udine Sistemi operativi – Operating Systems

PRAM consistency: example

initially: X=0 P1: P2: P3: P4: W(X)1 R(X)1 W(X)2 R(X)1 R(X)2 R(X)2 R(X)1

Università degli studi di Udine Sistemi operativi – Operating Systems

PRAM consistency: example

all processors see the same order for writes of P1 (1a) and P2 (2b)

(trivial)

execution is PRAM consistent

a single global order is not possible

execution is not Sequentially consistent

P3 and P4 do not agree on causal relation between 1a and 2b

execution is not Causally consistent

slide-11
SLIDE 11

Università degli studi di Udine Sistemi operativi – Operating Systems

Cache consistency: example

initially: X=0 ; Y=0 1a: X = 1 2a: A = Y

P1

1b: Y = 1 2b: B = X

P2

result: A=0 ; B=0

Università degli studi di Udine Sistemi operativi – Operating Systems

Cache consistency: example

initially: X=0 ; Y=0 P1: P2: W(X)1 W(Y)1 R(X)0 R(Y)0

Università degli studi di Udine Sistemi operativi – Operating Systems

Cache consistency: example

for P1:

1a < 1b (A=0)

for P2:

1b < 1a (B=0)

all processors see the same order for writes on X (1a) and on Y (1b)

(trivial)

execution is Cache consistent

a single global order is not possible

execution is not Sequentially consistent

Università degli studi di Udine Sistemi operativi – Operating Systems

Uniform consistency models

Processor consistency (PC)

PRAM consistent and Cache Consistent

Tie-Breaker (Peterson's) algorithm executes correctly under

Processor consistency

Bakery algorithm needs Sequential consistency

Processor consistent machines are easier to build than sequentially

consistent systems.

slide-12
SLIDE 12

Università degli studi di Udine Sistemi operativi – Operating Systems

Processor consistency: example

initially: X=0 ; Y=0 1a: X = 1 2a: c = 1 3a: A = Y

P1

1b: Y = 1 2b: c = 2 3b: B = X

P2

result: A=0 ; B=0

Università degli studi di Udine Sistemi operativi – Operating Systems

Processor consistency: example

initially: X=0 ; Y=0 P1: P2: W(X)1 W(Y)1 R(X)0 R(Y)0 W(c)1 W(c)2

Università degli studi di Udine Sistemi operativi – Operating Systems

Processor consistency: example

for P1:

1a < 2a < 3a < 1b < 2b

for P2:

1b < 2b < 3b < 1a < 2a

  • processors see different orders for writes on c

execution is not Processor consistent

A=0

for P1, 3a < 1b

B=0

for P2, 3b < 1a

Università degli studi di Udine Sistemi operativi – Operating Systems

Uniform consistency models

Slow consistency (SC)

All processors agree on the order of observed writes to

each location by a single processor

Writes by a process must be immediately visible to itself System where writes propagate slowly to memory and other

processors

slide-13
SLIDE 13

Università degli studi di Udine Sistemi operativi – Operating Systems

Uniform consistency models

Sequential Consistency Causal Consistency Processor Consistency PRAM Consistency Cache Consistency Slow Consistency Local Consistency

Università degli studi di Udine Sistemi operativi – Operating Systems

SC vs PC: example

initially: A=0 and B=0

  • 1a. A = 1;
  • 2a. X = B;
  • 1b. B = 1;
  • 2b. Y = A;

P1 P2

Note: A and B are volatiles

On sequential consistent systems X==0 and Y==0 is not possible On processor consistent systems, X==0 and Y==0 is possible

Università degli studi di Udine Sistemi operativi – Operating Systems

SC vs PC: example

initially: A=0 , B=0 , C=0 , D=0 , E=0

  • 1a. A = 1;
  • 2a. B = D;
  • 3a. C = 1;
  • 1b. D = 1;
  • 2b. E = A;
  • 3b. while (C==0) continue;
  • 4b. assert(B==1 || E==1);

P1 P2

The assertion 4b cannot fail on sequential consistent systems, but can fail on processor consistent systems

Note: A, B, C, D, E are volatiles

Università degli studi di Udine Sistemi operativi – Operating Systems

Consistency model and synchronization

For 2 processes,

many synchronization patterns work in the same way in processor consistent systems as well as in sequential consistent systems

It is possible to construct a situation in which processor

  • rdering fails, but there are few chances that such a

code is somewhat useful

slide-14
SLIDE 14

Università degli studi di Udine Sistemi operativi – Operating Systems

Signaling

initially: A=0 and B=0

  • 1a. A = 1;
  • 2a. B = 1;
  • 1b. while (B==0) continue;
  • 2b. assert (A==1);

P1 P2

The assertion cannot fail on sequential consistent and on processor consistent systems

P1: P2: W(A)1 R(B)1 R(A)1 W(B)1 1a 2a 1b 2b R(B)0 1b R(B)0 1b R(B)0 1b

Note: A and B are volatiles

Università degli studi di Udine Sistemi operativi – Operating Systems

Barrier

initially: A=0 , B=0 , C=0 , D=0

  • 1a. A = 1;
  • 2a. B = 1;
  • 3a. while (D==0) continue;
  • 4a. assert (A==1 && C==1);
  • 1b. C=1;
  • 2b. D=1;
  • 3b. while (B==0) continue;
  • 4b. assert (A==1 && C==1);

P1 P2

The assertions cannot fail on sequential consistent and on processor consistent systems

Note: A, B, C, D are volatiles

Università degli studi di Udine Sistemi operativi – Operating Systems

Consistency model and synchronization

For 3 or more processes,

there are simple synchronization patters that work in sequential consistent system but not in processor consistent systems

However,

it is easy to introduce small changes to have a correct synchronization even in processor consistent systems

Università degli studi di Udine Sistemi operativi – Operating Systems

Signaling

initially: A=0 , B=0 , C=0

  • 1a. A = 1;
  • 2a. B = 1;
  • 1b. while (B==0) continue;
  • 2b. C=1;

P1 P2

Note: A, B, C are volatiles

  • 1c. while (C==0) continue;
  • 2c. assert (A==1);

P3

The assertion 2c cannot fail on sequential consistent systems, but can fail on processor consistent systems WA1 and WC1 are performed on different variables by different cores:

  • n PC systems no order is enforced
slide-15
SLIDE 15

Università degli studi di Udine Sistemi operativi – Operating Systems

Signaling exploiting cache coherency

initially: A=0 and B=0

  • 1a. A = 1;
  • 2a. B = 1;
  • 1b. while (B==0) continue;
  • 2b. B=2;

P1 P2

Note: A and B are volatiles

  • 1c. while (B!=2) continue;
  • 2c. assert (A==1);

P3

The assertion 2c cannot fail on processor consistent systems WB1 and WB2 are performed by different cores on the same variable: cache coherency enforces access ordering

Università degli studi di Udine Sistemi operativi – Operating Systems

Signaling exploiting cache coherency

initially: A=0 , B=0 , C=0

  • 1a. A = 1;
  • 2a. B = 1;
  • 1b. while (B==0) continue;
  • 2b. B=1;
  • 3b. C=1;

P1 P2

Note: A, B, C are volatiles

  • 1c. while (C==0) continue;
  • 2c. assert (A==1);

P3

The assertion 2c cannot fail on processor consistent systems WB1 (2a) and WB1 (2b) are performed by different cores on the same variable: cache coherency enforces access ordering WB1 and WC1 are performed by the same processor:

  • rder is enforced by PRAM consistency

Università degli studi di Udine Sistemi operativi – Operating Systems

Hybrid consistency models

Weak Consistency (WC) Release Consistency (RC) Entry Consistency (EC) Others

Scope Consistency Location Consistency Dag Consistency

Università degli studi di Udine Sistemi operativi – Operating Systems

Hybrid consistency models

Weak Consistency (WC)

2 types of accesses

not synchronizing (read, write, read-write) synchronizing

Accesses to synchronization variables are sequentially consistent No access to a synchronization variable is issued in a processor

before all previous data accesses have been performed

No access is issued by a processor before a previous access to a

synchronization variable has been performed

Standard read and writes obey to Local consistency A synchronization access works as a fence

slide-16
SLIDE 16

Università degli studi di Udine Sistemi operativi – Operating Systems

Weak consistency

P1: P2: W(X)1 sync_R(Y)1 R(X)1 sync_W(Y)1

1a < 2a

cannot be reordered, since 2a is a synch. access

1b < 2b

cannot be reordered, since 1b is a synch. access

Y=1 2a < 1b

Hence:

in 2b, X must be 1

1a 2a 1b 2b

Università degli studi di Udine Sistemi operativi – Operating Systems

Data Race

Conflicting accesses

accesses to the same address from different processors,

where at least one is a write

Access order

  • rder can be enforced by the consistency model (SC) or by

using a synchronization access

Data race:

2 conflicting accesses with no ordering imposed

Università degli studi di Udine Sistemi operativi – Operating Systems

SC-DRF

A program executing on a weakly consistent system

appears sequentially consistent if:

there are no data races (i.e., no competing accesses) synchronization is visible to the memory system

Sequential consistency for Data-race free programs Università degli studi di Udine Sistemi operativi – Operating Systems

Hybrid consistency models

Release Consistency (RC)

2 kinds of synchronization accesses

acquire

  • nly delays future accesses
  • ften associated to a read: load_acquire

release

  • nly waits for previous accesses
  • ften associated to a write: store_release

Synchronization accesses are Processor consistent acquire and release act as a semi-permeable barrier

slide-17
SLIDE 17

Università degli studi di Udine Sistemi operativi – Operating Systems

access1 access2 access3 access4 access5 access6

Acquire Release

  • access1 and access2 can be reordered before and after “Acquire”, but not after “Release”
  • access3 and access4 can be reordered only between “Acquire” and “Release”
  • access5 and access6 can be reordered before and after “Release”, but not before “Acquire”

1: access1 2: access2 3: acquire 4: access3 5: access4 6: release 7: access5 8: access6

Memory Acquire and Release

Università degli studi di Udine Sistemi operativi – Operating Systems

Release consistency

P1: P2: W(X)1 R_acq(Y)1 R(X)1 W_rel(Y)1

1a < 2a cannot be reordered, since 2a is a release access 1b < 2b cannot be reordered, since 1b is an acquire access Y=1 2a < 1b

Hence:

in 2b, X must be 1

1a 2a 1b 2b

Università degli studi di Udine Sistemi operativi – Operating Systems

Hybrid consistency models

Entry Consistency (EC)

similar to RC differences:

each shared variable is associated to a synchronizing variable

the association can change dynamically under program control

a synchronizing variable is a lock or a barrier acquire accesses can be exclusive or non-exclusive

Università degli studi di Udine Sistemi operativi – Operating Systems

Synchronizing accesses

Synchronizing accesses

  • Full fences
  • Weak consistency
  • Release and Acquire
  • Release consistency
  • 1. Reordering constraint
  • 2. Memory access
slide-18
SLIDE 18

Università degli studi di Udine Sistemi operativi – Operating Systems

Memory barriers

Synchronizing accesses without access

mechanism to control the out-of-order execution

Instructions that prevents memory access reordering

read barriers: prevent reordering of reads

e.g., wait until the invalidate queue is empty

write barriers: prevent reordering of writes

e.g., wait until the store buffer is empty

full barriers: act on all accesses

Università degli studi di Udine Sistemi operativi – Operating Systems

Memory barriers

P1: P2: W(X)1 R(Y)1 R(X)? W(Y)1 barrier1

For P2:

1b < 2b < 3b

program order

1a and 3a are executed in order,

but 1b an 2b can be executed out of order

for P2 is the same as: “1a and 3a can be executed out of order”

Y=1 3a < 1b

Hence:

in 2b, X can be either 0 or 1

1a 2a 3a 1b 2b

Università degli studi di Udine Sistemi operativi – Operating Systems

Memory barriers

P1: P2: W(X)1 R(Y)1 R(X)1 W(Y)1 barrier1 barrier2

For P2:

1b < 2b < 3b

program order

1a < 2a < 3a

barrier1 (and barrier2)

Y=1 3a < 1b

Hence:

1a < 2a < 3a < 1b < 2b < 3b in 3b, X must be 1

1a 2a 3a 1b 2b 3b

Università degli studi di Udine Sistemi operativi – Operating Systems

CPU's memory consistency models

Processors implement out-of-order execution

Store buffer, cache coherency, ...

CPU specifications provide rules about possible reordering

Different memory areas can have different rules

ISA provide instructions to control reordering

Barriers

slide-19
SLIDE 19

Università degli studi di Udine Sistemi operativi – Operating Systems

CPU barriers

Full barrier

WC synchronizing accesses without access

Barriers are SC ordered Can distinguish memory accesses direction

Write memory barrier

Only impacts write accesses (of the current processor) e.g., wait until the store buffer is empty

Read memory barrier

Only affects read accesses (of the current processor) e.g., wait until the invalidate queue is empty

Full memory barrier

Acts on reordering of all accesses

Università degli studi di Udine Sistemi operativi – Operating Systems

CPU barriers

Semi-barrier

RC synchronizing accesses

Barriers are PC ordered Do distinguish memory accesses direction

load_acquire store_release

Università degli studi di Udine Sistemi operativi – Operating Systems

CPU barriers

Semi-barrier

RC synchronizing accesses

Barriers are PC ordered Do distinguish memory accesses direction

load_acquire store_release load_acq(X) store_rel(X) load_acq(X) store_rel(X) access1 access2 load_acq(X) store_rel(X) load_acq(X) store_rel(X) access1 access2

Note: A couple release-acquire can be seen in different order by another processor P2 sees: P1 executes:

Università degli studi di Udine Sistemi operativi – Operating Systems

Memory consistency model – Alpha

There is a partial order: BEFORE (or <=)

global relation (memory order)

Processors can perform accesses out-of-order

accesses: Instruction-fetch, Read, Write when addresses overlap:

IF-IF: maintain order IF-W: maintain order R-R: maintain order R-W: maintain order W-W:maintain order

I-cache and pipeline are not coherent three kinds of barriers:

MB:

force no-reordering between reads and writes

WMB:

force no-reordering between writes

IMB:

force no-reordering for reads, writes and I-fetches

slide-20
SLIDE 20

Università degli studi di Udine Sistemi operativi – Operating Systems

Order is not enforced for data dependency

initially: global_ptr = NULL 1a: ptr = malloc(...); 2a: ptr->key = val; 3a: ptr->data = data; 4a: wmb 5a: global_ptr = ptr;

P1

1b: while (global_ptr==NULL) continue; 2b: mb 3b: myval = global_ptr->key; 4b: mydata = global_ptr->data;

P2

there is a data dependency from 1b and 3b, but addresses do not overlap

a barrier is required for P2

Memory consistency model – Alpha

Università degli studi di Udine Sistemi operativi – Operating Systems

Memory consistency model – ARMv7

No global memory order Accesses to a single address are seen in the same order

by all processors (Cache coherency)

Instruction fetches, data reads, data writes can be

performed out-of-order

Data dependent loads are not reordered I-cache and pipeline are not coherent Università degli studi di Udine Sistemi operativi – Operating Systems

Memory consistency model – ARMv7

Note:

DMB, DSB, and ISB instructions are added in ARMv7 Previous versions use CP15 to implement barrier operations In ARMv6, barrier operations are always defined In ARMv4 and ARMv5, barrier operations may not exist

Three kinds of barriers:

DMB:

Data Memory Barrier

All specified memory accesses before the barrier must be completed

before any (specified) memory accesses after the barrier is started

DSB:

Data Synchronization Barrier

All specified memory accesses before the barrier must be completed

before any instruction after the barrier is started

ISB:

Instruction Synchronization Barrier

Flushes the pipeline

Università degli studi di Udine Sistemi operativi – Operating Systems

Normal

3 levels of shareability

Non-shareable

for Normal memory that is used by only a single processor

Inner Shareable

for Normal memory that is shared between several processors

Outer Shareable

for Normal memory that is shared between processors and devices Cacheability

Non-cacheable Write-Through Cacheable Write-Back Write-Allocate Cacheable Write-Back no Write-Allocate Cacheable

Memory consistency model – ARMv7

Memory types

slide-21
SLIDE 21

Università degli studi di Udine Sistemi operativi – Operating Systems

Memory consistency model – ARMv7

Memory types

Device

Accesses are strongly ordered

All memory accesses occur in program order.

Shareability

Shareable

for memory-mapped peripherals that are shared by several processors

Non-shareable

for memory-mapped peripherals that are used only by a single processor Cacheability: Non-cacheable a write to Device memory is permitted to complete before it reaches the target

Strongly-ordered

Accesses are strongly ordered Shareability: All Strongly-ordered regions are assumed to be Shareable Cacheability: Non-cacheable a write to Strongly-ordered memory can complete only when it reaches the target

Università degli studi di Udine Sistemi operativi – Operating Systems

DMB (or DSB) sy

Barrier for all memory accesses that refer to domain “Outer Shareable”

(full system barrier)

DMB (or DSB) st

Barrier for writes that refer to domain “Outer Shareable”

DMB (or DSB) sh

Barrier for all memory accesses that refer to domain “Inner Shareable”

DMB (or DSB) shst

Barrier for writes that refer to domain “Inner Shareable”

DMB (or DSB) un

Barrier for all memory accesses that refer to domain “Non-Shareable”

DMB (or DSB) unst

Barrier for writes that refer to domain “Non-Shareable”

Memory consistency model – ARMv7

Università degli studi di Udine Sistemi operativi – Operating Systems

Memory consistency model – MIPS32

Three kinds of barriers:

Completion Barriers

  • all specified memory accesses before the barrier must be completed (globally performed)

before the barrier

  • memory accesses after the barrier are started after the barrier
  • SYNC (or SYNC 0): acts on R and W (required in all implementations)

Ordering Barrier

  • all specified memory accesses before the barrier must be completed before the barrier
  • SYNC_WMB (or SYNC 4):

acts on W

  • SYNC_MB (or SYNC 16):

acts on R and W

  • SYNC_ACQUIRE (or SYNC 17):

acts on R (before) and R and W (after)

  • SYNC_RELEASE (or SYNC 18):

acts on R and W (before) and W (after)

  • SYNC_RMB (or SYNC 19):

acts on R Instruction cache barrier

  • Synchronize Caches to Make Instruction Writes Effective
  • SYNCI

an I-cache line is updated to be used after a code change

  • ptional

Università degli studi di Udine Sistemi operativi – Operating Systems

Memory consistency model – IA-32

Memory areas can be:

UC: uncacheable

strong ordering is enforced

useful for memory-mapped devices

WC: write-combining

cached in special buffers, coherence not enforced

useful for framebuffers (writes order is not relevant)

WB: cacheable, with write-back policy

coherence enforced

WT: cacheable, with write-through policy

coherence enforced

useful for devices that access memory (DMA-capable devices) without

implementing cache coherency protocols

slide-22
SLIDE 22

Università degli studi di Udine Sistemi operativi – Operating Systems

Memory consistency model – IA-32

For WB and WT:

there is a global memory ordering

  • rder is maintained for:

R-R, R-W, W-W

  • rder is not maintained for:

W-R

the read obtains data from the forwarding path

some streaming store instruction allows W-W reordering

MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD

string operations allow W-W reordering

Università degli studi di Udine Sistemi operativi – Operating Systems

Memory consistency model – IA-32

For WB memory areas:

Individual processors use the same ordering principles as in a

single-processor system.

Writes by a single processor are observed in the same order

by all processors.

Writes from an individual processor are NOT ordered with

respect to the writes from other processors.

Memory ordering obeys causality (memory ordering respects

transitive visibility).

Any two stores are seen in a consistent order by processors

  • ther than those performing the stores

Locked instructions have a total order.

Università degli studi di Udine Sistemi operativi – Operating Systems

Memory consistency model – IA-32

Three kinds of barriers:

MFENCE

Serializes load and store operations

guarantees that all loads and stores specified before the fence are globally

  • bservable prior to any loads or stores being carried out after the fence.

LFENCE

Serializes load operations

guarantees ordering between two loads and prevents speculative loads

from passing the load fence SFENCE

Serializes store operations

guarantees that every store instruction that precedes the SFENCE in

program order becomes globally visible before any store instruction that follows the SFENCE

Università degli studi di Udine Sistemi operativi – Operating Systems

Memory consistency models and OS

OS must provide primitives to enforce access ordering

processor vs processor accesses

not required on uni-processor systems

processor vs device accesses

required even on uni-processor systems

Multi architectures issue

Portable code must use the weakest model of all supported architectures

Linux

weakest model: ALPHA consistency model

does not guarantee ordering between data dependent accesses

slide-23
SLIDE 23

Università degli studi di Udine Sistemi operativi – Operating Systems

Linux memory barriers

Compiler barrier

prevent compiler reordering of accesses

processor can still perform out-of-order accesses

barrier():

compiler directive, no instruction

Processor vs processor barriers

smp_mb():

full memory barrier

smp_rmb():

memory barrier for reads

smp_wmb():

memory barrier for writes

smp_read_barrier_depends(): memory barrier for data-dependency

Processor vs anything barriers

mb():

full memory barrier

rmb():

memory barrier for reads

wmb():

memory barrier for writes

read_barrier_depends():

memory barrier for data-dependency

Università degli studi di Udine Sistemi operativi – Operating Systems

Linux memory barriers – examples

smp_mb smp_rmb smp_wmb smp_read_barrier_depends mb rmb wmb read_barrier_depends barrier barrier barrier nothing mb mb wmb mb barrier barrier barrier nothing dsb dsb dsb st nothing barrier barrier barrier nothing sync sync sync nothing barrier barrier barrier nothing mfence lfence sfence nothing Alpha ARMv7 MIPS32 IA-32

uni-processor systems

smp_mb smp_rmb smp_wmb smp_read_barrier_depends mb rmb wmb read_barrier_depends mb mb wmb mb mb mb wmb mb dmb sh dmb sh dms shst nothing dsb dsb dsb st nothing synch synch synch nothing synch synch synch nothing mfence barrier barrier nothing mfence lfence sfence nothing Alpha ARMv7 MIPS32 IA-32

multi-processor systems

Università degli studi di Udine Sistemi operativi – Operating Systems

Linux memory barriers

  • Other (smp only) barriers
  • smp_mb__before_atomic()
  • smp_mb__after_atomic()
  • smp_mb__after_unlock_lock()
  • nly in Linux 3.14 – 4.2

true barrier only on powerpc

  • smp_mb__before_spinlock()
  • since Linux-3.11

write barrier since 4.2

  • full barrier on powerpc

since 4.8

  • full barrier on arm64

Università degli studi di Udine Sistemi operativi – Operating Systems

Further readings

  • D. Howells, P. McKenney, W. Deacon, P. Zijlstra,

“Linux Kernel Memory Barriers,” Linux Documentation (memory-barriers.txt)

David Mosberger,

“Memory Consistency Models,” ACM SIGOPS Operating Systems Review, 1993, vol. 27, no. 1, pp. 18-26.

Paul McKenney

“Memory Barriers: a Hardware View for Software Hackers,” 2010.

Paul McKenney

“Is Parallel Programming Hard, And, If So, What Can You Do About It?”