Synchronising C/C++ and POWER Susmit Sarkar 1 Kayvan Memarian 1 Scott - - PowerPoint PPT Presentation

synchronising c c and power
SMART_READER_LITE
LIVE PREVIEW

Synchronising C/C++ and POWER Susmit Sarkar 1 Kayvan Memarian 1 Scott - - PowerPoint PPT Presentation

Synchronising C/C++ and POWER Susmit Sarkar 1 Kayvan Memarian 1 Scott Owens 1 Mark Batty 1 Peter Sewell 1 Luc Maranget 2 Jade Alglave 3 , 4 Derek Williams 5 1 University of Cambridge 2 INRIA 3 Oxford University 4 Queen Mary London 5 IBM Austin June


slide-1
SLIDE 1

Synchronising C/C++ and POWER

Susmit Sarkar1 Kayvan Memarian1 Scott Owens1 Mark Batty1 Peter Sewell1 Luc Maranget2 Jade Alglave3,4 Derek Williams5

1University of Cambridge 2INRIA 3Oxford University 4Queen Mary London 5IBM Austin

June 2012

slide-2
SLIDE 2

Relaxed Memory Concurrency

Concurrency on modern hardware/compilers: Relaxed Memory, not Sequential Consistency (SC) Semantics of concurrent programming languages ISO C/C++: introduces a new concurrency model Hardware: very different concurrency models

◮ Different between x86, Power,

ARM

◮ Different from C/C++ Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 2 / 23

slide-3
SLIDE 3

Correct implementations of C/C++ on hardware

Can it be done?

◮ . . . on highly relaxed hardware?

What is involved?

◮ Mapping new constructs to assembly ◮ Optimizations: which ones legal? Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 3 / 23

slide-4
SLIDE 4

Correct implementations of C/C++ on hardware

Can it be done?

◮ . . . on highly relaxed hardware? e.g. Power

What is involved?

◮ Mapping new constructs to assembly ◮ Optimizations: which ones legal? Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 3 / 23

slide-5
SLIDE 5

Implementing C/C++11 on POWER: Pointwise Mapping

C/C++11 Operation POWER Implementation

Store (non-atomic) Load (non-atomic) st ld

(From Paul McKenney and Raul Silvera)

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 4 / 23

slide-6
SLIDE 6

Implementing C/C++11 on POWER: Pointwise Mapping

C/C++11 Operation POWER Implementation

Store (non-atomic) Load (non-atomic) st ld Store relaxed Store release Store seq-cst st lwsync; st lwsync; st Load relaxed Load consume Load acquire Load seq-cst ld ld (and preserve dependency) ld; cmp; bc; isync hwsync; ld; cmp; bc; isync

(From Paul McKenney and Raul Silvera)

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 4 / 23

slide-7
SLIDE 7

Implementing C/C++11 on POWER: Pointwise Mapping

C/C++11 Operation POWER Implementation

Store (non-atomic) Load (non-atomic) st ld Store relaxed Store release Store seq-cst st lwsync; st lwsync; st Load relaxed Load consume Load acquire Load seq-cst ld ld (and preserve dependency) ld; cmp; bc; isync hwsync; ld; cmp; bc; isync Fence acquire Fence release Fence seq-cst lwsync lwsync hwsync

(From Paul McKenney and Raul Silvera)

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 4 / 23

slide-8
SLIDE 8

Implementing C/C++11 on POWER: Pointwise Mapping

C/C++11 Operation POWER Implementation

Store (non-atomic) Load (non-atomic) st ld Store relaxed Store release Store seq-cst st lwsync; st lwsync; st Load relaxed Load consume Load acquire Load seq-cst ld ld (and preserve dependency) ld; cmp; bc; isync hwsync; ld; cmp; bc; isync Fence acquire Fence release Fence seq-cst lwsync lwsync hwsync CAS relaxed CAS seq-cst loop: lwarx; cmp; bc exit; stwcx.; bc loop; exit: hwsync; loop: lwarx; cmp; bc exit; stwcx.; bc loop; isync; exit: . . . ...

(From Paul McKenney and Raul Silvera)

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 4 / 23

slide-9
SLIDE 9

Implementing C/C++11 on POWER: Pointwise Mapping

C/C++11 Operation POWER Implementation

Store (non-atomic) Load (non-atomic) st ld Store relaxed Store release Store seq-cst st lwsync; st lwsync; st Load relaxed Load consume Load acquire Load seq-cst ld ld (and preserve dependency) ld; cmp; bc; isync hwsync; ld; cmp; bc; isync Fence acquire Fence release Fence seq-cst lwsync lwsync hwsync CAS relaxed CAS seq-cst loop: lwarx; cmp; bc exit; stwcx.; bc loop; exit: hwsync; loop: lwarx; cmp; bc exit; stwcx.; bc loop; isync; exit: . . . ...

Is that mapping correct?

(From Paul McKenney and Raul Silvera)

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 4 / 23

slide-10
SLIDE 10

Implementing C/C++11 on POWER: Pointwise Mapping

C/C++11 Operation POWER Implementation

Store (non-atomic) Load (non-atomic) st ld Store relaxed Store release Store seq-cst st lwsync; st lwsync; hwsync; st Load relaxed Load consume Load acquire Load seq-cst ld ld (and preserve dependency) ld; cmp; bc; isync hwsync; ld; cmp; bc; isync Fence acquire Fence release Fence seq-cst lwsync lwsync hwsync CAS relaxed CAS seq-cst loop: lwarx; cmp; bc exit; stwcx.; bc loop; exit: hwsync; loop: lwarx; cmp; bc exit; stwcx.; bc loop; isync; exit: . . . ...

Answer: No!

(From Paul McKenney and Raul Silvera)

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 4 / 23

slide-11
SLIDE 11

Implementing C/C++11 on POWER: Pointwise Mapping

C/C++11 Operation POWER Implementation

Store (non-atomic) Load (non-atomic) st ld Store relaxed Store release Store seq-cst st lwsync; st hwsync; st Load relaxed Load consume Load acquire Load seq-cst ld ld (and preserve dependency) ld; cmp; bc; isync hwsync; ld; cmp; bc; isync Fence acquire Fence release Fence seq-cst lwsync lwsync hwsync CAS relaxed CAS seq-cst loop: lwarx; cmp; bc exit; stwcx.; bc loop; exit: hwsync; loop: lwarx; cmp; bc exit; stwcx.; bc loop; isync; exit: . . . ...

Is that mapping correct? Answer: Yes!

(From Paul McKenney and Raul Silvera)

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 4 / 23

slide-12
SLIDE 12

Implementing C/C++11 on POWER: Pointwise Mapping

C/C++11 Operation POWER Implementation

Store (non-atomic) Load (non-atomic) st ld Store relaxed Store release Store seq-cst st lwsync; st hwsync; st Load relaxed Load consume Load acquire Load seq-cst ld ld (and preserve dependency) ld; cmp; bc; isync hwsync; ld; cmp; bc; isync Fence acquire Fence release Fence seq-cst lwsync lwsync hwsync CAS relaxed CAS seq-cst loop: lwarx; cmp; bc exit; stwcx.; bc loop; exit: hwsync; loop: lwarx; cmp; bc exit; stwcx.; bc loop; isync; exit: . . . ...

Is that the only correct mapping? Answer: No!

(From Paul McKenney and Raul Silvera)

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 4 / 23

slide-13
SLIDE 13

Implementing C/C++11 on POWER: Pointwise Mapping

C/C++11 Operation POWER Implementation

Store (non-atomic) Load (non-atomic) st ld Store relaxed Store release Store seq-cst st lwsync; st hwsync; st Alternative hwsync; st; hwsync; Load relaxed Load consume Load acquire Load seq-cst ld ld (and preserve dependency) ld; cmp; bc; isync hwsync; ld; cmp; bc; isync ld; hwsync Fence acquire Fence release Fence seq-cst lwsync lwsync hwsync CAS relaxed CAS seq-cst loop: lwarx; cmp; bc exit; stwcx.; bc loop; exit: hwsync; loop: lwarx; cmp; bc exit; stwcx.; bc loop; isync; exit: . . . ...

All compilers must agree for separate compilation

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 4 / 23

slide-14
SLIDE 14

Implementing C/C++11 on POWER correctly

Theorem: For any sane, non-optimising compiler following the mapping: C/C++ prog POWER prog C/C++11 execution

  • bservations

POWER execution

  • bservations

C/C++11 semantics POWER semantics compilation

Showed previous mapping incorrect Easily adapt proof for an alternative mapping

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 5 / 23

slide-15
SLIDE 15

Benefits of a formal proof

Reasoning about industrial-strength concurrency

Enables: Confidence in C/C++ and Power concurrency models Confidence in compiler implementations [gcc] Reasoning about C/C++ and Power (Path to) Reasoning about ARM ??

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 6 / 23

slide-16
SLIDE 16

Context of This Paper

Before [POPL’12]: just loads and stores Power concurrency model (of loads and stores) [PLDI’11] C++11 concurrency model [POPL’11] Proof:

◮ some concepts correspond (e.g. coherence → modification order) ◮ others depend on key properties of abstract machine

This paper: also with synchronisation constructs Power: load-reserve and store-conditional C++11: locks, read-modify-writes, fences Proof:

◮ extends smoothly (new cases to be checked) ◮ points out interesting features of the models Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 7 / 23

slide-17
SLIDE 17

Outline

1

Introduction

2

Relaxed Memory Behaviour (examples)

3

Reasoning about Synchronising Operations

4

Proof Outline; and What We Learned

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 8 / 23

slide-18
SLIDE 18

Example: Message Passing

Initially: d = 0; f = 0; Thread 0 Thread 1 d = 1; f = 1; while (f == 0) {}; r = d; Finally: r = 0 ?? Forbidden on SC

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 9 / 23

slide-19
SLIDE 19

Example: Message Passing (racy)

Initially: d = 0; f = 0; Thread 0 Thread 1 d = 1; f = 1; while (f == 0) {}; r = d; Finally: r = 0 ?? Forbidden on SC In C/C++11, this has undefined semantics Data race on d and f variables

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 9 / 23

slide-20
SLIDE 20

Example (contd.): mark atomics

Mark atomic variables (accesses have memory order parameter) Initially: d = 0; f = 0; Thread 0 Thread 1 d.store(1,rlx); f.store(1,rlx); while (f.load(rlx) == 0) {}; r = d.load(rlx); Finally: r = 0 ?? (Forbidden on SC)

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 10 / 23

slide-21
SLIDE 21

Example (contd.): mark atomics

Mark atomic variables (accesses have memory order parameter) Initially: d = 0; f = 0; Thread 0 Thread 1 d.store(1,rlx); f.store(1,rlx); while (f.load(rlx) == 0) {}; r = d.load(rlx); Finally: r = 0 ?? (Forbidden on SC) Defined, and possible, in C/C++11 Allows for hardware (and compiler) optimisations

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 10 / 23

slide-22
SLIDE 22

Example (contd.): release-acquire synchronization

Mark release stores and acquire loads Initially: d = 0; f = 0; Thread 0 Thread 1 d.store(1,rlx); f.store(1,rel); while (f.load(acq) == 0) {}; r = d.load(rlx); Finally: r = 0 ?? (Forbidden on SC) Forbidden in C/C++11 due to release-acquire synchronization Implementation must ensure result not observed

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 11 / 23

slide-23
SLIDE 23

Implementation of acquire/release on POWER

Initially: d = 0; f = 0; Thread 0 Thread 1 st d 1; lwsync; st f 1; loop: ld f rtmp; cmp rtmp 0; beq loop; isync; ld d r; Finally: r = 0 ?? Forbidden (and not observed) on POWER7, and ARM lwsync prevents write reordering control dependency with isync prevents read speculation

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 12 / 23

slide-24
SLIDE 24

Outline

1

Introduction

2

Relaxed Memory Behaviour (examples)

3

Reasoning about Synchronising Operations

4

Proof Outline; and What We Learned

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 13 / 23

slide-25
SLIDE 25

What about Synchronising (Atomic) Operations?

Synchronization operations, e.g. “atomic add”,“CAS”,. . . RISC-friendly alternative: Load-reserve/Store-conditional

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 14 / 23

slide-26
SLIDE 26

What about Synchronising (Atomic) Operations?

Synchronization operations, e.g. “atomic add”,“CAS”,. . . RISC-friendly alternative: Load-reserve/Store-conditional Can be used to implement CAS, spinlocks, . . . Universal (like CAS) [Herlihy’93], but no ABA problem Atomic Addition loop: lwarx r, d; add r,v,r; stwcx r, d; bne loop; Informally, stwcx succeeds only if no other write to the same address since last lwarx

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 14 / 23

slide-27
SLIDE 27

What is no write since . . . ? In machine time?

◮ Neither necessary, nor sufficient Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 15 / 23

slide-28
SLIDE 28

What is no write since . . . ? In machine time?

◮ Neither necessary, nor sufficient

Microarchitecturally (simplified): if cache-line

  • wnership not lost since last lwarx

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 15 / 23

slide-29
SLIDE 29

Modeling “not lost since”

Abstractly: ownership chain modeled by building up coherence order Coherence: order relating stores to the same location (eventually linear) A stwcx succeeds only if it is (becomes) coherence-next-to the write read from by lwarx . . . and no other write can later come in between

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 16 / 23

slide-30
SLIDE 30

Modeling “not lost since”

Abstractly: ownership chain modeled by building up coherence order Coherence: order relating stores to the same location (eventually linear) A stwcx succeeds only if it is (becomes) coherence-next-to the write read from by lwarx . . . and no other write can later come in between Isolate key concept: write reaching coherence point —

◮ coherence is linear below this write, and no new edges will be added

below

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 16 / 23

slide-31
SLIDE 31

Load-reserve/store-conditional and ordering Same-thread load-reserve/store-conditionals ordered by program order If all memory accesses are atomic sequences Then: only SC behaviour But . . . normal loads/stores (to different addresses) not ordered Confusion here led to Linux bug . . . bad barrier placement in atomic-add-return

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 17 / 23

slide-32
SLIDE 32

Outline

1

Introduction

2

Relaxed Memory Behaviour (examples)

3

Reasoning about Synchronising Operations

4

Proof Outline; and What We Learned

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 18 / 23

slide-33
SLIDE 33

Proof outline

Theorem: For any sane, non-optimising compiler following the mapping: DRF C/C++ prog POWER prog C/C++11 execution

  • bservations

POWER execution

  • bservations

C/C++11 semantics POWER semantics compilation

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 19 / 23

slide-34
SLIDE 34

Proof outline

Theorem: For any sane, non-optimising compiler following the mapping: DRF C/C++ prog POWER prog C/C++11 execution

  • bservations

POWER execution

  • bservations

C/C++11 semantics POWER semantics compilation

Preserves memory accesses; Uses the mapping table; Respects the thread local semantics of C/C++, preserving dependencies

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 19 / 23

slide-35
SLIDE 35

Proof outline

Theorem: For any sane, non-optimising compiler following the mapping: DRF C/C++ prog POWER prog C/C++11 execution

  • bservations

POWER execution

  • bservations

C/C++11 semantics POWER semantics compilation

From POWER trace, build key relations (happens-before, SC

  • rder)

Required properties from abs. machine properties If trace looks like it produces data race, build the C/C++ data race

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 19 / 23

slide-36
SLIDE 36

Also in the paper

A formal model of load-reserve/store-conditional (in Lem) An executable model with exploration tool (ppcmem) Simplifications to the C/C++11 lock model Models “tight” against each other: relaxing the Power model would make C/C++11 unimplementable

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 20 / 23

slide-37
SLIDE 37

Conclusion

Reasoning about industrial-strength concurrency

Correct compilation of C/C++ concurrency primitives on Power

  • Formal relaxed-memory semantics of load-reserve/store-conditional
  • Allow proof of SC via atomic RMW sequences
  • Technical simplifications to the C/C++ lock model

Confidence in both models Compiler implementation relevance Reasoning about machine code at C/C++ level

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 21 / 23

slide-38
SLIDE 38

Thank You!

More details at: http://www.cl.cam.ac.uk/~pes20/cppppc

slide-39
SLIDE 39

Store-conditional speculation?

Power allows stores to forward value to same thread speculatively Can (and should) stwcx be allowed to be speculated (even before the lwarx) ? Initially: d = 0 f = 0; Thread 0 Thread 1 d = 1; # d.store(1,rlx) lwsync; # f.store(1,rel) f = 1; loop: lwarx f, rl; cmp rl 1; bne exit; stwcx f 2; bne loop;exit: # CAS (f,1,2) ld r1 f; # r1 = f.load(con) xor r2, r1,r1; # r2 = r1 ⊕ r1 ld [d + r2] r; # r = d[r2] Finally: r = 0 ??

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 23 / 23

slide-40
SLIDE 40

Store-conditional speculation?

Can (and should) stwcx be allowed to be speculated (even before the lwarx) ? Initially: d = 0 f = 0; Thread 0 Thread 1 d = 1; # d.store(1,rlx) lwsync; # f.store(1,rel) f = 1; loop: lwarx f, rl; cmp rl 1; bne exit; stwcx f 2; bne loop;exit: # CAS (f,1,2) ld r1 f; # r1 = f.load(con) xor r2, r1,r1; # r2 = r1 ⊕ r1 ld [d + r2] r; # r = d[r2] Finally: r = 0 ?? C/C++11 mapping would break (and no good way of fixing) Fortunately, current hardware does not do this . . . and now we know why future hardware should not

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 23 / 23

slide-41
SLIDE 41

60 second pitch

Hi, I am Susmit Sarkar, and I am going to be speaking about shared-memory concurrency not as we would like it to be, but as it actually is in the real world, on mainstream hardware such as PowerPC or ARM and on software such as the new C and C++ concurrency model. These two models are quite strange, and quite different from each other so it is a real question whether you can even compile from one to the other. Yes you can, and we prove this. This explains how these very different models really work. Come to Room B, just after lunch

Susmit Sarkar (Cambridge) Synchronising C/C++ and POWER June 2012 23 / 23