From C/C++11 to POWER and ARM: What is Shared-Memory Concurrency, - PowerPoint PPT Presentation

From C/C++11 to POWER and ARM: What is Shared-Memory Concurrency, Anyway? Susmit Sarkar University of St Andrews MMnet, Heriot Watt May, 2013

Shared Memory Concurrency: Since 1962 Burroughs D825 (first multiprocessing computer) Outstanding features include truly modular hardware with parallel processing throughout. FUTURE PLANS The complement of compiling languages is to be expanded. Susmit Sarkar (St Andrews) From C/C++11 to POWER and ARM: May 2013 2 / 34

And Since 2011: In C/C++ ISO C/C++11: introduces a new concurrency model Susmit Sarkar (St Andrews) From C/C++11 to POWER and ARM: May 2013 3 / 34

Example: Message Passing Initially: d = 0; f = 0; Thread 0 Thread 1 d = 1; while (f == 0) f = 1; {} ; r = d; Finally: r = 0 ?? Programmer would hope this is Forbidden Susmit Sarkar (St Andrews) From C/C++11 to POWER and ARM: May 2013 4 / 34

Example: Message Passing (racy) Initially: d = 0; f = 0; Thread 0 Thread 1 d = 1; while (f == 0) f = 1; {} ; r = d; Finally: r = 0 ?? Programmer would hope this is Forbidden In C/C++11, this has undefined semantics Data race on d and f variables Susmit Sarkar (St Andrews) From C/C++11 to POWER and ARM: May 2013 4 / 34

C11: A Data Race Free Model Idea : Programmer mistake to write Data Races Basis of C11 Concurrency Susmit Sarkar (St Andrews) From C/C++11 to POWER and ARM: May 2013 5 / 34

Example (contd.): mark atomics Mark atomic variables (accesses have memory order parameter) Initially: atomic d = 0; f = 0; Thread 0 Thread 1 d.store(1,sc); while (f.load(sc) == 0) f.store(1,sc); {} ; r = d.load(sc); Finally: r = 0 ?? Races on Atomic Accesses ignored (now have defined semantics) Susmit Sarkar (St Andrews) From C/C++11 to POWER and ARM: May 2013 6 / 34

Shared Memory Concurrency Multiple threads with a single shared memory Question: How do we reason about it? Answer [1979]: Sequential Consistency . . . the result of any execution is the same as if the operations of all the processors were executed in some sequential order, respecting the order specified by the program. [Lamport, 1979] Susmit Sarkar (St Andrews) From C/C++11 to POWER and ARM: May 2013 7 / 34

Sequential Consistency Thread 0 Thread 1 Thread 2 Thread 3 (Shared) Memory Traditional assumption (concurrent algorithms, semantics, verification): Sequential Consistency (SC) Implies: can use interleaving semantics Susmit Sarkar (St Andrews) From C/C++11 to POWER and ARM: May 2013 8 / 34

Sequential Consistency Thread 0 Thread 1 Thread 2 Thread 3 (Shared) Memory Traditional assumption (concurrent algorithms, semantics, verification): Sequential Consistency (SC) Implies: can use interleaving semantics False on modern (since 1972) multiprocessors, or with optimizing compilers Susmit Sarkar (St Andrews) From C/C++11 to POWER and ARM: May 2013 8 / 34

Our world is not SC Not since IBM System 370/158MP (1972) Susmit Sarkar (St Andrews) From C/C++11 to POWER and ARM: May 2013 9 / 34

Our world is not SC Not since IBM System 370/158MP (1972) . . . . . . Nor in x86, ARM, POWER, SPARC, Itanium, . . . . . . . . . Nor in C, C++, Java, . . . Susmit Sarkar (St Andrews) From C/C++11 to POWER and ARM: May 2013 10 / 34

Example (contd.): mark atomics relaxed Mark atomic variables as relaxed (a memory-order parameter) Initially: atomic d = 0; f = 0; Thread 0 Thread 1 d.store(1,rlx); while (f.load(rlx) == 0) f.store(1,rlx); {} ; r = d.load(rlx); Finally: r = 0 ?? (Forbidden on SC) Susmit Sarkar (St Andrews) From C/C++11 to POWER and ARM: May 2013 11 / 34

Example (contd.): mark atomics relaxed Mark atomic variables as relaxed (a memory-order parameter) Initially: atomic d = 0; f = 0; Thread 0 Thread 1 d.store(1,rlx); while (f.load(rlx) == 0) {} ; f.store(1,rlx); r = d.load(rlx); Finally: r = 0 ?? (Forbidden on SC) Defined, and possible, in C/C++11 Allows for hardware (and compiler) optimisations Susmit Sarkar (St Andrews) From C/C++11 to POWER and ARM: May 2013 11 / 34

C11 Concurrency: An Axiomatic Model Complete executions are considered (threadwise operational, reading arbitrary values) Relations defined over memory events ( e.g. happens-before) Predicate says whether execution is consistent Further, no consistent execution should have races Susmit Sarkar (St Andrews) From C/C++11 to POWER and ARM: May 2013 12 / 34

Example (contd.): release-acquire synchronization Mark release stores and acquire loads Initially: atomic d = 0; f = 0; Thread 0 Thread 1 d.store(1,rlx); while (f.load(acq) == 0) f.store(1,rel); {} ; r = d.load(rlx); Finally: r = 0 ?? (Forbidden on SC) Forbidden in C/C++11 due to release-acquire synchronization Implementation must ensure result not observed Susmit Sarkar (St Andrews) From C/C++11 to POWER and ARM: May 2013 13 / 34

Implementation of acquire/release on POWER Initially: d = 0; f = 0; Thread 0 Thread 1 st d 1; loop: ld f rtmp; lwsync; cmp rtmp 0; st f 1; beq loop; isync; ld d r; Finally: r = 0 ?? Forbidden (and not observed) on POWER7, and ARM lwsync prevents write reordering control dependency with isync prevents read speculation Susmit Sarkar (St Andrews) From C/C++11 to POWER and ARM: May 2013 14 / 34

Correct implementations of C/C++ on hardware Can it be done? ◮ . . . on highly relaxed hardware? What is involved? ◮ Mapping new constructs to assembly ◮ Optimizations: which ones legal? Susmit Sarkar (St Andrews) From C/C++11 to POWER and ARM: May 2013 15 / 34

Correct implementations of C/C++ on hardware Can it be done? ◮ . . . on highly relaxed hardware? e.g. POWER/ARM What is involved? ◮ Mapping new constructs to assembly ◮ Optimizations: which ones legal? Susmit Sarkar (St Andrews) From C/C++11 to POWER and ARM: May 2013 15 / 34

Implementing C/C++11 on POWER: Pointwise Mapping POWER Implementation C/C++11 Operation Store (non-atomic) st Load (non-atomic) ld Store relaxed st Store release lwsync; st Store seq-cst lwsync; st Load relaxed ld Load consume ld (and preserve dependency) Load acquire ld; cmp; bc; isync Load seq-cst hwsync; ld; cmp; bc; isync Fence acquire lwsync Fence release lwsync Fence seq-cst hwsync CAS relaxed loop: lwarx; cmp; bc exit; stwcx.; bc loop; exit: CAS seq-cst hwsync; loop: lwarx; cmp; bc exit; stwcx.; bc loop; isync; exit: . . . ... (From Paul McKenney and Raul Silvera) Susmit Sarkar (St Andrews) From C/C++11 to POWER and ARM: May 2013 16 / 34

Implementing C/C++11 on POWER: Pointwise Mapping POWER Implementation C/C++11 Operation Store (non-atomic) st Load (non-atomic) ld Store relaxed st Store release lwsync; st Store seq-cst lwsync; st Load relaxed ld Load consume ld (and preserve dependency) Is that mapping correct? Load acquire ld; cmp; bc; isync Load seq-cst hwsync; ld; cmp; bc; isync Fence acquire lwsync Fence release lwsync Fence seq-cst hwsync CAS relaxed loop: lwarx; cmp; bc exit; stwcx.; bc loop; exit: CAS seq-cst hwsync; loop: lwarx; cmp; bc exit; stwcx.; bc loop; isync; exit: . . . ... (From Paul McKenney and Raul Silvera) Susmit Sarkar (St Andrews) From C/C++11 to POWER and ARM: May 2013 16 / 34

Implementing C/C++11 on POWER: Pointwise Mapping POWER Implementation C/C++11 Operation Store (non-atomic) st Load (non-atomic) ld Store relaxed st Store release lwsync; st Store seq-cst lwsync; hwsync; st Load relaxed ld Load consume ld (and preserve dependency) Load acquire ld; cmp; bc; isync Load seq-cst hwsync; ld; cmp; bc; isync Fence acquire lwsync Fence release lwsync Fence seq-cst hwsync Answer: No! CAS relaxed loop: lwarx; cmp; bc exit; stwcx.; bc loop; exit: CAS seq-cst hwsync; loop: lwarx; cmp; bc exit; stwcx.; bc loop; isync; exit: . . . ... (From Paul McKenney and Raul Silvera) Susmit Sarkar (St Andrews) From C/C++11 to POWER and ARM: May 2013 16 / 34

Implementing C/C++11 on POWER: Pointwise Mapping POWER Implementation C/C++11 Operation Store (non-atomic) st Load (non-atomic) ld Store relaxed st Store release lwsync; st Store seq-cst hwsync; st Load relaxed ld Load consume ld (and preserve dependency) Is that mapping correct? Load acquire ld; cmp; bc; isync Load seq-cst hwsync; ld; cmp; bc; isync Fence acquire lwsync Fence release lwsync Fence seq-cst hwsync Answer: Yes! CAS relaxed loop: lwarx; cmp; bc exit; stwcx.; bc loop; exit: CAS seq-cst hwsync; loop: lwarx; cmp; bc exit; stwcx.; bc loop; isync; exit: . . . ... (From Paul McKenney and Raul Silvera) Susmit Sarkar (St Andrews) From C/C++11 to POWER and ARM: May 2013 16 / 34

From C/C++11 to POWER and ARM: What is Shared-Memory Concurrency, - PowerPoint PPT Presentation

From C/C++11 to POWER and ARM: What is Shared-Memory Concurrency, Anyway? Susmit Sarkar University of St Andrews MMnet, Heriot Watt May, 2013 Shared Memory Concurrency: Since 1962 Burroughs D825 (first multiprocessing computer) Outstanding

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

ARM Software Suite Powered by GDM Why use ARM Software? ARM is the software solution to plan,

ARM Advanced RISC Machines The ARM Instruction Set The ARM Instruction Set - ARM University

ARM Cortex-M4 Programming Model ARM = Advanced RISC Machines, Ltd. ARM licenses IP to other

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

ARM memory generator Arm Memory generator Make sure you create a folder similar to what you

ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 1 / 40 A

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

ARM v4T CS2253 Owen Kaser, UNBSJ ARM v4T History of ARM processors R is for RISC

Distributed Shared Memory Shared memory : difficult to realize vs . easy to program with.

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

Verifying the Motion of a Robot Arm Akul Penugonda 1 /6 Akul Penugonda - Robot Arm Motion 2

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Distributed Shared Memory Presented by Humayun Arafat 1 Outline Background Shared Memory,

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

It's finally time for Arm in the Datacenter- and beyond [TUT1143] Jay Kruemcke Sr. Product

Breadth first search Uniform cost search Robert Platt Northeastern University Some images and

B013 Muon Trigger 402.06.04 Darin Acosta, L3 Manager, Muon Trigger, 402.06.04 2-3 February

CPSC 410/ 611: Week 7 Vir t ual Memor y Reading: Silber shat z, Chapt er 9 Vir

On the Security and Privacy of delegated computation Anca Nitulescu DI ENS - Cascade Outline

Lightweight Coprocessor for Koblitz Curves: 283-bit ECC Including Scalar Conversion with only

Informed Search [RN2] Sec. 4.1, 4.2 [RN3] Sec. 3.5, 3.6 CS 486/686 University of Waterloo

Announcements Wednesday, October 24 You should already have the link to view your graded

Fast computation of the N th term of an algebraic series in positive characteristic Philippe