Shared Memory Consistency Models: A Tutorial By Sarita Adve, - PowerPoint PPT Presentation

“Shared Memory Consistency Models: A Tutorial” By Sarita Adve, Kourosh Gharachorloo WRL Research Report, 1995 Presentation: Vince Schuster

Contents  Overview  Uniprocessor Review  Sequential Consistency  Relaxed Memory Models  Program Abstractions  Conclusions 2

Overview  Correct & Efficient Shmem Programs  Require precise notion of behavior w.r.t. read (R) and write (W) operations between processor memories. Example 1, Figure 1 P 2 , P 3 , …, P n Initially, all ptrs = NULL; all ints = 0; While (MyTask == null) { Begin Critical Section P 1 if (Head != null) { While ( no more tasks ) { MyTask = Head; Task = GetFromFreeList(); Head = Head->Next; Task- >Data = …; } insert Task in task queue End Critical Section } } Head = head of task queue; … = MyTask->Data; Q: What will Data be? A: Could be old Data 3

Definitions  Memory Consistency Model  Formal Specification of Mem System Behavior to Programmer  Program Order  The order in which memory operations appear in program  Sequential Consistency (SC): An MP is SC if  Exec Result is same as if all Procs were in some sequence.  Operations of each Proc appear in this sequence in order specified by its program. (Lamport [16])  Relaxed Memory Consistency Models (RxM)  An RxM less restrictive than SC. Valuable for efficient shmem.  System Centric : HW/SW mechanism enabling Mem Model  Programmer-centric : Observation of Program behavior a memory model from programmer’s viewpoint .  Cache-Coherence: A write is eventually made visible to all MPs. 1. Writes to same loc appear as serialized (same order) by MPs 2. NOTE: not equivalent to Sequential Consistency (SC) 4

UniProcessor Review  Only needs to maintain control and data dependencies . Compiler can perform extreme Optz: (reg alloc, code motion, value propagation,  loop transformations, vectorizing, SW pipelining, prefetching , …  A multi-threaded program will look like: T3 T2 T4 T1 . . . Tn Memory All of memory will appear to have the same values to the threads in a UniProcessor System. You still have to deal with the normal multi-threaded Conceptually, SC wants the one program problems by one processor, but you don’t memory w/ switch that connects procs to have to deal with issues such as Write memory + Program Order on a per- Buffer problems or Cache Coherence. Processor basis 5

Seq. Consist. Examples Dekker’s Algorithm: P 1 // init: all = 0 P 2 // init: all = 0 What if Flag1 set to 1 then Flag1 = 1 Flag2 = 1 Flag2 set to 1 then if s? Or If (Flag2 == 0) If (Flag1 == 0) F2 Read bypasses F1 Write? critical section critical section A: Sequential Consistency (program order & Proc seq) What if P2 gets Read of A P 1 P 2 P 3 but P3 gets old value of A ? A = 1 If (A == 1) A: Atomicity of memops B = 1 (All procs see instant and If (B == 1) identical view of memops.) reg1 = A NOTE: UniProcessor system doesn’t have to deal with old values or R/W bypasses. 6

Architectures  Will visit:  Architectures w/o Cache  Write Bufferes w/ Bypass Capability  Overlapping Write Operations  Non-Blocking Read Operations  Architectures w/ Cache  Cache Coherence & SC  Detecting Completion of Write Operations  Illusion of Write Atomicity 7

Write Buffer w/ Bypass Capability Bus-based Mem P1 P2 System w/o Read Read Write Flag1 t3 Write Flag2 t4 Flag2 Flag1 Cache t1 t2 • Bypass can hide Write latency Shared Bus • Violates Sequential Consistency Q: What happens if NOTE: Write Buffer not Read of Flag1 & Flag2 a problem on Flag1: 0 bypass Writes? UniProcessor Programs Flag2: 0 P 1 // init: all = 0 P 2 // init: all = 0 A: Both enter critical section Flag1 = 1 Flag2 = 1 If (Flag2 == 0) If (Flag1 == 0) critical section critical section 8

Overlapping Writes • Interconnection network P1 P2 alleviates the serialization Read Data t3 bottleneck of a bus-based Read Head t2 design. Also, Writes can be coalesced. Write Head Write Data t1 t4 Q: What happens if Write of Head bypasses Memory Write of Data? Head: 0 Data: 0 P 1 // init: all = 0 P 2 // init: all = 0 A: Data Read Data = 2000 While (Head == 0) ; returns 0 Head = 1 ... = Data 9

Non-Blocking Reads Non-Blocking Reads Enable Interconnect P1 P2 • non-blocking caches • speculative execution Write Head t3 • dynamic scheduling Write Data t2 Read Head Read Data t4 t1 Q: What happens if Memory Read of Data bypasses Head: 0 Data: 0 Read of Head? P 1 // init: all = 0 P 2 // init: all = 0 A: Data Read Data = 2000 While (Head == 0) ; returns 0 Head = 1 ... = Data 10

Cache-Coherence & SC  Write buffer w/o cache similar to Write-thru cache  Reads can proceed before Write completes (on other MPs)  Cache-Coherence: not equiv to Sequential Consistency (SC) A write is eventually made visible to all MPs. 1. Writes to same loc appear as serialized (same order) by MPs 2. Propagate value via invalidating or updating cache-copy(ies) 3. Detecting Completion of Write Operation  What if P2 gets new Head but old Data? P1 P2  Avoided if invalidate/update before 2 nd Write Read Data t3  Read Head t2 Write ACK needed  Or at least Invalidate ACK  Write Head Write Data t1 t4 Write-thru cache Head: 0 Data: 0 Memory Memory 11

Illusion of Write Atomicity  Cache-coherence Problems:  Cache-coherence Problems: Cache-coherence (cc) Protocol must propogate value to all copies. Cache-coherence (cc) Protocol must propogate value to all copies. 1. 1. Detecting Write completion takes multi ops w/ multiple replications Detecting Write completion takes multi ops w/ multiple replications 2. 2. Hard to create “ Illusion of Atomicity” w/ non-atomic writes. Hard to create “ Illusion of Atomicity” w/ non-atomic writes. 3. 3. Q: What if P1 & P2 updates reach P3 & P4 differently? A: Reg1 & Reg2 might have different results (& violates SC) Solution: Can serialize writes to same location Alternative: Delay updates until ACK of previous to same loc Still not equiv to Sequential Consistency. P 1: A=B=C=0 P 2 = 0 P 3 P 4 A = 1 A = 2 While (B != 1) ; While (B != 1) ; B = 1 C = 1 While (C != 1) ; While (C != 1) ; Reg1 = A Reg2 = A 12

Ex2: Illusion of Wr Atomicity Q: What if P2 reads new A before P3 gets updated w/ A; AND P2 update of B reaches P3 before its update of A AND P3 reads new B & old A? A: Prohibit read from new value until all have ACK’d . Update Protocol (2-phase scheme): 1. Send update, Recv ACK from each MP 2. Updated MPs get ACK of all ACKs. (Note: Writing proc can consider Write complete after #1.) P 1 P 2 P 3 A = 1 If (A == 1) If (B == 1) B = 1 reg1 = A 13

Compilers  Compilers do many optz w.r.t. mem reorderings:  CSE, Code motion, reg alloc, SW pipe, vect temps, const prop,…  All done from uni-processor perspective. Violates shmem SC  e.g. Would never exit from many of our while loops.  Compiler needs to know shmem objects and/or Sync points or must forego many optz. 14

Sequential Consistency Summary  SC imposes many HW and Compiler constraints  Requirements: Complete of all mem ops before next (or Illusion thereof) 1. Writes to same loc need be serialized (cache-based). 2. Write Atomicity (or illusion thereof) 3. Discuss HW Techniques useful for SC & Efficiency:  Pre- Exclusive Rd (Delays due to Program Order); cc invalid mems  Read Rolebacks (Due to speculative exec or dyn sched).  Global shmem data dep analysis (Shasha & Snir)  Relaxed Memory Models (RxM) next  15

Relaxed Memory Models  Characterization (3 models, 5 specific types) 1a. Relax Write to Read program order (PO) (assume different 1b. Relax Write to Write PO locations) 1c. Relax Read to Read & Read to Write POs Relaxation 2. Read others’ Write early (cache-based only) (most allow & usually safe; but 3. Read own Write early what if two writers to same loc?) • Some RxMs can be detected by programmer, others not. • Various Systems use different fence techniques to provide safety net s. • AlphaServer, Cray T3D, SparcCenter, Sequent, IBM 370, PowerPC 16

Relaxed Write to Read PO  Relax constraint of Write then Read to a diff loc.  Reorder Reads w.r.t. previous Writes w/ memory disambiguation .  3 Models handle it differently. All do it to hide Write Latency  Only IBM 370 provides serialization instr as safety net between W&R  TSO can use Read-Modify-Write (RMW) of either Read or Write  PC must use RMW of Read since it uses less stringent RMW requirements. P 1 P 2 P 1 P 2 P 3 F1 = 1 F2 = 1 A = 1 A = 1 A = 2 if(A==1) Rg1 = A Rg3 = A B = 1 Rg2 = F2 Rg4 = F1 if (B==1) Rg1 = A Rslt: Rg1 = 1, Rg3 = 2 Rg2 = Rg4 = 0 Rslt: Rg1 = 0, B = 1 • TSO & PC since they allow • IBM 370 since it allows P 2 Read of new Read of F1/F2 before Write of A while P 3 Reads old A F1/F2 on each proc 17

Shared Memory Consistency Models: A Tutorial By Sarita Adve, - PowerPoint PPT Presentation

Shared Memory Consistency Models: A Tutorial By Sarita Adve, Kourosh Gharachorloo WRL Research Report, 1995 Presentation: Vince Schuster Contents Overview Uniprocessor Review Sequential Consistency Relaxed Memory Models

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Distributed Shared Memory Shared memory : difficult to realize vs . easy to program with.

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Distributed Shared Memory Presented by Humayun Arafat 1 Outline Background Shared Memory,

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Programming with Shared Memory In a shared memory system, any memory location can be accessible by

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Distributed Shared Memory and Machine Learning CSci 8211 Chai-Wen Hsieh 11/5/2018 Agenda

Threaded Programming Lecture 1: Concepts Overview Shared memory systems Basic Concepts

Shared Memory Bus for Multiprocessor Systems Mat Laibowitz and Albert Chiou Group 6 Shared

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Operating Systems WT 2019/20 Memory Management Shared Memory Process 1 virtual memory most

Distributed Shared Memory Distributed Shared Memory Systems Page based

Message Passing DM519 Concurrent Programming 1 1 Absence Of Shared Memory In previous lectures

Advanced Systems Security: Principles Trent Jaeger Systems and Internet Infrastructure Security

Homework Assignment 1/3 America COMPETES Reauthoriza6on Act (H.R. 1806

QCD Simulations at Realistic Quark Masses: Probing the Chiral Limit G. Schierholz Deutsches

So=ware Security using Language Engineering and mbeddr Markus

Tutorial on CPLEX Linear Programming Combinatorial Problem Solving (CPS) Enric Rodr

Non-Planarity Measures and Small Cycles EuroCG 2020 PhD School Markus Chimani Theoretical

Finding Triangles for Maximum Planar Subgraphs Ghent Graph Theory Workshop 2017 Parinya

CUG 2009 May 7, 2009 * National Center for Computational Sciences + Center for Transportation