shared memory
play

Shared Memory Consistency Models: A Tutorial By Sarita Adve, - PowerPoint PPT Presentation

Shared Memory Consistency Models: A Tutorial By Sarita Adve, Kourosh Gharachorloo WRL Research Report, 1995 Presentation: Vince Schuster Contents Overview Uniprocessor Review Sequential Consistency Relaxed Memory Models


  1. “Shared Memory Consistency Models: A Tutorial” By Sarita Adve, Kourosh Gharachorloo WRL Research Report, 1995 Presentation: Vince Schuster

  2. Contents  Overview  Uniprocessor Review  Sequential Consistency  Relaxed Memory Models  Program Abstractions  Conclusions 2

  3. Overview  Correct & Efficient Shmem Programs  Require precise notion of behavior w.r.t. read (R) and write (W) operations between processor memories. Example 1, Figure 1 P 2 , P 3 , …, P n Initially, all ptrs = NULL; all ints = 0; While (MyTask == null) { Begin Critical Section P 1 if (Head != null) { While ( no more tasks ) { MyTask = Head; Task = GetFromFreeList(); Head = Head->Next; Task- >Data = …; } insert Task in task queue End Critical Section } } Head = head of task queue; … = MyTask->Data; Q: What will Data be? A: Could be old Data 3

  4. Definitions  Memory Consistency Model  Formal Specification of Mem System Behavior to Programmer  Program Order  The order in which memory operations appear in program  Sequential Consistency (SC): An MP is SC if  Exec Result is same as if all Procs were in some sequence.  Operations of each Proc appear in this sequence in order specified by its program. (Lamport [16])  Relaxed Memory Consistency Models (RxM)  An RxM less restrictive than SC. Valuable for efficient shmem.  System Centric : HW/SW mechanism enabling Mem Model  Programmer-centric : Observation of Program behavior a memory model from programmer’s viewpoint .  Cache-Coherence: A write is eventually made visible to all MPs. 1. Writes to same loc appear as serialized (same order) by MPs 2. NOTE: not equivalent to Sequential Consistency (SC) 4

  5. UniProcessor Review  Only needs to maintain control and data dependencies . Compiler can perform extreme Optz: (reg alloc, code motion, value propagation,  loop transformations, vectorizing, SW pipelining, prefetching , …  A multi-threaded program will look like: T3 T2 T4 T1 . . . Tn Memory All of memory will appear to have the same values to the threads in a UniProcessor System. You still have to deal with the normal multi-threaded Conceptually, SC wants the one program problems by one processor, but you don’t memory w/ switch that connects procs to have to deal with issues such as Write memory + Program Order on a per- Buffer problems or Cache Coherence. Processor basis 5

  6. Seq. Consist. Examples Dekker’s Algorithm: P 1 // init: all = 0 P 2 // init: all = 0 What if Flag1 set to 1 then Flag1 = 1 Flag2 = 1 Flag2 set to 1 then if s? Or If (Flag2 == 0) If (Flag1 == 0) F2 Read bypasses F1 Write? critical section critical section A: Sequential Consistency (program order & Proc seq) What if P2 gets Read of A P 1 P 2 P 3 but P3 gets old value of A ? A = 1 If (A == 1) A: Atomicity of memops B = 1 (All procs see instant and If (B == 1) identical view of memops.) reg1 = A NOTE: UniProcessor system doesn’t have to deal with old values or R/W bypasses. 6

  7. Architectures  Will visit:  Architectures w/o Cache  Write Bufferes w/ Bypass Capability  Overlapping Write Operations  Non-Blocking Read Operations  Architectures w/ Cache  Cache Coherence & SC  Detecting Completion of Write Operations  Illusion of Write Atomicity 7

  8. Write Buffer w/ Bypass Capability Bus-based Mem P1 P2 System w/o Read Read Write Flag1 t3 Write Flag2 t4 Flag2 Flag1 Cache t1 t2 • Bypass can hide Write latency Shared Bus • Violates Sequential Consistency Q: What happens if NOTE: Write Buffer not Read of Flag1 & Flag2 a problem on Flag1: 0 bypass Writes? UniProcessor Programs Flag2: 0 P 1 // init: all = 0 P 2 // init: all = 0 A: Both enter critical section Flag1 = 1 Flag2 = 1 If (Flag2 == 0) If (Flag1 == 0) critical section critical section 8

  9. Overlapping Writes • Interconnection network P1 P2 alleviates the serialization Read Data t3 bottleneck of a bus-based Read Head t2 design. Also, Writes can be coalesced. Write Head Write Data t1 t4 Q: What happens if Write of Head bypasses Memory Write of Data? Head: 0 Data: 0 P 1 // init: all = 0 P 2 // init: all = 0 A: Data Read Data = 2000 While (Head == 0) ; returns 0 Head = 1 ... = Data 9

  10. Non-Blocking Reads Non-Blocking Reads Enable Interconnect P1 P2 • non-blocking caches • speculative execution Write Head t3 • dynamic scheduling Write Data t2 Read Head Read Data t4 t1 Q: What happens if Memory Read of Data bypasses Head: 0 Data: 0 Read of Head? P 1 // init: all = 0 P 2 // init: all = 0 A: Data Read Data = 2000 While (Head == 0) ; returns 0 Head = 1 ... = Data 10

  11. Cache-Coherence & SC  Write buffer w/o cache similar to Write-thru cache  Reads can proceed before Write completes (on other MPs)  Cache-Coherence: not equiv to Sequential Consistency (SC) A write is eventually made visible to all MPs. 1. Writes to same loc appear as serialized (same order) by MPs 2. Propagate value via invalidating or updating cache-copy(ies) 3. Detecting Completion of Write Operation  What if P2 gets new Head but old Data? P1 P2  Avoided if invalidate/update before 2 nd Write Read Data t3  Read Head t2 Write ACK needed  Or at least Invalidate ACK  Write Head Write Data t1 t4 Write-thru cache Head: 0 Data: 0 Memory Memory 11

  12. Illusion of Write Atomicity  Cache-coherence Problems:  Cache-coherence Problems: Cache-coherence (cc) Protocol must propogate value to all copies. Cache-coherence (cc) Protocol must propogate value to all copies. 1. 1. Detecting Write completion takes multi ops w/ multiple replications Detecting Write completion takes multi ops w/ multiple replications 2. 2. Hard to create “ Illusion of Atomicity” w/ non-atomic writes. Hard to create “ Illusion of Atomicity” w/ non-atomic writes. 3. 3. Q: What if P1 & P2 updates reach P3 & P4 differently? A: Reg1 & Reg2 might have different results (& violates SC) Solution: Can serialize writes to same location Alternative: Delay updates until ACK of previous to same loc Still not equiv to Sequential Consistency. P 1: A=B=C=0 P 2 = 0 P 3 P 4 A = 1 A = 2 While (B != 1) ; While (B != 1) ; B = 1 C = 1 While (C != 1) ; While (C != 1) ; Reg1 = A Reg2 = A 12

  13. Ex2: Illusion of Wr Atomicity Q: What if P2 reads new A before P3 gets updated w/ A; AND P2 update of B reaches P3 before its update of A AND P3 reads new B & old A? A: Prohibit read from new value until all have ACK’d . Update Protocol (2-phase scheme): 1. Send update, Recv ACK from each MP 2. Updated MPs get ACK of all ACKs. (Note: Writing proc can consider Write complete after #1.) P 1 P 2 P 3 A = 1 If (A == 1) If (B == 1) B = 1 reg1 = A 13

  14. Compilers  Compilers do many optz w.r.t. mem reorderings:  CSE, Code motion, reg alloc, SW pipe, vect temps, const prop,…  All done from uni-processor perspective. Violates shmem SC  e.g. Would never exit from many of our while loops.  Compiler needs to know shmem objects and/or Sync points or must forego many optz. 14

  15. Sequential Consistency Summary  SC imposes many HW and Compiler constraints  Requirements: Complete of all mem ops before next (or Illusion thereof) 1. Writes to same loc need be serialized (cache-based). 2. Write Atomicity (or illusion thereof) 3. Discuss HW Techniques useful for SC & Efficiency:  Pre- Exclusive Rd (Delays due to Program Order); cc invalid mems  Read Rolebacks (Due to speculative exec or dyn sched).  Global shmem data dep analysis (Shasha & Snir)  Relaxed Memory Models (RxM) next  15

  16. Relaxed Memory Models  Characterization (3 models, 5 specific types) 1a. Relax Write to Read program order (PO) (assume different 1b. Relax Write to Write PO locations) 1c. Relax Read to Read & Read to Write POs Relaxation 2. Read others’ Write early (cache-based only) (most allow & usually safe; but 3. Read own Write early what if two writers to same loc?) • Some RxMs can be detected by programmer, others not. • Various Systems use different fence techniques to provide safety net s. • AlphaServer, Cray T3D, SparcCenter, Sequent, IBM 370, PowerPC 16

  17. Relaxed Write to Read PO  Relax constraint of Write then Read to a diff loc.  Reorder Reads w.r.t. previous Writes w/ memory disambiguation .  3 Models handle it differently. All do it to hide Write Latency  Only IBM 370 provides serialization instr as safety net between W&R  TSO can use Read-Modify-Write (RMW) of either Read or Write  PC must use RMW of Read since it uses less stringent RMW requirements. P 1 P 2 P 1 P 2 P 3 F1 = 1 F2 = 1 A = 1 A = 1 A = 2 if(A==1) Rg1 = A Rg3 = A B = 1 Rg2 = F2 Rg4 = F1 if (B==1) Rg1 = A Rslt: Rg1 = 1, Rg3 = 2 Rg2 = Rg4 = 0 Rslt: Rg1 = 0, B = 1 • TSO & PC since they allow • IBM 370 since it allows P 2 Read of new Read of F1/F2 before Write of A while P 3 Reads old A F1/F2 on each proc 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend