CS6354: Memory models 1 To read more This days papers: Adve and - PowerPoint PPT Presentation

CS6354: Memory models 1

To read more… This day’s papers: Adve and Gharachorloo, “Shared Memory Consistency Models: A Tutorial” Boehm and Adve, “Foundations of the C++ Concurrency Memory Model”, section 1 only Supplementary readings: Hennessy and Patterson, section 5.6 Sorin, Hill, and Wood. A Primer on Memory Consistency and Coherence . Boehm, “Threads Cannot Be Implemented as a Library.” 1

double-checked locking class Foo { // BROKEN code private Helper helper = null; public Helper getHelper() { if (helper == null) synchronized(this) { if (helper == null) helper = new Helper(); } return helper; } int value; // ... } helper.value write visible after helper write? 2

compare-and-swap with ownership of *address in cache: if (*address == old) { *address = new; return TRUE; } else { return FALSE; } } 3 compare − and − swap(address, old, new) {

CAS lock Alleged lock with compare-and-swap: class Lock { int lockValue = 0; void lock() { 0, 1)) { // retry } } void unlock() { lockValue = 0; } }; 4 while (!compare − and − swap(&lockValue,

CAS lock: usage Lock counterLock; int counter = 0; Thread 1 Thread 2 counterLock.lock(); counter += 1; counterLock.unlock(); counterLock.lock(); counter += 1; counterLock.unlock(); possible result: counter == 2 5

CAS lock: broken timeline read-to-own lock CPU2 gets lock before counter = 1 write is complete unlock locally — no need to wait for ownership bufger write to counter counter = 0 read counter lock = 1 (cached) lock = 0, you own it lock = 0 writeback lock lock = 0 (cached) CPU1 counter = 1 (bufger) counter = 0 read counter lock = 1 (cached) lock = 0, you own it read-to-own lock CPU2 CPU2 Dir/Mem CPU1 Dir/Mem 6 read-to-own counter

Writing lock before counter? write bufgering — hides their latency lock release is lockValue = 0 — nothing special local write could happen faster than remote 7

CAS lock: fjxed class Lock { int lockValue = 0; void lock() { 0, 1)) { // retry } MEMORY_FENCE(); } void unlock() { MEMORY_FENCE(); *lockValue = 0; } }; 8 while (!compare − and − swap(&lockValue,

fences completely complete operations before fence … but doesn’t change order of other threads 9 includes waiting for invalidations

the acquire/release model acquire — one-way fence: operations after acquire aren’t done earlier release — one-way fence: operations before release aren’t done later 10

r1=0 r2=1 r1=1 r2=0 memory inconsistency on x86 outcomes on my desktop (100M trials) 1889 (00.001%) r1=1 r2=1 49798135 (49.798%) 50196062 (50.196%) 3914 (00.003%) x = y = 0 r1=0 r2=0 r2 = x; y = 1; r1 = y; x = 1; thread 2 thread 1 11

possible orders r1 y = 1 r2 = x x = 1 r1 = y == 1 r2 r2 == 0 Thread 1 Thread 2 Thread 1 Thread 2 Thread 1 Thread 2 == 1 == 0 x == 1 = 1 y = 1 r1 = y r2 = x r1 r2 r1 == 1 x = 1 r1 = y y = 1 r2 = x 12

memory inconsistency on x86 x = y = 0 outcomes on my desktop (100M trials) 1889 (00.001%) r1=1 r2=1 49798135 (49.798%) 50196062 (50.196%) 3914 (00.003%) r1=0 r2=0 r2 = x; y = 1; r1 = y; x = 1; thread 2 thread 1 13 r1=0 r2=1 r1=1 r2=0

X86’s omission stores can be reordered after loads to difgerent addresses …but thread always sees its own writes immediately 14

inconsistency causes in the interprocessor network (not possible with bus) in the processor out-of-order execution of reads and/or writes write bufgering (don’t wait for invalidates) 15

out-of-order read/write track dependencies between loads and stores don’t move loads across stores to same address don’t move stores across stores to same address with one CPU — provides sequential consistency 16

load bypassing not computed not computed run load immediately if no confmicts, check for confmicts pending load 0x5678 stores before load pending stores 0x4123 address 0x9543 not computed 0x4567 0xFFFED 0x2345 not computed 0x1234 value 17

load forwarding 0x4123 use value from store check for confmicts pending load 0x5678 stores before load pending stores not computed not computed 0x9543 address not computed 0x4567 0xFFFED 0x5678 not computed 0x1234 value 18

sequentially consistent reordering time Shared Modifjed/Exclusive reading anytime while still shared equivalent writing anytime while still exclusive equivalent read early commit read (check state) commit write (check state) write later 19

confmicts with optimizations write bufgers — need to reserve cache blocks early stores happen load forwarding — needs to check cache state (even though value from bufger) 20 load bypassing — needs to check cache state after

interaction with compilers compilers also reorder loads/stores e.g. loop optimization for instruction scheduling is this correct? 21 depends on memory model compiler presents to user

two defjnitions starting point: sequential consistency System-centric: what reorderings can I observe? Programmer-centric: what do I do to get sequential consistency? 22

relaxations 23

read other’s write early T3 reads X, post-update, before T4 receives its update fjgures from Boehm, “Foundations of the C++ Concurrency Memory Model” 24 delay reads until invalidations entirely fjnished

read other’s write early delay reads until invalidations entirely fjnished fjgures from Boehm, “Foundations of the C++ Concurrency Memory Model” 25

data-race-free race two operations, at least one write not separated by synchronization operation solution to races: add synchronization operation 26 sequentially consistent only if no races

example: C++ memory model almost data-race-free explicit synchronization operations library functions compiler can do aggressive optimization in between user’s perspective: anything can happen if you don’t synchronize 27

prohibited optimization (1) x = y = 0 thread 1 thread 2 if (x == 1) ++y; if (y == 1) ++x; optimized to: optimized to: ++y; ++x; Example from: Boehm, “Threads Cannot be Implemented as a Library”, 2004. 28 if (x != 1) −− y; if (y != 1) −− x;

prohibited optimization (2) struct { char a; char b; char c; char d; } x; ... x.b = 1; x.c = 2; x.d = 3; optimized to: struct { char a; char b; char c; char d; } x; ... value = x.a | 0x01020300; x = value; Example from: Boehm, “Threads Cannot be Implemented as a Library”, 2004. 29 // pseudo − C code:

lock-free stack (1) class StackNode { StackNode *next; int value; }; StackNode *head; void Push( int newValue) { StackNode* newItem = new QueueNode; do { MEMORY_FENCE(); // ??? } 30 newItem − >value = newValue; newItem − >next = head; } while (!compare − and − swap(&head, newItem − >next, newItem));

lock-free stack (2) class StackNode { StackNode *next; int value; }; StackNode *head; int Pop() { StackNode* removed; do { removed = head; MEMORY_FENCE(); // ??? /* missing: deallocating removed safely */ } 31 } while (!compare − and − swap(&head, removed, removed − >next)); return removed − >value;

CS6354: Memory models 1 To read more This days papers: Adve and - PowerPoint PPT Presentation

CS6354: Memory models 1 To read more This days papers: Adve and Gharachorloo, Shared Memory Consistency Models: A Tutorial Boehm and Adve, Foundations of the C++ Concurrency Memory Model, section 1 only Supplementary

CS6354: Snooping Cache Coherency 7 October 2016 1 To read more This days papers:

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Weak memory models INF4140 - Models of concurrency Weak memory models Fall 2016 30. 10. 2016

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Robustness against Relaxed Memory Models Memory Models Roland Meyer Technische Universit at

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

Memory Management Ideally programmers want memory that is large fast non

LOCK/WAIT FREE SYNCHRONIZATION Synchronization Mutex Blocking Lock-free At

Basic Issues in Syntactic Parsing Joakim Nivre Uppsala University Department of Linguistics and

CAS DIVERSITY, EQUITY, AND INCLUSION EFFORTS AND OSU RESOURCES June 10, 2020 Dr. Kim Loe ff ert

D e e p o b s e r v a t i o n s o f C a s A wi t h MA G I C i n d i

Cashew-Nuts in Haskell Implementation of a Timed Process Calculus Simon Foster University of

Section 3.1: Feasibility and Slack MATH 105: Contemporary Mathematics University of Louisville

Prince William County Public Schools 2019-20 School Calendar Modification School Board Meeting

Assembly of Galaxies Across Cosmic Time: Formaton of te Hubble Sequence at High Redshift