locks / cache coherency / spinlocks / other sync (intro) 1 - PowerPoint PPT Presentation

a simple race: results printf("A:%d B:%d\n", ( int ) A_result, ( int ) B_result); ??? A:0 B:0 (‘execute moves into x+y fjrst’) A:1 B:1 (‘B executes before A’) A:1 B:0 (‘A executes before B’) A:0 B:1 result frequency thread_A: my desktop, 100M trials: pthread_join(A, &A_result); pthread_join(B, &B_result); pthread_create(&B, NULL, thread_B, NULL); movl $1, x ret thread_B: movl $1, y x = y = 0; ret pthread_create(&A, NULL, thread_A, NULL); 16 /* x ← 1 */ /* y ← 1 */ movl y, %eax /* return y */ movl x, %eax /* return x */ 99 823 739 171 161 4 706 394

load/store reordering load/stores atomic, but run out of order recall?: out-of-order processors processor optimization: execute instructions in non-program order hide delays from slow caches, variable computation rates, etc. track side-efgects within a thread to make as if in-order but common choice: don’t worry as much between cores/threads design decision: if programmer cares, they worry about it 17

why load/store reordering? prior example: load of x executing before store of y why do this? otherwise delay the load if x and y unrelated — no benefjt to waiting 18

aside: some x86 reordering rules each core sees its own loads/stores in order (if a core stores something, it can always load it back) stores from other cores appear in a consistent order (but a core might observe its own stores too early) causality : if a core reads X=a and (after reading X=a) writes Y=b, then a core that reads Y=b cannot later read X=older value than a Source: Intel 64 and IA-32 Software Developer’s Manual, Volume 3A, Chapter 8 19

how do you do anything with this? diffjcult to reason about what modern CPU’s reordering rules do typically: don’t depend on details, instead: special instructions with stronger (and simpler) ordering rules often same instructions that help with implementing locks in other ways special instructions that restrict ordering of instructions around them (“fences”) loads/stores can’t cross the fence 20

compilers changes loads/stores too (1) .L2: ... // if (no_milk != 0) ... cmpl $0, no_milk // while (eax == 0) repeat jne .L2 testl %eax, %eax movl note_from_bob, %eax void Alice() { movl $1, note_from_alice Alice: } if (no_milk) {++milk;} do {} while (note_from_bob); note_from_alice = 1; 21 // note_from_alice ← 1 // eax ← note_from_bob

compilers changes loads/stores too (2) void Alice() { movl $2, note_from_alice ... // while (eax == 0) repeat jne .L2 testl %eax, %eax .L2: movl note_from_bob, %eax // (why? it will be set to 2 anyway) // compiler optimization: don't set note_from_alice to 1, Alice: } note_from_alice = 2; if (no_milk) {++milk;} do {} while (note_from_bob); // "Alice waiting" signal for Bob() note_from_alice = 1; 22 // eax ← note_from_bob // note_from_alice ← 2

pthreads and reordering many pthreads functions prevent reordering everything before function call actually happens before e.g. keeping global variable in register for too long pthread_mutex_lock/unlock, pthread_create, pthread_join, … basically: if pthreads is waiting for/starting something, no weird ordering 23 includes preventing some optimizations

C++: preventing reordering to help implementing things like pthread_mutex_lock C++ 2011 standard: atomic header, std::atomic class prevent CPU reordering and prevent compiler reordering also provide other tools for implementing locks (more later) could also hand-write assembly code compiler can’t know what assembly code is doing 24

C++: preventing reordering example .L2: ... cmpl $0, no_milk jne .L2 // if (note_from_bob == 0) repeat fence cmpl $0, note_from_bob // make sure store visible on/from other cores mfence movl $1, note_from_alice #include <atomic> Alice: } if (no_milk) {++milk;} std::atomic_thread_fence(std::memory_order_seq_cst); do { note_from_alice = 1; void Alice() { 25 } while (note_from_bob); // note_from_alice ← 1

C++ atomics: no reordering movl $1, note_from_alice ... jne .L2 testl %eax, %eax movl note_from_bob, %eax .L2: mfence Alice: std::atomic< int > note_from_alice, note_from_bob; } if (no_milk) {++milk;} do { note_from_alice.store(1); void Alice() { 26 } while (note_from_bob.load());

mfence x86 instruction mfence make sure all loads/stores in progress fjnish …and make sure no loads/stores were started early fairly expensive Intel ‘Skylake’: order 33 cycles + time waiting for pending stores/loads 27

GCC: built-in atomic functions used to implement std::atomic, etc. predate std::atomic builtin functions starting with __sync and __atomic these are what xv6 uses 28

connecting CPUs and memory multiple processors, common memory how do processors communicate with memory? 29

shared bus CPU1 CPU2 CPU3 CPU4 MEM1 MEM2 tagged messages — everyone gets everything, fjlters contention if multiple communicators some hardware enforces only one at a time 30

shared buses and scaling shared buses perform poorly with “too many” CPUs so, there are other designs we’ll gloss over these for now 31

shared buses and caches remember caches? each CPU wants to keep local copies of memory what happens when multiple CPUs cache same memory? 32 memory is pretty slow

the cache coherency problem value When does this change? When does this change? CPU1 writes 101 to 0xA300? CPU2’s cache 200 0xC500 100 0xA300 172 0x9300 address CPU1 CPU1’s cache 300 0xE500 200 0xC400 100 0xA300 value address MEM1 CPU2 33

the cache coherency problem value When does this change? When does this change? CPU1 writes 101 to 0xA300? CPU2’s cache 200 0xC500 100 0xA300 172 0x9300 address CPU1 CPU1’s cache 300 0xE500 200 0xC400 100101 0xA300 value address MEM1 CPU2 33

“snooping” the bus every processor already receives every read/write to memory take advantage of this to update caches idea: use messages to clean up “bad” cache entries 34

cache coherency states extra information for each cache block overlaps with/replaces valid, dirty bits update states based on reads, writes and heard messages on bus difgerent caches may have difgerent states for same block 35 stored in each cache

MSI state summary Modifjed value may be difgerent than memory and I am the only one who has it Shared Invalid I don’t have the value; I will need to ask for it 36 value is the same as memory

MSI scheme Modifjed nothing to do — no other CPU can have a copy example: write while Modifjed can send read later to get value from writer change to Invalid example: hear write while Shared then change to Modifjed must send write — inform others with Shared state example: write while Shared blue: transition requires sending message on bus — — to Invalid to Shared to Modifjed from state — to Invalid — Shared to Modifjed to Shared — — Invalid write read hear write hear read 37

MSI example CPU1 writes 102 to 0xA300 0xC500 200 Shared “CPU1 is writing 0xA3000” CPU1 writes 101 to 0xA300 cache sees write: invalidate 0xA300 maybe update memory? modifjed state — nothing communicated! 100 will “fjx” later if there’s a read nothing changed yet (writeback) “What is 0xA300?” CPU2 reads 0xA300 modifjed state — must update for CPU2! “Write 102 into 0xA300” CPU2 reads 0xA300 written back to memory early (could also become Invalid at CPU1) Shared 0xA300 CPU1 0xC400 CPU2 MEM1 address value state 0xA300 100 Shared 200 Shared Shared 0xE500 300 Shared address value state 0x9300 172 38

MSI example CPU1 writes 102 to 0xA300 0xC500 200 Shared “CPU1 is writing 0xA3000” CPU1 writes 101 to 0xA300 cache sees write: invalidate 0xA300 maybe update memory? modifjed state — nothing communicated! 100 will “fjx” later if there’s a read nothing changed yet (writeback) “What is 0xA300?” CPU2 reads 0xA300 modifjed state — must update for CPU2! “Write 102 into 0xA300” CPU2 reads 0xA300 written back to memory early (could also become Invalid at CPU1) Invalid 0xA300 CPU1 0xC400 CPU2 MEM1 address value state 0xA300 100101 Modifjed 200 Shared Shared 0xE500 300 Shared address value state 0x9300 172 38

MSI example CPU1 writes 102 to 0xA300 0xC500 200 Shared “CPU1 is writing 0xA3000” CPU1 writes 101 to 0xA300 cache sees write: invalidate 0xA300 maybe update memory? modifjed state — nothing communicated! 100 will “fjx” later if there’s a read nothing changed yet (writeback) “What is 0xA300?” CPU2 reads 0xA300 modifjed state — must update for CPU2! “Write 102 into 0xA300” CPU2 reads 0xA300 written back to memory early (could also become Invalid at CPU1) Invalid 0xA300 CPU1 0xC400 CPU2 MEM1 address value state 0xA300 102 Shared 200 Shared Shared 0xE500 300 Shared address value state 0x9300 172 38

MSI example CPU1 writes 102 to 0xA300 0xC500 200 Shared “CPU1 is writing 0xA3000” CPU1 writes 101 to 0xA300 cache sees write: invalidate 0xA300 maybe update memory? modifjed state — nothing communicated! 100102 will “fjx” later if there’s a read nothing changed yet (writeback) “What is 0xA300?” CPU2 reads 0xA300 modifjed state — must update for CPU2! “Write 102 into 0xA300” CPU2 reads 0xA300 written back to memory early (could also become Invalid at CPU1) Shared 0xA300 CPU1 0xC400 CPU2 MEM1 address value state 0xA300 102 Shared 200 Shared Shared 0xE500 300 Shared address value state 0x9300 172 38

MSI: update memory to write value (enter modifjed state), need to invalidate others can avoid sending actual value (shorter message/faster) 39 “I am writing address X ” versus “I am writing Y to address X ”

MSI: on cache replacement/writeback still happens — e.g. want to store something else requires writeback if modifjed (= dirty bit) 40 changes state to invalid

cache coherency exercise Modifjed/Shared/Invalid for CPU 1/2/3 CPU 3: CPU 2: CPU 1: Modifjed/Shared/Invalid for CPU 1/2/3 Q2: fjnal state of 0x2000 in caches? CPU 3: CPU 2: CPU 1: Q1: fjnal state of 0x1000 in caches? modifjed/shared/invalid; all initially invalid; 32B blocks, 8B CPU 3: read 0x1008 CPU 2: write 0x2008 CPU 2: read 0x1000 CPU 1: read 0x2000 CPU 1: write 0x1000 CPU 2: read 0x1000 CPU 1: read 0x1000 read/writes 41

cache coherency exercise solution I I CPU 1: read 0x2000 M I I S I I CPU 2: read 0x1000 S S I S I I CPU 2: write 0x2008 S S I I M I CPU 3: read 0x1008 S S S I M I I I 0x1000-0x101f CPU 1: read 0x1000 0x2000-0x201f action CPU 1 CPU 2 CPU 3 CPU 1 CPU 2 CPU 3 I I I I I I S I I I I I I CPU 2: read 0x1000 S S I I I I CPU 1: write 0x1000 M 43

MSI extensions real cache coherency protocols sometimes more complex: separate tracking modifjcations from whether other caches have copy send values directly between caches (maybe skip write to memory) send messages only to cores which might care (no shared bus) 44

modifying cache blocks in parallel cache coherency works on cache blocks but typical memory access — less than cache block e.g. one 4-byte array element in 64-byte cache block what if two processors modify difgerent parts same cache block? 4-byte writes to 64-byte cache block cache coherency — write instructions happen one at a time: processor ‘locks’ 64-byte cache block, fetching latest version processor updates 4 bytes of 64-byte cache block later, processor might give up cache block 45

modifying things in parallel (code) void sum_twice( int distance) { } pthread_join(threads[1], NULL); pthread_join(threads[0], NULL); pthread_create(&threads[1], NULL, sum_up, &array[distance]); pthread_create(&threads[0], NULL, sum_up, &array[0]); pthread_t threads[2]; 46 int array[1024]; __attribute__((aligned(4096))) } } *dest += data[i]; void *sum_up( void *raw_dest) { int *dest = ( int *) raw_dest; for ( int i = 0; i < 64 * 1024 * 1024; ++i) { /* aligned = address is mult. of 4096 */

performance v. array element gap (assuming sum_up compiled to not omit memory accesses) 47 500000000 400000000 time (cycles) 300000000 200000000 100000000 0 10 20 30 40 50 60 70 distance between array elements (bytes)

false sharing synchronizing to access two independent things two parts of same cache block solution: separate them 48

atomic read-modfjy-write really hard to build locks for atomic load store and normal load/stores aren’t even atomic… …so processors provide read/modify/write operations one instruction that atomically reads and modifjes and writes back a value 49

x86 atomic exchange lock xchg (%ecx), %eax atomic exchange …without being interrupted by other processors, etc. 50 temp ← M[ECX] M[ECX] ← EAX EAX ← temp

test-and-set: using atomic exchange one instruction that… writes a fjxed new value and reads the old value write: mark a locked as TAKEN (no matter what) read: see if it was already TAKEN (if so, only us) 51

implementing atomic exchange get cache block into Modifjed state do read+modify+write operation while state doesn’t change recall: Modifjed state = “I am the only one with a copy” 52

x86-64 spinlock with xchg mfence or mfence instruction no reordering of loads/stores across a lock Intel’s manual says: allows looping acquire to fjnish release lock by setting it to 0 (not taken) “spin” until lock is released elsewhere if lock was already locked retry read old value set lock variable to 1 (taken) ret // then, set the_lock to 0 (not taken) movl $0, the_lock // for memory order reasons release: lock variable in shared memory: the_lock ret try again // jne acquire // if the_lock wasn't 0 before: test %eax, %eax // sets %eax to prior val. of the_lock // sets the_lock to 1 (taken) // swap %eax and the_lock lock xchg %eax, the_lock movl $1, %eax acquire: if 1: someone has the lock; if 0: lock is free to take 53 // %eax ← 1

some common atomic operations (1) // x86: emulate with exchange test_and_set(address) { old_value = memory[address]; memory[address] = 1; return old_value != 0; // e.g. set ZF flag } // x86: xchg REGISTER, (ADDRESS) exchange(register, address) { temp = memory[address]; memory[address] = register; register = temp; } 54

some common atomic operations (2) } } register = old_value; memory[address] += register; old_value = memory[address]; // x86: lock xaddl REGISTER, (ADDRESS) } // x86: clear ZF flag // x86: mov OLD_VALUE, %eax; lock cmpxchg NEW_VALUE, (ADDRESS) return false ; } else { // x86: set ZF flag return true ; memory[address] = new_value; if (memory[address] == old_value) { 55 compare − and − swap(address, old_value, new_value) { fetch − and − add(address, register) {

common atomic operation pattern try to do operation, … detect if it failed if so, repeat atomic operation does “try and see if it failed” part 56

backup slides 57

GCC: preventing reordering example (1) movl $1, note_from_alice ... jne .L2 testl %eax, %eax movl note_from_bob, %eax .L2: mfence Alice: void Alice() { } if (no_milk) {++milk;} do { __atomic_store(&note_from_alice, &one, __ATOMIC_SEQ_CST); int one = 1; 58 } while (__atomic_load_n(&note_from_bob, __ATOMIC_SEQ_CST));

GCC: preventing reordering example (2) .L3: ... cmpl $0, no_milk jne .L3 // if (note_from_bob == 0) repeat fence cmpl $0, note_from_bob // make sure store is visible to other cores before loading mfence 59 void Alice() { movl $1, note_from_alice Alice: } if (no_milk) {++milk;} __atomic_thread_fence(__ATOMIC_SEQ_CST); do { note_from_alice = 1; } while (note_from_bob); // note_from_alice ← 1 // on x86: not needed on second + iteration of loop

xv6 spinlock: debugging stufg if (!holding(lk)) } ... panic("release"); } // Record info about lock acquisition for debugging. ... panic("acquire") if (holding(lk)) ... 60 void acquire( struct spinlock *lk) { lk − >cpu = mycpu(); getcallerpcs(&lk, lk − >pcs); void release( struct spinlock *lk) { lk − >pcs[0] = 0; lk − >cpu = 0;

exercise: fetch-and-add with compare-and-swap exercise: implement fetch-and-add with compare-and-swap compare_and_swap(address, old_value, new_value) { if (memory[address] == old_value) { memory[address] = new_value; return true ; // x86: set ZF flag } else { return false ; // x86: clear ZF flag } } 61

solution long old_value; do { while (!compare_and_swap(p, old_value, old_value + amount); return old_value; } 62 long my_fetch_and_add( long *p, long amount) { old_value = *p;

xv6 spinlock: acquire ... (but compiler may need more hints) on x86, xchg alone is enough to avoid processor’s reordering avoid load store reordering (including by compiler) same loop as before xchg wraps the lock xchg instruction …but we won’t release the lock until interruption fjnishes …but that can never succeed until we release the lock problem: interruption might try to do something with the lock don’t let us be interrupted after while have the lock } __sync_synchronize(); void // references happen after the lock is acquired. // past this point, to ensure that the critical section's memory // Tell the C compiler and the processor to not move loads or stores ; // The xchg is atomic. ... { 63 acquire( struct spinlock *lk) pushcli(); // disable interrupts to avoid deadlock. while (xchg(&lk − >locked, 1) != 0)

xv6 spinlock: acquire ... (but compiler may need more hints) on x86, xchg alone is enough to avoid processor’s reordering avoid load store reordering (including by compiler) same loop as before xchg wraps the lock xchg instruction …but we won’t release the lock until interruption fjnishes …but that can never succeed until we release the lock problem: interruption might try to do something with the lock don’t let us be interrupted after while have the lock } __sync_synchronize(); void // references happen after the lock is acquired. // past this point, to ensure that the critical section's memory // Tell the C compiler and the processor to not move loads or stores ; // The xchg is atomic. ... pushcli(); // disable interrupts to avoid deadlock. { 63 acquire( struct spinlock *lk) while (xchg(&lk − >locked, 1) != 0)

xv6 spinlock: acquire ... (but compiler may need more hints) on x86, xchg alone is enough to avoid processor’s reordering avoid load store reordering (including by compiler) same loop as before xchg wraps the lock xchg instruction …but we won’t release the lock until interruption fjnishes …but that can never succeed until we release the lock problem: interruption might try to do something with the lock don’t let us be interrupted after while have the lock } __sync_synchronize(); void // references happen after the lock is acquired. // past this point, to ensure that the critical section's memory // Tell the C compiler and the processor to not move loads or stores ; // The xchg is atomic. ... { 63 acquire( struct spinlock *lk) pushcli(); // disable interrupts to avoid deadlock. while (xchg(&lk − >locked, 1) != 0)

xv6 spinlock: release // This code can't use a C assignment, since it might reenable interrupts (taking nested locks into account) turns into mov of constant 0 into lk >locked plus tells compiler not to reorder turns into instruction to tell processor not to reorder } popcli(); // not be atomic. A real OS would use C atomics here. // Release the lock, equivalent to lk->locked = 0. void __sync_synchronize(); // stores; __sync_synchronize() tells them both not to. // Both the C compiler and the hardware may re-order loads and // section are visible to other cores before the lock is released. // past this point, to ensure that all the stores in the critical // Tell the C compiler and the processor to not move loads or stores ... 64 release( struct spinlock *lk) asm volatile ("movl $0, %0" : "+m" (lk − >locked) : );

locks / cache coherency / spinlocks / other sync (intro) 1 - PowerPoint PPT Presentation

locks / cache coherency / spinlocks / other sync (intro) 1 Changelog 12 Feb 2020: add solution slide for cache coherency exercise 1 last time pthread API pthread_create unlike fork, run particular function pthread_join like waitpid

Advanced OpenMP Lecture 3: Cache Coherency Cache coherency Main difficulty in building

Lets talk locks! @kavya719 kavya locks. locks are slow locks are slow latency

Cache Coherency Cache coherent processors most current value for an address is the last

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Rethink the Sync! Rethink the Sync! Edmund B. Nightingale Kaushik Veeraraghavan Peter M. Chen

Cache Coherency and Memory Consistency Why On-Chip Cache Coherence is here to stay - Motivation:

mutexes / barriers / monitors 1 last time cache coherency multiple cores, each with own cache

Overview Synchronization hardware primitives Cache Coherency Issues Coherence misses

Module 15: Managing Transactions and Locks Overview Introduction to Transactions and Locks

Threads & Locks Threads & Locks Srinidhi Varadarajan Topics Topics Thread

CS6354: Snooping Cache Coherency 7 October 2016 1 To read more This days papers:

Introduction to the Microsoft Sync Framework Michael Clark Development Manager Microsoft Agenda

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Solving Concurrent Multiagent Planning using Classical Planning Daniel Furelos-Blanco and Anders

Memory consistency in C++ Computer Architecture J. Daniel Garca Snchez (coordinator) David

Synchronization CS 416: Operating Systems Design, Spring 2011 Department of Computer Science

Teacher Peter Schneider-Kamp <petersk@imada.sdu.dk> Teaching Assistants Christian

Concurrency 2 Processes and Threads Alexandre David adavid@cs.aau.dk Credits for the

BIL 722: Advanced Topics in Computer Vision Mehmet Kerim Y cel Deep Structured Models For

Theory of Interaction Yuxi Fu BASICS, Shanghai Jiao Tong University Talk at BASICS 2009,

Atoms Atoms (also called atomic formulas) over are formed according to this syntax: A , B ::=

locks / cache coherency / spinlocks / other sync (intro) 1 - PowerPoint PPT Presentation

locks / cache coherency / spinlocks / other sync (intro) 1 Changelog 12 Feb 2020: add solution slide for cache coherency exercise 1 last time pthread API pthread_create unlike fork, run particular function pthread_join like waitpid

Advanced OpenMP Lecture 3: Cache Coherency Cache coherency Main difficulty in building

Lets talk locks! @kavya719 kavya locks. locks are slow locks are slow latency

Cache Coherency Cache coherent processors most current value for an address is the last

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Rethink the Sync! Rethink the Sync! Edmund B. Nightingale Kaushik Veeraraghavan Peter M. Chen

Cache Coherency and Memory Consistency Why On-Chip Cache Coherence is here to stay - Motivation:

mutexes / barriers / monitors 1 last time cache coherency multiple cores, each with own cache

Overview Synchronization hardware primitives Cache Coherency Issues Coherence misses

Module 15: Managing Transactions and Locks Overview Introduction to Transactions and Locks

Threads &amp; Locks Threads &amp; Locks Srinidhi Varadarajan Topics Topics Thread

CS6354: Snooping Cache Coherency 7 October 2016 1 To read more This days papers:

Introduction to the Microsoft Sync Framework Michael Clark Development Manager Microsoft Agenda

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Solving Concurrent Multiagent Planning using Classical Planning Daniel Furelos-Blanco and Anders

Memory consistency in C++ Computer Architecture J. Daniel Garca Snchez (coordinator) David

Synchronization CS 416: Operating Systems Design, Spring 2011 Department of Computer Science

Teacher Peter Schneider-Kamp &lt;petersk@imada.sdu.dk&gt; Teaching Assistants Christian

Concurrency 2 Processes and Threads Alexandre David adavid@cs.aau.dk Credits for the

BIL 722: Advanced Topics in Computer Vision Mehmet Kerim Y cel Deep Structured Models For

Theory of Interaction Yuxi Fu BASICS, Shanghai Jiao Tong University Talk at BASICS 2009,

Atoms Atoms (also called atomic formulas) over are formed according to this syntax: A , B ::=

Threads & Locks Threads & Locks Srinidhi Varadarajan Topics Topics Thread

Teacher Peter Schneider-Kamp <petersk@imada.sdu.dk> Teaching Assistants Christian