Replication and Consistency 08 Spin Locking and Contention Annette - - PowerPoint PPT Presentation

replication and consistency
SMART_READER_LITE
LIVE PREVIEW

Replication and Consistency 08 Spin Locking and Contention Annette - - PowerPoint PPT Presentation

Replication and Consistency 08 Spin Locking and Contention Annette Bieniusa AG Softech FB Informatik TU Kaiserslautern Annette Bieniusa Replication and Consistency 1/ 76 Thank you! These slides are based on companion material of the


slide-1
SLIDE 1

Replication and Consistency

08 Spin Locking and Contention Annette Bieniusa

AG Softech FB Informatik TU Kaiserslautern

Annette Bieniusa Replication and Consistency 1/ 76

slide-2
SLIDE 2

Thank you!

These slides are based on companion material of the following books: The Art of Multiprocessor Programming by Maurice Herlihy and Nir Shavit Synchronization Algorithms and Concurrent Programming by Gadi Taubenfeld

Annette Bieniusa Replication and Consistency 2/ 76

slide-3
SLIDE 3

Previously on Replication and Consistency

Models

Accurate (we never lied to you) But idealized (we forgot to mention a few things)

Protocols

Elegant Essential But naive

Annette Bieniusa Replication and Consistency 3/ 76

slide-4
SLIDE 4

New Focus: Performance in Real Systems

Models

More complicated (more details) Still focus on principles (not soon to become obsolete)

Protocols

Elegant (in their fashion) Important (why else would we discuss them) And realistic (more optimizations will be possible, though)

Annette Bieniusa Replication and Consistency 4/ 76

slide-5
SLIDE 5

Mutual Exclusion, revisited

Think of performance, not just correctness and progress Begin to understand how performance depends on our software properly utilizing the multiprocessor machine’s hardware And get to know a collection of locking algorithms

Annette Bieniusa Replication and Consistency 5/ 76

slide-6
SLIDE 6

If a processor doesn’t get a lock . . .

Question

What can the processor do?

Annette Bieniusa Replication and Consistency 6/ 76

slide-7
SLIDE 7

If a processor doesn’t get a lock . . .

Question

What can the processor do? Keep trying

“spin” or “busy-wait” as with Filter and Bakery algorithm Useful on multi-processors if expected delays are short

Suspend and allow scheduler to schedule other processes

“blocking’ ’ as with Java’s monitors Good if delays are long Always good on uniprocessors

In practise, often mix of both strategies

Spin for a short time Then, suspend

Annette Bieniusa Replication and Consistency 6/ 76

slide-8
SLIDE 8

Basic Spin-Lock

Contention: Multiple threads try to acquire lock at the same time Hoch can we avoid or alleviate contention?

Annette Bieniusa Replication and Consistency 7/ 76

slide-9
SLIDE 9

Test-and-Set (TAS) revisited

Machine-instruction on one word (here: for boolean values) Atomically, swap new value with prior value and return prior value Swapping in true is called Test-And-Set Aka getAndSet() in Java

\\ Package java.utitl.concurrent.atomic public class AtomicBoolean { boolean value; // implemented as one hardware instruction public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } }

Annette Bieniusa Replication and Consistency 8/ 76

slide-10
SLIDE 10

Task: Design a lock using Test-and-Set (TAS)!

class TASLock implements Lock{ // if false, lock is free // if true, lock is taken AtomicBoolean state = new AtomicBoolean(false); void lock() { // TODO } void unlock() { // TODO } }

Annette Bieniusa Replication and Consistency 9/ 76

slide-11
SLIDE 11

Test-and-Set Lock

class TASLock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); } }

Annette Bieniusa Replication and Consistency 10/ 76

slide-12
SLIDE 12

Space Complexity

TAS spin-lock has small “footprint”

N thread spin-lock uses O(1) space As opposed to O(N) Peterson/Bakery

Question

How did we overcome the Ω(N) lower bound?

Annette Bieniusa Replication and Consistency 11/ 76

slide-13
SLIDE 13

Space Complexity

TAS spin-lock has small “footprint”

N thread spin-lock uses O(1) space As opposed to O(N) Peterson/Bakery

Question

How did we overcome the Ω(N) lower bound? ⇒ Use an object with higher consensus number!

Annette Bieniusa Replication and Consistency 11/ 76

slide-14
SLIDE 14

Performance Evaluation

Experiment

Spawn N threads Increment shared counter 1 million times Work is split between the threads, i.e. each thread does 106/N increments Each thread takes lock, increments a counter, releases lock

How long should it take? How long does it take?

Annette Bieniusa Replication and Consistency 12/ 76

slide-15
SLIDE 15

Hypothesis

No speedup because lock is sequential bottleneck (Amadahl’s law!)

Annette Bieniusa Replication and Consistency 13/ 76

slide-16
SLIDE 16

Mystery 1

A typical evaluation looks like this:

Annette Bieniusa Replication and Consistency 14/ 76

slide-17
SLIDE 17

New approach: Test-and-Test-and-Set Locks

Lurking stage

Wait until lock seems to be free Spin while read returns true (lock taken)

Pouncing state

As soon as lock seems to be available Read returns false (lock free) Call TAS to acquire lock If TAS loses, back to lurking

Annette Bieniusa Replication and Consistency 15/ 76

slide-18
SLIDE 18

Test-and-Test-and-Set Locks

class TTASLock extends TASLock{ void lock() { while (true) { while (state.get()) {} // Lurk if (!state.getAndSet(true)) // Pounce return; } }

Annette Bieniusa Replication and Consistency 16/ 76

slide-19
SLIDE 19

Mystery 2

Annette Bieniusa Replication and Consistency 17/ 76

slide-20
SLIDE 20

Mystery 2

Both TAS and TTAS do the same thing in our model But TTAS performs much better in actual evaluations Neither approach is ideal

Annette Bieniusa Replication and Consistency 17/ 76

slide-21
SLIDE 21

Mystery 2

Both TAS and TTAS do the same thing in our model But TTAS performs much better in actual evaluations Neither approach is ideal Our memory abstraction is broken! We need a more detailed model!

Annette Bieniusa Replication and Consistency 17/ 76

slide-22
SLIDE 22

Bus-Based Architectures

Random Access Memory (access time: 10s of cycles) Shared Bus as broadcast medium

One broadcaster at a time Other processors and memory can passively listen

Per-Processor Caches (access time: 1-2 cycles)

Annette Bieniusa Replication and Consistency 18/ 76

slide-23
SLIDE 23

Cache Coherence

We have lots of copies of data

Original copy in memory Cached copies at processors

If some processor modifies its own copy:

What do we do with the others? How to avoid confusion about actual value?

Annette Bieniusa Replication and Consistency 19/ 76

slide-24
SLIDE 24

Cache Coherence

We have lots of copies of data

Original copy in memory Cached copies at processors

If some processor modifies its own copy:

What do we do with the others? How to avoid confusion about actual value?

Cache coherence protocol!

Annette Bieniusa Replication and Consistency 19/ 76

slide-25
SLIDE 25

Write-Back Caches

Idea: Accumulate changes in cache and write back when needed

Because we need cache for something else Or because another processor wants to read the changed value

On first modification, invalidate all other entries Cache entry can be marked as dirty (i.e. it must be eventually written back to main memory)

Annette Bieniusa Replication and Consistency 20/ 76

slide-26
SLIDE 26

When a thread modifies its cache value, . . .

Annette Bieniusa Replication and Consistency 21/ 76

slide-27
SLIDE 27

. . . it invalidates all other caches

Annette Bieniusa Replication and Consistency 22/ 76

slide-28
SLIDE 28

When another thread want to read, . . .

Annette Bieniusa Replication and Consistency 23/ 76

slide-29
SLIDE 29

. . . the owner responds

Annette Bieniusa Replication and Consistency 24/ 76

slide-30
SLIDE 30

Mystery Explained!

TAS-Lock

Spinning threads invalidate cache line with TAS, keeps bus busy Threads wanting to release lock is delayed behind spinners

TTAS-Lock

Threads spin on local cache No bus use while lock is taken Problem: When lock is released, reads are satisfied sequentially on bus Eventually system quiesces after lock has been acquired → quiescence time linear in number of threads for bus architecture

Annette Bieniusa Replication and Consistency 25/ 76

slide-31
SLIDE 31

Solution: Introduce Delay

“If the lock looks free, but I fail to get it, there must be lots of contention!” ⇒ Better to back off than to collide again

Annette Bieniusa Replication and Consistency 26/ 76

slide-32
SLIDE 32

Solution: Introduce Delay

“If the lock looks free, but I fail to get it, there must be lots of contention!” ⇒ Better to back off than to collide again Example: Exponential Backoff If I fail to get lock Wait random duration before retry Each subsequent failure doubles expected wait (up to fixed maximum)

Annette Bieniusa Replication and Consistency 26/ 76

slide-33
SLIDE 33

Exponential Backoff Lock

class Backoff extends TTASLock { void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; // if not successful, we wait sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; } } }

Annette Bieniusa Replication and Consistency 27/ 76

slide-34
SLIDE 34

Exponential Backoff Lock

Easy to implement But must choose parameters carefully Not portable across platforms

Annette Bieniusa Replication and Consistency 28/ 76

slide-35
SLIDE 35

Exponential Backoff Lock

Easy to implement But must choose parameters carefully Not portable across platforms Idea Avoid useless invalidations by keeping a queue of threads Each thread notifies next in line without bothering the others

Annette Bieniusa Replication and Consistency 28/ 76

slide-36
SLIDE 36

Anderson Queue Lock

Annette Bieniusa Replication and Consistency 29/ 76

slide-37
SLIDE 37

Anderson Queue Lock

Annette Bieniusa Replication and Consistency 30/ 76

slide-38
SLIDE 38

Anderson Queue Lock

Annette Bieniusa Replication and Consistency 31/ 76

slide-39
SLIDE 39

Anderson Queue Lock

Annette Bieniusa Replication and Consistency 32/ 76

slide-40
SLIDE 40

Anderson Queue Lock

Annette Bieniusa Replication and Consistency 33/ 76

slide-41
SLIDE 41

Anderson Queue Lock

Annette Bieniusa Replication and Consistency 34/ 76

slide-42
SLIDE 42

Anderson Queue Lock

Annette Bieniusa Replication and Consistency 35/ 76

slide-43
SLIDE 43

Anderson Queue Lock

class ALock implements Lock { boolean[] flags = {true,false,...,false}; // one per thread AtomicInteger next = new AtomicInteger(0); ThreadLocal<Integer> mySlot; // thread-local per thread void lock() { mySlot = next.getAndIncrement(); while (!flags[mySlot % n]) {}; //spin flags[mySlot % n] = false; // prepare for re-use (wrong in Figure!) } void unlock() { flags[(mySlot+1) % n] = true; // tell next thread } }

Annette Bieniusa Replication and Consistency 36/ 76

slide-44
SLIDE 44

Anderson Lock

FIFO fairness, no lockout Scalable performance

Threads spin on locally cached copy of single array location But beware of false sharing of items on the same cache line! Invalidations always per cache line Trick: Use padding to avoid sharing

Not space-efficient Requires knowledge about number of threads

Annette Bieniusa Replication and Consistency 37/ 76

slide-45
SLIDE 45

CLH Lock (by Craig, Landin, Hagersten)

Annette Bieniusa Replication and Consistency 38/ 76

slide-46
SLIDE 46

CLH Lock: Acquiring a lock

Annette Bieniusa Replication and Consistency 39/ 76

slide-47
SLIDE 47

CLH Lock: Acquiring a lock

Annette Bieniusa Replication and Consistency 40/ 76

slide-48
SLIDE 48

CLH Lock: It’s a Queue!

Annette Bieniusa Replication and Consistency 41/ 76

slide-49
SLIDE 49

CLH Lock: Releasing a lock

Annette Bieniusa Replication and Consistency 42/ 76

slide-50
SLIDE 50

CLH Lock: Releasing a lock

Annette Bieniusa Replication and Consistency 43/ 76

slide-51
SLIDE 51

Remarks

Threads spin on cached copy (efficient) Lock can reuse predecessor’s node for future lock accesses

Annette Bieniusa Replication and Consistency 44/ 76

slide-52
SLIDE 52

CLH Lock

class Qnode { AtomicBoolean locked = new AtomicBoolean(true); } class CLHLock implements Lock { AtomicReference<Qnode> tail = new AtomicReference<Qnode>(null); ThreadLocal<Qnode> myNode = new Qnode(); // per thread void lock() { qnolde.locked = true; Qnode pred = tail.getAndSet(myNode); // swap my node into queue while (pred.locked) {} // spin } void unlock() { myNode.locked = false; myNode = pred; // "reuse" predecessor's qnode (see book) } }

Annette Bieniusa Replication and Consistency 45/ 76

slide-53
SLIDE 53

CLH Lock

Lock release affects only successor Does not depend on prior knowledge about number of threads FIFO Fairness But doesn’t work (efficiently) for uncached NUMA architectures

Annette Bieniusa Replication and Consistency 46/ 76

slide-54
SLIDE 54

NUMA Architectures

N on-U niform-M emomory-A rchitecture Model: Flat shared memory, no caches (in most variants) Some memory regions faster accessible than others Spinning on remote memory is slow

Annette Bieniusa Replication and Consistency 47/ 76

slide-55
SLIDE 55

MCS Lock (by Mellor-Crummey and Scott)

FIFO order Spin on local memory only Small, constant-size overhead Idea: To acquire lock, place own Qnode at tail of list If it has a predecessor, modify predecessor’s node to refer to own Qnode

Annette Bieniusa Replication and Consistency 48/ 76

slide-56
SLIDE 56

MCS Lock

Annette Bieniusa Replication and Consistency 49/ 76

slide-57
SLIDE 57

MCS Lock: Acquiring a lock

Annette Bieniusa Replication and Consistency 50/ 76

slide-58
SLIDE 58

MCS Lock: Acquiring a lock

Annette Bieniusa Replication and Consistency 51/ 76

slide-59
SLIDE 59

MCS Lock: Acquiring a lock

Annette Bieniusa Replication and Consistency 52/ 76

slide-60
SLIDE 60

MCS Lock: Acquiring a lock

Annette Bieniusa Replication and Consistency 53/ 76

slide-61
SLIDE 61

MCS Lock: Acquiring a lock

Annette Bieniusa Replication and Consistency 54/ 76

slide-62
SLIDE 62

MCS Lock: Releasing a lock

Annette Bieniusa Replication and Consistency 55/ 76

slide-63
SLIDE 63

MCS Lock

class Qnode { boolean locked = false; // only reads/writes required Qnode next = null; }

Annette Bieniusa Replication and Consistency 56/ 76

slide-64
SLIDE 64

MCS Lock

class MCSLock implements Lock { AtomicReference tail; ThreadLocal<Qnode> qnode = new Qnode(); void lock() { // reset for reuse qnode.next = null; qnode.locked = false; // swap my node in Qnode pred = tail.getAndSet(qnode); if (pred != null) { // lock is taken, so set my status to wait qnode.locked = true; // tell predecessor where to find me pred.next = qnode; // spin on my node while (qnode.locked) {} } } ...

Annette Bieniusa Replication and Consistency 57/ 76

slide-65
SLIDE 65

MCS Lock: Releasing

Status of qnode.next indicates that other thread is active Need to wait for it to finish and start spinning

Annette Bieniusa Replication and Consistency 58/ 76

slide-66
SLIDE 66

MCS Lock: Releasing

Annette Bieniusa Replication and Consistency 59/ 76

slide-67
SLIDE 67

MCS Lock: Releasing

Annette Bieniusa Replication and Consistency 60/ 76

slide-68
SLIDE 68

MCS Lock

void unlock() { if (qnode.next == null) { // if really no thread waiting if (tail.compareAndSet(qnode, null) return; // otherwise, wait for successor to finish while (qnode.next == null) {} } // tell successor that it can start qnode.next.locked = false; } }

Annette Bieniusa Replication and Consistency 61/ 76

slide-69
SLIDE 69

Abortable Locks

What if you want to give up waiting for a lock?

For example: timeout, transaction aborted by user, . . .

Simple for Backoff-Lock

Just return from lock() call No cleanup, wait-free, immediate

Problematic for Queue Locks

Can’t just quit Thread in line behind will starve

Annette Bieniusa Replication and Consistency 62/ 76

slide-70
SLIDE 70

Abortable Locks

What if you want to give up waiting for a lock?

For example: timeout, transaction aborted by user, . . .

Simple for Backoff-Lock

Just return from lock() call No cleanup, wait-free, immediate

Problematic for Queue Locks

Can’t just quit Thread in line behind will starve

Idea: Let successor deal with the problem! ⇒ Abortable CLH Lock

Annette Bieniusa Replication and Consistency 62/ 76

slide-71
SLIDE 71

Timeout Lock

Annette Bieniusa Replication and Consistency 63/ 76

slide-72
SLIDE 72

Timeout Lock: Acquire

Annette Bieniusa Replication and Consistency 64/ 76

slide-73
SLIDE 73

Timeout Lock: Acquire

Annette Bieniusa Replication and Consistency 65/ 76

slide-74
SLIDE 74

Timeout Lock: Acquire

Annette Bieniusa Replication and Consistency 66/ 76

slide-75
SLIDE 75

Timeout Lock: Acquire

Annette Bieniusa Replication and Consistency 67/ 76

slide-76
SLIDE 76

Timeout Lock: Acquire

Annette Bieniusa Replication and Consistency 68/ 76

slide-77
SLIDE 77

Timeout Lock: While waiting, . . .

Annette Bieniusa Replication and Consistency 69/ 76

slide-78
SLIDE 78

Timeout Lock: Thread times out

Annette Bieniusa Replication and Consistency 70/ 76

slide-79
SLIDE 79

Timeout Lock: Thread times out

Annette Bieniusa Replication and Consistency 71/ 76

slide-80
SLIDE 80

Timeout Locks: Implementation

class TOLock { static Qnode AVAILABLE = new Qnode(); // signifies free lock AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode; // per thread // Return value indicates success boolean lock(long timeout) { // Initialize node Qnode qnode = new Qnode(); myNode = qnode; qnode.prev = null; // swap with tail Qnode myPred = tail.getAndSet(qnode); // if predecessor absent or released, we are done if (myPred == null || myPred.prev == AVAILABLE) { return true; } ...

Annette Bieniusa Replication and Consistency 72/ 76

slide-81
SLIDE 81

Timeout Locks

... // Keep trying for a while long start = now(); while (now()- start < timeout) { // Spin on predecessor's prev field Qnode predPred = myPred.prev; if (predPred == AVAILABLE) { // predecessor released lock return true; } else if (predPred != null) { // predecessor aborted, we advance in queue myPred = predPred; } } ...

Annette Bieniusa Replication and Consistency 73/ 76

slide-82
SLIDE 82

Timeout Locks

... // In case timeout happened, we waited long enough if (!tail.compareAndSet(qnode, myPred)){ // If CAS fails, tell successor about my predecessor qnode.prev = myPred; } // If CAS succeeds, no successor, nothing to do return false; }

Annette Bieniusa Replication and Consistency 74/ 76

slide-83
SLIDE 83

Timeout Locks

void unlock() { Qnode qnode = myNode.get(); if (!tail.compareAndSet(qnode, null)) { // If CAS failed: there is successor // Notify successor that it can enter qnode.prev = AVAILABLE; } // If CAS succeeds: no successor waiting // Set tail to null, no clean up }

Annette Bieniusa Replication and Consistency 75/ 76

slide-84
SLIDE 84

Summary: One Lock To Rule Them All?

TTAS+Backoff, CLH, MCS, ToLock . . . Each one better than others in some way There is no one solution Decision really depends on:

the application the hardware which properties are important

Annette Bieniusa Replication and Consistency 76/ 76