6 Transactional Memory Chip Multiprocessors (ACS MPhil) Robert - - PowerPoint PPT Presentation

6 transactional memory
SMART_READER_LITE
LIVE PREVIEW

6 Transactional Memory Chip Multiprocessors (ACS MPhil) Robert - - PowerPoint PPT Presentation

6 Transactional Memory Chip Multiprocessors (ACS MPhil) Robert Mullins Overview Limitations of lock-based programming Transactional memory Programming with TM Software TM (STM) Hardware TM (HTM) Chip


slide-1
SLIDE 1

6 • Transactional Memory

Chip Multiprocessors (ACS MPhil)‐ Robert Mullins

slide-2
SLIDE 2

Chip Multiprocessors (ACS MPhil)‐ 2

Overview

  • Limitations of lock-based programming
  • Transactional memory

– Programming with TM – Software TM (STM)‐ – Hardware TM (HTM)‐

slide-3
SLIDE 3

Chip Multiprocessors (ACS MPhil)‐ 3

Lock-based programming

  • Lock-based programming is a low-level model

– Close to basic hardware primitives – For some problems lock-based solutions that perform well are complex and error-prone

  • difficult to write, debug, and maintain
  • Not true of all problems
  • Parallel programming for the masses

– The majority of programmers will need to be able to produce highly parallel and robust software

slide-4
SLIDE 4

Chip Multiprocessors (ACS MPhil)‐ 4

Lock-based programming

  • Challenges:

– Must remember to use (the correct)‐ locks

  • Careful to avoid when not required (for performance)‐

– Coarse-grain vs. fine-grain locks

  • Simplicity
  • Unnecessary serialisation of operations

– Lock may not actually be required in most cases (data dependent)‐. Lock-based programming may be pessimistic.

  • We must also consider the time taken to acquire and release

locks! (even uncontended locks have a cost)‐

– What is the optimal granularity of locking? HW dependent.

slide-5
SLIDE 5

Chip Multiprocessors (ACS MPhil)‐ 5

Lock-based programming

  • Other issues:

– Deadlock – Scheduling threads

  • Priority inversion (e.g. Mars Rover Pathfinder problems)‐

– Low-priority thread is preempted (while holding a lock)‐ – Medium-priority thread runs – High-priority thread (needing the lock)‐ can't make progress

  • Convoying

– Thread holding lock is descheduled, a queue of threads form

– lost wake-ups (wait on CV, but forget to signal)‐ – Horribly complicated error recovery – Cannot even easily compose lock based programs

slide-6
SLIDE 6

Chip Multiprocessors (ACS MPhil)‐ 6

Lost wake-up example

p u s h m u t e x : : s c

  • p

e d _ l

  • c

k l

  • c

k ( p u s h M u t e x ) q u e u e . p u s h ( i t e m ) i f ( q u e u e . s i z e ( ) = = 1 ) m _ e m p t y C

  • n

d . n

  • t

i f y _

  • n

e ( ) p

  • p

/ / ( i m p l i c i t l

  • c

k r e l e a s e w h e n l e a v i n g s c

  • p

e ) m u t e x : : s c

  • p

e d _ l

  • c

k l

  • c

k ( p

  • p

M u t e x ) w h i l e ( q u e u e . e m p t y ( ) ) m _ e m p t y C

  • n

d . w a i t ( ) I t e m = q u e u e . f r

  • n

t ( ) q u e u e . p

  • p

( ) r e t u r n i t e m

slide-7
SLIDE 7

Chip Multiprocessors (ACS MPhil)‐ 7

Lock-based programming

  • Deadlock

– We are free to do anything when we hold a lock, even take a lock on another mutex – This can quickly lead to deadlock if we are not careful

  • Limiting ourselves to only being able to take a single lock at a

time would force us to use coarse-grain locks

  • e.g. consider maintaining two queues. These are each

accessed by many different threads. We are infrequently required to transfer data from one queue to the other (atomically)‐

// Trivial deadlock example // Thread 1 // Thread 2 a.lock(); b.lock(); b.lock(); a.lock(); ... ...

slide-8
SLIDE 8

Chip Multiprocessors (ACS MPhil)‐ 8

Lock-based programming

  • Avoiding deadlock

– Requires programmer to adopt some sort of policy (although this isn't automatically enforced)‐ – Often difficult to maintain/understand

  • Lock hierarchies

– All code must take locks in the same order – Lock chaining – take first lock, take second, release first, etc.

  • Try and back off

– More flexible than imposing a fixed order – Get first lock – Then try and lock additional mutexes in the required set. If we fail release locks and retry

  • pthread_mutex_trylock
slide-9
SLIDE 9

Chip Multiprocessors (ACS MPhil)‐ 9

Lock-based programming

  • Composing lock-based programs

– Consider our example of two queues – There is no simple way of dequeuing from one and enqueuing to the other in an atomic fashion

  • We would need to expose synchronization state and force

caller to manage locks

– Can't compose methods that block either (wait/notify)‐

  • How do we describe the operation where we want to

dequeue from either queue, whichever has data

  • Each queue implementation blocks internally
slide-10
SLIDE 10

Chip Multiprocessors (ACS MPhil)‐ 10

Transactions

  • Focus on where atomicity is necessary rather than

specific locking mechanisms

  • The transactional memory system will ensure that the

transaction is run in isolation from other threads

– Transactions are typically run in parallel optimistically – If transactions perform conflicting memory accesses, we must abort and ensure none of the side-effects of the abandoned transactions are visible

atomic { x=q0.deq(); q1.enq(x); }

slide-11
SLIDE 11

Chip Multiprocessors (ACS MPhil)‐ 11

Transactions

  • Atomicity (all-or-nothing)‐

– We guarantee that it appears that either all the instructions are executed or none of them are (if the transaction fails, failure atomicity)‐ – The transaction either commits or aborts

  • Transactions execute in isolation

– Other operations cannot access a transaction's intermediate state. – The result of executing concurrent transactions must be identical to a result in which the transactions executed sequentially (serializability)‐

slide-12
SLIDE 12

Chip Multiprocessors (ACS MPhil)‐ 12

Transactions

  • Retry

– Abandon transaction and try again – An implementation could wait until some changes

  • ccur in memory locations read by the aborted

transaction

  • Or specify a specific watch set [Atomos/PLDI'06]

void Queue::enq (int v) { atomic { // queue is full if (count==MAX_LEN) retry; buf[tail]=v; if (++tail == MAX_LEN) tail=0; count++; } }

“Composable memory transactions”, Harris et al.

slide-13
SLIDE 13

Chip Multiprocessors (ACS MPhil)‐ 13

Transactions

  • Choice

– Try to dequeue from q0 first, if this retries (i.e. queue is empty)‐, then try the second – If both retry, retry the whole orElse block

atomic { x = q0.deq(); } orElse { x = q1.deq(); }

“Composable memory transactions”, Harris et al.

slide-14
SLIDE 14

Chip Multiprocessors (ACS MPhil)‐ 14

Critical sections ≠ transactions

  • Converting critical sections to transactions

– pitfall: “A critical section that was previously atomic

  • nly with respect to other critical sections guarded by

the same lock is now atomic with respect to all other critical sections.”

“Deconstructing Transactional Semantics: The Subtleties of Atomicity” Colin Blundell. E Christopher Lewis. Milo M. K. Martin,WDDD, 2005)‐

proc1 { proc2 { acquire (m1) acquire (m2) while (!flagA) {} flag A=true flagB = true while (!flagB) {} .... .... release(m1) release(m2) } }

slide-15
SLIDE 15

Chip Multiprocessors (ACS MPhil)‐ 15

Implementating a TM system

  • Transaction granularity

– Object, word or block

  • How do we provide isolation?

– Direct or deferred update?

  • Update object directly and keep undo log
  • Update private copy, discard or replace object

– Also called eager and lazy versioning

  • When and how do we detect conflicts?

– Eager or lazy conflict detection?

  • A software or hardware-supported implementation?
slide-16
SLIDE 16

Chip Multiprocessors (ACS MPhil)‐ 16

Hardware support for TM

  • An introduction to hardware mechanisms for

supporting transactional memory

– See Larus/Rajwar book for a more complete survey – We'll look at:

  • Knight, “An architecture for mostly functional languages”, in

LFP, 1986.

  • A simple HTM with lazy conflict detection
  • Herlihy/Moss (1993)‐

– Discuss others in reading group

slide-17
SLIDE 17

Chip Multiprocessors (ACS MPhil)‐ 17

Hardware support for TM

  • 1. Tom Knight (1986)‐

– Not really a TM scheme, Knight describes a scheme for parallelising the execution of a single thread – Blocks are identified by the compiler and executed in parallel assuming there are no memory carried dependencies between them – Hardware support is provided to detect memory dependency violations – This work introduces the basic ideas of using caches and the cache coherence protocol to support TM

Larus/Rajwar book p.140

slide-18
SLIDE 18

Chip Multiprocessors (ACS MPhil)‐ 18

Hardware support for TM

[Knight86]

slide-19
SLIDE 19

Chip Multiprocessors (ACS MPhil)‐ 19

Hardware support for TM

  • Confirm Cache

– A block executes to completion and then commits. Blocks are committed in the original program order

  • Any data written in the block is temporarily held in the confirm

cache (not visible to other processors)‐. This is swept and written back during commit.

  • On a processor read, priority is given to the data in the

commit cache

– The block needs to see any writes it has made

[Knight86]

slide-20
SLIDE 20

Chip Multiprocessors (ACS MPhil)‐ 20

Hardware support for TM

  • Dependency Cache

– The dependency cache holds data read from memory. Data read during a block is held in state D (Depends)‐

  • A memory dependency violation is detected if a bus write

(made by a block that is currently committing)‐ updates a value in a dependency cache in state D

  • This indicates that the block read the data too early and must

be aborted

[Knight86]

slide-21
SLIDE 21

Chip Multiprocessors (ACS MPhil)‐ 21

Hardware support for TM

  • Simplified state

transition diagram for the dependency cache – In the real scheme there are also “predict” states to handle conditional execution (execute down both paths)‐ and loops (predict next index to avoid abort)‐

[Knight86]

slide-22
SLIDE 22

Chip Multiprocessors (ACS MPhil)‐ 22

Hardware support for TM

[Knight86]

slide-23
SLIDE 23

Chip Multiprocessors (ACS MPhil)‐ 23

Hardware support for TM

  • 2. A simple HTM scheme

– Lazy Conflict Detection – Lazy Version Management – Committer Wins

  • Similar to Knight's scheme

– The TCC and Bulk HTMs also take a similar approach

Many thanks to Christos Kozyrakis at Stanford

slide-24
SLIDE 24

Chip Multiprocessors (ACS MPhil)‐ 24

Hardware support for TM

  • Lazy Conflict Detection

– We will only start looking for conflicts when we try to commit – We will only allow one transaction to commit at once

  • Serialised commit

– To commit, we request exclusive access for all locations in our write set

  • Active transactions abort if data in their read set is invalidated
  • This is known as “committer wins” and guarantees forward

progress

slide-25
SLIDE 25

Chip Multiprocessors (ACS MPhil)‐ 25

Hardware support for TM

  • Lazy Conflict Detection

– We will only start looking for conflicts when we try to commit – We will only allow one transaction to commit at once

  • Serialised commit

– To commit, we request exclusive access for all locations in our write set*

  • Active transactions abort if data in their read set is invalidated
  • This is known as “committer wins” and guarantees forward

progress

*In practice, many systems check against both the read set and the write set. This is due to granularity issues (we are working with cache lines not individual words)‐ and in order to support strong isolation

slide-26
SLIDE 26

Chip Multiprocessors (ACS MPhil)‐ 26

Hardware support for TM

atomic { %t1 atomic { %t2 write x write z read t read x write y } write z }

slide-27
SLIDE 27

Chip Multiprocessors (ACS MPhil)‐ 27

Hardware support for TM

  • Registers are saved (checkpointed)‐ in case we need to abort
  • R and W bits in the cache track each transactions read/write sets

– Each transaction's write set is only visible locally

slide-28
SLIDE 28

Chip Multiprocessors (ACS MPhil)‐ 28

Hardware support for TM

  • Let's assume t1 gets to commit first
  • Validate: request exclusive access to write-set lines
  • Commit: reset R/W bits, turn write set into valid (dirty)‐ data
slide-29
SLIDE 29

Chip Multiprocessors (ACS MPhil)‐ 29

Hardware support for TM

  • BusUpgrX transactions arrive at t2's cache

– Check: exclusive requests against its read-set – Abort: (if necessary)‐ invalidate write-set, reset R/W bits, restore registers

slide-30
SLIDE 30

Chip Multiprocessors (ACS MPhil)‐ 30

Hardware support for TM

  • 3. Herlihy/Moss (1993)‐

– Coined the term “transaction memory” – Eager Conflict Detection – Lazy Version Management

slide-31
SLIDE 31

Chip Multiprocessors (ACS MPhil)‐ 31

Hardware support for TM

  • Their scheme exploits Eager Conflict Detection

– It detects possible conflicts as each transaction executes – This reduces the amount of computation lost by an aborted transaction – It may also abort transactions that could have committed

  • We don't have the luxury of knowing anything about the order

in which transactions will actually commit

  • In addition, we are not committing transactions one-at-a-time

(serialised commit)‐. So have to worry about write-write conflicts too.

slide-32
SLIDE 32

Chip Multiprocessors (ACS MPhil)‐ 32

Hardware support for TM

  • If we adopt eager

(pessimistic)‐ conflict detection we don't know the

  • rder in which T1 and T2

will commit when we are detecting conflicts

– We have to assume the worst- case situation. In this case, that T2 will commit first and T1 must be aborted – If T1 actually commits first and we use lazy (optimistic)‐ conflict detection, it is not necessary to abort either transaction

slide-33
SLIDE 33

Chip Multiprocessors (ACS MPhil)‐ 33

Hardware support for TM

  • Similar setup to our simple HTM scheme
  • But now, all caches will snoop our read/write

requests

– What happens if our request hits in another cache?

  • If we are performing a bus read and the data is only in the

read set of the other transaction - no problem

  • Any other situation where we access the data set of another

transaction will cause the remote cache to initiate a bus transaction to indicate that we should abort

– This covers all situations where there is a potential conflict: two transactions accessing the same memory location where at least one operation is a write – We assume that the requester aborts, but there are other policies

slide-34
SLIDE 34

Chip Multiprocessors (ACS MPhil)‐ 34

Hardware support for TM

  • Herlihy & Moss' Implementation

– Their implementation isn't as simple as described!

  • Extensions (see paper)‐

– Store old value in cache

  • XCOMMIT (Old value)‐
  • XABORT (New value)‐
  • We discard one depending on the outcome, commit or abort

– Dual caches

  • They don't use a single cache, they have a regular and

transactional cache

  • Why is this advantageous?
slide-35
SLIDE 35

Chip Multiprocessors (ACS MPhil)‐ 35

Hardware support for TM

  • Problems

– Assumes transactions are short lived and have small data sets

  • Maximum transaction size bounded by size of transactional

cache

  • They suggest trapping to software to support large

transactions (as in the limitLESS directory protocol)‐

– Contention management

  • How do we stop transactions repeatedly aborting each other?
  • As described the scheme doesn't guarantee forward progress
  • They suggest addressing this at the software level, by having

aborted transactions execute exponential backoff in SW

slide-36
SLIDE 36

Chip Multiprocessors (ACS MPhil)‐ 36

Hardware support for TM

  • Other approaches to build TM systems:

– Unbounded HTMs – Use of bloom filters (SigTM)‐ – Hybrid TM schemes

slide-37
SLIDE 37

Chip Multiprocessors (ACS MPhil)‐ 37

Retaining locks

  • Speculative Lock Elision (SLE)‐

– Rajwar and Goodman (MICRO 2001)‐ – Retains lock-based programming model but exploits

  • ptimistic execution
  • Another possibility is to identify critical sections

(transactions)‐ but construct a set of locks automatically by analysing the whole program

– Pessimistic rather than optimistic concurrency – “Lock inference”

slide-38
SLIDE 38

Chip Multiprocessors (ACS MPhil)‐ 38

Software TM (STM)

  • Software TM systems are important too

– Hardware is useful in accelerating most frequent or expensive operations

  • We won't cover software TM systems in detail here

– Read Chapter 3. of Larus/Rajwar book