Software Transactional Memory Should Not Be Obstruction Free Robert - - PowerPoint PPT Presentation
Software Transactional Memory Should Not Be Obstruction Free Robert - - PowerPoint PPT Presentation
Software Transactional Memory Should Not Be Obstruction Free Robert Ennals Intel Research Cambridge 15 JJ Thomson Avenue, Cambridge, CB3 0FD, UK robert.ennals@intel.com presented by Ted Cooper for CS510 Concurrent Systems (Spring 2014)
Grand Context (courtesy of Professor Walpole)
- Locking is slow and hard to get right. Clearly, non-blocking algorithms must be
the answer!
- But non-blocking algorithms (harder to get right) might starve out threads. Thus,
they should be wait-free.
- Wait-free algorithms must use “helping” to ensure all threads make progress, so
they perform poorly, and are no simpler to reason about.
- Transactions look like lock-based and sequential programs, so maybe they're
easier to reason about. Can we make them fast?
- But hardware transactional memory implementations have limits on transaction
size and other problems, must coexist with locks in real systems, and don't seem to be faster than locks in practice. Can we at least get an STM that handles transactions of arbitrary size and length and performs reasonably?
- What properties do we really need in an STM? Does it need to be some flavor
- f non-blocking?
STM Context
- STM performance not stellar compared to conventional locks.
- Processor speed growing faster than memory bandwidth. Can
we reduce memory accesses to improve STM performance?
- Do existing STM implementations maximize processor use? If
not, can we improve processor use to improve performance?
- “Obstruction-freedom” has been borrowed by STM researchers
from distributed systems (which have independent failure domains, so it's important that one node be able to continue progressing if another fails). Is this a useful property for STM? How does it affect performance?
Terminology
- Thread: Programmer-level idea, single parallelizable control flow. Think
green threads, user-level threads. Transactions run on threads.
- Task: OS-level idea, one runs per available core. Runtime multiplexes
threads onto tasks. Think OS threads.
- Non-blocking: At any given time, there is some thread whose progress is
not blocked (e.g. by mutual exclusion).
- Obstruction-free: A property non-blocking algorithms can have. If all other
threads are suspended (i.e. no contention), a thread can complete its
- peration in a finite number of its own steps. This may require retrying.
Does not guarantee progress in the presence of conflicting operations, e.g. livelock is possible
- Obstruction-free is the weakest additional “natural” property a non-blocking
algorithm can have.
Livelock?
- Threads are doing work, but one's work prevents the another from
- progressing. Just like deadlock, you can have 2-participant,
3-participant, n-participant livelock.
- “A real-world example of livelock occurs when two people meet in a
narrow corridor, and each tries to be polite by moving aside to let the
- ther pass, but they end up swaying from side to side without making
any progress because they both repeatedly move the same way at the same time.” http://en.wikipedia.org/wiki/Deadlock#Livelock
- In this example, each person's “sway deterministically until there is no
- bstacle” algorithm is obstruction-free since it can proceed if the other
person holds still, but not guaranteed to make progress while the
- ther person does the same thing.
Non-blocking algorithms
wait-free lock-free
- bstruction-free
- Wait-free: Under contention,
every thread makes progress, i.e. no starvation
- Lock-free: Under contention,
some thread makes progress. If multiple threads try to operate
- n the same data, someone will
- win. A given thread may never
win, so could be starved, but the system as a whole will make progress, so no livelock.
- Obstruction-free: In isolation (all
contenders suspended), a given thread makes progress. Under contention, this progress may not be useful, i.e. 2 threads could forever interfere and retry, livelocking.
Do we need obstruction-free STM?
- STM common case: parallelizing existing
sequential programs
- Sequential programmers are used to blocking
semantics, e.g. system calls(?)
- If we map tasks to cores 1-1, and run in-flight
transactions to completion before scheduling new ones, it's unlikely that any thread will be suspended mid-transaction, and only suspended transactions can block other transactions.
There is no one thread use case to rule them all
- Threading for convenience: Multiple threads to track
computations that proceed independently, e.g. compute and GUI threads. Blocking locks are fine here, may need priority levels for locks to ensure low-priority threads don't block high-priority threads.
- Threading for performance: Actual concurrent computation
is possible. Blocking fine in sequential code, so also fine in transactions (draw picture)
- To STMify lock-based code, we can map lock-protected
critical sections to transactions. This is no worse, since locks don't allow any concurrency in critical sections.
Why obstruction-freedom isn't as useful as it might seem
Obstruction-free misconception 1
- Misconception: Obstruction-freedom prevents a
long-running transaction from blocking others
- Counterexample: A transaction t reads an object
x, computes for a year, writes to x. t completes
- nly if any other transaction that needs x blocks
until t finishes. So, either t blocks contending transactions or t never completes.
- Question: Is it a problem for a transaction to
block others of the same or lower priority?
Obstruction-free misconception 2
- Misconception: Obstruction-freedom prevents the system from locking
up if a thread t is switched out mid-transaction.
- Argument 1: The OS will always switch the task running t back in
eventually (provided all tasks have the same OS scheduling priority), so you don't need obstruction-freedom to make progress as long as temporary interruptions are okay.
- Argument 2: STM runtime can match the number of tasks to the number
- f available cores (dynamically). In this situation tasks (and the threads
they run) will be switched out by the OS rarely, if ever.
- Argument 3: STM runtime can only start a new transaction on a given
task when that tasks' last transaction completes, i.e. the runtime never preempts an in-flight transaction. That is, we allow in-flight transactions to obstruct new ones :)
Obstruction-free misconception 3
- Misconception: Obstruction-freedom prevents the system
from locking up if a thread t fails. i.e. the system should continue to make progress as a whole if transactions fail silently.
- Argument 1: If it's a software failure, an equivalent lock-based
- r sequential program would also fail.
- Argument 2: If it's a hardware failure, then a) node failures in
distributed systems are common, while independent core failures in shared memory multiprocessors that don't bork the whole system are exceedingly rare, and b) again, a hardware failure would also break a lock-based or sequential program.
What does abandoning
- bstruction-freedom buy us?
Improved cache locality
- If object metadata lives in the same cache line
as object data, only one memory access to load a shared object. If program is memory bandwidth-limited, performance is directly proportional to number of memory accesses.
- Any metadata we can't fit in the object data
cache line should live in memory that is private to a given transaction, so transactions don't fight over it and so it stays in one cache.
Improved cache locality cont'd
- What does this have to do with obstruction-freedom?
- No obstruction-free STM can store object metadata and data in the same cache line.
They all require object data to be behind a level of indirection to prevent the following situation:
– Transaction t is writing to object x and is switched out. – Transaction s runs, needs x. What can s do?
- s could wait for t to finish with x, but that isn't obstruction-free.
- s could access x, but if t wakes up again it might overwrite x, invalidating s'
transaction and leaving s in an undefined state.
- s could abort t, but we can't guarantee abort has succeeded without an
acknowlegement from t, and that isn't obstruction-free. Even if s could abort t, then t could restart and abort s, resulting in livelock. My question: Could we avoid livelock with a total ordering of abort precedence, i.e. s can abort t but t can't abort s?
- This is the same reason we need pointers and copies in relativistic programming.
Optimal number of in-flight transactions
- Consider N in-flight transactions on N cores.
- A new transaction t tries to start before any of the N complete.
- While t exists but has not yet been scheduled to run, it can make no progress
in isolation, and so is not obstruction-free.
- So as soon as t exists, we have to switch out an in-flight transaction and share
N cores among N+1 transactions.
- This introduces context-switching overhead, which was previously avoided,
and which wastes cycles.
- This also increases the number of concurrently running transactions,
increasing the probability of conflicts among transactions.
- Why not just let each transaction complete without context-switching it out, and
- nce it completes run the new transaction in its task? Then we'd always have
N transactions running on N cores.
What does a non-obstruction-free STM that employs these
- ptimizations look like, and
how does it perform against existing obstruction-free STMs?
The Lightweight Transaction Library
- Ennals et al wrote a non-obstruction-free STM
library to test these ideas.
- In summary, it handily beats Fraser's STM and
Fraser's C implementation of DSTM, both of which are obstruction-free.
- It is available at:
http://sourceforge.net/projects/libltx
Memory Layout
- ltx designates a public memory region all transactions
can access, where shared objects (and only shared
- bjects) live.
- It also allocates a private memory region to each
transaction for the transaction state, which other transactions (usually) do not access. Each private region is allocated contiguosly starting at an aligned address
- nce and reused by subsequent transactions that run on
the same core, so it stays in that core's cache. This means that cache misses on private memory are rare.
What lives in private memory?
- At the very beginning (i.e. the aligned base address), a
descriptor for the transaction itself from which its priority can be determined.
- Read and write descriptors, one for each shared object x the
current transaction t has accessed.
- Read descriptors contain:
– x's version number as of the last time t read it. This is used to
check whether t needs to restart because the data it read changed before t could commit.
– A pointer to x, so t can read the data, check x's version, and check
whether x has been locked for writing by another transaction.
What lives in private memory? cont'd
Write descriptors contain:
– The object's version number as of the last time t read it. This is used
to compute a new version number on a successful commit, or to roll x back the its previous version on abort.
– A pointer to x, so t knows where to write on commit or abort. – A copy of x's object data. This is where t stages changes to x before
- committing. Note that unlike in RP, where changes are made visible
by replacing a public pointer to the old version with a public pointer to the new version, ltx copies this staged object data back to the public
- bject data during commit, enforcing the public/private division. This is
unavoidable, since object metadata and data are stored adjacently in the public region in a fixed location (to avoid the extra memory accesses imposed by indirection).
Object handles
- Each public object has a handle (metadata) stored
adjacent to the object data.
- The last bit of the handle signals whether a transaction
is currently writing to the object x:
– If 1, no transaction is currently writing, and the rest of the
handle represents x's current version number.
– If 0, a transaction t is currently writing to x, and the rest of the
handle is a pointer to t's write descriptor (more on this later) for x. Some fixed number of higher order bits in this pointer can also be used to t's transaction descriptor, since private regions are allocated in aligned contiguous blocks.
Is this figure correct?
?
How could “Verision Seen” be a pointer?
0 or more 0 or more
Writes
- Managed using revocable two-phase locking:
– A transaction locks every object to which it needs to write,
but keeps enough information around to release the lock and restore the object to its previous state on abort.
– If two transactions deadlock on write sets, one aborts.
My question: How does deadlock detection work in this case? Does a transaction s who needs an object x locked by t use x's handle to find t's write descriptors and some record of the set of objects t intends to ultimately lock, compare that to its own write descriptors and pending locks, look for a cycle, and abort if it finds one?
Writes cont'd
- How does t lock x for writing?
– t reads x's handle. If it ends in a 1 then the rest is x's
version number, and t stores that and a pointer to x in a write descriptor d, then uses a compare and swap
- r other atomic operation to replace x's handle with a
pointer to d with a trailing 0. If the atomic operation succeeds, t has locked x. Otherwise some other transaction has concurrently updated x, and t must
- retry. If t successfully locks x, it makes a copy of x's
- bject data in the write descriptor.
Writes cont'd
- What if x is already locked by another
transaction s?
– t (busy?) waits for a bounded number of cycles for x
to become available. If this time expires and x is still locked, t gets s' transaction descriptor (available via the pointer in the locked handle) and checks whether s is of the same or lower priority, then requests that s abort itself.
Reads
- Managed using optimistic concurrency control:
– t reads x's handle. If x is not locked, it logs the
version number from the handle in a read descriptor for x, along with a pointer to x. If x is locked, t waits in the same fashion as for writing.
– When t attempts to commit, it compares its logged
copy of x's version number to the current value in x's handle, and the commit fails if they differ.
Commits
- When t is ready to commit, it first checks whether it is still valid:
– If no other transaction has written to an object in t's read set (i.e. the
version numbers in the write descriptors still match the handles), t is valid.
- If t is valid, it can commit. t must have locked all the objects in
its write set, so we don't need to check those for to determine
- validity. For each write descriptor d for an object x, t simply
copies the updated object data in d (private memory) to the corresponding object data in public memory, then overwrites the lock in the x's handle with an incremented version number for x, releasing the lock and publishing the new version of x in
- ne fell swoop.
Commits cont'd
- What if t isn't valid?
– t may have read inconsistent data and gone into a
weird state, e.g. an infinite loop or a segfault from reading an out-of-date or corrupted array index.
– Because we can't predict the behavior caused by
inconsistent data, t may not retry properly, so the runtime has to periodically abort outstanding invalid transactions.
Performance Evaluation
- Benchmarks on Fraser's testbed to ensure that
comparison to Fraser's STM and C DSTM is fair.
- SunFire 15K server
– 106 UltraSparc III processors @ 1.2GHz
- Benchmarks
– Red-black tree and skip-list, both read and write
random set elements, 75% reads, 25% writes.
- Lower on y axis (CPU time per operation in microseconds) is
better.
- Key space varied to compare performance under contention
- ltx takes 50-60% time of Fraser, 35% time of C DSTM
- Probably wins because of cache locality optimization (fewer total
memory accesses): ltx incurs 48% L2 misses, 58% L1 misses, and 22% TLB misses compared to Fraser
- Lower on y axis (CPU time per operation in microseconds) is better.
- Key space varied from 16 to 219 to compare performance under
contention, number of processors used fixed at 90.
- Under high contention (left region of each graph) ltx takes ~20% time of
Fraser, C DSTM barely runs.
- Fraser's transactions help blockers, performs poorly for the same
reason wait-free algorithms do.
- Lower on y axis (CPU time per operation in microseconds) is better.
- Run on 4-way SPARC machine and number of tasks varied to measure
effect of OS context-switching.
- Unsurprisingly, as rate of context-switching increases performance
degrades.
- ltx more affected by context-switching than Fraser since switched-out
transactions can block others in ltx, but ltx is still faster.
- Under normal ltx deployment, number of tasks always upper-bounded
by available cores, so context-switching rarely occurs.
Conclusions
- Obstruction-freedom is not necessary for STM.
- 2 non-obstruction-free STM optimizations that maximize
cache locality and minimize context-switching are demonstrated in an implementation that outperforms existing best-in-class obstruction-free STM implementations.
- Therefore, Ennals et al belive that STM designers
should abandon obstruction-freedom.
- But wait, ltx writers use locks. Weren't we trying to get