CO538: Concurrency Design and Practice Bonus Lecture: Other - - PDF document

co538 concurrency design and practice
SMART_READER_LITE
LIVE PREVIEW

CO538: Concurrency Design and Practice Bonus Lecture: Other - - PDF document

CO538: Concurrency Design and Practice Bonus Lecture: Other Concurrency Models Dr. Fred Barnes S113, Computing Laboratory (ext. 4278) frmb@kent.ac.uk (ack to Matt Groening for Nibbler) CO538 So Far ... Were all fairly familiar with the


slide-1
SLIDE 1

CO538: Concurrency Design and Practice

Bonus Lecture: Other Concurrency Models

  • Dr. Fred Barnes

S113, Computing Laboratory (ext. 4278) frmb@kent.ac.uk

(ack to Matt Groening for Nibbler)

CO538

So Far ...

We’re all fairly familiar with the occam-π approach to concurrency:

isolated processes with synchronous channel communication. ... and though you may not know it, the CSP process algebra (get this for free!) [Hoare, 1985]. ‘π’ bit comes from elements of the π-calculus [Milner, 1999].

This is not the only one however! In brief:

shared-memory based models with threads-and-locks implementations. transactional-memory based approaches. message-passing approaches (including occam-π/CSP). data-parallel models and their implementations. hybrid approaches taking different elements from the above.

slide-2
SLIDE 2

CO538

A Brief History of Parallel Computing

Concurrency and parallelism have mostly been used as a mechanism for improving performance.

in networks of workstations (e.g. the Pi-Custer, “Grid”, cloud computing). in supercomputers such as Cray or IBM’s BlueGene, and not entirely unrelated, the transputer. in more recent multiprocessor computers (last 20 years or so). and most recently in multicore processors (including things like GPUs and Sony/Toshiba/IBM’s “Cell” processor). future massively multicore (8, 16, 80, 400, ...) multiprocessor platforms.

CO538

Technology

Processor technology has, for the time being, reached its limits in terms of raw clock speed (e.g. 4 GHz).

and we’re at 50-60 nano-meter processes in CPU manufacturing. the maximum clock speed is limited by heat dissipation. power consumption of the CPU is roughly proportional to (Nt ∗ Tp ∗ V 2), thus to pack in more transistors (Nt) to give more raw capability (e.g. more multicores, more cache) we want to make manufacturing small (Tp) and run with low voltages (V ). maximum clock speed is also limited by voltage – to low and there isn’t enough ‘juice’ to get the transistors to work reliably. also limits on how fast we can drive external memory busses.

Assorted interesting attempts to push clock speeds ever higher (AMD CPU run at 7.12 GHz not long ago; research has seen up to 500 GHz+), generally overclocking and overvolting.

a lot of fun efforts towards exteme cooling (with liquid nitrogen and suchlike).

slide-3
SLIDE 3

CO538

Into the Multicore Era

As far as raw processing capability goes, multicore CPUs are increasingly popular.

a divide-and-conquer approach to getting more processing cores (transistors) on a single chip; potential benefits (more cores)

  • utweigh the losses (slower cores).

In the past, parallel programming was primarily of interest to scientists (including us!) across a range of disciplines.

some problems trivially parallel — e.g. mandelbrot, render-farms, ...

  • thers not so much so — e.g. N-body problems, complex systems

simulations, ... most ‘regular’ programming was (and still is to some extent) entirely sequential.

Cannot rely on faster clock speed as a way of extracting more performance — need to cater for multicore CPUs.

leads to the problem of how to parallelise sequential codes (as an incremental change). essentially what led to threads-and-locks.

CO538

Philosophy

Parallel computing has generally been used to attain performance.

now forced to consider it when programming, so we have to deal with it one way or another.

A general view held is that concurrency is hard.

and as such, should be avoided at all costs wherever possible.

On the other hand, the view taken by ourselves (and others!), and presented in CO538 is that concurrency is easy!

both views are correct; depends on how you manage that concurrency ... more specifically, concurrency need not be hard.

The view we try to get across is that concurrency can be used as a fundamental design methodology, not just as a mechanism for extracting performance on modern systems (indeed, concurrency should simplify design and implementation!).

slide-4
SLIDE 4

CO538

Why Threads?

This is what the operating system typically provides as its concurrency abstraction.

essentially multiple processes that share the same address space. a fairly flexible abstraction, can be scheduled in a variety of ways.

Threads interact by sharing data in the heap and through OS provided mechanisms.

semaphores, mutexes, pipes, condition variables, ...

Scheduled by the OS on:

uniprocessor machines (trivial scheduling). multiprocessor machines (simplistic approach). multiprocessor machines (gang scheduling). VM 0 GB 4 GB

T0 stack T1 stack T2 stack text data heap

Proc0

T0 T1 T0 T2 T1

Proc1 Proc2

T0 T1 T2 T2 T0 T1 T2 T0 T1 T2 T1 T2 T0 T2 T1 T0

CO538

Thread Hazards

Uncontrolled access by threads to data in the heap is likely to result in race hazards.

more so in languages that permit aliasing of references (pointers).

Bits of code that modify shared state (data) must do so carefully:

by ensuring the mutual exclusion of other threads. by using lock-free and/or wait-free algorithms. by using OS or language mechanisms that incorporate the above.

Worth noting that such locking is only necessary when multiple threads are running concurrently or can preempt each other in unpredictable ways.

unfortunately OSs provide little in the way of support for control

  • ver thread scheduling (POSIX threads are what you tend to get,

gang scheduling rare). OSs have always had to deal with concurrency arising from interrupt handling (more in CO527 next term!).

slide-5
SLIDE 5

CO538

Traditional Locking Methods

Semaphores: essentially a non-negative integer value with a wait set; two operations:

wait: if the semaphore value is zero, process added to the wait set, else decrement the value and continue. notify: if the wait set is non-empty, wake up one process, else increment the value. The ‘notify’ operation never blocks; the ‘wait’ operation can block, however.

Mutex: a mechanism for mutual exclusion, can be implemented as a semaphore initialised to 1.

lock: wait on semaphore; unlock notify semaphore.

Spin-lock: a ‘fast’ mutex that does not involve the OS (required to put a thread to sleep or wake one up).

attempts to lock on top of an already held lock spin (100% CPU core) until the other thread unlocks.

CO538

Monitors in Java

Every object in Java has a monitor associated with it.

natural use is to “wait for a change in state of an object until notified”.

  • ften used with condition variables.

Monitors contain a mutually exclusive lock, which must be held before any action can be performed on a monitor.

the operations are wait, notify and notify-all.

Monitors require some degree of concurrency in order to work correctly:

when a thread calls ‘wait()’, it will be put to sleep.

  • nly another thread calling ‘notify()’ or ‘notifyAll()’ can wake it up.
slide-6
SLIDE 6

CO538

Java Monitor Operations

no lock (active) got mutex lock no lock (inactive) synchronized (x) { ... } monitor associated with object ‘x’ threads let in one at a time wait set x.wait () T1 wait-set holds suspended threads ... x.notify () notify wakes

  • ne thread

... x.notifyAll () T2 T3 notifyAll wakes all threads

waking threads must wait for the mutex, eventually let back in (maybe)

Java threads suffer from spurious wakeup when in the wait-set.

caused by an underlying problem (feature) with the POSIX threads mechanism in some OSs ...

CO538

Monitors in Java

The code that calls ‘wait’ must use a try-catch block to catch something called InterruptedException, or throw it upwards. Wait-set is a set – i.e. unordered, not guaranteed to be fair. Entry to synchronized blocks is unordered too. Threads can hold multiple mutex locks (nested ‘synchronized’ blocks); when a thread ‘wait’s on one of the monitors, the associated lock is temporarily released. A thread may acquire the same mutex lock multiple times safely, but still a potential for deadlock from cyclic ordering of locks.

slide-7
SLIDE 7

CO538

Java Concurrency Abstractions

The low-level mechanisms of Java (threads and monitors) are not easy to work with.

enter higher-level primitives (“java.util.concurrent”).

This provides a range of concurrency classes that can be used to implement synchronous or asynchronous communication, barriers (multi-party synchronisation) and other useful application-level features.

someone has to implement such things of course, and have a comprehensive understanding of how it all works at the low-level — computer scientists..!

At the end of the day, still have the potential for race-hazards between threads on shared data..

a big frustration is that such bugs are incredibly hard to pin down (and the subject of some interesting research).

CO538

Other Concurrency Models

Two other concurrency models which are worth considering are:

that used by Polyphonic C#, also called C-omega. that used by concurrent Haskell with software transactional memory.

gaining more popularity as it is more comprehensible to J. Random Programmer than locking strategies or wait-free/lock-free algorithms.

Both of these are somewhat different from the occam-π model. Discovering which of the various concurrency abstractions are most useful, and which are most easily understood are left largely to the reader... We’ve mostly established that the threads-and-locks model doesn’t work.

it may be easy to understand, but is hard to apply correctly and certainly does not scale well (at all).

slide-8
SLIDE 8

CO538

Software Transactional Memory

Provides an alternative to locks in a shared-memory environment.

can be applied fairly easily to any language which uses mutex locks to control access to shared state. [Shavit and Touitou, 1995].

A well-documented implementation is provided by concurrent Haskell [Harris et al., 2005]. The idea stems from databases, and is based around recording a log

  • f memory transactions, with the ability to commit or rollback

CO538

Software Transactional Memory

Rather than delving into Haskell, where STM would normally be used for protection in I/O monads, we’ll pretend we have a C/Java-like implementation:

void add to run queue (Process *p) { atomic { p->next = rq; // read from shared var rq = p; // write to shared var } }

The atomic transaction is considered to have succeeded if rq is not changed by another thread. At this sort of level, we require the underlying hardware to have load-linked and store-conditional instruction pair (as MIPS does) — bit harder on Intel.

slide-9
SLIDE 9

CO538

Software Transactional Memory

The resulting atomic block of code generated by the compiler (for MIPS) looks something like this:

la $a0,rq la $a1,p l1: ll $t0,0($a0) lw $t1,0($a1) sw $t0,16($t1) sc $t1,0($a0) beq $t1, $zero, l1 sync

load-linked instruction marks the start

  • f an atomic memory operation

if the memory at &rq changed since the load-linked, store-conditional fails in which case, we jump back and retry the operation

This is a trivial example — does not require a log of transactions to be maintained

hard to engineer for a very free-range language like C

Is this algorithm lock-free? Is it wait-free?

CO538

Software Transactional Memory

Easier to engineer STM into a more tightly controlled language, such as Haskell. For complex sequences of code, will need to record a log, e.g.:

read mem (A) read mem (B) write mem (A) read mem (C) write mem (B) write mem (C)

If the transaction on C fails, might need to undo A and B. The more threads we throw into the mix, the higher the probability that a collision will occur. Also need to check A and B at the end. This approach is better than a locks approach, but does not scale well with large number of threads, nor with complex atomic transactions.

needs a little care in programming to avoid infinite wait scenarios. various solutions, such as Eliot Moss’s open nested transactions (in Java type languages), but harder to use and reason about.

slide-10
SLIDE 10

CO538

Polyphonic C# (C-omega)

An extension of the object-oriented C# language.

still has all the existing threads-and-locks mechanisms. [Benton et al., 2004]

Extension is a synchronisation mechanism based on the join calculus

[Fournet and Gonthier, 1996]

Language mechanism uses chords which synchronise multiple threads.

public class Buffer { public string Get() & public async Put(string s) { return s; } }

CO538

Polyphonic C#

Calls to asynchronous methods do not block the caller, but queue the call. Calls to synchronous methods block until calls for all other asynchronous methods are available, at which point the chord executes the method body. Only allowed at most one synchronous call per chord — so the method body executes in that particular thread.

a restriction of the implementation, rather than the calculus.

When all parts of the chord are asynchronous, the method body executes in a new (or pooled) thread. No protection against concurrent chord execution, so may still need to perform some level of locking inside method bodies.

difference is the synchronisation mechanism.

slide-11
SLIDE 11

CO538

Chorded Synchronisation

May appear fairly simple at first glance (i.e. something we can already do in occam-π), but gets more complex with inheritance and overriding, e.g.:

public class Buffer { public string Get() & public async Put(string s) { return s; } public string Get() & public async Put(int i) { return i.ToString(); } }

A single call to ‘Get’ must essentially offer to synchronise against various chords.

if more than one is ready, makes an arbitrary choice.

Cannot override synchronous to asynchronous or vice versa.

CO538

Chorded Synchronisation

Can create some interesting synchronisation patterns, but sometimes complex.

in writing a fully synchronised rendezvous, we would like to write:

public class OccamChannel { public int Read() & public void Write(int i) { return to Write; return i to Read; } }

But this is clearly a bit messy — and violates the at most one synchronous rule.

C-like languages are fairly restricted to single returns.

A slightly nicer syntax (in my opinion) might be:

return {i,};

slide-12
SLIDE 12

CO538

Chorded Synchronisation

In the object-oriented world of C#, the choice the implementation makes about which thread to run the method body in can affect program behaviour

a problem because threads own locks (as they do in Java), permitting reentrant locking by the same thread, but introducing a potential for deadlock if we choose the wrong thread

Might think that we can get around it by introducing a rule which states that multiple synchronous calls inherit all locks during the method body.

but this is heading towards a performance-damaging implementation. also problems with stack-based security (who’s stack?) and thread-local vars.

But by doing such things, we run the risk of introducing inconsistencies with the formal semantics (join calculus); better to restrict.

CO538

Chorded Synchronisation

End up having to program half of the OccamChannel synchronisation explicitly:

public class OccamChannel { private class Thunk { int wait() & async reply(int v) { return v; } } public int Read() { Thunk t = new Thunk(); aRead (t); return t.wait(); } public void Write(int i) & private async aRead(Thunk t) { t.reply(i); } }

slide-13
SLIDE 13

CO538

Chorded Synchronisation

Some things are a little tricky to program (because of implementation restrictions), nevertheless, some traditional parallel programming algorithms can be expressed elegantly:

public class OneCell { public OneCell() { empty(); } public void Put(object o) & private async empty() { contains(o); } public object Get() & private async contains(object o) { empty(); return o; } }

CO538

Readers and Writers

This is a more complex example (does CREW style locking):

public class ReaderWriter { ReaderWriter() { idle(); } public void Shared() & async idle() { s(1); } public void Shared() & async s(int n) { s(n+1); } public void ReleaseShared() & async s(int n) { if (n == 1) idle(); else s(n-1); } public void Exclusive() & async idle() {} public void ReleaseExclusive() { idle(); } }

This, like various other implementations of the algorithm, suffers from a lack of writer priority (readers keep the writer locked out).

giving the writers priority isn’t too hard, unlike in some other implementations.

slide-14
SLIDE 14

CO538

Readers and Writers

public class ReaderWriterFair { ... existing ReaderWriter methods public void ReleaseShared() & async t(int n) { if (n == 0) idleExclusive(); else t(n-1); } public void Exclusive() & async s(int n) { t(n); wait(); } void wait() & async idleExclusive() {} }

This type of synchronisation doesn’t quite exist in CSP nor

  • ccam-π, but we can replicate its functionality to a certain point.

What’s missing essentially is a multiway synchronisation...

ALT c ? x & d ? y P () Q = ` (c, d) → P ´ ✷ . . . essentially waiting for both c and d to become ready before accepting either event

CO538

Emulating Chorded Synchronisation

The extended rendezvous mechanism can be used to “hold up” messages.

ALT c ? x & d ? y P () d ? y & e ? z Q () e ? z R () INITIAL INT sel IS -1: SEQ ALT ... select CASE sel ... execute d ?? y ALT c ? x sel := 0 e ? z sel := 1 c ?? x WHILE sel = (-1) ALT d ? y sel := 0 e ? z R () e ? z R () P () 1 Q () ELSE SKIP (cannot read from channel ‘c’ in this instance of ‘R’)

As can be seen, it’s not entirely pleasant, has restrictions, and cannot cater for all situations.

slide-15
SLIDE 15

CO538

Formally Based Models

The CSP model can produce programs with deadlocks, livelocks and starvation, however we have design patterns for avoiding these.

the occam-π implementation guarantees freedom from race-hazard errors (no shared data). the Haskell implementation (Brown’s CHP [Brown, 2008]) is also race-hazard free (nature of the language – functional), with features such as conjoined events (multiway synchronisation). implementations for languages such as C (CCSP/CIF, C++CSP) and Java (JCSP, CTJ) cannot gurarantee freedom from race-hazards (programmer has to get it right).

  • ther language support (e.g. Py-CSP) varies depending on the

compiler/interpreter used; performance issues.

The Join-calculus model is elegant, but language implementations are limited (C-omega and domain-specific).

CO538

Lock Based Models

Lock-based models (semaphores, mutexes, monitors) are simple in concept, but the interactions between multiple locks get complex quickly.

semaphores and suchlike are foremost a synchronisation mechanism — not a method for communication, as are monitors (mostly). deadlocks are still an issue; guaranteeing freedom and debugging are still hard to impossible.

Mostly reliant on shared-memory implementations (as threads communicate via the heap).

scaling to large numbers of cores, or across non-uniform memory (NUMA) multiprocessors and clusters, is complex.

Tends to be what is exposed in programming languages due to the underlying OS thread mechanisms.

slide-16
SLIDE 16

CO538

Transaction Based Approaches

Still mostly reliant on shared-memory architectures, but a data-centric approach.

avoids the fundamental problem of deadlocks between threads, and is simple to understand and to use (for the programmer at least). the costs lie in scalability and efficient algorithmic expression (avoiding livelock by infinite retry).

Effectively provides a method for turning ordinary sequential algorithms (in language X) into lock-free concurrent algorithms. Scalability is the main issue — the more code in an ‘atomic’ block,

  • r the higher the number of interacting threads, the more likely it is

that things will be rolled-back and retried.

recent work by Eliot Moss and co. on closed-nested and

  • pen-nested transactions to improve performance, but at the

expense of simplicity.

CO538

Other Things

Languages that embody support for concurrent programming in some way are a good thing!

Erlang provides an asynchronous message-passing implementation. Google’s Go concurrency mechanisms are similar to occam-π in many respects (subset relation), however the Go language is substantially more flexible than occam-π.

An increasing number of data-parallel implementations for specific hardware (e.g. GPUs with NVidia’s CUDA).

designed (primarily) for large number-crunching operations that are trivially parallelised (e.g. graphics, protein folding, ...).

slide-17
SLIDE 17

CO538

Pedagogy

We’re often asked the question “why learn occam-π, when we could be learning threads in C and Java, etc.?”. The short answer is that we’re not trying to train you in the use of specific technologies, though these are a necessary vehicle for learning.

we’re teaching you about the nature of concurrent programming and conceptual approaches to designing concurrent systems.

  • bviously relevant, given the multicore era and tendancy towards

NUMA architectures.

The fact that we use occam-π as a vehicle is deliberate.

although it may be hard to get a program working initially (learning curve for the langauge, plus learning of concurrency ideas), it much more likely that the resulting program will work as you expect. lightweight implementations allow concurrency to be used freely, focusing our teaching towards expression and not performance.

CO538

References

Benton, N., Cardelli, L., and Fournet, C. (2004). Modern concurrency abstractions for C#. ACM Transactions on Programming Languages and Systems, 26(5):769–804. Brown, N. (2008). Communicating Haskell Processes: Composable Explicit Concurrency Using Monads. In Communicating Process Architectures 2008. IOS Press. Fournet, C. and Gonthier, G. (1996). The reflexive CHAM and the join-calculus. In Proceedings of the 23rd ACM Symposium on Principles of Programming Languages, pages 372–385. ACM Press. Harris, T., Marlow, S., Peyton-Jones, S., and Herlihy, M. (2005). Composable Memory Transactions. In PPoPP ’05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pages 48–60, New York, NY, USA. ACM Press. Hoare, C. (1985). Communicating Sequential Processes. Prentice-Hall, London. ISBN: 0-13-153271-5. Milner, R. (1999). Communicating and Mobile Systems: the Pi-Calculus. Cambridge University Press. ISBN: 0-52165-869-1. Shavit, N. and Touitou, D. (1995). Software Transactional Memory. In PODC ’95: Proceedings of the fourteenth annual ACM symposium on Principles of Distributed Computing, pages 204–213, New York, NY, USA. ACM Press.