A Runtime System for Software Lock Elision Amitabha Roy (U. - - PowerPoint PPT Presentation

a runtime system for software lock elision
SMART_READER_LITE
LIVE PREVIEW

A Runtime System for Software Lock Elision Amitabha Roy (U. - - PowerPoint PPT Presentation

A Runtime System for Software Lock Elision Amitabha Roy (U. Cambridge) Steven Hand (U. Cambridge) Tim Harris (MSR Cambridge) Motivation Multicores mean application scalability is key to good performance Scaling programs synchronising


slide-1
SLIDE 1

A Runtime System for Software Lock Elision

Amitabha Roy (U. Cambridge) Steven Hand (U. Cambridge) Tim Harris (MSR Cambridge)

slide-2
SLIDE 2

Motivation

Multicores mean application scalability is key to good performance Scaling programs synchronising with locks

Existing software systems use locks Locks are very popular with programmers

Start with data race free correctly synchronised lock based program Use transactional memory opportunistically while retaining the locks

slide-3
SLIDE 3

Critical Sections & Speculation

Thread 1: Lock(L) Do stuff … Unlock(L) Thread 2: Lock(L) Do stuff … Unlock(L) Serialize

slide-4
SLIDE 4

Critical Sections & Speculation

Thread 1: Lock(L) Do stuff … Unlock(L) Thread 2: Lock(L) Do stuff … Unlock(L)

Rajwar et al: Speculative Lock Elision … Micro 2001

Relies on Hardware Transactional Memory (TM) support to enable

  • ptimistic concurrency control

Exploits disjoint-access parallelism (red-black trees, hash tables, etc)

slide-5
SLIDE 5

Critical Sections & Speculation

Thread 1: Lock(L) Do stuff … Unlock(L) Thread 2: Lock(L) Do stuff … Unlock(L) Serialize Thread 1: Lock(L) Do stuff … Unlock(L) Thread 2: Lock(L) Do stuff … Unlock(L)

Can coexist (excessive conflicts, I/O, wait conditions, ...) No need for new semantics – start from lock-based programs This paper: Software Lock Elision (SLE); no special h/w required

slide-6
SLIDE 6

Coming Up ...

Speculation in software Retaining lock semantics & behaviour Implementation and evaluation Interfacing to the runtime

slide-7
SLIDE 7

Speculation

Speculating threads and memory

Isolate using thread private copies Write back changes atomically

Well developed ideas in the Software Transactional Memory (STM) field We use a design similar to TL2

Dice et al: Transactional Locking II … DISC 2006

slide-8
SLIDE 8

Speculation: Shadowing

Shared Memory 10 Lock(L) elided … X = Y + 1 … Unlock(L) Y:

slide-9
SLIDE 9

Speculation: Shadowing

Shared Memory 10 Lock(L) elided … X = Y + 1 … Unlock(L) Y: Metadata table Hash (Address) 42

slide-10
SLIDE 10

Speculation: Shadowing

10 Lock(L) elided … X = Y + 1 … Unlock(L) Y: Metadata table 42 Hash (Address) <Y V42 10> Thread Private Log Shared Memory

slide-11
SLIDE 11

Speculation: Shadowing

10 Lock(L) elided … X = Y + 1 … Unlock(L) Y: Metadata table 42 Hash (Address) <Y V42 10> Thread Private Log 99 X: <X V50 11> 50 Shared Memory

slide-12
SLIDE 12

Speculation: Commit

Odd version numbers used to represent locked objects Manipulate with Compare and Swap (CAS) for atomicity

Lock(L) elided … X = Y + 1 … Unlock(L) commit Dirty: Clean: <X V50 11> <Y V42 10>

Commit (2PL): Lock, Verify, Write, Unlock

slide-13
SLIDE 13

Speculation: Commit

1. 50 Hash(X): 51 CAS Abort speculation and restart on conflict Lock(L) elided … X = Y + 1 … Unlock(L) commit Dirty: Clean: <X V50 11> <Y V42 10>

Commit (2PL): Lock, Verify, Write, Unlock

slide-14
SLIDE 14

Speculation: Commit

1. 50 Hash(X): 51 CAS 2. Hash(Y): == 42 ? Abort speculation and restart on conflict Lock(L) elided … X = Y + 1 … Unlock(L) commit Dirty: Clean: <X V50 11> <Y V42 10>

Commit (2PL): Lock, Verify, Write, Unlock

slide-15
SLIDE 15

Speculation: Commit

1. 50 Hash(X): 51 CAS 2. Hash(Y): == 42 ? 3. X: 99 Write 11 Abort speculation and restart on conflict Lock(L) elided … X = Y + 1 … Unlock(L) commit Dirty: Clean: <X V50 11> <Y V42 10>

Commit (2PL): Lock, Verify, Write, Unlock

slide-16
SLIDE 16

Speculation: Commit

1. 50 Hash(X): 51 CAS 2. Hash(Y): == 42 ? 3. X: 99 Write 11 4. 51 52 CAS Abort speculation and restart on conflict Hash(X): Lock(L) elided … X = Y + 1 … Unlock(L) commit Dirty: Clean: <X V50 11> <Y V42 10>

Commit (2PL): Lock, Verify, Write, Unlock

slide-17
SLIDE 17

Coming Up ...

Speculation in software Retaining lock semantics & behaviour Implementation and evaluation Interfacing to the run-time

slide-18
SLIDE 18

Semantics

Programmers should see the same semantics with SLE as when using locks This means:

Lock acquisition must be allowed No constraints on memory recycling

Solve this via insertion of Safe() calls:

Safe(O): while(metadata(O) is locked) wait;

We also want to ensure there’s no unexpected (i.e. additional) blocking on other threads

Safe(O) must not wait for any other thread

slide-19
SLIDE 19

Semantics – Application Locks

Acquisition of critical section locks Need to reconcile with speculating threads

Thread 1 Thread 2 Init: X = Y = 0 Lock(L) Elided X = Y + 1 Unlock(L) Lock(L) Acquired Y = X + 1 Unlock(L)

Can X == Y ?

slide-20
SLIDE 20

Semantics – Application Locks

Acquisition of critical section locks Need to reconcile with speculating threads

Thread 1 Thread 2 Init: X = Y = 0 Lock(L) Elided X = Y + 1 {Y=0 X = 1} Unlock(L) Lock(L) acquired Y = X + 1 {X=0 Y=1} Unlock(L)

X == Y == 1 !!!

slide-21
SLIDE 21

Semantics – Application Locks

Basic idea: add a version number to locks Lock is a shared memory object

Lock(L) Lock(L) ; version(L)++ Unlock(L) Version(L)++; Unlock(L) Elide (L) L.version even: Log (L.version)

Check for non speculative access

Use Safe(O) as defined before

Additional complexity to handle reader locks No information required about other threads

Roy et al: Brief Announcement: A Transactional Approach to Lock Scalability… SPAA’08

slide-22
SLIDE 22

Semantics – Privatisation

Memory no longer protected by a lock

Thread 1 Lock(L) Elided node = List_head(list) List_delete(node) Unlock(L) free (node) Lock(L) Elided node = List_head(list) node.value = 42 Unlock(L) Thread 2

slide-23
SLIDE 23

Semantics – Privatisation

Memory no longer protected by a lock

Thread 1 Lock(L) Elided node = List_head(list) List_delete(node) Unlock(L) free (node) Lock(L) Elided node = List_head(list) node.value = 42 Unlock(L) Thread 2

Memory corruption

Unmanaged environment no Garbage Collector

slide-24
SLIDE 24

Semantics – Privatisation

Memory no longer protected by a lock

OK! ☺

Thread 1 Lock(L) Elided node = List_head(list) List_delete(node) Unlock(L) Safe(node) free (node) Lock(L) Elided node = List_head(list) node.value = 42 Unlock(L) Thread 2

slide-25
SLIDE 25

Semantics – Avoiding Blocking

Locked metadata blocks non-speculative threads Execution behaviour changes:

Can block on other threads even if not at Lock(L)

Lock(L) not elided do stuff … if(error) { signal(FATAL_EXIT); do cleanup } Unlock(L) Lock(L) elided do stuff … Unlock(L) Thread 1 Thread 2 Exit on SIG Blocked on held metadata

Example from Apache webserver

slide-26
SLIDE 26

Semantics – Avoiding Blocking

Harris et al: Revocable Locks for Non-Blocking Programming … PPoPP’05

We use revocable locks:

Allow lock to be revoked, displacing lock holder’s execution to a special cleanup path Call revoke(O, v) if Safe(O) finds O locked at version v

revoke(O, v) { CAS(Metadata(O), v, v + 2); signal(previous holder); At this point we own the metadata } commit{ … Checkpoint: setjmp … .. if(Metadata(O) == expected) make changes (copy new data) … }

slide-27
SLIDE 27

Semantics – Avoiding Blocking

Signal Handler: longjmp revoke(O, v) { CAS(Metadata(O), v, v + 2); signal(previous holder); At this point we own metadata } commit{ … Checkpoint: setjmp … .. if(Metadata(O) == expected) make changes (copy new data) … }

slide-28
SLIDE 28

Semantics – Avoiding Blocking

Signal Handler: longjmp How to synchronously signal ? We use a custom signalling service implemented as a kernel module revoke(O, v) { CAS(Metadata(O), v, v + 2); signal(previous holder); At this point we own the lock } commit{ … Checkpoint: setjmp … .. if(Metadata(O) == expected) make changes (copy new data) … }

slide-29
SLIDE 29

Semantics – Avoiding Blocking

Problem: we know nothing of target thread state

Can send an inter-processor interrupt (IPI) Signal delivery on return to userspace

slide-30
SLIDE 30

Semantics – Avoiding Blocking

Problem: we know nothing of target thread state

Can send an inter-processor interrupt (IPI) Signal delivery on return to userspace

Source Thread Target Thread Set signal pending in target Cpu = last_running_on(target) Count = IPI_count(Cpu)

slide-31
SLIDE 31

Semantics – Avoiding Blocking

Problem: we know nothing of target thread state

Can send an inter-processor interrupt (IPI) Signal delivery on return to userspace

Source Thread Target Thread Set signal pending in target Cpu = last_running_on(target) Count = IPI_count(Cpu) Send_IPI(Cpu) Received Kernel to Userpace transition

slide-32
SLIDE 32

Semantics – Avoiding Blocking

Problem: we know nothing of target thread state

Can send an inter-processor interrupt (IPI) Signal delivery on return to userspace

Source Thread Target Thread Set signal pending in target Cpu = last_running_on(target) Count = IPI_count(Cpu) Send_IPI(Cpu) Until IPI_Count(Cpu) != Count Received Kernel to Userpace transition Ok for thread to be swapped out/migrated !

slide-33
SLIDE 33

Coming Up ...

Speculation in software Retaining lock semantics & behaviour Implementation and evaluation Interfacing to the run-time

slide-34
SLIDE 34

Implementation

Runtime ~ 2000 lines of C code

x86 and Itanium Targets C/C++ Applications

Extra features

Variable sized objects Version number embedded in objects Hash index

Per lock tuning parameters

Control cost of hash indexing Control optimism

slide-35
SLIDE 35

Evaluation

Performance

SLE removes synchronisation bottlenecks

Design goals

Preserve blocking behavior

slide-36
SLIDE 36

STAMP

0.5 1 1.5 2 2.5 3

kmeans genome vacation ssca2 intruder labyrinth

Speedup Benchmark

STAMP on a 16 way x86

Sequential SLE

slide-37
SLIDE 37

STAMP

0.5 1 1.5 2 2.5 3 3.5 4 4.5

kmeans genome vacation ssca2 intruder labyrinth

Speedup Benchmark

STAMP on a 16 way x86

Sequential SLE TL2

slide-38
SLIDE 38

STAMP

With larger hash Fix hash function to be alignment agnostic

0.5 1 1.5 2 2.5 3 3.5 4 4.5

kmeans genome vacation ssca2 intruder labyrinth

Speedup Benchmark

STAMP on a 16 way x86 -- with SLE tuning

Sequential SLE TL2

slide-39
SLIDE 39

Multiprogramming

1 2 3 4 5 6 1 2 4 8 16 32

Runtime (normalised to sequential) Threads

STAMP:Vacation - 2 way x86

SLE SLE-norestart

slide-40
SLIDE 40

Coming Up ...

Speculation in software Retaining lock semantics & behaviour Implementation / evaluation Interfacing to the run-time

slide-41
SLIDE 41

Programmer Interface

Manual placement of calls into SLE runtime

For declaring and acquiring locks For thread private copies For privatisation

Application synchronising with locks Source Code Manual Changes Compile Binary SLE Runtime

slide-42
SLIDE 42

Future Work: Automation

Application synchronising with locks Source Code Manual Changes Compile Binary SLE Runtime Compiler Binary Binary SLE Runtime Dynamic Binary Rewriting STM aware Compiler Binary SLE Runtime

slide-43
SLIDE 43

Future Work: Profiling

Application synchronising with locks Source Code Manual Changes Compile Binary SLE Runtime Compiler Binary Profile using PinCS Contended Locks with DAP Binary SLE Runtime Dynamic Binary Rewriting STM aware Compiler Binary SLE Runtime

PGO Roy et al: Exploring the limits of disjoint access parallelism … HotPar 2009

slide-44
SLIDE 44

Conclusion

Software Lock Elision

Off the shelf microprocessors STM to manage speculation

Retain semantics of locks

STM reconciles with locks Block only when lock is held

Revocable locks in software

slide-45
SLIDE 45

Backups

slide-46
SLIDE 46

Atomic Blocks

Atomic blocks ≠ transactional memory Just one of the (very popular) ways to expose transactions to the programmer Lock Elision subsumes atomic blocks

Atomic{ } == Lock(big global lock) { } Unlock(big global lock) Could easily build atomic blocks over SLE Approach followed for evaluations with STAMP

slide-47
SLIDE 47

Related Work

Welc et al ECOOP 2008

Combine monitors and transactions in Java Use the GC in the Java runtime to get around privatisation problems Do not optimise for reader locks Do not retain blocking semantics

Rossbach et al SOSP 2007

Cxspinlock in the linux kernel Lock elision, depends on HTM but declarative Does not need to solve software specific problems but would only run on a simulator ☺

slide-48
SLIDE 48

Suitability for Lock Elision

Contention Low (eg counter) High Disjoint Access Parallelism Low High (eg rbtree)

slide-49
SLIDE 49

Pending Signals

SIGHUP < … < SIGALRM < … SIGUSR1

slide-50
SLIDE 50

Semantics – Avoiding Blocking

Problem: we know nothing of target thread state

Can send an inter-processor interrupt (IPI) Signal delivery on return to userspace

Source Thread Target Thread Set signal pending in target Cpu = last_running_on(target) Count = IPI_count(Cpu) Send_IPI(Cpu) Until IPI_Count(Cpu) != Count Received Kernel to Userpace transition Ok for thread to be swapped out/migrated !

irqsave irqs blocked