[PPT] - The Performance of Spin Lock Alternatives for Shared-Memory PowerPoint Presentation

SLIDE 1

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors

Author: Thomas E. Anderson Presenter: Bin Lin

Department of Computer Science

Nov 25, 2013

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 1 / 31

SLIDE 2

Outline

◮ Introduction ◮ Multiprocessor architectures overview ◮ Simple approaches to spin-waiting ◮ Software alternatives ◮ Queueing approach ◮ Evaluation ◮ Hardware solutions ◮ Conclusions

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 2 / 31

SLIDE 3

Introduction

◮ Shared-memory multiprocessors

Various different architectures

◮ Mutual exclusion

Software
Hardware

◮ This paper focuses on spin locks ◮ Goals

Scalable
Low-latency

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 3 / 31

SLIDE 4

Multiprocessor Architectures Overview

◮ Two dimensions

Interconnect type (bus or multistage network)
Cache coherence (CC) strategy

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 4 / 31

SLIDE 5

Multiprocessor Architectures Overview

◮ Interconnect type – multistage interconnection network

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 5 / 31

SLIDE 6

Multiprocessor Architectures Overview

◮ Interconnect type – single bus

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 6 / 31

SLIDE 7

Multiprocessor Architectures Overview

◮ CC strategy – with CC

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 7 / 31

SLIDE 8

Multiprocessor Architectures Overview

◮ CC strategy – without CC

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 8 / 31

SLIDE 9

Multiprocessor Architectures Overview

◮ Two dimensions

Interconnect type (bus or multistage network)
Cache coherence (CC) strategy

◮ Six architectures considered in this paper

Multistage interconnection network without CC
Multistage interconnection network with invalidation-based CC using

remote directories

Bus without CC
Bus with snoopy write-through invalidation-based CC
Bus with snoopy write-back invalidation-based CC
Bus with snoopy distributed-write CC

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 9 / 31

SLIDE 10

Multiprocessor Architectures Overview

◮ Interconnection network with CC using directories

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 10 / 31

SLIDE 11

Multiprocessor Architectures Overview

◮ Two dimensions

Interconnect type (bus or multistage network)
Cache coherence (CC) strategy

◮ Six architectures considered in this paper

Multistage interconnection network without CC
Multistage interconnection network with invalidation-based CC using

remote directories

Bus without CC
Bus with snoopy write-through invalidation-based CC
Bus with snoopy write-back invalidation-based CC
Bus with snoopy distributed-write CC

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 11 / 31

SLIDE 12

Multiprocessor Architectures Overview

◮ Write-back vs write-through vs distributed-write

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 12 / 31

SLIDE 13

Simple Approaches to Spin-waiting

Spin on Test-and-Set

while (TestAndSet(lock) = BUSY); <criticial section> lock := CLEAR;

◮ Problems

The lock holder must contend with spinning processors for exclusive

access to the lock location.

Each spinning processor consumes internetwork bandwidth.

◮ Tradeoff: The more frequently polling of a processor, the faster it will

acquire the lock, but the more other processors will be disrupted.

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 13 / 31

SLIDE 14

Simple Approaches to Spin-waiting

Spin on Read

while (lock = BUSY or TestAndSet(lock) = BUSY); <criticial section> lock := CLEAR;

◮ It is a good idea, isn’t it? ◮ Problems

There is a separation between detecting the lock has been released and

attempting to acquire it

Cache copies of the lock value are invalidated by a test-and-set

instruction even if the value is not changed.

Invalidation-based CC requires O(P) bus or network cycles to broadcast

a value to P waiting processors.

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 14 / 31

SLIDE 15

Simple Approaches to Spin-waiting – Performance

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 15 / 31

SLIDE 16

Simple Approaches to Spin-waiting – Performance

◮ Quiescence time for spin on read

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 16 / 31

SLIDE 17

Simple Approaches to Spin-waiting – Performance

◮ Why quiescence is so slow for spin on read

When the lock is released, all cached copies are invalidated
Subsequent reads of all processors will incur a read miss
Many processors will see the lock free
Try to execute test-and-set
All but one attempt fails
When the last processor does a test-and-set, every other processor does

a read miss and then quiesce

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 17 / 31

SLIDE 18

Software Alternatives

The author presents five software alternatives:

◮ Four based on CSMA network protocols, which differ by:

Where to wait

Delay after the lock has been released Delay after every separate access to the lock

Whether the size of the delay is set statically or dynamically

◮ One novel approach—explicit queueing

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 18 / 31

SLIDE 19

Software Alternatives – Where to Delay

Delay after Spinner Notices Released Lock

while(lock = BUSY or TestAndSet(lock) = BUSY) begin while (lock = BUSY); Delay(); end;

Delay between Each Memory Reference

while(lock = BUSY or TestAndTest(lock) = BUSY) Delay();

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 19 / 31

SLIDE 20

Software Alternatives – How to Set the Size of Delay

Static Delay

◮ Each spinning processor is statically assigned a fixed amount of

time(slot) to delay which is different from each other.

◮ Good performance with:

Fewer spinning processors with fewer slots
More spinning processors with more slots

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 20 / 31

SLIDE 21

Software Alternatives – How to Set the Size of Delay

Dynamic Delay

◮ Like Ethernet’s exponential backoff ◮ If a processor “collides”with another processor, it backs off for a

longer delay

◮ Problems

The spinning processor will continue

to back off while the lock is held when the critical section is long

How long should the initial delay be

set

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 21 / 31

SLIDE 22

Queueing

◮ Delay-based approaches separate contending accesses in time ◮ Queuing separates contending accesses in space ◮ A naive approach

Maintain an explicit queue of spinning processors
Each arriving processor enqueues itself and spins on a separate flag
Reduce invalidations if each processor’s flag is kept in a separate cache

block

But maintaining queues is expensive: the enqueue and dequeue
perations must themselves be locked

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 22 / 31

SLIDE 23

Queueing – Implementation

◮ A more efficient approach

Init flags[0] := HAS LOCK; flags[1..P-1] := MUST WAIT; queueLast := 0; Lock myPlace := ReadAndIncrement(queueLast); while(flags[myPlace mod P] = MUST WAIT); <critical section> Unlock flags[myPlace mod P] := MUST WAIT; flags[(myPlace + 1) mod P] := HAS LOCK;

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 23 / 31

SLIDE 24

Queueing – Implementation

◮ Distributed-write coherence

All processors can spin on a single counter

◮ Invalidation-based coherence

Each processor should wait on a flag in a separate cache block

◮ Multistage network without CC

Each flag should be placed in a separate memory module

◮ Bus without CC

A delay is needed

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 24 / 31

SLIDE 25

Queueing – Problems

◮ What if an architecture does not support an atomic

read-and-increment instruction

◮ Increases lock latency when there is no contention ◮ Makes preemption problem more severe ◮ Makes it more difficult to wait for multiple events

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 25 / 31

SLIDE 26

Evaluation – Principal Performance Comparison

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 26 / 31

SLIDE 27

Evaluation – Spinning-waiting Overhead for a Burst

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 27 / 31

SLIDE 28

Hardware Solutions – Multistage Network

◮ Combining networks

Requests to the same location can be combined and forwarded as a

single request

Better performance, but may increase the latency

◮ Hardware queueing

Eliminates polling across the network
Speeds passing control of the lock

◮ Goodman’s queue links

Stores the name of the next processor in the queue directly in each

processor’s cache

The next processor can be notified without going through the original

memory module

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 28 / 31

SLIDE 29

Hardware Solutions – Single Bus

◮ Use additional bus with write broadcast coherence

Keep caches coherent and reduce bus contetion

◮ Read broadcast

Eliminates the cascade of read misses

◮ Special handling of test-and-set requests

Eliminates redundant test-and-sets

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 29 / 31

SLIDE 30

Conclusions

◮ The performance of spin-lock varies among architectures ◮ A variant of Ethernet backoff has good performance when there is no

contention

◮ Queueing has good performance when there is a lot of contention ◮ The performance can be further improved with special hardware

support

◮ But whether it is worthy of making such additional hardware support

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 30 / 31

SLIDE 31

Thank you!!!

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 31 / 31