The Performance of Spin Lock Alternatives for Shared-Memory - - PowerPoint PPT Presentation

the performance of spin lock alternatives for shared
SMART_READER_LITE
LIVE PREVIEW

The Performance of Spin Lock Alternatives for Shared-Memory - - PowerPoint PPT Presentation

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Author: Thomas E. Anderson Presenter: Bin Lin Department of Computer Science Nov 25, 2013 Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 1 /


slide-1
SLIDE 1

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors

Author: Thomas E. Anderson Presenter: Bin Lin

Department of Computer Science

Nov 25, 2013

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 1 / 31

slide-2
SLIDE 2

Outline

◮ Introduction ◮ Multiprocessor architectures overview ◮ Simple approaches to spin-waiting ◮ Software alternatives ◮ Queueing approach ◮ Evaluation ◮ Hardware solutions ◮ Conclusions

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 2 / 31

slide-3
SLIDE 3

Introduction

◮ Shared-memory multiprocessors

  • Various different architectures

◮ Mutual exclusion

  • Software
  • Hardware

◮ This paper focuses on spin locks ◮ Goals

  • Scalable
  • Low-latency

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 3 / 31

slide-4
SLIDE 4

Multiprocessor Architectures Overview

◮ Two dimensions

  • Interconnect type (bus or multistage network)
  • Cache coherence (CC) strategy

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 4 / 31

slide-5
SLIDE 5

Multiprocessor Architectures Overview

◮ Interconnect type – multistage interconnection network

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 5 / 31

slide-6
SLIDE 6

Multiprocessor Architectures Overview

◮ Interconnect type – single bus

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 6 / 31

slide-7
SLIDE 7

Multiprocessor Architectures Overview

◮ CC strategy – with CC

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 7 / 31

slide-8
SLIDE 8

Multiprocessor Architectures Overview

◮ CC strategy – without CC

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 8 / 31

slide-9
SLIDE 9

Multiprocessor Architectures Overview

◮ Two dimensions

  • Interconnect type (bus or multistage network)
  • Cache coherence (CC) strategy

◮ Six architectures considered in this paper

  • Multistage interconnection network without CC
  • Multistage interconnection network with invalidation-based CC using

remote directories

  • Bus without CC
  • Bus with snoopy write-through invalidation-based CC
  • Bus with snoopy write-back invalidation-based CC
  • Bus with snoopy distributed-write CC

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 9 / 31

slide-10
SLIDE 10

Multiprocessor Architectures Overview

◮ Interconnection network with CC using directories

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 10 / 31

slide-11
SLIDE 11

Multiprocessor Architectures Overview

◮ Two dimensions

  • Interconnect type (bus or multistage network)
  • Cache coherence (CC) strategy

◮ Six architectures considered in this paper

  • Multistage interconnection network without CC
  • Multistage interconnection network with invalidation-based CC using

remote directories

  • Bus without CC
  • Bus with snoopy write-through invalidation-based CC
  • Bus with snoopy write-back invalidation-based CC
  • Bus with snoopy distributed-write CC

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 11 / 31

slide-12
SLIDE 12

Multiprocessor Architectures Overview

◮ Write-back vs write-through vs distributed-write

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 12 / 31

slide-13
SLIDE 13

Simple Approaches to Spin-waiting

Spin on Test-and-Set

while (TestAndSet(lock) = BUSY); <criticial section> lock := CLEAR;

◮ Problems

  • The lock holder must contend with spinning processors for exclusive

access to the lock location.

  • Each spinning processor consumes internetwork bandwidth.

◮ Tradeoff: The more frequently polling of a processor, the faster it will

acquire the lock, but the more other processors will be disrupted.

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 13 / 31

slide-14
SLIDE 14

Simple Approaches to Spin-waiting

Spin on Read

while (lock = BUSY or TestAndSet(lock) = BUSY); <criticial section> lock := CLEAR;

◮ It is a good idea, isn’t it? ◮ Problems

  • There is a separation between detecting the lock has been released and

attempting to acquire it

  • Cache copies of the lock value are invalidated by a test-and-set

instruction even if the value is not changed.

  • Invalidation-based CC requires O(P) bus or network cycles to broadcast

a value to P waiting processors.

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 14 / 31

slide-15
SLIDE 15

Simple Approaches to Spin-waiting – Performance

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 15 / 31

slide-16
SLIDE 16

Simple Approaches to Spin-waiting – Performance

◮ Quiescence time for spin on read

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 16 / 31

slide-17
SLIDE 17

Simple Approaches to Spin-waiting – Performance

◮ Why quiescence is so slow for spin on read

  • When the lock is released, all cached copies are invalidated
  • Subsequent reads of all processors will incur a read miss
  • Many processors will see the lock free
  • Try to execute test-and-set
  • All but one attempt fails
  • When the last processor does a test-and-set, every other processor does

a read miss and then quiesce

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 17 / 31

slide-18
SLIDE 18

Software Alternatives

The author presents five software alternatives:

◮ Four based on CSMA network protocols, which differ by:

  • Where to wait

Delay after the lock has been released Delay after every separate access to the lock

  • Whether the size of the delay is set statically or dynamically

◮ One novel approach—explicit queueing

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 18 / 31

slide-19
SLIDE 19

Software Alternatives – Where to Delay

Delay after Spinner Notices Released Lock

while(lock = BUSY or TestAndSet(lock) = BUSY) begin while (lock = BUSY); Delay(); end;

Delay between Each Memory Reference

while(lock = BUSY or TestAndTest(lock) = BUSY) Delay();

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 19 / 31

slide-20
SLIDE 20

Software Alternatives – How to Set the Size of Delay

Static Delay

◮ Each spinning processor is statically assigned a fixed amount of

time(slot) to delay which is different from each other.

◮ Good performance with:

  • Fewer spinning processors with fewer slots
  • More spinning processors with more slots

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 20 / 31

slide-21
SLIDE 21

Software Alternatives – How to Set the Size of Delay

Dynamic Delay

◮ Like Ethernet’s exponential backoff ◮ If a processor “collides”with another processor, it backs off for a

longer delay

◮ Problems

  • The spinning processor will continue

to back off while the lock is held when the critical section is long

  • How long should the initial delay be

set

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 21 / 31

slide-22
SLIDE 22

Queueing

◮ Delay-based approaches separate contending accesses in time ◮ Queuing separates contending accesses in space ◮ A naive approach

  • Maintain an explicit queue of spinning processors
  • Each arriving processor enqueues itself and spins on a separate flag
  • Reduce invalidations if each processor’s flag is kept in a separate cache

block

  • But maintaining queues is expensive: the enqueue and dequeue
  • perations must themselves be locked

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 22 / 31

slide-23
SLIDE 23

Queueing – Implementation

◮ A more efficient approach

Init flags[0] := HAS LOCK; flags[1..P-1] := MUST WAIT; queueLast := 0; Lock myPlace := ReadAndIncrement(queueLast); while(flags[myPlace mod P] = MUST WAIT); <critical section> Unlock flags[myPlace mod P] := MUST WAIT; flags[(myPlace + 1) mod P] := HAS LOCK;

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 23 / 31

slide-24
SLIDE 24

Queueing – Implementation

◮ Distributed-write coherence

  • All processors can spin on a single counter

◮ Invalidation-based coherence

  • Each processor should wait on a flag in a separate cache block

◮ Multistage network without CC

  • Each flag should be placed in a separate memory module

◮ Bus without CC

  • A delay is needed

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 24 / 31

slide-25
SLIDE 25

Queueing – Problems

◮ What if an architecture does not support an atomic

read-and-increment instruction

◮ Increases lock latency when there is no contention ◮ Makes preemption problem more severe ◮ Makes it more difficult to wait for multiple events

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 25 / 31

slide-26
SLIDE 26

Evaluation – Principal Performance Comparison

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 26 / 31

slide-27
SLIDE 27

Evaluation – Spinning-waiting Overhead for a Burst

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 27 / 31

slide-28
SLIDE 28

Hardware Solutions – Multistage Network

◮ Combining networks

  • Requests to the same location can be combined and forwarded as a

single request

  • Better performance, but may increase the latency

◮ Hardware queueing

  • Eliminates polling across the network
  • Speeds passing control of the lock

◮ Goodman’s queue links

  • Stores the name of the next processor in the queue directly in each

processor’s cache

  • The next processor can be notified without going through the original

memory module

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 28 / 31

slide-29
SLIDE 29

Hardware Solutions – Single Bus

◮ Use additional bus with write broadcast coherence

  • Keep caches coherent and reduce bus contetion

◮ Read broadcast

  • Eliminates the cascade of read misses

◮ Special handling of test-and-set requests

  • Eliminates redundant test-and-sets

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 29 / 31

slide-30
SLIDE 30

Conclusions

◮ The performance of spin-lock varies among architectures ◮ A variant of Ethernet backoff has good performance when there is no

contention

◮ Queueing has good performance when there is a lot of contention ◮ The performance can be further improved with special hardware

support

◮ But whether it is worthy of making such additional hardware support

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 30 / 31

slide-31
SLIDE 31

Thank you!!!

Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 31 / 31