the performance of spin lock alternatives for shared
play

The Performance of Spin Lock Alternatives for Shared-Memory - PowerPoint PPT Presentation

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Author: Thomas E. Anderson Presenter: Bin Lin Department of Computer Science Nov 25, 2013 Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 1 /


  1. The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Author: Thomas E. Anderson Presenter: Bin Lin Department of Computer Science Nov 25, 2013 Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 1 / 31

  2. Outline ◮ Introduction ◮ Multiprocessor architectures overview ◮ Simple approaches to spin-waiting ◮ Software alternatives ◮ Queueing approach ◮ Evaluation ◮ Hardware solutions ◮ Conclusions Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 2 / 31

  3. Introduction ◮ Shared-memory multiprocessors • Various different architectures ◮ Mutual exclusion • Software • Hardware ◮ This paper focuses on spin locks ◮ Goals • Scalable • Low-latency Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 3 / 31

  4. Multiprocessor Architectures Overview ◮ Two dimensions • Interconnect type (bus or multistage network) • Cache coherence (CC) strategy Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 4 / 31

  5. Multiprocessor Architectures Overview ◮ Interconnect type – multistage interconnection network Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 5 / 31

  6. Multiprocessor Architectures Overview ◮ Interconnect type – single bus Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 6 / 31

  7. Multiprocessor Architectures Overview ◮ CC strategy – with CC Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 7 / 31

  8. Multiprocessor Architectures Overview ◮ CC strategy – without CC Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 8 / 31

  9. Multiprocessor Architectures Overview ◮ Two dimensions • Interconnect type (bus or multistage network) • Cache coherence (CC) strategy ◮ Six architectures considered in this paper • Multistage interconnection network without CC • Multistage interconnection network with invalidation-based CC using remote directories • Bus without CC • Bus with snoopy write-through invalidation-based CC • Bus with snoopy write-back invalidation-based CC • Bus with snoopy distributed-write CC Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 9 / 31

  10. Multiprocessor Architectures Overview ◮ Interconnection network with CC using directories Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 10 / 31

  11. Multiprocessor Architectures Overview ◮ Two dimensions • Interconnect type (bus or multistage network) • Cache coherence (CC) strategy ◮ Six architectures considered in this paper • Multistage interconnection network without CC • Multistage interconnection network with invalidation-based CC using remote directories • Bus without CC • Bus with snoopy write-through invalidation-based CC • Bus with snoopy write-back invalidation-based CC • Bus with snoopy distributed-write CC Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 11 / 31

  12. Multiprocessor Architectures Overview ◮ Write-back vs write-through vs distributed-write Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 12 / 31

  13. Simple Approaches to Spin-waiting Spin on Test-and-Set while (TestAndSet(lock) = BUSY); < criticial section > lock := CLEAR; ◮ Problems • The lock holder must contend with spinning processors for exclusive access to the lock location. • Each spinning processor consumes internetwork bandwidth. ◮ Tradeoff: The more frequently polling of a processor, the faster it will acquire the lock, but the more other processors will be disrupted. Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 13 / 31

  14. Simple Approaches to Spin-waiting Spin on Read while (lock = BUSY or TestAndSet(lock) = BUSY); < criticial section > lock := CLEAR; ◮ It is a good idea, isn’t it? ◮ Problems • There is a separation between detecting the lock has been released and attempting to acquire it • Cache copies of the lock value are invalidated by a test-and-set instruction even if the value is not changed. • Invalidation-based CC requires O(P) bus or network cycles to broadcast a value to P waiting processors. Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 14 / 31

  15. Simple Approaches to Spin-waiting – Performance Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 15 / 31

  16. Simple Approaches to Spin-waiting – Performance ◮ Quiescence time for spin on read Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 16 / 31

  17. Simple Approaches to Spin-waiting – Performance ◮ Why quiescence is so slow for spin on read • When the lock is released, all cached copies are invalidated • Subsequent reads of all processors will incur a read miss • Many processors will see the lock free • Try to execute test-and-set • All but one attempt fails • When the last processor does a test-and-set, every other processor does a read miss and then quiesce Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 17 / 31

  18. Software Alternatives The author presents five software alternatives: ◮ Four based on CSMA network protocols, which differ by: • Where to wait � Delay after the lock has been released � Delay after every separate access to the lock • Whether the size of the delay is set statically or dynamically ◮ One novel approach—explicit queueing Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 18 / 31

  19. Software Alternatives – Where to Delay Delay after Spinner Notices Released Lock while(lock = BUSY or TestAndSet(lock) = BUSY) begin while (lock = BUSY); Delay(); end; Delay between Each Memory Reference while(lock = BUSY or TestAndTest(lock) = BUSY) Delay(); Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 19 / 31

  20. Software Alternatives – How to Set the Size of Delay Static Delay ◮ Each spinning processor is statically assigned a fixed amount of time(slot) to delay which is different from each other. ◮ Good performance with: • Fewer spinning processors with fewer slots • More spinning processors with more slots Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 20 / 31

  21. Software Alternatives – How to Set the Size of Delay Dynamic Delay ◮ Like Ethernet’s exponential backoff ◮ If a processor “collides”with another processor, it backs off for a longer delay ◮ Problems • The spinning processor will continue to back off while the lock is held when the critical section is long • How long should the initial delay be set Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 21 / 31

  22. Queueing ◮ Delay-based approaches separate contending accesses in time ◮ Queuing separates contending accesses in space ◮ A naive approach • Maintain an explicit queue of spinning processors • Each arriving processor enqueues itself and spins on a separate flag • Reduce invalidations if each processor’s flag is kept in a separate cache block • But maintaining queues is expensive: the enqueue and dequeue operations must themselves be locked Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 22 / 31

  23. Queueing – Implementation ◮ A more efficient approach Init flags[0] := HAS LOCK; flags[1..P-1] := MUST WAIT; queueLast := 0; Lock myPlace := ReadAndIncrement(queueLast); while(flags[myPlace mod P] = MUST WAIT); < critical section > Unlock flags[myPlace mod P] := MUST WAIT; flags[(myPlace + 1) mod P] := HAS LOCK; Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 23 / 31

  24. Queueing – Implementation ◮ Distributed-write coherence • All processors can spin on a single counter ◮ Invalidation-based coherence • Each processor should wait on a flag in a separate cache block ◮ Multistage network without CC • Each flag should be placed in a separate memory module ◮ Bus without CC • A delay is needed Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 24 / 31

  25. Queueing – Problems ◮ What if an architecture does not support an atomic read-and-increment instruction ◮ Increases lock latency when there is no contention ◮ Makes preemption problem more severe ◮ Makes it more difficult to wait for multiple events Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 25 / 31

  26. Evaluation – Principal Performance Comparison Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 26 / 31

  27. Evaluation – Spinning-waiting Overhead for a Burst Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 27 / 31

  28. Hardware Solutions – Multistage Network ◮ Combining networks • Requests to the same location can be combined and forwarded as a single request • Better performance, but may increase the latency ◮ Hardware queueing • Eliminates polling across the network • Speeds passing control of the lock ◮ Goodman’s queue links • Stores the name of the next processor in the queue directly in each processor’s cache • The next processor can be notified without going through the original memory module Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 28 / 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend