Synchronization presented by Radu Teodorescu CS533 Why we need it? - - PowerPoint PPT Presentation

synchronization
SMART_READER_LITE
LIVE PREVIEW

Synchronization presented by Radu Teodorescu CS533 Why we need it? - - PowerPoint PPT Presentation

Synchronization presented by Radu Teodorescu CS533 Why we need it? Parallel programs share data! Consistency of shared data structures Access serialization Coordination between processors Allows queueing, ordering 2 For


slide-1
SLIDE 1

Synchronization

presented by Radu Teodorescu

CS533

slide-2
SLIDE 2

Why we need it?

  • Parallel programs share data!
  • Consistency of shared data structures
  • Access serialization
  • Coordination between processors
  • Allows queueing, ordering

2

slide-3
SLIDE 3

For today

  • How synchronization works
  • Synchronization operations
  • Hardware primitives
  • Implementations
slide-4
SLIDE 4

How it works

  • Components of a synchronization event:
  • ACQUIRE method - access right to the

synchronization

  • WAITING algorithm - wait for

synchronization to become available

  • RELEASE method - allow other processes

to proceed past synchronization

slide-5
SLIDE 5

Waiting algorithms

  • The process spins in a loop, repeatedly

testing for status change

  • PROS: low latency
  • CONS: blocks processor, higher traffic

(network, bus, cache)

Busy-waiting

slide-6
SLIDE 6

Waiting algorithms

  • The process suspends, releases the

processor, waits to be awakened

  • PROS: releases the processor for other jobs
  • CONS: higher overhead

Scheduling overhead < expected wait time

Blocking

slide-7
SLIDE 7

Synch operations

  • Locks: grant access to one process only
  • Barriers: no process advances beyond it until

all have arrived

  • Semaphores
  • Monitors

Some combination of hardware primitives and software

slide-8
SLIDE 8

Hardware primitives

  • Synchronization requires some ATOMIC
  • peration
  • Some flavor of atomic Read&Modify
  • Atomic exchange
  • Test & set
  • Fetch & increment

Implement locks, barriers, etc.

slide-9
SLIDE 9

Test & set

  • Test a value and set it if test passed
  • Also used to implement locks
  • In cache coherent machines - lots of

invalidations

lock: ADD R2, R0, #1 T&S R2, (R1) BNEZ R2, lock

slide-10
SLIDE 10

Test and test & set

  • Take advantage of locality
  • Test before write (to avoid an invalidation)

lock: LD R2, (R1) BNEZ R2, lock ADD R2, R0, #1 T&S R2, (R1) BNEZ R2, lock

slide-11
SLIDE 11

Atomic exchange

  • Exchange a value in a register with memory
  • Can be used to implement a spin lock

lock: LD R2, (R1) BNEZ R2, lock ADD R2, R0, #1 EXCH R2, (R1) BNEZ R2, lock

EXCH Reg, Mem

slide-12
SLIDE 12

Fetch and increment

  • Reads a memory value and increments it

atomically

  • Useful for barrier implementation
  • Can be used to determine how many

processes are waiting for it

slide-13
SLIDE 13

Performance

  • How expensive is locking with test&set?
  • N processors waiting
  • No hold time

∑(2i+1) = N2+2N

lock: ADD R2, R0, #1 T&S R2, (R1) BNEZ R2, lock

slide-14
SLIDE 14

Hardware complexity

  • Single, atomic memory operation
  • Hard to implement in hardware
  • complicates the coherence protocol
  • must be uninterrupted
  • avoid deadlocks
slide-15
SLIDE 15

Reducing complexity

  • Maintain the atomicity requirement
  • Have two linked instructions
  • The second indicates if the pair executed

atomically

  • MIPS/SGI: Load-linked (LL), Store-conditional (SC)
slide-16
SLIDE 16

LL/SC

  • LL returns value of memory location
  • SC
  • LL/SC not atomic - returns 0, doesn’t

update memory

  • LL/SC is atomic - returns 1, updates

memory

slide-17
SLIDE 17

LL/SC

  • LL/SC also fails if processor context

switches between LL and SC

  • Can implement other primitives: exchange,

fetch-and-add,...

lock: LL R2, (R1) BNEZ R2, lock ADD R2, R0, #1 SC R2, (R1) BEQZ R2, lock

slide-18
SLIDE 18

Implementations

  • HEP Multiprocessor
  • NYU Ultracomputer
  • IBM RP3
  • Illinois Cedar
slide-19
SLIDE 19

HEP multiprocessor

  • Each word in memory has Full/Empty bit
  • Bit is tested in hardware before a RD/WR
  • The RD/WR blocks until the test succeeds:
  • RD until full
  • WR until empty
  • When test succeeds, the bit is set to the
  • pposite value
slide-20
SLIDE 20

HEP multiprocessor

  • PROS:
  • Very efficient for low level dependences

(compare to locks)

  • CONS:
  • Requires complex hardware:
  • F/E bits
  • Support to queue a process if test fails
  • Logic to implement indivisible ops
slide-21
SLIDE 21

NYU ultracomputer

  • Implements fetch-and-add
  • PROS:
  • Can use message combining, scales well
  • Efficient barrier implementation
  • CONS:
  • Very complex network
  • Adders in each memory module
slide-22
SLIDE 22

Message combining

Switch Memory X Switch Memory X Switch Memory X Switch Memory X 5 F&A(X,3) F&A(X,1) 3 5 F&A(X,4) 9 5 5 8 3 9

slide-23
SLIDE 23

IBM RP3

  • Implements fetch-and-phi, where phi can be:
  • Add, And, Or
  • Min, Max, Store
  • Store if zero
  • PRO: generality
  • CON: hardware complexity
slide-24
SLIDE 24

Illinois Cedar

  • General atomic instruction that operates on

synchronization variables

  • Synch variable has 2 words: Key and

Value

  • Synch instruction:
  • F/E bit test for a read:
  • Complex hardware, special processor for

each memory module

{addr;(cond);op on key;op on value} {X; (X.key==1)*; decrement; fetch}

slide-25
SLIDE 25

That’s it for today!