Synchronization presented by Radu Teodorescu CS533 Why we need it? - - PowerPoint PPT Presentation

▶

Oct 11, 2022 157 likes •421 views

Synchronization presented by Radu Teodorescu CS533 Why we need it? Parallel programs share data! Consistency of shared data structures Access serialization Coordination between processors Allows queueing, ordering 2 For

SLIDE 1

Synchronization

presented by Radu Teodorescu

CS533

SLIDE 2

Why we need it?

Parallel programs share data!
Consistency of shared data structures
Access serialization
Coordination between processors
Allows queueing, ordering

SLIDE 3

For today

How synchronization works
Synchronization operations
Hardware primitives
Implementations

SLIDE 4

How it works

Components of a synchronization event:
ACQUIRE method - access right to the

synchronization

WAITING algorithm - wait for

synchronization to become available

RELEASE method - allow other processes

to proceed past synchronization

SLIDE 5

Waiting algorithms

The process spins in a loop, repeatedly

testing for status change

PROS: low latency
CONS: blocks processor, higher traffic

(network, bus, cache)

Busy-waiting

SLIDE 6

Waiting algorithms

The process suspends, releases the

processor, waits to be awakened

PROS: releases the processor for other jobs
CONS: higher overhead

Scheduling overhead < expected wait time

Blocking

SLIDE 7

Synch operations

Locks: grant access to one process only
Barriers: no process advances beyond it until

all have arrived

Semaphores
Monitors
…

Some combination of hardware primitives and software

SLIDE 8

Hardware primitives

Synchronization requires some ATOMIC
peration
Some flavor of atomic Read&Modify
Atomic exchange
Test & set
Fetch & increment

Implement locks, barriers, etc.

SLIDE 9

Test & set

Test a value and set it if test passed
Also used to implement locks
In cache coherent machines - lots of

invalidations

lock: ADD R2, R0, #1 T&S R2, (R1) BNEZ R2, lock

SLIDE 10

Test and test & set

Take advantage of locality
Test before write (to avoid an invalidation)

lock: LD R2, (R1) BNEZ R2, lock ADD R2, R0, #1 T&S R2, (R1) BNEZ R2, lock

SLIDE 11

Atomic exchange

Exchange a value in a register with memory
Can be used to implement a spin lock

lock: LD R2, (R1) BNEZ R2, lock ADD R2, R0, #1 EXCH R2, (R1) BNEZ R2, lock

EXCH Reg, Mem

SLIDE 12

Fetch and increment

Reads a memory value and increments it

atomically

Useful for barrier implementation
Can be used to determine how many

processes are waiting for it

SLIDE 13

Performance

How expensive is locking with test&set?
N processors waiting
No hold time

∑(2i+1) = N2+2N

lock: ADD R2, R0, #1 T&S R2, (R1) BNEZ R2, lock

SLIDE 14

Hardware complexity

Single, atomic memory operation
Hard to implement in hardware
complicates the coherence protocol
must be uninterrupted
avoid deadlocks

SLIDE 15

Reducing complexity

Maintain the atomicity requirement
Have two linked instructions
The second indicates if the pair executed

atomically

MIPS/SGI: Load-linked (LL), Store-conditional (SC)

SLIDE 16

LL/SC

LL returns value of memory location
SC
LL/SC not atomic - returns 0, doesn’t

update memory

LL/SC is atomic - returns 1, updates

memory

SLIDE 17

LL/SC

LL/SC also fails if processor context

switches between LL and SC

Can implement other primitives: exchange,

fetch-and-add,...

lock: LL R2, (R1) BNEZ R2, lock ADD R2, R0, #1 SC R2, (R1) BEQZ R2, lock

SLIDE 18

Implementations

HEP Multiprocessor
NYU Ultracomputer
IBM RP3
Illinois Cedar

SLIDE 19

HEP multiprocessor

Each word in memory has Full/Empty bit
Bit is tested in hardware before a RD/WR
The RD/WR blocks until the test succeeds:
RD until full
WR until empty
When test succeeds, the bit is set to the
pposite value

SLIDE 20

HEP multiprocessor

PROS:
Very efficient for low level dependences

(compare to locks)

CONS:
Requires complex hardware:
F/E bits
Support to queue a process if test fails
Logic to implement indivisible ops

SLIDE 21

NYU ultracomputer

Implements fetch-and-add
PROS:
Can use message combining, scales well
Efficient barrier implementation
CONS:
Very complex network
Adders in each memory module

SLIDE 22

Message combining

Switch Memory X Switch Memory X Switch Memory X Switch Memory X 5 F&A(X,3) F&A(X,1) 3 5 F&A(X,4) 9 5 5 8 3 9

SLIDE 23

IBM RP3

Implements fetch-and-phi, where phi can be:
Add, And, Or
Min, Max, Store
Store if zero
PRO: generality
CON: hardware complexity

SLIDE 24

Illinois Cedar

General atomic instruction that operates on

synchronization variables

Synch variable has 2 words: Key and

Value

Synch instruction:
F/E bit test for a read:
Complex hardware, special processor for

each memory module

{addr;(cond);op on key;op on value} {X; (X.key==1)*; decrement; fetch}

SLIDE 25

Synchronization

CS533

Why we need it?

For today

How it works

synchronization

synchronization to become available

to proceed past synchronization

Waiting algorithms

testing for status change

(network, bus, cache)

Busy-waiting

Waiting algorithms

processor, waits to be awakened

Scheduling overhead < expected wait time

Blocking

Synch operations

all have arrived

Some combination of hardware primitives and software

Hardware primitives

Implement locks, barriers, etc.

Test & set

invalidations

Test and test & set

Atomic exchange

EXCH Reg, Mem

Fetch and increment

atomically

processes are waiting for it

Performance

∑(2i+1) = N2+2N

Hardware complexity

Reducing complexity

atomically

LL/SC

update memory

memory

LL/SC

switches between LL and SC

fetch-and-add,...

Implementations

HEP multiprocessor

HEP multiprocessor

(compare to locks)

NYU ultracomputer

Message combining

IBM RP3

Illinois Cedar

synchronization variables

Value

each memory module

That’s it for today!