Role of Synchronization • “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Hardware-Software Trade-offs in Synchronization • Types of Synchronization – Mutual Exclusion – Event synchronization CS 252, Spring 05 » point-to-point » group David E. Culler » global (barriers) Computer Science Division • How much hardware support? U.C. Berkeley – high-level operations? – atomic instructions? – specialized interconnect? 3/29/2005 CS252 S05 2 Layers of synch support Mini-Instruction Set debate • atomic read-modify-write instructions – IBM 370: included atomic compare&swap for multiprogramming Application – x86: any instruction can be prefixed with a lock modifier – High-level language advocates want hardware locks/barriers » but it’s goes against the “RISC” flow,and has other User library problems Operating System Support – SPARC: atomic register-memory ops (swap, compare&swap) – MIPS, IBM Power: no atomic operations but pair of Synchronization Library instructions » load-locked, store-conditional Atomic RMW ops » later used by PowerPC and DEC Alpha too HW Support • Rich set of tradeoffs 3/29/2005 CS252 S05 3 3/29/2005 CS252 S05 4 Other forms of hardware support Components of a Synchronization Event • Separate lock lines on the bus • Acquire method • Lock locations in memory – Acquire right to the synch » enter critical section, go past event • Lock registers (Cray Xmp) • Waiting algorithm • Hardware full/empty bits (Tera) – Wait for synch to become available when it isn’t • Bus support for interrupt dispatch – busy-waiting, blocking, or hybrid • Release method – Enable other processors to acquire right to the synch • Waiting algorithm is independent of type of synchronization – makes no sense to put in hardware 3/29/2005 CS252 S05 5 3/29/2005 CS252 S05 6 NOW Handout Page 1
Strawman Lock Atomic Instructions • Specifies a location, register, & atomic operation Busy-Wait – Value in location read into a register /* copy location to register */ lock: ld register, location – Another value (function of value read or not) stored into /* compare with 0 */ cmp location, #0 location /* if not 0, try again */ bnz lock • Many variants /* store 1 to mark it locked */ st location, #1 /* return control to caller */ ret – Varying degrees of flexibility in second part • Simple example: test&set /* write 0 to location */ unlock: st location, #0 – Value in location read into a specified register /* return control to caller */ ret – Constant 1 stored into location – Successful if value loaded into register is 0 Why doesn’t the acquire method work? – Other constants could be used instead of 1 and 0 Release method? 3/29/2005 CS252 S05 7 3/29/2005 CS252 S05 8 Simple Test&Set Lock Performance Criteria for Synch. Ops • Latency (time per op) lock: t&s register, location /* if not 0, try again */ bnz lock – especially when light contention /* return control to caller */ ret • Bandwidth (ops per sec) /* write 0 to location */ unlock: st location, #0 – especially under high contention /* return control to caller */ ret • Traffic • Other read-modify-write primitives ? e c n – load on critical resources a m – Swap, Exch r o f r e – especially on failures under contention p n – Fetch&op o i t a z • Storage i n o – Compare&swap r h c n y s » Three operands: location, register to compare with, e r u s a ? register to swap with e n m o • Fairness i u a t o u r y D » Not commonly supported by RISC instruction sets o d ? e s l n a o c S t i d i • cacheable or uncacheable ? n n o c o t i t n a e h w n t o e r C d n U • 3/29/2005 CS252 S05 9 3/29/2005 CS252 S05 10 T&S Lock Microbenchmark: SGI Chal. Enhancements to Simple Lock 20 � � Test&set, c = 0 • Reduce frequency of issuing test&sets while � Test&set, exponential backof f, c = 3.64 � 18 � Test&set, exponential backof f, c = 0 � � waiting � Ideal � 16 � � � – Test&set lock with backoff � 14 � � � � – Don’t back off too much or will be backed off when lock � � 12 � Time ( µ s) � � becomes free � � � � – Exponential backoff works quite well empirically: i th time = 10 � � � � � � k*c i 8 lock; � � • Busy-wait with read operations rather than 6 � delay(c); test&set � � 4 unlock; � � � � � – Test-and-test&set lock 2 � � � ��������������� � � � � – Keep testing with ordinary load � � � � 0 3 5 7 9 11 13 15 » cached lock variable will be invalidated when release Number of processors • Why does performance degrade? occurs – When value changes (to 0), try to obtain lock with test&set • Bus Transactions on T&S? » only one attemptor will succeed; others will fail and start • Hardware support in CC protocol? testing again 3/29/2005 CS252 S05 11 3/29/2005 CS252 S05 12 NOW Handout Page 2
Recommend
More recommend