Synchronization Computer Architecture J. Daniel Garca Snchez - - PowerPoint PPT Presentation

synchronization
SMART_READER_LITE
LIVE PREVIEW

Synchronization Computer Architecture J. Daniel Garca Snchez - - PowerPoint PPT Presentation

Synchronization Synchronization Computer Architecture J. Daniel Garca Snchez (coordinator) David Expsito Singh Francisco Javier Garca Blas ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid cbed


slide-1
SLIDE 1

Synchronization

Synchronization

Computer Architecture

  • J. Daniel García Sánchez (coordinator)

David Expósito Singh Francisco Javier García Blas

ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 1/35

slide-2
SLIDE 2

Synchronization Introduction

1

Introduction

2

Hardware primitives

3

Locks

4

Barriers

5

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 2/35

slide-3
SLIDE 3

Synchronization Introduction

Synchronization in shared memory

Communication performed through shared memory.

It is necessary to synchronize multiple accesses to shared variables.

Alternatives:

Communication 1-1. Collective communication (1-N).

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 3/35

slide-4
SLIDE 4

Synchronization Introduction

Communication 1 to 1

Ensure that reading (receive) is performed after writing (send). In case of reuse (loops):

Ensure that writing (send) is performed after former reading (receive).

Need to access with mutual exclusion.

Only one of the processes accesses a variable at the same time.

Critical section:

Sequence of instructions accessing one or more variables with mutual exclusion.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 4/35

slide-5
SLIDE 5

Synchronization Introduction

Collective communication

Needs coordination of multiple accesses to variables.

Writes without interferences. Reads must wait for data to be available.

Must guarantee accesses to variable in mutual exclusion. Must guarantee that result is not read until all processes/threads have executed their critical section.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 5/35

slide-6
SLIDE 6

Synchronization Introduction

Adding a vector

Critical section in loop

void f(int max) { vector<double> v = get_vector(max); double sum = 0; auto do_sum = [&](int start, int n) { for (int i=start ; i<n; ++i) { sum += v[i]; } } thread t1{do_sum,0,max/2}; thread t2{do_sum,max/2+1,max};

  • t1. join () ;
  • t2. join () ;

}

Critical section out of loop

void f(int max) { vector<double> v = get_vector(max); double sum = 0; auto do_sum = [&](int start, int n) { double local_sum = 0; for (int i=start ; i<n; ++i) { local_sum += v[i ]; } sum += local_sum; } thread t1{do_sum,0,max/2}; thread t2{do_sum,max/2+1,max};

  • t1. join () ;
  • t2. join () ;

}

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 6/35

slide-7
SLIDE 7

Synchronization Hardware primitives

1

Introduction

2

Hardware primitives

3

Locks

4

Barriers

5

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 7/35

slide-8
SLIDE 8

Synchronization Hardware primitives

Hardware support

Need to fix a global order in operations.

Consistency model can be insufficient and complex. Usually complemented with read-modify-write

  • perations.

Example in IA-32:

Instructions with prefix LOCK. Access to bus in exclusive mode if location is not in cache.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 8/35

slide-9
SLIDE 9

Synchronization Hardware primitives

Primitives: Test and set

Instruction Test and Set:

Atomic sequence:

1

Read memory location into register (will be returned as result).

2

Write value 1 in memory location.

Uses: IBM 370, Sparc V9

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 9/35

slide-10
SLIDE 10

Synchronization Hardware primitives

Primitives: Exchange

Instruction for exchange (swap):

Atomic sequence:

1

Exchanges contents in a memory location and a register.

2

Includes a memory read and a memory write.

More general that test-and-set.

Instruction IA-32:

XCHG reg, mem

Uses: Sparc V9, IA-32, Itanium

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 10/35

slide-11
SLIDE 11

Synchronization Hardware primitives

Primitives: Fetch and operation

Instruction for fetching and applying operation (fetch-and-op):

Several operations: fetch-add, fetch-or, fetch-inc, . . . Atomic sequence:

1

Read memory location into a register (return that value).

2

Write to memory location the result of applying an operation to the original value.

Instruction IA-32:

LOCK XADD reg, mem

Uses: IBM SP3, Origin 2000, IA-32, Itanium.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 11/35

slide-12
SLIDE 12

Synchronization Hardware primitives

Primitives: Compare and exchange

Instruction to compare and exchange (compare-and-swap o compare-and-exchange):

Operation on two local variables (registers a and b) and a memory location (variable x). Atomic sequence:

1

Read value from x.

2

If x is equal to register a → exchange x and register b.

Instruction IA-32:

LOCK CMPXCHG mem, reg Implicitly uses additional register eax.

Uses: IBM 370, Sparc V9, IA-32, Itanium.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 12/35

slide-13
SLIDE 13

Synchronization Hardware primitives

Primitives: Conditional store

Pair of instructions LL/SC (Load Linked/Store Conditional).

Operation:

If content of read variable through LL is modified before a SC storage is not performed. When a context switch happens between LL and SC, SC is not performed. SC returns a success/failure code.

Example in Power-PC:

LWARX STWCX

Uses: Origin 2000, Sparc V9, Power PC

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 13/35

slide-14
SLIDE 14

Synchronization Locks

1

Introduction

2

Hardware primitives

3

Locks

4

Barriers

5

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 14/35

slide-15
SLIDE 15

Synchronization Locks

Locks

A lock is a mechanism to ensure mutual exclusion. Two synchronization functions:

Lock(k):

Acquires the lock. If several processes try to acquire the lock, n-1 are kept waiting. If more processes arrive, they are kept to waiting.

Unlock(k):

Releases the lock. Allow that a waiting process acquires the lock.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 15/35

slide-16
SLIDE 16

Synchronization Locks

Waiting mechanisms

Two alternatives: busy waiting and blocking. Busy waiting:

Process waits in a loop that constantly queries the wait control variable value. Spin-lock.

Blocking:

Process remains suspended and yields processor to other process. If a process executes unlock and there are blocked processes, one of them is un-blocked. Requires support from a scheduler (usually OS or runtime).

Alternative selection depends on cost.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 16/35

slide-17
SLIDE 17

Synchronization Locks

Components

Three elements of design in a locking mechanism: acquisition, waiting y release. Acquisition method:

Used to try to acquire the lock.

Waiting method:

Mechanism to wait until lock can be acquired.

Release method:

Mechanism to release one or several waiting processes.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 17/35

slide-18
SLIDE 18

Synchronization Locks

Simple locks

Shared variable k with two values.

0 → open. 1 → closed.

Lock(k)

If k=1 → Busy waiting while k=1. If k=0 → k=1. Do not allow that 2 processes acquire a lock simultaneously.

Use read-modify-write to close it.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 18/35

slide-19
SLIDE 19

Synchronization Locks

Simple implementations

Test and set

void lock(atomic_flag & k) { while (k.test_and_set()) {} } void unlock(atomic_flag & k) { k.clear () ; }

Fetch and operate

void lock(atomic<int> & k) { while (k.fetch_or(1) == 1) {} } void unlock(atomic<int> & k) { k.store(0) ; }

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 19/35

slide-20
SLIDE 20

Synchronization Locks

Simple implementations

Exchange IA-32

do_lock: mov eax, 1 repeat: xchg eax, _k cmp eax, 1 jz repeat

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 20/35

slide-21
SLIDE 21

Synchronization Locks

Exponential delay

Goal:

Reduce number of memory accesses. Limit energy consumption.

Lock with exponential delay

void lock(atomic_flag & k) { while (k.test_and_set()) { perform_pause(delay); delay ∗= 2; } }

Time between invocations to test_and_set() is incremented exponentially

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 21/35

slide-22
SLIDE 22

Synchronization Locks

Synchronization and modification

Performance can be improved if using the same variable to synchronize and communicate.

Avoid using shared variables only to synchronize.

Add a vector

double partial = 0; for (int i=iproc; i<max; i+=nproc) { partial += v[i ]; } sum.fetch_add(partial);

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 22/35

slide-23
SLIDE 23

Synchronization Locks

Locks and arrival order

Problem:

Simple implementations do not fix a lock acquisition order. Starvation might happen.

Solution:

Make the lock is acquired by request age. Guarantees FIFO order.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 23/35

slide-24
SLIDE 24

Synchronization Locks

Tagged locks

Two counters:

Acquire counter: Number of processes that have requested the lock. Release counter: Number of times the lock has been released.

Lock:

Tag → Acquisition counter value. Acquisition counter is incremented. Process remains waiting until the release counter matches the tag.

Unlock:

Increments release counter.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 24/35

slide-25
SLIDE 25

Synchronization Locks

Queue based locks

Keep a queue with processes waiting to enter into a critical section. Lock:

Check if queue is empty. If a process joins the queue it performs busy waiting in a variable.

Each process performs busy waiting in a different variable.

Unlock:

Removes process from queue. Modifies process waiting control variable.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 25/35

slide-26
SLIDE 26

Synchronization Barriers

1

Introduction

2

Hardware primitives

3

Locks

4

Barriers

5

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 26/35

slide-27
SLIDE 27

Synchronization Barriers

Barrera

A barrier allows to synchronize several processes in some point.

Guarantees that no process passes the barrier until all have arrived. Used to synchronize phases in a program.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 27/35

slide-28
SLIDE 28

Synchronization Barriers

Centralized barriers

Centralized counter associated to the barrier.

Counts number of processes that have arrived the barrier.

Barrier function:

Increments counter Waits the counter to reach the number of processes to be synchronized.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 28/35

slide-29
SLIDE 29

Synchronization Barriers

Simple barrier

Simple implementation

do_barrier(barrier , n) { lock( barrier .lock); if ( barrier .counter == 0) { barrier . flag=0; } local_counter = barrier .counter++; unlock(barrier .lock); if (local_counter == NP) { barrier .counter=0; barrier . flag=1; } else { while (barrier . flag==0) {} } }

Problem if barrier is reused in a loop.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 29/35

slide-30
SLIDE 30

Synchronization Barriers

Barrier with way inversion

Simple implementation

do_barrier(barrier , n) { local_flag = ! local_flag ; lock( barrier .lock); local_counter = barrier .counter++; unlock(barrier .lock); if (local_counter == NP) { barrier .counter=0; barrier . flag=local_flag ; } else { while (barrier . flag==local_flag) {} } }

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 30/35

slide-31
SLIDE 31

Synchronization Barriers

Tree barriers

A simple implementation of barriers is not scalable.

Contention in access to shared variables.

Tree structure for process arrival and release.

Specially useful in distributed networks.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 31/35

slide-32
SLIDE 32

Synchronization Conclusion

1

Introduction

2

Hardware primitives

3

Locks

4

Barriers

5

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 32/35

slide-33
SLIDE 33

Synchronization Conclusion

Summary

Need for shared memory access synchronization:

Individual (1-1) and collective (1-N) communication.

Diversity of hardware primitives for synchronization. Locks as a mechanism for mutual exclusion.

Busy waiting versus blocking. Three design elements: acquisition, waiting, and release.

Locks may lead to problems if order is not fixed (starvation).

Solutions based in tags or queues.

Barriers offer mechanisms to structure programs in phases.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 33/35

slide-34
SLIDE 34

Synchronization Conclusion

References

Computer Architecture. A Quantitative Approach. 5th Ed. Hennessy and Patterson. Section: 5.5

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 34/35

slide-35
SLIDE 35

Synchronization Conclusion

Synchronization

Computer Architecture

  • J. Daniel García Sánchez (coordinator)

David Expósito Singh Francisco Javier García Blas

ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 35/35