Slides for Lecture 23 ENCM 501: Principles of Computer Architecture - - PowerPoint PPT Presentation

slides for lecture 23
SMART_READER_LITE
LIVE PREVIEW

Slides for Lecture 23 ENCM 501: Principles of Computer Architecture - - PowerPoint PPT Presentation

Slides for Lecture 23 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 3 April, 2014 slide 2/19 ENCM 501 W14


slide-1
SLIDE 1

Slides for Lecture 23

ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng

Electrical & Computer Engineering Schulich School of Engineering University of Calgary

3 April, 2014

slide-2
SLIDE 2

ENCM 501 W14 Slides for Lecture 23

slide 2/19

Previous Lecture

◮ introduction to cache coherency protocols

slide-3
SLIDE 3

ENCM 501 W14 Slides for Lecture 23

slide 3/19

Today’s Lecture

◮ MSI protocol for cache concurrency ◮ snooping to support MSI and similar protocols ◮ race conditions at ILP and TLP levels ◮ use of locks to manage TLP races ◮ introduction to ISA support for locks

Related reading in Hennessy & Patterson: Sections 5.2, 5.5

slide-4
SLIDE 4

ENCM 501 W14 Slides for Lecture 23

slide 4/19

Quick review: Caches in multi-core systems

These points were made in the previous lecture:

◮ In a writeback cache in a uniprocessor (one-core,

no SMT) system, possible states of a cache block are invalid, clean, or dirty.

◮ For that writeback cache in a uniprocessor system, block

states usually change as a result of loads and stores processed within the single core. (Messy detail: Actions such as DMA by I/O devices can also change states of blocks.)

◮ In a multicore system, one of many possible choices for

sets of possible cache block states is MSI: modified, shared, or invalid. State changes in a cache in one core may be triggered by actions in another core.

slide-5
SLIDE 5

ENCM 501 W14 Slides for Lecture 23

slide 5/19

Example quad-core system, with private L1 caches and a shared L2 cache

L1 I L1 D

core 0

L1 I L1 D

core 1

L1 I L1 D

core 2

L1 I L1 D

core 3 bus

L2 UNIFIED

DRAM controller to off-chip DRAM

slide-6
SLIDE 6

ENCM 501 W14 Slides for Lecture 23

slide 6/19

Example of propagating writes between cores

Scenario from previous lecture: Before Thread A writes to X, there is a copy of X in all 5 caches, with a status of shared. Thread A core 0 Thread C core 2 Thread B core 1 time writes X=42 . . . . . . reads X . . . reads X . . . In an MSI protocol, what effects do the write and reads have

  • n states of cache blocks that contain X?
slide-7
SLIDE 7

ENCM 501 W14 Slides for Lecture 23

slide 7/19

Example of write serialization

Scenario: Before Thread A writes to Y, there is a copy of Y in caches of cores 0 and 1, with a status of shared. Thread A core 0 Thread B core 1 time writes Y=66 . . . writes Y=77 reads Y . . . In an MSI protocol, what effects do the writes and read have

  • n states of cache blocks that contain X?
slide-8
SLIDE 8

ENCM 501 W14 Slides for Lecture 23

slide 8/19

Complete details of an MSI Protocol

See Figures 5.5, 5.6, and 5.7 in the textbook, and related dicussion. Note that this is probably the simplest possible cache coherency protocol, and its sets of state transitions and related actions are substantially more complicated than what is required for a uniprocessor cache system!

slide-9
SLIDE 9

ENCM 501 W14 Slides for Lecture 23

slide 9/19

Snooping

The example scenarios presented on slides 6 and 7 require all caches to snoop on a common bus. Consider the L1 D-cache in core 0 of our quad-core example. In addition to doing all the things a uniprocessor cache would need to do, it must be capable of

◮ responding to write misses broadcast from other

cores—this may require writeback from the core 0 cache;

◮ responding to invalidate commands broadcast from other

cores—this may require invalidation within the core 0 cache. The requirements are symmetric for the L1 D-caches in cores 1, 2 and 3.

slide-10
SLIDE 10

ENCM 501 W14 Slides for Lecture 23

slide 10/19

Snooping works reasonably well in a small SMP system—for example, two to four cores on a single chip. Scaling is a problem, as bus traffic increases at least proportionally to the number of cores. There is potential for a lot of waste. For example, in an eight-core system, suppose threads in two cores, say cores 0 and 1, are frequently accessing shared memory. Then cores 2–7 are exposed to and slowed down by a stream of traffic that is really just a conversation between cores 0 and 1. A more scalable alternative to snooping is use of a directory-based protocol. Time prevents us from looking at directory-baseds protocols in ENCM 501.

slide-11
SLIDE 11

ENCM 501 W14 Slides for Lecture 23

slide 11/19

Race conditions

A race condition is a situation in which the evolution of the state of some system depends critically on the exact order of nearly-simultaneous events in two or more subsystems. Race conditions can exist as a pure hardware problem in the design of sequential logic circuits. Race conditions can also arise at various levels of parallelism in software systems, including instruction-level parallelism and thread-level parallelism.

slide-12
SLIDE 12

ENCM 501 W14 Slides for Lecture 23

slide 12/19

ILP race condition example

Consider a Tomasulo-based microarchitecture for a uniprocessor system. What is the potential race condition in the following code sequence? S.D F0, 0(R8) S.D F2, 24(R29) How could hardware design eliminate the race condition? Note: There are similar race condition problems related to

  • ut-of-order execution of stores and loads.
slide-13
SLIDE 13

ENCM 501 W14 Slides for Lecture 23

slide 13/19

TLP race condition example

Two threads are running in the same loop, both reading and writing a global variable called counter . . . Thread A: while ( condition ) { do some work counter++; } Thread B: while ( condition ) { do some work counter++; } Before considering the race condition in a multicore system, let’s address this question: What is the potential failure if this program is running in a uniprocessor system?

slide-14
SLIDE 14

ENCM 501 W14 Slides for Lecture 23

slide 14/19

TLP race condition example, continued

The same defective program, now with the threads running simultaneously in two cores . . . Thread A: while ( condition ) { do some work counter++; } Thread B: while ( condition ) { do some work counter++; } Why does a cache coherency protocol such as MSI not prevent failure? Does it help if the ISA has memory-register instructions, such as addq $1, (%rax) in x86-84? Doesn’t that do read-add-write in a single instruction?

slide-15
SLIDE 15

ENCM 501 W14 Slides for Lecture 23

slide 15/19

Making the two-thread program safe

Thread A: while ( condition ) { do some work acquire lock counter++; release lock } Thread B: while ( condition ) { do some work acquire lock counter++; release lock } Setting up some kind of lock—often called a mutual exclusion, or mutex—prevents the failures seen in the past two slide. Let’s make some notes about how this would work in a uniprocessor system.

slide-16
SLIDE 16

ENCM 501 W14 Slides for Lecture 23

slide 16/19

Thread A: while ( condition ) { do some work acquire lock counter++; release lock } Thread B: while ( condition ) { do some work acquire lock counter++; release lock } What about the multicore case? The lock will prevent Thread B from reading counter while Thread A is doing a load-add-store update with counter. Is that enough? Consider the MSI protocol again. What must happen either as a result of Thread A releasing the lock, or as a result of Thread B releasing the lock?

slide-17
SLIDE 17

ENCM 501 W14 Slides for Lecture 23

slide 17/19

ISA and microarchitecture support for concurrency

Instruction sets have to provide special atomic instructions to allow software implementation of synchronization facilities such as mutexes and semaphores. An atomic instruction (or a sequence of instructions that is intended to provide the same kind of behaviour, such as MIPS LL/SC) typically works like this:

◮ memory read is attempted at some location; ◮ some kind of write data is generated; ◮ memory write to the same location is attempted.

slide-18
SLIDE 18

ENCM 501 W14 Slides for Lecture 23

slide 18/19

The key aspects of atomic instructions are:

◮ the whole operation succeeds or the whole operation fails,

in a clean way that can be checked after the attempt was made;

◮ if two or more threads attempt the operation, such that

the attempts overlap in time, one thread will succeed, and all the other threads will fail.

slide-19
SLIDE 19

ENCM 501 W14 Slides for Lecture 23

slide 19/19

Next (and last!) lecture, Thu Apr 10

◮ atomic instructions ◮ overview of parallel programming with Pthreads

Related reading in Hennessy & Patterson: Section 5.5