Previous Lecture Slides for Lecture 20 ENCM 501: Principles of - - PDF document

previous lecture slides for lecture 20
SMART_READER_LITE
LIVE PREVIEW

Previous Lecture Slides for Lecture 20 ENCM 501: Principles of - - PDF document

slide 2/16 ENCM 501 W14 Slides for Lecture 20 Previous Lecture Slides for Lecture 20 ENCM 501: Principles of Computer Architecture Winter 2014 Term more examples of Tomasulos algorithm reorder buffers and speculation Steve Norman,


slide-1
SLIDE 1

Slides for Lecture 20

ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng

Electrical & Computer Engineering Schulich School of Engineering University of Calgary

25 March, 2014

ENCM 501 W14 Slides for Lecture 20

slide 2/16

Previous Lecture

◮ more examples of Tomasulo’s algorithm ◮ reorder buffers and speculation ◮ multiple issue of instructions ◮ other ILP topics, if time permits

Related reading in Hennessy & Patterson: Sections 3.5–3.8

ENCM 501 W14 Slides for Lecture 20

slide 3/16

Today’s Lecture

◮ WHAT?

Related reading in Hennessy & Patterson: Sections 3.5–3.6 AND WHAT ELSE?

ENCM 501 W14 Slides for Lecture 20

slide 4/16

Resolution of practical RAW, WAR and WAW hazards (repeat from lec. 19, with edits for clarity)

RAW: S.D depends on the MUL.D result, and ADD.D depends the L.D result. WAR: S.D must use the MUL.D result, not the L.D result. WAW: ADD.D must use the L.D result, not the MUL.D result, and when all these instructions are done, F2 must contain the L.D result. MUL.D F2, F4, F6 S.D F2, 0(R8) SUB.D F0, F12, F14 L.D F2, 0(R9) ADD.D F8, F8, F2 Let’s trace how Tomasulo’s algorithm handles this sequence.

ENCM 501 W14 Slides for Lecture 20

slide 5/16

Loop example

This is from page 179 of the textbook: Loop: L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, -8 BNE R1, R2, Loop1 Let’s make some notes about the DADDIU and BNE instructions. Let’s assume that the loop starts with R1 = 0x600040 and R2 = 0x600000. Let’s trace how Tomasulo’s algorithm might handle the first two passes through the loop.

ENCM 501 W14 Slides for Lecture 20

slide 6/16

History of Tomasulo’s algorithm

Tomasulo developed the algorithm in the 1960’s. Note that microprocessors did not exist until the early 1970’s! Also, in the 60’s and 70’s, memory was fast enough relative to processors that caches were unnecessary. The algorithm was deployed in the IBM 360/91, a computer designed to crunch FP numbers as fast as FP numbers could possibly be crunched in the 1960’s. (Web search for ibm 360/91 yields many fantastic results.) Processor designs started to use Tomasulo’s algorithm again in the 1990’s, when it became clear that it was important to find ways to work around unpredictable delays caused by cache misses.

slide-2
SLIDE 2

ENCM 501 W14 Slides for Lecture 20

slide 7/16

Costs of the CDB (common data bus)

In a typical clock cycle, some reservation station will broadcast a result on the CDB, and other reservation stations and the register file will look at the result to see if it’s useful. Transmitting the result and receiving the result both have energy costs. A complex instruction unit, reservation stations, and related hardware require lots of transistors. If Moore’s law had not applied for so many decades, we would not see Tomasulo’s algorithm used as a basis for design of modestly priced processor chips. It’s possible, in some cycles, that two or more reservation stations will simultaneously try to broadcast their results. Why is this not a fatal defect in Tomasulo’s algorithm?

ENCM 501 W14 Slides for Lecture 20

slide 8/16

Data hazards in the memory system

Why is this a potential RAW hazard? S.D F0, 48(R8) L.D F2, 8(R9) And why is this a potential WAR hazard? L.D F4, 72(R10) S.D F6, 0(R11) Finally, why is this a potential WAW hazard? S.D F8, (R12) S.D F10, (R13) All of these hazards are important problems in processors that may complete instructions out of order. Due to lack of time, we won’t look at solutions in detail, but be aware that processor designers must deal with these hazards correctly.

ENCM 501 W14 Slides for Lecture 20

slide 9/16

Tomasulo’s algorithm and branch prediction

Consider this code fragment: BEQ R8, R0, L99 S.D F0, (R10) ADD.D F2, F2, F4 Suppose the branch is incorrectly predicted as not taken, and S.D and ADD.D get issued while BEQ waits for some earlier instruction to provide a value for R8. If Tomasulo’s algorithm does nothing beyond what has been presented so far in lectures, what will prevent S.D from making an incorrect update to memory, and what will prevent ADD.D from making an incorrect update to F2?

ENCM 501 W14 Slides for Lecture 20

slide 10/16

Tomasulo’s algorithm and exceptions

MUL.D F2, F4, F6 S.D F2, 0(R8) SUB.D F0, F12, F14 L.D F2, 0(R9) ADD.D F8, F8, F2 Suppose MUL.D gets delayed because it has to wait until a result for F6 is ready. That will delay the execution of S.D. Meanwhile, Tomasulo’s algorithm may allow completion of SUB.D, L.D, and ADD.D. What kind of problem is created if S.D eventually results in a page fault exception?

ENCM 501 W14 Slides for Lecture 20

slide 11/16

Out-of-order execution, in-order completion

The version of Tomasulo’s algorithm presented in textbook Section 3.5 has scalar issue (that is, at most one instruction issued per clock cycle), out-of-order execution, and

  • ut-of-order completion.

Section 3.6 modifies the algorithm to include a circuit called a reorder buffer (ROB), which will enforce in-order completion. Use of a reorder buffer solves the branch prediction and exception problems described on slides 9 and 10.

ENCM 501 W14 Slides for Lecture 20

slide 12/16

In a processor with a reorder buffer, issue of an instruction sends information related to the instruction both to a reservation station and to the reorder buffer. A reservation station for a store is responsible for address computation only—it is not allowed to write to memory. The reorder buffer is a FIFO queue—instructions enter in program order, and leave in program order. When an instruction gets to the head of the ROB, it can be committed as soon as its results are known. Examples:

◮ An ADD.D can be committed if a reservation station has

provided the sum to the reorder buffer.

◮ An S.D can be committed if both the data to be stored

and the address to be used are ready.

slide-3
SLIDE 3

ENCM 501 W14 Slides for Lecture 20

slide 13/16

Register file changes:

◮ The Qi field for each register is replaced by a Busy flag

and a Reorder # field. Busy = 0 means the register is up-to-date; Busy = 1 means the register is waiting for a result from whatever entry in the reorder buffer matches the Reorder #.

◮ The register file does not watch the CDB for results.

The ROB must watch the CDB for results for all of the instructions within the ROB that don’t yet have results.

ENCM 501 W14 Slides for Lecture 20

slide 14/16

The reservation stations and functional units work very much as before, except:

◮ the Qj and Qk fields hold ROB entry numbers instead of

reservation station numbers;

◮ each reservation stations has a Dest field to hold an ROB

entry number;

◮ when a reservation station broadcasts its result on the

CDB, it includes the Dest field value to help both the ROB and the other reservation stations.

ENCM 501 W14 Slides for Lecture 20

slide 15/16

The reorder buffer and safe speculation

The key point about the ROB is that it can collect a large number of results without knowing whether those results should really be written to registers or memory. Consider a branch instruction that is mispredicted as taken.

◮ What happens to all the instructions that got into the

ROB before the branch?

◮ What happens to the branch target instruction, the

successor of the the branch target instruction, etc., which got into the ROB after the branch? The bad effect of the above scenario is a waste of time and

  • energy. What are the important bad effects that were

prevented?

ENCM 501 W14 Slides for Lecture 20

slide 16/16

More Topics for Today

As time permits . . .

◮ multiple issue of instructions ◮ limitations of ILP

ENCM 501 W14 Slides for Lecture 20

slide 16/16

Upcoming Topics

◮ Processes and threads. ◮ Multi-core processor circuits and their caches. ◮ Multi-core support for processes and threads.

Related reading in Hennessy & Patterson: Sections 5.1–5.2