SLIDE 1 Slides for Lecture 17
ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng
Electrical & Computer Engineering Schulich School of Engineering University of Calgary
13 March, 2014
SLIDE 2 ENCM 501 W14 Slides for Lecture 17
slide 2/20
Previous Lecture
◮ pipeline hazards ◮ solutions to pipeline hazards
SLIDE 3 ENCM 501 W14 Slides for Lecture 17
slide 3/20
Today’s Lecture
◮ review of floating-numbers and operations ◮ effects of multiple-cycle EX-stage computation ◮ in-order versus out-of-order execution ◮ WAW and WAR data hazards
Related reading in Hennessy & Patterson: Sections C.5, 3.1
SLIDE 4
ENCM 501 W14 Slides for Lecture 17
slide 4/20
A quick, incomplete review of floating-point numbers
A lot of textbook examples use floating-point instructions, so a brief review might be a good idea. Essentially, floating-point is a base two version of scientific notation. Here’s an example of scientific notation: The mass of the earth is about 5973600000000000000000000 kg, more conveniently written as 5.9736 × 1024 kg.
SLIDE 5 ENCM 501 W14 Slides for Lecture 17
slide 5/20
Any nonzero real number can be written as sign × 2 exponent × (1 + fraction), where the exponent is an integer and 0 ≤ fraction < 1.0. If we have a finite number of exponent bits, that will limit the magnitude range of the numbers we can represent. With a finite number of fraction bits, most real numbers can
- nly be approximated—floating-point representation involves
rounding error. For a computer to work with floating-point numbers, we need a way to organize sign, exponent, and fraction bits into fixed-size chunks . . .
SLIDE 6 ENCM 501 W14 Slides for Lecture 17
slide 6/20
Bit fields in 64-bit floating-point
63 62 52 51
52 fraction bits 11 exponent bits sign bit Sign bit: 0 for positive, 1 for negative. Exponent: Uses a bias of 011 1111 1111two = 1023ten. Example bit patterns:
◮ 011 1111 1111 means the exponent is zero; ◮ 011 1111 1110 means the exponent is −1; ◮ 100 0000 0000 means the exponent is +1.
SLIDE 7 ENCM 501 W14 Slides for Lecture 17
slide 7/20
63 62 52 51
52 fraction bits 11 exponent bits sign bit Fraction bits: Only bits from the right side of the “binary point” are recorded. It is assumed that there is a single 1 bit to the left of the binary point, so that bit need not be recorded. Example: How is 1.375ten represented? 1.375 = 1 + 0 21 + 1 22 + 1 23 = 1.011two sign, exponent, and fraction are: 0 011 1111 1111 011000 · · · 000
SLIDE 8 ENCM 501 W14 Slides for Lecture 17
slide 8/20
In IEEE 754 floating-point formats there are some special bit patterns:
◮ zero ◮ +∞ ◮ −∞ ◮ NaN—not a number.
For example in IEEE 754, the result of 1.0/0.0 is +∞, but the result of 0.0/0.0 is NaN.
SLIDE 9
ENCM 501 W14 Slides for Lecture 17
slide 9/20
FP multiplication
If A and B are nonzero, then A × B is signA × signB × 2 (exponentA + exponentB) ×(1 + fractionA) × (1 + fractionB) To do an FP multiplication, a logic circuit first has to check that operands are not zero or other special bit patterns. The step that costs the most time (and energy) is the 53-bit-by-53-bit integer multiplication for (1 + fractionA) × (1 + fractionB). At the end, there must be rounding, exponent adjustment, and a check for underflow or overflow.
SLIDE 10
ENCM 501 W14 Slides for Lecture 17
slide 10/20
Will FP multiplication fit into a single clock cycle?
No! An example in textbook Section C.5 suggests a latency of 7 clock cycles for FP multiplication. The same example suggests a latency of 4 clock cycles for FP addition or subtraction, which is easier than FP multiplication, but much complicated than integer addition or subtraction. Those numbers are examples. Together, Moore’s Law and the ingenuity of circuit designers imply that the number vary from year to year and from one design to another.
SLIDE 11 ENCM 501 W14 Slides for Lecture 17
slide 11/20
Fitting FP operations into the 5-stage pipeline
Actually, this applies to fitting in integer multiplication and integer division as well. Let’s follow the textbook example:
◮ 7-cycle latency for FP or integer multiplication ◮ 4-cycle latency for FP addition ◮ 24-cycle latency for FP or integer division
(Note: Division is notoriously hard to do fast in digital logic!) We are going to have to give up on our nice, easy 1-cycle EX stage in the middle of the 5-stage pipeline!
SLIDE 12 ENCM 501 W14 Slides for Lecture 17
slide 12/20
Let’s make some notes about this picture . . .
EX M1 FP/integer multiply Integer unit FP adder FP/integer divider IF ID MEM WB M2 M3 M4 M5 M6 A1 A2 A3 A4 M7 DIV
Image is Figure C.35 from Hennessy J. L. and Patterson D. A., Computer Architecture: A Quantitative Approach, 5nd ed., c 2012, Elsevier, Inc.
SLIDE 13 ENCM 501 W14 Slides for Lecture 17
slide 13/20
Quick overview of MIPS FP instructions
Many versions of the MIPS ISA have 16 64-bit floating-point registers: F0, F2, F4, . . . , F30—note use of even numbers
(Newer ISA versions have 32 64-bit FPRs.) F0 is not special. Unlike the GPR R0, F0 is not hard-wired to have a value of 0.0.
SLIDE 14 ENCM 501 W14 Slides for Lecture 17
slide 14/20
Loads, stores and arithmetic are easy to understand. Here is a very short example: L.D F2, 0(R4) # load L.D F4, 0(R5) # load MUL.D F6, F2, F4 # multiply S.D F6, 0(R7) # store Note the use of GPRs for addresses. Remember, memory addresses are integers! The suffix .D is for double precision. Use .S instead to work with with 32-bit single precision FP numbers. To understand examples in ENCM 501, we do not need to know the details of instructions for FP comparison, branching
- n FP comparison results, or converting between integer and
FP formats.
SLIDE 15
ENCM 501 W14 Slides for Lecture 17
slide 15/20
In-order versus out-of-order
In-order execution of instructions implies that instructions are processed in the same order that they would be in a hypothetical computer that always completes one instruction before starting the next. The simple 5-stage pipeline is in-order, even though there are usually 5 instructions in flight within the pipeline. (What about instructions that get into the 5-stage pipeline but get cancelled due to a branch?) Out-of-order execution implies that start and completion of instructions is often but not always in-order.
SLIDE 16
ENCM 501 W14 Slides for Lecture 17
slide 16/20
5-stage pipeline with variable-length EX stage
This pipeline always starts instructions in-order. This is known as in-order issue of instructions. However, there is a design choice to be made: Should we allow instructions to complete out-of-order? What are the advantages and disadvantages of forcing instruction completion to be in-order? What are some challenges created by allowing out-of-order completion?
SLIDE 17
ENCM 501 W14 Slides for Lecture 17
slide 17/20
WAW (write after write) data hazards
Here is a simple, but unlikely-to-occur WAW hazard: MUL.D F2, F2, F4 L.D F2, 0(R4) (What is the point of the multiply if its result is going to be written over by the load?) For program correctness the load must write to F2 after the multiply writes to F2. Practical WAW hazards are more likely to appear when programs do out-of-order issue.
SLIDE 18 ENCM 501 W14 Slides for Lecture 17
slide 18/20
A more practical WAW hazard, and a WAR hazard
WAR hazards can only occur with out-of-order issue. The WAW hazard in this example would be present only
What are the potential hazards? MUL.D F2, F4, F6 S.D F2, 0(R8) some hazard-free instructions L.D F2, 0(R9) ADD.D F8, F8, F2 Why are the hazards impossible in the in-order pipeline of Figure C.35? What bad decision did the compiler make in generating the above code?
SLIDE 19
ENCM 501 W14 Slides for Lecture 17
slide 19/20
More problems related to long latency
The divide unit in the Figure C.35 has a 24-cycle latency and is not pipelined. What kind of hazard is created by the lack of pipelining in the divide unit? What is the effect of the 4-cycle FP add latency and the 7-cycle multiply latency on the frequency and severity of RAW data hazards?
SLIDE 20 ENCM 501 W14 Slides for Lecture 17
slide 20/20
Upcoming Topics
◮ Processing instructions with parallel pipelines.
Related reading in Hennessy & Patterson: Sections 3.1, 3.4, 3.5