 
              Slides for Lecture 17 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 13 March, 2014
slide 2/20 ENCM 501 W14 Slides for Lecture 17 Previous Lecture ◮ pipeline hazards ◮ solutions to pipeline hazards
slide 3/20 ENCM 501 W14 Slides for Lecture 17 Today’s Lecture ◮ review of floating-numbers and operations ◮ effects of multiple-cycle EX-stage computation ◮ in-order versus out-of-order execution ◮ WAW and WAR data hazards Related reading in Hennessy & Patterson: Sections C.5, 3.1
slide 4/20 ENCM 501 W14 Slides for Lecture 17 A quick, incomplete review of floating-point numbers A lot of textbook examples use floating-point instructions, so a brief review might be a good idea. Essentially, floating-point is a base two version of scientific notation . Here’s an example of scientific notation: The mass of the earth is about 5973600000000000000000000 kg, more conveniently written as 5 . 9736 × 10 24 kg.
slide 5/20 ENCM 501 W14 Slides for Lecture 17 Any nonzero real number can be written as sign × 2 exponent × (1 + fraction) , where the exponent is an integer and 0 ≤ fraction < 1 . 0. If we have a finite number of exponent bits, that will limit the magnitude range of the numbers we can represent. With a finite number of fraction bits, most real numbers can only be approximated —floating-point representation involves rounding error. For a computer to work with floating-point numbers, we need a way to organize sign, exponent, and fraction bits into fixed-size chunks . . .
slide 6/20 ENCM 501 W14 Slides for Lecture 17 Bit fields in 64-bit floating-point 63 62 52 51 0 52 fraction bits 11 exponent bits sign bit Sign bit: 0 for positive, 1 for negative. Exponent: Uses a bias of 011 1111 1111 two = 1023 ten . Example bit patterns: ◮ 011 1111 1111 means the exponent is zero; ◮ 011 1111 1110 means the exponent is − 1; ◮ 100 0000 0000 means the exponent is +1.
slide 7/20 ENCM 501 W14 Slides for Lecture 17 63 62 52 51 0 52 fraction bits 11 exponent bits sign bit Fraction bits: Only bits from the right side of the “binary point” are recorded. It is assumed that there is a single 1 bit to the left of the binary point, so that bit need not be recorded. Example: How is 1 . 375 ten represented? 1 . 375 = 1 + 0 2 1 + 1 2 2 + 1 2 3 = 1 . 011 two sign, exponent, and fraction are: 0 011 1111 1111 011000 · · · 000
slide 8/20 ENCM 501 W14 Slides for Lecture 17 In IEEE 754 floating-point formats there are some special bit patterns: ◮ zero ◮ + ∞ ◮ −∞ ◮ NaN—not a number. For example in IEEE 754, the result of 1.0/0.0 is + ∞ , but the result of 0.0/0.0 is NaN.
slide 9/20 ENCM 501 W14 Slides for Lecture 17 FP multiplication If A and B are nonzero, then A × B is signA × signB × 2 (exponentA + exponentB) × (1 + fractionA) × (1 + fractionB) To do an FP multiplication, a logic circuit first has to check that operands are not zero or other special bit patterns. The step that costs the most time (and energy) is the 53-bit-by-53-bit integer multiplication for (1 + fractionA) × (1 + fractionB). At the end, there must be rounding, exponent adjustment, and a check for underflow or overflow.
slide 10/20 ENCM 501 W14 Slides for Lecture 17 Will FP multiplication fit into a single clock cycle? No! An example in textbook Section C.5 suggests a latency of 7 clock cycles for FP multiplication. The same example suggests a latency of 4 clock cycles for FP addition or subtraction, which is easier than FP multiplication, but much complicated than integer addition or subtraction. Those numbers are examples . Together, Moore’s Law and the ingenuity of circuit designers imply that the number vary from year to year and from one design to another.
slide 11/20 ENCM 501 W14 Slides for Lecture 17 Fitting FP operations into the 5-stage pipeline Actually, this applies to fitting in integer multiplication and integer division as well. Let’s follow the textbook example: ◮ 7-cycle latency for FP or integer multiplication ◮ 4-cycle latency for FP addition ◮ 24-cycle latency for FP or integer division (Note: Division is notoriously hard to do fast in digital logic!) We are going to have to give up on our nice, easy 1-cycle EX stage in the middle of the 5-stage pipeline!
slide 12/20 ENCM 501 W14 Slides for Lecture 17 Let’s make some notes about this picture . . . Integer unit EX FP/integer multiply M1 M2 M3 M4 M5 M6 M7 IF ID MEM WB FP adder A1 A2 A3 A4 FP/integer divider DIV Image is Figure C.35 from Hennessy J. L. and Patterson D. A., Computer Architecture: A Quantitative Approach, 5nd ed. , c � 2012, Elsevier, Inc.
slide 13/20 ENCM 501 W14 Slides for Lecture 17 Quick overview of MIPS FP instructions Many versions of the MIPS ISA have 16 64-bit floating-point registers: F0, F2, F4, . . . , F30—note use of even numbers only for FPRs. (Newer ISA versions have 32 64-bit FPRs.) F0 is not special. Unlike the GPR R0, F0 is not hard-wired to have a value of 0.0.
slide 14/20 ENCM 501 W14 Slides for Lecture 17 Loads, stores and arithmetic are easy to understand. Here is a very short example: L.D F2, 0(R4) # load L.D F4, 0(R5) # load MUL.D F6, F2, F4 # multiply S.D F6, 0(R7) # store Note the use of GPRs for addresses. Remember, memory addresses are integers! The suffix .D is for double precision . Use .S instead to work with with 32-bit single precision FP numbers. To understand examples in ENCM 501, we do not need to know the details of instructions for FP comparison, branching on FP comparison results, or converting between integer and FP formats.
slide 15/20 ENCM 501 W14 Slides for Lecture 17 In-order versus out-of-order In-order execution of instructions implies that instructions are processed in the same order that they would be in a hypothetical computer that always completes one instruction before starting the next. The simple 5-stage pipeline is in-order , even though there are usually 5 instructions in flight within the pipeline. (What about instructions that get into the 5-stage pipeline but get cancelled due to a branch?) Out-of-order execution implies that start and completion of instructions is often but not always in-order .
slide 16/20 ENCM 501 W14 Slides for Lecture 17 5-stage pipeline with variable-length EX stage This pipeline always starts instructions in-order. This is known as in-order issue of instructions. However, there is a design choice to be made: Should we allow instructions to complete out-of-order? What are the advantages and disadvantages of forcing instruction completion to be in-order? What are some challenges created by allowing out-of-order completion?
slide 17/20 ENCM 501 W14 Slides for Lecture 17 WAW (write after write) data hazards Here is a simple, but unlikely-to-occur WAW hazard: MUL.D F2, F2, F4 L.D F2, 0(R4) (What is the point of the multiply if its result is going to be written over by the load?) For program correctness the load must write to F2 after the multiply writes to F2. Practical WAW hazards are more likely to appear when programs do out-of-order issue .
slide 18/20 ENCM 501 W14 Slides for Lecture 17 A more practical WAW hazard, and a WAR hazard WAR hazards can only occur with out-of-order issue. The WAW hazard in this example would be present only out-of-order issue. What are the potential hazards? MUL.D F2, F4, F6 S.D F2, 0(R8) some hazard-free instructions L.D F2, 0(R9) ADD.D F8, F8, F2 Why are the hazards impossible in the in-order pipeline of Figure C.35? What bad decision did the compiler make in generating the above code?
slide 19/20 ENCM 501 W14 Slides for Lecture 17 More problems related to long latency The divide unit in the Figure C.35 has a 24-cycle latency and is not pipelined . What kind of hazard is created by the lack of pipelining in the divide unit? What is the effect of the 4-cycle FP add latency and the 7-cycle multiply latency on the frequency and severity of RAW data hazards?
slide 20/20 ENCM 501 W14 Slides for Lecture 17 Upcoming Topics ◮ Processing instructions with parallel pipelines. Related reading in Hennessy & Patterson: Sections 3.1, 3.4, 3.5
Recommend
More recommend