DLX Floating Point Extend MIPS Pipeline to Floating Point Operations - PowerPoint PPT Presentation

DLX Floating Point Extend MIPS Pipeline to Floating Point Operations • • Functional units more complex than simple integer ALU • Require several clock cycles for a FP arithmetic operation • Add: 4 cycles, Multiply: 7 cycles, Divide: 25 cycles; Square root: 112 cycles Different functional units (FUs) for different operations • • Separate set of Floating Point Registers (FP registers): F0 …. F31 Variable Delay EX ADD MUL DIV 1

Floating Point Unit Separate Functional unit (FU) for each of the FP Arithmetic Instructions: • ADD.D, MUL.D, DIV.D • Load and Store use integer ALU (EX) for address calculation: • L.D, S.D • Integer MUL and DIV use the FP units • FP instructions take differing amounts of time • e.g. + (4 cycles), * (7 cycles), / (25 cycles) • Monolithic ALU (as in integer unit) inappropriate • • Substantially larger than simple integer ALU operations 2

FP Pipeline Model EX MEM IF ID WB ADD (2 cycle) MUL (4 cycle) DIV (5 cycle) 3

Design Choices Designing for the Worst Case • Clock Pipeline at the speed of the slowest Functional Unit (FU) • • Entire pipeline operates at 1/5 frequency Not an attractive solution!! • • MIPS rating falls by 80% 4

Design Choices Optimizing the Common Case • Assume integer EX instructions are the common case • • Slow instructions are less frequent Slow the pipeline only when needed (FP ADD, MUL, DIV) instructions • • Insert appropriate number of stall cycles when a slow instructions is in EX stage Example to show potential benefit: ADD 4%, MUL 4%, DIV 1% of instructions CPI = 1 + 4% x 1 + 4% x 3 + 1% x 4 cycles = 1 + .04 + .12 + .04 = 1.20 • MIPS rating drops by about 16% 4

FP Pipeline Model EX MEM M IF ID WB U ADD (2 cycle) L MUL (4 cycle) DIV (5 cycle) 3

Stall due to Multi-Cycle ALU Operation 1 2 3 4 5 6 7 8 9 IF ID EX MEM WB IF ID EX MEM WB IF ID * * * * MEM IF ID ID ID ID * IF IF IF IF ID A : ADD R1, R2, R3 B: ADD R4, R5, R6 C: MUL.D F2, F4, F6 D: MUL.D F8, F10, F12 E: MUL.D F14, F16, F18 14

Structural Hazards Are there any structural hazards in the design? Are there instructions that are delayed because of insufficient hardware resource? 1. ID/EX Pipeline register: Contention for datapath Sequence of integer EX goes through pipeline at 1 cycle/instruction • A MUL instruction holds the ID/EX register for 4 cycles • MUL F0, F2, F4 ADD R6, R8, R10 (Stalls 3 cycles) 8

Structural Hazards 1. ID/EX Pipeline register: Contention for datapath Sequence of integer EX goes through pipeline at 1 cycle/instruction • A MUL instruction holds the ID/EX register for 4 cycles • MUL F0, F2, F4 AND R6, R8, R10 (Stalls 3 cycles) Enhance the ID/EX Pipeline Register to hold 2 (or more) • instructions simultaneously 8

FP Pipeline Model A N EX MEM D IF ID WB ADD (2 cycle) M MUL (4 cycle) U L DIV (5 cycle) 3

Structural Hazards 2. EX stage: Contention for FUs EX unit: no contention • Successive (or close by) FP instructions contend for the Functional unit • MUL F0, F2, F4 MUL F6, F8, F10 (Stalls 4 cycles) 10

FP Pipeline Model EX MEM M IF ID WB U ADD (2 cycle) L M MUL (4 cycle) U L DIV (5 cycle) 3

Structural Hazards 2. EX stage: Contention for FUs Replicate Functional Units • How much replication? • 2 Adders implies no structural hazards for any sequence of ADDs • 2 Multipliers: • Consecutive MULs no structural hazards; • 3 consecutive MULs : 2 stall cycles • • If insufficient stall instruction in ID stage till resource available 10

Structural Hazards 2. EX stage: Contention for FUs Replicate Functional Units • Pipelined functional units • • Require pipeline registers between stages of the FU • Could be slower than non-pipelined design • Each FU may be non-pipelined, fully pipelined, or partially pipelined. • Depends on cost, time, frequency of operation 4-stage Pipelined Multiplier M1 M2 M3 M4 Initiation Interval: Time between successive operations Fully Pipelined FU has initiation interval of 1 cycle : No stalls needed 10

Pipeline Functional Units 2-stage fully pipelined Adder 4-stage fully pipelined Multiplier 5-cycle non-pipelined Divider EX MEM A1 A2 IF ID WB M1 M2 M3 M4 DIV (5 cycle non pipelined) 9

Hybrid Functional Units 1 2-cycle latency Fully Pipelined Adder 2 4-cycle latency 2-stage Partially Pipelined Multipliers 1 5-cycle (monolithic) Divider EX MEM A1 A2 IF ID WB MUL 1 (2 cycles) MUL 1 (2 cycles) MUL 2 (2 cycles) MUL2 (2 cycles) DIV (5 cycles) 11

Structural Hazards 3. MEM stage: Contention for access to data memory Only LOAD and STORE instructions want to use data memory unit • Both follow same path through the pipelined and access MEM in cycle 4 • No contention • 4. WB stage: Contention for i. Write ports in register file ii. Data paths through MEM stage to WB 12

Structural Hazard: WB stage 1 2 3 4 5 6 A IF ID + + MEM WB B IF ID EX MEM WB A : ADD.D F0, F2, F4 B : L.D F18, 100(R4) Contention for: Write ports in Register File in WB stage (cycle 6) Data paths through MEM stage (cycle 5) 13

Structural Hazard: WB stage 1 2 3 4 5 6 7 8 9 IF ID / / / / / MEM WB IF ID * * * * MEM WB IF ID + + MEM WB IF ID + + MEM WB IF ID EX MEM WB A : DIV.D F0, F2, F4 B: MUL.D F6, F8, F10 C: ADD.D F12, F14, F16 D: ADD.D F18, F20, F22 E: L.D F24, 100(R4) Contention for: Write ports in Register File in WB stage (cycle 9) Data paths through MEM stage (cycle 8) 14

Solutions for WB Structural Hazards 1. Multiple write ports in register file Extra hardware. Slowdown • Should we design for the peak vs average number of writes per cycle? • 2. Buffer requests at WB stage and write one at a time • How deep should the buffer queue be? 3. Stall: Allow only 1 write to propagate to the WB stage In MEM stage (EX/MEM pipeline register) Easy (+) Prioritize based on heuristics (longest latency) (+) Need to propagate stall backwards (-) Two sources of resource stalls (-) In ID stage : Only release instruction that won’t cause hazard in WB stage Centralized handling of stalls (+) Occurs earlier than necessary (-) 15 We will allow. S.D and FP instruction to go through MEM stage at the same time

Stall in MEM stage EX MEM A1 A2 IF ID WB M1 M2 M3 M4 MUX DIV (5 cycle non pipelined) 16

Stall in ID stage Check if instruction currently in ID will use WB at the same cycle as a previously issued instruction. If so Stall else Issue the instruction Simple hardware implementation: • Shift register of length L equal to length of longest path from ID to WB – Tracks the usage of WB for the next L cycles – Bit j of the Shift Register is True whenever an issued instruction will use WB j cycles from now – Every cycle shift the contents by 1 bit (so bit j becomes bit number j-1) Assume instruction in ID wants to use register file in the WB stage: 1. Determine how many cycles later will instruction in ID use the WB stage (say d) (Depends on FU required by the instruction) 2. Check if bit d of register is set or not. If set Stall current instruction for 1 cycle else Set bit d of shift register to 1 3. Shift register one bit position 17

DLX Floating Point Extend MIPS Pipeline to Floating Point Operations - PowerPoint PPT Presentation

DLX Floating Point Extend MIPS Pipeline to Floating Point Operations Functional units more complex than simple integer ALU Require several clock cycles for a FP arithmetic operation Add: 4 cycles, Multiply: 7 cycles, Divide: 25

Debugging Floating-Point Debugging Floating-Point Debugging Floating-Point Math in Racket Math

Comp. Organization DLX Comp. Arch. ECE 337 Unpipelined DLX Architecture Each DLX instruction

Formal verification of floating-point algorithms John Harrison Intel Corporation Floating

Floating-point numbers Fractional binary numbers IEEE floating-point standard Floating-point

Lecture 3 Floating Point Representations 1 Floating-point arithmetic We often incur

Machine numbers: how floating point numbers are stored? Floating-point number representation

Floating point Today ! IEEE Floating Point Standard ! Rounding ! Floating Point Operations !

15-213 The course that gives CMU its Zip! Floating Point Sept 6, 2006 Topics Topics

ECS 231 Computer Arithmetic 1 / 27 Outline Floating-point numbers and representations 1

9/20/2018 Today: Floating Point Background: Fractional binary numbers IEEE floating point

2/10/2020 Today: Floating Point Background: Fractional binary numbers IEEE floating point

Pavel Alex James Zach Panchekha Sanchez-Stern Wilcox Tatlock Floating Points Wild

CS 356 Unit 3 IEEE 754 Floating Point Representation 3.2 Floating Point Used to represent

Unit 3 IEEE 754 Floating Point Representation 3.2 Floating Point Used to represent very

for Optimization and Analysis of Floating-Point Computations Heiko Becker, Pavel Panchekha, Eva

Floating Point Representation CS3220 - Summer 2008 Jonathan Kaldor Floating Point Numbers

Definition(of(Keywords(and(Its(Organization(in(HDF5( By(Jixia(Li( 1. ! Introduction,

TYPES, LISTS, AND STRINGS CSSE 120 Rose-Hulman Institute of Technology Outline More on

http://fpanalysistools.org/ 1 under Contract DE-AC52-07NA27344, via LDRD project 17-SI-004

FLOATING POINT OPERATIONS Mahdi Nazm Bojnordi Assistant Professor School of Computing

Types Variables We (hopefully) know that if you say: You ask the computer for a variable called

Effective Java Department of Computer Science University of Maryland, College Park Effective

Image courtesy: Southern California Earthquake Center Matthias Christen, Cetus Users and Compiler

CONFIDENTIAL Precise approximation of Floating-Point Computations for C/C++ Software Using the