Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry - - PowerPoint PPT Presentation

low complexity reorder buffer architecture
SMART_READER_LITE
LIVE PREVIEW

Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry - - PowerPoint PPT Presentation

Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower 16 th Annual ACM International


slide-1
SLIDE 1

ICS’02 1

Low-Complexity Reorder Buffer Architecture*

*supported in part by DARPA through the PAC-C program and NSF

Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

16th Annual ACM International Conference on Supercomputing (ICS’02), June 24th 2002

slide-2
SLIDE 2

ICS’02 2

Outline

ROB complexities Motivation for the low-complexity ROB Low-complexity ROB design Results Concluding remarks

slide-3
SLIDE 3

ICS’02 3

Pentium III-like Superscalar Datapath

IQ Function Units

Instruction Issue

F1 D1

FU1 FU2 FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File F2

Fetch

Decode/Dispatch

D2 D-cache LSQ ROB

slide-4
SLIDE 4

ICS’02 4

ROB Port Requirements for a W-way CPU

ROB

Writeback W write ports to write results Dispatch/Issue 2W read ports to read the source

  • perands

Decode/Dispatch W write ports to setup entries Commit W read ports for instruction commitment

slide-5
SLIDE 5

ICS’02 5

Where are the Source Values Coming From?

IQ Function Units

Instruction Issue

F1 D1

FU1 FU2 FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File F2

Fetch

Decode/Dispatch

D2 D-cache LSQ ROB

1 2 3

slide-6
SLIDE 6

ICS’02 6

Where are the Source Values Coming From ?

0% 20% 40% 60% 80% 100% b z i p 2 g c c g a p g c c m c f p a r s e r p e r l b m k t w

  • l

f v

  • r

t e x v p r a p p l u a p s i a r t e q u a k e m e s a m g r i d s w i m w u p w i s e A v g . I n t . A v g . f p . A v e r a g e

Forwarding ARF ROB

96-entry ROB, 4-way processor SPEC2K Benchmarks

62% 32% 32% 6%

slide-7
SLIDE 7

ICS’02 7

How Efficiently are the Ports Used ?

ROB

Writeback W write ports To write results Dispatch/Issue 2W read ports to read the source

  • perands

Decode/Dispatch W write ports to setup entries Commit W read ports for instruction commitment

6%

slide-8
SLIDE 8

ICS’02 8

Approaches to Reducing ROB Complexity

Reduce the number of read ports for reading out the source operand values More radical (and better): Completely eliminate the read ports for reading source operand values!

slide-9
SLIDE 9

ICS’02 9

4 8 12 16

1 read port 2 read ports

Reducing the Number of Read Ports

Performance Drop %

4 8 12 16 20

3.5% 1.0%

Average IPC Drop:

bzip2 gap gcc gzip mcf parser perl twolf Int Avg. vortex vpr applu apsi art equake mesa mgrid swim wupwise FP Avg.

slide-10
SLIDE 10

ICS’02 10

Problems with Retaining Fewer Source Read Ports on the ROB Need arbitration for the small number of ports Additional logic needed to block the instructions which could not get the port. Need a switching network to route the operands to correct destinations Multi-cycle access still remains in the critical path of Dispatch/Issue logic

slide-11
SLIDE 11

ICS’02 11

Our Solution: Elimination of Read Ports

IQ Function Units

Instruction Issue

F1 D1

FU1 FU2 FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File F2

Fetch

Decode/Dispatch

D2 D-cache LSQ ROB

1 2 3

slide-12
SLIDE 12

ICS’02 12

Our Solution: Elimination of Read Ports

IQ Function Units

Instruction Issue

F1 D1

FU1 FU2 FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File F2

Fetch

Decode/Dispatch

D2 D-cache LSQ ROB

1 2 3

slide-13
SLIDE 13

ICS’02 13

Our Solution: Elimination of Read Ports

IQ Function Units

Instruction Issue

F1 D1

FU1 FU2 FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File F2

Fetch

Decode/Dispatch

D2 D-cache LSQ

1 3

ROB

slide-14
SLIDE 14

ICS’02 14

Comparison of ROB Bitcells (0.18µ, TSMC)

Layout of a 32-ported SRAM bitcell Layout of a 16-ported SRAM bitcell

Area Reduction – 71% Shorter bit and wordlines

slide-15
SLIDE 15

ICS’02 15

Our Solution: Elimination of Read Ports

IQ Function Units

Instruction Issue

F1 D1

FU1 FU2 FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File F2

Fetch

Decode/Dispatch

D2 D-cache LSQ ROB

Area Reduction – 45%

slide-16
SLIDE 16

ICS’02 16

Eliminating/Reducing the Number of Read Ports: Effects on Power Dissipation

Power is reduced because:

shorter bitlines and wordlines lower capacitive loading fewer decoders fewer drivers and sense amps

slide-17
SLIDE 17

ICS’02 17

Completely Eliminating the Source Read Ports on the ROB

The Problem: Issue of instructions that require a value stored in the ROB will stall Solutions: Forward the value to the waiting instruction at the time of committing the value: LATE FORWARDING

slide-18
SLIDE 18

ICS’02 18

Late Forwarding: Use the Normal Forwarding Buses!

IQ Function Units

Instruction Issue

F1 D1

FU1 FU2 FUm

ARF EX

Instruction dispatch

Architectural Register File F2

Fetch

Decode/Dispatch

D2 D-cache LSQ ROB

Result/status forwarding buses:

slide-19
SLIDE 19

ICS’02 19 IQ Function Units

Instruction Issue

F1 D1

FU1 FU2 FUm

ARF EX

Instruction dispatch

Architectural Register File F2

Fetch

Decode/Dispatch

D2 D-cache LSQ ROB

Result/status forwarding buses:

Late Forwarding: Use the Normal Forwarding Buses!

slide-20
SLIDE 20

ICS’02 20

Optimizing Late Forwarding

PROBLEM: If Late Forwarding is done for every result that is committed, additional forwarding buses are needed in order not to degrade the performance SOLUTION: Selective Late Forwarding (SLF) SLF requires additional bit in the ROB

That bit is set by the dispatched instructions that require Late Forwarding

No additional forwarding buses are needed, since SLF traffic is very small

slide-21
SLIDE 21

ICS’02 21 IQ Function Units

Instruction Issue

F1 D1

FU1 FU2 FUm

ARF

Only 3.5% of the traffic is from SELECTIVE LATE FORWARDING

EX

Instruction dispatch

Architectural Register File F2

Fetch

Decode/Dispatch

D2 D-cache LSQ ROB

Result/status forwarding buses:

Late Forwarding: Use the Normal Forwarding Buses!

slide-22
SLIDE 22

ICS’02 22

4 8 12 16

No ROB read ports with SLF 1 read port 2 read ports

Performance Drop of Simplified ROB

Performance Drop %

5 10 15 20 25 30

9.6% 3.5% 1.0%

Average IPC Drop:

bzip2 gap gcc gzip mcf parser perl twolf Int Avg. vortex vpr applu apsi art equake mesa mgrid swim wupwise FP Avg.

37% 17%

slide-23
SLIDE 23

ICS’02 23

IPC Penalty: Source Value Not Accessible within the ROB

Forwarding Late Forwarding/ Commitment

Lifetime of a Result Value

Result Generation time Value within ARF Value within ROB

slide-24
SLIDE 24

ICS’02 24

Improving IPC with No Read Ports

Cache recently generated values in a set

  • f RETENTION LATCHES (RL)

Retention Latches are SMALL and FAST Only 8 to 16 latches needed in the set Entire set has 1 or 2 read ports

slide-25
SLIDE 25

ICS’02 25

Datapath with the Retention Latches

IQ Function Units

Instruction Issue

F1 D1

FU1 FU2 FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

F2

Fetch

Decode/Dispatch

D2 D-cache LSQ ROB Architectural Register File

slide-26
SLIDE 26

ICS’02 26

Datapath with the Retention Latches

IQ Function Units

Instruction Issue

F1 D1

FU1 FU2 FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File F2

Fetch

Decode/Dispatch

D2 D-cache LSQ

RETENTION LATCHES

ROB

slide-27
SLIDE 27

ICS’02 27

The Structure of the Retention Latch Set

L ROB slot addresses (L=1 or 2)

L-ported CAM field (key = ROB_slot_id)

W write ports for writing up to W results in parallel Status

L recently-written results (L=1 or 2 works great)

Result Values

8 or 16 latches

slide-28
SLIDE 28

ICS’02 28

Retention Latch Management Strategies

FIFO 8 entry RL: 42% hit rate 16 entry RL: 55% hit rate LRU 8 entry RL: 56% hit rate 16 entry RL: 62% hit rate Random Replacement Worse performance than FIFO

slide-29
SLIDE 29

ICS’02 29

Hit Ratios to Retention Latches

20 40 60 80 100

FIFO 8 2 FIFO 16 2 LRU 8 2 LRU 16 2

42% 55% 56% 62%

20 40 60 80 100

Hit Ratios

bzip2 gap gcc gzip mcf parser perl twolf Int Avg. vortex vpr applu apsi art equake mesa mgrid swim wupwise FP Avg.

Average Hit Ratio:

slide-30
SLIDE 30

ICS’02 30

Accessing Retention Latch Entries

ROB index is used as a unique key in the Retention Latches to search the result values Need to maintain unique keys even when we have: Reuse of a ROB slot:

Not a problem for FIFO simply flush a RL entry at commit time for LRU

Branch mispredictions

slide-31
SLIDE 31

ICS’02 31

Handling Branch Mispredictions Selective RL Flushing: Retention latch entries that are in the mispredicted path are flushed

Uses branch tags Complicated implementation

Complete RL Flushing: All retention latch entries are flushed

Very simple implementation Performance drop is only 1.5% compared to selective flushing

slide-32
SLIDE 32

ICS’02 32

Misprediction Handling: Performance

0.5 1 1.5 2 2.5 3 3.5

bzip gap gcc gzip mcf pars perl twol vort vpr appl apsi art equ mesa mgrid swim wupw Int. FP Avg.

Selective flushing Complete flushing

1.5%

Average IPC Drop:

IPC

slide-33
SLIDE 33

ICS’02 33

Experimental Setup: the AccuPower (DATE’02)

Compiled SPEC benchmarks Datapath specs Performance stats VLSI layout data SPICE deck

SPICE

Microarchitectural Simulator (Rooted in SimpleScalar) Energy/Power Estimator

Power/energy stats

SPICE measures of energy per transition Transition counts, Context information

slide-34
SLIDE 34

ICS’02 34

Configuration of the Simulated System

Machine width 4-way Issue Queue 32 entries 96 entries Reorder Buffer Load/Store Queue 32 entries Simulated the execution of SPEC2000 benchmarks

slide-35
SLIDE 35

ICS’02 35

Assumed Timings

Rename Table lookup for ROB index Rename Table Lookup for ROB index Associative lookup of

  • perand from

retention latches using ROB index as a key Source

  • perand

read from the ROB Source

  • perand

read from the ROB Smaller delay: few latches

D1 D2 D3 D1 D2

Timing of the baseline model Timing of the simplified ROB

slide-36
SLIDE 36

ICS’02 36

  • 5
  • 3
  • 1

1 3 5

8 2-ported FIFO 8 2-ported LRU 16 2-ported FIFO 16 2-ported LRU

Experimental Results: Effect on Performance

Performance Drop %

  • 6
  • 4
  • 2

2 4 6

0.1%

  • 1.6%
  • 1.0%
  • 2.3%

applu apsi art equake mesa mgrid swim wupwise FP Avg. bzip2 gap gcc gzip mcf parser perl twolf Int Avg. vortex vpr

  • Avg. IPC Drop:
slide-37
SLIDE 37

ICS’02 37

2 4 6 8

8 2-ported FIFO 8 2-ported LRU 16 2-ported FIFO 16 2-ported LRU

Experimental Results: Effect on Performance

Performance Drop %

2 4 6 8 10

3.3% 1.7% 2.3% 1.0%

applu apsi art equake mesa mgrid swim wupwise FP Avg. bzip2 gap gcc gzip mcf parser perl twolf Int Avg. vortex vpr

  • Avg. IPC Drop:
slide-38
SLIDE 38

ICS’02 38 10 20 30 40

No ROB ports 8 2-ported FIFO 8 2-ported LRU 16 2-ported FIFO 16 2-ported LRU

Experimental Results: Effect on Power

Power Savings %

10 20 30 40 50

bzip2 gap gcc gzip mcf parser perl twolf Int Avg. vortex vpr applu apsi art equake mesa mgrid swim wupwise FP Avg.

30% 23.4% 22.2% 21% 20.2%

  • Avg. Savings:
slide-39
SLIDE 39

ICS’02 39

Summary of Results Significantly reduced ROB complexity and power dissipation 45% area reduction 20% to 30% power reduction across SPEC

2000 benchmarks

Actual IPC improvements: 1.6% to 2.3% gain across SPEC benchmarks

IPC gains come from 1 cycle access to RL (vs. 2 cycles that would be needed for ROB access)

slide-40
SLIDE 40

ICS’02 40

Related Work Value-Aging Buffer (Hu & Martonosi, PACS 2000) Forwarding Buffer and Clustered Register Cache (Borch et.al., HPCA’02) Multiple Register Banks (Cruz et.al., ISCA’00 & Balasubramonian et.al., MICRO’01) See paper for discussions

slide-41
SLIDE 41

ICS’02 41

Conclusions Typical source operand location statistics can be successfully exploited to reduce ROB complexity Significant reduction in ROB area and power – no ROB ports needed for reading source

  • perands

IPC gains are possible because of the use of a small sized, low-ported Retention Latch to supply cached operand values in a single cycle

slide-42
SLIDE 42

ICS’02 42

Low-Complexity Reorder Buffer Architecture*

*supported in part by DARPA through the PAC-C program and NSF

Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

16th Annual ACM International Conference on Supercomputing (ICS’02), June 24th 2002