Register Packing Register Packing Exploiting Narrow- -Width - - PowerPoint PPT Presentation

register packing register packing
SMART_READER_LITE
LIVE PREVIEW

Register Packing Register Packing Exploiting Narrow- -Width - - PowerPoint PPT Presentation

Register Packing Register Packing Exploiting Narrow- -Width Operands Width Operands Exploiting Narrow for Reducing Register File Pressure for Reducing Register File Pressure Og guz uz Ergin Ergin*, Deniz Balkan, Kanad Ghose, *, Deniz


slide-1
SLIDE 1

Register Packing Register Packing

Exploiting Narrow Exploiting Narrow-

  • Width Operands

Width Operands for Reducing Register File Pressure for Reducing Register File Pressure

O Og guz uz Ergin Ergin*, Deniz Balkan, Kanad Ghose, *, Deniz Balkan, Kanad Ghose, Dmitry Dmitry Ponomarev Ponomarev

Department Department of

  • f Computer

Computer Science Science State State University University of

  • f New

New York York -

  • Binghamton

Binghamton

*currently with Intel Barcelona Research Center *currently with Intel Barcelona Research Center

slide-2
SLIDE 2

Outline Outline

  • Introduction and motivations

Introduction and motivations

  • Register Packing:

Register Packing:

  • Conservative Packing

Conservative Packing

  • Speculative Packing

Speculative Packing

  • Results and discussions

Results and discussions

  • Conclusion

Conclusion

slide-3
SLIDE 3

Introduction Introduction

  • Implications of larger instruction windows

Implications of larger instruction windows

  • Increases register pressure

Increases register pressure

  • Generally dealt with by using large register files

Generally dealt with by using large register files

  • Large register files have:

Large register files have:

  • Higher access time or require multi

Higher access time or require multi-

  • cycle access

cycle access

  • Higher energy dissipation

Higher energy dissipation

  • Need to decrease the register file pressure

Need to decrease the register file pressure

slide-4
SLIDE 4

Motivations Motivations

  • Many generated results have a lot of leading

Many generated results have a lot of leading zeros or ones zeros or ones

  • Fewer bits are needed to represent the value

Fewer bits are needed to represent the value

  • Register files are thus not used efficiently

Register files are thus not used efficiently

slide-5
SLIDE 5

“ “Narrow Narrow” ” Values Values

  • Prefixes of all 1s can be replaced with a single 1 and

Prefixes of all 1s can be replaced with a single 1 and the prefixes of all 0s can be replaced with a single 0. the prefixes of all 0s can be replaced with a single 0.

  • 1111111

11111111 1 → → 1 1 (width = 1) (width = 1)

  • 00000000

00000000 → → (width = 1) (width = 1)

  • 00000001

00000001 → → 01 01 (width = 2) (width = 2)

  • 11111101

11111101 → → 101 101 (width = 3) (width = 3)

  • 10101001

10101001 → → 10101001 10101001 (width = 8) (width = 8)

  • Narrow width operands do not use the full width of a

Narrow width operands do not use the full width of a register register

slide-6
SLIDE 6

Distribution of Widths Distribution of Widths

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% bzip2 gap gcc gzip mcf parser twolf vpr ammp applu apsi art equake mesa mgrid swim wupwise INT Average FP Average Total Average

16 bits 32 bits 48 bits 64 bits

slide-7
SLIDE 7

Exploiting Narrow Values Exploiting Narrow Values

  • Packing multiple results into a single physical

Packing multiple results into a single physical register improves performance as the effective register improves performance as the effective number of physical registers go up number of physical registers go up

64 32 16 48 32 64 32 16 48 32

slide-8
SLIDE 8

Main Challenges Main Challenges

  • Value widths are not known until the results are

Value widths are not known until the results are actually produced actually produced

  • Register allocation made to a result can change if the

Register allocation made to a result can change if the value turns out to be narrow value turns out to be narrow

  • Consumers of the result have to be informed if it is

Consumers of the result have to be informed if it is reallocated to a different register based on its width reallocated to a different register based on its width

  • If multiple results are packed into a common

If multiple results are packed into a common register some means must be provided to locate register some means must be provided to locate them unambiguously them unambiguously

slide-9
SLIDE 9

Detecting Value Widths Detecting Value Widths

  • Have to quantize the widths to simplify

Have to quantize the widths to simplify implementation implementation

  • Chunks of bytes or double bytes

Chunks of bytes or double bytes

  • Width detection logic is embedded into the final

Width detection logic is embedded into the final stages of an execution unit stages of an execution unit

  • Techniques for detecting widths are well known

Techniques for detecting widths are well known – – Leading Zero Detectors in floating point units Leading Zero Detectors in floating point units

slide-10
SLIDE 10

Storing Narrow Values in Registers Storing Narrow Values in Registers

  • Parts of a result do not need to be stored

Parts of a result do not need to be stored contiguously. contiguously.

Upper half of narrow result A Lower half of narrow result A Upper half of narrow result B Lower half of narrow result B

P7

slide-11
SLIDE 11

Addressing Narrow Values Addressing Narrow Values

  • Use a bit mask to specify partitions holding

Use a bit mask to specify partitions holding components of the value along with the register components of the value along with the register address address

Upper half of narrow result A Lower half of narrow result A

Address of A = P7, 1001

P7

slide-12
SLIDE 12

Register Read Logic Register Read Logic

4:1 sign bit MUX*

n- devices s 1 1 k 1 k 1 k k k k k k k 4k

2:1 MUX 3:1 MUX 4:1 MUX 4:1 MUX Sense Amp Array Bitcell Array Partition 3 Partition 2 Partition 1 Partition 0 *includes 1k expander

slide-13
SLIDE 13

Register Packing Alternatives Register Packing Alternatives

  • Conservative Packing

Conservative Packing

  • Assume result to use the full width of a

Assume result to use the full width of a register at allocation time register at allocation time

  • Speculative Packing

Speculative Packing

  • Predict the result width at allocation time and

Predict the result width at allocation time and allocate accordingly allocate accordingly

slide-14
SLIDE 14

Conservative Packing Conservative Packing

  • Initially allocate a full

Initially allocate a full-

  • width register

width register

  • If the result turns out to be narrow:

If the result turns out to be narrow:

  • Release the unneeded parts to the free pool

Release the unneeded parts to the free pool

  • If there is a suitable partition: reallocate.

If there is a suitable partition: reallocate.

slide-15
SLIDE 15

Conservative Packing Conservative Packing

Instruction I is dispatched: Instruction I is dispatched:

P2 P5

Free Partition Allocated Partition

slide-16
SLIDE 16

Conservative Packing Conservative Packing

Instruction I is dispatched: Instruction I is dispatched: P2 is allocated P2 is allocated

P2 P5

Free Partition Allocated Partition

slide-17
SLIDE 17

Conservative Packing Conservative Packing

Instruction I is dispatched: Instruction I is dispatched: Width of result = 2 slots Width of result = 2 slots P5 P5’ ’s upper half is allocated and P2 is released s upper half is allocated and P2 is released

P2 P5

Free Partition Allocated Partition

slide-18
SLIDE 18

Taking Care of Reassignments Taking Care of Reassignments

  • Two broadcasts are needed

Two broadcasts are needed

  • First broadcast uses old tag (=originally assigned

First broadcast uses old tag (=originally assigned register id) to inform dependents that the result register id) to inform dependents that the result will be available shortly will be available shortly

  • Second broadcast drives the old tag and the new

Second broadcast drives the old tag and the new tag (= newly tag (= newly-

  • assigned register id +

assigned register id + “ “parts parts” ” bits) bits)

  • ld tag is used to locate dependents
  • ld tag is used to locate dependents
  • new tag picked up by matching entries and used later

new tag picked up by matching entries and used later to read out source value from the register file to read out source value from the register file

slide-19
SLIDE 19

Tag Broadcast for Wakeup Tag Broadcast for Wakeup

P1, 1001 P2, 1111 Consumer Issue Queue P2, 1111 Producer Function Unit Tag Bus P2, 1111

slide-20
SLIDE 20

Tag Rebroadcast Example Tag Rebroadcast Example

P1, 1001 P2, 1111 Consumer Issue Queue P2, 1111 Producer Function Unit Old Tag P5, 1100 New Tag P2, 1111 P5, 1100 P5, 1100

slide-21
SLIDE 21

IPCs IPCs for Conservative Packing for Conservative Packing

0.5 1 1.5 2 2.5 3 bzip2 gap gcc gzip mcf parser twolf vpr ammp applu apsi art equake mesa mgrid swim wupwise INT Average FP Average Total Average Base 8 tag buses 4 tag buses 1 cycle stall on tag re-broadcasts

slide-22
SLIDE 22

Conservative Packing: Observations Conservative Packing: Observations

  • Extra broadcast is needed for all results that

Extra broadcast is needed for all results that don don’ ’t use all of the partitions within a register t use all of the partitions within a register

  • Performance is heavily constrained by the

Performance is heavily constrained by the number of broadcast buses number of broadcast buses

  • 6% for 4 buses

6% for 4 buses

  • 14% for 8 buses

14% for 8 buses

  • 26% for 4 buses assuming an extra cycle delay for

26% for 4 buses assuming an extra cycle delay for width estimation width estimation

slide-23
SLIDE 23

Speculative Packing Speculative Packing

  • Predict the width of the result and allocate

Predict the width of the result and allocate accordingly accordingly

  • Width

Width overprediction

  • verprediction: two choices here

: two choices here

  • Release unused parts of register

Release unused parts of register – – rebroadcast only rebroadcast only the parts bits the parts bits

  • Do not release unused parts

Do not release unused parts – – no rebroadcast is no rebroadcast is needed needed

  • Width

Width underprediction underprediction: requires reallocation and : requires reallocation and an update broadcast an update broadcast

slide-24
SLIDE 24

Width Predictor Width Predictor

  • Width prediction bits are maintained within the

Width prediction bits are maintained within the L1 I L1 I-

  • Cache

Cache

  • Prediction bits do no percolate down the

Prediction bits do no percolate down the memory hierarchy from L1 memory hierarchy from L1

  • Default prediction is full width

Default prediction is full width

  • Prediction bits are updated only on

Prediction bits are updated only on mispredictions mispredictions

slide-25
SLIDE 25

Width Prediction is Accurate ! Width Prediction is Accurate !

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% bzip2 gap gcc gzip mcf parser twolf vpr ammp applu apsi art equake mesa mgrid swim wupwise INT Average FP Average Total Average Prediction accuracy when counting the overpredictions as mispredictions Prediction accuracy when counting the overpredictions as correct predictions

slide-26
SLIDE 26

Deadlock Avoidance Deadlock Avoidance

  • If there is a misprediction and there are no free

If there is a misprediction and there are no free register parts available: register parts available:

  • Stall writeback and wait

Stall writeback and wait

  • This can still cause a deadlock if the instruction is the

This can still cause a deadlock if the instruction is the

  • ldest in the pipeline
  • ldest in the pipeline
  • Create an exception and squash all instructions

Create an exception and squash all instructions younger than the instruction (including itself) younger than the instruction (including itself)

  • Steal a register from a younger instruction and

Steal a register from a younger instruction and squash all instructions coming after the owner squash all instructions coming after the owner

slide-27
SLIDE 27

Comparison of Deadlock Avoidance Comparison of Deadlock Avoidance Schemes Schemes

0.5 1 1.5 2 2.5 3 bzip2 gap gcc gzip mcf parser twolf vpr ammp applu apsi art equake mesa mgrid swim wupwise INT Average FP Average Total Average flush all steal from younger

slide-28
SLIDE 28

Speedups of Speculative Packing Speedups of Speculative Packing

0.5 1 1.5 2 2.5 3 bzip2 gap gcc gzip mcf parser twolf vpr ammp applu apsi art equake mesa mgrid swim wupwise INT Average FP Average Total Average Base 8 tag buses 4 tag buses 1 cycle stall on tag re-broadcasts

slide-29
SLIDE 29

Performance of Packing Performance of Packing

0.5 1 1.5 2 2.5 3 bzip2 gap gcc gzip mcf parser twolf vpr ammp applu apsi art equake mesa mgrid swim wupwise INT Average FP Average Total Average Base 64+64 Register Packing 64+64 Base 128+128

slide-30
SLIDE 30

Conclusions Conclusions

  • We proposed and evaluated two register packing

We proposed and evaluated two register packing schemes schemes

  • Because of the high number of tag broadcasts

Because of the high number of tag broadcasts Conservative Packing suffers in performance Conservative Packing suffers in performance

  • Speculative Packing results in 15% IPC

Speculative Packing results in 15% IPC improvement on the average with 64 improvement on the average with 64 fp fp and 64 and 64 int int registers (with tag bus sharing) registers (with tag bus sharing)

slide-31
SLIDE 31

Thank You ! Thank You !

Oguz Oguz Ergin Ergin

Department Department of

  • f Computer

Computer Science Science State State University University of

  • f New

New York York -

  • Binghamton

Binghamton www.cs.binghamton.edu/ www.cs.binghamton.edu/~ ~oguz

  • guz

Intel Barcelona Research Center Intel Barcelona Research Center