[PPT] - Register Packing Register Packing Exploiting Narrow- -Width PowerPoint Presentation

SLIDE 1

Register Packing Register Packing

Exploiting Narrow Exploiting Narrow-

Width Operands

Width Operands for Reducing Register File Pressure for Reducing Register File Pressure

O Og guz uz Ergin Ergin, Deniz Balkan, Kanad Ghose, , Deniz Balkan, Kanad Ghose, Dmitry Dmitry Ponomarev Ponomarev

Department Department of

f Computer

Computer Science Science State State University University of

f New

New York York -

Binghamton

Binghamton

*currently with Intel Barcelona Research Center *currently with Intel Barcelona Research Center

SLIDE 2

Outline Outline

Introduction and motivations

Introduction and motivations

Register Packing:

Register Packing:

Conservative Packing

Conservative Packing

Speculative Packing

Speculative Packing

Results and discussions

Results and discussions

Conclusion

Conclusion

SLIDE 3

Introduction Introduction

Implications of larger instruction windows

Implications of larger instruction windows

Increases register pressure

Increases register pressure

Generally dealt with by using large register files

Generally dealt with by using large register files

Large register files have:

Large register files have:

Higher access time or require multi

Higher access time or require multi-

cycle access

cycle access

Higher energy dissipation

Higher energy dissipation

Need to decrease the register file pressure

Need to decrease the register file pressure

SLIDE 4

Motivations Motivations

Many generated results have a lot of leading

Many generated results have a lot of leading zeros or ones zeros or ones

Fewer bits are needed to represent the value

Fewer bits are needed to represent the value

Register files are thus not used efficiently

Register files are thus not used efficiently

SLIDE 5

“ “Narrow Narrow” ” Values Values

Prefixes of all 1s can be replaced with a single 1 and

Prefixes of all 1s can be replaced with a single 1 and the prefixes of all 0s can be replaced with a single 0. the prefixes of all 0s can be replaced with a single 0.

1111111

11111111 1 → → 1 1 (width = 1) (width = 1)

00000000

00000000 → → (width = 1) (width = 1)

00000001

00000001 → → 01 01 (width = 2) (width = 2)

11111101

11111101 → → 101 101 (width = 3) (width = 3)

10101001

10101001 → → 10101001 10101001 (width = 8) (width = 8)

Narrow width operands do not use the full width of a

Narrow width operands do not use the full width of a register register

SLIDE 6

Distribution of Widths Distribution of Widths

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% bzip2 gap gcc gzip mcf parser twolf vpr ammp applu apsi art equake mesa mgrid swim wupwise INT Average FP Average Total Average

16 bits 32 bits 48 bits 64 bits

SLIDE 7

Exploiting Narrow Values Exploiting Narrow Values

Packing multiple results into a single physical

Packing multiple results into a single physical register improves performance as the effective register improves performance as the effective number of physical registers go up number of physical registers go up

64 32 16 48 32 64 32 16 48 32

SLIDE 8

Main Challenges Main Challenges

Value widths are not known until the results are

Value widths are not known until the results are actually produced actually produced

Register allocation made to a result can change if the

Register allocation made to a result can change if the value turns out to be narrow value turns out to be narrow

Consumers of the result have to be informed if it is

Consumers of the result have to be informed if it is reallocated to a different register based on its width reallocated to a different register based on its width

If multiple results are packed into a common

If multiple results are packed into a common register some means must be provided to locate register some means must be provided to locate them unambiguously them unambiguously

SLIDE 9

Detecting Value Widths Detecting Value Widths

Have to quantize the widths to simplify

Have to quantize the widths to simplify implementation implementation

Chunks of bytes or double bytes

Chunks of bytes or double bytes

Width detection logic is embedded into the final

Width detection logic is embedded into the final stages of an execution unit stages of an execution unit

Techniques for detecting widths are well known

Techniques for detecting widths are well known – – Leading Zero Detectors in floating point units Leading Zero Detectors in floating point units

SLIDE 10

Storing Narrow Values in Registers Storing Narrow Values in Registers

Parts of a result do not need to be stored

Parts of a result do not need to be stored contiguously. contiguously.

Upper half of narrow result A Lower half of narrow result A Upper half of narrow result B Lower half of narrow result B

P7

SLIDE 11

Addressing Narrow Values Addressing Narrow Values

Use a bit mask to specify partitions holding

Use a bit mask to specify partitions holding components of the value along with the register components of the value along with the register address address

Upper half of narrow result A Lower half of narrow result A

Address of A = P7, 1001

P7

SLIDE 12

Register Read Logic Register Read Logic

4:1 sign bit MUX*

n- devices s 1 1 k 1 k 1 k k k k k k k 4k

2:1 MUX 3:1 MUX 4:1 MUX 4:1 MUX Sense Amp Array Bitcell Array Partition 3 Partition 2 Partition 1 Partition 0 *includes 1k expander

SLIDE 13

Register Packing Alternatives Register Packing Alternatives

Conservative Packing

Conservative Packing

Assume result to use the full width of a

Assume result to use the full width of a register at allocation time register at allocation time

Speculative Packing

Speculative Packing

Predict the result width at allocation time and

Predict the result width at allocation time and allocate accordingly allocate accordingly

SLIDE 14

Conservative Packing Conservative Packing

Initially allocate a full

Initially allocate a full-

width register

width register

If the result turns out to be narrow:

If the result turns out to be narrow:

Release the unneeded parts to the free pool

Release the unneeded parts to the free pool

If there is a suitable partition: reallocate.

If there is a suitable partition: reallocate.

SLIDE 15

Conservative Packing Conservative Packing

Instruction I is dispatched: Instruction I is dispatched:

P2 P5

Free Partition Allocated Partition

SLIDE 16

Conservative Packing Conservative Packing

Instruction I is dispatched: Instruction I is dispatched: P2 is allocated P2 is allocated

P2 P5

Free Partition Allocated Partition

SLIDE 17

Conservative Packing Conservative Packing

Instruction I is dispatched: Instruction I is dispatched: Width of result = 2 slots Width of result = 2 slots P5 P5’ ’s upper half is allocated and P2 is released s upper half is allocated and P2 is released

P2 P5

Free Partition Allocated Partition

SLIDE 18

Taking Care of Reassignments Taking Care of Reassignments

Two broadcasts are needed

Two broadcasts are needed

First broadcast uses old tag (=originally assigned

First broadcast uses old tag (=originally assigned register id) to inform dependents that the result register id) to inform dependents that the result will be available shortly will be available shortly

Second broadcast drives the old tag and the new

Second broadcast drives the old tag and the new tag (= newly tag (= newly-

assigned register id +

assigned register id + “ “parts parts” ” bits) bits)

ld tag is used to locate dependents
ld tag is used to locate dependents
new tag picked up by matching entries and used later

new tag picked up by matching entries and used later to read out source value from the register file to read out source value from the register file

SLIDE 19

Tag Broadcast for Wakeup Tag Broadcast for Wakeup

P1, 1001 P2, 1111 Consumer Issue Queue P2, 1111 Producer Function Unit Tag Bus P2, 1111

SLIDE 20

Tag Rebroadcast Example Tag Rebroadcast Example

P1, 1001 P2, 1111 Consumer Issue Queue P2, 1111 Producer Function Unit Old Tag P5, 1100 New Tag P2, 1111 P5, 1100 P5, 1100

SLIDE 21

IPCs IPCs for Conservative Packing for Conservative Packing

0.5 1 1.5 2 2.5 3 bzip2 gap gcc gzip mcf parser twolf vpr ammp applu apsi art equake mesa mgrid swim wupwise INT Average FP Average Total Average Base 8 tag buses 4 tag buses 1 cycle stall on tag re-broadcasts

SLIDE 22

Conservative Packing: Observations Conservative Packing: Observations

Extra broadcast is needed for all results that

Extra broadcast is needed for all results that don don’ ’t use all of the partitions within a register t use all of the partitions within a register

Performance is heavily constrained by the

Performance is heavily constrained by the number of broadcast buses number of broadcast buses

6% for 4 buses

6% for 4 buses

14% for 8 buses

14% for 8 buses

26% for 4 buses assuming an extra cycle delay for

26% for 4 buses assuming an extra cycle delay for width estimation width estimation

SLIDE 23

Speculative Packing Speculative Packing

Predict the width of the result and allocate

Predict the width of the result and allocate accordingly accordingly

Width

Width overprediction

verprediction: two choices here

: two choices here

Release unused parts of register

Release unused parts of register – – rebroadcast only rebroadcast only the parts bits the parts bits

Do not release unused parts

Do not release unused parts – – no rebroadcast is no rebroadcast is needed needed

Width

Width underprediction underprediction: requires reallocation and : requires reallocation and an update broadcast an update broadcast

SLIDE 24

Width Predictor Width Predictor

Width prediction bits are maintained within the

Width prediction bits are maintained within the L1 I L1 I-

Cache

Cache

Prediction bits do no percolate down the

Prediction bits do no percolate down the memory hierarchy from L1 memory hierarchy from L1

Default prediction is full width

Default prediction is full width

Prediction bits are updated only on

Prediction bits are updated only on mispredictions mispredictions

SLIDE 25

Width Prediction is Accurate ! Width Prediction is Accurate !

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% bzip2 gap gcc gzip mcf parser twolf vpr ammp applu apsi art equake mesa mgrid swim wupwise INT Average FP Average Total Average Prediction accuracy when counting the overpredictions as mispredictions Prediction accuracy when counting the overpredictions as correct predictions

SLIDE 26

Deadlock Avoidance Deadlock Avoidance

If there is a misprediction and there are no free

If there is a misprediction and there are no free register parts available: register parts available:

Stall writeback and wait

Stall writeback and wait

This can still cause a deadlock if the instruction is the

This can still cause a deadlock if the instruction is the

ldest in the pipeline
ldest in the pipeline
Create an exception and squash all instructions

Create an exception and squash all instructions younger than the instruction (including itself) younger than the instruction (including itself)

Steal a register from a younger instruction and

Steal a register from a younger instruction and squash all instructions coming after the owner squash all instructions coming after the owner

SLIDE 27

Comparison of Deadlock Avoidance Comparison of Deadlock Avoidance Schemes Schemes

0.5 1 1.5 2 2.5 3 bzip2 gap gcc gzip mcf parser twolf vpr ammp applu apsi art equake mesa mgrid swim wupwise INT Average FP Average Total Average flush all steal from younger

SLIDE 28

Speedups of Speculative Packing Speedups of Speculative Packing

0.5 1 1.5 2 2.5 3 bzip2 gap gcc gzip mcf parser twolf vpr ammp applu apsi art equake mesa mgrid swim wupwise INT Average FP Average Total Average Base 8 tag buses 4 tag buses 1 cycle stall on tag re-broadcasts

SLIDE 29

Performance of Packing Performance of Packing

0.5 1 1.5 2 2.5 3 bzip2 gap gcc gzip mcf parser twolf vpr ammp applu apsi art equake mesa mgrid swim wupwise INT Average FP Average Total Average Base 64+64 Register Packing 64+64 Base 128+128

SLIDE 30

Conclusions Conclusions

We proposed and evaluated two register packing

We proposed and evaluated two register packing schemes schemes

Because of the high number of tag broadcasts

Because of the high number of tag broadcasts Conservative Packing suffers in performance Conservative Packing suffers in performance

Speculative Packing results in 15% IPC

Speculative Packing results in 15% IPC improvement on the average with 64 improvement on the average with 64 fp fp and 64 and 64 int int registers (with tag bus sharing) registers (with tag bus sharing)

SLIDE 31

Thank You ! Thank You !

Oguz Oguz Ergin Ergin

Department Department of

f Computer

Computer Science Science State State University University of

f New

New York York -

Binghamton

Binghamton www.cs.binghamton.edu/ www.cs.binghamton.edu/~ ~oguz

guz

Intel Barcelona Research Center Intel Barcelona Research Center

Register Packing Register Packing

Exploiting Narrow Exploiting Narrow-

Width Operands for Reducing Register File Pressure for Reducing Register File Pressure

O Og guz uz Ergin Ergin*, Deniz Balkan, Kanad Ghose, *, Deniz Balkan, Kanad Ghose, Dmitry Dmitry Ponomarev Ponomarev

Outline Outline

Introduction and motivations

Register Packing:

Conservative Packing

Speculative Packing

Results and discussions

Conclusion

Introduction Introduction

Implications of larger instruction windows

Increases register pressure

Generally dealt with by using large register files

Large register files have:

Higher access time or require multi-

cycle access

Higher energy dissipation

Need to decrease the register file pressure

Motivations Motivations

Many generated results have a lot of leading zeros or ones zeros or ones

Fewer bits are needed to represent the value

Register files are thus not used efficiently

“ “Narrow Narrow” ” Values Values

Prefixes of all 1s can be replaced with a single 1 and the prefixes of all 0s can be replaced with a single 0. the prefixes of all 0s can be replaced with a single 0.

11111111 1 → → 1 1 (width = 1) (width = 1)

00000000 → → (width = 1) (width = 1)

00000001 → → 01 01 (width = 2) (width = 2)

11111101 → → 101 101 (width = 3) (width = 3)

10101001 → → 10101001 10101001 (width = 8) (width = 8)

Narrow width operands do not use the full width of a register register

Distribution of Widths Distribution of Widths

Exploiting Narrow Values Exploiting Narrow Values

Packing multiple results into a single physical register improves performance as the effective register improves performance as the effective number of physical registers go up number of physical registers go up

Main Challenges Main Challenges

Value widths are not known until the results are actually produced actually produced

Register allocation made to a result can change if the value turns out to be narrow value turns out to be narrow

Consumers of the result have to be informed if it is reallocated to a different register based on its width reallocated to a different register based on its width

If multiple results are packed into a common register some means must be provided to locate register some means must be provided to locate them unambiguously them unambiguously

Detecting Value Widths Detecting Value Widths

Have to quantize the widths to simplify implementation implementation

Chunks of bytes or double bytes

Width detection logic is embedded into the final stages of an execution unit stages of an execution unit

Techniques for detecting widths are well known – – Leading Zero Detectors in floating point units Leading Zero Detectors in floating point units

Storing Narrow Values in Registers Storing Narrow Values in Registers

Parts of a result do not need to be stored contiguously. contiguously.

P7

Addressing Narrow Values Addressing Narrow Values

Use a bit mask to specify partitions holding components of the value along with the register components of the value along with the register address address

Address of A = P7, 1001

P7

Register Read Logic Register Read Logic

Register Packing Alternatives Register Packing Alternatives

Conservative Packing

Assume result to use the full width of a register at allocation time register at allocation time

Speculative Packing

Predict the result width at allocation time and allocate accordingly allocate accordingly

Conservative Packing Conservative Packing

Initially allocate a full-

width register

If the result turns out to be narrow:

Release the unneeded parts to the free pool

If there is a suitable partition: reallocate.

Conservative Packing Conservative Packing

Instruction I is dispatched: Instruction I is dispatched:

P2 P5

Conservative Packing Conservative Packing

Instruction I is dispatched: Instruction I is dispatched: P2 is allocated P2 is allocated

P2 P5

Conservative Packing Conservative Packing

Instruction I is dispatched: Instruction I is dispatched: Width of result = 2 slots Width of result = 2 slots P5 P5’ ’s upper half is allocated and P2 is released s upper half is allocated and P2 is released

P2 P5

Taking Care of Reassignments Taking Care of Reassignments

Two broadcasts are needed

First broadcast uses old tag (=originally assigned register id) to inform dependents that the result register id) to inform dependents that the result will be available shortly will be available shortly

Second broadcast drives the old tag and the new tag (= newly tag (= newly-

assigned register id + “ “parts parts” ” bits) bits)

new tag picked up by matching entries and used later to read out source value from the register file to read out source value from the register file

Tag Broadcast for Wakeup Tag Broadcast for Wakeup

O Og guz uz Ergin Ergin, Deniz Balkan, Kanad Ghose, , Deniz Balkan, Kanad Ghose, Dmitry Dmitry Ponomarev Ponomarev