Lazy Retirement: A Power Aware Register Management Mechanism - - PowerPoint PPT Presentation

lazy retirement a power aware register management
SMART_READER_LITE
LIVE PREVIEW

Lazy Retirement: A Power Aware Register Management Mechanism - - PowerPoint PPT Presentation

Lazy Retirement: A Power Aware Register Management Mechanism Guillermo (Eli) Savransky WCED Workshop on Complexity Efficient Design Ronny Ronen May 2002 Anchorage Antonio Gonzalez Alaska MRL - Intel Corp. Agenda Standard


slide-1
SLIDE 1

Lazy Retirement: A Power Aware Register Management Mechanism

WCED – Workshop on Complexity Efficient Design May 2002 – Anchorage Alaska Guillermo (Eli) Savransky Ronny Ronen Antonio Gonzalez MRL - Intel Corp.

slide-2
SLIDE 2

Savransky, Ronen, Gonzalez Page 2

Agenda

Standard Retirement Algorithm Lazy Retirement Run Example Simulation results Summary

slide-3
SLIDE 3

Savransky, Ronen, Gonzalez Page 3

Background

P6 architecture:

Reorder buffer (ROB) and physical

register file are the same logical structure.

Values produced by the retiring

instructions are copied from the ROB to the real register file (RRF).

ROB entries deallocated on

retirement. This copy operation costs power.

Motivation: Reduce the number of copy operations without breaking the cyclic ROB structure. Motivation: Reduce the number of copy operations without breaking the cyclic ROB structure.

EAX

1 2

EAX

3

EBX

4 5 … 64

EAX EBX … EDI Data Data

Head Tail

ROB RRF

Retirement Retirement Allocation Allocation

slide-4
SLIDE 4

Savransky, Ronen, Gonzalez Page 4

Lazy Retirement: The Idea

When retiring a ROB entry, its value is

declared as architectural state but not copied to the RRF.

When the allocator needs a ROB entry, check

if it is still part of the architectural state.

If it is, copy it to the RRF. If it isn’t, ignore.

No Performance Penalty!!!

Standard Retirement: Register Deallocation Copy to RRF Standard Retirement: Register Deallocation Copy to RRF Lazy Retirement: Register Reallocation Copy to RRF Lazy Retirement: Register Reallocation Copy to RRF

slide-5
SLIDE 5

Savransky, Ronen, Gonzalez Page 5

Example

EAX EBX EBX EAX 37 38 39 40 Tail Head EAX EBX EAX 37 38 39 40 ECX 3 retire 4 allocated Tail Head

Copy to RRF is needed Copy to RRF is needed

load eax [esp] add ebx eax and ebx 0xf mov eax ebx mov ecx 0x1 load eax [esp] add ebx eax and ebx 0xf mov eax ebx mov ecx 0x1

EBX EAX 37 38 39 40 2 retire Tail Head ECX EBX EAX 37 38 39 40 60 allocated Tail Head ECX

slide-6
SLIDE 6

Savransky, Ronen, Gonzalez Page 6

Implementation

The Lazy Map

Table remembers where are the retired registers.

A data valid bit in

the ROB marks the registers containing architectural state.

EAX

1 2

EAX

3

EBX

4 5 … 64

EAX EBX … EDI

Data Data

EAX

No

EBX

No

… EDI

Yes

Is in RRF?

2 3

Index

Register Map Table Head Tail

ROB RRF

P6

EAX EBX

Yes

… EDI

Yes

Is in RRF?

Index

Lazy Map Table No

Lazy

slide-7
SLIDE 7

Savransky, Ronen, Gonzalez Page 7

Algorithm

The valid data bit in the ROB will be set if the associated

entry contains an architectural register.

It will be set at retirement. It will be reset when:

Another operation with the same architectural retires or The register is copied to the RRF.

The lazy map table will indicate where the architectural

register is.

ROB entry or RRF. It will be actualized at retirement and if the allocator forces

the copying of the register to the RRF. On mispredictions or exceptions, the lazy map table is

copied to the renamer.

slide-8
SLIDE 8

Savransky, Ronen, Gonzalez Page 8

Why It Works?

ROB size tuned for

worst cases:

Cache misses. Long latency

dependency chains.

Most of the data

copied to the RRF is

  • verwritten shortly

after the transference.

Uniformely distributed register allocation

0% 20% 40% 60% 80% 100% 120% 1 9 1 7 2 5 3 3 4 1 4 9 5 7 6 5 7 3 8 1 8 9 9 7 1 5 1 1 3 1 2 1 Unallocated window size Probability of avoiding the copy

8 16 32 64 128

ROB usage for SPECInt

0% 2% 4% 6% 8% 10% 12%

  • 3

4

  • 7

8

  • 1

1 1 2

  • 1

5 1 6

  • 1

9 2

  • 2

3 2 4

  • 2

7 2 8

  • 3

1 3 2

  • 3

5 3 6

  • 3

9 4

  • 4

3 4 4

  • 4

7 4 8

  • 5

1 5 2

  • 5

5 5 6

  • 5

9 6

  • 6

4

Entries used Percent used

0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 120.00%

Cumulative Average Cumulative

slide-9
SLIDE 9

Savransky, Ronen, Gonzalez Page 9

Simulation Setup

Used an internal performance simulator. Simulated processor details:

IA32 architecture. P6-like microarchitecture. Separated ROB and RRF. 64 ROB entries.

A modified CACTI tool used for power estimations. Workload:

SpecInt2000 Winstone99 SYSmark98 Other multimedia traces.

slide-10
SLIDE 10

Savransky, Ronen, Gonzalez Page 10

Simulation Results

Retirement ports usage per cycle.

23.9% 12.2% 2.5% 0.3% 2.2% 8.7% 0% 5% 10% 15% 20% 25% 30% 1 2 3 Standard retirement Lazy retirement

0.3% 0.3% of the clocks

  • f the clocks

three ports are three ports are used! used!

Clocks with zero copies not shown Clocks with zero copies not shown Improves clock gating when no port required: P6:61%, Lazy: 88%

slide-11
SLIDE 11

Savransky, Ronen, Gonzalez Page 11

Simulation Results

The number of copies from the ROB to the

RRF copies per operation.

0% 10% 20% 30% 40% 50% 60% 70%

KatCh_Dec MM99_VP07 SModem SPECint2000_bzip204 SPECint2000_crafty07 SPECint2000_gap06 SPECint2000_gcc01 SPECint2000_gcc02 SPECint2000_gzip06 SPECint2000_gzip15 SPECint2000_gzip20 SPECint2000_link12 SPECint2000_mcf01 SPECint2000_twolf10 SPECint2000_vpr 14 Smark98NT_Corel 01 Smark98NT_Excel05 Smark98NT_Natur 01 Smark98NT_OmniPage 01 Smark98NT_Paradox 01 Smark98NT_PowerP10 Smark98NT_Word03 Winst99_Cor97_7 Winst99_Lot_17 Winst99_Lot_6 Winst99_Off 97_3 Average

Standard Retirement Lazy Retirement

75% of the 75% of the copies copies eliminated! eliminated!

Copies out of Retired operations

slide-12
SLIDE 12

Savransky, Ronen, Gonzalez Page 12

Power Modeling

Power reduction compared to original

Power consumed by the different tables as a function of the original consumption

0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 40.0% 45.0% 50.0% KatCh_Dec MM99_VP07 SModem SPECint2000_bzip20 SPECint2000_crafty0 SPECint2000_gap06 SPECint2000_gcc01 SPECint2000_gcc02 SPECint2000_gzip06 SPECint2000_gzip15 SPECint2000_gzip20 SPECint2000_link12 SPECint2000_mcf01 SPECint2000_twolf1 SPECint2000_vpr14 Smark98NT_Corel01 Smark98NT_Excel05 Smark98NT_Natur01 Smark98NT_OmniPa Smark98NT_Paradox Smark98NT_PowerP Smark98NT_Word03 Winst99_Cor97_7 Winst99_Lot_17 Winst99_Lot_6 Winst99_Off97_3 Average

Trace File Percent of the

  • riginal power

Lazy Table RRF lazy ROB lazy

>60% >60% power power reduction! reduction!

Lazy table use 13% of the original retirement power.

slide-13
SLIDE 13

Savransky, Ronen, Gonzalez Page 13

Considerations

ROB + RRF is about 7% of total processor power. Renamer power changes are not modeled: ☺ ☺ ☺ ☺ Number of updates greatly reduced.

  • Misprediction recovery is not thermally relevant.

Can be used to reduce the number of ROB, RRF and

renamer physical ports used for retirement.

High power reduction. Have performance penalty (trade off is architecture

dependent)

In an unified register file with no RRF (as in the P4

architecture) the management logic is more expensive than the P6 retirement.

slide-14
SLIDE 14

Savransky, Ronen, Gonzalez Page 14

Summary

Shown a method for reducing the copies of data from

the physical to the architectural register file.

Eliminates about 75% of the copies. Can be implemented without performance penalty.

The power reduction is much higher than the

  • verhead.

Balance algorithm complexity to reduce power:

Too dumb lots of work High power. Too smart lots of control logic High power. In general:

Balance added capacitance with lowered activity