lazy retirement a power aware register management
play

Lazy Retirement: A Power Aware Register Management Mechanism - PowerPoint PPT Presentation

Lazy Retirement: A Power Aware Register Management Mechanism Guillermo (Eli) Savransky WCED Workshop on Complexity Efficient Design Ronny Ronen May 2002 Anchorage Antonio Gonzalez Alaska MRL - Intel Corp. Agenda Standard


  1. Lazy Retirement: A Power Aware Register Management Mechanism Guillermo (Eli) Savransky WCED – Workshop on Complexity Efficient Design Ronny Ronen May 2002 – Anchorage Antonio Gonzalez Alaska MRL - Intel Corp.

  2. Agenda � Standard Retirement Algorithm � Lazy Retirement � Run Example � Simulation results � Summary Savransky, Ronen, Gonzalez Page 2

  3. Background � P6 architecture: ROB Retirement Retirement � Reorder buffer (ROB) and physical Data 0 EAX register file are the same logical 1 Tail 2 EAX structure. 3 EBX 4 5 Head � Values produced by the retiring … 64 instructions are copied from the Allocation Allocation ROB to the real register file (RRF). RRF � ROB entries deallocated on Data EAX retirement. EBX … EDI � This copy operation costs power. Motivation: Motivation: Reduce the number of copy operations without Reduce the number of copy operations without breaking the cyclic ROB structure. breaking the cyclic ROB structure. Savransky, Ronen, Gonzalez Page 3

  4. Lazy Retirement: The Idea � When retiring a ROB entry, its value is declared as architectural state but not copied to the RRF. � When the allocator needs a ROB entry, check if it is still part of the architectural state. � If it is, copy it to the RRF. � If it isn’t, ignore. � No Performance Penalty!!! Standard Retirement: Standard Retirement: Register Deallocation � Copy to RRF Register Deallocation � Copy to RRF Lazy Retirement: Lazy Retirement: Register Reallocation � Copy to RRF Register Reallocation � Copy to RRF Savransky, Ronen, Gonzalez Page 4

  5. Example load eax � [esp] load eax � [esp] add ebx � eax add ebx � eax and ebx � 0xf and ebx � 0xf mov eax � ebx mov eax � ebx mov ecx � 0x1 mov ecx � 0x1 Tail 37 EAX 37 EAX 37 37 38 EBX 38 38 38 39 EBX 39 EBX 39 EBX 39 EBX Head Tail Head EAX EAX EAX EAX 40 40 40 40 ECX ECX ECX Tail Tail Head Head Copy to RRF Copy to RRF 3 retire is needed 2 retire 60 allocated is needed 4 allocated Savransky, Ronen, Gonzalez Page 5

  6. Implementation � The Lazy Map ROB Lazy P6 Table remembers Data 0 EAX where are the 1 Register Map Table retired registers. Tail 2 Index EAX Is in RRF? 2 No EAX 3 EBX � A data valid bit in 3 No EBX 4 … the ROB marks the Yes 5 EDI Head … registers containing 64 architectural state. Lazy Map Table RRF Index Is in RRF? 0 No Data EAX Yes EBX EAX … EBX Yes EDI … EDI Savransky, Ronen, Gonzalez Page 6

  7. Algorithm � The valid data bit in the ROB will be set if the associated entry contains an architectural register. � It will be set at retirement. � It will be reset when: � Another operation with the same architectural retires or � The register is copied to the RRF. � The lazy map table will indicate where the architectural register is. � ROB entry or RRF. � It will be actualized at retirement and if the allocator forces the copying of the register to the RRF. � On mispredictions or exceptions, the lazy map table is copied to the renamer. Savransky, Ronen, Gonzalez Page 7

  8. Why It Works? � ROB size tuned for Uniformely distributed register allocation 120% worst cases: Probability of avoiding 100% 8 80% � Cache misses. the copy 16 60% 32 40% � Long latency 64 20% 128 dependency chains. 0% 1 2 3 4 4 5 6 7 8 8 9 1 9 1 1 1 7 5 3 1 9 7 5 3 1 9 7 0 1 2 5 3 1 Unallocated window size � Most of the data copied to the RRF is ROB usage for SPECInt 12% 120.00% 10% 100.00% overwritten shortly Cumulative 8% 80.00% Percent used after the transference. 6% 60.00% 4% 40.00% Average 2% 20.00% Cumulative 0% 0.00% 0 4 8 1 1 2 2 2 3 3 4 4 4 5 5 6 2 6 0 4 8 2 6 0 4 8 2 6 0 - - - 3 7 1 - - - - - - - - - - - - - 1 1 1 2 2 3 3 3 4 4 5 5 5 6 5 9 3 7 1 5 9 3 7 1 5 9 4 Entries used Savransky, Ronen, Gonzalez Page 8

  9. Simulation Setup � Used an internal performance simulator. � Simulated processor details: � IA32 architecture. � P6-like microarchitecture. � Separated ROB and RRF. � 64 ROB entries. � A modified CACTI tool used for power estimations. � Workload: � SpecInt2000 � Winstone99 � SYSmark98 � Other multimedia traces. Savransky, Ronen, Gonzalez Page 9

  10. Simulation Results � Retirement ports usage per cycle. 30% 23.9% Standard retirement 25% Clocks with Clocks with Lazy retirement zero copies 20% zero copies not shown not shown 15% 12.2% 8.7% 10% 5% 2.5% 2.2% 0.3% 0% 1 2 3 0.3% of the clocks of the clocks 0.3% three ports are Improves clock gating when no port required: three ports are P6:61%, Lazy: 88% used! used! Savransky, Ronen, Gonzalez Page 10

  11. Copies out of Retired operations KatCh_Dec 10% 20% 30% 40% 50% 60% 70% 0% MM99_VP07 SPECint2000_bzip204 � The number of copies from the ROB to the Simulation Results SPECint2000_crafty07 SModem RRF copies per operation. SPECint2000_gap06 SPECint2000_gcc01 SPECint2000_gcc02 SPECint2000_gzip06 SPECint2000_gzip15 SPECint2000_gzip20 SPECint2000_link12 SPECint2000_mcf01 SPECint2000_twolf10 SPECint2000_vpr Smark98NT_Corel Savransky, Ronen, Gonzalez Smark98NT_Excel05 14 Smark98NT_Natur Smark98NT_OmniPage 01 Smark98NT_Paradox Smark98NT_PowerP10 01 01 Smark98NT_Word03 01 Winst99_Cor97_7 Winst99_Lot_17 Winst99_Lot_6 Winst99_Off eliminated! eliminated! 75% of the 75% of the 97_3 Average copies copies Lazy Retirement Standard Retirement Page 11

  12. >60% power power Trace File reduction! reduction! Page 12 Average Winst99_Off97_3 Winst99_Lot_6 >60% Winst99_Lot_17 Power consumed by the different tables as a function of the original Winst99_Cor97_7 � Power reduction compared to original Smark98NT_Word03 Smark98NT_PowerP Smark98NT_Paradox Lazy table use 13% of the original retirement power. Smark98NT_OmniPa Smark98NT_Natur01 Smark98NT_Excel05 Savransky, Ronen, Gonzalez Smark98NT_Corel01 SPECint2000_vpr14 consumption SPECint2000_twolf1 SPECint2000_mcf01 SPECint2000_link12 SPECint2000_gzip20 Power Modeling SPECint2000_gzip15 SPECint2000_gzip06 SPECint2000_gcc02 SPECint2000_gcc01 SPECint2000_gap06 SPECint2000_crafty0 SPECint2000_bzip20 SModem MM99_VP07 KatCh_Dec Lazy Table ROB lazy RRF lazy 50.0% 45.0% 40.0% 35.0% 30.0% 25.0% 20.0% 15.0% 10.0% 5.0% 0.0% original power Percent of the

  13. Considerations � ROB + RRF is about 7% of total processor power. � Renamer power changes are not modeled: ☺ Number of updates greatly reduced. ☺ ☺ ☺ � Misprediction recovery is not thermally relevant. � � � � Can be used to reduce the number of ROB, RRF and renamer physical ports used for retirement. � High power reduction. � Have performance penalty (trade off is architecture dependent) � In an unified register file with no RRF (as in the P4 architecture) the management logic is more expensive than the P6 retirement. Savransky, Ronen, Gonzalez Page 13

  14. Summary � Shown a method for reducing the copies of data from the physical to the architectural register file. � Eliminates about 75% of the copies. � Can be implemented without performance penalty. � The power reduction is much higher than the overhead. � Balance algorithm complexity to reduce power: � Too dumb � lots of work � High power. � Too smart � lots of control logic � High power. � In general: Balance added capacitance with lowered activity Savransky, Ronen, Gonzalez Page 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend