Inherently Lower Complexity Inherently Lower Complexity Architectures using Architectures using Dynamic Optimization Dynamic Optimization
Michael Gschwind Michael Gschwind Erik Altman Erik Altman
ÿþýüûúùúüø÷öõôóüòñõñ÷ðïîüíñóöñð
Inherently Lower Complexity Inherently Lower Complexity - - PowerPoint PPT Presentation
Inherently Lower Complexity Inherently Lower Complexity Architectures using Architectures using Dynamic Optimization Dynamic Optimization Michael Gschwind Michael Gschwind Erik Altman Erik Altman
Inherently Lower Complexity Inherently Lower Complexity Architectures using Architectures using Dynamic Optimization Dynamic Optimization
Michael Gschwind Michael Gschwind Erik Altman Erik Altman
ÿþýüûúùúüø÷öõôóüòñõñ÷ðïîüíñóöñð
What is the Problem? What is the Problem?
Out of order superscalars achieve high Out of order superscalars achieve high performance. performance. ... But at the cost of ... But at the cost of high hardware complexity high hardware complexity
Predictors Predictors Complex decode Complex decode Complex issue queues with wakeup and issue Complex issue queues with wakeup and issue logic logic Register mapping tables Register mapping tables ... ...
What is the Problem? What is the Problem?
Out of order superscalars achieve high Out of order superscalars achieve high performance. performance. ... But at the cost of ... But at the cost of high power. high power.
Many out of order components operate Many out of order components operate every cycle. every cycle. Many components query a large set of Many components query a large set of data to operate on a single element. data to operate on a single element.
What is the Problem? What is the Problem?
Out of order superscalars achieve high Out of order superscalars achieve high performance. performance. ... But at the cost of ... But at the cost of deep pipelines. deep pipelines.
Complex logic has long latency. Complex logic has long latency. To achieve high frequency with long To achieve high frequency with long latency, super pipelining is required. latency, super pipelining is required. Deep pipelines require excellent branch Deep pipelines require excellent branch predictors. predictors. Excellent branch predictors are complex. Excellent branch predictors are complex. Complex logic has long latency ... Complex logic has long latency ...
What is the Problem? What is the Problem?
Out of order superscalars achieve high Out of order superscalars achieve high performance. performance. ... But at the cost of ... But at the cost of high verification and high verification and debug complexity. debug complexity.
With Moore's Law, schedule slips = With Moore's Law, schedule slips = performance slips performance slips
Schedule Slip Relative Performance
1 month 4% 3 month 12% 6 month 26% 9 month 41% 12 month 59% 18 month 100%
What is the Solution? What is the Solution?
Software Dynamic Optimization Software Dynamic Optimization Allows reduced hardware complexity: Allows reduced hardware complexity:
Shorter pipelines for same frequency. Shorter pipelines for same frequency. Fewer hardware predictors. Fewer hardware predictors. Simpler issue logic. Simpler issue logic. Less power, a la Transmeta. Less power, a la Transmeta. Less debug and verification. Less debug and verification. Smaller chips and higher yield. Smaller chips and higher yield.
How to Implement the Solution How to Implement the Solution
BOA Architecture for Complexity BOA Architecture for Complexity Effective Design Effective Design BOA BOA = = B Binary Translation inary Translation O Optimized ptimized A Architecture rchitecture BOA in combination with its dynamic BOA in combination with its dynamic
compatible with PowerPC. compatible with PowerPC.
What is interesting about BOA? What is interesting about BOA?
Software dynamic optimization. Software dynamic optimization. Precise behavior on most memory faults. Precise behavior on most memory faults. Load/Store order tables ensure memory Load/Store order tables ensure memory semantics and allow aggressive dynamic semantics and allow aggressive dynamic software reordering. software reordering. Instruction recirculation mechanism to Instruction recirculation mechanism to simplify issue and exception handling. simplify issue and exception handling. Predictable latencies handled by Predictable latencies handled by software, unpredictable by hardware. software, unpredictable by hardware.
BOA System Architecture BOA System Architecture
Interpret Ins X (PowerPC) X Prev Translated Entry Pt
Update Statistics Goto next ins X Yes
Exec Group X's BOA Translation No Seen X 15 times No Form Group at X and Translate Ins to BOA Instruc Yes
BOA ISA (1) BOA ISA (1)
BOA is variable length VLIW machine. BOA is variable length VLIW machine. BOA instructions (bundles) are 128 bits. BOA instructions (bundles) are 128 bits. Bundles have 3 primitive ops. Bundles have 3 primitive ops. Primitive ops have 39 bits plus stop bit. Primitive ops have 39 bits plus stop bit. Complex PowerPC ops cracked. Complex PowerPC ops cracked. 8 bits of bundle reserved for future uses 8 bits of bundle reserved for future uses such as predication. such as predication. Instruction Issue: Instruction Issue:
Up to 6 primitive ops are issued together. Up to 6 primitive ops are issued together. Only last op issued may have stop bit set. Only last op issued may have stop bit set.
BOA ISA (2) BOA ISA (2)
64 64 Integer Registers Integer Registers 64 64 Float Registers Float Registers 16 16 4-bit 4-bit Condition Registers Condition Registers Branches take Branches take 1 1 cycle: cycle:
Branch mispredicts cost Branch mispredicts cost 7 7 cycles cycles Static branch pred ( Static branch pred (using interpreter stats using interpreter stats) ) At most one branch per cycle At most one branch per cycle
PowerPC Regs Shadow Regs
ÿþýüûúùø÷þø
Scratch Regs
ÿþýüûú ûøý
PowerPC State and Precise PowerPC State and Precise Exceptions Exceptions
BOA Latencies BOA Latencies
Integer ops take Integer ops take 1 1 cycle cycle
No bypass No bypass => => Dependent ops must be 2 Dependent ops must be 2
.
cycles apart cycles apart
LOADs take LOADs take 3 3 cycles cycles
No bypass No bypass => => Dependent ops must be 4 Dependent ops must be 4 ..... ..... cycles later cycles later
BOA Resources BOA Resources
6 6 Issue Issue Slots Slots 2 2 LOAD / STORE LOAD / STORE units units
Each with own copy of register file Each with own copy of register file
4 4 Integer Integer units units
Each with own copy of register file Each with own copy of register file
2 2 Float Float units units 1 1 Branch Branch unit unit 32 32-entry
Load and and Store Buffers Store Buffers Register scoreboarding of LOAD values Register scoreboarding of LOAD values
Stall when try to use loaded value Stall when try to use loaded value
BOA Dynamic Optimization BOA Dynamic Optimization
BOA's software optimizer originates with BOA's software optimizer originates with IBM's earlier DAISY project. IBM's earlier DAISY project. BOA adjusted and tuned optimizer: BOA adjusted and tuned optimizer:
To support a narrower, higher frequency target To support a narrower, higher frequency target machine. machine. To optimize along single hyperblock paths, To optimize along single hyperblock paths, instead of tree region with multiple paths. instead of tree region with multiple paths. Improves code packing, reduces TLB misses Improves code packing, reduces TLB misses Improves code layout and helps IFetch, a la Improves code layout and helps IFetch, a la trace caches. trace caches.
Dynamic Optimization Dynamic Optimization Environments Environments
Dynamic Optimization can be used in a Dynamic Optimization can be used in a variety of environments: variety of environments:
Process level Process level Idealized virtual memory Idealized virtual memory Fewer difficult system/kernel code issues Fewer difficult system/kernel code issues Operating system level Operating system level No modifications to operating system No modifications to operating system More transparent More transparent
Less danger of compatibility issues Less danger of compatibility issues
Dynamic Optimization Targets (1) Dynamic Optimization Targets (1)
Simpler implementation of the same Simpler implementation of the same architecture architecture
Ability to bail out and revert to native Ability to bail out and revert to native execution: execution: If overhead too high If overhead too high For hard to emulate sequences For hard to emulate sequences When no benefit of DO can be measured When no benefit of DO can be measured
Or actually degrades Or actually degrades
Dynamic Optimization Targets (2) Dynamic Optimization Targets (2)
Different architecture, e.g., RISC => Different architecture, e.g., RISC => VLIW VLIW
Drastically simplify architecture Drastically simplify architecture Reduce decoding overhead even further Reduce decoding overhead even further Add more registers, add new concepts Add more registers, add new concepts All code must be emulated. Can cause All code must be emulated. Can cause severe degradation if low reuse, e.g. severe degradation if low reuse, e.g. WinStone. WinStone. Get benefits of code packing Get benefits of code packing
Some Optimizations Some Optimizations
Code packing Code packing Register Port arbitration Register Port arbitration Exploit novel architecture concepts Exploit novel architecture concepts Improve predictability of execution path Improve predictability of execution path by code layout by code layout Eliminate performance-degrading ops Eliminate performance-degrading ops Avoid use of complex idioms, e.g. Avoid use of complex idioms, e.g. condition register broad side read/write condition register broad side read/write Replace with easier to schedule/execute Replace with easier to schedule/execute
Code Packing - Code Packing - Software Based Trace Caching Software Based Trace Caching
Code packing: Code packing:
Application-directed code compaction Application-directed code compaction Similar concept to hardware trace cache Similar concept to hardware trace cache Much simpler to implement Much simpler to implement Increase effectiveness of ICache and ITLB Increase effectiveness of ICache and ITLB Very helpful in HP Dynamo performance Very helpful in HP Dynamo performance
Technique OOO DO+OOO DO+IO DO+VLIW ISA base ISA base ISA base ISA new ISA general
too complex DO optimizes DO optimizes DO optimizes path-predictive fetching I fetch prediction DO improves prediction DO improves prediction DO improves prediction code compaction trace cache DO performs layout DO performs layout DO performs layout select insns to issue wakeup/select logic wakeup/select logic DO adapts at
DO adapts at
precise exceptions register renaming register renaming SW recovery code SW recovery + HW support complex insns decoder cracks decoder cracks DO or HW DO cracks and layers form issue groups select logic select logic issue logic DO groups packets
Dynamic optimization and Dynamic optimization and architecture styles architecture styles
BOA and DAISY Differences (1) BOA and DAISY Differences (1)
PowerPC PowerPC ops from
single path. single path. 6 Issue 6 Issue Ops assigned to Ops assigned to FU's in pipeline FU's in pipeline Stall-on-use Stall-on-use
Memop sequence #'s, Memop sequence #'s, Address Comparators Address Comparators
PowerPC PowerPC ops from
multiple paths. multiple paths. 8-16 Issue 8-16 Issue Mini-Icache maps Mini-Icache maps fixed cache fixed cache locations to FU's locations to FU's Stall-on-miss Stall-on-miss Load-Verify Load-Verify Instructions Instructions
BOA
DAISY
BOA and DAISY Differences (2) BOA and DAISY Differences (2)
Predicated Predicated bundles of 3 ops bundles of 3 ops 1 branch per cycle 1 branch per cycle Branch prediction Branch prediction Tree instructions Tree instructions Up to 3 branches Up to 3 branches per cycle per cycle Encode successor Encode successor cache line in cache line in instruction => instruction => Fetch known ins Fetch known ins each cycle each cycle
BOA
DAISY
Dynamic optimizer: Dynamic optimizer:
Limits number of register read/write ports, Limits number of register read/write ports, with little performance effect. with little performance effect. Handles PowerPC quirks, e.g. Handles PowerPC quirks, e.g. Condition register used as 32-bit value, Condition register used as 32-bit value, 8x4-bit values, and as 32x1-bit values. 8x4-bit values, and as 32x1-bit values. Renaming must account for all cases. Renaming must account for all cases. PowerPC addressing treats register R0 PowerPC addressing treats register R0 as literal 0. BOA addressing treats all as literal 0. BOA addressing treats all registers uniformly. registers uniformly.
Additional Hardware Simplification Additional Hardware Simplification from Dynamic Optimization from Dynamic Optimization
Speculative Load Support Speculative Load Support
Use ctr to assign Use ctr to assign sequence number sequence number to to each each LOAD LOAD and and STORE STORE in a group. in a group. Sequence number Sequence number part of opcode part of opcode On On STORE STORE, hardware checks , hardware checks
STORE STOREú ú ú ú ú ú ú úaddr overlaps a prev addr overlaps a prev LOAD LOAD addr addr Prev Prev LOAD LOAD addr has higher addr has higher sequence sequence number number than than STORE STORE
If aliasing: If aliasing:
Rollback group to start and re-execute Rollback group to start and re-execute Possibly retranslate to unspeculate Possibly retranslate to unspeculate LOAD LOAD
Speculative Load Support (1) Speculative Load Support (1)
Use ctr to assign Use ctr to assign sequence number sequence number to to each each LOAD LOAD and and STORE STORE in a group. in a group. Sequence number Sequence number part of opcode: part of opcode:
LOAD X ... STORE Y ... LOAD Z ... PowerPC Code 1 LOAD X 3 LOAD Z 2 STORE Y ... BOA Group
Speculative Load Support (2) Speculative Load Support (2)
STORE STOREú ú ú ú ú ú ú úaddr overlaps a prev addr overlaps a prev LOAD LOAD addr addr Prev Prev LOAD LOAD addr has higher addr has higher sequence sequence number number than than STORE STORE
1 LOAD X 3 LOAD Z 2 STORE Y ... BOA Group
Z aliases with Y
Seq #3 > Seq #2
Speculative Load Support (3) Speculative Load Support (3)
If aliasing: If aliasing:
Rollback group to start and re-execute Rollback group to start and re-execute Possibly retranslate to unspeculate Possibly retranslate to unspeculate LOAD LOAD
Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback Broadcast Writeback Mem 1 Fetch 1 Fetch 2 AGEN Decode GPR Rd Issue TLB Mem 2 Fetch 1 Fetch 2 AGEN Decode GPR Rd Issue TLB S-CAM
Integer
STORE LOAD
BOA Pipelines BOA Pipelines
Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Recirculation Recirculation
Recirculation Buffer
High Frequency => Not send global stall signals. Recirculate Ins instead.
Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Recirculation Recirculation
Recirculation Buffer Input Regs Ready
Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Recirculation Recirculation
Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
ÿþýüû Recirculate
Quash
Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Recirculation Recirculation
Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Recirculation Recirculation
Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Recirculation Recirculation
Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Recirculation Recirculation
Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Recirculation Recirculation
Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Recirculation Recirculation
Recirculation Buffer Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Recirculation Recirculation
Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Recirculation Recirculation
Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
ÿþýüû Recirculate
Quash
Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Recirculation Recirculation
Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Recirculation Recirculation
Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Recirculation Recirculation
Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Recirculation Recirculation
Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Recirculation Recirculation
Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback
Size Line Size Assoc Hit Latency
L1 - Ins
256K 256 4 1
L1 - Data
64K 128 2 4
L2 - Joint
4M 128 8 14
Memory
90
BOA Caches BOA Caches
Benchmarks Benchmarks
Benchmarks Benchmarks
SPECint95 SPECint95 TPC-C TPC-C
SPECint95 SPECint95 Sampling Method Sampling Method
Uniformly Sampled PowerPC Traces Uniformly Sampled PowerPC Traces 2 million 2 million instructions per sample instructions per sample 50 50 samples per benchmark samples per benchmark
TPC-C TPC-C Sampling Method Sampling Method
Special-purpose hardware Special-purpose hardware 170 million 170 million instruction trace instruction trace
li perl 88k go ijpeg vortex gcc compr tpcc 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 CPI TLB L2 L1-D L1-I Interp Xlate + DynOpt Branch Excep Base
BOA Baseline CPI BOA Baseline CPI
compress gcc go ijpeg li m88k perl vortex tpcc mean 5 10 15 20 25 30 Basic Block Profiling Path Prediction Oracle Prediction, BBlk Profiling Dynamic Predictor
Branch Misprediction Rates Branch Misprediction Rates
BOA and Other Dynamic BOA and Other Dynamic Optimization Approaches Optimization Approaches
Produces code for a different underlying Produces code for a different underlying architecture, unlike architecture, unlike Dynamo Dynamo. . Wider issue, higher frequency, more Wider issue, higher frequency, more performance oriented design than performance oriented design than Transmeta Transmeta. . Runs whole architecture and not dependent Runs whole architecture and not dependent
FX!32. . Unlike Unlike Java JITS Java JITS: :
High Level Language Independent High Level Language Independent But not portable across platforms But not portable across platforms
Conclusions Conclusions
BOA BOA uses dynamic optimization to uses dynamic optimization to simplify processor design. simplify processor design. BOA BOA achieves good performance. achieves good performance.
Conclusions Conclusions
BOA BOA uses dynamic optimization to uses dynamic optimization to simplify processor design. simplify processor design. BOA BOA achieves good performance. achieves good performance. Cleaned up paper available at: Cleaned up paper available at:
www.research.ibm.com/vliw/Pdf/wced02.pdf www.research.ibm.com/vliw/Pdf/wced02.pdf
www.research.ibm.com/people/m/mikeg/papers/2002_wced.pdf ed.pdf