Inherently Lower Complexity Inherently Lower Complexity - - PowerPoint PPT Presentation

inherently lower complexity inherently lower complexity
SMART_READER_LITE
LIVE PREVIEW

Inherently Lower Complexity Inherently Lower Complexity - - PowerPoint PPT Presentation

Inherently Lower Complexity Inherently Lower Complexity Architectures using Architectures using Dynamic Optimization Dynamic Optimization Michael Gschwind Michael Gschwind Erik Altman Erik Altman


slide-1
SLIDE 1

Inherently Lower Complexity Inherently Lower Complexity Architectures using Architectures using Dynamic Optimization Dynamic Optimization

Michael Gschwind Michael Gschwind Erik Altman Erik Altman

ÿþýüûúùúüø÷öõôóüòñõñ÷ðïîüíñóöñð

slide-2
SLIDE 2

What is the Problem? What is the Problem?

Out of order superscalars achieve high Out of order superscalars achieve high performance. performance. ... But at the cost of ... But at the cost of high hardware complexity high hardware complexity

Predictors Predictors Complex decode Complex decode Complex issue queues with wakeup and issue Complex issue queues with wakeup and issue logic logic Register mapping tables Register mapping tables ... ...

slide-3
SLIDE 3

What is the Problem? What is the Problem?

Out of order superscalars achieve high Out of order superscalars achieve high performance. performance. ... But at the cost of ... But at the cost of high power. high power.

Many out of order components operate Many out of order components operate every cycle. every cycle. Many components query a large set of Many components query a large set of data to operate on a single element. data to operate on a single element.

slide-4
SLIDE 4

What is the Problem? What is the Problem?

Out of order superscalars achieve high Out of order superscalars achieve high performance. performance. ... But at the cost of ... But at the cost of deep pipelines. deep pipelines.

Complex logic has long latency. Complex logic has long latency. To achieve high frequency with long To achieve high frequency with long latency, super pipelining is required. latency, super pipelining is required. Deep pipelines require excellent branch Deep pipelines require excellent branch predictors. predictors. Excellent branch predictors are complex. Excellent branch predictors are complex. Complex logic has long latency ... Complex logic has long latency ...

slide-5
SLIDE 5

What is the Problem? What is the Problem?

Out of order superscalars achieve high Out of order superscalars achieve high performance. performance. ... But at the cost of ... But at the cost of high verification and high verification and debug complexity. debug complexity.

With Moore's Law, schedule slips = With Moore's Law, schedule slips = performance slips performance slips

Schedule Slip Relative Performance

1 month 4% 3 month 12% 6 month 26% 9 month 41% 12 month 59% 18 month 100%

slide-6
SLIDE 6

What is the Solution? What is the Solution?

Software Dynamic Optimization Software Dynamic Optimization Allows reduced hardware complexity: Allows reduced hardware complexity:

Shorter pipelines for same frequency. Shorter pipelines for same frequency. Fewer hardware predictors. Fewer hardware predictors. Simpler issue logic. Simpler issue logic. Less power, a la Transmeta. Less power, a la Transmeta. Less debug and verification. Less debug and verification. Smaller chips and higher yield. Smaller chips and higher yield.

slide-7
SLIDE 7

How to Implement the Solution How to Implement the Solution

BOA Architecture for Complexity BOA Architecture for Complexity Effective Design Effective Design BOA BOA = = B Binary Translation inary Translation O Optimized ptimized A Architecture rchitecture BOA in combination with its dynamic BOA in combination with its dynamic

  • ptimization software is architecturally
  • ptimization software is architecturally

compatible with PowerPC. compatible with PowerPC.

slide-8
SLIDE 8

What is interesting about BOA? What is interesting about BOA?

Software dynamic optimization. Software dynamic optimization. Precise behavior on most memory faults. Precise behavior on most memory faults. Load/Store order tables ensure memory Load/Store order tables ensure memory semantics and allow aggressive dynamic semantics and allow aggressive dynamic software reordering. software reordering. Instruction recirculation mechanism to Instruction recirculation mechanism to simplify issue and exception handling. simplify issue and exception handling. Predictable latencies handled by Predictable latencies handled by software, unpredictable by hardware. software, unpredictable by hardware.

slide-9
SLIDE 9

BOA System Architecture BOA System Architecture

Interpret Ins X (PowerPC) X Prev Translated Entry Pt

Update Statistics Goto next ins X Yes

Exec Group X's BOA Translation No Seen X 15 times No Form Group at X and Translate Ins to BOA Instruc Yes

slide-10
SLIDE 10

BOA ISA (1) BOA ISA (1)

BOA is variable length VLIW machine. BOA is variable length VLIW machine. BOA instructions (bundles) are 128 bits. BOA instructions (bundles) are 128 bits. Bundles have 3 primitive ops. Bundles have 3 primitive ops. Primitive ops have 39 bits plus stop bit. Primitive ops have 39 bits plus stop bit. Complex PowerPC ops cracked. Complex PowerPC ops cracked. 8 bits of bundle reserved for future uses 8 bits of bundle reserved for future uses such as predication. such as predication. Instruction Issue: Instruction Issue:

Up to 6 primitive ops are issued together. Up to 6 primitive ops are issued together. Only last op issued may have stop bit set. Only last op issued may have stop bit set.

slide-11
SLIDE 11

BOA BOA Instructions Instructions

slide-12
SLIDE 12

BOA ISA (2) BOA ISA (2)

64 64 Integer Registers Integer Registers 64 64 Float Registers Float Registers 16 16 4-bit 4-bit Condition Registers Condition Registers Branches take Branches take 1 1 cycle: cycle:

Branch mispredicts cost Branch mispredicts cost 7 7 cycles cycles Static branch pred ( Static branch pred (using interpreter stats using interpreter stats) ) At most one branch per cycle At most one branch per cycle

slide-13
SLIDE 13

PowerPC Regs Shadow Regs

ÿþýüûúùø÷þø

Scratch Regs

ÿþýüûú ûøý

PowerPC State and Precise PowerPC State and Precise Exceptions Exceptions

slide-14
SLIDE 14

BOA Latencies BOA Latencies

Integer ops take Integer ops take 1 1 cycle cycle

No bypass No bypass => => Dependent ops must be 2 Dependent ops must be 2

.

cycles apart cycles apart

LOADs take LOADs take 3 3 cycles cycles

No bypass No bypass => => Dependent ops must be 4 Dependent ops must be 4 ..... ..... cycles later cycles later

slide-15
SLIDE 15

BOA Resources BOA Resources

6 6 Issue Issue Slots Slots 2 2 LOAD / STORE LOAD / STORE units units

Each with own copy of register file Each with own copy of register file

4 4 Integer Integer units units

Each with own copy of register file Each with own copy of register file

2 2 Float Float units units 1 1 Branch Branch unit unit 32 32-entry

  • entry Load

Load and and Store Buffers Store Buffers Register scoreboarding of LOAD values Register scoreboarding of LOAD values

Stall when try to use loaded value Stall when try to use loaded value

slide-16
SLIDE 16

Dynamic Dynamic Optimization Optimization

slide-17
SLIDE 17

BOA Dynamic Optimization BOA Dynamic Optimization

BOA's software optimizer originates with BOA's software optimizer originates with IBM's earlier DAISY project. IBM's earlier DAISY project. BOA adjusted and tuned optimizer: BOA adjusted and tuned optimizer:

To support a narrower, higher frequency target To support a narrower, higher frequency target machine. machine. To optimize along single hyperblock paths, To optimize along single hyperblock paths, instead of tree region with multiple paths. instead of tree region with multiple paths. Improves code packing, reduces TLB misses Improves code packing, reduces TLB misses Improves code layout and helps IFetch, a la Improves code layout and helps IFetch, a la trace caches. trace caches.

slide-18
SLIDE 18

Dynamic Optimization Dynamic Optimization Environments Environments

Dynamic Optimization can be used in a Dynamic Optimization can be used in a variety of environments: variety of environments:

Process level Process level Idealized virtual memory Idealized virtual memory Fewer difficult system/kernel code issues Fewer difficult system/kernel code issues Operating system level Operating system level No modifications to operating system No modifications to operating system More transparent More transparent

Less danger of compatibility issues Less danger of compatibility issues

slide-19
SLIDE 19

Dynamic Optimization Targets (1) Dynamic Optimization Targets (1)

Simpler implementation of the same Simpler implementation of the same architecture architecture

Ability to bail out and revert to native Ability to bail out and revert to native execution: execution: If overhead too high If overhead too high For hard to emulate sequences For hard to emulate sequences When no benefit of DO can be measured When no benefit of DO can be measured

Or actually degrades Or actually degrades

slide-20
SLIDE 20

Dynamic Optimization Targets (2) Dynamic Optimization Targets (2)

Different architecture, e.g., RISC => Different architecture, e.g., RISC => VLIW VLIW

Drastically simplify architecture Drastically simplify architecture Reduce decoding overhead even further Reduce decoding overhead even further Add more registers, add new concepts Add more registers, add new concepts All code must be emulated. Can cause All code must be emulated. Can cause severe degradation if low reuse, e.g. severe degradation if low reuse, e.g. WinStone. WinStone. Get benefits of code packing Get benefits of code packing

slide-21
SLIDE 21

Some Optimizations Some Optimizations

Code packing Code packing Register Port arbitration Register Port arbitration Exploit novel architecture concepts Exploit novel architecture concepts Improve predictability of execution path Improve predictability of execution path by code layout by code layout Eliminate performance-degrading ops Eliminate performance-degrading ops Avoid use of complex idioms, e.g. Avoid use of complex idioms, e.g. condition register broad side read/write condition register broad side read/write Replace with easier to schedule/execute Replace with easier to schedule/execute

  • ps
  • ps
slide-22
SLIDE 22

Code Packing - Code Packing - Software Based Trace Caching Software Based Trace Caching

Code packing: Code packing:

Application-directed code compaction Application-directed code compaction Similar concept to hardware trace cache Similar concept to hardware trace cache Much simpler to implement Much simpler to implement Increase effectiveness of ICache and ITLB Increase effectiveness of ICache and ITLB Very helpful in HP Dynamo performance Very helpful in HP Dynamo performance

slide-23
SLIDE 23

Technique OOO DO+OOO DO+IO DO+VLIW ISA base ISA base ISA base ISA new ISA general

  • ptimizations

too complex DO optimizes DO optimizes DO optimizes path-predictive fetching I fetch prediction DO improves prediction DO improves prediction DO improves prediction code compaction trace cache DO performs layout DO performs layout DO performs layout select insns to issue wakeup/select logic wakeup/select logic DO adapts at

  • exec. time

DO adapts at

  • exec. time

precise exceptions register renaming register renaming SW recovery code SW recovery + HW support complex insns decoder cracks decoder cracks DO or HW DO cracks and layers form issue groups select logic select logic issue logic DO groups packets

Dynamic optimization and Dynamic optimization and architecture styles architecture styles

slide-24
SLIDE 24

BOA BOA Processor Processor

slide-25
SLIDE 25

BOA and DAISY Differences (1) BOA and DAISY Differences (1)

PowerPC PowerPC ops from

  • ps from

single path. single path. 6 Issue 6 Issue Ops assigned to Ops assigned to FU's in pipeline FU's in pipeline Stall-on-use Stall-on-use

Memop sequence #'s, Memop sequence #'s, Address Comparators Address Comparators

PowerPC PowerPC ops from

  • ps from

multiple paths. multiple paths. 8-16 Issue 8-16 Issue Mini-Icache maps Mini-Icache maps fixed cache fixed cache locations to FU's locations to FU's Stall-on-miss Stall-on-miss Load-Verify Load-Verify Instructions Instructions

BOA

DAISY

slide-26
SLIDE 26

BOA and DAISY Differences (2) BOA and DAISY Differences (2)

Predicated Predicated bundles of 3 ops bundles of 3 ops 1 branch per cycle 1 branch per cycle Branch prediction Branch prediction Tree instructions Tree instructions Up to 3 branches Up to 3 branches per cycle per cycle Encode successor Encode successor cache line in cache line in instruction => instruction => Fetch known ins Fetch known ins each cycle each cycle

BOA

DAISY

slide-27
SLIDE 27

Dynamic optimizer: Dynamic optimizer:

Limits number of register read/write ports, Limits number of register read/write ports, with little performance effect. with little performance effect. Handles PowerPC quirks, e.g. Handles PowerPC quirks, e.g. Condition register used as 32-bit value, Condition register used as 32-bit value, 8x4-bit values, and as 32x1-bit values. 8x4-bit values, and as 32x1-bit values. Renaming must account for all cases. Renaming must account for all cases. PowerPC addressing treats register R0 PowerPC addressing treats register R0 as literal 0. BOA addressing treats all as literal 0. BOA addressing treats all registers uniformly. registers uniformly.

Additional Hardware Simplification Additional Hardware Simplification from Dynamic Optimization from Dynamic Optimization

slide-28
SLIDE 28

Speculative Load Support Speculative Load Support

Use ctr to assign Use ctr to assign sequence number sequence number to to each each LOAD LOAD and and STORE STORE in a group. in a group. Sequence number Sequence number part of opcode part of opcode On On STORE STORE, hardware checks , hardware checks

STORE STOREú ú ú ú ú ú ú úaddr overlaps a prev addr overlaps a prev LOAD LOAD addr addr Prev Prev LOAD LOAD addr has higher addr has higher sequence sequence number number than than STORE STORE

If aliasing: If aliasing:

Rollback group to start and re-execute Rollback group to start and re-execute Possibly retranslate to unspeculate Possibly retranslate to unspeculate LOAD LOAD

slide-29
SLIDE 29

Speculative Load Support (1) Speculative Load Support (1)

Use ctr to assign Use ctr to assign sequence number sequence number to to each each LOAD LOAD and and STORE STORE in a group. in a group. Sequence number Sequence number part of opcode: part of opcode:

LOAD X ... STORE Y ... LOAD Z ... PowerPC Code 1 LOAD X 3 LOAD Z 2 STORE Y ... BOA Group

slide-30
SLIDE 30

Speculative Load Support (2) Speculative Load Support (2)

STORE STOREú ú ú ú ú ú ú úaddr overlaps a prev addr overlaps a prev LOAD LOAD addr addr Prev Prev LOAD LOAD addr has higher addr has higher sequence sequence number number than than STORE STORE

1 LOAD X 3 LOAD Z 2 STORE Y ... BOA Group

Z aliases with Y

Seq #3 > Seq #2

slide-31
SLIDE 31

Speculative Load Support (3) Speculative Load Support (3)

If aliasing: If aliasing:

Rollback group to start and re-execute Rollback group to start and re-execute Possibly retranslate to unspeculate Possibly retranslate to unspeculate LOAD LOAD

slide-32
SLIDE 32

Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback Broadcast Writeback Mem 1 Fetch 1 Fetch 2 AGEN Decode GPR Rd Issue TLB Mem 2 Fetch 1 Fetch 2 AGEN Decode GPR Rd Issue TLB S-CAM

Integer

STORE LOAD

BOA Pipelines BOA Pipelines

slide-33
SLIDE 33

Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

Recirculation Recirculation

Recirculation Buffer

High Frequency => Not send global stall signals. Recirculate Ins instead.

slide-34
SLIDE 34

Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

Recirculation Recirculation

Recirculation Buffer Input Regs Ready

slide-35
SLIDE 35

Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

Recirculation Recirculation

Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

ÿþýüû Recirculate

Quash

slide-36
SLIDE 36

Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

Recirculation Recirculation

Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

slide-37
SLIDE 37

Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

Recirculation Recirculation

Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

slide-38
SLIDE 38

Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

Recirculation Recirculation

Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

slide-39
SLIDE 39

Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

Recirculation Recirculation

Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

slide-40
SLIDE 40

Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

Recirculation Recirculation

Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

slide-41
SLIDE 41

Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

Recirculation Recirculation

Recirculation Buffer Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

slide-42
SLIDE 42

Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

Recirculation Recirculation

Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

slide-43
SLIDE 43

Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

Recirculation Recirculation

Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

ÿþýüû Recirculate

Quash

slide-44
SLIDE 44

Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

Recirculation Recirculation

Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

slide-45
SLIDE 45

Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

Recirculation Recirculation

Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

slide-46
SLIDE 46

Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

Recirculation Recirculation

Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

slide-47
SLIDE 47

Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

Recirculation Recirculation

Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

slide-48
SLIDE 48

Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

Recirculation Recirculation

Recirculation Buffer Input Regs Ready Fetch 1 Fetch 2 Execute Decode GPR Rd Issue Broadcast Writeback

slide-49
SLIDE 49

Size Line Size Assoc Hit Latency

L1 - Ins

256K 256 4 1

L1 - Data

64K 128 2 4

L2 - Joint

4M 128 8 14

Memory

90

BOA Caches BOA Caches

slide-50
SLIDE 50

Benchmarks Benchmarks

Benchmarks Benchmarks

SPECint95 SPECint95 TPC-C TPC-C

SPECint95 SPECint95 Sampling Method Sampling Method

Uniformly Sampled PowerPC Traces Uniformly Sampled PowerPC Traces 2 million 2 million instructions per sample instructions per sample 50 50 samples per benchmark samples per benchmark

TPC-C TPC-C Sampling Method Sampling Method

Special-purpose hardware Special-purpose hardware 170 million 170 million instruction trace instruction trace

slide-51
SLIDE 51

li perl 88k go ijpeg vortex gcc compr tpcc 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 CPI TLB L2 L1-D L1-I Interp Xlate + DynOpt Branch Excep Base

BOA Baseline CPI BOA Baseline CPI

slide-52
SLIDE 52

compress gcc go ijpeg li m88k perl vortex tpcc mean 5 10 15 20 25 30 Basic Block Profiling Path Prediction Oracle Prediction, BBlk Profiling Dynamic Predictor

Branch Misprediction Rates Branch Misprediction Rates

slide-53
SLIDE 53

CPI for Block CPI for Block Structured Structured ISA ISA

slide-54
SLIDE 54

BOA and Other Dynamic BOA and Other Dynamic Optimization Approaches Optimization Approaches

Produces code for a different underlying Produces code for a different underlying architecture, unlike architecture, unlike Dynamo Dynamo. . Wider issue, higher frequency, more Wider issue, higher frequency, more performance oriented design than performance oriented design than Transmeta Transmeta. . Runs whole architecture and not dependent Runs whole architecture and not dependent

  • n OS, unlike
  • n OS, unlike FX!32

FX!32. . Unlike Unlike Java JITS Java JITS: :

High Level Language Independent High Level Language Independent But not portable across platforms But not portable across platforms

slide-55
SLIDE 55

Conclusions Conclusions

BOA BOA uses dynamic optimization to uses dynamic optimization to simplify processor design. simplify processor design. BOA BOA achieves good performance. achieves good performance.

slide-56
SLIDE 56

Conclusions Conclusions

BOA BOA uses dynamic optimization to uses dynamic optimization to simplify processor design. simplify processor design. BOA BOA achieves good performance. achieves good performance. Cleaned up paper available at: Cleaned up paper available at:

www.research.ibm.com/vliw/Pdf/wced02.pdf www.research.ibm.com/vliw/Pdf/wced02.pdf

  • r
  • r www.research.ibm.com/people/m/mikeg/papers/2002_wc

www.research.ibm.com/people/m/mikeg/papers/2002_wced.pdf ed.pdf