1
A Superscalar Out-of-Order x86 Soft Processor for FPGA
June 5, 2019 Stanford University EE380
Henry Wong University of Toronto, Intel
henry@stufgedcow.net
A Superscalar Out-of-Order x86 Soft Processor for FPGA Henry Wong - - PowerPoint PPT Presentation
A Superscalar Out-of-Order x86 Soft Processor for FPGA Henry Wong University of Toronto, Intel henry@stufgedcow.net June 5, 2019 Stanford University EE380 1 Hi! CPU architect, Intel Hillsboro Ph.D., University of Toronto Today:
1
June 5, 2019 Stanford University EE380
henry@stufgedcow.net
2
– Motivation – High-level design and results – Microarchitecture details and some circuits
3
– 20x area, 20x delay cost – Circuit building blocks are somewhat bigger than logic gates
4
– 20x area, 20x delay cost – Circuit building blocks are somewhat bigger than logic gates
6-LUT 6-LUT 6-LUT
5
– Often running on a soft processor
– Parallel code and hardware accelerators need efgort – Less efgort if soft processors got faster
6
– Often running on a soft processor
– Parallel code and hardware accelerators need efgort – Less efgort if soft processors got faster
7
– Often running on a soft processor
– Parallel code and hardware accelerators need efgort – Less efgort if soft processors got faster
8
– Altera Nios II/f: 1100 ALUT (0.2% of Stratix IV), 240 MHz – Xilinx MicroBlaze: 2100 LUT (0.3% of Virtex-7), 246 MHz
9
– Altera Nios II/f: 1100 ALUT (0.2% of Stratix IV), 240 MHz – Xilinx MicroBlaze: 2100 LUT (0.3% of Virtex-7), 246 MHz
10
– Thread: Multi-core, multi-thread – Data: Vectors, SIMD – Instruction: Pipelining, out-of-order
– Increase instructions per clock cycle (IPC) – Without decreasing clock cycles per second (MHz) – And at an afgordable FPGA hardware cost
11
– Thread: Multi-core, multi-thread – Data: Vectors, SIMD – Instruction: Pipelining, out-of-order
– Increase instructions per clock cycle (IPC) – Without decreasing clock cycles per second (MHz) – And at an afgordable FPGA hardware cost
12
– Length decoding – Self-modifying code – Two destination operands (MUL, DIV) – ...but most hard features are not x86-specifjc
– But is RISC even the right clean-slate design?
13
– Benchmarks
– OS, compiler, development tools – Web browser, JavaScript JIT, ... – Software in unexpected places: VGA BIOS
14
– Benchmarks
– OS, compiler, development tools – Web browser, JavaScript JIT, ... – Software in unexpected places: VGA BIOS
15
– Reasonable per-clock performance (~2× vs. Nios II/f) – 300 MHz (25% higher than Nios II/f on Stratix IV)
– Binary compatible: OS, dev tools, user programs – Both system and user mode
16
17
18
19
20
– Physical Register File: Reduce SRAM ports, CAM size – Low-associativity caches: Reduce multiplexers, CAMs
21
22
23
24
25
26
8b 06 03 46 04 89 46 08
27
8b 06 03 46 04 89 46 08 8b 06 03 46 04 89 46 08
28
8b 06 03 46 04 89 46 08 8b 06 03 46 04 89 46 08
29
8b 06 03 46 04 89 46 08 8b 06 03 46 04 89 46 08
30
8b 06 03 46 04 89 46 08 8b 06 03 46 04 89 46 08
31
8b 06 03 46 04 89 46 08 8b 06 03 46 04 89 46 08
32
8b 06 03 46 04 89 46 08 8b 06 03 46 04 89 46 08
33
8b 06 03 46 04 89 46 08 8b 06 03 46 04 89 46 08
34
8b 06 03 46 04 89 46 08 8b 06 03 46 04 89 46 08
35
8b 06 03 46 04 89 46 08 8b 06 03 46 04 89 46 08
36
– 4400 ALM (6.5× ), 245 MHz (0.82× )
* Area estimate for partial or unimplemented circuit
37
– 4400 ALM (6.5× ), 245 MHz (0.82× )
* Area estimate for partial or unimplemented circuit
38
– 4400 ALM (6.5× ), 245 MHz (0.82× )
* Area estimate for partial or unimplemented circuit
39
– 4400 ALM (6.5× ), 245 MHz (0.82× )
* Area estimate for partial or unimplemented circuit
40
VIA C3 550 MHz Nios II/f 100 MHz Pentium 200 MHz Atom (Bonnell) 1600 MHz AMD K6 166 MHz Pentium 4 2800 MHz ARM Cortex-A9 800 MHz Pentium Pro 233 MHz This work ~200 MHz Atom (Silvermont) 2400 MHz VIA Nano 1000 MHz Opteron K8 2800 MHz AMD Piledriver 3500 MHz Core 2 Q9550 3400 MHz Haswell 4300 MHz
0.0 0.5 1.0 1.5 2.0 2.5 3.0
2.68 2.73 2.06 1.63 1.46 1.58 1.42 1.26 1.00 0.99 0.87 0.91 0.68 0.56 0.44
Slower
41
– Wall-clock: 2.23×
– Also 8 KB/256 KB cache – 3-way OoO
– Also 2-way OoO – 32 KB/2 MB cache
VIA C3 550 MHz Nios II/f 100 MHz Pentium 200 MHz Atom (Bonnell) 1600 MHz AMD K6 166 MHz Pentium 4 2800 MHz ARM Cortex-A9 800 MHz Pentium Pro 233 MHz This work ~200 MHz Atom (Silvermont) 2400 MHz VIA Nano 1000 MHz Opteron K8 2800 MHz AMD Piledriver 3500 MHz Core 2 Q9550 3400 MHz Haswell 4300 MHz
0.0 0.5 1.0 1.5 2.0 2.5 3.0
2.68 2.73 2.06 1.63 1.46 1.58 1.42 1.26 1.00 0.99 0.87 0.91 0.68 0.56 0.44
Slower
42
– Wall-clock: 2.23×
– Also 8 KB/256 KB cache – 3-way OoO
– Also 2-way OoO – 32 KB/2 MB cache
VIA C3 550 MHz Nios II/f 100 MHz Pentium 200 MHz Atom (Bonnell) 1600 MHz AMD K6 166 MHz Pentium 4 2800 MHz ARM Cortex-A9 800 MHz Pentium Pro 233 MHz This work ~200 MHz Atom (Silvermont) 2400 MHz VIA Nano 1000 MHz Opteron K8 2800 MHz AMD Piledriver 3500 MHz Core 2 Q9550 3400 MHz Haswell 4300 MHz
0.0 0.5 1.0 1.5 2.0 2.5 3.0
2.68 2.73 2.06 1.63 1.46 1.58 1.42 1.26 1.00 0.99 0.87 0.91 0.68 0.56 0.44
VIA C3 550 MHz Nios II/f 100 MHz Pentium 200 MHz Atom (Bonnell) 1600 MHz AMD K6 166 MHz Pentium 4 2800 MHz ARM Cortex-A9 800 MHz Pentium Pro 233 MHz This work ~200 MHz Atom (Silvermont) 2400 MHz VIA Nano 1000 MHz Opteron K8 2800 MHz AMD Piledriver 3500 MHz Core 2 Q9550 3400 MHz Haswell 4300 MHz
0.0 0.5 1.0 1.5 2.0 2.5 3.0
2.68 2.73 2.06 1.63 1.46 1.58 1.42 1.26 1.00 0.99 0.87 0.91 0.68 0.56 0.44
Slower
43
– Wall-clock: 2.23×
– Also 8 KB/256 KB cache – 3-way OoO
– Also 2-way OoO – 32 KB/2 MB cache
VIA C3 550 MHz Nios II/f 100 MHz Pentium 200 MHz Atom (Bonnell) 1600 MHz AMD K6 166 MHz Pentium 4 2800 MHz ARM Cortex-A9 800 MHz Pentium Pro 233 MHz This work ~200 MHz Atom (Silvermont) 2400 MHz VIA Nano 1000 MHz Opteron K8 2800 MHz AMD Piledriver 3500 MHz Core 2 Q9550 3400 MHz Haswell 4300 MHz
0.0 0.5 1.0 1.5 2.0 2.5 3.0
2.68 2.73 2.06 1.63 1.46 1.58 1.42 1.26 1.00 0.99 0.87 0.91 0.68 0.56 0.44
VIA C3 550 MHz Nios II/f 100 MHz Pentium 200 MHz Atom (Bonnell) 1600 MHz AMD K6 166 MHz Pentium 4 2800 MHz ARM Cortex-A9 800 MHz Pentium Pro 233 MHz This work ~200 MHz Atom (Silvermont) 2400 MHz VIA Nano 1000 MHz Opteron K8 2800 MHz AMD Piledriver 3500 MHz Core 2 Q9550 3400 MHz Haswell 4300 MHz
0.0 0.5 1.0 1.5 2.0 2.5 3.0
2.68 2.73 2.06 1.63 1.46 1.58 1.42 1.26 1.00 0.99 0.87 0.91 0.68 0.56 0.44
VIA C3 550 MHz Nios II/f 100 MHz Pentium 200 MHz Atom (Bonnell) 1600 MHz AMD K6 166 MHz Pentium 4 2800 MHz ARM Cortex-A9 800 MHz Pentium Pro 233 MHz This work ~200 MHz Atom (Silvermont) 2400 MHz VIA Nano 1000 MHz Opteron K8 2800 MHz AMD Piledriver 3500 MHz Core 2 Q9550 3400 MHz Haswell 4300 MHz
0.0 0.5 1.0 1.5 2.0 2.5 3.0
2.68 2.73 2.06 1.63 1.46 1.58 1.42 1.26 1.00 0.99 0.87 0.91 0.68 0.56 0.44
Slower
44
– Wall-clock: 2.23×
– Also 8 KB/256 KB cache – 3-way OoO
– Also 2-way OoO – 32 KB/2 MB cache
VIA C3 550 MHz Nios II/f 100 MHz Pentium 200 MHz Atom (Bonnell) 1600 MHz AMD K6 166 MHz Pentium 4 2800 MHz ARM Cortex-A9 800 MHz Pentium Pro 233 MHz This work ~200 MHz Atom (Silvermont) 2400 MHz VIA Nano 1000 MHz Opteron K8 2800 MHz AMD Piledriver 3500 MHz Core 2 Q9550 3400 MHz Haswell 4300 MHz
0.0 0.5 1.0 1.5 2.0 2.5 3.0
2.68 2.73 2.06 1.63 1.46 1.58 1.42 1.26 1.00 0.99 0.87 0.91 0.68 0.56 0.44
VIA C3 550 MHz Nios II/f 100 MHz Pentium 200 MHz Atom (Bonnell) 1600 MHz AMD K6 166 MHz Pentium 4 2800 MHz ARM Cortex-A9 800 MHz Pentium Pro 233 MHz This work ~200 MHz Atom (Silvermont) 2400 MHz VIA Nano 1000 MHz Opteron K8 2800 MHz AMD Piledriver 3500 MHz Core 2 Q9550 3400 MHz Haswell 4300 MHz
0.0 0.5 1.0 1.5 2.0 2.5 3.0
2.68 2.73 2.06 1.63 1.46 1.58 1.42 1.26 1.00 0.99 0.87 0.91 0.68 0.56 0.44
VIA C3 550 MHz Nios II/f 100 MHz Pentium 200 MHz Atom (Bonnell) 1600 MHz AMD K6 166 MHz Pentium 4 2800 MHz ARM Cortex-A9 800 MHz Pentium Pro 233 MHz This work ~200 MHz Atom (Silvermont) 2400 MHz VIA Nano 1000 MHz Opteron K8 2800 MHz AMD Piledriver 3500 MHz Core 2 Q9550 3400 MHz Haswell 4300 MHz
0.0 0.5 1.0 1.5 2.0 2.5 3.0
2.68 2.73 2.06 1.63 1.46 1.58 1.42 1.26 1.00 0.99 0.87 0.91 0.68 0.56 0.44
VIA C3 550 MHz Nios II/f 100 MHz Pentium 200 MHz Atom (Bonnell) 1600 MHz AMD K6 166 MHz Pentium 4 2800 MHz ARM Cortex-A9 800 MHz Pentium Pro 233 MHz This work ~200 MHz Atom (Silvermont) 2400 MHz VIA Nano 1000 MHz Opteron K8 2800 MHz AMD Piledriver 3500 MHz Core 2 Q9550 3400 MHz Haswell 4300 MHz
0.0 0.5 1.0 1.5 2.0 2.5 3.0
2.68 2.73 2.06 1.63 1.46 1.58 1.42 1.26 1.00 0.99 0.87 0.91 0.68 0.56 0.44
Slower
45
– Both user and system modes
– 7% Stratix IV or 1.3% Stratix 10
– Per-clock: ~2.7× – Frequency: ~0.8×
46
– LUT granularity
Fetch
Decode Rename Schedule Execute Memory Commit
47
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.08 0.27 0.36 0.07 0.05 0.12 0.04 0.02 1E-3 3E-3 9E-5 3E-11 Frequency Cumulative frequency
Instruction length (bytes) Frequency Cumulative Frequency
– 3.4 B per instruction – Fetch 8 B/cycle
Fetch
Decode Rename Schedule Execute Memory Commit
48
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.08 0.27 0.36 0.07 0.05 0.12 0.04 0.02 1E-3 3E-3 9E-5 3E-11 Frequency Cumulative frequency
Instruction length (bytes) Frequency Cumulative Frequency
1 2 3 4 5 6 7+ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.86 0.14 0.0006 1E-6 6E-10 3E-10 1E-10 3E-11 Frequency Cumulative frequency
Number of prefix bytes Frequency Cumulative Frequency
– 3.4 B per instruction – Fetch 8 B/cycle
– Prefjx bytes uncommon – Fast decode up to 1
Fetch
Decode Rename Schedule Execute Memory Commit
49
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.08 0.27 0.36 0.07 0.05 0.12 0.04 0.02 1E-3 3E-3 9E-5 3E-11 Frequency Cumulative frequency
Instruction length (bytes) Frequency Cumulative Frequency
1 2 3 4 5 6 7+ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.86 0.14 0.0006 1E-6 6E-10 3E-10 1E-10 3E-11 Frequency Cumulative frequency
Number of prefix bytes Frequency Cumulative Frequency
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.94 0.04 0.02 1E-4 2E-4 4E-5 3E-5 1E-4 7E-6 5E-5 8E-9 1E-5 Frequency Cumulative frequency
Micro-ops per instruction Frequency Cumulative Frequency
– 3.4 B per instruction – Fetch 8 B/cycle
– Prefjx bytes uncommon – Fast decode up to 1
– 1 is common case – Dual-issue up to 2, single-
Fetch
Decode Rename Schedule Execute Memory Commit
50
eax ecx edx ebx esp ebp esi edi eflags fpucw fpusw tmp0 tmp1
PRF A Speculative register mapping table Physical register file
– ~14 sources/clk
– ~4 destinations/clk
Fetch
Decode Rename Schedule Execute Memory Commit
51
eax ecx edx ebx esp ebp esi edi eflags fpucw fpusw tmp0 tmp1
PRF A Speculative register mapping table Physical register file
– ~14 sources/clk
– ~4 destinations/clk
– Each ALU “owns” an RF
eax ecx edx ebx esp ebp esi edi eflags fpucw fpusw tmp0 tmp1
PRF A PRF B PRF C Speculative register mapping table Physical register file
Fetch
Decode Rename Schedule Execute Memory Commit
52
eax ecx edx ebx esp ebp esi edi eflags fpucw fpusw tmp0 tmp1
PRF A Speculative register mapping table Physical register file
– ~14 sources/clk
– ~4 destinations/clk
– Each ALU “owns” an RF
eax ecx edx ebx esp ebp esi edi eflags fpucw fpusw tmp0 tmp1 eax ecx edx ebx esp ebp esi edi eflags fpucw fpusw tmp0 tmp1
Copy: pipeline flush
PRF A PRF B PRF C Speculative register mapping table Committed register mapping table Physical register file
eax ecx edx ebx esp ebp esi edi eflags fpucw fpusw tmp0 tmp1
PRF A PRF B PRF C Speculative register mapping table Physical register file
Fetch
Decode Rename Schedule Execute Memory Commit
53
Fetch
Decode Rename Schedule Execute Memory Commit
GPR Forwarding SREG Forwarding
μop 0 StringOp × 2 μcode × 2 Pause?
GPR RAT 13 x 8-bit (64 p.regs) 6r 2w ports SREG RAT 12 x 4-bit (16 p.regs)
4r 2w ports
OCZAPS 1 x 5-bit (16 p.regs) RF_is_zero 1 x 1-bit
ALUop0 [dst,src,src]
Forwarding Forwarding 1r 1w ports 1r 1w ports
From Decode String-op Microcode
Choose Operands Forwarding and mux select
AGUop0 [dst,src,src,src,seg] ALUop1 [dst,src,src] AGUop1 [dst,src,src,src,seg] μop 1 μop 0 μop 1
GPR Free list A GPR Free list B GPR Free list C SREG Free list OCZAPS Free list A OCZAPS Free list B
dst0 dst1 GPR C GPR C 6 regs dst 4 regs
r w μop 0 μop 1
2 regs
Choose Operands
r w
54
Fetch
Decode Rename Schedule Execute Memory Commit
55
Fetch
Decode Rename Schedule Execute Memory Commit
56
Fetch
Decode Rename Schedule Execute Memory Commit
57
Fetch
Decode Rename Schedule Execute Memory Commit
58
Fetch
Decode Rename Schedule Execute Memory Commit
59
8 16 24 32 40 48 56 64 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Scheduler Capacity (entries) IPC
60
8 16 24 32 40 48 56 64 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Scheduler Capacity (entries) IPC
61
– 275 MHz
– Pentium Pro: 20 – Haswell: 60 – Ryzen: 84 (6×14) – Skylake: 97
Fetch
Decode Rename Schedule Execute Memory Commit
62
– 275 MHz
– Pentium Pro: 20 – Haswell: 60 – Ryzen: 84 (6×14) – Skylake: 97
Fetch
Decode Rename Schedule Execute Memory Commit
63
– 275 MHz
– Pentium Pro: 20 – Haswell: 60 – Ryzen: 84 (6×14) – Skylake: 97
Fetch
Decode Rename Schedule Execute Memory Commit
64
– 275 MHz
– Pentium Pro: 20 – Haswell: 60 – Ryzen: 84 (6×14) – Skylake: 97
Fetch
Decode Rename Schedule Execute Memory Commit
65
– 275 MHz
– Pentium Pro: 20 – Haswell: 60 – Ryzen: 84 (6×14) – Skylake: 97
Fetch
Decode Rename Schedule Execute Memory Commit
66
– 275 MHz
– Pentium Pro: 20 – Haswell: 60 – Ryzen: 84 (6×14) – Skylake: 97
Fetch
Decode Rename Schedule Execute Memory Commit
67
Newer Older
Fetch
Decode Rename Schedule Execute Memory Commit
68
Newer Older
Fetch
Decode Rename Schedule Execute Memory Commit
69
– Han-Carlson prefjx tree – Very diffjcult to code: Synthesizer makes it linear depth again
Fetch
Decode Rename Schedule Execute Memory Commit
70
– Complex, 3+ cycle – Simple, 1 cycle – Address generation
Fetch
Decode Rename Schedule Execute Memory Commit
71
– Complex, 3+ cycle – Simple, 1 cycle – Address generation
Fetch
Decode Rename Schedule Execute Memory Commit
72
– Complex, 3+ cycle – Simple, 1 cycle – Address generation
Fetch
Decode Rename Schedule Execute Memory Commit
73
– Complex, 3+ cycle – Simple, 1 cycle – Address generation
Fetch
Decode Rename Schedule Execute Memory Commit
74
– Complex, 3+ cycle – Simple, 1 cycle – Address generation
Fetch
Decode Rename Schedule Execute Memory Commit
75
– Complex, 3+ cycle – Simple, 1 cycle – Address generation
Fetch
Decode Rename Schedule Execute Memory Commit
76
– Complex, 3+ cycle – Simple, 1 cycle – Address generation
Fetch
Decode Rename Schedule Execute Memory Commit
77
– Shifter – Adder – Bitwise logic
Fetch
Decode Rename Schedule Execute Memory Commit
78
– Shifter – Adder – Bitwise logic
Fetch
Decode Rename Schedule Execute Memory Commit
79
– Shifter – Adder – Bitwise logic
Fetch
Decode Rename Schedule Execute Memory Commit
80
Fetch
Decode Rename Schedule Execute Memory Commit
81
– Left and right. 32-, 16-, and 8-bit operands – Rotate, shift, arithmetic shift – Rotate-through-carry-by-1 – Byte swap (aa bb cc dd
– Sign extension
Fetch
Decode Rename Schedule Execute Memory Commit
82
– Left and right. 32-, 16-, and 8-bit operands – Rotate, shift, arithmetic shift – Rotate-through-carry-by-1 – Byte swap (aa bb cc dd
– Sign extension
Fetch
Decode Rename Schedule Execute Memory Commit
83
Fetch
Decode Rename Schedule Execute Memory Commit
84
– 32/16/8-bit NOR gate is 3 LUT levels plus the adder...
Fetch
Decode Rename Schedule Execute Memory Commit
85
– 32/16/8-bit NOR gate is 3 LUT levels plus the adder...
– ZF: 3 LUT levels in parallel with adder
Fetch
Decode Rename Schedule Execute Memory Commit
86
– 32/16/8-bit NOR gate is 3 LUT levels plus the adder...
– ZF: 3 LUT levels in parallel with adder
Fetch
Decode Rename Schedule Execute Memory Commit
87
– 32/16/8-bit NOR gate is 3 LUT levels plus the adder...
– ZF: 3 LUT levels in parallel with adder
Fetch
Decode Rename Schedule Execute Memory Commit
88
– Store: mov [ecx], eax – Load: mov eax, [ecx]
Fetch
Decode Rename Schedule Execute Memory Commit
89
– Store: mov [ecx], eax – Load: mov eax, [ecx]
– For performance (Instruction and data) – For OS support (TLB and page walking)
Fetch
Decode Rename Schedule Execute Memory Commit
90
Fetch
Decode Rename Schedule Execute Memory Commit
91
Dhrystone SPECint 2000
0.2 0.4 0.6 0.8 1 1.2 1.4
0.98 0.46 0.99 0.60 1.00 0.74 1.00 0.85 1.00 0.91 1.00 0.94 1.00 0.97
Bigger L1 cache
– Sensitivity varies with workload
Fetch
Decode Rename Schedule Execute Memory Commit
92
Dhrystone SPECint 2000
0.2 0.4 0.6 0.8 1 1.2 1.4
0.98 0.46 0.99 0.60 1.00 0.74 1.00 0.85 1.00 0.91 1.00 0.94 1.00 0.97
Bigger L1 cache
– Sensitivity varies with workload
Fetch
Decode Rename Schedule Execute Memory Commit
93
Dhrystone SPECint 2000
1.00 0.93 1.00 1.00 1.00 1.05 1.00 1.07 1.00 1.10 1.00 1.11 1.00 1.13
Dhrystone SPECint 2000
0.2 0.4 0.6 0.8 1 1.2 1.4
0.98 0.46 0.99 0.60 1.00 0.74 1.00 0.85 1.00 0.91 1.00 0.94 1.00 0.97
Bigger L1 cache
– Sensitivity varies with workload
Fetch
Decode Rename Schedule Execute Memory Commit
94
Dhrystone SPECint 2000
0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35
1.01 1.21 1.08 1.32
Blocking, In-order Non-blocking, In-order Non-blocking, Out-of-order
Relative IPC vs. In-order Blocking
– Memory dependence speculation
Fetch
Decode Rename Schedule Execute Memory Commit
95
Fetch
Decode Rename Schedule Execute Memory Commit
96
Fetch
Decode Rename Schedule Execute Memory Commit
97
Fetch
Decode Rename Schedule Execute Memory Commit
98
Fetch
Decode Rename Schedule Execute Memory Commit
99
Fetch
Decode Rename Schedule Execute Memory Commit
100
Fetch
Decode Rename Schedule Execute Memory Commit
101
Fetch
Decode Rename Schedule Execute Memory Commit
102
Fetch
Decode Rename Schedule Execute Memory Commit
103
Fetch
Decode Rename Schedule Execute Memory Commit
104
Fetch
Decode Rename Schedule Execute Memory Commit
105
Fetch
Decode Rename Schedule Execute Memory Commit
106
Fetch
Decode Rename Schedule Execute Memory Commit
107
Fetch
Decode Rename Schedule Execute Memory Commit
108
Fetch
Decode Rename Schedule Execute Memory Commit
109
Fetch
Decode Rename Schedule Execute Memory Commit
110
Fetch
Decode Rename Schedule Execute Memory Commit
111
Fetch
Decode Rename Schedule Execute Memory Commit
112
Fetch
Decode Rename Schedule Execute Memory Commit
113
Fetch
Decode Rename Schedule Execute Memory Commit
114
– Page faults – TLB Accessed/Dirty bits – Uncacheable accesses
Fetch
Decode Rename Schedule Execute Memory Commit
115
Fetch
Decode Rename Schedule Execute Memory Commit
116
– Split-cacheline/page – Stores – L2 and cache coherence
Fetch
Decode Rename Schedule Execute Memory Commit
117
Fetch
Decode Rename Schedule Execute Memory Commit
118
– TLB lookup – Cache tag compare – Cache data rotate (32 B)
– TLB lookup – 2-to-1 mux (1 bit) – 2-to-1 mux (32 bit)
Fetch
Decode Rename Schedule Execute Memory Commit
119
– TLB lookup – Cache tag compare – Cache data rotate (32 B)
– TLB lookup – 2-to-1 mux (1 bit) – 2-to-1 mux (32 bit)
Fetch
Decode Rename Schedule Execute Memory Commit
120
– TLB lookup – Cache tag compare – Cache data rotate (32 B)
– TLB lookup – 2-to-1 mux (1 bit) – 2-to-1 mux (32 bit)
Fetch
Decode Rename Schedule Execute Memory Commit
121
– Critical portions hand-mapped to LUTs (blue boxes) – Many critical paths have logic depth of 5 LUTs
– Nios II/f: TLB+cache circuit is ~6.0 ns (~1.5 stages) Fetch
Decode Rename Schedule Execute Memory Commit
122
– Critical portions hand-mapped to LUTs (blue boxes) – Many critical paths have logic depth of 5 LUTs
– Nios II/f: TLB+cache circuit is ~6.0 ns (~1.5 stages) Fetch
Decode Rename Schedule Execute Memory Commit
123
– Area: 6.5× Nios II/f, but afgordable – Performance: 2.2× Nios II/f on SPECint2000
– Adapt microarchitecture to suit circuits – Or push circuits to make microarchitecture feasible – LUT-level design is often useful
124
– Some missing pieces
– Frequency
125
126
– Abandoned “RISC” features
– “CISC” features in modern “RISC”
127
– It’s easy to compile C to instructions. – It’s easy to crack complex instructions into many micro-ops.
– Fused Multiply-Add: Can’t synthesize from MUL and ADD – REP MOVS: Must maintain memory ordering – Binary translation is really hard: Need to keep low-level
128
129
– Comparing FPGA vs. custom CMOS and the impact on processor
– Quantifying the gap between FPGA and custom CMOS to aid
– Effjcient methods for out-of-order load/store execution for high-
– Microarchitecture and circuits for a 200 MHz Out-of-order soft
– High performance instruction scheduling circuits for out-of-order soft
– High performance instruction scheduling circuits for superscalar out-of-