A Superscalar Out-of-Order x86 Soft Processor for FPGA Henry Wong - - PowerPoint PPT Presentation

a superscalar out of order x86 soft processor for fpga
SMART_READER_LITE
LIVE PREVIEW

A Superscalar Out-of-Order x86 Soft Processor for FPGA Henry Wong - - PowerPoint PPT Presentation

A Superscalar Out-of-Order x86 Soft Processor for FPGA Henry Wong University of Toronto, Intel henry@stufgedcow.net June 5, 2019 Stanford University EE380 1 Hi! CPU architect, Intel Hillsboro Ph.D., University of Toronto Today:


slide-1
SLIDE 1

1

A Superscalar Out-of-Order x86 Soft Processor for FPGA

June 5, 2019 Stanford University EE380

Henry Wong University of Toronto, Intel

henry@stufgedcow.net

slide-2
SLIDE 2

2

Hi!

  • CPU architect, Intel Hillsboro
  • Ph.D., University of Toronto
  • Today: x86 OoO processor for FPGA (Ph.D. work)

– Motivation – High-level design and results – Microarchitecture details and some circuits

slide-3
SLIDE 3

3

FPGA: Field-Programmable Gate Array

  • Is a digital circuit (logic gates and wires)
  • Is fjeld-programmable (at power-on, not in the fab)
  • Pre-fab everything you’ll ever need

– 20x area, 20x delay cost – Circuit building blocks are somewhat bigger than logic gates

slide-4
SLIDE 4

4

FPGA: Field-Programmable Gate Array

  • Is a digital circuit (logic gates and wires)
  • Is fjeld-programmable (at power-on, not in the fab)
  • Pre-fab everything you’ll ever need

– 20x area, 20x delay cost – Circuit building blocks are somewhat bigger than logic gates

6-LUT 6-LUT 6-LUT

slide-5
SLIDE 5

5

FPGA Soft Processors

  • FPGA systems often have software components

– Often running on a soft processor

  • Need more performance?

– Parallel code and hardware accelerators need efgort – Less efgort if soft processors got faster

slide-6
SLIDE 6

6

FPGA Soft Processors

  • FPGA systems often have software components

– Often running on a soft processor

  • Need more performance?

– Parallel code and hardware accelerators need efgort – Less efgort if soft processors got faster

slide-7
SLIDE 7

7

FPGA Soft Processors

  • FPGA systems often have software components

– Often running on a soft processor

  • Need more performance?

– Parallel code and hardware accelerators need efgort – Less efgort if soft processors got faster

slide-8
SLIDE 8

8

Current Soft Processors

  • CPUs from the FPGA vendors: Small, in-order, 1-way

– Altera Nios II/f: 1100 ALUT (0.2% of Stratix IV), 240 MHz – Xilinx MicroBlaze: 2100 LUT (0.3% of Virtex-7), 246 MHz

  • Some other examples:

Processor ALUT Frequency (Stratix IV) Leon 3 In-order 1-way 5600 150 MHz OpenRISC OR1200 In-order 1-way 3500 130 MHz SPREX In-order 1-way 1800 150 MHZ (Stratix III) BOOM (RISC-V) OoO 2-way ? 50 MHz (Zync-7000) OPA (RISC-V) OoO 3-way 12600 215 MHz

slide-9
SLIDE 9

9

Current Soft Processors

  • CPUs from the FPGA vendors: Small, in-order, 1-way

– Altera Nios II/f: 1100 ALUT (0.2% of Stratix IV), 240 MHz – Xilinx MicroBlaze: 2100 LUT (0.3% of Virtex-7), 246 MHz

  • Some other examples:

Processor ALUT Frequency (Stratix IV) Leon 3 In-order 1-way 5600 150 MHz OpenRISC OR1200 In-order 1-way 3500 130 MHz SPREX In-order 1-way 1800 150 MHZ (Stratix III) BOOM (RISC-V) OoO 2-way ? 50 MHz (Zync-7000) OPA (RISC-V) OoO 3-way 12600 215 MHz

slide-10
SLIDE 10

10

Faster Soft Processors

  • Faster means extracting parallelism

– Thread: Multi-core, multi-thread – Data: Vectors, SIMD – Instruction: Pipelining, out-of-order

  • Challenge: What microarchitecture and circuits?

– Increase instructions per clock cycle (IPC) – Without decreasing clock cycles per second (MHz) – And at an afgordable FPGA hardware cost

  • First OoO hard processors: ~1.5× IPC, ~1.0× frequency
slide-11
SLIDE 11

11

Faster Soft Processors

  • Faster means extracting parallelism

– Thread: Multi-core, multi-thread – Data: Vectors, SIMD – Instruction: Pipelining, out-of-order

  • Challenge: What microarchitecture and circuits?

– Increase instructions per clock cycle (IPC) – Without decreasing clock cycles per second (MHz) – And at an afgordable FPGA hardware cost

  • First OoO hard processors: ~1.5× IPC, ~1.0× frequency

← Least user efgort

slide-12
SLIDE 12

12

ISA: Why not x86

  • A few x86-specifjc features are really hard

– Length decoding – Self-modifying code – Two destination operands (MUL, DIV) – ...but most hard features are not x86-specifjc

  • Existing ISA: Can’t shift unwanted features to software
  • RISC-V: didn’t exist in 2009

– But is RISC even the right clean-slate design?

slide-13
SLIDE 13

13

ISA: Why x86

  • It’s easy to implement!
  • The alternative:

– Benchmarks

  • “Way too much of my time is wasted on corraling benchmarks”

― Chris Celio, on porting SPEC to RISC-V, 2015

  • “I’ll just use the binaries” ― me

– OS, compiler, development tools – Web browser, JavaScript JIT, ... – Software in unexpected places: VGA BIOS

  • Alpha uses x86 emulation in fjrmware
slide-14
SLIDE 14

14

ISA: Why x86

  • It’s easy to implement!
  • The alternative:

– Benchmarks

  • “Way too much of my time is wasted on corraling benchmarks”

― Chris Celio, on porting SPEC to RISC-V, 2015

  • “I’ll just use the binaries” ― me

– OS, compiler, development tools – Web browser, JavaScript JIT, ... – Software in unexpected places: VGA BIOS

  • Alpha uses x86 emulation in fjrmware

… I still can’t do any of this

slide-15
SLIDE 15

15

Soft Processor Design Goals

  • Performance: Two-issue, out-of-order

– Reasonable per-clock performance (~2× vs. Nios II/f) – 300 MHz (25% higher than Nios II/f on Stratix IV)

  • No software rewrite: 32-bit x86 instruction set (P6)

– Binary compatible: OS, dev tools, user programs – Both system and user mode

  • Tested with 16 OSes and >50 user benchmarks
slide-16
SLIDE 16

16

SRAM (1rw) Multiplier Adder Content-Addressable Memory (CAM) Multi-write port RAM (4r2w) Multiplexer Whole processors 2.2 22 220 Delay Area FPGA Area or Delay Ratio vs. Custom CMOS

How are FPGAs difgerent from CMOS?

22×

slide-17
SLIDE 17

17

SRAM (1rw) Multiplier Adder Content-Addressable Memory (CAM) Multi-write port RAM (4r2w) Multiplexer Whole processors 2.2 22 220 Delay Area FPGA Area or Delay Ratio vs. Custom CMOS

How are FPGAs difgerent from CMOS?

22×

slide-18
SLIDE 18

18

SRAM (1rw) Multiplier Adder Content-Addressable Memory (CAM) Multi-write port RAM (4r2w) Multiplexer Whole processors 2.2 22 220 Delay Area FPGA Area or Delay Ratio vs. Custom CMOS

How are FPGAs difgerent from CMOS?

22×

slide-19
SLIDE 19

19

SRAM (1rw) Multiplier Adder Content-Addressable Memory (CAM) Multi-write port RAM (4r2w) Multiplexer Whole processors 2.2 22 220 Delay Area FPGA Area or Delay Ratio vs. Custom CMOS

How are FPGAs difgerent from CMOS?

22×

slide-20
SLIDE 20

20

SRAM (1rw) Multiplier Adder Content-Addressable Memory (CAM) Multi-write port RAM (4r2w) Multiplexer Whole processors 2.2 22 220 Delay Area FPGA Area or Delay Ratio vs. Custom CMOS

How are FPGAs difgerent from CMOS?

  • Conclusion: ~Conventional processor

microarchitecture, but try to avoid expensive circuits

– Physical Register File: Reduce SRAM ports, CAM size – Low-associativity caches: Reduce multiplexers, CAMs

22×

slide-21
SLIDE 21

21

Soft Processor Design Methodology

slide-22
SLIDE 22

22

Soft Processor Design Methodology

  • 1. Bochs behavioural

simulator

slide-23
SLIDE 23

23

Soft Processor Design Methodology

  • 1. Bochs behavioural

simulator

  • 2. New detailed CPU

pipeline model

slide-24
SLIDE 24

24

Soft Processor Design Methodology

  • 1. Bochs behavioural

simulator

  • 2. New detailed CPU

pipeline model

  • 3. Verify, optimize

microarchitecture

slide-25
SLIDE 25

25

Soft Processor Design Methodology

  • 1. Bochs behavioural

simulator

  • 2. New detailed CPU

pipeline model

  • 3. Verify, optimize

microarchitecture

  • 4. Circuit design,
  • ptimization
slide-26
SLIDE 26

26

  • Fetch

Our Processor’s Microarchitecture

8b 06 03 46 04 89 46 08

slide-27
SLIDE 27

27

  • Fetch
  • Length decode

Our Processor’s Microarchitecture

8b 06 03 46 04 89 46 08 8b 06 03 46 04 89 46 08

slide-28
SLIDE 28

28

  • Fetch
  • Length decode
  • Decode

Our Processor’s Microarchitecture

8b 06 03 46 04 89 46 08 8b 06 03 46 04 89 46 08

slide-29
SLIDE 29

29

  • Fetch
  • Length decode
  • Decode
  • Rename

Our Processor’s Microarchitecture

8b 06 03 46 04 89 46 08 8b 06 03 46 04 89 46 08

slide-30
SLIDE 30

30

  • Fetch
  • Length decode
  • Decode
  • Rename
  • Schedule

Our Processor’s Microarchitecture

8b 06 03 46 04 89 46 08 8b 06 03 46 04 89 46 08

slide-31
SLIDE 31

31

  • Fetch
  • Length decode
  • Decode
  • Rename
  • Schedule
  • Execute

Our Processor’s Microarchitecture

8b 06 03 46 04 89 46 08 8b 06 03 46 04 89 46 08

slide-32
SLIDE 32

32

  • Fetch
  • Length decode
  • Decode
  • Rename
  • Schedule
  • Execute
  • Commit

Our Processor’s Microarchitecture

8b 06 03 46 04 89 46 08 8b 06 03 46 04 89 46 08

slide-33
SLIDE 33

33

  • Fetch
  • Length decode
  • Decode
  • Rename
  • Schedule
  • Execute
  • Commit

Our Processor’s Microarchitecture

8b 06 03 46 04 89 46 08 8b 06 03 46 04 89 46 08

Main challenges: 1.Circuit complexity vs. support common case fast, worst-case correct 2.Circuit complexity vs. capacity (IPC) 3.Circuit complexity vs. pipeline latency (IPC) 1. 2. 3.

slide-34
SLIDE 34

34

  • Fetch
  • Length decode
  • Decode
  • Rename
  • Schedule
  • Execute
  • Commit

Our Processor’s Microarchitecture

8b 06 03 46 04 89 46 08 8b 06 03 46 04 89 46 08

Main challenges: 1.Circuit complexity vs. support common case fast, worst-case correct 2.Circuit complexity vs. capacity (IPC) 3.Circuit complexity vs. pipeline latency (IPC) 1. 2. 3.

slide-35
SLIDE 35

35

  • Fetch
  • Length decode
  • Decode
  • Rename
  • Schedule
  • Execute
  • Commit

Our Processor’s Microarchitecture

8b 06 03 46 04 89 46 08 8b 06 03 46 04 89 46 08

Main challenges: 1.Circuit complexity vs. support common case fast, worst-case correct 2.Circuit complexity vs. capacity (IPC) 3.Circuit complexity vs. pipeline latency (IPC) 1. 2. 3.

slide-36
SLIDE 36

36

Processor Area and Frequency

  • Compare to Nios II/f with MMU and 32K L1I + 32K L1D

– 4400 ALM (6.5× ), 245 MHz (0.82× )

Component Estimated Area (ALM) Frequency (MHz) Decode * 6 000 247 Renaming 1 900 317 Scheduler * 4 000 275 Register fjle 2 400 260 Execution 2 300 240 Memory system (Logic) 9 000 (Caches) 5 000 200 Commit + ROB * 2 000 Microcode * 500 Total 28 700 200

* Area estimate for partial or unimplemented circuit

slide-37
SLIDE 37

37

Processor Area and Frequency

  • Compare to Nios II/f with MMU and 32K L1I + 32K L1D

– 4400 ALM (6.5× ), 245 MHz (0.82× )

Component Estimated Area (ALM) Frequency (MHz) Decode * 6 000 247 Renaming 1 900 317 Scheduler * 4 000 275 Register fjle 2 400 260 Execution 2 300 240 Memory system (Logic) 9 000 (Caches) 5 000 200 Commit + ROB * 2 000 Microcode * 500 Total 28 700 200 7% of Stratix IV

* Area estimate for partial or unimplemented circuit

slide-38
SLIDE 38

38

Processor Area and Frequency

  • Compare to Nios II/f with MMU and 32K L1I + 32K L1D

– 4400 ALM (6.5× ), 245 MHz (0.82× )

Component Estimated Area (ALM) Frequency (MHz) Decode * 6 000 247 Renaming 1 900 317 Scheduler * 4 000 275 Register fjle 2 400 260 Execution 2 300 240 Memory system (Logic) 9 000 (Caches) 5 000 200 Commit + ROB * 2 000 Microcode * 500 Total 28 700 200 Optimize more? 7% of Stratix IV

* Area estimate for partial or unimplemented circuit

slide-39
SLIDE 39

39

Processor Area and Frequency

  • Compare to Nios II/f with MMU and 32K L1I + 32K L1D

– 4400 ALM (6.5× ), 245 MHz (0.82× )

Component Estimated Area (ALM) Frequency (MHz) Decode * 6 000 247 Renaming 1 900 317 Scheduler * 4 000 275 Register fjle 2 400 260 Execution 2 300 240 Memory system (Logic) 9 000 (Caches) 5 000 200 Commit + ROB * 2 000 Microcode * 500 Total 28 700 200 Optimize more? 7% of Stratix IV

* Area estimate for partial or unimplemented circuit

“OoO stufg”

slide-40
SLIDE 40

40

Per-clock performance (SPECint2000)

VIA C3 550 MHz Nios II/f 100 MHz Pentium 200 MHz Atom (Bonnell) 1600 MHz AMD K6 166 MHz Pentium 4 2800 MHz ARM Cortex-A9 800 MHz Pentium Pro 233 MHz This work ~200 MHz Atom (Silvermont) 2400 MHz VIA Nano 1000 MHz Opteron K8 2800 MHz AMD Piledriver 3500 MHz Core 2 Q9550 3400 MHz Haswell 4300 MHz

0.0 0.5 1.0 1.5 2.0 2.5 3.0

2.68 2.73 2.06 1.63 1.46 1.58 1.42 1.26 1.00 0.99 0.87 0.91 0.68 0.56 0.44

Relative Runtime Cycles

Slower

slide-41
SLIDE 41

41

Per-clock performance (SPECint2000)

  • Nios II/f: 2.73×

– Wall-clock: 2.23×

  • Pentium Pro (1995): 1.26×

– Also 8 KB/256 KB cache – 3-way OoO

  • Atom Silvermont (2013): 0.99×

– Also 2-way OoO – 32 KB/2 MB cache

  • Large performance increases vs. Nios II/f
  • Comparable per-clock performance to

similar x86 microarchitectures

VIA C3 550 MHz Nios II/f 100 MHz Pentium 200 MHz Atom (Bonnell) 1600 MHz AMD K6 166 MHz Pentium 4 2800 MHz ARM Cortex-A9 800 MHz Pentium Pro 233 MHz This work ~200 MHz Atom (Silvermont) 2400 MHz VIA Nano 1000 MHz Opteron K8 2800 MHz AMD Piledriver 3500 MHz Core 2 Q9550 3400 MHz Haswell 4300 MHz

0.0 0.5 1.0 1.5 2.0 2.5 3.0

2.68 2.73 2.06 1.63 1.46 1.58 1.42 1.26 1.00 0.99 0.87 0.91 0.68 0.56 0.44

Relative Runtime Cycles

Slower

slide-42
SLIDE 42

42

Per-clock performance (SPECint2000)

  • Nios II/f: 2.73×

– Wall-clock: 2.23×

  • Pentium Pro (1995): 1.26×

– Also 8 KB/256 KB cache – 3-way OoO

  • Atom Silvermont (2013): 0.99×

– Also 2-way OoO – 32 KB/2 MB cache

  • Large performance increases vs. Nios II/f
  • Comparable per-clock performance to

similar x86 microarchitectures

VIA C3 550 MHz Nios II/f 100 MHz Pentium 200 MHz Atom (Bonnell) 1600 MHz AMD K6 166 MHz Pentium 4 2800 MHz ARM Cortex-A9 800 MHz Pentium Pro 233 MHz This work ~200 MHz Atom (Silvermont) 2400 MHz VIA Nano 1000 MHz Opteron K8 2800 MHz AMD Piledriver 3500 MHz Core 2 Q9550 3400 MHz Haswell 4300 MHz

0.0 0.5 1.0 1.5 2.0 2.5 3.0

2.68 2.73 2.06 1.63 1.46 1.58 1.42 1.26 1.00 0.99 0.87 0.91 0.68 0.56 0.44

Relative Runtime Cycles

VIA C3 550 MHz Nios II/f 100 MHz Pentium 200 MHz Atom (Bonnell) 1600 MHz AMD K6 166 MHz Pentium 4 2800 MHz ARM Cortex-A9 800 MHz Pentium Pro 233 MHz This work ~200 MHz Atom (Silvermont) 2400 MHz VIA Nano 1000 MHz Opteron K8 2800 MHz AMD Piledriver 3500 MHz Core 2 Q9550 3400 MHz Haswell 4300 MHz

0.0 0.5 1.0 1.5 2.0 2.5 3.0

2.68 2.73 2.06 1.63 1.46 1.58 1.42 1.26 1.00 0.99 0.87 0.91 0.68 0.56 0.44

Relative Runtime Cycles

Slower

slide-43
SLIDE 43

43

Per-clock performance (SPECint2000)

  • Nios II/f: 2.73×

– Wall-clock: 2.23×

  • Pentium Pro (1995): 1.26×

– Also 8 KB/256 KB cache – 3-way OoO

  • Atom Silvermont (2013): 0.99×

– Also 2-way OoO – 32 KB/2 MB cache

  • Large performance increases vs. Nios II/f
  • Comparable per-clock performance to

similar x86 microarchitectures

VIA C3 550 MHz Nios II/f 100 MHz Pentium 200 MHz Atom (Bonnell) 1600 MHz AMD K6 166 MHz Pentium 4 2800 MHz ARM Cortex-A9 800 MHz Pentium Pro 233 MHz This work ~200 MHz Atom (Silvermont) 2400 MHz VIA Nano 1000 MHz Opteron K8 2800 MHz AMD Piledriver 3500 MHz Core 2 Q9550 3400 MHz Haswell 4300 MHz

0.0 0.5 1.0 1.5 2.0 2.5 3.0

2.68 2.73 2.06 1.63 1.46 1.58 1.42 1.26 1.00 0.99 0.87 0.91 0.68 0.56 0.44

Relative Runtime Cycles

VIA C3 550 MHz Nios II/f 100 MHz Pentium 200 MHz Atom (Bonnell) 1600 MHz AMD K6 166 MHz Pentium 4 2800 MHz ARM Cortex-A9 800 MHz Pentium Pro 233 MHz This work ~200 MHz Atom (Silvermont) 2400 MHz VIA Nano 1000 MHz Opteron K8 2800 MHz AMD Piledriver 3500 MHz Core 2 Q9550 3400 MHz Haswell 4300 MHz

0.0 0.5 1.0 1.5 2.0 2.5 3.0

2.68 2.73 2.06 1.63 1.46 1.58 1.42 1.26 1.00 0.99 0.87 0.91 0.68 0.56 0.44

Relative Runtime Cycles

VIA C3 550 MHz Nios II/f 100 MHz Pentium 200 MHz Atom (Bonnell) 1600 MHz AMD K6 166 MHz Pentium 4 2800 MHz ARM Cortex-A9 800 MHz Pentium Pro 233 MHz This work ~200 MHz Atom (Silvermont) 2400 MHz VIA Nano 1000 MHz Opteron K8 2800 MHz AMD Piledriver 3500 MHz Core 2 Q9550 3400 MHz Haswell 4300 MHz

0.0 0.5 1.0 1.5 2.0 2.5 3.0

2.68 2.73 2.06 1.63 1.46 1.58 1.42 1.26 1.00 0.99 0.87 0.91 0.68 0.56 0.44

Relative Runtime Cycles

Slower

slide-44
SLIDE 44

44

Per-clock performance (SPECint2000)

  • Nios II/f: 2.73×

– Wall-clock: 2.23×

  • Pentium Pro (1995): 1.26×

– Also 8 KB/256 KB cache – 3-way OoO

  • Atom Silvermont (2013): 0.99×

– Also 2-way OoO – 32 KB/2 MB cache

  • Large performance increases vs. Nios II/f
  • Comparable per-clock performance to

similar x86 microarchitectures

VIA C3 550 MHz Nios II/f 100 MHz Pentium 200 MHz Atom (Bonnell) 1600 MHz AMD K6 166 MHz Pentium 4 2800 MHz ARM Cortex-A9 800 MHz Pentium Pro 233 MHz This work ~200 MHz Atom (Silvermont) 2400 MHz VIA Nano 1000 MHz Opteron K8 2800 MHz AMD Piledriver 3500 MHz Core 2 Q9550 3400 MHz Haswell 4300 MHz

0.0 0.5 1.0 1.5 2.0 2.5 3.0

2.68 2.73 2.06 1.63 1.46 1.58 1.42 1.26 1.00 0.99 0.87 0.91 0.68 0.56 0.44

Relative Runtime Cycles

VIA C3 550 MHz Nios II/f 100 MHz Pentium 200 MHz Atom (Bonnell) 1600 MHz AMD K6 166 MHz Pentium 4 2800 MHz ARM Cortex-A9 800 MHz Pentium Pro 233 MHz This work ~200 MHz Atom (Silvermont) 2400 MHz VIA Nano 1000 MHz Opteron K8 2800 MHz AMD Piledriver 3500 MHz Core 2 Q9550 3400 MHz Haswell 4300 MHz

0.0 0.5 1.0 1.5 2.0 2.5 3.0

2.68 2.73 2.06 1.63 1.46 1.58 1.42 1.26 1.00 0.99 0.87 0.91 0.68 0.56 0.44

Relative Runtime Cycles

VIA C3 550 MHz Nios II/f 100 MHz Pentium 200 MHz Atom (Bonnell) 1600 MHz AMD K6 166 MHz Pentium 4 2800 MHz ARM Cortex-A9 800 MHz Pentium Pro 233 MHz This work ~200 MHz Atom (Silvermont) 2400 MHz VIA Nano 1000 MHz Opteron K8 2800 MHz AMD Piledriver 3500 MHz Core 2 Q9550 3400 MHz Haswell 4300 MHz

0.0 0.5 1.0 1.5 2.0 2.5 3.0

2.68 2.73 2.06 1.63 1.46 1.58 1.42 1.26 1.00 0.99 0.87 0.91 0.68 0.56 0.44

Relative Runtime Cycles

VIA C3 550 MHz Nios II/f 100 MHz Pentium 200 MHz Atom (Bonnell) 1600 MHz AMD K6 166 MHz Pentium 4 2800 MHz ARM Cortex-A9 800 MHz Pentium Pro 233 MHz This work ~200 MHz Atom (Silvermont) 2400 MHz VIA Nano 1000 MHz Opteron K8 2800 MHz AMD Piledriver 3500 MHz Core 2 Q9550 3400 MHz Haswell 4300 MHz

0.0 0.5 1.0 1.5 2.0 2.5 3.0

2.68 2.73 2.06 1.63 1.46 1.58 1.42 1.26 1.00 0.99 0.87 0.91 0.68 0.56 0.44

Relative Runtime Cycles

Slower

slide-45
SLIDE 45

45

Summary 1

  • Designed microarchitecture and circuits for a superscalar
  • ut-of-order x86 soft processor

– Both user and system modes

  • Area: 6.5× Nios II/f, but afgordable

– 7% Stratix IV or 1.3% Stratix 10

  • Performance: 2.2× Nios II/f on SPECint2000

– Per-clock: ~2.7× – Frequency: ~0.8×

  • Out-of-order increases soft processor performance

without rewriting software

  • x86 is feasible on FPGA
slide-46
SLIDE 46

46

Part 2: Pipeline Details

  • Sketch of interesting circuits at each stage
  • Timing budget: ~5 LUT levels (< 3.5 ns)
  • Many circuits designed bottom-up

– LUT granularity

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-47
SLIDE 47

47

Front end: Fetch-decode

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.08 0.27 0.36 0.07 0.05 0.12 0.04 0.02 1E-3 3E-3 9E-5 3E-11 Frequency Cumulative frequency

Instruction length (bytes) Frequency Cumulative Frequency

  • Fetch bandwidth

– 3.4 B per instruction – Fetch 8 B/cycle

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

ICache Bytes Instructions Micro-ops → → → Renamer →

x86: Worst case is complex, but common case isn’t too bad

slide-48
SLIDE 48

48

Front end: Fetch-decode

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.08 0.27 0.36 0.07 0.05 0.12 0.04 0.02 1E-3 3E-3 9E-5 3E-11 Frequency Cumulative frequency

Instruction length (bytes) Frequency Cumulative Frequency

1 2 3 4 5 6 7+ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.86 0.14 0.0006 1E-6 6E-10 3E-10 1E-10 3E-11 Frequency Cumulative frequency

Number of prefix bytes Frequency Cumulative Frequency

  • Fetch bandwidth

– 3.4 B per instruction – Fetch 8 B/cycle

  • Length decode

– Prefjx bytes uncommon – Fast decode up to 1

prefjx

8B/cycle Multi-cycle

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

ICache Bytes Instructions Micro-ops → → → Renamer →

x86: Worst case is complex, but common case isn’t too bad

slide-49
SLIDE 49

49

Front end: Fetch-decode

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.08 0.27 0.36 0.07 0.05 0.12 0.04 0.02 1E-3 3E-3 9E-5 3E-11 Frequency Cumulative frequency

Instruction length (bytes) Frequency Cumulative Frequency

1 2 3 4 5 6 7+ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.86 0.14 0.0006 1E-6 6E-10 3E-10 1E-10 3E-11 Frequency Cumulative frequency

Number of prefix bytes Frequency Cumulative Frequency

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.94 0.04 0.02 1E-4 2E-4 4E-5 3E-5 1E-4 7E-6 5E-5 8E-9 1E-5 Frequency Cumulative frequency

Micro-ops per instruction Frequency Cumulative Frequency

  • Fetch bandwidth

– 3.4 B per instruction – Fetch 8 B/cycle

  • Length decode

– Prefjx bytes uncommon – Fast decode up to 1

prefjx

  • Decode into micro-ops

– 1 is common case – Dual-issue up to 2, single-

issue up to 4

1-issue Multi-cycle 2-issue

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

ICache Bytes Instructions Micro-ops → → → Renamer →

x86: Worst case is complex, but common case isn’t too bad

slide-50
SLIDE 50

50

eax ecx edx ebx esp ebp esi edi eflags fpucw fpusw tmp0 tmp1

PRF A Speculative register mapping table Physical register file

Register Renamer

  • Read: Map logical

register numbers to physical

– ~14 sources/clk

  • Write: Update mapping

table

– ~4 destinations/clk

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

Map logical reg physical reg → , 2 uops/clock

slide-51
SLIDE 51

51

eax ecx edx ebx esp ebp esi edi eflags fpucw fpusw tmp0 tmp1

PRF A Speculative register mapping table Physical register file

Register Renamer

  • Read: Map logical

register numbers to physical

– ~14 sources/clk

  • Write: Update mapping

table

– ~4 destinations/clk

  • Allows 1w-port reg. fjles

– Each ALU “owns” an RF

eax ecx edx ebx esp ebp esi edi eflags fpucw fpusw tmp0 tmp1

PRF A PRF B PRF C Speculative register mapping table Physical register file

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

Map logical reg physical reg → , 2 uops/clock

slide-52
SLIDE 52

52

eax ecx edx ebx esp ebp esi edi eflags fpucw fpusw tmp0 tmp1

PRF A Speculative register mapping table Physical register file

Register Renamer

  • Read: Map logical

register numbers to physical

– ~14 sources/clk

  • Write: Update mapping

table

– ~4 destinations/clk

  • Allows 1w-port reg. fjles

– Each ALU “owns” an RF

  • Used for recovery from

misspeculation

eax ecx edx ebx esp ebp esi edi eflags fpucw fpusw tmp0 tmp1 eax ecx edx ebx esp ebp esi edi eflags fpucw fpusw tmp0 tmp1

Copy: pipeline flush

PRF A PRF B PRF C Speculative register mapping table Committed register mapping table Physical register file

eax ecx edx ebx esp ebp esi edi eflags fpucw fpusw tmp0 tmp1

PRF A PRF B PRF C Speculative register mapping table Physical register file

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

Map logical reg physical reg → , 2 uops/clock

slide-53
SLIDE 53

53

Renamer Circuit

  • Stage 1: Pick two uops; fjnd where each operand comes from
  • Stage 2: A bunch of read muxes; write destination regs to RAT
  • 317 MHz, 1900 ALMs
  • x86: Few registers: small FF-based RAT. But ≥3 register types.

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

GPR Forwarding SREG Forwarding

μop 0 StringOp × 2 μcode × 2 Pause?

GPR RAT 13 x 8-bit (64 p.regs) 6r 2w ports SREG RAT 12 x 4-bit (16 p.regs)

4r 2w ports

OCZAPS 1 x 5-bit (16 p.regs) RF_is_zero 1 x 1-bit

ALUop0 [dst,src,src]

Forwarding Forwarding 1r 1w ports 1r 1w ports

From Decode String-op Microcode

Choose Operands Forwarding and mux select

AGUop0 [dst,src,src,src,seg] ALUop1 [dst,src,src] AGUop1 [dst,src,src,src,seg] μop 1 μop 0 μop 1

GPR Free list A GPR Free list B GPR Free list C SREG Free list OCZAPS Free list A OCZAPS Free list B

dst0 dst1 GPR C GPR C 6 regs dst 4 regs

r w μop 0 μop 1

2 regs

Choose Operands

r w

slide-54
SLIDE 54

54

Scheduling: Track dependencies

  • Pick a ready operation, execute, and wake up dependents

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-55
SLIDE 55

55

Scheduling: Track dependencies

  • Pick a ready operation, execute, and wake up dependents

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-56
SLIDE 56

56

Scheduling: Track dependencies

  • Pick a ready operation, execute, and wake up dependents

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-57
SLIDE 57

57

Scheduling: Track dependencies

  • Pick a ready operation, execute, and wake up dependents

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-58
SLIDE 58

58

Scheduling: Track dependencies

  • Pick a ready operation, execute, and wake up dependents

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-59
SLIDE 59

59

Scheduler Size

  • Can trade capacity (area and frequency) for IPC

8 16 24 32 40 48 56 64 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Scheduler Capacity (entries) IPC

slide-60
SLIDE 60

60

Scheduler Size

  • Can trade capacity (area and frequency) for IPC

8 16 24 32 40 48 56 64 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Scheduler Capacity (entries) IPC

32 entries 275 MHz

slide-61
SLIDE 61

61

Scheduler Circuit

  • 4-way distributed

matrix scheduler

  • 32-entry (10, 10, 7, 5)

– 275 MHz

  • Comparison:

– Pentium Pro: 20 – Haswell: 60 – Ryzen: 84 (6×14) – Skylake: 97

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-62
SLIDE 62

62

Scheduler Circuit

  • 4-way distributed

matrix scheduler

  • 32-entry (10, 10, 7, 5)

– 275 MHz

  • Comparison:

– Pentium Pro: 20 – Haswell: 60 – Ryzen: 84 (6×14) – Skylake: 97

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-63
SLIDE 63

63

Scheduler Circuit

  • 4-way distributed

matrix scheduler

  • 32-entry (10, 10, 7, 5)

– 275 MHz

  • Comparison:

– Pentium Pro: 20 – Haswell: 60 – Ryzen: 84 (6×14) – Skylake: 97

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-64
SLIDE 64

64

Scheduler Circuit

  • 4-way distributed

matrix scheduler

  • 32-entry (10, 10, 7, 5)

– 275 MHz

  • Comparison:

– Pentium Pro: 20 – Haswell: 60 – Ryzen: 84 (6×14) – Skylake: 97

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-65
SLIDE 65

65

Scheduler Circuit

  • 4-way distributed

matrix scheduler

  • 32-entry (10, 10, 7, 5)

– 275 MHz

  • Comparison:

– Pentium Pro: 20 – Haswell: 60 – Ryzen: 84 (6×14) – Skylake: 97

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-66
SLIDE 66

66

Scheduler Circuit

  • 4-way distributed

matrix scheduler

  • 32-entry (10, 10, 7, 5)

– 275 MHz

  • Comparison:

– Pentium Pro: 20 – Haswell: 60 – Ryzen: 84 (6×14) – Skylake: 97

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-67
SLIDE 67

67

Scheduler Picker: Bit Scan

  • Pick fjrst ready instruction to execute

Bit scan (output) Ready (input)

Scan for fjrst ready

Newer Older

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-68
SLIDE 68

68

Scheduler Picker: Bit Scan

  • Pick fjrst ready instruction to execute

Bit scan (output) Ready (input)

Scan for fjrst ready

Newer Older

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-69
SLIDE 69

69

Scheduler Picker: Bit Scan

  • Pick fjrst ready instruction to execute
  • Logarithmic depth: radix-6

– Han-Carlson prefjx tree – Very diffjcult to code: Synthesizer makes it linear depth again

Bit scan (output) Ready (input)

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-70
SLIDE 70

70

Execution

  • Three difgerent

execution units

– Complex, 3+ cycle – Simple, 1 cycle – Address generation

  • Latency vs. delay

circuit design problem

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-71
SLIDE 71

71

Execution

  • Three difgerent

execution units

– Complex, 3+ cycle – Simple, 1 cycle – Address generation

  • Latency vs. delay

circuit design problem

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-72
SLIDE 72

72

Execution

  • Three difgerent

execution units

– Complex, 3+ cycle – Simple, 1 cycle – Address generation

  • Latency vs. delay

circuit design problem

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-73
SLIDE 73

73

Execution

  • Three difgerent

execution units

– Complex, 3+ cycle – Simple, 1 cycle – Address generation

  • Latency vs. delay

circuit design problem

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-74
SLIDE 74

74

Execution

  • Three difgerent

execution units

– Complex, 3+ cycle – Simple, 1 cycle – Address generation

  • Latency vs. delay

circuit design problem

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-75
SLIDE 75

75

Execution

  • Three difgerent

execution units

– Complex, 3+ cycle – Simple, 1 cycle – Address generation

  • Latency vs. delay

circuit design problem

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-76
SLIDE 76

76

Execution

  • Three difgerent

execution units

– Complex, 3+ cycle – Simple, 1 cycle – Address generation

  • Latency vs. delay

circuit design problem

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-77
SLIDE 77

77

Execution Circuits: Simple ALU

  • Three parts:

– Shifter – Adder – Bitwise logic

  • We’ll look at shifter

and adder circuits

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-78
SLIDE 78

78

Execution Circuits: Simple ALU

  • Three parts:

– Shifter – Adder – Bitwise logic

  • We’ll look at shifter

and adder circuits

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-79
SLIDE 79

79

Execution Circuits: Simple ALU

  • Three parts:

– Shifter – Adder – Bitwise logic

  • We’ll look at shifter

and adder circuits

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-80
SLIDE 80

80

Execution Circuit: Shifter

  • Minimal 32-bit shifter: Needs 3 LUT levels (4-to-1 mux

per level)

  • We used a rotate + mask circuit: Almost 3 LUT levels

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-81
SLIDE 81

81

Execution Circuit: Shifter

  • Minimal 32-bit shifter: Needs 3 LUT levels (4-to-1 mux per level)
  • We used a rotate + mask circuit: Almost 3 LUT levels

– Left and right. 32-, 16-, and 8-bit operands – Rotate, shift, arithmetic shift – Rotate-through-carry-by-1 – Byte swap (aa bb cc dd

dd cc bb aa) →

– Sign extension

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-82
SLIDE 82

82

Execution Circuit: Shifter

  • Minimal 32-bit shifter: Needs 3 LUT levels (4-to-1 mux per level)
  • We used a rotate + mask circuit: Almost 3 LUT levels

– Left and right. 32-, 16-, and 8-bit operands – Rotate, shift, arithmetic shift – Rotate-through-carry-by-1 – Byte swap (aa bb cc dd

dd cc bb aa) →

– Sign extension

  • 2.9 ns, 54% faster (and 46% smaller) than HDL synthesis

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-83
SLIDE 83

83

Execution Circuit: Adder

  • FPGAs have carry chains: Can’t improve on it by much

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-84
SLIDE 84

84

Execution Circuit: Adder

  • FPGAs have carry chains: Can’t improve on it by much
  • Condition codes: ZF means “is the result zero?”

– 32/16/8-bit NOR gate is 3 LUT levels plus the adder...

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-85
SLIDE 85

85

Execution Circuit: Adder

  • FPGAs have carry chains: Can’t improve on it by much
  • Condition codes: ZF means “is the result zero?”

– 32/16/8-bit NOR gate is 3 LUT levels plus the adder...

  • Computing a + b = K does not need addition!

– ZF: 3 LUT levels in parallel with adder

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-86
SLIDE 86

86

Execution Circuit: Adder

  • FPGAs have carry chains: Can’t improve on it by much
  • Condition codes: ZF means “is the result zero?”

– 32/16/8-bit NOR gate is 3 LUT levels plus the adder...

  • Computing a + b = K does not need addition!

– ZF: 3 LUT levels in parallel with adder

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-87
SLIDE 87

87

Execution Circuit: Adder

  • FPGAs have carry chains: Can’t improve on it by much
  • Condition codes: ZF means “is the result zero?”

– 32/16/8-bit NOR gate is 3 LUT levels plus the adder...

  • Computing a + b = K does not need addition!

– ZF: 3 LUT levels in parallel with adder

  • 2.3ns, 24% faster, +55% area (+30 ALM) vs. HDL synthesis

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-88
SLIDE 88

88

  • Memory operations

– Store: mov [ecx], eax – Load: mov eax, [ecx]

Store Load

Memory System Microarchitecture

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-89
SLIDE 89

89

  • Memory operations

– Store: mov [ecx], eax – Load: mov eax, [ecx]

  • Caches

– For performance (Instruction and data) – For OS support (TLB and page walking)

Store Load

Memory System Microarchitecture

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-90
SLIDE 90

90

Basic Cache Trade-ofgs

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-91
SLIDE 91

91

Dhrystone SPECint 2000

0.2 0.4 0.6 0.8 1 1.2 1.4

0.98 0.46 0.99 0.60 1.00 0.74 1.00 0.85 1.00 0.91 1.00 0.94 1.00 0.97

256 KB L2 cache No L2 cache

Bigger L1 cache

Basic Cache Trade-ofgs

  • Bigger cache higher IPC

– Sensitivity varies with workload

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-92
SLIDE 92

92

Dhrystone SPECint 2000

0.2 0.4 0.6 0.8 1 1.2 1.4

0.98 0.46 0.99 0.60 1.00 0.74 1.00 0.85 1.00 0.91 1.00 0.94 1.00 0.97

256 KB L2 cache No L2 cache

Bigger L1 cache

Basic Cache Trade-ofgs

  • Bigger cache higher IPC

– Sensitivity varies with workload

  • L1 caches need to be small (We chose 8 KB)

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-93
SLIDE 93

93

Dhrystone SPECint 2000

1.00 0.93 1.00 1.00 1.00 1.05 1.00 1.07 1.00 1.10 1.00 1.11 1.00 1.13

Dhrystone SPECint 2000

0.2 0.4 0.6 0.8 1 1.2 1.4

0.98 0.46 0.99 0.60 1.00 0.74 1.00 0.85 1.00 0.91 1.00 0.94 1.00 0.97

256 KB L2 cache No L2 cache

Bigger L1 cache

Basic Cache Trade-ofgs

  • Bigger cache higher IPC

– Sensitivity varies with workload

  • L1 caches need to be small (We chose 8 KB)
  • L2 cache (256 KB) mostly makes up for this

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-94
SLIDE 94

94

Dhrystone SPECint 2000

0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35

1.01 1.21 1.08 1.32

Blocking, In-order Non-blocking, In-order Non-blocking, Out-of-order

Relative IPC vs. In-order Blocking

+4 misses in-fight +Out-of-order In-order memory Stall on cache miss

  • Multiple in-fight misses
  • Out-of-order

– Memory dependence speculation

More Memory System Trade-ofgs

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-95
SLIDE 95

95

L1 Memory System

  • TLB lookup
  • Cache tag compare
  • Cache data rotate (32 B)

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-96
SLIDE 96

96

L1 Memory System

  • TLB lookup
  • Cache tag compare
  • Cache data rotate (32 B)

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-97
SLIDE 97

97

L1 Memory System

  • TLB lookup
  • Cache tag compare
  • Cache data rotate (32 B)
  • Long critical path for

direct implementation

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-98
SLIDE 98

98

L1 Memory System

  • TLB lookup
  • Cache tag compare
  • Cache data rotate (32 B)
  • Long critical path for

direct implementation

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-99
SLIDE 99

99

What happens to a load: simplifjed

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-100
SLIDE 100

100

What happens to a load: simplifjed

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-101
SLIDE 101

101

What happens to a load: simplifjed

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-102
SLIDE 102

102

What happens to a load: simplifjed

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-103
SLIDE 103

103

What happens to a load: simplifjed

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-104
SLIDE 104

104

What happens to a load: simplifjed

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-105
SLIDE 105

105

Possible outcomes for a load:

What happens to a load: simplifjed

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-106
SLIDE 106

106

Possible outcomes for a load:

What happens to a load: simplifjed

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-107
SLIDE 107

107

Possible outcomes for a load:

What happens to a load: simplifjed

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-108
SLIDE 108

108

Possible outcomes for a load:

What happens to a load: simplifjed

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-109
SLIDE 109

109

Possible outcomes for a load:

What happens to a load: simplifjed

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-110
SLIDE 110

110

Possible outcomes for a load:

What happens to a load: simplifjed

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-111
SLIDE 111

111

Possible outcomes for a load:

What happens to a load: simplifjed

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-112
SLIDE 112

112

Possible outcomes for a load:

What happens to a load: simplifjed

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-113
SLIDE 113

113

Possible outcomes for a load:

What happens to a load: simplifjed

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-114
SLIDE 114

114

Possible outcomes for a load:

What happens to a load: simplifjed

  • Other paging features:

– Page faults – TLB Accessed/Dirty bits – Uncacheable accesses

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-115
SLIDE 115

115

Possible outcomes for a load:

What happens to a load: simplifjed

  • Replay accesses that fail

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-116
SLIDE 116

116

Possible outcomes for a load:

What happens to a load: simplifjed

  • Also:

– Split-cacheline/page – Stores – L2 and cache coherence

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-117
SLIDE 117

117

What happens to a load: simplifjed

Design a circuit to implement this microarchitecture...

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-118
SLIDE 118

118

What happens to a load: simplifjed

  • Long critical path for

direct implementation

– TLB lookup – Cache tag compare – Cache data rotate (32 B)

  • Shannon expansion:

“TLB hit way 1?”

– TLB lookup – 2-to-1 mux (1 bit) – 2-to-1 mux (32 bit)

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-119
SLIDE 119

119

What happens to a load: simplifjed

  • Long critical path for

direct implementation

– TLB lookup – Cache tag compare – Cache data rotate (32 B)

  • Shannon expansion:

“TLB hit way 1?”

– TLB lookup – 2-to-1 mux (1 bit) – 2-to-1 mux (32 bit)

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-120
SLIDE 120

120

What happens to a load: simplifjed

  • Long critical path for

direct implementation

– TLB lookup – Cache tag compare – Cache data rotate (32 B)

  • Shannon expansion:

“TLB hit way 1?”

– TLB lookup – 2-to-1 mux (1 bit) – 2-to-1 mux (32 bit)

Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-121
SLIDE 121

121

Memory System: L1 Circuit

  • Final design: Many paths are near-critical — delays are balanced

– Critical portions hand-mapped to LUTs (blue boxes) – Many critical paths have logic depth of 5 LUTs

  • 4.3 ns (~230 MHz)

– Nios II/f: TLB+cache circuit is ~6.0 ns (~1.5 stages) Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-122
SLIDE 122

122

Memory System: L1 Circuit

  • Final design: Many paths are near-critical — delays are balanced

– Critical portions hand-mapped to LUTs (blue boxes) – Many critical paths have logic depth of 5 LUTs

  • 4.3 ns (~230 MHz)

– Nios II/f: TLB+cache circuit is ~6.0 ns (~1.5 stages) Fetch

  • Len. decode

Decode Rename Schedule Execute Memory Commit

slide-123
SLIDE 123

123

Summary 2

  • Out-of-order soft processors are useful and feasible

– Area: 6.5× Nios II/f, but afgordable – Performance: 2.2× Nios II/f on SPECint2000

  • ...if you pay attention to circuit design

– Adapt microarchitecture to suit circuits – Or push circuits to make microarchitecture feasible – LUT-level design is often useful

  • Avoiding (n)optimization has been painful
slide-124
SLIDE 124

124

Future Work

  • Hardware not yet functional

– Some missing pieces

  • More optimization

– Frequency

  • Add FPU
slide-125
SLIDE 125

125

End

slide-126
SLIDE 126

126

Thoughts on Complex Instructions

  • There is little distinction between modern RISC and

CISC

– Abandoned “RISC” features

  • Branch delay slot
  • Register windows

– “CISC” features in modern “RISC”

  • Variable-length instructions
  • Instructions with more than one micro-op
  • Unaligned memory access
slide-127
SLIDE 127

127

Complex Instructions 2

  • It’s easier to lower than raise abstraction level

– It’s easy to compile C to instructions. – It’s easy to crack complex instructions into many micro-ops.

  • We’ve made a lot of instruction-level promises.

Breaking them needs ISA changes

– Fused Multiply-Add: Can’t synthesize from MUL and ADD – REP MOVS: Must maintain memory ordering – Binary translation is really hard: Need to keep low-level

promises

  • Can we really execute 5, 10, 20 IPC while keeping all

instruction-level ISA promises?

slide-128
SLIDE 128

128

Simulator ISA Support

slide-129
SLIDE 129

129

Publications

  • Comparing FPGA to custom CMOS

– Comparing FPGA vs. custom CMOS and the impact on processor

microarchitecture (FPGA 2011)

– Quantifying the gap between FPGA and custom CMOS to aid

microarchitectural design (TVLSI, 2013)

  • Out-of-order memory system

– Effjcient methods for out-of-order load/store execution for high-

performance soft processors (FPT 2013)

– Microarchitecture and circuits for a 200 MHz Out-of-order soft

processor memory system (TRETS 2016)

  • Out-of-order instruction scheduling

– High performance instruction scheduling circuits for out-of-order soft

processors (FCCM 2016)

– High performance instruction scheduling circuits for superscalar out-of-

  • rder soft processors (TRETS 2017)