Processor Architecture Charles Eric LaForest J. Gregory Steffan - PowerPoint PPT Presentation

Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24

Easier FPGA Programming • We focus on overlay architectures – Nios, MicroBlaze, Vector Processors • These inherited their architectures from ASICs – Easy to use with existing software tools – Performance penalty – ASIC architectures poor fit to FPGA hardware! • ASIC ≠ FPGA – ASIC: transistors, poly, vias, metal layers – FPGA: LUTs, BRAMs, DSP Blocks, routing • Fixed widths, depths, other discretizations FPGA-centric processor design? 2

How do FPGAs Want to Compute? Hardware (Stratix IV) Width (bits) Fmax (MHz) DSP Blocks 36 480 Block RAMs 36 550 ALUTs 1 800 Nios II/f 32 230 What processor architecture best fits the underlying FPGA? 3

Research Goals 1. Assume threaded data parallelism 2. Run at maximum FPGA frequency 3. Have high performance 4. Never stall 5. Aim for simple, minimal ISA 6. Match architecture to underlying FPGA 4

Result: Octavo • 10 stages, 8 threads, 550 MHz • Family of designs – Word width (8 to 72 bits) – Memory depth (2 to 32k words) – Pipeline depth (8 to 16 stages) Snapshot of work-in-progress 5

Designing Octavo 6

High-Level View of Octavo Unified registers and RAM 7

Octavo vs. Classic RISC • All memories unified (no loads/stores) • How to pipeline Octavo? 8

Design For Speed: Self-Loop Characterization 9

Self-Loop Characterization • Connect module outputs to inputs – Accounts for the FPGA interconnect • Pipeline loop paths to absorb delays • Pointed to other limits than raw delay – Minimum clock pulse widths • DSP Blocks: 480 MHz • BRAMs: 550 MHz We measured some surprising delays… 10

BRAM Self-Loop Characterization 656 MHz 531 MHz 710 MHz 398 MHz (routing!) Must connect BRAMs using registers 11

Building Octavo: Memory 12

Building Octavo: Memory 13

Memory Instruction ALU Result Replicated “scratchpad” memories with I/O while still exceeding 550 MHz limit. 14

Building Octavo: ALU 15

Building Octavo: ALU • Fully pipelined (4 stages) – Never stalls 16

Building Octavo: ALU • Multiplication – Uses DSP Blocks – Must overcome their 480 MHz limit… 17

Building Octavo: Multiplier • One multiplier is wide enough but too slow 480 MHz • Two multipliers working at half-speed – Send data to both multipliers in alternation 600 MHz 18

Octavo: Putting It All Together 19

Octavo 1 2 3 5 6 7 8 9 0 4 • Pipeline – 10 stages • Actually 8 stages with one exception (more later) – No result forwarding or pipeline interlocks – Scalar, Single-Issue, In-Order, Multi-Threaded 20

Octavo 1 2 3 5 6 7 8 9 0 4 I • Instruction Memory – Indexed by current thread PC – Provides a 3-operand instruction – On-chip BRAMs only 21

Octavo 1 2 3 5 6 7 8 9 0 4 I A/B A/B • A and B Memories – Receive operand addresses from instruction – Provide data operands to ALU and Controller • Some addresses map to I/O ports – On-chip BRAMs only 22

Octavo 1 2 3 5 6 7 8 9 0 4 I A/B A/B • Pipeline Registers – Avoid an odd number of stages – Separate BRAMs for best speed • Predicted by BRAM self-loop characterization • Unusual but essential design constraint 23

Octavo 1 2 3 5 6 7 8 9 0 4 I CTL0 CTL1 A/B A/B • Controller – Receives opcode, source/destination operands – Decides branches – Provides current PC of next thread to I memory 24

Octavo 1 2 3 5 6 7 8 9 0 4 I CTL0 CTL1 A/B A/B ALU0 ALU1 ALU2 ALU3 • ALU – Receives opcode and data – Writes result to all memories 25

Octavo T0 T1 1 T6 T7 T2 2 3 5 6 T3 7 T4 8 T5 9 0 4 I CTL0 CTL1 A/B A/B ALU0 ALU1 ALU2 ALU3 • Longest mandatory loop: 8 stages – Along A/B memories and ALU – Fill with 8 threads to avoid stalls 26

Octavo 1 2 3 5 6 7 8 9 0 4 I CTL0 CTL1 A/B A/B ALU0 ALU1 ALU2 ALU3 • Special case longest loop: 10 stages – Along instruction memory and ALU – Does not affect most computations • Adds a delay slot to subroutine and loop code 27

Results: Speed and Area 28

Experimental Framework • Quartus 10.1 targeting Stratix IV (fastest) – Optimize and place for speed – Average speed over 10 placement runs • Varied processor parameters: – Word width – Memory depth – Pipeline depth • Measure Frequency, Area, and Density 29

Maximum Operating Frequency 30

Maximum Operating Frequency Timing Slack! BRAM hard limit Faster Wider 31

Maximum Operating Frequency 550+ MHz 36 bits wide 230 MHz 32 bits wide 2.39x faster, but not a fair comparison 32

Maximum Operating Frequency Multiplier CAD Anomaly! (38 to 54 bits width) Enough pipeline stages bury the inefficiency 33

Area Density 34

Area Density 67% used 26% used (typical) “Sweet spot” 72 bits, 1024 words 35 72 bits, 4096 words

Designing Octavo: Lessons & Future Work 36

Lessons • Soft-processors can hit BRAM Fmax – Octavo: 8 threads, 10 stages, 550 MHz • Self-loop characterization for modules – Helps reason about their pipelining – Shows true operating envelopes on FPGA • Octavo spans a large design space – Significant range of widths, depths, stages Consider FPGA-centric architecture! 37

Future Work 38

Processor Architecture Charles Eric LaForest J. Gregory Steffan - PowerPoint PPT Presentation

Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24 Easier FPGA Programming We focus on overlay architectures Nios, MicroBlaze, Vector Processors

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

FPGA co-processor Patrick Dunne for the co-processor group Introduction Co-processor will

Processor Architecture: Current Trends A B Transfer a truckload at a time from A to B Processor

Assembly Language Programming Processor architecture Zbigniew Jurkiewicz, Instytut Informatyki UW

Blackfin Processor Architecture Processor Architecture Blackfin Instructor: Prof. Andy Wu

Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor

MIPS Architecture An Example: MIPS Example: subset of MIPS processor architecture From the

CS356 : Discussion #14 Processor Architecture Marco Paolieri (paolieri@usc.edu) Illustrations

Embedded systems & the Nios II soft core processor A Nios II processor system I equivalent to

MIPS Architecture w Example: subset of MIPS processor architecture n Drawn from Patterson

MIPS Architecture Example: subset of MIPS processor architecture Drawn from Patterson

Processor Pipeline Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer Architecture

Hardware Architecture of the Cell Broadband Engine Processor LOGO Presented by Wei Wei,

Cortex-A15 Processor ARMs next generation mobile applications processor Travis Lanier Senior

Ch. 5: Processor + Memory December 12, 2008 Ch. 5: Processor + Memory Overview of Implementation

Chapter 12 CPU Structure and Function Contents Processor organization Register

LONG TERM DISABILITY (LTD) BENEFITS: What All Advocates and Claimants Must Know AIDS LEGAL

Introduction to Mobile Robotics Mapping with Known Poses Wolfram Burgard, Cyrill Stachniss,

I SAIAH , P ART 1 Ch. 1 First Isaiah Ch. 40 2nd 55 3rd 66 Is. of Jerusalem Exile

CrIS EDR Validation Assessment Model: Case Study IASI Temperature and Water Vapor Retrievals N.

PRIMES Barry Mazur April 26, 2014 (A discussion of Primes: What is Riemanns

EI331 Signals and Systems Lecture 7 Bo Jiang John Hopcroft Center for Computer Science Shanghai

wth 139 L2 .' ODE P/,r.. S"pat-c.L/e d !efc rrcf,'ora r 41o.r'u Res'+(t i i.

Numerical Methods for Ordinary Differential Equations (ODE) Introduction In this course, we focus

Processor Architecture Charles Eric LaForest J. Gregory Steffan - PowerPoint PPT Presentation

Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24 Easier FPGA Programming We focus on overlay architectures Nios, MicroBlaze, Vector Processors

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

FPGA co-processor Patrick Dunne for the co-processor group Introduction Co-processor will

Processor Architecture: Current Trends A B Transfer a truckload at a time from A to B Processor

Assembly Language Programming Processor architecture Zbigniew Jurkiewicz, Instytut Informatyki UW

Blackfin Processor Architecture Processor Architecture Blackfin Instructor: Prof. Andy Wu

Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor

MIPS Architecture An Example: MIPS Example: subset of MIPS processor architecture From the

CS356 : Discussion #14 Processor Architecture Marco Paolieri (paolieri@usc.edu) Illustrations

Embedded systems &amp; the Nios II soft core processor A Nios II processor system I equivalent to

MIPS Architecture w Example: subset of MIPS processor architecture n Drawn from Patterson

MIPS Architecture Example: subset of MIPS processor architecture Drawn from Patterson

Processor Pipeline Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer Architecture

Hardware Architecture of the Cell Broadband Engine Processor LOGO Presented by Wei Wei,

Cortex-A15 Processor ARMs next generation mobile applications processor Travis Lanier Senior

Ch. 5: Processor + Memory December 12, 2008 Ch. 5: Processor + Memory Overview of Implementation

Chapter 12 CPU Structure and Function Contents Processor organization Register

LONG TERM DISABILITY (LTD) BENEFITS: What All Advocates and Claimants Must Know AIDS LEGAL

Introduction to Mobile Robotics Mapping with Known Poses Wolfram Burgard, Cyrill Stachniss,

I SAIAH , P ART 1 Ch. 1 First Isaiah Ch. 40 2nd 55 3rd 66 Is. of Jerusalem Exile

CrIS EDR Validation Assessment Model: Case Study IASI Temperature and Water Vapor Retrievals N.

PRIMES Barry Mazur April 26, 2014 (A discussion of Primes: What is Riemanns

EI331 Signals and Systems Lecture 7 Bo Jiang John Hopcroft Center for Computer Science Shanghai

wth 139 L2 .' ODE P/,r.. S&quot;pat-c.L/e d !efc rrcf,'ora r 41o.r'u Res'+(t i i.

Numerical Methods for Ordinary Differential Equations (ODE) Introduction In this course, we focus

Embedded systems & the Nios II soft core processor A Nios II processor system I equivalent to

wth 139 L2 .' ODE P/,r.. S"pat-c.L/e d !efc rrcf,'ora r 41o.r'u Res'+(t i i.