L2: FPGA HARDWARE 18-545: ADVANCED DIGITAL DESIGN PROJECT FALL 2016 - - PowerPoint PPT Presentation

l2 fpga hardware
SMART_READER_LITE
LIVE PREVIEW

L2: FPGA HARDWARE 18-545: ADVANCED DIGITAL DESIGN PROJECT FALL 2016 - - PowerPoint PPT Presentation

L2: FPGA HARDWARE 18-545: ADVANCED DIGITAL DESIGN PROJECT FALL 2016 BRANDON LUCIA Admin stuff Project Proposals happen on Monday Be prepared to give an in-class presentation Lab 1 is due Wednesday, Sept. 14th Reading Assignment #1 due today


slide-1
SLIDE 1

18-545: ADVANCED DIGITAL DESIGN PROJECT FALL 2016 BRANDON LUCIA

L2: FPGA HARDWARE

slide-2
SLIDE 2

18-545: FALL 2016

Admin stuff

Project Proposals happen on Monday Be prepared to give an in-class presentation Lab 1 is due Wednesday, Sept. 14th Reading Assignment #1 due today Submit a PDF/text file, don't fill in the web form Team assignments are done

2

slide-3
SLIDE 3

18-545: FALL 2016

Admin Stuff

Status reports due today No word docs, please! Be specific about what happened/is going to happen Talk about what YOU did/will do, not just what your group did Grades on the way, as general feedback

3

slide-4
SLIDE 4

18-545: FALL 2016

Game Plan

Overview Why use FPGAs? FPGA Internals

7 Caveat: I will use Xilinx specific terminology since that’s the FPGA company you will be using. Beware that other companies use different terms

slide-5
SLIDE 5

FPGA Overview

Field Programmable Gate Array Array of generic logic gates Gates where logic function can be programmed Programmable interconnection between gates Fielded systems can be programmed i.e. post-fabrication

slide-6
SLIDE 6

18-545: FALL 2016

Xilinx Virtex-5 FPGA

9

slide-7
SLIDE 7

18-545: FALL 2016

Design Platform

Virtex-5 Development System Xilinx XC5VLX110T FPGA 17280 slices of CLB goodness 256MB DDR2 (SODIMM) DVI Video port VGA port is for input 10/100/1000 Ethernet port Audio Codec (AC97) USB2 port 16x2 LCD, RS-232 Compact Flash card slot Expansion connectors

10

slide-8
SLIDE 8

18-545: FALL 2016

Game Plan

Overview Why use FPGAs? FPGA Internals

11

slide-9
SLIDE 9

Why use FPGAs?

System designers have a Goldilocks problem Off-the-shelf parts are not efficient enough Custom ASICs cost too much Need a “just right” solution

slide-10
SLIDE 10

ASIC Design

Difficult to design Large and complex Issues in advanced processes Interconnect delay Device leakage Power density constraints Expensive to design / fabricate Mask set costs Non-recurring engineering costs

Need a high-volume, high-profit market to justify costs!

slide-11
SLIDE 11

Efficiency View

An efficiency gap exists between ASICs and CPUs

  • N. Zhang, et. al, “The Cost of Flexibility in Systems on a Chip Design for Signal Processing Applications”

0.01 0.1 1 10 100 1000 10000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Energy Efficiency (MOPS/mW) Area Efficiency (MOPS/mm2)

Microprocessors ASICs DSPs

slide-12
SLIDE 12

Economic View

FPGAs: High package costs ($300+), low NRE costs ASICs: Low package costs (pennies), high NRE costs ($600K+)

Development Cost + Device Cost

  • Increasing NRE charge
  • 58% are late to market --

impacts total volumes shipped

  • ASIC cycle longer than some

market windows

  • Over 50% need to be respun

Total Units

Additional ASIC costs:

Decreasing FPGA unit cost pushing crossover point to the right ASIC Trend FPGA Trend

(Courtesy Xilinx, Inc.)

FPGA solution has a lower total cost ASIC solution has a lower total cost

slide-13
SLIDE 13

18-545: FALL 2016

FPGA Advantages

Higher performance than CPU solution Lower power than CPU solution (usually) Low NRE costs Off-the-shelf part designed by FPGA vendor You are sharing NRE costs with all other customers Fast design time Low time-to-market Fast re-design / re-fabrication time Easy to correct an error, to add functionality, in response to spec change Can even change product after deployment

16

slide-14
SLIDE 14

18-545: FALL 2016

High per-part costs Good for low to middle volume applications High volume applications should consider ASICs Perhaps use FPGA for prototyping Lower performance than ASIC Higher power than ASIC More specialized design skills than programming a CPU

17

FPGA Disadvantages

slide-15
SLIDE 15

Example uses of FPGAs

Rapid Prototyping Emulation of ASIC design Design exploration Shipping product Networking Military Microsoft Bing Datacenters Reconfigurable Computing Research! (http://parallel.princeton.edu/openpiton/)

slide-16
SLIDE 16

18-545: FALL 2016

Game Plan

Overview Why use FPGAs? FPGA Internals

19

slide-17
SLIDE 17

FPGA Breakdown

3 Basic components Configurable Logic Blocks General purpose interconnect I/O Blocks Advanced components Hard macros CPUs Block RAM Multipliers Specialized components DSP blocks

VIRTEX-II PRO

slide-18
SLIDE 18

CLB (64 TOTAL) I/O BLOCK (64 TOTAL) GENERAL PURPOSE INTERCONNECT IOBS HAVE DIRECT ACCESS TO ADJACENT CLBS SWITCH MATRIX

(COURTESY XILINX, INC.)

XILINX XC3020

slide-19
SLIDE 19

ZOOMED IN VIEW OF THE CLB MATRIX OF THE FPGA SPECIFIC INGRESS AND EGRESS CONNECTION OPTIONS (BLACK DOTS) ARE AVAILABLE EVEN MORE ZOOMED IN VIEW

(COURTESY XILINX, INC.)

ROUTING

slide-20
SLIDE 20

EACH MATRIX HAS 5 CONNECTIONS PER SIDE

(COURTESY XILINX, INC.)

ROUTING: THE SWITCH MATRIX

slide-21
SLIDE 21

ONLY CERTAIN CONNECTION PATTERNS ARE POSSIBLE

(COURTESY XILINX, INC.)

ROUTING: THE SWITCH MATRIX

EACH MATRIX HAS 5 CONNECTIONS PER SIDE

slide-22
SLIDE 22

18-545: FALL 2016

Hierarchical Routing

25

Spartan-2 and more recent have different length connections between switch matrices Local roads, limited access roads, interstate highways Routes across entire chip don’t burn lots of short connections

slide-23
SLIDE 23

Detailed Routing (Spartan 2)

slide-24
SLIDE 24

Configurable Logic Blocks

CLBs get more and more stuff crammed in them over time XC3K family had LUT (5 variable input, 2 FF values, 2 outputs), 2 FFs, clock enable, FF reset (direct / global) and 9 muxes ~51 bits of configuration SRAM per CLB

(COURTESY XILINX, INC.)

slide-25
SLIDE 25

18-545: FALL 2016

What’s a Look-up-table (LUT)?

A direct implementation of a truth table, using memory LUT inputs are memory address values LUT outputs are the memory data value

28

A B C D F 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 A B C D F 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

slide-26
SLIDE 26

18-545: FALL 2016

Another View of LUTs

29

Can view LUT as 16:1 mux Inputs are mux select Config sets mux data inputs Logically same as 16x1 memory Can compact logic if you can route inputs to mux data inputs

slide-27
SLIDE 27

Look Up Table Additional Functionality

  • Can be configured as:

 Shift register (16 regs)  Small memory (16 bits) ฀“Distributed RAM”

  • Some other FPGAs use

muxes instead of memories to implement the core combinational logic

slide-28
SLIDE 28

18-545: FALL 2016

Spartan-2 CLB

Spartan-2 has 2 LUTs (4 input each) feeding a 3rd LUT, 2 FFs (with Preset/Reset, Enable, posedge or negedge clocks) and 16 muxes 12 inputs (plus clock), 4 outputs

(COURTESY XILINX, INC.)

34

slide-29
SLIDE 29

Spartan-3

CLBs are composed of 4 slices Organized as 2 pairs, one of which is optimized for memory access Each slice has 2 FFs and 2 LUTs

(COURTESY XILINX, INC.)

slide-30
SLIDE 30

FPGA Families extend Architecture

❏Devices are built, with more capability, but around the same basic architecture ❏Some additional capabilities

◆Low voltage versions ◆Faster clock rates ◆Different packaging options

(Courtesy Xilinx, Inc.)

slide-31
SLIDE 31

FIFO memory chips

The need for more stuff

❏CompEs cannot design on logic, routing, I/O alone ❏Extreme case from early 90s

◆16 port ATM switch, designed on a single board ◆Design is limited by I/O to memory chips--bring them on-chip

FPGAs (XC3Ks)

37

slide-32
SLIDE 32

Other “Stuff”

❏Clock managers

◆Global clock buffering, distribution ◆Digital Clock Manager (DCM): eliminate skew, phase shifts,

multiply or divide clock

❏Memory

◆Block RAM ◆Distributed RAM (repurposed LUTs)

❏Shift Registers ❏Dedicated Multiplexers ❏Carry Look-Ahead Generators ❏I/O Blocks

◆SelectIO supports 18 standards (single, differential, various

voltage levels, ....)

❏Embedded Multipliers

38

slide-33
SLIDE 33

Hard Macros

  • Hard macros

 Block RAMs  Multipliers  CPUs  DSPs

  • Soft macros

 HDL IP Blocks

slide-34
SLIDE 34

Block RAMs

  • Distributed RAM

 Use LUTs as memories  Low density  Poor performance

  • Block RAM

 Large-ish dedicated memory blocks ฀Xilinx BRAMs = 18Kb  Some configurability ฀Dual-port ฀Data width / depth ฀FIFO, CAM, etc.

slide-35
SLIDE 35

Multipliers

18x18 signed 2’s-complement multiplier

  • Two 18b inputs
  • One 36b output
  • 18b enough for many DSP applications
  • Can gang multiple units together for wider data
  • Faster and lower power than multiplier from CLBs
slide-36
SLIDE 36

CPUs – PowerPC 405

XC2VP30 has 2 Embedded PowerPC 405 cores

  • Embedded L1 I and D caches
  • No FPU
slide-37
SLIDE 37

CPU Connectivity: PLB and OPB

IBM Core Connect

  • Processor Local Bus (PLB) - fast on-chip communication
  • On-Chip Peripheral Bus (OPB) - optimized for periphs. (UART, etc)
  • Device Control Register bus (DCR) - used to send and set config.
slide-38
SLIDE 38

CPU Connectivity: PLB and OPB (cont.)

slide-39
SLIDE 39

CPU Connectivity: OCM

On-Chip Memory controller

  • CPU block RAM
  • 2 OCMs – I and D
  • Direct, fast interface
  • Can use dual-port BRAMs for

producer-consumer link to FPGA fabric

slide-40
SLIDE 40

18-545: FALL 2016

CPU Links

A lot more details on the embedded CPU

  • http://www.xilinx.com/bvdocs/userguides/ppc_ref_guide.pdf
  • http://direct.xilinx.com/bvdocs/userguides/ug018.pdf
  • http://www-

3.ibm.com/chips/techlib/techlib.nsf/productfamilies/CoreConnect_ Bus_Architecture

46

slide-41
SLIDE 41

Zynq 7000

Advanced Microcontroller Bus Interface + Advanced eXtensible Interconnect To memory, FPGA fabric, I/O & Peripherals AMBA = ARM’s attempt at The One True Interface

slide-42
SLIDE 42

Configuration Storage

Lots of configuration bits LUTs, routing, I/O configuration Xilinx XC2VP30 has >11Mb Configuration storage technologies Volatile SRAM cells Non-volatile FLASH, EEPROM Anti-fuse

Actel anti-fuse WL bit bit_b 6T SRAM cell

slide-43
SLIDE 43

18-545: FALL 2016

Configuration

How to load (scan) configuration bits (bitstream) Connect all configuration registers into single long shift register Serially clock in configuration bits Most designs use standard scan interface (JTAG) developed for test Bitstream source Non-volatile memory On-board FLASH, EEPROM, serial memory External media (CF card) Attached workstation Can encrypt bitstream to conceal configuration 49

slide-44
SLIDE 44

18-545: FALL 2016

Major FPGA Vendors

SRAM-based FPGAs Xilinx Altera Atmel Lattice Semiconductor Flash & antifuse FPGAs Actel Corp. Quick Logic Corp. Lattice Semiconductor Xilinx (system-in-a-package solution)

Share over 60% of the market

50