Reconfiguration Overhead in Dynamic Task-Based Implementations on - - PowerPoint PPT Presentation

reconfiguration overhead in dynamic task based
SMART_READER_LITE
LIVE PREVIEW

Reconfiguration Overhead in Dynamic Task-Based Implementations on - - PowerPoint PPT Presentation

Reconfiguration Overhead in Dynamic Task-Based Implementations on FPGAs Padmini Nagaraj University of California, Berkeley, Distributed Mentor Program, Researcher Summer 2004 Professor Elaheh Bozorgzadeh University of California, Irvine,


slide-1
SLIDE 1

Reconfiguration Overhead in Dynamic Task-Based Implementations on FPGAs

Padmini Nagaraj University of California, Berkeley, Distributed Mentor Program, Researcher Summer 2004 Professor Elaheh Bozorgzadeh University of California, Irvine, Distributed Mentor Program, Mentor

slide-2
SLIDE 2

Padmini Nagaraj - minar@ocf.berkeley.edu

2

Introduction

Field Programmable Gate Arrays Metrics

Performance Time Reconfiguration Time Resources Available

Xilinx Virtex 2 XCV2000E

Partially Reconfigurable by Combinational Logic Block (CLB) columns

CLB CLB CLB Example Xilinx FPGA Chip

slide-3
SLIDE 3

Padmini Nagaraj - minar@ocf.berkeley.edu

3

Reconfiguration Overhead

  • Reconfiguration delay is crucial in dynamic reconfigurable

architecture if it is exploited at runtime.

  • Project: Study the trade-off between reconfiguration delay and

performance of implemented task on FPGA device.

  • Reconfiguration delay is highly correlated with the physical layout of

the implementation.

  • In Xilinx, reconfiguration is column by column.
  • Number of columns of the layout of design is highly correlated with

reconfiguration delay

slide-4
SLIDE 4

Padmini Nagaraj - minar@ocf.berkeley.edu

4

Project Description

Task: Implementation on FPGA devices

Performance

Application Clock Frequency

Vs. Configuration Time

Number of CLB Columns Objective:

slide-5
SLIDE 5

Padmini Nagaraj - minar@ocf.berkeley.edu

5

FPGA-based Compilation Flow

Simulation Simulation Synthesis Synthesis Write to Chip Write to Chip Hardware Description Language VHDL ModelSim SE 5.7g Synplify Pro 7.6.1 Report performance and area Place and Route Place and Route Xilinx Place and Route Tools + Xilinx COREGen

slide-6
SLIDE 6

Padmini Nagaraj - minar@ocf.berkeley.edu

6

Experimental Analysis

  • Metrics used (Xilinx Place and Route Tools Provided):

– CLB Columns constrained – Maximum Clock Frequency – Maximum Pin Delay – Average Delay of 10 Worst Nets

  • Applications:

– Matrix Multiply – Fast Fourier Transform – 2-D Discrete Cosine Transform – JPEG – Others: CORDIC, Multiply Accumulator, Comb Filter, etc.

slide-7
SLIDE 7

Padmini Nagaraj - minar@ocf.berkeley.edu

7

Experimental Data: Matrix Multiplier

Matrix Multiplier CLock Frequency vs. CLB Columns

1.450E+08 1.500E+08 1.550E+08 1.600E+08 1.650E+08 1.700E+08 10 12 14 16 Whole Chip Physical Constraint (Number of CLB Columns) Maximum Clock Frequency (Hz)

Matrix Multiplier Delays and Clock Period

0.000E+00 1.000E-09 2.000E-09 3.000E-09 4.000E-09 5.000E-09 6.000E-09 7.000E-09 10 12 14 16 Whole Chip Physical Constraint (Number of CLB Columns) Minimum Clock Period (s) Maximum Pin Delay (s) Worst 10 Net Delays (s)

Matrix Multiplier constrained at 12 columns Matrix Multiplier unconstrained

slide-8
SLIDE 8

Padmini Nagaraj - minar@ocf.berkeley.edu

8

Experimental Data: Fast Fourier Transform

FFT Clock Frequency vs. CLB Columns

0.000E+00 2.000E+07 4.000E+07 6.000E+07 8.000E+07 1.000E+08 1.200E+08 1.400E+08 1.600E+08 16 20 24 28 32 Whole Chip Physical Constraints (Number of CLB Columns) Maximum Clock Frequency (Hz)

FFT Delays and Clock Period

0.000E+00 2.000E-09 4.000E-09 6.000E-09 8.000E-09 1.000E-08 1.200E-08 16 20 24 28 32 Whole Chip Physical Constraint (Number of CLB Columns) Minimum Clock Period (s) Maximum Pin Delay(s) Worst 10 Net Delay(s)

FFT constrained at 20 columns FFT unconstrained

slide-9
SLIDE 9

Padmini Nagaraj - minar@ocf.berkeley.edu

9

Experimental Data: 2-D Discrete Cosine Transform

2-D Discretre Cosine Transform Clock Frequency

  • vs. CLB Columns

0.000E+00 2.000E+07 4.000E+07 6.000E+07 8.000E+07 1.000E+08 1.200E+08 1.400E+08 1.600E+08 1.800E+08 12 16 20 24 28 Whole Chip Physical Constraint (Number of CLB Columns) Maximum Clock Frequency (Hz)

2-D Discrete Cosine Transform Delays and Clock Period

0.000E+00 2.000E-09 4.000E-09 6.000E-09 8.000E-09 12 16 20 24 28 Whole Chip Physical Constraint (Number of CLB Columns) Minimum Clock Period (s) Maximum Pin Delay Worst 10 Net Delays

2DCT constrained at 28 columns 2DCT unconstrained

slide-10
SLIDE 10

Padmini Nagaraj - minar@ocf.berkeley.edu

10

Experimental Data

1.233E-09 1.810E-09 2.207E+08 4.532E-09 2 Direct Digital Synthesizer 1.120E-09 1.677E-09 0.000E+00 0.000E+00 2 Sine/Cosine Look Up Table 2.388E-09 3.060E-09 1.837E+08 5.443E-09 2 Multiply Accumulator 1.009E-09 1.461E-09 2.959E+08 3.380E-09 2 Cascaded Int. Comb Filter 2.360E-09 2.835E-09 2.059E+08 4.857E-09 2 1-D Disc. Cosine Transform 2.377E-09 3.108E-09 1.194E+08 8.373E-09 4 Digital Down Converter 2.288E-09 2.876E-09 1.183E+08 8.453E-09 4 CORDIC 3.567E-09 4.235E-09 1.547E+08 6.466E-09 10 Matrix Multiplier 4.724E-09 5.462E-09 1.074E+08 9.312E-09 12 FFT 1024 3.382E-09 4.040E-09 1.444E+08 6.923E-09 14 2-D Disc. Cosine Transform 5.617E-09 6.711E-09 9.501E+07 1.053E-08 16 FFT 3.702E-09 5.228E-09 1.321E+08 7.571E-09 20 FFT 256 Worst 10 net Delay Max Pin Delay Maximum Clock Frequency Minimum Clock Period Minimum Number of CLB columns

slide-11
SLIDE 11

Padmini Nagaraj - minar@ocf.berkeley.edu

11

Application: JPEG

Image Block 8 x 8 Pixels RGB->YCrCb 2-D Disc. Cosine Transform Quantize Encoding YCrCb->RGB Inverse 2-D Disc. Cosine Transform Image Block 8 x 8 Pixels Decoding Inverse Quantize

JPEG encoding steps JPEG decoding steps

slide-12
SLIDE 12

Padmini Nagaraj - minar@ocf.berkeley.edu

12

Application: JPEG

JPEG Application Frequencies

0.000E+00 2.000E+07 4.000E+07 6.000E+07 8.000E+07 1.000E+08 1.200E+08 1.400E+08 1.600E+08 1.800E+08 XAPP637 RGB to YCbCr 2-D Disc. Cosine Transform XAPP615 Qauntization XAPP615 Inverse- Quantization Inverse 2-D

  • Disc. Cosine

Transform XAPP238Y CrCb to RGB Applications Frequency (Hz)

JPEG Clock Period and Delays

0.000E+00 2.000E-09 4.000E-09 6.000E-09 8.000E-09 1.000E-08 XAPP637 RGB to YCbCr 2-D Disc. Cosine Transform XAPP615 Qauntization XAPP615 Inverse- Quantization Inverse 2-D

  • Disc. Cosine

Transform XAPP238Y CrCb to RGB Applications Clock Period (s) Max Pin Delay Worst 10 net Delay

slide-13
SLIDE 13

Padmini Nagaraj - minar@ocf.berkeley.edu

13

Conclusion

  • Studied the trade-off between reconfiguration delay and

performance in implementation of applications on FPGA device

  • Compared performance at different layout area for

implementation

  • Results show the following:

– In several cases, by having a more relaxed area constraint, the performance can be improved by the tool and in some cases it doesn’t for the following reasons:

  • I/O dominated applications
  • FPGA CAD tools are not matured enough to try small area for better

performance