Reconfiguration Overhead in Dynamic Task-Based Implementations on - - PowerPoint PPT Presentation

reconfiguration overhead in dynamic task based
SMART_READER_LITE
LIVE PREVIEW

Reconfiguration Overhead in Dynamic Task-Based Implementations on - - PowerPoint PPT Presentation

Reconfiguration Overhead in Dynamic Task-Based Implementations on FPGAs Padmini Nagaraj UCB, Distributed Mentor Program, Researcher Summer 2004 Professor Elaheh Bozorgzadeh UCI, Distributed Mentor Program, Mentor Outline I. Introduction


slide-1
SLIDE 1

Reconfiguration Overhead in Dynamic Task-Based Implementations on FPGAs

Padmini Nagaraj UCB, Distributed Mentor Program, Researcher Summer 2004 Professor Elaheh Bozorgzadeh UCI, Distributed Mentor Program, Mentor

slide-2
SLIDE 2

Padmini Nagaraj - minar@ocf.berkeley.edu

2

Outline

I. Introduction II. Project Description III. Example Application: Matrix Multiplier IV. Experimental Data

A. Matrix Multiplier B. Fast Fourier Transform C. 2-D Discrete Cosine Transform D. Multiple Applications

V. Real World Application: JPEG VI. Conclusion

slide-3
SLIDE 3

Padmini Nagaraj - minar@ocf.berkeley.edu

3

I II III IV A B C D V VI

Introduction

Field Programmable Gate Array Metrics

Performance Time Reconfiguration Time Resources Available

Xilinx Virtex 2 XCV2000E

Partially Reconfigurable by CLB columns

CLB CLB CLB Example Xilinx Chip

slide-4
SLIDE 4

Padmini Nagaraj - minar@ocf.berkeley.edu

4

I II III IV A B C D V VI

Introduction (cont…)

Simulation Synthesis Write to Chip Hardware Description Language VHDL ModelSim SE 5.7g Synplify Pro 7.6.1 No Writes to Chip (iMPACT) Project Navigator 6.2.03i Place and Route Xilinx Place and Route Tools

slide-5
SLIDE 5

Padmini Nagaraj - minar@ocf.berkeley.edu

5

I II III IV A B C D V VI

Project Description

GOAL: Application configuration time vs. performance time. Number of CLB Columns Application Clock Frequency Several small to large independent applications Real world example: JPEG

slide-6
SLIDE 6

Padmini Nagaraj - minar@ocf.berkeley.edu

6

I II III IV A B C D V VI

Example Application: Matrix Multiplier

8 x 8 Matrix Multiplier Needs Lots of Data! a.) BRAMs b.) Lots of I/O pins c.) Neither Interested in seeing effects independent of

  • ther chip resources

Really slow! Too much time reading inputs. Okay.

slide-7
SLIDE 7

Padmini Nagaraj - minar@ocf.berkeley.edu

7

I II III IV A B C D V VI

Example Application: Matrix Multiplier (cont…)

A0[15:0] B0[15:0] A1[15:0] B1[15:0] A2[15:0] B2[15:0] A3[15:0] B3[15:0] Result[15:0]

Mult Mult Mult Mult Add Add Add

Matrix Multiply Block Diagram

A4[15:0] B4[15:0] A5[15:0] B5[15:0] A6[15:0] B6[15:0] A7[15:0] B7[15:0]

Mult Mult Mult Mult Add Add Add Add

slide-8
SLIDE 8

Padmini Nagaraj - minar@ocf.berkeley.edu

8

I II III IV A B C D V VI

Example Application: Matrix Multiplier (cont…)

1.) Write code 2.) Simulate - Testbench 3.) Synthesis 4.) Place and Route - Constrain Time and Columns

slide-9
SLIDE 9

Padmini Nagaraj - minar@ocf.berkeley.edu

9

I II III IV A B C D V VI

Experimental Data

Xilinx CORE Generator Intellectual Property of Xilinx CLB Columns Maximum Clock Frequency Maximum Pin Delay Average Delay of 10 Worst Nets Metrics Used:

slide-10
SLIDE 10

Padmini Nagaraj - minar@ocf.berkeley.edu

10

I II III IV A B C D V VI

Experimental Data: Matrix Multiplier

Matrix Multiplier CLock Frequency vs. CLB Columns

1.450E+08 1.500E+08 1.550E+08 1.600E+08 1.650E+08 1.700E+08 10 12 14 16 Whole Chip Physical Constraint (Number of CLB Columns) M axim um C lock Frequency (H z)

Matrix Multiplier Delays and Clock Period

0.000E+00 1.000E-09 2.000E-09 3.000E-09 4.000E-09 5.000E-09 6.000E-09 7.000E-09 10 12 14 16 Whole Chip Physical Constraint (Number of CLB Columns) Minimum Clock Period (s) Maximum Pin Delay (s) Worst 10 Net Delays (s)

slide-11
SLIDE 11

Padmini Nagaraj - minar@ocf.berkeley.edu

11

I II III IV A B C D V VI

Experimental Data: Matrix Multiplier (cont…)

Matrix Multiplier constrained at 12 columns Matrix Multiplier unconstrained

slide-12
SLIDE 12

Padmini Nagaraj - minar@ocf.berkeley.edu

12

I II III IV A B C D V VI

Experimental Data: Matrix Multiplier (cont…)

3.396E-09 3.470E-09 3.406E-09 3.692E-09 3.567E-09 Worst 10 Net Delays (s) 3.787E-09 4.120E-09 3.938E-09 4.174E-09 4.235E-09 Maximum Pin Delay (s) 1.686E+08 1.539E+08 1.539E+08 1.544E+08 1.547E+08 Maximum Clock Frequency (Hz) 5.930E-09 6.496E-09 6.496E-09 6.476E-09 6.466E-09 Minimum Clock Period (s) Whole Chip 16 14 12 10 Physical Constraint (number of CLB columns)

slide-13
SLIDE 13

Padmini Nagaraj - minar@ocf.berkeley.edu

13

I II III IV A B C D V VI

Experimental Data: Fast Fourier Transform

FFT Clock Frequency vs. CLB Columns

0.000E+00 2.000E+07 4.000E+07 6.000E+07 8.000E+07 1.000E+08 1.200E+08 1.400E+08 1.600E+08 16 20 24 28 32 Whole Chip Physical Constraints (Number of CLB Columns) M a x im u m C lo c k F re q u e n c y (H z)

FFT Delays and Clock Period

0.000E+00 2.000E-09 4.000E-09 6.000E-09 8.000E-09 1.000E-08 1.200E-08 16 20 24 28 32 Whole Chip Physical Constraint (Number of CLB Columns) Minimum Clock Period (s) Maximum Pin Delay(s) Worst 10 Net Delay(s)

slide-14
SLIDE 14

Padmini Nagaraj - minar@ocf.berkeley.edu

14

I II III IV A B C D V VI

Experimental Data: Fast Fourier Transform (cont…)

FFT constrained at 20 columns FFT unconstrained

slide-15
SLIDE 15

Padmini Nagaraj - minar@ocf.berkeley.edu

15

I II III IV A B C D V VI

Experimental Data: Fast Fourier Transform (cont…)

4.776E-09 5.067E-09 4.778E-09 5.404E-09 4.736E-09 5.617E-09 Worst 10 Net Delay (s) 5.540E-09 5.864E-09 5.397E-09 6.227E-09 5.545E-09 6.711E-09 Maximum Pin Delay (s) 1.195E+08 1.224E+08 1.208E+08 1.208E+08 1.386E+08 9.501E+07 Maximum Clock Frequency (Hz) 8.365E-09 8.170E-09 8.276E-09 8.276E-09 7.214E-09 1.053E-08 Minimum Clock Period (s) Whole Chip 32 28 24 20 16 Physical Constraint (Number of CLB columns)

slide-16
SLIDE 16

Padmini Nagaraj - minar@ocf.berkeley.edu

16

I II III IV A B C D V VI

Experimental Data: 2-D Discrete Cosine Transform

2-D Discretre Cosine Transform Clock Frequency

  • vs. CLB Columns

0.000E+00 2.000E+07 4.000E+07 6.000E+07 8.000E+07 1.000E+08 1.200E+08 1.400E+08 1.600E+08 1.800E+08 12 16 20 24 28 Whole Chip Physical Constraint (Number of CLB Columns) M a x im u m C lo c k F r e q u e n c y ( H z )

2-D Discrete Cosine Transform Delays and Clock Period

0.000E+00 2.000E-09 4.000E-09 6.000E-09 8.000E-09 12 16 20 24 28 Whole Chip Physical Constraint (Number of CLB Columns) Minimum Clock Period (s) Maximum Pin Delay Worst 10 Net Delays

slide-17
SLIDE 17

Padmini Nagaraj - minar@ocf.berkeley.edu

17

I II III IV A B C D V VI

Experimental Data: 2-D Discrete Cosine Transform (cont…)

2DCT constrained at 28 columns 2DCT unconstrained

slide-18
SLIDE 18

Padmini Nagaraj - minar@ocf.berkeley.edu

18

I II III IV A B C D V VI

Experimental Data: 2-D Discrete Cosine Transform (cont…)

5.711E-09 3.280E-09 3.295E-09 3.373E-09 3.420E-09 3.667E-09 Worst 10 Net Delays 6.367E-09 3.707E-09 4.088E-09 4.163E-09 4.208E-09 4.798E-09 Maximum Pin Delay 1.341E+08 1.623E+08 1.591E+08 1.614E+08 1.575E+08 1.395E+08 Maximum Clock Frequency (Hz) 7.457E-09 6.163E-09 6.286E-09 6.197E-09 6.349E-09 7.169E-09 Minimum Clock Period (s) Whole Chip 28 24 20 16 12 CLB Columns Physical Constraint (number of CLB columns)

slide-19
SLIDE 19

Padmini Nagaraj - minar@ocf.berkeley.edu

19

I II III IV A B C D V VI

Experimental Data: Multiple Applications

Multiple Applications Delays

0.000E+00 2.000E-09 4.000E-09 6.000E-09 8.000E-09 1.000E-08 1.200E-08 FFT 256 2-D Disc. Cosine Matrix Multiplier Digital Down Converter Cascaded

  • Int. Comb

Sine/Cosine Look Up Applications Minimum Clock Period (s) Max Pin Delay (s) Worst 10 net Delay (s)

Multiple Applications Frequencies

0.000E+00 5.000E+07 1.000E+08 1.500E+08 2.000E+08 2.500E+08 3.000E+08 3.500E+08 FFT 256 2-D Disc. Cosine Matrix Multiplier Digital Down Converter Cascaded

  • Int. Comb

Sine/Cosine Look Up Applications Frequency (Hz)

slide-20
SLIDE 20

Padmini Nagaraj - minar@ocf.berkeley.edu

20

I II III IV A B C D V VI

Experimental Data: Multiple Applications (cont…)

1.233E-09 1.810E-09 2.207E+08 4.532E-09 2 Direct Digital Synthesizer 1.120E-09 1.677E-09 0.000E+00 0.000E+00 2 Sine/Cosine Look Up Table 2.388E-09 3.060E-09 1.837E+08 5.443E-09 2 Multiply Accumulator 1.009E-09 1.461E-09 2.959E+08 3.380E-09 2 Cascaded Int. Comb Filter 2.360E-09 2.835E-09 2.059E+08 4.857E-09 2 1-D Disc. Cosine Transform 2.377E-09 3.108E-09 1.194E+08 8.373E-09 4 Digital Down Converter 2.288E-09 2.876E-09 1.183E+08 8.453E-09 4 CORDIC 3.567E-09 4.235E-09 1.547E+08 6.466E-09 10 Matrix Multiplier 4.724E-09 5.462E-09 1.074E+08 9.312E-09 12 FFT 1024 3.382E-09 4.040E-09 1.444E+08 6.923E-09 14 2-D Disc. Cosine Transform 5.617E-09 6.711E-09 9.501E+07 1.053E-08 16 FFT 3.702E-09 5.228E-09 1.321E+08 7.571E-09 20 FFT 256 Worst 10 net Delay Max Pin Delay Maximum Clock Frequency Minimum Clock Period Minimum Number of CLB columns

slide-21
SLIDE 21

Padmini Nagaraj - minar@ocf.berkeley.edu

21

I II III IV A B C D V VI

Experimental Data: Multiple Applications (cont…)

FFT constrained at 16 columns 1DCT constrained at 2 columns

slide-22
SLIDE 22

Padmini Nagaraj - minar@ocf.berkeley.edu

22

I II III IV A B C D V VI

Real World Application: JPEG

Image Block 8 x 8 Pixels RGB->YCrCb 2-D Disc. Cosine Transform Quantize Encoding YCrCb->RGB Inverse 2-D Disc. Cosine Transform Image Block 8 x 8 Pixels Decoding Inverse Quantize

JPEG encoding steps JPEG decoding steps

slide-23
SLIDE 23

Padmini Nagaraj - minar@ocf.berkeley.edu

23

I II III IV A B C D V VI

Real World Application: JPEG (cont…)

JPEG Clock Period and Delays

0.000E+00 2.000E-09 4.000E-09 6.000E-09 8.000E-09 1.000E-08 XAPP637 RGB to YCbCr 2-D Disc. Cosine Transform XAPP615 Qauntization XAPP615 Inverse- Quantization Inverse 2-D

  • Disc. Cosine

Transform XAPP238Y CrCb to RGB Applications Clock Period (s) Max Pin Delay Worst 10 net Delay

JPEG Application Frequencies

0.000E+00 2.000E+07 4.000E+07 6.000E+07 8.000E+07 1.000E+08 1.200E+08 1.400E+08 1.600E+08 1.800E+08 XAPP637 RGB to YCbCr 2-D Disc. Cosine Transform XAPP615 Qauntization XAPP615 Inverse- Quantization Inverse 2-D

  • Disc. Cosine

Transform XAPP238Y CrCb to RGB Applications Frequency (Hz)

slide-24
SLIDE 24

Padmini Nagaraj - minar@ocf.berkeley.edu

24

I II III IV A B C D V VI

Real World Application: JPEG (cont…)

2.377E-09 3.368E-09 4.026E-09 4.146E-09 3.121E-09 2.712E-09 Worst 10 net Delay 3.130E-09 3.583E-09 4.847E-09 4.950E-09 4.097E-09 3.571E-09 Max Pin Delay 1.546E+08 1.520E+08 1.356E+08 1.194E+08 1.212E+08 1.199E+08 Clock Frequency 6.469E-09 6.580E-09 7.376E-09 8.378E-09 8.249E-09 8.343E-09 Clock Period 2 8 6 6 8 2 Num of CLB columns XAPP238Y CrCb to RGB Inverse 2-D Disc. Cosine Transfor m XAPP615 Inverse- Quantiza tion XAPP615 Qauntiza tion 2-D Disc. Cosine Transfor m XAPP637 RGB to YCbCr

slide-25
SLIDE 25

Padmini Nagaraj - minar@ocf.berkeley.edu

25

I II III IV A B C D V VI

Real World Application: JPEG (cont…)

XAPP637 constrained at 2 columns XAPP238 constrained at 2 columns

slide-26
SLIDE 26

Padmini Nagaraj - minar@ocf.berkeley.edu

26

I II III IV A B C D V VI

Real World Application: JPEG (cont…)

Quantize constrained at 8 columns IQuantize constrained at 8 columns

slide-27
SLIDE 27

Padmini Nagaraj - minar@ocf.berkeley.edu

27

I II III IV A B C D V VI

Conclusion

Place and Route Tools Lack sufficient intelligence Density of application affects everything Clock period, maximum pin delay and worst 10 net delay User defined constraints Helps Place and Route tools