[PPT] - On the Sensitivity of FPGA Architectural Conclusions to Experimental PowerPoint Presentation

SLIDE 1

On the Sensitivity of FPGA Architectural Conclusions to Experimental Assumptions, Tools, and Techniques

Andy Yan, Rebecca Cheng, Steve Wilton University of British Columbia Vancouver, B.C. stevew@ece.ubc.ca

SLIDE 2

FPGA Experiments:

Impressive improvement in FPGA Technology: 1994: 25,000 Gates was good 2001: 6,000,000 System Gates How did this happen?

Improvements in process technology
Improvements in CAD Tools
Improvements in Architectures

The key behind this: Experimentation

SLIDE 3

The Danger of Experimentation:

No matter how careful you are:

You will have to make some assumptions
You will have to settle on an experimental technique
You will have to settle on a CAD tool

But what if these assumptions, techniques, & tools impact the conclusions… Can we believe any of these results?

SLIDE 4

This Talk:

Take a step back and look at some basic experiments:

What is the best LUT size?
What is the best switch block topology?
What is the best cluster size?
What is the best memory size?

The answers have all been published… But, how sensitive are they to the Assumptions, Tools, and Techniques

SLIDE 5

Question 1: What is the best LUT Size?

SLIDE 6

What is the best LUT size?

Intuitively, in terms of area:

A smaller LUT takes up less chip area
But more of them area required for a circuit

Intuitively, in terms of delay:

A smaller LUT is faster
But the critical path passes through more of them

(and also through the routing!) Published results: 4-6 inputs in each LUT is a good choice

SLIDE 7

Baseline Experiment:

Optimize (eg. SIS) Technology Map (eg. Flowmap) Place and Route (eg. VPR)

Benchmark Circuits Architectures Area Delay

SLIDE 8

Technology- Mapping Tool

Optimize (eg. SIS) Technology Map (eg. Flowmap) Place and Route (eg. VPR)

Benchmark Circuits Architectures Area Delay

How Sensitive is this on the Tools?

SLIDE 9

Sensitivity to Technology Mapper

Flowmap (Baseline) Critical Path Delay (s) * Area (MTE's) LUT Size 3 2 4 5 6 7 0.30 0.25 0.20 0.15 0.10 0.05

SLIDE 10

Sensitivity to Technology Mapper

Flowmap (Baseline) Cutmap Critical Path Delay (s) * Area (MTE's) LUT Size 3 2 4 5 6 7 0.30 0.25 0.20 0.15 0.10 0.05

SLIDE 11

Sensitivity to Technology Mapper

Chortle Flowmap (Baseline) Cutmap Critical Path Delay (s) * Area (MTE's) LUT Size 3 2 4 5 6 7 0.30 0.25 0.20 0.15 0.10 0.05

Conclusion depends on technology-mapper

SLIDE 12

Place and Route Tool

Optimize (eg. SIS) Technology Map (eg. Flowmap) Place and Route (eg. VPR)

Benchmark Circuits Architectures Area Delay

How Sensitive are these results?

SLIDE 13

Sensitivity to Place and Route Tool:

Normal VPR (Baseline)

Critical Path Delay (s) * Area (MTE's)

0.30 0.20 0.10 0.60 0.50 0.40 0.70 2 3 4 5 6 7 LUT Size

SLIDE 14

Sensitivity to Place and Route Tool:

Normal VPR (Baseline) Fast

Critical Path Delay (s) * Area (MTE's)

0.30 0.20 0.10 0.60 0.50 0.40 0.70 2 3 4 5 6 7 LUT Size

SLIDE 15

Sensitivity to Place and Route Tool:

UFP Normal VPR (Baseline) Fast

Critical Path Delay (s) * Area (MTE's)

0.30 0.20 0.10 0.60 0.50 0.40 0.70 2 3 4 5 6 7 LUT Size

SLIDE 16

Sensitivity to Place and Route Tool:

Routability- Driven UFP Normal VPR (Baseline) Fast

Critical Path Delay (s) * Area (MTE's)

0.30 0.20 0.10 0.60 0.50 0.40 0.70 2 3 4 5 6 7 LUT Size

SLIDE 17

Optimization

Optimize (eg. SIS) Technology Map (eg. Flowmap) Place and Route (eg. VPR)

Benchmark Circuits Architectures Area Delay

How Sensitive is this on the Tools?

SLIDE 18

Optimization Scripts:

2 3 4 5 6 7 LUT Size SIS + Flowmap 3.0x106 3.5x106 4.0x106 4.5x106 5.0x106 5.5x106

Area (MTE's)

SLIDE 19

Optimization Scripts:

2 3 4 5 6 7 LUT Size SIS + Flowmap (SIS + Flowmap)*2 3.0x106 3.5x106 4.0x106 4.5x106 5.0x106 5.5x106

Area (MTE's)

Optimization of circuits is important!

SLIDE 20

Circuits

Optimize (eg. SIS) Technology Map (eg. Flowmap) Place and Route (eg. VPR)

Benchmark Circuits Architectures Area Delay

How Sensitive is this on the Circuits?

SLIDE 21

Benchmark Circuits:

MCNC Critical Path Delay (s) * Area (MTE's) LUT Size 3 2 4 5 6 7 0.30 0.20 0.10 0.60 0.50 0.40 Synthesized

MCNC Circuits behave differently than “real” circuits

SLIDE 22

Quantifying our Results

Want a number that indicates how strongly

ur conclusions are affected by an experimental

variation Consider an experiment to find best value of an architectural parameter Run 1: Baseline Run 2: Same experiment with one experimental parameter varied Margin = The difference in conclusion between Run 1 and Run 2

SLIDE 23

Area * Delay Sweep of an Architectural Parameter RUN 1 Best Architecture

Margin : Case 1

SLIDE 24

Margin : Case 1

Area * Delay Sweep of an Architectural Parameter RUN 1 RUN 2 Best Architecture

X% Y% Margin = | X – Y |

SLIDE 25

Area * Delay Sweep of an Architectural Parameter Best Architecture Best Architecture RUN 1 RUN 2

Margin: Case 2

X% Y% Margin = MAX( X , Y )

SLIDE 26

Quantifying the Sensitivity:

Categorize Experimental Variations by their Margin: 0%-2%: Not Sensitive 2%-5%: Slightly Sensitive 5%-10%: Sensitive 10%-100%: Very Sensitive > 100%: Extremely Sensitive We can have area margins, delay margins, and area * delay margins.

SLIDE 27

Margin Results: Summary

I’ll leave a paper with tabulated results, but here are the variations that had a margin > 5%: Using Chortle instead of Flowmap: 76% Optimize and Tech Map circuits twice: 8.5% Use Routability-Driven Place and Route: 301% Synthesized Circuits rather than MCNC ccts: 11% Multiply Minimum Channel width by 1.1 5.4% Use Fc=0.3 rather than Fc=0.6 5.7% Use Fc=0.4 rather than Fc=0.6 11% Use Fc=0.7 rather than Fc=0.6 5.5% Use Fc=0.8 rather than Fc=0.6 11% Use Segments of Length 1 instead of 4 8.5%

SLIDE 28

Question 2: What is the best Switch Block Topology?

SLIDE 29

What is the best Switch Block?

Published Switch Blocks

Disjoint switch block (Xilinx)
Universal switch block
Wilton switch block
Imran switch block

(combination of Wilton and Disjoint block) Our FPL paper showed the Imran block was good:

Unlike disjoint, it does not divide routing fabric

into segments

Unlike Wilton, it does not suffer from extra

transistors in segmented architectures

SLIDE 30

Sensitivity to Place and Route Tool:

0.1 0.2 0.3 0.4 0.5 VPR (Baseline) Critical Path Delay (s) * Area (MTE's) Disjoint Wilton Universal Imran

SLIDE 31

Sensitivity to Place and Route Tool:

0.1 0.2 0.3 0.4 0.5 VPR (Baseline) UFP Critical Path Delay (s) * Area (MTE's) Disjoint Wilton Universal Imran

SLIDE 32

Sensitivity to Place and Route Tool:

0.1 0.2 0.3 0.4 0.5 VPR (Baseline) UFP Routeability- Driven Critical Path Delay (s) * Area (MTE's) Disjoint Wilton Universal Imran

SLIDE 33

Sensitivity to Place and Route Tool:

0.1 0.2 0.3 0.4 0.5 VPR (Baseline) UFP Routeability- Driven Fast Critical Path Delay (s) * Area (MTE's) Disjoint Wilton Universal Imran

SLIDE 34

Margin Results: Summary

We did many experiments, but here are the variations that had a margin > 5%: Use Fast Option of VPR: 6.8% Use Routability-Driven Place and Route: 320% Synthesized Circuits rather than MCNC ccts: 7.5% Implement on Double-Sized FPGA: 7.5% Use Segments of Length 1 instead of 4 33% All switches buffered (instead of 50/50): 6.8%

SLIDE 35

Question 3: How Big should each cluster be?

SLIDE 36

What is the best Cluster (LAB) size?

Intuitively:

A larger cluster (LAB) means more local connections
But a larger cluster is slower and has area overhead

Previous Published Results:

Between 4 and 10 LUT’s / cluster seem to work well

SLIDE 37

Sensitivity to Place and Route Tool:

Fast UFP VPR (Baseline) Routability 1 2 3 4 5 6 7 8 9 10 Cluster Size 0.1 0.2 0.3 0.4 0.5 0.6 Critical Path Delay (s) * Area (MTE's)

SLIDE 38

The Main Message is This:

Experimental results can be significantly influenced by the assumptions, tools, and techniques used in experimentation There are many architecture papers out there:

Very few really address how sensitive their results

are to the experimental assumptions (at UBC, we are guilty of this too)

The results in this talk show that they should

SLIDE 39

Orthogonal Architecture Assumptions

Optimize (eg. SIS) Technology Map (eg. Flowmap) Place and Route (eg. VPR)

Benchmark Circuits Architectures Area Delay

How Sensitive is this on the Architecture?

SLIDE 40

Orthogonal Architecture Assumptions:

LUT Size 3 2 4 5 6 7

Area (MTE's)

4.5x106 5.0x106 5.5x106 6.0x106 Fc=0.6 (baseline)

SLIDE 41

Orthogonal Architecture Assumptions:

LUT Size 3 2 4 5 6 7

Area (MTE's)

4.5x106 5.0x106 5.5x106 6.0x106 Fc=1.0 Fc=0.6 (baseline) Fc=0.3

Conclusion does depend on Fc

SLIDE 42

Sensitivity to Fc:

Cluster Size Critical Path Delay (s) * Area (MTE) 0.06 2 3 4 5 6 7 8 9 10 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 Fc=0.5

SLIDE 43

Sensitivity to Fc:

Cluster Size Critical Path Delay (s) * Area (MTE) 0.06 2 3 4 5 6 7 8 9 10 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 Fc=0.3 Fc=0.5

SLIDE 44

Sensitivity to Fc:

Cluster Size Critical Path Delay (s) * Area (MTE) 0.06 2 3 4 5 6 7 8 9 10 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 Fc=0.3 Fc=0.5 Fc=0.7

SLIDE 45

Sensitivity to Fc:

Cluster Size Critical Path Delay (s) * Area (MTE) 0.06 2 3 4 5 6 7 8 9 10 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 Fc=0.3 Fc=0.5 Fc=0.7 Fc=0.9

Conclusion does depend on Fc

SLIDE 46

Question 4: What is the best Memory Array Size?

SLIDE 47

What is the best Memory Array Size?

Focus on one previous study which investigated the best memory size when memories are used to implement logic. Intuitively:

A larger memory can implement more logic
A larger memory is slower and larger

Previous Published Results:

A 2Kbit memory seem to work well

SLIDE 48

Sensitivity to Packing Tool:

Margin (EMBPACK) = 53%

256 512 1024 2048 4096 Packing Ratio Bits Per Array SMAP (Baseline) SMAP-d 2.0 2.5 3.0 3.5 4.0 1.5 1.0 EMBPACK

SLIDE 49