Architecture and Synthesis for Multi Architecture and Synthesis for Multi-
- Cycle
Cycle On On-
- Chip Communication
Architecture and Synthesis for Multi- -Cycle Cycle Architecture - - PowerPoint PPT Presentation
Architecture and Synthesis for Multi- -Cycle Cycle Architecture and Synthesis for Multi On- -Chip Communication Chip Communication On Jason Cong Jason Cong VLSI CAD Lab VLSI CAD Lab Computer Science Department Computer Science
1st challenge: Interconnect delay exceeds gate delay (happened i
Source of “timing closure” problem
Happened in mid 1990s. Addressed by new physical synthesis/prot
ITRS’01 0.07um Tech 5.63 G Hz across-chip clock 800 mm2 (28.3mm x 28.3mm) IPEM BIWS estimations
Buffer size: 100x Driver/receiver size: 100x
Can travel up to 11.4 mm in
Need 5 clock cycles from
2nd challenge:
Not supported by the current CAD toolset
About to happen soon
7.154 ns
300 MHz
3 clock cycles!
MegaRAM Blocks (9) DSP Blocks (22) M4K RAM Blocks (364) M512 RAM Blocks (767) Logic Array Blocks (79,040 LEs)
Multiple clock cycles are needed to cross the chip
Proper placement allows retiming to
Placement 1 Before retiming, φ = 5.0 a b c d After retiming, φ = 3.0 Before retiming, φ = 4.0 a c b d Placement 2 d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL Better Initial Placement !!
Multiple clock cycles are needed to cross the chip
Proper placement allows retiming to
Placement 1 Before retiming, φ = 5.0 a b c d After retiming, φ = 3.0 Before retiming, φ = 4.0 a c b d After retiming, φ = 4.0 Placement 2 d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL Better Initial Placement !!
Global Interconnect
i i
ic
ic island ra
− − −
int log int log int
Local Computational Cluster (LCC)
Register File
ADD MUX MUL
Cluster with area constraint
Global Interconnect
Local Computational Cluster (LCC)
Register File
ADD MUX MUL
Cluster with area constraint
1 cycle 2 cycle k cycle
7.154 ns
Alu1 1,5,10 Alu2 2,6,9
Mul2 3,7,8
Mul1 4,11,12
2 ns 1 ns
* *
*
*
3 5 7 9 2 4 6 8 11 10 12
Long interconnect Short interconnect
2 1 ns ALU 2 2 ns Multiplier Num Delay FU
* *
*
*
10 +
*
*
*
1 3 4 6 5 7 8 9 11 12 Cycle 1 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 2 10 +
*
*
1 3 4 6 5 7 8 9 12 Cycle 1 Cycle 7 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 2 Cycle 8 Cycle 9
11
… Alu1 1,5,10 …
… Mul2 3,7,8 … Mul1 4,11,12 Alu2 2,6,9
Scheduling-driven placement
10 +
*
*
*
1 3 4 6 5 7 8 9 11 12 Cycle 1 Cycle 7 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 2 Cycle 8
Simultaneous placement, scheduling and binding
Cycle 1 Cycle 7 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 2 10 +
*
*
1 3 4 6 5 7 8 9 11 12 *
… Alu1 1,5,10 …
… Mul2 3,7,11 … Alu2 2,6,9 Mul1 4,8,12
Register and port binding Datapath & FSM generation
Floorplan constraints RTL VHDL files Multi-cycle path constraints CDFG
C / VHDL
CDFG generation
+ 2 * 3 * 4
* 7 * 8
* 11 * 12
RDR Arch. Spec. Target clock period Resource allocation
Resource constraints
* *
*
Interconnected Component Graph (ICG)
Functional unit binding
Mult1 Alu2 Mult2 Alu1
Interconnected Component Graph (ICG)
Location information
Scheduling-driven placement
Alu1 1,5,10
Mul2 3,7,12
Alu2 2,6,9 Mul1 4,8,11
Placement-driven rebinding & scheduling
Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7
* * * +
Alu1 1,5,10
Mul2 3,7,11
Alu2 2,6,9 Mul1 4,8,12
… Alu1 1,5,10 …
… Mul2 3,7,8 … Mul1 4,11,12 Alu2 2,6,9
Weight assignment
10 +
*
*
*
1 3 4 6 5 7 8 9 11 12 Cycle 1 Cycle 7 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 2 Cycle 8
X48 +3 *1 X40 +4 *2 *6 *5
est est(i, j) (i, j) cpl cpl(i, j) (i, j)
+8
CDFG Interconnected component graph C / VHDL Location information
1 Functional unit allocation & binding Commercial FPGA development system Placement-driven rebinding & rescheduling Scheduling-driven placement CDFG generation 2 3 Register and port binding Placement-driven scheduling Scheduling Datapath & FSM generation
Floorplanconstraints; Multi-cycle path constraints
RDR Arch. Spec. Target clock period
RTL VHDL files
Cycle number, clock period, and overall latency comparison
pr wang lee mcm honda dir chem u5ml12
Latency (ns)
Total latency comparison
Mapped VHDL for Stratix FPGAs Altera Quartus-II
MCAS basic flow vs. Synopsys’ Behavioral Compiler
pr wang m cm honda Synopsys BC M CAS 1000 2000 3000 4000 5000 6000 7000 pr w ang m cm honda Synopsys BC M C AS
Latency Resource
Design Flow Cylces Reg ALU MULT fmax (MHz) LE Latency (ns) MCAS vs. BC Synopsys BC 25 28 5 8 90.31 2945 276.82 MCAS 27 34 6 2 96.74 2476 279.10 100.82% Synopsys BC 29 36 7 8 83.61 3605 346.85 MCAS 14 35 5 8 103.76 4242 134.93 38.90% Synopsys BC 43 142 23 7 79.65 6253 539.86 MCAS 34 35 6 3 72.05 3876 471.89 87.41% Synopsys BC 29 44 8 14 85.14 6128 340.62 MCAS 23 42 6 8 87.11 5523 264.03 77.52% pr wang mcm honda
Rocket I/O Transceivers PowerPC 405 PowerPC 405 PowerPC 405 PowerPC 405 Rocket I/O Transceivers Programmable Logic
& up to 22 high performance DSP block
Tools Developed:
Converter: Translate SpecC to
SDM
Simulator: Validate the design in
SDM, Simulation design at different levels of abstraction
SW code generator: Generate C
Source Code from SDM for target platform
HW code generator: Generate
VHDL Source code from SDM for target platform
Profiler: Generate profile based on
generated SW/HW system
HW synthesis: MCAS system
Design Design Spec. Spec. Simulation Simulation Synthesis Synthesis
C Code C Code VHDL VHDL Target Target SW SW Target Target PLD PLD SW SW Code Gen Code Gen HW HW Code Gen Code Gen
Partitioning Partitioning Scheduling Scheduling Interface Interface Synthesis Synthesis SW synthesis SW synthesis HW synthesis HW synthesis
Platform Platform Info. Info. Estimation Estimation
MCAS system
Image Fragmentation Image Fragmentation DCT DCT Entropy Coding Entropy Coding
Quantization Quantization
JPEG: an standard for image compression DCT: Discrete Cosine Transform(ChenDCT) Four mode of the operations in JPEG standard Sequential DCT-based mode Progressive DCT-based mode Lossless mode Hierarchical mode JPEG: an standard for image compression DCT: Discrete Cosine Transform(ChenDCT) Four mode of the operations in JPEG standard Sequential DCT-based mode Progressive DCT-based mode Lossless mode Hierarchical mode
Run
time (10
rate(%) time (10
rate(%) time (10
rate(%) time (10
rate(%) 50.31 1.22% 50.31 1.92% 50.31 1.84% 50.31 4.59% (19878.67) (19878.67) (19878.67) (19878.67) 3160.56 76.46% 1641.04 62.78% 1756.67 64.35% 123.51 11.26% (316.4) (609.37) (569.26) (8096.46) 176.42 4.27% 176.42 6.75% 176.42 6.46% 176.42 16.09% (5668.41) (5668.41) (5668.41) (5668.41) 746.29 18.05% 746.29 28.55% 746.29 27.34% 746.29 68.06% (1339.96) (1339.96) (1339.96) (1339.96) Total 4133.57 100.00% 2614.05 100.00% 2729.68 100.00% 1096.52 100.00% HuffmanEncode NIOS(SW+HW2) NIOS(SW+HW3) HandleData DCT Quantization Module Name NIOS(SW) NIOS(SW+HW1)