 
              A Framework for Layout-Level Logic Restructuring Hosung Leo Kim John Lillis
Motivation: Logical-to-Physical Disconnect Logic-level Optimization fixed netlist disconnect Limited by the structure Physical-level Optimization obtained from the logic-level • Performance is determined largely by physical- level Interconnect delay. • Problem: timing optimization at logic-level ≠ actual performance.
Past Layout-Driven Restructuring Work: Replication Based • Basic Operations: – Gate Splitting – Fanout Partitioning; Enables “Path Straightening” • [Schabas, Brown. ISFPGA03] • [Beraudo, Lillis. DAC03] • [Hrkic, Lillis, Beraudo. TCAD06] • [Chen, Cong. ISFPGA05]
Limitation of Logic Replication • While interconnect delay can be significantly reduced, the LUT-depth of a path remains unchanged. • The LUT-depth is typically determined by a technology mapper which does not have an accurate view of critical paths. Candidate: Remapping
Other Work • Redundant Wires (e.g., [Chang, Cheng, Suaris, Marek- Sadowska. DAC00]) � rewire connections while keeping logical equivalence. � Predictable, but optimization scope limited • [Lin, Jagannathan, and Cong. ISFPGA03] � Remap based on placement-level timing analysis � Significant restructuring, but placement of remapped cells determined by initial placement (not simultaneous). • [Singh and Brown. Integration07] � Shannon’s expansion / precomputation � Allows late signals to skip logic levels, but relatively local in nature
Objectives • Overcome limitations of basic replication (e.g., fixed LUT-depth) • Large and flexible remapping space • Explicitly account for placement freedom of remapped LUTs • Tight coupling with placement
Components of Approach (FPGA Domain) Placement-Level Timing-Critical Induce Fan-in Cone Replication Tree Static Timing Extraction [Hrkic,TCAD06] Analysis
Components of Approach (cont’d) Mapper and Recursive, Exhaustive Embedder Legalizer Ashenhurst LUT (Dynamic Replication Decomposition Subject Graph Programming) tree (Choice Tree) A choice A node i j k l a B C l j l k l b R d i i k i j a b c d e f g h k j (a) Given LUT-tree i j k l B choice C choice node node c R e g 1 g 4 g 6 g 7 g 2 g 5 g 8 h e f d c c d a b f h c d e h c d e g g 3 f g a b a b c d e f g h (b) Choice tree
Remapping Example A A B E m B C D m d e C D d e f g h i j k l a E b c h a b c g f i j k l (a) Given LUT-tree (b) “Mini-LUT” tree after LUT-decompositions A ′ A ′ B ′ d e E E B ′ m d e C D a b c a m C D b c h g f h i j k l g f i j k l (c) Alternative mapping (d) Corresponding LUT-tree
Functional Decomposition • Simple Disjoint Functional Decomposition wz xy 0 0 0 0 0 1 1 0 • Test for decomposability 1 0 0 1 – Ashenhurst’s theorem g 1 1 1 1 1 f x y 1 bit (Simple) g 2 w x y z • Recursively decompose disjoint w z
All Recursive Decompositions g 1 g 4 g 1 g 2 d d c g 2 g 5 g 2 g 3 c ′ f a b c ′ d c d g 3 a b g 3 g 3 g 4 g 5 c ′ f a b a b g 5 g 3 d g 6 g 3 c ′ d g 6 g 7 a b c d g 7 a b g 8 g 8 c ′ d c d a b g 3 g 8 a b c d
Choice Tree [Lehman,TCAD97] A choice A node i j k l B C l j l k l i i k i j a b c d e f g h k j (a) Given LUT-tree i j k l choice B C choice node node g 1 g 4 g 6 g 7 g 2 g 5 g 8 h e d c c d a b f h c d e h c d e g g 3 f g a b a b c d e f g h (b) Choice tree
Algorithm • Mini-LUT Tree Mapping • Fan-in Tree Embedding [Hrkic,TCAD06] • Simultaneous Remapping and Embedding
Logic Remapping Formulation • Formulation – Given a “mini-LUT” tree and arrival time at the leaves, – map the tree to K -input LUTs minimizing cost subject to an arrival time constraint at the root. a b c g d e f h i j k l m
Solution Signature • (c,a) – for a sub-tree rooted u , a solution is characterized by two parameters: • cost of the embedding (and remapping) of a sub-tree. • arrival time at u . • Dominance Relation arrival time – ( c , a ) is not dominated by ( c ’, a ’) when c is better than c ’ or a is better than a ’. cost
Solution Sets • S i [ u ] = {( c , a )} J u – u : signal produced by root LUT – i : # inputs of root LUT h i – c : # LUTs in subtrees (0,6) (0,2) – a : the latest among the fan-ins. J S 2 [ u ]={(0,6)} • S i [ u ] u – “finalized” solution from S i [ u ]. J – c : # LUTs in subtrees + 1 h i – a : the root LUT included. (0,6) (0,2) S 2 [ u ]={(1,7)} • S [ u ] – non-dominated_sol( S 2 [ b ], … , S K [ b ])
J S i [ u ] Example For simplicity: one LUT = one unit cost one LUT = one unit delay
S i [ u ] and S [ u ] Example • S i [ b ] • S [ b ] = non-dominated_sol( S 2 [ b ], … , S K [ b ]) = {(1,7)}
J Computation of S i [ u ] i = 1, no collapsing of u and L i = K - 1 , no c ollapsing of u and R L R Otherwise , collapsing of u, L , and R . K - i i (a) u u i = 3 i = 1 i = 2 a b a b (= K –1) d c d c K = 4 (b) (c) (d) S 4 [ u ] = join( S [ a ], S 3 [ b ]) ∪ join( S 2 [ a ], S 2 [ b ]) ∪ join( S 3 [ a ], S [ b ]) J J J J J
Remapping Algorithm Example arrival time a b c g d e f (0,4) h i j k l m (0,6) (0,2) (0,3) (0,2) (0,1) (0,4) (a) Subject Tree
Algorithms • Mini-LUT Tree Mapping • Fan-in Tree Embedding [Hrkic,TCAD06] • Simultaneous Remapping and Embedding
Tree Embedding [Hrkic,TCAD06] a a b R d topology b R arrival time c R d pin locations c R e (0,4) arrival time e f f (0,3) (0,2) Embedding target layout graph Algorithm cost metrics a a d b R d e c R e f f
Algorithms • Mini-LUT Tree Mapping • Fan-in Tree Embedding [Hrkic,TCAD06] • Simultaneous Remapping and Embedding
Simultaneous Remapping and Embedding • Formulation – Given a “mini-LUT” tree with fixed leaves and root, and arrival time at the leaves, a target layout graph – Simultaneously map the tree to K -input LUTs and embed.
J Solution Set S i [ u ][ v ] • The remapped root produces signal u and is placed at v in the target layout graph.
Solution Set S i [ u ][ v ] • Solutions S i [ u ][w] are finalized and drives vertex J v in the target layout graph. • Computed by shortest weight-constrained path algorithm. S i [ u ][ v ] v u w h i j k
Solution Set S [ u ][ v ] • S [ u ][ v ] ← non-dominated-sol( S 2 [ u ][ v ],…, S K [ u ][ v ]) • The best remapping regardless of the number of inputs at v in the target layout graph.
Simultaneous Remapping and Embedding Example a v 23 b c h g g d e f i m j k l h i j k l m (c) S 4 [ a ][ v 23 ]={(22,10)} (19,13) arrival time (20,11) (22,10) cost (c) S [ a ][ v 23 ]={(19,13),(20,11),(22,10)}
Experiment • Benchmarks – 20 MCNC benchmark circuits – At least 20% white space • Comparisions – Timing-driven VPR placer – Replication Tree embedder – Arbor embedder [Kim,GLSVLSI06] – Remapping embedder • Criteria of Interest – LUT depth – Clock period of circuits • Different logic-level mappers and Stability effect of new algorithm
Optimization Flow Initial Netlist & Placement Static Timing Analysis & Replication Tree Construction Modified Netlist Tree Embedding •Repl Tree embedder •Remapping embedder Post-Processing & Legalization Modified Netlist & Placement
LUT Depth Changes Init. ckt New ckt Init. ckt New ckt ckt Crit. ckt Crit. ckt Crit. ckt Crit. Path Path Path Path ex5p 9 8 9 7 des 8 5 8 4 tseng 12 12 12 11 bigkey 5 4 5 4 apex4 8 8 9 8 frisc 23 20 22 22 misex3 9 8 9 8 spla 10 10 10 9 alu4 9 9 9 8 elliptic 17 15 18 15 diffeq 14 14 13 10 ex1010 10 10 10 9 dsip 8 5 8 4 pdc 11 10 11 8 seq 8 8 9 7 s38417 10 10 10 8 apex2 10 10 10 9 s38584.1 10 5 10 5 s298 16 16 14 13 clma 15 14 14 13
Routed Clock Period 1.2 1 0.8 VPR Repl 0.6 Arbor Remap 0.4 0.2 0 y p 4 q 2 c 0 7 1 g 3 4 p q 8 s c a c a e 5 x u e e x e s i 1 d 1 . n x i 9 l t m 4 s p e e k p x e e f 2 d i 0 p 4 l s a f d r 8 g s l e p p i 1 8 s s i s f c d l 5 i l a i a x 3 t b e m 8 e s 3 s • Average Normalized Clock Period • Max reduction of REMAP vs Arbor T-VPR Repl Arbor Remap 11.7% Avg Delay 1 0.886 0.848 0.826
Different Logic-level Mappers and Stability Effect of Remap • FlowMap: optimal depth. • FlowMap-r: relaxed depth. • ZMap: optimal depth with simultaneous area minimization. • Praetor: minimized area. • Daomap 90 80 Span: 12% Span: 4% 70 60 FlowMap FlowMap-r 50 ZMap 40 Praetor Daomap 30 20 10 0 VPR Repl Remap seq
Summary • Study of layout-level restructuring for interconnect optimization. – Functional Decomposition – Choice Tree – Remapping Algorithm – Simultaneous remapping and embedding • Experimental Result – Average 17% reduction on clock period compared with T-VPR.
Thank You!
Recommend
More recommend