advanced synthesis techniques
play

Advanced Synthesis Techniques Ramine Roane Advanced Synthesis - PowerPoint PPT Presentation

Advanced Synthesis Techniques Ramine Roane Advanced Synthesis Techniques Reminder From Last Year Use UltraFast Design Methodology for Vivado www.xilinx.com/ultrafast Recommendations for Rapid Closure HDL : use HDL Language


  1. Advanced Synthesis Techniques Ramine Roane

  2. Advanced Synthesis Techniques

  3. Reminder From Last Year  Use UltraFast Design Methodology for Vivado – www.xilinx.com/ultrafast  Recommendations for Rapid Closure – HDL : use HDL Language Templates & DRC – Constraints : Timing Constraint Wizard, DRC Tools–>Report–>Report DRC – Iterate in Synthesis (converge within 300ps)  Real problems seen post synthesis (long path…)  Faster iterations & higher impact Worst path post Synthesis : 4.3ns 13 levels of logic!  Improve area, timing, power – Only then, iterate in next steps  opt, place, phys_opt , route, phys_opt Worst path post Route : 4.1ns 4 levels of logic

  4. Advanced Synthesis Techniques Overview  Advance Synthesis Techniques for Design Closure  Case Study: design closure at Synthesis level

  5. Vivado Synthesis Flow VHDL, Verilog VHDL-2008, SystemVerilog more compact: advanced types… verification friendly: UVM, SVA… Syntax check Build file hierarchy Analyze Design hierarchy Cross-probing Unroll loops Build Logic: • Arithmetic • RAM Elaborate FSM • XDC • Boolean logic Module generators LUT6 RTL Optimizations Optimize & Map Boolean optimization Technology mapping P&R or DCP

  6. • Architecture-Aware Coding • Priority Encoders • Loops • Clocks & Resets • Directives & Strategies • Case Study

  7. Architecture Aware DSP  HDL code needs to match DSP hardware (e.g. DSP48 E2 ) – Signage, width of nets, optimal pipelining… Signed 27 bit ACC A 27 48 B XOR 45 C EQ 27 18 Verify that DSP are inferred efficiently Signed arithmetic with pipelining    Complex multiplier Dynamic pre-adder Rounding (2015.3) Use templates &    Squarer (UG901) FIR (UG579) XOR (2016.1) Coding style examples:   Multiply-accumulate  Large accumulator …

  8. DSP Block Inference Improvements Squarer: 1 DSP Complex multiplier: 3 DSP (a+bi)*(c+di) = ((c-d) * a + S ) + ((c+d) * b + S )i (a – b) 2 with S =(a-b) * d (a + b) 2 − X + Re A − X B + X + Im Wider arithmetic requires more pipelining e.g. MULT 44x35 requires 4 MULT 27x18 & ADD A A Synthesis B B Pipelined MULT 44x35 in HDL Mapped to 4 DSP Blocks (27x18 MULT) Verify proper inference for full DSP block performance!

  9. Architecture-Aware RAM & ROM RAMB36  HDL code needs to match BRAM Architecture out – Registered address (sync read), optional output register addr – 32K configurations  Width=1 x Depth=2 15 (32K) = 32Kx1  Width=2 x Depth=2 14 (16K) = 16Kx2  …  Width=32 x Depth=2 10 (1K) = 1Kx32 32x1K – 36K configuration Q  Width=36 x Depth=2 10 (1K) = 1Kx36 addr  Wider & Deeper Memories – Automatically inferred by Synthesis Example: single port RAM Verify that BRAM are inferred efficiently!

  10. RAM Decomposition: Example  32Kx32 RAM 32Kx1 1Kx32 1 1Kx32 32 32 ... 32Kx1 1Kx32 4x 32 32 1Kx32 LUTs 8x 32x 32x ... ... 32Kx1 1Kx32 . . . 8-1 MUX W=1 D=15 W=32 D=10 W=32 D=10 High Performance & Power Low Power & Performance Performance/Power Trade-off (default w/ timing constraints) UltraScale cascade-MUX Hybrid LUT & UltraScale Cascade 1 level , 32 BRAM active 32 levels , 1 BRAM active 4 levels , 4 BRAM active (* cascade_height = 32 *) … (* cascade_height = 4 *) … Verify that BRAM are decomposed efficiently!

  11. RAM & ROM Recommendations BRAM BRAM BRAM BRAM Reg Reg Reg Reg Use pipeline Reg No logic in-between No Fanout In same hierarchy! for performance BRAM slack<0 BRAM slack>0 Reg Reg Reg Reg Reg Run phys_opt to move Reg Add extra pipeline in & out based on timing for best performance! Verify that BRAM are pipelined efficiently!

  12. Beware of Priority Logic if (c0) q = a0; if (c0) q = a0; if (c1) q = a1; else if (c1) q = a1; if (c2) q = a2; else if (c2) q = a2; if (c3) q = a3; else if (c3) q = a3; if (c4) q = a4; else if (c4) q = a4; if (c5) q = a5; … else if (c5) q = a5; … Removing else ’s won’t help!! Priority encoded logic  long paths a5 a0 a4 a1 a3 a2 a2 a3 a1 c5 a4 c0 c4 c1 a0 a5 c3 c2 … c2 … c3 c1 c4 c0 c5 Priority logic will hurt Timing Closure!

  13. Priority Logic with “case” Statement case (c) In Verilog: v0: q=a0; CASE (c) //synthesis parallel_case v1: q=a1; (watch for simulation mismatch!) v2: q=a3; In SystemVerilog: v3: q=a4; unique case (c) // works with “if” too v4: q=a5;… a0 CASE won’t help either! c (note: values are variables) v0 a1 c a0 v1 a1 a2 a2 c v2 a3 a4 a3 c==v0 c c==v1 a5 v3 c==v2 a4 … c==v3 c v4 c==v4 … GOOD BAD c==v5 If conditions are mutually exclusive, make it clear! Note: please use complete conditions .v full_case (simulation may not match) or default & assign don’t_care .sv priority (for case & if)

  14. Priority Logic Which Should Not Be! case ( S ) c0 = ( S == 0); if (c0) q = a0; Automated in most cases… 0: q = a0 c1 = ( S == 1); Even with registered conditions! else if (c1) q = a1; 1: q = a1 c2 = ( S == 2); unique if (c0) … else if (c2) q = a2; 2: q = a2 c3 = ( S == 3); in SystemVerilog … else if (c3) q = a3; c4 = ( S == 4); or: else if (c4) q = a4; 1-hot conditions a0 (here: binary encoded) q = A[S] else if (c5) q = a5; … S 0.2 a1 S S 0.2 a2 a0 a1 S 0.2 a2 a3 a0..7 S 0.2 a3 a4 a4 S 0.2 a5 a5 a0..3 … S 2 S 1 S 0 S 0.2 S 0 S 1 a4..7 BAD GOOD GOOD S 0 S 1 If conditions are mutually exclusive, do not use a priority logic S 2 Use “unique if” in SystemVerilog

  15. Parallelizing Priority Logic  When you can’t avoid O(n), you still can! if c32…c63 if c0…c63 32 deep 64 deep a0 1 a1 1 1 a2 1 0 a3 1 0 0 a4 0 1 c 0 a5 0 1 c 1 0 a63 1 c 2 if c0…c31 2 deep 0 c 3 0 c 4 (log 6 (32)) 32 deep c 0 … c 31 … c 5 c 63 GOOD: N/2 +1 deep... BAD: N deep or N/4 + 2… or log(N) recursively Improve timing even when conditions are not mutually exclusive!

  16. Priority Logic with “for” loops flag = 0; for (i=0 ; i<31 ; i=i+1) flag = 0; if (c[i]) begin for (i=0 ; i<31 ; i=i+1) flag = 1; if (c[i]) break; // System Verilog flag = 1; //or exit in VHDL end Same as if…if…if… Same as if…else if…else if… Break/exit won’t help!! 1 1 1 1 1 1 1 1 1 1 c[31] c[0] c[30] c[1] 1 1 c[29] c[2] 0 … 0 … c[28] c[3] c[27] c[4] … … c[26] c[5] “break” does not reduce logic! Best code in this case: flag = |c Think Simple!

  17. Beware of Loop Unrolling – Avoid “if” c = 0; c = 0; c = a[0] + a[1] + a[2] + for (i=0 ; i<8 ; i=i+1) for (i=0 ; i<8 ; i=i+1) a[3] + a[4] + a[5] + if (a[i]) c = c+a[i]; a[6] + a[7]; c = c+1; Get rid of “if” a[0] a[7] a[1] c + + c a[2] +1 + +1 0 +1 a[3] … a[0] a[6] + a[6] a[7] a[4] a[5] BAD: area & depth O(N) GOOD: area & depth log 3 (N) “if” in loops can seriously hurt timing!

  18. Beware of Loop Unrolling – Arithmetic’s Q = 0 Q = … for i = 0 to 3 = 16*A + 48 Q = (A + 3) << 4 for j = 0 to 3 = A<<4 + 48 Q = Q+A+i+j A[N-1:4] A[N-5:0] Q = 0+ A+0+0 + A+0+1 + A+0+2 + A+0+3 + + Q[N-4:0] Q[N-1:4] A+1+0 + A+1+1 + A+1+2 + A+1+3 A+2+0 + A+2+1 + A+2+2 + A+2+3 48 3 A+3+0 + A+3+1 + A+3+2 + A+3+3 BAD: up to 36 N bit adder GOOD: 1 N-3 bit adder BETTER: 1 N-4 bit adder Loops (in general) can hurt timing! Here: symbolic arithmetic optimization may not happen

  19. Avoid Gated Clock Transformation  Very common in ASIC design (low power)  Consolidate the clocks to minimize clock skew low-skew network D Q (BUFG) D Q ASIC FPGA CE CE c c clk clk CE (latched on ~c) edged detector D Q D Q D Q D Q CE c clk clk c clk clk BAD: 2 clocks, 1 gated GOOD: 1 clock Avoid gated clocks – they will hurt timing closure (will cause clock skew)

  20. Avoid [Async] Resets  What we recommended – Reduce the number of “control sets” {clk, rst, ce} – Avoid Reset / avoid Async Reset D Q D Q D Q CE CE CLR does this clk clk clk really remove reset? rst BAD: Attempt to remove Reset created Enable and Reset is still Async… Verify that removing Reset did not add Enables

  21. RTL Synthesis: New Strategies  Vivado RTL Synthesis has now 8 Strategies – Each Strategy is a combination of options & directives – Directives have a specific purpose  For quick pipe-cleaning iterations – FLow_ Runtime Optimized  For best area – Flow_ Area MultThresholdDSP – Flow_ Area Optimized_medium – Flow_ Area Optimized_high  For performance – Vivado_Synthesis_ Default – Flow_ Perf Optimized_high – Flow_ Perf ThresholdCarry Strategies in Vivado (synthesis options)  For congested designs – Flow_Alternate Routability  Taking the best of all Strategies can give you 10% better QoR

  22. Case Study  Problem – Area explosion & bad timing in a design  Locating the cause of the issue – Find offending module & synthesize it Out Of Context – Look for suspicious operators on Elaborated view (how??) – Cross-probe to source files  Resolution – Fix the source code and/or use synthesis options

  23. Case Study: Locating the Cause of the Issue  Look for suspicious operators – Ctrl-F in Elaborated Schematic – Select suspicious operators (here: MULT, MOD…) – Press F4 to view schematic – Press F7 to cross-probe

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend