Advanced Synthesis Techniques Ramine Roane Advanced Synthesis - PowerPoint PPT Presentation

Advanced Synthesis Techniques Ramine Roane

Advanced Synthesis Techniques

Reminder From Last Year  Use UltraFast Design Methodology for Vivado – www.xilinx.com/ultrafast  Recommendations for Rapid Closure – HDL : use HDL Language Templates & DRC – Constraints : Timing Constraint Wizard, DRC Tools–>Report–>Report DRC – Iterate in Synthesis (converge within 300ps)  Real problems seen post synthesis (long path…)  Faster iterations & higher impact Worst path post Synthesis : 4.3ns 13 levels of logic!  Improve area, timing, power – Only then, iterate in next steps  opt, place, phys_opt , route, phys_opt Worst path post Route : 4.1ns 4 levels of logic

Advanced Synthesis Techniques Overview  Advance Synthesis Techniques for Design Closure  Case Study: design closure at Synthesis level

Vivado Synthesis Flow VHDL, Verilog VHDL-2008, SystemVerilog more compact: advanced types… verification friendly: UVM, SVA… Syntax check Build file hierarchy Analyze Design hierarchy Cross-probing Unroll loops Build Logic: • Arithmetic • RAM Elaborate FSM • XDC • Boolean logic Module generators LUT6 RTL Optimizations Optimize & Map Boolean optimization Technology mapping P&R or DCP

• Architecture-Aware Coding • Priority Encoders • Loops • Clocks & Resets • Directives & Strategies • Case Study

Architecture Aware DSP  HDL code needs to match DSP hardware (e.g. DSP48 E2 ) – Signage, width of nets, optimal pipelining… Signed 27 bit ACC A 27 48 B XOR 45 C EQ 27 18 Verify that DSP are inferred efficiently Signed arithmetic with pipelining    Complex multiplier Dynamic pre-adder Rounding (2015.3) Use templates &    Squarer (UG901) FIR (UG579) XOR (2016.1) Coding style examples:   Multiply-accumulate  Large accumulator …

DSP Block Inference Improvements Squarer: 1 DSP Complex multiplier: 3 DSP (a+bi)*(c+di) = ((c-d) * a + S ) + ((c+d) * b + S )i (a – b) 2 with S =(a-b) * d (a + b) 2 − X + Re A − X B + X + Im Wider arithmetic requires more pipelining e.g. MULT 44x35 requires 4 MULT 27x18 & ADD A A Synthesis B B Pipelined MULT 44x35 in HDL Mapped to 4 DSP Blocks (27x18 MULT) Verify proper inference for full DSP block performance!

Architecture-Aware RAM & ROM RAMB36  HDL code needs to match BRAM Architecture out – Registered address (sync read), optional output register addr – 32K configurations  Width=1 x Depth=2 15 (32K) = 32Kx1  Width=2 x Depth=2 14 (16K) = 16Kx2  …  Width=32 x Depth=2 10 (1K) = 1Kx32 32x1K – 36K configuration Q  Width=36 x Depth=2 10 (1K) = 1Kx36 addr  Wider & Deeper Memories – Automatically inferred by Synthesis Example: single port RAM Verify that BRAM are inferred efficiently!

RAM Decomposition: Example  32Kx32 RAM 32Kx1 1Kx32 1 1Kx32 32 32 ... 32Kx1 1Kx32 4x 32 32 1Kx32 LUTs 8x 32x 32x ... ... 32Kx1 1Kx32 . . . 8-1 MUX W=1 D=15 W=32 D=10 W=32 D=10 High Performance & Power Low Power & Performance Performance/Power Trade-off (default w/ timing constraints) UltraScale cascade-MUX Hybrid LUT & UltraScale Cascade 1 level , 32 BRAM active 32 levels , 1 BRAM active 4 levels , 4 BRAM active (* cascade_height = 32 *) … (* cascade_height = 4 *) … Verify that BRAM are decomposed efficiently!

RAM & ROM Recommendations BRAM BRAM BRAM BRAM Reg Reg Reg Reg Use pipeline Reg No logic in-between No Fanout In same hierarchy! for performance BRAM slack<0 BRAM slack>0 Reg Reg Reg Reg Reg Run phys_opt to move Reg Add extra pipeline in & out based on timing for best performance! Verify that BRAM are pipelined efficiently!

Beware of Priority Logic if (c0) q = a0; if (c0) q = a0; if (c1) q = a1; else if (c1) q = a1; if (c2) q = a2; else if (c2) q = a2; if (c3) q = a3; else if (c3) q = a3; if (c4) q = a4; else if (c4) q = a4; if (c5) q = a5; … else if (c5) q = a5; … Removing else ’s won’t help!! Priority encoded logic  long paths a5 a0 a4 a1 a3 a2 a2 a3 a1 c5 a4 c0 c4 c1 a0 a5 c3 c2 … c2 … c3 c1 c4 c0 c5 Priority logic will hurt Timing Closure!

Priority Logic with “case” Statement case (c) In Verilog: v0: q=a0; CASE (c) //synthesis parallel_case v1: q=a1; (watch for simulation mismatch!) v2: q=a3; In SystemVerilog: v3: q=a4; unique case (c) // works with “if” too v4: q=a5;… a0 CASE won’t help either! c (note: values are variables) v0 a1 c a0 v1 a1 a2 a2 c v2 a3 a4 a3 c==v0 c c==v1 a5 v3 c==v2 a4 … c==v3 c v4 c==v4 … GOOD BAD c==v5 If conditions are mutually exclusive, make it clear! Note: please use complete conditions .v full_case (simulation may not match) or default & assign don’t_care .sv priority (for case & if)

Priority Logic Which Should Not Be! case ( S ) c0 = ( S == 0); if (c0) q = a0; Automated in most cases… 0: q = a0 c1 = ( S == 1); Even with registered conditions! else if (c1) q = a1; 1: q = a1 c2 = ( S == 2); unique if (c0) … else if (c2) q = a2; 2: q = a2 c3 = ( S == 3); in SystemVerilog … else if (c3) q = a3; c4 = ( S == 4); or: else if (c4) q = a4; 1-hot conditions a0 (here: binary encoded) q = A[S] else if (c5) q = a5; … S 0.2 a1 S S 0.2 a2 a0 a1 S 0.2 a2 a3 a0..7 S 0.2 a3 a4 a4 S 0.2 a5 a5 a0..3 … S 2 S 1 S 0 S 0.2 S 0 S 1 a4..7 BAD GOOD GOOD S 0 S 1 If conditions are mutually exclusive, do not use a priority logic S 2 Use “unique if” in SystemVerilog

Parallelizing Priority Logic  When you can’t avoid O(n), you still can! if c32…c63 if c0…c63 32 deep 64 deep a0 1 a1 1 1 a2 1 0 a3 1 0 0 a4 0 1 c 0 a5 0 1 c 1 0 a63 1 c 2 if c0…c31 2 deep 0 c 3 0 c 4 (log 6 (32)) 32 deep c 0 … c 31 … c 5 c 63 GOOD: N/2 +1 deep... BAD: N deep or N/4 + 2… or log(N) recursively Improve timing even when conditions are not mutually exclusive!

Priority Logic with “for” loops flag = 0; for (i=0 ; i<31 ; i=i+1) flag = 0; if (c[i]) begin for (i=0 ; i<31 ; i=i+1) flag = 1; if (c[i]) break; // System Verilog flag = 1; //or exit in VHDL end Same as if…if…if… Same as if…else if…else if… Break/exit won’t help!! 1 1 1 1 1 1 1 1 1 1 c[31] c[0] c[30] c[1] 1 1 c[29] c[2] 0 … 0 … c[28] c[3] c[27] c[4] … … c[26] c[5] “break” does not reduce logic! Best code in this case: flag = |c Think Simple!

Beware of Loop Unrolling – Avoid “if” c = 0; c = 0; c = a[0] + a[1] + a[2] + for (i=0 ; i<8 ; i=i+1) for (i=0 ; i<8 ; i=i+1) a[3] + a[4] + a[5] + if (a[i]) c = c+a[i]; a[6] + a[7]; c = c+1; Get rid of “if” a[0] a[7] a[1] c + + c a[2] +1 + +1 0 +1 a[3] … a[0] a[6] + a[6] a[7] a[4] a[5] BAD: area & depth O(N) GOOD: area & depth log 3 (N) “if” in loops can seriously hurt timing!

Beware of Loop Unrolling – Arithmetic’s Q = 0 Q = … for i = 0 to 3 = 16*A + 48 Q = (A + 3) << 4 for j = 0 to 3 = A<<4 + 48 Q = Q+A+i+j A[N-1:4] A[N-5:0] Q = 0+ A+0+0 + A+0+1 + A+0+2 + A+0+3 + + Q[N-4:0] Q[N-1:4] A+1+0 + A+1+1 + A+1+2 + A+1+3 A+2+0 + A+2+1 + A+2+2 + A+2+3 48 3 A+3+0 + A+3+1 + A+3+2 + A+3+3 BAD: up to 36 N bit adder GOOD: 1 N-3 bit adder BETTER: 1 N-4 bit adder Loops (in general) can hurt timing! Here: symbolic arithmetic optimization may not happen

Avoid Gated Clock Transformation  Very common in ASIC design (low power)  Consolidate the clocks to minimize clock skew low-skew network D Q (BUFG) D Q ASIC FPGA CE CE c c clk clk CE (latched on ~c) edged detector D Q D Q D Q D Q CE c clk clk c clk clk BAD: 2 clocks, 1 gated GOOD: 1 clock Avoid gated clocks – they will hurt timing closure (will cause clock skew)

Avoid [Async] Resets  What we recommended – Reduce the number of “control sets” {clk, rst, ce} – Avoid Reset / avoid Async Reset D Q D Q D Q CE CE CLR does this clk clk clk really remove reset? rst BAD: Attempt to remove Reset created Enable and Reset is still Async… Verify that removing Reset did not add Enables

RTL Synthesis: New Strategies  Vivado RTL Synthesis has now 8 Strategies – Each Strategy is a combination of options & directives – Directives have a specific purpose  For quick pipe-cleaning iterations – FLow_ Runtime Optimized  For best area – Flow_ Area MultThresholdDSP – Flow_ Area Optimized_medium – Flow_ Area Optimized_high  For performance – Vivado_Synthesis_ Default – Flow_ Perf Optimized_high – Flow_ Perf ThresholdCarry Strategies in Vivado (synthesis options)  For congested designs – Flow_Alternate Routability  Taking the best of all Strategies can give you 10% better QoR

Case Study  Problem – Area explosion & bad timing in a design  Locating the cause of the issue – Find offending module & synthesize it Out Of Context – Look for suspicious operators on Elaborated view (how??) – Cross-probe to source files  Resolution – Fix the source code and/or use synthesis options

Case Study: Locating the Cause of the Issue  Look for suspicious operators – Ctrl-F in Elaborated Schematic – Select suspicious operators (here: MULT, MOD…) – Press F4 to view schematic – Press F7 to cross-probe

Advanced Synthesis Techniques Ramine Roane Advanced Synthesis - PowerPoint PPT Presentation

Advanced Synthesis Techniques Ramine Roane Advanced Synthesis Techniques Reminder From Last Year Use UltraFast Design Methodology for Vivado www.xilinx.com/ultrafast Recommendations for Rapid Closure HDL : use HDL Language

Chemical Synthesis Techniques Chemical Synthesis Techniques Chemical Synthesis Techniques

1/88 Presentation: Advanced Techniques 2/88 Presentation: Advanced Techniques 3/88

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

Total Synthesis of the Polycyclic Total Synthesis of the Polycyclic Total Synthesis of the

Co-synthesis techniques for embedded systems embedded systems Kelvin Yuk June 5, 2002 EEC282 -

Advanced Election Techniques in Rings Eero Hkkinen 2007-02-21 Advanced Election Techniques in

Synthesis of Carbon Synthesis of Carbon Nanotubes Nanotubes Polina Shifrina Supervisors: Dr.

Solid Texture Synthesis Solid Texture Synthesis Solid Texture Synthesis from 2D Exemplars from

Post-Synthesis Simulation VITAL Models, SDF Files, Timing Simulation Post-synthesis simulation

Synthesis of Ranking Functions and Synthesis of Inductive Invariants and Synthesis of

Text-to-Image Generation Yu Cheng Text-to-Image Synthesis Text-to-Image Synthesis

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

CTP431- Music and Audio Computing Sound Synthesis Graduate School of Culture Technology KAIST

Texture Synthesis Given a texture, create more CS176: Texture Synthesis All examples from Wei

Intraday Techniques Intraday Techniques Intraday Techniques Intraday Techniques Combining

Advanced Counting Techniques CS1200, CSE IIT Madras Meghana Nasre April 3, 2020 CS1200, CSE IIT

Graduation, differentiation , and vulnerability Development Cooperation Forum Side event

Runtime systems Runtime systems Functional program are very high-level: its not obvious how to

IN USE FOR EXCHANGING DATA IN WWW Luis Kornblueh September 12, 2017 Max-Planck-Institut fr

AR annual report December, 31st 2005 Introduction to Hera Group 2 Hera achieved Leadership

Advantages of anomaly detection between a controlling unit and its process devices for Industrial

QUANTUM PROSPECTS FOR BEYOND MOORE Leti Devices Workshop | Silvano De Franceschi| December 4, 2016

IEEE P1581 revisited Heiko Ehrenberg, GOEPEL Electronics IEEE P1581 WG chair 1 Objective

United States Court of Appeals for the Federal Circuit __________________________ FIFTH