advanced synthesis techniques reminder from last year
play

Advanced Synthesis Techniques Reminder From Last Year Use UltraFast - PowerPoint PPT Presentation

Advanced Synthesis Techniques Reminder From Last Year Use UltraFast Design Methodology for Vivado www.xilinx.com/ultrafast Recommendations for Rapid Closure HDL : use HDL Language Templates & DRC Constraints : Timing


  1. Advanced Synthesis Techniques

  2. Reminder From Last Year  Use UltraFast Design Methodology for Vivado – www.xilinx.com/ultrafast  Recommendations for Rapid Closure – HDL : use HDL Language Templates & DRC – Constraints : Timing Constraint Wizard, DRC Tools – >Report – >Report DRC – Iterate in Synthesis (converge within 300ps)  Real problems seen post synthesis (long path…)  Faster iterations & higher impact Worst path post Synthesis : 4.3ns 13 levels of logic!  Improve area, timing, power – Only then, iterate in next steps  opt, place, phys_opt , route, phys_opt Worst path post Route : 4.1ns 4 levels of logic

  3. Advanced Synthesis Techniques Overview  Advantages of C Synthesis over RTL Synthesis  Advance Synthesis Techniques for Design Closure  Case Study: design closure at Synthesis level

  4. HLS & IP Integrator (IPI) vs. RTL Synthesis Design closure Actual example: VHDL, Verilog: RTL (VHDL) 100k lines RTL (VHDL) VHDL-2008, SV: 50k lines HW Traditional Flow RTL RTL System Debug P&R Sim. Test Bench • Synthesis 240 people*mo Test Bench Test Bench (System C) (System C) o 10 people (driver level) o 2 years Exhaustive functional Minimal test tests Verification advantage: e.g. video processing 15x faster • RTL: 1 frame per ~5 hours • C++: 1 frame per second HLS Based Flow • 16 people*mo Design closure o 2 people o 8 month RTL (VHDL) C++ code RTL (VHDL) ( 5k lines)) HW System C RTL Faster for derivative designs HLS IPI Debug P&R Debug Test Bench Synth Test Bench • C++ reuse Test Bench (System C) (System C) • (application) Scales with parameters • Device independent System-level Debug Exhaustive functional tests

  5. HLS Automates Micro-architecture Exploration Project specification while (i++) while (i++) for j =0 .. N for j =0 .. N y(i) = y(i) + b(j) * x(i-j) y(i) = y(i-j) +b(j) * x(i) Architecture choices Z -1 Z -1 Z -1 x(n) x(n) b 0 b 1 b 2 b m-1 b 0 b 1 b 2 b m-1 … … X X X X X X X X Z -1 Z -1 Z -1 + + + + + + Micro-architecture choices … … Algorithmic delay b x(n) x(n) x(n) X X X X X X + 0 + + + + + + 0 Pipeline register Fully parallel : N DSP no cascade Fully parallel : N DSP + cascade Fully Folded : 1 DSP (default)

  6. HLS Micro-Architecture Exploration while (1) c 0 c 1 M3(i) = M1(i-2) * C2 Z -1 A1 A2 M1 M2 A1 A2 M1 M2 A1(i) = M3(i) + x(i) x[i] M2(i) = M1(i-1) * C1 c 2 A2(i) = A1(i) + M2(i) Z -1 M3 M3 M1(i) = A2(i) * C0 i++ Dataflow Graph C++ RTL Z -2 M3 M3 M3 Schedule 1: 14 cycles A2 A1 M1 A1 A2 M1 Z -1 Sequential process (CPU model) M2 M2 Minimal HW Resources : 1 MULT, 1 ADD i i+1 14 28 Schedule 2: 10 cycles Z -2 M3 M3 M3 Parallelism within each iteration A1 A2 A1 A2 M1 M1 Better performance (~29%) M2 M2 Z -1 20 2 MULT, 1 ADD i 10 i+1 i+2 Schedule 3: 9 cycles Z -2 M3 M3 M3 Loop pipelining A2 A2 A1 A1 M1 M1 Best performance (~36%) M2 M2 Z -1 9 18 2 MULT, 1 ADD i i+2 i+1

  7. Vivado Synthesis Flow VHDL, Verilog VHDL-2008, SystemVerilog more compact: advanced types… verification friendly: UVM, SVA… Syntax check Build file hierarchy Analyze Design hierarchy Cross-probing Unroll loops Build Logic: • Arithmetic • RAM • Elaborate FSM XDC • Boolean logic Module generators RTL Optimizations LUT6 Optimize & Map Boolean optimization Technology mapping P&R or DCP

  8. • Architecture-Aware Coding • Priority Encoders • Loops • Clocks & Resets • Directives & Strategies • Case Study

  9. Architecture Aware DSP  HDL code needs to match DSP hardware (e.g. DSP48 E2 ) – Signage, width of nets, optimal pipelining… Signed 27 bit ACC A 27 B 48 XOR 45 C EQ 27 18 Verify that DSP are inferred efficiently Signed arithmetic with pipelining    Complex multiplier Dynamic pre-adder Rounding (2015.3) Use templates &   Squarer (UG901)  FIR (UG579) XOR (2016.1) Coding style examples:  Multiply-accumulate  Large accumulator  …

  10. DSP Block Inference Improvements Squarer: 1 DSP Complex multiplier: 3 DSP (a+bi)*(c+di) = ((c-d) * a + S ) + ((c+d) * b + S )i (a – b) 2 with S =(a-b) * d (a + b) 2 − X + Re A − X B + X + Im Wider arithmetic requires more pipelining e.g. MULT 44x35 requires 4 MULT 27x18 & ADD A A B Synthesis B Pipelined MULT 44x35 in HDL Mapped to 4 DSP Blocks (27x18 MULT) Verify proper inference for full DSP block performance!

  11. Architecture-Aware RAM & ROM RAMB36  HDL code needs to match BRAM Architecture out – Registered address (sync read), optional output register addr – 32K configurations  Width=1 x Depth=2 15 (32K) = 32Kx1  Width=2 x Depth=2 14 (16K) = 16Kx2  …  Width=32 x Depth=2 10 (1K) = 1Kx32 – 36K configuration 32x1K Q  Width=36 x Depth=2 10 (1K) = 1Kx36 addr  Wider & Deeper Memories – Automatically inferred by Synthesis Example: single port RAM Verify that BRAM are inferred efficiently!

  12. RAM Decomposition: Example  32Kx32 RAM 32Kx1 1Kx32 1 1Kx32 32 32 ... 32Kx1 1Kx32 4x 32 32 1Kx32 LUTs 8x 32x 32x ... ... 32Kx1 1Kx32 . . . 8-1 MUX W=1 D=15 W=32 D=10 W=32 D=10 High Performance & Power Low Power & Performance Performance/Power Trade-off (default w/ timing constraints) UltraScale cascade-MUX Hybrid LUT & UltraScale Cascade 1 level , 32 BRAM active 32 levels , 1 BRAM active 4 levels , 4 BRAM active (* cascade_height = 32 *) … (* cascade_height = 4 *) … Verify that BRAM are decomposed efficiently!

  13. RAM & ROM Recommendations BRAM BRAM BRAM BRAM Reg Reg Reg Reg Reg Use pipeline Reg No logic in-between No Fanout In same hierarchy! for performance BRAM slack<0 BRAM slack>0 Reg Reg Reg Reg Reg Run phys_opt to move Reg Add extra pipeline in & out based on timing for best performance! Verify that BRAM are pipelined efficiently!

  14. Beware of Priority Logic if (c0) q = a0; if (c0) q = a0; if (c1) q = a1; else if (c1) q = a1; if (c2) q = a2; else if (c2) q = a2; if (c3) q = a3; else if (c3) q = a3; if (c4) q = a4; else if (c4) q = a4; if (c5) q = a5; … else if (c5) q = a5; … Priority encoded logic Removing else ’s won’t help!!  long paths a5 a0 a4 a1 a3 a2 a2 a3 a1 c5 a4 c0 c4 c1 a0 a5 c3 c2 … c2 … c3 c1 c4 c0 c5 Priority logic will hurt Timing Closure!

  15. Priority Logic with “for” loops flag = 0; for (i=0 ; i<31 ; i=i+1) flag = 0; if (c[i]) begin for (i=0 ; i<31 ; i=i+1) flag = 1; if (c[i]) break; //SystemVerilog flag = 1; end Same as if…if…if… Same as if…else if…else if… break won’t help!! 1 1 1 1 1 1 1 1 1 1 c[31] c[0] c[30] c[1] 1 1 c[29] c[2] 0 … 0 … c[28] c[3] c[27] c[4] … … c[26] c[5] “break” does not reduce logic! Best code in this case: flag = |c Think Simple!

  16. Priority Logic with “case” Statement case (c) In Verilog: v0: q=a0; CASE (c) //synthesis parallel_case v1: q=a1; (watch for simulation mismatch!) v2: q=a3; In SystemVerilog: unique case (c) // works with “if” too v3: q=a4; v4: q=a5;… a0 CASE won’t help either! c (note: values are variables) v0 a1 c a0 v1 a1 a2 a2 c v2 a3 a4 a3 c==v0 c c==v1 a5 v3 c==v2 a4 … c==v3 c v4 c==v4 … BAD GOOD c==v5 If conditions are mutually exclusive, make it clear! Note: please use complete conditions .v full_case (simulation may not match) or default & assign don’t_care .sv priority (for case & if)

  17. Priority Logic Which Should Not Be! case ( S ) c0 = ( S == 0); if (c0) q = a0; Automated in most cases … 0: q = a0 c1 = ( S == 1); Even with registered conditions! else if (c1) q = a1; 1: q = a1 c2 = ( S == 2); unique if ( c0) … else if (c2) q = a2; 2: q = a2 c3 = ( S == 3); … in SystemVerilog else if (c3) q = a3; c4 = ( S == 4); or: else if (c4) q = a4; 1-hot conditions a0 (here: binary encoded) q = A[S] else if (c5) q = a5; … S 0.2 a1 S S 0.2 a2 a0 a1 S 0.2 a2 a3 a0..7 S 0.2 a3 a4 a4 S 0.2 a5 a5 a0..3 … S 2 S 1 S 0 S 0.2 S 0 S 1 a4..7 BAD GOOD GOOD S 0 S 1 If conditions are mutually exclusive, do not use a priority logic S 2 Use “unique if” in SystemVerilog

  18. Parallelizing Priority Logic  When you can’t avoid O(n), you still can! i f c32…c63 i f c0…c63 32 deep 64 deep a0 1 a1 1 1 a2 1 0 a3 1 0 0 a4 0 1 c 0 a5 0 1 c 1 0 a63 1 c 2 i f c0…c31 2 deep 0 c 3 0 c 4 (log 6 (32)) 32 deep c 0 … c 31 … c 5 c 63 GOOD: N/2 +1 deep... BAD: N deep or N/4 + 2… or log(N) recursively Improve timing even when conditions are not mutually exclusive!

  19. Beware of Loop Unrolling – Avoid “if” c = 0; c = 0; c = a[0] + a[1] + a[2] + for (i=0 ; i<8 ; i=i+1) for (i=0 ; i<8 ; i=i+1) a[3] + a[4] + a[5] + if (a[i]) c = c+a[i]; a[6] + a[7]; Get rid of “if” c = c+1; a[0] a[7] a[1] c + + c a[2] +1 + +1 0 +1 a[3] … a[0] a[6] + a[6] a[7] a[4] a[5] BAD: area & depth O(N) GOOD: area & depth log 3 (N) “if” in loops can seriously hurt timing!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend