towards layout friendly high level synthesis
play

Towards Layout-Friendly High-Level Synthesis Jason Cong UCLA Bin - PowerPoint PPT Presentation

Towards Layout-Friendly High-Level Synthesis Jason Cong UCLA Bin Liu UCLA Peking University Guojie Luo Raghu Prabhakar UCLA Outline High-level synthesis and layout-friendly architecture Evaluation of the impact of high-level


  1. Towards Layout-Friendly High-Level Synthesis Jason Cong UCLA Bin Liu UCLA Peking University Guojie Luo Raghu Prabhakar UCLA

  2. Outline  High-level synthesis and layout-friendly architecture  Evaluation of the impact of high-level decisions  Evaluation of metrics for scheduling/binding  Conclusion

  3. High-Level Synthesis  Synthesis as a model refinement Behavioral process Model  Mature RTL-to-layout flow today  Behavior model: one level above RTL Model RTL  C/C++/SystemC/Matlab, etc. Gate-Level  High-level synthesis Netlist  Untimed behavioral model to cycle- accurate RTL Layout  Typically: C to Verilog

  4. A Typical Synthesis Flow from Behavior Level t 1 = a + b; Compiler transformation t 2 = c * d; • Program -> CDFG t 3 = e + f; t 4 = t 1 * t 2 ; z = t 4 – t 3 ; Scheduling × + + × • CDFG -> FSMD  Binding S0 d S0 • FSMD -> RTL b a S1 S1 – * S2 S2 z RTL Synthesis, P&R … 3 cycles

  5. A Short History of High-Level Synthesis  1980s—early 1990s: research and prototype  Late 1990s: early commercialization  Synopsys Behavioral Compiler, etc.  Mostly from behavioral VHDL/Verilog  2000—present: another wave of commercialization  C-based languages (C/C++/SystemC) as input  AutoESL (Xilinx), Cadence, Forte, Mentor (Calypto), NEC, Synfora (Synopsys), Synopsys  Growing interest driven by design complexity and time-to-market pressue

  6. xPilot: Behavioral-to-RTL Synthesis Flow [SOCC’2006]  Advanced transformtion/optimizations Behavioral spec.  Loop unrolling/shifting/pipelining in C/C++/SystemC  Strength reduction / Tree height reduction Platform  Bitwidth analysis Frontend description  Memory analysis … compiler  Core behvior synthesis optimizations  Scheduling  Resource binding, e.g., functional unit binding register/port binding SSDM   Arch-generation & RTL/constraints generation RTL + constraints  Verilog/VHDL/SystemC  FPGAs: Altera, Xilinx  ASICs: Magma, Synopsys, … FPGAs/ASICs

  7. AutoPilot Compilation Tool (based UCLA xPilot system) Design Specification C/C++/SystemC User Constraints Common Testbench Platform-based C to FPGA  Simulation, Verification, and Prototyping synthesis Compilation & AutoPilot TM Elaboration Synthesize pure ANSI-C and  ESL Synthesis C++, GCC-compatible compilation flow Presynthesis Optimizations Full support of IEEE-754  floating point data types & Behavioral & Communication = Platform operations Characterization Synthesis and Optimizations Efficiently handle bit-accurate  Library fixed-point arithmetic RTL HDLs & Timing/Power/Layout More than 10X design  RTL SystemC Constraints productivity gain High quality-of-results  FPGA Co-Processor Developed by AutoESL, acquired by Xilinx in Jan. 2011

  8. AutoPilot Results: Sphere Decoder (from Xilinx) Toplevel Block Diagram • W ireless MI MO Sphere 4x4 4x4 Matrix Inverse Norm Matrix Matrix Decoder Back Search/ multiply H multiply QRD Subst. Reorder – ~ 4 0 0 0 lines of C code – Xilinx Virtex-5 at 2 2 5 MHz 3x3 3x3 Matrix Inverse Norm Matrix Matrix • Com pared to optim ized I P Back Search/ QRD multiply multiply Subst. Reorder – 1 1 -3 1 % better resource usage 2x2 2x2 Matrix Inverse Norm Matrix Matrix Back Search/ QRD multiply multiply Subst. Reorder … Metric RTL AutoPilot Diff Tree Search Sphere Detector Min 8x8 RVD Stage 1 Stage 8 Expert Expert ( % ) Search QRD LUTs 32,708 29,060 -11% Registers 44,885 31,000 -31% TCAD April 2011 (keynote paper) “High-Level Synthesis for FPGAs: From DSP48s 225 201 -11% Prototyping to Deployment” BRAMs 128 99 -26%

  9. AutoPilot Results: DQPSK Receiver (from BDTI)  Application Hand-coded AutoPilot  DQPSK receiver RTL  18.75Msamples @75MHz clock speed Xilinx 5.9% 5.6% XC3SD3400A chip utilization ratio  Area better than hand-coded (lower the better) BDTi evaluation of AutoPilot http:/ / w w w .bdti.com / articles/ AutoPilot.pd

  10. AutoPilot Results: Optical Flow (from BDTI) Input Video Input Video  Application  Optical flow, 1280x720 progress scan  Design too complex for an RTL team  Compared to high-end DSP:  30X higher throughput, 40X better cost/fps Output Video Chip Highest Cost/ perform ance Unit Fram e Rate @ ( $ / fram e/ second) Cost 7 2 0 p ( fps) Xilinx $27 183 $0.14 Spartan3ADSP XC3SD3400A chip Texas $21 5.1 $4.20 Instruments TMS320DM6437 DSP processor BDTi evaluation of AutoPilot http:/ / w w w .bdti.com / articles/ AutoPilot.pdf

  11. Impact on Quality of Result  Big impact on QoR due to drastically different architectures  Parallel/sequential/pipelined  Different ways to map operations to control states  Different ways to share functional units/registers/interconnects  Opportunity to select from multiple possible implementations  Instead of struggling with a sub-optimal RTL  Need metrics/models to decide which implementation is superior  Performance/throughput/area can be estimated reasonably well in HLS  Frequency/congestion is quite hard  Some RTL structures lead to long interconnect delay after layout

  12. Interconnect Estimation: the Challenge  Estimation of interconnect timing and congestion is hard at a high level  Long wires/congestion occur during layout  Incorporate layout in synthesis?  Reasonable, but time consuming.  May not be necessary if we just want to estimate if one solution is better than the other  Try to get the more layout-friendly solution  In this work  Experimentally evaluate the impact of HLS decisions on congestion  Evaluate some possible metrics without doing layout

  13. Outline  High-level synthesis and layout-friendly architecture  Evaluation of the impact of high-level decisions  Evaluation of metrics for scheduling/binding  Conclusion

  14. Experiment Setup  Varying strategies in HLS Compiler transformation  Impacts of compiler transformation • Program -> CDFG Loop unrolling, memory partitioning, etc. and synthesis engine (scheduling Binding objective constraint & binding) evaluated separately 1 Total area None Scheduling  5 DSP benchmarks (lots of 2 Total area Mux_input <= 4 Scheduling objective Resource constraint • CDFG -> FSMD multiplication/addition, simple or no 3 #R (total number of registers) Mux_input <= 4 1 ASAP (as soon as possible) None control flow) for synthesis engine 4 #M None 2 ALAP (as late as possible) None Number of lines in C Number of nodes in CDFG 5 #M Mux_input <= 4 Binding 3 MINREG (reduce registers) None Test1 96 78 6 #M and #R None • FSMD -> RTL 4 ALAP #M = ceil(0.25 * m) Test2 20 90 7 #M and #R Mux_input <= 4 5 ALAP #M = ceil(0.25 * m), #A = ceil(0.4 * a) Test3 97 160 8 #M and #A None 6 MINREG Test4 16 #M = ceil(0.1 * m), #A = ceil(0.2 * a) 50 9 #M and #A Mux_input <= 4 #M: number of multiplier m: number of multiplication Test5 87 390 10 #M and #A and #R Mux_input <= 4 #A: number of adder a: number of addition/subtraction

  15. The RTL Implementation Flow for Routability Evaluation RTL elaboration by Quartus C program Logic synthesis high-level synthesis by ABC by xPilot Evaluation (with different strategies) Pack & place by VPACK+VPR Verilog code Routing by VPR

  16. Implementation Flow Setup  Target platform: island-style FPGA  10 4-LUTs per CLB, with routing channels between CLBs (span = 1 CLB)  The number of routing tracks per channel ( channel width ) is constant  Configurations of the toolchain  Logic synthesis by ABC with default settings  Packing by T-VPACK with default settings  Wirelength-driven placement by VPR using simulated-annealing  Routing by VPR using negotiation-based routing and directed search • The channel width is variable and determined by binary search  Post-layout characteristics  Maximum channel width (CW_max)  Average wirelength (WL_avg) = average #tracks per net

  17. Impact of the Synthesis Engine  60 RTLs generated for each design  6 scheduling strategies, 10 binding strategies  Some are equivalent  Results: min/max for each metric  Clearly, very different although behaviorally equivalent

  18. Impact of the Synthesis Engine (min vs max) 60 140 CW_max CW_avg 50 120 100 40 80 30 60 20 40 10 20 0 0 test1 test2 test3 test4 test5 test1 test2 test3 test4 test5 18 18 WL_tot WL_avg 16 16 14 14 12 12 10 10 8 8 6 6 4 4 2 2 0 0 test1 test2 test3 test4 test5 test1 test2 test3 test4 test5

  19. Impact of Compiler Transformations  A matrix multiplication example outer_loop: for (i = 0; i < 8; i++) { middle_loop: for (j = 0; j < 8; j++) { Result[i][j] = 0; inner_loop: for (k = 0; k < 8; k++) Result[i][j] += X[i][k] * Y[k][j]; } }  Different ways to transform/pipeline the code, partition memory loop memory 1 Keep all loops, pipeline inner loop As is 2 Unroll inner loop, pipeline middle Partition X into columns and Y into rows loop to allow simultaneous accesses 3 Unroll inner and middle loop, pipeline Partition X and Y into scalars, partition outer loop Result into columns

  20. Impact of Compiler Transformations

  21. Outline  High-level synthesis and layout-friendly architecture  Evaluation of the impact of high-level decisions  Evaluation of metrics for scheduling/binding  Conclusion

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend