cda 4253 fpga system design op7miza7on techniques
play

CDA 4253 FPGA System Design Op7miza7on Techniques Hao Zheng Comp S - PowerPoint PPT Presentation

CDA 4253 FPGA System Design Op7miza7on Techniques Hao Zheng Comp S ci & Eng Univ of South Florida 1 Extracted from Advanced FPGA Design by Steve Kilts 2 Op7miza7on for Performance 3 Performance Defini7ons Throughput : the number


  1. CDA 4253 FPGA System Design Op7miza7on Techniques Hao Zheng Comp S ci & Eng Univ of South Florida 1

  2. Extracted from Advanced FPGA Design by Steve Kilts 2

  3. Op7miza7on for Performance 3

  4. Performance Defini7ons • Throughput : the number of inputs processed per unit 2me. • Latency : the amount of 2me for an input to be processed. • Maximizing throughput and minimizing latency in conflict. • Both require 2ming op2miza2on: - Reduce delay of the cri$cal path. 4

  5. Achieving High Throughput: Pipelining • Divide data processing into stages • Process different data inputs in different stages simultaneously. -- Non-pipelined version -- Non-pipelined version process process (clk) begin begin xpower = 1; if if rising_edge(clk) then then for for (i = 0; i < 3; i++) if if start=‘1’ then then cnt <= 3; xpower = x * xpower; end end if if; if if cnt > 0 then then cnt <= cnt – 1; xpower <= xpower * x; Throughput: 1 data / 3 cycles = 0.33 elsif elsif cnt = 0 then then data / cycle . done <= ‘1’; Latency: 3 cycles. end if end if; Critical path delay: 1 multiplier delay end process; end process 5

  6. Achieving High Throughput: Pipelining -- Pipelined version process (clk, rst) begin if rising_edge(clk) then if start=‘1’ then -- stage 1 x1 <= x; xpower1 <= x; xpower = 1; done1 <= start; end if ; for (i = 0; i < 3; i++) -- stage 2 xpower = x * xpower; x2 <= x1; xpower2 <= xpower1 * x1; done2 <= done1; Throughput: 1 data / cycle -- stage 3 Latency: 3 cycles + register delays. xpower <= xpower2 * x2; Critical path delay: 1 multiplier delay done <= done2; end if ; end process ; 6

  7. Comparison Iterative implementation Pipelined implementation 7

  8. Achieving High Throughput: Pipelining • Loop unrolling X Y C Reg 8

  9. Achieving High Throughput: Pipelining • Loop unrolling X ... C0 C1 Cn Reg Reg Y 9

  10. Achieving High Throughput: Pipelining • Divide data processing into stages • Process different data inputs in different stages simultaneously. dout din 10

  11. Achieving High Throughput: Pipelining • Divide data processing into stages • Process different data inputs in different stages simultaneously. dout din … stage 1 stage 2 stage n registers Penalty: increase in area as logic needs to be duplicated for different stages 11

  12. Reducing Latency • Closely related to reducing cri2cal path delay. • Reducing pipeline registers reduces latency. dout din … stage 1 stage 2 stage n registers 12

  13. Reducing Latency • Closely related to reducing cri2cal path delay. • Reducing pipeline registers reduces latency. dout din … stage 1 stage 2 stage n 13

  14. Timing Op7miza7on • Maximal clock frequency determined by the longest path delay in any combina2onal logic blocks. • Pipelining is one approach. dout din … stage 1 stage 2 stage n pipeline registers din dout 14

  15. Timing Op7miza7on: Spa7al Compu7ng • Extract independent opera2ons • Execute independent opera2ons in parallel. X = A + B + C + D process (clk, rst) begin process (clk, rst) begin if rising_edge(clk) then if rising_edge(clk) then X1 := A + B; X1 := A + B; X2 := C + D; X2 := X1 + C; X <= X1 + X2; X <= X2 + D; end if ; end if ; end process ; end process ; Critical path delay: 2 adders Critical path delay: 3 adders 15

  16. Timing Op7miza7on: Avoid Unwanted Priority process (clk, rst) begin if rising_edge(clk) then if c[0]=‘1’ then r[0] <= din; elsif c[1]=‘1’ then r[1] <= din; elsif c[2]=‘1’ then r[2] <= din; elsif c[3]=‘1’ then r[3] <= din; end if; end if; end process; Critical path delay: 3-input AND gate + 4x1 MUX. 16

  17. Timing Op7miza7on: Avoid Unwanted Priority Critical path delay: 3-input AND gate + 4x1 MUX. 17

  18. Timing Op7miza7on: Avoid Unwanted Priority process (clk, rst) begin if rising_edge(clk) then if c[0]=‘1’ then r[0] <= din; end if; if c[1]=‘1’ then r[1] <= din; end if; if c[2]=‘1’ then r[2] <= din; end if; if c[3]=‘1’ then r[3] <= din; end if; end if; end process; Critical path delay: 2x1 MUX 18

  19. Timing Op7miza7on: Avoid Unwanted Priority Critical path delay: 2x1 MUX 19

  20. Timing Op7miza7on: Register Balancing • Maximal clock frequency determined by the longest path delay in any combina2onal logic blocks. din dout block 1 block 2 din dout block 1 block 2 20

  21. Timing Op7miza7on: Register Balancing process process (clk, rst) begin begin process (clk, rst) begin process begin if if rising_edge(clk) then then if if rising_edge(clk) then then sumAB <= A + B; rA <= A; rC <= C; rB <= B; sum <= sumAB + rC; rC <= C; end if end if; sum <= rA + rB + rC; end process end process; end if end if; end process; end process

  22. Timing Op7miza7on: Register Balancing process process (clk, rst) begin begin if if rising_edge(clk) then then rA <= A; rB <= B; rC <= C; sum <= rA + rB + rC; end if end if; end process end process;

  23. Timing Op7miza7on: Register Balancing process process (clk, rst) begin begin if if rising_edge(clk) then then sumAB <= A + B; rC <= C; sum <= sumAB + rC; end if end if; end process end process;

  24. Op7miza7on for Area 24

  25. Area Op7miza7on: Resource Sharing • Rolling up pipleline: share common resources at different 2me – a form of temporal compu2ng dout din … stage 1 stage 2 stage n Block including dout all all logic in din stage 1 to n. 25

  26. Area Op7miza7on: Resource Sharing • Use registers to hold inputs • Develop FSM to select which inputs to process in each cycle. X = A + B + C + D A + B + X C + D 26

  27. Area Op7miza7on: Resource Sharing • Use registers to hold inputs • Develop FSM to select which inputs to process in each cycle. X = A + B + C + D A A B + B C D + X + X C + control D A, B, C, D need to hold steady until X is processed 27

  28. Area Op7miza7on: Resource Sharing Merge duplicate components together 28

  29. Area Op7miza7on: Resource Sharing Merge duplicate components together – reduces a 8-bit counter 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend