 
              CDA 4253 FPGA System Design Op7miza7on Techniques Hao Zheng Comp S ci & Eng Univ of South Florida 1
Extracted from Advanced FPGA Design by Steve Kilts 2
Op7miza7on for Performance 3
Performance Defini7ons • Throughput : the number of inputs processed per unit 2me. • Latency : the amount of 2me for an input to be processed. • Maximizing throughput and minimizing latency in conflict. • Both require 2ming op2miza2on: - Reduce delay of the cri$cal path. 4
Achieving High Throughput: Pipelining • Divide data processing into stages • Process different data inputs in different stages simultaneously. -- Non-pipelined version -- Non-pipelined version process process (clk) begin begin xpower = 1; if if rising_edge(clk) then then for for (i = 0; i < 3; i++) if if start=‘1’ then then cnt <= 3; xpower = x * xpower; end end if if; if if cnt > 0 then then cnt <= cnt – 1; xpower <= xpower * x; Throughput: 1 data / 3 cycles = 0.33 elsif elsif cnt = 0 then then data / cycle . done <= ‘1’; Latency: 3 cycles. end if end if; Critical path delay: 1 multiplier delay end process; end process 5
Achieving High Throughput: Pipelining -- Pipelined version process (clk, rst) begin if rising_edge(clk) then if start=‘1’ then -- stage 1 x1 <= x; xpower1 <= x; xpower = 1; done1 <= start; end if ; for (i = 0; i < 3; i++) -- stage 2 xpower = x * xpower; x2 <= x1; xpower2 <= xpower1 * x1; done2 <= done1; Throughput: 1 data / cycle -- stage 3 Latency: 3 cycles + register delays. xpower <= xpower2 * x2; Critical path delay: 1 multiplier delay done <= done2; end if ; end process ; 6
Comparison Iterative implementation Pipelined implementation 7
Achieving High Throughput: Pipelining • Loop unrolling X Y C Reg 8
Achieving High Throughput: Pipelining • Loop unrolling X ... C0 C1 Cn Reg Reg Y 9
Achieving High Throughput: Pipelining • Divide data processing into stages • Process different data inputs in different stages simultaneously. dout din 10
Achieving High Throughput: Pipelining • Divide data processing into stages • Process different data inputs in different stages simultaneously. dout din … stage 1 stage 2 stage n registers Penalty: increase in area as logic needs to be duplicated for different stages 11
Reducing Latency • Closely related to reducing cri2cal path delay. • Reducing pipeline registers reduces latency. dout din … stage 1 stage 2 stage n registers 12
Reducing Latency • Closely related to reducing cri2cal path delay. • Reducing pipeline registers reduces latency. dout din … stage 1 stage 2 stage n 13
Timing Op7miza7on • Maximal clock frequency determined by the longest path delay in any combina2onal logic blocks. • Pipelining is one approach. dout din … stage 1 stage 2 stage n pipeline registers din dout 14
Timing Op7miza7on: Spa7al Compu7ng • Extract independent opera2ons • Execute independent opera2ons in parallel. X = A + B + C + D process (clk, rst) begin process (clk, rst) begin if rising_edge(clk) then if rising_edge(clk) then X1 := A + B; X1 := A + B; X2 := C + D; X2 := X1 + C; X <= X1 + X2; X <= X2 + D; end if ; end if ; end process ; end process ; Critical path delay: 2 adders Critical path delay: 3 adders 15
Timing Op7miza7on: Avoid Unwanted Priority process (clk, rst) begin if rising_edge(clk) then if c[0]=‘1’ then r[0] <= din; elsif c[1]=‘1’ then r[1] <= din; elsif c[2]=‘1’ then r[2] <= din; elsif c[3]=‘1’ then r[3] <= din; end if; end if; end process; Critical path delay: 3-input AND gate + 4x1 MUX. 16
Timing Op7miza7on: Avoid Unwanted Priority Critical path delay: 3-input AND gate + 4x1 MUX. 17
Timing Op7miza7on: Avoid Unwanted Priority process (clk, rst) begin if rising_edge(clk) then if c[0]=‘1’ then r[0] <= din; end if; if c[1]=‘1’ then r[1] <= din; end if; if c[2]=‘1’ then r[2] <= din; end if; if c[3]=‘1’ then r[3] <= din; end if; end if; end process; Critical path delay: 2x1 MUX 18
Timing Op7miza7on: Avoid Unwanted Priority Critical path delay: 2x1 MUX 19
Timing Op7miza7on: Register Balancing • Maximal clock frequency determined by the longest path delay in any combina2onal logic blocks. din dout block 1 block 2 din dout block 1 block 2 20
Timing Op7miza7on: Register Balancing process process (clk, rst) begin begin process (clk, rst) begin process begin if if rising_edge(clk) then then if if rising_edge(clk) then then sumAB <= A + B; rA <= A; rC <= C; rB <= B; sum <= sumAB + rC; rC <= C; end if end if; sum <= rA + rB + rC; end process end process; end if end if; end process; end process
Timing Op7miza7on: Register Balancing process process (clk, rst) begin begin if if rising_edge(clk) then then rA <= A; rB <= B; rC <= C; sum <= rA + rB + rC; end if end if; end process end process;
Timing Op7miza7on: Register Balancing process process (clk, rst) begin begin if if rising_edge(clk) then then sumAB <= A + B; rC <= C; sum <= sumAB + rC; end if end if; end process end process;
Op7miza7on for Area 24
Area Op7miza7on: Resource Sharing • Rolling up pipleline: share common resources at different 2me – a form of temporal compu2ng dout din … stage 1 stage 2 stage n Block including dout all all logic in din stage 1 to n. 25
Area Op7miza7on: Resource Sharing • Use registers to hold inputs • Develop FSM to select which inputs to process in each cycle. X = A + B + C + D A + B + X C + D 26
Area Op7miza7on: Resource Sharing • Use registers to hold inputs • Develop FSM to select which inputs to process in each cycle. X = A + B + C + D A A B + B C D + X + X C + control D A, B, C, D need to hold steady until X is processed 27
Area Op7miza7on: Resource Sharing Merge duplicate components together 28
Area Op7miza7on: Resource Sharing Merge duplicate components together – reduces a 8-bit counter 29
Recommend
More recommend