Systems Pipelining (and Verilog) Shankar Balachandran* Associate - - PowerPoint PPT Presentation

systems
SMART_READER_LITE
LIVE PREVIEW

Systems Pipelining (and Verilog) Shankar Balachandran* Associate - - PowerPoint PPT Presentation

Spring 2015 Week 8 Module 47 Digital Circuits and Systems Pipelining (and Verilog) Shankar Balachandran* Associate Professor, CSE Department Indian Institute of Technology Madras *Currently a Visiting Professor at IIT Bombay Dataflow


slide-1
SLIDE 1

Shankar Balachandran* Associate Professor, CSE Department Indian Institute of Technology Madras

*Currently a Visiting Professor at IIT Bombay

Digital Circuits and Systems

Spring 2015 Week 8 Module 47

Pipelining (and Verilog)

slide-2
SLIDE 2

Dataflow Modeling

 GCD algorithm

 No abstract constructs (for loops) were used  Loops were unrolled  Basic computing structure was identified  Sequence in which the data was supplied and written back

was taken care of by a separate control (state machine)

 Machine had a distinct “Control Path” and a “Data Path”

 Widely known by the name Register Transfer Level

Design, RTL for short

slide-3
SLIDE 3

Characteristics of RTL Design

 Perfect balance of abstraction vs structure  Wires and Regs are declared, representing connectivity in

the circuit

 Verilog statements imply datapath and registers  Multiplexers and Buses are identified  Clocking mechanism for registers is identified  Register widths are identified

slide-4
SLIDE 4

Dataflow Example

input [3:0] a,b; input [7:0] c; wire [7:0] d;

  • -a, b and c arrive at the same time

assign d = a*b + c; a d c b

Purely Combinational

slide-5
SLIDE 5

Registered Output - Blocking

always @(a,b) begin ab = a * b; end always @(posedge clk) d <= ab + c;

a d c b CLK Equivalent to d = a[i]*b[i] + c[i];

slide-6
SLIDE 6

Implications

 Addition and Multiplication operation are

cascaded

 The maximum delay through the combinational logic is

TADD+TMULT

 After the delay the register can latch the data  Meanwhile the input must remain unchanged  Next input can be given only after the delay TADD+TMULT

and thus clock should be as wide as the sum of the delays

 The operation takes one clock cycle and you can perform

  • ne operation every clock cycle

a d c b CLK

slide-7
SLIDE 7

Model with Nonblocking

always @(posedge clk) begin d <= a*b + c; end

  • Infers the same hardware as previous one
slide-8
SLIDE 8

Mode with Nonblocking(2)

always @(posedge clk) begin ab <= a * b; d <= ab + c; end

slide-9
SLIDE 9

Hardware Inference

a d c b ab

slide-10
SLIDE 10

Why?

 Register for ab

 Assigned inside a clock statement

 Register for d

 Also within a clock statement

slide-11
SLIDE 11

Problem with the Model

 Multiplier works on current a and b

 The result will be available only after one clock cycle

 Adder works on current c and previous ab

 The equivalent C code :

d = a[i-1]*b[i-1] + c[i];

slide-12
SLIDE 12

From Simulation Point of View

 ab is a nonblocking assignment

 Not updated till a new timing control

 d uses the value of ab

 Value of ab not updated immediately  Reg ab has memory  Thus previous value is used

 Simulation and Synthesis are consistent

slide-13
SLIDE 13

Another Verilog Model

always @(posedge clk) begin ab <= a * b; ctmp <= c; d <= ab + ctmp; end

slide-14
SLIDE 14

Hardware Inferred

a d c b ctmp ab

slide-15
SLIDE 15

Analysis of the Model

 New reg ctmp copies c  All the regs ab, ctmp and d get a register  When ab is computed, c is just copied to ctmp  Adder always looks at the previous value of ab and ctmp

(previous data)

 All data inputs pass through same number of registers

and hence consistent results

 Equivalent C code :

d = a[i-1]*b[i-1] + c[i-1];

slide-16
SLIDE 16

From Simulation Point of View

 ab is assigned only at the end  ctmp is also assigned only at the end  Both ab and ctmp are regs and thus retain the old value  d looks at the values of ab and ctmp from the previous

assignment

 Consistent with the synthesis model

slide-17
SLIDE 17

More Analysis

 Unlike the model with blocking assignments, results are

not available immediately. They are delayed by one clock cycle.

 The clock can now be max(TADD,TMULT) instead of

TADD+TMULT

 Faster clock

 You can supply data, once every clock cycle  You get the results once every clock cycle (except for the

very first data)

slide-18
SLIDE 18

Pipelining

 Note that when the multiplier is working on the Current

Set, the adder is evaluating result from the previous set

 Thus, the datapath elements are working in tandem. This

is called pipelining

 Data marches through the operations at the command of a clock

 Pipelining is facilitated by many small combinational

blocks which work in tandem and the registers between them which separate the data set

slide-19
SLIDE 19

Illustration of a Pipelined System

TA+TB max(TA,TB) Pipelined Version

TA TB

slide-20
SLIDE 20

Discussion on Pipelined Systems

 Better delay

 Clocks can be made faster because the critical path for

computation is reduced

 Faster pipeline clocks can be used with slower system clocks to

achieve unit cycle operations

 Latency is the cost of using the pipeline

 Results are available only after so many clock cycles

 More number of latches in the pipelined system than in

the original one

 Parallel Processing is another alternative to achieve the

same thing

 At the expense of huge amounts of hardware

slide-21
SLIDE 21

Implications of Latency and Throughput

 Latency is an important factor in microprocessors etc

 Most of the operations need to be completed within one clock

cycle and results be immediately available

 Control is simpler because only one data set is current at any time

 Throughput is more important in DSP applications

 Real time data need to be acquired and processed  Latency is not an issue

slide-22
SLIDE 22

Example of Pipelining - Convolution

 Popular in DSP  Defn :

a – The set of coefficients for convolution b – Sample set c – Result width – Sample window size

 The sample set B is a moving window and can be

arriving real time

width i

i b i a c ] [ * ] [

slide-23
SLIDE 23

Regular Implementation

a[0] a[1] a[2] a[3] B C TA+TB

slide-24
SLIDE 24

Pipelined Implementation

always @(posedge clk) begin ab <= a * b; ctmp <= ab + ctmp; end c <= ctmp;

slide-25
SLIDE 25

Implied Hardware

A C B AB CTMP Equivalent C code : c = c + a[i]*b[i]; Circular Buffer Holding Samples

slide-26
SLIDE 26

End of Week 8: Module 47

Thank You

Pipeliing (Verilog) 26