VLSI programming Systolic Design Book Parhi, Chp. 7 Rudolf Mak - - PowerPoint PPT Presentation

vlsi programming systolic design
SMART_READER_LITE
LIVE PREVIEW

VLSI programming Systolic Design Book Parhi, Chp. 7 Rudolf Mak - - PowerPoint PPT Presentation

VLSI programming Systolic Design Book Parhi, Chp. 7 Rudolf Mak r.h.mak@tue.nl 18-May-16 Rudolf Mak TU/e Computer Science Systolic 1 Agenda Systolic arrays (what, where) Regular Iterative Algorithms (RIAs) Dependence graphs


slide-1
SLIDE 1

18-May-16 Rudolf Mak TU/e Computer Science Systolic 1

VLSI programming Systolic Design

Book Parhi, Chp. 7 Rudolf Mak r.h.mak@tue.nl

slide-2
SLIDE 2

18-May-16 Rudolf Mak TU/e Computer Science Systolic 2

Agenda

  • Systolic arrays (what, where)
  • Regular Iterative Algorithms (RIAs)
  • Dependence graphs (regular, reduced)
  • Systolic design techniques

– Binding (computations to PEs) – Scheduling (computations to time slots)

  • Examples

– Fir filters, matrix multipliers

slide-3
SLIDE 3

18-May-16 Rudolf Mak TU/e Computer Science Systolic 3

FSM reminder

Moore machine Mealy machine

CL

state

CL

state

Chaining Mealy machines may lead too long critical paths!

slide-4
SLIDE 4

18-May-16 Rudolf Mak TU/e Computer Science Systolic 4

Systolic system (Leiserson)

A systolic system is a set of interconnected Moore machines that operate synchronously and satisfy certain smallness (boundedness) conditions:

1. # states is bounded 2. # input ports is bounded 3. # output ports is bounded 4. # neighbor machines is bounded “#” stands for “number of”

slide-5
SLIDE 5

18-May-16 Rudolf Mak TU/e Computer Science Systolic 5

Systolic = Uniform Pipelined SDF

  • Uniform:

– Each PE (Moore machine) computes the same set of combinatorial functions.

  • Regular:

– All PEs are connected to a small finite number of neighboring PEs via one or more D-elements according to a regular topology. All connections are point-to-point connections.

  • Synchronous operation:

– All PEs operate in lock step (fire concurrently) ; data is pumped through the system, much like the hart pumps blood through the body (hence the name systolic).

slide-6
SLIDE 6

18-May-16 Rudolf Mak TU/e Computer Science Systolic 6

Relaxations

  • To obtain better systems small relaxations to

the systolic model are allowed:

  • 1. Not all PEs are identical, small deviations are

allowed especially for PEs at the border of the system.

  • 2. (A limited form) of broadcasting is allowed. This

means that PEs have become Mealy machines.

  • 1. These systems are called semi-systolic by Leiserson.
  • 2. Parhi does not make the distinction. Instead he uses the

notion fully pipelined for the Moore machine variant.

  • 3. Connections need not be to nearest neighbors, but

locality needs to be maintained.

slide-7
SLIDE 7

18-May-16 Rudolf Mak TU/e Computer Science Systolic 7

Systolic system

PE

Host

PE PE PE PE

Systolic array: Moore machines Turing-equivalent machine

Such as a Power PC

  • n a FPGA

Such as a dedicated computing engine on a FPGA

slide-8
SLIDE 8

18-May-16 Rudolf Mak TU/e Computer Science Systolic 8

Application areas

  • Computationally intensive, regular

– Basic linear algebra operations – Signal processing – Image processing – Order statistics, sorting – Dynamic programming – High performance computing

  • e.g., many particle simulations (in chemistry,

physics or astronomy)

slide-9
SLIDE 9

FIR filter (N-tap)

,

  • ,

,

  • , 0

, 1 1

  • 1, 1

, 0 , 1, , , 1

18-May-16 Rudolf Mak TU/e Computer Science Systolic 9

  • r , 1

Spec

RIA

1 1 does not work!!!

slide-10
SLIDE 10

18-May-16 Rudolf Mak TU/e Computer Science Systolic 10

Regular Iterative Algorithm

A RIA is a triple consisting of

1. An index space 2. A finite set of variables 3. A set of direct dependencies among indexed variables (given as equalities)

  • with associated index displacement vectors
  • also called fundamental edges by Parhi

Canonical forms:

1. Standard input 2. Standard output { , | 0 , 0 ! , ,

, is input

slide-11
SLIDE 11

18-May-16 Rudolf Mak TU/e Computer Science Systolic 11

Standard output canonical form:

, , , 1, 1, , 0 , 1, , 1, , , 1 , 1

Index displacement vectors:

→ → → → → #$ → %$ 0, 1 1, 0 0, 0 0, 0 1, 1

FIR-filter: RIA description

, 1 , 0, 1 1, 1 = ,

LHS = RHS + IDV

slide-12
SLIDE 12

18-May-16 Rudolf Mak TU/e Computer Science Systolic 12

Computational node

& ' ( ) * ( +1 &, ' ) +2 &, ' * +3 &, ' .

+1 +2 +3

node g

slide-13
SLIDE 13

18-May-16 Rudolf Mak TU/e Computer Science Systolic 13

Computational node from RIA

1, , 1 1, 1 , , , , , , 1, 1

I(g) I(g) is the index

vector, i.e., the sequence of coordinates of g in index-space

slide-14
SLIDE 14

18-May-16 Rudolf Mak TU/e Computer Science Systolic 14

Dependence graphs

1. The nodes of a dependence graph represent (small)

  • computations. There is a separate node for each com-

putation. 2. The edges of a dependence graph represent causal dependencies between computations, i.e., an edge from node to node indicates that the result of the computation performed by is used in the computation performed by . 3. There is no notion of time in a dependence graph. It is an (index-)space representation.

slide-15
SLIDE 15

18-May-16 Rudolf Mak TU/e Computer Science Systolic 15

FIR: Dependence graph

x(0) x(1) x(2) x(4) x(3) y(0) y(1) y(2) y(4) y(3) h(1) h(2) h(0)

  • 0 1 1 2 2
slide-16
SLIDE 16

18-May-16 Rudolf Mak TU/e Computer Science Systolic 16

FIR: Dependence graph

x(0) x(1) x(2) x(4) x(3) y(0) y(1) y(2) y(4) y(3) h(1) h(2) h(0)

  • 0 1 1 2 2
slide-17
SLIDE 17

18-May-16 Rudolf Mak TU/e Computer Science Systolic 17

Regular dependence graphs

A dependence graph / is regular when:

  • 1. There is a injective mapping 0 from the

nodes of / to a grid of points in the - dimensional index space.

  • 2. There exists a finite set 1 of vectors, called

fundamental edges, such that every pair ,

  • f neighboring nodes is mapped to a pair of

grid locations that differ by a fundamental edge 2 ∈ 1, i.e., 0 0 2.

slide-18
SLIDE 18

18-May-16 Rudolf Mak TU/e Computer Science Systolic 18

FIR: DG in space representation

x(0) x(1) x(2) x(4) x(3) y(0) y(1) y(2) y(4) y(3) h(1) h(2) h(0)

(1,-1) (1,0) (0,1)

1 24 25 |26 1 1 1 1 fundamental edges

slide-19
SLIDE 19

18-May-16 Rudolf Mak TU/e Computer Science Systolic 19

Systolic array design

The design of a systolic array for a computation given in the form of a regular dependence graph involves:

1. Choosing a processor space, i.e., a set of dimensions and a number of PEs per dimension (the array). 2. Mapping each computational node of the graph to a PE of the array. 3. For each PE scheduling the computations of the nodes mapped onto it, i.e., assigning each individual computation to a distinct time slot.

Similar to folding

slide-20
SLIDE 20

18-May-16 Rudolf Mak TU/e Computer Science Systolic 20

Design parameters

An 1)-dimensional systolic design for an

  • dimensional regular dependence graph is

characterized by:

1. A 7 1 processor space matrix 8: 9

:0 is the processor that executes node

2. A -dimensional scheduling vector ; :

<

:0 is the time slot at which node x is executed

3. A projection (iteration) vector = : 0 – 0 ? = implies 9

:0 9@0

slide-21
SLIDE 21

18-May-16 Rudolf Mak TU/e Computer Science Systolic 21

Design constraints

  • Computations whose grid locations differ by

a multiple of the projection vector execute on the same PE

– 0 – 0 ? = implies 9

:0 9@0

– hence 9

:= 0

  • Computations that execute on the same PE

must be scheduled in different time slots

– <

:0 is the time slot at which node is

executed – hence ;

:= A 0

slide-22
SLIDE 22

18-May-16 Rudolf Mak TU/e Computer Science Systolic 22

Processor allocation:

x(0) x(1) x(2) x(4) x(3) y(0) y(1) y(2) y(4) y(3) h(1) h(2) h(0)

B:

processors

): 1, 0 B: 0, 1

slide-23
SLIDE 23

18-May-16 Rudolf Mak TU/e Computer Science Systolic 23

Scheduling:

x(0) x(1) x(2) x(4) x(3) y(0) y(1) y(2) y(4) y(3) h(1) h(2) h(0) 1 1 1 2 2 2 3 3 3 4 4 4 time

): 1, 0 C: 1, 0 C:

slide-24
SLIDE 24

18-May-16 Rudolf Mak TU/e Computer Science Systolic 24

Hardware Utilization Efficiency (HUE)

Let and y computations with index vectors 0, 0 that are executed on the same PE.

  • Then 0 0 ?=.

Let $D be the time at which is scheduled and $E be the time at which is scheduled.

  • Then $D $E ;@0 0 ? ;

:= F |; :=|.

Hence, any PE executes at most 1 computation per ;

:= time slots. So

HUE = 1 / | ;

:= |

Question: How do we call ;

:= ?

slide-25
SLIDE 25

18-May-16 Rudolf Mak TU/e Computer Science Systolic 25

From DG to systolic array

Map a DG onto a systolic array as follows:

  • Nodes:

– map x to processing element 9

:0

  • Edges

– map → to connection 9

:0 → 9 :0

– insert ;

:2 D-elements in this edge, where

2 0 – 0, is a fundamental edge

Note that there are only finitely many fundamental grid edges (independent of the size DG), and recall that each edge is a translation of a fundamental edge.

slide-26
SLIDE 26

18-May-16 Rudolf Mak TU/e Computer Science Systolic 26

B1: H-stay, X-broadcast, Y-move

2

:

H

:*

;

:*

  • 1, 0

1

  • 0, 1

1

  • 1, 1

1 1

  • PE

PE PE

=

:

  • 1, 0

H

:

  • 0, 1

;

:

  • 1, 0
slide-27
SLIDE 27

18-May-16 Rudolf Mak TU/e Computer Science Systolic 27

B1: H-stay, X-broadcast, Y-move

HUE = 1 / | sTd | = 1

h0 h1 h2 x(i) y(i) v(i) u(i) y(i) = h0·x(i) + v(i-1), v(i) = h1·x(i) + u(i-1), u(i) = h2·x(i) + 0 y(i) = h0·x(i) + h1·x(i-1) + h2·x(i-2)

slide-28
SLIDE 28

18-May-16 Rudolf Mak TU/e Computer Science Systolic 28

Determining 8, ;, and =

  • Trial-and-error approach

– Pick a combination and check whether the design constraints are fulfilled.

  • Constructive approach
  • 1. Determine a schedule ;.
  • 2. Determine a projection vector = such that ;

:= A 0

  • 3. Let I =

:= 0 – == :

. Then I is a matrix of rank 1 such that I

:= 0. By sweeping, a zero

column can be created in Q. Drop this column to

  • btain a 7 1 -matrix 8.
slide-29
SLIDE 29

18-May-16 Rudolf Mak TU/e Computer Science Systolic 29

FIR-designs (Parhi)

s

T

d

T

p

T

p

T(eh|ex|ey)

s

T(eh|ex|ey)

B1 (1, 0) (1, 0) (0, 1) (0, 1,-1) (1, 0, 1) F (1, 1) (1, 0) (0, 1) (0, 1,-1) (1, 1, 0) W1 (2, 1) (1, 0) (0, 1) (0, 1,-1) (2, 1, 1) W2 (1, 2) (1, 0) (0, 1) (0, 1,-1) (1, 2,-1) DW2 (1,-1) (1, 0) (0, 1) (0, 1,-1) (1,-1,2) B2 (1, 0) (1,-1) (1, 1) (1, 1, 0) (1, 0, 1) R1 (1,-1) (1,-1) (1, 1) (1, 1, 0) (1,-1, 2) R2 (2, 1) (1,-1) (1, 1) (1, 1, 0) (2, 1, 1) DR2 (1, 2) (1,-1) (1, 1) (1, 1, 0) (1, 2, -1)

reverse direction funda- mental edge

ey = -ey ex = -ex ex = -ex ey = -ey

slide-30
SLIDE 30

18-May-16 Rudolf Mak TU/e Computer Science Systolic 30

Design R1: dependence graph

X(0) X(1) X(2) X(4) X(3) y(0) y(1) y(2) y(4) y(3) h(1) h(2) h(0)

(1,-1) (1,0) (0,-1)

E = ( eh | -ex | ey) = 1 0 1 0 -1 -1 fundamental edges

slide-31
SLIDE 31

18-May-16 Rudolf Mak TU/e Computer Science Systolic 31

Space-time diagram R1

0 2 4 6 8 10 12 10 8 6 4 2

): 1, 1, B: 1, 1, C: 1, 1 J

K

L B: 0 M N C: 0 M

slide-32
SLIDE 32

18-May-16 Rudolf Mak TU/e Computer Science Systolic 32

Processor allocation R1:

X(0) X(1) X(2) X(4) X(3) y(0) y(1) y(2) h(1) h(2) h(0)

dT = (1, -1) L B: , : LO) 3

slide-33
SLIDE 33

18-May-16 Rudolf Mak TU/e Computer Science Systolic 33

Scheduling R1:

X(0) X(1) X(2) X(4) X(3) h(1) h(2) h(0)

dT = (1, -1)

1 2 4 2 3 5 6 4 6 7 8 10 8 9

sT = (1, -1)

y(0) y(1) y(2) y(4) y(3) N C:, : 3 P 3 !2

slide-34
SLIDE 34

18-May-16 Rudolf Mak TU/e Computer Science Systolic 34

R1: H-move, X-move, Y-stay

2

:

H

:2

;

:2

  • 1, 0

1 1

  • 0, 1

1 1

  • 1, 1

2 =

:

1, 1 H

:

  • 1, 1

;

:

1, 1

  • PE

PE PE

slide-35
SLIDE 35

18-May-16 Rudolf Mak TU/e Computer Science Systolic 35

R1: H-move, X-move, Y-stay

HUE = 1 / | ;Q= | = 1 / 2 (2-slow) h1 h2 h0

x00x1

1 2 4 5

At time: 0

1 3 4 2 3 5

06Y205 05Y105 04Y005

slide-36
SLIDE 36

18-May-16 Rudolf Mak TU/e Computer Science Systolic 36

R1: H-move, X-move, Y-stay

h2 h0 x00x1

1 2 4 5

06Y205Y5

Y X W V H

slide-37
SLIDE 37

18-May-16 Rudolf Mak TU/e Computer Science Systolic 37

N

  • R

S

  • $T
  • U
  • \ \

1 U ∗ / / 2

  • U ∗

x / / 3 U ∗ ∗ \ \ 4

  • U ∗ ∗

U / / 5 U / / 6 U U \ \ \ 7 U ∗ \ / / U 8

  • U ∗ \

_ / / 9 U ∗ \ ∗ _ \ \ 10

  • U ∗ \ ∗ _

a / / 11 a / / 12 U a b \ \ 13 U ∗ a / / a

slide-38
SLIDE 38

18-May-16 Rudolf Mak TU/e Computer Science Systolic 48

Matrix multiplication c 7 c: RIA

(, ∑ : 0 c: &, ', f, , g ∑ : 0 g: &, ', h, , g &, g 1 i, , g 'g 1, f, , 0 0 f, , g f, , g 1 h, , g i, , g h, , g h, 1, g i, , g i 1, , g +Oj 0 , , g c

slide-39
SLIDE 39

18-May-16 Rudolf Mak TU/e Computer Science Systolic 49

i j k C B A Dependence graph for c 3 (Finite!)

1 2 1 2 1 2

slide-40
SLIDE 40

18-May-16 Rudolf Mak TU/e Computer Science Systolic 51

Kung-Leiserson design

  • Scheduling vector

– ;

: 1, 1, 1

  • Projection vector

– =

: 1, 1, 1

  • Projection space matrix

– 9

: 1

1 1 1

  • HUE = 1 / 3

2 8Q2 ;Q2 h 1 1 1 i 1 1 1 f 1 1 1 1

slide-41
SLIDE 41

18-May-16 Rudolf Mak TU/e Computer Science Systolic 52

Kung- Leiserson (3x3)-matrix multiplication systolic array delay-elements not drawn: one

  • n each edge!

y = j-k x = i-k

slide-42
SLIDE 42

18-May-16 Rudolf Mak TU/e Computer Science Systolic 53

KL-array processor allocation ( binding ) unbalanced workload

slide-43
SLIDE 43

18-May-16 Rudolf Mak TU/e Computer Science Systolic 54

i j k C B A Dependence graph for c 3

1 2 1 2 1 2

d

slide-44
SLIDE 44

18-May-16 Rudolf Mak TU/e Computer Science Systolic 55

KL-array 3-slow schedule HUE = 1/3

slide-45
SLIDE 45

KL-array details

In addition to the previous slides the following issues must be addressed

  • For both A and B there are 5 input streams
  • How are the matrix values distributed over them?
  • For C there are 5 output stream
  • How are the resulting values distributed over them?
  • How are results that become available at an internal PE

propagated to the border

  • How to operate this array for multiple multiplications?
  • Flushing old values, can be combined with getting internal

results out.

18-May-16 Rudolf Mak TU/e Computer Science Systolic 56

slide-46
SLIDE 46

18-May-16 Rudolf Mak TU/e Computer Science Systolic 57

Summary

1. Systolic architectures are attractive for implementation media like VLSI circuits and FPGAs. 2. Starting point for systolic design is a RIA (or a dependence graph). 3. RIAs can be mapped to systolic arrays in a systematic fashion. 4. Mapping uses simple linear algebra techniques. 5. A large variety of designs for a single problem can be obtained.

slide-47
SLIDE 47

18-May-16 Rudolf Mak TU/e Computer Science Systolic 58

Exercise (systolic design)

1. An OCL system is a system that counts (#), for each window of size on its input stream, the number of times the last received value occurs in that window, i.e., for 1

# : 0 : ,

where is the input stream and the output stream.

a) Derive a RIA (in standard output form) for this system that satisfies the equations , , 0 , #: : , 0 l , , Note that l, , ‼! b) Draw the dependence graph of this RIA for 4. (you need to draw only the part with 0 6).

slide-48
SLIDE 48

18-May-16 Rudolf Mak TU/e Computer Science Systolic 59

2. Consider the scheduling, projection and processor vector

a) Construct the systolic array that corresponds to these vectors. You may assume the existence of a comparator operator that takes two input streams and produces an output stream of one’s and zero’s, for equal and unequal input pairs respectively. b) Determine the slowness of your design.

3. Assume that the time to perform comparison and addition are given by Tcmp = 1ns and Tadd = 3ns, respectively. Give the maximum throughput and the latency of your design (taking slowness into account). Give the latency both in number of delays and in real time (C)

Exercise (systolic design)

= C 2 1 , ) 1 2 , p 0 1

slide-49
SLIDE 49

18-May-16 Rudolf Mak TU/e Computer Science Systolic 60

4. Next, replace the scheduling vector by sT = (1, 0). Compare the throughput and latency of the resulting systolic array with that of the

  • ne with sT = (2, 1).

5. Consider the design of 4.

a) Eliminate redundant operators, and optimize the throughput by

  • pipelining. Give the resulting throughput and latency.

b) Next retime the result of a), keeping throughput and latency fixed, to

  • btain the minimum number of delays.

Exercise (systolic design)