 
              1 EE 457 Unit 2c Fast Multipliers
2 Multiplication Overview • Multiplication approaches: – Sequential: Shift-and-Add produces one product bit per clock cycle time (usually slow) – Combinational: Array multiplier uses an array of adders • Can be as simple as N-1 ripple-carry adders for an NxN multiplication m3 m2 m1 m0 q0 m3 m2 m1 m0 m3·q0 m2·q0 m1·q0 m0·q0 x q3 q2 q1 q0 q1 m3q0 m2q0 m1q0 m0q0 m3·q1 m2·q1 m1·q1 m0·q1 m3q1 m2q1 m1q1 m0q1 - q2 m3q2 m2q2 m1q2 m0q2 - - + m3q3 m2q3 m1q3 m0q3 - - - m3·q2 m2·q2 m1·q2 m0·q2 q3 p7 p6 p5 p4 p3 p2 p1 p0 m3·q3 m2·q3 m1·q3 m0·q3 AND Gate Array produces partial product terms
3 Array Multiplier m1q0 m0q0 m3q0 m2q0 0 Can this be a HA? m3q1 m2q1 m1q1 m0q1 X Y X Y X Y X Y Co Ci Co Ci Co Ci Co FA FA FA HA S S S S m1q2 m0q2 m3q2 m2q2 X Y X Y X Y X Y Co Ci Co Ci Co Ci Co FA FA FA HA S S S S m1q3 m0q3 m3q3 m2q3 X Y X Y X Y X Y Co Ci Co Ci Co Ci Co FA FA FA HA S S S S P[7] P[6] P[5] P[1] P[0] P[4] P[3] P[2] • Maximum delay = ? – Do you look for the longest path or the shortest path between any input and output? – Compare with the delay of a shift-and-add method
4 Pipelined Multiplier • Now try to pipeline the previous design m1q2 m0q2 m3q2 m2q2 m1q3 m0q3 m3q3 m2q3 m1q1 m0q1 m1q0 m0q0 m2q1 m3q0 m2q0 m3q1 X X Y X Y X Y Co Y Co Ci Co Ci Co HA FA FA HA S S S S X Y X Y X Y X Y Co Ci Co Ci Co Ci Co FA FA FA HA S S S S X Y X Y X Y X Y Co Ci Co Ci Co Ci Co FA FA FA HA S S S S P[7] P[6] P[5] P[4] P[3] P[2] P[1] P[0] Determine the maximum stage delay to decide the pipeline clock rate. Assume zero-delay for stage latches. How does the latency of the pipeline compare with the simple combinational array of the previous stage?
5 Carry-Save Multiplier • Instead of propagating the carries to the left in the same row, carries are now sent down to the next stage to reduce stage delay and facilitate pipelining m3q0 m2q0 m1q0 m0q0 0 0 0 m3q1 m2q1 m1q1 m0q1 X Y X Y X Y CSA’s Co Ci Co Ci Co Ci FA FA FA S S S m2q2 m1q2 m0q2 m3q2 X Y X Y X Y Co Ci Co Ci Co Ci FA FA FA S S S m3q3 m2q3 m1q3 m0q3 The upper three stages are 3-bit X Y X Y X Y Carry Save Adders (CSA’s) each Co Ci Co Ci Co Ci FA FA FA with 2-gate delays. S S S RCA The last stage is a Ripple Carry Adder (RCA) which requires X Y X Y X Y longer delay. It can be replaced Co Ci Co Ci Co Ci FA FA FA 0 by a CLA for larger multipliers. S S S P[7] P[6] P[5] P[1] P[0] P[4] P[3] P[2]
6 Carry Save Adders • Consider the decimal addition of 47 + 96 + 58 = 201 • One way is to add 47 to 96 to get 143 and then add 58 • Here the ten’s column cannot be added until the carry is produced • In the carry- save style, we add the one’s column and ten’s column simultaneous 1 1 4 7 4 7 + 9 6 9 6 1 1 4 3 + 5 8 3 2 1 + 5 8 2 1 1 2 0 1 + 1 8 2 _ 5 4 6 2 0 1 4 3
7 Carry-Save (3,2) Adders • A carry save adder is also called a (3,2) adder or a (3,2) counter (refer to 0 1 0 1 Computer Arithmetic Algorithms by 1 0 0 1 Israel Koren) as it takes three vectors, + 1 0 1 1 adds them up, and reduces them to 1 0 0 1 _ Carry vector 0 1 1 1 two vectors, namely a sum vector and a Sum vector carry vector • CSA’s are based on the principle that carries do not have to be added as soon A[3] B[3] C[3] A[2] B[2] C[2] A[1] B[1] C[1] A[0] B[0] C[0] as possible, but can be combined in a Z X Y Z X Y Z X Y X Y Z Co Co Co Co FA FA FA FA later step S S S S • An n-bit CSA consist of n disjoint full C[4] S[3] C[3] S[2] C[2] S[1] C[1] S[0] adders
8 1-bit FA vs. 1-bit CSA • Any difference between an ordinary full adder and 1- bit CSA? NO! • 16-bit wide CSA takes ( more / equal / less ) time to produce its outputs compared to an 8-bit wide CSA • Carry-save adder ( is / is not ) useful in adding only 2 numbers
9 CSA Organization • We can arrange our CSA’s in a linear manner where one partial product is added per CSA (after the first level)
10 Wallace Tree Multiplier • Using the previous example as a template, to build an NxN multiplier q1·M q0·M q7·M q6·M q5·M q4·M q3·M q2·M you need (n-1) of CSA CSA (n-1) bit CSAs followed by a CSA CSA final (n-1)-bit RCA • Delay = Delay of (n- 1) CSA’s CSA + Delay of (n-1) bit RCA CSA = 2 * (n-1) * Delay(FullAdder) Propagation Adder • We can reduce the CSA component Product of the delay by organizing the CSA’s Note: The vectors (partial products) in a tree (i.e. logarithmic delay) need to be aligned before summing. These details are not shown in the block diagram.
11 Logic Delay • Consider the gate arrangement for OR’ing 8 bits • Linear: – Delay = 7 gates • Tree – Depth of tree = log 2 8 = 3 levels • Consider OR’ing 16 -bits using 4-bit OR gates, how many levels would you need?
12 Wallace Tree Discussion • A 4-input OR gate reduces 4 literals to 1 (i.e. a factor of 4 reduction) • A CSA reduces 3 vectors to 2 vectors (i.e. a factor of 1.5) – This reduction factor may not be convenient to develop an efficient tree to sum 16 or 32 partial products – Wallace tree may not achieve a great reduction in delay due to wastage of an extra level • Also note the Wallace tree shown earlier does not show… – Size of buses – What bits are “retired” progressivley – Relative significance (alignment) of partial products – Size of the carry-propagate adder (e.g. RCA or CLA) needs to be figured out and overall delay estimated
13
14 10 9 8 7 6 5 4 3 2 1 0 10 9 8 7 6 5 4 3 2 1 0 Original 6x6 Matrix Reorganized 6x6 matrix 10 9 8 7 6 5 4 3 2 1 0 10 9 8 7 6 5 4 3 2 1 0 Level 1 CSA Level 2 CSA 10 9 8 7 6 5 4 3 2 1 0 11 10 9 8 7 6 5 4 3 2 1 0 Results of Level 1 Level 3 CSA
15 Credits • These slides were derived from Gandhi Puvvada’s EE 457 Class Notes
Recommend
More recommend