in Monolithic 3D ICs Hao Zhuang, Jingwei Lu, Kambiz Samadi*, Yang - - PowerPoint PPT Presentation

in monolithic 3d ics
SMART_READER_LITE
LIVE PREVIEW

in Monolithic 3D ICs Hao Zhuang, Jingwei Lu, Kambiz Samadi*, Yang - - PowerPoint PPT Presentation

Performance-Driven Placement for Design of Rotation and Right Arithmetic Shifters in Monolithic 3D ICs Hao Zhuang, Jingwei Lu, Kambiz Samadi*, Yang Du*, and Chung-Kuan Cheng Dept. Computer Science & Engineering, University of California, San


slide-1
SLIDE 1

Performance-Driven Placement for Design of Rotation and Right Arithmetic Shifters in Monolithic 3D ICs

Hao Zhuang, Jingwei Lu, Kambiz Samadi*, Yang Du*, and Chung-Kuan Cheng

  • Dept. Computer Science & Engineering, University of California, San Diego, CA, 92093 USA

*Qualcomm Research, San Diego, CA, 92121, USA

slide-2
SLIDE 2

Outline

  • Motivation

– Monolithic 3D ICs (M3D) – Our target circuits:

  • Rotation Shifter
  • Arithmetic Shifter (right shift)
  • Optimization Approach of Shifter Designs

– Permutation-based Optimization + M3D technology – Efficient Simulated Annealing Solver

  • Experiment
  • Conclusions
slide-3
SLIDE 3

Motivation: Resume Moore’s Law – 3D ICs

  • 3D-ICs is a promising solution for

scaling of VLSI.

  • Through Silicon Vias (TSV)-based 3D ICs

– Fabricate dies separately. – Wafer need to be thinned, aligned and then bonded. – TSV is large

  • Monolithic 3D Ics (M3D)

– Fabricate tiers sequentially – Use monolithic inter-tier vias (MIVs) as vertical connections. The are of only metal-via sizes.

  • TSV diameter = 6um [1]
  • MIV diameter = 70nm [1]

[1] Shreepad. Panth, et al. ASPDAC2012

  • Standard Cell Height = 1.4um [1]
slide-4
SLIDE 4

Motivation: Monolithic 3D ICs (M3D)/Monolithic Inter-Tier Vias (MIV)

The advantages of M3D/MIV:

  • High-density integrations. Reduce the huge

dimensions and area overhead of TSVs for 3D IC designs.

  • Cope

with interconnect-limited 2D-ICs, where most of the problems are essentially caused by the high interconnect density at gate level.

  • Inserts vertical connections and shortens the

distance between connected modules.

  • Reduce the total wire length and power,

improves the routability and timing behavior.

slide-5
SLIDE 5

Motivation: Our target circuits

  • Shifter Circuits

– An indispensable datapath components in the MPU and ASIC. – Has a broad spectrum of application and could impact the system performance in a larger scale. – The wiring inside each shifter module is quite dense. Improvement on timing and power behaviors of shifters becomes an important subject.

  • In this work, our specific targets of shifters are:

– Rotation Shifter – Arithmetic Shifter (Right shift)

slide-6
SLIDE 6

Rotation Shifter (Rotator)

This is a linear ordering design (LO) of Rotation shifter (rotator), also known as cyclic

  • shifter. Rotation requires long wrap-around wires.

0 1 (1, 3) 0 1 (2, 3) 0 1 (3, 3) 0 1 (4, 3) 0 1 (5, 3) 0 1 (6, 3) 0 1 (7, 3) 0 1 (0, 3) 0 1 (1, 2) 0 1 (2, 2) 0 1 (3, 2) 0 1 (4, 2) 0 1 (5, 2) 0 1 (6, 2) 0 1 (7, 2) 0 1 (0, 2) 0 1 (1, 1) 0 1 (2, 1) 0 1 (3, 1) 0 1 (4, 1) 0 1 (5, 1) 0 1 (6, 1) 0 1 (7, 1) 0 1 (0, 1) D1 (1, 0) D2 (2, 0) D3 (3, 0)

D4

(4, 0)

D5

(5, 0)

D6

(6, 0)

D7

(7, 0)

D0

(0, 0)

Z6 Z7 Z5 Z3 Z4 Z2 Z1 Z0 S0 S1 S2

x y

slide-7
SLIDE 7

0 1 (1, 3) 0 1 (2, 3) 0 1 (3, 3) 0 1 (4, 3) 0 1 (5, 3) 0 1 (6, 3) 0 1 (7, 3) 0 1 (0, 3) 0 1 (1, 2) 0 1 (2, 2) 0 1 (3, 2) 0 1 (4, 2) 0 1 (5, 2) 0 1 (6, 2) 0 1 (7, 2) 0 1 (0, 2) 0 1 (1, 1) 0 1 (2, 1) 0 1 (3, 1) 0 1 (4, 1) 0 1 (5, 1) 0 1 (6, 1) 0 1 (7, 1) 0 1 (0, 1) D1 (1, 0) D2 (2, 0) D3 (3, 0)

D4

(4, 0)

D5

(5, 0)

D6

(6, 0)

D7

(7, 0)

D0

(0, 0)

Z6 Z7 Z5 Z3 Z4 Z2 Z1 Z0 S0 S1 S2

x y

Rotation Shifter (Rotator)

This is a linear ordering design (LO) of Rotation shifter (rotator), also known as cyclic

  • shifter. Rotation requires long wrap-around wires.
slide-8
SLIDE 8

Right Arithmetic Shifter

This is a linear ordering design of right arithmetic shifter. Extend the original MSB (most significant bits).

0 1 (1, 3) 0 1 (2, 3) 0 1 (3, 3) 0 1 (4, 3) 0 1 (5, 3) 0 1 (6, 3) 0 1 (7, 3) 0 1 (0, 3) 0 1 (1, 2) 0 1 (2, 2) 0 1 (3, 2) 0 1 (4, 2) 0 1 (5, 2) 0 1 (6, 2) 0 1 (7, 2) 0 1 (0, 2) 0 1 (1, 1) 0 1 (2, 1) 0 1 (3, 1) 0 1 (4, 1) 0 1 (5, 1) 0 1 (6, 1) 0 1 (7, 1) 0 1 (0, 1) D1 (1, 0) D2 (2, 0) D3 (3, 0)

D4

(4, 0)

D5

(5, 0)

D6

(6, 0)

D7

(7, 0)

D0

(0, 0)

Z6 Z7 Z5 Z3 Z4 Z2 Z1 Z0 S0 S1 S2

x y

slide-9
SLIDE 9

Objectives

  • Objectives:

– Reduce such longest path to improve timing. – Reduce total wire length to improve power.

  • Heavy wire loads in the linear order design, caused by long wrap-around wires.

0 1 (1, 3) 0 1 (2, 3) 0 1 (3, 3) 0 1 (4, 3) 0 1 (5, 3) 0 1 (6, 3) 0 1 (7, 3) 0 1 (0, 3) 0 1 (1, 2) 0 1 (2, 2) 0 1 (3, 2) 0 1 (4, 2) 0 1 (5, 2) 0 1 (6, 2) 0 1 (7, 2) 0 1 (0, 2) 0 1 (1, 1) 0 1 (2, 1) 0 1 (3, 1) 0 1 (4, 1) 0 1 (5, 1) 0 1 (6, 1) 0 1 (7, 1) 0 1 (0, 1) D1 (1, 0) D2 (2, 0) D3 (3, 0)

D4

(4, 0)

D5

(5, 0)

D6

(6, 0)

D7

(7, 0)

D0

(0, 0)

Z6 Z7 Z5 Z3 Z4 Z2 Z1 Z0 S0 S1 S2

x y

slide-10
SLIDE 10

Approaches

Our optimization approach combines two aspects as follows:

  • M3D/MIV

– Inserts vertical connections, may shortens the distance between connected cells. – By introducing extra dimension here, it reduces the total wire length and dynamic power, improve the routability and timing behavior.

  • Cell Order Permutations (proposed in our ASPDAC 07 paper [2])

– Idea/Observations: By swapping the physical positions of cells in shifter, it reduces the longest path and total wire length. – Sometimes, it compensates the delay penalty by deviate routes of the design only by naïve folding 2D designs to 3D ICs (show in the experiment of right arithmetic shifter). The first work to optimize 3D shifter by cell order permutations. (Previous work tend to use simple folding 2D linear design into 3D space. Not efficient!)

[2] Haikun Zhu, et al. ASPDAC2007.

slide-11
SLIDE 11

Optimization (Cell Order Permutation)

7 6 5 4 3 2 1 7 6 5 4 3 2 1 7 6 5 4 3 2 1 7 6 5 4 3 2 1

>> 1-bit >> 2-bit >> 4-bit 1 1 1 1 1 1 1 7 1 1 1 1 1 1 1 7 7 7 1 7 1 9 1 3 1 3 1 3 1 3 1 3 7 7 9 3 3 3 3 3 7 7 7 7 9 7 3 7 3 11 3 11 3 13 3 7 7 7 9 7 11 11 13 7

7 6 5 4 3 2 1 6 5 4 3 7 2 1 3 4 2 6 7 5 1 7 6 5 4 3 2 1

>> 1-bit >> 2-bit >> 4-bit 1 1 1 1 4 1 1 1 3 1 1 1 4 1 1 1 1 4 2 2 2 4 1 4 5 4 3 5 5 1 4 1 3 4 2 4 5 4 5 4 3 8 4 7 5 8 8 4 7 8 4 7 7 4 6 3 8 8 7 8 7 8 7 6 8

A optimized solution The longest path spans (LPS)

  • nly 8 MUX cells.

LO design The longest path spans (LPS) is 13 MUX cells.

11

Illustration case of permutation-based optimization (8-bit rotator)

slide-12
SLIDE 12

Optimization (M3D)

  • Folding 2D LO into 3D Linear Order Design
  • Cut the long wrap-around wires, use short MIVs instead

Cut the 2D design

Folding to different layers (3D)

12

slide-13
SLIDE 13

Optimization (M3D)

(1) Physical placement of cell and (2) Connect different layers with MIVs for vertical communications

13

slide-14
SLIDE 14

Cell Order Permutation in 3D space

14

  • Extend to 3D Space Permutations

– wires are not shown along x-y direction, – the MIVs are treated as vertical interconnects. – Assume MIVs connects adjacent layers is 5% width of MUX cell – Swap these two highlighted cells, and etc.

Input Output

slide-15
SLIDE 15

Solve the whole optimization via Simulated Annealing-Based Solver (SA)

  • Use slacks for timing

– Take two fan-out nodes into considerations, not just reducing the longest path. – The weight of net 𝑜𝑗, 𝑥𝑜𝑗 = (1 −

𝑡𝑚𝑏𝑑𝑙 𝑓 𝐸

)𝜄

𝑓∈𝑜𝑗

– Total slack of 𝑜𝑓𝑢𝑥𝑝𝑠𝑙 of shifter 𝑋𝑡𝑚𝑏𝑑𝑙 = 𝑥𝑜𝑗

𝑜𝑗∈𝑜𝑓𝑢𝑥𝑝𝑠𝑙

  • Use total wire length as another cost function

– Power is proportional to wire length – 𝑋𝑈𝑋𝑀 is the total wire length.

  • Auto-Normalizing Cost Function

∆𝑑𝑝𝑡𝑢= 𝛿 Δ𝑋𝑡𝑚𝑏𝑑𝑙 𝑋𝑡𝑚𝑏𝑑𝑙𝑞𝑠𝑓𝑤 + (1 − 𝛿) Δ𝑋𝑈𝑋𝑀 𝑋𝑈𝑋𝑀𝑞𝑠𝑓𝑤 𝛿 is a tuning parameter. [3][4] [3] A. Marquardt, et. al. FPGA 2000 [4] K. Eguro, et. al. DAC 2008

slide-16
SLIDE 16

Scalable SA optimization solver

  • Integer Linear Programming (ILP) is not scalable in our case, which was

used in [2].

  • SA is a scalable method to solve this optimization problem, and also

archive almost same quality of LPS as ILP, (shown in Table II, 16 bits rotator cases).

  • “LPS”: The span of the longest path along x-/z- directions, measure in the number of

MUX cell. wire span along y-direction contributes the same among shifters)

  • When optimizing a 32 bits rotator in 2 layer, ILP spends over days to
  • btain the solution, while SA only take minutes.

[2] Haikun Zhu, et al. ASPDAC2007.

slide-17
SLIDE 17

Experiment of Shifter Design Optimization

  • The parameters for performance evaluations
  • Notations on following pages,

– “SA”: permutation-based optimization by simulated annealing-based solver. – “LO”: Linear order design in 2D or folding linear order design in 3D. – “LPS”: The span of the longest path along x-/z- directions, measure in the number of MUX cell (wire span along y-direction contributes the same among shifters). – Delay, and Power are measured based on the following methods.

slide-18
SLIDE 18

18

Evaluating interconnect effect

  • Based on logical effort model

– Technology independent – Easy to incorporate interconnect effect

Gate delay Logical effort, only depends on gate type Electrical effort, depends on load cap Parasitic delay, only depends on gate type Wire load is integrated into h Electrical effort contributed by wire per column spanned Wire length normalized to cell width With vertical MIV interconnect (alpha =0.05) and

slide-19
SLIDE 19

Delay and Power Metrics

  • Delay Evaluation, calculate the delay along path

And Select the largest value

  • Power: Dynamic Power Evaluation

– Summation of all effective Capacitance,

Where the input gate capacitance is 8Cg/3, which is also the capacitance of wire per column span (MUX cell width).

19

slide-20
SLIDE 20

Evaluation Result (I)

Results of 32, 64 and 128 bits rotators by LO design and SA in 2D/3D ICs

  • SA improves timing improvement 32% and reduces power 5% on average (vs.

simple 2D LO and 3D LO folding)

  • For example, the combinations of M3D (4 layers) and SA reduces 60% dynamic

power and 49% delay as optimized 2D design of 128 bits shifter.

slide-21
SLIDE 21

Evaluation Result (II)

Results of 32, 64 and 128 bits right arithmetic shifters by LO design and SA in 2D/3D ICs

  • Our SA, compared to simple 2D LO and 3D LO folding, improves timing improvement 23% and

reduces power 5% on average. Compensate the delay by deviate routes from naïve design (folding) from 2D to 3D ICs.

  • For example, the combinations of M3D (4 layers) and SA reduces 40% dynamic power and 18%

delay as 2D design of 128 bits shifter.

– 2D SA cases are ignored here because the timing and power improvements are scarce in the 2D cases.

slide-22
SLIDE 22

Conclusions

  • The high-density integration with vertical interconnect by monolithic 3D-IC

technology provides a promising solution to cope with interconnect- limited 2D-ICs.

  • The permutation-based optimization can reduce longest path as well as
  • timing. The dynamic power is reduced in the majority of our cases .

– Simulated-Annealing-based solver is highly efficient for permutation- based optimization. – Our SA, compared to simple 2D LO and 3D LO folding, improves timing improvement around 20% and reduces power around 5% on average.

  • Our work is the first one to optimize 3D shifter and explore design space

via cell order permutations.

slide-23
SLIDE 23

Thanks! Q&A

slide-24
SLIDE 24

Right arithmetic shifter by folding If it is 64 bits, 2 layers, if the cut happen on the logic cell 32 63 62 …. 32 | 31 … 1 0 (32 in one layer’s right most cell, 31 is in the other layer’s leftmost cell) Previous wire length to fetch next logic level cell is from X-> Y 2D x-span -> x-span via 3d folding 1. 1 -> 31 2. 2 -> 30 3. 4 -> 28 4. 8 -> 24 5. 16 -> 16 6. 32 -> 0 Total span 63 -> 129

slide-25
SLIDE 25

Right Arithmetic Shifter

0 1 (1, 3) 0 1 (2, 3) 0 1 (3, 3) 0 1 (4, 3) 0 1 (5, 3) 0 1 (6, 3) 0 1 (7, 3) 0 1 (0, 3) 0 1 (1, 2) 0 1 (2, 2) 0 1 (3, 2) 0 1 (4, 2) 0 1 (5, 2) 0 1 (6, 2) 0 1 (7, 2) 0 1 (0, 2) 0 1 (1, 1) 0 1 (2, 1) 0 1 (3, 1) 0 1 (4, 1) 0 1 (5, 1) 0 1 (6, 1) 0 1 (7, 1) 0 1 (0, 1) D1 (1, 0) D2 (2, 0) D3 (3, 0)

D4

(4, 0)

D5

(5, 0)

D6

(6, 0)

D7

(7, 0)

D0

(0, 0)

Z6 Z7 Z5 Z3 Z4 Z2 Z1 Z0 S0 S1 S2

x y This line span will become to 3 while previous only one