An Exact Polynomial Time Algorithm for Clock Tree Sizing for - - PowerPoint PPT Presentation

an exact polynomial time algorithm
SMART_READER_LITE
LIVE PREVIEW

An Exact Polynomial Time Algorithm for Clock Tree Sizing for - - PowerPoint PPT Presentation

An Exact Polynomial Time Algorithm for Clock Tree Sizing for Register Files Alexander Berkovich, Lawrence D. Gonzales, Ataur Patwary Intel Corporation Rupesh S. Shelar Synopsys Inc. Agenda Introduction Clock Trees for Register Files


slide-1
SLIDE 1

An Exact Polynomial Time Algorithm for Clock Tree Sizing for Register Files

Alexander Berkovich, Lawrence D. Gonzales, Ataur Patwary Intel Corporation Rupesh S. Shelar Synopsys Inc.

slide-2
SLIDE 2

Agenda

  • Introduction
  • Clock Trees for Register Files
  • Sizing Algorithm
  • Experimental Results & Conclusion
slide-3
SLIDE 3

Background

  • Memory hierarchy: on-chip, DRAM, SSDs, (X-point), hard drives
  • On-chip memory: latch/flop, Register files, SRAM
  • Register files for fast on-chip storage
  • Employed in processors meant for low power & high performance

applications

  • Have been around for decades
slide-4
SLIDE 4

Register Files

  • Contain bit-cells arranged in rows and columns
  • Bit-cell optimized for area, power, speed in each technology
  • Circuitry for read-/write address decoding
  • Rail-to-rail voltage swing compared to SRAM
  • Typically, 8-transistor cells for 1R1W
  • Bit-cells are larger
  • Use static domino latch instead of sense amplifiers
  • Commercial tools available for layout of different configurations
slide-5
SLIDE 5

Synchronous Operation

  • Most circuits are synchronous
  • Register files also accessed in the same fashion
  • Requires clock tree sizing considering races
  • Commercial tools avoid races by employing self-timed circuits
  • Comes at the cost of additional area, power
  • May not be desirable for high volume processors
slide-6
SLIDE 6

Why clock trees for register files are important

  • Clock trees consume significant power, in general
  • True for register files also, as all read/write accesses go through clock network
  • Bit-cells highly optimized for each technology
  • Layout is also optimized using regular arrangement in rows & columns
  • Clock trees for RFs have additional constraints, apart from skew, slew
  • Races between read & precharge, read & write
  • Tedious, time-consuming to size manually
slide-7
SLIDE 7

Agenda

  • Introduction
  • Clock Trees for Register Files
  • Sizing Algorithm
  • Experimental Results & Conclusion
slide-8
SLIDE 8

High-level View

Layout summary of 48 entry x 4-bit array

edgerdpdecc0

rddecsdlc0

vn7inc30ls0v 1

vn7lap00ls0i1

vn7inc30ls0v 1

vn7lap00ls0i1

vn7inc30ls0v 1

vn7lap00ls0i1

vn7inc30ls0v 1

vn7lap00ls0i1

edgewrckdrv

inv Read Address Protection Latches/3:8 Decoder Write Address Protection Latches/3:8 Decoder

RDDECC RDDECC

CKRDM1N22

MCLK

RCB CKWRM2N33 Read Address Protection Latches/3:8 Decoder Write Address Protection Latches/3:8 Decoder RCB

ENTRY 0 ENTRY 47 RDDECC RDDECC RDDECC WRDECC WRDECC WRDECC

slide-9
SLIDE 9

Clock Tree Topology

  • Clock cell instance naming

convention (for this example only)

  • G = Ganged NAND
  • B = buffer
  • R = RDWL
  • P0(1) = L0(1) Precharge
  • A = AND
  • O = OR
  • Numbers in suffix denote the

reverse topological order in which solutions are created

RDWL[N:0] L0PCH rdbll0 Relative spec 0: R L0PCH 0.05 BR RDWL[N:0] Relative spec 1: F L0PCH 5ps AF RDWL[N:0] CKRDM1N44

G

Hi Lo .. N .. L1PCH Hi Hi Lo Lo Lo

RCB

Hsw default mclk rise
  • 55 ar
CKRDM1N22 MCLK Absolute spec 0:
  • 27 ar
RDWL[N:0] L0PCH rdbll0 Relative spec 0: R L0PCH 0.05 BR RDWL[N:0] Relative spec 1: F L0PCH 5ps AF RDWL[N:0] CKRDM1N44

G

Hi Lo RDWL[N]. Absolute spec 2: 27 ar .. N .. Relative spec 0: R L1PCH 0.05 BR INSTANCE[A:0]/RDBLL0 Relative spec 1: F L1PCH 5ps AF INSTANCE[A:0]/RDBLL0 INSTANCE[0] INSTANCE[A]

BR1 GR1 BP01 BR2 GR2 BP11 BP12 OP11 AP11 BP13 BP14 RCB

slide-10
SLIDE 10 bit bit bit bit bit bit bit bit rdbll0 Level0 Bitline[0]

memcell[0] memcell[1] memcell[2] memcell[3] memcell[4] memcell[5] memcell[6] memcell[7]

bit bit bit bit bit bit bit bit bit bit bit bit bit bit bit bit rdbll0 rdbll0 Level0 Bitline[1] Level0 Bitline[0] Level0 Bitline[1] Level0 Bitline[0] Level0 Bitline[1] RDWL[0] RDWL[1] RDWL[2] RDWL[3] RDWL[4] RDWL[5] RDWL[6] RDWL[7] Level0 precharge clock ( left )

Earliest L0PCH

Relative Timing Constraints between L0PCH and RDWL

  • Relative timing constraint between RDWL & L0PCH
  • Rise transition: earliest (latest) L0PCH rise before

earliest of earliest (latest) RDWL receivers

  • ENR: Earliest among nearests
  • EFR: Earliest among farthests
  • Fall transition: latest (earliest) L0PCH falls after

latest of latest (earliest) RDWL receivers

  • LNF: Latest among nearests
  • LFF: Latest among farthests

Earliest rise and fall Latest rise and fall Latest L0PCH

LFFRDWL EFRRDWL EFRL0PCH LFFL0PCH LNF

RDWL

ENRRDWL ENRL0PCH LNFL0PCH

slide-11
SLIDE 11

Additional Constraints for RF Clock Tree Sizing

  • Absolute required arrival time at a specific entry
  • Relative timing constraints for races between
  • L0PCH and RDWL
  • L1PCH and RDBL
  • RDWL and WRWL
  • Approach
  • Size one RDWL to the absolute arrival time
  • Let the rest of the RDWLs fall as they may
  • Size L0PCH, L1PCH, WRWL for relative timing constraints
slide-12
SLIDE 12

Agenda

  • Introduction
  • Clock Trees for Register Files
  • Sizing Algorithm
  • Experimental Results & Conclusion
slide-13
SLIDE 13

Overall Algorithm

  • Two stage approach
  • Stage 1: Size RDWLs and L0PCH in the 1st stage
  • Relative timing constraints for L0PCHs will be met
  • Run timing analysis to find arrival times for RDBLs and RDWLs
  • Generate constraints for absolute arrival time requirements for L1PCHs and WRWLs
  • Stage2: Size L1PCH, WRWLs considering those constraints
  • Iterate to account for perturbations of RDWL timing during Stage 2
  • Dynamic programming for clock tree sizing considering races
  • Specifically, will show how bottom-up solution generation is carried out
  • Most critical step
  • Also, dominant one from time complexity perspective
slide-14
SLIDE 14

Example

  • Assume

– Each clock buffer has 3 strengths

  • Power = 3 unit, delay = 22 unit
  • Power = 2 unit, delay = 24 unit
  • Power = 1 unit, delay = 26 unit
  • Generate solutions in reverse topological
  • rder
  • A solution associated with a clock buffer

characterized by a 10-element vector (EFR, LFF, ENR, LNF, RR_spc, RF_spc, Rise-cell- delay, fall-cell-delay, cell-power, total-power)

RDWL[N:0] L0PCH rdbll0 Relative spec 0: R L0PCH 0.05 BR RDWL[N:0] Relative spec 1: F L0PCH 5ps AF RDWL[N:0] CKRDM1N44

G

Hi Lo .. N .. L1PCH Hi Hi Lo Lo Lo

RCB

Hsw default mclk rise
  • 55 ar
CKRDM1N22 MCLK Absolute spec 0:
  • 27 ar
RDWL[N:0] L0PCH rdbll0 Relative spec 0: R L0PCH 0.05 BR RDWL[N:0] Relative spec 1: F L0PCH 5ps AF RDWL[N:0] CKRDM1N44

G

Hi Lo RDWL[N]. Absolute spec 2: 27 ar .. N .. Relative spec 0: R L1PCH 0.05 BR INSTANCE[A:0]/RDBLL0 Relative spec 1: F L1PCH 5ps AF INSTANCE[A:0]/RDBLL0 INSTANCE[0] INSTANCE[A]

BR1 GR1 BP01 BR2 GR2 BP11 BP12 OP11 AP11 BP13 BP14 RCB

slide-15
SLIDE 15

Bottom-up Solution Generation

  • Assume

– required arrival time at the specified entry is 27 unit – L0PCH has to rise/fall before/after RDWL by ≥5 unit

  • Initialize the solutions at leaves, i.e., the

receivers of BR1[0:7] and BP01 to (27, 27, 27, 27, 27, 27, 0, 0, 0, 0)

  • Create solutions in the following order:

– BR1, GR1, BP01, BR2, GR2, RCB – BP12, OP11, AP11, BP13, BP14, RCB

BR1 GR1 BP01 BR2 GR2 BP11 BP12 OP11 AP11 BP13 BP14 RCB

slide-16
SLIDE 16

Bottom-up Solution Generation (C (CNTD.)

  • 3 steps:

– Combine solutions at the output – Consider sizes for the current clock buffer to create solutions – Prune inferior solutions

slide-17
SLIDE 17 RDWL[N:0] L0PCH rdbll0 Relative spec 0: R L0PCH 0.05 BR RDWL[N:0] Relative spec 1: F L0PCH 5ps AF RDWL[N:0] CKRDM1N44

G

Hi Lo .. N .. L1PCH Hi Hi Lo Lo Lo

RCB

Hsw default mclk rise
  • 55 ar
CKRDM1N22 MCLK Absolute spec 0:
  • 27 ar
RDWL[N:0] L0PCH rdbll0 Relative spec 0: R L0PCH 0.05 BR RDWL[N:0] Relative spec 1: F L0PCH 5ps AF RDWL[N:0] CKRDM1N44

G

Hi Lo RDWL[N]. Absolute spec 2: 27 ar .. N .. Relative spec 0: R L1PCH 0.05 BR INSTANCE[A:0]/RDBLL0 Relative spec 1: F L1PCH 5ps AF INSTANCE[A:0]/RDBLL0 INSTANCE[0] INSTANCE[A]

Bottom-up Solution Generation: Combining Solutions at B BR1

  • Assume max and min wire-

delay to be 6 and 2 unit

  • Combine solutions at the
  • utput of BR1

– (27-6, 27-6, 27-2, 27- 2, 27-6, 27-6, 0, 0, 0, 0) = (21, 21, 25, 25, 21, 21, 0, 0, 0, 0) – For 2nd and 3rd iteration, RDWL[i] may have different wire-delays, so use the appropriate

  • nes for each BR1[i]

BR1 GR1 BP01 BR2 GR2 BP11 BP12 OP11 AP11 BP13 BP14 RCB

slide-18
SLIDE 18 RDWL[N:0] L0PCH rdbll0 Relative spec 0: R L0PCH 0.05 BR RDWL[N:0] Relative spec 1: F L0PCH 5ps AF RDWL[N:0] CKRDM1N44

G

Hi Lo .. N .. L1PCH Hi Hi Lo Lo Lo

RCB

Hsw default mclk rise
  • 55 ar
CKRDM1N22 MCLK Absolute spec 0:
  • 27 ar
RDWL[N:0] L0PCH rdbll0 Relative spec 0: R L0PCH 0.05 BR RDWL[N:0] Relative spec 1: F L0PCH 5ps AF RDWL[N:0] CKRDM1N44

G

Hi Lo RDWL[N]. Absolute spec 2: 27 ar .. N .. Relative spec 0: R L1PCH 0.05 BR INSTANCE[A:0]/RDBLL0 Relative spec 1: F L1PCH 5ps AF INSTANCE[A:0]/RDBLL0 INSTANCE[0] INSTANCE[A]

Bottom-up Solution Generation: Considering choices for BR1

  • For 1st iteration, assume slope;

use actual values for later ones

  • 3 solutions corresponding to 3

choices

– Solution 1: (21-22, 21- 22, 25-22, 25-22, 21- 22, 21-22, 22, 22, 3, 3) = (-1, -1, 3, 3, -1, -1, 22, 22, 3, 3) – Solution 2: (-3, -3, 1, 1,

  • 3, -3, 24, 24, 2, 2)

– Solution 3: (-5, -5, -1, - 1, -5, -5, 26, 26, 1, 1)

BR1 GR1 BP01 BR2 GR2 BP11 BP12 OP11 AP11 BP13 BP14 RCB

slide-19
SLIDE 19 RDWL[N:0] L0PCH rdbll0 Relative spec 0: R L0PCH 0.05 BR RDWL[N:0] Relative spec 1: F L0PCH 5ps AF RDWL[N:0] CKRDM1N44

G

Hi Lo .. N .. L1PCH Hi Hi Lo Lo Lo

RCB

Hsw default mclk rise
  • 55 ar
CKRDM1N22 MCLK Absolute spec 0:
  • 27 ar
RDWL[N:0] L0PCH rdbll0 Relative spec 0: R L0PCH 0.05 BR RDWL[N:0] Relative spec 1: F L0PCH 5ps AF RDWL[N:0] CKRDM1N44

G

Hi Lo RDWL[N]. Absolute spec 2: 27 ar .. N .. Relative spec 0: R L1PCH 0.05 BR INSTANCE[A:0]/RDBLL0 Relative spec 1: F L1PCH 5ps AF INSTANCE[A:0]/RDBLL0 INSTANCE[0] INSTANCE[A]

Bottom-up Solution Generation: Pruning at BR1

  • No solution with same

RR_spc, RF_spc, but higher power

  • So, keep all 3 solutions

– Solution 1: (21-22, 21- 22, 25-22, 25-22, 21- 22, 21-22, 22, 22, 3, 3) = (-1, -1, 3, 3, -1, -1, 22, 22, 3, 3) – Solution 2: (-3, -3, 1, 1,

  • 3, -3, 24, 24, 2, 2)

– Solution 3: (-5, -5, -1, - 1, -5, -5, 26, 26, 1, 1)

BR1 GR1 BP01 BR2 GR2 BP11 BP12 OP11 AP11 BP13 BP14 RCB

slide-20
SLIDE 20 RDWL[N:0] L0PCH rdbll0 Relative spec 0: R L0PCH 0.05 BR RDWL[N:0] Relative spec 1: F L0PCH 5ps AF RDWL[N:0] CKRDM1N44

G

Hi Lo .. N .. L1PCH Hi Hi Lo Lo Lo

RCB

Hsw default mclk rise
  • 55 ar
CKRDM1N22 MCLK Absolute spec 0:
  • 27 ar
RDWL[N:0] L0PCH rdbll0 Relative spec 0: R L0PCH 0.05 BR RDWL[N:0] Relative spec 1: F L0PCH 5ps AF RDWL[N:0] CKRDM1N44

G

Hi Lo RDWL[N]. Absolute spec 2: 27 ar .. N .. Relative spec 0: R L1PCH 0.05 BR INSTANCE[A:0]/RDBLL0 Relative spec 1: F L1PCH 5ps AF INSTANCE[A:0]/RDBLL0 INSTANCE[0] INSTANCE[A]

Bottom-up Solution Generation: Combining at GR1

  • Assuming max-/min wire-delay to be 2

and 6 unit, and for the specific entry wire- delay is 4 unit, we get the following solutions after the result of combine: – Solution 1: (-1-6, -1-6, 3-2, 3-2, - 1-4, -1-4, 22, 22, 3, 3*4) = (-7, -7, 1, 1, -5, -5, 22, 22, 3, 12) – Solution 2: (-3-6, -3-6, 1-2, 1-2, - 3-4, -3-4, 24, 24, 2, 2*4) = (-9, -9,

  • 1, -1, -7, -7, 24, 24, 2, 8)

– Solution 3: (-5-6, -5-6, -1-2, -1-2, - 5-4, -5-4, 26, 26, 1, 1*4) = (-11, - 11, -3, -3, -9, -9, 26, 26, 1, 4)

BR1 GR1 BP01 BR2 GR2 BP11 BP12 OP11 AP11 BP13 BP14 RCB

slide-21
SLIDE 21

Bottom-up Solution Generation: Considering Choices for GR1

  • 3 solutions at the output, (-7, -7, 1, 1, -5, -5, 22, 22, 12, 12), (-9, -9, -1, -1, -7, -7, 24, 24, 8, 8), (-11,
  • 11, -3, -3, -9, -9, 26, 26, 4, 4), and 3 choices for GR1, Power=3, 2, 1 unit results in 9 solutions as

follows:

  • Solution 1 (Power=3 unit): (-7-22, -7-22, 1-22, 1-22, -5-22, -5-22, 22, 22, 3, 12+3) = (-29, -29, -

21, -21, -27, -27, 22, 22, 3, 15)

  • Solution 2 (Power=3 unit): (-31, -31, -23, -23, -29, -29, 22, 22, 3, 11)
  • Solution 3 (Power=3 unit): (-33, -33, -25, -25, -31, -31, 22, 22, 3, 7)
  • Solution 4 (Power=2 unit): (-7-24, -7-24, 1-24, 1-24, -5-24, -5-24, 24, 24, 2, 12+2) = (-31, -31, -

23, -23, -29, -29, 24, 24, 2, 14)

  • Solution 5 (Power=2 unit): (-33, -33, -25, -25, -31, -31, 24, 24, 2, 10)
  • Solution 6 (Power=2 unit): (-35, -35, -27, -27, -33, -33, 24, 24, 2, 6)
  • Solution 7 (Power=1 unit): (-7-26, -7-26, 1-26, 1-26, -5-26, -5-26, 26, 26, 1, 12+1) = (-33, -33, -

25, -25, -31, -31, 26, 26, 1, 13)

  • Solution 8 (Power=1 unit): (-35, -35, -27, -27, -33, -33, 26, 26, 1, 9)
  • Solution 9 (Power=1 unit): (-37, -37, -29, -29, -35, -35, 26, 26, 1, 5)

√ √ √ X √ X √ X X

slide-22
SLIDE 22
  • Pruning the solutions with same RR_spc, RF_spc but higher power

results in the following:

  • Solution 1 (Power=3 unit): (-29, -29, -21, -21, -27, -27, 22, 22, 3, 15)
  • Solution 2 (Power=3 unit): (-31, -31, -23, -23, -29, -29, 22, 22, 3, 11)
  • Solution 3 (Power=3 unit): (-33, -33, -25, -25, -31, -31, 22, 22, 3, 7)
  • Solution 6 (Power=2 unit): (-35, -35, -27, -27, -33, -33, 24, 24, 2, 6)
  • Solution 9 (Power=1 unit): (-37, -37, -29, -29, -35, -35, 26, 26, 1, 5)
  • Note, how optimum power solutions are preserved

Bottom-up Solution Generation: Considering Choices for GR1

(CNTD.)

slide-23
SLIDE 23 RDWL[N:0] L0PCH rdbll0 Relative spec 0: R L0PCH 0.05 BR RDWL[N:0] Relative spec 1: F L0PCH 5ps AF RDWL[N:0] CKRDM1N44

G

Hi Lo .. N .. L1PCH Hi Hi Lo Lo Lo

RCB

Hsw default mclk rise
  • 55 ar
CKRDM1N22 MCLK Absolute spec 0:
  • 27 ar
RDWL[N:0] L0PCH rdbll0 Relative spec 0: R L0PCH 0.05 BR RDWL[N:0] Relative spec 1: F L0PCH 5ps AF RDWL[N:0] CKRDM1N44

G

Hi Lo RDWL[N]. Absolute spec 2: 27 ar .. N .. Relative spec 0: R L1PCH 0.05 BR INSTANCE[A:0]/RDBLL0 Relative spec 1: F L1PCH 5ps AF INSTANCE[A:0]/RDBLL0 INSTANCE[0] INSTANCE[A]

Bottom-up Solution Generation: Creating Solutions at t BP01

BR1 GR1 BP01 BR2 GR2 BP11 BP12 OP11 AP11 BP13 BP14 RCB

  • Assume max and min wire-

delay to be 6 and 2 unit

  • Combine solutions at the
  • utput of BP01

–(27-6, 27-6, 27-2, 27- 2, 27-6, 27-6, 0, 0, 0, 0) = (21, 21, 25, 25, 21, 21, 0, 0, 0, 0)

  • BP01 is skewed, so let’s

assume fall is 18 unit slower and that it has 3 sizes with rise/fall delays as follows

  • 3 unit, 44/62 unit
  • 2 unit, 48/66 unit
  • 1 unit, 52/70 unit
slide-24
SLIDE 24 RDWL[N:0] L0PCH rdbll0 Relative spec 0: R L0PCH 0.05 BR RDWL[N:0] Relative spec 1: F L0PCH 5ps AF RDWL[N:0] CKRDM1N44

G

Hi Lo .. N .. L1PCH Hi Hi Lo Lo Lo

RCB

Hsw default mclk rise
  • 55 ar
CKRDM1N22 MCLK Absolute spec 0:
  • 27 ar
RDWL[N:0] L0PCH rdbll0 Relative spec 0: R L0PCH 0.05 BR RDWL[N:0] Relative spec 1: F L0PCH 5ps AF RDWL[N:0] CKRDM1N44

G

Hi Lo RDWL[N]. Absolute spec 2: 27 ar .. N .. Relative spec 0: R L1PCH 0.05 BR INSTANCE[A:0]/RDBLL0 Relative spec 1: F L1PCH 5ps AF INSTANCE[A:0]/RDBLL0 INSTANCE[0] INSTANCE[A]

Bottom-up Solution Generation: Creating Solutions at t BP01

BR1 GR1 BP01 BR2 GR2 BP11 BP12 OP11 AP11 BP13 BP14 RCB

  • 3 Solutions at BP01 as follows:
  • Solution 1: (21-44, 21-62, 25-44, 25-

62, 21-44, 21-62, 44, 62, 3, 3) = (-23, - 41, -19, -37, -23, -41, 44, 62, 3, 3)

  • Solution 2: (21-48, 21-66, 25-48, 25-

66, 21-48, 21-66, 48, 66, 2, 2) = (-27, - 45, -23, -41, -27, -45, 48, 66, 2, 2)

  • Solution 3: (21-52, 21-70, 25-52, 25-

70, 21-52, 21-70, 52, 70, 1, 1) = (-31, - 49, -27, -45, -31, -49, 52, 70, 1, 1)

  • Pruning is trivial, so all the solutions

preserved in this case

slide-25
SLIDE 25 RDWL[N:0] L0PCH rdbll0 Relative spec 0: R L0PCH 0.05 BR RDWL[N:0] Relative spec 1: F L0PCH 5ps AF RDWL[N:0] CKRDM1N44

G

Hi Lo .. N .. L1PCH Hi Hi Lo Lo Lo

RCB

Hsw default mclk rise
  • 55 ar
CKRDM1N22 MCLK Absolute spec 0:
  • 27 ar
RDWL[N:0] L0PCH rdbll0 Relative spec 0: R L0PCH 0.05 BR RDWL[N:0] Relative spec 1: F L0PCH 5ps AF RDWL[N:0] CKRDM1N44

G

Hi Lo RDWL[N]. Absolute spec 2: 27 ar .. N .. Relative spec 0: R L1PCH 0.05 BR INSTANCE[A:0]/RDBLL0 Relative spec 1: F L1PCH 5ps AF INSTANCE[A:0]/RDBLL0 INSTANCE[0] INSTANCE[A]

Bottom-up Solu lutio ion Generation: Combin inin ing Solu lutions at t fanout of f BR2

  • Update the required arrival times for solutions

from BP01 and GR1

  • Assume wire-delay from BR2 to BP01 to be 4 units.

So, the solutions from L0PCH path are:

  • Sol.1: (-27, -45, -23, -41, -27, -45, 22, 31, 3, 3)
  • Sol. 2: (-31, -49, -27, -45, -31, -49, 24, 33, 2, 2)
  • Sol. 3: (-35, -53, -31, -49, -35, -53, 26, 35, 1, 1)
  • Assume wire-delay from BR2 to GR1 to be 6 unit.

So, the solutions from RDWL path are:

  • Sol. 1: (-35, -35, -27, -27, -33, -33, 22, 22, 3, 15)
  • Sol. 2: (-37, -37, -29, -29, -35, -35, 22, 22, 3, 11)
  • Sol. 3: (-39, -39, -31, -31, -37, -37, 22, 22, 3, 7)
  • Sol. 6: (-41, -41, -33, -33, -39, -39, 24, 24, 2, 6)
  • Sol. 9: (-43, -43, -35, -35, -41, -41, 26, 26, 1, 5)

BR1 GR1 BP01 BR2 GR2 BP11 BP12 OP11 AP11 BP13 BP14 RCB

slide-26
SLIDE 26

Bottom-up Solution Generation: Combining Solutions at fanout of BR2

  • Combine solutions from L0PCH path those from RDWL path iff (∆ = 5 units)
  • Rise transition: EFRL0PCH ≥ EFRRDWL+ 5 && ENRL0PCH ≥ ENRRDWL+ 5
  • Fall transition: LFFL0PCH ≤ LFFRDWL - 5 && LNFL0PCH ≤ LNFRDWL - 5
  • 15 possible solutions:
  • Combination: Sol. 1L0PCH && Sol. 1RDWLinvalid (ENRL0PCH ≥ ENRRDWL + 5 false)
  • Combination: Sol. 1L0PCH && Sol. 2RDWL valid
  • Combination: Sol. 1L0PCH && Sol. 3RDWL valid
  • Combination: Sol. 1L0PCH && Sol. 6RDWL invalid (LFFL0PCH ≤ LFFRDWL - 5 false)
  • Combination: Sol. 1L0PCH && Sol. 9RDWL invalid (LFFL0PCH ≤ LFFRDWL - 5 false)
  • Combination: Sol. 2L0PCH && Sol. 1RDWL invalid (EFRL0PCH ≥ EFRRDWL + 5 false)
  • Combination: Sol. 2L0PCH && Sol. 2 RDWL invalid (ENRL0PCH ≥ ENRRDWL+ 5 false)
  • Combination: Sol. 2L0pCH && Sol. 3RDWL invalid (ENRL0PCH ≥ ENRRDWL+ 5 false)
  • Combination: Sol. 2L0PCH && Sol. 6RDWL valid
  • Combination: Sol. 2L0pch and Sol. 9RDWLvalid
  • Sol. 3L0PCH cannot be combined with any solution from RDWL, since ENRL0PCH ≥ ENRRDWL + 5
  • Out of 15 only 4 solutions valid (that’s why problem is more difficult than that without relative timing constraints)
  • Also, the combinations of GR1, BR1, and BP01 should be considered simultaneously (LR-like algorithm is

at disadvantage)

  • For valid combinations, add total powers and keep RATs due to the solutions from RDWL path
slide-27
SLIDE 27

Bottom-up Solution Generation: Combinin ing Solu lutions at fanout of BR2

  • 4 Valid combinations:

  • Sol. 2: (-37, -37, -29, -29, -35, -35, 22, 22, 3, 11+3)

  • Sol. 3: (-39, -39, -31, -31, -37, -37, 22, 22, 3, 7+3)

  • Sol. 6: (-41, -41, -33, -33, -39, -39, 24, 24, 2, 6+2)

  • Sol. 9: (-43, -43, -35, -35, -41, -41, 26, 26, 1, 5+2)
  • Consider 3 sizes for BR2 to create 12 possible solutions and then prune, and continue

sizing GR2, RCB

  • At the root, choose the solution whose RAT is the closest to that of grid clock arrival
  • Next, perform timing analysis and apply the sizing for L1PCH path with absolute arrival

time requirement, and by sizing RCB to the same delay from L0PCH/RDWL sizing iteration (Why? )

  • In the next iteration, size L0PCH and RDWL paths…
  • Typically, in 3 iterations sizes converge or do not change much
slide-28
SLIDE 28

Time Complexity & Limitations

  • Time complexity: O(NB2), N = number of clock buffers in the tree; B =

maximum number of size choices for a clock buffer

  • Since only part of the clock tree is sized (remaining instantiations get

the same sizes), runtimes are fast, in minutes, excluding timing analysis/extraction, etc.

  • Limitation:
  • Inaccuracies in delays due to actual slopes - typically, diminish in 3 iterations
slide-29
SLIDE 29

Agenda

  • Introduction
  • Clock Trees for Register Files
  • Sizing Algorithm
  • Experimental Results & Conclusion
slide-30
SLIDE 30

Experimental Results

  • Algorithm implemented on top of internal timer
  • Employed for clock tree sizing of ~100 register file blocks in 22 nm
  • Runs in minutes on these blocks
  • Saves ~22% of clock power over traditional sizing
slide-31
SLIDE 31

Conclusion

  • Algorithm results in power savings and productivity improvement
  • Has been proliferated for use in advanced technologies
  • Has also been incorporated in proprietary RF automation flow
  • Resulted in wide use across microprocessors, from low power to high

performance applications

  • Applicable to sizing for races in general, as long as those can be

capturing employing time constraints

slide-32
SLIDE 32

Acknowledgments

  • Authors would like to thank several of their colleagues for review and

comments

slide-33
SLIDE 33

Questions