In Search of Lost Time Andrew B. Kahng UCSD CSE and ECE Departments - - PowerPoint PPT Presentation

in search of lost time
SMART_READER_LITE
LIVE PREVIEW

In Search of Lost Time Andrew B. Kahng UCSD CSE and ECE Departments - - PowerPoint PPT Presentation

In Search of Lost Time Andrew B. Kahng UCSD CSE and ECE Departments abk@ucsd.edu http://vlsicad.ucsd.edu TAU-2016 Friday keynote, Santa Rosa 1 A. B. Kahng, TAU 2016 In Search of Lost Time 2 A. B. Kahng, TAU 2016 What is Time? How do we


slide-1
SLIDE 1

1

  • A. B. Kahng, TAU 2016

In Search of Lost Time

Andrew B. Kahng UCSD CSE and ECE Departments abk@ucsd.edu http://vlsicad.ucsd.edu TAU-2016 Friday keynote, Santa Rosa

slide-2
SLIDE 2

2

  • A. B. Kahng, TAU 2016

In Search of Lost Time

slide-3
SLIDE 3

3

  • A. B. Kahng, TAU 2016

What is Time? How do we lose Time? How do we regain Time?

slide-4
SLIDE 4

4

  • A. B. Kahng, TAU 2016

What is Time?

slide-5
SLIDE 5

5

  • A. B. Kahng, TAU 2016

What is Time?

  • Time = Schedule
  • Moore’s Law: 1% = 1 week
  • Time = Things convertible to time
  • mV, σ, uW, nm, $, µm2

Margin

Time

Product Quality Model and Analysis Accuracy

nm, mV, {skew, jitter, OCV…} power, area, fmax, Iddq,… rms, %, σ

slide-6
SLIDE 6

6

  • A. B. Kahng, TAU 2016

What is Time?

  • Time = Schedule
  • Moore’s Law: 1% = 1 week
  • Time = Things convertible to time
  • mV, σ, uW, nm, $, µm2
  • Time = time itself
  • Flavors: slack, trans, xd, d-trans, …
slide-7
SLIDE 7

7

  • A. B. Kahng, TAU 2016

What is Time?

  • Time = Schedule
  • Moore’s Law: 1% = 1 week
  • Time = Things convertible to time
  • mV, σ, uW, nm, $, µm2
  • Time = time itself
  • Flavors: slack, trans, xd, d-trans, …

Time = Money

slide-8
SLIDE 8

8

  • A. B. Kahng, TAU 2016

What is Time? How do we lose Time?

slide-9
SLIDE 9

9

  • A. B. Kahng, TAU 2016

How Do We Lose Time?

  • It’s tough not to …
slide-10
SLIDE 10

10

  • A. B. Kahng, TAU 2016

Context I: Race to End of Roadmap

  • Paper model to v1.0 SPICE model: ~12 months @N10
  • Many near-term “red bricks”: ArF, Cu, low-k, …
  • Foundry-fabless dynamics: who gives up margin ?
  • Time constants limit design-manufacturing co-evolution

(Years) Tech development, app market definition, architecture/front-end design (Months) RTL-to-GDS implementation, reliability qualification (Weeks) Fab latency, cycles of yield learning, design re-spins, mask flows (Days) Process tweaks, design ECOs Mismatches among these time constants

  • Model-hardware

miscorrelation

  • Model guardbanding
  • Faster node enablement

is challenging !!

slide-11
SLIDE 11

11

  • A. B. Kahng, TAU 2016

Context II: Low-Power Grand Challenge

Low power = High complexity

multiple supply voltages, power and clock gating, DVFS, MTCMOS, multi-Lgate, …

Increased timing closure burden

Mobility Big data Green datacenters Cloud Internet of Things

slide-12
SLIDE 12

12

  • A. B. Kahng, TAU 2016

How Do We Lose Time?

  • It’s tough not to …
  • The margining imperative …
slide-13
SLIDE 13

13

  • A. B. Kahng, TAU 2016

Nobody Wants to Own the Scrap

  • Timing model not 100% accurate
  • Add margin to cover unknowns
slide-14
SLIDE 14

14

  • A. B. Kahng, TAU 2016

 Stacks of Margins

performance PDF

Process

Signoff

Temperature

source: Wu 08

Nominal Vdd

Static IR drop Power grid IR gradient Dynamic IR HCI/NBTI

Signoff

Voltage

Signoff

Design margin = stack of layers of conservatism

Reliability

slide-15
SLIDE 15

15

  • A. B. Kahng, TAU 2016

 Consequences

  • Diminishing ROI from next node
  • Typical: Moore’s Law-like scaling
  • Worst-case: scales, but worse ROI
  • Signoff with excessive margin: potential gain wiped out
slide-16
SLIDE 16

16

  • A. B. Kahng, TAU 2016

Time: Lose Some, Win Some

20nm 90nm 45/40nm 28nm 16/14nm 10nm ≤7nm 65nm BTI Temp inversion Noise MCMM Maxtrans EM AOCV / POCV PBA Fixed-margin spec patterning Multi- patterning Cell-POCV MOL, BEOL R ↑ Dynamic IR Fill effects Layout rules BEOL, MOL variations Signoff criteria with AVS SOC complexity LVF MIS Phys-aware timing ECO Min implant

slide-17
SLIDE 17

17

  • A. B. Kahng, TAU 2016

How Do We Lose Time?

  • It’s tough not to …
  • The margining imperative …
  • We give it away
  • Intentionally

c2q-setup-hold surface

slide-18
SLIDE 18

18

  • A. B. Kahng, TAU 2016

How Do We Lose Time?

  • It’s tough not to …
  • The margining imperative …
  • We give it away
  • Intentionally

C

Layer M2

3σ C

Layer M1

Interconnect stack with M1 and M2

M1 C M2 C 3σ Pessimism

Homogeneous BEOL corners (e.g., Cworst)

Homogeneous Cw corner

slide-19
SLIDE 19

19

  • A. B. Kahng, TAU 2016

How Do We Lose Time?

  • It’s tough not to …
  • The margining imperative …
  • We give it away
  • Intentionally
  • By miscorrelating
  • 0.6
  • 0.5
  • 0.4
  • 0.3
  • 0.2
  • 0.1

0.1

  • 0.6
  • 0.5
  • 0.4
  • 0.3
  • 0.2
  • 0.1

0.1

T2 Path Slack (ns) T1 Path Slack (ns)

123 ps

slide-20
SLIDE 20

20

  • A. B. Kahng, TAU 2016

How Do We Lose Time?

  • It’s tough not to …
  • The margining imperative …
  • We give it away
  • Intentionally
  • By miscorrelating
  • By wasting it

2013 Contest NDA: Without [EDA vendor’s] prior approval, I shall not write or publish any article or presentation that references [EDA vendor’s tool name].

slide-21
SLIDE 21

21

  • A. B. Kahng, TAU 2016

How Do We Lose Time?

  • It’s tough not to …
  • The margining imperative …
  • We give it away
  • Intentionally
  • By miscorrelating
  • By wasting it

“We don’t have enough time to do it right, but we have enough time to do it wrong”

slide-22
SLIDE 22

22

  • A. B. Kahng, TAU 2016

Not Enough Time To Do It Right…

Option #1: go with latest available technology = 0.01 AU/year speed Option #2: spend the next ten years to come up with a spaceship = 0.1 AU/year speed 2016 2026 2027 2031

  • Earth to Mars

Option #1 = 0.5 / 0.01 = 50 years Option #2 = 0.5 / 0.1 + 10 years = 15 years (B<< A)

  • Issue: investment for the long haul

Option #1 Option #2 Corner-based STA Statistical STA Planar 3D Homogeneous CMOS Heterogeneous CMOS

Need a faster ship

Year:

slide-23
SLIDE 23

23

  • A. B. Kahng, TAU 2016

What is Time? How do we lose Time? How do we regain Time?

slide-24
SLIDE 24

24

  • A. B. Kahng, TAU 2016

How Do We Regain Time?

  • Learn !!! (machine learning, Big Data mindset)
slide-25
SLIDE 25

25

  • A. B. Kahng, TAU 2016

Timer Miscorrelation

  • T1 and T2 : commercial signoff STA tools with same inputs

(.v, .spef, .lib)

  • 123ps slack divergence  20% performance difference

 one node of Moore’s Law scaling

  • 0.6
  • 0.5
  • 0.4
  • 0.3
  • 0.2
  • 0.1

0.1

  • 0.6
  • 0.5
  • 0.4
  • 0.3
  • 0.2
  • 0.1

0.1

T2 Path Slack (ns) T1 Path Slack (ns)

123 ps

[DATE14]

slide-26
SLIDE 26

26

  • A. B. Kahng, TAU 2016

Erase Miscorrelation with Machine Learning!

Can also erase P&R vs. signoff STA miscorrelation

Artificial Circuits Train Validate Test New Designs

MODELS

(Path slack, setup time, stage, cell, wire delays)

If error > threshold

Outliers (data points) ONE-TIME INCREMENTAL Real Designs

T1 Path Slack (ns) T2 Path Slack (ns)

31 ps ~4× reduction

  • 0.6
  • 0.5
  • 0.4
  • 0.3
  • 0.2
  • 0.1

0.1

  • 0.6
  • 0.5
  • 0.4
  • 0.3
  • 0.2
  • 0.1

0.1

T2 Path Slack (ns) T1 Path Slack (ns)

123 ps

ML Modeling

BEFORE AFTER

[DATE14]

slide-27
SLIDE 27

27

  • A. B. Kahng, TAU 2016

Harder: Non-SI to SI Calibration

.v

.db, .lib .spef .v .sdc Post P & R Database Calibration: Recipe to Convert Non-SI Timing Report to SI Timing Report

Non-SI Timing Report Non-SI Timing Report SI Timing Report SI Timing Report SI Timing Report

  • Complex interplay of electrical,

logic structure, and layout parameters

  • Black-box code in STA tools
  • Slack diverges by 81ps (clock

period = 1.0ns)

  • ~4 stages of logic at 28nm

FDSOI

81ps SI Path Slack (ns) ($$$) Non-SI Path Slack (ns) ($) [SLIP15]

slide-28
SLIDE 28

28

  • A. B. Kahng, TAU 2016

“SI for Free” with Machine Learning

  • Machine learning of

incremental transition time, delay due to SI

  • Accurate SI-aware

path delays, slacks

Timing Reports in SI Mode Timing Reports in Non-SI Mode Create Training, Validation and Testing Sets ANN (2 Hidden Layers, 5-Fold Cross-Validation) Save Model and Exit SVM (RBF Kernel, 5-Fold Cross-Validation) HSM (Weighted Predictions from ANN and SVM) Actual Path Delay (ps) Predicted Path Delay (ps) 8.2ps Worst absolute error = 8.2ps Average absolute error = 1.7ps

81ps SI Path Slack (ns) ($$$) Non-SI Path Slack (ns) ($) ML Modeling

BEFORE AFTER

[SLIP15]

slide-29
SLIDE 29

29

  • A. B. Kahng, TAU 2016

Sim Results (Dyn.) Activity Factor (Static) Timing/ Noise MTTF & Aging P&R + Optimization Power Analysis Thermal Analysis Task Mapping/ Migration/ (DVFS) Temp Map Power Trace Reliability Report

Tech files, signoff criteria, corners

Slack IR Drop Map Timing / Glitches AVS

Sim vectors Benchmark RTL

Functional Sim

Similar: Closing Multiphysics Analysis Loops

[ASPDAC16]

slide-30
SLIDE 30

30

  • A. B. Kahng, TAU 2016

Sim Results (Dyn.) Activity Factor (Static) Timing/ Noise MTTF & Aging P&R + Optimization Power Analysis Thermal Analysis Task Mapping/ Migration/ (DVFS) Temp Map Power Trace Reliability Report

Tech files, signoff criteria, corners

Slack IR Drop Map Timing / Glitches AVS

Sim vectors Benchmark RTL

Functional Sim

STA-IR loop STA-Thermal loop Workload-Thermal loop STA-Reliability loop

Similar: Closing Multiphysics Analysis Loops

[ASPDAC16]

slide-31
SLIDE 31

31

  • A. B. Kahng, TAU 2016

Multiphysics Analysis is Difficult to Predict

  • IR drop, thermal, reliability, crosstalk, etc.
  • Example: Can we predict “risk map” for embedded

memories at floorplan stage ?

SRAM #1

SRAM Slack (ps)

SRAM #5

25ps 29ps

[ASPDAC16]

slide-32
SLIDE 32

32

  • A. B. Kahng, TAU 2016

Multiphysics Analysis is Difficult to Predict

Implementation Index SRAM Slack (ps) [ASPDAC16]

  • IR drop, thermal, reliability, crosstalk, etc.
  • Example: Can we predict “risk map” for embedded

memories at floorplan stage ?

slide-33
SLIDE 33

33

  • A. B. Kahng, TAU 2016

Floorplan Pathfinding

  • Filter bad floorplans (e.g., embedded memory placements,

power plans) comprehending downstream PD flow

  • Model f estimates combined effects of netlist, constraints,

placement, CTS, routing, optimization, STA

  • = Slack (w/, w/o IR)

= netlist, constraints, floorplan parameters

  • = ()

= ???

Signoff

Extraction, Timing, Verification Placement Floorplan, Powerplan Routing

Gate Netlist

Slack (w/, w/o IR) Modeling Scope

Constraints

Clock network synthesis Extraction, Timing

Costly Iteration [ASPDAC16]

slide-34
SLIDE 34

34

  • A. B. Kahng, TAU 2016

Floorplan Pathfinding Model

  • False negatives = 3%
  • Pessimistic predictions  floorplan change that is actually

not required

  • False positives = 4%
  • Model incorrectly deems a floorplan to be good

False positives False negatives

Actual Pass Fail Pass Fail Predicted 584 42 384 31

Positive slack data points: Precision: tp/(tp +fp) = 93.3% Recall: tp/(tp +fn) = 95.0% Negative slack data points: Precision: tn/(tn +fp) = 92.5% Recall: tn/(tn +fn) = 90.1% Precision Recall Precision Recall [ASPDAC16]

slide-35
SLIDE 35

35

  • A. B. Kahng, TAU 2016

Related: Library Groups  New k-Factors ?

  • Library interpolation with each “physics” modeled as

equivalent voltage delta (for example)

  • Voltage
  • Process variation
  • Temperature
  • Aging / reliability
  • Per-instance timing derating for signoff
  • In spirit of old “k-factors”, perhaps

Derating(V1, P1, T1, A1) Derating(V2, P2, T2, A2) Derating(V3, P3, T3, A3)

Voltage Process variation Temperature Aging / reliability

slide-36
SLIDE 36

36

  • A. B. Kahng, TAU 2016

How Do We Regain Time?

  • Learn !!! (machine learning, Big Data mindset)
  • Embrace the “era of optimization”: 1% = 1 week
slide-37
SLIDE 37

37

  • A. B. Kahng, TAU 2016

METRICS (1999): Measure to Improve

[ISQED01]

  • Goal #1: Predict outcome
  • Goal #2: Find sweet spot (field of use) of tool, flow
  • Goal #3: Dial in design-specific tool, flow knobs
slide-38
SLIDE 38

38

  • A. B. Kahng, TAU 2016

Pure Optimization is a Big Lever

  • Project planning and management
  • Unforeseen events (late RTL bugs, timing ECO)
  • Resource co-constraints (e.g., 2 cores per EDA license, 3 concurrent

tapeouts)

( ) A4 (3) ( ) A5 (1) ( ) A1 (1) ( ) A2 (1) ( ) A1 (1) ( ) A2 (1) ( ) A2 (1) ( ) A3 (1) ( ) A3 (1) ( ) A4 (1) ( ) A4 (2) ( ) A4 (1) ( ) A5 (1) ( ) A4 (3) ( ) A5 (2) ( ) A1 (2) ( ) A3 (2) ( ) A2 (2) ( ) A1 (2) ( ) A2 (2) ( ) A2 (2) ( ) A3 (1) ( ) A3 (2) ( ) A4 (1) ( ) A4 (2) ( ) A4 (1) ( ) A5 (2) ( ) A4 (3) ( ) A5 (3) ( ) A1 (3) ( ) A3 (3) ( ) A2 (3) ( ) A1 (3) ( ) A2 (3) ( ) A2 (3) ( ) A3 (2) ( ) A3 (3) ( ) A4 (2) ( ) A4 (2) ( ) A4 (3) ( ) A5 (3)

20 22 24 26 28 30 32 34 36 38 40 42 Current servers Work Weeks Usage (Across Three Projects) Datacenter capacity

( ) A3 (3)

  • “How to pack 14 tapeouts into my design center during 2H15?”
  • Schedule cost minimization (SCM)
  • Minimize overall project makespan subject to delay penalties, resource

bounds, resource co-constraints, etc.

  • Resource cost minimization (RCM)
  • Minimize number of resources required across all projects

[DAC15 WIP]

slide-39
SLIDE 39

39

  • A. B. Kahng, TAU 2016

Example Solver Use Cases (from a design center of a world top-5 semi)

  • Schedule modification after late-breaking bug
  • Three projects, 11 activities/project (e.g., placement, routing, RCX,

STA, etc.)

  • Five resource types (#cores, #memory, licenses for P&R, RCX, STA,

tools)

  • Industry solution: Makespan of 41 days across all projects
  • SCM solution: Makespan of 34 days across all projects (1.4 weeks

saved)

  • Datacenter resource allocation
  • 24 projects, five activities (synthesis, P&R, RCX, STA, PV) per project
  • Forecast-based allocation for #servers in datacenter
  • Industry solution: Purchase 600 additional servers
  • SCM solution: Zero additional servers
  • Human resource allocation
  • Four large projects
  • Four types of human resources (synthesis, P&R, verification, STA)
  • RCM solution: ~$5.2M headcount cost savings for company
  • MILP solver at http://vlsicad.ucsd.edu/MILP/
slide-40
SLIDE 40

40

  • A. B. Kahng, TAU 2016

DARPA’s CRAFT Program (2016-)

  • “Circuit Realization At Faster Timescales” (UCSD leads a team)
  • Goal: reduce SOC design time from 130 weeks to 30 weeks
  • “Iso-PAP” (Performance At Power) at 14/16nm and below
slide-41
SLIDE 41

41

  • A. B. Kahng, TAU 2016

How Do We Regain Time?

  • Learn !!! (machine learning, Big Data mindset)
  • Embrace the “era of optimization”: 1% = 1 week
  • Take back what we’ve given away
slide-42
SLIDE 42

42

  • A. B. Kahng, TAU 2016

Flexible FF Timing  Margin Recovery

setup c2q hold c2q

c2q-setup-hold surface setup hold c2q

setup hold c2q1 c2qn ...

setup-hold-c2q flexible model

  • Setup time, hold time and clock-to-q

(c2q) delay of FF ⇒ NOT fixed values

  • Flexible FF timing model considering
  • perating (function/test) modes, path

partitioning ⇒ Reduce pessimism in timing analysis

  • Sequential LP
  • setup-c2q
  • ptimization + hold-

c2q optimization

  • Objective: Find the best setup/hold time/c2q for each FF

setup-hold-c2q fixed model

[ISQED14]

slide-43
SLIDE 43

43

  • A. B. Kahng, TAU 2016

“Free” Improvement of Timing

Extract path timing information LP formulation with flexible flip-flop timing model Solve Sequential LP

(STA_FTmax , STA_FTmin)

Annotate new timing model for each flip-flop Solution Netlist (and SPEF, if routed) Timing signoff with annotated timing

  • Fix timing violations “for free”
  • 48ps average WNS improvement
  • ver 5 designs in foundry 65nm

technology

  • Other opportunities (jitter, ERCs,

non-full rail swing, …)

slide-44
SLIDE 44

44

  • A. B. Kahng, TAU 2016

Homogeneous Corners

  • (1) Define RC corners of each layer separately
  • (2) Use corners from each layer to construct a

homogeneous corner for an interconnect stack

C

Layer M2

3σ C

Layer M1

Interconnect stack with M1 and M2

M1 C M2 C 3σ Pessimism

Example: worst-case capacitance corner

Homogeneous Cw corner

[ICCD14]

slide-45
SLIDE 45

45

  • A. B. Kahng, TAU 2016

Homogeneous Corners

  • (1) Define RC corners of each layer separately
  • (2) Use corners from each layer to construct a

homogeneous corner for an interconnect stack Interconnect stack with M1 and M2

M1 C M2 C 3σ Homogeneous Cw corner C

Layer M2

3σ C

Layer M1

3σ Pessimism

Example: worst-case capacitance corner

When variations in different layers are not fully correlated, pessimism of homogeneous corners increase with #layers

slide-46
SLIDE 46

46

  • A. B. Kahng, TAU 2016

Tightened BEOL Corners (“TBC”)

Routed design Timing analysis using conventional BEOL corners (CBC) ECO using CBC violation = 0? done

Conventional Signoff

No

Routed design Classify timing critical paths GTBC GCBC ECO using CBC

Timing analysis

using TBC

violation = 0?

Timing analysis using CBC

violation = 0? ECO using TBC done

UCSD, 2014

No No

[ICCD14]

slide-47
SLIDE 47

47

  • A. B. Kahng, TAU 2016

Pessimism in Conventional BEOL Corners (CBC)

  • Assumption: a max (setup) path pj is “safe” when the delay

evaluated at a given CBC is larger than nominal delay + 3σj dj(YCBC) ≥ 3σj + dj(Ytyp)

  • For a given path, we can compare the statistical delay

variation and the delay obtained from a given CBC αj = 3σj / ∆dj(YCBC) ∆dj(YCBC)= [dj(YCBC) - dj(Ytyp)] YCBC ∈ {Ycw, Ycb, Yrcw, Yrcb}

  • A small αj implies there is a large pessimism

delay

dj(YCBC)-dj(Ytyp)

3σj Large pessimism

slide-48
SLIDE 48

48

  • A. B. Kahng, TAU 2016

Scaling Factor α and Delay Variation

  • Paths with small ∆drcw and ∆dcw have large α
  • E.g., there are αj > 0.6 when ((∆drcw < 3%) AND (∆dcw < 3%))
  • Identify paths for tightened BEOL corners based on ∆drcw and ∆dcw

α

Δd(Ycw)/d(Ytyp) Δd(Yrcw)/d(Ytyp)

slide-49
SLIDE 49

49

  • A. B. Kahng, TAU 2016

Find Paths for Which TBCs Can Be Used Acw Arcw

Gtbc = paths which can be safely signed off using tightened corners: Path with ((∆dcw larger than Acw) OR (Path with ∆drcw larger than Arcw))

Δd(Ycw)/d(Ytyp) Δd(Yrcw)/d(Ytyp)

slide-50
SLIDE 50

50

  • A. B. Kahng, TAU 2016

Benefits of Tightened BEOL Corners

  • WNS and TNS are reduced by

up to 100ps and 53ns

  • #paths with timing violations

is reduced by 24% to 100%

  • TBC-0.5 configuration has little

benefits because there are not many paths in Gtbc

Correlation factor, γ = 0.5

  • 0.2
  • 0.15
  • 0.1
  • 0.05

LEON SUPERBLUE12 NETCARD WNS (ns) CBC TBC-0.5 TBC-0.6 TBC-0.7

  • 100
  • 80
  • 60
  • 40
  • 20

LEON SUPERBLUE12 NETCARD TNS (ns) CBC TBC-0.5 TBC-0.6 TBC-0.7 500 1000 1500 LEON SUPERBLUE12 NETCARD #Timing violations CBC TBC-0.5 TBC-0.6 TBC-0.7

slide-51
SLIDE 51

51

  • A. B. Kahng, TAU 2016

How Do We Regain Time?

  • Learn correlations!
  • Enter the “era of optimization”: 1% = 1 week
  • Take back what we’ve given away
  • Stop wasting time
slide-52
SLIDE 52

52

  • A. B. Kahng, TAU 2016

Poor Enablement, Poor Results

  • Academic libraries
  • OpenPDK
  • SAED 32/28
  • 15nm FreePDK
  • ISPD Sizing Contest Library
  • Academic enablements (PDKs, libraries, etc.) quite weak
slide-53
SLIDE 53

53

  • A. B. Kahng, TAU 2016

Example: 15nm OpenPDK

  • Cell delays not realistic
  • RC information missing
  • Cannot extract wire

capacitances with commercial RCX tools

  • Complex LEF rules missing

Des/Clust/Port Wire Load Model Library

  • wb_dma_top

wl_zero NanGate_15nm_OCL Point Fanout Cap Trans Incr Path

  • clock clk_i (rise edge) 0.000 0.000

clock network delay (ideal) 0.000 0.000 u3_u1_slv_adr_reg_9_/CLK (DFFRNQ_X1) 0.000 0.000 0.000 r u3_u1_slv_adr_reg_9_/Q (DFFRNQ_X1) 2.494 10.094 10.094 f slv0_adr[9] (net) 1 0.807 0.000 10.094 f U3390/ZN (NOR2_X1) 5.483 3.642 13.735 r n2388 (net) 1 1.616 0.000 13.735 r U2231/ZN (NAND2_X2) 4.255 3.216 16.952 f n2593 (net) 3 2.164 0.000 16.952 f U3389/ZN (INV_X1) 3.705 2.917 19.868 r n3228 (net) 3 1.990 0.000 19.868 r U3387/ZN (NAND2_X1) 6.314 4.207 24.075 f n3230 (net) 3 2.198 0.000 24.075 f U4136/Z (OR2_X1) 3.102 5.762 29.837 f n2318 (net) 2 1.509 0.000 29.837 f U3373/ZN (INV_X1) 2.093 1.799 31.636 r n3435 (net) 1 0.840 0.000 31.636 r U3372/Z (BUF_X2) 15.410 10.845 42.481 r n2367 (net) 31 21.353 0.000 42.481 r U3992/ZN (AOI22_X1) 9.081 5.892 48.373 f n3388 (net) 1 0.631 0.000 48.373 f U3185/ZN (NAND4_X1) 7.109 2.862 51.235 r u0_N3065 (net) 1 0.485 0.000 51.235 r u0_wb_rf_dout_reg_22_/D (DFFRNQ_X1) 7.109 0.000 51.235 r data arrival time 51.235 clock clk_i (rise edge) 60.000 60.000 clock network delay (ideal) 0.000 60.000 u0_wb_rf_dout_reg_22_/CLK (DFFRNQ_X1) 0.000 60.000 r library setup time -8.764 51.236 data required time 51.236

  • data required time 51.236

data arrival time -51.235

  • slack (MET) 0.002

Clock Period = 60ps? (1.5ns with 28nm foundry) Stage delay: 2ps~30ps STA report from [EDA tool]

slide-54
SLIDE 54

54

  • A. B. Kahng, TAU 2016

Example: ISPD13 Gate Sizing Contest Library

  • “Gap” between academic benchmarks and industry

designs

  • Unrealistic timing library
  • Missing MCMM
  • Missing multiple power domains
  • Missing multiple clock domains
  • Missing memories / macro cells
  • No support for standard formats (.spef, .v, .sdc, .lib)

See “A2A” from UCSD: “horizontal benchmark extension” http://vlsicad.ucsd.edu/Publications/Conferences/313/c313.pdf

slide-55
SLIDE 55

55

  • A. B. Kahng, TAU 2016

Poor Research Enablement Has Costs

  • “Good” academic sizers cannot be used for industry designs
  • No MCMM vs. MCMM  Resource (memory/runtime) problem
  • Simple vs. Complicated timing models  Timing accuracy problem
  • Few benchmarks vs. industry designs  Academic sizers don’t port well
  • Overtrained on a particular suite of “benchmarks”
  • Timing/power characteristics, intuition mismatched to reality, actual
  • utcomes

Benchmark: netcard

aSizer1 aSizer1 aSizer1

[GLSVLSI14]

Commercial sizer wins with foundry technologies (similar leakage, better timing slack, better runtime)

cSizer1: commercial sizer aSizer1: academic sizer

slide-56
SLIDE 56

56

  • A. B. Kahng, TAU 2016

How Do We Regain Time?

  • Learn correlations!
  • Enter the “era of optimization”: 1% = 1 week
  • Take back what we’ve given away
  • Stop wasting time
  • Harvest low-hanging fruits
slide-57
SLIDE 57

57

  • A. B. Kahng, TAU 2016

FEOL: Layout-Dependent Effects

  • Layout dependent variations
  • Variation in poly pitch
  • Well-proximity effects :

Closer to well edge  more Vth shift

  • Intentional and unintentional Stress:

LOD, STI, DSL and SiGe

  • Pattern dependent dishing and oxide erosion

[Mark Zwolinski, ISPD2013]

slide-58
SLIDE 58

58

  • A. B. Kahng, TAU 2016

BEOL: Statistical RC Extraction

[source: R. Jiang, Synopsys, 2005]

  • Statistical RC extraction flow comprehends spatial

correlation of interconnect variations

  • Proposed by industry, then dropped …. (???)
slide-59
SLIDE 59

59

  • A. B. Kahng, TAU 2016

Multi-Die: Signoff Corners

  • Example: inter-die process variation limits performance

improvement of 3DICs

  • What if SS Tier 0 and SS Tier 1 will never be stacked

together?

3D integration SS Tier 1 wafer/die FF Tier 0 wafer Wafer-to-wafer (die-to-wafer) bonding: integrate SS wafer/die with FF wafer/die (SS Tier 0 wafer/die + FF Tier 1 wafer or FF Tier 0 wafer/die + SS Tier 1 wafer)

75ps

  • 180
  • 140
  • 100
  • 60
  • 20

SS-SS SS-FF FF-SS WNS (ps)

  • XX-YY = XX Tier 0 + YY Tier 1
  • Technology: 28FDSOI
  • 3D netlist is bipartitioned with min-cut

Mix-and-match

[DATE16]

slide-60
SLIDE 60

60

  • A. B. Kahng, TAU 2016

Multi-Die Design for “Mix-and-Match”

  • Partition netlist such that paths have balanced delay

across two tiers  Maximizes timing benefit from mix-and-match

  • 16% performance increase at signoff compared to

existing flows

Design Clk period M0 1.2ns AES 1.1ns VGA 1.0ns

  • 300
  • 250
  • 200
  • 150
  • 100
  • 50

50 100 150

ARM M0 AES VGA WNS (ps)

Brute-force (orig) Brute-force (opt) Shrunk2D (orig) Shrunk2D (opt) GT2012 (opt) GT2012 (orig) Technology: 28FDSOI

slide-61
SLIDE 61

61

  • A. B. Kahng, TAU 2016
  • Self-aligned multiple patterning (SAMP) + Cutmask
  • Cut shapes and locations determine dummy wires,

end-of-line (EOL) extensions of wire segments ⇒ affect performance

  • BACUS15: Co-optimization of
  • Cut mask minimum spacing rules
  • EOL extension with usage of multiple cut masks
  • Metal density constraints (dummy fills)
  • Insight into achievable tradeoff of performance and cost

SAMP + Cutmask: Dummy and EOL ΔTiming

Original layout dummy fill Final layout extension 1D wires Cut masks cut

[BACUS15]

slide-62
SLIDE 62

62

  • A. B. Kahng, TAU 2016

Timing Impacts

  • Best vs. Worst EOL extension
  • BEST ILP solution: little impact of EOL

extensions on timing

  • WORST ILP solution in N5 degrades

up to 196ps compared to N7

  • Post-ILP optimization is beneficial

to timing

  • Different metal density with up to

14ps difference

  • 0.45
  • 0.4
  • 0.35
  • 0.3
  • 0.25
  • 0.2
  • 0.15
  • 0.1
  • 0.05

ARM Cortex M0 N7 ARM Cortex M0 N5 AES N7 AES N5 JPEG N7 JPEG N5 Changes in WNS (ns)

Changes in WNS

BEST WORST

  • 0.08
  • 0.07
  • 0.06
  • 0.05
  • 0.04
  • 0.03
  • 0.02
  • 0.01

ARM Cortex M0 AES JPEG Change in WNS (ns)

Change in WNS for different target metal density

40% 42.5% 45%

slide-63
SLIDE 63

63

  • A. B. Kahng, TAU 2016

How Do We Regain Time?

  • Learn correlations, predictions:  kill loops, risk
  • Enter the “era of optimization”: 1% = 1 week
  • Take back what we’ve given away
  • Stop wasting time
  • Harvest low-hanging fruits
  • + resilience, adaptivity (signoff at TT), reliability pessimism
  • Min cost of resilience (“MinRazor”; OD, AVS-BTI signoff)
  • PVS ROs
  • + new paradigms (stochastic, approximate)
  • + pure optimization QOR (P&R&Opt for N10, N7)
  • ? faster RTL design/optimization; DFT …
slide-64
SLIDE 64

64

  • A. B. Kahng, TAU 2016

Costs of Reliability

AF (α) Jrms Temp Wire width MTTF Driver size

A B Inverse relation; if A increases then B decreases A B Direct relation; if A increases then B increases

Supply voltage Timing slack |ΔVthp | Wire spacing

TDDB TDDB EM EM

Freq. |ΔVthn | Slew rate Load/ fanout Gate length Junction resistance

EM, TDDB, NBTI, HCI HCI NBTI HCI HCI HCI HCI HCI HCI NBTI

Tunable at design or runtime Tunable at design

general general general general general general general general general general general general general general general general general HCI HCI NBTI

slide-65
SLIDE 65

65

  • A. B. Kahng, TAU 2016

N10, N7 P&R: “Opt” Methods

DB violation MinIW violation MinIW violation MinOW violation

flipped Cells are moved

slide-66
SLIDE 66

66

  • A. B. Kahng, TAU 2016

In Search of Lost Time

slide-67
SLIDE 67

67

  • A. B. Kahng, TAU 2016

Summary

  • Time = schedule, universal currency, $$$
  • We lose time in many ways
  • Some unavoidable
  • Learning, big data, optimization
  • Take back what we’ve given away
  • Stop pointless waste
  • Many low-hanging fruits
  • Recovering lost Time = “equivalent scaling”
  • EDA continues the Moore’s-Law value trajectory
slide-68
SLIDE 68

68

  • A. B. Kahng, TAU 2016

THANK YOU !