1
- A. B. Kahng, TAU 2016
In Search of Lost Time Andrew B. Kahng UCSD CSE and ECE Departments - - PowerPoint PPT Presentation
In Search of Lost Time Andrew B. Kahng UCSD CSE and ECE Departments abk@ucsd.edu http://vlsicad.ucsd.edu TAU-2016 Friday keynote, Santa Rosa 1 A. B. Kahng, TAU 2016 In Search of Lost Time 2 A. B. Kahng, TAU 2016 What is Time? How do we
1
2
3
4
5
Margin
Product Quality Model and Analysis Accuracy
nm, mV, {skew, jitter, OCV…} power, area, fmax, Iddq,… rms, %, σ
6
7
8
9
10
(Years) Tech development, app market definition, architecture/front-end design (Months) RTL-to-GDS implementation, reliability qualification (Weeks) Fab latency, cycles of yield learning, design re-spins, mask flows (Days) Process tweaks, design ECOs Mismatches among these time constants
11
12
13
14
performance PDF
Process
Signoff
Temperature
source: Wu 08
Nominal Vdd
Static IR drop Power grid IR gradient Dynamic IR HCI/NBTI
Signoff
Voltage
Signoff
Reliability
15
16
20nm 90nm 45/40nm 28nm 16/14nm 10nm ≤7nm 65nm BTI Temp inversion Noise MCMM Maxtrans EM AOCV / POCV PBA Fixed-margin spec patterning Multi- patterning Cell-POCV MOL, BEOL R ↑ Dynamic IR Fill effects Layout rules BEOL, MOL variations Signoff criteria with AVS SOC complexity LVF MIS Phys-aware timing ECO Min implant
17
18
C
Layer M2
3σ C
Layer M1
3σ
Interconnect stack with M1 and M2
M1 C M2 C 3σ Pessimism
Homogeneous BEOL corners (e.g., Cworst)
Homogeneous Cw corner
19
0.1
0.1
T2 Path Slack (ns) T1 Path Slack (ns)
123 ps
20
21
22
Option #1: go with latest available technology = 0.01 AU/year speed Option #2: spend the next ten years to come up with a spaceship = 0.1 AU/year speed 2016 2026 2027 2031
Option #1 = 0.5 / 0.01 = 50 years Option #2 = 0.5 / 0.1 + 10 years = 15 years (B<< A)
Option #1 Option #2 Corner-based STA Statistical STA Planar 3D Homogeneous CMOS Heterogeneous CMOS
Need a faster ship
Year:
23
24
25
0.1
0.1
T2 Path Slack (ns) T1 Path Slack (ns)
123 ps
[DATE14]
26
Artificial Circuits Train Validate Test New Designs
MODELS
(Path slack, setup time, stage, cell, wire delays)
If error > threshold
Outliers (data points) ONE-TIME INCREMENTAL Real Designs
T1 Path Slack (ns) T2 Path Slack (ns)
31 ps ~4× reduction
0.1
0.1
T2 Path Slack (ns) T1 Path Slack (ns)
123 ps
ML Modeling
[DATE14]
27
.v
.db, .lib .spef .v .sdc Post P & R Database Calibration: Recipe to Convert Non-SI Timing Report to SI Timing Report
Non-SI Timing Report Non-SI Timing Report SI Timing Report SI Timing Report SI Timing Report
81ps SI Path Slack (ns) ($$$) Non-SI Path Slack (ns) ($) [SLIP15]
28
Timing Reports in SI Mode Timing Reports in Non-SI Mode Create Training, Validation and Testing Sets ANN (2 Hidden Layers, 5-Fold Cross-Validation) Save Model and Exit SVM (RBF Kernel, 5-Fold Cross-Validation) HSM (Weighted Predictions from ANN and SVM) Actual Path Delay (ps) Predicted Path Delay (ps) 8.2ps Worst absolute error = 8.2ps Average absolute error = 1.7ps
81ps SI Path Slack (ns) ($$$) Non-SI Path Slack (ns) ($) ML Modeling
[SLIP15]
29
Sim Results (Dyn.) Activity Factor (Static) Timing/ Noise MTTF & Aging P&R + Optimization Power Analysis Thermal Analysis Task Mapping/ Migration/ (DVFS) Temp Map Power Trace Reliability Report
Tech files, signoff criteria, corners
Slack IR Drop Map Timing / Glitches AVS
Sim vectors Benchmark RTL
Functional Sim
[ASPDAC16]
30
Sim Results (Dyn.) Activity Factor (Static) Timing/ Noise MTTF & Aging P&R + Optimization Power Analysis Thermal Analysis Task Mapping/ Migration/ (DVFS) Temp Map Power Trace Reliability Report
Tech files, signoff criteria, corners
Slack IR Drop Map Timing / Glitches AVS
Sim vectors Benchmark RTL
Functional Sim
[ASPDAC16]
31
SRAM #1
SRAM Slack (ps)
SRAM #5
25ps 29ps
[ASPDAC16]
32
Implementation Index SRAM Slack (ps) [ASPDAC16]
33
= netlist, constraints, floorplan parameters
= ???
Signoff
Extraction, Timing, Verification Placement Floorplan, Powerplan Routing
Gate Netlist
Slack (w/, w/o IR) Modeling Scope
Constraints
Clock network synthesis Extraction, Timing
Costly Iteration [ASPDAC16]
34
False positives False negatives
Positive slack data points: Precision: tp/(tp +fp) = 93.3% Recall: tp/(tp +fn) = 95.0% Negative slack data points: Precision: tn/(tn +fp) = 92.5% Recall: tn/(tn +fn) = 90.1% Precision Recall Precision Recall [ASPDAC16]
35
Derating(V1, P1, T1, A1) Derating(V2, P2, T2, A2) Derating(V3, P3, T3, A3)
36
37
[ISQED01]
38
tapeouts)
( ) A4 (3) ( ) A5 (1) ( ) A1 (1) ( ) A2 (1) ( ) A1 (1) ( ) A2 (1) ( ) A2 (1) ( ) A3 (1) ( ) A3 (1) ( ) A4 (1) ( ) A4 (2) ( ) A4 (1) ( ) A5 (1) ( ) A4 (3) ( ) A5 (2) ( ) A1 (2) ( ) A3 (2) ( ) A2 (2) ( ) A1 (2) ( ) A2 (2) ( ) A2 (2) ( ) A3 (1) ( ) A3 (2) ( ) A4 (1) ( ) A4 (2) ( ) A4 (1) ( ) A5 (2) ( ) A4 (3) ( ) A5 (3) ( ) A1 (3) ( ) A3 (3) ( ) A2 (3) ( ) A1 (3) ( ) A2 (3) ( ) A2 (3) ( ) A3 (2) ( ) A3 (3) ( ) A4 (2) ( ) A4 (2) ( ) A4 (3) ( ) A5 (3)
20 22 24 26 28 30 32 34 36 38 40 42 Current servers Work Weeks Usage (Across Three Projects) Datacenter capacity
( ) A3 (3)
bounds, resource co-constraints, etc.
[DAC15 WIP]
39
STA, etc.)
tools)
saved)
40
41
42
setup c2q hold c2q
c2q-setup-hold surface setup hold c2q
setup hold c2q1 c2qn ...
setup-hold-c2q flexible model
setup-hold-c2q fixed model
[ISQED14]
43
Extract path timing information LP formulation with flexible flip-flop timing model Solve Sequential LP
(STA_FTmax , STA_FTmin)
Annotate new timing model for each flip-flop Solution Netlist (and SPEF, if routed) Timing signoff with annotated timing
technology
non-full rail swing, …)
44
homogeneous corner for an interconnect stack
C
Layer M2
3σ C
Layer M1
3σ
Interconnect stack with M1 and M2
M1 C M2 C 3σ Pessimism
Homogeneous Cw corner
[ICCD14]
45
homogeneous corner for an interconnect stack Interconnect stack with M1 and M2
M1 C M2 C 3σ Homogeneous Cw corner C
Layer M2
3σ C
Layer M1
3σ Pessimism
46
Routed design Timing analysis using conventional BEOL corners (CBC) ECO using CBC violation = 0? done
No
Routed design Classify timing critical paths GTBC GCBC ECO using CBC
Timing analysis
using TBC
violation = 0?
Timing analysis using CBC
violation = 0? ECO using TBC done
No No
[ICCD14]
47
delay
dj(YCBC)-dj(Ytyp)
3σj Large pessimism
48
α
49
Gtbc = paths which can be safely signed off using tightened corners: Path with ((∆dcw larger than Acw) OR (Path with ∆drcw larger than Arcw))
50
LEON SUPERBLUE12 NETCARD WNS (ns) CBC TBC-0.5 TBC-0.6 TBC-0.7
LEON SUPERBLUE12 NETCARD TNS (ns) CBC TBC-0.5 TBC-0.6 TBC-0.7 500 1000 1500 LEON SUPERBLUE12 NETCARD #Timing violations CBC TBC-0.5 TBC-0.6 TBC-0.7
51
52
53
capacitances with commercial RCX tools
Des/Clust/Port Wire Load Model Library
wl_zero NanGate_15nm_OCL Point Fanout Cap Trans Incr Path
clock network delay (ideal) 0.000 0.000 u3_u1_slv_adr_reg_9_/CLK (DFFRNQ_X1) 0.000 0.000 0.000 r u3_u1_slv_adr_reg_9_/Q (DFFRNQ_X1) 2.494 10.094 10.094 f slv0_adr[9] (net) 1 0.807 0.000 10.094 f U3390/ZN (NOR2_X1) 5.483 3.642 13.735 r n2388 (net) 1 1.616 0.000 13.735 r U2231/ZN (NAND2_X2) 4.255 3.216 16.952 f n2593 (net) 3 2.164 0.000 16.952 f U3389/ZN (INV_X1) 3.705 2.917 19.868 r n3228 (net) 3 1.990 0.000 19.868 r U3387/ZN (NAND2_X1) 6.314 4.207 24.075 f n3230 (net) 3 2.198 0.000 24.075 f U4136/Z (OR2_X1) 3.102 5.762 29.837 f n2318 (net) 2 1.509 0.000 29.837 f U3373/ZN (INV_X1) 2.093 1.799 31.636 r n3435 (net) 1 0.840 0.000 31.636 r U3372/Z (BUF_X2) 15.410 10.845 42.481 r n2367 (net) 31 21.353 0.000 42.481 r U3992/ZN (AOI22_X1) 9.081 5.892 48.373 f n3388 (net) 1 0.631 0.000 48.373 f U3185/ZN (NAND4_X1) 7.109 2.862 51.235 r u0_N3065 (net) 1 0.485 0.000 51.235 r u0_wb_rf_dout_reg_22_/D (DFFRNQ_X1) 7.109 0.000 51.235 r data arrival time 51.235 clock clk_i (rise edge) 60.000 60.000 clock network delay (ideal) 0.000 60.000 u0_wb_rf_dout_reg_22_/CLK (DFFRNQ_X1) 0.000 60.000 r library setup time -8.764 51.236 data required time 51.236
data arrival time -51.235
Clock Period = 60ps? (1.5ns with 28nm foundry) Stage delay: 2ps~30ps STA report from [EDA tool]
54
See “A2A” from UCSD: “horizontal benchmark extension” http://vlsicad.ucsd.edu/Publications/Conferences/313/c313.pdf
55
Benchmark: netcard
aSizer1 aSizer1 aSizer1
[GLSVLSI14]
Commercial sizer wins with foundry technologies (similar leakage, better timing slack, better runtime)
cSizer1: commercial sizer aSizer1: academic sizer
56
57
[Mark Zwolinski, ISPD2013]
58
[source: R. Jiang, Synopsys, 2005]
59
3D integration SS Tier 1 wafer/die FF Tier 0 wafer Wafer-to-wafer (die-to-wafer) bonding: integrate SS wafer/die with FF wafer/die (SS Tier 0 wafer/die + FF Tier 1 wafer or FF Tier 0 wafer/die + SS Tier 1 wafer)
75ps
SS-SS SS-FF FF-SS WNS (ps)
Mix-and-match
[DATE16]
60
Design Clk period M0 1.2ns AES 1.1ns VGA 1.0ns
50 100 150
ARM M0 AES VGA WNS (ps)
Brute-force (orig) Brute-force (opt) Shrunk2D (orig) Shrunk2D (opt) GT2012 (opt) GT2012 (orig) Technology: 28FDSOI
61
Original layout dummy fill Final layout extension 1D wires Cut masks cut
[BACUS15]
62
extensions on timing
up to 196ps compared to N7
to timing
14ps difference
ARM Cortex M0 N7 ARM Cortex M0 N5 AES N7 AES N5 JPEG N7 JPEG N5 Changes in WNS (ns)
Changes in WNS
BEST WORST
ARM Cortex M0 AES JPEG Change in WNS (ns)
Change in WNS for different target metal density
40% 42.5% 45%
63
64
AF (α) Jrms Temp Wire width MTTF Driver size
A B Inverse relation; if A increases then B decreases A B Direct relation; if A increases then B increases
Supply voltage Timing slack |ΔVthp | Wire spacing
TDDB TDDB EM EM
Freq. |ΔVthn | Slew rate Load/ fanout Gate length Junction resistance
EM, TDDB, NBTI, HCI HCI NBTI HCI HCI HCI HCI HCI HCI NBTI
Tunable at design or runtime Tunable at design
general general general general general general general general general general general general general general general general general HCI HCI NBTI
65
DB violation MinIW violation MinIW violation MinOW violation
flipped Cells are moved
66
67
68