ISLPED’17
A Learning Bridge from Architectural Synthesis to Physical Design for Exploring Power Efficient High-Performance Adders
Subhendu Roy1 Yuzhe Ma2 Jin Miao1 Bei Yu2
1Cadence Design Systems 2The Chinese University of Hong Kong
1 / 23
A Learning Bridge from Architectural Synthesis to Physical Design - - PowerPoint PPT Presentation
A Learning Bridge from Architectural Synthesis to Physical Design for Exploring Power Efficient High-Performance Adders Subhendu Roy 1 Yuzhe Ma 2 Jin Miao 1 Bei Yu 2 1 Cadence Design Systems 2 The Chinese University of Hong Kong ISLPED17 1 /
1Cadence Design Systems 2The Chinese University of Hong Kong
1 / 23
Architectural Synthesis Logic Synthesis Physical Design
◮ Optimality at one stage doesn’t guarantee the same in another stage ◮ Data-driven methodology, such as machine learning, becomes imminent
2 / 23
◮ Primary building blocks in the datapath logic of a microprocessor ◮ A fundamental problem in VLSI industry for last several decades
3 / 23
4 / 23
4 / 23
5 / 23
6 / 23
◮ Example: [Matsunaga+,GLSVLSI’07], [Liu+,ICCAD’03], [Zhu+,ASPDAC’05],
◮ Example: [Roy+,TCAD’14]
7 / 23
180 190 200 210 220 230 240 5 10 15 20 25 30 35 Node Size Max Fanout G1 G2
(a)
1700 1800 1900 2000 2100 2200 2300 0.34 0.36 0.38 0.4 Area (µm2) Critical Delay (ns) G1 G2
(b)
(a) Architectural solution space; (b) Physical design space. ◮ G1 (less fan-out and high size); G2 (high fan-out and low size) ◮ When mapped to physical solution space
8 / 23
180 190 200 210 220 230 240 5 10 15 20 25 30 35 Node Size Max Fanout G1 G2
(a)
1700 1800 1900 2000 2100 2200 2300 0.34 0.36 0.38 0.4 Area (µm2) Critical Delay (ns) G1 G2
(b)
(a) Architectural solution space; (b) Physical design space. ◮ G1 (less fan-out and high size); G2 (high fan-out and low size) ◮ When mapped to physical solution space
8 / 23
6000 6500 7000 7500 8000 320 340 360 380 400 420 Power (µw) Critical Delay (ps)
TCAD‘14
9 / 23
G2 G3 G3 G4 Gn+1 Gn
◮ Gn = set of prefix graphs of bit-width n ◮ Prefix graphs of higher order generated in bottom-up fashion ◮ Several pruning strategies during Gn → Gn+1 for scaling
10 / 23
◮ The concept is derived from regular adders such as Brent-Kung, Sklansky. ◮ xi and xi+1 combined to form prefix nodes, where i is even. ◮ This regularity for only L = 1 ◮ For L > 1, regularity compromises size optimality (Forbidden). ◮ Observation: this semi-regularity doesn’t degrade size-optimality.
11 / 23
◮ Trivial fan-in having same MSB ◮ x4 and i1 are trivial and non-trivial fan-in of i2 ◮ Level (non-trivial fan-in) ≥ level (trivial fan-in) ◮ Reduces search space without degrading size-optimality
12 / 23
◮ Table is for 64 bit adders ◮ [Roy+,TCAD’14] cannot get solutions for all fanouts. ◮ Our solutions are always more size-optimal. ◮ Runtimes are comparable, adder synthesis is one-time.
13 / 23
TCAD‘14
◮ 7000 random samples from [Roy+,TCAD’14] vs. 3000 samples from us ◮ Reason: TCAD’14 misses solutions for bounded fanout in a few cases
14 / 23
TCAD‘14 Ours
◮ 7000 random samples from [Roy+,TCAD’14] vs. 3000 samples from us ◮ Reason: TCAD’14 misses solutions for bounded fanout in a few cases
14 / 23
6000 6500 7000 7500 8000 340 360 380 400 420 440 Power(µw) Critical Delay(ps)
Real PF
15 / 23
◮ Hundreds of thousands of solutions ◮ How to choose training data?
◮ Capture all architectural bins ◮ Select solutions from each bin randomly
s=244 s=245 s=246 s=233 s=234 s=235 mfo=4 mfo=6 Bin of solutions with s=246 and mfo=4
16 / 23
◮ Architectural attributes: s, mfo, sum-path-fanout (spfo) ◮ Tool settings: Target delay ◮ Best model fitting by support-vector-regression (SVR) with RBF kernel ◮ Including spfo improves MSE score for delay from 0.232 to 0.164 ◮ Note: linear models not sufficient for modeling delay
17 / 23
◮ Conventional learning focusses on prediction accuracy
◮ Scalarization or α-sweep
18 / 23
◮ Tools: Design Compiler/ IC Compiler ◮ Library: Non-linear-delay-model (NLDM) in 32nm SAED cell-library ◮ Tool settings: Target delay = 0.1ns, 0.2ns, 0.3 ns
◮ C++ for prefix adder synthesis ◮ Python based machine learning package scikit-learn
◮ 72GB RAM UNIX machine ◮ 2.8GHz CPU
19 / 23
6000 6500 7000 7500 8000 340 360 380 400 420 440 Power(µw) Critical Delay(ps)
Real PF Predicted PF
◮ Training set is randomly selected from 300 samples. ◮ Rep. adders are quasi-random sampled from other 3000 samples ◮ Predicted frontier is from best 150 solutions (predicted)
20 / 23
6000 6500 7000 7500 8000 340 360 380 400 420 440 Power(µw) Critical Delay(ps)
Real PF Predicted PF
1800 1900 2000 2100 2200 2300 340 360 380 400 420 Area(µm2) Critical Delay(ps)
Real PF Predicted PF
◮ Training set is randomly selected from 300 samples. ◮ Rep. adders are quasi-random sampled from other 3000 samples ◮ Predicted frontier is from best 150 solutions (predicted)
20 / 23
21 / 23
◮ For power-efficient high-performance adders ◮ Bridge the gap between architectural and physical solution space ◮ Provide near-optimal power vs. delay trade-off
◮ State-of-the-art adder synthesis algorithms in power/delay/area metrics ◮ Readily adoptable for any cell-library
22 / 23
23 / 23