[PPT] - A Learning Bridge from Architectural Synthesis to Physical Design PowerPoint Presentation

SLIDE 1

ISLPED’17

A Learning Bridge from Architectural Synthesis to Physical Design for Exploring Power Efficient High-Performance Adders

Subhendu Roy1 Yuzhe Ma2 Jin Miao1 Bei Yu2

1Cadence Design Systems 2The Chinese University of Hong Kong

1 / 23

SLIDE 2

ISLPED’17

Optimality across EDA stages

Architectural Synthesis Logic Synthesis Physical Design

No 1-1 mapping between metrics across various EDA stages.

◮ Optimality at one stage doesn’t guarantee the same in another stage ◮ Data-driven methodology, such as machine learning, becomes imminent

2 / 23

SLIDE 3

ISLPED’17

Binary Adder Design

◮ Primary building blocks in the datapath logic of a microprocessor ◮ A fundamental problem in VLSI industry for last several decades

What is still unsolved? Closing the gap across adder design stages

3 / 23

SLIDE 4

ISLPED’17

Parallel Prefix Adders

Parallel Prefix Adders → Flexible delay-power trade-off Regular Adders → Sub-optimal Custom Adders → High TAT

4 / 23

SLIDE 5

ISLPED’17

Parallel Prefix Adders

Parallel Prefix Adders → Flexible delay-power trade-off Regular Adders → Sub-optimal Custom Adders → High TAT This Work:

Automatic Cumtom Adders

4 / 23

SLIDE 6

ISLPED’17

Architectural Level: Mapped to Prefix Structures

g7,p7 g0,p0 g1,p1 a0 b0 a1 b1 a7 b7 a1 b1 a2 b2 a0 b0 a7 b7 s7 s0 s2 c1 c6 c0 Cout = c7 s1 Parallel Prefix Structure Pre-processing Post-processing

5 / 23

SLIDE 7

ISLPED’17

Prefix Graph Problem

Carry-computation can be mapped to prefix graph problem

yi = xi − 1 o xi−1 o xi−2 o . . . x1 o x0

y5 x5 x4 x3 x2 x1 x0

Size (s) = No. of prefix nodes = 7 Level (L) = maximum logic level = 3 Max-Fanout (mfo) = 2

6 / 23

SLIDE 8

ISLPED’17

Classifying Prefix Graph Synthesis

Can be classified based on the solution#

Category 1: Limited number of solutions

◮ Example: [Matsunaga+,GLSVLSI’07], [Liu+,ICCAD’03], [Zhu+,ASPDAC’05],

[Roy+,ASPDAC’15]

Not suitable for exploring data-driven methodologies
No analytical model to physical design stage

Category 2: Innumerable solutions

◮ Example: [Roy+,TCAD’14]

Not scalable for bounded fan-out
Computationally expensive to run all solutions through full physical design flow

7 / 23

SLIDE 9

ISLPED’17

Gap between Prefix Structure and Physical Design

180 190 200 210 220 230 240 5 10 15 20 25 30 35 Node Size Max Fanout G1 G2

(a)

1700 1800 1900 2000 2100 2200 2300 0.34 0.36 0.38 0.4 Area (µm2) Critical Delay (ns) G1 G2

(b)

(a) Architectural solution space; (b) Physical design space. ◮ G1 (less fan-out and high size); G2 (high fan-out and low size) ◮ When mapped to physical solution space

Correlation between size and area
Not completely reliable, G1 and G2 get mixed up in physical solution space

8 / 23

SLIDE 10

ISLPED’17

Gap between Prefix Structure and Physical Design

180 190 200 210 220 230 240 5 10 15 20 25 30 35 Node Size Max Fanout G1 G2

(a)

1700 1800 1900 2000 2100 2200 2300 0.34 0.36 0.38 0.4 Area (µm2) Critical Delay (ns) G1 G2

(b)

(a) Architectural solution space; (b) Physical design space. ◮ G1 (less fan-out and high size); G2 (high fan-out and low size) ◮ When mapped to physical solution space

Correlation between size and area
Not completely reliable, G1 and G2 get mixed up in physical solution space

What We Want to Search For:

All Pareto Frontier points with low area, low power, and low critical delay.

8 / 23

SLIDE 11

ISLPED’17

Task 1: Prefix Adder Solution Exploration

6000 6500 7000 7500 8000 320 340 360 380 400 420 Power (µw) Critical Delay (ps)

TCAD‘14

9 / 23

SLIDE 12

ISLPED’17

[Roy+,TCAD’14]– Summary

G2 G3 G3 G4 Gn+1 Gn

◮ Gn = set of prefix graphs of bit-width n ◮ Prefix graphs of higher order generated in bottom-up fashion ◮ Several pruning strategies during Gn → Gn+1 for scaling

For bounded fan-out, these strategies compromises in size-optimality

10 / 23

SLIDE 13

ISLPED’17

Enhancement 1: Imposing Semi-regularity

◮ The concept is derived from regular adders such as Brent-Kung, Sklansky. ◮ xi and xi+1 combined to form prefix nodes, where i is even. ◮ This regularity for only L = 1 ◮ For L > 1, regularity compromises size optimality (Forbidden). ◮ Observation: this semi-regularity doesn’t degrade size-optimality.

x7 x6 x5 x4 x3 x2 x1 x0

11 / 23

SLIDE 14

ISLPED’17

Enhancement 2: Level restriction in Non-trivial Fan-in

◮ Trivial fan-in having same MSB ◮ x4 and i1 are trivial and non-trivial fan-in of i2 ◮ Level (non-trivial fan-in) ≥ level (trivial fan-in) ◮ Reduces search space without degrading size-optimality

y5 i2 i1 x5 x5 x4 x3 x2 x1 x0

12 / 23

SLIDE 15

ISLPED’17

Comparison at Prefix Graph Stage

mfo

Our Approach [Roy+,TCAD’14] size Run-time (s) size Run-time (s) 4 244 302 252 241 6 233 264 238 212 8 222 423

12

201 193

16

191 73 192 149 32 185 0.04 185 0.04

◮ Table is for 64 bit adders ◮ [Roy+,TCAD’14] cannot get solutions for all fanouts. ◮ Our solutions are always more size-optimal. ◮ Runtimes are comparable, adder synthesis is one-time.

13 / 23

SLIDE 16

ISLPED’17

Physical Solution Space Comparison

6000 6500 7000 7500 8000 320 340 360 380 400 420 Power (µw) Critical Delay (ps)

TCAD‘14

Our solutions cover wider space in physical domain

◮ 7000 random samples from [Roy+,TCAD’14] vs. 3000 samples from us ◮ Reason: TCAD’14 misses solutions for bounded fanout in a few cases

14 / 23

SLIDE 17

ISLPED’17

Physical Solution Space Comparison

6000 6500 7000 7500 8000 320 340 360 380 400 420 Power (µw) Critical Delay (ps)

TCAD‘14 Ours

Our solutions cover wider space in physical domain

◮ 7000 random samples from [Roy+,TCAD’14] vs. 3000 samples from us ◮ Reason: TCAD’14 misses solutions for bounded fanout in a few cases

14 / 23

SLIDE 18

ISLPED’17

Task 2: Pareto Frontier Driven Learning

6000 6500 7000 7500 8000 340 360 380 400 420 440 Power(µw) Critical Delay(ps)

Real PF

Rep. Adder

15 / 23

SLIDE 19

ISLPED’17

Quasi-Random Data Sampling

◮ Hundreds of thousands of solutions ◮ How to choose training data?

Cannot run too many architectures as physical design flow costly.
Too few will degrade model accuracy.

Quasi-Random Sampling

Create architectural bins based on mfo and s.

◮ Capture all architectural bins ◮ Select solutions from each bin randomly

s=244 s=245 s=246 s=233 s=234 s=235 mfo=4 mfo=6 Bin of solutions with s=246 and mfo=4

16 / 23

SLIDE 20

ISLPED’17

Feature Selection and Learning Model

◮ Architectural attributes: s, mfo, sum-path-fanout (spfo) ◮ Tool settings: Target delay ◮ Best model fitting by support-vector-regression (SVR) with RBF kernel ◮ Including spfo improves MSE score for delay from 0.232 to 0.164 ◮ Note: linear models not sufficient for modeling delay

i1 y1 x0 x1 x2 x3

spfo(y1) = spfo(x0) + spfo(x1) + fo(x0) + fo(x1) = 0 + 0 + 1 + 1 = 2 spfo(i1) = spfo(x3) + spfo(x2) + fo(x3) + fo(x2) = 0 + 0 + 1 + 2 = 3 spfo(y3) = spfo(i1) + spfo(y1) + fo(i1) + fo(y1) = 3 + 2 + 1 + 2 = 8

17 / 23

SLIDE 21

ISLPED’17

Pareto Frontier Driven Learning

◮ Conventional learning focusses on prediction accuracy

Model accuracy improvement doesn’t guarantee Pareto-frontier improvement
Need for learning integrated Pareto-frontier exploration

◮ Scalarization or α-sweep

Learning output is a linear sum of delay and power (α×Power + Delay)
Model-fitting done with different values of alpha
Sweeping alpha from 0 to a large positive number

18 / 23

SLIDE 22

ISLPED’17

Experimental Setup

Synthesis and placement/routing of adders

◮ Tools: Design Compiler/ IC Compiler ◮ Library: Non-linear-delay-model (NLDM) in 32nm SAED cell-library ◮ Tool settings: Target delay = 0.1ns, 0.2ns, 0.3 ns

Programming Language

◮ C++ for prefix adder synthesis ◮ Python based machine learning package scikit-learn

Machine Configurations

◮ 72GB RAM UNIX machine ◮ 2.8GHz CPU

19 / 23

SLIDE 23

ISLPED’17

Pareto-frontier Comparison

6000 6500 7000 7500 8000 340 360 380 400 420 440 Power(µw) Critical Delay(ps)

Real PF Predicted PF

Rep. Adder

Predicted pareto-frontier almost matches actual pareto-frontier

◮ Training set is randomly selected from 300 samples. ◮ Rep. adders are quasi-random sampled from other 3000 samples ◮ Predicted frontier is from best 150 solutions (predicted)

20 / 23

SLIDE 24

ISLPED’17

Pareto-frontier Comparison

6000 6500 7000 7500 8000 340 360 380 400 420 440 Power(µw) Critical Delay(ps)

Real PF Predicted PF

Rep. Adder

1800 1900 2000 2100 2200 2300 340 360 380 400 420 Area(µm2) Critical Delay(ps)

Real PF Predicted PF

Rep. Adder

Predicted pareto-frontier almost matches actual pareto-frontier

◮ Training set is randomly selected from 300 samples. ◮ Rep. adders are quasi-random sampled from other 3000 samples ◮ Predicted frontier is from best 150 solutions (predicted)

20 / 23

SLIDE 25

ISLPED’17

Comparison with Other Adders

Pareto-points derived from our approach beats other solutions in all metrics (delay, area, power) Method Delay (ps) Area (µm2) Power (mW) Kogge-Stone 347.9 2563.7 8.78 Ours (P1) 340.0 2203.3 7.72 Sklansky 356.1 1792.5 6.1 Ours (P2) 353.0 1753.0 5.9 [Roy+,ASPDAC’15] 348.7 1971.4 6.98 Ours (P3) 346.0 1848.6 6.67

21 / 23

SLIDE 26

ISLPED’17

Conclusion

Machine learning guided design space exploration

◮ For power-efficient high-performance adders ◮ Bridge the gap between architectural and physical solution space ◮ Provide near-optimal power vs. delay trade-off

Our methodology excels

◮ State-of-the-art adder synthesis algorithms in power/delay/area metrics ◮ Readily adoptable for any cell-library

22 / 23

SLIDE 27

ISLPED’17

Thank You

Subhendu Roy (subhroy@cadence.com) Yuzhe Ma (yzma@cse.cuhk.edu.hk) Jin Miao (jmiao@cadence.com) Bei Yu (byu@cse.cuhk.edu.hk)

23 / 23