Graph Sparsification Approaches to Scalable Integrated Circuit - - PowerPoint PPT Presentation

graph sparsification approaches to scalable integrated
SMART_READER_LITE
LIVE PREVIEW

Graph Sparsification Approaches to Scalable Integrated Circuit - - PowerPoint PPT Presentation

Design Automation Group Graph Sparsification Approaches to Scalable Integrated Circuit Modeling and Simulations Zhuo Feng Acknowledgements: My PhD students Xueqian Zhao (MTU) and Lengfei Han (MTU) ICSICT, Oct, 2014 1 Scalable SPICE-Accurate


slide-1
SLIDE 1

1

Graph Sparsification Approaches to Scalable Integrated Circuit Modeling and Simulations

Zhuo Feng

ICSICT, Oct, 2014 Design Automation Group

Acknowledgements: My PhD students Xueqian Zhao (MTU) and Lengfei Han (MTU)

slide-2
SLIDE 2

2

Scalable SPICE-Accurate IC Simulations

+

  • Vin

Mp Vref Rf1 Rf2 Cout

Vout

Iout Error Amp

Cur. Amp.

Cf

If IC

VG

VR VR VR VR

Analog Circuit Blocks Digital Circuit Blocks Original Circuit with Analog and Digital Blocks

  • Motivation

– Integrated circuit (IC) system that involves billions of transistors and interconnect components needs to be accurately modeled and analyzed

  • Challenges in large-scale SPICE-accurate IC simulations

– Computational cost grows rapidly with traditional direct solution methods – Iterative solution methods need to be robust and efficient for general tasks

Power Delivery Network (PDN) w/ Embedded Voltage Regulators (VRs)

slide-3
SLIDE 3

3

Background of SPICE Simulation Algorithms

  • Standard SPICE simulators rely on Newton-Raphson (NR) method

– Step1: Linearize the nonlinear devices (transistors, diodes, etc) – Step 2: Update the solution through NR iteration

( ) , ( )

k k

k k x x

f q G x C x x x δ δ δ δ = =

( ) ( ( )) ( ( )) ( ) d F x f x t q x t u t dt = + + =

  • Problem formulation

– Nonlinear differential equations – f(.) and q(.) denote the static and dynamic nonlinearities, respectively

Jacobian of F(x)

slide-4
SLIDE 4

4

Prior Works

  • Direct and iterative solvers have been used in SPICE simulations

– Direct solver: LU decomposition (KLU [1]) – Expensive for large-scale post-layout IC problems due to the exponentially increased memory and runtime cost – Krylov-subspace iterative methods: GMRES [2] – Pros: black box solver, good memory efficiency, high parallelism – Cons: problem dependent convergence properties, worse runtime

– ILU and domain-decomposition based preconditioners, etc

References: [1] T. Davis, et al. Algorithm 907: KLU, a direct sparse solver for circuit simulation problems. ACM Trans. Math. Softw., 2010. [2] Y. Saad, et al. GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 1986. [3] D. A. Spielman, et al. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. ACM STOC, 2004. [4] M. Bern, et al. Support-graph preconditioners. SIAM J. Matrix Anal. Appl., 2006.

  • Our contribution: a circuit-oriented preconditioning approach

– Novel circuit-oriented preconditioners (compared to matrix-oriented ones ) – Rigorous mathematic foundation: graph sparsification research [3-4] – Consistent performance when solving transistor-level nonlinear circuits

slide-5
SLIDE 5

5

Graph Sparsification Techniques

  • Graph sparsification basics

– Find a subgraph P approximating the original graph G in some measure (pairwise distance, cut values, graph Laplacian, etc) – Maintain the same set of vertices such that P can be used as a proxy for G in numerical computations w/o introducing much error – A good graph sparsifier should keep very few edges to limit the computation and storage cost

Figure source:

  • L. Koutis, G. L. Miller and R. Peng. A fast solver for a class of linear systems. Commun. ACM, 2012

G P

slide-6
SLIDE 6

6

  • Support-graph preconditioner (SGP)

– Example: find a spanning tree from the original graph – Compute matrix factors w/o introducing any fill-ins for the spanning tree

  • The condition number of P-1G can be greatly reduced

1 2 3 4 5 1 9 8 7 6 4 2 4 6 5 4 9 8 1 3 3

1 2 3 4 5 6 7 8 9

2 1 2 4 3 4 8 1 6 4 3 6 5 1 8 5 3 4 9 1 9 4 3 4 d d d d d d d d d                            

Support-Graph Preconditioner

1 1 4 2 4 6 5 4 9 8 1 3 3 2 3 6 5 4 7 8 9

1 2 3 4 5 6 7 8 9

' 2 2 ' 4 4 ' 8 ' 6 4 6 ' 5 8 5 ' 4 ' 9 9 ' 4 4 ' d d d d d d d d d                            

Matrix 1st 2nd 3rd 4th 5th 6th cond G 26.170 23.182 17.572 11.514 9.373 6.673 135.948 P 25.239 23.540 17.579 10.909 9.865 6.822 16.752 P-1G 1.431 1.204 1.062 1.000 1.000 1.000 17.442

G P

slide-7
SLIDE 7

7

  • A naïve support-circuit preconditioner (SCP)

– Sparsifies the linear networks of the original circuit network – Takes advantage of existing sparse matrix techniques (Cholesky, LU, etc) – Nearly-linear complexity for analyzing nanoscale (parasitics-dominant) ICs – E.g. clock networks, power delivery networks, etc.

Support-Circuit Preconditioner

VR VR VR VR

Digital Circuit Blocks

VR VR VR VR

Support-Circuit Preconditioner Support Graph of the Original Network

slide-8
SLIDE 8

8

  • General-purpose support-circuit preconditioner (GPSCP)

– Extracts sparsified network from the linearized circuit of the original circuit – Leverages existing sparse matrix solution techniques – Nearly-linear complexity for analyzing more general nonlinear circuit systems

Support-Circuit Preconditioner (Cont.)

Linearized Circuit

ds

g

ds

C

m gs

g V g s d

gs

C

gd

C

1

g

4

g

3

g

2

g

5

g

Nonlinear Circuit d g s

3

R

4

R

5

R

1

R

2

R

ds

g

ds

C

m gs

g V

g s d

gd

C

1

g

3

g

2

g

5

g

Support Circuit

slide-9
SLIDE 9

9

Nonlinear Circuit

d g s

3

R

4

R

5

R

1

R

2

R

Support-Circuit Preconditioner Extraction (1)

  • Directed weighted graph corresponding to a linearized circuit

– Can be obtained around an solution point during NR iterations – Will be sparsified through graph decomposition and sparsification

Linearized Circuit

ds

g

ds

C

m gs

g V g s d

gs

C

gd

C

1

g

4

g

3

g

2

g

5

g

1

Directed Weighted Graph

ds

g

ds

C h

m gs

g V

g s d

gs

C h

gd

C h

1

g

2

g

3

g

4

g

5

g

2

ds

g

ds

C h g s d

gs

C h

gd

C h

1

g

2

g

3

g

4

g

5

g

Undirected Weighted Graph

3

Support Graph

ds

g

ds

C h g s d

gd

C h

1

g

2

g

3

g

5

g

4

slide-10
SLIDE 10

10

Controlling Sources

m

g

V

ds

g

ds

C h

g s d

gd

C h

1

g

2

g

3

g

5

g

Support Graph

Support-Circuit Preconditioner Extraction (2)

  • Support-circuit preconditioner extraction

– Combine support graph and other components (e.g. controlling sources) – Factor the Jacobian matrix of the support circuit to create the preconditioner

ds

g

ds

C h

m gs

g V

g s d

gd

C h

1

g

2

g

3

g

5

g

Support Circuit

5 5

ds

g

ds

C

m gs

g V

g s d

gd

C

1

g

3

g

2

g

5

g

6

Spt-CKT Spt-CKT

General-Purpose Support Circuit

7

slide-11
SLIDE 11

11

Quality Quantification of Support Graph Preconditioners

  • Convergence of support-graph preconditioners

– The convergence relies on the condition number of matrix pencil (G,P) – The support of pencil (G,P) is defined as: – Eigenvalues of pencil (G,P) are bounded by – A smaller means faster convergence

τ

( , ) min{ | ( ) 0, all }

T n

G P x P G x x σ τ τ = ∈ℜ − ≥ ∈ℜ

max min

( , ) ( , ) ( , ) G P k G P G P λ λ =

  • Spanning-tree support graph as a preconditioner

– May require many iterations to converge if (mismatch) is too large – can be estimated by comparing Joule heating of two resistive networks

Power dissipated by G: Power dissipated by P:

T

x Gx

T

x Px

τ τ

τ

slide-12
SLIDE 12

12

Ultra-Sparsifier Support Graph (1)

  • Ultra-sparsifier (non-tree) support graphs

– Ultra-sparsifier contains at most n-1+k edges (spanning tree + extra edges) – It is k-ultra-sparse that -approximates the original graph with high probability [1] – Adding extra edges to the spanning tree can better approximate the original graph (e.g. eigenvalues, power dissipations) Spanning tree

Edges of spanning tree graph Extra edges

Ultra-sparsifier

[1] D. A. Spielman and S. Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In Proc. ACM STOC, 2004.

slide-13
SLIDE 13

13

Ultra-Sparsifier Support Graph (2)

  • Sparsity control of an ultra-sparsifier support graph

– Provides tradeoffs between the quality and efficiency of preconditioners – Weighted degree of a vertex v in a graph A is defined:

– Example: for a 2D-mesh grid, 1 ≤ wd(v) ≤ 4 – If wd(v) ->1: one dominant edge – If wd(v) ->4 : four evenly critical edges

( )

( ) ( ) max ( , )

u neighbor v

vol v wd v w u v

=

vol(v): total weight incident to node v w(u,v): the weight of the edge connecting nodes v and u

slide-14
SLIDE 14

14

Ultra-Sparsifier Support Graph (3)

  • Iterative ultra-sparsifier support graph construction

– Define θ as the matching factor threshold (0 < θ < 1) of node weighted degree Step 1

  • Compute weighted degree wd of each node

in the original graph A Step 2

  • Compute the support graph A’ with

weighed degree wd’ Step 3

  • Recover edges to A’ until wd’/wd > θ for

each node in the support graph A’ Step 4

  • Return the final ultra-sparsifier support

graph A’for support-circuit preconditioning

Extra edges

Ultra-sparsifier Spanning tree

wd’/wd < θ wd’/wd > θ

slide-15
SLIDE 15

15

Performance Model Guided Sparsification

  • Runtime performance model can help find the optimal θ

– Which is better: a denser or sparser support graph?

tot GMRES LU

T N T T = ⋅ +

LU

T

GMRES

T N ⋅

Denser preconditioner

1. Greater LU factorization time 2. Less GMRES iterations

LU

T

GMRES

T N ⋅

Sparser preconditioner

1. Less LU factorization time 2. More GMRES iterations

Goal: minimize Ttot by finding a proper matching factor threshold θ !

Total Runtime:

slide-16
SLIDE 16

16

Finding the Optimal Weighted Degree Threshold θ

  • Optimal weighted degree threshold θ

– Exploit symbolic matrix factorization results to quickly identify optimal θ – E.g. find θ that maximizes the flops change of Cholesky factorizations

slide-17
SLIDE 17

17

Performance Modeling Results

  • Experiments results of IBM power grid benchmarks

Runtime and flops vs. weighted degree threshold θ

Runtime results of manual and automatic sparsification schemes

slide-18
SLIDE 18

18

Test Cases for Experiments

CKT

# nunk # Mos # R # C # L # I ldo1 3M 84K 6M 250K 7K 250K ldo2 5M 71K 10M 422K 12K 422K pg1 3M 144 6M 250K 7K 250K pg2 6M 144 11M 490K 14K 490K clk1 3M 65K 6M 3M

  • clk2

6M 65K 11M 6M

  • Circuit Design Parameters:
  • #nunk: number of unknowns in the circuits
  • #Mos: number of MOSFET
  • #R: number of resistors
  • #L: number of inductors
  • #C: number of capacitors
  • #I: number of current sources

Three Circuit Design Types:

  • ldo: large PDNs with on-chip VRs
  • pg: large PDNs with power gating
  • clk: clock distribution network
slide-19
SLIDE 19

19

Results of Performance Model Guided Sparsification

  • Experimental results for a large PDN with multiple VRs

– Performance guided sparsification approach achieve nearly-optimal runtime Runtime of a single NR step using different θ

slide-20
SLIDE 20

20

Experimental Results

CKT #NR Direct GPSCP

Time (s) #GMRES Time (s) Speedup ldo1 237 279,629 4,130 15,368 18X ldo2 314

  • 3,979

23,793

  • pg1

222 108,784 3,381 10,204 11X pg2 421 185,892 3,478 14,206 13X clk1 132 50,688 1,452 3,493 14X clk2 219 112,497 2,555 8,001 14X

  • Runtime comparison for transient analysis (100-time-step)
  • Memory comparison

CKT Direct GPSCP

ldo1

4.2GB 0.8GB/5X

ldo2

  • 1.1GB/-

pg1

3.2GB 0.8GB/4X

pg2

7.8GB 1.6GB/5X

clk1

4.3GB 0.8GB/5X

clk2

10.0GB 1.4GB/7X

slide-21
SLIDE 21

21

Experimental Results (2)

  • A large PDN with embedded multiple VRs
slide-22
SLIDE 22

22

RF Simulation Methods

  • For nonlinear RF circuits, output is usually quasi-periodic

– SPICE may require simulating many periods to reach steady state – Time-domain shooting method can not handle distributed devices

  • Harmonic Balance (HB) analysis for steady-state RF simulation

– HB analysis can capture the steady-state spectral response directly – Harmonic balance also refers to balancing the current between linear and nonlinear portions at every harmonic frequency

Output may contain

  • freqs. other than

ω

( )

t cos ω

Nonlinear Circuit

+ v −

v

Freq Domain, MHz dB Time Domain (ps) Voltage (v)

slide-23
SLIDE 23

23

HB Analysis of RF Circuits

  • Non-autonomous circuit analysis[1]

: state variables : impulse response function of linear circuit components : dynamic nonlinearities : static nonlinearities : time-dependent excitation sources

[1] K. S. Kundert and A. Sangiovanni-vincentelli. Simulation of Nonlinear Circuits in the Frequency Domain, CAD, 1986

( ) x t

( ) q ฀

( ) f ฀ ( ) b t

( ) y t

are typically periodic functions

( ), x t ( ), q ฀

( ) f ฀

slide-24
SLIDE 24

24

HB Analysis of RF Circuits (2)

  • HB Jacobian matrix (frequency domain)

– and represent the Fast Fourier Transform(FFT) and Inverse Fast Fourier Transform(IFFT) respectively – G and C denote the linearization of q() and f() at s time domain sampled points, (s=2k+1, k is positive frequencies number) – includes lots of dense blocks introduced by

1 1

2

− −

Γ Γ + Γ ΩΓ + = G C f j Y Jhb π

                    ∂ ∂ ∂ ∂ ∂ ∂ =

S

t t t

x q x q x q C 

2 1

                    ∂ ∂ ∂ ∂ ∂ ∂ =

S

t t t

x f x f x f G 

2 1

               − = Ω kI kI  

Γ

1 −

Γ

hb

J

1 1

& C G

− −

Γ Γ Γ Γ

slide-25
SLIDE 25

25

Challenges in Harmonic Balance (HB) Analysis

  • Direct Methods for RF HB circuit simulation (A. Mehrotra et al, DAC’09)

– Challenged by solving large yet non-sparse Jacobian matrices – Cons: comp./memory cost grows quickly with circuit size

  • Traditional iterative methods for HB analysis (P. Feldmann et al, CICC’96,
  • W. Dong et al, TCAD’09)

– Pros: black-box, matrix-oriented, memory-efficient – E.g. ILU preconditioner, domain-decomposition preconditioner – Cons: inefficient/unreliable for strongly nonlinear RF systems

            = Γ ⋅ ⋅ Γ

− 1 2 1 2 1 1

G G G G G G G G G

s s s

                    =

s

g g g G 

2 1

T s

G G G ] , , , [

2 1

T s

g g g ] , , , [

2 1

FFT

Dense circulant matrices due to FFT/IFFT operations

slide-26
SLIDE 26

26

  • From graph sparsification to Jacobian matrix sparsification

– Modified nodal analysis (MNA) matrix reduction: 20% ~ 38% fewer entries – Fill-ins during LU reduction: 60% LU factorization Speedup: 50X

Graph Sparsification Approach to HB Analysis

     

    

       ⇒

  

      

        

MNA Matrix HB Jacobian Matrix

     

  • ×

    

  • ×
  • ×

        ⇒ × ×

  • ×

×    

  • ×

       × ×

  • ×

        

Fill-ins during LU Block Fill-ins during LU

Before Graph Sparsification

     

    

       ⇒

  

      

        

MNA Matrix HB Jacobian Matrix

     

  • ×

    

       ⇒

   ×

      

  • ×

        

Fill-ins during LU Block Fill-ins during LU After Graph Sparsification

slide-27
SLIDE 27

27

Conclusion

  • Graph sparsification approaches to circuit simulations

– MNA matrix decomposition into Laplacian and Complement matrices – Performance-guided graph sparsification of Laplacian matrix – Support-circuit preconditioner construction

  • Our preliminary results

– Highly reliable convergence for time/frequency domain simulations – Up to 18X (21X) speedup and 7X (6X) memory reduction for time (frequency) domain simulations

– Scalable to large post-layout integrated circuits

  • Future work

– Will explore spectral graph sparsification methods – Will exploit heterogeneous CPU-GPU computing platforms

slide-28
SLIDE 28

28

Nonlinear Devices Evaluation in HB

  • Evaluation of nonlinear devices

Fr Freq->Ti Time: terminal voltage waveforms Tim ime e do domai ain: evaluate current (derivative) waveforms Time->Fr Freq: currents(derivatives) in freq. domain

Terminal voltage spectrum IFFT/IAPDFT Terminal voltage samples Device evaluation Ids samples FFT/APDFT(Almost-Periodic DFT) Ids spectrum

  • Terminal voltage samples

– Need sampling at 2k+1 time points (k is the positive frequencies number) according to Nyquist–Shannon sampling theorem.

slide-29
SLIDE 29

29

Support-Circuit Preconditioner for HB Analysis

  • Step 1: MNA matrix decomposition of linearized RF circuit

– Laplacian Matrix (P): passive devices such as resistors, capacitors, etc

– Complement Matrix (A): active devices such as transconductances, etc

M1 L1 R1 L2 C2 C1 R2 RF Circuit Linearized Circuit at t1 Linearized Circuit at ts . . .

P t1

A t1 L1 R1 L2 C2 C1 C

gd

C

gs

gds Cgs

g

mV gs

R2 1 2 3 4 5 L1 R1 L2 C2 C1 C

gd

C

gs

gds Cgs

g

mV gs

R2 1 2 3 4 5

P ts

A ts

t1~ts are s time sampled time points

slide-30
SLIDE 30

30

Support-Circuit Preconditioner for HB Analysis (2)

  • Step 2: Representative Laplacian matrix construction

– Different sampled time points have different entry values – Normalize the scaled Laplacian matrices of all sampled time points

P t1 P t2 P ts

Representative Laplacian Matrix Normalize Average

slide-31
SLIDE 31

31

Support-Circuit Preconditioner for HB Analysis (3)

g1+C2/h 5 2 gds+Cds/h C1/h Cgd/h 3 1 4 g2 Cgs/h Representative Laplacian Matrix Original Weighted Graph Ultra Sparsifier C1/h Cgd/h 3 1 4 g2 5 2 g1+C2/h gds+Cds/h

Sparsified Representative Laplacian Matrix

Complement Matrix Sparsification pattern Matrix

  • Step 3: Sparsification Pattern Extraction

– Convert matrix to weighted graph – Sparsify the weighted graph and convert back to matrix form – Combine with the complement matrix

slide-32
SLIDE 32

32

Support-Circuit Preconditioner for HB Analysis (4)

System MNA Matrix t1 Sparsification pattern Matrix System MNA Matrix t2 System MNA Matrix ts Sparsified System MNA Matrix t1 Sparsified system MNA Matrix t2 Sparsified system MNA Matrix ts

… …

  • Step 4: MNA Matrix Sparsification
slide-33
SLIDE 33

33 Support circuit preconditioner Permuted matrix

  • Circulant matrix in HB
  • Step 5: Support circuit block preconditioner generation

– Original matrix : all variables of a single harmonic grouped together – Permuted matrix: all the harmonics of a single variable grouped together

Support-Circuit Preconditioner for HB Analysis (5)

            = Γ ⋅ ⋅ Γ

− 1 2 1 2 1 1

G G G G G G G G G

s s s

                    =

s

g g g G 

2 1

T s

G G G ] , , , [

2 1

T s

g g g ] , , , [

2 1

FFT

Permutation FFT

Sparsified MNA matrix

slide-34
SLIDE 34

34

Case Study : Double-balanced Gilbert Mixer

  • MOSFET linearization model

[21] [2] [1] [8] [16] [25] [27] [20] [7] [15] [13] [14] [11] [18] [22] [17] [4] [6]

M2 M1 R7 M5 L1 L0 C0 Vlo+ M3 M4 M6 R1 R3 R8 L2 R10 L3 C1 R2 Vrf+ R5 Vrf- R6 Vlo- R4 VDD

[1] [8] [21] [16] [25] [27] [20] [7] [15] [26] [13] [14] [11] [18] [22] [17] [4] [6] [2]

  • Linearized passive network (Laplacian matrix) extraction

Rds

gmVgs

gnVbs D S

G

B Cgd Cgs

G B S D

[xx] denotes node index

slide-35
SLIDE 35

35

Case Study : Double-balanced Gilbert Mixer (cont.)

  • Ultra-sparsifier support graph construction

– Step 1: Extract maximum spanning tree – Step 2: Restore critical edges until reaching a desired approximation

2

4 6 8 11 13 14 1 18 16 21 17 22

25 27 2

4 6 8 11 13 14 1 18 16 21 17 22

25 27 2

4 6 8 11 13 14 1 18 16 21 17 22

25 27

Laplacian graph Maximum spanning tree Ultra sparsifier

slide-36
SLIDE 36

36

HB Simulation Engine on CPU-GPU Platform

Device evaluation Support-circuit preconditioner Preconditioner factorization GMRES iterations Convergence checking Start End NR

  • Decompose MNA matrix to

Passive and active matrices 1. Performance modeling based sparsification configuration 2. Construct representative passive matrix 3. Extract sparsification pattern 4. Sparsify MNA Matrix 5. Generate Support-circuit preconditioner

  • GPU-based block LU

decomposition

  • Matrix-free iterative solver
slide-37
SLIDE 37

37

Runtime Performance Modeling

  • Lookup table (LUT) for runtime performance modeling

– 2D LUTs predict LU factorization runtime on GPU – Two LUTs are created for GPU matrix multiplications and matrix divisions

Runtime performance lookup table for GPU-based matrix operations

Matrix operation batch size Matrix size Bilinear interpolation

slide-38
SLIDE 38

38

Parallel Sparse Block LU Factorization

  • Representative Sparsified MNA Matrix (test matrix)

– Approximates the properties of block sparse matrix – Created by averaging all sparsified MNA matrices – Factorized to get the fill-ins’ locations

Test matrix Average

Sparsified System MNA Matrix t1 Sparsified system MNA Matrix t2 Sparsified system MNA Matrix ts

x

Fill-in

x x x x

LU

L factor U factor

slide-39
SLIDE 39

39

Parallel Sparse Block LU Factorization (cont.)

  • Data dependency graph

– Column k depends on column j, when U(j, k) != 0 [1] – Can be derived from U matrix

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

2 1 6 4 5 3 7 8 9

Level 0 Level 1 Level 2 Level 3 Level 4

[1] J. Gilbert and T. Peierls. Sparse partial pivoting in time proportional to arithmetic operations. SIAM J. Sci. Stat. Comput., 9(5):862–873, 1988.

slide-40
SLIDE 40

40

Parallel Sparse Block LU Factorization (cont.)

  • Modified data dependency graph

– Identify “fake” dependency when L(j+1:n, j) == 0 – Eliminate “fake” dependencies

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 2 1 6 4 5 3 7 8 9 Level Level 1 Level 2

2 1 6 4 5 3 7 8 9

Level 0 Level 1 Level 2 Level 3 Level 4

slide-41
SLIDE 41

41

Parallel Sparse Block LU Factorization (cont.)

  • GPU-based block sparse

matrix LU factorizations

– Levelize the factorization according to data dependency graph – Each level only contains matrix multiplication and division operations – Use batched matrix multiplication and inversion functions provided by CUBLAS

2 1 6 4 5 3 7 8 9 Level 0 Level 1 Level 2

÷

X X X X X X X X X X X X X X X X X X X X X X X X X X X

÷

X X X X X X X X X X X X X X X X X X X X X X X X X X X

÷

X X X X X X X X X X X X X X X X X X X X X X X X X X X

Level 0 Level n Result

×

X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

÷

X X X X X X X X X X X X X X X X X X X X X X X X X X X

÷

X X X X X X X X X X X X X X X X X X X X X X X X X X X

× ×

slide-42
SLIDE 42

42

Experiment Setup

Note:

  • Freqs: Number of harmonics
  • Nunk: Number of unknowns

CKT Name Nodes Tones Freqs Nunk 1 mixer 1 302 2 25 14798 2 mixer 2 1988 2 41 161028 3 mixer 3 5262 2 5 47358 4 mixer 4 7532 2 13 188300 5 LNA + mixer 1 343 3 63 42875 6 LNA + mixer 2 5303 3 14 143181 7 LNA + mixer 3 7573 3 14 204471

  • Widely used RF circuits as the benchmark
slide-43
SLIDE 43

43

  • Support-circuit preconditioned HB (SCPHB) method

– High robustness and efficiency – Runtime speedup: 21X (compared with direct solver in DAC’09) – Memory reduction: 6X (compared with direct solver in DAC’09)

Runtime and Memory Efficiency on CPU

CKT Direct solver BD preconditioner SCPHB preconditioner Time(s) Mem(GB) Time(s) K-Its Time(s) Mem(GB) K-Its Speedup 1 471.9 0.23 24.9 821 145.5 0.10 204 3.24X 2 19263.1 7.95 5637.6 6731 1408 1.72 383 13.7X 3 686.4 0.36 92.2 165 69.5 0.06 229 9.8X 4 14153.5 4.26 1072.3 273 1035.6 0.73 355 21.3X 5 2561.6 1.92 DNF DNF 821.5 1 194 3.1X 6 4040.9 3.34 DNF DNF 414.7 0.67 328 9.74X 7 6633.6 5.21 DNF DNF 791 0.83 255 8.38X K-Its : GMRES iteration number; DNF : Do not finish within 1000 Newton iterations

slide-44
SLIDE 44

44

  • Simulation runtime VS. input power of LNA+Mixer

– BD preconditioner: runtime increases exponentially – SCPHB preconditioner: runtime remains nearly constant

Runtime Efficiency for Strongly Nonlinearities

slide-45
SLIDE 45

45

Scalability

  • Nearly-linear runtime and memory scalability

(a) Runtime scalability (b) Memory scalability