Heuristics for Nested Dissection to Reduce Fill-in in Sparse Matrix - - PowerPoint PPT Presentation

heuristics for nested dissection to reduce fill in in
SMART_READER_LITE
LIVE PREVIEW

Heuristics for Nested Dissection to Reduce Fill-in in Sparse Matrix - - PowerPoint PPT Presentation

Heuristics for Nested Dissection to Reduce Fill-in in Sparse Matrix Factorizations Miguel Vargas-Flix, Salvador Botello-Rionda miguelvargas@cimat.mx, botello@cimat.mx Morelia, November 6, 2013 1/28 Sparse matrices In most problems of finite


slide-1
SLIDE 1

Heuristics for Nested Dissection to Reduce Fill-in in Sparse Matrix Factorizations

Miguel Vargas-Félix, Salvador Botello-Rionda

miguelvargas@cimat.mx, botello@cimat.mx

Morelia, November 6, 2013 1/28

slide-2
SLIDE 2

Sparse matrices

In most problems of finite element method, finite volume or isogeometric analysis we have to solve a linear system of equations A x=b. When the stiffness matrix is assembled, the relation between adjacent nodes is captured as entries in a matrix. Because a node has adjacency with only a few others, the resulting matrix has a very sparse structure. i j k

(

∘ ∘ ∘ ∘ ∘ ∘ ∘ ⋯ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ⋯ ∘ ai i ∘ ai j ∘ ∘ ⋯ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ⋯ ∘ a j i ∘ a j j ∘ ∘ ⋯ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ⋯ ∘ ∘ ∘ ak k ∘ ⋯ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ⋯ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱)

http://www.cimat.mx/~miguelvargas 2/28

slide-3
SLIDE 3

Lets define the notation η(A), it indicates the number of non-zero entries of A. For example, A∈ℝ

556×556 has 309,136 entries, with η(A)=1810, this means that only the 0.58% of

the entries are non zero.

Black dots indicates a non zero entry in the matrix

In finite element problems all matrices have symmetric structure, and depending on the problem symmetric values or not.

http://www.cimat.mx/~miguelvargas 3/28

slide-4
SLIDE 4

Cholesky factorization for sparse matrices

For full matrices the computational complexity of Cholesky factorization A=LL

T is O(n 3).

To calculate entries of L Li j= 1 L j j(Ai j−∑

k=1 j−1

Li k L j k), for i> j L j j=√A j j−∑

k=1 j−1

L j k

2 .

We use four strategies to reduce time and memory usage when performing this factorization on sparse matrices:

  • 1. Reordering of rows and columns of the matrix to reduce fill-in in L. This is equivalent to use a

permutation matrix to reorder the system (P AP

T)(Px)=(P b).

  • 2. Use symbolic Cholesky factorization to obtain an exact L factor (non zero entries in L).
  • 3. Organize operations to improve cache usage.
  • 4. Parallelize the factorization.

http://www.cimat.mx/~miguelvargas 4/28

slide-5
SLIDE 5

Matrix reordering

We want to reorder rows and columns of A, in a way that the number of non-zero entries of L are

  • reduced. η(L) indicates the number of non-zero entries of L.

A=

L=

The stiffness matrix to the left A∈ℝ

556×556, with η(A)=1810. To the right the lower triangular

matrix L, with η(L)=8729. There are several heuristics like the minimum degree algorithm [Geor81] or a nested dissection method [Kary99].

http://www.cimat.mx/~miguelvargas 5/28

slide-6
SLIDE 6

By reordering we have a matrix A' with η(A' )=1810 and its factorization L' with η(L' )=3215. Both factorizations solve the same system of equations. A'=

L'=

We reduce the factorization fill-in by η(L' )=3215 η(L)=8729 =0.368. To determine a “good” reordering for a matrix A that minimize the fill-in of L is an NP complete problem [Yann81].

http://www.cimat.mx/~miguelvargas 6/28

slide-7
SLIDE 7

Symbolic Cholesky factorization

The algorithm to determine the Li j entries that area non-zero is called symbolic Cholesky factorization [Gall90]. Let be, for all columns j=1…n, a j ≝ {k > j ∣ Ak j≠0}, l j ≝ {k > j ∣ Lk j≠0}. The sets r j will register the columns of L which structure will affect the column j of L. for i ← 1,2,…,n

·

ri ← ∅ for i ← 1,2,…,n

·

l i ← ai

·

for each j∈ri

· ·

l i ← l i∪l j∖{i}

·

if l i≠∅

· ·

p ← min { j∈li}

· ·

r p ← r p∪{i }

A=( a11 a12 a16 a21 a22 a23 a24 a32 a33 a35 a42 a44 a53 a55 a56 a61 a65 a66) a2={3,4} L=( l 11 l 21 l 22 l 32 l 33 l 42 l 43 l 44 l 53 l 54 l55 l 61 l 62 l 63 l 64 l65 l 66)

l 2={3,4,6}

This algorithm is very efficient, its complexity in time and space has an order of O(η( L)).

http://www.cimat.mx/~miguelvargas 7/28

slide-8
SLIDE 8

How efficient is it?

The next table shows results solving a 2D Poisson equation problem, comparing Cholesky and conjugate gradient with Jacobi preconditioning.

Number of equations nnz(A) nnz(L) Cholesky time [s] CGJ time [s] 1,006 6,140 14,722 0.086 0.081 3,110 20,112 62,363 0.137 0.103 10,014 67,052 265,566 0.309 0.184 31,615 215,807 1’059,714 1.008 0.454 102,233 705,689 4’162,084 3.810 2.891 312,248 2’168,286 14’697,188 15.819 19.165 909,540 6’336,942 48’748,327 69.353 89.660 3’105,275 21’681,667 188’982,798 409.365 543.110 10’757,887 75’202,303 743’643,820 2,780.734 3,386.609

1,000 10,000 100,000 1,000,000 10,000,000 1 10 100 1,000 10,000 0.1 0.1 0.2 0.5 2.9 19.2 89.7 543.1 3,386.6 0.1 0.1 0.3 1.0 3.8 15.8 69.4 409.4 2,780.7

Cholesky CGJ Equations Time [s]

1,000 10,000 100,000 1,000,000 10,000,000 1 10 100 1,000 10,000 0.4 1.3 4.3 13.6 44.1 134.9 393.2 1,343.5 4,656.1 0.7 2.6 10.2 38.2 145.3 512.5 1,747.3 7,134.4 32,703.9

Cholesky CGJ Equations Memory [Mega bytes]

http://www.cimat.mx/~miguelvargas 8/28

slide-9
SLIDE 9

Graph reordering

Let G=(V , E) be a graph with n vertex V with m edges E,

1 2 3 4 5 6 7

A ordering (or tagging) α of G is simply a mapping from the set {1,2,…,n} in V, where n is the number of vertex of G. The graph tagged by α will be written G

α=(V α , E α).

5 4 7 6 2 1 3

http://www.cimat.mx/~miguelvargas 9/28

slide-10
SLIDE 10

Let A be a sparse matrix, symmetric, of size n×n, the ordered graph of A, G

A=(V A , E A), in it

{vi ,v j}∈E

A if and only if Ai j=A j i≠0, i≠ j. Here vi is a vertex of V A with tag A.

A=(

  • ·
  • ·
  • ·
  • ·

· ·

  • ·
  • ·
  • ·

·

  • ·
  • ·

·

  • ·

·

  • ·
  • ·
  • ·

· · ·

  • ·

· · ·

  • ·
  • )

G

A=(V A , E A)

1 2 3 4 5 6 7

An ordering α applied to G

A=(V A , E A) is equivalent to a apply a permutation matrix.

G

α=(V α , E α)

5 4 7 6 2 1 3

PA P

T=(

  • ·

· ·

  • ·

· ·

  • ·

· ·

  • ·
  • ·

·

  • ·

· · ·

  • ·
  • ·

·

  • ·

·

  • ·

·

  • ·

·

  • ·
  • )

For the fill-in of the Cholesky factorization, to find a “good” permutation P for A that reduces the number of non-zero entries on L means to find a “good”ordering of its graph [Geor81].

http://www.cimat.mx/~miguelvargas 10/28

slide-11
SLIDE 11

Nested dissection algorithm

Separator

Lets consider a separator set S⊂V in G=(V , E), when removed from G the graph is disconnected in two non-empty sets of vertex R⊂V and S⊂V , also R∩S=∅, S∩T =∅, R∩T=∅ y R∪S∪T=V In the next example V ={1,2,…,13}, let S={1,10,13} be a separator of G=(V , E), when removed from the graph we obtain R={4,3,7,8,11} and T={2,5,6,9,12}

1 2 3 4 5 6 7 8 9 10 11 12 13 2 3 4 5 6 7 8 9 11 12

http://www.cimat.mx/~miguelvargas 11/28

slide-12
SLIDE 12

This procedure applied in a recursive way is known as the generalized nested dissection algorithm [Lipt79]. The idea is to split the graph trying to make R and T of the same size with a small separator. This is an divide-and-conquer recursive method.

Dissection G (V , E) If ∣G∣<β · The vertexes V are ordered with any order (the minimum degree algorithm can be used) else · Search for a separator set S⊂V such that after removing it from G we get two sets of disconnected vertexes R and T , with ∣R∣≈∣T∣. · Remove all edges from E that have connections with entries in S. · Dissection G (R , E) · Dissection G (T , E) · Order the vertexes, putting the ones in S at the end of the order.

This algorithm is recursive and the resulting order of vertexes of G produces an efficient Cholesky factorization. An improved version of this algorithm is presented in [Kary99].

http://www.cimat.mx/~miguelvargas 12/28

slide-13
SLIDE 13

This is an exaple of the nested dissection algorithm:

Graph Dependency tree

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

5, 2, 10, 13, 17 1, 4, 3, 7, 8, 11, 14 6, 12, 18, 20, 9, 15, 16, 19

http://www.cimat.mx/~miguelvargas 13/28

slide-14
SLIDE 14

Graph Dependency tree

1 3 4 6 7 8 9 11 12 14 15 16 18 19 20

5, 2, 10, 13, 17 1, 4, 3, 7, 8, 11, 14 6, 12, 18, 20, 9, 15, 16, 19

1 3 4 6 7 8 9 11 12 14 15 16 18 19 20

5, 2, 10, 13, 17 7, 8 12, 15 11, 14 18, 20 1, 4, 3 6, 9, 16, 19

http://www.cimat.mx/~miguelvargas 14/28

slide-15
SLIDE 15

Graph Dependency tree

1 3 4 6 9 11 14 16 18 19 20

5, 2, 10, 13, 17 7, 8 12, 15 11, 14 18, 20 1, 4, 3 6, 9, 16, 19

1 3 4 6 9 11 14 16 18 19 20

5, 2, 10, 13, 17 7, 8 12, 15 11, 14 18, 20 4 9, 16 1 3 6 19

http://www.cimat.mx/~miguelvargas 15/28

slide-16
SLIDE 16

Fron the final dependency tree an ordering is creating numerating the vertexes filling the root at last up to the leaves.

5, 2, 10, 13, 17 7, 8 12, 15 11, 14 18, 20 4 9, 16 1 3 6 19

The elimination sequence is 6, 19, 9, 16, 18, 20, 12, 15, 1, 3, 4, 11, 14, 7, 8, 5, 2, 10, 13, 17. This

  • rdering is used to generate A', the equivalent P is

P=

(

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0)

.

http://www.cimat.mx/~miguelvargas 16/28

slide-17
SLIDE 17

This is an example of a finite element mesh reordered using nested dissection A has a size 1,263×1,263, with η(A)=14,131. The factorization produces η(L)=128,476 A= L=

http://www.cimat.mx/~miguelvargas 17/28

slide-18
SLIDE 18

Reordering using nested dissection A

r=

L

r=

The factorization produces η(L

r)=39,465. For this case

η(L

r)=39,465

η(L)=128,476=0.3072.

http://www.cimat.mx/~miguelvargas 18/28

slide-19
SLIDE 19

Heuristics to obtain “good” separators

Basically we have to goals: a) Split the graph in two approximatelly equal sub-graphs. b) Split the graph using the minimum number of vertexes in the separator. We will try to locate the “center” of the graph and then we will try to generate a plane that cuts the graph in approximately two equal parts. The difficult part is that this has to be done without knowing the coordinates of nodes of the mesh.

http://www.cimat.mx/~miguelvargas 19/28

slide-20
SLIDE 20

Generating ‘balls’

Layer 0 Layer 1 Layer 2

Balls are created selecting a vertex randomly, lets call it ‘layer 0’. Next a layer of all adjacent vertexes to ‘layer 0’ is added, forming ‘layer 1’. Same process to form ‘layer 2’, etc. This procedure works for 3D meshes.

http://www.cimat.mx/~miguelvargas 20/28

slide-21
SLIDE 21

Heuristic to locate the center of the graph

This is done creating several balls, each one has n

2 +α vertexes. The intersection of all balls will be

choosen as the center of the graph.

http://www.cimat.mx/~miguelvargas 21/28

slide-22
SLIDE 22

Separator creation

Two nodes of the ‘center’ of the graph are randomly selected, they are used as center of balls. Layers of both balls are grown at the same time. When a vertex is part of the layer of both balls it is maked as a separator and removed from the layers.

http://www.cimat.mx/~miguelvargas 22/28

slide-23
SLIDE 23

Several separators are created, the best one is used. It is choosen the one that has the minimun

  • fitness. The fitness function is:

f =s( nmin nmax), where s is the separator size, nmin is the number of nodes of the partition with less nodes, and nmax is the number of nodes of the partition with more nodes.

http://www.cimat.mx/~miguelvargas 23/28

slide-24
SLIDE 24

Examples

These are examples of the recursive evolution of the nested dissection.

http://www.cimat.mx/~miguelvargas 24/28

slide-25
SLIDE 25

Numerical experiments

We tested these heuristics against the METIS library [Kary99].

n nnz(A) FEMT nnz(L) METIS nnz(L) Difference 998 6,490 15,760 14,149 6.4% 3,361 22,311 66,280 60,492 9.5% 10,013 67,819 255,221 231,332 5.7% 33,258 228,502 1,052,032 952,013 7.1% 100,529 695,927 3,843,843 3,434,915 7.6% 334,643 2,328,077 15,590,679 13,760,163 7.3% 1,008,572 7,034,658 52,843,928 48,576,349 4.0% 3,343,042 23,354,982 190,399,689 185,091,059 2.9% 10,136,296 70,900,872 648,691,337 629,026,259 3.1%

100 1,000 10,000 100,000 1,000,000 10,000,000 10,000 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 FEMT nnz(L) METIS nnz(L)

Number of equations nnz(L)

http://www.cimat.mx/~miguelvargas 25/28

slide-26
SLIDE 26

Future work

Improve speed. Dynamically set the number of separators tried by the number of nodes. Improve speed by parallelizing the creation of separators. Improve the fitness function. Use nested dissection as a partitioning strategy [Kary99].

http://www.cimat.mx/~miguelvargas 26/28

slide-27
SLIDE 27

Questions?

miguelvargas@cimat.mx

http://www.cimat.mx/~miguelvargas 27/28

slide-28
SLIDE 28

References

[Geor81] A. George, J. W. H. Liu. Computer solution of large sparse positive definite systems. Prentice-Hall, 1981. [Kary99] G. Karypis, V. Kumar. A Fast and Highly Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal on Scientific Computing, Vol. 20-1, pp. 359-392, 1999. [Lipt79]

  • R. J. Lipton, D. J. Rose, R. E. Tarjan. Generalized Nested Dissection. Computer Science

Department, Stanford University, 1997. [Yann81] M. Yannakakis. Computing the minimum fill-in is NP-complete. SIAM Journal on Algebraic Discrete Methods, Volume 2, Issue 1, pp 77-79, March, 1981. [Gall90]

  • K. A. Gallivan, M. T. Heath, E. Ng, J. M. Ortega, B. W. Peyton, R. J. Plemmons, C. H.

Romine, A. H. Sameh, R. G. Voigt. Parallel Algorithms for Matrix Computations. SIAM, 1990.

http://www.cimat.mx/~miguelvargas 28/28