Implementing a Parallel Graph Clustering Algorithm with Sparse - - PowerPoint PPT Presentation

implementing a parallel graph clustering algorithm with
SMART_READER_LITE
LIVE PREVIEW

Implementing a Parallel Graph Clustering Algorithm with Sparse - - PowerPoint PPT Presentation

Implementing a Parallel Graph Clustering Algorithm with Sparse Matrix Computation Jun Chen, Peigang Zou High Performance Computing Center, Institute of Applied Physics and Computational Mathematics chenjun@iapcm.ac.cn OUTLINE n Gr Graph c h


slide-1
SLIDE 1

Implementing a Parallel Graph Clustering Algorithm with Sparse Matrix Computation

Jun Chen, Peigang Zou High Performance Computing Center, Institute of Applied Physics and Computational Mathematics chenjun@iapcm.ac.cn

slide-2
SLIDE 2

OUTLINE

n Gr

Graph c h clu lustering ng

n Peer pressure clustering(PPCL)

n Cha

halle lleng nges

n A s

solu lution: b n: based o

  • n L

n Large g graph p h pla latforms ms/li libraries

n Combinatorial BLAS

n Rela

lated w works ks

n Paralle

llel P l PPCL a alg lgorithm w hm with ma h matrix c computation

n Nume

merical R l Result lts

n Di

Discussions ns

n Conc

nclu lusion

slide-3
SLIDE 3

1、Graph Clustering

ß A wide class of algorithms to classify vertices

in a group into many clusters, where the vertices in the same cluster have high connectivity than those in various clusters.

slide-4
SLIDE 4

Graph clustering Vs

  • Vs. Vector clustering

Graph clustering Vector clustering

Clustering:find natural groups.

Classified by distances between points. Classified by relationship between points.

slide-5
SLIDE 5

Graph clustering(cont.)

Machine learning Pattern recognition Bioinformatics Image analyze Graph clustering

Graph clustering Random walks Peer pressure clustering(PPCL) Minimum cuts multi-way partition Genetic algorithms ……

slide-6
SLIDE 6
  • 2. Challenges
slide-7
SLIDE 7

2.1 The size of Graph is growing

E.g., # of Facebook users > 1 billion. A big graph!

slide-8
SLIDE 8

n Paralle

llel g l graph c h clu lustering ng i imple leme ment nt i is d difficult

  • lt. It

. It r requires: :

n well-suited description of the natural sparse locality n Storage them effectively n High performance computing

n High p

h performa manc nce c cha halle lleng nge

n Scalability: time should <= O(n). Memory consumption <O(nxn). n Parallel patterns for solving PDEs in typical scientific computing is

based on dense computations. They are not suitful for the sparse characteristic in large graph computation.

n MapReduce pattern for big data problems has low efficiency.

ü A S

Solu lution: B n: Based o

  • n L

n Large Gr Graph p h pla latform/ m/li library y

2.2. Large Graph Computation

slide-9
SLIDE 9
  • 3. Large graph platforms/Libraries

n

Scale leGr Graph

n

TITECH, JAPAN

n

Implements Pregel model provided by google,and optimize its collective communication and memory management methods.

n

Goal: can analyze the graphs containing 10 billion nodes and edges.

n

Di DisBeli lief

n

Google

n

A parallel framework for deep learning

n

PBGL GL(Paralle llel Bo Boost Gr Graph Library) y)

n

Indiana University, USA

n

C++ graph library

n

GA GAPDT(Gr Graph A h Alg lgorithm a hm and nd P Pattern Di n Discovery T y Toolb lbox)

n

UCSB, USA

n

Provide interactive graph operations, and can parallel run on Star-P, a parallel version of MATLAB.

n

Use distributed sparse array to describe the parallel operations.

slide-10
SLIDE 10

n

Gr Graph BL BLAS

Þ Defines a core set of matrix-based graph operations that can be used to

implement a wide class of graph algorithms in a wide range of programming environments.

n

etc.

slide-11
SLIDE 11

Combinatorial BLAS

Table le 1

  • 1. Many b

y basic li line near a alg lgebra o

  • perations

ns i in n Comb mbBLAS

CombBLAS:

Ø A representation implementation of Graph BLAS Ø Multi-core parallelism based on MPI Ø A collective set of basic linear algebra operations.

slide-12
SLIDE 12

[Buluç A, Gilbert J R. The Combinatorial BLAS: design, implementation, and applications]

slide-13
SLIDE 13
  • 4. Related works

ß Introduction of random walks method

Þ The cluster assignment of a vertex will be the same

at that of most of its neighbors.

ß Current representative parallel PPCL methods.

Þ Parallel PPCL in SPARQL Þ Parallel PPCL in STAR-P

slide-14
SLIDE 14

Related works about parallel PPCL

PPCL PPCL in S n STAR-P

  • P

Result lts:

ü Maximu mum p m processor nu numb mber:128 128 ü R-M

  • MAT g

graph:209 2097152 52 vertices, 1 , 18305177 edg edges es (scale le 2 21 ) ) ü Low p performa manc nce. Ø STAR-P is a parallel implementation of MATLAB.

PPCL PPCL in S n SPARQL Result lts:

ü Maximu mum p m processor nu numb mber:64 64 ü RDF DF Gr Graph:10000 v vertices, 2 , 232,0 ,000 edg edges es

  • 200

200 秒 Ø SPARQL is a similar SQL tool for RDF graph.

RDF Graph:stores meta-data for web resources. Its vertex identify a resource, its arc describes the resource attributes.

Paper 1: Paper 2:

slide-15
SLIDE 15
  • 5. Parallel PPCL Algorithm with

Matrix Computation

  • Standard PPCL algorithm
  • Alternative PPCL algorithm based on linear algebra
  • Parallel PPCL algorithm based on linear algebra
  • Parallel PPCL implementation on ComBLAS
slide-16
SLIDE 16

5.1、standard PPCL algorithm

3.Vote

4.tally

5.Form new cluster approximation G’’

2.Initial

PeerPressure( ( , ), ) 1 for ( , , ) 2 do ( )( ( )) ( )( ( )) 3 for 4 do C ( ) : : ( )( ) ( )( ) 5 if 6 then return 7 else return PeerPre

i i i f i f f

G V E C u v w E T v C u T v C u w n V n i j V T n j T n i C C C = ∈ ← + ∈ ← ∀ ∈ ≤ == ssure( ( , ), )

f

G V E C =

Alg lgorithm 1 hm 1. 1.Given an approximation G’

Result G’’

If G’!=G’’, then G’=G’’ If G’==G’’

It starts with an initial cluster assignment, e.g., each vertex being in its own cluster.Each iteration performs an election at each vertex to select its cluster num. The votes are the cluster assignments of its neighbors. Ties are settled by selecting the lowest cluster ID to maintain determinism here. The algorithm converges when two consecutive iterations have a tiny difference between them.

slide-17
SLIDE 17

5.2 PPCL algorithm based on Linear algebra

3.Vote

4.tally

5.Form new cluster approximation G’’

2.Initial

i i f i f f f

PeerPressure( A : ,C : ) 1 T : : m : 2 T = C A 3 m = T max. 4 C = m .== T 5 if C == C 6 then return C 7 else return PeerPressure( ,C )

N N N N N N N N N f

G C G

× × + × × + +

= R B R B R

Alg lgorithm 2 hm 2:

1.Given an approximation G’

Result G’’

If G’!=G’’, then G’=G’’ If G’==G’’

  • 1. Starting approximation G’: each

vertex is a cluster.

  • 2. Initialization: assuring vertices have

equally votes.

  • 3. Vote: Each node vote for its

neighbors.

  • 4. Tally: (1) normalize. (2) Settling ties:

what to do if two clusters tie for the maximum number of votes for a vertex.

  • 5. Form a new approximation G’’
slide-18
SLIDE 18

0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 A = 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2 0.33 0.33 0.33 0.25 0.25 0.25 0.25 ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠

1 1 1 1 1 1 1 1 1 1 C = 0 1 1 1 1 1 1 1 1 ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠

f

1 1 1 1 1 C = 0 1 1 1 ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ (a) The object graph G (b) The adjacency A of G after initialization (c) Temporary results of matrix C (d) Temporary matrix C after 1st tires-settling

Fig.1 The procedure when applying algorithm 2 for object graph G.

slide-19
SLIDE 19

5.3 Parallel PPCL algorithm based on linear algebra

i i

a matix-based tageted graph A : and a matrix-based initial approximation garph C : . a matrix-based clustering result. PeerPressure( A : ,C : ) 1 SpParMat<unsig

N N N N N N N N

G

× × + × × +

= Input : Output : procedure R B R B ned,double,SpDCCols,<unsigned,double>> A, C; /* , single */ 2 DenseParVec<unsigned,double> rowsums = A.Reduce(Row,plus<double>); /* reduce to Row columns are collapsed to entries multinv double < > */ 3 rowsums.Apply(multinv<double>); /* , */ 4 A.DimScale(Row, rowsums); 5 C != T /* */ 6 function is user defined for double nomalize A scale each Column with given vector vote while do SpParMat<unsigned,double,SpDCCols,<unsigned,double>> T = SpGEMM(C,A); /* */ 7 Renomalize(SpParMat<unsigned,double,SpDCCols,<unsigned,double>> &T); /* renomalize T */ 8 settling_ties(SpParMat<unsigned,double,SpDCCols,<unsigned,double>> &T); settling ties

Alg lgorithm 3 hm 3:

Ini Initiali lization Vo Vote Norma mali lization Settli ling ng t ties

slide-20
SLIDE 20

5.4 Parallel PPCL implementation on CombBLAS

ß Data distribution and storage

Þ DCSC storage structure

ß Algorithm expansion & MPI implementation

Þ Parallel voting Þ Renormalization Þ Parallel ties-settling

slide-21
SLIDE 21

5.4.1 Data distribution and storage

ß Distribute the sparse matrices on a 2D <Pr,

Pc> processor grid.

Þ Processor P(I,j) stores the sub-matrix Aij of

dimensions (m/Pr)x(n/Pc) in its local memory.

ß HyperSparseGEMM operates on O(nnz)

DCSC data structure.

slide-22
SLIDE 22

(1)DCSC storage structures

CSC CSC or C CSR structure structure

Total Storage:

( )

O nnz

DC DCSC i is mo more e efficienc ncy t y tha han C n CSR o

  • r C

CSC in s n storing ng la large s scale le s sparse ma matrix.

Fig.1.1. trip description of matrix A Fig.1.2. CSC structure of matrix A Fig.1.3. DCSC structure of matrix A

DC DCSC structure structure

slide-23
SLIDE 23

5.4.2 Algorithm expansion & MPI implement.

(1)Voting ng:sparse S SUM UMMA alg lgorithm hm,our M MPI imple leme ment ntation (2)Norma mali lization: R n: Reduce() and nd Di DimA mApply ly() i in n Comb mbBLAS, o , our M MPI imple leme ment ntation. n. (3)Settli ling ng t ties: b both M h MPI a I and nd M MPI- I-OpenM nMP i imple leme ment ntation. n. (90% time in the full code)

slide-24
SLIDE 24

(1) Parallel voting

ß SpGEMM algorithm

slide-25
SLIDE 25

(2) Renormalization

ß Leverage two primitives in CombBLAS to assure the

elements in matrix T remain to 1 or 0.

Algorithm 4 Renormalize the T matrix Renormalize(SpParMat &T) DenseParVec colmax = T.reduce(Column, max); T.DimApply(Column, colmaxs, equal_to<double>).

slide-26
SLIDE 26

(3) Settling ties

ß

Algorithm 5.

Settling_ties(SpParMat<unsigned,double,SpDCCols,<unsigned,double>> &T) 1 for all processors ( , ) in parallel do /* */ 2 vector<int P i j creat an vector tallying the processor number > v = T.CreatVec(); /* _ */ 3 MPI_Allreduce(v, min_v, k, MPI_INT, MPI_MIN, Colworld()); /* */ 4 do MPI Allreduce on every processor column generate a new clustering result matrix T.PruneMat(min_v);

ü Func nction: S n: Sele lecting ng t the he lo lowest nu numb mbered c clu luster w with t h the he hi highe hest nu numb mber o

  • f

votes. . Recording ng Commu mmuni nication Sele lection The for…in parallel do construct indicates that all of the docode blocks execute in parallel by all the processors. In line 2, constructing a vector where every element is corresponding to each column in T matrix. If the value in some column contain 1, then tallying its processor ID in its corresponding element in the vector, or tallying the maximal integer.

slide-27
SLIDE 27
  • 6. Numerical Experiments

ß Platform and Input Graph ß Testing CombBLAS performance on Dawning ß Testing our parallel implementation

slide-28
SLIDE 28

6.1 Platform and Input Graph

Da Dawni ning ng s supercomputer:

n

Intel Omni-Path network,100Gbps two-side connection;

n

172 nodes, each has 24 2.5GHz Intel E5-2680 processor cores with 64GB memory.

n

Mvapich2.

In Input Gr Graph: h:

1. 1.A permuted R-MAT graph of scale 16 with self loops added.

Ø

65,5 ,536 vertices, 4 , 490,5 ,563 edg edges. es.

  • 2. A permuted R-MAT graph of scale 21 with self loops added.

Ø

2,0 ,097,152 vertices, 1 , 18,3 ,305,177 edg edges. es.

R-M

  • MAT gr

graph aph:RMAT model is one of graph generation models . It has simple algorithm and its resulting graph fits for the Power-Law distribution

slide-29
SLIDE 29

6.2 Testing CombBLAS performance on Dawning

ß Using BFS code included in

CombBLAS.

ß Input:

Þ

R-MAT graph of scale 17 with self- loops added

Þ

Contains a sparse matrix of size 1310722.

ß Performance:

Þ

Speedup: 15 on 64 processors.

Þ

MTEPS(Mega Traversed Edges per second).

# of processor cores MTEPS 1 74 4 198 16 406 64 945

Table 1. MTEPS of BFS code of CombBLAS running

  • n Dawning with different proc. number.

The results have not gained good enough performance as predicted. The reason maybe that the CombBLAS version we used is not optimized to match the Dawning supercomputer.

slide-30
SLIDE 30

6.3 Testing our parallel implementation

ß Case 1: R-mat graph of scale 16 ß Case 2: R-mat graph of scale 21

slide-31
SLIDE 31

(1) Case 1: R-MAT with scale 16

Fig.2 .2. Time vs. processor number 图3.

  • 3. Speedup vs. processor number

Parallel efficiency on 4 processors:92.5% … Parallel efficiency on 64 processors:62.7%

slide-32
SLIDE 32

(2) Case 2: R-MAT of scale 21

  • Fig. 4

. 4.T .Time me v

  • vs. p

. processor nu numb mbers Fig.5 .5.S .Speedup v

  • vs. p

. processor nu numb mbers

E on 4 processors:87.5% … E on 1024 processors:75%

slide-33
SLIDE 33

Settling ties performance: R-MAT of scale 21

E on 4 processors:87.5% … E on 1024 processors:86.4%

  • Fig. 8

. 8. . MPI implementation:(left)parallel time.(right)speedup

slide-34
SLIDE 34

From the test…

ß The sparse degree of the input graph

Þ All the two input graphs have highly sparse

degree>99%. (=(# of zeros)/(# of total elements))

Þ Also high irregular degree.

ß Scalability of our algorithm

Þ Under the same conditions besides the size of graph

and sparse degree, the speedups of the small and large graphs we test are not much difference. It also proves the scalability of the algorithm.

slide-35
SLIDE 35
  • 7. Discussions

ß Good scalability of our parallel algorithm. ß Limit on the core numbers: mxm

Þ # of columns = # of rows. Þ Current version uses even distribution between

processors.

ß Settling ties step consumes most of the time

Þ Reason: all-to-all communication between processors.

slide-36
SLIDE 36

ß BFS testing on Dawning indicates optimizing the

relative algebra operations in CombBLAS for object supercomputer is needed.

ß There are not so many research results about

parallel PPCL algorithm on large computation. Our implementation has gained high performance on more than thousands processors for large scale graph.

slide-37
SLIDE 37

ß Our implementation VS. direct parallelization

Þ It is applicable to provide direct parallelization of PPCL

algorithm

Þ But the tedious treatment process for irregular data

structure of graph can not ignored. It also causes a large amount of time costs and memory occupations.

Þ We translated the graph computations into a series of

matrix operations, and the computations on the irregular data structures of graph can be transformed to be structural representation based on sparse matrix. Thus high performance can be achieved.

slide-38
SLIDE 38

Conclusion

ß We design and implement parallel PPCL algorithm based on

CombBLAS, which representing the PPCL algorithm as sparse matrix computations.

ß When the input is a permuted R-MAT graph of scale 21 with self

loops, the MPI implementation achieves up to 809.6x speed up

  • n 1024 cores of Dawning supercomputer.

ß Next work includes:

Þ Optimizing the performance of parallel PPCL implementation and

testing larger graph for high scalability

Þ Optimizing the performance of relative matrix algebra building blocks

in CombBLAS to match the under computer architecture.

Þ Studying more parallel irregular algorithms presented by sparse matrix

  • perations.
slide-39
SLIDE 39

THANKS!