T-PLATFORMS March 3, 2016 Artem Osipov Alexander Daryin - - PowerPoint PPT Presentation

▶

Nov 11, 2022 167 likes •374 views

Multithreaded distributed BFS with regularized communication pattern on the cluster with Angara interconnect T-PLATFORMS March 3, 2016 Artem Osipov Alexander Daryin GraphHPC-2016 www.t-platforms.com BFS Revisited Breadth-First Search

SLIDE 1

Multithreaded distributed BFS with regularized communication pattern on the cluster with Angara interconnect

T-PLATFORMS

www.t-platforms.com GraphHPC-2016

March 3, 2016 Artem Osipov Alexander Daryin

SLIDE 2

Breadth-First Search (BFS) on distributed memory systems
Part of Graph500 benchmark
Performance is limited by:
memory access latency
network latency
network bandwidth

BFS Revisited

www.t-platforms.com GraphHPC-2016

SLIDE 3

Compressed Sparse-Row Graph (CSR)

www.t-platforms.com GraphHPC-2016

3 1 2 … rowstarts 21 6 27 2 42 15 13 7 … 1 27 rank local_id global_id columns local_id global_id

SLIDE 4

Data: CSR (rowstarts R, column C), current queue Q fun function ction ProcessQueue(R, C, Q) for

r v in

in Q do do for

r e in

in {R[v]..R[v + 1]} do do send v, local(C[e]) to owner(C[e]))

Reference Algorithm

www.t-platforms.com GraphHPC-2016

SLIDE 5

Data: CSR (rowstarts R, column C), current queue Q fun function ction ProcessQueue(R, C, Q) for

r v in

in Q do do for

r p in

in {R[v]..R[v + 1]} do do send v, local(C[p]) to owner(C[p]))

Reference Algorithm

www.t-platforms.com GraphHPC-2016

Message coalescing Small message size low bandwidth Separate send buffer for each peer process large memory overhead multiple threads - ? Irregular communication pattern connection overhead

SLIDE 6

Optimize memory access
Regularize communication pattern
Reduce memory consumption
Allow for multithreaded design
Possible solution:
partition graph data for each peer node
process partitions independently (and possibly concurrently)

Goals

www.t-platforms.com GraphHPC-2016

SLIDE 7

Partition graph data for each peer node
Process partitions independently (and concurrently)
Problem:
Unacceptable memory overhead when using standard CSR representation - |rowstarts| = |V|
Solution:
Pack rowstarts

Regularized Communications Pattern

www.t-platforms.com GraphHPC-2016

SLIDE 8

Packed Partitioned CSR

www.t-platforms.com GraphHPC-2016

… 3 2 1 1 4 7 8 4 1 .. 6 5 2 1 8 2 1 6 3 … 5 4 7 9 7 3 6 2 2 … 9 7 1 8 9 5 9 dst 1 dst 0 dst 2 dst 3 local_id

ffset

packed rowstarts local_id

SLIDE 9

Data: Packed CSR (vertex indices V, row offsets D, column C), current queue Q (bit mask), number of processes NP fun function ction ProcessQueue(R, C, Q) for

r p in

in {0..NP-1} do do for

r i in

in {0..|V[p]|-1} do do if if V[p][i] in n Q then hen for for e in in {D[p][i]..D[p][i + 1] do do send V[p][i], C[e]) to p

Packed CSR Algorithm

www.t-platforms.com GraphHPC-2016

SLIDE 10

Standard algorithm: when sending (u, v) edge for already visited u we're hoping that v is unvisited
At some point it becomes easier to hit visited v than unvisited
Backward stepping algorithm: send (u, v) edge for unvisited u and hope that v is visited
Reduce communications – use edge probing
To increase the effectiveness of probing sort columns by degree

Direction Optimization

www.t-platforms.com GraphHPC-2016

SLIDE 11

Data: Packed CSR (vertex indices V, row offsets D, column C), bit mask of visited vertices M, number of processes NP fun function ction ProcessUnvisited() // probe first edges for

r p in

in {0..NP-1} do do for

r i in

in {0..|V[p]-1|} do do if if V[p][i] in in M then hen send V[p][i], C[D[p][i]] to peer flush send buffer wait for all acks // probe other edges for

r p in

in {0..NP-1} do do for

r i in

in {0..|V[p]|-1} do do if if V[p][i] in in M then hen for

r e in {D[p][i] + 1..D[p][i + 1]} do

do send V[peer][i], C[e] to peer flush send buffer

Backwards Stepping

www.t-platforms.com GraphHPC-2016

SLIDE 12

Basic concepts:
All MPI communications go through the main thread
All received data is handled by dedicated thread
Other threads prepare data to send
All communications between threads go through concurrent queues (Intel Thread Building Blocks package

was used)

Multithreading

www.t-platforms.com GraphHPC-2016

SLIDE 13

Multithreading

www.t-platforms.com GraphHPC-2016

thread 0

performs MPI communication and coordinates worker threads

thread 1

processes received messages

thread 2

processes packed CSR for one dst at a time

thread k-1

processes packed CSR for one dst at a time

…

processes packed CSR for one dst at a time send buffers recv buffers dst list

SLIDE 14

Forward stepping

www.t-platforms.com GraphHPC-2016

InitNewRound Send/Recv Allgather updated count WaitMainThread ProcessLocal ProcessRecv WaitMainThread ProcessGlobal SendRequests

BuffersQueue ActionsQueue

RecvRequests

BuffersQueue ActionsQueue

RanksToProcess

MsgHandlerThread(1) MainThread(0) QueueThread(2..k-1)

SLIDE 15

Backward stepping

www.t-platforms.com GraphHPC-2016

InitNewRound Send/Recv Allgather updated count WaitMainThread ProcessLocal ProcessBckRecv FwdRecvRequests

BuffersQueue ActionsQueue

MsgHandlerThread(1) MainThread(0)

SLIDE 16

Backward stepping

www.t-platforms.com GraphHPC-2016

InitNewRound Send/Recv Allgather updated count WaitMainThread ProcessGlobal BckSendRequests

BuffersQueue ActionsQueue

RanksToProcess

MainThread(0) QueueThread(2..k-1)

FwdSendRequests

BuffersQueue ActionsQueue

BckRecvRequests

BuffersQueue ActionsQueue

ProcessFwdRecv

SLIDE 17

Performance: “Angara K1”

www.t-platforms.com GraphHPC-2016

1 2 3 4 5 6 7 8 22 23 24 25 26 27 28 29 30

16 Nodes

16x2 4 OMP 16x 6 OMP 16x2 10 OMP 16x8

SLIDE 18

Performance: “Angara K1”

www.t-platforms.com GraphHPC-2016

1 2 3 4 5 6 7 22 23 24 25 26 27 28 29 30

32 Nodes

32x1 6 OMP 32x4 32x8

SLIDE 19

Graph redistribution (maximal clustering per node)
Multithreaded heavy

Future optimizations

www.t-platforms.com GraphHPC-2016

SLIDE 20

Multithreaded distributed BFS with regularized communication pattern on the cluster with Angara interconnect

T-PLATFORMS

BFS Revisited

Compressed Sparse-Row Graph (CSR)

Reference Algorithm

Reference Algorithm

Goals

Regularized Communications Pattern

Packed Partitioned CSR

Packed CSR Algorithm

Direction Optimization

Backwards Stepping

Multithreading

Multithreading

Forward stepping

Backward stepping

Backward stepping

Performance: “Angara K1”

Performance: “Angara K1”

Future optimizations

THANK YOU!