T-PLATFORMS March 3, 2016 Artem Osipov Alexander Daryin - - PowerPoint PPT Presentation

t platforms
SMART_READER_LITE
LIVE PREVIEW

T-PLATFORMS March 3, 2016 Artem Osipov Alexander Daryin - - PowerPoint PPT Presentation

Multithreaded distributed BFS with regularized communication pattern on the cluster with Angara interconnect T-PLATFORMS March 3, 2016 Artem Osipov Alexander Daryin GraphHPC-2016 www.t-platforms.com BFS Revisited Breadth-First Search


slide-1
SLIDE 1

Multithreaded distributed BFS with regularized communication pattern on the cluster with Angara interconnect

T-PLATFORMS

www.t-platforms.com GraphHPC-2016

March 3, 2016 Artem Osipov Alexander Daryin

slide-2
SLIDE 2
  • Breadth-First Search (BFS) on distributed memory systems
  • Part of Graph500 benchmark
  • Performance is limited by:
  • memory access latency
  • network latency
  • network bandwidth

BFS Revisited

www.t-platforms.com GraphHPC-2016

slide-3
SLIDE 3

Compressed Sparse-Row Graph (CSR)

www.t-platforms.com GraphHPC-2016

3 1 2 … rowstarts 21 6 27 2 42 15 13 7 … 1 27 rank local_id global_id columns local_id global_id

slide-4
SLIDE 4

Data: CSR (rowstarts R, column C), current queue Q fun function ction ProcessQueue(R, C, Q) for

  • r v in

in Q do do for

  • r e in

in {R[v]..R[v + 1]} do do send v, local(C[e]) to owner(C[e]))

Reference Algorithm

www.t-platforms.com GraphHPC-2016

slide-5
SLIDE 5

Data: CSR (rowstarts R, column C), current queue Q fun function ction ProcessQueue(R, C, Q) for

  • r v in

in Q do do for

  • r p in

in {R[v]..R[v + 1]} do do send v, local(C[p]) to owner(C[p]))

Reference Algorithm

www.t-platforms.com GraphHPC-2016

Message coalescing Small message size low bandwidth Separate send buffer for each peer process large memory overhead multiple threads - ? Irregular communication pattern connection overhead

slide-6
SLIDE 6
  • Optimize memory access
  • Regularize communication pattern
  • Reduce memory consumption
  • Allow for multithreaded design
  • Possible solution:
  • partition graph data for each peer node
  • process partitions independently (and possibly concurrently)

Goals

www.t-platforms.com GraphHPC-2016

slide-7
SLIDE 7
  • Partition graph data for each peer node
  • Process partitions independently (and concurrently)
  • Problem:
  • Unacceptable memory overhead when using standard CSR representation - |rowstarts| = |V|
  • Solution:
  • Pack rowstarts

Regularized Communications Pattern

www.t-platforms.com GraphHPC-2016

slide-8
SLIDE 8

Packed Partitioned CSR

www.t-platforms.com GraphHPC-2016

… 3 2 1 1 4 7 8 4 1 .. 6 5 2 1 8 2 1 6 3 … 5 4 7 9 7 3 6 2 2 … 9 7 1 8 9 5 9 dst 1 dst 0 dst 2 dst 3 local_id

  • ffset

packed rowstarts local_id

slide-9
SLIDE 9

Data: Packed CSR (vertex indices V, row offsets D, column C), current queue Q (bit mask), number of processes NP fun function ction ProcessQueue(R, C, Q) for

  • r p in

in {0..NP-1} do do for

  • r i in

in {0..|V[p]|-1} do do if if V[p][i] in n Q then hen for for e in in {D[p][i]..D[p][i + 1] do do send V[p][i], C[e]) to p

Packed CSR Algorithm

www.t-platforms.com GraphHPC-2016

slide-10
SLIDE 10
  • Standard algorithm: when sending (u, v) edge for already visited u we're hoping that v is unvisited
  • At some point it becomes easier to hit visited v than unvisited
  • Backward stepping algorithm: send (u, v) edge for unvisited u and hope that v is visited
  • Reduce communications – use edge probing
  • To increase the effectiveness of probing sort columns by degree

Direction Optimization

www.t-platforms.com GraphHPC-2016

slide-11
SLIDE 11

Data: Packed CSR (vertex indices V, row offsets D, column C), bit mask of visited vertices M, number of processes NP fun function ction ProcessUnvisited() // probe first edges for

  • r p in

in {0..NP-1} do do for

  • r i in

in {0..|V[p]-1|} do do if if V[p][i] in in M then hen send V[p][i], C[D[p][i]] to peer flush send buffer wait for all acks // probe other edges for

  • r p in

in {0..NP-1} do do for

  • r i in

in {0..|V[p]|-1} do do if if V[p][i] in in M then hen for

  • r e in {D[p][i] + 1..D[p][i + 1]} do

do send V[peer][i], C[e] to peer flush send buffer

Backwards Stepping

www.t-platforms.com GraphHPC-2016

slide-12
SLIDE 12
  • Basic concepts:
  • All MPI communications go through the main thread
  • All received data is handled by dedicated thread
  • Other threads prepare data to send
  • All communications between threads go through concurrent queues (Intel Thread Building Blocks package

was used)

Multithreading

www.t-platforms.com GraphHPC-2016

slide-13
SLIDE 13

Multithreading

www.t-platforms.com GraphHPC-2016

thread 0

performs MPI communication and coordinates worker threads

thread 1

processes received messages

thread 2

processes packed CSR for one dst at a time

thread k-1

processes packed CSR for one dst at a time

processes packed CSR for one dst at a time send buffers recv buffers dst list

slide-14
SLIDE 14

Forward stepping

www.t-platforms.com GraphHPC-2016

InitNewRound Send/Recv Allgather updated count WaitMainThread ProcessLocal ProcessRecv WaitMainThread ProcessGlobal SendRequests

BuffersQueue ActionsQueue

RecvRequests

BuffersQueue ActionsQueue

RanksToProcess

MsgHandlerThread(1) MainThread(0) QueueThread(2..k-1)

slide-15
SLIDE 15

Backward stepping

www.t-platforms.com GraphHPC-2016

InitNewRound Send/Recv Allgather updated count WaitMainThread ProcessLocal ProcessBckRecv FwdRecvRequests

BuffersQueue ActionsQueue

MsgHandlerThread(1) MainThread(0)

slide-16
SLIDE 16

Backward stepping

www.t-platforms.com GraphHPC-2016

InitNewRound Send/Recv Allgather updated count WaitMainThread ProcessGlobal BckSendRequests

BuffersQueue ActionsQueue

RanksToProcess

MainThread(0) QueueThread(2..k-1)

FwdSendRequests

BuffersQueue ActionsQueue

BckRecvRequests

BuffersQueue ActionsQueue

ProcessFwdRecv

slide-17
SLIDE 17

Performance: “Angara K1”

www.t-platforms.com GraphHPC-2016

1 2 3 4 5 6 7 8 22 23 24 25 26 27 28 29 30

16 Nodes

16x2 4 OMP 16x 6 OMP 16x2 10 OMP 16x8

slide-18
SLIDE 18

Performance: “Angara K1”

www.t-platforms.com GraphHPC-2016

1 2 3 4 5 6 7 22 23 24 25 26 27 28 29 30

32 Nodes

32x1 6 OMP 32x4 32x8

slide-19
SLIDE 19
  • Graph redistribution (maximal clustering per node)
  • Multithreaded heavy

Future optimizations

www.t-platforms.com GraphHPC-2016

slide-20
SLIDE 20

THANK YOU!

www.t-platforms.com