[PPT] - CSLab Outline 1 Dijkstras Basics 2 Straightforward Parallelization PowerPoint Presentation

SLIDE 1

Early Experiences on Accelerating Dijkstra’s Algorithm Using Transactional Memory

Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Computing Systems Laboratory School of Electrical and Computer Engineering National Technical University of Athens {anastop,knikas,goumas,nkoziris}@cslab.ece.ntua.gr http://www.cslab.ece.ntua.gr

May 31, 2009

CSLab

National Technical University of Athens

SLIDE 2

Outline

1 Dijkstra’s Basics 2 Straightforward Parallelization Scheme 3 Helper-Threading Scheme 4 Experimental Evaluation 5 Conclusions Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 2 / 19

SLIDE 3

The Basics of Dijkstra’s Algorithm

SSSP Problem Directed graph G = (V , E), weight function w : E → R+, source vertex s ∀v ∈ V : compute δ(v) = min{w(p) : s

p

v} Shortest path estimate d(v) gradually converges to δ(v) through relaxations relax (v,w): d(w) = min{d(w), d(v) + w(v, w)}

◮ can we find a better path s

p

w by going through v?

Three partitions of vertices Settled: d(v) = δ(v) Queued: d(v) > δ(v) and d(v) = ∞ Unreached: d(v) = ∞

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 3 / 19

SLIDE 4

The Basics of Dijkstra’s Algorithm

Serial algorithm

Input : G = (V , E), w : E → R+,

1

source vertex s, min Q Output : shortest distance array d,

2

predecessor array π foreach v ∈ V do

3

d[v] ← inf;

4

π[v] ← nil;

5

Insert(Q, v);

6

end

7

d[s] ← 0;

8

while Q = ∅ do

9

u ← ExtractMin(Q);

10

foreach v adjacent to u do

11

sum ← d[u] + w(u, v);

12

if d[v] > sum then

13

DecreaseKey(Q, v, sum);

14

d[v] ← sum;

15

π[v] ← u;

16

end

17

end

18

50 55 60 65 70

5 2 10 10 15 20 7

S

A B C D E

8 8 8

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 4 / 19

SLIDE 5

The Basics of Dijkstra’s Algorithm

5 7 8 12 9 15 17 13 16 10 13 12 14

i j k

4 6 5 7 9 15 12 13 16 8 13 10 14

i:17 6

Min-priority queue implemented as binary min-heap maintains all but the settled vertices min-heap property: ∀i : d(parent(i)) ≤ d(i) amortizes the cost of multiple ExtractMin’s and DecreaseKey’s

◮ O((|E| + |V |)log|V |) time complexity Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 5 / 19

SLIDE 6

Straightforward Parallelization

Fine-grain parallelization at the inner loop level

Fine-Grain Multi-Threaded

/* Initialization phase same to the serial code */ while Q = ∅ do

1

Barrier

2

if tid = 0 then

3

u ← ExtractMin(Q);

4

Barrier

5

for v adjacent to u in parallel do

6

sum ← d[u] + w(u, v);

7

if d[v] > sum then

8

Begin-Atomic

9

DecreaseKey(Q, v, sum);

10

End-Atomic

11

d[v] ← sum;

12

π[v] ← u;

13

end

14

end

15

50 55 60 65 70

5 2 10 10 15 20 7

S

A B C D E

8 8 8 Issues: speedup bounded by average

ut-degree

concurrent heap updates due to DecreaseKey’s barrier synchronization overhead

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 6 / 19

SLIDE 7

Concurrent Heap Updates with Locks

5 7 8 12 9 15 17 13 16 10 13 12 14

i j k

4 6 5 7 9 15 12 13 16 8 13 10 14

j k:12 4 i:17 6

3 5 8 7 9 15 12 13 16 10 13 12 14

k i:17 3 j:9 2

x x

Coarse-grain synchronization (cgs-lock)

◮ enforces atomicity at the level of a DecreaseKey operation ◮ one lock for the entire heap ◮ serializes DecreaseKey’s

Fine-grain synchronization (fgs-lock)

◮ enforces atomicity at the level of a single swap ◮ allows multiple swap sequences to execute in parallel as long as they

are temporally non-overlapping

◮ separate locks for each parent-child pair Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 7 / 19

SLIDE 8

Performance of FGMT with Locks

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 2 4 6 8 10 12 14 16 Multithreaded speedup Number of threads cgs-lock perfbar+cgs-lock perfbar+fgs-lock

Software barriers dominate total execution time 72% with 2 threads, 88% with 8 replace with idealized (simulated) zero-latency barriers Fgs-lock scheme more scalable, but still fails to outperform serial locking overhead (2 locks + 2 unlocks per swap)

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 8 / 19

SLIDE 9

Concurrent Heap Updates with TM

5 7 8 12 9 15 17 13 16 10 13 12 14

i j k

4 6 5 7 9 15 12 13 16 8 13 10 14

j k:12 4 i:17 6

3 5 8 7 9 15 12 13 16 10 13 12 14

k i:17 3 j:9 2

x x

TM-based Coarse-grain synchronization (cgs-tm)

◮ enclose DecreaseKey within a transaction ◮ allows multiple swap sequences to execute in parallel as long as they

are spatially (and temporally) non-overlapping

◮ conflicting transaction stalls and retries or aborts

Fine-grain synchronization (fgs-tm)

◮ enclose each swap operation within a transaction ◮ atomicity as in fgs-lock ◮ shorter but more transactions Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 9 / 19

SLIDE 10

Performance of FGMT with TM

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 2 4 6 8 10 12 14 16 Multithreaded speedup Number of threads perfbar+cgs-lock perfbar+fgs-lock perfbar+cgs-tm perfbar+fgs-tm

TM-based schemes offer speedup up to ∼ 1.1 less overhead for cgs-tm, yet equally able to exploit available concurrency

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 10 / 19

SLIDE 11

Helper-Threading Scheme

Motivation expose more parallelism to each thread eliminate costly barrier synchronization Rationale in serial, relaxations are performed only from the extracted (settled) vertex allow relaxations for out-edges of queued vertices, hoping that some of them might already be settled

◮ main thread operates as in the serial

algorithm

◮ assign the next t vertices in the

queue (x2 . . . xt+1) to t helper threads

◮ helper thread k relaxes all out-edges

f vertex xk

50 55 60 65 70

5 2 10 10 15 20 7

i-1 i

S

A B C D E

8 8 8 speculation on the status of d(xk)

◮ if already optimal , main thread will be offloaded ◮ if not optimal , any suboptimal relaxations will be corrected eventually by

main thread

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 11 / 19

SLIDE 12

Execution Pattern

step k step k+1 step k+2 Main Thread 1 Thread 2 Thread 3 Thread 4 step k step k+1 step k+2

kill kill kill

Main Helper 1 Helper 2 Helper 3

kill kill

step k step k+1 step k+2

extract-min relax edges read tidth-min Serial FGMT Helper Threads

the main thread stops all helpers at the end of each iteration unfinished work will be corrected, as with mis-speculated distances

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 12 / 19

SLIDE 13

Helper-Threading Scheme

Main thread

while Q = ∅ do

1

u ← ExtractMin(Q);

2

done ← 0;

3

foreach v adjacent to u do

4

sum ← d[u] + w(u, v);

5

Begin-Xact

6

if d[v] > sum then

7

DecreaseKey(Q, v, sum);

8

d[v] ← sum;

9

π[v] ← u;

10

End-Xact

11

end

12

Begin-Xact

13

done ← 1;

14

End-Xact

15

end

16

Helper thread

while Q = ∅ do

1

while done = 1 do ;

2

x ← ReadMin(Q, tid)

3

stop ← 0

4

foreach y adjacent to x and while stop = 0 do

5

Begin-Xact

6

if done = 0 then

7

sum ← d[x] + w(x, y)

8

if d[y] > sum then

9

DecreaseKey(Q, y, sum)

10

d[y] ← sum

11

π[y] ← x

12

else

13

stop ← 1

14

End-Xact

15

end

16

end

17

for a single neighbour, the check for relaxation, updates to the heap, and updates to d,π arrays, are enclosed within a transaction

◮ performed “all-or-none” ◮ on a conflict, only one thread commits

interruption of helper threads implemented through TM, as well

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 13 / 19

SLIDE 14

Helper-Threading Scheme

Main thread

while Q = ∅ do

1

u ← ExtractMin(Q);

2

done ← 0;

3

foreach v adjacent to u do

4

sum ← d[u] + w(u, v);

5

Begin-Xact

6

if d[v] > sum then

7

DecreaseKey(Q, v, sum);

8

d[v] ← sum;

9

π[v] ← u;

10

End-Xact

11

end

12

Begin-Xact

13

done ← 1;

14

End-Xact

15

end

16

Helper thread

while Q = ∅ do

1

while done = 1 do ;

2

x ← ReadMin(Q, tid)

3

stop ← 0

4

foreach y adjacent to x and while stop = 0 do

5

Begin-Xact

6

if done = 0 then

7

sum ← d[x] + w(x, y)

8

if d[y] > sum then

9

DecreaseKey(Q, y, sum)

10

d[y] ← sum

11

π[y] ← x

12

else

13

stop ← 1

14

End-Xact

15

end

16

end

17

Why with TM? composable

◮ all dependent atomic sub-operations composed into a large atomic

peration, without limiting concurrency
ptimistic

easily programmable

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 14 / 19

SLIDE 15

Experimental Setup

Full-system simulation Simics 3.0.31 in conjunction with GEMS toolset 2.1 boots unmodified Solaris 10 (UltraSPARC III Cu) LogTM (“Signature Edition”) eager version management eager conflict detection

◮ on a conflict, a transaction stalls and either retries or aborts

HYBRID conflict resolution policy

◮ favors older transactions

Hardware platform single CMP system (configurations up to 32 cores) private L1 caches (64KB), shared L2 cache (2MB) Software Pthreads for threading and synchronization Simics “magic” instructions to simulate idealized barriers Sun Studio 12 C compiler (-xO3)

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 15 / 19

SLIDE 16

Graphs

Three graph families Random: G(n, m) model SSCA#2: cliques with varying size (1 – C) connected with probability P R-MAT: power-law degree distributions GTgraph graph generator Fixed #nodes (10K), varying density sparse (∼10K edges) medium (∼100K edges) dense (∼200K edges)

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 16 / 19

SLIDE 17

Speedups

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 2 4 6 8 10 12 14 16 Number of threads ssca2-10000x28351 perfbar+cgs-lock perfbar+cgs-tm perfbar+fgs-lock perfbar+fgs-tm helper 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 2 4 6 8 10 12 14 16 Number of threads ssca2-10000x118853 perfbar+cgs-lock perfbar+cgs-tm perfbar+fgs-lock perfbar+fgs-tm helper 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 2 4 6 8 10 12 14 16 Number of threads rmat-10000x200000 perfbar+cgs-lock perfbar+cgs-tm perfbar+fgs-lock perfbar+fgs-tm helper 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 2 4 6 8 10 12 14 16 Number of threads rmat-10000x10000 perfbar+cgs-lock perfbar+cgs-tm perfbar+fgs-lock perfbar+fgs-tm helper 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 2 4 6 8 10 12 14 16 Number of threads rand-10000x100000 perfbar+cgs-lock perfbar+cgs-tm perfbar+fgs-lock perfbar+fgs-tm helper 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 2 4 6 8 10 12 14 16 Number of threads rand-10000x200000 perfbar+cgs-lock perfbar+cgs-tm perfbar+fgs-lock perfbar+fgs-tm helper

Helper-Threading speedups in 6 out of 9 cases (not all shown), up to 1.46 performance improves with increasing density main thread not obstructed by helpers (<1% abort rate in all cases) FGMT with TM speedups only with perfect barriers

ptimistic parallelism does exist in concurrent queue updates

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 17 / 19

SLIDE 18

Conclusions

FGMT conventional synchronization mechanisms incur unacceptable

verhead

TM reduces overheads and highlights the existence of parallelism, but still requires very efficient barriers to offer some speedup HT with TM exposes more parallelism and eliminates barrier synchronization noteworthy speedups with minimal code extensions Future work more aggressive parallelization schemes dynamic adaptation of helper threads to algorithm’s execution phases explore impact of TM characteristics applicability of HT on other SSSP algorithms (∆-stepping, Bellman-Ford) and other similar (“greedy”) applications

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 18 / 19

SLIDE 19

Thank you!

Questions?

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 19 / 19