Callisto-RTS: Fine-Grain Parallel Loops Tim Harris (Oracle Labs), - - PowerPoint PPT Presentation

callisto rts fine grain parallel loops
SMART_READER_LITE
LIVE PREVIEW

Callisto-RTS: Fine-Grain Parallel Loops Tim Harris (Oracle Labs), - - PowerPoint PPT Presentation

Callisto-RTS: Fine-Grain Parallel Loops Tim Harris (Oracle Labs), Stefan Kaestle (ETH Zurich) ATC 2015 1 / 14 Callisto-RTS OpenMP-like runtime system for (NUMA) shared memory machines Aims to scale better than OpenMP, while requiring


slide-1
SLIDE 1

Callisto-RTS: Fine-Grain Parallel Loops

Tim Harris (Oracle Labs), Stefan Kaestle (ETH Zurich) ATC 2015

1 / 14

slide-2
SLIDE 2

Callisto-RTS

◮ OpenMP-like runtime system for (NUMA) shared memory

machines

◮ Aims to scale better than OpenMP, while requiring less

manual tuning

◮ ‘Automatic’ handling of nested loops ◮ Scales well to very small block sizes (1K cycles)

2 / 14

slide-3
SLIDE 3

Implementation – API

“Our initial workloads are graph analytics algorithms generated by a compiler from the Green-Marl DSL [12]. Therefore, while we aim for the syntax to be reasonably clear, our main goal is performance.”

3 / 14

slide-4
SLIDE 4

Implementation – API

“Our initial workloads are graph analytics algorithms generated by a compiler from the Green-Marl DSL [12]. Therefore, while we aim for the syntax to be reasonably clear, our main goal is performance.” struct example_1 { atomic<int> t o t a l {0}; void work ( int idx ) { t o t a l += idx ; // Atomic add } } e1 ; parallel_for <example_1 , int >(e1 , 0 , 10); cout << e1 . t o t a l ;

3 / 14

slide-5
SLIDE 5

Implemenation – API 2

◮ Adding functions void fork(thread_data_t &) and

void join(thread_data_t &) allows thread-local accumulation.

◮ They considered C++ lambdas, but “performance using

current compilers appears to depend a great deal on the behavior of optimization heuristics.”

◮ Each parallel loop ends with an implicit barrier (so

preemptions are a problem).

◮ Nested parallized loops need an additional loop level, with

0 being the innermost loop.

4 / 14

slide-6
SLIDE 6

Implementation

(a) Thread t1 executes sequential code. Other threads wait for work. (b) Thread t1 enters a loop at level 0. All threads participate in the loop. (c) Thread t1 enters a loop at level 1. Thread t1 and t5

  • participate. Threads t6–t8 now wait for work from t5.

(d) Threads t1 and t5 enter loops at level 0, threads participate in the respective loops.

◮ Threads are organized in a tree, with a fixed level for each

thread.

◮ A thread at level n becomes a leader when it hits a loop

with level k n.

◮ A follower (= not a leader) at level n executes iterations

in a loop when its leader hits a loop at level k n.

5 / 14

slide-7
SLIDE 7

Work stealing

(a) Distribution during a level-0 loop led by t1 in which all threads participate, using separate request combiners on each socket. (b) Distribution during a level-0 loop led by t1 (left) and by t5 (right).

Work distributors, that are combined in a hierarchy: Shared counter Single iteration counter, modified with atomic instruction. Distributed counters Work is evenly distributed among ‘stripes’, threads complete work in own stripe before moving to others. Request combiner Aggregate multiple requests for work, and forward to higher level.

6 / 14

slide-8
SLIDE 8

Request combining

my_slot− >s t a r t = REQ; while (1) { i f ( ! trylock (&combiner− >lock )) { while ( is_locked(&combiner− >lock )) ; } e l s e { // c o l l e c t requests from other threads , // i s s u e aggregated request , // d i s t r i b u t e work unlock(&combiner− >lock ) ; } i f ( my_slot− >s t a r t != REQ) { return ( my_slot− >start , my_slot− >end ) ; } }

7 / 14

slide-9
SLIDE 9

Evaluation – Hardware

2-socket Xeon E5-2650 IvyBridge

◮ 8 cores per socket ◮ L1 and L2 per core ◮ 2 hardware threads per core

8-socket SPARC T5

◮ 16 cores per socket ◮ L1 and L2 per core ◮ 8 hardware threads per core

8 / 14

slide-10
SLIDE 10

Evaluation – Microbenchmarks 1

Global Per-socket Per-core Per-thread Per-core combine-sync Per-core combine-async 1 2 3 4 5 6 7 8 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size 5 10 15 20 25 30 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size 5 10 15 20 25 30 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size

Even distribution: 2-socket Xeon (X4-2), 4 threads (left), 16 threads (center), 32 threads (right)

1 2 3 4 5 6 7 8 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size 5 10 15 20 25 30 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size 5 10 15 20 25 30 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size

Skewed distribution: 2-socket Xeon (X4-2), 4 threads (left), 16 threads (center), 32 threads (right)

9 / 14

slide-11
SLIDE 11

Evaluation – Microbenchmarks 2

Global Per-socket Per-core Per-thread Per-core combine-sync Per-core combine-async 50 100 150 200 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size 100 200 300 400 500 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size 100 200 300 400 500 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size

Even distribution: 8-socket T5 (T5-8), 128 threads (left), 512 threads (center), 1024 threads (right)

50 100 150 200 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size 100 200 300 400 500 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size 100 200 300 400 500 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size

Skewed distribution: 8-socket T5 (T5-8), 128 threads (left), 512 threads (center), 1024 threads (right)

10 / 14

slide-12
SLIDE 12

Evaluation – Graph Algorithms

1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time 1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time 1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time 1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time Best = 1.00 Best = 1.63 Best = 1.11 Best = 0.83

T5-8, OpenMP (left), single global counter, per-socket counters, per-core counters with async combining (right), PageRank – LiveJournal. The best OpenMP execution took 0.26s (512 threads, 1024 batch size).

1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time 1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time 1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time 1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time Best = 1.00 Best = 1.73 Best = 0.94 Best = 0.94

T5-8, OpenMP (left), single global counter, per-socket counters, per-core counters with async combining (right), PageRank – Twitter. The best OpenMP execution took 6.0s (512 threads, 1024 batch size).

1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time 1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time 1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time 1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time Best = 1.00 Best = 1.21 Best = 0.66 Best = 0.61

T5-8, OpenMP (left), single global counter, per-socket counters, per-core counters with async combining (right), Triangle counting – LiveJournal. The best OpenMP execution took 0.21s (256 threads, 256 batch size).

Normalized execution time Normalized execution time Normalized execution time Normalized execution time

11 / 14

slide-13
SLIDE 13

Evaluation – Comparison to Galois

2 4 6 8 10 12 1 2 4 8 16 32 Speedup / sequential Threads C-RTS Galois 2 4 6 8 10 12 1 2 4 8 16 32 Speedup / sequential Threads C-RTS Galois

2-socket Xeon (X4-2), PageRank with LiveJournal input (left) and Twitter input (right).

10 20 30 40 50 60 70 80 90 100 1 4 16 64 256 1024 Speedup / sequential Threads C-RTS C-RTS (per-socket) Galois 20 40 60 80 100 120 140 160 1 4 16 64 256 1024 Speedup / sequential Threads C-RTS C-RTS (per-socket) Galois

8-socket T5 (T5-8), PageRank with LiveJournal input (left) and Twitter input (right).

Galios uses per-socket queues to dispatch work blocks, which worker threads draw from.

12 / 14

slide-14
SLIDE 14

Evaluation – Nested parallelism

20 40 60 80 100 120 32 64 128 256 512 1024 Execution time (s) Threads No nesting 4 cores 2 cores 1 core

Figure 7: Betweenness centrality using nested paral- lelism at different levels.

◮ With 128 threads + flat parallelism: 9.8% misses in L2-D

Cache

◮ With 1024 threads + flat parallelism: 29% ◮ With 2014 threads + 10.8%

13 / 14

slide-15
SLIDE 15

Discussion

◮ What to do when there are other processes? Busy waiting

and barriers don’t really work then.

◮ What is the relation to Callisto? ◮ What is the problem with C++ lambdas?

14 / 14