Callisto-RTS: Fine-Grain Parallel Loops
Tim Harris (Oracle Labs), Stefan Kaestle (ETH Zurich) ATC 2015
1 / 14
Callisto-RTS: Fine-Grain Parallel Loops Tim Harris (Oracle Labs), - - PowerPoint PPT Presentation
Callisto-RTS: Fine-Grain Parallel Loops Tim Harris (Oracle Labs), Stefan Kaestle (ETH Zurich) ATC 2015 1 / 14 Callisto-RTS OpenMP-like runtime system for (NUMA) shared memory machines Aims to scale better than OpenMP, while requiring
1 / 14
2 / 14
3 / 14
3 / 14
4 / 14
(a) Thread t1 executes sequential code. Other threads wait for work. (b) Thread t1 enters a loop at level 0. All threads participate in the loop. (c) Thread t1 enters a loop at level 1. Thread t1 and t5
(d) Threads t1 and t5 enter loops at level 0, threads participate in the respective loops.
5 / 14
(a) Distribution during a level-0 loop led by t1 in which all threads participate, using separate request combiners on each socket. (b) Distribution during a level-0 loop led by t1 (left) and by t5 (right).
6 / 14
7 / 14
8 / 14
Global Per-socket Per-core Per-thread Per-core combine-sync Per-core combine-async 1 2 3 4 5 6 7 8 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size 5 10 15 20 25 30 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size 5 10 15 20 25 30 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size
Even distribution: 2-socket Xeon (X4-2), 4 threads (left), 16 threads (center), 32 threads (right)
1 2 3 4 5 6 7 8 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size 5 10 15 20 25 30 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size 5 10 15 20 25 30 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size
Skewed distribution: 2-socket Xeon (X4-2), 4 threads (left), 16 threads (center), 32 threads (right)
9 / 14
Global Per-socket Per-core Per-thread Per-core combine-sync Per-core combine-async 50 100 150 200 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size 100 200 300 400 500 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size 100 200 300 400 500 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size
Even distribution: 8-socket T5 (T5-8), 128 threads (left), 512 threads (center), 1024 threads (right)
50 100 150 200 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size 100 200 300 400 500 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size 100 200 300 400 500 4 8 16 32 64 128 256 512 1024 Normalized speedup Batch size
Skewed distribution: 8-socket T5 (T5-8), 128 threads (left), 512 threads (center), 1024 threads (right)
10 / 14
1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time 1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time 1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time 1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time Best = 1.00 Best = 1.63 Best = 1.11 Best = 0.83
T5-8, OpenMP (left), single global counter, per-socket counters, per-core counters with async combining (right), PageRank – LiveJournal. The best OpenMP execution took 0.26s (512 threads, 1024 batch size).
1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time 1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time 1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time 1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time Best = 1.00 Best = 1.73 Best = 0.94 Best = 0.94
T5-8, OpenMP (left), single global counter, per-socket counters, per-core counters with async combining (right), PageRank – Twitter. The best OpenMP execution took 6.0s (512 threads, 1024 batch size).
1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time 1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time 1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time 1024 512 256 128 64 32 1024 256 64 16 4 Threads Batch size 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Normalized execution time Best = 1.00 Best = 1.21 Best = 0.66 Best = 0.61
T5-8, OpenMP (left), single global counter, per-socket counters, per-core counters with async combining (right), Triangle counting – LiveJournal. The best OpenMP execution took 0.21s (256 threads, 256 batch size).
Normalized execution time Normalized execution time Normalized execution time Normalized execution time
11 / 14
2 4 6 8 10 12 1 2 4 8 16 32 Speedup / sequential Threads C-RTS Galois 2 4 6 8 10 12 1 2 4 8 16 32 Speedup / sequential Threads C-RTS Galois
2-socket Xeon (X4-2), PageRank with LiveJournal input (left) and Twitter input (right).
10 20 30 40 50 60 70 80 90 100 1 4 16 64 256 1024 Speedup / sequential Threads C-RTS C-RTS (per-socket) Galois 20 40 60 80 100 120 140 160 1 4 16 64 256 1024 Speedup / sequential Threads C-RTS C-RTS (per-socket) Galois
8-socket T5 (T5-8), PageRank with LiveJournal input (left) and Twitter input (right).
12 / 14
20 40 60 80 100 120 32 64 128 256 512 1024 Execution time (s) Threads No nesting 4 cores 2 cores 1 core
Figure 7: Betweenness centrality using nested paral- lelism at different levels.
13 / 14
14 / 14