callisto rts fine grain parallel loops
play

Callisto-RTS: Fine-Grain Parallel Loops Tim Harris (Oracle Labs), - PowerPoint PPT Presentation

Callisto-RTS: Fine-Grain Parallel Loops Tim Harris (Oracle Labs), Stefan Kaestle (ETH Zurich) ATC 2015 1 / 14 Callisto-RTS OpenMP-like runtime system for (NUMA) shared memory machines Aims to scale better than OpenMP, while requiring


  1. Callisto-RTS: Fine-Grain Parallel Loops Tim Harris (Oracle Labs), Stefan Kaestle (ETH Zurich) ATC 2015 1 / 14

  2. Callisto-RTS ◮ OpenMP-like runtime system for (NUMA) shared memory machines ◮ Aims to scale better than OpenMP, while requiring less manual tuning ◮ ‘Automatic’ handling of nested loops ◮ Scales well to very small block sizes (1K cycles) 2 / 14

  3. Implementation – API “Our initial workloads are graph analytics algorithms generated by a compiler from the Green-Marl DSL [12]. Therefore, while we aim for the syntax to be reasonably clear, our main goal is performance.” 3 / 14

  4. Implementation – API “Our initial workloads are graph analytics algorithms generated by a compiler from the Green-Marl DSL [12]. Therefore, while we aim for the syntax to be reasonably clear, our main goal is performance.” struct example_1 { atomic<int> t o t a l {0}; void work ( int idx ) { t o t a l += idx ; // Atomic add } } e1 ; parallel_for <example_1 , int >(e1 , 0 , 10); cout << e1 . t o t a l ; 3 / 14

  5. Implemenation – API 2 ◮ Adding functions void fork(thread_data_t &) and void join(thread_data_t &) allows thread-local accumulation. ◮ They considered C++ lambdas, but “performance using current compilers appears to depend a great deal on the behavior of optimization heuristics.” ◮ Each parallel loop ends with an implicit barrier (so preemptions are a problem). ◮ Nested parallized loops need an additional loop level, with 0 being the innermost loop. 4 / 14

  6. Implementation (a) Thread t1 executes sequential code. (c) Thread t1 enters a loop at level 1. Thread t1 and t5 Other threads wait for work. participate. Threads t6–t8 now wait for work from t5. (b) Thread t1 enters a loop at level 0. (d) Threads t1 and t5 enter loops at level 0, All threads participate in the loop. threads participate in the respective loops. ◮ Threads are organized in a tree, with a fixed level for each thread. ◮ A thread at level n becomes a leader when it hits a loop with level k � n . ◮ A follower ( = not a leader) at level n executes iterations in a loop when its leader hits a loop at level k � n . 5 / 14

  7. Work stealing (a) Distribution during a level-0 loop led by t 1 in which all threads participate, using separate (b) Distribution during a level-0 loop request combiners on each socket. led by t 1 (left) and by t 5 (right). Work distributors, that are combined in a hierarchy: Shared counter Single iteration counter, modified with atomic instruction. Distributed counters Work is evenly distributed among ‘stripes’, threads complete work in own stripe before moving to others. Request combiner Aggregate multiple requests for work, and forward to higher level. 6 / 14

  8. Request combining my_slot − >s t a r t = REQ; while (1) { i f ( ! trylock (&combiner − >lock )) { while ( is_locked(&combiner − >lock )) ; } e l s e { // c o l l e c t requests from other threads , // i s s u e aggregated request , // d i s t r i b u t e work unlock(&combiner − >lock ) ; } i f ( my_slot − >s t a r t != REQ) { return ( my_slot − >start , my_slot − >end ) ; } } 7 / 14

  9. Evaluation – Hardware 2-socket Xeon E5-2650 IvyBridge ◮ 8 cores per socket ◮ L1 and L2 per core ◮ 2 hardware threads per core 8-socket SPARC T5 ◮ 16 cores per socket ◮ L1 and L2 per core ◮ 8 hardware threads per core 8 / 14

  10. Evaluation – Microbenchmarks 1 Global Per-core Per-core combine-sync Per-socket Per-thread Per-core combine-async 8 30 30 Normalized speedup Normalized speedup Normalized speedup 7 25 25 6 5 20 20 4 15 15 3 10 10 2 5 5 1 0 0 0 1024 512 256 128 64 32 16 8 4 1024 512 256 128 64 32 16 8 4 1024 512 256 128 64 32 16 8 4 Batch size Batch size Batch size Even distribution: 2-socket Xeon (X4-2), 4 threads (left), 16 threads (center), 32 threads (right) 8 30 30 Normalized speedup Normalized speedup Normalized speedup 7 25 25 6 5 20 20 4 15 15 3 10 10 2 5 5 1 0 0 0 1024 512 256 128 64 32 16 8 4 1024 512 256 128 64 32 16 8 4 1024 512 256 128 64 32 16 8 4 Batch size Batch size Batch size Skewed distribution: 2-socket Xeon (X4-2), 4 threads (left), 16 threads (center), 32 threads (right) 9 / 14

  11. Evaluation – Microbenchmarks 2 Global Per-core Per-core combine-sync Per-socket Per-thread Per-core combine-async 200 500 500 Normalized speedup Normalized speedup Normalized speedup 400 400 150 300 300 100 200 200 50 100 100 0 0 0 1024 512 256 128 64 32 16 8 4 1024 512 256 128 64 32 16 8 4 1024 512 256 128 64 32 16 8 4 Batch size Batch size Batch size Even distribution: 8-socket T5 (T5-8), 128 threads (left), 512 threads (center), 1024 threads (right) 200 500 500 Normalized speedup Normalized speedup Normalized speedup 400 400 150 300 300 100 200 200 50 100 100 0 0 0 1024 512 256 128 64 32 16 8 4 1024 512 256 128 64 32 16 8 4 1024 512 256 128 64 32 16 8 4 Batch size Batch size Batch size Skewed distribution: 8-socket T5 (T5-8), 128 threads (left), 512 threads (center), 1024 threads (right) 10 / 14

  12. Evaluation – Graph Algorithms Normalized execution time Normalized execution time Normalized execution time Normalized execution time 4.0 4.0 4.0 4.0 32 32 32 32 3.5 3.5 3.5 3.5 64 64 64 64 3.0 3.0 3.0 3.0 Threads Threads Threads Threads 128 128 128 128 2.5 2.5 2.5 2.5 256 256 256 256 2.0 2.0 2.0 2.0 512 512 512 512 1.5 1.5 1.5 1.5 1024 1024 1024 1024 1.0 1.0 1.0 1.0 1024 256 64 16 4 1024 256 64 16 4 1024 256 64 16 4 1024 256 64 16 4 Batch size Batch size Batch size Batch size Best = 1.00 Best = 1.63 Best = 1.11 Best = 0.83 T5-8, OpenMP (left), single global counter, per-socket counters, per-core counters with async combining (right), PageRank – LiveJournal. The best OpenMP execution took 0.26s (512 threads, 1024 batch size). Normalized execution time Normalized execution time Normalized execution time Normalized execution time 4.0 4.0 4.0 4.0 32 32 32 32 3.5 3.5 3.5 3.5 64 64 64 64 3.0 3.0 3.0 3.0 Threads Threads Threads Threads 128 128 128 128 2.5 2.5 2.5 2.5 256 256 256 256 2.0 2.0 2.0 2.0 512 512 512 512 1.5 1.5 1.5 1.5 1024 1024 1024 1024 1.0 1.0 1.0 1.0 1024 256 64 16 4 1024 256 64 16 4 1024 256 64 16 4 1024 256 64 16 4 Batch size Batch size Batch size Batch size Best = 1.00 Best = 1.73 Best = 0.94 Best = 0.94 T5-8, OpenMP (left), single global counter, per-socket counters, per-core counters with async combining (right), PageRank – Twitter. The best OpenMP execution took 6.0s (512 threads, 1024 batch size). Normalized execution time Normalized execution time Normalized execution time Normalized execution time 4.0 4.0 4.0 4.0 32 32 32 32 3.5 3.5 3.5 3.5 64 64 64 64 3.0 3.0 3.0 3.0 Threads Threads Threads Threads 128 128 128 128 2.5 2.5 2.5 2.5 256 256 256 256 2.0 2.0 2.0 2.0 512 512 512 512 1.5 1.5 1.5 1.5 1024 1024 1024 1024 1.0 1.0 1.0 1.0 1024 256 64 16 4 1024 256 64 16 4 1024 256 64 16 4 1024 256 64 16 4 Batch size Batch size Batch size Batch size Best = 1.00 Best = 1.21 Best = 0.66 Best = 0.61 T5-8, OpenMP (left), single global counter, per-socket counters, per-core counters with async combining (right), Triangle counting – LiveJournal. The best OpenMP execution took 0.21s (256 threads, 256 batch size). Normalized execution time Normalized execution time Normalized execution time Normalized execution time 11 / 14

  13. Evaluation – Comparison to Galois 12 12 Speedup / sequential C-RTS Speedup / sequential C-RTS 10 10 Galois Galois 8 8 6 6 4 4 2 2 0 0 1 2 4 8 16 32 1 2 4 8 16 32 Threads Threads 2-socket Xeon (X4-2), PageRank with LiveJournal input (left) and Twitter input (right). 100 160 Speedup / sequential C-RTS Speedup / sequential C-RTS 90 140 80 C-RTS (per-socket) C-RTS (per-socket) 120 70 Galois Galois 100 60 50 80 40 60 30 40 20 20 10 0 0 1 4 16 64 256 1024 1 4 16 64 256 1024 Threads Threads 8-socket T5 (T5-8), PageRank with LiveJournal input (left) and Twitter input (right). Galios uses per-socket queues to dispatch work blocks, which worker threads draw from. 12 / 14

  14. Evaluation – Nested parallelism 120 No nesting Execution time (s) 100 4 cores 2 cores 80 1 core 60 40 20 0 32 64 128 256 512 1024 Threads Figure 7: Betweenness centrality using nested paral- lelism at different levels. ◮ With 128 threads + flat parallelism: 9.8% misses in L2-D Cache ◮ With 1024 threads + flat parallelism: 29% ◮ With 2014 threads + 10.8% 13 / 14

  15. Discussion ◮ What to do when there are other processes? Busy waiting and barriers don’t really work then. ◮ What is the relation to Callisto? ◮ What is the problem with C++ lambdas? 14 / 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend