Callisto-RTS: Fine-Grain Parallel Loops Tim Harris (Oracle Labs), - PowerPoint PPT Presentation

Callisto-RTS: Fine-Grain Parallel Loops Tim Harris (Oracle Labs), Stefan Kaestle (ETH Zurich) ATC 2015 1 / 14

Callisto-RTS ◮ OpenMP-like runtime system for (NUMA) shared memory machines ◮ Aims to scale better than OpenMP, while requiring less manual tuning ◮ ‘Automatic’ handling of nested loops ◮ Scales well to very small block sizes (1K cycles) 2 / 14

Implementation – API “Our initial workloads are graph analytics algorithms generated by a compiler from the Green-Marl DSL [12]. Therefore, while we aim for the syntax to be reasonably clear, our main goal is performance.” 3 / 14

Implementation – API “Our initial workloads are graph analytics algorithms generated by a compiler from the Green-Marl DSL [12]. Therefore, while we aim for the syntax to be reasonably clear, our main goal is performance.” struct example_1 { atomic<int> t o t a l {0}; void work ( int idx ) { t o t a l += idx ; // Atomic add } } e1 ; parallel_for <example_1 , int >(e1 , 0 , 10); cout << e1 . t o t a l ; 3 / 14

Implemenation – API 2 ◮ Adding functions void fork(thread_data_t &) and void join(thread_data_t &) allows thread-local accumulation. ◮ They considered C++ lambdas, but “performance using current compilers appears to depend a great deal on the behavior of optimization heuristics.” ◮ Each parallel loop ends with an implicit barrier (so preemptions are a problem). ◮ Nested parallized loops need an additional loop level, with 0 being the innermost loop. 4 / 14

Implementation (a) Thread t1 executes sequential code. (c) Thread t1 enters a loop at level 1. Thread t1 and t5 Other threads wait for work. participate. Threads t6–t8 now wait for work from t5. (b) Thread t1 enters a loop at level 0. (d) Threads t1 and t5 enter loops at level 0, All threads participate in the loop. threads participate in the respective loops. ◮ Threads are organized in a tree, with a fixed level for each thread. ◮ A thread at level n becomes a leader when it hits a loop with level k � n . ◮ A follower ( = not a leader) at level n executes iterations in a loop when its leader hits a loop at level k � n . 5 / 14

Work stealing (a) Distribution during a level-0 loop led by t 1 in which all threads participate, using separate (b) Distribution during a level-0 loop request combiners on each socket. led by t 1 (left) and by t 5 (right). Work distributors, that are combined in a hierarchy: Shared counter Single iteration counter, modified with atomic instruction. Distributed counters Work is evenly distributed among ‘stripes’, threads complete work in own stripe before moving to others. Request combiner Aggregate multiple requests for work, and forward to higher level. 6 / 14

Request combining my_slot − >s t a r t = REQ; while (1) { i f ( ! trylock (&combiner − >lock )) { while ( is_locked(&combiner − >lock )) ; } e l s e { // c o l l e c t requests from other threads , // i s s u e aggregated request , // d i s t r i b u t e work unlock(&combiner − >lock ) ; } i f ( my_slot − >s t a r t != REQ) { return ( my_slot − >start , my_slot − >end ) ; } } 7 / 14

Evaluation – Hardware 2-socket Xeon E5-2650 IvyBridge ◮ 8 cores per socket ◮ L1 and L2 per core ◮ 2 hardware threads per core 8-socket SPARC T5 ◮ 16 cores per socket ◮ L1 and L2 per core ◮ 8 hardware threads per core 8 / 14

Evaluation – Microbenchmarks 1 Global Per-core Per-core combine-sync Per-socket Per-thread Per-core combine-async 8 30 30 Normalized speedup Normalized speedup Normalized speedup 7 25 25 6 5 20 20 4 15 15 3 10 10 2 5 5 1 0 0 0 1024 512 256 128 64 32 16 8 4 1024 512 256 128 64 32 16 8 4 1024 512 256 128 64 32 16 8 4 Batch size Batch size Batch size Even distribution: 2-socket Xeon (X4-2), 4 threads (left), 16 threads (center), 32 threads (right) 8 30 30 Normalized speedup Normalized speedup Normalized speedup 7 25 25 6 5 20 20 4 15 15 3 10 10 2 5 5 1 0 0 0 1024 512 256 128 64 32 16 8 4 1024 512 256 128 64 32 16 8 4 1024 512 256 128 64 32 16 8 4 Batch size Batch size Batch size Skewed distribution: 2-socket Xeon (X4-2), 4 threads (left), 16 threads (center), 32 threads (right) 9 / 14

Evaluation – Microbenchmarks 2 Global Per-core Per-core combine-sync Per-socket Per-thread Per-core combine-async 200 500 500 Normalized speedup Normalized speedup Normalized speedup 400 400 150 300 300 100 200 200 50 100 100 0 0 0 1024 512 256 128 64 32 16 8 4 1024 512 256 128 64 32 16 8 4 1024 512 256 128 64 32 16 8 4 Batch size Batch size Batch size Even distribution: 8-socket T5 (T5-8), 128 threads (left), 512 threads (center), 1024 threads (right) 200 500 500 Normalized speedup Normalized speedup Normalized speedup 400 400 150 300 300 100 200 200 50 100 100 0 0 0 1024 512 256 128 64 32 16 8 4 1024 512 256 128 64 32 16 8 4 1024 512 256 128 64 32 16 8 4 Batch size Batch size Batch size Skewed distribution: 8-socket T5 (T5-8), 128 threads (left), 512 threads (center), 1024 threads (right) 10 / 14

Evaluation – Graph Algorithms Normalized execution time Normalized execution time Normalized execution time Normalized execution time 4.0 4.0 4.0 4.0 32 32 32 32 3.5 3.5 3.5 3.5 64 64 64 64 3.0 3.0 3.0 3.0 Threads Threads Threads Threads 128 128 128 128 2.5 2.5 2.5 2.5 256 256 256 256 2.0 2.0 2.0 2.0 512 512 512 512 1.5 1.5 1.5 1.5 1024 1024 1024 1024 1.0 1.0 1.0 1.0 1024 256 64 16 4 1024 256 64 16 4 1024 256 64 16 4 1024 256 64 16 4 Batch size Batch size Batch size Batch size Best = 1.00 Best = 1.63 Best = 1.11 Best = 0.83 T5-8, OpenMP (left), single global counter, per-socket counters, per-core counters with async combining (right), PageRank – LiveJournal. The best OpenMP execution took 0.26s (512 threads, 1024 batch size). Normalized execution time Normalized execution time Normalized execution time Normalized execution time 4.0 4.0 4.0 4.0 32 32 32 32 3.5 3.5 3.5 3.5 64 64 64 64 3.0 3.0 3.0 3.0 Threads Threads Threads Threads 128 128 128 128 2.5 2.5 2.5 2.5 256 256 256 256 2.0 2.0 2.0 2.0 512 512 512 512 1.5 1.5 1.5 1.5 1024 1024 1024 1024 1.0 1.0 1.0 1.0 1024 256 64 16 4 1024 256 64 16 4 1024 256 64 16 4 1024 256 64 16 4 Batch size Batch size Batch size Batch size Best = 1.00 Best = 1.73 Best = 0.94 Best = 0.94 T5-8, OpenMP (left), single global counter, per-socket counters, per-core counters with async combining (right), PageRank – Twitter. The best OpenMP execution took 6.0s (512 threads, 1024 batch size). Normalized execution time Normalized execution time Normalized execution time Normalized execution time 4.0 4.0 4.0 4.0 32 32 32 32 3.5 3.5 3.5 3.5 64 64 64 64 3.0 3.0 3.0 3.0 Threads Threads Threads Threads 128 128 128 128 2.5 2.5 2.5 2.5 256 256 256 256 2.0 2.0 2.0 2.0 512 512 512 512 1.5 1.5 1.5 1.5 1024 1024 1024 1024 1.0 1.0 1.0 1.0 1024 256 64 16 4 1024 256 64 16 4 1024 256 64 16 4 1024 256 64 16 4 Batch size Batch size Batch size Batch size Best = 1.00 Best = 1.21 Best = 0.66 Best = 0.61 T5-8, OpenMP (left), single global counter, per-socket counters, per-core counters with async combining (right), Triangle counting – LiveJournal. The best OpenMP execution took 0.21s (256 threads, 256 batch size). Normalized execution time Normalized execution time Normalized execution time Normalized execution time 11 / 14

Evaluation – Comparison to Galois 12 12 Speedup / sequential C-RTS Speedup / sequential C-RTS 10 10 Galois Galois 8 8 6 6 4 4 2 2 0 0 1 2 4 8 16 32 1 2 4 8 16 32 Threads Threads 2-socket Xeon (X4-2), PageRank with LiveJournal input (left) and Twitter input (right). 100 160 Speedup / sequential C-RTS Speedup / sequential C-RTS 90 140 80 C-RTS (per-socket) C-RTS (per-socket) 120 70 Galois Galois 100 60 50 80 40 60 30 40 20 20 10 0 0 1 4 16 64 256 1024 1 4 16 64 256 1024 Threads Threads 8-socket T5 (T5-8), PageRank with LiveJournal input (left) and Twitter input (right). Galios uses per-socket queues to dispatch work blocks, which worker threads draw from. 12 / 14

Evaluation – Nested parallelism 120 No nesting Execution time (s) 100 4 cores 2 cores 80 1 core 60 40 20 0 32 64 128 256 512 1024 Threads Figure 7: Betweenness centrality using nested parallelism at different levels. ◮ With 128 threads + flat parallelism: 9.8% misses in L2-D Cache ◮ With 1024 threads + flat parallelism: 29% ◮ With 2014 threads + 10.8% 13 / 14

Discussion ◮ What to do when there are other processes? Busy waiting and barriers don’t really work then. ◮ What is the relation to Callisto? ◮ What is the problem with C++ lambdas? 14 / 14

Callisto-RTS: Fine-Grain Parallel Loops Tim Harris (Oracle Labs), - PowerPoint PPT Presentation

Callisto-RTS: Fine-Grain Parallel Loops Tim Harris (Oracle Labs), Stefan Kaestle (ETH Zurich) ATC 2015 1 / 14 Callisto-RTS OpenMP-like runtime system for (NUMA) shared memory machines Aims to scale better than OpenMP, while requiring

LOOPS Loops Loops Loops! How can we repeat a piece of code without having to write it out over

Tutorial 3 Loops Side Effects 1 CS 136 Spring 2020 Tutorial 3 Loops: for loops &

Loops! Flow of Control: Loops (Savitch, Chapter 4) TOPICS while Loops do while

AERATION AERATION and COOLING and COOLING of Stored Grain of Stored Grain Mark Casada , Ph.D.,

WHAT IS THE GRAIN INNOVATION HUB? GRAIN = WHAT IS THE GRAIN INNOVATION HUB? OUTCOME Manitoba as

Covered Commodities Wheat, Oats, Barley, Corn, Grain Sorghum, Long Grain Rice, Medium Grain

Loops! Loops! Loops! Lecture 10 COP 3014 Spring 2017 January 31, 2017 Repetition Statements

Loops! Loops! Loops! Lecture 5 COP 3014 Fall 2020 September 17, 2020 Repetition Statements

A marriage of rely/guarantee & separation logic Viktor V afeiadis MPI - SWS Coarse - grain

Using MPPCs for T2K Fine Grain Detector Fabrice Retire (TRIUMF) for the FGD group University

Fine Grain Provenance Using Temporal Databases Outline of the talk Use case: Classic

Fine Grained Access Control Fine-Grained Access Control Fine Grained Access Control

Sun Metro RTS Naming Presentation | 09.13.11 : RTS Naming Presentation 2 assignment Create a

Monday, November 13, 2017 #AEP5 Public Perception of the Arts WWW. A MERICANS F OR T HE A RTS.ORG

Building Java Programs Chapter 5 Lecture 5-1: while Loops, Fencepost Loops, and Sentinel Loops

Repetition with for loops Topic 5 for loops and nested loops So far, repeating a statement is

Performance analysis on Xeon CERN openlab II quarterly review 20 September 2006 Ryszard Jurga

Mo r e f u n w i t h j e t s Mo r e f u n w i t h j e t s E v a B r

Clique trees 2 Graphical Models 10708 Carlos Guestrin Carnegie Mellon University October 3

CS786 Lecture 12: May 12, 2012 Inference as Optimization (continued) [KF Chapter 11] CS786 P.

Three Body Mean Motion Resonances Tabar Gallardo Departamento de Astronoma Facultad de

Calmness of solution mappings in parametric optimization problems Diethard Klatte, University

Understanding Urine pH of Pre-Fresh Cows Differing in Metabolic Acid-Base Status Tim Brown,

Geophysical Ice Flows: Analytical and Numerical Approaches Will Mitchell University of Alaska -

Callisto-RTS: Fine-Grain Parallel Loops Tim Harris (Oracle Labs), - PowerPoint PPT Presentation

Callisto-RTS: Fine-Grain Parallel Loops Tim Harris (Oracle Labs), Stefan Kaestle (ETH Zurich) ATC 2015 1 / 14 Callisto-RTS OpenMP-like runtime system for (NUMA) shared memory machines Aims to scale better than OpenMP, while requiring

LOOPS Loops Loops Loops! How can we repeat a piece of code without having to write it out over

Tutorial 3 Loops Side Effects 1 CS 136 Spring 2020 Tutorial 3 Loops: for loops &amp;

Loops! Flow of Control: Loops (Savitch, Chapter 4) TOPICS while Loops do while

AERATION AERATION and COOLING and COOLING of Stored Grain of Stored Grain Mark Casada , Ph.D.,

WHAT IS THE GRAIN INNOVATION HUB? GRAIN = WHAT IS THE GRAIN INNOVATION HUB? OUTCOME Manitoba as

Covered Commodities Wheat, Oats, Barley, Corn, Grain Sorghum, Long Grain Rice, Medium Grain

Loops! Loops! Loops! Lecture 10 COP 3014 Spring 2017 January 31, 2017 Repetition Statements

Loops! Loops! Loops! Lecture 5 COP 3014 Fall 2020 September 17, 2020 Repetition Statements

A marriage of rely/guarantee &amp; separation logic Viktor V afeiadis MPI - SWS Coarse - grain

Using MPPCs for T2K Fine Grain Detector Fabrice Retire (TRIUMF) for the FGD group University

Fine Grain Provenance Using Temporal Databases Outline of the talk Use case: Classic

Fine Grained Access Control Fine-Grained Access Control Fine Grained Access Control

Sun Metro RTS Naming Presentation | 09.13.11 : RTS Naming Presentation 2 assignment Create a

Monday, November 13, 2017 #AEP5 Public Perception of the Arts WWW. A MERICANS F OR T HE A RTS.ORG

Building Java Programs Chapter 5 Lecture 5-1: while Loops, Fencepost Loops, and Sentinel Loops

Repetition with for loops Topic 5 for loops and nested loops So far, repeating a statement is

Performance analysis on Xeon CERN openlab II quarterly review 20 September 2006 Ryszard Jurga

Mo r e f u n w i t h j e t s Mo r e f u n w i t h j e t s E v a B r

Clique trees 2 Graphical Models 10708 Carlos Guestrin Carnegie Mellon University October 3

CS786 Lecture 12: May 12, 2012 Inference as Optimization (continued) [KF Chapter 11] CS786 P.

Three Body Mean Motion Resonances Tabar Gallardo Departamento de Astronoma Facultad de

Calmness of solution mappings in parametric optimization problems Diethard Klatte, University

Understanding Urine pH of Pre-Fresh Cows Differing in Metabolic Acid-Base Status Tim Brown,

Geophysical Ice Flows: Analytical and Numerical Approaches Will Mitchell University of Alaska -

Tutorial 3 Loops Side Effects 1 CS 136 Spring 2020 Tutorial 3 Loops: for loops &

A marriage of rely/guarantee & separation logic Viktor V afeiadis MPI - SWS Coarse - grain