Parallel Performance Optimization ASD Shared Memory HPC Workshop - - PowerPoint PPT Presentation
Parallel Performance Optimization ASD Shared Memory HPC Workshop - - PowerPoint PPT Presentation
Parallel Performance Optimization ASD Shared Memory HPC Workshop Computer Systems Group, ANU Research School of Computer Science Australian National University Canberra, Australia February 13, 2020 Schedule - Day 4 Computer Systems (ANU)
Schedule - Day 4
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 2 / 63
NUMA systems
Outline
1
NUMA systems
2
Profiling Codes
3
Intel TBB
4
Lock Free Synchronization
5
Transactional Memory Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 3 / 63
NUMA systems
Non-Uniform Memory Access: Basic Ideas
NUMA means there is some hierarchy in main memory system’s structure all memory is available to the programmer (single address space), but some memory takes longer to access than others modular memory systems with interconnects: UMA/NUMA vs NUMA
I N T E R C O N N E C T MEMORY MEMORY MEMORY MEMORY CORE1 CORE2 CORE1 CORE2 CORE1 CORE2 Cache Cache Cache Cache CORE1 CORE2 MEMORY MEMORY MEMORY MEMORY I N T E R C O N N E C T CORE1 CORE2 CORE1 CORE2 CORE1 CORE2 CORE1 CORE2 Cache Cache Cache Cache
- n a NUMA system, there are two effects that may be important:
thread affinity: once a thread is assigned to a core, ensure that it stays there NUMA affinity: memory used by a process must only be allocated on the socket of the core that it is bound to
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 4 / 63
NUMA systems
Examples of NUMA Configurations
Intel Xeon5500, with QPI
(courtesy qdpma.com)
4-socket Opteron; note ex- tra NUMA level within a socket!
(courtesy qdpma.com) Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 5 / 63
NUMA systems
Examples of NUMA Configurations (II)
8-way ‘glueless’ system (processors are directly connected)
(courtesy qdpma.com) Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 6 / 63
NUMA systems
Case Study: Why NUMA Matters
MetUM global atmosphere model, 1024 × 769 × 70 grid on an NCI Intel X5570 - Infiniband supercomputer (2011): Effect of Process and NUMA Affinity on Scaling note differing values for t16!
- n the X5570,
local:remote memory access is 65:105 cycles indicates a significant amount
- f L3$ misses
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 7 / 63
NUMA systems
Case Study: Why NUMA Matters (II)
Time breakdown for no NUMA affinity, 1024 processes (dual socket nodes, 4 cores per socket) Note spikes in compute times were always in groups of 4 processes (e.g. socket 0)
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 8 / 63
NUMA systems
Process and Thread Affinity
in general, the OS is free to decide which core (virtual CPU) a process or thread (next) runs on we can restrict which CPUs it will run on by specifying an affinity mask of the CPU ids it may be scheduled to run on this has 2 benefits (assuming other active processes/threads are excluded from the specified CPUs):
ensure maximum speed for that process/thread minimize cache / TLB pollution caused by context switches
e.g. on an 8-CPU system, create 8 threads to run on different CPUs:
1 pthread_t threadHandle [8]; cpu_set_t cpu; for (int i = 0; i < 8; i++) { 3 pthread_create (& threadHandle [i], NULL , threadFunc , NULL); CPU_ZERO (& cpu); CPU_SET(i, &cpu); 5 pthread_setaffinity_np ( threadHandle [i], sizeof(cpu_set_t), &cpu); }
for a process, it is similar:
sched_setaffinity (getpid (), sizeof(cpu_set_t), &cpu); Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 9 / 63
NUMA systems
NUMActl: Controlling NUMA from the Shell
- n a NUMA system, we generally wish to bind a process and its
memory image to a particular ‘node’ (=NUMA domain) the NUMA API provides a way of controlling policies of memory allocation on a per node or per process basis
policies are default, bind, interleave, preferred
run a program on a CPU on node 0, with all memory allocated on node 0:
1 numactl
- -membind =0 --cpunodebind =0 ./ prog -args
similar, but force to be run on CPU 0 (which must be on node 0):
1 numactl
- -physcpubind =0 --membind =0 ./ prog ./ args
- ptimize bandwidth for a crucial program to utilize multiple memory
controllers (at expense of other processes!)
1 numactl
- -interleave =all ./ memhog
... numactl --hardware shows available nodes etc Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 10 / 63
NUMA systems
LibNUMA: Controlling NUMA from within a Program
with libnuma, we can similarly change (the current thread of) an executing process’s node affinity and memory allocation policy run from now on on a CPU on node 0, with all memory allocated on node 0:
numa_run_on_node (0); 2 numa_set_preferred (0); nodemask_t mask; 2 nodemask_zero (& mask); nodemask_set (&mask , 0); numa_bind (& mask);
to allow it to run on all nodes again:
1 numa_run_on_node_mask (& numa_all_nodes );
execute a memory hogging function, with all its (new) memory fully interleaved, and then restore to previous state:
1 numamask_t prevmask = numa_get_interleave_mask (); numa_set_interleave_mask (& numa_all_nodes ); 3 memhog (...); numa_set_interleave_mask (& prevmask); Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 11 / 63
NUMA systems
Hands-on Exercise: NUMA Effects
Objective: Explore the effects of Non-Uniform Memory Access (NUMA), that is the general benefit of ensuring a process and its memory are in the same NUMA domain
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 12 / 63
Profiling Codes
Outline
1
NUMA systems
2
Profiling Codes
3
Intel TBB
4
Lock Free Synchronization
5
Transactional Memory Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 13 / 63
Profiling Codes
Profiling: Basics
Profiling is the process of recording information during execution of a program to form an aggregate view of its dynamic behaviour
Compare with tracing, which records an ordered log of events that can be used to reconstruct dynamic behaviour
Used to understand program performance and find bottlenecks At certain points in execution, record program state (instruction pointer, calling context, hardware performance counters, ...) Sampling (recurrent event trigger) vs. instrumentation (probes at specific points in program) Real vs. simulated execution
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 14 / 63
Profiling Codes
Sampling
EuroMPI‘12: Hands
Program Main ... end Main Function Asterix (...) ... end Asterix Function Obelix (...) ... end Obelix ...
CPU program counter cycle counter cache miss counter flop counter Main Asterix Obelix + Function Table interrupt every 10 ms add and reset counter
When event trigger occurs, record instruction pointer (+ call context) and performance counters: low overhead but subject to sampling error
Source: EuroMPI’12: Introduction to Performance Engineering Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 15 / 63
Profiling Codes
Instrumentation
EuroMPI‘12: Hands
... Function Obelix (...) call monitor(“Obelix“, “enter“) ... call monitor(“Obelix“,“exit“) end Obelix ...
CPU monitor(routine, location) if (“enter“) then else end if Function Table cache miss counter Main Asterix Obelix +
- 10
200
1300 1490
Inject ‘trampolines’ into source or binary code: accurate but higher
- verhead
Source: EuroMPI’12: Introduction to Performance Engineering Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 16 / 63
Profiling Codes
The 80/20 Rule & Life Cycle
Programs typically spend 80% of their time in 20% of the code Programmers typically spend 20% of their effort to get 80% of the possible speedup → optimize for the common case
EuroMPI‘12: Hands
Measurement Analysis Ranking Refinement
Coding Performance Analysis Production Program Tuning Source: EuroMPI’12: Introduction to Performance Engineering Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 17 / 63
Profiling Codes
perf and VTune
Perf is a profiler tool for Linux 2.6+ based systems that abstracts away CPU hardware differences in Linux performance measurements and presents a simple command-line interface. Perf is based on the perf events interface exported by recent versions of the Linux kernel. Intel’s VTune is a commercial-grade profiling tool for complex applications via the command-line or GUI.
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 18 / 63
Profiling Codes
perf Reference Material
http://www.brendangregg.com/perf.html Perf for User-Space Program Analysis Perf Wiki
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 19 / 63
Profiling Codes
VTune Reference Material
https://software.intel.com/en-us/intel-vtune-amplifier-xe Documentation URL
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 20 / 63
Profiling Codes
perf for Linux
perf is both a kernel syscall interface and a collection of tools to collect, analyze and present hardware performance counter data either via counting or sampling
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 21 / 63
Profiling Codes
perf for Linux
% perf list branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] cache-misses [Hardware event] cache-references [Hardware event] cpu-cycles OR cycles [Hardware event] instructions [Hardware event] ref-cycles [Hardware event] ... branch-load-misses [Hardware cache event] branch-loads [Hardware cache event] dTLB-load-misses [Hardware cache event] dTLB-loads [Hardware cache event] dTLB-store-misses [Hardware cache event] dTLB-stores [Hardware cache event] iTLB-load-misses [Hardware cache event] iTLB-loads [Hardware cache event] Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 22 / 63
Profiling Codes
VTune
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 23 / 63
Profiling Codes
Hands-on Exercise: Perf and VTune
Objective: Use perf to measure performance of matrix multiply code Use VTune to both measure and improve code performance
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 24 / 63
Intel TBB
Outline
1
NUMA systems
2
Profiling Codes
3
Intel TBB
4
Lock Free Synchronization
5
Transactional Memory Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 25 / 63
Intel TBB
Intel Threading Building Blocks (TBB)
Template library extending C++ for parallelism using tasks Focus on divide-and-conquer algorithms Thread-safe data structures Work stealing scheduler Efficient low-level atomic operations Scalable memory allocation Free software under GPLv2
Content adapted from https://software.intel.com/sites/default/files/IntelAcademic Parallel 08 TBB.pdf Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 26 / 63
Intel TBB
Using TBB - Task Based Approach
TBB provides C++ constructs that allow you to express parallel solutions in terms of task objects Task scheduler manages thread pool Task scheduler avoids common performance problems of programming with threads
Oversubscription - One scheduler thread per hardware thread Fair scheduling - Non-preemptive unfair scheduling High overhead - Programmer specifies tasks, not threads Load imbalance - Work-stealing balances load
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 27 / 63
Intel TBB
Using Task based approach
Fibonacci calculation example The function fibTBB calculates the nth Fibonacci number using a TBB
task_group.
int fibTBB(int n) { 2 if (n < 10) { return fibSerial(n); 4 } else { int x, y; 6 tbb :: task_group g; g.run ([&]{ x = Fib(n -1); }); // spawn a task 8 g.run ([&]{ y = Fib(n -2); }); // spawn another task g.wait (); // wait for both tasks to complete 10 return x+y; } 12 } Content adapted from https://software.intel.com/sites/default/files/m/d/4/1/d/8/1-6-AppThr - Using Tasks Instead of Threads.pdf Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 28 / 63
Intel TBB
Using Tasks
Developers express the logical parallelism with tasks Runtime library schedules these tasks on to internal pool of worker threads Tasks are much lighter weight than threads. Hence it is possible to express parallelism at a much finer granularity. Apart from a task interface TBB also provides high-level algorithms that implement some of the most common task patterns, such as
parallel_invoke, parallel_for, parallel_reduce etc
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 29 / 63
Intel TBB
TBB Algorithms
parallel_for, parallel_for_each: load-balanced parallel execution of
loop iterations where iterations are independent
parallel_reduce: load-balanced parallel execution of independent loop
iterations that perform reduction (e.g. summation of array elements)
parallel_scan: load-balanced computation of parallel prefix parallel_do: load-balanced parallel execution of independent loop
iterations with ability to add more work during its execution
parallel_sort: parallel sort parallel_invoke: parallel execution of function objects or pointers to
functions
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 30 / 63
Intel TBB
parallel for
#include <tbb/ blocked_range .h> 2 #include <tbb/ parallel_for .h> 4 template <typename Range , typename Func > Func parallel_for (const Range & range , const Func & f , 6 [, task_group_context & group ]) 8 template <typename Index , typename Func > Func parallel_for (Index first , Index_type last [, Index step ], 10 const Func & f [, task_group_context & group ]);
Template function parallel for recursively divides loop Partitions original range into subranges, and deals out subranges to worker threads in a way that:
Balances load Uses cache efficiently Scales
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 31 / 63
Intel TBB
Loop Splitting
blocked_range<T> is a splittable type representing 1D iteration space
- ver type T
Similarly blocked_range2d<T> for 2D block splitting Separate loop body as an object or lambda expression
void ParallelDecrement (float* a, size_t n) { 2 parallel_for (blocked_range <size_t >(0, n), [=]( const blocked_range <size_t >& r) { 4 for (size_t i=r.begin (); i != r.end (); ++i) a[i]--; 6 } ); 8 } Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 32 / 63
Intel TBB
An Example using parallel for
Independent iterations and fixed/known bounds Serial code:
const int N = 100000; 2 void change_array (float array , int M) { for (int i = 0; i < M; i++) { 4 array[i] *= 2; } 6 } int main () { 8 float A[N]; initialize_array (A); 10 change_array (A, N); return 0; 12 } Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 33 / 63
Intel TBB
An Example using parallel for
Using parallel for
#include <tbb/ blocked_range .h> 2 #include <tbb/ parallel_for .h> 4 using namespace tbb; 6 void parallel_change_array (float* array , size_t M) { parallel_for (blocked_range <size_t >(0, M, IdealGrainSize ), 8 [=]( const blocked_range <size_t >& r) -> void { for (size_t i = r.begin (); i != r.end (); i++) 10 array[i] *= 2; } 12 ); } Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 34 / 63
Intel TBB
Generic Programming vs Lambda functions
Generic Programming:
1 class ChangeArrayBody { float *array; 3 public: ChangeArrayBody (float *a): array(a) {} 5 void
- perator ()( const
blocked_range <size_t >& r ) const{ for (size_t i = r.begin (); i != r.end (); i++) { 7 array[i] *= 2; } 9 } }; 11 void parallel_change_array (float *array , size_t M) { parallel_for (blocked_range <int >(0, M, IdealGrainSize ), 13 ChangeArrayBody (array)); }
Lambda functions:
void parallel_change_array (float *array , size_t M) { 2 parallel_for (blocked_range <size_t >(0, M, IdealGrainSize ), [=]( const blocked_range <size_t >& r) -> void { 4 for (size_t i = r.begin (); i != r.end (); i++) array[i] *= 2; 6 }); } Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 35 / 63
Intel TBB
Mutual Exclusion in TBB
Multiple tasks computing the minimum value in an array using parallel for
1 void ParallelMin (int* a, int n) { parallel_for (blocked_range <int >(0, n), 3 [=]( const blocked_range <int >& r) { for(int i=r.begin (); i!=r.end (); ++i) 5 if (a[i] < min) min = a[i]; } 7 ); } Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 36 / 63
Intel TBB
Mutex Flavours
TBB provides several flavours of mutex:
spin mutex: non-scalable, unfair, fast queuing mutex: scalable, fair, slower spin rw mutex, queuing rw mutex: as above, with reader locks mutex and recursive mutex: wrappers around native implementation (e.g. Pthreads)
Avoid locks wherever possible
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 37 / 63
Intel TBB
Mutual Exclusion in TBB
Scoped mutex inside critical section
typedef spin_mutex ReductionMutex ; 2 ReductionMutex minMutex; 4 void ParallelMin (int* a, int n) { parallel_for (blocked_range <int >(0, n), 6 [=]( const blocked_range <int >& r) { for (int i=r.begin (); i!=r.end (); ++i) { 8 ReductionMutex :: scoped_lock lock(minMutex); if (a[i] < min) min = a[i]; 10 } } 12 ); } Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 38 / 63
Intel TBB
parallel reduce in TBB
1 #include <tbb/ blocked_range .h> #include <tbb/ parallel_reduce .h> 3 template <typename Range , typename Value , 5 typename Func , typename ReductionFunc > Value parallel_reduce (const Range& range , const Value& identity , 7 const Func& func , const ReductionFunc & reductionFunc , [, partitioner [, task_group_context &group ]]);
parallel reduce partitions original range into subranges like parallel for The function Func is applied on these subranges, the returned result is then merged with the others (or identity if there is none) using the function reductionFunc.
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 39 / 63
Intel TBB
parallel reduce in TBB - Serial Example
#include <limits > 2 // Find index of smallest element in a[0...n -1] 4 size_t serialMinIndex (const float a[], size_t n) { float value_of_min = numeric_limits <float >:: max (); 6 size_t index_of_min = 0; for (size_t i=0; i<n; ++i) { 8 float value = a[i]; if (value < value_of_min ) { 10 value_of_min = value ; index_of_min = i; 12 } } 14 return index_of_min ; } Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 40 / 63
Intel TBB
parallel reduce in TBB - Parallel Version
1 #include <limits > #include <tbb/ blocked_range .h> 3 #include <tbb/ parallel_reduce .h> 5 size_t parallelMinIndex (const float a[], size_t n) { return parallel_reduce ( 7 blocked_range <size_t >(0, n, 10000) , size_t (0) , 9 [=]( blocked_range <size_t > &r, size_t index_of_min ) -> size_t { float value_of_min = a[ index_of_min ]; 11 for (size_t i = r.begin (); i != r.end (); ++i) { float value = a[i]; 13 if (value < value_of_min ) { value_of_min = value; // accumulate result 15 index_of_min = i; } 17 } return index_of_min ; 19 }, [=]( size_t i1 , size_t i2) { 21 return (a[i1] < a[i2])? i1: i2; // reduction
- perator
} 23 ); } Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 41 / 63
Intel TBB
Hands-on Exercise: Programming with TBB
Objective: Implement a parallel heat stencil application using TBB and to profile it using VTune
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 42 / 63
Lock Free Synchronization
Outline
1
NUMA systems
2
Profiling Codes
3
Intel TBB
4
Lock Free Synchronization
5
Transactional Memory Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 43 / 63
Lock Free Synchronization
‘Lock-free’ Data Structures: Motivation
consider the atomic test-and-set operation
atomic int testAndSet(volatile int *Lock) {int lv = *Lock; *Lock = 1; return lv;}
synchronizes the whole memory system (down to LLC), costs ≈ 50 cycles, degrades memory access progress for all
not scalable: energy and time costs are O(N2) (N is number of cores)
mutual exclusion via test-and-set can be modelled by:
volatile int lock = 0; //0..1; 1=locked, 0=unlocked // thread 0 2 while (1) { // non -critical section 0 4 ... while ( testAndSet (& lock) != 0) 6 { /* spin */ } 8 // critical section 0 ... 10 lock = 0; } 1 // thread 1 while (1) { 3 // non -critical section 1 ... 5 while ( testAndSet (& lock) != 0) { /* spin */ } 7 // critical section 1 9 ... lock = 0; 11 }
with p threads under heavy contention (time in non-critical becomes small), there will be O(p2) atomics!! Can we do better?
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 44 / 63
Lock Free Synchronization
Improving Locked Access to Data Structures
- nly try atomics when we see lock is free: reduces #atomics to O(p)
1 do { while (lock == 1) {/* spin */} 3 } while ( testAndSet (& lock) != 0);
use adaptive spin-locks: if lock is not acquired after a certain time, the thread yields
Pthreads mutexes should do this; especially useful if p > N
use algorithms requiring a (small) constant number of atomic
- perations per access of the critical region
can be a general mutex algorithm, or data structure-specific
better still (possibly), use algorithms using no atomic operations
Lamport’s Bakery Algorithm: still has O(p2) costs, assumes order of updates made by one thread is seen by other threads
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 45 / 63
Lock Free Synchronization
Atomic Operation-based Bakery Algorithm
analogous to protocol of buying seafood at Woolworths
uses atomic int fetchAndIncrement(volatile int *v){int v = *v; (*v)++; return(v);} initialize: ticketNumber = screenNumber = 0; acquire: myTicket = fetchAndIncrement(&ticketNumber); wait until (myTicket == screenNumber) release: screenNumber++; (does not need to be atomic)
performance: good if H/W has direct support of fetchAndIncrement(), else may require unbounded number of compare-and-swaps
problems under contention: large amount of cache line invalidations due to all threads accessing ‘scoreboard’ variable screenNumber can we restrict the # threads accessing their ‘scoreboard’ to just 2?
MCS lock: Woolworths analogy: instead of watching a screen, the person just ahead tells you when they are served
requires atomic dual word copy on lock acquire and atomic compare-and-swap on release
CLH lock: requires only a single fetch-and-store atomic
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 46 / 63
Lock Free Synchronization
Synchronization: Barriers
wait for all to arrive at same point in computation before proceeding (none may leave until all have arrived) central barrier with p threads (initialize globals ctr to p, sense to 0 ; each thread initializes its own threadSense to 0)
threadSense = !threadSense; if (fetchAndDecrement(ctr)== 1) ctr = p, sense = !sense; // last to reach toggles global sense
wait until (sense == threadSense)
sense is required for repeated barriers
caution: deadlock occurs if 1 thread does not participate! problems:
most processors do not support atomic decrement of a memory location not (very) scalable: p atomic decrements per barrier
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 47 / 63
Lock Free Synchronization
Synchronization: Combining Tree Barrier
each pair of threads points to a leaf node in a tree
each node has a ctr (init. to 2), and a sense flag (init. to 0)
algorithm: begins on all leaf nodes (each thread has threadSense=1 flag):
if (fetchAndDecrement(ctr)== 1)
call algorithm on parent node
ctr = 2, sense = !sense
wait until threadSense == sense then, as leaving the barrier, threadSense = !threadSense
notes:
last thread to reach each node continues up the tree the thread that reaches root begins the ‘wakeup’ (reversing sense) upon wakeup, a thread releases its siblings at each node along path
performance: 2× the atomic operations, but can distribute memory locations of node data (e.g. across different LLC banks)
atomics can be avoided by the scalable tree barrier! (simply replace
ctr with 2 adjacent byte flags, one for each thread)
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 48 / 63
Lock Free Synchronization
Single Reader, Single Writer Bounded Buffer
Strategy: simply put burden of condition synchronization (retry) on client
class LFBoundedBuffer { 2 int N, int in , out , *buf; LFBoundedBuffer (int Nc) { N = Nc; in = out = 0; buf = new int [N]; } 4 bool put(int v) { int inNxt = (in +1) % N; 6 if (inNxt == out) // buffer full return false; 8 buf[in] = v; in = inNxt; return true; 10 } bool get(int *v) { 12 if (in == out) // buffer empty return false; 14 *v = buf[out ]; out = (out +1) % N; return true; 16 }};
Why does this work? Threads do not operate on the same variable. Consider N=2. Need to assume updates on in/out don’t overtake the add/remove from buf on other thread. Unbounded: use linked lists, reader advances a read pointer, leaves writer to remove previously read nodes
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 49 / 63
Lock Free Synchronization
‘Lock-Free’ Stack
class LFStack { 2 int top , nPop; LFStack(int Nc) { N = Nc; nPop = top = 0; buf = new int [N]; } 4 bool push(int v) { while (1) { 6 int
- ldTop = top;
if (oldTop == N) return false; // caller must try again 8 if (dcas (&top ,
- ldTop ,
- ldTop +1,
&buf[oldTop], buf[oldTop], v)) 10 return true; // oldTop remained a valid stack top } 12 } bool pop(int *v) { 14 while (1) { int
- ldTop = top , oldNPop = nPop;
16 if (oldTop == 0) return false; // caller must try again *v = buf[oldTop -1]; 18 if (dcas (&top ,
- ldTop ,
- ldTop -1,
&nPop , oldNPop , oldNPop +1)) 20 return true; // oldTop remained a valid stack top } 22 }}; // note: dcas (&top , oldTop , oldTop -1, v, *v, buf[oldTop ]) won ’t work!
dcas() is a ‘double’ compare-and-swap operation: in this case we use the op on the 1st word to ensure safety, and the 2nd to ensure the data is added / removed atomically
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 50 / 63
Lock Free Synchronization
Summary: Lock Free Synchronization
Use fine-grained locking to reduce contention in operations on shared data structures Non-blocking solution to avoid overheads due to lock, but still requires appropriate memory fence operations Lock free design does not eliminate contention
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 51 / 63
Transactional Memory
Outline
1
NUMA systems
2
Profiling Codes
3
Intel TBB
4
Lock Free Synchronization
5
Transactional Memory Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 52 / 63
Transactional Memory
Transactional Memory - Motivations
- r, problems with lock-based solutions:
convoying: a thread holding a lock is de-scheduled, possibly holding up other threads waiting for a lock priority inversion: lower-priority thread is pre-empted while holding a lock needed by a higher priority thread deadlock: two threads attempt to acquire the same locks do so in different order inherently unsafe: relationship between data objects and locks in entirely by convention lacks the composition property: difficult to combine 2 concurrent
- perations using locks into a single larger operation
pessimistic: in low-contention situations, their protection is only needed < 1% of the time!
recall overheads of acquiring a lock (even if free)
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 53 / 63
Transactional Memory
Transactions - Basic Ideas
a transaction is a series of reads and writes (to global data) made by a single thread which execute atomically, i.e:
if it succeeds, no other thread sees any of the writes until it completes if it fails, no change to (global) data can be observed
it must also execute with consistency:
the thread does not see any (interfering) writes from other threads
it has the following steps:
begin, read and writes to (transactional) variables, end
if consistency is met, the end results in a commit; otherwise it aborts a condition that meets this is serializability:
if transactions cannot execute concurrently, consistency is assured note: this assumes that (transactional) variables are only written (in some cases, read) by transactions
can also achieve this if no other thread accesses the affected variables during the transaction
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 54 / 63
Transactional Memory
Transactions - Programming Model
e.g. to add an item newItem to a queue with tail pointer tail:
atomic { 2 struct node *newTail = malloc(sizeof(struct node)); newTail ->item = newItem; 4 newTail ->next = tail; // tail must not be changed between these 2 points! 6 tail = newTail; }
here, tail is the single read and write transactional variable the atomic block is executed atomically (it can be assumed that the underlying system will re-execute it until it can commit (what about
the malloc?))
e.g. code from Unstructured Adaptive benchmark of NAS suite
for (ije1 = 0; ije1 < nnje; ije1 ++) { ... 2 for (ije2 = 1; ije2 < nnje; ije2 ++) for (col = 1-shift; col < lx1 -shift; col ++) { 4 ig = idmo[v_end[ije2], col , ije1 , ije2 , iface , ie]; #pragma
- mp
atomic 6 tmor[ig] += temp[v_end[ije2],col ,ije1] * 0.5; ... } 8 } Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 55 / 63
Transactional Memory
Limited Compare-and-Swap Implementation
the atomic compare-and-swap operation may be expressed as:
typedef unsigned long uint64; 2 atomic uint64 cas(uint64 *x, uint64 x_old , uint64 x_new) { uint64 x_now = *x; 4 if (x_now == x_old) *x = x_new; return (x_now); }
(1st 2 lines are implemented by a single cas instruction) we can implement the (essentially atomic part of) the code for queue insertion by:
do { 2 newTail ->next = tail; } while (cas(tail , newTail ->next , newTail) != newTail ->next);
and for the Unstructured Adaptive benchmark
1 do { x_old = tmor[ig]; 3 x_new = temp[v_end[ije2],col ,ije1] * 0.5 + x_old; } while (cas (& tmor[ig], x_old , x_new) != x_old);
note: recent OpenMP runtime systems implemented atomic sections using a global lock (why does this perform abysmally?)
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 56 / 63
Transactional Memory
Hardware TM: Basis on Coherency Protocols
idea: tentative values can be held in cache lines (assuming each processor has a separate cache) review the standard MESI cache coherency protocol
cache line states: Modified: only this cache holds the data; it is ‘dirty’, Exclusive: as before, but not ‘dirty’, Shared: > 1 caches hold data, not ‘dirty’, Invalid: this line holds no data key transitions:
(a) A: load x (miss; then line in E state) (b) B: load x (miss; then lines are in S state (A and B)) (c) B: store x (hit; then lines in M state (B) and I state (A) (d) A: load x (miss; then B copies back line to memory and lines in S state (A and B))
when a line is evicted (due to a conflict with a new address needing to be loaded), the data is copied-back to memory
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 57 / 63
Transactional Memory
Hardware Transactional Memory (Bounded)
add a transactional bit (T) for the state of each cache line
T=1 if entry is placed in cache on behalf of a transaction (o.w. 0)
simply extend the MESI protocol as follows:
if a line with T=1 is invalidated or evicted, abort (with no copy-back) note: can record fact that transaction will later abort instead
abort causes lines with T=1 to be invalidated (no copy-back, if dirty) a commit requires all dirty lines with T=1 to be written atomically to memory (T bit is cleared) requires transactional variants to load and store instructions, plus an (attempt to) commit instruction (which clears all T bits) why this works:
if a line with T=1 is invalidated, we have a R-W or W-W conflict if a line with T=1 is evicted, the transaction cannot complete
note: size of transaction is limited by cache size (in practice, smaller) upon abort, hardware needs to indicate whether to retry (synch. conflict) or not (error or SW-detected resource exhaustion)
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 58 / 63
Transactional Memory
HTM in the Intel Haswell/Broadwell
has new instructions to begin, end and abort a transaction
xbegin addr: address of retry loop, in case of abort xend: all loads/stores in between are transactional; and xabort
no guarantee of progress! (indefinitely repeated aborts) e.g. xbegin; load x; load y; store z;
T flag tag S u 1 E x 1 S y M v 1 M z I
- T
flag tag S u I x I y M v I z I
- T
flag tag S u E x S y M v E z I
- cache state after the
above if x’s line becomes invalidated (remote core writes to x), before xend, abort if v’s line becomes invalidated (remote core writes to v), before xend, commit
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 59 / 63
Transactional Memory
Uses of HTM
may be used to create custom atomic operations e.g. the atomic double compare-and-swap operation may be expressed as:
typedef unsigned long uint64; 2 int dcas(uint64 *x, uint64 x_old , uint64 x_new , uint64 *y, uint64 y_old , uint64 y_new) { 4 int swapped = 0; xbegin abortaddr; 6 uint64 x_now = *x, y_now = *y; if (x_now == x_old && y_now == y_old) 8 *x = x_new , *y = y_new , swapped = 1; xend; 10 return (swapped); // success abortaddr: 12 return (0); // failure }
note: x86 provides a cmpxchg16b instruction which can do the above if x and y are in consecutive memory locations
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 60 / 63
Transactional Memory
Summary: Transactional Memory
Atomic construct: aim to increase simplicity of synchronization without significantly sacrificing performance Implementation: many variants that differ in versioning policy (eager/lazy), conflict detection (pessimistic/optimistic), detection granularity Hardware transactional memory: versioned data kept in caches, conflict detection as part of coherence protocol
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 61 / 63
Transactional Memory
Hands-on Exercise: Lock and Barrier Performance
Objective: Understand performance of atomic operations in various implementations of locks and barriers
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 62 / 63
Transactional Memory
Summary
Topics covered today - Parallel Performance Optimization: Non-uniform memory access hardware Profiling using VTune Intel TBB Lock free data structures and transactional memory Tomorrow - Parallel Software Design!
Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 63 / 63