Parallel Performance Optimization ASD Shared Memory HPC Workshop - - PowerPoint PPT Presentation

parallel performance optimization asd shared memory hpc
SMART_READER_LITE
LIVE PREVIEW

Parallel Performance Optimization ASD Shared Memory HPC Workshop - - PowerPoint PPT Presentation

Parallel Performance Optimization ASD Shared Memory HPC Workshop Computer Systems Group, ANU Research School of Computer Science Australian National University Canberra, Australia February 13, 2020 Schedule - Day 4 Computer Systems (ANU)


slide-1
SLIDE 1

Parallel Performance Optimization ASD Shared Memory HPC Workshop

Computer Systems Group, ANU

Research School of Computer Science Australian National University Canberra, Australia

February 13, 2020

slide-2
SLIDE 2

Schedule - Day 4

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 2 / 63

slide-3
SLIDE 3

NUMA systems

Outline

1

NUMA systems

2

Profiling Codes

3

Intel TBB

4

Lock Free Synchronization

5

Transactional Memory Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 3 / 63

slide-4
SLIDE 4

NUMA systems

Non-Uniform Memory Access: Basic Ideas

NUMA means there is some hierarchy in main memory system’s structure all memory is available to the programmer (single address space), but some memory takes longer to access than others modular memory systems with interconnects: UMA/NUMA vs NUMA

I N T E R C O N N E C T MEMORY MEMORY MEMORY MEMORY CORE1 CORE2 CORE1 CORE2 CORE1 CORE2 Cache Cache Cache Cache CORE1 CORE2 MEMORY MEMORY MEMORY MEMORY I N T E R C O N N E C T CORE1 CORE2 CORE1 CORE2 CORE1 CORE2 CORE1 CORE2 Cache Cache Cache Cache

  • n a NUMA system, there are two effects that may be important:

thread affinity: once a thread is assigned to a core, ensure that it stays there NUMA affinity: memory used by a process must only be allocated on the socket of the core that it is bound to

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 4 / 63

slide-5
SLIDE 5

NUMA systems

Examples of NUMA Configurations

Intel Xeon5500, with QPI

(courtesy qdpma.com)

4-socket Opteron; note ex- tra NUMA level within a socket!

(courtesy qdpma.com) Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 5 / 63

slide-6
SLIDE 6

NUMA systems

Examples of NUMA Configurations (II)

8-way ‘glueless’ system (processors are directly connected)

(courtesy qdpma.com) Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 6 / 63

slide-7
SLIDE 7

NUMA systems

Case Study: Why NUMA Matters

MetUM global atmosphere model, 1024 × 769 × 70 grid on an NCI Intel X5570 - Infiniband supercomputer (2011): Effect of Process and NUMA Affinity on Scaling note differing values for t16!

  • n the X5570,

local:remote memory access is 65:105 cycles indicates a significant amount

  • f L3$ misses

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 7 / 63

slide-8
SLIDE 8

NUMA systems

Case Study: Why NUMA Matters (II)

Time breakdown for no NUMA affinity, 1024 processes (dual socket nodes, 4 cores per socket) Note spikes in compute times were always in groups of 4 processes (e.g. socket 0)

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 8 / 63

slide-9
SLIDE 9

NUMA systems

Process and Thread Affinity

in general, the OS is free to decide which core (virtual CPU) a process or thread (next) runs on we can restrict which CPUs it will run on by specifying an affinity mask of the CPU ids it may be scheduled to run on this has 2 benefits (assuming other active processes/threads are excluded from the specified CPUs):

ensure maximum speed for that process/thread minimize cache / TLB pollution caused by context switches

e.g. on an 8-CPU system, create 8 threads to run on different CPUs:

1 pthread_t threadHandle [8]; cpu_set_t cpu; for (int i = 0; i < 8; i++) { 3 pthread_create (& threadHandle [i], NULL , threadFunc , NULL); CPU_ZERO (& cpu); CPU_SET(i, &cpu); 5 pthread_setaffinity_np ( threadHandle [i], sizeof(cpu_set_t), &cpu); }

for a process, it is similar:

sched_setaffinity (getpid (), sizeof(cpu_set_t), &cpu); Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 9 / 63

slide-10
SLIDE 10

NUMA systems

NUMActl: Controlling NUMA from the Shell

  • n a NUMA system, we generally wish to bind a process and its

memory image to a particular ‘node’ (=NUMA domain) the NUMA API provides a way of controlling policies of memory allocation on a per node or per process basis

policies are default, bind, interleave, preferred

run a program on a CPU on node 0, with all memory allocated on node 0:

1 numactl

  • -membind =0 --cpunodebind =0 ./ prog -args

similar, but force to be run on CPU 0 (which must be on node 0):

1 numactl

  • -physcpubind =0 --membind =0 ./ prog ./ args
  • ptimize bandwidth for a crucial program to utilize multiple memory

controllers (at expense of other processes!)

1 numactl

  • -interleave =all ./ memhog

... numactl --hardware shows available nodes etc Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 10 / 63

slide-11
SLIDE 11

NUMA systems

LibNUMA: Controlling NUMA from within a Program

with libnuma, we can similarly change (the current thread of) an executing process’s node affinity and memory allocation policy run from now on on a CPU on node 0, with all memory allocated on node 0:

numa_run_on_node (0); 2 numa_set_preferred (0); nodemask_t mask; 2 nodemask_zero (& mask); nodemask_set (&mask , 0); numa_bind (& mask);

to allow it to run on all nodes again:

1 numa_run_on_node_mask (& numa_all_nodes );

execute a memory hogging function, with all its (new) memory fully interleaved, and then restore to previous state:

1 numamask_t prevmask = numa_get_interleave_mask (); numa_set_interleave_mask (& numa_all_nodes ); 3 memhog (...); numa_set_interleave_mask (& prevmask); Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 11 / 63

slide-12
SLIDE 12

NUMA systems

Hands-on Exercise: NUMA Effects

Objective: Explore the effects of Non-Uniform Memory Access (NUMA), that is the general benefit of ensuring a process and its memory are in the same NUMA domain

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 12 / 63

slide-13
SLIDE 13

Profiling Codes

Outline

1

NUMA systems

2

Profiling Codes

3

Intel TBB

4

Lock Free Synchronization

5

Transactional Memory Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 13 / 63

slide-14
SLIDE 14

Profiling Codes

Profiling: Basics

Profiling is the process of recording information during execution of a program to form an aggregate view of its dynamic behaviour

Compare with tracing, which records an ordered log of events that can be used to reconstruct dynamic behaviour

Used to understand program performance and find bottlenecks At certain points in execution, record program state (instruction pointer, calling context, hardware performance counters, ...) Sampling (recurrent event trigger) vs. instrumentation (probes at specific points in program) Real vs. simulated execution

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 14 / 63

slide-15
SLIDE 15

Profiling Codes

Sampling

EuroMPI‘12: Hands

Program Main ... end Main Function Asterix (...) ... end Asterix Function Obelix (...) ... end Obelix ...

CPU program counter cycle counter cache miss counter flop counter Main Asterix Obelix + Function Table interrupt every 10 ms add and reset counter

When event trigger occurs, record instruction pointer (+ call context) and performance counters: low overhead but subject to sampling error

Source: EuroMPI’12: Introduction to Performance Engineering Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 15 / 63

slide-16
SLIDE 16

Profiling Codes

Instrumentation

EuroMPI‘12: Hands

... Function Obelix (...) call monitor(“Obelix“, “enter“) ... call monitor(“Obelix“,“exit“) end Obelix ...

CPU monitor(routine, location) if (“enter“) then else end if Function Table cache miss counter Main Asterix Obelix +

  • 10

200

1300 1490

Inject ‘trampolines’ into source or binary code: accurate but higher

  • verhead

Source: EuroMPI’12: Introduction to Performance Engineering Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 16 / 63

slide-17
SLIDE 17

Profiling Codes

The 80/20 Rule & Life Cycle

Programs typically spend 80% of their time in 20% of the code Programmers typically spend 20% of their effort to get 80% of the possible speedup → optimize for the common case

EuroMPI‘12: Hands

Measurement Analysis Ranking Refinement

Coding Performance Analysis Production Program Tuning Source: EuroMPI’12: Introduction to Performance Engineering Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 17 / 63

slide-18
SLIDE 18

Profiling Codes

perf and VTune

Perf is a profiler tool for Linux 2.6+ based systems that abstracts away CPU hardware differences in Linux performance measurements and presents a simple command-line interface. Perf is based on the perf events interface exported by recent versions of the Linux kernel. Intel’s VTune is a commercial-grade profiling tool for complex applications via the command-line or GUI.

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 18 / 63

slide-19
SLIDE 19

Profiling Codes

perf Reference Material

http://www.brendangregg.com/perf.html Perf for User-Space Program Analysis Perf Wiki

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 19 / 63

slide-20
SLIDE 20

Profiling Codes

VTune Reference Material

https://software.intel.com/en-us/intel-vtune-amplifier-xe Documentation URL

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 20 / 63

slide-21
SLIDE 21

Profiling Codes

perf for Linux

perf is both a kernel syscall interface and a collection of tools to collect, analyze and present hardware performance counter data either via counting or sampling

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 21 / 63

slide-22
SLIDE 22

Profiling Codes

perf for Linux

% perf list branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] cache-misses [Hardware event] cache-references [Hardware event] cpu-cycles OR cycles [Hardware event] instructions [Hardware event] ref-cycles [Hardware event] ... branch-load-misses [Hardware cache event] branch-loads [Hardware cache event] dTLB-load-misses [Hardware cache event] dTLB-loads [Hardware cache event] dTLB-store-misses [Hardware cache event] dTLB-stores [Hardware cache event] iTLB-load-misses [Hardware cache event] iTLB-loads [Hardware cache event] Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 22 / 63

slide-23
SLIDE 23

Profiling Codes

VTune

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 23 / 63

slide-24
SLIDE 24

Profiling Codes

Hands-on Exercise: Perf and VTune

Objective: Use perf to measure performance of matrix multiply code Use VTune to both measure and improve code performance

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 24 / 63

slide-25
SLIDE 25

Intel TBB

Outline

1

NUMA systems

2

Profiling Codes

3

Intel TBB

4

Lock Free Synchronization

5

Transactional Memory Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 25 / 63

slide-26
SLIDE 26

Intel TBB

Intel Threading Building Blocks (TBB)

Template library extending C++ for parallelism using tasks Focus on divide-and-conquer algorithms Thread-safe data structures Work stealing scheduler Efficient low-level atomic operations Scalable memory allocation Free software under GPLv2

Content adapted from https://software.intel.com/sites/default/files/IntelAcademic Parallel 08 TBB.pdf Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 26 / 63

slide-27
SLIDE 27

Intel TBB

Using TBB - Task Based Approach

TBB provides C++ constructs that allow you to express parallel solutions in terms of task objects Task scheduler manages thread pool Task scheduler avoids common performance problems of programming with threads

Oversubscription - One scheduler thread per hardware thread Fair scheduling - Non-preemptive unfair scheduling High overhead - Programmer specifies tasks, not threads Load imbalance - Work-stealing balances load

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 27 / 63

slide-28
SLIDE 28

Intel TBB

Using Task based approach

Fibonacci calculation example The function fibTBB calculates the nth Fibonacci number using a TBB

task_group.

int fibTBB(int n) { 2 if (n < 10) { return fibSerial(n); 4 } else { int x, y; 6 tbb :: task_group g; g.run ([&]{ x = Fib(n -1); }); // spawn a task 8 g.run ([&]{ y = Fib(n -2); }); // spawn another task g.wait (); // wait for both tasks to complete 10 return x+y; } 12 } Content adapted from https://software.intel.com/sites/default/files/m/d/4/1/d/8/1-6-AppThr - Using Tasks Instead of Threads.pdf Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 28 / 63

slide-29
SLIDE 29

Intel TBB

Using Tasks

Developers express the logical parallelism with tasks Runtime library schedules these tasks on to internal pool of worker threads Tasks are much lighter weight than threads. Hence it is possible to express parallelism at a much finer granularity. Apart from a task interface TBB also provides high-level algorithms that implement some of the most common task patterns, such as

parallel_invoke, parallel_for, parallel_reduce etc

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 29 / 63

slide-30
SLIDE 30

Intel TBB

TBB Algorithms

parallel_for, parallel_for_each: load-balanced parallel execution of

loop iterations where iterations are independent

parallel_reduce: load-balanced parallel execution of independent loop

iterations that perform reduction (e.g. summation of array elements)

parallel_scan: load-balanced computation of parallel prefix parallel_do: load-balanced parallel execution of independent loop

iterations with ability to add more work during its execution

parallel_sort: parallel sort parallel_invoke: parallel execution of function objects or pointers to

functions

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 30 / 63

slide-31
SLIDE 31

Intel TBB

parallel for

#include <tbb/ blocked_range .h> 2 #include <tbb/ parallel_for .h> 4 template <typename Range , typename Func > Func parallel_for (const Range & range , const Func & f , 6 [, task_group_context & group ]) 8 template <typename Index , typename Func > Func parallel_for (Index first , Index_type last [, Index step ], 10 const Func & f [, task_group_context & group ]);

Template function parallel for recursively divides loop Partitions original range into subranges, and deals out subranges to worker threads in a way that:

Balances load Uses cache efficiently Scales

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 31 / 63

slide-32
SLIDE 32

Intel TBB

Loop Splitting

blocked_range<T> is a splittable type representing 1D iteration space

  • ver type T

Similarly blocked_range2d<T> for 2D block splitting Separate loop body as an object or lambda expression

void ParallelDecrement (float* a, size_t n) { 2 parallel_for (blocked_range <size_t >(0, n), [=]( const blocked_range <size_t >& r) { 4 for (size_t i=r.begin (); i != r.end (); ++i) a[i]--; 6 } ); 8 } Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 32 / 63

slide-33
SLIDE 33

Intel TBB

An Example using parallel for

Independent iterations and fixed/known bounds Serial code:

const int N = 100000; 2 void change_array (float array , int M) { for (int i = 0; i < M; i++) { 4 array[i] *= 2; } 6 } int main () { 8 float A[N]; initialize_array (A); 10 change_array (A, N); return 0; 12 } Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 33 / 63

slide-34
SLIDE 34

Intel TBB

An Example using parallel for

Using parallel for

#include <tbb/ blocked_range .h> 2 #include <tbb/ parallel_for .h> 4 using namespace tbb; 6 void parallel_change_array (float* array , size_t M) { parallel_for (blocked_range <size_t >(0, M, IdealGrainSize ), 8 [=]( const blocked_range <size_t >& r) -> void { for (size_t i = r.begin (); i != r.end (); i++) 10 array[i] *= 2; } 12 ); } Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 34 / 63

slide-35
SLIDE 35

Intel TBB

Generic Programming vs Lambda functions

Generic Programming:

1 class ChangeArrayBody { float *array; 3 public: ChangeArrayBody (float *a): array(a) {} 5 void

  • perator ()( const

blocked_range <size_t >& r ) const{ for (size_t i = r.begin (); i != r.end (); i++) { 7 array[i] *= 2; } 9 } }; 11 void parallel_change_array (float *array , size_t M) { parallel_for (blocked_range <int >(0, M, IdealGrainSize ), 13 ChangeArrayBody (array)); }

Lambda functions:

void parallel_change_array (float *array , size_t M) { 2 parallel_for (blocked_range <size_t >(0, M, IdealGrainSize ), [=]( const blocked_range <size_t >& r) -> void { 4 for (size_t i = r.begin (); i != r.end (); i++) array[i] *= 2; 6 }); } Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 35 / 63

slide-36
SLIDE 36

Intel TBB

Mutual Exclusion in TBB

Multiple tasks computing the minimum value in an array using parallel for

1 void ParallelMin (int* a, int n) { parallel_for (blocked_range <int >(0, n), 3 [=]( const blocked_range <int >& r) { for(int i=r.begin (); i!=r.end (); ++i) 5 if (a[i] < min) min = a[i]; } 7 ); } Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 36 / 63

slide-37
SLIDE 37

Intel TBB

Mutex Flavours

TBB provides several flavours of mutex:

spin mutex: non-scalable, unfair, fast queuing mutex: scalable, fair, slower spin rw mutex, queuing rw mutex: as above, with reader locks mutex and recursive mutex: wrappers around native implementation (e.g. Pthreads)

Avoid locks wherever possible

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 37 / 63

slide-38
SLIDE 38

Intel TBB

Mutual Exclusion in TBB

Scoped mutex inside critical section

typedef spin_mutex ReductionMutex ; 2 ReductionMutex minMutex; 4 void ParallelMin (int* a, int n) { parallel_for (blocked_range <int >(0, n), 6 [=]( const blocked_range <int >& r) { for (int i=r.begin (); i!=r.end (); ++i) { 8 ReductionMutex :: scoped_lock lock(minMutex); if (a[i] < min) min = a[i]; 10 } } 12 ); } Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 38 / 63

slide-39
SLIDE 39

Intel TBB

parallel reduce in TBB

1 #include <tbb/ blocked_range .h> #include <tbb/ parallel_reduce .h> 3 template <typename Range , typename Value , 5 typename Func , typename ReductionFunc > Value parallel_reduce (const Range& range , const Value& identity , 7 const Func& func , const ReductionFunc & reductionFunc , [, partitioner [, task_group_context &group ]]);

parallel reduce partitions original range into subranges like parallel for The function Func is applied on these subranges, the returned result is then merged with the others (or identity if there is none) using the function reductionFunc.

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 39 / 63

slide-40
SLIDE 40

Intel TBB

parallel reduce in TBB - Serial Example

#include <limits > 2 // Find index of smallest element in a[0...n -1] 4 size_t serialMinIndex (const float a[], size_t n) { float value_of_min = numeric_limits <float >:: max (); 6 size_t index_of_min = 0; for (size_t i=0; i<n; ++i) { 8 float value = a[i]; if (value < value_of_min ) { 10 value_of_min = value ; index_of_min = i; 12 } } 14 return index_of_min ; } Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 40 / 63

slide-41
SLIDE 41

Intel TBB

parallel reduce in TBB - Parallel Version

1 #include <limits > #include <tbb/ blocked_range .h> 3 #include <tbb/ parallel_reduce .h> 5 size_t parallelMinIndex (const float a[], size_t n) { return parallel_reduce ( 7 blocked_range <size_t >(0, n, 10000) , size_t (0) , 9 [=]( blocked_range <size_t > &r, size_t index_of_min ) -> size_t { float value_of_min = a[ index_of_min ]; 11 for (size_t i = r.begin (); i != r.end (); ++i) { float value = a[i]; 13 if (value < value_of_min ) { value_of_min = value; // accumulate result 15 index_of_min = i; } 17 } return index_of_min ; 19 }, [=]( size_t i1 , size_t i2) { 21 return (a[i1] < a[i2])? i1: i2; // reduction

  • perator

} 23 ); } Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 41 / 63

slide-42
SLIDE 42

Intel TBB

Hands-on Exercise: Programming with TBB

Objective: Implement a parallel heat stencil application using TBB and to profile it using VTune

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 42 / 63

slide-43
SLIDE 43

Lock Free Synchronization

Outline

1

NUMA systems

2

Profiling Codes

3

Intel TBB

4

Lock Free Synchronization

5

Transactional Memory Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 43 / 63

slide-44
SLIDE 44

Lock Free Synchronization

‘Lock-free’ Data Structures: Motivation

consider the atomic test-and-set operation

atomic int testAndSet(volatile int *Lock) {int lv = *Lock; *Lock = 1; return lv;}

synchronizes the whole memory system (down to LLC), costs ≈ 50 cycles, degrades memory access progress for all

not scalable: energy and time costs are O(N2) (N is number of cores)

mutual exclusion via test-and-set can be modelled by:

volatile int lock = 0; //0..1; 1=locked, 0=unlocked // thread 0 2 while (1) { // non -critical section 0 4 ... while ( testAndSet (& lock) != 0) 6 { /* spin */ } 8 // critical section 0 ... 10 lock = 0; } 1 // thread 1 while (1) { 3 // non -critical section 1 ... 5 while ( testAndSet (& lock) != 0) { /* spin */ } 7 // critical section 1 9 ... lock = 0; 11 }

with p threads under heavy contention (time in non-critical becomes small), there will be O(p2) atomics!! Can we do better?

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 44 / 63

slide-45
SLIDE 45

Lock Free Synchronization

Improving Locked Access to Data Structures

  • nly try atomics when we see lock is free: reduces #atomics to O(p)

1 do { while (lock == 1) {/* spin */} 3 } while ( testAndSet (& lock) != 0);

use adaptive spin-locks: if lock is not acquired after a certain time, the thread yields

Pthreads mutexes should do this; especially useful if p > N

use algorithms requiring a (small) constant number of atomic

  • perations per access of the critical region

can be a general mutex algorithm, or data structure-specific

better still (possibly), use algorithms using no atomic operations

Lamport’s Bakery Algorithm: still has O(p2) costs, assumes order of updates made by one thread is seen by other threads

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 45 / 63

slide-46
SLIDE 46

Lock Free Synchronization

Atomic Operation-based Bakery Algorithm

analogous to protocol of buying seafood at Woolworths

uses atomic int fetchAndIncrement(volatile int *v){int v = *v; (*v)++; return(v);} initialize: ticketNumber = screenNumber = 0; acquire: myTicket = fetchAndIncrement(&ticketNumber); wait until (myTicket == screenNumber) release: screenNumber++; (does not need to be atomic)

performance: good if H/W has direct support of fetchAndIncrement(), else may require unbounded number of compare-and-swaps

problems under contention: large amount of cache line invalidations due to all threads accessing ‘scoreboard’ variable screenNumber can we restrict the # threads accessing their ‘scoreboard’ to just 2?

MCS lock: Woolworths analogy: instead of watching a screen, the person just ahead tells you when they are served

requires atomic dual word copy on lock acquire and atomic compare-and-swap on release

CLH lock: requires only a single fetch-and-store atomic

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 46 / 63

slide-47
SLIDE 47

Lock Free Synchronization

Synchronization: Barriers

wait for all to arrive at same point in computation before proceeding (none may leave until all have arrived) central barrier with p threads (initialize globals ctr to p, sense to 0 ; each thread initializes its own threadSense to 0)

threadSense = !threadSense; if (fetchAndDecrement(ctr)== 1) ctr = p, sense = !sense; // last to reach toggles global sense

wait until (sense == threadSense)

sense is required for repeated barriers

caution: deadlock occurs if 1 thread does not participate! problems:

most processors do not support atomic decrement of a memory location not (very) scalable: p atomic decrements per barrier

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 47 / 63

slide-48
SLIDE 48

Lock Free Synchronization

Synchronization: Combining Tree Barrier

each pair of threads points to a leaf node in a tree

each node has a ctr (init. to 2), and a sense flag (init. to 0)

algorithm: begins on all leaf nodes (each thread has threadSense=1 flag):

if (fetchAndDecrement(ctr)== 1)

call algorithm on parent node

ctr = 2, sense = !sense

wait until threadSense == sense then, as leaving the barrier, threadSense = !threadSense

notes:

last thread to reach each node continues up the tree the thread that reaches root begins the ‘wakeup’ (reversing sense) upon wakeup, a thread releases its siblings at each node along path

performance: 2× the atomic operations, but can distribute memory locations of node data (e.g. across different LLC banks)

atomics can be avoided by the scalable tree barrier! (simply replace

ctr with 2 adjacent byte flags, one for each thread)

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 48 / 63

slide-49
SLIDE 49

Lock Free Synchronization

Single Reader, Single Writer Bounded Buffer

Strategy: simply put burden of condition synchronization (retry) on client

class LFBoundedBuffer { 2 int N, int in , out , *buf; LFBoundedBuffer (int Nc) { N = Nc; in = out = 0; buf = new int [N]; } 4 bool put(int v) { int inNxt = (in +1) % N; 6 if (inNxt == out) // buffer full return false; 8 buf[in] = v; in = inNxt; return true; 10 } bool get(int *v) { 12 if (in == out) // buffer empty return false; 14 *v = buf[out ]; out = (out +1) % N; return true; 16 }};

Why does this work? Threads do not operate on the same variable. Consider N=2. Need to assume updates on in/out don’t overtake the add/remove from buf on other thread. Unbounded: use linked lists, reader advances a read pointer, leaves writer to remove previously read nodes

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 49 / 63

slide-50
SLIDE 50

Lock Free Synchronization

‘Lock-Free’ Stack

class LFStack { 2 int top , nPop; LFStack(int Nc) { N = Nc; nPop = top = 0; buf = new int [N]; } 4 bool push(int v) { while (1) { 6 int

  • ldTop = top;

if (oldTop == N) return false; // caller must try again 8 if (dcas (&top ,

  • ldTop ,
  • ldTop +1,

&buf[oldTop], buf[oldTop], v)) 10 return true; // oldTop remained a valid stack top } 12 } bool pop(int *v) { 14 while (1) { int

  • ldTop = top , oldNPop = nPop;

16 if (oldTop == 0) return false; // caller must try again *v = buf[oldTop -1]; 18 if (dcas (&top ,

  • ldTop ,
  • ldTop -1,

&nPop , oldNPop , oldNPop +1)) 20 return true; // oldTop remained a valid stack top } 22 }}; // note: dcas (&top , oldTop , oldTop -1, v, *v, buf[oldTop ]) won ’t work!

dcas() is a ‘double’ compare-and-swap operation: in this case we use the op on the 1st word to ensure safety, and the 2nd to ensure the data is added / removed atomically

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 50 / 63

slide-51
SLIDE 51

Lock Free Synchronization

Summary: Lock Free Synchronization

Use fine-grained locking to reduce contention in operations on shared data structures Non-blocking solution to avoid overheads due to lock, but still requires appropriate memory fence operations Lock free design does not eliminate contention

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 51 / 63

slide-52
SLIDE 52

Transactional Memory

Outline

1

NUMA systems

2

Profiling Codes

3

Intel TBB

4

Lock Free Synchronization

5

Transactional Memory Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 52 / 63

slide-53
SLIDE 53

Transactional Memory

Transactional Memory - Motivations

  • r, problems with lock-based solutions:

convoying: a thread holding a lock is de-scheduled, possibly holding up other threads waiting for a lock priority inversion: lower-priority thread is pre-empted while holding a lock needed by a higher priority thread deadlock: two threads attempt to acquire the same locks do so in different order inherently unsafe: relationship between data objects and locks in entirely by convention lacks the composition property: difficult to combine 2 concurrent

  • perations using locks into a single larger operation

pessimistic: in low-contention situations, their protection is only needed < 1% of the time!

recall overheads of acquiring a lock (even if free)

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 53 / 63

slide-54
SLIDE 54

Transactional Memory

Transactions - Basic Ideas

a transaction is a series of reads and writes (to global data) made by a single thread which execute atomically, i.e:

if it succeeds, no other thread sees any of the writes until it completes if it fails, no change to (global) data can be observed

it must also execute with consistency:

the thread does not see any (interfering) writes from other threads

it has the following steps:

begin, read and writes to (transactional) variables, end

if consistency is met, the end results in a commit; otherwise it aborts a condition that meets this is serializability:

if transactions cannot execute concurrently, consistency is assured note: this assumes that (transactional) variables are only written (in some cases, read) by transactions

can also achieve this if no other thread accesses the affected variables during the transaction

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 54 / 63

slide-55
SLIDE 55

Transactional Memory

Transactions - Programming Model

e.g. to add an item newItem to a queue with tail pointer tail:

atomic { 2 struct node *newTail = malloc(sizeof(struct node)); newTail ->item = newItem; 4 newTail ->next = tail; // tail must not be changed between these 2 points! 6 tail = newTail; }

here, tail is the single read and write transactional variable the atomic block is executed atomically (it can be assumed that the underlying system will re-execute it until it can commit (what about

the malloc?))

e.g. code from Unstructured Adaptive benchmark of NAS suite

for (ije1 = 0; ije1 < nnje; ije1 ++) { ... 2 for (ije2 = 1; ije2 < nnje; ije2 ++) for (col = 1-shift; col < lx1 -shift; col ++) { 4 ig = idmo[v_end[ije2], col , ije1 , ije2 , iface , ie]; #pragma

  • mp

atomic 6 tmor[ig] += temp[v_end[ije2],col ,ije1] * 0.5; ... } 8 } Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 55 / 63

slide-56
SLIDE 56

Transactional Memory

Limited Compare-and-Swap Implementation

the atomic compare-and-swap operation may be expressed as:

typedef unsigned long uint64; 2 atomic uint64 cas(uint64 *x, uint64 x_old , uint64 x_new) { uint64 x_now = *x; 4 if (x_now == x_old) *x = x_new; return (x_now); }

(1st 2 lines are implemented by a single cas instruction) we can implement the (essentially atomic part of) the code for queue insertion by:

do { 2 newTail ->next = tail; } while (cas(tail , newTail ->next , newTail) != newTail ->next);

and for the Unstructured Adaptive benchmark

1 do { x_old = tmor[ig]; 3 x_new = temp[v_end[ije2],col ,ije1] * 0.5 + x_old; } while (cas (& tmor[ig], x_old , x_new) != x_old);

note: recent OpenMP runtime systems implemented atomic sections using a global lock (why does this perform abysmally?)

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 56 / 63

slide-57
SLIDE 57

Transactional Memory

Hardware TM: Basis on Coherency Protocols

idea: tentative values can be held in cache lines (assuming each processor has a separate cache) review the standard MESI cache coherency protocol

cache line states: Modified: only this cache holds the data; it is ‘dirty’, Exclusive: as before, but not ‘dirty’, Shared: > 1 caches hold data, not ‘dirty’, Invalid: this line holds no data key transitions:

(a) A: load x (miss; then line in E state) (b) B: load x (miss; then lines are in S state (A and B)) (c) B: store x (hit; then lines in M state (B) and I state (A) (d) A: load x (miss; then B copies back line to memory and lines in S state (A and B))

when a line is evicted (due to a conflict with a new address needing to be loaded), the data is copied-back to memory

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 57 / 63

slide-58
SLIDE 58

Transactional Memory

Hardware Transactional Memory (Bounded)

add a transactional bit (T) for the state of each cache line

T=1 if entry is placed in cache on behalf of a transaction (o.w. 0)

simply extend the MESI protocol as follows:

if a line with T=1 is invalidated or evicted, abort (with no copy-back) note: can record fact that transaction will later abort instead

abort causes lines with T=1 to be invalidated (no copy-back, if dirty) a commit requires all dirty lines with T=1 to be written atomically to memory (T bit is cleared) requires transactional variants to load and store instructions, plus an (attempt to) commit instruction (which clears all T bits) why this works:

if a line with T=1 is invalidated, we have a R-W or W-W conflict if a line with T=1 is evicted, the transaction cannot complete

note: size of transaction is limited by cache size (in practice, smaller) upon abort, hardware needs to indicate whether to retry (synch. conflict) or not (error or SW-detected resource exhaustion)

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 58 / 63

slide-59
SLIDE 59

Transactional Memory

HTM in the Intel Haswell/Broadwell

has new instructions to begin, end and abort a transaction

xbegin addr: address of retry loop, in case of abort xend: all loads/stores in between are transactional; and xabort

no guarantee of progress! (indefinitely repeated aborts) e.g. xbegin; load x; load y; store z;

T flag tag S u 1 E x 1 S y M v 1 M z I

  • T

flag tag S u I x I y M v I z I

  • T

flag tag S u E x S y M v E z I

  • cache state after the

above if x’s line becomes invalidated (remote core writes to x), before xend, abort if v’s line becomes invalidated (remote core writes to v), before xend, commit

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 59 / 63

slide-60
SLIDE 60

Transactional Memory

Uses of HTM

may be used to create custom atomic operations e.g. the atomic double compare-and-swap operation may be expressed as:

typedef unsigned long uint64; 2 int dcas(uint64 *x, uint64 x_old , uint64 x_new , uint64 *y, uint64 y_old , uint64 y_new) { 4 int swapped = 0; xbegin abortaddr; 6 uint64 x_now = *x, y_now = *y; if (x_now == x_old && y_now == y_old) 8 *x = x_new , *y = y_new , swapped = 1; xend; 10 return (swapped); // success abortaddr: 12 return (0); // failure }

note: x86 provides a cmpxchg16b instruction which can do the above if x and y are in consecutive memory locations

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 60 / 63

slide-61
SLIDE 61

Transactional Memory

Summary: Transactional Memory

Atomic construct: aim to increase simplicity of synchronization without significantly sacrificing performance Implementation: many variants that differ in versioning policy (eager/lazy), conflict detection (pessimistic/optimistic), detection granularity Hardware transactional memory: versioned data kept in caches, conflict detection as part of coherence protocol

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 61 / 63

slide-62
SLIDE 62

Transactional Memory

Hands-on Exercise: Lock and Barrier Performance

Objective: Understand performance of atomic operations in various implementations of locks and barriers

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 62 / 63

slide-63
SLIDE 63

Transactional Memory

Summary

Topics covered today - Parallel Performance Optimization: Non-uniform memory access hardware Profiling using VTune Intel TBB Lock free data structures and transactional memory Tomorrow - Parallel Software Design!

Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 63 / 63