SYNCHRONIZATION IS BAD, BUT IF YOU MUST (S9329) Olivier Giroux, - - PowerPoint PPT Presentation

synchronization is bad
SMART_READER_LITE
LIVE PREVIEW

SYNCHRONIZATION IS BAD, BUT IF YOU MUST (S9329) Olivier Giroux, - - PowerPoint PPT Presentation

SYNCHRONIZATION IS BAD, BUT IF YOU MUST (S9329) Olivier Giroux, Distinguished Architect, ISO C++ Chair of Concurrency & Parallelism. My coordinates Memory Model WG21 Community ogiroux@nvidia.com Architects ISO C++ Users NVIDIA GPU


slide-1
SLIDE 1

Olivier Giroux, Distinguished Architect, ISO C++ Chair of Concurrency & Parallelism.

SYNCHRONIZATION IS BAD, BUT IF YOU MUST… (S9329)

slide-2
SLIDE 2

2

ISO C++ Users NVIDIA GPU Engineers WG21

Memory Model Community

  • giroux@nvidia.com

My coordinates

Architects

slide-3
SLIDE 3

🛒 cudaDeviceSynchronize() 🛒 __syncthreads() 🛒 __shfl_sync() ✅ Using atomics to do blocking synchronization. WHAT THIS TALK IS ABOUT:

slide-4
SLIDE 4

T0 → T1 → T2 → T3 → T4 → T5 → T6 → Blocked Blocked Blocked Blocked Blocked Blocked Blocked Blocked Blocked SIGH!! Blocked Dammit Blocked Blocked Blocked Still blocked Blocked Blocked

All blocked and no play makes t6 a dull thread…

Blocked Blocked

PSA: DON’T RUN SERIAL CODE IN THREADS

slide-5
SLIDE 5

T0 → T1 → T2 → T3 → T4 → T5 → T6 → Blocked Blocked Blocked Blocked Blocked Blocked Blocked

PSA: RARE CONTENTION IS FINE

slide-6
SLIDE 6

struct mutex { // suspend atomic<> disbelief for now __host__ __device__ void lock() { while(1 == l.exchange(1, memory_order_acquire)) ; } __host__ __device__ void unlock() { l.store(0, memory_order_release); } atomic<int> l = ATOMIC_VAR_INIT(0); };

🎊 Thanks for attending my talk. 🎊

Awesome.

1.E+06 1.E+07 1.E+08 1.E+09 1 2 4 8 16 32 64 128 256 512 10242048 Critical sections (per second) Thread Occupancy V100 CPU

UNCONTENDED EXCHANGE LOCK

slide-7
SLIDE 7

7

Deadlock. 🐱🕴 Deadlock. Deadlock. Deadlock.

slide-8
SLIDE 8

😏 Atomic result feeds branch, closes loop, Volta+ 😲 Atomic result feeds branch, closes loop 😭 Atomic result feeds branch, inside loop 🤩 Atomic result feeds branch, outside loop 🙃 Atomic result feeds arithmetic 😂 Atomic result ignored 👷 No atomics SIMT ATOMIC CONCERN SCALE :

slide-9
SLIDE 9

9

SIMT FAMILY HISTORY

2007 2017

Tesla SIMT

Scalar thread programs. Forward-progress = Nope ☹︐

Volta SIMT

Scalar thread programs. Forward-progress = YES! 🤙

Source: Wikipedia, SIGGRAPH proceedings, IEEE Micro.

1984

Pixar CHAP

Scalar channel programs. 1966

SIMD

1970

Time zero.

slide-10
SLIDE 10

10

APPLICABILITY

slide-11
SLIDE 11

11

CONs:

  • 1. Serialization is bad.
  • 2. Critical path / Amdahl’s law.
  • 3. Latency is high.

PROs

  • 1. Algorithmic gains.
  • 2. Latency hiding.
  • 3. Throughput is high

TL;DR: Sometimes, it’s a win.

SYNCHRONIZATION DECISION CHECKLIST

slide-12
SLIDE 12

Keep local state in registers & shared memory, with synchronization.

20x

faster for RNN.

Grid0<<<>>> Grid1<<<>>>

State Invalidation

Cooperative Grid

Global Barrier

APP #1: GPU-RESIDENT METHODS

See Greg Diamos’ GTC 2016 talk for more.

slide-13
SLIDE 13

// *continue* to suspend atomic<> disbelief for now __host__ __device__ bool lock_free_writer_version(atomic<int>& a, atomic<int>& b) { int expected = -1; if(a.compare_exchange_strong(expected, 1, memory_order_relaxed)) b.store(1, memory_order_relaxed); return expected == -1; } // This version is a ~60% speedup at GPU application level, despite progress hazards. __host__ __device__ bool starvation_free_writer_version(atomic<int>& a, atomic<int>& b) { int expected_a = -1, expected_b = -1; bool success_a = a.compare_exchange_strong(expected_a, 1, memory_order_relaxed), success_b = b.compare_exchange_strong(expected_b, 1, memory_order_relaxed); if(success_a) // Note: we almost always succeed at both. while(!success_b) // <-- This loop makes this a deadlock-free algorithm. success_b = b.compare_exchange_strong(expected_b = -1, 1, memory_order_relaxed); else if(success_b) b.store(-1, memory_order_relaxed); return expected_a == -1; }

Exposed dependent latency Overlapped Rarely-taken loop changes this algorithm to a different category.

APP #2: LOCK-FREE IS NOT ALWAYS FASTER

slide-14
SLIDE 14

Even if mutexes hide in every node, GPUs can build tree structures fast.

For more, see my CppCon 2018 talk on YouTube, and ‘Parallel Forall’ blog post.

Multi-threading (CPU) Acceleration (RTX 2070)

APP #3: CONCURRENT DATA STRUCTURES

?

slide-15
SLIDE 15

15

PRE-REQUISITES

slide-16
SLIDE 16

→ Compute_7x. Compute_6x. 

Maurice Herlihy and Nir Shavit. 2011. On the nature of progress. In Proceedings of the 15th international conference on Principles of Distributed Systems (OPODIS'11)

Every thread succeeds. Some thread succeeds. No scheduling requirements. (Any thread scheduler.) Eventually run isolated. Critical sections eventually complete.

Concurrent algorithm taxomomy. App #2.

PR #1: FORWARD-PROGRESS

slide-17
SLIDE 17

ISO C++ 11 CUDA 9.0-10.2, Volta+ CUDA 10.3, Volta+

int atomic<int>::load(memory_order_seq_cst) asm("fence.sc.sys;"); asm("ld.acquire.sys.b32 %0, [%1];":::memory); int atomic<int>::load(memory_order_seq_cst) int atomic<int>::load(memory_order_acquire) asm("ld.acquire.sys.b32 %0, [%1];":::memory); int atomic<int>::load(memory_order_acquire) int atomic<int>::load(memory_order_relaxed) asm("ld.relaxed.sys.b32 %0, [%1];":::memory); OR : x = *(volatile int*)ptr; int atomic<int>::load(memory_order_relaxed) void atomic<int>::store(int, memory_order_seq_cst) asm("fence.sc.sys;"); asm("st.relaxed.sys.b32 [%0], %1;":::memory); void atomic<int>::store(int, memory_order_seq_cst) void atomic<int>::store(int, memory_order_release) asm("st.release.sys.b32 [%0], %1;":::memory); void atomic<int>::store(int, memory_order_release) void atomic<int>::store(int, memory_order_relaxed) asm("st.relaxed.sys.b32 [%0], %1;":::memory); OR : *(volatile int*)ptr = x; void atomic<int>::store(int, memory_order_relaxed) int atomic<int>::exchange(int, memory_order_seq_cst) asm("fence.sc.sys;"); asm("atom.exch.acquire.sys.b32 %0, [%1], %2;":::memory); int atomic<int>::exchange(int, memory_order_seq_cst) int atomic<int>::exchange(int, memory_order_acq_rel) asm("atom.exch.acq_rel.sys.b32 %0, [%1], %2;":::memory); int atomic<int>::exchange(int, memory_order_acq_rel) int atomic<int>::exchange(int, memory_order_release) asm("atom.exch.release.sys.b32 %0, [%1], %2;":::memory); int atomic<int>::exchange(int, memory_order_release) int atomic<int>::exchange(int, memory_order_acquire) asm("atom.exch.acquire.sys.b32 %0, [%1], %2;":::memory); int atomic<int>::exchange(int, memory_order_acquire) int atomic<int>::exchange(int, memory_order_relaxed) asm("atom.exch.relaxed.sys.b32 %0, [%1], %2;":::memory); OR: y = atomicExch_system(ptr, x); int atomic<int>::exchange(int, memory_order_relaxed)

And so on...

Classic CUDA C++.

🎊 Later this year! 🎊 See PTX 6 chapter 8 for the asm.

PR #2: MEMORY CONSISTENCY

Our ASPLOS 2019 paper: https://github.com/NVlabs/ptxmemorymodel.

slide-18
SLIDE 18

Platform / allocator

Load/store sharing Atomic (low cont’n) Atomic (high cont’n) Any: ARM/Windows/Mac/Unmanaged

  • Nope. Not at all.

x86 Linux (CPU/GPU) Managed Yes. Technically… but no. x86 Linux (GPU/GPU) Managed YES! TRY IT! POWER Linux (all pairs) Managed

  • Concurrent data sharing between CPU and GPU is a new possibility.
  • Real usefulness has some more conditions.

PR #3: TRUE SHARING

slide-19
SLIDE 19

19

PRELIMINARIES

slide-20
SLIDE 20

__host__ __device__ void test(int my_thread, int total_threads, int final_value) { for(int old ; my_thread < final_value; start += total_threads) while(!a.compare_exchange_weak(old = my_thread, my_thread + 1, memory_order_relaxed)) ; }

1.E-08 1.E-07 1.E-06 1.E-05 1 2 4 8 16 32 64 128 256 512 1024 2048 Latency (seconds) Contending threads (count) V100 POWER X86

BW=1/LatNUMA is a punishing depressor of CPU perf. Bathtub curve is due to the statistical likelihood of finding peer in pipeline. Little’s Law finally kicks in.

CONTENTION IS THE ISSUE, DIFFERENTLY.

slide-21
SLIDE 21

1 2 4 8 16 1.E-09 1.E-06 1.E-03 2 8 32 128 512 2048

Latency (seconds) Threads (GPU x CPU)

Crushed?

½ millisecond

1.E-09 1.E-08 1.E-07 1.E-06 1.E-05 1.E-04 1.E-03 1 2 4 8 163264 128 256 512 1024 2048 Latency (seconds) Thread Occupancy V100 X86

CONTENDING PROCESSORS ARE CRUSHED…

slide-22
SLIDE 22

1 2 4 8 16 1.E-09 1.E-06 1.E-03 4 32 256 2048

Latency (seconds) Threads (GPU x CPU)

1 2 4 8 16 32 64 1.E-09 1.E-06 1.E-03 4 32 256 2048

Latency (seconds) Threads (GPU x CPU)

x86 + V100 (PCIE) POWER + V100 (NVLINK)

…UNLESS THE PROCESSORS ARE NVLINK’ED.

slide-23
SLIDE 23

1 2 4 8 16 32 64 1.E-07 1.E-06 1.E-05 4 32 256 2048

Latency (seconds) Threads (GPU x CPU)

ALL OF THE FOLLOWING SLIDES ARE NVLINK’ED.

And not log scale, because it’s legible in linear scale now. Thanks.

slide-24
SLIDE 24

24

CONTENDED MUTEXES

slide-25
SLIDE 25

25

CONTENDED MUTEXES

AS AN EXERCISE TO THINK ABOUT THROUGHPUT AND FAIRNESS

slide-26
SLIDE 26

struct mutex { __host__ __device__ void lock() { while(1 == l.exchange(1, memory_order_acquire)) ; } __host__ __device__ void unlock() { l.store(0, memory_order_release); } atomic<int> l = ATOMIC_VAR_INIT(0); }; 2 8 32 1.E-07 1.E-06 1.E-05 8 128 2048

Latency (seconds) Threads (GPU x CPU)

CONTENDED EXCHANGE LOCK

🎊 Stay. Keep attending my talk. 🎊

Not awesome.

slide-27
SLIDE 27

1 2 4 8 16 32 64 1.E-07 1.E-06 1.E-05 2 8 32 128 512 2048

Latency (seconds) Threads (GPU x CPU)

Heavy system pressure:

  • A lot of requests
  • Each request is slow

CONTENDED EXCHANGE LOCK

slide-28
SLIDE 28
  • K bounds forecast relative error (orange line):

Latencyresponse > Kdelay * Latencyimpulse

  • Pick arbitrary Kdelay; say 1.5 for 50% error.
  • Some benefit to stochastic choice, avoid coupling.
  • Ceiling trades bandwidth & maximum error:

tpolling / (latloaded + latbackoff) > BWpolling

  • Pick arbitrary BWpolling; say 0.5 * Bwcontended
  • Floor protects the fast corner (green box):

Latencyresponse > Latencyfloor

  • Minimum CPU sleep (Linux) is ~= 50000ns.
  • Minimum sleep on V100 is ~= 0ns.

Impulse-Response Diagram

(seconds-seconds)

1.E-08 1.E-07 1.E-06 1.E-05 1.E-04 1.E-08 1.E-07 1.E-06 1.E-05 1.E-04 – Instantaneous response

BACKOFF : LESS PRESSURE VIA FORECASTING

K Ceiling Floor (CPU) Floor (GPU)

slide-29
SLIDE 29

2 8 32 1.E-07 1.E-06 1.E-05 8 128 2048

Latency (seconds) Threads (GPU x CPU)

__host__ __device__ void lock() { uint32_t history = 1<<8; // 256ns while(1 == l.exchange(1, memory_order_acquire)) { uint32_t delay = history >> 1; // 50% #ifdef __CUDA_ARCH__ __nanosleep(delay); #else if(delay > (1<<15)) // 32us std::this_thread::sleep_for( std::chrono::nanoseconds(delay)); else { std::this_thread::yield(); delay = 0; } #endif history += (1<<8) + delay; if(history > (1<<18)) // 256us history = 1<<18; } }

FLOOR CEILING K

We seem to have fixed the slow corner.

CONTENDED EXCHANGE LOCK + BACKOFF

1 2 4 8 16 32 64 1.E-07 1.E-06 1.E-05 2 8 32 128 512 2048 Latency (seconds) Threads (GPU x CPU)
slide-30
SLIDE 30

2 8 32 1.E-07 1.E-06 1.E-05 8 128 2048

Latency (seconds) Threads (GPU x CPU)

1.E-01 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06

  • Fast because: lock disproportionally granted to some threads.
  • Slow because: top-level performance often depends on fairness.

Single-thread rate is a strong attractor.

Sections Threads

FAST LOCKS, SLOW APPLICATIONS 😲

slide-31
SLIDE 31

Maurice Herlihy and Nir Shavit. 2011. On the nature of progress. In Proceedings of the 15th international conference on Principles of Distributed Systems (OPODIS'11)

Some thread succeeds. Critical sections eventually complete.

RECALL : FORWARD-PROGRESS

CONTENDED EXCHANGE LOCK WITH BACKOFF

slide-32
SLIDE 32

Enumerated list: 1. Very low contention. 2. Top-level algorithms resilient to tail effects. Luckily, this is still pretty common!

WHEN IS DEADLOCK-FREE SUITABLE?

slide-33
SLIDE 33

struct alignas(128) ticket_mutex { __host__ __device__ void lock() { auto const my = in.fetch_add(1, memory_order_acquire); while(1) { auto const now = out.load(memory_order_acquire); if(now == my) break; auto const delta = my - now; auto const delay = (delta << 8); // * 256 #ifdef __CUDA_ARCH__ __nanosleep(delay); #else if(delay > (1<<15)) // 32us std::this_thread::sleep_for(std::chrono::nanoseconds(delay)); else std::this_thread::yield(); #endif } } __host__ __device__ void unlock() {

  • ut.fetch_add(1, memory_order_release);

} atomic<unsigned> in = ATOMIC_VAR_INIT(0); atomic<unsigned> out = ATOMIC_VAR_INIT(0); };

Don’t need either K or ceiling here, delta is an accurate forecast! ☺

2 8 32 1.E-07 1.E-06 1.E-05 4 32 256 2048

Latency (seconds)

Threads (GPU x CPU)

TICKET LOCK + PROPORTIONAL BACKOFF

slide-34
SLIDE 34

1.E-01 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 2 8 32 1.E-07 1.E-06 1.E-05 4 32 256 2048

Latency (seconds)

Threads (GPU x CPU)

TICKET LOCK + PROPORTIONAL BACKOFF

Sections Threads

slide-35
SLIDE 35

Maurice Herlihy and Nir Shavit. 2011. On the nature of progress. In Proceedings of the 15th international conference on Principles of Distributed Systems (OPODIS'11)

Critical sections eventually complete.

AGAIN : FORWARD-PROGRESS

TICKET LOCK Every thread succeeds.

slide-36
SLIDE 36

This is your default, when deadlock-free is unsuitable.

WHEN IS STARVATION-FREE SUITABLE?

Wish we could use queue locks (e.g. MCS) but we can’t. These use O(P) storage 😲 and local stack pointers (MCS).

WHAT ELSE IS THERE FOR MUTEXES?

slide-37
SLIDE 37

37

BARRIERS

slide-38
SLIDE 38

38

BARRIERS

AS A TYPICALLY-GPU THING AND ALSO TO THINK ABOUT LATENCY

slide-39
SLIDE 39

__host__ __device__ void arrive_and_wait() { auto const _expected = expected; auto const old = phase_arrived.fetch_add(1, memory_order_acq_rel); auto current = old + 1; if((old & phase_bit) != (current & phase_bit)) { phase_arrived.fetch_add(phase_bit - _expected); } else while(1) { current = phase_arrived.load(memory_order_acquire); if((old & phase_bit) != (current & phase_bit)) break; auto const delta = phase_bit - (current & ~phase_bit); auto const delay = (delta << 8); // * 256 #ifdef __CUDA_ARCH__ __nanosleep(delay); #else if(delay > (1<<15)) // 32us std::this_thread::sleep_for(std::chrono::nanoseconds(delay)); else std::this_thread::yield(); #endif } } uint32_t const expected = 0; atomic<uint32_t> phase_arrived = ATOMIC_VAR_INIT(0);

CENTRAL BARRIER + PROPORTIONAL BACKOFF

2 8 32 1.E-07 1.E-06 1.E-05 8 128 2048 Latency (seconds) Threads (GPU x CPU)

slide-40
SLIDE 40

2 8 32 1.E-07 1.E-06 1.E-05 8 128 2048 Latency (seconds) Threads (GPU x CPU)

CENTRAL BARRIER + PROPORTIONAL BACKOFF

  • Centralized barrier is bad for the CPU.
  • Coherence protocols strongly prefer

fancy barrier algorithms: tree, tournament, dissemination…

  • Because: BWcontended = 1/LatNUMA.
  • GPU just hangs-on for a while longer.
  • But: fancy algorithms introduce high-

latency, levels of indirection.

  • Each indirection needs 1:100x .. 1:1000x

improvement in BW to justify itself.

slide-41
SLIDE 41

EASY AND EFFECTIVE GPU TREE BARRIER

2 8 32 1.E-07 1.E-06 1.E-05 8 128 2048 Latency (seconds) Threads (GPU x CPU)

__host__ __device__ void arrive_and_wait() { #ifdef __CUDA_ARCH__ auto const c = __syncthreads_count(1); if(threadIdx.x == 0) __arrive_and_wait(c); __syncthreads(); #else __arrive_and_wait(); #endif // __CUDA_ARCH__ } __host__ __device__ void __arrive_and_wait(int c = 1) { auto const _expected = expected; auto const old = phase_arrived.fetch_add(c, memory_order_acq_rel); auto current = old + c; //... }

  • 2nd level of hierarchy is ~free, in blocks.
  • Up to 1:1024 bandwidth reduction!
slide-42
SLIDE 42

42

“Remember, if you actually need

a GPU barrier, then you should use cooperative groups instead.”

  • My inner CUDA engineer voice.

https://devblogs.nvidia.com/cooperative-groups/

slide-43
SLIDE 43

WHAT ABOUT CPU-GPU BARRIERS, THOUGH?

  • As you can see, a new barrier algorithm is necessary.
  • Perhaps partitioned strategies, by processor type?

Seriously, I’m asking. Somebody should try it! 😆

  • I don’t know what it would be for, though. So no rush.
slide-44
SLIDE 44

For multi-GPU systems:

  • You can replicate arrivals to trade atomics vs. polling.
  • Not done by CG but it has been done at NVIDIA.

For a DGX-2 (2.6 million threads):

  • You might benefit from 3rd level of barrier, barely.
  • I don’t think it’s been done at NVIDIA yet.

WHAT ELSE IS THERE FOR BARRIERS?

slide-45
SLIDE 45

45

IN SHORT

slide-46
SLIDE 46

→ Compute_7x.

Critical sections eventually complete.

  • 1. Contention bandwidth is a

major issue for synchronization. See: atomic story.

  • 2. If you use back-offs,

keep an eye on fairness. See: mutex story.

  • 3. If you use indirection, the GPU

needs a 100..1000x saving. See: barrier story.

USE CASES PRE-REQS KEEP IN MIND

slide-47
SLIDE 47

Should come to the CUDA C++ toolkit this year, in 2019. A preview is here: https://github.com/ogiroux/freestanding. My CppCon 2018 talk has more, stream it on YouTube.

CUDA::STD::ATOMIC<T> IS COMING SOON

slide-48
SLIDE 48

Concurrency at this scale has never been easier. If you have IBM + V100 systems, try new algorithms! We want to see what you’ll do with them.

EXTREME SHARED-MEMORY CONCURRENCY

slide-49
SLIDE 49