SYNCHRONIZATION IS BAD, BUT IF YOU MUST (S9329) Olivier Giroux, - PowerPoint PPT Presentation

SYNCHRONIZATION IS BAD, BUT IF YOU MUST… (S9329) Olivier Giroux, Distinguished Architect, ISO C++ Chair of Concurrency & Parallelism.

My coordinates Memory Model WG21 Community ogiroux@nvidia.com Architects ISO C++ Users NVIDIA GPU Engineers 2

WHAT THIS TALK IS ABOUT: 🛒 cudaDeviceSynchronize() 🛒 __syncthreads() 🛒 __shfl_sync() ✅ Using atomics to do blocking synchronization.

PSA: DON’T RUN SERIAL CODE IN THREADS T0 → Blocked Dammit Blocked Blocked Blocked T1 → Blocked Blocked SIGH!! Blocked T2 → Blocked T3 → Blocked Blocked Blocked T4 → Blocked Blocked Still blocked T5 → Blocked All blocked and Blocked T6 → Blocked no play makes t6 Blocked a dull thread…

PSA: RARE CONTENTION IS FINE T0 → Blocked Blocked T1 → Blocked T2 → T3 → Blocked T4 → T5 → Blocked T6 → Blocked Blocked

UNCONTENDED EXCHANGE LOCK struct mutex { // suspend atomic<> disbelief for now 1.E+09 Critical sections (per second) __host__ __device__ void lock() { 1.E+08 while(1 == l.exchange(1, memory_order_acquire)) ; } 1.E+07 __host__ __device__ void unlock() { l.store(0, memory_order_release); 1.E+06 } 1 2 4 8 16 32 64 128 256 512 10242048 Thread Occupancy atomic<int> l = ATOMIC_VAR_INIT(0); V100 CPU }; Awesome. 🎊 Thanks for attending my talk. 🎊

Deadlock. Deadlock. 🐱🕴 Deadlock. Deadlock. 7

SIMT ATOMIC CONCERN SCALE : 😏 Atomic result feeds branch, closes loop, Volta+ 😲 Atomic result feeds branch, closes loop 😭 Atomic result feeds branch, inside loop 🤩 Atomic result feeds branch, outside loop 🙃 Atomic result feeds arithmetic 😂 Atomic result ignored 👷 No atomics

SIMT FAMILY HISTORY Pixar CHAP Tesla SIMT Volta SIMT Scalar channel programs. Scalar thread programs. Scalar thread programs. Forward-progress = Nope Forward-progress = YES! 🤙 ☹︐ Time zero. SIMD 1966 1970 1984 2007 2017 Source: Wikipedia, SIGGRAPH proceedings, IEEE Micro. 9

APPLICABILITY 10

SYNCHRONIZATION DECISION CHECKLIST CONs: PROs 1. Serialization is bad. 1. Algorithmic gains. 2. Critical path / Amdahl’s law. 2. Latency hiding. 3. Latency is high. 3. Throughput is high TL;DR: Sometimes, it’s a win. 11

APP #1: GPU-RESIDENT METHODS Keep local state in registers & shared memory, with synchronization. Grid0<<<>>> Cooperative Grid 20x Global Barrier State Invalidation faster for Grid1<<<>>> RNN. See Greg Diamos ’ GTC 2016 talk for more.

APP #2: LOCK-FREE IS NOT ALWAYS FASTER // *continue* to suspend atomic<> disbelief for now __host__ __device__ bool lock_free_writer_version(atomic<int>& a, atomic<int>& b) { int expected = -1; if(a.compare_exchange_strong(expected, 1, memory_order_relaxed)) Exposed dependent latency b.store(1, memory_order_relaxed); return expected == -1; } // This version is a ~60% speedup at GPU application level, despite progress hazards. __host__ __device__ bool starvation_free_writer_version(atomic<int>& a, atomic<int>& b) { int expected_a = -1, expected_b = -1; bool success_a = a.compare_exchange_strong(expected_a, 1, memory_order_relaxed), Overlapped success_b = b.compare_exchange_strong(expected_b, 1, memory_order_relaxed); if(success_a) // Note: we almost always succeed at both. while(!success_b) // <-- This loop makes this a deadlock-free algorithm. success_b = b.compare_exchange_strong(expected_b = -1, 1, memory_order_relaxed); else if(success_b) Rarely-taken loop changes this b.store(-1, memory_order_relaxed); algorithm to a different category. return expected_a == -1; }

APP #3: CONCURRENT DATA STRUCTURES Even if mutexes hide in every node, GPUs can build tree structures fast. For more, see my CppCon 2018 talk on YouTube, and ‘Parallel Forall ’ blog post. Multi-threading (CPU) ? Acceleration (RTX 2070)

PRE-REQUISITES 15

Concurrent PR #1: FORWARD-PROGRESS algorithm taxomomy. Compute_6x.  → Compute_7x. Every thread succeeds. App #2. Some thread succeeds. Eventually No scheduling requirements. Critical sections Maurice Herlihy and Nir Shavit. 2011. On the nature of progress. In Proceedings of the 15th (Any thread scheduler.) run isolated. eventually complete. international conference on Principles of Distributed Systems (OPODIS'11)

PR #2: MEMORY CONSISTENCY Classic CUDA C++. See PTX 6 chapter 8 for the asm. 🎊 Later this year! 🎊 ISO C++ 11 CUDA 9.0-10.2, Volta+ CUDA 10.3, Volta+ asm("fence.sc.sys;"); int atomic<int>::load(memory_order_seq_cst) int atomic<int>::load(memory_order_seq_cst) asm("ld.acquire.sys.b32 %0, [%1];":::memory); int atomic<int>::load(memory_order_acquire) asm("ld.acquire.sys.b32 %0, [%1];":::memory); int atomic<int>::load(memory_order_acquire) asm("ld.relaxed.sys.b32 %0, [%1];":::memory); int atomic<int>::load(memory_order_relaxed) int atomic<int>::load(memory_order_relaxed) OR : x = *(volatile int*)ptr; asm("fence.sc.sys;"); void atomic<int>::store(int, memory_order_seq_cst) void atomic<int>::store(int, memory_order_seq_cst) asm("st.relaxed.sys.b32 [%0], %1;":::memory); void atomic<int>::store(int, memory_order_release) asm("st.release.sys.b32 [%0], %1;":::memory); void atomic<int>::store(int, memory_order_release) asm("st.relaxed.sys.b32 [%0], %1;":::memory); void atomic<int>::store(int, memory_order_relaxed) void atomic<int>::store(int, memory_order_relaxed) OR : *(volatile int*)ptr = x; asm("fence.sc.sys;"); int atomic<int>::exchange(int, memory_order_seq_cst) int atomic<int>::exchange(int, memory_order_seq_cst) asm("atom.exch.acquire.sys.b32 %0, [%1], %2;":::memory); int atomic<int>::exchange(int, memory_order_acq_rel) asm("atom.exch.acq_rel.sys.b32 %0, [%1], %2;":::memory); int atomic<int>::exchange(int, memory_order_acq_rel) int atomic<int>::exchange(int, memory_order_release) asm("atom.exch.release.sys.b32 %0, [%1], %2;":::memory); int atomic<int>::exchange(int, memory_order_release) int atomic<int>::exchange(int, memory_order_acquire) asm("atom.exch.acquire.sys.b32 %0, [%1], %2;":::memory); int atomic<int>::exchange(int, memory_order_acquire) asm("atom.exch.relaxed.sys.b32 %0, [%1], %2;":::memory); int atomic<int>::exchange(int, memory_order_relaxed) int atomic<int>::exchange(int, memory_order_relaxed) OR: y = atomicExch_system(ptr, x); And so on... Our ASPLOS 2019 paper: https://github.com/NVlabs/ptxmemorymodel.

PR #3: TRUE SHARING • Concurrent data sharing between CPU and GPU is a new possibility. • Real usefulness has some more conditions. Load/store sharing Platform / allocator Atomic (low cont’n ) Atomic (high cont’n ) Any: ARM/Windows/Mac/Unmanaged Nope. Not at all. x86 Linux (CPU/GPU) Managed Yes. Technically… but no. x86 Linux (GPU/GPU) Managed YES! TRY IT! POWER Linux (all pairs) Managed

PRELIMINARIES 19

CONTENTION IS THE ISSUE, DIFFERENTLY. BW=1/Lat NUMA is a punishing Little’s Law 1.E-05 depressor of CPU perf. Bathtub curve is due to the statistical finally kicks in. Latency (seconds) likelihood of finding peer in pipeline. 1.E-06 1.E-07 1.E-08 1 2 4 8 16 32 64 128 256 512 1024 2048 Contending threads (count) V100 POWER X86 __host__ __device__ void test(int my_thread, int total_threads, int final_value) { for(int old ; my_thread < final_value; start += total_threads) while(!a.compare_exchange_weak(old = my_thread, my_thread + 1, memory_order_relaxed)) ; }

CONTENDING PROCESSORS ARE CRUSHED… 1.E-03 Crushed? ½ millisecond Latency (seconds) 1.E-03 2048 Latency (seconds) 1.E-06 1.E-04 512 1.E-05 128 1.E-06 1.E-07 32 Threads 1.E-08 (GPU x CPU) 8 1.E-09 1.E-09 1 2 4 8 163264 128 256 512 1024 2048 2 16 Thread Occupancy 8 4 0 2 1 V100 X86 0

…UNLESS THE PROCESSORS ARE NVLINK’ED . 1.E-03 1.E-03 Latency (seconds) Latency (seconds) 1.E-06 2048 1.E-06 2048 256 256 32 Threads Threads 32 (GPU x CPU) (GPU x CPU) 1.E-09 1.E-09 4 4 64 32 16 16 8 8 4 0 0 4 2 2 1 1 0 0 x86 + V100 (PCIE) POWER + V100 (NVLINK)

ALL OF THE FOLLOWING 1.E-05 SLIDES ARE Latency (seconds) NVLINK’ED . 1.E-06 2048 256 And not log scale, because it’s legible in Threads 32 (GPU x CPU) 1.E-07 linear scale now. 4 64 Thanks. 32 16 8 0 4 2 1 0

CONTENDED MUTEXES 24

CONTENDED MUTEXES AS AN EXERCISE TO THINK ABOUT THROUGHPUT AND FAIRNESS 25

CONTENDED EXCHANGE LOCK struct mutex { __host__ __device__ void lock() { while(1 == l.exchange(1, memory_order_acquire)) 1.E-05 ; Latency (seconds) } 1.E-06 2048 __host__ __device__ void unlock() { l.store(0, memory_order_release); 128 Threads } 1.E-07 (GPU x CPU) 8 32 0 atomic<int> l = ATOMIC_VAR_INIT(0); 8 2 0 }; Not awesome. 🎊 Stay. Keep attending my talk. 🎊

CONTENDED EXCHANGE LOCK Heavy system pressure: • A lot of requests • Each request is slow 1.E-05 Latency (seconds) 2048 1.E-06 512 128 32 Threads (GPU x CPU) 1.E-07 8 64 2 32 16 8 4 0 2 1 0

SYNCHRONIZATION IS BAD, BUT IF YOU MUST (S9329) Olivier Giroux, - PowerPoint PPT Presentation

SYNCHRONIZATION IS BAD, BUT IF YOU MUST (S9329) Olivier Giroux, Distinguished Architect, ISO C++ Chair of Concurrency & Parallelism. My coordinates Memory Model WG21 Community ogiroux@nvidia.com Architects ISO C++ Users NVIDIA GPU

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

Clock Synchronization Synchronization Clock Henrik Lnn Electronics & Software Volvo

synchronization.txt synchronization.txt Feb 2 2009 1:10 Page 1

File Synchronization with File Synchronization with Syxaw in an Ad-hoc Network Syxaw in an

Chapter 7: Process Synchronization Background The Critical-Section Problem

Module 6: Process Synchronization Background The Critical-Section Problem

CSCI [4|6] 730 Operating Systems Synchronization Part 1 : The Basics Maria Hybinette, UGA

Chapter 6: Process [& Thread] Synchronization Why is synchronization needed? CSCI [4|6]

Thread and Synchronization Synchronization Mechanisms (Module 20) Yann-Hang Lee Arizona State

Semaphores and Monitors: High-level Synchronization Constructs 1 Synchronization Constructs

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Synchronization Synchronization

CS 134: Operating Systems More Synchronization 1 / 21 Overview CS34 Overview 2013-05-19

Chapter 7: Process Synchronization Background The Critical-Section Problem

Synchronization in sensor networks Synchronization in sensor networks Jie Gao Computer Science

Time Synchronization Goals of this chaper Understand the importance of time synchronization in

Return-oriented Programming: Exploitation without Code Injection Erik Buchanan, Ryan Roemer,

Functional Assessment of Erasure Coded Storage Archive Computer Systems, Cluster, and Networking

Experiment Reproducibility in Planetlab RP 1.1 Project Presentation Sudesh Jethoe Experiment

Designing Fast Virtual Desktops for Healthcare What is VDI* and why is it important to healthcare?

IMPLICATIONS OF A TRANSFORMATIONAL TECHNOLOGY 5G AMERICAS REPORT PROVIDED BY RYSAVY RESEARCH

System-on-a-Chip Processor Synchronization Support in Hardw are by Bilge E. Saglam and Vincent

TLD-OPS Update ccNSO Meeting June 25, 2019 ICANN65, Marrakech Jacques Latour, .ca (Chair)

An overview of my research Paolo Romano Lisbon University & INESC-ID Roadmap About me

Wi- Fi Module Product Training April, 2012 www.rovingnetworks.com CONFIDENTIAL Wi- Fi

SYNCHRONIZATION IS BAD, BUT IF YOU MUST (S9329) Olivier Giroux, - PowerPoint PPT Presentation

SYNCHRONIZATION IS BAD, BUT IF YOU MUST (S9329) Olivier Giroux, Distinguished Architect, ISO C++ Chair of Concurrency & Parallelism. My coordinates Memory Model WG21 Community ogiroux@nvidia.com Architects ISO C++ Users NVIDIA GPU

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

Clock Synchronization Synchronization Clock Henrik Lnn Electronics &amp; Software Volvo

synchronization.txt synchronization.txt Feb 2 2009 1:10 Page 1

File Synchronization with File Synchronization with Syxaw in an Ad-hoc Network Syxaw in an

Chapter 7: Process Synchronization Background The Critical-Section Problem

Module 6: Process Synchronization Background The Critical-Section Problem

CSCI [4|6] 730 Operating Systems Synchronization Part 1 : The Basics Maria Hybinette, UGA

Chapter 6: Process [&amp; Thread] Synchronization Why is synchronization needed? CSCI [4|6]

Thread and Synchronization Synchronization Mechanisms (Module 20) Yann-Hang Lee Arizona State

Semaphores and Monitors: High-level Synchronization Constructs 1 Synchronization Constructs

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Synchronization Synchronization

CS 134: Operating Systems More Synchronization 1 / 21 Overview CS34 Overview 2013-05-19

Chapter 7: Process Synchronization Background The Critical-Section Problem

Synchronization in sensor networks Synchronization in sensor networks Jie Gao Computer Science

Time Synchronization Goals of this chaper Understand the importance of time synchronization in

Return-oriented Programming: Exploitation without Code Injection Erik Buchanan, Ryan Roemer,

Functional Assessment of Erasure Coded Storage Archive Computer Systems, Cluster, and Networking

Experiment Reproducibility in Planetlab RP 1.1 Project Presentation Sudesh Jethoe Experiment

Designing Fast Virtual Desktops for Healthcare What is VDI* and why is it important to healthcare?

IMPLICATIONS OF A TRANSFORMATIONAL TECHNOLOGY 5G AMERICAS REPORT PROVIDED BY RYSAVY RESEARCH

System-on-a-Chip Processor Synchronization Support in Hardw are by Bilge E. Saglam and Vincent

TLD-OPS Update ccNSO Meeting June 25, 2019 ICANN65, Marrakech Jacques Latour, .ca (Chair)

An overview of my research Paolo Romano Lisbon University &amp; INESC-ID Roadmap About me

Wi- Fi Module Product Training April, 2012 www.rovingnetworks.com CONFIDENTIAL Wi- Fi

Clock Synchronization Synchronization Clock Henrik Lnn Electronics & Software Volvo

Chapter 6: Process [& Thread] Synchronization Why is synchronization needed? CSCI [4|6]

An overview of my research Paolo Romano Lisbon University & INESC-ID Roadmap About me