Synchronization on Manycore Machines John Owens Associate - PowerPoint PPT Presentation

Synchronization on Manycore Machines John Owens Associate Professor, Electrical and Computer Engineering University of California, Davis

Announcements • If anyone’s going back to Boston near a T station immediately after the end of the conference on Friday, I’d love a ride. (Faster than the train alternative. I’m happy to get back ASAP.) • Stu ff I’m not talking about but might be interesting to some of you: • Tridiagonal solvers • bzip2-style lossless compression • Heterogeneous multi-node global-illumination rendering (substitute your hard heterogeneous problem here)

GPU Programming Model Host Device • A kernel is executed as a grid of Grid 1 thread blocks Block Block Block Kernel 1 (0, 0) (1, 0) (2, 0) • A thread block is a fixed-maximum- Block Block Block (0, 1) (1, 1) (2, 1) size (~512) batch of threads that can cooperate with each other by: Grid 2 • E ffi ciently sharing data through Kernel 2 shared memory • Synchronizing their execution Block (1, 1) • Two threads from two di ff erent Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) blocks cannot cooperate Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) • Blocks are independent Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

GPU Hardware, High Level • Hardware responsible for assigning blocks to “SMs” (“streaming multiprocessors” or “cores”—think of them as virtual blocks). • Di ff erent GPUs have di ff erent numbers of SMs. Host Input Assembler Thread Execution Manager Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Load/store Global Memory

SM Multithreaded Multiprocessor • Each SM runs a block of threads SM • SM has 32 SP Thread Processors • Run as a “warp” in lockstep MT IU • 99 GFLOPS peak x 16 SMs at 1.544 GHz SP SP SP SP SP SP SP SP (1 MAD/clock/SP) SP SP SP SP IU IU • IEEE 754 32-bit floating point SP SP SP SP SP SP SP SP SP SP • Scalar ISA SP SP SP SP Shared Shared Memory Memory • Up to 768 threads, hw multithreaded SP SP SP SP SP SP SP SP • 16 or 48 KB shared memory, 48 or 16 Shared KB hardware-managed cache Memory

Mapping SW to HW Thread Processors Thread Processors Host Device Grid 1 Block Block Block Parallel Parallel Parallel Parallel Kernel 1 Data Data Data Data (0, 0) (1, 0) (2, 0) Cache Cache Cache Cache Block Block Block (0, 1) (1, 1) (2, 1) MT IU Grid 2 SP SP SP SP Kernel 2 SP SP SP SP SP SP SP SP Block (1, 1) SP SP SP SP SP SP SP SP Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) SP SP SP SP Exposed Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) SP SP SP SP computational SP SP SP SP Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) hierarchy Shared Memory

Synchronization Toolbox ( 1 ) MT IU • Within a thread block & within a warp: SP SP SP SP SP SP SP SP • In hardware, warps run synchronously SP SP SP SP SP SP SP SP • Hardware manages branch divergence SP SP SP SP (idle threads go to sleep) SP SP SP SP • The width of a warp is only vaguely exposed by SP SP SP SP SP SP SP SP the programming model Shared • Di ff erent for di ff erent vendors (Intel: 16, Memory NVIDIA: 32, AMD: 64) • Warps have _all , _any , _ballot hw intra-warp functions

Synchronization Toolbox ( 2 ) MT IU • Within a thread block & across warps: SP SP SP SP • _syncthreads is a barrier for threads within a warp SP SP SP SP SP SP SP SP • No need to synchronize between threads within warp SP SP SP SP • Newest NVIDIA GPUs add _syncthreads_count(p) , SP SP SP SP SP SP SP SP _syncthreads_or(p) , _syncthreads_and(p) for predicate p SP SP SP SP • _threadfence_block: all memory accesses are visible to all SP SP SP SP threads within block Shared • _threadfence: all memory accesses visible to all threads on Memory GPU • _threadfence_system: all memory accesses visible to threads on GPU and also CPU

Synchronization Toolbox ( 3 ) MT IU SP SP SP SP • Threads within a block can read/write shared memory SP SP SP SP • Best approximation of shared-memory model is CREW: SP SP SP SP concurrent reads, exclusive write SP SP SP SP SP SP SP SP • Hardware makes no guarantees about who will win if SP SP SP SP concurrent writes SP SP SP SP • Memory accesses can be guaranteed to compile into actual SP SP SP SP read/write with volatile qualifier Shared • Atomics on shared memory: 32b, 64b ints; 32b float for exch Memory and add • add , sub , exch , min , max , inc , dec , CAS , bitwise { and , or , xor }

Synchronization Toolbox ( 4 ) • Threads within a block can read/write global memory • Same atomics as shared memory Thread Processors Thread Processors • Memory accesses can be guaranteed to compile into actual read/write with volatile qualifier Parallel Parallel Parallel Parallel • Fermi has per-block L1 cache and global L2 cache Data Data Data Data Cache Cache Cache Cache • On Fermi, volatile means “bypass L1 cache” Volkov & Demmel (SC ’08): • synchronous kernel Implicit global-memory barrier between dependent invocation: 10–14 µ s, kernels asynchronous: 3–7 vec_minus<<<nblocks, blksize>>>(a, b, c); vec_dot<<<nblocks, blksize>>>(c, c); • No other synchronization instructions! Why? Let’s pop up a level and talk about CUDA’s goals.

Big Ideas in the GPU Model 1. One thread maps to one data element (lots of threads!) 2. Write programs as if they run on one thread 3. CPUs mitigate latency. GPUs hide latency by switching to another piece of work. 4. Blocks within a kernel are independent

Scaling the Architecture • Same program runs on both GPUs • Scalable performance! Host Host Input Assembler Input Assembler Thread Execution Manager Thread Execution Manager Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Data Data Data Data Data Data Data Data Data Data Data Data Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Load/store Load/store Global Memory Global Memory

Consequences of Independence • Any possible interleaving of blocks must be valid • Blocks presumed to run to completion without preemption • Can run in any order • Can run concurrently OR sequentially • Therefore, blocks may coordinate but not synchronize or communicate • Can’t have a global barrier: blocks running to completion may block other blocks from launching • Can’t ask block A to wait for block B to do something, or for B to send to A: A might launch before B

Outline • Persistent threads • Persistent thread global barriers • Spin-locks for shared resources • Higher-order (and better) synchronization primitives • Hardware biases (permutation) • Work queues

Tree-Based Parallel Reductions 3 1 7 0 4 1 6 3 4 7 5 9 11 14 Commonly done in traditional GPGPU 25 Ping-pong between render targets, reduce by 1/2 at a time Completely bandwidth bound using graphics API Memory writes and reads are off-chip, no reuse of intermediate sums CUDA solves this by exposing on-chip shared memory Reduce blocks of data in shared memory to save bandwidth 15 S05: High Performance Computing with CUDA

Traditional reductions • Ideal: n reads, 1 write. • Block size 256 threads. Thus: • Read n items, write back n /256 items. (Kernel 1) • Implicit synchronization between kernels, and possibly round-trip communication (400 µ s) to CPU to launch second kernel. • Read n /256 items, write back 1 item. If too big for one block, recurse. (Kernel 2) • Or could sum using an atomic add, but we’ll ignore that for the moment.

Synchronization on Manycore Machines John Owens Associate - PowerPoint PPT Presentation

Synchronization on Manycore Machines John Owens Associate Professor, Electrical and Computer Engineering University of California, Davis Announcements If anyones going back to Boston near a T station immediately after the end of the

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF

Finite State Machines (FSM) Chapter 8 State Machines Introduction State Machines Mealy and

ShfmLocks: Scalable and Practjcal Locking for Manycore Systems Changwoo Min COSMOSS Lab / ECE /

MORC A MANYCORE ORIENTED COMPRESSED CACHE TRI M. NGUYEN, DAVID WENTZLAFF 12/7/2015 1

Finite State Machines (FSM) AKA Finite State Automat on State Machines Introduction State

Clock Synchronization Synchronization Clock Henrik Lnn Electronics & Software Volvo

synchronization.txt synchronization.txt Feb 2 2009 1:10 Page 1

File Synchronization with File Synchronization with Syxaw in an Ad-hoc Network Syxaw in an

Chapter 7: Process Synchronization Background The Critical-Section Problem

Module 6: Process Synchronization Background The Critical-Section Problem

CSCI [4|6] 730 Operating Systems Synchronization Part 1 : The Basics Maria Hybinette, UGA

Chapter 6: Process [& Thread] Synchronization Why is synchronization needed? CSCI [4|6]

Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study Joel

Lecture 3 Lossless Source Coding I-Hsiang Wang Department of Electrical Engineering National

Storage Tradeoffs in a Collaborative Backup Service for Mobile Devices Ludovic Courts,

Edgebreaker Connectivity Compression for Triangle Meshes Jarek Rossignac, TVCG 1999 Contribution

Primal-Dual Characterizations of Jointly Optimal Transmission Rate and Scheme for Distributed

Computer Graphics Si Lu Fall 2017 http://www.cs.pdx.edu/~lusi/CS447/CS447_547_Comput

Authors : Dalila Goudia (LIRMM-SIMPA) Marc Chaumont (LIRMM, France) William Puech (LIRMM, France)

Efficient One-Way Secret-Key Agreement and Private Channel Coding via Polarization Joseph M.

Synchronization on Manycore Machines John Owens Associate - PowerPoint PPT Presentation

Synchronization on Manycore Machines John Owens Associate Professor, Electrical and Computer Engineering University of California, Davis Announcements If anyones going back to Boston near a T station immediately after the end of the

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF

Finite State Machines (FSM) Chapter 8 State Machines Introduction State Machines Mealy and

ShfmLocks: Scalable and Practjcal Locking for Manycore Systems Changwoo Min COSMOSS Lab / ECE /

MORC A MANYCORE ORIENTED COMPRESSED CACHE TRI M. NGUYEN, DAVID WENTZLAFF 12/7/2015 1

Finite State Machines (FSM) AKA Finite State Automat on State Machines Introduction State

Clock Synchronization Synchronization Clock Henrik Lnn Electronics &amp; Software Volvo

synchronization.txt synchronization.txt Feb 2 2009 1:10 Page 1

File Synchronization with File Synchronization with Syxaw in an Ad-hoc Network Syxaw in an

Chapter 7: Process Synchronization Background The Critical-Section Problem

Module 6: Process Synchronization Background The Critical-Section Problem

CSCI [4|6] 730 Operating Systems Synchronization Part 1 : The Basics Maria Hybinette, UGA

Chapter 6: Process [&amp; Thread] Synchronization Why is synchronization needed? CSCI [4|6]

Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study Joel

Lecture 3 Lossless Source Coding I-Hsiang Wang Department of Electrical Engineering National

Storage Tradeoffs in a Collaborative Backup Service for Mobile Devices Ludovic Courts,

Edgebreaker Connectivity Compression for Triangle Meshes Jarek Rossignac, TVCG 1999 Contribution

Primal-Dual Characterizations of Jointly Optimal Transmission Rate and Scheme for Distributed

Computer Graphics Si Lu Fall 2017 http://www.cs.pdx.edu/~lusi/CS447/CS447_547_Comput

Authors : Dalila Goudia (LIRMM-SIMPA) Marc Chaumont (LIRMM, France) William Puech (LIRMM, France)

Efficient One-Way Secret-Key Agreement and Private Channel Coding via Polarization Joseph M.

Clock Synchronization Synchronization Clock Henrik Lnn Electronics & Software Volvo

Chapter 6: Process [& Thread] Synchronization Why is synchronization needed? CSCI [4|6]