Synchronization on Manycore Machines John Owens Associate - - PowerPoint PPT Presentation

synchronization on manycore machines
SMART_READER_LITE
LIVE PREVIEW

Synchronization on Manycore Machines John Owens Associate - - PowerPoint PPT Presentation

Synchronization on Manycore Machines John Owens Associate Professor, Electrical and Computer Engineering University of California, Davis Announcements If anyones going back to Boston near a T station immediately after the end of the


slide-1
SLIDE 1

Synchronization on Manycore Machines

John Owens Associate Professor, Electrical and Computer Engineering University of California, Davis

slide-2
SLIDE 2

Announcements

  • If anyone’s going back to Boston near a T station

immediately after the end of the conference on Friday, I’d love a ride. (Faster than the train alternative. I’m happy to get back ASAP.)

  • Stuff I’m not talking about but might be interesting to

some of you:

  • Tridiagonal solvers
  • bzip2-style lossless compression
  • Heterogeneous multi-node global-illumination rendering

(substitute your hard heterogeneous problem here)

slide-3
SLIDE 3

GPU Programming Model

  • A kernel is executed as a grid of

thread blocks

  • A thread block is a fixed-maximum-

size (~512) batch of threads that can cooperate with each other by:

  • Efficiently sharing data through

shared memory

  • Synchronizing their execution
  • Two threads from two different

blocks cannot cooperate

  • Blocks are independent

Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1)

Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0)

slide-4
SLIDE 4
  • Hardware responsible for assigning blocks to “SMs” (“streaming

multiprocessors” or “cores”—think of them as virtual blocks).

  • Different GPUs have different numbers of SMs.

GPU Hardware, High Level

Thread Execution Manager Input Assembler Host

Parallel Data Cache

Global Memory Load/store

Parallel Data Cache

Thread Processors

Parallel Data Cache Parallel Data Cache

Thread Processors

Parallel Data Cache Parallel Data Cache

Thread Processors

Parallel Data Cache Parallel Data Cache

Thread Processors

Parallel Data Cache Parallel Data Cache

Thread Processors

Parallel Data Cache Parallel Data Cache

Thread Processors

Parallel Data Cache Parallel Data Cache

Thread Processors

Parallel Data Cache Parallel Data Cache

Thread Processors

slide-5
SLIDE 5

SM Multithreaded Multiprocessor

  • Each SM runs a block of threads
  • SM has 32 SP Thread Processors
  • Run as a “warp” in lockstep
  • 99 GFLOPS peak x 16 SMs at 1.544 GHz

(1 MAD/clock/SP)

  • IEEE 754 32-bit floating point
  • Scalar ISA
  • Up to 768 threads, hw multithreaded
  • 16 or 48 KB shared memory, 48 or 16

KB hardware-managed cache

SP

Shared Memory

IU SP

Shared Memory

IU

Shared Memory

MT IU

SM

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

slide-6
SLIDE 6

Mapping SW to HW

Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1)

Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0)

Shared Memory

MT IU

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

Parallel Data Cache Parallel Data Cache

Thread Processors

Parallel Data Cache Parallel Data Cache

Thread Processors

Exposed computational hierarchy

slide-7
SLIDE 7

Synchronization Toolbox (1)

  • Within a thread block & within a warp:
  • In hardware, warps run synchronously
  • Hardware manages branch divergence

(idle threads go to sleep)

  • The width of a warp is only vaguely exposed by

the programming model

  • Different for different vendors (Intel: 16,

NVIDIA: 32, AMD: 64)

  • Warps have _all, _any, _ballot hw intra-warp functions

Shared Memory

MT IU

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

slide-8
SLIDE 8

Synchronization Toolbox (2)

  • Within a thread block & across warps:
  • _syncthreads is a barrier for threads within a warp
  • No need to synchronize between threads within warp
  • Newest NVIDIA GPUs add _syncthreads_count(p),

_syncthreads_or(p), _syncthreads_and(p) for predicate p

  • _threadfence_block: all memory accesses are visible to all

threads within block

  • _threadfence: all memory accesses visible to all threads on

GPU

  • _threadfence_system: all memory accesses visible to

threads on GPU and also CPU

Shared Memory

MT IU

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

slide-9
SLIDE 9

Synchronization Toolbox (3)

  • Threads within a block can read/write shared memory
  • Best approximation of shared-memory model is CREW:

concurrent reads, exclusive write

  • Hardware makes no guarantees about who will win if

concurrent writes

  • Memory accesses can be guaranteed to compile into actual

read/write with volatile qualifier

  • Atomics on shared memory: 32b, 64b ints; 32b float for exch

and add

  • add, sub, exch, min, max, inc, dec, CAS, bitwise {and, or, xor}

Shared Memory

MT IU

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

slide-10
SLIDE 10

Synchronization Toolbox (4)

  • Threads within a block can read/write global memory
  • Same atomics as shared memory
  • Memory accesses can be guaranteed to compile into

actual read/write with volatile qualifier

  • Fermi has per-block L1 cache and global L2 cache
  • On Fermi, volatile means “bypass L1 cache”
  • Implicit global-memory barrier between dependent

kernels

vec_minus<<<nblocks, blksize>>>(a, b, c); vec_dot<<<nblocks, blksize>>>(c, c);

  • No other synchronization instructions! Why? Let’s pop

up a level and talk about CUDA’s goals.

Parallel Data Cache Parallel Data Cache

Thread Processors

Parallel Data Cache Parallel Data Cache

Thread Processors

Volkov & Demmel (SC ’08): synchronous kernel invocation: 10–14 µs, asynchronous: 3–7

slide-11
SLIDE 11

Big Ideas in the GPU Model

  • 1. One thread maps to one data element (lots of threads!)
  • 2. Write programs as if they run on one thread
  • 3. CPUs mitigate latency. GPUs hide latency by switching to

another piece of work.

  • 4. Blocks within a kernel are independent
slide-12
SLIDE 12
  • Same program runs on both GPUs
  • Scalable performance!

Scaling the Architecture

Thread Execution Manager Input Assembler Host

Parallel Data Cache

Global Memory Load/store

Parallel Data Cache

Thread Processors

Parallel Data Cache Parallel Data Cache

Thread Processors

Parallel Data Cache Parallel Data Cache

Thread Processors

Parallel Data Cache Parallel Data Cache

Thread Processors

Thread Execution Manager Input Assembler Host Global Memory Load/store

Parallel Data Cache Parallel Data Cache

Thread Processors

Parallel Data Cache Parallel Data Cache

Thread Processors

slide-13
SLIDE 13

Consequences of Independence

  • Any possible interleaving of blocks must be valid
  • Blocks presumed to run to completion without preemption
  • Can run in any order
  • Can run concurrently OR sequentially
  • Therefore, blocks may coordinate but not synchronize or

communicate

  • Can’t have a global barrier: blocks running to completion

may block other blocks from launching

  • Can’t ask block A to wait for block B to do something, or for

B to send to A: A might launch before B

slide-14
SLIDE 14

Outline

  • Persistent threads
  • Persistent thread global barriers
  • Spin-locks for shared resources
  • Higher-order (and better) synchronization primitives
  • Hardware biases (permutation)
  • Work queues
slide-15
SLIDE 15

S05: High Performance Computing with CUDA

15

Tree-Based Parallel Reductions

Commonly done in traditional GPGPU

Ping-pong between render targets, reduce by 1/2 at a time Completely bandwidth bound using graphics API Memory writes and reads are off-chip, no reuse of intermediate sums

CUDA solves this by exposing on-chip shared memory

Reduce blocks of data in shared memory to save bandwidth

4 7 5 9 11 14 25 3 1 7 4 1 6 3

slide-16
SLIDE 16

S05: High Performance Computing with CUDA

15

Tree-Based Parallel Reductions

Commonly done in traditional GPGPU

Ping-pong between render targets, reduce by 1/2 at a time Completely bandwidth bound using graphics API Memory writes and reads are off-chip, no reuse of intermediate sums

CUDA solves this by exposing on-chip shared memory

Reduce blocks of data in shared memory to save bandwidth

4 7 5 9 11 14 25 3 1 7 4 1 6 3

slide-17
SLIDE 17

Traditional reductions

  • Ideal: n reads, 1 write.
  • Block size 256 threads. Thus:
  • Read n items, write back n/256 items. (Kernel 1)
  • Implicit synchronization between kernels, and possibly

round-trip communication (400 µs) to CPU to launch second kernel.

  • Read n/256 items, write back 1 item. If too big for one block,
  • recurse. (Kernel 2)
  • Or could sum using an atomic add, but we’ll ignore that for the

moment.

slide-18
SLIDE 18

Persistent Threads

  • GPU programming model suggests one thread per item
  • What if you filled the machine with just enough blocks to keep all

processors busy, then asked each thread to stay alive until the input was complete?

  • More like a traditional CPU program
  • Essentially replaces hardware scheduler with software
  • Reduction example: now intermediate results are O(number of SMs) rather

than O(input size)

m0 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15

⊕0

⊕1 ⊕2 ⊕3 ⊕4 ⊕5 ⊕6 ⊕7 ⊕8 ⊕9 ⊕10 ⊕11 ⊕12 ⊕13 ⊕14 ⊕15

[figure: Duane Merrill] Thread 0 Thread 1 Thread 2 Thread 3 Thread 4

slide-19
SLIDE 19

Persistent Threads

  • Use cases:
  • Avoid CPU-GPU round-trip synchronization
  • Load-balancing: PT can use a software queue to

(re)distribute irregularly-{produced,consumed} work

  • Producer-consumer locality within kernel
  • Cheaper global synchronization (next slide)
  • Minus: More overhead per thread (register pressure)
  • Minus: Violent anger of vendors

Recent work in our

  • group. In submission.

[joint work with Kshitij Gupta and Jeff Stuart]

slide-20
SLIDE 20

synergy.cs.vt.edu

GPU Lock-Based Synchronization

11

Block #1 Block #2 Block #3 Block #N g_mutex= 0

  • S. Xiao and W. Feng. Inter-block GPU communication

via fast barrier synchronization. In IPDPS, Apr. 2010.

slide-21
SLIDE 21

synergy.cs.vt.edu

GPU Lock-Based Synchronization

11

Block #1 Block #2 Block #3 Block #N g_mutex= 0

  • S. Xiao and W. Feng. Inter-block GPU communication

via fast barrier synchronization. In IPDPS, Apr. 2010.

slide-22
SLIDE 22

synergy.cs.vt.edu

GPU Lock-Based Synchronization

11

Block #1 Block #2 Block #3 Block #N g_mutex= 0

  • S. Xiao and W. Feng. Inter-block GPU communication

via fast barrier synchronization. In IPDPS, Apr. 2010.

slide-23
SLIDE 23

synergy.cs.vt.edu

GPU Lock-Based Synchronization

11

Block #1 Block #2 Block #3 Block #N g_mutex= atomicAdd(1) g_mutex == G ? 1

  • S. Xiao and W. Feng. Inter-block GPU communication

via fast barrier synchronization. In IPDPS, Apr. 2010.

slide-24
SLIDE 24

synergy.cs.vt.edu

GPU Lock-Based Synchronization

11

Block #1 Block #2 Block #3 Block #N g_mutex= atomicAdd(1) atomicAdd(1) g_mutex == G ? g_mutex == G ? 2

  • S. Xiao and W. Feng. Inter-block GPU communication

via fast barrier synchronization. In IPDPS, Apr. 2010.

slide-25
SLIDE 25

synergy.cs.vt.edu

GPU Lock-Based Synchronization

11

Block #1 Block #2 Block #3 Block #N g_mutex= atomicAdd(1) atomicAdd(1) atomicAdd(1) g_mutex == G ? g_mutex == G ? g_mutex == G ? 3

  • S. Xiao and W. Feng. Inter-block GPU communication

via fast barrier synchronization. In IPDPS, Apr. 2010.

slide-26
SLIDE 26

synergy.cs.vt.edu

GPU Lock-Based Synchronization

11

Block #1 Block #2 Block #3 Block #N g_mutex= atomicAdd(1) atomicAdd(1) atomicAdd(1) atomicAdd(1) g_mutex == G ? g_mutex == G ? g_mutex == G ? g_mutex == G ? N

  • S. Xiao and W. Feng. Inter-block GPU communication

via fast barrier synchronization. In IPDPS, Apr. 2010.

slide-27
SLIDE 27

synergy.cs.vt.edu

GPU Lock-Based Synchronization

11

Block #1 Block #2 Block #3 Block #N g_mutex= atomicAdd(1) atomicAdd(1) atomicAdd(1) atomicAdd(1) g_mutex == G g_mutex == G g_mutex == G g_mutex == G N

  • S. Xiao and W. Feng. Inter-block GPU communication

via fast barrier synchronization. In IPDPS, Apr. 2010.

slide-28
SLIDE 28

synergy.cs.vt.edu

GPU Lock-Based Synchronization

11

Block #1 Block #2 Block #3 Block #N g_mutex= atomicAdd(1) atomicAdd(1) atomicAdd(1) atomicAdd(1) Block #1 Block #2 Block #3 Block #N g_mutex == G g_mutex == G g_mutex == G g_mutex == G N

  • S. Xiao and W. Feng. Inter-block GPU communication

via fast barrier synchronization. In IPDPS, Apr. 2010.

slide-29
SLIDE 29

synergy.cs.vt.edu

GPU Lock-Based Synchronization

11

Block #1 Block #2 Block #3 Block #N g_mutex= atomicAdd(1) atomicAdd(1) atomicAdd(1) atomicAdd(1) Block #1 Block #2 Block #3 Block #N g_mutex == G g_mutex == G g_mutex == G g_mutex == G N Barrier synchronization

  • S. Xiao and W. Feng. Inter-block GPU communication

via fast barrier synchronization. In IPDPS, Apr. 2010.

slide-30
SLIDE 30

synergy.cs.vt.edu

GPU Lock-Free Synchronization

  • Implementation

Block #1 Block #2 Block #3 Block #N Ain Aout

13

slide-31
SLIDE 31

synergy.cs.vt.edu

GPU Lock-Free Synchronization

  • Implementation

Block #1 Block #2 Block #3 Block #N Ain Aout

1 1 1 1

Ain[1]=1 Ain[2]=1 Ain[3]=1 Ain[N]=1

13

slide-32
SLIDE 32

synergy.cs.vt.edu

GPU Lock-Free Synchronization

  • Implementation

Block #1 Block #2 Block #3 Block #N Ain Aout

1 1 1 1

Thread #1 Thread #N Thread #3 Thread #2 Ain[1]=1 Ain[2]=1 Ain[3]=1 Ain[N]=1

13

slide-33
SLIDE 33

synergy.cs.vt.edu

GPU Lock-Free Synchronization

  • Implementation

Block #1 Block #2 Block #3 Block #N Ain Aout

1 1 1 1

Thread #1 Thread #N Thread #3 Thread #2 Ain[1]=1 Ain[2]=1 Ain[3]=1 Ain[N]=1 ==1 ? ==1? ==1? ==1?

13

slide-34
SLIDE 34

synergy.cs.vt.edu

GPU Lock-Free Synchronization

  • Implementation

Block #1 Block #2 Block #3 Block #N Ain Aout

1 1 1 1

Thread #1 Thread #N Thread #3 Thread #2 Ain[1]=1 Ain[2]=1 Ain[3]=1 Ain[N]=1 ==1 ? ==1? ==1? ==1? ==1 ? ==1 ? ==1 ?

13

slide-35
SLIDE 35

synergy.cs.vt.edu

GPU Lock-Free Synchronization

  • Implementation

Block #1 Block #2 Block #3 Block #N Ain Aout

1 1 1 1

Thread #1 Thread #N Thread #3 Thread #2 Ain[1]=1 Ain[2]=1 Ain[3]=1 Ain[N]=1 ==1 ==1 ==1 ==1 ==1 ? ==1 ? ==1 ?

13

slide-36
SLIDE 36

synergy.cs.vt.edu

GPU Lock-Free Synchronization

  • Implementation

Block #1 Block #2 Block #3 Block #N Ain Aout

1 1 1 1

Thread #1 Thread #N Thread #3 Thread #2 Ain[1]=1 Ain[2]=1 Ain[3]=1 Ain[N]=1 ==1 ==1 ==1 ==1 ==1 ? ==1 ? ==1 ? Barrier synchronization

13

slide-37
SLIDE 37

synergy.cs.vt.edu

GPU Lock-Free Synchronization

  • Implementation

Block #1 Block #2 Block #3 Block #N Ain Aout

1 1 1 1 1 1 1 1

Thread #1 Thread #N Thread #3 Thread #2 Ain[1]=1 Ain[2]=1 Ain[3]=1 Ain[N]=1 ==1 ==1 ==1 ==1 Aout[1]=1 Aout[2]=1 Aout[3]=1 Aout[N]=1 ==1 ? ==1 ? ==1 ? Barrier synchronization

13

slide-38
SLIDE 38

synergy.cs.vt.edu

GPU Lock-Free Synchronization

  • Implementation

Block #1 Block #2 Block #3 Block #N Ain Aout

1 1 1 1 1 1 1 1

Thread #1 Thread #N Thread #3 Thread #2 Ain[1]=1 Ain[2]=1 Ain[3]=1 Ain[N]=1 ==1 ==1 ==1 ==1 Aout[1]=1 Aout[2]=1 Aout[3]=1 Aout[N]=1 ==1 ? ==1 ? ==1 ? ==1 ? Barrier synchronization

13

slide-39
SLIDE 39

synergy.cs.vt.edu

GPU Lock-Free Synchronization

  • Implementation

Block #1 Block #2 Block #3 Block #N Ain Aout

1 1 1 1 1 1 1 1

Thread #1 Thread #N Thread #3 Thread #2 Ain[1]=1 Ain[2]=1 Ain[3]=1 Ain[N]=1 ==1 ==1 ==1 ==1 Aout[1]=1 Aout[2]=1 Aout[3]=1 Aout[N]=1 ==1 ==1 ==1 ==1 Barrier synchronization

13

slide-40
SLIDE 40

synergy.cs.vt.edu

GPU Lock-Free Synchronization

  • Implementation

Block #1 Block #2 Block #3 Block #N Block #1 Block #2 Block #3 Block #N Ain Aout

1 1 1 1 1 1 1 1

Thread #1 Thread #N Thread #3 Thread #2 Ain[1]=1 Ain[2]=1 Ain[3]=1 Ain[N]=1 ==1 ==1 ==1 ==1 Aout[1]=1 Aout[2]=1 Aout[3]=1 Aout[N]=1 ==1 ==1 ==1 ==1 Barrier synchronization

13

slide-41
SLIDE 41

synergy.cs.vt.edu

GPU Lock-Free Synchronization

  • Implementation

Block #1 Block #2 Block #3 Block #N Block #1 Block #2 Block #3 Block #N Ain Aout

1 1 1 1 1 1 1 1

Thread #1 Thread #N Thread #3 Thread #2 Ain[1]=1 Ain[2]=1 Ain[3]=1 Ain[N]=1 ==1 ==1 ==1 ==1 Aout[1]=1 Aout[2]=1 Aout[3]=1 Aout[N]=1 ==1 ==1 ==1 ==1 Barrier synchronization

13

slide-42
SLIDE 42

synergy.cs.vt.edu

GPU Lock-Free Synchronization

  • Implementation

Block #1 Block #2 Block #3 Block #N Block #1 Block #2 Block #3 Block #N Ain Aout

1 1 1 1 1 1 1 1

Thread #1 Thread #N Thread #3 Thread #2 Ain[1]=1 Ain[2]=1 Ain[3]=1 Ain[N]=1 ==1 ==1 ==1 ==1 Aout[1]=1 Aout[2]=1 Aout[3]=1 Aout[N]=1 ==1 ==1 ==1 ==1

Note: Goal value is 1 for the first time, and then increased by 1 each time __gpu_sync() is called.

Barrier synchronization

13

slide-43
SLIDE 43

synergy.cs.vt.edu

GPU Lock-Free Synchronization

  • Implementation

Block #1 Block #2 Block #3 Block #N Block #1 Block #2 Block #3 Block #N Ain Aout

1 1 1 1 1 1 1 1

Thread #1 Thread #N Thread #3 Thread #2 Ain[1]=1 Ain[2]=1 Ain[3]=1 Ain[N]=1 ==1 ==1 ==1 ==1 Aout[1]=1 Aout[2]=1 Aout[3]=1 Aout[N]=1 ==1 ==1 ==1 ==1

Note: Goal value is 1 for the first time, and then increased by 1 each time __gpu_sync() is called.

Barrier synchronization

13

Volkov & Demmel (SC ’08): synchronous kernel invocation: 10– 14 µs, asynchronous: 3–7. This method: 1.3– 2 µs.

slide-44
SLIDE 44

Spin Lock

function CPU: CreateSpinLock 1: X AllocateGPUWord() 2: *X 0 3: return X function GPU: SpinLock(Lock) 1: Locked false 2: while Locked = false do 3: OldVal atomicExch(Lock, 1) 4: if OldVal = 0 then 5: Locked true 6: end if 7: end while function GPU: SpinUnlock(Lock) 1: : atomicExch(Lock, 0)

  • Lock stores 0 if unlocked, 1 if

locked

  • To lock, swap 1 with lock
  • Succeeded if we get a 0 back
  • Otherwise try again
  • To unlock, swap 0 with lock
  • More predictable than

volatile-write + threadfence

  • Bad: High atomic contention
slide-45
SLIDE 45
  • Important parameters for synchronization design:
  • Atomic:volatile ratio, especially under contention. Are spin locks viable?
  • Contentious:noncontentious ratio. Do sleeping algorithms make sense?
  • Atomic capture: Does an atomic hold a cache line hostage?

Lots ‘o Stats

Tesla Reads Tesla Writes Fermi Reads Fermi Writes Volatiles 1.44× 3.67× 11.49× 6.03× Atomics 92.79× 79.12× 3.38× 4.71× Volatiles preceded by Atomic 1.54× 4.01× 11.78× 16.48×

Tesla Reads Tesla Writes Fermi Reads Fermi Writes Contentious Atomics 92.46× 94.57× 2.99× 8.40× Noncontentious Atomics 1.43× 4.38× 10.16× 10.76× Contentious Volatile preceded by Atomic 1.08× 1.10× 2.98× 4.71× Noncontentious Volatile preceded by Atomic 1.02× 1.01× 2.91× 1.72×

Recent work in our group.

  • Unpublished. On arxiv.

Looking for venue.

Tesla Reads (ms) Tesla Writes (ms) Fermi Reads (ms) Fermi Writes (ms) Contentious Volatile 0.848 0.829 0.494 0.175 Noncontentious Volatile 0.590 0.226 0.043 0.029 Contentious Atomic 78.407 78.404 1.479 1.470 Noncontentious Atomic 0.845 0.991 0.437 0.312 Contentious Volatile preceded by Atomic 0.923 0.915 1.473 0.824 Noncontentious Volatile preceded by Atomic 0.601 0.228 0.125 0.050

slide-46
SLIDE 46

Synchronization Primitive Design

  • Evaluated designs for barrier, mutex, semaphore
  • General strategies:
  • Minimize atomics
  • Avoid contentious atomics
  • Sleeping is often a win

function GPU: SpinMutexLock(Mutex) 1: Acquired false 2: while Locked = false do 3: OldVal atomicExch(Mutex, 1) 4: if OldVal = 0 then 5: Acquired true 6: else if Acquired = false^ UseBackoff = true then 7: Backoff() 8: end if 9: end while

function GPU: FAMutexLock(Mutex) 1: TicketNumber atomicInc(Mutex.ticket) 2: while TicketNumber 6= Mutex.turn do 3: Backoff() 4: end while

this design also ensures fairness: service in order of arrival

[joint work with Jeff Stuart]

slide-47
SLIDE 47

Knuth’s Algorithm for shuffling

Algorithm 1 Parallel version of Knuth’s algorithm procedure KnuthPermuteParallel (int a[])

1: for i=1 to n do {in parallel} 2:

j = rand(n-i)+i

3:

lock(a[i]); lock(a[j])

4:

swap(a[j], a[i])

5:

unlock(a[i]); unlock(a[j])

6: end for

  • For each item i (left to right), swap that item with a

randomly chosen item j where j ≥ i How do you tell if it worked?

Recent work in our group.

  • Unpublished. Written in

an unconventional style. Looking for venue.

[joint work with Andrew Davidson and Anjul Patney]

slide-48
SLIDE 48

Original implementation

509 15508

slide-49
SLIDE 49

604 7488

Reindex warps

slide-50
SLIDE 50

604 7488

Reindex warps

... 1 2 3 32 33 34 64 65 66 ... ... ... ... ... ... ... ... 1 warp 35 67 96 97 98 99 ... 32 64 96 1 33 65 2 34 66 ... ... ... ... ... ... ... ... 1 warp 97 98 3 35 67 99 C C: Card #

slide-51
SLIDE 51

1672 2190

Random waits on warps

slide-52
SLIDE 52

1806 2130

Random swap left/right

slide-53
SLIDE 53

Recursive Subdivision is Irregular

2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 2 2 2 Patney, Ebeida, and Owens. “Parallel View-Dependent Tessellation of Catmull-Clark Subdivision Surfaces”. HPG ’09.

slide-54
SLIDE 54
slide-55
SLIDE 55
slide-56
SLIDE 56

Static Task List

Input Input Input Input Input SM SM SM SM SM Output Atomic Ptr

restart kernel

Daniel Cederman and Philippas Tsigas, On Dynamic Load Balancing on Graphics

  • Processors. Graphics

Hardware 2008, June 2008.

slide-57
SLIDE 57

Private Work Queue Approach

  • Allocate private work queue of tasks per

core

  • Each core can add to or remove work from

its local queue

  • Cores mark self as idle if {queue

exhausts storage, queue is empty}

  • Cores periodically check global idle

counter

  • If global idle counter reaches threshold,

rebalance work

gProximity: Fast Hierarchy Operations

  • n GPU Architectures, Lauterbach,

Mo, and Manocha, EG ’10

slide-58
SLIDE 58

Work Stealing & Donating

I/O Deque I/O Deque I/O Deque I/O Deque I/O Deque SM SM SM SM SM Lock Lock

...

  • Cederman and Tsigas: Stealing == best performance and scalability

(follows Arora CPU-based work)

  • We showed how to do this with multiple kernels in an uberkernel and

persistent-thread programming style

  • We added donating to minimize memory usage

Stanley Tzeng, Anjul Patney, and John D. Owens. Task Management for Irregular-Parallel Workloads on the GPU. HPG ’10.

slide-59
SLIDE 59

Ingredients for Our Scheme

34

What is the proper granularity for tasks? How many threads to launch? How to avoid global synchronizations? How to distribute tasks evenly? Warp Size Work Granularity Uberkernels Persistent Threads Task Donation Implementation questions that we need to address:

slide-60
SLIDE 60

The Programmable Pipeline

Input Assembly Tess. Shading Vertex Shading Geom. Shading Raster Frag. Shading Compose Split Dice Shading Sampling Composition Ray Generation Shading Ray Traversal Ray-Primitive Intersection

slide-61
SLIDE 61

The Programmable Pipeline

Input Assembly Tess. Shading Vertex Shading Geom. Shading Raster Frag. Shading Compose Split Dice Shading Sampling Composition Ray Generation Shading Ray Traversal Ray-Primitive Intersection

Bricks & mortar: how do we allow programmers to build stages without worrying about assembling them together? Pipeline Stages Us

slide-62
SLIDE 62

Thanks to ...

  • The organizers for the invitation
  • Wu Feng, Jeff Stuart, Duane Merrill, Anjul Patney, and

Stanley Tzeng for helpful comments and slide material.

  • Funding agencies: Department of Energy (SciDAC Institute

for Ultrascale Visualization, Early Career Principal Investigator Award), NSF, Intel Science and Technology Center for Visual Computing, LANL, BMW, NVIDIA, HP, UC MICRO, Microsoft, ChevronTexaco, Rambus

slide-63
SLIDE 63

“If you were plowing a field, which would you rather use? Two strong buffalo or 1024 chickens?”

—Seymour Cray