Prefix sums on GPUs Motivating Problem Definitions Other - - PowerPoint PPT Presentation

prefix sums on gpus
SMART_READER_LITE
LIVE PREVIEW

Prefix sums on GPUs Motivating Problem Definitions Other - - PowerPoint PPT Presentation

Prefix sums on GPUs Bruce Merry Definition and Applications Prefix sums on GPUs Motivating Problem Definitions Other Applications Parallel Algorithms Bruce Merry Kogge-Stone Brent-Kung GPU Strategies Department of Computer Science,


slide-1
SLIDE 1

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Prefix sums on GPUs

Bruce Merry

Department of Computer Science, University of Cape Town

GPGPU2 Workshop 2014

slide-2
SLIDE 2

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Outline

1

Definition and Applications Motivating Problem Definitions Other Applications

2

Parallel Algorithms Kogge-Stone Brent-Kung

3

GPU Strategies Reduce-then-Scan Two-Level Prefix Sum

slide-3
SLIDE 3

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Outline

1

Definition and Applications Motivating Problem Definitions Other Applications

2

Parallel Algorithms Kogge-Stone Brent-Kung

3

GPU Strategies Reduce-then-Scan Two-Level Prefix Sum

slide-4
SLIDE 4

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Problem Statement

For every object in a set, output a list of the other objects that differ by less than some amount. This is deliberately vague: could be for n-body simulation, clustering, scattered data interpolation.

slide-5
SLIDE 5

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Problem Statement

For every object in a set, output a list of the other objects that differ by less than some amount. This is deliberately vague: could be for n-body simulation, clustering, scattered data interpolation.

slide-6
SLIDE 6

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Output Format

The lists should be packed together contiguously. A0 A1 A2 Assuming one workitem per object, how do the workitems know where to start?

slide-7
SLIDE 7

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Output Format

The lists should be packed together contiguously. A0 A1 A2 B0 B1 Assuming one workitem per object, how do the workitems know where to start?

slide-8
SLIDE 8

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Output Format

The lists should be packed together contiguously. A0 A1 A2 B0 B1 D0 Assuming one workitem per object, how do the workitems know where to start?

slide-9
SLIDE 9

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Output Format

The lists should be packed together contiguously. A0 A1 A2 B0 B1 D0 E0 E1 Assuming one workitem per object, how do the workitems know where to start?

slide-10
SLIDE 10

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Output Format

The lists should be packed together contiguously. A0 A1 A2 B0 B1 D0 E0 E1 Assuming one workitem per object, how do the workitems know where to start?

slide-11
SLIDE 11

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Solution

This can be solved with a multi-pass approach:

1 Every workitem counts how many records to emit, and

writes this number to a buffer.

2 The buffer is processed to determine the start position

for each object, and writes this position to a buffer.

3 Each workitem reads this buffer, and emits its records

in the right place.

slide-12
SLIDE 12

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Solution

This can be solved with a multi-pass approach:

1 Every workitem counts how many records to emit, and

writes this number to a buffer.

2 The buffer is processed to determine the start position

for each object, and writes this position to a buffer.

3 Each workitem reads this buffer, and emits its records

in the right place.

slide-13
SLIDE 13

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Solution

This can be solved with a multi-pass approach:

1 Every workitem counts how many records to emit, and

writes this number to a buffer.

2 The buffer is processed to determine the start position

for each object, and writes this position to a buffer.

3 Each workitem reads this buffer, and emits its records

in the right place.

slide-14
SLIDE 14

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Outline

1

Definition and Applications Motivating Problem Definitions Other Applications

2

Parallel Algorithms Kogge-Stone Brent-Kung

3

GPU Strategies Reduce-then-Scan Two-Level Prefix Sum

slide-15
SLIDE 15

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Exclusive Prefix Sum

Given an operator ⊕ and an identity element I, the exclusive prefix sum of (a0, a1, . . . , an−1) is (I, a0, a0 ⊕ a1, a0 ⊕ a1 ⊕ a2, . . . , a0 ⊕ · · · ⊕ an−2) =  

i−1

  • j=0

aj   In other words, element i is the sum of all elements strictly before i. 4 3 7 9 2 3 4 7 14 23 25

slide-16
SLIDE 16

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Exclusive Prefix Sum

Given an operator ⊕ and an identity element I, the exclusive prefix sum of (a0, a1, . . . , an−1) is (I, a0, a0 ⊕ a1, a0 ⊕ a1 ⊕ a2, . . . , a0 ⊕ · · · ⊕ an−2) =  

i−1

  • j=0

aj   In other words, element i is the sum of all elements strictly before i. 4 3 7 9 2 3 4 7 14 23 25

slide-17
SLIDE 17

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Inclusive Prefix Sum

Given an operator ⊕ and an identity element I, the inclusive prefix sum of (a0, a1, . . . , an−1) is (a0, a0 ⊕ a1, a0 ⊕ a1 ⊕ a2, . . . , a0 ⊕ · · · ⊕ an−1) =  

i

  • j=0

aj   In other words, element i is the sum of all elements before and including i. 4 3 7 9 2 3 4 7 14 23 25 28

slide-18
SLIDE 18

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Inclusive Prefix Sum

Given an operator ⊕ and an identity element I, the inclusive prefix sum of (a0, a1, . . . , an−1) is (a0, a0 ⊕ a1, a0 ⊕ a1 ⊕ a2, . . . , a0 ⊕ · · · ⊕ an−1) =  

i

  • j=0

aj   In other words, element i is the sum of all elements before and including i. 4 3 7 9 2 3 4 7 14 23 25 28

slide-19
SLIDE 19

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Outline

1

Definition and Applications Motivating Problem Definitions Other Applications

2

Parallel Algorithms Kogge-Stone Brent-Kung

3

GPU Strategies Reduce-then-Scan Two-Level Prefix Sum

slide-20
SLIDE 20

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Other Applications

Compaction: select all objects that satisfy a predicate Partitioning: rearrange objects that satisfy a predicate before the others Sorting: radix sort is just repeated partitioning Visibility: an object is visible if it is not preceded by a taller one (using max operator instead of +) Meshing: each cell produces an variable number of triangles

slide-21
SLIDE 21

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Other Applications

Compaction: select all objects that satisfy a predicate Partitioning: rearrange objects that satisfy a predicate before the others Sorting: radix sort is just repeated partitioning Visibility: an object is visible if it is not preceded by a taller one (using max operator instead of +) Meshing: each cell produces an variable number of triangles

slide-22
SLIDE 22

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Other Applications

Compaction: select all objects that satisfy a predicate Partitioning: rearrange objects that satisfy a predicate before the others Sorting: radix sort is just repeated partitioning Visibility: an object is visible if it is not preceded by a taller one (using max operator instead of +) Meshing: each cell produces an variable number of triangles

slide-23
SLIDE 23

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Other Applications

Compaction: select all objects that satisfy a predicate Partitioning: rearrange objects that satisfy a predicate before the others Sorting: radix sort is just repeated partitioning Visibility: an object is visible if it is not preceded by a taller one (using max operator instead of +) Meshing: each cell produces an variable number of triangles

slide-24
SLIDE 24

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Other Applications

Compaction: select all objects that satisfy a predicate Partitioning: rearrange objects that satisfy a predicate before the others Sorting: radix sort is just repeated partitioning Visibility: an object is visible if it is not preceded by a taller one (using max operator instead of +) Meshing: each cell produces an variable number of triangles

slide-25
SLIDE 25

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Atomics

Atomics offer an alternative way to allocate unique memory per work-item, but Suffer heavy contention, which is slow (but getting better all the time) Do not preserve the original ordering Do not give reproducible ordering Atomics have the advantage of allowing for single-pass algorithms.

slide-26
SLIDE 26

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Outline

1

Definition and Applications Motivating Problem Definitions Other Applications

2

Parallel Algorithms Kogge-Stone Brent-Kung

3

GPU Strategies Reduce-then-Scan Two-Level Prefix Sum

slide-27
SLIDE 27

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Idea

Let st

i be the sum of the (up to) t inputs ending with ai. Then

s2t

i

= st

i−t ⊕ st i .

We start with (s1

i ) = (ai), then compute (s2 i ), (s4 i ), (s8 i ) and

so on, up to (sN

i ), in O(log2 N) iterations, to give an

inclusive prefix sum.

slide-28
SLIDE 28

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Example

6 9 2 3 6 1 6 3 s1 6 15 11 5 9 7 7 9 s2 6 15 17 20 20 12 16 16 s3 6 15 17 20 26 27 33 36 s4

slide-29
SLIDE 29

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Pseudo-code

foreach power-of-two t from 1 to N do for i ← t to N − 1 do in parallel ai ← ai−t ⊕ ai;

slide-30
SLIDE 30

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Work-item Pseudo-code

i ← workitem ID; foreach power-of-two t from 1 to N do x ← ai; if t ≤ i then x ← x ⊕ ai−t; barrier(); ai ← x; barrier();

slide-31
SLIDE 31

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Optimizations

The working register x can be reused between loop iterations without reloading. The if statement can be eliminated by padding at the front with zeros. Shared memory can be used to reduce global memory accesses.

slide-32
SLIDE 32

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Optimizations

The working register x can be reused between loop iterations without reloading. The if statement can be eliminated by padding at the front with zeros. Shared memory can be used to reduce global memory accesses.

slide-33
SLIDE 33

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Optimizations

The working register x can be reused between loop iterations without reloading. The if statement can be eliminated by padding at the front with zeros. Shared memory can be used to reduce global memory accesses.

slide-34
SLIDE 34

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Properties

It is work-inefficient: it performs O(N log N) operations in total About 2 log2 N barriers About N log N reads and N log N writes Memory access pattern is good: sequential accesses

slide-35
SLIDE 35

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Properties

It is work-inefficient: it performs O(N log N) operations in total About 2 log2 N barriers About N log N reads and N log N writes Memory access pattern is good: sequential accesses

slide-36
SLIDE 36

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Properties

It is work-inefficient: it performs O(N log N) operations in total About 2 log2 N barriers About N log N reads and N log N writes Memory access pattern is good: sequential accesses

slide-37
SLIDE 37

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Properties

It is work-inefficient: it performs O(N log N) operations in total About 2 log2 N barriers About N log N reads and N log N writes Memory access pattern is good: sequential accesses

slide-38
SLIDE 38

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Outline

1

Definition and Applications Motivating Problem Definitions Other Applications

2

Parallel Algorithms Kogge-Stone Brent-Kung

3

GPU Strategies Reduce-then-Scan Two-Level Prefix Sum

slide-39
SLIDE 39

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Idea

For an exclusive scan: Add pairs of adjacent elements: pi = a2i ⊕ a2i+1 Recursively scan these sums: qi =

i−1

  • j=0

pi =

2i−1

  • j=0

aj Use these sums to compute the result: s2i = qi, s2i+1 = qi ⊕ a2i

slide-40
SLIDE 40

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Example

6 9 2 3 6 1 6 3

slide-41
SLIDE 41

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Example

15 6 9 5 2 3 7 6 1 9 6 3

slide-42
SLIDE 42

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Example

20 15 6 9 5 2 3 16 7 6 1 9 6 3

slide-43
SLIDE 43

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Example

36 20 15 6 9 5 2 3 16 7 6 1 9 6 3

slide-44
SLIDE 44

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Example

20 15 6 9 5 2 3 16 7 6 1 9 6 3

slide-45
SLIDE 45

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Example

15 6 9 5 2 3 20 7 6 1 9 6 3

slide-46
SLIDE 46

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Example

6 9 15 2 3 20 20 6 1 27 6 3

slide-47
SLIDE 47

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Example

6 15 15 17 20 20 20 26 27 27 33

slide-48
SLIDE 48

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Memory Arrangement

In-place Each sum replaces the second element of the pair being summed. No extra memory, but has bad bank conflicts. Out-of-place Each level of the tree stored contiguously. Requires double the memory, but conflicts are

  • nly 2-way.
slide-49
SLIDE 49

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Memory Arrangement

In-place Each sum replaces the second element of the pair being summed. No extra memory, but has bad bank conflicts. Out-of-place Each level of the tree stored contiguously. Requires double the memory, but conflicts are

  • nly 2-way.
slide-50
SLIDE 50

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Pseudo-code

Out-of-place exclusive sum, for N = 2n:

Copy ai to bi+N for i ∈ [0, N); for t ← n − 1 downto 1 do for i ← 0 to 2t − 1 do in parallel b2t +i ← b2t+1+i ⊕ b2t+1+i+1; // Exclusive prefix sum of two elements b3 ← b2; b2 ← I; for t ← 1 to n − 1 do for i ← 0 to 2t − 1 do in parallel b2t+1+i+1 ← b2t +i ⊕ b2t+1+i; b2t+1+i ← b2t +i; Copy bi+N to si for i ∈ [0, N);

slide-51
SLIDE 51

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Per-workitem Pseudo-code

Uses N

2 work-items: i ← work-item ID; bi+N ← ai; bi+N+ N

2 ← ai+ N 2 ;

barrier(); for t ← n − 1 downto 1 do if i < 2t then b2t +i ← b2t+1+2i ⊕ b2t+1+2i+1; barrier(); if i = 0 then a3 ← a2; a2 ← I; barrier(); for t ← 1 to n − 1 do if i < 2t then b2t+1+2i+1 ← b2t +i ⊕ b2t+1+2i; b2t+1+2i ← b2t +i; barrier(); si ← bi+N; si+ N

2 ← bi+N+ N 2 ;

slide-52
SLIDE 52

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Properties

Work-efficient: O(N) addition operations Still requires about 2 log2 N barriers Requires about 4N reads and 3N writes Only N

2 work-items required

Has branching, but it is coherent

slide-53
SLIDE 53

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Properties

Work-efficient: O(N) addition operations Still requires about 2 log2 N barriers Requires about 4N reads and 3N writes Only N

2 work-items required

Has branching, but it is coherent

slide-54
SLIDE 54

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Properties

Work-efficient: O(N) addition operations Still requires about 2 log2 N barriers Requires about 4N reads and 3N writes Only N

2 work-items required

Has branching, but it is coherent

slide-55
SLIDE 55

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Properties

Work-efficient: O(N) addition operations Still requires about 2 log2 N barriers Requires about 4N reads and 3N writes Only N

2 work-items required

Has branching, but it is coherent

slide-56
SLIDE 56

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Properties

Work-efficient: O(N) addition operations Still requires about 2 log2 N barriers Requires about 4N reads and 3N writes Only N

2 work-items required

Has branching, but it is coherent

slide-57
SLIDE 57

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Motivation

Applying one of these at a larger (multi-workgroup) scale has issues: Synchronisation: no inter-workgroup synchronisation, so barriers must be kernel-instance boundaries Memory usage: need O(N) working space Bandwidth: requires O(N log N) for Kogge-Stone, about 7N for Brent-Kung

slide-58
SLIDE 58

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Outline

1

Definition and Applications Motivating Problem Definitions Other Applications

2

Parallel Algorithms Kogge-Stone Brent-Kung

3

GPU Strategies Reduce-then-Scan Two-Level Prefix Sum

slide-59
SLIDE 59

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Generalizing Brent-Kung

The Brent-Kung tree doesn’t have to be binary:

6 9 2 3 6 1 6 3 3 1 5 4 9 7 3 3

slide-60
SLIDE 60

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Generalizing Brent-Kung

The Brent-Kung tree doesn’t have to be binary:

20 6 9 2 3 16 6 1 6 3 13 3 1 5 4 22 9 7 3 3

slide-61
SLIDE 61

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Generalizing Brent-Kung

The Brent-Kung tree doesn’t have to be binary:

71 20 6 9 2 3 16 6 1 6 3 13 3 1 5 4 22 9 7 3 3

slide-62
SLIDE 62

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Generalizing Brent-Kung

The Brent-Kung tree doesn’t have to be binary:

20 6 9 2 3 16 6 1 6 3 13 3 1 5 4 22 9 7 3 3

slide-63
SLIDE 63

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Generalizing Brent-Kung

The Brent-Kung tree doesn’t have to be binary:

6 9 2 3 20 6 1 6 3 36 3 1 5 4 49 9 7 3 3

slide-64
SLIDE 64

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Generalizing Brent-Kung

The Brent-Kung tree doesn’t have to be binary:

6 15 17 20 20 26 27 33 36 36 39 40 45 49 49 58 65 68

slide-65
SLIDE 65

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Reduce-then-Scan strategy

1 Divide elements into blocks of size M. 2 Use a workgroup per block to compute sum of each

block.

3 Recursively prefix-sum the block sums. 4 Use a workgroup per block to prefix-sum each block,

starting from the result from the previous level. Steps 2 and 4 can use any parallel reduction/prefix sum algorithm.

slide-66
SLIDE 66

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Reduce-then-Scan strategy

1 Divide elements into blocks of size M. 2 Use a workgroup per block to compute sum of each

block.

3 Recursively prefix-sum the block sums. 4 Use a workgroup per block to prefix-sum each block,

starting from the result from the previous level. Steps 2 and 4 can use any parallel reduction/prefix sum algorithm.

slide-67
SLIDE 67

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Reduce-then-Scan strategy

1 Divide elements into blocks of size M. 2 Use a workgroup per block to compute sum of each

block.

3 Recursively prefix-sum the block sums. 4 Use a workgroup per block to prefix-sum each block,

starting from the result from the previous level. Steps 2 and 4 can use any parallel reduction/prefix sum algorithm.

slide-68
SLIDE 68

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Reduce-then-Scan strategy

1 Divide elements into blocks of size M. 2 Use a workgroup per block to compute sum of each

block.

3 Recursively prefix-sum the block sums. 4 Use a workgroup per block to prefix-sum each block,

starting from the result from the previous level. Steps 2 and 4 can use any parallel reduction/prefix sum algorithm.

slide-69
SLIDE 69

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Reduce-then-Scan strategy

1 Divide elements into blocks of size M. 2 Use a workgroup per block to compute sum of each

block.

3 Recursively prefix-sum the block sums. 4 Use a workgroup per block to prefix-sum each block,

starting from the result from the previous level. Steps 2 and 4 can use any parallel reduction/prefix sum algorithm.

slide-70
SLIDE 70

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Analysis

Assuming that M is reasonably large: About logM N kernel instances Most memory accesses can be to local memory Slightly over 2N global reads Slightly over N global writes Slightly over O(N log M) barrier instructions

slide-71
SLIDE 71

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Analysis

Assuming that M is reasonably large: About logM N kernel instances Most memory accesses can be to local memory Slightly over 2N global reads Slightly over N global writes Slightly over O(N log M) barrier instructions

slide-72
SLIDE 72

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Analysis

Assuming that M is reasonably large: About logM N kernel instances Most memory accesses can be to local memory Slightly over 2N global reads Slightly over N global writes Slightly over O(N log M) barrier instructions

slide-73
SLIDE 73

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Analysis

Assuming that M is reasonably large: About logM N kernel instances Most memory accesses can be to local memory Slightly over 2N global reads Slightly over N global writes Slightly over O(N log M) barrier instructions

slide-74
SLIDE 74

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Analysis

Assuming that M is reasonably large: About logM N kernel instances Most memory accesses can be to local memory Slightly over 2N global reads Slightly over N global writes Slightly over O(N log M) barrier instructions

slide-75
SLIDE 75

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Outline

1

Definition and Applications Motivating Problem Definitions Other Applications

2

Parallel Algorithms Kogge-Stone Brent-Kung

3

GPU Strategies Reduce-then-Scan Two-Level Prefix Sum

slide-76
SLIDE 76

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Reducing Parallelism

The more parallelism one uses in a prefix sum, the higher the overheads become. Therefore, only use as much parallelism as is necessary to saturate the hardware.

slide-77
SLIDE 77

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Reducing Parallelism

The more parallelism one uses in a prefix sum, the higher the overheads become. Therefore, only use as much parallelism as is necessary to saturate the hardware.

slide-78
SLIDE 78

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Fixed Block Count

Use the same reduce-then-scan strategy, but Fix the number of blocks at C, set M = N

C

Fix a work-group size G C should be tuned so that C × G workitems saturate the device C should be small enough that only 2 levels are required

slide-79
SLIDE 79

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Prefix-Summing a Block

Each block has size M but workgroups only have G

  • workitems. How does a workgroup prefix-sum a block?
  • Serially. In sub-blocks of size G or 2G.
slide-80
SLIDE 80

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Prefix-Summing a Block

Each block has size M but workgroups only have G

  • workitems. How does a workgroup prefix-sum a block?
  • Serially. In sub-blocks of size G or 2G.
slide-81
SLIDE 81

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Prefix-Summing a Block

Each block has size M but workgroups only have G

  • workitems. How does a workgroup prefix-sum a block?
  • Serially. In sub-blocks of size G or 2G.
slide-82
SLIDE 82

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Advantages

Only three kernel instances, two of which use the full GPU Only O(N log G) barrier instructions Only O(C) extra global memory

slide-83
SLIDE 83

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Advantages

Only three kernel instances, two of which use the full GPU Only O(N log G) barrier instructions Only O(C) extra global memory

slide-84
SLIDE 84

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Advantages

Only three kernel instances, two of which use the full GPU Only O(N log G) barrier instructions Only O(C) extra global memory

slide-85
SLIDE 85

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Summary

Parallel prefix sum is hard work GPUs need parallelism, but algorithm works best with least parallelism With good implementation, can be bandwidth-limited

slide-86
SLIDE 86

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Summary

Parallel prefix sum is hard work GPUs need parallelism, but algorithm works best with least parallelism With good implementation, can be bandwidth-limited

slide-87
SLIDE 87

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

Summary

Parallel prefix sum is hard work GPUs need parallelism, but algorithm works best with least parallelism With good implementation, can be bandwidth-limited

slide-88
SLIDE 88

Prefix sums

  • n GPUs

Bruce Merry Definition and Applications

Motivating Problem Definitions Other Applications

Parallel Algorithms

Kogge-Stone Brent-Kung

GPU Strategies

Reduce-then-Scan Two-Level Prefix Sum

Summary

References

Guy E. Blelloch. Prefix sums and their applications. Technical Report CMU-CS-90-190, Computer Science Department, Carnegie Mellon University, November 1990. Duane Merrill and Andrew Grimshaw. Parallel scan for stream architectures. Technical Report CS2009-14, Department of Computer Science, University of Virginia, December 2009.