Lecture 10: Parallel Patterns: The What and How of Parallel - - PowerPoint PPT Presentation

lecture 10 parallel patterns the what and how of parallel
SMART_READER_LITE
LIVE PREVIEW

Lecture 10: Parallel Patterns: The What and How of Parallel - - PowerPoint PPT Presentation

Lecture 10: Parallel Patterns: The What and How of Parallel Programming G63.2011.002/G22.2945.001 November 9, 2010 Embarrassing Partition Pipelines Reduction Scan Tentative Plan for Rest of Class Today: Parallel Patterns Nov 16:


slide-1
SLIDE 1

Lecture 10: Parallel Patterns: The What and How of Parallel Programming

G63.2011.002/G22.2945.001 · November 9, 2010

Embarrassing Partition Pipelines Reduction Scan

slide-2
SLIDE 2

Tentative Plan for Rest of Class

  • Today: Parallel Patterns
  • Nov 16: Load Balancing
  • Nov 23: More performance tricks, tools
  • Nov 30: Odds and Ends in GPU Land
  • Dec 7: moved to Dec 14 (still ok?)
  • Dec 14, 21: Final Project Presentations
  • Will assign presentation date this week.

Embarrassing Partition Pipelines Reduction Scan

slide-3
SLIDE 3

Tentative Plan for Rest of Class

  • Today: Parallel Patterns
  • Nov 16: Load Balancing
  • Nov 23: More performance tricks, tools
  • Nov 30: Odds and Ends in GPU Land
  • Dec 7: moved to Dec 14 (still ok?)
  • Dec 14, 21: Final Project Presentations
  • Will assign presentation date this week.

Anything not on here that you would like covered?

Embarrassing Partition Pipelines Reduction Scan

slide-4
SLIDE 4

Tentative Plan for Rest of Class

  • Today: Parallel Patterns
  • Nov 16: Load Balancing
  • Nov 23: More performance tricks, tools
  • Nov 30: Odds and Ends in GPU Land
  • Dec 7: moved to Dec 14 (still ok?)
  • Dec 14, 21: Final Project Presentations
  • Will assign presentation date this week.

Will post HW3 solution soon. (list message) Graded HW3 next week.

Embarrassing Partition Pipelines Reduction Scan

slide-5
SLIDE 5

Today

“Traditional” parallel programming in a nutshell Key question:

  • Data Dependencies

Embarrassing Partition Pipelines Reduction Scan

slide-6
SLIDE 6

Outline

Embarrassingly Parallel Partition Pipelines Reduction Scan

Embarrassing Partition Pipelines Reduction Scan

slide-7
SLIDE 7

Outline

Embarrassingly Parallel Partition Pipelines Reduction Scan

Embarrassing Partition Pipelines Reduction Scan

slide-8
SLIDE 8

Embarrassingly Parallel

yi = fi(xi)

where i ∈ {1, . . . , N}. Notation: (also for rest of this lecture)

  • xi: inputs
  • yi: outputs
  • fi: (pure) functions (i.e. no side effects)

Embarrassing Partition Pipelines Reduction Scan

slide-9
SLIDE 9

Embarrassingly Parallel

yi = fi(xi)

where i ∈ {1, . . . , N}. Notation: (also for rest of this lecture)

  • xi: inputs
  • yi: outputs
  • fi: (pure) functions (i.e. no side effects)

When does a function have a “side effect”? In addition to producing a value, it

  • modifies non-local state, or
  • has an observable interaction with the
  • utside world.

Embarrassing Partition Pipelines Reduction Scan

slide-10
SLIDE 10

Embarrassingly Parallel

yi = fi(xi)

where i ∈ {1, . . . , N}. Notation: (also for rest of this lecture)

  • xi: inputs
  • yi: outputs
  • fi: (pure) functions (i.e. no side effects)

Embarrassing Partition Pipelines Reduction Scan

slide-11
SLIDE 11

Embarrassingly Parallel

yi = fi(xi)

where i ∈ {1, . . . , N}. Notation: (also for rest of this lecture)

  • xi: inputs
  • yi: outputs
  • fi: (pure) functions (i.e. no side effects)

Often: f1 = · · · = fN. Then

  • Lisp/Python function map
  • C++ STL std::transform

Embarrassing Partition Pipelines Reduction Scan

slide-12
SLIDE 12

Embarrassingly Parallel: Graph Representation

x0 y0 f0 x1 y1 f1 x2 y2 f2 x3 y3 f3 x4 y4 f4 x5 y5 f5 x6 y6 f6 x7 y7 f7 x8 y8 f8

Embarrassing Partition Pipelines Reduction Scan

slide-13
SLIDE 13

Embarrassingly Parallel: Graph Representation

x0 y0 f0 x1 y1 f1 x2 y2 f2 x3 y3 f3 x4 y4 f4 x5 y5 f5 x6 y6 f6 x7 y7 f7 x8 y8 f8 Trivial? Often: no.

Embarrassing Partition Pipelines Reduction Scan

slide-14
SLIDE 14

Embarrassingly Parallel: Examples

Surprisingly useful:

  • Element-wise linear algebra:

Addition, scalar multiplication (not inner product)

  • Image Processing: Shift, rotate,

clip, scale, . . .

  • Monte Carlo simulation
  • (Brute-force) Optimization
  • Random Number Generation
  • Encryption, Compression

(after blocking)

  • Software compilation
  • make -j8

Embarrassing Partition Pipelines Reduction Scan

slide-15
SLIDE 15

Embarrassingly Parallel: Examples

Surprisingly useful:

  • Element-wise linear algebra:

Addition, scalar multiplication (not inner product)

  • Image Processing: Shift, rotate,

clip, scale, . . .

  • Monte Carlo simulation
  • (Brute-force) Optimization
  • Random Number Generation
  • Encryption, Compression

(after blocking)

  • Software compilation
  • make -j8

But: Still needs a minimum of

  • coordination. How can that be

achieved?

Embarrassing Partition Pipelines Reduction Scan

slide-16
SLIDE 16

Mother-Child Parallelism

Mother-Child parallelism: Mother 1 2 3 4 Children Send initial data Collect results (formerly called “Master-Slave”)

Embarrassing Partition Pipelines Reduction Scan

slide-17
SLIDE 17

Embarrassingly Parallel: Issues

  • Process Creation:

Dynamic/Static?

  • MPI 2 supports dynamic process

creation

  • Job Assignment (‘Scheduling’):

Dynamic/Static?

  • Operations/data light- or

heavy-weight?

  • Variable-size data?
  • Load Balancing:
  • Here: easy

Embarrassing Partition Pipelines Reduction Scan

slide-18
SLIDE 18

Embarrassingly Parallel: Issues

  • Process Creation:

Dynamic/Static?

  • MPI 2 supports dynamic process

creation

  • Job Assignment (‘Scheduling’):

Dynamic/Static?

  • Operations/data light- or

heavy-weight?

  • Variable-size data?
  • Load Balancing:
  • Here: easy

Can you think of a load balancing recipe?

Embarrassing Partition Pipelines Reduction Scan

slide-19
SLIDE 19

Outline

Embarrassingly Parallel Partition Pipelines Reduction Scan

Embarrassing Partition Pipelines Reduction Scan

slide-20
SLIDE 20

Partition

yi = fi(xi−1, xi, xi+1)

where i ∈ {1, . . . , N}.

Embarrassing Partition Pipelines Reduction Scan

slide-21
SLIDE 21

Partition

yi = fi(xi−1, xi, xi+1)

where i ∈ {1, . . . , N}. Includes straightforward generalizations to dependencies on a larger (but not O(P)-sized!) set of neighbor inputs.

Embarrassing Partition Pipelines Reduction Scan

slide-22
SLIDE 22

Partition: Graph

x0 x1 x2 x3 x4 x5 x6 y1 y2 y3 y4 y5

Embarrassing Partition Pipelines Reduction Scan

slide-23
SLIDE 23

Partition: Examples

  • Time-marching

(in particular: PDE solvers)

  • (Including finite differences → HW3!)
  • Iterative Methods
  • Solve Ax = b (Jacobi, . . . )
  • Optimization (all P on single problem)
  • Eigenvalue solvers
  • Cellular Automata (Game of Life :-)

Embarrassing Partition Pipelines Reduction Scan

slide-24
SLIDE 24

Partition: Issues

  • Only useful when the computation

is mainly local

  • Responsibility for updating one

datum rests with one processor

  • Synchronization, Deadlock,

Livelock, . . .

  • Performance Impact
  • Granularity
  • Load Balancing: Thorny issue
  • → next lecture
  • Regularity of the Partition?

Embarrassing Partition Pipelines Reduction Scan

slide-25
SLIDE 25

Rendezvous Trick

  • Assume an irregular

partition.

  • Assume problem

components i, j on unknown partitions pi, pj need to communicate.

  • How can pi find pj (and vice

versa)? pi pj

i j

Embarrassing Partition Pipelines Reduction Scan

slide-26
SLIDE 26

Rendezvous Trick

  • Assume an irregular

partition.

  • Assume problem

components i, j on unknown partitions pi, pj need to communicate.

  • How can pi find pj (and vice

versa)? Communicate via a third party, pf (i,j). For f : think ‘hash function’. pi pj

i j

pf (i,j)

Embarrassing Partition Pipelines Reduction Scan

slide-27
SLIDE 27

Rendezvous Trick

  • Assume an irregular

partition.

  • Assume problem

components i, j on unknown partitions pi, pj need to communicate.

  • How can pi find pj (and vice

versa)? Communicate via a third party, pf (i,j). For f : think ‘hash function’. pi pj

i j

pf (i,j) “I’m in pi.”

Embarrassing Partition Pipelines Reduction Scan

slide-28
SLIDE 28

Rendezvous Trick

  • Assume an irregular

partition.

  • Assume problem

components i, j on unknown partitions pi, pj need to communicate.

  • How can pi find pj (and vice

versa)? Communicate via a third party, pf (i,j). For f : think ‘hash function’. pi pj

i j

pf (i,j) “I’m in pj.”

Embarrassing Partition Pipelines Reduction Scan

slide-29
SLIDE 29

Rendezvous Trick

  • Assume an irregular

partition.

  • Assume problem

components i, j on unknown partitions pi, pj need to communicate.

  • How can pi find pj (and vice

versa)? Communicate via a third party, pf (i,j). For f : think ‘hash function’. pi pj

i j

pf (i,j)

Embarrassing Partition Pipelines Reduction Scan

slide-30
SLIDE 30

Rendezvous Trick

  • Assume an irregular

partition.

  • Assume problem

components i, j on unknown partitions pi, pj need to communicate.

  • How can pi find pj (and vice

versa)? Communicate via a third party, pf (i,j). For f : think ‘hash function’. pi pj

i j

pf (i,j)

Embarrassing Partition Pipelines Reduction Scan

slide-31
SLIDE 31

Outline

Embarrassingly Parallel Partition Pipelines Reduction Scan

Embarrassing Partition Pipelines Reduction Scan

slide-32
SLIDE 32

Pipelined Computation

y = fN(· · · f2(f1(x)) · · · ) = (fN ◦ · · · ◦ f1)(x)

where N is fixed.

Embarrassing Partition Pipelines Reduction Scan

slide-33
SLIDE 33

Pipelined Computation: Graph

x y f1 f1 f2 f3 f4 f6

Embarrassing Partition Pipelines Reduction Scan

slide-34
SLIDE 34

Pipelined Computation: Graph

x y f1 f1 f2 f3 f4 f6 Processor Assignment?

Embarrassing Partition Pipelines Reduction Scan

slide-35
SLIDE 35

Pipelined Computation: Examples

  • Image processing
  • Any multi-stage algorithm
  • Pre/post-processing or I/O
  • Out-of-Core algorithms

Specific simple examples:

  • Sorting (insertion sort)
  • Triangular linear system solve

(‘backsubstitution’)

  • Key: Pass on values as soon as

they’re available

(will see more efficient algorithms for both later)

Embarrassing Partition Pipelines Reduction Scan

slide-36
SLIDE 36

Pipelined Computation: Issues

  • Non-optimal while pipeline fills or

empties

  • Often communication-inefficient
  • for large data
  • Needs some attention to

synchronization, deadlock avoidance

  • Can accommodate some

asynchrony But don’t want:

  • Pile-up
  • Starvation

Embarrassing Partition Pipelines Reduction Scan

slide-37
SLIDE 37

Outline

Embarrassingly Parallel Partition Pipelines Reduction Scan

Embarrassing Partition Pipelines Reduction Scan

slide-38
SLIDE 38

Reduction

y = f (· · · f (f (x1, x2), x3), . . . , xN)

where N is the input size.

Embarrassing Partition Pipelines Reduction Scan

slide-39
SLIDE 39

Reduction

y = f (· · · f (f (x1, x2), x3), . . . , xN)

where N is the input size. Also known as. . .

  • Lisp/Python function reduce (Scheme: fold)
  • C++ STL std::accumulate

Embarrassing Partition Pipelines Reduction Scan

slide-40
SLIDE 40

Reduction: Graph

y x1 x2 x3 x4 x5 x6

Embarrassing Partition Pipelines Reduction Scan

slide-41
SLIDE 41

Reduction: Graph

y x1 x2 x3 x4 x5 x6 Painful! Not parallelizable.

Embarrassing Partition Pipelines Reduction Scan

slide-42
SLIDE 42

Approach to Reduction

f (x, y)?

Can we do better? “Tree” very imbalanced. What property

  • f f would allow ‘rebalancing’?

Embarrassing Partition Pipelines Reduction Scan

slide-43
SLIDE 43

Approach to Reduction

f (x, y)?

Can we do better? “Tree” very imbalanced. What property

  • f f would allow ‘rebalancing’?

f (f (x, y), z) = f (x, f (y, z)) Looks less improbable if we let x ◦ y = f (x, y): x ◦ (y ◦ z)) = (x ◦ y) ◦ z Has a very familiar name: Associativity

Embarrassing Partition Pipelines Reduction Scan

slide-44
SLIDE 44

Reduction: A Better Graph

y x0 x1 x2 x3 x4 x5 x6 x7

Embarrassing Partition Pipelines Reduction Scan

slide-45
SLIDE 45

Reduction: A Better Graph

y x0 x1 x2 x3 x4 x5 x6 x7 Processor allocation?

Embarrassing Partition Pipelines Reduction Scan

slide-46
SLIDE 46

Mapping Reduction to the GPU

  • Obvious: Want to use tree-based approach.
  • Problem: Two scales, Work group and Grid
  • Need to occupy both to make good use of the machine.
  • In particular, need synchronization after each tree stage.

With material by M. Harris (Nvidia Corp.)

Embarrassing Partition Pipelines Reduction Scan

slide-47
SLIDE 47

Mapping Reduction to the GPU

  • Obvious: Want to use tree-based approach.
  • Problem: Two scales, Work group and Grid
  • Need to occupy both to make good use of the machine.
  • In particular, need synchronization after each tree stage.
  • Solution: Use a two-scale algorithm.

4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3

In particular: Use multiple grid invocations to achieve inter-workgroup synchronization.

With material by M. Harris (Nvidia Corp.)

Embarrassing Partition Pipelines Reduction Scan

slide-48
SLIDE 48

Kernel V1

kernel void reduce0( global T ∗g idata, global T ∗g odata, unsigned int n, local T∗ ldata) { unsigned int lid = get local id (0); unsigned int i = get global id (0); ldata [ lid ] = (i < n) ? g idata [ i ] : 0; barrier (CLK LOCAL MEM FENCE); for(unsigned int s=1; s < get local size (0); s ∗= 2) { if (( lid % (2∗s)) == 0) ldata [ lid ] += ldata[lid + s]; barrier (CLK LOCAL MEM FENCE); } if ( lid == 0) g odata[get group id(0)] = ldata [0]; }

Embarrassing Partition Pipelines Reduction Scan

slide-49
SLIDE 49

Interleaved Addressing

2 11 7 2

  • 3
  • 2

5 3

  • 2
  • 1

8 1 10

Values (shared memory)

2 4 6 8 10 12 14

2 2 11 11 7 9

  • 3
  • 5

5 8

  • 2
  • 2
  • 1

7 1 11

Values

4 8 12

2 2 11 13 7 9

  • 3

4 5 8

  • 2

6

  • 1

7 1 18

Values

8

2 2 11 13 7 9

  • 3

17 5 8

  • 2

6

  • 1

7 1 24

Values

2 2 11 13 7 9

  • 3

17 5 8

  • 2

6

  • 1

7 1 41

Values Thread IDs Step 1 Stride 1 Step 2 Stride 2 Step 3 Stride 4 Step 4 Stride 8 Thread IDs Thread IDs Thread IDs

With material by M. Harris (Nvidia Corp.)

Embarrassing Partition Pipelines Reduction Scan

slide-50
SLIDE 50

Interleaved Addressing

2 11 7 2

  • 3
  • 2

5 3

  • 2
  • 1

8 1 10

Values (shared memory)

2 4 6 8 10 12 14

2 2 11 11 7 9

  • 3
  • 5

5 8

  • 2
  • 2
  • 1

7 1 11

Values

4 8 12

2 2 11 13 7 9

  • 3

4 5 8

  • 2

6

  • 1

7 1 18

Values

8

2 2 11 13 7 9

  • 3

17 5 8

  • 2

6

  • 1

7 1 24

Values

2 2 11 13 7 9

  • 3

17 5 8

  • 2

6

  • 1

7 1 41

Values Thread IDs Step 1 Stride 1 Step 2 Stride 2 Step 3 Stride 4 Step 4 Stride 8 Thread IDs Thread IDs Thread IDs

Issue: Slow modulo, Divergence

With material by M. Harris (Nvidia Corp.)

Embarrassing Partition Pipelines Reduction Scan

slide-51
SLIDE 51

Kernel V2

kernel void reduce2( global T ∗g idata, global T ∗g odata, unsigned int n, local T∗ ldata) { unsigned int lid = get local id (0); unsigned int i = get global id (0); ldata [ lid ] = (i < n) ? g idata [ i ] : 0; barrier (CLK LOCAL MEM FENCE); for(unsigned int s= get local size (0)/2; s>0; s>>=1) { if ( lid < s) ldata [ lid ] += ldata[lid + s]; barrier (CLK LOCAL MEM FENCE); } if ( lid == 0) g odata[ get local size (0)] = ldata [0]; }

Embarrassing Partition Pipelines Reduction Scan

slide-52
SLIDE 52

Sequential Addressing

2 11 7 2

  • 3
  • 2

5 3

  • 2
  • 1

8 1 10

Values (shared memory)

1 2 3 4 5 6 7

2 11 7 2

  • 3
  • 2

7 3 9 6 10

  • 2

8

Values

1 2 3

2 11 7 2

  • 3
  • 2

7 3 9 13 13 7 8

Values

1

2 11 7 2

  • 3
  • 2

7 3 9 13 13 20 21

Values

2 11 7 2

  • 3
  • 2

7 3 9 13 13 20 41

Values Thread IDs Step 1 Stride 8 Step 2 Stride 4 Step 3 Stride 2 Step 4 Stride 1 Thread IDs Thread IDs Thread IDs

With material by M. Harris (Nvidia Corp.)

Embarrassing Partition Pipelines Reduction Scan

slide-53
SLIDE 53

Sequential Addressing

2 11 7 2

  • 3
  • 2

5 3

  • 2
  • 1

8 1 10

Values (shared memory)

1 2 3 4 5 6 7

2 11 7 2

  • 3
  • 2

7 3 9 6 10

  • 2

8

Values

1 2 3

2 11 7 2

  • 3
  • 2

7 3 9 13 13 7 8

Values

1

2 11 7 2

  • 3
  • 2

7 3 9 13 13 20 21

Values

2 11 7 2

  • 3
  • 2

7 3 9 13 13 20 41

Values Thread IDs Step 1 Stride 8 Step 2 Stride 4 Step 3 Stride 2 Step 4 Stride 1 Thread IDs Thread IDs Thread IDs

Better! But still not “efficient”. Only half of all work items after first round, then a quarter, . . .

With material by M. Harris (Nvidia Corp.)

Embarrassing Partition Pipelines Reduction Scan

slide-54
SLIDE 54

Thinking about Parallel Complexity

Distinguish:

  • Time on T processors: TP
  • Step Complexity/Span T∞: Minimum number of steps

taken if an infinite number of processors are available

  • Work per step St
  • Work Complexity/Work T1 = T∞

t=1 St: Total number of

  • perations performed
  • Parallelism T1/T∞: average amount of work along span
  • P > T1/T∞ doesn’t make sense.

Algorithm-specific!

Embarrassing Partition Pipelines Reduction Scan

slide-55
SLIDE 55

Thinking about Parallel Complexity

Distinguish:

  • Time on T processors: TP
  • Step Complexity/Span T∞: Minimum number of steps

taken if an infinite number of processors are available

  • Work per step St
  • Work Complexity/Work T1 = T∞

t=1 St: Total number of

  • perations performed
  • Parallelism T1/T∞: average amount of work along span
  • P > T1/T∞ doesn’t make sense.

Algorithm-specific! Lower Bounds:

  • TP ≥ . . . ? (in terms of T1)
  • TP ≥ . . . ? (in terms of T∞)

Embarrassing Partition Pipelines Reduction Scan

slide-56
SLIDE 56

Parallel Complexity for Reduction

Number of Items N Actual work to be done: W = O(N) additions. Step Complexity: Let d = ⌈log2 N⌉. Then T∞ = d, St = O(2d−t). Work Complexity: T1 =

T

  • t=1

St = O T

  • t=1

2d−t

  • = O(2d) = O(N)

Embarrassing Partition Pipelines Reduction Scan

slide-57
SLIDE 57

Parallel Complexity for Reduction

Number of Items N Actual work to be done: W = O(N) additions. Step Complexity: Let d = ⌈log2 N⌉. Then T∞ = d, St = O(2d−t). Work Complexity: T1 =

T

  • t=1

St = O T

  • t=1

2d−t

  • = O(2d) = O(N)

“Work-efficient:” T1 ∼ W .

Embarrassing Partition Pipelines Reduction Scan

slide-58
SLIDE 58

Greedy Scheduling

Theorem (Graham ‘68, Brent ‘75)

A parallel algorithm with span T∞ and work complexity T1 can be executed on a shared-memory machine with P processors in no more than TP ≤ T1 P + T∞ steps. Observations:

  • Think of T∞ as the length of the “critical path”.
  • The first summand can be made to go away by increasing P.
  • Only valid for shared-memory.

Embarrassing Partition Pipelines Reduction Scan

slide-59
SLIDE 59

Greedy Scheduling

Theorem (Graham ‘68, Brent ‘75)

A parallel algorithm with span T∞ and work complexity T1 can be executed on a shared-memory machine with P processors in no more than TP ≤ T1 P + T∞ steps. Observations:

  • Think of T∞ as the length of the “critical path”.
  • The first summand can be made to go away by increasing P.
  • Only valid for shared-memory.

Estimate for P = 1? Proof sketch?

Embarrassing Partition Pipelines Reduction Scan

slide-60
SLIDE 60

Brent for Reduction

Again: Number of items N. Brent says TP = O T1 P + T∞

  • = O

N P + log N

  • .

Within a work group: N = P ⇒ TN = O(log N).

Embarrassing Partition Pipelines Reduction Scan

slide-61
SLIDE 61

Brent for Reduction

Again: Number of items N. Brent says TP = O T1 P + T∞

  • = O

N P + log N

  • .

Within a work group: N = P ⇒ TN = O(log N). But: Work groups are an illusion! Machine has finite width. Thus TP > O(log N)! How low can we take P before we hurt our asymptotic runtime TP?

Embarrassing Partition Pipelines Reduction Scan

slide-62
SLIDE 62

Brent for Reduction

Again: Number of items N. Brent says TP = O T1 P + T∞

  • = O

N P + log N

  • .

Within a work group: N = P ⇒ TN = O(log N). But: Work groups are an illusion! Machine has finite width. Thus TP > O(log N)! How low can we take P before we hurt our asymptotic runtime TP? Asymptotically optimal TP = O(log N) for P ≥ N log N . Result: We’re free to reduce P by a factor of (log N) without increasing TP. ⇒ Do (log N) items in sequence per work item without increasing asymptotic TP.

Embarrassing Partition Pipelines Reduction Scan

slide-63
SLIDE 63

Brent for Reduction

Again: Number of items N. Brent says TP = O T1 P + T∞

  • = O

N P + log N

  • .

Within a work group: N = P ⇒ TN = O(log N). But: Work groups are an illusion! Machine has finite width. Thus TP > O(log N)! How low can we take P before we hurt our asymptotic runtime TP? Asymptotically optimal TP = O(log N) for P ≥ N log N . Result: We’re free to reduce P by a factor of (log N) without increasing TP. ⇒ Do (log N) items in sequence per work item without increasing asymptotic TP. Think of this in terms of cost: Cost = P × TP

Embarrassing Partition Pipelines Reduction Scan

slide-64
SLIDE 64

Brent for Reduction

Again: Number of items N. Brent says TP = O T1 P + T∞

  • = O

N P + log N

  • .

Within a work group: N = P ⇒ TN = O(log N). But: Work groups are an illusion! Machine has finite width. Thus TP > O(log N)! How low can we take P before we hurt our asymptotic runtime TP? Asymptotically optimal TP = O(log N) for P ≥ N log N . Result: We’re free to reduce P by a factor of (log N) without increasing TP. ⇒ Do (log N) items in sequence per work item without increasing asymptotic TP. Think of this in terms of cost: Cost = P × TP Brent gives lower bound on P. Fewer Processors ⇒ less cost!

Embarrassing Partition Pipelines Reduction Scan

slide-65
SLIDE 65

Brent for Reduction

Again: Number of items N. Brent says TP = O T1 P + T∞

  • = O

N P + log N

  • .

Within a work group: N = P ⇒ TN = O(log N). But: Work groups are an illusion! Machine has finite width. Thus TP > O(log N)! How low can we take P before we hurt our asymptotic runtime TP? Asymptotically optimal TP = O(log N) for P ≥ N log N . Result: We’re free to reduce P by a factor of (log N) without increasing TP. ⇒ Do (log N) items in sequence per work item without increasing asymptotic TP. Think of this in terms of cost: Cost = P × TP Brent gives lower bound on P. Fewer Processors ⇒ less cost! “Algorithm cascading”

Embarrassing Partition Pipelines Reduction Scan

slide-66
SLIDE 66

Kernel V3 Part 1

kernel void reduce6( global T ∗g idata, global T ∗g odata, unsigned int n, volatile local T∗ ldata) { unsigned int lid = get local id (0); unsigned int i = get group id(0)∗( get local size (0)∗2) + get local id (0); unsigned int gridSize = GROUP SIZE∗2∗get num groups(0); ldata [ lid ] = 0; while (i < n) { ldata [ lid ] += g idata[i ]; if (i + GROUP SIZE < n) ldata [ lid ] += g idata[i+GROUP SIZE]; i += gridSize; } barrier (CLK LOCAL MEM FENCE);

Embarrassing Partition Pipelines Reduction Scan

slide-67
SLIDE 67

Kernel V3 Part 2

if (GROUP SIZE >= 512) { if ( lid < 256) { ldata[ lid ] += ldata[lid + 256]; } barrier (CLK LOCAL MEM FENCE); } // ... if (GROUP SIZE >= 128) { /∗ ... ∗/ } if ( lid < 32) { if (GROUP SIZE >= 64) { ldata[lid] += ldata[lid + 32]; } if (GROUP SIZE >= 32) { ldata[lid] += ldata[lid + 16]; } // ... if (GROUP SIZE >= 2) { ldata[lid] += ldata[lid + 1]; } } if ( lid == 0) g odata[get group id(0)] = ldata [0]; }

Embarrassing Partition Pipelines Reduction Scan

slide-68
SLIDE 68

Performance Comparison

0.01 0.1 1 10 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 33554432 Time (ms) 1: Interleaved Addressing: Divergent Branches 2: Interleaved Addressing: Bank Conflicts 3: Sequential Addressing 4: First add during global load 5: Unroll last warp 6: Completely unroll 7: Multiple elements per thread (max 64 blocks)

With material by M. Harris (Nvidia Corp.)

Embarrassing Partition Pipelines Reduction Scan

slide-69
SLIDE 69

Reduction: Examples

  • Sum, Inner Product, Norm
  • Occurs in iterative methods
  • Minimum, Maximum
  • Data Analysis
  • Evaluation of Monte Carlo

Simulations

  • List Concatenation, Set Union
  • Matrix-Vector product (but. . . )

Embarrassing Partition Pipelines Reduction Scan

slide-70
SLIDE 70

Reduction: Issues

  • When adding: floating point

cancellation?

  • Serial order goes faster:

can use registers for intermediate results

  • Requires availability of neutral

element

  • GPU-Reduce: Optimization

sensitive to data type

Embarrassing Partition Pipelines Reduction Scan

slide-71
SLIDE 71

Map-Reduce

y = f (· · · f (f (g(x1), g(x2)), g(x3)), . . . , g(xN))

where N is the input size.

  • Lisp naming, again
  • Mild generalization of reduction

Embarrassing Partition Pipelines Reduction Scan

slide-72
SLIDE 72

Map-Reduce: Graph

y x0 g x1 g x2 g x3 g x4 g x5 g x6 g x7 g

Embarrassing Partition Pipelines Reduction Scan

slide-73
SLIDE 73

MapReduce: Discussion

MapReduce ≥ map + reduce:

  • Used by Google (and many others) for

large-scale data processing

  • Map generates (key, value) pairs
  • Reduce operates only on pairs with

identical keys

  • Remaining output sorted by key
  • Represent all data as character strings
  • User must convert to/from internal repr.
  • Messy implementation
  • Parallelization, fault tolerance, monitoring,

data management, load balance, re-run “stragglers”, data locality

  • Works for Internet-size data
  • Simple to use even for inexperienced users

Embarrassing Partition Pipelines Reduction Scan

slide-74
SLIDE 74

MapReduce: Examples

  • String search
  • (e.g. URL) Hit count from Log
  • Reverse web-link graph
  • desired: (target URL, sources)
  • Sort
  • Indexing
  • desired: (word, document IDs)
  • Machine Learning, Clustering, . . .

Embarrassing Partition Pipelines Reduction Scan

slide-75
SLIDE 75

Outline

Embarrassingly Parallel Partition Pipelines Reduction Scan

Embarrassing Partition Pipelines Reduction Scan

slide-76
SLIDE 76

Scan

y1 = x1 y2 = f (y1, x2) . . . = . . . yN = f (yN−1, xN)

where N is the input size.

  • Also called “prefix sum”.
  • Or cumulative sum (‘cumsum’) by Matlab/NumPy.

Embarrassing Partition Pipelines Reduction Scan

slide-77
SLIDE 77

Scan: Graph

x0 y0 x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 y1 Id y2 Id y3 Id y4 Id y5 Id Id

Embarrassing Partition Pipelines Reduction Scan

slide-78
SLIDE 78

Scan: Graph

x0 y0 x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 y1 Id y2 Id y3 Id y4 Id y5 Id Id This can’t possibly be parallelized. Or can it?

Embarrassing Partition Pipelines Reduction Scan

slide-79
SLIDE 79

Scan: Graph

x0 y0 x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 y1 Id y2 Id y3 Id y4 Id y5 Id Id This can’t possibly be parallelized. Or can it? Again: Need assumptions on f . Associativity, commutativity.

Embarrassing Partition Pipelines Reduction Scan

slide-80
SLIDE 80

Scan: Implementation

Embarrassing Partition Pipelines Reduction Scan

slide-81
SLIDE 81

Scan: Implementation

Work-efficient?

Embarrassing Partition Pipelines Reduction Scan

slide-82
SLIDE 82

Scan: Implementation II

Two sweeps: Upward, downward, both tree-shape On upward sweep:

  • Get values L and R from left and right

child

  • Save L in local variable Mine
  • Compute Tmp = L + R and pass to parent

On downward sweep:

  • Get value Tmp from parent
  • Send Tmp to left child
  • Sent Tmp+Mine to right child

Embarrassing Partition Pipelines Reduction Scan

slide-83
SLIDE 83

Scan: Implementation II

Two sweeps: Upward, downward, both tree-shape On upward sweep:

  • Get values L and R from left and right

child

  • Save L in local variable Mine
  • Compute Tmp = L + R and pass to parent

On downward sweep:

  • Get value Tmp from parent
  • Send Tmp to left child
  • Sent Tmp+Mine to right child

Work-efficient? Span rel. to first attempt?

Embarrassing Partition Pipelines Reduction Scan

slide-84
SLIDE 84

Scan: Examples

  • Anything with a loop-carried

dependence

  • One row of Gauss-Seidel
  • One row of triangular solve
  • Segment numbering if boundaries

are known

  • Low-level building block for many

higher-level algorithms algorithms

  • FIR/IIR Filtering
  • G.E. Blelloch:

Prefix Sums and their Applications

Embarrassing Partition Pipelines Reduction Scan

slide-85
SLIDE 85

Scan: Issues

  • Subtlety: Inclusive/Exclusive Scan
  • Pattern sometimes hard to

recognize

  • But shows up surprisingly often
  • Need to prove

associativity/commutativity

  • Useful in Implementation:

algorithm cascading

  • Do sequential scan on parts, then

parallelize at coarser granularities

Embarrassing Partition Pipelines Reduction Scan

slide-86
SLIDE 86

Questions?

?

Embarrassing Partition Pipelines Reduction Scan