Lecture 10: Parallel Patterns: The What and How of Parallel Programming
G63.2011.002/G22.2945.001 · November 9, 2010
Embarrassing Partition Pipelines Reduction Scan
Lecture 10: Parallel Patterns: The What and How of Parallel - - PowerPoint PPT Presentation
Lecture 10: Parallel Patterns: The What and How of Parallel Programming G63.2011.002/G22.2945.001 November 9, 2010 Embarrassing Partition Pipelines Reduction Scan Tentative Plan for Rest of Class Today: Parallel Patterns Nov 16:
G63.2011.002/G22.2945.001 · November 9, 2010
Embarrassing Partition Pipelines Reduction Scan
Embarrassing Partition Pipelines Reduction Scan
Anything not on here that you would like covered?
Embarrassing Partition Pipelines Reduction Scan
Will post HW3 solution soon. (list message) Graded HW3 next week.
Embarrassing Partition Pipelines Reduction Scan
“Traditional” parallel programming in a nutshell Key question:
Embarrassing Partition Pipelines Reduction Scan
Embarrassingly Parallel Partition Pipelines Reduction Scan
Embarrassing Partition Pipelines Reduction Scan
Embarrassingly Parallel Partition Pipelines Reduction Scan
Embarrassing Partition Pipelines Reduction Scan
where i ∈ {1, . . . , N}. Notation: (also for rest of this lecture)
Embarrassing Partition Pipelines Reduction Scan
where i ∈ {1, . . . , N}. Notation: (also for rest of this lecture)
When does a function have a “side effect”? In addition to producing a value, it
Embarrassing Partition Pipelines Reduction Scan
where i ∈ {1, . . . , N}. Notation: (also for rest of this lecture)
Embarrassing Partition Pipelines Reduction Scan
where i ∈ {1, . . . , N}. Notation: (also for rest of this lecture)
Often: f1 = · · · = fN. Then
Embarrassing Partition Pipelines Reduction Scan
x0 y0 f0 x1 y1 f1 x2 y2 f2 x3 y3 f3 x4 y4 f4 x5 y5 f5 x6 y6 f6 x7 y7 f7 x8 y8 f8
Embarrassing Partition Pipelines Reduction Scan
x0 y0 f0 x1 y1 f1 x2 y2 f2 x3 y3 f3 x4 y4 f4 x5 y5 f5 x6 y6 f6 x7 y7 f7 x8 y8 f8 Trivial? Often: no.
Embarrassing Partition Pipelines Reduction Scan
Surprisingly useful:
Addition, scalar multiplication (not inner product)
clip, scale, . . .
(after blocking)
Embarrassing Partition Pipelines Reduction Scan
Surprisingly useful:
Addition, scalar multiplication (not inner product)
clip, scale, . . .
(after blocking)
But: Still needs a minimum of
achieved?
Embarrassing Partition Pipelines Reduction Scan
Mother-Child parallelism: Mother 1 2 3 4 Children Send initial data Collect results (formerly called “Master-Slave”)
Embarrassing Partition Pipelines Reduction Scan
Dynamic/Static?
creation
Dynamic/Static?
heavy-weight?
Embarrassing Partition Pipelines Reduction Scan
Dynamic/Static?
creation
Dynamic/Static?
heavy-weight?
Can you think of a load balancing recipe?
Embarrassing Partition Pipelines Reduction Scan
Embarrassingly Parallel Partition Pipelines Reduction Scan
Embarrassing Partition Pipelines Reduction Scan
where i ∈ {1, . . . , N}.
Embarrassing Partition Pipelines Reduction Scan
where i ∈ {1, . . . , N}. Includes straightforward generalizations to dependencies on a larger (but not O(P)-sized!) set of neighbor inputs.
Embarrassing Partition Pipelines Reduction Scan
x0 x1 x2 x3 x4 x5 x6 y1 y2 y3 y4 y5
Embarrassing Partition Pipelines Reduction Scan
(in particular: PDE solvers)
Embarrassing Partition Pipelines Reduction Scan
is mainly local
datum rests with one processor
Livelock, . . .
Embarrassing Partition Pipelines Reduction Scan
partition.
components i, j on unknown partitions pi, pj need to communicate.
versa)? pi pj
i j
Embarrassing Partition Pipelines Reduction Scan
partition.
components i, j on unknown partitions pi, pj need to communicate.
versa)? Communicate via a third party, pf (i,j). For f : think ‘hash function’. pi pj
i j
pf (i,j)
Embarrassing Partition Pipelines Reduction Scan
partition.
components i, j on unknown partitions pi, pj need to communicate.
versa)? Communicate via a third party, pf (i,j). For f : think ‘hash function’. pi pj
i j
pf (i,j) “I’m in pi.”
Embarrassing Partition Pipelines Reduction Scan
partition.
components i, j on unknown partitions pi, pj need to communicate.
versa)? Communicate via a third party, pf (i,j). For f : think ‘hash function’. pi pj
i j
pf (i,j) “I’m in pj.”
Embarrassing Partition Pipelines Reduction Scan
partition.
components i, j on unknown partitions pi, pj need to communicate.
versa)? Communicate via a third party, pf (i,j). For f : think ‘hash function’. pi pj
i j
pf (i,j)
Embarrassing Partition Pipelines Reduction Scan
partition.
components i, j on unknown partitions pi, pj need to communicate.
versa)? Communicate via a third party, pf (i,j). For f : think ‘hash function’. pi pj
i j
pf (i,j)
Embarrassing Partition Pipelines Reduction Scan
Embarrassingly Parallel Partition Pipelines Reduction Scan
Embarrassing Partition Pipelines Reduction Scan
where N is fixed.
Embarrassing Partition Pipelines Reduction Scan
x y f1 f1 f2 f3 f4 f6
Embarrassing Partition Pipelines Reduction Scan
x y f1 f1 f2 f3 f4 f6 Processor Assignment?
Embarrassing Partition Pipelines Reduction Scan
Specific simple examples:
(‘backsubstitution’)
they’re available
(will see more efficient algorithms for both later)
Embarrassing Partition Pipelines Reduction Scan
empties
synchronization, deadlock avoidance
asynchrony But don’t want:
Embarrassing Partition Pipelines Reduction Scan
Embarrassingly Parallel Partition Pipelines Reduction Scan
Embarrassing Partition Pipelines Reduction Scan
where N is the input size.
Embarrassing Partition Pipelines Reduction Scan
where N is the input size. Also known as. . .
Embarrassing Partition Pipelines Reduction Scan
y x1 x2 x3 x4 x5 x6
Embarrassing Partition Pipelines Reduction Scan
y x1 x2 x3 x4 x5 x6 Painful! Not parallelizable.
Embarrassing Partition Pipelines Reduction Scan
Can we do better? “Tree” very imbalanced. What property
Embarrassing Partition Pipelines Reduction Scan
Can we do better? “Tree” very imbalanced. What property
f (f (x, y), z) = f (x, f (y, z)) Looks less improbable if we let x ◦ y = f (x, y): x ◦ (y ◦ z)) = (x ◦ y) ◦ z Has a very familiar name: Associativity
Embarrassing Partition Pipelines Reduction Scan
y x0 x1 x2 x3 x4 x5 x6 x7
Embarrassing Partition Pipelines Reduction Scan
y x0 x1 x2 x3 x4 x5 x6 x7 Processor allocation?
Embarrassing Partition Pipelines Reduction Scan
With material by M. Harris (Nvidia Corp.)
Embarrassing Partition Pipelines Reduction Scan
4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3
In particular: Use multiple grid invocations to achieve inter-workgroup synchronization.
With material by M. Harris (Nvidia Corp.)
Embarrassing Partition Pipelines Reduction Scan
kernel void reduce0( global T ∗g idata, global T ∗g odata, unsigned int n, local T∗ ldata) { unsigned int lid = get local id (0); unsigned int i = get global id (0); ldata [ lid ] = (i < n) ? g idata [ i ] : 0; barrier (CLK LOCAL MEM FENCE); for(unsigned int s=1; s < get local size (0); s ∗= 2) { if (( lid % (2∗s)) == 0) ldata [ lid ] += ldata[lid + s]; barrier (CLK LOCAL MEM FENCE); } if ( lid == 0) g odata[get group id(0)] = ldata [0]; }
Embarrassing Partition Pipelines Reduction Scan
2 11 7 2
5 3
8 1 10
Values (shared memory)
2 4 6 8 10 12 14
2 2 11 11 7 9
5 8
7 1 11
Values
4 8 12
2 2 11 13 7 9
4 5 8
6
7 1 18
Values
8
2 2 11 13 7 9
17 5 8
6
7 1 24
Values
2 2 11 13 7 9
17 5 8
6
7 1 41
Values Thread IDs Step 1 Stride 1 Step 2 Stride 2 Step 3 Stride 4 Step 4 Stride 8 Thread IDs Thread IDs Thread IDs
With material by M. Harris (Nvidia Corp.)
Embarrassing Partition Pipelines Reduction Scan
2 11 7 2
5 3
8 1 10
Values (shared memory)
2 4 6 8 10 12 14
2 2 11 11 7 9
5 8
7 1 11
Values
4 8 12
2 2 11 13 7 9
4 5 8
6
7 1 18
Values
8
2 2 11 13 7 9
17 5 8
6
7 1 24
Values
2 2 11 13 7 9
17 5 8
6
7 1 41
Values Thread IDs Step 1 Stride 1 Step 2 Stride 2 Step 3 Stride 4 Step 4 Stride 8 Thread IDs Thread IDs Thread IDs
Issue: Slow modulo, Divergence
With material by M. Harris (Nvidia Corp.)
Embarrassing Partition Pipelines Reduction Scan
kernel void reduce2( global T ∗g idata, global T ∗g odata, unsigned int n, local T∗ ldata) { unsigned int lid = get local id (0); unsigned int i = get global id (0); ldata [ lid ] = (i < n) ? g idata [ i ] : 0; barrier (CLK LOCAL MEM FENCE); for(unsigned int s= get local size (0)/2; s>0; s>>=1) { if ( lid < s) ldata [ lid ] += ldata[lid + s]; barrier (CLK LOCAL MEM FENCE); } if ( lid == 0) g odata[ get local size (0)] = ldata [0]; }
Embarrassing Partition Pipelines Reduction Scan
2 11 7 2
5 3
8 1 10
Values (shared memory)
1 2 3 4 5 6 7
2 11 7 2
7 3 9 6 10
8
Values
1 2 3
2 11 7 2
7 3 9 13 13 7 8
Values
1
2 11 7 2
7 3 9 13 13 20 21
Values
2 11 7 2
7 3 9 13 13 20 41
Values Thread IDs Step 1 Stride 8 Step 2 Stride 4 Step 3 Stride 2 Step 4 Stride 1 Thread IDs Thread IDs Thread IDs
With material by M. Harris (Nvidia Corp.)
Embarrassing Partition Pipelines Reduction Scan
2 11 7 2
5 3
8 1 10
Values (shared memory)
1 2 3 4 5 6 7
2 11 7 2
7 3 9 6 10
8
Values
1 2 3
2 11 7 2
7 3 9 13 13 7 8
Values
1
2 11 7 2
7 3 9 13 13 20 21
Values
2 11 7 2
7 3 9 13 13 20 41
Values Thread IDs Step 1 Stride 8 Step 2 Stride 4 Step 3 Stride 2 Step 4 Stride 1 Thread IDs Thread IDs Thread IDs
Better! But still not “efficient”. Only half of all work items after first round, then a quarter, . . .
With material by M. Harris (Nvidia Corp.)
Embarrassing Partition Pipelines Reduction Scan
Distinguish:
taken if an infinite number of processors are available
t=1 St: Total number of
Algorithm-specific!
Embarrassing Partition Pipelines Reduction Scan
Distinguish:
taken if an infinite number of processors are available
t=1 St: Total number of
Algorithm-specific! Lower Bounds:
Embarrassing Partition Pipelines Reduction Scan
Number of Items N Actual work to be done: W = O(N) additions. Step Complexity: Let d = ⌈log2 N⌉. Then T∞ = d, St = O(2d−t). Work Complexity: T1 =
T
St = O T
2d−t
Embarrassing Partition Pipelines Reduction Scan
Number of Items N Actual work to be done: W = O(N) additions. Step Complexity: Let d = ⌈log2 N⌉. Then T∞ = d, St = O(2d−t). Work Complexity: T1 =
T
St = O T
2d−t
“Work-efficient:” T1 ∼ W .
Embarrassing Partition Pipelines Reduction Scan
Theorem (Graham ‘68, Brent ‘75)
A parallel algorithm with span T∞ and work complexity T1 can be executed on a shared-memory machine with P processors in no more than TP ≤ T1 P + T∞ steps. Observations:
Embarrassing Partition Pipelines Reduction Scan
Theorem (Graham ‘68, Brent ‘75)
A parallel algorithm with span T∞ and work complexity T1 can be executed on a shared-memory machine with P processors in no more than TP ≤ T1 P + T∞ steps. Observations:
Estimate for P = 1? Proof sketch?
Embarrassing Partition Pipelines Reduction Scan
Again: Number of items N. Brent says TP = O T1 P + T∞
N P + log N
Within a work group: N = P ⇒ TN = O(log N).
Embarrassing Partition Pipelines Reduction Scan
Again: Number of items N. Brent says TP = O T1 P + T∞
N P + log N
Within a work group: N = P ⇒ TN = O(log N). But: Work groups are an illusion! Machine has finite width. Thus TP > O(log N)! How low can we take P before we hurt our asymptotic runtime TP?
Embarrassing Partition Pipelines Reduction Scan
Again: Number of items N. Brent says TP = O T1 P + T∞
N P + log N
Within a work group: N = P ⇒ TN = O(log N). But: Work groups are an illusion! Machine has finite width. Thus TP > O(log N)! How low can we take P before we hurt our asymptotic runtime TP? Asymptotically optimal TP = O(log N) for P ≥ N log N . Result: We’re free to reduce P by a factor of (log N) without increasing TP. ⇒ Do (log N) items in sequence per work item without increasing asymptotic TP.
Embarrassing Partition Pipelines Reduction Scan
Again: Number of items N. Brent says TP = O T1 P + T∞
N P + log N
Within a work group: N = P ⇒ TN = O(log N). But: Work groups are an illusion! Machine has finite width. Thus TP > O(log N)! How low can we take P before we hurt our asymptotic runtime TP? Asymptotically optimal TP = O(log N) for P ≥ N log N . Result: We’re free to reduce P by a factor of (log N) without increasing TP. ⇒ Do (log N) items in sequence per work item without increasing asymptotic TP. Think of this in terms of cost: Cost = P × TP
Embarrassing Partition Pipelines Reduction Scan
Again: Number of items N. Brent says TP = O T1 P + T∞
N P + log N
Within a work group: N = P ⇒ TN = O(log N). But: Work groups are an illusion! Machine has finite width. Thus TP > O(log N)! How low can we take P before we hurt our asymptotic runtime TP? Asymptotically optimal TP = O(log N) for P ≥ N log N . Result: We’re free to reduce P by a factor of (log N) without increasing TP. ⇒ Do (log N) items in sequence per work item without increasing asymptotic TP. Think of this in terms of cost: Cost = P × TP Brent gives lower bound on P. Fewer Processors ⇒ less cost!
Embarrassing Partition Pipelines Reduction Scan
Again: Number of items N. Brent says TP = O T1 P + T∞
N P + log N
Within a work group: N = P ⇒ TN = O(log N). But: Work groups are an illusion! Machine has finite width. Thus TP > O(log N)! How low can we take P before we hurt our asymptotic runtime TP? Asymptotically optimal TP = O(log N) for P ≥ N log N . Result: We’re free to reduce P by a factor of (log N) without increasing TP. ⇒ Do (log N) items in sequence per work item without increasing asymptotic TP. Think of this in terms of cost: Cost = P × TP Brent gives lower bound on P. Fewer Processors ⇒ less cost! “Algorithm cascading”
Embarrassing Partition Pipelines Reduction Scan
kernel void reduce6( global T ∗g idata, global T ∗g odata, unsigned int n, volatile local T∗ ldata) { unsigned int lid = get local id (0); unsigned int i = get group id(0)∗( get local size (0)∗2) + get local id (0); unsigned int gridSize = GROUP SIZE∗2∗get num groups(0); ldata [ lid ] = 0; while (i < n) { ldata [ lid ] += g idata[i ]; if (i + GROUP SIZE < n) ldata [ lid ] += g idata[i+GROUP SIZE]; i += gridSize; } barrier (CLK LOCAL MEM FENCE);
Embarrassing Partition Pipelines Reduction Scan
if (GROUP SIZE >= 512) { if ( lid < 256) { ldata[ lid ] += ldata[lid + 256]; } barrier (CLK LOCAL MEM FENCE); } // ... if (GROUP SIZE >= 128) { /∗ ... ∗/ } if ( lid < 32) { if (GROUP SIZE >= 64) { ldata[lid] += ldata[lid + 32]; } if (GROUP SIZE >= 32) { ldata[lid] += ldata[lid + 16]; } // ... if (GROUP SIZE >= 2) { ldata[lid] += ldata[lid + 1]; } } if ( lid == 0) g odata[get group id(0)] = ldata [0]; }
Embarrassing Partition Pipelines Reduction Scan
0.01 0.1 1 10 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 33554432 Time (ms) 1: Interleaved Addressing: Divergent Branches 2: Interleaved Addressing: Bank Conflicts 3: Sequential Addressing 4: First add during global load 5: Unroll last warp 6: Completely unroll 7: Multiple elements per thread (max 64 blocks)
With material by M. Harris (Nvidia Corp.)
Embarrassing Partition Pipelines Reduction Scan
Simulations
Embarrassing Partition Pipelines Reduction Scan
cancellation?
can use registers for intermediate results
element
sensitive to data type
Embarrassing Partition Pipelines Reduction Scan
where N is the input size.
Embarrassing Partition Pipelines Reduction Scan
y x0 g x1 g x2 g x3 g x4 g x5 g x6 g x7 g
Embarrassing Partition Pipelines Reduction Scan
MapReduce ≥ map + reduce:
large-scale data processing
identical keys
data management, load balance, re-run “stragglers”, data locality
Embarrassing Partition Pipelines Reduction Scan
Embarrassing Partition Pipelines Reduction Scan
Embarrassingly Parallel Partition Pipelines Reduction Scan
Embarrassing Partition Pipelines Reduction Scan
where N is the input size.
Embarrassing Partition Pipelines Reduction Scan
x0 y0 x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 y1 Id y2 Id y3 Id y4 Id y5 Id Id
Embarrassing Partition Pipelines Reduction Scan
x0 y0 x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 y1 Id y2 Id y3 Id y4 Id y5 Id Id This can’t possibly be parallelized. Or can it?
Embarrassing Partition Pipelines Reduction Scan
x0 y0 x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 y1 Id y2 Id y3 Id y4 Id y5 Id Id This can’t possibly be parallelized. Or can it? Again: Need assumptions on f . Associativity, commutativity.
Embarrassing Partition Pipelines Reduction Scan
Embarrassing Partition Pipelines Reduction Scan
Work-efficient?
Embarrassing Partition Pipelines Reduction Scan
Two sweeps: Upward, downward, both tree-shape On upward sweep:
child
On downward sweep:
Embarrassing Partition Pipelines Reduction Scan
Two sweeps: Upward, downward, both tree-shape On upward sweep:
child
On downward sweep:
Work-efficient? Span rel. to first attempt?
Embarrassing Partition Pipelines Reduction Scan
dependence
are known
higher-level algorithms algorithms
Prefix Sums and their Applications
Embarrassing Partition Pipelines Reduction Scan
recognize
associativity/commutativity
algorithm cascading
parallelize at coarser granularities
Embarrassing Partition Pipelines Reduction Scan
Embarrassing Partition Pipelines Reduction Scan