How Much Parallelism is There in Irregular Applications?
Milind Kulkarni, Martin Burtscher, Rajasekhar Inkulu and Keshav Pingali Călin Caşcaval
How Much Parallelism is There in Irregular Applications? Milind - - PowerPoint PPT Presentation
How Much Parallelism is There in Irregular Applications? Milind Kulkarni, Martin Burtscher, Rajasekhar Inkulu C lin Ca caval and Keshav Pingali How Much Parallelism is There in Irregular Applications? Milind Kulkarni, Martin
Milind Kulkarni, Martin Burtscher, Rajasekhar Inkulu and Keshav Pingali Călin Caşcaval
Milind Kulkarni, Martin Burtscher, Rajasekhar Inkulu and Keshav Pingali Călin Caşcaval
3
Application Domain Algorithms Data-mining Agglomerative clustering, k-means Bayesian inference Belief propagation, survey propagation Compilers Iterative dataflow, Elimination-based dataflow Functional interpreters Graph reduction, static/dynamic dataflow Maxflow Preflow-push, augmenting paths Minimum spanning trees Prim’s, Kruskal’s Boruvka’s N-body methods Barnes-Hut, fast multipole Graphics Ray-tracing Linear solvers Sparse MVM, sparse Cholesky factorization Event-driven simulation Time warp, Chandy-Misra-Bryant Meshing Delaunay mesh refinement, triangulation
4
removing “cavity” and re- triangulating
triangles
processed in any order
when worklist is empty
Before After
5
can generate new events to send to other nodes
in global time order
B
3 A
6
can generate new events to send to other nodes
in global time order
B
2 4 3 A
6
can generate new events to send to other nodes
in global time order
B
2 4 3 A
6
can generate new events to send to other nodes
in global time order
B
2 4 3
A
6
can generate new events to send to other nodes
in global time order
B
2 4 3
A
6
6
unordered worklists of active nodes
accessing neighborhood
nodes
non-overlapping neighborhoods in parallel
respect ordering constraints
7
8
DAG
unit time
process all ready
step
9
LOAD A1 LOAD B1 LOAD A2 LOAD B2 MUL MUL ADD LOAD A1 LOAD B1 LOAD A2 LOAD B2 MUL MUL ADD
Available Parallelism Computation Step
DAG
unit time
process all ready
step
9
LOAD A1 LOAD B1 LOAD A2 LOAD B2 MUL MUL ADD LOAD A1 LOAD B1 LOAD A2 LOAD B2 MUL MUL ADD
Available Parallelism Computation Step
DAG
unit time
process all ready
step
9
LOAD A1 LOAD B1 LOAD A2 LOAD B2 MUL MUL ADD LOAD A1 LOAD B1 LOAD A2 LOAD B2 MUL MUL ADD
Available Parallelism Computation Step
LOAD A1 LOAD B1 LOAD A2 LOAD B2 MUL MUL ADD
DAG
unit time
process all ready
step
9
LOAD A1 LOAD B1 LOAD A2 LOAD B2 MUL MUL ADD
Available Parallelism Computation Step
10
graph, not a DAG
independent elements to process
lead to different amounts of parallelism
resources!
11
Conflict Graph
graph, not a DAG
independent elements to process
lead to different amounts of parallelism
resources!
12
during execution
scheduling a priori ➡ Solution: execute in stages, recalculate conflict graph after each stage
13
Conflict Graph
during execution
scheduling a priori ➡ Solution: execute in stages, recalculate conflict graph after each stage
14
Conflict Graph
during execution
scheduling a priori ➡ Solution: execute in stages, recalculate conflict graph after each stage
14
Conflict Graph
15
16
16
Generate parallelism profile by tracking #
17
10 20 30 40 50 60
Computation Step
2000 4000 6000 8000
Available Parallelism
Input: 100,000 triangle mesh, 47,000 bad triangles
18
19
10 20 30 40 50 60
Computation Step
20 40 60 80 100
% of Worklist Executed
20
Input: 100,000 triangle mesh, 47,000 bad triangles
21 10 20 30 40 50 60
Computation Step
2000 4000 6000
Available Parallelism
Input: 100,000 triangle mesh, 47,000 bad triangles
22
Input: 100,000 triangle mesh, 47,000 bad triangles
23 5 10 15 20
Computation Step
5000 5500 6000 6500 7000 7500
Available Parallelism
given a set of points
worklist
triangles, flipping edges
20 40 60 80 100
Computation Step
50 100 150 200 250 300
Available Parallelism
20 40 60 80 100
Computation Step
20 40 60 80 100
% of Worklist Executed
24
2000 4000 6000 8000 10000 12000 14000
Computation Step
20 40 60 80 100
% of Worklist Executed
2000 4000 6000 8000 10000 12000 14000
Computation Step
200 400 600 800 1000
Available Parallelism
solving SAT problems
and variables
variables with possible truth values
variables, 4200 clauses
25
10 20 30 40 50 60
Computation Step
20 40 60 80 100
% of Worklist Executed
10 20 30 40 50 60
Computation Step
1000 2000 3000 4000 5000 6000
Available Parallelism
based on similarity
point clusters
26
algorithms
parallel implementation
in paper)
significant parallelism
maxflow algorithms ...
number of threads may be useful
27