How Much Parallelism is There in Irregular Applications? Milind - - PowerPoint PPT Presentation

how much parallelism is there in irregular applications
SMART_READER_LITE
LIVE PREVIEW

How Much Parallelism is There in Irregular Applications? Milind - - PowerPoint PPT Presentation

How Much Parallelism is There in Irregular Applications? Milind Kulkarni, Martin Burtscher, Rajasekhar Inkulu C lin Ca caval and Keshav Pingali How Much Parallelism is There in Irregular Applications? Milind Kulkarni, Martin


slide-1
SLIDE 1

How Much Parallelism is There in Irregular Applications?

Milind Kulkarni, Martin Burtscher, Rajasekhar Inkulu and Keshav Pingali Călin Caşcaval

slide-2
SLIDE 2

How Much Parallelism is There in Irregular Applications?

Milind Kulkarni, Martin Burtscher, Rajasekhar Inkulu and Keshav Pingali Călin Caşcaval

slide-3
SLIDE 3

Introduction

  • We understand parallelism in regular

algorithms

  • e.g., in N×N matrix-matrix multiply, can do

N3 multiplications concurrently

  • What about irregular algorithms?
  • Operate on complex, pointer-based data

structures such as graphs, trees, etc.

  • Is there much parallelism?

3

slide-4
SLIDE 4

Example Algorithms

Application Domain Algorithms Data-mining Agglomerative clustering, k-means Bayesian inference Belief propagation, survey propagation Compilers Iterative dataflow, Elimination-based dataflow Functional interpreters Graph reduction, static/dynamic dataflow Maxflow Preflow-push, augmenting paths Minimum spanning trees Prim’s, Kruskal’s Boruvka’s N-body methods Barnes-Hut, fast multipole Graphics Ray-tracing Linear solvers Sparse MVM, sparse Cholesky factorization Event-driven simulation Time warp, Chandy-Misra-Bryant Meshing Delaunay mesh refinement, triangulation

4

slide-5
SLIDE 5

Example: Delaunay mesh refinement

  • Worklist of bad triangles
  • Process bad triangles by

removing “cavity” and re- triangulating

  • May create new bad

triangles

  • Triangles can be

processed in any order

  • Algorithm terminates

when worklist is empty

Before After

5

slide-6
SLIDE 6

Example: Event-driven Simulation

  • Network of nodes
  • Worklist of events,
  • rdered by timestamp
  • Nodes process events,

can generate new events to send to other nodes

  • Events must be processed

in global time order

B

3 A

6

slide-7
SLIDE 7

Example: Event-driven Simulation

  • Network of nodes
  • Worklist of events,
  • rdered by timestamp
  • Nodes process events,

can generate new events to send to other nodes

  • Events must be processed

in global time order

B

2 4 3 A

6

slide-8
SLIDE 8

Example: Event-driven Simulation

  • Network of nodes
  • Worklist of events,
  • rdered by timestamp
  • Nodes process events,

can generate new events to send to other nodes

  • Events must be processed

in global time order

B

2 4 3 A

6

slide-9
SLIDE 9

Example: Event-driven Simulation

  • Network of nodes
  • Worklist of events,
  • rdered by timestamp
  • Nodes process events,

can generate new events to send to other nodes

  • Events must be processed

in global time order

B

2 4 3

A

6

slide-10
SLIDE 10

Example: Event-driven Simulation

  • Network of nodes
  • Worklist of events,
  • rdered by timestamp
  • Nodes process events,

can generate new events to send to other nodes

  • Events must be processed

in global time order

B

2 4 3

A

6

6

slide-11
SLIDE 11

Amorphous Data Parallelism

  • Data structure: graph
  • Operate over ordered or

unordered worklists of active nodes

  • Process an active node by

accessing neighborhood

  • May generate new active

nodes

  • Can process nodes with

non-overlapping neighborhoods in parallel

  • Ordered worklists: must

respect ordering constraints

7

slide-12
SLIDE 12

“Available Parallelism”

  • A measure of the maximum amount of

parallelism that can be extracted from a program

  • Profile the algorithm, not the system
  • Disregard communication/synchronization

costs, run-time overheads and locality concerns

8

slide-13
SLIDE 13

Measuring Parallelism

  • Represent program as a

DAG

  • Nodes: operations
  • Edges: dependences
  • Execution strategy
  • Assume operations take

unit time

  • Execute “greedily” –

process all ready

  • perations in each step
  • Parallelism profile: # of
  • perations executed in each

step

9

LOAD A1 LOAD B1 LOAD A2 LOAD B2 MUL MUL ADD LOAD A1 LOAD B1 LOAD A2 LOAD B2 MUL MUL ADD

Available Parallelism Computation Step

slide-14
SLIDE 14

Measuring Parallelism

  • Represent program as a

DAG

  • Nodes: operations
  • Edges: dependences
  • Execution strategy
  • Assume operations take

unit time

  • Execute “greedily” –

process all ready

  • perations in each step
  • Parallelism profile: # of
  • perations executed in each

step

9

LOAD A1 LOAD B1 LOAD A2 LOAD B2 MUL MUL ADD LOAD A1 LOAD B1 LOAD A2 LOAD B2 MUL MUL ADD

Available Parallelism Computation Step

slide-15
SLIDE 15

Measuring Parallelism

  • Represent program as a

DAG

  • Nodes: operations
  • Edges: dependences
  • Execution strategy
  • Assume operations take

unit time

  • Execute “greedily” –

process all ready

  • perations in each step
  • Parallelism profile: # of
  • perations executed in each

step

9

LOAD A1 LOAD B1 LOAD A2 LOAD B2 MUL MUL ADD LOAD A1 LOAD B1 LOAD A2 LOAD B2 MUL MUL ADD

Available Parallelism Computation Step

slide-16
SLIDE 16

LOAD A1 LOAD B1 LOAD A2 LOAD B2 MUL MUL ADD

Measuring Parallelism

  • Represent program as a

DAG

  • Nodes: operations
  • Edges: dependences
  • Execution strategy
  • Assume operations take

unit time

  • Execute “greedily” –

process all ready

  • perations in each step
  • Parallelism profile: # of
  • perations executed in each

step

9

LOAD A1 LOAD B1 LOAD A2 LOAD B2 MUL MUL ADD

Available Parallelism Computation Step

slide-17
SLIDE 17

Amorphous Data Parallel Algorithms

10

  • No notion of ordering
  • Represent program as a

graph, not a DAG

  • Execution: choose set of

independent elements to process

  • Different scheduling choices

lead to different amounts of parallelism

  • Even with unlimited

resources!

slide-18
SLIDE 18

Amorphous Data Parallel Algorithms

11

Conflict Graph

  • No notion of ordering
  • Represent program as a

graph, not a DAG

  • Execution: choose set of

independent elements to process

  • Different scheduling choices

lead to different amounts of parallelism

  • Even with unlimited

resources!

slide-19
SLIDE 19

Greedy scheduling

  • Finding schedule to maximize parallelism is

NP-hard ➡ Solution: Schedule greedily

  • Attempt to maximize work done in

current step

  • Choose a maximal independent set in

conflict graph

12

slide-20
SLIDE 20

Incremental Execution

  • Conflict graph can change

during execution

  • New work generated
  • New conflicts
  • Cannot perform

scheduling a priori ➡ Solution: execute in stages, recalculate conflict graph after each stage

13

Conflict Graph

slide-21
SLIDE 21

Incremental Execution

  • Conflict graph can change

during execution

  • New work generated
  • New conflicts
  • Cannot perform

scheduling a priori ➡ Solution: execute in stages, recalculate conflict graph after each stage

14

Conflict Graph

slide-22
SLIDE 22

Incremental Execution

  • Conflict graph can change

during execution

  • New work generated
  • New conflicts
  • Cannot perform

scheduling a priori ➡ Solution: execute in stages, recalculate conflict graph after each stage

14

Conflict Graph

slide-23
SLIDE 23

ParaMeter

  • Tool to generate parallelism profiles for

amorphous data-parallel applications

  • Uses greedy scheduling and incremental

execution to handle dynamic nature of computation

15

slide-24
SLIDE 24

ParaMeter Execution Strategy

  • While work left
  • Generate conflict graph for current

worklist

  • Execute maximal independent set of nodes

in graph

  • Add newly generated work to worklist

16

slide-25
SLIDE 25

ParaMeter Execution Strategy

  • While work left
  • Generate conflict graph for current

worklist

  • Execute maximal independent set of nodes

in graph

  • Add newly generated work to worklist

16

Generate parallelism profile by tracking #

  • f nodes executed in each step
slide-26
SLIDE 26

Experiments

  • Profiled 7 applications:
  • Delaunay mesh refinement
  • Delaunay triangulation
  • Augmenting paths maxflow
  • Preflow push maxflow
  • Survey propagation
  • Agglomerative clustering (unordered)
  • Agglomerative clustering (ordered)

17

slide-27
SLIDE 27

10 20 30 40 50 60

Computation Step

2000 4000 6000 8000

Available Parallelism

Delaunay Mesh Refinement

Input: 100,000 triangle mesh, 47,000 bad triangles

18

slide-28
SLIDE 28

Parallelism Intensity

  • Available parallelism shows absolute amount
  • f parallelism in program
  • Is parallelism low because there is little

work? Or many conflicts?

  • Parallelism intensity: measure what

percentage of worklist is executed in parallel

19

slide-29
SLIDE 29

Mesh Refinement: Parallelism Intensity

10 20 30 40 50 60

Computation Step

20 40 60 80 100

% of Worklist Executed

20

slide-30
SLIDE 30

Effects of Scheduling on Parallelism

Input: 100,000 triangle mesh, 47,000 bad triangles

21 10 20 30 40 50 60

Computation Step

2000 4000 6000

Available Parallelism

slide-31
SLIDE 31

Effects of Scheduling on Parallelism

Input: 100,000 triangle mesh, 47,000 bad triangles

22

slide-32
SLIDE 32

Effects of Scheduling on Parallelism

Input: 100,000 triangle mesh, 47,000 bad triangles

23 5 10 15 20

Computation Step

5000 5500 6000 6500 7000 7500

Available Parallelism

slide-33
SLIDE 33

Delaunay Triangulation

  • Build a Delaunay mesh

given a set of points

  • Points in an unordered

worklist

  • Insert points by splitting

triangles, flipping edges

  • Input: 10,000 points

20 40 60 80 100

Computation Step

50 100 150 200 250 300

Available Parallelism

20 40 60 80 100

Computation Step

20 40 60 80 100

% of Worklist Executed

24

slide-34
SLIDE 34

2000 4000 6000 8000 10000 12000 14000

Computation Step

20 40 60 80 100

% of Worklist Executed

2000 4000 6000 8000 10000 12000 14000

Computation Step

200 400 600 800 1000

Available Parallelism

Survey Propagation

  • Heuristic approach to

solving SAT problems

  • Bipartite graph of clauses

and variables

  • Iteratively update

variables with possible truth values

  • Input: formula with 1000

variables, 4200 clauses

25

slide-35
SLIDE 35

10 20 30 40 50 60

Computation Step

20 40 60 80 100

% of Worklist Executed

10 20 30 40 50 60

Computation Step

1000 2000 3000 4000 5000 6000

Available Parallelism

Agglomerative Clustering

  • Cluster a set of points

based on similarity

  • Unordered worklist of

point clusters

  • Builds a tree bottom-up
  • Input: 20,000 points

26

slide-36
SLIDE 36

Conclusions

  • ParaMeter: first tool to measure parallelism in irregular

algorithms

  • Provides insight into an algorithm, rather than a particular

parallel implementation

  • Also: ordered worklists, constrained parallelism (details

in paper)

  • ParaMeter shows that important irregular algorithms have

significant parallelism

  • Mesh algorithms, SAT solvers, data mining algorithms,

maxflow algorithms ...

  • Parallelism varies during execution ➔ adaptive control of

number of threads may be useful

27

slide-37
SLIDE 37

Thank you!

http://www.ices.utexas.edu/~milind milind@ices.utexas.edu