[PPT] - Lonestar: A Suite of Parallel Irregular Programs Milind Kulkarni, PowerPoint Presentation

SLIDE 1

Lonestar: A Suite of Parallel Irregular Programs

Milind Kulkarni, Martin Burtscher and Keshav Pingali Călin Caşcaval

฀

Tuesday, April 21, 2009

SLIDE 2

Lonestar: A Suite of Parallel Irregular Programs

Milind Kulkarni, Martin Burtscher and Keshav Pingali Călin Caşcaval

฀

Tuesday, April 21, 2009

SLIDE 3

Why Another Benchmark Suite?

We understand parallelism in regular algorithms
e.g., in N×N matrix-matrix multiply, can do N3

multiplications concurrently

What about irregular algorithms?
Operate on complex, pointer-based data

structures such as graphs, trees, etc.

Very input dependent behavior
Is there much parallelism? Can this parallelism

be exploited?

3

Tuesday, April 21, 2009

SLIDE 4

Example Algorithms

Application Domain Algorithms Data-mining Agglomerative clustering, k-means Bayesian inference Belief propagation, survey propagation Compilers Iterative dataflow, Elimination-based dataflow Functional interpreters Graph reduction, static/dynamic dataflow Maxflow Preflow-push, augmenting paths Minimum spanning trees Prim’s, Kruskal’s Boruvka’s N-body methods Barnes-Hut, fast multipole Graphics Ray-tracing Linear solvers Sparse MVM, sparse Cholesky factorization Event-driven simulation Time warp, Chandy-Misra-Bryant Meshing Delaunay mesh refinement, triangulation

4

Tuesday, April 21, 2009

SLIDE 5

Example: Delaunay Mesh Refinement

Worklist of bad triangles
Process bad triangles by

removing “cavity” and re- triangulating

May create new bad

triangles

Triangles can be

processed in any order

Algorithm terminates

when worklist is empty

Before After

5

Tuesday, April 21, 2009

SLIDE 6

Example: Event-driven Simulation

Network of nodes
Worklist of events,
rdered by timestamp
Nodes process events,

can generate new events to send to other nodes

Events must be processed

in global time order

B

3 A

6

Tuesday, April 21, 2009

SLIDE 7

Example: Event-driven Simulation

Network of nodes
Worklist of events,
rdered by timestamp
Nodes process events,

can generate new events to send to other nodes

Events must be processed

in global time order

B

2 4 3 A

6

Tuesday, April 21, 2009

SLIDE 8

Example: Event-driven Simulation

Network of nodes
Worklist of events,
rdered by timestamp
Nodes process events,

can generate new events to send to other nodes

Events must be processed

in global time order

B

2 4 3 A

6

Tuesday, April 21, 2009

SLIDE 9

Example: Event-driven Simulation

Network of nodes
Worklist of events,
rdered by timestamp
Nodes process events,

can generate new events to send to other nodes

Events must be processed

in global time order

B

2 4 3

A

6

Tuesday, April 21, 2009

SLIDE 10

Example: Event-driven Simulation

Network of nodes
Worklist of events,
rdered by timestamp
Nodes process events,

can generate new events to send to other nodes

Events must be processed

in global time order

B

2 4 3

A

6

Tuesday, April 21, 2009

SLIDE 11

A Unified Approach to Irregular Algorithms

Want to raise the level of

abstraction, find commonalities between algorithms

Inspired by Wirth’s aphorism,

“Program = Algorithm + Data Structure”

Abstract data structure: a graph
Abstract algorithm:
Operate over ordered or

unordered worklists of active nodes

Process an active node by

accessing neighborhood

May generate new active nodes

7

Tuesday, April 21, 2009

SLIDE 12

A Unified Approach to Irregular Algorithms

Want to raise the level of

abstraction, find commonalities between algorithms

Inspired by Wirth’s aphorism,

“Program = Algorithm + Data Structure”

Abstract data structure: a graph
Abstract algorithm:
Operate over ordered or

unordered worklists of active nodes

Process an active node by

accessing neighborhood

May generate new active nodes

7

Tuesday, April 21, 2009

SLIDE 13

A Unified Approach to Irregular Algorithms

Want to raise the level of

abstraction, find commonalities between algorithms

Inspired by Wirth’s aphorism,

“Program = Algorithm + Data Structure”

Abstract data structure: a graph
Abstract algorithm:
Operate over ordered or

unordered worklists of active nodes

Process an active node by

accessing neighborhood

May generate new active nodes

7

Tuesday, April 21, 2009

SLIDE 14

Amorphous Data Parallelism

Where’s the parallelism?
Can process nodes with

non-overlapping neighborhoods in parallel

Ordered worklists: must

respect ordering constraints

In general, must use
ptimistic/speculative

parallelism

8

Tuesday, April 21, 2009

SLIDE 15

Lonestar Benchmark Suite

Suite of irregular programs that exhibit

amorphous data parallelism

Agglomerative clustering (AC)
Barnes-Hut (BH)
Delaunay mesh refinement (DMR)
Delaunay triangulation (DT)
Survey propagation (SP)

9

Tuesday, April 21, 2009

SLIDE 16

Why These Applications?

Real-world applications, which perform a

substantial amount of work

Algorithms have significant potential for

parallelism

Parallel implementations exhibit significant

speedup

10

Tuesday, April 21, 2009

SLIDE 17

Why These Applications?

Real-world applications, which perform a

substantial amount of work

Algorithms have significant potential for

parallelism

Parallel implementations exhibit significant

speedup

11

Tuesday, April 21, 2009

SLIDE 18

Performance Characteristics

Used performance counters on SPARC IV

platform to gather performance characteristics of sequential execution

12

Application Input Size Iterations Memory Footprint Instructions/ Iteration Memory Acc/ Iteration L1d Miss Rate

AC 2M points 1,999,999 1,039 MB 67,920 13,832 7.1% BH 220K bodies 220,000 41 MB 199,167 49,789 14.1% DMR 550K triangles 1,297,380 2,545 MB 72,747 22,684 31.7% DT 80K points 80,000 927 MB 262,952 91,547 41.1% SP 500 variables 2100 clauses 4,492,403 8 MB 177,885 42,016 2.1%

Tuesday, April 21, 2009

SLIDE 19

Summary of Characteristics

In each benchmark, an average iteration

executes tens of thousands of instructions

Benchmarks perform many memory

accesses

Three of five benchmarks exhibit high L1

data cache miss rates

13

Tuesday, April 21, 2009

SLIDE 20

Why These Applications?

Real-world applications which perform a

substantial amount of work

Algorithms have significant potential for

parallelism

Parallel implementations exhibit significant

speedup

14

Tuesday, April 21, 2009

SLIDE 21

Measuring Potential Parallelism

How much parallelism actually exists in these applications?
Independent of implementation details, particular

architectures, etc.

Used ParaMeter to measure available parallelism in Lonestar

applications

Kulkarni et al., “How Much Parallelism is There in Irregular

Applications?”

Key idea: determine how many active nodes have non-
verlapping neighborhoods
Generate parallelism profiles: number of parallel

computations executed in each step

15

Tuesday, April 21, 2009

SLIDE 22

ParaMeter Results – Delaunay Mesh Refinement

Input: 220K bad triangles
~50% of triangles are

badly shaped

Bell-shaped profile

reflects increasing size of mesh

16

10 20 30 40 50 60

Computation Step

2000 4000 6000 8000 10000 12000 14000 16000

Available Parallelism

Tuesday, April 21, 2009

SLIDE 23

20 40 60 80 100

Computation Step

50000 100000 150000 200000 250000 300000

Available Parallelism

ParaMeter Results – Agglomerative Clustering

Data mining algorithm

that clusters points based

n similarity
Builds binary tree in

bottom-up manner

Input: 1,000,000 points
Parallelism determined by

structure of binary tree

17

Tuesday, April 21, 2009

SLIDE 24

20 40 60 80 100

Computation Step

200 400 600 800 1000 1200

Available Parallelism

ParaMeter Results – Delaunay Triangulation

Constructs Delaunay

mesh from set of input points

Input size: 40,000 points
Similar bell-shaped profile

as DMR

Less parallelism in the

beginning because mesh starts very small

18

Tuesday, April 21, 2009

SLIDE 25

ParaMeter Results – Survey Propagation

SAT solving heuristic
Formula represented as a

bipartite graph.

Iterate over variables,

updating guess for truth value

Input: 350 variables, 1470

clauses

Parallelism profile more

uniform, as graph doesn’t change dramatically.

Parallelism drops as variables

are assigned truth values.

19

5000 10000 15000 20000 25000 30000

Computation Step

50 100 150 200 250 300 350

Available Parallelism

Tuesday, April 21, 2009

SLIDE 26

Available Parallelism Summary

All Lonestar benchmarks have significant

available parallelism

Different benchmarks display different

parallelism behaviors – parallelism is clearly application dependent

Available parallelism increases for each

benchmark as input size increases

20

Tuesday, April 21, 2009

SLIDE 27

Why These Applications?

Real-world applications which perform a

substantial amount of work

Algorithms have significant potential for

parallelism

Parallel implementations exhibit significant

speedup

21

Tuesday, April 21, 2009

SLIDE 28

Parallel Implementation

Parallelized all benchmarks using the Galois system
Kulkarni et al. “Optimistic Parallelism Requires

Abstractions”

All code written in Java 1.6, run using HotSpot VM
Evaluated on three platforms
128-processor UltraSparcIV
16-core UltraSparc T1
16-core x86
Measured speedup vs. baseline sequential implementation

22

Tuesday, April 21, 2009

SLIDE 29

Speedups – AC

23

# of cores

Speedup

Tuesday, April 21, 2009

SLIDE 30

Speedups – BH

24

# of cores Speedup

Tuesday, April 21, 2009

SLIDE 31

Speedups – DMR

25

# of cores Speedup

Tuesday, April 21, 2009

SLIDE 32

Speedups – DT

26

# of cores Speedup

Tuesday, April 21, 2009

SLIDE 33

Speedups – SP

27

# of cores Speedup

Tuesday, April 21, 2009

SLIDE 34

Speedups Summary

4 of 5 benchmarks achieve 50% parallel

efficiency up to 16 cores

Survey propagation scales least and has lowest

available parallelism

Many benchmarks scale better with larger inputs
Many benchmarks are bandwidth bound
Poorer speedup on bandwidth limited x86

system

28

Tuesday, April 21, 2009

SLIDE 35

Conclusions

Lonestar benchmarks are a set of irregular

programs that exhibit amorphous data parallelism

Real-world applications spanning many

important domains

Each benchmark performs a significant

amount of work

Benchmarks contain significant parallelism,

both in theory and in practice

29

Tuesday, April 21, 2009

SLIDE 36

Lonestar Version 2.0

Lonestar version 2.0 available for download at:

http://www.iss.ices.utexas.edu/lonestar

Still adding new applications
Looking for more input sets

30

Tuesday, April 21, 2009

SLIDE 37

Thank you!

http://www.ices.utexas.edu/~milind milind@ices.utexas.edu

Tuesday, April 21, 2009