Lonestar: A Suite of Parallel Irregular Programs Milind Kulkarni, - - PowerPoint PPT Presentation

lonestar a suite of parallel irregular programs
SMART_READER_LITE
LIVE PREVIEW

Lonestar: A Suite of Parallel Irregular Programs Milind Kulkarni, - - PowerPoint PPT Presentation

Lonestar: A Suite of Parallel Irregular Programs Milind Kulkarni, Martin C lin Ca caval Burtscher and Keshav Pingali Tuesday, April 21, 2009 Lonestar: A Suite of Parallel Irregular Programs Milind Kulkarni, Martin C lin Ca


slide-1
SLIDE 1

Lonestar: A Suite of Parallel Irregular Programs

Milind Kulkarni, Martin Burtscher and Keshav Pingali Călin Caşcaval

Tuesday, April 21, 2009

slide-2
SLIDE 2

Lonestar: A Suite of Parallel Irregular Programs

Milind Kulkarni, Martin Burtscher and Keshav Pingali Călin Caşcaval

Tuesday, April 21, 2009

slide-3
SLIDE 3

Why Another Benchmark Suite?

  • We understand parallelism in regular algorithms
  • e.g., in N×N matrix-matrix multiply, can do N3

multiplications concurrently

  • What about irregular algorithms?
  • Operate on complex, pointer-based data

structures such as graphs, trees, etc.

  • Very input dependent behavior
  • Is there much parallelism? Can this parallelism

be exploited?

3

Tuesday, April 21, 2009

slide-4
SLIDE 4

Example Algorithms

Application Domain Algorithms Data-mining Agglomerative clustering, k-means Bayesian inference Belief propagation, survey propagation Compilers Iterative dataflow, Elimination-based dataflow Functional interpreters Graph reduction, static/dynamic dataflow Maxflow Preflow-push, augmenting paths Minimum spanning trees Prim’s, Kruskal’s Boruvka’s N-body methods Barnes-Hut, fast multipole Graphics Ray-tracing Linear solvers Sparse MVM, sparse Cholesky factorization Event-driven simulation Time warp, Chandy-Misra-Bryant Meshing Delaunay mesh refinement, triangulation

4

Tuesday, April 21, 2009

slide-5
SLIDE 5

Example: Delaunay Mesh Refinement

  • Worklist of bad triangles
  • Process bad triangles by

removing “cavity” and re- triangulating

  • May create new bad

triangles

  • Triangles can be

processed in any order

  • Algorithm terminates

when worklist is empty

Before After

5

Tuesday, April 21, 2009

slide-6
SLIDE 6

Example: Event-driven Simulation

  • Network of nodes
  • Worklist of events,
  • rdered by timestamp
  • Nodes process events,

can generate new events to send to other nodes

  • Events must be processed

in global time order

B

3 A

6

Tuesday, April 21, 2009

slide-7
SLIDE 7

Example: Event-driven Simulation

  • Network of nodes
  • Worklist of events,
  • rdered by timestamp
  • Nodes process events,

can generate new events to send to other nodes

  • Events must be processed

in global time order

B

2 4 3 A

6

Tuesday, April 21, 2009

slide-8
SLIDE 8

Example: Event-driven Simulation

  • Network of nodes
  • Worklist of events,
  • rdered by timestamp
  • Nodes process events,

can generate new events to send to other nodes

  • Events must be processed

in global time order

B

2 4 3 A

6

Tuesday, April 21, 2009

slide-9
SLIDE 9

Example: Event-driven Simulation

  • Network of nodes
  • Worklist of events,
  • rdered by timestamp
  • Nodes process events,

can generate new events to send to other nodes

  • Events must be processed

in global time order

B

2 4 3

A

6

Tuesday, April 21, 2009

slide-10
SLIDE 10

Example: Event-driven Simulation

  • Network of nodes
  • Worklist of events,
  • rdered by timestamp
  • Nodes process events,

can generate new events to send to other nodes

  • Events must be processed

in global time order

B

2 4 3

A

6

6

Tuesday, April 21, 2009

slide-11
SLIDE 11

A Unified Approach to Irregular Algorithms

  • Want to raise the level of

abstraction, find commonalities between algorithms

  • Inspired by Wirth’s aphorism,

“Program = Algorithm + Data Structure”

  • Abstract data structure: a graph
  • Abstract algorithm:
  • Operate over ordered or

unordered worklists of active nodes

  • Process an active node by

accessing neighborhood

  • May generate new active nodes

7

Tuesday, April 21, 2009

slide-12
SLIDE 12

A Unified Approach to Irregular Algorithms

  • Want to raise the level of

abstraction, find commonalities between algorithms

  • Inspired by Wirth’s aphorism,

“Program = Algorithm + Data Structure”

  • Abstract data structure: a graph
  • Abstract algorithm:
  • Operate over ordered or

unordered worklists of active nodes

  • Process an active node by

accessing neighborhood

  • May generate new active nodes

7

Tuesday, April 21, 2009

slide-13
SLIDE 13

A Unified Approach to Irregular Algorithms

  • Want to raise the level of

abstraction, find commonalities between algorithms

  • Inspired by Wirth’s aphorism,

“Program = Algorithm + Data Structure”

  • Abstract data structure: a graph
  • Abstract algorithm:
  • Operate over ordered or

unordered worklists of active nodes

  • Process an active node by

accessing neighborhood

  • May generate new active nodes

7

Tuesday, April 21, 2009

slide-14
SLIDE 14

Amorphous Data Parallelism

  • Where’s the parallelism?
  • Can process nodes with

non-overlapping neighborhoods in parallel

  • Ordered worklists: must

respect ordering constraints

  • In general, must use
  • ptimistic/speculative

parallelism

8

Tuesday, April 21, 2009

slide-15
SLIDE 15

Lonestar Benchmark Suite

  • Suite of irregular programs that exhibit

amorphous data parallelism

  • Agglomerative clustering (AC)
  • Barnes-Hut (BH)
  • Delaunay mesh refinement (DMR)
  • Delaunay triangulation (DT)
  • Survey propagation (SP)

9

Tuesday, April 21, 2009

slide-16
SLIDE 16

Why These Applications?

  • Real-world applications, which perform a

substantial amount of work

  • Algorithms have significant potential for

parallelism

  • Parallel implementations exhibit significant

speedup

10

Tuesday, April 21, 2009

slide-17
SLIDE 17

Why These Applications?

  • Real-world applications, which perform a

substantial amount of work

  • Algorithms have significant potential for

parallelism

  • Parallel implementations exhibit significant

speedup

11

Tuesday, April 21, 2009

slide-18
SLIDE 18

Performance Characteristics

  • Used performance counters on SPARC IV

platform to gather performance characteristics of sequential execution

12

Application Input Size Iterations Memory Footprint Instructions/ Iteration Memory Acc/ Iteration L1d Miss Rate

AC 2M points 1,999,999 1,039 MB 67,920 13,832 7.1% BH 220K bodies 220,000 41 MB 199,167 49,789 14.1% DMR 550K triangles 1,297,380 2,545 MB 72,747 22,684 31.7% DT 80K points 80,000 927 MB 262,952 91,547 41.1% SP 500 variables 2100 clauses 4,492,403 8 MB 177,885 42,016 2.1%

Tuesday, April 21, 2009

slide-19
SLIDE 19

Summary of Characteristics

  • In each benchmark, an average iteration

executes tens of thousands of instructions

  • Benchmarks perform many memory

accesses

  • Three of five benchmarks exhibit high L1

data cache miss rates

13

Tuesday, April 21, 2009

slide-20
SLIDE 20

Why These Applications?

  • Real-world applications which perform a

substantial amount of work

  • Algorithms have significant potential for

parallelism

  • Parallel implementations exhibit significant

speedup

14

Tuesday, April 21, 2009

slide-21
SLIDE 21

Measuring Potential Parallelism

  • How much parallelism actually exists in these applications?
  • Independent of implementation details, particular

architectures, etc.

  • Used ParaMeter to measure available parallelism in Lonestar

applications

  • Kulkarni et al., “How Much Parallelism is There in Irregular

Applications?”

  • Key idea: determine how many active nodes have non-
  • verlapping neighborhoods
  • Generate parallelism profiles: number of parallel

computations executed in each step

15

Tuesday, April 21, 2009

slide-22
SLIDE 22

ParaMeter Results – Delaunay Mesh Refinement

  • Input: 220K bad triangles
  • ~50% of triangles are

badly shaped

  • Bell-shaped profile

reflects increasing size of mesh

16

10 20 30 40 50 60

Computation Step

2000 4000 6000 8000 10000 12000 14000 16000

Available Parallelism

Tuesday, April 21, 2009

slide-23
SLIDE 23

20 40 60 80 100

Computation Step

50000 100000 150000 200000 250000 300000

Available Parallelism

ParaMeter Results – Agglomerative Clustering

  • Data mining algorithm

that clusters points based

  • n similarity
  • Builds binary tree in

bottom-up manner

  • Input: 1,000,000 points
  • Parallelism determined by

structure of binary tree

17

Tuesday, April 21, 2009

slide-24
SLIDE 24

20 40 60 80 100

Computation Step

200 400 600 800 1000 1200

Available Parallelism

ParaMeter Results – Delaunay Triangulation

  • Constructs Delaunay

mesh from set of input points

  • Input size: 40,000 points
  • Similar bell-shaped profile

as DMR

  • Less parallelism in the

beginning because mesh starts very small

18

Tuesday, April 21, 2009

slide-25
SLIDE 25

ParaMeter Results – Survey Propagation

  • SAT solving heuristic
  • Formula represented as a

bipartite graph.

  • Iterate over variables,

updating guess for truth value

  • Input: 350 variables, 1470

clauses

  • Parallelism profile more

uniform, as graph doesn’t change dramatically.

  • Parallelism drops as variables

are assigned truth values.

19

5000 10000 15000 20000 25000 30000

Computation Step

50 100 150 200 250 300 350

Available Parallelism

Tuesday, April 21, 2009

slide-26
SLIDE 26

Available Parallelism Summary

  • All Lonestar benchmarks have significant

available parallelism

  • Different benchmarks display different

parallelism behaviors – parallelism is clearly application dependent

  • Available parallelism increases for each

benchmark as input size increases

20

Tuesday, April 21, 2009

slide-27
SLIDE 27

Why These Applications?

  • Real-world applications which perform a

substantial amount of work

  • Algorithms have significant potential for

parallelism

  • Parallel implementations exhibit significant

speedup

21

Tuesday, April 21, 2009

slide-28
SLIDE 28

Parallel Implementation

  • Parallelized all benchmarks using the Galois system
  • Kulkarni et al. “Optimistic Parallelism Requires

Abstractions”

  • All code written in Java 1.6, run using HotSpot VM
  • Evaluated on three platforms
  • 128-processor UltraSparcIV
  • 16-core UltraSparc T1
  • 16-core x86
  • Measured speedup vs. baseline sequential implementation

22

Tuesday, April 21, 2009

slide-29
SLIDE 29

Speedups – AC

23

  • # of cores

Speedup

Tuesday, April 21, 2009

slide-30
SLIDE 30
  • Speedups – BH

24

# of cores Speedup

Tuesday, April 21, 2009

slide-31
SLIDE 31
  • Speedups – DMR

25

# of cores Speedup

Tuesday, April 21, 2009

slide-32
SLIDE 32
  • Speedups – DT

26

# of cores Speedup

Tuesday, April 21, 2009

slide-33
SLIDE 33
  • Speedups – SP

27

# of cores Speedup

Tuesday, April 21, 2009

slide-34
SLIDE 34

Speedups Summary

  • 4 of 5 benchmarks achieve 50% parallel

efficiency up to 16 cores

  • Survey propagation scales least and has lowest

available parallelism

  • Many benchmarks scale better with larger inputs
  • Many benchmarks are bandwidth bound
  • Poorer speedup on bandwidth limited x86

system

28

Tuesday, April 21, 2009

slide-35
SLIDE 35

Conclusions

  • Lonestar benchmarks are a set of irregular

programs that exhibit amorphous data parallelism

  • Real-world applications spanning many

important domains

  • Each benchmark performs a significant

amount of work

  • Benchmarks contain significant parallelism,

both in theory and in practice

29

Tuesday, April 21, 2009

slide-36
SLIDE 36

Lonestar Version 2.0

  • Lonestar version 2.0 available for download at:

http://www.iss.ices.utexas.edu/lonestar

  • Still adding new applications
  • Looking for more input sets

30

Tuesday, April 21, 2009

slide-37
SLIDE 37

Thank you!

http://www.ices.utexas.edu/~milind milind@ices.utexas.edu

Tuesday, April 21, 2009