Optimistic Parallelism Benefits from Data Partitioning Milind - - PowerPoint PPT Presentation

optimistic parallelism benefits from data partitioning
SMART_READER_LITE
LIVE PREVIEW

Optimistic Parallelism Benefits from Data Partitioning Milind - - PowerPoint PPT Presentation

Optimistic Parallelism Benefits from Data Partitioning Milind Kulkarni, Keshav Pingali, Ganesh Ramanarayanan, Bruce Walter, Kavita Bala and L. Paul Chew Optimistic Parallelism Benefits from Data Partitioning Milind Kulkarni, Keshav Pingali,


slide-1
SLIDE 1

Optimistic Parallelism Benefits from Data Partitioning

Milind Kulkarni, Keshav Pingali, Ganesh Ramanarayanan, Bruce Walter, Kavita Bala and L. Paul Chew

slide-2
SLIDE 2

Optimistic Parallelism Benefits from Data Partitioning

Milind Kulkarni, Keshav Pingali, Ganesh Ramanarayanan, Bruce Walter, Kavita Bala and L. Paul Chew

slide-3
SLIDE 3

Parallelism in Irregular Programs

  • Many irregular programs use iterative

algorithms over worklists of various kinds

  • Delaunay mesh refinement
  • Image segmentation using graphcuts
  • Agglomerative clustering
  • Delaunay triangulation
  • SAT solvers
  • Iterative data-flow analysis
  • ...

3

slide-4
SLIDE 4

Running Example Mesh Refinement

wl.add(mesh.badTriangles()); while (wl.size() != 0) { Element e = wl.get(); if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); }

4

slide-5
SLIDE 5

Generalized Data Parallelism

  • Process elements from

worklist in parallel

  • Deciding if cavities overlap

must be done at runtime

  • Can use optimistic parallelism
  • Speculatively process two

triangles from worklist

  • If cavities overlap, roll back
  • ne iteration
  • Implementation: Galois System

(PLDI ’07)

5

slide-6
SLIDE 6

Scalability Issues

  • General Parallelization Issues
  • Maintaining Locality
  • Reducing contention for shared data

structures

  • Optimistic Parallelization Issues
  • Reducing mis-speculation
  • Lowering cost of run-time conflict

detection

6

slide-7
SLIDE 7

Locality vs. Parallelism in Mesh Refinement

  • In sequential version, worklist implemented

as stack, for locality

  • If run in parallel, high likelihood of cavity
  • verlap
  • Another option: assign work to cores

randomly, to reduce likelihood of conflict

  • Reduces locality

7

slide-8
SLIDE 8

Outline

  • Overview of Galois System
  • Addressing Scalability
  • Data Partitioning
  • Computation Partitioning
  • Lock Coarsening
  • Evaluation and Conclusion

8

slide-9
SLIDE 9

The Galois System

  • Programming Model and

Implementation to support optimistic parallelization of irregular programs

  • User code: What to

parllelize

  • Class Libraries +

Runtime: How to parallelize correctly

“Optimistic Parallelism Requires Abstractions,” PLDI 2007

9

User Code Class Libraries Runtime

slide-10
SLIDE 10

User Code

  • Sequential semantics
  • Use optimistic set iterator to expose
  • pportunities for exploiting data parallelism
  • Can add new elements to set during

iteration foreach e in Set s do B(e)

10

slide-11
SLIDE 11

What to Parallelize

wl.add(mesh.badTriangles()); while (wl.size() != 0) { Element e = wl.get(); if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); }

11

slide-12
SLIDE 12

What to Parallelize

wl.add(mesh.badTriangles()); foreach Element e in wl { if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); }

12

slide-13
SLIDE 13

Execution Model

  • Shared memory encapsulated in objects
  • Program runs sequentially until set iterator

encountered

  • Multiple threads execute iterations from

worklist

  • Scheduler assigns work to threads

13

slide-14
SLIDE 14

Class Libraries + Runtime

  • Ensure that iterations run in parallel only if

independent

  • Detect dependences between iterations

using semantic commutativity

  • Uses semantic properties of objects to

determine dependence

  • If conflict, roll back using undo methods

14

slide-15
SLIDE 15

Outline

  • Overview of Galois System
  • Addressing Scalability
  • Data Partitioning
  • Computation Partitioning
  • Lock Coarsening
  • Evaluation and Conclusion

15

slide-16
SLIDE 16

Data Partitioning

Graph Physical Cores

16

slide-17
SLIDE 17

Abstract Domain

  • Set of abstract processors mapped to

physical cores

  • Data structure elements mapped to

abstract processors

  • Allows for overdecomposition
  • More abstract processors than cores
  • Useful in many contexts (e.g. load

balancing)

17

slide-18
SLIDE 18

Abstract Domain

Graph Physical Cores

18

slide-19
SLIDE 19

Graph Abstract Domain Physical Cores

Abstract Domain

19

slide-20
SLIDE 20

Logical Partitioning

  • Elements of data structure

(e.g. triangles in the mesh) are mapped to abstract processors

  • Add “color” to data

structure elements

  • Promotes locality
  • Cavities small and

contiguous → likely to be in a single partition

20

slide-21
SLIDE 21

Physical Partitioning

  • Reimplementation of data structure to

leverage logical partitioning

  • e.g. Worklist:
  • Allows different partitions of data structure

to be accessed concurrently

  • Reduces contention

21

slide-22
SLIDE 22

Physical Partitioning

  • Reimplementation of data structure to

leverage logical partitioning

  • e.g. Worklist:
  • Allows different partitions of data structure

to be accessed concurrently

  • Reduces contention

21

slide-23
SLIDE 23

Computation Partitioning

22

slide-24
SLIDE 24

Computation Partitioning

22

slide-25
SLIDE 25

Computation Partitioning

22

slide-26
SLIDE 26

Data + Computation Partitioning

  • Data Partitioning → most cavities contained

within a single partition

  • Computation Partitioning → each partition

touched mostly by one core

➡ Partitions are effectively “bound” to cores

  • Maintains good locality
  • Reduces misspeculation

23

slide-27
SLIDE 27

Outline

  • Overview of Galois System
  • Addressing Scalability
  • Data Partitioning
  • Computation Partitioning
  • Lock Coarsening
  • Evaluation and Conclusion

24

slide-28
SLIDE 28

Overheads from Conflict Checking

  • Significant source of overhead in Galois:

conflict checks

  • Checks themselves computationally

expensive

  • Checks for each object must serialize to

ensure correctness → bottleneck

  • Can we take advantage of partitioning?

25

slide-29
SLIDE 29

Optimization: Lock Coarsening

  • Can often replace conflict checks with

lightweight, distributed checks

  • Iteration locks partitions as needed
  • Lock owned by someone else →

conflict

  • Release locks when iteration completes

26

slide-30
SLIDE 30

Upshot

  • Synchronization dramatically reduced
  • While iteration stays within a single

partition, only one lock is acquired

  • Conflict checks are distributed, eliminating

bottleneck

27

slide-31
SLIDE 31
  • Lock coarsening is an imprecise way to

check for conflicts

  • Overdecompose to reduce likelihood of

conflict

Overdecomposition

28

slide-32
SLIDE 32
  • Lock coarsening is an imprecise way to

check for conflicts

  • Overdecompose to reduce likelihood of

conflict

Overdecomposition

28

slide-33
SLIDE 33
  • Lock coarsening is an imprecise way to

check for conflicts

  • Overdecompose to reduce likelihood of

conflict

Overdecomposition

28

slide-34
SLIDE 34

Implementation

  • Modify run-time to

support computation partitioning

  • Extend classes in Class

Library to support data partitioning and/or lock coarsening

  • User code only needs to

change object instantiation

GraphInterface Graph

Conflict Check

PartitionedGraph

Conflict Check

Physically PartitionedGraph

Conflict Check

Physically PartitionedGraph

Lock Coarsening 29

slide-35
SLIDE 35

Outline

  • Overview of Galois System
  • Addressing Scalability
  • Data Partitioning
  • Computation Partitioning
  • Lock Coarsening
  • Evaluation and Conclusion

30

slide-36
SLIDE 36

Evaluation

  • Four-core system
  • Intel Xeon processors @ 2GHz
  • Implementation in Java 1.6

31

slide-37
SLIDE 37

Benchmarks

  • Delaunay mesh refinement
  • Augmenting-paths maxflow
  • Preflow-push maxflow
  • Agglomerative clustering

32

slide-38
SLIDE 38

Different parallelization strategies

  • Baseline Galois (gal)
  • Partitioned Galois (par)
  • Lock coarsening (lco)
  • Lock coarsening + overdecomposition (ovd)
  • Measure speedup versus sequential

execution time

33

slide-39
SLIDE 39

Delaunay Mesh Refinement

1 2 3 4

# of Cores

1 1.5 2 2.5 3

Speedup

OVD LCO PAR GAL

34

slide-40
SLIDE 40

Augmenting Paths

1 2 3 4

# of Cores

0.5 1 1.5 2 2.5

Speedup

OVD LCO PAR GAL

35

slide-41
SLIDE 41

Preflow Push

36

1 2 3 4

# of Cores

0.5 1 1.5 2 2.5 3

Speedup

OVD LCO PAR GAL

slide-42
SLIDE 42

Agglomerative Clustering

1 2 3 4

# of Cores

1 1.4 1.8

Speedup

PAR GAL

37

slide-43
SLIDE 43
  • Addressed issues that arise in any optimistic

parallelization system:

  • Tradeoff between locality and parallelism
  • Contention for shared data structures
  • Overhead of conflict checks
  • Low programmer overhead

Summary

38

slide-44
SLIDE 44
  • Addressed issues that arise in any optimistic

parallelization system:

  • Tradeoff between locality and parallelism
  • Contention for shared data structures
  • Overhead of conflict checks
  • Low programmer overhead

Summary

Logical Partitioning + Computation Partitioning

38

slide-45
SLIDE 45
  • Addressed issues that arise in any optimistic

parallelization system:

  • Tradeoff between locality and parallelism
  • Contention for shared data structures
  • Overhead of conflict checks
  • Low programmer overhead

Summary

Logical Partitioning + Computation Partitioning Physical Partitioning

38

slide-46
SLIDE 46
  • Addressed issues that arise in any optimistic

parallelization system:

  • Tradeoff between locality and parallelism
  • Contention for shared data structures
  • Overhead of conflict checks
  • Low programmer overhead

Summary

Logical Partitioning + Computation Partitioning Physical Partitioning Lock Coarsening + Overdecomposition

38

slide-47
SLIDE 47

Questions/Comments?

milind@cs.cornell.edu http://www.cs.cornell.edu/w8/~milind