optimistic parallelism benefits from data partitioning
play

Optimistic Parallelism Benefits from Data Partitioning Milind - PowerPoint PPT Presentation

Optimistic Parallelism Benefits from Data Partitioning Milind Kulkarni, Keshav Pingali, Ganesh Ramanarayanan, Bruce Walter, Kavita Bala and L. Paul Chew Optimistic Parallelism Benefits from Data Partitioning Milind Kulkarni, Keshav Pingali,


  1. Optimistic Parallelism Benefits from Data Partitioning Milind Kulkarni, Keshav Pingali, Ganesh Ramanarayanan, Bruce Walter, Kavita Bala and L. Paul Chew

  2. Optimistic Parallelism Benefits from Data Partitioning Milind Kulkarni, Keshav Pingali, Ganesh Ramanarayanan, Bruce Walter, Kavita Bala and L. Paul Chew

  3. Parallelism in Irregular Programs • Many irregular programs use iterative algorithms over worklists of various kinds • Delaunay mesh refinement • Image segmentation using graphcuts • Agglomerative clustering • Delaunay triangulation • SAT solvers • Iterative data-flow analysis • ... 3

  4. Running Example Mesh Refinement wl.add(mesh.badTriangles()); while (wl.size() != 0) { Element e = wl.get(); if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } 4

  5. Generalized Data Parallelism • Process elements from worklist in parallel • Deciding if cavities overlap must be done at runtime • Can use optimistic parallelism • Speculatively process two triangles from worklist • If cavities overlap, roll back one iteration • Implementation: Galois System (PLDI ’07) 5

  6. Scalability Issues • General Parallelization Issues • Maintaining Locality • Reducing contention for shared data structures • Optimistic Parallelization Issues • Reducing mis-speculation • Lowering cost of run-time conflict detection 6

  7. Locality vs. Parallelism in Mesh Refinement • In sequential version, worklist implemented as stack, for locality • If run in parallel, high likelihood of cavity overlap • Another option: assign work to cores randomly, to reduce likelihood of conflict • Reduces locality 7

  8. Outline • Overview of Galois System • Addressing Scalability • Data Partitioning • Computation Partitioning • Lock Coarsening • Evaluation and Conclusion 8

  9. The Galois System • Programming Model and Implementation to support optimistic User Code parallelization of irregular programs Class Libraries • User code: What to parllelize Runtime • Class Libraries + Runtime: How to parallelize correctly “Optimistic Parallelism Requires Abstractions,” PLDI 2007 9

  10. User Code • Sequential semantics • Use optimistic set iterator to expose opportunities for exploiting data parallelism foreach e in Set s do B(e) • Can add new elements to set during iteration 10

  11. What to Parallelize wl.add(mesh.badTriangles()); while (wl.size() != 0) { Element e = wl.get(); if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } 11

  12. What to Parallelize wl.add(mesh.badTriangles()); foreach Element e in wl { if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } 12

  13. Execution Model • Shared memory encapsulated in objects • Program runs sequentially until set iterator encountered • Multiple threads execute iterations from worklist • Scheduler assigns work to threads 13

  14. Class Libraries + Runtime • Ensure that iterations run in parallel only if independent • Detect dependences between iterations using semantic commutativity • Uses semantic properties of objects to determine dependence • If conflict, roll back using undo methods 14

  15. Outline • Overview of Galois System • Addressing Scalability • Data Partitioning • Computation Partitioning • Lock Coarsening • Evaluation and Conclusion 15

  16. Data Partitioning Physical Cores Graph 16

  17. Abstract Domain • Set of abstract processors mapped to physical cores • Data structure elements mapped to abstract processors • Allows for overdecomposition • More abstract processors than cores • Useful in many contexts ( e.g. load balancing) 17

  18. Abstract Domain Physical Cores Graph 18

  19. Abstract Domain Abstract Physical Domain Cores Graph 19

  20. Logical Partitioning • Elements of data structure ( e.g. triangles in the mesh) are mapped to abstract processors • Add “color” to data structure elements • Promotes locality • Cavities small and contiguous → likely to be in a single partition 20

  21. Physical Partitioning • Reimplementation of data structure to leverage logical partitioning • e.g. Worklist: • Allows different partitions of data structure to be accessed concurrently • Reduces contention 21

  22. Physical Partitioning • Reimplementation of data structure to leverage logical partitioning • e.g. Worklist: • Allows different partitions of data structure to be accessed concurrently • Reduces contention 21

  23. Computation Partitioning 22

  24. Computation Partitioning 22

  25. Computation Partitioning 22

  26. Data + Computation Partitioning • Data Partitioning → most cavities contained within a single partition • Computation Partitioning → each partition touched mostly by one core ➡ Partitions are effectively “bound” to cores • Maintains good locality • Reduces misspeculation 23

  27. Outline • Overview of Galois System • Addressing Scalability • Data Partitioning • Computation Partitioning • Lock Coarsening • Evaluation and Conclusion 24

  28. Overheads from Conflict Checking • Significant source of overhead in Galois: conflict checks • Checks themselves computationally expensive • Checks for each object must serialize to ensure correctness → bottleneck • Can we take advantage of partitioning? 25

  29. Optimization: Lock Coarsening • Can often replace conflict checks with lightweight, distributed checks • Iteration locks partitions as needed • Lock owned by someone else → conflict • Release locks when iteration completes 26

  30. Upshot • Synchronization dramatically reduced • While iteration stays within a single partition, only one lock is acquired • Conflict checks are distributed, eliminating bottleneck 27

  31. Overdecomposition • Lock coarsening is an imprecise way to check for conflicts • Overdecompose to reduce likelihood of conflict 28

  32. Overdecomposition • Lock coarsening is an imprecise way to check for conflicts • Overdecompose to reduce likelihood of conflict 28

  33. Overdecomposition • Lock coarsening is an imprecise way to check for conflicts • Overdecompose to reduce likelihood of conflict 28

  34. Implementation • Modify run-time to GraphInterface support computation partitioning • Extend classes in Class Graph PartitionedGraph Library to support data partitioning and/or lock Conflict Check Conflict Check coarsening • User code only needs to Physically Physically change object PartitionedGraph PartitionedGraph instantiation Conflict Check Lock Coarsening 29

  35. Outline • Overview of Galois System • Addressing Scalability • Data Partitioning • Computation Partitioning • Lock Coarsening • Evaluation and Conclusion 30

  36. Evaluation • Four-core system • Intel Xeon processors @ 2GHz • Implementation in Java 1.6 31

  37. Benchmarks • Delaunay mesh refinement • Augmenting-paths maxflow • Preflow-push maxflow • Agglomerative clustering 32

  38. Different parallelization strategies • Baseline Galois ( gal ) • Partitioned Galois ( par ) • Lock coarsening ( lco ) • Lock coarsening + overdecomposition ( ovd ) • Measure speedup versus sequential execution time 33

  39. Delaunay Mesh Refinement OVD 3 LCO PAR 2.5 Speedup GAL 2 1.5 1 1 2 3 4 # of Cores 34

  40. Augmenting Paths 2.5 OVD LCO 2 PAR Speedup GAL 1.5 1 0.5 0 1 2 3 4 # of Cores 35

  41. Preflow Push 3 OVD 2.5 LCO PAR 2 Speedup GAL 1.5 1 0.5 0 1 2 3 4 # of Cores 36

  42. Agglomerative Clustering PAR 1.8 GAL Speedup 1.4 1 1 2 3 4 # of Cores 37

  43. Summary • Addressed issues that arise in any optimistic parallelization system: • Tradeoff between locality and parallelism • Contention for shared data structures • Overhead of conflict checks • Low programmer overhead 38

  44. Summary • Addressed issues that arise in any optimistic parallelization system: Logical Partitioning + Computation Partitioning • Tradeoff between locality and parallelism • Contention for shared data structures • Overhead of conflict checks • Low programmer overhead 38

  45. Summary • Addressed issues that arise in any optimistic parallelization system: Logical Partitioning + Computation Partitioning • Tradeoff between locality and parallelism Physical Partitioning • Contention for shared data structures • Overhead of conflict checks • Low programmer overhead 38

  46. Summary • Addressed issues that arise in any optimistic parallelization system: Logical Partitioning + Computation Partitioning • Tradeoff between locality and parallelism Physical Partitioning • Contention for shared data structures Lock Coarsening + • Overhead of conflict checks Overdecomposition • Low programmer overhead 38

  47. Questions/Comments? milind@cs.cornell.edu http://www.cs.cornell.edu/w8/~milind

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend