how much parallelism is there in irregular applications
play

How Much Parallelism is There in Irregular Applications? Milind - PowerPoint PPT Presentation

How Much Parallelism is There in Irregular Applications? Milind Kulkarni, Martin Burtscher, Rajasekhar Inkulu C lin Ca caval and Keshav Pingali How Much Parallelism is There in Irregular Applications? Milind Kulkarni, Martin


  1. ฀ How Much Parallelism is There in Irregular Applications? Milind Kulkarni, Martin Burtscher, Rajasekhar Inkulu C ă lin Ca ş caval and Keshav Pingali

  2. ฀ How Much Parallelism is There in Irregular Applications? Milind Kulkarni, Martin Burtscher, Rajasekhar Inkulu C ă lin Ca ş caval and Keshav Pingali

  3. Introduction • We understand parallelism in regular algorithms • e.g. , in N × N matrix-matrix multiply, can do N 3 multiplications concurrently • What about irregular algorithms? • Operate on complex, pointer-based data structures such as graphs, trees, etc. • Is there much parallelism? 3

  4. Example Algorithms Application Domain Algorithms Data-mining Agglomerative clustering, k-means Bayesian inference Belief propagation, survey propagation Compilers Iterative dataflow, Elimination-based dataflow Functional interpreters Graph reduction, static/dynamic dataflow Maxflow Preflow-push, augmenting paths Minimum spanning trees Prim’s, Kruskal’s Boruvka’s N-body methods Barnes-Hut, fast multipole Graphics Ray-tracing Linear solvers Sparse MVM, sparse Cholesky factorization Event-driven simulation Time warp, Chandy-Misra-Bryant Meshing Delaunay mesh refinement, triangulation 4

  5. Example: Delaunay mesh refinement • Worklist of bad triangles • Process bad triangles by removing “cavity” and re- triangulating • May create new bad triangles Before • Triangles can be processed in any order • Algorithm terminates when worklist is empty After 5

  6. Example: Event-driven Simulation • Network of nodes • Worklist of events, ordered by timestamp 3 A • Nodes process events, can generate new events to send to other nodes • Events must be processed B in global time order 6

  7. Example: Event-driven Simulation • Network of nodes 2 • Worklist of events, ordered by timestamp 3 A • Nodes process events, 4 can generate new events to send to other nodes • Events must be processed B in global time order 6

  8. Example: Event-driven Simulation • Network of nodes • Worklist of events, ordered by timestamp 3 A 2 • Nodes process events, 4 can generate new events to send to other nodes • Events must be processed B in global time order 6

  9. Example: Event-driven Simulation • Network of nodes • Worklist of events, ordered by timestamp A 2 • Nodes process events, 4 can generate new events 3 to send to other nodes • Events must be processed B in global time order 6

  10. Example: Event-driven Simulation • Network of nodes • Worklist of events, ordered by timestamp A 2 • Nodes process events, 4 can generate new events 3 6 to send to other nodes • Events must be processed B in global time order 6

  11. Amorphous Data Parallelism • Data structure: graph • Operate over ordered or unordered worklists of active nodes • Process an active node by accessing neighborhood • May generate new active nodes • Can process nodes with non-overlapping neighborhoods in parallel • Ordered worklists: must respect ordering constraints 7

  12. “Available Parallelism” • A measure of the maximum amount of parallelism that can be extracted from a program • Profile the algorithm, not the system • Disregard communication/synchronization costs, run-time overheads and locality concerns 8

  13. Measuring Parallelism • Represent program as a DAG LOAD A1 LOAD B1 LOAD A2 LOAD B2 • Nodes: operations • Edges: dependences MUL MUL • Execution strategy • Assume operations take ADD unit time • Execute “greedily” – process all ready Parallelism Available operations in each step LOAD A1 LOAD B1 • Parallelism profile: # of LOAD A2 MUL operations executed in each LOAD B2 MUL ADD step Computation Step 9

  14. Measuring Parallelism • Represent program as a DAG LOAD A1 LOAD B1 LOAD A2 LOAD B2 • Nodes: operations • Edges: dependences MUL MUL • Execution strategy • Assume operations take ADD unit time • Execute “greedily” – process all ready Parallelism Available operations in each step LOAD A1 LOAD B1 • Parallelism profile: # of LOAD A2 MUL operations executed in each LOAD B2 MUL ADD step Computation Step 9

  15. Measuring Parallelism • Represent program as a DAG LOAD A1 LOAD B1 LOAD A2 LOAD B2 • Nodes: operations • Edges: dependences MUL MUL • Execution strategy • Assume operations take ADD unit time • Execute “greedily” – process all ready Parallelism Available operations in each step LOAD A1 LOAD B1 • Parallelism profile: # of LOAD A2 MUL operations executed in each LOAD B2 MUL ADD step Computation Step 9

  16. Measuring Parallelism • Represent program as a DAG LOAD A1 LOAD B1 LOAD A2 LOAD B2 • Nodes: operations • Edges: dependences MUL MUL • Execution strategy • Assume operations take ADD unit time • Execute “greedily” – process all ready Parallelism Available operations in each step LOAD A1 LOAD B1 • Parallelism profile: # of LOAD A2 MUL operations executed in each LOAD B2 MUL ADD step Computation Step 9

  17. Amorphous Data Parallel Algorithms • No notion of ordering • Represent program as a graph, not a DAG • Execution: choose set of independent elements to process • Different scheduling choices lead to different amounts of parallelism • Even with unlimited resources! 10

  18. Amorphous Data Parallel Algorithms • No notion of ordering • Represent program as a graph, not a DAG • Execution: choose set of independent elements to process • Different scheduling choices lead to different amounts of Conflict Graph parallelism • Even with unlimited resources! 11

  19. Greedy scheduling • Finding schedule to maximize parallelism is NP -hard ➡ Solution: Schedule greedily • Attempt to maximize work done in current step • Choose a maximal independent set in conflict graph 12

  20. Incremental Execution • Conflict graph can change during execution • New work generated • New conflicts • Cannot perform scheduling a priori ➡ Solution: execute in Conflict Graph stages, recalculate conflict graph after each stage 13

  21. Incremental Execution • Conflict graph can change during execution • New work generated • New conflicts • Cannot perform scheduling a priori ➡ Solution: execute in Conflict Graph stages, recalculate conflict graph after each stage 14

  22. Incremental Execution • Conflict graph can change during execution • New work generated • New conflicts • Cannot perform scheduling a priori ➡ Solution: execute in Conflict Graph stages, recalculate conflict graph after each stage 14

  23. ParaMeter • Tool to generate parallelism profiles for amorphous data-parallel applications • Uses greedy scheduling and incremental execution to handle dynamic nature of computation 15

  24. ParaMeter Execution Strategy • While work left • Generate conflict graph for current worklist • Execute maximal independent set of nodes in graph • Add newly generated work to worklist 16

  25. ParaMeter Execution Strategy • While work left • Generate conflict graph for current Generate parallelism profile by tracking # worklist of nodes executed in each step • Execute maximal independent set of nodes in graph • Add newly generated work to worklist 16

  26. Experiments • Profiled 7 applications: • Delaunay mesh refinement • Delaunay triangulation • Augmenting paths maxflow • Preflow push maxflow • Survey propagation • Agglomerative clustering (unordered) • Agglomerative clustering (ordered) 17

  27. Delaunay Mesh Refinement 8000 Available Parallelism 6000 4000 2000 0 0 10 20 30 40 50 60 Computation Step Input: 100,000 triangle mesh, 47,000 bad triangles 18

  28. Parallelism Intensity • Available parallelism shows absolute amount of parallelism in program • Is parallelism low because there is little work? Or many conflicts? • Parallelism intensity: measure what percentage of worklist is executed in parallel 19

  29. Mesh Refinement: Parallelism Intensity % of Worklist Executed 100 80 60 40 20 0 0 10 20 30 40 50 60 Computation Step 20

  30. Effects of Scheduling on Parallelism Available Parallelism 6000 4000 2000 0 0 10 20 30 40 50 60 Computation Step Input: 100,000 triangle mesh, 47,000 bad triangles 21

  31. Effects of Scheduling on Parallelism Input: 100,000 triangle mesh, 47,000 bad triangles 22

  32. Effects of Scheduling on Parallelism 7500 Available Parallelism 7000 6500 6000 5500 5000 0 5 10 15 20 Computation Step Input: 100,000 triangle mesh, 47,000 bad triangles 23

  33. Delaunay Triangulation 300 • Build a Delaunay mesh Available Parallelism 250 200 given a set of points 150 • Points in an unordered 100 50 worklist 0 0 20 40 60 80 100 • Insert points by splitting Computation Step triangles, flipping edges % of Worklist Executed 100 80 60 • Input: 10,000 points 40 20 0 0 20 40 60 80 100 Computation Step 24

  34. Survey Propagation • Heuristic approach to 1000 Available Parallelism 800 solving SAT problems 600 • Bipartite graph of clauses 400 and variables 200 0 • Iteratively update 0 2000 4000 6000 8000 10000 12000 14000 Computation Step variables with possible truth values % of Worklist Executed 100 80 60 • Input: formula with 1000 40 20 variables, 4200 clauses 0 0 2000 4000 6000 8000 10000 12000 14000 Computation Step 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend