Scheduling Strategies for Optimistic Parallel Execution of - - PowerPoint PPT Presentation

scheduling strategies for optimistic parallel execution
SMART_READER_LITE
LIVE PREVIEW

Scheduling Strategies for Optimistic Parallel Execution of - - PowerPoint PPT Presentation

Scheduling Strategies for Optimistic Parallel Execution of Irregular Programs Milind Kulkarni, Patrick Carribault, Keshav Pingali, Ganesh Ramanarayanan, Bruce Walter, Kavita Bala and L. Paul Chew University of Texas at Austin Cornell


slide-1
SLIDE 1

Scheduling Strategies for Optimistic Parallel Execution

  • f Irregular Programs

Milind Kulkarni, Patrick Carribault, Keshav Pingali, Ganesh Ramanarayanan, Bruce Walter, Kavita Bala and L. Paul Chew

University of Texas at Austin Cornell University

slide-2
SLIDE 2

Amorphous Data Parallelism

  • Many irregular programs implement

iterative algorithms over worklists

  • Mesh refinement, agglomerative

clustering, maxflow algorithms, compiler analyses, ...

  • Complex dependences between

iterations

  • But many iterations can be executed in

parallel

  • New elements can be added to worklist

2

slide-3
SLIDE 3

Delaunay Mesh Refinement (DMR)

3

Worklist wl; wl.add(mesh.badTriangles()); while (wl.size() != 0) { Triangle t = wl.get(); if (t no longer in mesh) continue; Cavity c = new Cavity(t); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); }

slide-4
SLIDE 4

Delaunay Mesh Refinement (DMR)

3

Worklist wl; wl.add(mesh.badTriangles()); while (wl.size() != 0) { Triangle t = wl.get(); if (t no longer in mesh) continue; Cavity c = new Cavity(t); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); }

No ordering constraints on processing of worklist items

slide-5
SLIDE 5

Parallelism in DMR

  • Can process bad triangles

concurrently

  • As long as cavities do not
  • verlap
  • Cannot determine this until

run time

  • Example of amorphous data

parallelism

  • Our approach: Galois system

for optimistic parallelization [PLDI’07, ASPLOS’08]

4

slide-6
SLIDE 6

Galois System

  • User code
  • Optimistic iterators
  • Sequential Semantics
  • Class libraries
  • Data structures
  • Conflict conditions
  • Runtime system
  • Optimistic parallelization
  • Conflict detection & handling

5

User Code Class Libraries Runtime

foreach e in Set s do B(e)

slide-7
SLIDE 7

Worklist wl; wl.add(mesh.badTriangles()); while (wl.size() != 0) { Triangle t = wl.get(); if (t no longer in mesh) continue; Cavity c = new Cavity(t); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); }

DMR User Code

6

slide-8
SLIDE 8

Worklist wl; wl.add(mesh.badTriangles()); foreach Triangle t in wl { if (t no longer in mesh) continue; Cavity c = new Cavity(t); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); }

7

DMR User Code

slide-9
SLIDE 9

1 2 3 4

# of Cores

0.8 1 1.2 1.4 1.6 1.8 2 2.2

Speedup

stack random

Scheduling Impact: DMR

8

Evaluation platform: 4-core Xeon system, running Java 1.6 HotSpot JVM Input mesh: 100K triangles, ~40K bad triangles

slide-10
SLIDE 10

Scheduling in OpenMP

  • OpenMP provides parallel DO-ALL loops

for regular programs

  • Major scheduling concerns are load-

balancing and overhead

  • OpenMP scheduling policies address these

issues

  • static, dynamic, guided

9

slide-11
SLIDE 11

Amorphous Data Parallelism Issues

  • Algorithmic – The efficiency of the

algorithm or data structures

  • Conflicts – The likelihood that two iterations

executed in parallel will conflict

  • Locality – The temporal or spatial locality

exhibited in the data structures

  • Dynamically created work
  • Load-balancing and contention still an

issue

10

slide-12
SLIDE 12

Scheduling Basics

  • Each iteration is executed by a single core
  • Each core executes a set of iterations in a

linear order

  • Scheduling maps work from an “iteration

space” to positions in an “execution schedule”

  • Each iteration is mapped to a core, and

a position in that core’s execution schedule

11

slide-13
SLIDE 13

Scheduling Functions

12

Clustering – Groups iterations into clusters; Each cluster executed

  • n a single core

Labeling – Maps clusters to cores; Each core can have multiple clusters

  • Ordering – Specifies a

serial execution order for each core

slide-14
SLIDE 14

Scheduling Functions

13

➡ Clustering – Groups iterations into clusters; Each cluster executed

  • n a single core

Labeling – Maps clusters to cores; Each core can have multiple clusters

  • Ordering – Specifies a

serial execution order for each core

slide-15
SLIDE 15

Scheduling Functions

13

➡ Clustering – Groups iterations into clusters; Each cluster executed

  • n a single core

Labeling – Maps clusters to cores; Each core can have multiple clusters

  • Ordering – Specifies a

serial execution order for each core

slide-16
SLIDE 16

Scheduling Functions

14

Clustering – Groups iterations into clusters; Each cluster executed

  • n a single core

➡ Labeling – Maps clusters to cores; Each core can have multiple clusters

  • Ordering – Specifies a

serial execution order for each core

slide-17
SLIDE 17

Scheduling Functions

14

P0 P1

Clustering – Groups iterations into clusters; Each cluster executed

  • n a single core

➡ Labeling – Maps clusters to cores; Each core can have multiple clusters

  • Ordering – Specifies a

serial execution order for each core

slide-18
SLIDE 18

Scheduling Functions

15

P0 P1

Clustering – Groups iterations into clusters; Each cluster executed

  • n a single core

Labeling – Maps clusters to cores; Each core can have multiple clusters ➡ Ordering – Specifies a serial execution order for each core

slide-19
SLIDE 19

Scheduling Functions

15

P0 P1 time time

Clustering – Groups iterations into clusters; Each cluster executed

  • n a single core

Labeling – Maps clusters to cores; Each core can have multiple clusters ➡ Ordering – Specifies a serial execution order for each core

slide-20
SLIDE 20

Scheduling Functions

15

P0 P1 time time

Functions can be defined “online”

Clustering – Groups iterations into clusters; Each cluster executed

  • n a single core

Labeling – Maps clusters to cores; Each core can have multiple clusters ➡ Ordering – Specifies a serial execution order for each core

slide-21
SLIDE 21

Example Instantiations

  • OpenMP’s chunked

self-scheduling

  • Clustering: chunked
  • Labeling: dynamic
  • Ordering: cluster-major

16

  • DMR’s “generator-

computes”

  • Clustering: chunked +

generator-computes

  • Labeling: dynamic
  • Ordering: LIFO

The Galois system provides a number of built-in scheduling policies

slide-22
SLIDE 22

Evaluated Applications

  • Delaunay mesh refinement
  • Delaunay triangulation
  • Augmenting-paths maxflow
  • Preflow-push maxflow
  • Agglomerative clustering

17

slide-23
SLIDE 23

Sample Schedules for DMR

  • random – default Galois schedule
  • stack – LIFO schedule
  • partitioned – data-centric schedule,

based on partitioning of mesh

  • generator-computes – random schedule,

new work immediately processed by core that created it

18

slide-24
SLIDE 24

1 2 3 4

# of Cores

1 1.5 2 2.5 3

Speedup

generator-computes partitioned stack random

DMR Results

19

slide-25
SLIDE 25

Summary of Results

20

Clustering Labeling Ordering Delaunay Mesh Refinement random/ inherited dynamic/ random —/ LIFO Delaunay Triangulation data-centric/ — static/ data-centric cluster-major/ random Augmenting Paths Maxflow data-centric/ inherited static/ data-centric cluster-major/ LIFO Preflow Push Maxflow data-centric/ inherited static/ data-centric cluster-major/ LIFO Agglomerative Clustering unit/ custom dynamic/ custom —/ —

  • Best combination of policies for each application
slide-26
SLIDE 26

Summary of Results

21

Clustering Labeling Ordering Delaunay Mesh Refinement random/ inherited dynamic/ random —/ LIFO Delaunay Triangulation data-centric/ — static/ data-centric cluster-major/ random Augmenting Paths Maxflow data-centric/ inherited static/ data-centric cluster-major/ LIFO Preflow Push Maxflow data-centric/ inherited static/ data-centric cluster-major/ LIFO Agglomerative Clustering unit/ custom dynamic/ custom —/ —

  • Best combination of policies for each application
slide-27
SLIDE 27

Conclusions

  • Developed a general framework for

scheduling programs with amorphous data parallelism

  • Subsumes OpenMP scheduling policies
  • Implemented framework in Galois system
  • Provides several default scheduling

policies

  • Allows programmers to specify their own

scheduling policies when needed

22