Optimistic Parallelism Requires Abstractions Milind Kulkarni, - - PowerPoint PPT Presentation

optimistic parallelism requires abstractions
SMART_READER_LITE
LIVE PREVIEW

Optimistic Parallelism Requires Abstractions Milind Kulkarni, - - PowerPoint PPT Presentation

Optimistic Parallelism Requires Abstractions Milind Kulkarni, Keshav Pingali The University of Texas at Austin Bruce Walter, Ganesh Ramanarayanan, Kavita Bala and L. Paul Chew Cornell University Optimistic Parallelism Requires


slide-1
SLIDE 1

Optimistic Parallelism Requires Abstractions

Milind Kulkarni, Keshav Pingali – The University of Texas at Austin Bruce Walter, Ganesh Ramanarayanan, Kavita Bala and L. Paul Chew – Cornell University

slide-2
SLIDE 2

Optimistic Parallelism Requires Abstractions

Milind Kulkarni, Keshav Pingali – The University of Texas at Austin Bruce Walter, Ganesh Ramanarayanan, Kavita Bala and L. Paul Chew – Cornell University

slide-3
SLIDE 3

PLDI 2007 June 11th, 2007

Motivation

✦ Parallel programming very important ✦ Multicore processors ✦ Parallel programming is hard! ✦ Limited success in domains which deal with

structured data

✦ Array programs ✦ Database applications ✦ What about irregular applications which deal

with unstructured data?

✦ Compile time techniques have failed

3

slide-4
SLIDE 4

PLDI 2007 June 11th, 2007

Galois System: Core Beliefs

✦ Irregular applications have worklist-style data

parallelism

✦ Optimistic parallelization is crucial ✦ Parallelism should be hidden within natural

syntactic constructs

✦ High level application semantics are critical

for parallelization

4

slide-5
SLIDE 5

PLDI 2007 June 11th, 2007

Outline

✦ Two challenge problems ✦ Galois programming model and

implementation

✦ Evaluation ✦ Related Work ✦ Conclusions

5

slide-6
SLIDE 6

PLDI 2007 June 11th, 2007

Delaunay Mesh Refinement

✦ Iterative refinement procedure to produce

guaranteed quality meshes

6

slide-7
SLIDE 7

PLDI 2007 June 11th, 2007

Delaunay Pseudo-code

7

Mesh m = /* read in mesh */ WorkList wl; wl.add(mesh.badTriangles()); while (wl.size() != 0) { Element e = wl.get(); if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); }

slide-8
SLIDE 8

PLDI 2007 June 11th, 2007

Delaunay Pseudo-code

8

Mesh m = /* read in mesh */ WorkList wl; wl.add(mesh.badTriangles()); while (wl.size() != 0) { Element e = wl.get(); if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } Worklist idiom

slide-9
SLIDE 9

PLDI 2007 June 11th, 2007

Finding Parallelism

✦ Can expand multiple cavities in parallel ✦ Provided cavities do not overlap ✦ Determining this statically is impossible ✦ Solution: Optimistic parallel execution

9

slide-10
SLIDE 10

PLDI 2007 June 11th, 2007

Agglomerative Clustering

✦ Create binary tree of points in a space in

bottom-up fashion

✦ Always choose two closest points to cluster

10

a b c d e a b c d e a b c d e (a) Data points (b) Hierarchical clusters (c) Dendrogram

slide-11
SLIDE 11

PLDI 2007 June 11th, 2007

Agglomerative Clustering

✦ Two key data structures ✦ Priority Queue – Keeps pairs of points

<p,n> where n is the nearest neighbor of p

✦ Ordered by distance ✦ KD-tree – Spatial structure to find nearest

neighbors

11

slide-12
SLIDE 12

PLDI 2007 June 11th, 2007

Finding Parallelism

✦ Priority queue functions as a worklist ✦ Seems to be completely sequential ✦ If clusters are independent, can be done in

parallel

12 a b c d e

slide-13
SLIDE 13

PLDI 2007 June 11th, 2007

Lessons Learned

✦ Worklist-style data parallelism ✦ May be dependences between iterations ✦ However, worklist abstractions are missing

from the code

✦ Concurrent access to shared objects a must ✦ worklist, priority queue, kd-tree

13

slide-14
SLIDE 14

Galois Programming Model and Implementation

slide-15
SLIDE 15

PLDI 2007 June 11th, 2007

Programming Model

✦ Object-based shared

memory model

✦ Client code must

invoke methods to access object state

✦ Client code has

sequential semantics

✦ But runtime system

may execute code in parallel

15

Client Code Galois Objects

slide-16
SLIDE 16

PLDI 2007 June 11th, 2007

Worklist Abstractions

✦ Iterators over collections

✦ foreach e in set S do B(e)

✦ Iterations can execute in any order ✦ As in Delaunay mesh refinement

✦ foreach e in poSet S do B(e)

✦ Iterations must respect ordering of S ✦ As in agglomerative clustering ✦ May be dependences between iterations ✦ Sets can change during execution

16

slide-17
SLIDE 17

PLDI 2007 June 11th, 2007

Delaunay Example

17

Mesh m = /* read in mesh */ WorkList wl; wl.add(mesh.badTriangles()); while (wl.size() != 0) { Element e = wl.get(); if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); }

slide-18
SLIDE 18

PLDI 2007 June 11th, 2007

Delaunay Example

18

Mesh m = /* read in mesh */ WorkList wl; wl.add(mesh.badTriangles()); foreach Element e in wl { if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } rest of code unchanged

slide-19
SLIDE 19

PLDI 2007 June 11th, 2007

Delaunay Example

19

Mesh m = /* read in mesh */ WorkList wl; wl.add(mesh.badTriangles()); foreach Element e in wl { if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); }

Iterators expose worklist abstraction to runtime system

slide-20
SLIDE 20

PLDI 2007 June 11th, 2007

Execution Model

✦ Master thread begins execution ✦ When it encounters an iterator, it uses helper

threads to aid in execution of iterations

✦ Iterations assigned to thread according to

scheduling policy (for now, dynamic to ensure load balance)

✦ Parallel execution of iterator must respect

sequential semantics of iterator

✦ Concurrent access control ✦ Serializability of iterations

20

slide-21
SLIDE 21

PLDI 2007 June 11th, 2007

Concurrent Access

✦ Concurrent invocations

to a shared object must not interfere

✦ Our current

implementation uses locks

✦ Can use other

techniques such as TM

21

S.add(y) S.add(x)

S

slide-22
SLIDE 22

PLDI 2007 June 11th, 2007

Serializability

22

S.contains?(x) S.remove(x) S.add(x)

S Workset

S.add() ... = S.get() S.add() ... = S.get()

(a) Interleaving is illegal (b) Interleaving is legal (and necessary)

slide-23
SLIDE 23

PLDI 2007 June 11th, 2007

Semantic Commutativity

✦ Method calls which commute can be

interleaved

✦ Else, commutativity violation ✦ Property of abstract data type ✦ Implementation independent

23

slide-24
SLIDE 24

PLDI 2007 June 11th, 2007

Galois Classes

✦ Inverse methods ✦ Allow for rollback

when commutativity violated

✦ Commutativity and

inverse specified through interface annotation

24

class SetInterface { void add(T x); [commutes] add(y) {y != x} remove(y) {y != x} contains(y) {y != x} [inverse] remove(x) bool contains(T x); [commutes] add(y) {y != x} remove(y) {y != x} ... }

slide-25
SLIDE 25

PLDI 2007 June 11th, 2007

Galois Classes

✦ Inverse methods ✦ Allow for rollback

when commutativity violated

✦ Commutativity and

inverse specified through interface annotation

25

class SetInterface { void add(T x); [commutes] add(y) {y != x} remove(y) {y != x} contains(y) {y != x} [inverse] remove(x) bool contains(T x); [commutes] add(y) {y != x} remove(y) {y != x} ... }

Galois Classes expose abstractions to the runtime system

slide-26
SLIDE 26

PLDI 2007 June 11th, 2007

Runtime System

✦ Two main components: ✦ Global commit pool ✦ Manages iterations ✦ Similar to reorder buffer in OOE

processors

✦ Per object conflict logs ✦ Detects commutativity violations ✦ Triggers aborts if commutativity violated

26

slide-27
SLIDE 27

PLDI 2007 June 11th, 2007

Evaluation

✦ Evaluation platform: ✦ Implementation in C++ ✦ gcc compiler on Red Hat Linux ✦ 4 processor, shared memory system ✦ Itanium 2 @ 1.5 GHz

27

slide-28
SLIDE 28

PLDI 2007 June 11th, 2007

Evaluation – Delaunay

✦ Three different versions of benchmark ✦ reference – purely sequential code ✦ FGL – hand-written, optimistic parallel code

using fine-grained locking

✦ meshgen – Galois version of code ✦ Input mesh generated using Triangle ✦ ~10K triangles ✦ ~4K bad triangles

28

slide-29
SLIDE 29

PLDI 2007 June 11th, 2007

Abort Ratios

✦ Optimism must be warranted ✦ Conflicts lead to rollbacks, which waste

work

✦ FGL and meshgen have abort ratios <1% on 4

processors

✦ Closely tied to scheduling policy ✦ Choice of proper scheduling policy is

crucial for good performance

29

slide-30
SLIDE 30

PLDI 2007 June 11th, 2007

1 2 3 4

# of processors

1 1.5 2 2.5 3

Speedup

reference FGL meshgen 1 2 3 4

# of processors

2 4 6 8

Execution Time (s)

reference FGL meshgen

Evaluation – Delaunay

30

slide-31
SLIDE 31

PLDI 2007 June 11th, 2007

1 2 3 4

# of processors

1 1.5 2 2.5 3

Speedup

reference FGL meshgen 1 2 3 4

# of processors

2 4 6 8

Execution Time (s)

reference FGL meshgen

Evaluation – Delaunay

31

~3x speedup

slide-32
SLIDE 32

PLDI 2007 June 11th, 2007

Performance Breakdown

32

Client Object Runtime 5 10 15 20 1 proc 4 proc

Cycle (billions)

13.8951 18.8501 5 10 15 20 1 proc 4 proc

Instructions (billions)

16.8889 17.4675

slide-33
SLIDE 33

PLDI 2007 June 11th, 2007

Related Work

✦ Weihl, 1988 – Concurrency control using

commutativity properties of ADTs

✦ Rinard & Diniz, 1996 – Static commutativity

analysis for parallelization

✦ Wu & Padua, 1998 – Exploiting semantic

properties of containers in compilation

✦ Ni et al, 2007 – Open nesting using abstract

locks

33

slide-34
SLIDE 34

PLDI 2007 June 11th, 2007

Conclusions

✦ Optimistic parallelism necessary to parallelize

irregular, worklist-based applications

✦ Need to exploit high-level semantics ✦ Iterators to expose parallelism ✦ Galois classes to expose semantics of

  • bjects

34

slide-35
SLIDE 35

Thank You!

Email: milind@cs.utexas.edu