Parallel Func+onal Arrays Ananya Kumar Guy Blelloch Robert Harper - - PowerPoint PPT Presentation
Parallel Func+onal Arrays Ananya Kumar Guy Blelloch Robert Harper - - PowerPoint PPT Presentation
Parallel Func+onal Arrays Ananya Kumar Guy Blelloch Robert Harper Carnegie Mellon University Goals Func+onal arrays Efficient (constant +me) Parallel Well defined cost seman+cs Previous Work - Monads Thread mutable state
Goals
- Func+onal arrays
- Efficient (constant +me)
- Parallel
- Well defined cost seman+cs
Previous Work - Monads
- Thread mutable state
- Enforce single reference to array
- Need completely different code
- Not parallel
Previous Work – Specialized Type System
- Enforce single threadedness of arrays
- Not available in most languages
- Hard to reason about
Previous Work – Reference Coun+ng
- Check reference counts
- If one, update in place, else copy
- Depends on compiler
- Hard to reason about
Sequences
3 11 14
A = NEW(5, 0) B = SET(A, 0, 3)
3
C = SET(A, 2, 3)
3 3 14
D = SET(C, 4, 14) E = SET(D, 1, 11)
Sequences
3 11 14
A = NEW(5, 0) B = SET(A, 0, 3)
3
C = SET(A, 2, 3)
3 3 14
D = SET(C, 4, 14) E = SET(D, 1, 11)
Sequences
3 11 14
A = NEW(5, 0) B = SET(A, 0, 3)
3
C = SET(A, 2, 3)
3 3 14
D = SET(C, 4, 14) E = SET(D, 1, 11)
Sequences
3 11 14
A = NEW(5, 0) B = SET(A, 0, 3)
3
C = SET(A, 2, 3)
3 3 14
D = SET(C, 4, 14) E = SET(D, 1, 11)
Sequences
3 11 14
A = NEW(5, 0) B = SET(A, 0, 3)
3
C = SET(A, 2, 3)
3 3 14
D = SET(C, 4, 14) E = SET(D, 1, 11)
Sequences
3 11 14
A = NEW(5, 0) B = SET(A, 0, 3)
3
C = SET(A, 2, 3)
3 3 14
D = SET(C, 4, 14) E = SET(D, 1, 11)
Sequences
3 11 14
A = NEW(5, 0) B = SET(A, 0, 3)
3
C = SET(A, 2, 3)
3 3 14
D = SET(C, 4, 14) E = SET(D, 1, 11)
Sequences
3 11 14
A = NEW(5, 0) B = SET(A, 0, 3)
3
C = SET(A, 2, 3)
3 3 14
D = SET(C, 4, 14) E = SET(D, 1, 11)
Sequences
3 11 14
A = NEW(5, 0) B = SET(A, 0, 3)
3
C = SET(A, 2, 3)
3 3 14
D = SET(C, 4, 14) E = SET(D, 1, 11)
Sequences
3 11 14
A = NEW(5, 0) B = SET(A, 0, 3)
3
C = SET(A, 2, 3)
3 3 14
D = SET(C, 4, 14) E = SET(D, 1, 11)
Previous Work
- N = size of array
- Dietz – O(log log N) per opera+on
- Trailer arrays – O(1) for leaves
- Improvements by Chuang, O’ Neill
- No support for concurrency
Our Approach
- Func+onal
- Efficient – O(1) for leaves, fast for interior
- Parallel – wait-free
- Well defined cost seman+cs
Sequence Implementa+on
2
C
3 11 14 3
D
4
E
Main Sec+ons
- Cost dynamics
- Concurrent implementa+on
Fork-Join Parallelism
(1+2) || (3+4)
Fork-Join Parallelism
(1+2) || (3+4) Fork
Fork-Join Parallelism
(1+2) || (3+4) 1+2 3+4
Fork-Join Parallelism
(1+2) || (3+4) 1+2 3 3+4 7
Fork-Join Parallelism
(1+2) || (3+4) 1+2 3 3+4 7 Join
Fork-Join Parallelism
(1+2) || (3+4) 1+2 3 3+4 7 (3, 7)
Work and Span
N log(N) 1 1 1 1
Work: size of cost tree Span: depth of cost tree
Work and Span
N log(N) 1 1 1 1
Work: N + log(N) + 4 Span: N + log(N) + 2
Scheduling Theorems
- Work + Span gives execu+on cost on P
processor machine
- Goal: evaluate cost of using sequences on a P
processor machine
- Sufficient to evaluate work and span
Parallel Structural Dynamics
- Cost of running program with ∞ processors
- Determinis+c
Interleaved Structural Dynamics
- Cost of running program with 1 processor
- Non-determinis+c
Interleaved Structural Dynamics
- Store which sequences are interior and leaf
Work = Non-Determinis+c
A (leaf), size N GET SET GET GET Join
Work (Good Interleaving)
Current Work: 1 Total Work: 1
A (leaf), size N GET SET GET GET Join
Work (Good Interleaving)
Current Work: 1 Total Work: 2
A (leaf), size N GET SET GET GET Join
Work (Good Interleaving)
A (leaf), size N GET SET GET GET Join
Current Work: 1 Total Work: 3
Work (Good Interleaving)
A (leaf), size N GET SET GET GET Join
Current Work: 1 Total Work: 4
Work = Non-Determinis+c
A (leaf), size N GET SET GET GET Join
Work (Bad Interleaving)
Current Work: 1 Total Work: 1
A (leaf), size N GET SET GET GET Join
Work (Bad Interleaving)
Current Work: 1 Total Work: 2
A (leaf), size N GET SET GET GET Join
Work (Bad Interleaving)
Current Work: log(N) Total Work: 2 + log(N)
A (leaf), size N GET SET GET GET Join
Work (Bad Interleaving)
Current Work: log(N) Total Work: 2 + 2log(N)
A (leaf), size N GET SET GET GET Join
GET-GET Case
A (leaf), size N GET GET GET GET Join
SET-GET Case
A (leaf), size N GET SET GET GET Join
SET-SET Case
A (leaf), size N GET SET SET GET Join
Upper Bounding Work
- Determinis+c evalua+onal dynamics
- Store which sequences are leaf and interior
- Store the number of “cheap” (cost = 1) GETs
- n each sequence
- At the join, if sequence was modified on one
side, make the GETs expensive (cost = log(N))
Upper Bounding Work
- Showed that upper bounds are valid for all
inter-leavings
- Showed that the upper bound is +ght*
A = NEW(5, 0)
Version 1
Seq A ArrayData 1 (Version = 1)
B = SET(A, 2, 5)
5 Version 1
Seq A ArrayData 1 (Version = 2)
Version 2
Seq B
Version 1 Value
Naïve SET
- Implementa+on of SET(A, i, v)
- First set values[i] = v
- Then add a log entry to arraydata
GET-SET Race
Sequence A, version 1 Array data AD, version 1 Values = [0, 0, 0, 0, 0] Logs = empty
Thread 1 Thread 2 Result Step 1 Values[2] = 5 Step 2 GET(A, 2) Step 3 Add log entry to Logs[i]
GET-SET Race
Sequence A, version 1 Array data AD, version 1 Values = [0, 0, 0, 0, 0] Logs = empty
Thread 1 Thread 2 Result Step 1 Values[2] = 5 ✓ Step 2 GET(A, 2) Step 3 Add log entry to Logs[i]
GET-SET Race
Sequence A, version 1 Array data AD, version 1 Values = [0, 0, 0, 0, 0] Logs = empty
Thread 1 Thread 2 Result Step 1 Values[2] = 5 ✓ Step 2 GET(A, 2) 5 Step 3 Add log entry to Logs[i]
GET-SET Race
Sequence A, version 1 Array data AD, version 1 Values = [0, 0, 0, 0, 0] Logs = empty
Thread 1 Thread 2 Result Step 1 Values[2] = 5 ✓ Step 2 GET(A, 2) 5 Step 3 Add log entry to Logs[i] ✓
A Wait-Free Solu+on
- Can be fixed by adding log entry before
muta+ng values array
- Other issues in GET require careful ordering
- Other issues in SET require compare & swap
Experimental Results
- Compared sequences to regular arrays
- Random & sequen+al accesses
- Wri+ng: 2-3 +mes slower
- Reading: under 10% slower
Concurrent Results
- Compared
– 1 thread reading million +mes – 2 threads reading half million +mes
- 2 threads were > 1.75 +mes faster
Summary
- Func+onal array implementa+on
- O(1) opera+ons for leaf
- Wait-free concurrent
- Well defined cost seman+cs
Future Work
- Prove concurrent costs of sequence
implementa+on
- Tighter cost bounds
- Extend to disjoint sets, unordered sets
- Lower bound for func+onal array costs
Acknowledgements
- Joe Tassaror for lots of advice on correctness
proof
- Danny Sleator for ideas on lower bounds for
func+onal array costs
- NSF, Air Force Office, Intel for grants