Parallel Objects for Multicores A Glimpse at the Parallel Language - - PowerPoint PPT Presentation

parallel objects for multicores
SMART_READER_LITE
LIVE PREVIEW

Parallel Objects for Multicores A Glimpse at the Parallel Language - - PowerPoint PPT Presentation

Parallel Objects for Multicores A Glimpse at the Parallel Language Encore Dave Clarke & Tobias Wrigstad SFM Summer School Uppsala University Bertinoro, June, 2015 Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 1 Overview Dave


slide-1
SLIDE 1

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

SFM Summer School Bertinoro, June, 2015

Parallel Objects for Multicores

A Glimpse at the Parallel Language Encore

Dave Clarke & Tobias Wrigstad Uppsala University

1

slide-2
SLIDE 2

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Overview

slide-3
SLIDE 3

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Tutorial Overview

Background and Motivation Language Design Inversion Encore Language Design (5 Inversions) (Exercise Session)

3

slide-4
SLIDE 4

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Motivation

slide-5
SLIDE 5

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 5

In the early 2000’s hardware hit a wall

– “Too much power used too inefgiciently” – CPU temperature approaching sun’s surface – Adding 2x transistors yields 2% speedup

Solution: multi- and manycore machines

– Use 2x transistors to build 2x cores – 200% speedup — in theory – Essentially pushes the problem over to sofuware – “‘No one’ knows how to program these machines”

Background

slide-6
SLIDE 6

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 6

Combining object-orientation and parallelism is hard

– Aliasing make reasoning about efgicient parallelism difgicult – Abstract dynamic structures stress memory bottlenecks – Compositionality of concurrency control…

One root cause: classical languages evolved in a predominantly sequential setting

– Support for concurrency & parallelism as an afuerthought – Thread libraries are easily integrated, but hard to use – Essentially pushes the problem over to application programmers – “‘No one’ knows how to program with lots of threads”

Object-Oriented Parallel Programming

slide-7
SLIDE 7

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Aliasing Problem: Shared Mutable State

7

A B

slide-8
SLIDE 8

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Aliasing Problem: Shared Mutable State

8

even worse with concurrency/parallelism

A B

slide-9
SLIDE 9

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Locks

9

Must acquire a lock before accessing a certain resource write read

A B

slide-10
SLIDE 10

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Locking Too Little

10

write read

A B

slide-11
SLIDE 11

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Locking Too Much

11

force interleaved access even for commuting operations read read

A B

slide-12
SLIDE 12

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Compositionality of Locks

12

B deadlock A

acquire A, B; acquire B, A;

A B

slide-13
SLIDE 13

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Locks are Good, Locks are Bad

Threads and locks are easy to add to a programming language with minimal changes Place burden on programmer instead of programming language designer Code that requires synchronisation is indistinguishable from code that does not Locks perform quite well quite ofuen Uncontended locks are cheap Highly contended locks are expensive Coarse-grained locking is simpler but reduces parallelism Fine-grained locking allows parallelism, but is harder (e.g. deadlocks)

13

slide-14
SLIDE 14

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 14

Rethink object-oriented programming languages

– Remove sequential bias in classical languages – Keep a sufgiciently object-oriented programming model – Save industry investments in OOSD

End goal: make massively parallel programming in OO-languages possible & afgordable

slide-15
SLIDE 15

Language Design Inversion

slide-16
SLIDE 16

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Inversion

Most modern languages are designed first for sequential programming, with parallel programming constructs tacked on — Erlang is one exception. Mutability, possibly data dependencies, shared state, poor locality etc all limit possible parallelism and scalability. Inversion = adopt defaults that favours parallelism and scalability.

16

slide-17
SLIDE 17

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Inversion in design of Encore

Concurrent-by-default (Data)Parallel-by-default Data-race-free-by-default Isolated-by-default Asynchronous-by-default Linear-by-default Immutable-by-default Local-by-default Multi-object-by-default …

17

slide-18
SLIDE 18

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

… by-default

Defaults can be overridden 
 — additional code overhead. Some defaults are conflicting 
 — need to be addressed.

18

slide-19
SLIDE 19

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Concurrent-by-Default

slide-20
SLIDE 20

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Concurrency-by-Default

Java objects were designed for sequential access. Threads trample over objects. Locks/monitors added to protect objects. Erlang has concurrency by default (actors), but it is not object-oriented.

20

slide-21
SLIDE 21

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Actors/Active Objects

slide-22
SLIDE 22

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Active Object Characteristics

Mailbox Single thread of control Isolation Asynchronous communication – Saturation of asynchronous operations on difgerent object enables efgicient use of
 parallel machines Method suites defined in classes + usually OO Return values handled using futures

22

slide-23
SLIDE 23

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Actor Characteristics

23

Active Obj. A

m1 m2

Active Obj. B not allowed

slide-24
SLIDE 24

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Actor Characteristics

24

Active Obj. A

m1 m2

Active Obj. B a.m2() status value action run mode status value action run mode Q

by recv. by anyone

run m1 waiting running suspended finished

… … run l

slide-25
SLIDE 25

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 25

synchronous asynchronous single thread of control

Active Object-based Parallelism

slide-26
SLIDE 26

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 26

synchronous asynchronous single thread of control

Active Object-based Parallelism

BIG JOB TO DO

slide-27
SLIDE 27

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 27

BIG JOB TO DO

Fork multiple actors

slide-28
SLIDE 28

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Thread Ring Example (A Litte Bit Boring)

28 class Main def main(): void let index = 1 first = new Worker(index) next = null : Worker nhops = 50 000 000 ring_size = 503 current = first in { while (index < ring_size) { index = index + 1; next = new Worker(index); current ! setNext(next); current = next; }; current ! setNext(first); first ! run(nhops); } class Worker id : int next : Worker def init(id : int): void this.id = id def setNext(next: Worker): void this.next = next def run(n : int): void if (n > 0) then this.next!run(n-1) else print(this.id)

slide-29
SLIDE 29

Tobias Wrigstad (UU) Brussels 26.02.15

Threadrings Benchmark [pl shootout bench]

29

Speedup Normalised on Ruby 1 10 100 Go Clojure Racket C OCaml Java C++ Ruby Encore

OO Languages

51x

Tested on a 4 core laptop Note: higher is better

PonyRT inside!

slide-30
SLIDE 30

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 30

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

Sieve of Eratosthenes

slide-31
SLIDE 31

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 31

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

Sieve of Eratosthenes

slide-32
SLIDE 32

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 32

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

Sieve of Eratosthenes

slide-33
SLIDE 33

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 33

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

Sieve of Eratosthenes

slide-34
SLIDE 34

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 34

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

Sieve of Eratosthenes

slide-35
SLIDE 35

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 35

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

Sieve of Eratosthenes

slide-36
SLIDE 36

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 36

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

Sieve of Eratosthenes

slide-37
SLIDE 37

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 37

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

Sieve of Eratosthenes

slide-38
SLIDE 38

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 38

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

Sieve of Eratosthenes

slide-39
SLIDE 39

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 39

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

Sieve of Eratosthenes

slide-40
SLIDE 40

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 40

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

Parallel Sieve of Eratosthenes

Source

W1 W2 W3 W4 W5

slide-41
SLIDE 41

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Prime Sieve Benchmark

Primes for each filter Sending bufger

~ 200 LOC Encore + 130 LOC from libraries

41

Active Object

slide-42
SLIDE 42

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 42

3–√N 679– 5341– 1345– 2011– 2677– 3343– 4009– 4675– 6007– 8005–

(rest omitted)

Parallel Prime Sieve in a Nutshell

Active Object Found primes send to children

~ 200 LOC Encore + 130 LOC from libraries

slide-43
SLIDE 43

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 43

3–√N 679– 5341– 1345– 2011– 2677– 3343– 4009– 4675– 6007– 8005–

3 3 3 3

Scans vector of numbers linearly to find primes Forwards each prime P to its immediate children Cancels all multiples of P in their range Forwards each prime P to its immediate children

3 3 3 3 3 3

(omitted rest)

Parallel Prime Sieve in a Nutshell

slide-44
SLIDE 44

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 44

… … C … … … … … … A B

D

Aggregate result with children, display D = A + B + C Aggregate result with children, send to parent e.g., ”A primes found”

A B

(omitted rest)

Parallel Prime Sieve in a Nutshell

When done, send result to parent

50847534!

slide-45
SLIDE 45

Strong Scalability (Normalised on 1, calculating 1.6B primes)

45 10 x 100 x # actors mapped onto 1–64 cores 1 3 7 15 31 64 127

0.3 seconds

30x

slide-46
SLIDE 46

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Back to the Futures

A future is a placeholder for a value Asynchronous methods return futures … … when the method is complete, its result is assigned to the future — the future is fulfilled.

46

m1 m2

status value action run mode Q run m1 waiting running suspended finished

slide-47
SLIDE 47

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Accessing a future: get

get :: Fut t -> t 
 
 returns the value associated with a future, if available, otherwise blocks current active

  • bject until it is

get immediately afuer a call ~ synchronous call

47

slide-48
SLIDE 48

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

read from future write return value

48

synchronous x ! foo() single thread of control

A B

slide-49
SLIDE 49

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 49

synchronous x ! foo() single thread of control get f

A B

hopefully, f is fulfilled before this happens p = b.loadPageSource(); i = p.loadImages(); display.render(p, i);

Sequential chain

slide-50
SLIDE 50

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 50

synchronous x ! foo() single thread of control get f

A B

hopefully, f is fulfilled before this happens p = get b.loadPageSource(); i = get p.loadImages(); display.render(p, i);

Sequential chain

slide-51
SLIDE 51

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 51

synchronous x ! foo() single thread of control get f

A B

hopefully, f is fulfilled before this happens i = p.loadImages(); a = b.loadAds(); display.render(get i, get a);

”Fork—join”

slide-52
SLIDE 52

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Operations on Futures

await :: Fut t -> t 
 – like get, but relinquishes control of the active object until a value in future is available, then returns that value poll :: Fut t -> Bool 
 – checks whether the future has been fulfilled

+ chaining (next slide)

52

A

Q

B

slide-53
SLIDE 53

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 53

synchronous x ! foo() single thread of control

A

creates a ”workflow” that is disconnected from A — avoids blocking A b.loadPageSource() ~~> l p —> p.searchAdWords() ~~> l w -> getAds(w);

Sequential chain

chain :: Fut t -> (t -> t’) -> Fut t’ 
 – apply a function asynchronously to the result of future, returning a future for the result

slide-54
SLIDE 54

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 54

synchronous x ! foo() single thread of control ~~>

A

creates a ”workflow” that is disconnected from A — avoids blocking A b.loadPageSource() ~~> l p —> p.searchAdWords() ~~> l w -> getAds(w);

Sequential chain

~~> (get f)

chain :: Fut t -> (t -> t’) -> Fut t’ 
 – apply a function asynchronously to the result of future, returning a future for the result

slide-55
SLIDE 55

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 55

synchronous x ! foo() ~~>

A

creates a ”workflow” that is disconnected from A — avoids blocking A b.loadPageSource() ~~> l p —> p.searchAdWords() ~~> l w -> getAds(w);

Sequential chain

~~>

  • Two “run modes” depending on how

environment is captured Detached mode — closure is “self- contained” and can be run by any thread Attached mode — closure captures (mutable) local state and must be run by its creator

slide-56
SLIDE 56

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Cooperative Multi-Tasking

  • await (Fut t -> t) — like get but it relinquishes control of the active object to process

another message (if there is one), if the future has not been fulfilled

  • suspend relinquishes control of active object to process another message
  • Both require active object to reestablish its class invariants before relinquishing control

Essentially the aliasing problem, but without the concurrency

56

slide-57
SLIDE 57

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Comparison

  • get and await are costly as they require copying and storing the current calling context

(stack), when the future has not been fulfilled

  • chaining is cheaper, but eventually a get is needed if you need the value

57

slide-58
SLIDE 58

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Data-race-free-by-Default and 
 Isolation-by-Default

slide-59
SLIDE 59

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Passive Objects

Not all objects need their own (logical) thread of control Synchronous communication, ”borrows” the thread of control of the caller Sharing passive objects across active objects is unsafe, so must be isolated Passive objects act as regular objects … … without synchronisation overhead. …possible to reason about how their state changes during an operation

59

slide-60
SLIDE 60

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Gradual Sharing?

  • 1. Isolation (so trivially race-free)
  • 2. Sharing, but sharing in race-free manner
  • 3. Sharing with races
  • Who controls race-freedom?

Guaranteed by system (enforced at declaration-site) Guaranteed by programmer (enforced at use-site | not at all)

60

Explain DRF here

slide-61
SLIDE 61

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Basic Isolation

Fields can only be accessed by their active object. But what about objects in fields? Isolation by enforcing copying values across active objects …by using powerful type system to enable transfer, cooperation, read-sharing, etc.

61

slide-62
SLIDE 62

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Benefits & Costs of Isolation

Benefits Per Active Object GC — without synchronisation! Single Thread of Control abstraction inside each active object Costs Cloning is expensive No sharing of mutable state

62

slide-63
SLIDE 63

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Data-race Freedom

Data-race freedom is achieved because there is only one thread of control per active object Fields and passive objects are only accessed by one thread, under the control of the active

  • bject’s concurrency control

Thus no data races Of course, DRF does not imply determinism Order of messages in queues are non-deterministic

63

slide-64
SLIDE 64

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

(Data)Parallel-by-Default

slide-65
SLIDE 65

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

(Data)Parallel-by-default

Most languages are sequential by default, adding constructs for parallelism on top. Encore explores parallel-by-default by integrating parallel computation as a first-class entity. Parallel computations are manipulated by parallel combinators. Work in progress

65

slide-66
SLIDE 66

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 66

Futures are a handle on

  • ne parallel computation.

Generalise to support many parallel computations.

slide-67
SLIDE 67

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Parallel Types and Combinators

Parallel combinators express parallelism within an active object (and beyond) Typed, higher-order, and functional — inspired by Haskell, Orc, LINQ, and others Recall — Fut t = a handle to just one parallel computation Par t = handle to parallel computation producing multiple t-typed values Analogy: Par t ≈ [Fut t] Except that Par t is an abstract type (don’t want to rely on orderings, etc.)

67

slide-68
SLIDE 68

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Parallel Combinators: Interaction with Active Objects I

By analogy, [o1.m1(), o2.m2(), o3.m3()] :: [Fut a] is a parallel value In Encore, par(o1.m1(), o2.m2(), o3.m3()) :: Par a each :: [a] -> Par a — convert list into parallel value

68

slide-69
SLIDE 69

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Parallel Combinators: Interaction with Active Objects II

”Big variables” — multi-association between classes suggests parallelism

69

Bank − →∗ Customer − →∗ Account ... ... balance:int ...

b.getCustomers() :: Par Customer

slide-70
SLIDE 70

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Parallel Combinators: Example

70

class Main customers:Person* def main(): void let sum = this.customers . get_accounts . get_balance . (filter > 9900) . sum in print("Total: {}\n", sum)

”Sum up the total value of all accounts in the bank with more than 9900 Euro”

each accounts balance filter sum

slide-71
SLIDE 71

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Parallel Combinators: Example

71

class Main customers:Person* def main(): void this.customers ~~> bindp get_accounts -- flatten accounts ~~> pmap get_balance -- get balance per account ~~> filter ( \ x:int -> x > 9900 ) -- filter accounts ~~> sum -- reduce operation ~~> ( \sum:int print("Total: {}\n”, sum) )

”Sum up the total value of all accounts in the bank with more than 9900 Euro”

each bindp pmap filter sum

slide-72
SLIDE 72

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Parallel Combinators: Example

72

class Main def main(): void let customers = get_customers() -- get customers id par = each(customers) -- List t -> Par t in { par = bindp(par, get_accounts); -- flatten accounts par = pmap(par, get_balance); -- get balance per account par = filter(par, \(x: int) -> { x > 9900 }); -- filter accounts print("Total: {}\n", sum(par)); -- reduce operation } each bindp pmap filter sum

”Sum up the total value of all accounts in the bank with more than 9900 Euro”

slide-73
SLIDE 73

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

}

Parallel Combinators: Example

73

each bindp pmap filter sum bindp pmap filter bindp pmap filter bindp pmap filter bindp pmap filter bindp pmap filter

?

”Sum up the total value of all accounts in the bank with more than 9900 Euro”

slide-74
SLIDE 74

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Parallel Combinators (More Examples)

bindp :: Par a -> (a -> Par b) -> Par b generalises monadic bind = map, then flatten

  • therwise :: Par a -> (() -> Par a) -> Par a

if first parallel value is empty, return the value of the second argument filter :: Par a -> (a -> Bool) -> Par a keeps values matching predicate. select :: Par a -> Fut (Maybe a) returns the first finished result, if there is one. selectAndKill :: Par a -> Maybe a returns the first finished result, if there is one and kills all remaining

74

slide-75
SLIDE 75

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Parallel Combinators: From Parallel Types to Regular Values

Synchronisation sync :: Par t -> [t] — synchronises a parallel value, giving list of results Reduction sum :: Par Int -> Int — performs parallel sum of result of parallel integer-valued computation Many such functions exist.

75

slide-76
SLIDE 76

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Parallel Combinators: Challenges

  • Integration with OO fragment

Capabilities handle race conditions — ”if you have a reference, you can use it fully”

  • Optimisation

Parallel semantics by default opens door to many optimisations and scheduling strategies

  • Program Methodology

Case studies shall reveal design patterns for using parallel combinators and active

  • bjects in unison

76

slide-77
SLIDE 77

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

SFM Summer School Bertinoro, June, 2015

Unique-by-default

77

slide-78
SLIDE 78

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Alias Freedom is a Strong and Useful Property

  • Strong updates

Change type of object (e.g., typestate, verification)

  • Optimisations

Explode the object into registers, no need to synch with main memory

  • Reasoning

Sequential reasoning, pre/postconditions, no need for taking locks

  • Ownership transfer

E.g. enable object transfer through pointer swizzle

78

slide-79
SLIDE 79

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

  • Mainstream OOPLs make sharing default

Benefit: keeps things simple for the programmer (cf. Rust) Price: hard to establish (and maintain) actual uniqueness

  • Analysis of object-oriented code shows that:

Most variables are never null Most objects are not shared across threads Most objects are not aliased on the heap However — most mainstream programming languages do not capture that

79

slide-80
SLIDE 80

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 80

Normal OOP Encore

?

x : Foo x : Foo

Exclusive

Safe

slide-81
SLIDE 81

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 81

Normal OOP Encore

?

x : Foo x : Foo

slide-82
SLIDE 82

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 82

Normal OOP Encore

?

x : Foo x : Foo y : Foo y : Foo Separate Thread Separate Thread

  • r Active Obj.
slide-83
SLIDE 83

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 83

Normal OOP Encore

?

x : Foo x : Bar y : Bar Separate Thread Separate Thread

  • r Active Obj.
slide-84
SLIDE 84

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 84

Normal OOP Encore

?

x : Foo x : Baz y : Frob y : Foo Separate Thread z : Quux Separate Thread

  • r Active Obj.
slide-85
SLIDE 85

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 85

Normal OOP Encore

?

x : Foo x : Foo y : Foo y : Foo Separate Thread Separate Thread

  • r Active Obj.
slide-86
SLIDE 86

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Strong pair Two-faced Stream class Pair = Cell ⨂ Cell { … } class Pair = Cell ⨁ Cell { … } linear trait Put { def yield(Object o) : void … } readonly trait Take { def read() : Object … def next() : Take … } class TwoFacedStream = Put ⨂ Take { … } Weak pair

86

Linear ReadOnly

slide-87
SLIDE 87

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 87

producer : Put consumer1 : Take consumer2 : Take consumerN : Take class TwoFacedStream = Put ⨂ Take { … }

(SPMCQ)

linear trait Put { def yield(Object o) : void … } readonly trait Take { def read() : Object … def next() : Take … }

slide-88
SLIDE 88

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 88

producer : Put consumer1 : Take consumer2 : Take consumerN : Take class TwoFacedStream = Put ⨂ Take { … }

(SPSCQ)

linear trait Put { def yield(Object o) : void … } linear trait Take { def read() : Object … def next() : Take … }

slide-89
SLIDE 89

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 89

head tail next

Not All Aliasing is Evil

slide-90
SLIDE 90

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 90

head tail next

Not All Aliasing is Evil

Possibility 1: next and tail reference difgerent parts of the object

slide-91
SLIDE 91

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 91

head tail next

Not All Aliasing is Evil

Possibility 2: list is constructed from parts that may be freely aliased locked capability

slide-92
SLIDE 92

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM 92

head : Hd tail : Tl next : Hd

Not All Aliasing is Evil

Possibility 3: introduce aliasing in a tractable way Link = Hd ⋁ Tl Programmer may only dereference Hd or Tl, never both

if head != tail then tail ⋁ tail.next = new Link(…) else …

slide-93
SLIDE 93

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Unique-as-Default

  • Slightly more tricky programming

Intentional sharing incurs syntactic cost, becomes clearly visible Need to work harder in some cases to maintain uniqueness

  • Sometimes, type system is not strong enough to track uniqueness

Thread-locality gives many similar guarantees modulo transfer Use capabilities that protect against data races Will be revisited in the talk on ownership types soon

93

slide-94
SLIDE 94

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

SFM Summer School Bertinoro, June, 2015

Locality-by-default

94

slide-95
SLIDE 95

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Encore Memory Management

95

LH L1 L3 L4 L2 L5

Programmer’s mind Reality

LH L1 L3 L4 L2 L5

slide-96
SLIDE 96

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Encore Memory Management

96

Projecting the list onto an array

LH L1 L3 L4 L2 L5

slide-97
SLIDE 97

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

{

Problem: Bad Memory Efgiciency

f1 f2 f3 f4 f1 f2 f3 f4 f1 f2 f3 f4

e1

{ {

… e2 e3

f1* f1 f1 f2* f2 f2 f3* f3 f3 f4* f4 f4

… … … …

cache line size

* = aligned with cache line start

98

slide-98
SLIDE 98

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

{

f1 f2 f3 f4 f1 f2 f3 f4 f1 f2 f3 f4

e1

{ {

… e2 e3

used waste cache line size

each e.f1 access

{

~40% waste def maybe_inc(e:element) : void if (e.f1) e.f2++ repeat i <- 1024 maybe_inc(elements[i])

1024 accesses Assume e not in cache, cost of e.f1 ≈ 100 cycles Access e.f2 will be a hit, cost ≈ 1 cycle = 102400 units = 41370 units of waste Each turn in the loop will stall! (modulo misalignment and prefetching)

99

slide-99
SLIDE 99

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

cache line size

first e.f1 access

{

1024 accesses First access to e.f1 a miss ≈ 100 cycles 2 subsequent items hits ≈ 2 cycles As soon as we have more than ~0% waste At most 1/3 elements will stall 40% fewer memory accesses — faster program!

f1* f1 f1 f2* f2 f2 f3* f3 f3 f4* f4 f4

… … … …

used (100%) used (100%) never loaded! never loaded!

first e.f2 access

{

def maybe_inc(e:element) : void if (e.f1) e.f2++ repeat i <- 1024 maybe_inc(elements[i])

100

slide-100
SLIDE 100

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Encore Memory Management

  • Locality–by–default

Allocate objects building up large structures from the same memory pool Locality requires difgerent placement strategy for difgerent data structures (e.g., hierarchical for trees, linear for linked lists)

  • Structure splitting

Especially good for performing many similar operations on part of a big structure (e.g., column-wise accesses, vectorisation) ”Small updates” may cause more writes to disjoint locations = more invalidation, i.e., not a silver bullet ”Maximal splitting” seems to work well in the general case, but grouping certain substructures may be an optimisation

101

slide-101
SLIDE 101

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Ordering Data in Pools

  • Linked/array-like structures are simple to organise in memory
  • No ”best” organisation strategy — dependent on data structure definition and use

For example, consider a binary tree

102

Which one is best?

slide-102
SLIDE 102

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Data Representation (of a simple pair)

fst snd fst snd snd fst

Embed both Embed one Embed none

Externalise: Make it possible to change between these possibilities at use-site, without touching the ”business logic” of the pair

103

slide-103
SLIDE 103

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Linked Pools

  • There is a deception in the linked list example: commonly, the list does not embed its

elements, rather they are stored in the list by pointer only If element objects are spread across more than one pool, little is accomplished If element objects are mixed with link objects, less locality Optimal case: element objects in a single pool (modulo splitting) and order in element pool is linked to the order in the link pool

104

slide-104
SLIDE 104

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Linked Pool Example

105

L1 L3 L4 L2 L5 L1 L3 L4 L2 L5

Pool 1 Pool 2 Pool 3 Links Elements Links

Ordering dependency

slide-105
SLIDE 105

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

The Case for Object-Relative Addressing

106

A B copy

Copy: All object-relative addresses on A’s heap are valid when copied to B’s heap. Hence, copying N links can be reduced to a ”memcpy” of start–end addresses.

v1 +4 v2 +4 v3 +8 v4 +8 v5 −4 v6 null

slide-106
SLIDE 106

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

The Case for Object-Relative Addressing

107

A B copy

Example win: Can fit 32 pointers in a single cache line as opposed to 8 — can store many small subtrees in a single cache line in the tree hierarchy example

v1 +4 v2 +4 v3 +8 v4 +8 v5 −4 v6 null

slide-107
SLIDE 107

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Exercises

slide-108
SLIDE 108

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Design Exercises

“Implement” the system described in the handout using ideas from Encore. Which objects should be active? Which passive? How is data distributed among the active objects? What is the amount of data passed between active objects? What are the dependencies? What is the degree of parallelism? Locality?

110

slide-109
SLIDE 109

Crowd Simulation

slide-110
SLIDE 110

Dave Clarke/Tobias Wrigstad (UU) Bertinoro/SFM

Parallel Object-Oriented Programming

  • The Encore programming language

Make the design defaults give good properties ”for free” (focus on parallelism)

  • Starting with active objects and futures as the vanilla model
  • ”Secret sauces”

Parallel combinators, fancy capability-based types, modular layout specification, …

  • A lot of what I have shown you is in some incomplete state of implementation
  • We are looking for collaborators at any level
  • We are also looking for users that can tell us what their pains are

112

Thanks for listening!