State of Multicore OCaml KC Sivaramakrishnan University of OCaml - - PowerPoint PPT Presentation

state of multicore ocaml
SMART_READER_LITE
LIVE PREVIEW

State of Multicore OCaml KC Sivaramakrishnan University of OCaml - - PowerPoint PPT Presentation

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline Overview of the multicore OCaml project Multicore OCaml runtime design Future directions Multicore OCaml Multicore OCaml Add native


slide-1
SLIDE 1

State of Multicore OCaml

KC Sivaramakrishnan

University of Cambridge OCaml Labs

slide-2
SLIDE 2

Outline

  • Overview of the multicore OCaml project
  • Multicore OCaml runtime design
  • Future directions
slide-3
SLIDE 3

Multicore OCaml

slide-4
SLIDE 4

Multicore OCaml

  • Add native support for concurrency and (shared-memory)

parallelism to OCaml

slide-5
SLIDE 5

Multicore OCaml

  • Add native support for concurrency and (shared-memory)

parallelism to OCaml

  • History

★ Jan 2014: Initiated by Stephen Dolan and Leo White ★ Sep 2014: Multicore OCaml design @ OCaml workshop ★ Jan 2015: KC joins the project at OCaml Labs ★ Sep 2015: Effect handlers @ OCaml workshop ★ Jan 2016: Native code backend for Amd64 on Linux and OSX ★ Jun 2016: Multicore rebased to 4.02.2 from 4.00.0 ★ Sep 2016: Reagents library, Multicore backend for Links @ OCaml workshop ★ Apr 2017: ARM64 backend

slide-6
SLIDE 6

Multicore OCaml

slide-7
SLIDE 7

Multicore OCaml

  • History continued…

★ Jun 2017: Handlers for Concurrent System Programming @ TFP ★ Sep 2017: Memory model proposal @ OCaml workshop ★ Sep 2017: CPS translation for handlers @ FSCD ★ Apr 2018: Multicore rebased to 4.06.1 (will track releases going forward) ★ Jun 2018: Memory model @ PLDI

slide-8
SLIDE 8

Multicore OCaml

  • History continued…

★ Jun 2017: Handlers for Concurrent System Programming @ TFP ★ Sep 2017: Memory model proposal @ OCaml workshop ★ Sep 2017: CPS translation for handlers @ FSCD ★ Apr 2018: Multicore rebased to 4.06.1 (will track releases going forward) ★ Jun 2018: Memory model @ PLDI

  • Looking forward…

★ Q3’18 — Q4’18: Implement missing features, upstream prerequisites to

trunk

★ Q1’19 — Q2’19: Submit feature-based PRs to upstream

slide-9
SLIDE 9

Components

Multicore Runtime + Domains Effect Handlers Effect System

slide-10
SLIDE 10

Components

Multicore Runtime + Domains Effect Handlers Effect System

  • Multicore Runtime

★ Multicore GC + Domains (creating and managing parallel threads)

slide-11
SLIDE 11

Components

Multicore Runtime + Domains Effect Handlers Effect System

  • Multicore Runtime

★ Multicore GC + Domains (creating and managing parallel threads)

  • Effect handlers

★ Fibers: Runtime system support for linear delimited continuations

slide-12
SLIDE 12

Components

Multicore Runtime + Domains Effect Handlers Effect System

  • Multicore Runtime

★ Multicore GC + Domains (creating and managing parallel threads)

  • Effect handlers

★ Fibers: Runtime system support for linear delimited continuations

  • Effect system

★ Track user-defined effects in the type system ★ Statically rule out the possibility of unhandled effects

slide-13
SLIDE 13

Components

Multicore Runtime + Domains Effect Handlers Effect System

  • Multicore Runtime

★ Multicore GC + Domains (creating and managing parallel threads)

  • Effect handlers

★ Fibers: Runtime system support for linear delimited continuations

  • Effect system

★ Track user-defined effects in the type system ★ Statically rule out the possibility of unhandled effects

Current implementation

slide-14
SLIDE 14

Components

Multicore Runtime + Domains Effect Handlers Effect System

  • Multicore Runtime

★ Multicore GC + Domains (creating and managing parallel threads)

  • Effect handlers

★ Fibers: Runtime system support for linear delimited continuations

  • Effect system

★ Track user-defined effects in the type system ★ Statically rule out the possibility of unhandled effects

Current implementation Work-in-progress

slide-15
SLIDE 15

Multicore GC

Minor Heap Minor Heap Minor Heap Minor Heap

Major Heap

Domain 0 Domain 1 Domain 2 Domain 3

[1] Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. "Scalable, locality-conscious multithreaded memory allocation." ISMM 2006. [2] Lorenz Huelsbergen and Phil Winterbottom. "Very concurrent mark-&-sweep garbage collection without fine-grain synchronization." ISMM 1998.

slide-16
SLIDE 16

Multicore GC

Minor Heap Minor Heap Minor Heap Minor Heap

Major Heap

Domain 0 Domain 1 Domain 2 Domain 3

[1] Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. "Scalable, locality-conscious multithreaded memory allocation." ISMM 2006. [2] Lorenz Huelsbergen and Phil Winterbottom. "Very concurrent mark-&-sweep garbage collection without fine-grain synchronization." ISMM 1998.

slide-17
SLIDE 17

Multicore GC

Minor Heap Minor Heap Minor Heap Minor Heap

Major Heap

Domain 0 Domain 1 Domain 2 Domain 3

  • Independent per-domain minor collection

[1] Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. "Scalable, locality-conscious multithreaded memory allocation." ISMM 2006. [2] Lorenz Huelsbergen and Phil Winterbottom. "Very concurrent mark-&-sweep garbage collection without fine-grain synchronization." ISMM 1998.

slide-18
SLIDE 18

Multicore GC

Minor Heap Minor Heap Minor Heap Minor Heap

Major Heap

Domain 0 Domain 1 Domain 2 Domain 3

  • Independent per-domain minor collection

Read barrier for mutable fields + promotion to major

[1] Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. "Scalable, locality-conscious multithreaded memory allocation." ISMM 2006. [2] Lorenz Huelsbergen and Phil Winterbottom. "Very concurrent mark-&-sweep garbage collection without fine-grain synchronization." ISMM 1998.

slide-19
SLIDE 19

Multicore GC

Minor Heap Minor Heap Minor Heap Minor Heap

Major Heap

Domain 0 Domain 1 Domain 2 Domain 3

  • Independent per-domain minor collection

Read barrier for mutable fields + promotion to major

  • A new major allocator based on StreamFlow [1], lock-free multithreaded

allocation

[1] Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. "Scalable, locality-conscious multithreaded memory allocation." ISMM 2006. [2] Lorenz Huelsbergen and Phil Winterbottom. "Very concurrent mark-&-sweep garbage collection without fine-grain synchronization." ISMM 1998.

slide-20
SLIDE 20

Multicore GC

Minor Heap Minor Heap Minor Heap Minor Heap

Major Heap

Domain 0 Domain 1 Domain 2 Domain 3

  • Independent per-domain minor collection

Read barrier for mutable fields + promotion to major

  • A new major allocator based on StreamFlow [1], lock-free multithreaded

allocation

  • A new major GC based on

VCGC [2] adapted to fibers, ephemerons, finalisers

[1] Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. "Scalable, locality-conscious multithreaded memory allocation." ISMM 2006. [2] Lorenz Huelsbergen and Phil Winterbottom. "Very concurrent mark-&-sweep garbage collection without fine-grain synchronization." ISMM 1998.

slide-21
SLIDE 21

Major GC

  • Concurrent, incremental, mark and sweep

★ Uses deletion/yuasa barrier ★ Upper bound on marking work per cycle (not fixed due to weak refs)

  • 3 phases:

★ Sweep-and-mark-main ★ Mark-final ★ Sweep-ephe

slide-22
SLIDE 22

Major GC: Sweep-and-mark-main

slide-23
SLIDE 23

Major GC: Sweep-and-mark-main

Domain 0

Mark Roots

Domain 1

Mark Roots

  • Domains begin by marking roots
slide-24
SLIDE 24

Major GC: Sweep-and-mark-main

Mutator

Domain 0

Sweep Mark Roots Mutator Sweep Mutator

Domain 1

Mutator Mark Roots Sweep Mutator

  • Domains begin by marking roots
  • Domains alternate between sweeping own garbage and running mutator
slide-25
SLIDE 25

Major GC: Sweep-and-mark-main

Mutator

Domain 0

Mark Sweep Mark Roots Mutator Sweep Mutator Mark Mutator Mutator

Domain 1

Mutator Mark Roots Sweep Mutator Mark Mutator Mark Mutator

  • Domains begin by marking roots
  • Domains alternate between sweeping own garbage and running mutator
  • Domains alternate between marking objects and running mutator
slide-26
SLIDE 26

Major GC: Sweep-and-mark-main

Mutator

Domain 0

Mark Sweep Mark Roots Mutator Sweep Mutator Mark Mutator Mutator

Domain 1

Mutator Mark Roots Sweep Mutator Mark Mutator Mark Mutator

  • Domains begin by marking roots
  • Domains alternate between sweeping own garbage and running mutator
  • Domains alternate between marking objects and running mutator
slide-27
SLIDE 27

Major GC: Sweep-and-mark-main

Mutator

Domain 0

Mark Sweep Ephe Mark Mark Roots Mutator Sweep Mutator Mark Mutator Mutator Mutator Mark Mutator

Domain 1

Mutator Mark Roots Sweep Mutator Mark Mutator Mark Ephe Mark Mutator Mutator Mark Mutator

  • Domains begin by marking roots
  • Domains alternate between sweeping own garbage and running mutator
  • Domains alternate between marking objects and running mutator
  • Domains alternate between marking ephemerons, marking other objects and running

mutator

slide-28
SLIDE 28

Major GC: Sweep-and-mark-main

Mutator

Domain 0

Mark Sweep Ephe Mark Mark Roots Mutator Sweep Mutator Mark Mutator Mutator Mutator Mark Mutator

Domain 1

Mutator Mark Roots Sweep Mutator Mark Mutator Mark Ephe Mark Mutator Mutator Mark Mutator

  • Domains begin by marking roots
  • Domains alternate between sweeping own garbage and running mutator
  • Domains alternate between marking objects and running mutator
  • Domains alternate between marking ephemerons, marking other objects and running

mutator

slide-29
SLIDE 29

Major GC: Sweep-and-mark-main

Mutator

Domain 0

Mark Sweep Ephe Mark Mark Roots Mutator Sweep Mutator Mark Mutator Mutator Mutator Mark Mutator Ephe Mark

Domain 1

Mutator Mark Roots Sweep Mutator Mark Mutator Mark Ephe Mark Mutator Mutator Mark Mutator

  • Domains begin by marking roots
  • Domains alternate between sweeping own garbage and running mutator
  • Domains alternate between marking objects and running mutator
  • Domains alternate between marking ephemerons, marking other objects and running

mutator

slide-30
SLIDE 30

Major GC: Sweep-and-mark-main

Mutator

Domain 0

Mark Sweep Ephe Mark Mark Roots Mutator Sweep Mutator Mark Mutator Mutator Mutator Mark Mutator Ephe Mark

Domain 1

Mutator Mark Roots Sweep Mutator Mark Mutator Mark Ephe Mark Mutator Mutator Mark Mutator

Barrier

  • Domains begin by marking roots
  • Domains alternate between sweeping own garbage and running mutator
  • Domains alternate between marking objects and running mutator
  • Domains alternate between marking ephemerons, marking other objects and running

mutator

  • Global barrier to switch to the next phase

Reading weak keys may make unreachable objects reachable

Verify that the phase termination conditions hold

slide-31
SLIDE 31

Major GC: mark-final

slide-32
SLIDE 32

Major GC: mark-final

Domain 0

Update final first

Domain 1

Update final first

  • Domains update Gc.finalise finalisers which take values and mark the values

Preserves the order of evaluation of finalisers per domain c.f trunk

slide-33
SLIDE 33

Major GC: mark-final

Domain 0

Mark Ephe Mark Update final first Mutator Mark Mutator Mutator Mutator Mark Mutator Ephe Mark

Domain 1

Update final first Mutator Mark Mutator Mark Ephe Mark Mutator Mutator Mark Mutator

  • Domains update Gc.finalise finalisers which take values and mark the values

Preserves the order of evaluation of finalisers per domain c.f trunk

slide-34
SLIDE 34

Major GC: mark-final

Domain 0

Mark Ephe Mark Update final first Mutator Mark Mutator Mutator Mutator Mark Mutator Ephe Mark

Domain 1

Update final first Mutator Mark Mutator Mark Ephe Mark Mutator Mutator Mark Mutator

Barrier

  • Domains update Gc.finalise finalisers which take values and mark the values

Preserves the order of evaluation of finalisers per domain c.f trunk

slide-35
SLIDE 35

Major GC: sweep-ephe

slide-36
SLIDE 36

Major GC: sweep-ephe

Domain 0

Update final last

Domain 1

Update final last

  • Domains prepares the Gc.finalise_last finaliser list which do not take values

Preserves the order of evaluation of finalisers per domain c.f trunk

slide-37
SLIDE 37

Major GC: sweep-ephe

Domain 0

Update final last

Domain 1

Update final last Ephe Sweep Mutator Mutator Mutator

Barrier

  • Domains prepares the Gc.finalise_last finaliser list which do not take values

Preserves the order of evaluation of finalisers per domain c.f trunk

Ephe Sweep Ephe Sweep Mutator Mutator Ephe Sweep

slide-38
SLIDE 38

Major GC: sweep-ephe

Domain 0

Update final last

Domain 1

Update final last Ephe Sweep Mutator Mutator Mutator

Barrier

  • Domains prepares the Gc.finalise_last finaliser list which do not take values

Preserves the order of evaluation of finalisers per domain c.f trunk

Ephe Sweep Ephe Sweep Mutator Mutator Ephe Sweep

  • Swap the meaning of GC bits

MARKED → UNMARKED

UNMARKED → GARBAGE

GARBAGE → MARKED

slide-39
SLIDE 39

Major GC: sweep-ephe

Domain 0

Update final last

Domain 1

Update final last Ephe Sweep Mutator Mutator Mutator

Barrier

  • Domains prepares the Gc.finalise_last finaliser list which do not take values

Preserves the order of evaluation of finalisers per domain c.f trunk

Ephe Sweep Ephe Sweep Mutator Mutator Ephe Sweep

  • Swap the meaning of GC bits

MARKED → UNMARKED

UNMARKED → GARBAGE

GARBAGE → MARKED

  • Major GC algorithm verified in SPIN model checker
slide-40
SLIDE 40

Memory Model

slide-41
SLIDE 41

Memory Model

  • Goal: Balance comprehensibility and performance
slide-42
SLIDE 42

Memory Model

  • Goal: Balance comprehensibility and performance
  • Generalise

★ SC-DRF property

Data-race-free programs have sequential semantics

★ to local DRF

Data-race-free parts of programs have sequential semantics

slide-43
SLIDE 43

Memory Model

  • Goal: Balance comprehensibility and performance
  • Generalise

★ SC-DRF property

Data-race-free programs have sequential semantics

★ to local DRF

Data-race-free parts of programs have sequential semantics

  • Bounds data races in space and time

★ Data races on one location do not affect sequential semantics of another ★ Dara races in the past or the future do no affect sequential semantics of non-

racy accesses

slide-44
SLIDE 44

Memory Model

slide-45
SLIDE 45

Memory Model

  • We have developed a memory model that has LDRF

★ Atomic and non-atomic locations (no relaxed operations yet) ★ Proven correct (on paper) compilation to x86 and ARMv8

slide-46
SLIDE 46

Memory Model

  • We have developed a memory model that has LDRF

★ Atomic and non-atomic locations (no relaxed operations yet) ★ Proven correct (on paper) compilation to x86 and ARMv8

  • Is it practical?

★ SC has LDRF and SRA is conjectured to have LDRF, but not practical due to

performance impact

slide-47
SLIDE 47

Memory Model

  • We have developed a memory model that has LDRF

★ Atomic and non-atomic locations (no relaxed operations yet) ★ Proven correct (on paper) compilation to x86 and ARMv8

  • Is it practical?

★ SC has LDRF and SRA is conjectured to have LDRF, but not practical due to

performance impact

  • Must preserve load-store ordering

★ Most compiler optimisations are valid (CSE, LICM).

No redundant store elimination across load.

★ Free on x86, low-overhead on ARM (0.6% overhead) and POWER (2.9%

  • verhead)
slide-48
SLIDE 48

Runtime support for Effect handlers

slide-49
SLIDE 49

Runtime support for Effect handlers

  • Linear delimited continuations

★ Linearity enforced by the runtime ★ Raise exception when continuation resumed more than once ★ Finaliser discontinues unresumed continuation

slide-50
SLIDE 50

Runtime support for Effect handlers

  • Linear delimited continuations

★ Linearity enforced by the runtime ★ Raise exception when continuation resumed more than once ★ Finaliser discontinues unresumed continuation

  • Fibers: Heap managed stack segments

★ Requires stack-overflow checks at function entry ★ Static analysis removes checks in small leaf functions

slide-51
SLIDE 51

Runtime support for Effect handlers

  • Linear delimited continuations

★ Linearity enforced by the runtime ★ Raise exception when continuation resumed more than once ★ Finaliser discontinues unresumed continuation

  • Fibers: Heap managed stack segments

★ Requires stack-overflow checks at function entry ★ Static analysis removes checks in small leaf functions

  • C calls needs to be performed on C stack

★ < 1% performance slowdown on average for this feature ★ DWARF magic allows full backtrace across nested calls of handlers, C calls and callbacks.

slide-52
SLIDE 52

Runtime support for Effect handlers

  • Linear delimited continuations

★ Linearity enforced by the runtime ★ Raise exception when continuation resumed more than once ★ Finaliser discontinues unresumed continuation

  • Fibers: Heap managed stack segments

★ Requires stack-overflow checks at function entry ★ Static analysis removes checks in small leaf functions

  • C calls needs to be performed on C stack

★ < 1% performance slowdown on average for this feature ★ DWARF magic allows full backtrace across nested calls of handlers, C calls and callbacks.

  • WIP to support capturing continuations that include C frames c.f “Threads

Yield Continuations”

slide-53
SLIDE 53

Status

  • Major GC and fiber implementations are stable modulo bugs

★ TODO: Effect System

  • Laundry list of minor features

★ https://github.com/ocamllabs/ocaml-multicore/projects/3

  • We need

★ Benchmarks ★ Benchmarking tools and infrastructure ★ Performance tuning

slide-54
SLIDE 54

Future Directions: Memory Model

slide-55
SLIDE 55

Future Directions: Memory Model

  • Memory model only supports atomic and non-atomic locations

★ Extend memory model with weaker atomics and “new ref” while

preserving LDRF theorem

slide-56
SLIDE 56

Future Directions: Memory Model

  • Memory model only supports atomic and non-atomic locations

★ Extend memory model with weaker atomics and “new ref” while

preserving LDRF theorem

  • Avoid become C++ — multiple weak atomics w/ subtle

interactions

★ Could we expose restricted APIs to the programmer?

slide-57
SLIDE 57

Future Directions: Memory Model

  • Memory model only supports atomic and non-atomic locations

★ Extend memory model with weaker atomics and “new ref” while

preserving LDRF theorem

  • Avoid become C++ — multiple weak atomics w/ subtle

interactions

★ Could we expose restricted APIs to the programmer?

  • Verify multicore OCaml programs

★ Explore (semi-)automated SMT

  • aided verification

★ Challenge problem: verify k-CAS at the heart of Reagents library

slide-58
SLIDE 58

Future Directions: Multicore MirageOS

slide-59
SLIDE 59

Future Directions: Multicore MirageOS

  • MirageOS rewrite to take advantage of typed effect handlers

and multicore parallelism

★ Typed effects for better error handling and concurrency

slide-60
SLIDE 60

Future Directions: Multicore MirageOS

  • MirageOS rewrite to take advantage of typed effect handlers

and multicore parallelism

★ Typed effects for better error handling and concurrency

  • Better concurrency model over Xen block devices

★ Extricate oneself from dependence on POSIX API ★ Discriminate various concurrency levels (CPU, application, I/O) in the

scheduler

★ Failure and Back pressure as a first-class operation

slide-61
SLIDE 61

Future Directions: Multicore MirageOS

  • MirageOS rewrite to take advantage of typed effect handlers

and multicore parallelism

★ Typed effects for better error handling and concurrency

  • Better concurrency model over Xen block devices

★ Extricate oneself from dependence on POSIX API ★ Discriminate various concurrency levels (CPU, application, I/O) in the

scheduler

★ Failure and Back pressure as a first-class operation

  • Multicore-capable Irmin, a branch-consistent database library
slide-62
SLIDE 62

Future Directions: Heterogeneous System

  • Programming heterogenous, non

Von Neumann architectures

★ How do we capture computational model in richer type system? ★ How do we compile efficiently to such a system?