Retrofitting Parallelism onto OCaml KC Sivaramakrishnan , Stephen - - PowerPoint PPT Presentation

retrofitting parallelism onto ocaml
SMART_READER_LITE
LIVE PREVIEW

Retrofitting Parallelism onto OCaml KC Sivaramakrishnan , Stephen - - PowerPoint PPT Presentation

Retrofitting Parallelism onto OCaml KC Sivaramakrishnan , Stephen Dolan, Leo white, Sadiq Jaffer, Tom Kelly, Anmol Sahoo, Sudha Parimala, Atul Dhiman, Anil Madhavapeddy OCaml Labs Industry Projects The Astre Static Analyzer Industry


slide-1
SLIDE 1

Retrofitting Parallelism onto OCaml

KC Sivaramakrishnan, Stephen Dolan, Leo white, Sadiq Jaffer, Tom Kelly, Anmol Sahoo, Sudha Parimala, Atul Dhiman, Anil Madhavapeddy

OCaml Labs

slide-2
SLIDE 2

The Astrée Static Analyzer

Industry Projects

slide-3
SLIDE 3

The Astrée Static Analyzer

Industry Projects

No multicore support!

slide-4
SLIDE 4

Multicore OCaml

  • Adds native support for concurrency and shared-memory

parallelism to OCaml

slide-5
SLIDE 5

Multicore OCaml

  • Adds native support for concurrency and shared-memory

parallelism to OCaml

  • Focus of this work is parallelism

✦ Building a multicore GC for OCaml

slide-6
SLIDE 6

Multicore OCaml

  • Adds native support for concurrency and shared-memory

parallelism to OCaml

  • Focus of this work is parallelism

✦ Building a multicore GC for OCaml

  • Key parallel GC design principle

✦ Backwards compatibility before parallel scalability

slide-7
SLIDE 7

Challenges

  • Millions of lines of legacy code

✦ Weak references, ephemerons, lazy values, finalisers ✦ Low-level C API that bakes in GC invariants ✦ Cost of refactoring sequential code itself is prohibitive

slide-8
SLIDE 8

Challenges

  • Millions of lines of legacy code

✦ Weak references, ephemerons, lazy values, finalisers ✦ Low-level C API that bakes in GC invariants ✦ Cost of refactoring sequential code itself is prohibitive

  • Type safety

✦ Dolan et al, “Bounding Data Races in Space and

Time”, PLDI’18

✦ Strong guarantees (including type safety) under data races

slide-9
SLIDE 9

Challenges

  • Millions of lines of legacy code

✦ Weak references, ephemerons, lazy values, finalisers ✦ Low-level C API that bakes in GC invariants ✦ Cost of refactoring sequential code itself is prohibitive

  • Type safety

✦ Dolan et al, “Bounding Data Races in Space and

Time”, PLDI’18

✦ Strong guarantees (including type safety) under data races

  • Low-latency and predictable performance

✦ Thanks to the GC design

slide-10
SLIDE 10

Incremental and non-moving

Stock OCaml GC

  • A generational, non-moving, incremental, mark-and-sweep GC

Minor Heap

Major Heap

  • Small (2 MB default)
  • Bump pointer allocation
  • Survivors copied to major heap
slide-11
SLIDE 11

Incremental and non-moving

Stock OCaml GC

  • A generational, non-moving, incremental, mark-and-sweep GC

Minor Heap

Major Heap

  • Small (2 MB default)
  • Bump pointer allocation
  • Survivors copied to major heap

Mutator

Start of major cycle Idle

slide-12
SLIDE 12

Incremental and non-moving

Stock OCaml GC

  • A generational, non-moving, incremental, mark-and-sweep GC

Minor Heap

Major Heap

  • Small (2 MB default)
  • Bump pointer allocation
  • Survivors copied to major heap

Mutator

Start of major cycle Idle

Mark Roots

mark roots

slide-13
SLIDE 13

Mark

mark main Incremental and non-moving

Stock OCaml GC

  • A generational, non-moving, incremental, mark-and-sweep GC

Minor Heap

Major Heap

  • Small (2 MB default)
  • Bump pointer allocation
  • Survivors copied to major heap

Mutator

Start of major cycle Idle

Mark Roots

mark roots

slide-14
SLIDE 14

Mark

mark main

Sweep

sweep Incremental and non-moving

Stock OCaml GC

  • A generational, non-moving, incremental, mark-and-sweep GC

Minor Heap

Major Heap

  • Small (2 MB default)
  • Bump pointer allocation
  • Survivors copied to major heap

Mutator

Start of major cycle Idle

Mark Roots

mark roots

slide-15
SLIDE 15

Mark

mark main

Sweep

sweep Incremental and non-moving

Stock OCaml GC

  • A generational, non-moving, incremental, mark-and-sweep GC

Minor Heap

Major Heap

  • Small (2 MB default)
  • Bump pointer allocation
  • Survivors copied to major heap

End of major cycle

Mutator

Start of major cycle Idle

Mark Roots

mark roots

slide-16
SLIDE 16

Mark

mark main

Sweep

sweep Incremental and non-moving

Stock OCaml GC

  • A generational, non-moving, incremental, mark-and-sweep GC

Minor Heap

Major Heap

  • Small (2 MB default)
  • Bump pointer allocation
  • Survivors copied to major heap

End of major cycle

Mutator

Start of major cycle Idle

Mark Roots

mark roots

  • Fast allocations, no read barriers
slide-17
SLIDE 17

Mark

mark main

Sweep

sweep Incremental and non-moving

Stock OCaml GC

  • A generational, non-moving, incremental, mark-and-sweep GC

Minor Heap

Major Heap

  • Small (2 MB default)
  • Bump pointer allocation
  • Survivors copied to major heap

End of major cycle

Mutator

Start of major cycle Idle

Mark Roots

mark roots

  • Fast allocations, no read barriers
  • Max GC latency < 10 ms, 99th percentile latency < 1 ms
slide-18
SLIDE 18

Requirements

  • 1. Feature backwards compatibility
  • Serial programs do not break on parallel runtime
  • No separate serial and parallel modes
slide-19
SLIDE 19

Requirements

  • 1. Feature backwards compatibility
  • Serial programs do not break on parallel runtime
  • No separate serial and parallel modes
  • 2. Performance backwards compatibility
  • Serial programs behave similarly on parallel runtime in terms of

running time, GC pausetime and memory usage.

slide-20
SLIDE 20

Requirements

  • 1. Feature backwards compatibility
  • Serial programs do not break on parallel runtime
  • No separate serial and parallel modes
  • 2. Performance backwards compatibility
  • Serial programs behave similarly on parallel runtime in terms of

running time, GC pausetime and memory usage.

  • 3. Parallel responsiveness and scalability
  • Parallel programs remain responsive
  • Parallel programs scale with additional cores
slide-21
SLIDE 21

Multicore OCaml: Major GC

  • Multicore-aware allocator

✦ Based on Streamflow [Schneider et al. 2006] ✦ Thread-local, size-segmented free lists for small objects + malloc for large

allocations

✦ Sequential performance on par with OCaml’s allocators

slide-22
SLIDE 22

Multicore OCaml: Major GC

  • Multicore-aware allocator

✦ Based on Streamflow [Schneider et al. 2006] ✦ Thread-local, size-segmented free lists for small objects + malloc for large

allocations

✦ Sequential performance on par with OCaml’s allocators

  • A mostly-concurrent, non-moving, mark-and-sweep collector

✦ Based on

VCGC [Huelsbergen and Winterbottom 1998]

slide-23
SLIDE 23

Multicore OCaml: Major GC

  • Multicore-aware allocator

✦ Based on Streamflow [Schneider et al. 2006] ✦ Thread-local, size-segmented free lists for small objects + malloc for large

allocations

✦ Sequential performance on par with OCaml’s allocators

  • A mostly-concurrent, non-moving, mark-and-sweep collector

✦ Based on

VCGC [Huelsbergen and Winterbottom 1998]

Sweep Mark Mark Roots Sweep Mark Mark Roots

Start of major cycle End of major cycle Domain 0 Domain 1

slide-24
SLIDE 24

Multicore OCaml: Major GC

  • Multicore-aware allocator

✦ Based on Streamflow [Schneider et al. 2006] ✦ Thread-local, size-segmented free lists for small objects + malloc for large

allocations

✦ Sequential performance on par with OCaml’s allocators

  • A mostly-concurrent, non-moving, mark-and-sweep collector

✦ Based on

VCGC [Huelsbergen and Winterbottom 1998]

Sweep Mark Mark Roots Sweep Mark Mark Roots

Start of major cycle End of major cycle mark and sweep phases may overlap Domain 0 Domain 1

slide-25
SLIDE 25

Multicore OCaml: Major GC

slide-26
SLIDE 26

Multicore OCaml: Major GC

  • Extend support weak references, ephemerons, (2 different kinds
  • f) finalizers, fibers, lazy values
slide-27
SLIDE 27

Multicore OCaml: Major GC

  • Extend support weak references, ephemerons, (2 different kinds
  • f) finalizers, fibers, lazy values
  • Ephemerons are tricky in a concurrent multicore GC

✦ A generalisation of weak references ✦ Introduce conjunction in the reachability property ✦ Requires multiple rounds of ephemeron marking ✦ Cycle-delimited handshaking without global barrier

slide-28
SLIDE 28

Multicore OCaml: Major GC

  • Extend support weak references, ephemerons, (2 different kinds
  • f) finalizers, fibers, lazy values
  • Ephemerons are tricky in a concurrent multicore GC

✦ A generalisation of weak references ✦ Introduce conjunction in the reachability property ✦ Requires multiple rounds of ephemeron marking ✦ Cycle-delimited handshaking without global barrier

  • A barrier each for the two kinds of finalisers

✦ 3 barriers / cycle worst case

slide-29
SLIDE 29

Multicore OCaml: Major GC

  • Extend support weak references, ephemerons, (2 different kinds
  • f) finalizers, fibers, lazy values
  • Ephemerons are tricky in a concurrent multicore GC

✦ A generalisation of weak references ✦ Introduce conjunction in the reachability property ✦ Requires multiple rounds of ephemeron marking ✦ Cycle-delimited handshaking without global barrier

  • A barrier each for the two kinds of finalisers

✦ 3 barriers / cycle worst case

  • Verified in the SPIN model checker
slide-30
SLIDE 30

Concurrent Minor GC

  • Based on [Doligez and Leroy 1993] but lazier as in [Marlow and

Peyton Jones 2011] collector for GHC

Minor Heap Minor Heap Minor Heap Minor Heap

Major Heap

Domain 0 Domain 1 Domain 2 Domain 3

slide-31
SLIDE 31

Concurrent Minor GC

  • Based on [Doligez and Leroy 1993] but lazier as in [Marlow and

Peyton Jones 2011] collector for GHC

Minor Heap Minor Heap Minor Heap Minor Heap

Major Heap

Domain 0 Domain 1 Domain 2 Domain 3

  • Each domain can independently collect its minor heap
slide-32
SLIDE 32

Concurrent Minor GC

  • Based on [Doligez and Leroy 1993] but lazier as in [Marlow and

Peyton Jones 2011] collector for GHC

Minor Heap Minor Heap Minor Heap Minor Heap

Major Heap

Domain 0 Domain 1 Domain 2 Domain 3

  • Each domain can independently collect its minor heap
  • Major to minor pointers allowed

✦ Prevents early promotion & mirrors sequential behaviour ✦ Read barrier required for mutable field + promotion

slide-33
SLIDE 33

Read Barriers

  • Stock OCaml does not have read barriers

✦ Read barriers need to be efficient for performance backwards

compatibility

slide-34
SLIDE 34

Read Barriers

  • Stock OCaml does not have read barriers

✦ Read barriers need to be efficient for performance backwards

compatibility

  • 3 instructions in x86 -

VMM + bit-twiddling tricks

✦ Proof of correctness available in the paper ✦ Minimal performance impact on sequential code

slide-35
SLIDE 35

Read Barriers

  • Stock OCaml does not have read barriers

✦ Read barriers need to be efficient for performance backwards

compatibility

  • 3 instructions in x86 -

VMM + bit-twiddling tricks

✦ Proof of correctness available in the paper ✦ Minimal performance impact on sequential code

  • Unfortunately, read barriers break the C API (feature backwards

compatibility)

slide-36
SLIDE 36

Read Barriers

minor

major heap x y a

minor

b

Domain 0 Domain 1

!y !x

slide-37
SLIDE 37

Read Barriers

minor

major heap x y a

minor

b

Domain 0 Domain 1

!y !x

promote (!y) promote (!x)

slide-38
SLIDE 38

Read Barriers

  • Service promotion requests on read faults to avoid deadlock

✦ Mutable reads are GC safe points!

minor

major heap x y a

minor

b

Domain 0 Domain 1

!y !x

promote (!y) promote (!x)

slide-39
SLIDE 39

Read Barriers

  • Service promotion requests on read faults to avoid deadlock

✦ Mutable reads are GC safe points!

  • C API written with explicit knowledge of when GC may happen

✦ Need to manually refactor tricky code

minor

major heap x y a

minor

b

Domain 0 Domain 1

!y !x

promote (!y) promote (!x)

slide-40
SLIDE 40

Parallel Minor GC

  • Stop-the-world parallel minor collection

✦ Similar to GHCs minor collection

slide-41
SLIDE 41

Parallel Minor GC

  • Stop-the-world parallel minor collection

✦ Similar to GHCs minor collection

Dom 0 Dom 1

Mutator Minor GC Major slice Mutator Minor GC

Start major End major ConcMinor

slide-42
SLIDE 42

Parallel Minor GC

  • Stop-the-world parallel minor collection

✦ Similar to GHCs minor collection

Dom 0 Dom 1

Mutator Minor GC Major slice Mutator Minor GC

Start major End major ConcMinor

Mutator Major slice Mutator

Start major End major Start minor End minor ParMinor

slide-43
SLIDE 43

Parallel Minor GC

  • Stop-the-world parallel minor collection

✦ Similar to GHCs minor collection

Dom 0 Dom 1

Mutator Minor GC Major slice Mutator Minor GC

Start major End major ConcMinor

Mutator Major slice Mutator

Start major End major Start minor End minor ParMinor Slop space filled with major slices

slide-44
SLIDE 44

Parallel Minor GC

  • Stop-the-world parallel minor collection

✦ Similar to GHCs minor collection

  • No need for read barriers!

Dom 0 Dom 1

Mutator Minor GC Major slice Mutator Minor GC

Start major End major ConcMinor

Mutator Major slice Mutator

Start major End major Start minor End minor ParMinor Slop space filled with major slices

slide-45
SLIDE 45

Parallel Minor GC

  • Stop-the-world parallel minor collection

✦ Similar to GHCs minor collection

  • No need for read barriers!
  • Quickly bring all the domains to a barrier

✦ Insert poll points in code for timely inter-domain interrupt handling

[Feeley 1993]

Dom 0 Dom 1

Mutator Minor GC Major slice Mutator Minor GC

Start major End major ConcMinor

Mutator Major slice Mutator

Start major End major Start minor End minor ParMinor Slop space filled with major slices

slide-46
SLIDE 46

Evaluation

  • 2 x 14-core Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz

✦ 24 cores isolated for performance evaluation

slide-47
SLIDE 47

Evaluation

  • 2 x 14-core Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz

✦ 24 cores isolated for performance evaluation

  • Sequential Throughput — compared to stock OCaml

✦ ConcMinor 4.9% slower and ParMinor 3.5% slower ✦ ConcMinor 54% lower peak memory and ParMinor 61% lower peak

memory

slide-48
SLIDE 48

Evaluation

  • 2 x 14-core Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz

✦ 24 cores isolated for performance evaluation

  • Sequential Throughput — compared to stock OCaml

✦ ConcMinor 4.9% slower and ParMinor 3.5% slower ✦ ConcMinor 54% lower peak memory and ParMinor 61% lower peak

memory

  • Sequential GC pause times on par with stock OCaml
slide-49
SLIDE 49

Parallel Scalability

slide-50
SLIDE 50

Parallel Scalability

ConcMinor suffers due to read faults

slide-51
SLIDE 51

Parallel Scalability

ConcMinor suffers due to read faults Unbalanced allocation leads to inopportune minor GCs in ParMinor

slide-52
SLIDE 52

ParMinor vs ConcMinor

  • Parallel GC latency roughly similar between ParMinor and

ConcMinor

slide-53
SLIDE 53

ParMinor vs ConcMinor

  • Parallel GC latency roughly similar between ParMinor and

ConcMinor

  • ParMinor wins over ConcMinor

✦ Does not break the C API ✦ Performs similarly to the ConcMinor on 24 cores

slide-54
SLIDE 54

ParMinor vs ConcMinor

  • Parallel GC latency roughly similar between ParMinor and

ConcMinor

  • ParMinor wins over ConcMinor

✦ Does not break the C API ✦ Performs similarly to the ConcMinor on 24 cores

  • OCaml 5.00 will have multicore support and use ParMinor

✦ May revisit ConcMinor later for manycore future

slide-55
SLIDE 55

Thanks!

  • Multicore OCaml

✦ https://github.com/ocaml-multicore/ocaml-multicore

  • Sandmark — benchmark suite for (Multicore) OCaml

✦ https://github.com/ocaml-bench/sandmark/

  • SPIN models

✦ https://github.com/ocaml-multicore/multicore-ocaml-verify

  • Parallel Programming with Multicore OCaml

✦ https://github.com/ocaml-multicore/parallel-programming-in-

multicore-ocaml