RCUArray: An RCU-like Parallel-Safe Distributed Resizable Array By - - PowerPoint PPT Presentation

rcuarray an rcu like parallel safe
SMART_READER_LITE
LIVE PREVIEW

RCUArray: An RCU-like Parallel-Safe Distributed Resizable Array By - - PowerPoint PPT Presentation

RCUArray: An RCU-like Parallel-Safe Distributed Resizable Array By Louis Jenkins The Problem Parallel-Safe Resizing Not inherently thread-safe to access memory while it is being resized Memory has to be moved from the smaller


slide-1
SLIDE 1

RCUArray: An RCU-like Parallel-Safe Distributed Resizable Array

By Louis Jenkins

slide-2
SLIDE 2

The Problem Parallel-Safe Resizing

  • Not inherently thread-safe to access memory while it is being resized
  • Memory has to be ‘moved’ from the smaller storage into larger storage
slide-3
SLIDE 3

The Problem Parallel-Safe Resizing

  • Not inherently thread-safe to access memory while it is being resized
  • Memory has to be ‘moved’ from the smaller storage into larger storage
  • Concurrent loads and stores can result in undefined behavior
  • Stores after memory is moved can be lost entirely

Load Store

slide-4
SLIDE 4

The Problem Parallel-Safe Resizing

  • Not inherently thread-safe to access memory while it is being resized
  • Memory has to be ‘moved’ from the smaller storage into larger storage
  • Concurrent loads and stores can result in undefined behavior
  • Stores after memory is moved can be lost entirely
  • Loads and Stores after the smaller storage is reclaimed can produce undefined behavior

Load Store

slide-5
SLIDE 5
  • Not inherently thread-safe to access memory while it is being resized
  • Memory has to be ‘moved’ from the smaller storage into larger storage
  • Concurrent loads and stores can result in undefined behavior
  • Stores after memory is moved can be lost entirely
  • Loads and Stores after the smaller storage is reclaimed can produce undefined behavior
  • Why not just synchronize access?
  • Not scalable

The Problem Parallel-Safe Resizing

slide-6
SLIDE 6

The Problem Parallel-Safe Resizing

  • Not inherently thread-safe to access memory while it is being resized
  • Memory has to be ‘moved’ from the smaller storage into larger storage
  • Concurrent loads and stores can result in undefined behavior
  • Stores after memory is moved can be lost entirely
  • Loads and Stores after the smaller storage is reclaimed can produce undefined behavior
  • Why not just synchronize access?
  • Not scalable
  • What do we need?
slide-7
SLIDE 7

The Problem Parallel-Safe Resizing

  • Not inherently thread-safe to access memory while it is being resized
  • Memory has to be ‘moved’ from the smaller storage into larger storage
  • Concurrent loads and stores can result in undefined behavior
  • Stores after memory is moved can be lost entirely
  • Loads and Stores after the smaller storage is reclaimed can produce undefined behavior
  • Why not just synchronize access?
  • Not scalable
  • What do we need?
  • 1. Allow concurrent access to both smaller and larger storage

Load Store

slide-8
SLIDE 8

The Problem Parallel-Safe Resizing

  • Not inherently thread-safe to access memory while it is being resized
  • Memory has to be ‘moved’ from the smaller storage into larger storage
  • Concurrent loads and stores can result in undefined behavior
  • Stores after memory is moved can be lost entirely
  • Loads and Stores after the smaller storage is reclaimed can produce undefined behavior
  • Why not just synchronize access?
  • Not scalable
  • What do we need?
  • 1. Allow concurrent access to both smaller and larger storage
  • 2. Ensure safe memory management of smaller storage

Load Store

slide-9
SLIDE 9

The Problem Parallel-Safe Resizing

  • Not inherently thread-safe to access memory while it is being resized
  • Memory has to be ‘moved’ from the smaller storage into larger storage
  • Concurrent loads and stores can result in undefined behavior
  • Stores after memory is moved can be lost entirely
  • Loads and Stores after the smaller storage is reclaimed can produce undefined behavior
  • Why not just synchronize access?
  • Not scalable
  • What do we need?
  • 1. Allow concurrent access to both smaller and larger storage
  • 2. Ensure safe memory management of smaller storage
  • 3. Ensure that stores to old memory are visible in larger storage

Load Store

slide-10
SLIDE 10

Read-Copy-Update (RCU)

  • Synchronization strategy that favors performance of readers over writers
  • Read the current snapshot 𝑡

𝑇 = 𝑐1

P

slide-11
SLIDE 11

Read-Copy-Update (RCU)

  • Synchronization strategy that favors performance of readers over writers
  • Read the current snapshot 𝑡
  • Copy 𝑡 to create 𝑡′

𝑇 = 𝑐1

P

𝑇′ = 𝑐1

slide-12
SLIDE 12

Read-Copy-Update (RCU)

  • Synchronization strategy that favors performance of readers over writers
  • Read the current snapshot 𝑡
  • Copy 𝑡 to create 𝑡′
  • Update applied to s′…

𝑇 = 𝑐1

P 𝑇′ = 𝑐1, 𝑐2

slide-13
SLIDE 13

Read-Copy-Update (RCU)

  • Synchronization strategy that favors performance of readers over writers
  • Read the current snapshot 𝑡
  • Copy 𝑡 to create 𝑡′
  • Update applied to s′, 𝑡′ becomes new current snapshot

𝑇 = 𝑐1

P 𝑇′ = 𝑐1, 𝑐2

slide-14
SLIDE 14

Read-Copy-Update (RCU)

  • Synchronization strategy that favors performance of readers over writers
  • Read the current snapshot 𝑡
  • Copy 𝑡 to create 𝑡′
  • Update applied to s′, 𝑡′ becomes new current snapshot
  • Not always applicable in all situations
  • Must be safe to access at least two different snapshots of the same data

𝑇 = 𝑐1

𝑇′ = 𝑐1, 𝑐2

Reader Reader

slide-15
SLIDE 15

Read-Copy-Update (RCU)

Read-Copy-Update

  • Readers Concurrent with Readers

Reader-Writer Locks

  • Readers Concurrent With Readers
  • Synchronization strategy that favors performance of readers over writers
  • Read the current snapshot 𝑡
  • Copy 𝑡 to create 𝑡′
  • Update applied to s′, 𝑡′ becomes new current snapshot
  • Not always applicable in all situations
  • Must be safe to access at least two different snapshots of the same data
slide-16
SLIDE 16

Read-Copy-Update (RCU)

Read-Copy-Update

  • Readers Concurrent with Readers
  • Writers Mutually Exclusive with Writers

Reader-Writer Locks

  • Readers Concurrent With Readers
  • Writers Mutually Exclusive with Writers
  • Synchronization strategy that favors performance of readers over writers
  • Read the current snapshot 𝑡
  • Copy 𝑡 to create 𝑡′
  • Update applied to s′, 𝑡′ becomes new current snapshot
  • Not always applicable in all situations
  • Must be safe to access at least two different snapshots of the same data
slide-17
SLIDE 17

Read-Copy-Update (RCU)

Read-Copy-Update

  • Readers Concurrent with Readers
  • Writers Mutually Exclusive with Writers
  • Readers Concurrent with Writers

Reader-Writer Locks

  • Readers Concurrent With Readers
  • Writers Mutually Exclusive with Writers
  • Readers Mutually Exclusive with Writers
  • Synchronization strategy that favors performance of readers over writers
  • Read the current snapshot 𝑡
  • Copy 𝑡 to create 𝑡′
  • Update applied to s′, 𝑡′ becomes new current snapshot
  • Not always applicable in all situations
  • Must be safe to access at least two different snapshots of the same data
slide-18
SLIDE 18

Distributed RCU

  • Privatization and Snapshots
  • Each node in the cluster has its own local snapshot

Locale #0 Locale #1 Locale #2 Locale #3

𝑇 = 𝑐1 𝑇 = 𝑐1 𝑇 = 𝑐1 𝑇 = 𝑐1

P P P P

slide-19
SLIDE 19

Distributed RCU

  • Privatization and Snapshots
  • Each node in the cluster has its own local snapshot
  • All local snapshots point to the same block

Locale #0 Locale #1 Locale #2 Locale #3

𝑇 = 𝑐1 𝑇 = 𝑐1 𝑇 = 𝑐1 𝑇 = 𝑐1

P P P P 𝑐1

slide-20
SLIDE 20

Distributed RCU

  • Privatization and Snapshots
  • Each node in the cluster has its own local snapshot
  • All local snapshots point to the same block
  • Reader Concurrency
  • Readers will read from local snapshot only
  • All readers regardless of node will see same block
  • All stores to 𝑐1 are seen by any snapshot or node

Locale #0 Locale #1 Locale #2 Locale #3

𝑇 = 𝑐1 𝑇 = 𝑐1 𝑇 = 𝑐1 𝑇 = 𝑐1

P P P P 𝑐1 Reader Reader Reader Reader

slide-21
SLIDE 21

Distributed RCU

  • Privatization and Snapshots
  • Each node in the cluster has its own local snapshot
  • All local snapshots point to the same block
  • Reader Concurrency
  • Readers will read from local snapshot only
  • All readers regardless of node will see same block
  • All stores to 𝑐1 are seen by any snapshot or node
  • Writer Mutual Exclusion
  • Use a distributed lock

Locale #0 Locale #1 Locale #2 Locale #3

𝑇 = 𝑐1 𝑇 = 𝑐1 𝑇 = 𝑐1 𝑇 = 𝑐1

P P P P 𝑐1 Reader Reader Reader Reader

slide-22
SLIDE 22

Distributed RCU

  • Privatization and Snapshots
  • Each node in the cluster has its own local snapshot
  • All local snapshots point to the same block
  • Reader Concurrency
  • Readers will read from local snapshot only
  • All readers regardless of node will see same block
  • All stores to 𝑐1 are seen by any snapshot or node
  • Writer Mutual Exclusion
  • Use a distributed lock
  • Perform each update local to each node

Locale #0 Locale #1 Locale #2 Locale #3 𝑇′ = 𝑐1, 𝑐2 𝑇′ = 𝑐1, 𝑐2 𝑇′ = 𝑐1, 𝑐2 𝑇′ = 𝑐1, 𝑐2 P P P P 𝑐1 Reader Reader Reader Reader 𝑐2

slide-23
SLIDE 23

Distributed RCU

  • Privatization and Snapshots
  • Each node in the cluster has its own local snapshot
  • All local snapshots point to the same block
  • Reader Concurrency
  • Readers will read from local snapshot only
  • All readers regardless of node will see same block
  • All stores to 𝑐1 are seen by any snapshot or node
  • Writer Mutual Exclusion
  • Use a distributed lock
  • Perform each update local to each node
  • Results
  • Fast and parallel-safe loads/stores across multiple nodes
  • Allow for loads and stores to be immediately visible
  • 40x faster resizing than naïve Block Distribution at 32-nodes

Locale #0 Locale #1 Locale #2 Locale #3 𝑇′ = 𝑐1, 𝑐2 𝑇′ = 𝑐1, 𝑐2 𝑇′ = 𝑐1, 𝑐2 𝑇′ = 𝑐1, 𝑐2 P P P P 𝑐1 Reader Reader Reader Reader 𝑐2

slide-24
SLIDE 24

RCUArray – Resizing Example

𝑡 𝑐

1

𝑡 𝑐

1

𝑆

Set of readers 𝑆 begin using snapshot 𝑡

slide-25
SLIDE 25

RCUArray – Resizing

𝑡 𝑐

1

𝑆

Writer acquires Cluster Lock

𝑡 𝑐

1

𝑆

slide-26
SLIDE 26

RCUArray – Resizing

Writer clones 𝑡 to create 𝑡′

𝑡 𝑐

1

𝑆 𝑡 𝑡′ 𝑐1 𝑆

slide-27
SLIDE 27

RCUArray – Resizing

Writer appends block 𝑐2 to 𝑡′

𝑡 𝑡′ 𝑐1 𝑆 𝑐2 𝑡 𝑡′ 𝑐1 𝑆

slide-28
SLIDE 28

RCUArray – Resizing

Writer updates current snapshot to 𝑡′

𝑡 𝑡′ 𝑐1 𝑆 𝑐2 𝑡 𝑡′ 𝑐1 𝑆 𝑐2

slide-29
SLIDE 29

RCUArray – Resizing

Set of readers 𝑆′ begin accessing 𝑡′

𝑡 𝑡′ 𝑐1 𝑆 𝑐2 𝑡 𝑡′ 𝑐1 𝑆 𝑐2 𝑆′

slide-30
SLIDE 30

RCUArray – Resizing

Readers 𝑆 finish using 𝑡

𝑡 𝑡′ 𝑐1 𝑐2 𝑡 𝑡′ 𝑐1 𝑆 𝑐2 𝑆′ 𝑆′

slide-31
SLIDE 31

RCUArray – Resizing

Reclaim 𝑡

𝑡′ 𝑐1 𝑐2 𝑡 𝑡′ 𝑐1 𝑐2 𝑆′ 𝑆′

slide-32
SLIDE 32

RCUArray – Resizing

Writer releases cluster lock

𝑡′ 𝑐1 𝑐2 𝑆′ 𝑡′ 𝑐1 𝑐2 𝑆′

slide-33
SLIDE 33

Network Atomics vs Remote Execution Atomics

  • In Chapel, pointers to potentially remote memory are widened to 128-bits
  • 64-bit Address, 32-bit Locale id, 32-bit Sub-locale id (NUMA)
  • Cray’s Aeries NIC only supports 64-bit network atomic operations
  • Atomics via remote execution proves to be significantly slower than network atomics
  • Distributed wait-free algorithms can scale with network atomics
  • Must have a low constant bounds in inter-node communications

Network Execution 26x faster (32 Nodes) Network Execution 20x faster (32 Nodes)

slide-34
SLIDE 34

RCUArray as a Dynamic Heap

  • Replacing Wide Pointers
  • Blocks have locality information
  • 64-bits vs 128-bits
  • Network Atomics
  • Recycling Memory
  • Each node recycles indices to local

blocks

  • Dynamic Heap
  • Parallel-Safe and Fast Resizing
  • Distributed across multiple locales
  • Great as a per data-structure heap
slide-35
SLIDE 35

Conclusion

  • Chapel makes RCU easier…
  • Lot of abstraction and language constructs
  • Privatization
  • Parallel remote tasks
  • Including Distributed RCU…
  • RCUArray as a distribution
  • Exploring implementation under Domain map Standard Interface (DSI)
  • Memory Management Related Efforts
  • Current efforts to add Quiescent State-Based “Garbage Collector” into language
  • 75% finished runtime changes… but on hold
  • Plans to introduce a Epoch-Based “Garbage Collector” as a Chapel module…