What Came First? The Ordering of Events in Systems @kavya719 - - PowerPoint PPT Presentation

what came first
SMART_READER_LITE
LIVE PREVIEW

What Came First? The Ordering of Events in Systems @kavya719 - - PowerPoint PPT Presentation

What Came First? The Ordering of Events in Systems @kavya719 kavya the design of concurrent systems Slack architecture on AWS systems with multiple independent actors . threads nodes in a multithreaded program. in a distributed system.


slide-1
SLIDE 1

What Came First?

The Ordering of Events in Systems

@kavya719

slide-2
SLIDE 2

kavya

slide-3
SLIDE 3

the design of concurrent systems

slide-4
SLIDE 4
slide-5
SLIDE 5

Slack architecture on AWS

slide-6
SLIDE 6

systems with multiple independent actors. nodes in a distributed system. threads in a multithreaded program. concurrent actors

slide-7
SLIDE 7

user-space or system threads

threads

slide-8
SLIDE 8

R W R W

func main() { for { if len(tasks) > 0 { task := dequeue(tasks)
 process(task) } } }

user-space or system threads

threads

var tasks []Task

slide-9
SLIDE 9

multiple threads:

// Shared variable var tasks []Task func worker() { for len(tasks) > 0 { task := dequeue(tasks) process(task) } } func main() { // Spawn fixed-pool of worker threads.
 startWorkers(3, worker) // Populate task queue. for _, t := range hellaTasks {
 tasks = append(tasks, t) } }

R W R W

g2 g1

“when two+ threads concurrently access a shared memory location, at least one access is a write.” data race

slide-10
SLIDE 10

…many threads provides concurrency, may introduce data races.

slide-11
SLIDE 11

nodes

processes i.e. logical nodes
 (but term can also refer to machines i.e.
 physical nodes). communicate by message-passing i.e.
 connected by unreliable network, 
 no shared memory. are sequential. no global clock.

slide-12
SLIDE 12

distributed key-value store.
 three nodes with master and two replicas.

M R R cart: [ apple crepe,
 blueberry crepe ] cart: [ ] ADD apple crepe userX ADD blueberry crepe userY

slide-13
SLIDE 13

distributed key-value store.
 three nodes with three equal replicas. read_quorum = write_quorum = 1. eventually consistent.

cart: [ ] N2 N3 N1 cart: [ apple crepe ] ADD apple crepe userX cart: [ blueberry crepe ] ADD blueberry crepe userY

slide-14
SLIDE 14

…multiple nodes accepting writes
 provides availability, may introduce conflicts.

slide-15
SLIDE 15

given we want concurrent systems, we need to deal with data races,
 conflict resolution.

slide-16
SLIDE 16

riak:

distributed key-value store

channels:

Go concurrency primitive

stepping back:

similarity,
 meta-lessons

slide-17
SLIDE 17

riak

a distributed datastore

slide-18
SLIDE 18

riak

  • Distributed key-value database:


// A data item = <key: blob>
 {“uuid1234”: {“name”:”ada”}}


  • v1.0 released in 2011.


Based on Amazon’s Dynamo.

  • Eventually consistent:


uses optimistic replication i.e. 
 replicas can temporarily diverge,
 will eventually converge.


  • Highly available:


data partitioned and replicated,
 decentralized,
 sloppy quorum.

]

AP system (CAP theorem)

slide-19
SLIDE 19

cart: [ ] N2 N3 N1 cart: [ apple crepe ] cart: [ blueberry crepe ] ADD apple crepe ADD blueberry crepe cart: [ apple crepe ] N2 N3 N1 cart: [ date crepe ] UPDATE to date crepe

conflict resolution causal updates

slide-20
SLIDE 20

how do we determine causal vs. concurrent updates?

slide-21
SLIDE 21

{ cart : [ A ] } N1 N2 N3 userY { cart : [ B ] } userX { cart : [ A ]} userX { cart : [ D ]}

A B C D

concurrent events?

A: apple B: blueberry D: date

slide-22
SLIDE 22

N1 N2 N3

A B C D

concurrent events?

slide-23
SLIDE 23

A B C D

N1 N2 N3

A, C: not concurrent — same sequential actor

slide-24
SLIDE 24

A B C D

N1 N2 N3

A, C: not concurrent — same sequential actor C, D: not concurrent — fetch/ update pair

slide-25
SLIDE 25

happens-before

X ≺ Y IF one of: — same actor — are a synchronization pair — X ≺ E ≺ Y across actors. IF X not ≺ Y and Y not ≺ X , concurrent!

  • rders events

Formulated in Lamport’s 
 Time, Clocks, and the Ordering of Events paper in 1978. establishes causality and concurrency. (threads or nodes)

slide-26
SLIDE 26

A ≺ C (same actor) C ≺ D (synchronization pair) So, A ≺ D (transitivity)

causality and concurrency

A B C D

N1 N2 N3

slide-27
SLIDE 27

…but B ? D
 D ? B So, B, D concurrent!

A B C D

N1 N2 N3

causality and concurrency

slide-28
SLIDE 28

A B C D

N1 N2 N3 { cart : [ A ] } { cart : [ B ] } { cart : [ A ]} { cart : [ D ]} A ≺ D
 D should update A 
 B, D concurrent B, D need resolution

slide-29
SLIDE 29

how do we implement happens-before?

slide-30
SLIDE 30

1

n1 n2 n3 n1 n2 n3 n1 n2 n3

n1 n2 n3

vector clocks

means to establish happens-before edges.

1

slide-31
SLIDE 31

n1 n2 n3

1 2 1

n1 n2 n3 n1 n2 n3

n1 n2 n3

vector clocks

means to establish happens-before edges.

slide-32
SLIDE 32

n1 n2 n3

1 2 1 1

n1 n2 n3 n1 n2 n3

n1 n2 n3

vector clocks

means to establish happens-before edges.

slide-33
SLIDE 33

2 1

n1 n2 n3

1 2 1

n1 n2 n3 n1 n2 n3

n1 n2 n3

vector clocks

means to establish happens-before edges. max ((2, 0, 0), (0, 1, 0))

slide-34
SLIDE 34

2 1

n1 n2 n3

1 2 1

n1 n2 n3 n1 n2 n3

n1 n2 n3

vector clocks

means to establish happens-before edges. max ((2, 0, 0), (0, 1, 0))

happens-before comparison: X ≺ Y iff VCx < VCy

slide-35
SLIDE 35

A B C D

N1 N2 N3

1 1 2 2 2 1 1 1 2 1

So, A ≺ D VC at D: VC at A:

slide-36
SLIDE 36

A B C D

N1 N2 N3

1 1 2 2 2 1 1 1 2 1

VC at D: VC at B: So, B, D concurrent

slide-37
SLIDE 37

causality tracking in riak

GET, PUT operations on a key pass around a casual context object, that contains the vector clocks. Therefore, able to detect conflicts. a more precise form,
 “dotted version vector” Riak stores a vector clock with each version of the data. 2 1 2

n1 n2

max ((2, 0, 0), (0, 1, 0))

slide-38
SLIDE 38

…what about resolving those conflicts?

causality tracking in riak

GET, PUT operations on a key pass around a casual context object, that contains the vector clocks. a more precise form,
 “dotted version vector” Riak stores a vector clock with each version of the data. Therefore, able to detect conflicts.

slide-39
SLIDE 39

conflict resolution in riak

Behavior is configurable.
 Assuming vector clock analysis enabled:


  • last-write-wins


i.e. version with higher timestamp picked.

  • merge, iff the underlying data type is a CRDT
  • return conflicting versions to application


riak stores “siblings” or conflicting versions,
 returned to application for resolution.

slide-40
SLIDE 40

return conflicting versions to application:

1 2 1

D: { cart: [ “date crepe” ] } B: { cart: [ “blueberry crepe” ] } Riak stores both versions next op returns both to application application must resolve conflict { cart: [ “blueberry crepe”, “date crepe” ] }

2 1 1

which creates a causal update { cart: [ “blueberry crepe”, “date crepe” ] }

slide-41
SLIDE 41

…what about resolving those conflicts? doesn’t

(default behavior).

instead, exposes happens-before graph to the application for conflict resolution.

slide-42
SLIDE 42

riak: uses vector clocks to track causality and conflicts. exposes happens-before graph to the user for conflict resolution.

slide-43
SLIDE 43

channels

Go concurrency primitive

slide-44
SLIDE 44

R W R W

g2 g1

multiple threads:

// Shared variable var tasks []Task func worker() { for len(tasks) > 0 { task := dequeue(tasks) process(task) } } func main() { // Spawn fixed-pool of worker threads.
 startWorkers(3, worker) // Populate task queue. for _, t := range hellaTasks {
 tasks = append(tasks, t) } }

“when two+ threads concurrently access a shared memory location, at least one access is a write.” data race

slide-45
SLIDE 45

specifies when an event happens before another.

memory model

X ≺ Y IF one of: — same thread — are a synchronization pair — X ≺ E ≺ Y IF X not ≺ Y and Y not ≺ X , concurrent! x = 1 print(x)

X Y

unlock/ lock on a mutex, send / recv on a channel, spawn/ first event of a thread. etc.

slide-46
SLIDE 46

The unit of concurrent execution: goroutines user-space threads
 use as you would threads 
 > go handle_request(r) Go memory model specified in terms of goroutines within a goroutine: reads + writes are ordered with multiple goroutines: shared data must be synchronized…else data races!

goroutines

slide-47
SLIDE 47

The synchronization primitives are: mutexes, conditional vars, …
 > import “sync” 
 > mu.Lock() atomics
 > import “sync/ atomic"
 > atomic.AddUint64(&myInt, 1) channels

synchronization

slide-48
SLIDE 48

“Do not communicate by sharing memory; 
 instead, share memory by communicating.” standard type in Go — chan safe for concurrent use. mechanism for goroutines to communicate, and synchronize. Conceptually similar to Unix pipes:
 
 > ch := make(chan int) // Initialize
 > go func() { ch <- 1 } () // Send
 > <-ch // Receive, blocks until sent.


channels

slide-49
SLIDE 49

// Shared variable var tasks []Task func worker() { for len(tasks) > 0 { task := dequeue(tasks) process(task) } } func main() { // Spawn fixed-pool of workers.
 startWorkers(3, worker) // Populate task queue. for _, t := range hellaTasks {
 tasks = append(tasks, t) } }

want:

main:

* give tasks to workers.

worker:

* get a task. * process it. * repeat.

slide-50
SLIDE 50

var taskCh = make(chan Task, n) var resultCh = make(chan Result) func worker() { for { // Get a task. t := <-taskCh process(t)
 // Send the result. resultCh <- r } } func main() { // Spawn fixed-pool of workers.
 startWorkers(3, worker) // Populate task queue. for _, t := range hellaTasks {
 taskCh <- t } // Wait for and amalgamate results. var results []Result for r := range resultCh { results = append(results, r) } }

slide-51
SLIDE 51

// Shared variable var tasks []Task func worker() { for len(tasks) > 0 { task := dequeue(tasks) process(task) } } func main() { // Spawn fixed-pool of workers.
 startWorkers(3, worker) // Populate task queue. for _, t := range hellaTasks {
 tasks = append(tasks, t) } }

] ]

mu mu

] mu

…but workers can exit early. mutex?

slide-52
SLIDE 52

want:

worker:

* wait for task * process it * repeat

main:

* send tasks

main worker send task wait for task process recv task

channel semantics


(as used):

send task to happen before worker runs.

…channels allow us to express happens-before constraints.

slide-53
SLIDE 53

channels: allow, and force, the user to express happens-before constraints.

slide-54
SLIDE 54

stepping back…

slide-55
SLIDE 55

first principle:


happens-before riak:

distributed key-value store

channels:

Go concurrency primitive

surface happens-before to the user

similarities

slide-56
SLIDE 56

meta-lessons

slide-57
SLIDE 57

new technologies cleverly decompose into

  • ld ideas
slide-58
SLIDE 58

the “right” boundaries for abstractions are flexible.

slide-59
SLIDE 59

@kavya719

happens-before riak channels

https://speakerdeck.com/kavya719/what-came-first

slide-60
SLIDE 60

nodes in Riak: > virtual nodes (“vnodes”) > key-space partitioning by consistent hashing,1 vnode per partition.
 > sequential because Erlang processes, use message queues.
 replicas:
 > N, R, W, etc. configurable by key. > on network partition, defaults to sloppy quorum w/ hinted-handoff. conflict-resolution: > by read-repair, active anti-entropy.

riak: a note (or two)…

slide-61
SLIDE 61

riak: dotted version vectors

problem with standard vector clocks: false concurrency.
 


userX: PUT “cart”:”A”, {} —> (1, 0); “A” userY: PUT “cart”:”B”, {} —> (2, 0); [“A”, “B”] userX: PUT “cart”:”C”, {(1, 0); “A”} —> (1, 0) !< (2, 0) —> (3, 0); [“A”, “B”, “C”]
 This is false concurrency; leads to “sibling explosion”.



 dotted version vectors

fine-grained mechanism to detect causal updates.
 decompose each vector clock into its set of discrete events, so:
 userX: PUT “cart”:”A”, {} —> (1, 0); “A” userY: PUT “cart”:”B”, {} —> (2, 0); [(1, 0)->”A”, (2, 0)->”B”] userX: PUT “cart”:”C”, {} —> (3, 0); [(2, 0)->”B”, (3, 0)->”C”]

slide-62
SLIDE 62

riak: CRDTs

Conflict-free / Convergent / Commutative Replicated Data Type
 > data structure with property:
 replicas can be updated concurrently without coordination, and 
 it’s mathematically possible to always resolve conflicts. 
 > two types: op-based (commutative) and state-based (convergent). 
 > examples: G-Set (Grow-Only Set), G-Counter, PN-Counter
 
 > Riak DT is state-based CRDTs.

slide-63
SLIDE 63

ch := make(chan int, 3)

channels: implementation

nil nil

buf sendq recvq lock ... waiting senders waiting receivers ring buffer mutex

hchan

slide-64
SLIDE 64

ch <- t1

g1

ch <- t4 ch <- t2 ch <- t3

nil nil nil buf sendq recvq lock

g1

buf sendq recvq lock

slide-65
SLIDE 65

ch <- t1

g1

buf sendq recvq lock

g1

nil

<-ch

g2

slide-66
SLIDE 66

buf sendq recvq lock nil nil

<-ch

g2

g1

slide-67
SLIDE 67

buf sendq recvq lock nil nil

<-ch

g2 g1

ch <- t4

buf sendq recvq lock nil nil

slide-68
SLIDE 68

A B C D

W

send

R g1 g2

recv

// Shared variable var count = 0 var ch = make(chan bool, 1) func setCount() { count++ ch <- true } func printCount() { <- ch
 print(count) } go setCount()
 go printCount()

B ≺ C
 So, A ≺ D

  • 1. send happens-before corresponding receive
slide-69
SLIDE 69
  • 2. nth receive on a channel of size C happens-before

n+Cth send completes.

var maxOutstanding = 3 var taskCh = make(chan int, maxOutstanding) func worker() { for { t := <-taskCh processAndStore(t) } } func main() { go worker()
 tasks := generateHellaTasks() for _, t := range tasks { taskCh <- t } }

slide-70
SLIDE 70

If channel empty:
 receiver goroutine paused;
 resumed after a channel send occurs. 
 If channel not empty:
 receiver gets first unreceived element
 i.e. buffer is a FIFO queue. Sends must have completed due to mutex.

  • 1. send happens-before corresponding receive.
slide-71
SLIDE 71

“2nd receive happens-before 5th send.”
 


  • 2. nth receive on a channel of size C happens-before

n+Cth send completes. send #3 can occur. send #4 can occur after receive #1. send #5 can occur after receive #2. Fixed-size, circular buffer.

slide-72
SLIDE 72
  • 2. nth receive on a channel of size C happens-before

n+Cth send completes. If channel full:
 sender goroutine paused;
 resumed after a channel recv occurs. 
 If channel not empty:
 receiver gets first unreceived element
 i.e. buffer is a FIFO queue. Send of that element must have completed due to 
 channel mutex