Distributed PubSub Non-Abstract Large System Design ... NALSD - - PowerPoint PPT Presentation

distributed pubsub
SMART_READER_LITE
LIVE PREVIEW

Distributed PubSub Non-Abstract Large System Design ... NALSD - - PowerPoint PPT Presentation

Distributed PubSub Non-Abstract Large System Design ... NALSD Non-Abstract Large System Design Alternatively: SRE Classroom Large (planet scale) system design questions Hands-on workshops and exercises


slide-1
SLIDE 1

Distributed PubSub

Non-Abstract Large System Design

...

slide-2
SLIDE 2
  • “Non-Abstract Large System Design”
  • Alternatively: SRE Classroom
  • Large (“planet scale”) system design questions
  • Hands-on workshops and exercises
  • Non-abstract component:

○ Crunch numbers ○ Provision the system

  • Resilient sofuware systems
  • Distributed architecture patuerns

NALSD

slide-3
SLIDE 3
  • Introduction and problem statement
  • “Let’s do it together”
  • Breakout session 1: Design for single datacenter
  • Single datacenter sample solution
  • Breakout session 2: Design for multiple datacenters
  • Multiple datacenters sample solution
  • Breakout session 3: Provision the system
  • Provision the system sample solution
  • Wrap-up and conclusions

Agenda

slide-4
SLIDE 4

Introduction

slide-5
SLIDE 5
  • Publish-Subscribe (PubSub)
  • Asynchronous communication

through message-passing

Introduction: PubSub

slide-6
SLIDE 6
  • Publishers: “producers” or “writers”

○ Senders of messages ○ Sends ordered messages ○ Messages grouped by topic

Introduction: PubSub

slide-7
SLIDE 7
  • Subscribers: “consumers” or “readers”

○ Subscribes to topics ○ Receives messages only for subscribed topics

Introduction: PubSub

slide-8
SLIDE 8
  • Publishers do not directly communicate with Subscribers
  • Subscribers do not directly communicate with Publishers
  • Scale publishers/subscribers

independently

Introduction: PubSub

slide-9
SLIDE 9

Introduction: PubSub

Publisher A Message F1 Topic Foo Message B1 Topic Bar Subscriber X Subscriber Y

slide-10
SLIDE 10

Introduction: PubSub

Publisher A Message F1 Topic Foo Message B1 Topic Bar Subscriber X Subscriber Y

slide-11
SLIDE 11

Introduction: PubSub

Publisher A Message F1 Topic Foo Message B1 Topic Bar Subscriber X Subscriber Y

slide-12
SLIDE 12

Problem Statement

Let’s identify the problem at hand

slide-13
SLIDE 13

Design a PubSub service that clients all over the world can use to read and write messages.

slide-14
SLIDE 14

Gather Requirements

Let’s identify what we know and what we need

slide-15
SLIDE 15

Requirements

Correctness Availability Latency

slide-16
SLIDE 16

What we have:

  • Three datacenters (DCs):

○ New York ○ Seatule ○ Kansas City

  • Reliable storage system

○ Distributed!

  • Reliable network
  • Authentication &

Authorization

Background

slide-17
SLIDE 17

What we need:

  • A way to publish messages

○ Ordered ○ Grouped by topic

  • A way to receive messages

○ Ordered ○ Grouped by topic

  • Message persistence

Requirements

Publisher A

Message F1 Topic Foo Message B1 Topic Bar

Subscriber X Subscriber Y

slide-18
SLIDE 18
  • Each DC runs the PubSub service we are designing
  • Clients all over the world read and write messages
  • Large volume of messages per day
  • Uneven distribution of traffjc over time

Requirements

Seattle New York Kansas City

slide-19
SLIDE 19
  • Communicate ordered messages, grouped by topic
  • Readers/writers can connect to any DC
  • Users expect the same level of service from all DCs
  • If a DC goes down, the user will automatically get connected

to another one (this is already provided as a service)

  • Once a DC recovers, it goes back to full service

Requirements - What Does PubSub Do?

slide-20
SLIDE 20
  • Topics are identifjed by their topic_id.
  • Readers are identifjed by their consumer_id.
  • Readers will explicitly subscribe to topics.
  • Subscribe(topic_id, consumer_id):

Subscribes the given consumer to the given topic.

Requirements - PubSub API

Subscriber X

Topic Foo Topic Bar

slide-21
SLIDE 21
  • Push(topic_id, message):

Append the message to the given topic.

Requirements - PubSub API

Publisher A Message F1 Topic Foo Message B1 Topic Bar

slide-22
SLIDE 22
  • Pop(topic_id, consumer_id):

Read the next message (in order) for the given topic.

Requirements - PubSub API

Message F1 Topic Foo Message B1 Topic Bar

Subscriber X

slide-23
SLIDE 23

Requirements - PubSub API

time

Publisher Topic Foo Publisher Topic Foo Subscriber Topic Foo

push subscribe push

Subscriber Topic Foo

pop

Message F2

Publisher Topic Foo

push

Subscriber Topic Foo

pop

Message F3 Message F1 Message F2 Message F3

slide-24
SLIDE 24
  • List():

Returns a list of all available topics.

  • Not in scope for this exercise.

Requirements - PubSub API

Topic Bar Topic Foo

Subscriber X

...

slide-25
SLIDE 25

Service Level Terminology

  • SLI: service level indicator

A quantifjable (numeric) measure of service reliability.

  • SLO: service level objective

A reliability target for an SLI.

  • SLA: service level agreement

SLO + consequences when SLO is violated

slide-26
SLIDE 26

Availability

  • PubSub must continue working under peak load even if one

datacenter goes down Latency

  • 99% of API calls must complete within 500ms
  • 99% of pushed messages must be available for pop anywhere

in the world within 1s

Requirements - SLO

slide-27
SLIDE 27

Correctness

  • At-Least-Once delivery
  • 100 day message retention
  • System can lose 0.01% of enqueued message per year

Furuher details, including volumes of data, are in the workbook handouts.

Requirements - SLO

slide-28
SLIDE 28

Let’s do it together: push()

slide-29
SLIDE 29
  • Global PubSub Service
  • Three datacenters (DCs):

○ New York ○ Seatule ○ Kansas City

  • Clients all over the world write (push) and read (pop)
  • Large volume of messages per day
  • Uneven distribution of traffjc over time

Requirements Recap

slide-30
SLIDE 30

push() Let’s design the API call that receives messages.

slide-31
SLIDE 31

Pushing a message

push() Message

slide-32
SLIDE 32

Staru by storing the messages...

push() Message MessageStore

slide-33
SLIDE 33

Message ID Service push() Message MessageStore

Assign message IDs for storage...

slide-34
SLIDE 34

More on the Message ID Service

  • Assign unique IDs for message within a topic
  • Assign ordered message IDs for simple ordered lookup

Message

Message ID Service

Message Message Message Message Message

slide-35
SLIDE 35

Batch Operations

  • Address bandwidth or throughput botulenecks
  • May be supporued alongside singular operations
  • Basically: stufg multiple requests into a single RPC

Request

RPC Service

Request Request Request Request Request Request

RPC Service

Request Request Request Request Request

slide-36
SLIDE 36

More on the Message ID Service

  • Assign unique IDs for message within a topic
  • Assign ordered message IDs for simple ordered lookup
  • Pergormance optimizations: batch operations

Message

Message ID Service

Message Message Message Message Message Message

Message ID Service

Message Message Message Message Message

slide-37
SLIDE 37

MessageStore

More on the MessageStore

Key: Topic ID, Message ID Value: Message Content

Topic 1 Message 1 … Message Content ... Topic 1 Message 1 … Message Content ... Topic 1 Message 1 … Message Content ...

slide-38
SLIDE 38

More on the MessageStore

black-box distributed fjle system

  • Distributed fjle system

○ Storage abstractions ○ write(), read(), implemented already ○ Supporus confjgurable replication strategy

MessageStore

slide-39
SLIDE 39

Message Store Sharding

  • Need to retain 100 days woruh of messages
  • 100 days * … = 25TB of data → too big for one machine :(
slide-40
SLIDE 40
  • Address storage size botulenecks
  • Basically: split your data into multiple buckets, and store those

buckets separately, possibly multiple copies of each bucket

  • Sharding mechanism should be fmexible
  • Consistency and fault tolerance
  • A single disk failure should not cause data loss
  • Consider replicating shards locally (local reads are cheapest)

Sharding

unsharded sharded sharded + replicated locally A, B, C, D A B C D A B D D B A C D B C B C

slide-41
SLIDE 41

Message Store Sharding

  • Need to retain 100 days woruh of messages
  • 100 days * … = 25TB of data → too big for one machine :(
  • Sharding to the rescue!
  • Keep multiple copies (replicas) of each shard:

○ Greater resilience ○ … and pergormance too (local reads are cheap)!

slide-42
SLIDE 42

Flow overview: push()

1. Get message ID from Message ID Service 2. Write message to MessageStore 3. Ack receipt of message

MessageStore Message ID Service push() Message

slide-43
SLIDE 43
  • Designs will be difgerent, with difgerent abstractions: that’s okay!
  • Focus on the process of designing something end-to-end
  • Think about high level concepts, rather than nituy details
  • Think about trade-ofgs of difgerent design decisions
  • Make assumptions explicit
  • Call out risks
  • Simplify the problem
  • If working in a group, discuss ideas and use each other as

resources!

Reminder: don’t sweat it!

slide-44
SLIDE 44
  • Assume good intent
  • Respect each other
  • Speak up and share information
  • Let everybody speak
  • Ask questions

Most imporuantly, have fun!

Rules of engagement

slide-45
SLIDE 45

Breakout Session 1: Single Datacenter (40 minutes) Goal: Design a working system that fjts in a single datacenter.

slide-46
SLIDE 46

Break: 5 Minutes

slide-47
SLIDE 47

Reading a message

pop() Consumer

slide-48
SLIDE 48

Reading a message

pop() Consumer MessageStore

slide-49
SLIDE 49

Reading: getuing the “next” message

pop() Consumer MessageStore Subscription Position Service

slide-50
SLIDE 50

Next, read the messages on demand...

pop() Consumer MessageStore Subscription Position Service Message ID Service

slide-51
SLIDE 51

Message ID Service push() Message MessageStore

Reminder of how push() works...

slide-52
SLIDE 52

push()

Error Handling: pop()

  • Message IDs are consecutive… almost.
  • Gaps can arise if push() service crashes afuer allocating ID, but

before message is successfully writuen to storage.

MessageStore Message ID Service Message

✔ ✘

slide-53
SLIDE 53

Error Handling: pop()

  • Detect error upon read
  • Increment ID and keep reading until the next message is found
  • Do not read past the end of the topic
  • Some latency impact; expect to be rare
  • Pergormance optimizations:

○ Batch reads ○ Readahead cache ○ Bloom fjlter on storage service

slide-54
SLIDE 54

Flow Overview: pop()

1. Get latest writuen message ID from Message ID Service 2. Get latest read message ID from Subscription Position Service 3. Increment the read message ID 4. If at the end of topic, return 5. Read message from storage 6. Return the message to consumer 7. Update subscription position for consumer and topic

MessageStore Message ID Service pop() Consumer Subscription Position Service

slide-55
SLIDE 55

Breakout Session 2: Multiple Datacenters (30 minutes) Goal: Extend the design to work correctly in multiple datacenters.

slide-56
SLIDE 56

Break: 5 Minutes

slide-57
SLIDE 57

Single Datacenter Design

MessageStore Message ID Service push() Message pop() Consumer Subscription Position Service

slide-58
SLIDE 58

One for each datacenter…?

Seatule Kansas City New York

MessageStore Message ID Service push() Message pop() Consumer Subscription Position Service MessageStore Message ID Service push() Message pop() Consumer Subscription Position Service MessageStore Message ID Service push() Message pop() Consumer Subscription Position Service

slide-59
SLIDE 59

Paruitioned MessageStore

MessageStore Message ID Service push() pop() Subscription Position Service

Seatule New York

MessageStore Message ID Service push() Topic1, Msg X pop() Subscription Position Service UserX, Topic1

slide-60
SLIDE 60

MessageStore Replication

  • Pushes can arrive at any datacenter
  • Need to be able to pop messages from any datacenter, even at

a difgerent datacenter than where it arrived

  • Need to replicate messages to every datacenter
  • Factors to consider:

○ Consistency ○ Fault tolerance ○ Availability

slide-61
SLIDE 61

Replication: synchronous

Seatule Kansas City New York

MessageStore MessageStore MessageStore

Message push()

slide-62
SLIDE 62

Replication: asynchronous

Seatule Kansas City New York

MessageStore MessageStore MessageStore

Message push()

slide-63
SLIDE 63

Replication: hybrid

Seatule Kansas City New York

MessageStore MessageStore MessageStore

Message push()

slide-64
SLIDE 64

MessageStore Replication: Tradeofgs

Push Latency Pop Latency Data Durability Synchronous Replication High Low High Asynchronous Replication Low High Low Hybrid Replication Medium Medium Medium

slide-65
SLIDE 65
  • Asynchronous writes: ~10ms response time
  • Can we afgord the data loss?
  • Reminder:

○ Can lose 0.01% of pushed messages per year ○ 99% of messages must be available for pop from any location in 1 second or less 5,000 topics * 10,000 msg / day / topic = 50M msg / day → Can lose 5k messages per day.

MessageStore Replication

slide-66
SLIDE 66

90k sec/day * 1 msg/sec/thread = 90k msg / day / thread parallelize processing to handle the entire load... (50M msg / day) / (90k msg / thread) = ~600 threads / day (i.e. concurrent loads / day)

Async Replication

Reminders:

  • 50M msg / day
  • 99% of messages must be

available for pop from any location in 1 second or less

  • ~90k seconds / day
  • Assume 1 second replication

delay

slide-67
SLIDE 67
  • Each machine failure =

lose all in-fmight messages = lose ~600 messages

  • Machine would have to fail ~8

times / day for us to lose 5k messages (0.01% of incoming messages) We can afgord it!

Async Replication

Reminders:

  • Can lose 5k msg / day
  • ~600 in-fmight msg / sec
slide-68
SLIDE 68

Let’s use replication...

MessageStore Message ID Service push() Message pop() Consumer Subscription Position Service

Kansas City

Message ID Service push() Message pop() Consumer Subscription Position Service

Seatule

MessageStore Message ID Service push() Message pop() Consumer Subscription Position Service

New York

MessageStore

File Replication

slide-69
SLIDE 69

Message ID Confmicts

MessageStore Message ID Service push() Topic1, Msg Y pop() Subscription Position Service

Seatule New York

MessageStore Message ID Service push() Topic1, Msg X pop() Subscription Position Service

slide-70
SLIDE 70

Message ID Service Message ID Service Message ID Service

Let’s use consensus...

MessageStore push() Message pop() Consumer Subscription Position Service

Kansas City

Paxos-based consensus

push() Message pop() Consumer Subscription Position Service

Seatule

MessageStore push() Message pop() Consumer Subscription Position Service

New York

MessageStore

slide-71
SLIDE 71

Distributed Consensus

  • Distributed components reliably and consistently:

○ Agree on a single source of truth ○ Identify leaders for specifjc operations ○ Divide pieces of work ○ Make other decisions

  • Unreliable components → reliable decisions
  • Consistent to decisions, even when sub-components fail
  • Recover orphaned datacenters
  • Eventual at-most-once semantics
  • Paxos, FastPaxos, Rafu
slide-72
SLIDE 72

Message ID Service Message ID Service Message ID Service

Let’s use consensus...

MessageStore push() Message pop() Consumer Subscription Position Service

Kansas City

Paxos-based consensus

push() Message pop() Consumer Subscription Position Service

Seatule

MessageStore push() Message pop() Consumer Subscription Position Service

New York

MessageStore Message ID Service Message ID Service Message ID Service

slide-73
SLIDE 73

Paruitioned/Stale Subscription Positions

MessageStore Message ID Service push() UserX, Topic1 pop() Subscription Position Service

Seatule New York

MessageStore Message ID Service push() UserX, Topic1 pop() Subscription Position Service

slide-74
SLIDE 74

Message ID Service Message ID Service Message ID Service

Let’s use consensus...

MessageStore push() Message pop() Consumer Subscription Position Service

Kansas City

Paxos-based consensus

push() Message pop() Consumer Subscription Position Service

Seatule

MessageStore push() Message pop() Consumer Subscription Position Service

New York

MessageStore Message ID Service Message ID Service Message ID Service Subscription Position Service Subscription Position Service Subscription Position Service

slide-75
SLIDE 75

Replicating/Sharding Services

MessageStore Message ID Service push() Message pop() Consumer Subscription Position Service

slide-76
SLIDE 76

Breakout Session 3: Provision the System (35 minutes) Goal: Identify how many machines you

  • need. Determine if SLOs are viable.
slide-77
SLIDE 77

Break: 5 Minutes

slide-78
SLIDE 78
  • Provisioning is an aru.
  • Simplify where possible
  • Over-provision by default
  • Granularity: units of one machine

Provisioning

single machine capacity provision 3 machines negligible, ignore system needs

slide-79
SLIDE 79

Storage

Message content: 50M msg / day * 5 kB / msg = 250 GB / day IDs: 50M msg / day * 128 bits / msg = 800 MB / day Total: ~250 GB / day

Key: Topic ID, Message ID Value: Message Content MessageStore Topic ID = 64 bits Msg ID = 64 bits Average msg size = 5 kB Machine: 128GB RAM, 2TB SSD 1 x 4TB HDD

slide-80
SLIDE 80

Storage

100 days retention: 250 GB / day * 100 days = 25 TB / 100 days ⌈25 TB / (4 TB HDD / machine)⌉ = 7 machines … per DC … per copy

Key: Topic ID, Message ID Value: Message Content MessageStore Topic ID = 64 bits Msg ID = 64 bits Average msg size = 5 kB Machine: 128GB RAM, 2TB SSD 1 x 4TB HDD

slide-81
SLIDE 81

Storage

100 days retention: 7 machines / DC / copy 7 machines / DC / copy * 2 copies / DC * 3 DCs = 42 machines

Key: Topic ID, Message ID Value: Message Content MessageStore Topic ID = 64 bits Msg ID = 64 bits Average msg size = 5 kB Machine: 128GB RAM, 2TB SSD 1 x 4TB HDD

slide-82
SLIDE 82

Which hardware to choose?

latency per-machine machine count

RAM 0.01ms 128GB 1176 SSD 1ms 2TB 78 HDD 15ms 4TB 42 MessageStore

slide-83
SLIDE 83

Bandwidth: push

  • Peak load = 1.25x avg load

= 250 GB / day * 1.25 = ~315 GB / day

  • 315 GB / day

= ~4 MB / s = ~30 Mbps inbound

  • Outbound ~= Inbound

30 Mbps inbound, 30 Mbps outbound

~250 GB / day Machine: 10Gbps ethernet 100Gbps cross-DC

MessageStore Message ID Service push() Message

slide-84
SLIDE 84

Bandwidth: pop

  • Avg load

= 10k consumers * 5 topics / consumer * 10k msg / topic / day * 5 kB / msg = 2.5 TB / day

?? / day Machine: 10Gbps ethernet 100Gbps cross-DC

MessageStore Message ID Service pop() Consumer Subscription Position Service

slide-85
SLIDE 85

Bandwidth: pop

  • Peak load = 1.25x avg load

= 2.5 TB / day * 1.25 = ~3.15 TB / day

  • 3.15 TB / day

= ~37 MB / s = ~300 Mbps outbound

  • Internal ~= Outbound

300 Mbps outbound, 300 Mbps internal

~2.5 TB / day Machine: 10Gbps ethernet 100Gbps cross-DC

MessageStore Message ID Service pop() Consumer Subscription Position Service

slide-86
SLIDE 86

Message ID Service Message ID Service Message ID Service

Is it reliable enough?

MessageStore push() Message pop() Consumer Subscription Position Service

Kansas City

Paxos-based consensus

push() Message pop() Consumer Subscription Position Service

Seatule

MessageStore push() Message pop() Consumer Subscription Position Service

New York

MessageStore Message ID Service Message ID Service Message ID Service Subscription Position Service Subscription Position Service Subscription Position Service

slide-87
SLIDE 87

Availability Paruition Tolerance

(Latency)

CAP Theorem

Consistency

(Correctness)

slide-88
SLIDE 88

Latency: push

  • Determine ID: ~200ms
  • Store message: ~150ms

○ Synchronous ○ Bound by slowest connection to remote datacenter

  • Write message: ~10ms

Total = 200ms + 150ms + 10ms = 360ms Reminders:

  • 99% ops complete in <500ms
  • Paxos takes ~200ms
  • Inter-continental = ~150ms
  • Local write takes ~10ms

MessageStore Message ID Service push() Message

slide-89
SLIDE 89

Latency: pop

  • Determine ID: ~0.5ms local,

~150ms remote

  • Read message: ~15ms local,

~150ms remote

  • Deliver message: ~negligible
  • Update position: ~200ms

Total = 150ms + 150ms + 200ms = 500ms Reminders:

  • 99% ops complete in <500ms
  • Paxos takes ~200ms
  • Inter-continental = ~150ms
  • Disk seek+read takes ~15ms

MessageStore Message ID Service pop() Consumer Subscription Position Service

slide-90
SLIDE 90

Bill of Materials

Final count of machines: 2 push + 2 pop + 3 Message ID Service + 3 Subscription Position Service + 14 MessageStore = 24 per DC * 3 DCs * 1.25 (for load spikes) = 90 machines

slide-91
SLIDE 91

Last thoughts

  • Staru simple and iterate
  • See the big picture
  • Details, details, details!
  • But also, be reasonably pragmatic
  • Flexible vs. premature future-proofjng
  • Cultivate discipline in problem solving approach
  • Make data-driven decisions

Take breaks and enjoy the process!

slide-92
SLIDE 92

Distributed PubSub

Non-Abstract Large System Design

...