Andrew Godwin Django core developer Senior Software Engineer at - - PowerPoint PPT Presentation

andrew godwin
SMART_READER_LITE
LIVE PREVIEW

Andrew Godwin Django core developer Senior Software Engineer at - - PowerPoint PPT Presentation

Hi, I'm Andrew Godwin Django core developer Senior Software Engineer at Used to complain about migrations a lot Distributed Systems c = 299,792,458 m/s Early CPUs c = 60m propagation distance 5 MHz ~2cm Clock Modern CPUs c = 10cm


slide-1
SLIDE 1
slide-2
SLIDE 2

Andrew Godwin

Hi, I'm

Django core developer Senior Software Engineer at Used to complain about migrations a lot

slide-3
SLIDE 3

Distributed Systems

slide-4
SLIDE 4

c = 299,792,458 m/s

slide-5
SLIDE 5

Early CPUs

c = 60m propagation distance

Clock

~2cm

5 MHz

slide-6
SLIDE 6

Modern CPUs

c = 10cm propagation distance 3 GHz

slide-7
SLIDE 7

Distributed systems are made of independent components

slide-8
SLIDE 8

They are slower and harder to write than synchronous systems

slide-9
SLIDE 9

But they can be scaled up much, much further

slide-10
SLIDE 10

Trade-offs

slide-11
SLIDE 11

There is never a perfect solution.

slide-12
SLIDE 12

Fast Good Cheap

slide-13
SLIDE 13
slide-14
SLIDE 14

Load Balancer WSGI Worker WSGI Worker WSGI Worker

slide-15
SLIDE 15

Load Balancer WSGI Worker WSGI Worker WSGI Worker Cache

slide-16
SLIDE 16

Load Balancer WSGI Worker WSGI Worker WSGI Worker Cache Cache Cache

slide-17
SLIDE 17

Load Balancer WSGI Worker WSGI Worker WSGI Worker Database

slide-18
SLIDE 18

CAP Theorem

slide-19
SLIDE 19

Partition Tolerant Consistent Available

slide-20
SLIDE 20

PostgreSQL: CP

Consistent everywhere Handles network latency/drops Can't write if main server is down

slide-21
SLIDE 21

Cassandra: AP

Can read/write to any node Handles network latency/drops Data can be inconsistent

slide-22
SLIDE 22

It's hard to design a product that might be inconsistent

slide-23
SLIDE 23

But if you take the tradeoff, scaling is easy

slide-24
SLIDE 24

Otherwise, you must find

  • ther solutions
slide-25
SLIDE 25

Read Replicas

(often called master/slave)

Load Balancer WSGI Worker WSGI Worker WSGI Worker Replica Replica Main

slide-26
SLIDE 26

Replicas scale reads forever... But writes must go to one place

slide-27
SLIDE 27

If a request writes to a table it must be pinned there, so later reads do not get old data

slide-28
SLIDE 28

When your write load is too high, you must then shard

slide-29
SLIDE 29

Vertical Sharding

Users Tickets Events Payments

slide-30
SLIDE 30

Horizontal Sharding

Users 0 - 2 Users 3 - 5 Users 6 - 8 Users 9 - A

slide-31
SLIDE 31

Both

Users 0 - 2 Users 3 - 5 Users 6 - 8 Users 9 - A Events 0 - 2 Events 3 - 5 Events 6 - 8 Events 9 - A Tickets 0 - 2 Tickets 3 - 5 Tickets 6 - 8 Tickets 9 - A

slide-32
SLIDE 32

Both plus caching

Users 0 - 2 Users 3 - 5 Users 6 - 8 Users 9 - A Events 0 - 2 Events 3 - 5 Events 6 - 8 Events 9 - A Tickets 0 - 2 Tickets 3 - 5 Tickets 6 - 8 Tickets 9 - A User Cache Event Cache Ticket Cache

slide-33
SLIDE 33

Teams have to scale too; nobody should have to understand eveything in a big system.

slide-34
SLIDE 34

Services allow complexity to be reduced - for a tradeoff

  • f speed
slide-35
SLIDE 35

Users 0 - 2 Users 3 - 5 Users 6 - 8 Users 9 - A Events 0 - 2 Events 3 - 5 Events 6 - 8 Events 9 - A Tickets 0 - 2 Tickets 3 - 5 Tickets 6 - 8 Tickets 9 - A User Cache Event Cache Ticket Cache

User Service Event Service Ticket Service

slide-36
SLIDE 36

User Service Event Service Ticket Service WSGI Server

slide-37
SLIDE 37

Each service is its own, smaller project, managed and scaled separately.

slide-38
SLIDE 38

But how do you communicate between them?

slide-39
SLIDE 39

Service 2 Service 3 Service 1

Direct Communication

slide-40
SLIDE 40

Service 2 Service 3 Service 1 Service 4 Service 5

slide-41
SLIDE 41

Service 2 Service 3 Service 1 Service 4 Service 5 Service 6 Service 7 Service 8

slide-42
SLIDE 42

Service 2 Service 3 Service 1

Message Bus

Service 2 Service 3 Service 1

slide-43
SLIDE 43

A single point of failure is not always bad - if the alternative is multiple, fragile ones

slide-44
SLIDE 44

Channels and ASGI provide a standard message bus built with certain tradeoffs

slide-45
SLIDE 45

Backing Store

e.g. Redis, RabbitMQ

ASGI (Channel Layer) Channels Library Django

Django Channels Project

slide-46
SLIDE 46

Backing Store

e.g. Redis, RabbitMQ

ASGI (Channel Layer) Pure Python

slide-47
SLIDE 47

Failure Mode At most once

Messages either do not arrive, or arrive once.

At least once

Messages arrive once, or arrive multiple times

slide-48
SLIDE 48

Guarantees vs. Latency Low latency

Messages arrive very quickly but go missing more

Low loss rate

Messages are almost never lost but arrive slower

slide-49
SLIDE 49

Queuing Type First In First Out

Consistent performance for all users

First In Last Out

Hides backlogs but makes them worse

slide-50
SLIDE 50

Queue Sizing Finite Queues

Sending can fail

Infinite queues

Makes problems even worse

slide-51
SLIDE 51

You must understand what you are making

(This is surprisingly uncommon)

slide-52
SLIDE 52

Design as much as possible around shared-nothing

slide-53
SLIDE 53

Per-machine caches On-demand thumbnailing Signed cookie sessions

slide-54
SLIDE 54

Has to be shared? Try to split it

slide-55
SLIDE 55

Has to be shared? Try sharding it.

slide-56
SLIDE 56

Django's job is to be slowly replaced by your code

slide-57
SLIDE 57

Just make sure you match the API contract of what you're replacing!

slide-58
SLIDE 58

Don't try to scale too early; you'll pick the wrong tradeoffs.

slide-59
SLIDE 59

Thanks.

Andrew Godwin

@andrewgodwin

channels.readthedocs.io