Jeremy Edberg QconSF 2012 Tweet @jedberg with feedback! Tweet - - PowerPoint PPT Presentation

jeremy edberg
SMART_READER_LITE
LIVE PREVIEW

Jeremy Edberg QconSF 2012 Tweet @jedberg with feedback! Tweet - - PowerPoint PPT Presentation

Jeremy Edberg QconSF 2012 Tweet @jedberg with feedback! Tweet @jedberg with feedback! Building a Reliable Data Store Tweet @jedberg with feedback! Agenda CAP theory and how it applies to reliability How reddit and Netflix maintain


slide-1
SLIDE 1

Tweet @jedberg with feedback!

Jeremy Edberg

QconSF 2012

slide-2
SLIDE 2

Tweet @jedberg with feedback!

slide-3
SLIDE 3

Tweet @jedberg with feedback!

Building a Reliable Data Store

slide-4
SLIDE 4

Tweet @jedberg with feedback!

Agenda

  • CAP theory and how it applies to reliability
  • How reddit and Netflix maintain reliable

data stores

  • Best Practices
  • War stories -- surviving real outages
slide-5
SLIDE 5

Tweet @jedberg with feedback!

CAP Theorem

  • Consistent
  • Available
  • Partition-resistant
slide-6
SLIDE 6

Tweet @jedberg with feedback!

ATM

?

slide-7
SLIDE 7

Tweet @jedberg with feedback!

ATM

AP Limits liability through allowing only small transactions

slide-8
SLIDE 8

Tweet @jedberg with feedback!

Flight Reservations

?

slide-9
SLIDE 9

Tweet @jedberg with feedback!

Flight Reservations

AP This is why

  • verbooking
  • ccurs
slide-10
SLIDE 10

Tweet @jedberg with feedback!

slide-11
SLIDE 11

Tweet @jedberg with feedback!

The problem with CAP

  • Daniel Abadi had a problem with CAP
  • The weightings were uneven
  • A is essential in all scenarios
  • C is more important than P
  • Latency wasn’t accounted for at all
slide-12
SLIDE 12

Tweet @jedberg with feedback!

PACELC

If there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)?

slide-13
SLIDE 13

Tweet @jedberg with feedback!

Partitioning

slide-14
SLIDE 14

Tweet @jedberg with feedback!

Thinking like a coder

Partitions are like code branches

slide-15
SLIDE 15

Tweet @jedberg with feedback!

Some examples

  • ACID systems (Postgres, Oracle, MySql,

etc) are PC/EC

  • Cassandra is PA/EL
slide-16
SLIDE 16

Tweet @jedberg with feedback!

slide-17
SLIDE 17

Tweet @jedberg with feedback!

Reliability and $$

slide-18
SLIDE 18

Tweet @jedberg with feedback!

Building for redundancy

slide-19
SLIDE 19

Tweet @jedberg with feedback!

We want to make sure we are building for survival

slide-20
SLIDE 20

Tweet @jedberg with feedback!

1 > 2 > 3

Going from two to three is hard

slide-21
SLIDE 21

Tweet @jedberg with feedback!

1 > 2 > 3

Going from one to two is harder

slide-22
SLIDE 22

Tweet @jedberg with feedback!

Build for Three

If possible, plan for 3 or more from the beginning.

slide-23
SLIDE 23

Tweet @jedberg with feedback!

“Build for three” is the secret to success

slide-24
SLIDE 24

Tweet @jedberg with feedback!

slide-25
SLIDE 25

Tweet @jedberg with feedback!

reddit

slide-26
SLIDE 26

Tweet @jedberg with feedback!

Architecture

slide-27
SLIDE 27

Tweet @jedberg with feedback!

Postgres

slide-28
SLIDE 28

Tweet @jedberg with feedback!

Database Resiliancy with Sharding

slide-29
SLIDE 29

Tweet @jedberg with feedback!

Sharding

  • reddit split writes across four master databases
  • Links/Accounts/Subreddits, Comments,

Votes and Misc

  • Each has at least one slave in another zone
  • Avoid reading from the master if possible
  • Wrote their own database access layer, called

the “thing” layer

slide-30
SLIDE 30

Tweet @jedberg with feedback!

Sample Schema

link_thing int id timestamp date int ups int downs bool deleted bool spam link_data int thing_id string name string value char kind

slide-31
SLIDE 31

Tweet @jedberg with feedback!

The thing layer

  • Postgres is used like a key/value store
  • Thing table has denormalized data
  • Data table has arbitrary keys
  • Lots of indexes tuned for our specific

queries

  • Thing and data tables are on the same box,

but don’t have to be

slide-32
SLIDE 32

Tweet @jedberg with feedback!

I love memcache

I make heavy use of memcached

slide-33
SLIDE 33

Tweet @jedberg with feedback!

A B C 3 2 1

slide-34
SLIDE 34

Tweet @jedberg with feedback!

A B C 3 2 1 D

slide-35
SLIDE 35

Tweet @jedberg with feedback!

Cassandra

slide-36
SLIDE 36

Tweet @jedberg with feedback!

slide-37
SLIDE 37

Tweet @jedberg with feedback!

Netflix

slide-38
SLIDE 38

Tweet @jedberg with feedback!

Data

What does Netflix do with it all?

slide-39
SLIDE 39

Tweet @jedberg with feedback!

We store it!

  • Cache (memcached)
  • Cassandra
  • RDS (MySql)
slide-40
SLIDE 40

Tweet @jedberg with feedback!

I love memcache

I make heavy use of memcached

slide-41
SLIDE 41

Tweet @jedberg with feedback!

RDS (Relational Database Service)

slide-42
SLIDE 42

Tweet @jedberg with feedback!

Cassandra

slide-43
SLIDE 43

Tweet @jedberg with feedback!

A/B Testing

slide-44
SLIDE 44

Tweet @jedberg with feedback!

A/B Testing

Online Data Offline Data Test Cell allocation Test Metadata Start/End date UI Directives Test tracking Retention Fraction Viewed Pages Viewed

slide-45
SLIDE 45

Tweet @jedberg with feedback!

Atlas

slide-46
SLIDE 46

Tweet @jedberg with feedback!

AWS Usage

Dollar amounts have been carefully removed

slide-47
SLIDE 47

Tweet @jedberg with feedback!

Chronos

slide-48
SLIDE 48

Tweet @jedberg with feedback!

More Things Netflix Stores in Cassandra

  • Video Quality
  • Network issues
  • Usage History
  • Playback Errors
slide-49
SLIDE 49

Tweet @jedberg with feedback!

Service based architecture

slide-50
SLIDE 50

Tweet @jedberg with feedback!

Netflix on AWS

2012 IPv6 2012 IPv6 2012 IPv6

slide-51
SLIDE 51

Tweet @jedberg with feedback!

Abstraction

  • Data sources are abstracted away behind

restful interfaces

  • Each application owns its own consistency
  • Each application can scale independently

based on load

slide-52
SLIDE 52

Tweet @jedberg with feedback!

Netflix autoscaling

Traffic Peak Text

1 2

slide-53
SLIDE 53

Tweet @jedberg with feedback!

The Big Oracle Database

slide-54
SLIDE 54

Tweet @jedberg with feedback!

Circuit Breakers

Be liberal in what you accept, strict in what you send

slide-55
SLIDE 55

Tweet @jedberg with feedback!

Cassandra

slide-56
SLIDE 56

Tweet @jedberg with feedback!

Priam

slide-57
SLIDE 57

Tweet @jedberg with feedback!

Cassandra Architecture

slide-58
SLIDE 58

Tweet @jedberg with feedback!

Cassandra Architecture

slide-59
SLIDE 59

Tweet @jedberg with feedback!

How it works

  • Replication factor
  • Quorum reads / writes
  • Bloom Filter for fast negative lookups
  • Immutable files for fast writes
  • Seed nodes
  • Multi-region
  • Gossip protocol
slide-60
SLIDE 60

Tweet @jedberg with feedback!

Cassandra Benefits

  • Fast writes
  • Fast negative lookups
  • Easy incremental scalability
  • Distributed -- No SPoF
slide-61
SLIDE 61

Tweet @jedberg with feedback!

Why Cassandra?

  • Availability over consistency
  • Writes over reads
  • We know Java
  • Open source + support
slide-62
SLIDE 62

Tweet @jedberg with feedback!

slide-63
SLIDE 63

Tweet @jedberg with feedback!

We live in an unreliable world

slide-64
SLIDE 64

Tweet @jedberg with feedback!

slide-65
SLIDE 65

Tweet @jedberg with feedback!

slide-66
SLIDE 66

Tweet @jedberg with feedback!

slide-67
SLIDE 67

Tips, and Tricks

slide-68
SLIDE 68

Tweet @jedberg with feedback!

Queues are your friend

  • Votes
  • Comments
  • Thumbnail scraper
  • Precomputed queries
  • Spam
  • processing
  • corrections
slide-69
SLIDE 69

Tweet @jedberg with feedback!

Caching is a good way to hide your failures

slide-70
SLIDE 70

Tweet @jedberg with feedback!

Sometimes users notice your data inconstancy

slide-71
SLIDE 71

Tweet @jedberg with feedback!

A B C 3 2 1 D

+

EVCache

slide-72
SLIDE 72

Tweet @jedberg with feedback!

Do you even need a cache?

slide-73
SLIDE 73

Tweet @jedberg with feedback!

Think of SSDs as cheap RAM, not expensive disk

slide-74
SLIDE 74

Tweet @jedberg with feedback!

Going multi-zone or multi-datacenter

slide-75
SLIDE 75

Tweet @jedberg with feedback!

Benefits of Amazon’s Zones

  • Loosely connected
  • Low latency between zones
  • 99.95% uptime guarantee per zone
slide-76
SLIDE 76

Tweet @jedberg with feedback!

Going Multi-region

slide-77
SLIDE 77

Tweet @jedberg with feedback!

Leveraging Mutli-region

  • 100% uptime is theoretically possible.
  • You have to replicate your data
  • This will cost money
slide-78
SLIDE 78

Tweet @jedberg with feedback!

Other options

  • Backup datacenter
  • Backup provider
slide-79
SLIDE 79

Tweet @jedberg with feedback!

Cause chaos

slide-80
SLIDE 80

Tweet @jedberg with feedback!

The Monkey Theory

  • Simulate things that go wrong
  • Find things that are different
slide-81
SLIDE 81

Tweet @jedberg with feedback!

The simian army

  • Chaos -- Kills random instances
  • Latency -- Slows the network down
  • Conformity -- Looks for outliers
  • Doctor -- Looks for passing health checks
  • Janitor -- Cleans up unused resources
  • Howler --

Yells about bad things

slide-82
SLIDE 82

Tweet @jedberg with feedback!

The Chaos Gorilla

slide-83
SLIDE 83

Tweet @jedberg with feedback!

Automate all the things!

slide-84
SLIDE 84

Tweet @jedberg with feedback!

Automate all the things!

  • Application startup
  • Configuration
  • Code deployment
  • System deployment
slide-85
SLIDE 85

Tweet @jedberg with feedback!

Incident Reviews

  • What went wrong?
  • How could we have detected it sooner?
  • How could we have prevented it?
  • How can we prevent this class of problem

in the future?

  • How can we improve our behavior for next

time? Ask the key questions:

slide-86
SLIDE 86

Tweet @jedberg with feedback!

The Netflix way

  • Everything is “built for three”
  • Fully automated build tools to test and

make packages

  • Fully automated machine image bakery
  • Fully automated image deployment
slide-87
SLIDE 87

Tweet @jedberg with feedback!

All systems choices assume some part will fail at some point.

slide-88
SLIDE 88

Tweet @jedberg with feedback!

Best Practices

  • Keep data in multiple Availability Zones / DCs
  • Avoid keeping state on a single instance
slide-89
SLIDE 89

Tweet @jedberg with feedback!

Best Practices

  • Isolated Services
  • Three Balanced AZs
  • Triple replicated persistence
  • Isolated Regions
slide-90
SLIDE 90

Tweet @jedberg with feedback!

Best Practices

  • Don’t trust your dependencies
  • Have good fallbacks
  • Use circuit breakers/dependency

commands

slide-91
SLIDE 91

Tweet @jedberg with feedback!

  • Be generous in what you accept and stingy

in what you give

Best Practices

slide-92
SLIDE 92

Tweet @jedberg with feedback!

  • Hope for the best, assume the worst

Best Practices

slide-93
SLIDE 93

Tweet @jedberg with feedback!

slide-94
SLIDE 94

Tweet @jedberg with feedback!

War Stories

slide-95
SLIDE 95

Tweet @jedberg with feedback!

April 2011 EBS outage

slide-96
SLIDE 96

Tweet @jedberg with feedback!

June 29th Outage

  • Due to a severe storm, power went out in
  • ne AZ
  • Netflix did not do well because of a bug in
  • ur internal mid-tier load balancer
  • However, Cassandra held up just fine!
slide-97
SLIDE 97

Tweet @jedberg with feedback!

October 29th Outage

  • EBS degradation in one Zone
  • We did much better this time
  • Cassandra just kept running
  • MySql not as well, but fallbacks kicked in
slide-98
SLIDE 98

Tweet @jedberg with feedback!

Hurricane Sandy

The outage that never was

slide-99
SLIDE 99

Tweet @jedberg with feedback!

Just a quick reminder...

(Some of) Netflix is open source: https://netflix.github.com/

slide-100
SLIDE 100

Tweet @jedberg with feedback!

Another reminder...

reddit is also open source https://github.com/reddit patches are now being accepted!

slide-101
SLIDE 101

Tweet @jedberg with feedback!

Netflix is hiring

http://jobs.netflix.com/jobs.html

  • or -

email talent@netflix.com and tell them jedberg sent you

slide-102
SLIDE 102

Tweet @jedberg with feedback!

Questions?

slide-103
SLIDE 103

Tweet @jedberg with feedback!

slide-104
SLIDE 104

Tweet @jedberg with feedback!

Getting in touch

Email: jedberg@gmail.com Twitter: @jedberg Web: www.jedberg.net Facebook: facebook.com/jedberg Linkedin: www.linkedin.com/in/jedberg reddit: www.reddit.com/user/jedberg