Honey, I shrunk the database! Resilience and recoverability in Cloud - - PowerPoint PPT Presentation

honey i shrunk the database
SMART_READER_LITE
LIVE PREVIEW

Honey, I shrunk the database! Resilience and recoverability in Cloud - - PowerPoint PPT Presentation

Honey, I shrunk the database! Resilience and recoverability in Cloud Native services JEFFREY FARBER SIDNEY SHEK Cloud infrastructure = reliable services right? SUPER RESILIENT CLOUD-BASED ARCHITECTURE Canary Progressive Rollouts


slide-1
SLIDE 1

Honey, I shrunk the database!

Resilience and recoverability in Cloud Native services

JEFFREY FARBER SIDNEY SHEK

slide-2
SLIDE 2
slide-3
SLIDE 3

Cloud infrastructure = reliable services

right?

slide-4
SLIDE 4

SUPER RESILIENT CLOUD-BASED ARCHITECTURE

Canary Progressive Rollouts Blue-Green Deployments 5 minute backups Distributed Multi-Region Cassandra Database

slide-5
SLIDE 5

WITH LOTS OF DEPENDENCIES

Reliability Promise = 99.999% Recovery Time Objective = 1 hour Recovery Point Objective = 5 minutes

slide-6
SLIDE 6

Until…

slide-7
SLIDE 7

BET YOU DIDN’T SEE THIS COMING (THIS IS A TRUE STORY)

slide-8
SLIDE 8

WE RESTORE…

2 hour old snapshot

slide-9
SLIDE 9

WE RESTORE… … BUT WHAT ABOUT OUR DEPENDENCIES?

LOST DATA WRONG(?) DATA MAY HAVE

CORRECT DATA, BUT HOW TO SYNC?

10X

NORMAL LOAD?

slide-10
SLIDE 10

💪 will happen.

Accept and plan for it.

Emergent behaviour

Our systems are complex and unpredictable.

Broad spectrum solutions

Incorporate general recovery methods to handle the unexpected.

Big statement with support

slide-11
SLIDE 11

💪 will happen.

Accept and plan for it.

Emergent behaviour

Our systems are complex and unpredictable.

Broad spectrum solutions

Incorporate general recovery methods to handle the unexpected.

Big statement with support

slide-12
SLIDE 12

💪 will happen.

Accept and plan for it.

Emergent behaviour

Our systems are complex and unpredictable.

Broad spectrum solutions

Incorporate general recovery methods to handle the unexpected.

Big statement with support

slide-13
SLIDE 13
  • 3. Local and distributed redundancy

Having fallbacks for fallbacks for fallbacks…

  • 1. Event sourcing

Minimize data loss after a restore

  • 2. Easily recoverable replication

Get downstream systems back in sync

PATTERNS FOR GUARDING AGAINST THE IMPOSSIBLE

slide-14
SLIDE 14
  • 3. Local and distributed redundancy

Having fallbacks for fallbacks for fallbacks…

  • 1. Event sourcing

Minimize data loss after a restore

  • 2. Easily recoverable replication

Get downstream systems back in sync

PATTERNS FOR GUARDING AGAINST THE IMPOSSIBLE

slide-15
SLIDE 15
  • 3. Local and distributed redundancy

Having fallbacks for fallbacks for fallbacks…

  • 1. Event sourcing

Minimize data loss after a restore

  • 2. Easily recoverable replication

Get downstream systems back in sync

PATTERNS FOR GUARDING AGAINST THE IMPOSSIBLE

slide-16
SLIDE 16

Event sourcing

Minimize data loss after a restore

slide-17
SLIDE 17

Let’s add recovery here

slide-18
SLIDE 18

Event Sourcing Recover from bugs ruining data

Databases can replicate your data across regions. It also replicates your bugs.

Minimize data loss of DB restore (RPO)

Critical applications can’t afford hours of data loss.

Accuracy & Time (RTO)

We need confidence in our restored data, and we need it quickly!

Goals Generating Recovery Write Events

slide-19
SLIDE 19

Event Sourcing Recover from bugs ruining data

Databases can replicate your data across regions. It also replicates your bugs.

Minimize data loss of DB restore (RPO)

Critical applications can’t afford hours of data loss.

Accuracy & Time (RTO)

We need confidence in our restored data, and we need it quickly!

Goals Generating Recovery Write Events

slide-20
SLIDE 20

Event Sourcing Recover from bugs ruining data

Databases can replicate your data across regions. It also replicates your bugs.

Minimize data loss of DB restore (RPO)

Critical applications can’t afford hours of data loss.

Accuracy & Time (RTO)

We need confidence in our restored data, and we need it quickly!

Goals Generating Write Events Recovery

slide-21
SLIDE 21

INSERT INTO table DELETE FROM table SELECT FROM table

Service

Event Sourcing

Goals Generating Write Events Recovery

slide-22
SLIDE 22

INSERT INTO table DELETE FROM table SELECT FROM table

Event Sourcing

Goals Generating

Writes Reads

Write Events Recovery

slide-23
SLIDE 23

INSERT INTO table DELETE FROM table historical command store SELECT FROM table INSERT/APPEND

Event Sourcing

Goals Generating

Writes

Write Events Recovery

slide-24
SLIDE 24

INSERT INTO table DELETE FROM table historical command store SELECT FROM table INSERT/APPEND

Event Sourcing

Goals Generating

Writes Replay events to main database

Write Events Recovery

slide-25
SLIDE 25

INSERT INTO table DELETE FROM table historical command store SELECT FROM table INSERT/APPEND

Event Sourcing

Goals Generating

MAKE SURE THIS IS AN INDEPENDENT STORE!

Writes

Write Events Recovery

slide-26
SLIDE 26

Generating

[ { "sequence": 20, "stream": "user123", "commandType": "grant_permission", "params": { "user": "user123", "resource": "issueABC", "permission": "view" }, "timestamp": "2019-07-24 3:30 PM”, "actor": "jira_share_service" }, ]

Event Sourcing

Goals Write Events Recovery

slide-27
SLIDE 27

Generating

[ { "sequence": 20, "stream": "user123", "commandType": "grant_permission", "params": { "user": "user123", "resource": "issueABC", "permission": "view" }, "timestamp": "2019-07-24 3:30 PM”, "actor": "jira_share_service" }, ]

Event Sourcing

Order all writes, so we can replay in same order Goals Write Events Recovery

1

slide-28
SLIDE 28

Generating

Event Sourcing

Order all writes, so we can replay in same order Goals Write Events Recovery

Stream Sequence user123 19

STRICTLY ORDERED

SET sequence = sequence + 1 WHERE sequence = 19

"sequence": 20, sequences -> 19, 20, 21...

1

slide-29
SLIDE 29

Generating

Event Sourcing

Order all writes, so we can replay in same order Goals Write Events Recovery

"sequence": 20,

MOSTLY ORDERED

sequence = {timestamp}{unique_node_id}

sequences -> 1565312340, 1565323450, ...

1

slide-30
SLIDE 30

Generating

Event Sourcing

Order all writes, so we can replay in same order Goals Write Events Recovery

"sequence": 20,

MOSTLY ORDERED

sequence = {timestamp}{unique_node_id}

sequences -> 1565312340, 1565323450, ...

No SPOF (database) Only certain writes need strict ordering Clock skew window is small (< 1 sec) Don’t know previous sequence

1

slide-31
SLIDE 31

timestamp2 > timestamp1

Generating

Event Sourcing

Order all writes, so we can replay in same order Goals Write Events Recovery

"sequence": 20,

MOSTLY ORDERED + CUSTOMER-DICTATED STRICT ORDERING

/create

write1 {sequence/timestamp1}

/delete

write2 ?after={timestamp1} {sequence/timestamp2}

1

slide-32
SLIDE 32

Generating

[ { "sequence": 20, "stream": "user123", "commandType": "grant_permission", "params": { "user": "user123", "resource": "issueABC", "permission": "view" }, "timestamp": "2019-07-24 3:30 PM”, "actor": "jira_share_service" }, ]

Event Sourcing

Streams guarantee order Parallelize across streams Goals Write Events Recovery

2

slide-33
SLIDE 33

Generating

[ { "sequence": 20, "stream": "user123", "commandType": "grant_permission", "params": { "user": "user123", "resource": "issueABC", "permission": "view" }, "timestamp": "2019-07-24 3:30 PM”, "actor": "jira_share_service" }, ]

Event Sourcing

Goals Write Events Recovery

user123

{sequence: 20, commandType: ”grant", permission: "view"} {sequence: 21, commandType: ”revoke”, permission: "view"}

Streams guarantee order Parallelize across streams

2

slide-34
SLIDE 34

Generating

[ { "sequence": 20, "stream": "user123", "commandType": "grant_permission", "params": { "user": "user123", "resource": "issueABC", "permission": "view" }, "timestamp": "2019-07-24 3:30 PM”, "actor": "jira_share_service" }, ]

Event Sourcing

2

Goals Write Events Recovery

user456

{sequence: 74, commandType: ”grant", permission: “edit"} {sequence: 75, commandType: ”revoke”, permission: “edit"}

user123

{sequence: 20, commandType: ”grant", permission: "view"} {sequence: 21, commandType: ”revoke”, permission: "view"}

Streams guarantee order Parallelize across streams

slide-35
SLIDE 35

Event Sourcing

Generating Goals Write Events Recovery

1

Restore snapshot

slide-36
SLIDE 36

Event Sourcing

Generating Goals Write Events Recovery

1 2

Recover streams in parallel

user123 user456

21, 20 19

Restore snapshot

73 ..., 76, 75, 74

slide-37
SLIDE 37

Event Sourcing

Generating Goals Write Events Recovery

1 2

Recover streams in parallel Bonus: process all stream events in-memory

user123 user456

21, 20 19

Restore snapshot

73 ..., 76, 75, 74

slide-38
SLIDE 38

Event Sourcing

[ { "stream": "user123", "sequence": 20, "commandType": "grant_permission", "params": { "user": "user123", "resource": "issueABC", "permission": “view" }, ... }, Events for stream “user123” Main Datastore Stream Sequence user123 19 { "stream": "user123", "sequence": 21, "commandType": "grant_permission", "params": { "user": "user123", User Resource Permissions user123 issueABC []

Generating Goals Write Events Recovery

slide-39
SLIDE 39

Event Sourcing

[ { "stream": "user123", "sequence": 20, "commandType": "grant_permission", "params": { "user": "user123", "resource": "issueABC", "permission": “view" }, ... }, Main Datastore { "stream": "user123", "sequence": 21, "commandType": "grant_permission", "params": { "user": "user123",

Generating Goals Write Events Recovery

Events for stream “user123” Stream Sequence user123 20 User Resource Permissions user123 issueABC [“view”]

slide-40
SLIDE 40

Event Sourcing

"user": "user123", "resource": "issueABC", "permission": “view" }, ... }, Main Datastore { "stream": "user123", "sequence": 21, "commandType": "grant_permission", "params": { "user": "user123", "resource": "issueABC", "permission": “edit” }, ... } ]

Generating Goals Write Events Recovery

Stream Sequence user123 20 User Resource Permissions user123 issueABC [“view”]

slide-41
SLIDE 41

Event Sourcing

"user": "user123", "resource": "issueABC", "permission": “view" }, ... }, Main Datastore { "stream": "user123", "sequence": 21, "commandType": "grant_permission", "params": { "user": "user123", "resource": "issueABC", "permission": “edit” }, ... } ]

Generating Recovery Goals Write Events

Stream Sequence user123 21 User Resource Permissions user123 issueABC [“view”, “edit”]

slide-42
SLIDE 42

THE DATABASE JUST LOST HALF IT’S DATA!

ALERT - INCOMING PAGE!

slide-43
SLIDE 43

INCIDENT RECOVERY

1

Stop new writes

(depends on architecture) write API

error

slide-44
SLIDE 44

INCIDENT RECOVERY

1

2 hour old snapshot write API

error

2

Restore from backup Stop new writes

(depends on architecture)

slide-45
SLIDE 45

INCIDENT RECOVERY

1

2 hour old snapshot write API

error

2

Restore from backup

3

Catchup via command store

(a) Command store queryable by time (b) Logs (c) Worst case -> check all streams

Stop new writes

(depends on architecture)

slide-46
SLIDE 46

INCIDENT RECOVERY

1

2 hour old snapshot write API

error

2

Restore from backup

3 4

Go relax!

Re-enable writes (if applicable)

Catchup via command store

(a) Command store queryable by time (b) Logs (c) Worst case -> check all streams

Stop new writes

(depends on architecture)

slide-47
SLIDE 47

Easily Recoverable Replication

Get downstream systems back in sync

slide-48
SLIDE 48

Recover this data!

slide-49
SLIDE 49

Our goals

Recover quickly

Minimise downtime of
 all systems

slide-50
SLIDE 50

Recover correctly

Confidently serve consistent data across systems through idempotency

Our goals

Recover quickly

Minimise downtime of
 all systems

slide-51
SLIDE 51

Recover correctly

Confidently serve consistent data across systems through idempotency

Our goals

Control the chaos

Avoid self-inflicted denial of services during recovery

Recover quickly

Minimise downtime of
 all systems

slide-52
SLIDE 52

SENDING EVENTS TO CONSUMERS

INSERT INTO table

Write Service

search

Downstream Services

ecosystem

Events sent to
 event bus

slide-53
SLIDE 53

Downstream Services Write Service

user123 Operation-Based

{sequence: 22, commandType: "grant", permission: "edit"} {sequence: 23, commandType: "revoke", permission: "edit"}

Can build from internal event sourcing events Requires perfect ordering. Cannot lose a single message Recovery requires replaying from sequence=0

search

slide-54
SLIDE 54

{sequence: 22, permissions: [“edit"]} {sequence: 23, permissions: []}

State-Based

Downstream Services Write Service

user123 user123 Operation-Based

{sequence: 22, commandType: "grant", permission: "edit"} {sequence: 23, commandType: "revoke", permission: "edit"}

Can build from internal event sourcing events Requires perfect ordering. Cannot lose a single message Recovery requires replaying from sequence=0

Each message is independent Recovery requires single message Requires another DB query to produce state

search search

slide-55
SLIDE 55

INSERT INTO table

Write Service Downstream Services

search ecosystem

Operation-Based State-Based

Query database to generate state-based event

Events sent to
 event bus

slide-56
SLIDE 56

INSERT INTO table

Write Service Downstream Services

search ecosystem

Operation-Based State-Based

Events sent to
 event bus

Save outbound event Query database to generate state-based event

slide-57
SLIDE 57

ERROR DETECTION AND RECOVERY

Synchronizing APIs Downstream Services

search

Stream Sequence user123 23 user456 75 Stream Sequence user123 23 user456 48

/resendLatest /latestSequence /allStreamIds

slide-58
SLIDE 58

ERROR DETECTION AND RECOVERY

Synchronizing APIs Downstream Services

search

Stream Sequence user123 23 user456 75 Stream Sequence user123 23 user456 48

/resendLatest /latestSequence /allStreamIds /latestSequence?stream=user123 sequence: 23

slide-59
SLIDE 59

ERROR DETECTION AND RECOVERY

Synchronizing APIs Downstream Services

search

Stream Sequence user123 23 user456 75 Stream Sequence user123 23 user456 48

/resendLatest /latestSequence /allStreamIds /latestSequence?stream=user456 sequence: 75

slide-60
SLIDE 60

ERROR DETECTION AND RECOVERY

Synchronizing APIs Downstream Services

search

Stream Sequence user123 23 user456 75 Stream Sequence user123 23 user456 48

/resendLatest /latestSequence /allStreamIds /latestSequence?stream=user456 sequence: 75

slide-61
SLIDE 61

ERROR DETECTION AND RECOVERY

Synchronizing APIs Downstream Services

search

Stream Sequence user123 23 user456 75 Stream Sequence user123 23 user456 48

/resendLatest /latestSequence /allStreamIds /resendLatest?stream=user456 {sequence: 75, permissions: [latest state]}

slide-62
SLIDE 62

ERROR DETECTION AND RECOVERY

Synchronizing APIs Downstream Services

search

Stream Sequence user123 23 user456 75 Stream Sequence user123 23 user456 75

/resendLatest /latestSequence /allStreamIds /resendLatest?stream=user456 {sequence: 75, permissions: [latest state]}

slide-63
SLIDE 63

ERROR DETECTION AND RECOVERY

Synchronizing APIs Downstream Services

search

Stream Sequence user123 23 user456 75 Stream Sequence user123 23

/resendLatest /latestSequence /allStreamIds /allStreamIds [user123, user456,..., (paginated)]

slide-64
SLIDE 64

SEARCH SERVICE HAS BEEN DROPPING MESSAGES FOR 4 HOURS!

ALERT - INCOMING PAGE!

slide-65
SLIDE 65

INCIDENT RECOVERY

1

Determine dropped streams

(a) Command store queryable by time (b) Logs (c) Worst case -> check all streams

slide-66
SLIDE 66

INCIDENT RECOVERY

1

Determine dropped streams

(a) Command store queryable by time (b) Logs (c) Worst case -> check all streams

2

  • Depending on (1), could help to improve speed

Compare via /latestSequence

slide-67
SLIDE 67

INCIDENT RECOVERY

1

Determine dropped streams

(a) Command store queryable by time (b) Logs (c) Worst case -> check all streams

2

Compare via /latestSequence

  • Depending on (1), could help to improve speed

3 Resend state via /resendLatest

  • Messages are idempotent
  • Can occur alongside new messages
slide-68
SLIDE 68

Local and Distributed Redundancy

Having fallbacks for fallbacks for fallbacks…

slide-69
SLIDE 69

Redundancy?

slide-70
SLIDE 70

Our goals

“Always online”
 even during incidents

Debugging and data recovery take time.

slide-71
SLIDE 71

Eliminate DB as
 single point of failure

Replicas don’t guard against data corruption, cluster config errors, etc.

Our goals

“Always online”
 even during incidents

Debugging and data recovery take time.

slide-72
SLIDE 72

Eliminate DB as
 single point of failure

Replicas don’t guard against data corruption, cluster config errors, etc.

Our goals

Do better than region failover

Cross-region call latency is always higher than local calls

“Always online”
 even during incidents

Debugging and data recovery take time.

slide-73
SLIDE 73

REDUNDANT DATA STORES

CONSISTENT CONSISTENT? INVALIDATIONS? MODERATE LATENCY Serve reads

App layer

Main data store e.g. Cassandra Cache e.g. Redis Event store
 e.g. S3 LOW LATENCY EVENTUALLY CONSISTENT HIGHER LATENCY MODERATE SCALE HIGHLY SCALABLE

  • V. HIGHLY

SCALABLE

slide-74
SLIDE 74

REDUNDANT DATA STORES

Serve reads

Writes

Main data store e.g. Cassandra Cache e.g. Redis Event store
 e.g. S3

Reads

Invalidate

slide-75
SLIDE 75

Redundancy!

slide-76
SLIDE 76

DISTRIBUTED RESULTS WITH VALIDATABLE TOKENS Identity Consumer 2 Consumer 1

Check X Token Request + Token Validate token Validate token

slide-77
SLIDE 77

Things to consider

Signed, with the right lifetime

Fast validation algorithm, appropriate lifetime, https://bitbucket.org/ atlassian/asap

slide-78
SLIDE 78

Blacklists for invalidation

Low volume, easily distributable and cacheable

Things to consider

Signed, with the right lifetime

Fast validation algorithm, appropriate lifetime, https://bitbucket.org/ atlassian/asap

slide-79
SLIDE 79

Blacklists for invalidation

Low volume, easily distributable and cacheable

Things to consider

Centralise logic with sidecars

Easier maintenance, avoid bugs due to inconsistency

Signed, with the right lifetime

Fast validation algorithm, appropriate lifetime, https://bitbucket.org/ atlassian/asap

slide-80
SLIDE 80

DATABASE IS OVERLOADED & FAILING ALL QUERIES

ALERT - INCOMING PAGE!

slide-81
SLIDE 81

DATABASE IS OVERLOADED & FAILING QUERIES

Token

Hystrix flips to serve fallback datastore Hystrix flips to serve primary datastore

slide-82
SLIDE 82

DATABASE IS OVERLOADED & FAILING QUERIES

slide-83
SLIDE 83

Looking back…
 to look forward

slide-84
SLIDE 84

💪 will happen.

Accept and plan for it.

Emergent behaviour

Our systems are complex and unpredictable.

Broad spectrum solutions

Incorporate general recovery methods to handle the unexpected.

Big statement with support

slide-85
SLIDE 85

💪 will happen.

Accept and plan for it.

Emergent behaviour

Our systems are complex and unpredictable.

Broad spectrum solutions

Incorporate general recovery methods to handle the unexpected.

Big statement with support

slide-86
SLIDE 86

💪 will happen.

Accept and plan for it.

Emergent behaviour

Our systems are complex and unpredictable.

Broad spectrum solutions

Incorporate general recovery methods to handle the unexpected.

Big statement with support

slide-87
SLIDE 87
  • 3. Local and distributed redundancy

Having fallbacks for fallbacks for fallbacks…

  • 1. Event sourcing

Minimize data loss after a restore

  • 2. Easily recoverable replication

Get downstream systems back in sync

PATTERNS FOR GUARDING AGAINST THE IMPOSSIBLE

slide-88
SLIDE 88
  • 3. Local and distributed redundancy

Having fallbacks for fallbacks for fallbacks…

  • 1. Event sourcing

Minimize data loss after a restore

  • 2. Easily recoverable replication

Get downstream systems back in sync

PATTERNS FOR GUARDING AGAINST THE IMPOSSIBLE

slide-89
SLIDE 89
  • 3. Local and distributed redundancy

Having fallbacks for fallbacks for fallbacks…

  • 1. Event sourcing

Minimize data loss after a restore

  • 2. Easily recoverable replication

Get downstream systems back in sync

PATTERNS FOR GUARDING AGAINST THE IMPOSSIBLE

slide-90
SLIDE 90

How do you take on the impossible?

slide-91
SLIDE 91

Thank you!

JEFFREY FARBER SIDNEY SHEK

slide-92
SLIDE 92

Rate today ’s session

Session page on conference website O’Reilly Events App

slide-93
SLIDE 93

REFERENCES TO ALL THOSE PICS

https://www.imdb.com/title/tt4834206/ https://www.imdb.com/title/tt2015381/ https://www.lynchburgvirginia.org/event/back-to-the-future-film-event/ https://www.youtube.com/watch?v=LQxe-JkSv4o https://www.quora.com/If-you-punch-yourself-and-it-hurts-are-you-strong-or-are-you-weak
 https://www.amazon.com/Iron-Man-3-Theatrical-Version/dp/B00FEHXCIG
 https://unitedlocksmith.net/wp-content/uploads/open-24-hours.jpg 
 https://jwt.io/
 https://businesstech.co.za/news/wp-content/uploads/2019/05/Blacklisted.png
 https://www.dmcsidecars.com/wp-content/uploads/2016/05/DSCN5400-e1468435592887.jpg https://www.starsjackets.com/product/star-lord-chris-pratt-jacket
 https://aws.amazon.com/architecture/icons/