Honey, I shrunk the database!
Resilience and recoverability in Cloud Native services
JEFFREY FARBER SIDNEY SHEK
Honey, I shrunk the database! Resilience and recoverability in Cloud - - PowerPoint PPT Presentation
Honey, I shrunk the database! Resilience and recoverability in Cloud Native services JEFFREY FARBER SIDNEY SHEK Cloud infrastructure = reliable services right? SUPER RESILIENT CLOUD-BASED ARCHITECTURE Canary Progressive Rollouts
Resilience and recoverability in Cloud Native services
JEFFREY FARBER SIDNEY SHEK
right?
SUPER RESILIENT CLOUD-BASED ARCHITECTURE
Canary Progressive Rollouts Blue-Green Deployments 5 minute backups Distributed Multi-Region Cassandra Database
WITH LOTS OF DEPENDENCIES
Reliability Promise = 99.999% Recovery Time Objective = 1 hour Recovery Point Objective = 5 minutes
BET YOU DIDN’T SEE THIS COMING (THIS IS A TRUE STORY)
WE RESTORE…
2 hour old snapshot
WE RESTORE… … BUT WHAT ABOUT OUR DEPENDENCIES?
CORRECT DATA, BUT HOW TO SYNC?
NORMAL LOAD?
Accept and plan for it.
Our systems are complex and unpredictable.
Incorporate general recovery methods to handle the unexpected.
Accept and plan for it.
Our systems are complex and unpredictable.
Incorporate general recovery methods to handle the unexpected.
Accept and plan for it.
Our systems are complex and unpredictable.
Incorporate general recovery methods to handle the unexpected.
Having fallbacks for fallbacks for fallbacks…
Minimize data loss after a restore
Get downstream systems back in sync
PATTERNS FOR GUARDING AGAINST THE IMPOSSIBLE
Having fallbacks for fallbacks for fallbacks…
Minimize data loss after a restore
Get downstream systems back in sync
PATTERNS FOR GUARDING AGAINST THE IMPOSSIBLE
Having fallbacks for fallbacks for fallbacks…
Minimize data loss after a restore
Get downstream systems back in sync
PATTERNS FOR GUARDING AGAINST THE IMPOSSIBLE
Minimize data loss after a restore
Let’s add recovery here
Databases can replicate your data across regions. It also replicates your bugs.
Critical applications can’t afford hours of data loss.
We need confidence in our restored data, and we need it quickly!
Goals Generating Recovery Write Events
Databases can replicate your data across regions. It also replicates your bugs.
Critical applications can’t afford hours of data loss.
We need confidence in our restored data, and we need it quickly!
Goals Generating Recovery Write Events
Databases can replicate your data across regions. It also replicates your bugs.
Critical applications can’t afford hours of data loss.
We need confidence in our restored data, and we need it quickly!
Goals Generating Write Events Recovery
INSERT INTO table DELETE FROM table SELECT FROM table
Service
Goals Generating Write Events Recovery
INSERT INTO table DELETE FROM table SELECT FROM table
Goals Generating
Writes Reads
Write Events Recovery
INSERT INTO table DELETE FROM table historical command store SELECT FROM table INSERT/APPEND
Goals Generating
Writes
Write Events Recovery
INSERT INTO table DELETE FROM table historical command store SELECT FROM table INSERT/APPEND
Goals Generating
Writes Replay events to main database
Write Events Recovery
INSERT INTO table DELETE FROM table historical command store SELECT FROM table INSERT/APPEND
Goals Generating
MAKE SURE THIS IS AN INDEPENDENT STORE!
Writes
Write Events Recovery
Generating
[ { "sequence": 20, "stream": "user123", "commandType": "grant_permission", "params": { "user": "user123", "resource": "issueABC", "permission": "view" }, "timestamp": "2019-07-24 3:30 PM”, "actor": "jira_share_service" }, ]
Goals Write Events Recovery
Generating
[ { "sequence": 20, "stream": "user123", "commandType": "grant_permission", "params": { "user": "user123", "resource": "issueABC", "permission": "view" }, "timestamp": "2019-07-24 3:30 PM”, "actor": "jira_share_service" }, ]
Order all writes, so we can replay in same order Goals Write Events Recovery
1
Generating
Order all writes, so we can replay in same order Goals Write Events Recovery
Stream Sequence user123 19
STRICTLY ORDERED
SET sequence = sequence + 1 WHERE sequence = 19
"sequence": 20, sequences -> 19, 20, 21...
1
Generating
Order all writes, so we can replay in same order Goals Write Events Recovery
"sequence": 20,
MOSTLY ORDERED
sequence = {timestamp}{unique_node_id}
sequences -> 1565312340, 1565323450, ...
1
Generating
Order all writes, so we can replay in same order Goals Write Events Recovery
"sequence": 20,
MOSTLY ORDERED
sequence = {timestamp}{unique_node_id}
sequences -> 1565312340, 1565323450, ...
No SPOF (database) Only certain writes need strict ordering Clock skew window is small (< 1 sec) Don’t know previous sequence
1
timestamp2 > timestamp1
Generating
Order all writes, so we can replay in same order Goals Write Events Recovery
"sequence": 20,
MOSTLY ORDERED + CUSTOMER-DICTATED STRICT ORDERING
/create
write1 {sequence/timestamp1}
/delete
write2 ?after={timestamp1} {sequence/timestamp2}
1
Generating
[ { "sequence": 20, "stream": "user123", "commandType": "grant_permission", "params": { "user": "user123", "resource": "issueABC", "permission": "view" }, "timestamp": "2019-07-24 3:30 PM”, "actor": "jira_share_service" }, ]
Streams guarantee order Parallelize across streams Goals Write Events Recovery
2
Generating
[ { "sequence": 20, "stream": "user123", "commandType": "grant_permission", "params": { "user": "user123", "resource": "issueABC", "permission": "view" }, "timestamp": "2019-07-24 3:30 PM”, "actor": "jira_share_service" }, ]
Goals Write Events Recovery
user123
{sequence: 20, commandType: ”grant", permission: "view"} {sequence: 21, commandType: ”revoke”, permission: "view"}
Streams guarantee order Parallelize across streams
2
Generating
[ { "sequence": 20, "stream": "user123", "commandType": "grant_permission", "params": { "user": "user123", "resource": "issueABC", "permission": "view" }, "timestamp": "2019-07-24 3:30 PM”, "actor": "jira_share_service" }, ]
2
Goals Write Events Recovery
user456
{sequence: 74, commandType: ”grant", permission: “edit"} {sequence: 75, commandType: ”revoke”, permission: “edit"}
user123
{sequence: 20, commandType: ”grant", permission: "view"} {sequence: 21, commandType: ”revoke”, permission: "view"}
Streams guarantee order Parallelize across streams
Generating Goals Write Events Recovery
1
Restore snapshot
Generating Goals Write Events Recovery
1 2
Recover streams in parallel
user123 user456
21, 20 19
Restore snapshot
73 ..., 76, 75, 74
Generating Goals Write Events Recovery
1 2
Recover streams in parallel Bonus: process all stream events in-memory
user123 user456
21, 20 19
Restore snapshot
73 ..., 76, 75, 74
[ { "stream": "user123", "sequence": 20, "commandType": "grant_permission", "params": { "user": "user123", "resource": "issueABC", "permission": “view" }, ... }, Events for stream “user123” Main Datastore Stream Sequence user123 19 { "stream": "user123", "sequence": 21, "commandType": "grant_permission", "params": { "user": "user123", User Resource Permissions user123 issueABC []
Generating Goals Write Events Recovery
[ { "stream": "user123", "sequence": 20, "commandType": "grant_permission", "params": { "user": "user123", "resource": "issueABC", "permission": “view" }, ... }, Main Datastore { "stream": "user123", "sequence": 21, "commandType": "grant_permission", "params": { "user": "user123",
Generating Goals Write Events Recovery
Events for stream “user123” Stream Sequence user123 20 User Resource Permissions user123 issueABC [“view”]
"user": "user123", "resource": "issueABC", "permission": “view" }, ... }, Main Datastore { "stream": "user123", "sequence": 21, "commandType": "grant_permission", "params": { "user": "user123", "resource": "issueABC", "permission": “edit” }, ... } ]
Generating Goals Write Events Recovery
Stream Sequence user123 20 User Resource Permissions user123 issueABC [“view”]
"user": "user123", "resource": "issueABC", "permission": “view" }, ... }, Main Datastore { "stream": "user123", "sequence": 21, "commandType": "grant_permission", "params": { "user": "user123", "resource": "issueABC", "permission": “edit” }, ... } ]
Generating Recovery Goals Write Events
Stream Sequence user123 21 User Resource Permissions user123 issueABC [“view”, “edit”]
ALERT - INCOMING PAGE!
INCIDENT RECOVERY
1
Stop new writes
(depends on architecture) write API
error
INCIDENT RECOVERY
1
2 hour old snapshot write API
error
2
Restore from backup Stop new writes
(depends on architecture)
INCIDENT RECOVERY
1
2 hour old snapshot write API
error
2
Restore from backup
3
Catchup via command store
(a) Command store queryable by time (b) Logs (c) Worst case -> check all streams
Stop new writes
(depends on architecture)
INCIDENT RECOVERY
1
2 hour old snapshot write API
error
2
Restore from backup
3 4
Go relax!
Re-enable writes (if applicable)
Catchup via command store
(a) Command store queryable by time (b) Logs (c) Worst case -> check all streams
Stop new writes
(depends on architecture)
Get downstream systems back in sync
Recover this data!
Recover quickly
Minimise downtime of all systems
Recover correctly
Confidently serve consistent data across systems through idempotency
Recover quickly
Minimise downtime of all systems
Recover correctly
Confidently serve consistent data across systems through idempotency
Control the chaos
Avoid self-inflicted denial of services during recovery
Recover quickly
Minimise downtime of all systems
SENDING EVENTS TO CONSUMERS
INSERT INTO table
Write Service
search
Downstream Services
ecosystem
Events sent to event bus
Downstream Services Write Service
user123 Operation-Based
{sequence: 22, commandType: "grant", permission: "edit"} {sequence: 23, commandType: "revoke", permission: "edit"}
Can build from internal event sourcing events Requires perfect ordering. Cannot lose a single message Recovery requires replaying from sequence=0
search
{sequence: 22, permissions: [“edit"]} {sequence: 23, permissions: []}
State-Based
Downstream Services Write Service
user123 user123 Operation-Based
{sequence: 22, commandType: "grant", permission: "edit"} {sequence: 23, commandType: "revoke", permission: "edit"}
Can build from internal event sourcing events Requires perfect ordering. Cannot lose a single message Recovery requires replaying from sequence=0
Each message is independent Recovery requires single message Requires another DB query to produce state
search search
INSERT INTO table
Write Service Downstream Services
search ecosystem
Operation-Based State-Based
Query database to generate state-based event
Events sent to event bus
INSERT INTO table
Write Service Downstream Services
search ecosystem
Operation-Based State-Based
Events sent to event bus
Save outbound event Query database to generate state-based event
ERROR DETECTION AND RECOVERY
Synchronizing APIs Downstream Services
search
Stream Sequence user123 23 user456 75 Stream Sequence user123 23 user456 48
/resendLatest /latestSequence /allStreamIds
ERROR DETECTION AND RECOVERY
Synchronizing APIs Downstream Services
search
Stream Sequence user123 23 user456 75 Stream Sequence user123 23 user456 48
/resendLatest /latestSequence /allStreamIds /latestSequence?stream=user123 sequence: 23
ERROR DETECTION AND RECOVERY
Synchronizing APIs Downstream Services
search
Stream Sequence user123 23 user456 75 Stream Sequence user123 23 user456 48
/resendLatest /latestSequence /allStreamIds /latestSequence?stream=user456 sequence: 75
ERROR DETECTION AND RECOVERY
Synchronizing APIs Downstream Services
search
Stream Sequence user123 23 user456 75 Stream Sequence user123 23 user456 48
/resendLatest /latestSequence /allStreamIds /latestSequence?stream=user456 sequence: 75
ERROR DETECTION AND RECOVERY
Synchronizing APIs Downstream Services
search
Stream Sequence user123 23 user456 75 Stream Sequence user123 23 user456 48
/resendLatest /latestSequence /allStreamIds /resendLatest?stream=user456 {sequence: 75, permissions: [latest state]}
ERROR DETECTION AND RECOVERY
Synchronizing APIs Downstream Services
search
Stream Sequence user123 23 user456 75 Stream Sequence user123 23 user456 75
/resendLatest /latestSequence /allStreamIds /resendLatest?stream=user456 {sequence: 75, permissions: [latest state]}
ERROR DETECTION AND RECOVERY
Synchronizing APIs Downstream Services
search
Stream Sequence user123 23 user456 75 Stream Sequence user123 23
/resendLatest /latestSequence /allStreamIds /allStreamIds [user123, user456,..., (paginated)]
ALERT - INCOMING PAGE!
INCIDENT RECOVERY
1
Determine dropped streams
(a) Command store queryable by time (b) Logs (c) Worst case -> check all streams
INCIDENT RECOVERY
1
Determine dropped streams
(a) Command store queryable by time (b) Logs (c) Worst case -> check all streams
2
Compare via /latestSequence
INCIDENT RECOVERY
1
Determine dropped streams
(a) Command store queryable by time (b) Logs (c) Worst case -> check all streams
2
Compare via /latestSequence
3 Resend state via /resendLatest
Having fallbacks for fallbacks for fallbacks…
Redundancy?
“Always online” even during incidents
Debugging and data recovery take time.
Eliminate DB as single point of failure
Replicas don’t guard against data corruption, cluster config errors, etc.
“Always online” even during incidents
Debugging and data recovery take time.
Eliminate DB as single point of failure
Replicas don’t guard against data corruption, cluster config errors, etc.
Do better than region failover
Cross-region call latency is always higher than local calls
“Always online” even during incidents
Debugging and data recovery take time.
REDUNDANT DATA STORES
CONSISTENT CONSISTENT? INVALIDATIONS? MODERATE LATENCY Serve reads
App layer
Main data store e.g. Cassandra Cache e.g. Redis Event store e.g. S3 LOW LATENCY EVENTUALLY CONSISTENT HIGHER LATENCY MODERATE SCALE HIGHLY SCALABLE
SCALABLE
REDUNDANT DATA STORES
Serve reads
Writes
Main data store e.g. Cassandra Cache e.g. Redis Event store e.g. S3
Reads
Invalidate
Redundancy!
DISTRIBUTED RESULTS WITH VALIDATABLE TOKENS Identity Consumer 2 Consumer 1
Check X Token Request + Token Validate token Validate token
Signed, with the right lifetime
Fast validation algorithm, appropriate lifetime, https://bitbucket.org/ atlassian/asap
Blacklists for invalidation
Low volume, easily distributable and cacheable
Signed, with the right lifetime
Fast validation algorithm, appropriate lifetime, https://bitbucket.org/ atlassian/asap
Blacklists for invalidation
Low volume, easily distributable and cacheable
Centralise logic with sidecars
Easier maintenance, avoid bugs due to inconsistency
Signed, with the right lifetime
Fast validation algorithm, appropriate lifetime, https://bitbucket.org/ atlassian/asap
ALERT - INCOMING PAGE!
DATABASE IS OVERLOADED & FAILING QUERIES
Token
Hystrix flips to serve fallback datastore Hystrix flips to serve primary datastore
DATABASE IS OVERLOADED & FAILING QUERIES
Accept and plan for it.
Our systems are complex and unpredictable.
Incorporate general recovery methods to handle the unexpected.
Accept and plan for it.
Our systems are complex and unpredictable.
Incorporate general recovery methods to handle the unexpected.
Accept and plan for it.
Our systems are complex and unpredictable.
Incorporate general recovery methods to handle the unexpected.
Having fallbacks for fallbacks for fallbacks…
Minimize data loss after a restore
Get downstream systems back in sync
PATTERNS FOR GUARDING AGAINST THE IMPOSSIBLE
Having fallbacks for fallbacks for fallbacks…
Minimize data loss after a restore
Get downstream systems back in sync
PATTERNS FOR GUARDING AGAINST THE IMPOSSIBLE
Having fallbacks for fallbacks for fallbacks…
Minimize data loss after a restore
Get downstream systems back in sync
PATTERNS FOR GUARDING AGAINST THE IMPOSSIBLE
JEFFREY FARBER SIDNEY SHEK
Session page on conference website O’Reilly Events App
REFERENCES TO ALL THOSE PICS
https://www.imdb.com/title/tt4834206/ https://www.imdb.com/title/tt2015381/ https://www.lynchburgvirginia.org/event/back-to-the-future-film-event/ https://www.youtube.com/watch?v=LQxe-JkSv4o https://www.quora.com/If-you-punch-yourself-and-it-hurts-are-you-strong-or-are-you-weak https://www.amazon.com/Iron-Man-3-Theatrical-Version/dp/B00FEHXCIG https://unitedlocksmith.net/wp-content/uploads/open-24-hours.jpg https://jwt.io/ https://businesstech.co.za/news/wp-content/uploads/2019/05/Blacklisted.png https://www.dmcsidecars.com/wp-content/uploads/2016/05/DSCN5400-e1468435592887.jpg https://www.starsjackets.com/product/star-lord-chris-pratt-jacket https://aws.amazon.com/architecture/icons/