honey i shrunk the database
play

Honey, I shrunk the database! Resilience and recoverability in Cloud - PowerPoint PPT Presentation

Honey, I shrunk the database! Resilience and recoverability in Cloud Native services JEFFREY FARBER SIDNEY SHEK Cloud infrastructure = reliable services right? SUPER RESILIENT CLOUD-BASED ARCHITECTURE Canary Progressive Rollouts


  1. Honey, I shrunk the database! Resilience and recoverability in Cloud Native services JEFFREY FARBER SIDNEY SHEK

  2. Cloud infrastructure = reliable services right?

  3. SUPER RESILIENT CLOUD-BASED ARCHITECTURE Canary Progressive Rollouts Distributed Blue-Green Deployments Multi-Region Cassandra Database 5 minute backups

  4. WITH LOTS OF DEPENDENCIES Reliability Promise = 99.999% Recovery Time Objective = 1 hour Recovery Point Objective = 5 minutes

  5. Until…

  6. BET YOU DIDN’T SEE THIS COMING (THIS IS A TRUE STORY)

  7. WE RESTORE… 2 hour old snapshot

  8. WE RESTORE… … BUT WHAT ABOUT OUR DEPENDENCIES? M AY HAVE CORRECT DATA , BUT HOW L OST DATA W RONG (?) DATA TO SYNC ? 10 X NORMAL LOAD ?

  9. Big 💪 will happen. Accept and plan for it. statement with support Emergent behaviour Our systems are complex and unpredictable. Broad spectrum solutions Incorporate general recovery methods to handle the unexpected.

  10. Big 💪 will happen. Accept and plan for it. statement with support Emergent behaviour Our systems are complex and unpredictable. Broad spectrum solutions Incorporate general recovery methods to handle the unexpected.

  11. Big 💪 will happen. Accept and plan for it. statement with support Emergent behaviour Our systems are complex and unpredictable. Broad spectrum solutions Incorporate general recovery methods to handle the unexpected.

  12. 1. Event sourcing Minimize data loss after a restore 2. Easily recoverable replication Get downstream systems back in sync 3. Local and distributed redundancy PATTERNS FOR GUARDING Having fallbacks for fallbacks for fallbacks… AGAINST THE IMPOSSIBLE

  13. 1. Event sourcing Minimize data loss after a restore 2. Easily recoverable replication Get downstream systems back in sync 3. Local and distributed redundancy PATTERNS FOR GUARDING Having fallbacks for fallbacks for fallbacks… AGAINST THE IMPOSSIBLE

  14. 1. Event sourcing Minimize data loss after a restore 2. Easily recoverable replication Get downstream systems back in sync 3. Local and distributed redundancy PATTERNS FOR GUARDING Having fallbacks for fallbacks for fallbacks… AGAINST THE IMPOSSIBLE

  15. Event sourcing Minimize data loss after a restore

  16. Let’s add recovery here

  17. Event Sourcing Minimize data loss of DB restore (RPO) Critical applications can’t afford hours of data loss. Goals Recover from bugs ruining data Databases can replicate your data across regions. It also replicates your bugs. Write Events Accuracy & Time (RTO) Generating We need confidence in our restored data, and we need it quickly! Recovery

  18. Event Sourcing Minimize data loss of DB restore (RPO) Critical applications can’t afford hours of data loss. Goals Recover from bugs ruining data Databases can replicate your data across regions. It also replicates your bugs. Write Events Accuracy & Time (RTO) Generating We need confidence in our restored data, and we need it quickly! Recovery

  19. Event Sourcing Minimize data loss of DB restore (RPO) Critical applications can’t afford hours of data loss. Goals Recover from bugs ruining data Databases can replicate your data across regions. It also replicates your bugs. Write Events Accuracy & Time (RTO) Generating We need confidence in our restored data, and we need it quickly! Recovery

  20. Event Sourcing Service Goals INSERT INTO table SELECT FROM table DELETE FROM table Write Events Generating Recovery

  21. Event Sourcing Writes Reads Goals INSERT INTO table SELECT FROM table DELETE FROM table Write Events Generating Recovery

  22. Event Sourcing Writes Goals SELECT FROM table Write Events INSERT INTO table DELETE FROM table Generating INSERT/APPEND historical command store Recovery

  23. Event Sourcing Writes Goals SELECT FROM table Write Events INSERT INTO table DELETE FROM table Generating INSERT/APPEND historical Replay events command to main store Recovery database

  24. Event Sourcing Writes Goals SELECT FROM table Write Events INSERT INTO table DELETE FROM table Generating INSERT/APPEND M AKE SURE THIS IS AN historical I NDEPENDENT STORE ! command store Recovery

  25. [ { Event Sourcing "sequence": 20, "stream": "user123", "commandType": "grant_permission", "params": { Goals "user": "user123", "resource": "issueABC", Write Events "permission": "view" }, "timestamp": "2019-07-24 3:30 PM”, Generating "actor": "jira_share_service" }, Recovery ]

  26. [ { Event Sourcing "sequence": 20, "stream": "user123", Order all writes, so we can replay in same order 1 "commandType": "grant_permission", "params": { Goals "user": "user123", "resource": "issueABC", Write Events "permission": "view" }, "timestamp": "2019-07-24 3:30 PM”, Generating "actor": "jira_share_service" }, Recovery ]

  27. "sequence": 20, Order all writes, so we can replay in same order 1 Event Sourcing S TRICTLY ORDERED Goals Write Events Stream Sequence user123 19 Generating SET sequence = sequence + 1 WHERE sequence = 19 Recovery sequences -> 19, 20, 21...

  28. "sequence": 20, Order all writes, so we can replay in same order 1 Event Sourcing M OSTLY ORDERED Goals sequence = {timestamp}{unique_node_id} Write Events sequences -> 1565312340, 1565323450, ... Generating Recovery

  29. "sequence": 20, Order all writes, so we can replay in same order 1 Event Sourcing M OSTLY ORDERED Goals sequence = {timestamp}{unique_node_id} Write Events sequences -> 1565312340, 1565323450, ... Generating No SPOF (database) Only certain writes need strict ordering Clock skew window is small (< 1 sec) Recovery Don’t know previous sequence

  30. "sequence": 20, Order all writes, so we can replay in same order 1 Event Sourcing M OSTLY ORDERED + CUSTOMER - DICTATED STRICT ORDERING Goals write2 write1 ?after={timestamp1} Write Events /create /delete Generating {sequence/timestamp1} {sequence/timestamp2} Recovery timestamp2 > timestamp1

  31. [ { Event Sourcing "sequence": 20, "stream": "user123", "commandType": "grant_permission", Streams guarantee order 2 "params": { Parallelize across streams Goals "user": "user123", "resource": "issueABC", Write Events "permission": "view" }, "timestamp": "2019-07-24 3:30 PM”, Generating "actor": "jira_share_service" }, Recovery ]

  32. [ { Event Sourcing "sequence": 20, "stream": "user123", "commandType": "grant_permission", Streams guarantee order 2 "params": { Parallelize across streams Goals "user": "user123", user123 {sequence: 21, {sequence: 20, "resource": "issueABC", commandType: ”revoke”, commandType: ”grant", Write Events "permission": "view" permission: "view"} permission: "view"} }, "timestamp": "2019-07-24 3:30 PM”, Generating "actor": "jira_share_service" }, Recovery ]

  33. [ { Event Sourcing "sequence": 20, "stream": "user123", "commandType": "grant_permission", Streams guarantee order 2 "params": { Parallelize across streams Goals "user": "user123", user123 {sequence: 21, {sequence: 20, "resource": "issueABC", commandType: ”revoke”, commandType: ”grant", Write Events "permission": "view" permission: "view"} permission: "view"} }, user456 {sequence: 75, {sequence: 74, "timestamp": "2019-07-24 3:30 PM”, Generating commandType: ”revoke”, commandType: ”grant", "actor": "jira_share_service" permission: “edit"} permission: “edit"} }, Recovery ]

  34. Event Sourcing Restore snapshot 1 Goals Write Events Generating Recovery

  35. Event Sourcing Restore snapshot 1 user123 Goals 21, 20 19 Recover streams in parallel 2 user456 Write Events ..., 76, 75, 74 73 Generating Recovery

  36. Event Sourcing Restore snapshot 1 user123 Goals 21, 20 19 Recover streams in parallel 2 user456 Write Events ..., 76, 75, 74 73 Generating Bonus: process all stream events in-memory Recovery

  37. Events for stream “user123” Main Datastore [ Event Sourcing { "stream": "user123", "sequence": 20, "commandType": "grant_permission", "params": { Stream Sequence Goals "user": "user123", user123 19 "resource": "issueABC", "permission": “view" User Resource Permissions Write Events }, user123 issueABC [] ... }, Generating { "stream": "user123", "sequence": 21, "commandType": "grant_permission", Recovery "params": { "user": "user123",

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend