availability latency and cost withstanding regional
play

Availability, Latency and Cost: Withstanding Regional Outages - PowerPoint PPT Presentation

Availability, Latency and Cost: Withstanding Regional Outages @aaronblohowiak aaronb@netflix.com What to expect Why? Overview! Algebraic Models Availability! Latency! Cost! Architecture! @aaronblohowiak Why? You


  1. Availability, Latency and Cost: Withstanding Regional Outages @aaronblohowiak aaronb@netflix.com

  2. What to expect ● Why? ● Overview! ● Algebraic Models ○ Availability! ○ Latency! ○ Cost! ● Architecture! @aaronblohowiak

  3. Why?

  4. You never let a serious crisis go to waste. And what I mean by that it's an opportunity to do things you think you could not do before. - Rahm Emanuel

  5. Good, not great. @aaronblohowiak

  6. Good, not great. 1. Instability @aaronblohowiak

  7. Good, not great. 1. Instability 2. Infrequency @aaronblohowiak

  8. Good, not great. 1. Instability 2. Infrequency 3. GOTO 1. @aaronblohowiak

  9. Source: https://martinfowler.com/bliki/FrequencyReducesDifficulty.html

  10. One of my favorite soundbites is: if it hurts, do it more often. - Martin Fowler

  11. Operational Burden 1. Alerts @aaronblohowiak

  12. Operational Burden 1. Alerts 2. Canaries @aaronblohowiak

  13. Operational Burden 1. Alerts 2. Canaries 3. WoW Metrics @aaronblohowiak

  14. From Burden to Advantage @aaronblohowiak

  15. In general, freedom and rapid recovery is better than trying to prevent error. We are in a creative business, not a safety-critical business. - jobs.netflix.com/culture

  16. Overview

  17. Problem Description Number of Regions @aaronblohowiak

  18. @aaronblohowiak

  19. @aaronblohowiak

  20. 100% Capacity @aaronblohowiak

  21. Problem Description Number of Regions @aaronblohowiak

  22. N+1 Architecture @aaronblohowiak

  23. 100% 1+0 (no spare) @aaronblohowiak

  24. 100% 100% 1+1 @aaronblohowiak

  25. 100% 100% 1+1 = 200% @aaronblohowiak

  26. 2+1 50% 50% 50% @aaronblohowiak

  27. 2+1 = 150% 50% 50% 50% @aaronblohowiak

  28. 2+1 = 150% ?!?!?!?!?! 50% 50% 50% @aaronblohowiak

  29. 2+1 Overview @aaronblohowiak

  30. @aaronblohowiak

  31. @aaronblohowiak

  32. @aaronblohowiak

  33. Excess Risk @aaronblohowiak

  34. @aaronblohowiak

  35. @aaronblohowiak

  36. @aaronblohowiak

  37. Algebraic Models

  38. All models are wrong but some are useful - George Box

  39. Availability

  40. Distribution of Change Number of Regions Balance of Traffic @aaronblohowiak

  41. Distribution of Change Number of Regions Balance of Traffic @aaronblohowiak

  42. Distribution of Change Number of Regions Balance of Traffic @aaronblohowiak

  43. @aaronblohowiak

  44. Distribution of Change Number of Regions Balance of Traffic @aaronblohowiak

  45. Distribution of Change Number of Regions Balance of Traffic @aaronblohowiak

  46. @aaronblohowiak

  47. Distribution of Change Number of Regions Balance of Traffic Empirical Risk @aaronblohowiak

  48. Latency

  49. Which Latency?

  50. Normal vs Failover

  51. Latency ??? Availability Cost @aaronblohowiak

  52. If you’re successful, hourly demand maps to population by longitude. - Blohowiak’s Third Law

  53. Measuring Latency @aaronblohowiak

  54. @aaronblohowiak

  55. @aaronblohowiak

  56. @aaronblohowiak

  57. Measuring Latency @aaronblohowiak

  58. Measuring Latency @aaronblohowiak

  59. Cost

  60. 2+1 50% 50% 50% @aaronblohowiak

  61. @aaronblohowiak

  62. In N+1 Architecture, minimal failover overhead is 1/N. @aaronblohowiak

  63. In N+1 Architecture, minimal failover overhead is 1/N. Cost = 100% + 1/N @aaronblohowiak

  64. In N+1 Architecture, minimal failover overhead is 1/N. Cost = 100% + 1/N If costs are pure throughput @aaronblohowiak

  65. 100%

  66. Throughput Portion 100% Database Portion

  67. 2+1 @aaronblohowiak

  68. 2+1 All data everywhere

  69. 2+1 All data everywhere >150%

  70. Data Base Portion Region Replication Factor @aaronblohowiak

  71. In RRF=All T is Throughput Cost T = (1 - DBP) * (1 + 1/N) D is DB Cost D = DBP * (N + 1) Total = T + D @aaronblohowiak

  72. @aaronblohowiak

  73. In RRF=2 T is Throughput Cost T = (1 - DBP) * (1 + 1/N) D is DB Cost D = DBP * 2 Total = T + D @aaronblohowiak

  74. @aaronblohowiak

  75. @aaronblohowiak

  76. Cost Summary ● 50% throughput overhead plus tripled database cost for 3-region RRF=all. ● 25% throughput overhead plus doubled database cost for 5-region RRF=2, plus a lot of complexity. @aaronblohowiak

  77. Architecture

  78. Multi-Site Fault Isolation ● No cross-region Requests! ● Stateless or Async* Replication! ○ Cache Replication! ● Change One Region at a Time! @aaronblohowiak

  79. To shard or not to shard? That is the question. @aaronblohowiak

  80. To shard or not to shard? That is the question. Steering ● @aaronblohowiak

  81. To shard or not to shard? That is the question. Steering ● Rebalancing & Rehoming ● @aaronblohowiak

  82. To shard or not to shard? That is the question. Steering ● Rebalancing & Rehoming ● Cost ● @aaronblohowiak

  83. To shard or not to shard? That is the question. Steering ● Rebalancing & Rehoming ● Cost ● Satellites ● @aaronblohowiak

  84. To shard or not to shard? That is the question. Steering ● Rebalancing & Rehoming ● Cost ● Satellites ● Graph vs Multi-tenant ● @aaronblohowiak

  85. How to RRF=2 with 1/N overhead? Central Savior ● Ring ● Custom Hashing ● @aaronblohowiak

  86. Central Savior @aaronblohowiak

  87. Central Savior @aaronblohowiak

  88. Ring Regions @aaronblohowiak

  89. Ring Regions @aaronblohowiak

  90. Ring Regions @aaronblohowiak

  91. One More Thing @aaronblohowiak

  92. What percentage of your outages come from regional failures? @aaronblohowiak

  93. Many of the availability benefits come from isolation, not regions. @aaronblohowiak

  94. What percentage of your outages come from database failures? @aaronblohowiak

  95. Maybe for you and your org having logical stacks makes the most sense. @aaronblohowiak

  96. Closing Thoughts @aaronblohowiak

  97. Questions? @aaronblohowiak

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend