architecting distributed databases for failure
play

Architecting Distributed Databases for Failure A Case Study with - PowerPoint PPT Presentation

Architecting Distributed Databases for Failure A Case Study with Druid Fangjin Yang Cofounder @ Imply The Bad The Really Bad Overview The Catastrophic Best Practices: Operations Everything is going to fail! Requirements Scalable -


  1. Architecting Distributed Databases for Failure A Case Study with Druid Fangjin Yang Cofounder @ Imply

  2. The Bad The Really Bad Overview The Catastrophic Best Practices: Operations

  3. Everything is going to fail!

  4. Requirements Scalable - Tens of thousands of nodes - Petabytes of raw data Available - 24 x 7 x 365 uptime Performant - Run as smoothly as possible when things go wrong

  5. Druid Open source distributed data store Column oriented storage of event data Low latency OLAP queries & low latency data ingestion Initially designed to power a SaaS for online advertising (in AWS) Our real-world example case study

  6. The Bad

  7. Single Server Failures Common Occurs for every imaginable and unimaginable reason - Hardware malfunction, kernel panic, network outage, etc. - Minimal impact Standard solution: replication

  8. Druid Segments Timestamp Dimensions Measures 2015-01-01T00 Segment_2015-01-01/2014-01-02 Timestamp Dimensions Measures 2015-01-01T01 2015-01-01T00 2015-01-01T01 Timestamp Dimensions Measures 2015-01-02T05 2015-01-02T05 Segment_2015-01-02/2014-01-03 2015-01-02T07 2015-01-02T07 2015-01-03T05 Timestamp Dimensions Measures 2015-01-03T07 2015-01-03T05 Segment_2015-01-03/2014-01-04 2015-01-03T07 Partition by time

  9. Replication Example Druid Historicals Segment_1 Load Queries Druid Brokers Segment_2 Segment_2015-01-01/2015-01-02 (Segment_1) Client Segment_1 Segment_2015-01-01/2015-01-02 Segment_3 (Segment_2) Segment_2015-01-01/2015-01-02 (Segment_3) Segment_2 Segment_3

  10. Query Segment_1 Druid Historicals Segment_1 Load Queries Druid Brokers Segment_2 Segment_2015-01-01/2015-01-02 (Segment_1) Client Segment_1 Segment_2015-01-01/2015-01-02 Segment_3 (Segment_2) Segment_2015-01-01/2015-01-02 (Segment_3) Segment_2 Segment_3

  11. Query Segment_1 Druid Historicals Segment_1 Load Queries Druid Brokers Segment_2 Segment_2015-01-01/2015-01-02 (Segment_1) Client Segment_1 Segment_2015-01-01/2015-01-02 Segment_3 (Segment_2) Segment_2015-01-01/2015-01-02 (Segment_3) Segment_2 Segment_3

  12. Multi-Server Failures Common: 1 server fails Less common: >1 server fails Data center issues (rack failure) Two strategies: - fast recovery - multi-datacenter replication

  13. Fast Recovery Complete data availability in the face of multi-server failures is hard! Focus on fast recovery instead Be careful of the pitfalls of fast recovery More viable in the cloud

  14. Fast Recovery Example Druid Historicals Deep Storage Segment_1 (S3/HDFS) Queries Druid Brokers Segment_2 Load Segment_2015-01-01/2015-01-02 (Segment_1) Load Client Segment_1 Segment_2015-01-01/2015-01-02 Segment_3 (Segment_2) Segment_2015-01-01/2015-01-02 (Segment_3) Segment_2 Segment_3

  15. Fast Recovery Example Druid Historicals Deep Storage Segment_1 (S3/HDFS) Segment_2 Load Segment_1 Segment_3 Segment_2 Segment_3

  16. Fast Recovery Example Druid Historicals Deep Storage Segment_1 (S3/HDFS) Segment_2 Load

  17. Fast Recovery Example Druid Historicals Deep Storage Segment_1 (S3/HDFS) Segment_2 Druid Coordinator Load Segment_1, Segment_3 Load Load Segment_2, Segment_3

  18. Fast Recovery Example Druid Historicals Deep Storage Segment_1 (S3/HDFS) Segment_2 Druid Coordinator Load Segment_1 Segment_3 Segment_2 Segment_3

  19. Dangers of Fast Recovery Easy to create bottlenecks - Prioritize how resources are spent during recovery - Druid prioritizes data availability and throttles replication Beware query hotspots - Intelligent load balancing during recovery is important

  20. Fast Recovery Example Druid Historicals Deep Storage Segment_1 (S3/HDFS) Segment_2 Load Segment_1 Segment_3 Segment_2 Segment_3

  21. Fast Recovery Example Deep Storage (S3/HDFS) Druid Historicals Load Segment_1 Overloaded! Segment_2 Segment_3

  22. The Really Bad

  23. Data Center Outage Very uncommon Power loss Can be extremely disruptive without proper planning Solution: Multi-datacenter replication Beware pitfalls of multi-datacenter replication

  24. Multi-Datacenter Replication Druid Historicals Queries Druid Brokers Segment_1 Segment_2 Druid Coordinator Segment_3 Data Center 1 Data Center 2 Client Segment_1 Segment_3 Segment_2 Segment_3

  25. Multi-Datacenter Pitfalls Coordination + leader election can be tricky Communication can require non-trivial network time Coordination usually done with heartbeats and quorum decisions Writes, failovers, & consistent reads require round trips

  26. Multi-Datacenter Replication Data Center 1 Client Data Center 2

  27. The Catastrophic

  28. “Why are things slow today?” Poor performance is much worse than things completely failing Causes: - Heavy concurrent usage (multi-tenancy) - Hotspots & variability - Bad software update

  29. Architecting for Multi-tenancy Small units of computation - No single query should starve out a cluster

  30. Druid Multi-tenancy Druid Historical Segment_query_1 Segment_query_2 Queries Processing Segment_query_1 Order Segment_query_3 Segment_query_2 Segment_query_1 Segment_query_4

  31. Architecting for Multi-tenancy Resource prioritization and isolation - Not all queries are equal - Not all users are equal

  32. Druid Multi-tenancy Druid Historicals Tier 1: Older Queries Druid Brokers Data Dedicated for Older data Client Tier 2: Newer Data Dedicated for Newer Dat Tier 2: Newer Data

  33. Hotspots Incredible variability in query performance among nodes Nodes may become slow but not fail Difficult to detect as there is nothing obviously wrong Solutions: - Hedged requests - Selective Replication - Latency Induced Probation

  34. Hedged Requests Druid Historicals Segment_1 Druid Brokers Segment_2 Client Segment_1 Segment_3 Segment_2 Segment_3

  35. Hedged Requests Druid Historicals Segment_1 Druid Brokers Segment_2 Client Segment_1 Segment_3 Segment_2 Segment_3

  36. Minimizing Variability Selective Replication Latency-induced probation Great paper: https://web.stanford.edu/class/cs240/readings/tail-at-scale.pdf

  37. Bad Software Updates It is very difficult to simulate production traffic - Testing/staging clusters mostly verify correctness No noticeable failures for a long time Common cause of cascading failures

  38. Rolling Upgrades Be able to update different components with no down time Backwards compatibility is extremely important Roll back if things are bad

  39. Rolling Upgrades Druid Historicals Queries Druid Brokers V2 V1 Client V1 V1 V1

  40. Rolling Upgrades Druid Historicals Queries Druid Brokers V2 V1 Client V2 V1 V2

  41. Rolling Upgrades Druid Historicals Queries Druid Brokers V2 V2 Client V2 V1 V2

  42. Best Practices: Operations

  43. Monitoring Detection of when things go badly Define your critical metrics and acceptable values

  44. Alerts Alert on critical errors - Out of disk space, out of cluster capacity, etc. Design alerts to reduce “noise” - Distinguish warnings and alerts

  45. Exploratory Analytics Extremely critical to diagnosing root causes quickly Not many organizations do this

  46. Takeaways Everything is going to fail! - Use replication for single server failures - Use fast recovery for multi-server failures (when you don’t want to set up another data center) - Use multi-datacenter replication when availability really matters - Alerting, monitoring, and exploratory analysis are critical

  47. Thanks! @implydata @druidio @fangjin imply.io druid.io

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend