Architecting Distributed Databases for Failure A Case Study with - - PowerPoint PPT Presentation

architecting distributed databases for failure
SMART_READER_LITE
LIVE PREVIEW

Architecting Distributed Databases for Failure A Case Study with - - PowerPoint PPT Presentation

Architecting Distributed Databases for Failure A Case Study with Druid Fangjin Yang Cofounder @ Imply The Bad The Really Bad Overview The Catastrophic Best Practices: Operations Everything is going to fail! Requirements Scalable -


slide-1
SLIDE 1

Architecting Distributed Databases for Failure

A Case Study with Druid Fangjin Yang

Cofounder @ Imply

slide-2
SLIDE 2

Overview

The Bad The Really Bad The Catastrophic Best Practices: Operations

slide-3
SLIDE 3

Everything is going to fail!

slide-4
SLIDE 4

Requirements

Scalable

  • Tens of thousands of nodes
  • Petabytes of raw data

Available

  • 24 x 7 x 365 uptime

Performant

  • Run as smoothly as possible when things go wrong
slide-5
SLIDE 5

Druid

Open source distributed data store Column oriented storage of event data Low latency OLAP queries & low latency data ingestion Initially designed to power a SaaS for online advertising (in AWS) Our real-world example case study

slide-6
SLIDE 6

The Bad

slide-7
SLIDE 7

Single Server Failures

Common Occurs for every imaginable and unimaginable reason

  • Hardware malfunction, kernel panic, network outage, etc.
  • Minimal impact

Standard solution: replication

slide-8
SLIDE 8

Druid Segments

Timestamp Dimensions Measures 2015-01-01T00 2015-01-01T01 2015-01-02T05 2015-01-02T07 2015-01-03T05 2015-01-03T07 Timestamp Dimensions Measures 2015-01-01T00 2015-01-01T01 Timestamp Dimensions Measures 2015-01-02T05 2015-01-02T07 Timestamp Dimensions Measures 2015-01-03T05 2015-01-03T07

Partition by time Segment_2015-01-01/2014-01-02 Segment_2015-01-02/2014-01-03 Segment_2015-01-03/2014-01-04

slide-9
SLIDE 9

Replication Example

Segment_2015-01-01/2015-01-02 (Segment_1) Segment_2015-01-01/2015-01-02 (Segment_2) Segment_2015-01-01/2015-01-02 (Segment_3) Segment_1 Segment_2 Segment_1 Segment_3 Segment_2 Segment_3 Load Druid Historicals Druid Brokers Queries Client

slide-10
SLIDE 10

Query Segment_1

Segment_2015-01-01/2015-01-02 (Segment_1) Segment_2015-01-01/2015-01-02 (Segment_2) Segment_2015-01-01/2015-01-02 (Segment_3) Segment_1 Segment_2 Segment_1 Segment_3 Segment_2 Segment_3 Load Druid Historicals Druid Brokers Queries Client

slide-11
SLIDE 11

Query Segment_1

Segment_2015-01-01/2015-01-02 (Segment_1) Segment_2015-01-01/2015-01-02 (Segment_2) Segment_2015-01-01/2015-01-02 (Segment_3) Segment_1 Segment_2 Segment_1 Segment_3 Segment_2 Segment_3 Load Druid Historicals Druid Brokers Queries Client

slide-12
SLIDE 12

Multi-Server Failures

Common: 1 server fails Less common: >1 server fails Data center issues (rack failure) Two strategies:

  • fast recovery
  • multi-datacenter replication
slide-13
SLIDE 13

Fast Recovery

Complete data availability in the face of multi-server failures is hard! Focus on fast recovery instead Be careful of the pitfalls of fast recovery More viable in the cloud

slide-14
SLIDE 14

Fast Recovery Example

Segment_2015-01-01/2015-01-02 (Segment_1) Segment_2015-01-01/2015-01-02 (Segment_2) Segment_2015-01-01/2015-01-02 (Segment_3) Segment_1 Segment_2 Segment_1 Segment_3 Segment_2 Segment_3 Load Druid Historicals Druid Brokers Queries Client Deep Storage (S3/HDFS) Load

slide-15
SLIDE 15

Fast Recovery Example

Segment_1 Segment_2 Segment_1 Segment_3 Segment_2 Segment_3 Druid Historicals Load Deep Storage (S3/HDFS)

slide-16
SLIDE 16

Fast Recovery Example

Segment_1 Segment_2 Druid Historicals Deep Storage (S3/HDFS) Load

slide-17
SLIDE 17

Fast Recovery Example

Segment_1 Segment_2 Druid Historicals Deep Storage (S3/HDFS) Load Druid Coordinator Load Segment_1, Segment_3 Load Segment_2, Segment_3

slide-18
SLIDE 18

Fast Recovery Example

Segment_1 Segment_2 Druid Historicals Deep Storage (S3/HDFS) Load Druid Coordinator Segment_1 Segment_3 Segment_2 Segment_3

slide-19
SLIDE 19

Dangers of Fast Recovery

Easy to create bottlenecks

  • Prioritize how resources are spent during recovery
  • Druid prioritizes data availability and throttles replication

Beware query hotspots

  • Intelligent load balancing during recovery is important
slide-20
SLIDE 20

Fast Recovery Example

Segment_1 Segment_2 Segment_1 Segment_3 Segment_2 Segment_3 Druid Historicals Load Deep Storage (S3/HDFS)

slide-21
SLIDE 21

Fast Recovery Example

Druid Historicals Deep Storage (S3/HDFS) Load Segment_1 Segment_2 Segment_3 Overloaded!

slide-22
SLIDE 22

The Really Bad

slide-23
SLIDE 23

Data Center Outage

Very uncommon Power loss Can be extremely disruptive without proper planning Solution: Multi-datacenter replication Beware pitfalls of multi-datacenter replication

slide-24
SLIDE 24

Multi-Datacenter Replication

Segment_1 Segment_2 Segment_3 Segment_1 Segment_3 Segment_2 Segment_3 Druid Historicals Druid Brokers Queries Client Druid Coordinator

Data Center 1 Data Center 2

slide-25
SLIDE 25

Multi-Datacenter Pitfalls

Coordination + leader election can be tricky Communication can require non-trivial network time Coordination usually done with heartbeats and quorum decisions Writes, failovers, & consistent reads require round trips

slide-26
SLIDE 26

Multi-Datacenter Replication

Client

Data Center 1 Data Center 2

slide-27
SLIDE 27

The Catastrophic

slide-28
SLIDE 28

“Why are things slow today?”

Poor performance is much worse than things completely failing Causes:

  • Heavy concurrent usage (multi-tenancy)
  • Hotspots & variability
  • Bad software update
slide-29
SLIDE 29

Architecting for Multi-tenancy

Small units of computation

  • No single query should starve out a cluster
slide-30
SLIDE 30

Druid Multi-tenancy

Segment_query_1 Segment_query_2 Segment_query_1 Segment_query_3 Segment_query_2 Segment_query_1 Segment_query_4

Druid Historical Queries Processing Order

slide-31
SLIDE 31

Architecting for Multi-tenancy

Resource prioritization and isolation

  • Not all queries are equal
  • Not all users are equal
slide-32
SLIDE 32

Druid Multi-tenancy

Tier 1: Older Data Tier 2: Newer Data Tier 2: Newer Data Dedicated for Older data Druid Historicals Druid Brokers Dedicated for Newer Dat Queries Client

slide-33
SLIDE 33

Hotspots

Incredible variability in query performance among nodes Nodes may become slow but not fail Difficult to detect as there is nothing obviously wrong Solutions:

  • Hedged requests
  • Selective Replication
  • Latency Induced Probation
slide-34
SLIDE 34

Hedged Requests

Segment_1 Segment_2 Segment_1 Segment_3 Segment_2 Segment_3 Druid Historicals Druid Brokers Client

slide-35
SLIDE 35

Hedged Requests

Segment_1 Segment_2 Segment_1 Segment_3 Segment_2 Segment_3 Druid Historicals Druid Brokers Client

slide-36
SLIDE 36

Minimizing Variability

Selective Replication Latency-induced probation Great paper: https://web.stanford.edu/class/cs240/readings/tail-at-scale.pdf

slide-37
SLIDE 37

Bad Software Updates

It is very difficult to simulate production traffic

  • Testing/staging clusters mostly verify correctness

No noticeable failures for a long time Common cause of cascading failures

slide-38
SLIDE 38

Rolling Upgrades

Be able to update different components with no down time Backwards compatibility is extremely important Roll back if things are bad

slide-39
SLIDE 39

Rolling Upgrades

V2 V1 V1 V1 Druid Historicals Druid Brokers V1 Queries Client

slide-40
SLIDE 40

Rolling Upgrades

V2 V2 V2 V1 Druid Historicals Druid Brokers V1 Queries Client

slide-41
SLIDE 41

Rolling Upgrades

V2 V2 V2 V2 Druid Historicals Druid Brokers Queries Client V1

slide-42
SLIDE 42

Best Practices: Operations

slide-43
SLIDE 43

Monitoring

Detection of when things go badly Define your critical metrics and acceptable values

slide-44
SLIDE 44

Alerts

Alert on critical errors

  • Out of disk space, out of cluster capacity, etc.

Design alerts to reduce “noise”

  • Distinguish warnings and alerts
slide-45
SLIDE 45

Exploratory Analytics

Extremely critical to diagnosing root causes quickly Not many organizations do this

slide-46
SLIDE 46

Takeaways

Everything is going to fail!

  • Use replication for single server failures
  • Use fast recovery for multi-server failures (when you don’t want to set up

another data center)

  • Use multi-datacenter replication when availability really matters
  • Alerting, monitoring, and exploratory analysis are critical
slide-47
SLIDE 47

Thanks!

@implydata @druidio @fangjin imply.io druid.io