Things break.
Justin Sheehy
justin@basho.com
Things break. riak bends. Justin Sheehy justin@basho.com - - PowerPoint PPT Presentation
Things break. riak bends. Justin Sheehy justin@basho.com Perfection is Unattainable A system cannot perform as well during a storm of component failure as it can on a sunny day. Know How You Degrade Plan it and understand it before your
Justin Sheehy
justin@basho.com
A system cannot perform as well during a storm
3
You might prevent whole system failure if you’re lucky and good, but what happens during partial failure? Plan it and understand it before your users do.
4
Plan it and understand it before your users do. You think you know which parts will break.
5
Plan it and understand it before your users do. You think you know which parts will break. You are wrong.
6
in tension with each other: (harvest * yield) ~ constant goal: failures cause known linear reduction to one of these
7
plan ahead, know when you care!
A system cannot perform as well during a storm
A system cannot perform as well during a storm
failures will happen.
Assume that
10
Designing whole systems and components with individual failures in mind is a plan for predictable success. failures will happen. Designing whole systems and components with individual failures in mind is a plan for predictable success.
Layered, multi-scale resilience is key!
11
Designing whole systems and components with individual failures in mind is a plan for predictable success.
12
Worst case: whole DB corrupted! Typical mitigation: write-ahead logging for repair
13
Worst case: whole DB corrupted! Typical mitigation: write-ahead logging for repair Drawbacks: logging adds I/O, repair can be slow
14
Alternative: append-only main storage "log-structured" databases Example: bitcask
15
simple append-only file format
16
s i m p l e a p p e n d
l y fi l e f
m a t a s a c
p
e n t
s
e t h i n g b i g g e r
17
What about a half-written write? Two problems: detection, minimization.
18
What about a half-written write? Two problems: detection, minimization. minimum-length check, CRC-check per record
19
What about a half-written write? Two problems: detection, minimization. invalidate only the end-failed record, not the file
20
21
Bugs can lurk anywhere. Unpredictability, eek. Typical mitigation: complex exception-management
X? X? X? X? X? X? X?
22
Stronger mitigation: supervision trees and "let it crash" Added bonus: simpler and clearer code
23
Many storage instances per server. If one fails, whole system is okay.
24
Also good for operational sanity when adding or removing hosts. Many storage instances per server. If one fails, whole system is okay.
25
26
What about a half-written write? Two problems: detection, minimization. invalidate only the end-failed record, not the file Isn't this still a busted record?
27
{ok, Value} {ok, Value} {error, not_found}
28
{ok, Value} {ok, Value} {error, not_found}
client
{ok, Value}
29
{ok, Value} {ok, Value} {error, bad_crc}
client
{ok, Value}
helps with nearly any local error:
30
{ok, Value} {ok, Value} {error, bad_crc}
client
{ok, Value} {ok, Value}
31
From a distributed system's point of view, a whole server can be seen as "a component." How can the overall system continue to perform? Computers fail all the time.
32
{ok, Value} {ok, Value}
client
{ok, Value}
33
client
PUT Value
34
client
PUT Value
35
client
sloppy quorums work for reads too!
{ok, Value} {ok, Value} {ok, Value} {ok, Value}
36
{ok, Value} {ok, Value} {ok, Value}
37
38
also a fix for inconsistent view of membership!
39
40
41
still live!
42
(will catch up later)
Justin Sheehy
justin@basho.com