Autopsy of an automation disaster Simon J Mudd (Senior Database - - PowerPoint PPT Presentation

autopsy of an automation disaster
SMART_READER_LITE
LIVE PREVIEW

Autopsy of an automation disaster Simon J Mudd (Senior Database - - PowerPoint PPT Presentation

Autopsy of an automation disaster Simon J Mudd (Senior Database Engineer) Percona Live, 25 th April 2017 To err is human To really foul things up requires a computer [1] (or a script) [1]: http://quoteinvestigator.com/2010/12/07/foul-computer/


slide-1
SLIDE 1

Autopsy of an automation disaster

Simon J Mudd (Senior Database Engineer) Percona Live, 25th April 2017

slide-2
SLIDE 2

2

To err is human To really foul things up requires a computer[1] (or a script)

[1]: http://quoteinvestigator.com/2010/12/07/foul-computer/

slide-3
SLIDE 3

Booking.com

  • Based in Amsterdam since 1996
  • Online Hotel/Accommodation/Travel Agent (OTA):
  • +1,200,000 properties in 227 countries
  • +1.200.000 room nights reserved daily
  • +40 languages (website and customer service)
  • +15.700 people working in 187 offices worldwide
  • Part of the Priceline Group, PCLN on Nasdaq
  • And we use MySQL:
  • Thousands (1000s) of servers, ~90% replicating
  • >150 masters: ~30 >50 slaves & ~10 >100 slaves

3

slide-4
SLIDE 4

Session Summary

  • 1. MySQL replication at Booking.com
  • 2. Automation disaster: external eye
  • 3. Chain of events: analysis
  • 4. Learning / takeaway

4

slide-5
SLIDE 5

MySQL replication at Booking.com

  • Typical MySQL replication deployment at Booking.com:

+---+ | M | +---+ | +------+-- ... --+---------------+-------- ... | | | | +---+ +---+ +---+ +---+ | S1| | S2| | Sn| | M1| +---+ +---+ +---+ +---+ | +-- ... --+ | | +---+ +---+ | T1| | Tm| +---+ +---+

5

slide-6
SLIDE 6

MySQL replication at Booking.com’

  • We use and contribute to Orchestrator:

6

slide-7
SLIDE 7

MySQL replication at Booking.com’’

  • Orchestrator allows us to:
  • Visualize our replication deployments
  • Move slaves for planned maintenance of an intermediate master
  • Automatically replace an intermediate master in case of its unexpected failure

(thanks to pseudo-GTIDs when we have not deployed GTIDs)

7

slide-8
SLIDE 8

MySQL replication at Booking.com’’

8

slide-9
SLIDE 9

MySQL replication at Booking.com’’

  • Orchestrator allows us to:
  • Visualize our replication deployments
  • Move slaves for planned maintenance of an intermediate master
  • Automatically replace an intermediate master in case of its unexpected failure

(thanks to pseudo-GTIDs when we have not deployed GTIDs widely though

  • rchestrator handles both types)
  • Automatically replace a master in case of a failure (failing over to a slave)

9

slide-10
SLIDE 10

MySQL replication at Booking.com’’

10

slide-11
SLIDE 11

MySQL replication at Booking.com’’

  • Orchestrator allows us to:
  • Visualize our replication deployments
  • Move slaves for planned maintenance of an intermediate master
  • Automatically replace an intermediate master in case of its unexpected failure

(thanks to pseudo-GTIDs when we have not deployed GTIDs)

  • Automatically replace a master in case of a failure (failing over to a slave)
  • But Orchestrator cannot replace a master alone:
  • Booking.com uses DNS for master discovery
  • So Orchestrator calls custom hooks (a script) to repoint DNS (and to do other

magic)

11

slide-12
SLIDE 12

MySQL replication at Booking.com’’

  • We also contribute to Orchestrator
  • To allow for better scaling
  • To improve integration with our own tooling
  • To ensure that it can work on all our systems (MySQL, MariaDB) with GTID or

pseudo-GTID

  • To ensure it can provide us an HA service
  • Shlomi has not stopped improving it and others contribute too

12

slide-13
SLIDE 13

MySQL replication at Booking.com’’

  • So it can handle this:

13

slide-14
SLIDE 14

Our subject database

  • Simple replication deployment (in two data centers):

DNS (master) +---+ points here --> | A | +---+ | +------------------------+ | | Reads +---+ +---+ happen here --> | B | | X | +---+ +---+ | +---+ And reads | Y | <-- happen here +---+

14

slide-15
SLIDE 15

Split brain: 1st event

  • A and B (two servers in same data center) fail at the same time:

DNS (master) +\-/+ points here --> | A | but accesses +/-\+ are now failing Reads +\-/+ +---+ happen here --> | B | | X | but accesses +/-\+ +---+ are now failing | +---+ And reads | Y | <-- happen here +---+

15 (I will cover how/why this happened later.)

slide-16
SLIDE 16

Split brain: 1st event’

  • Orchestrator fixes things:

+\-/+ | A | +/-\+ Reads +\-/+ +---+ Now, DNS (master) happen here --> | B | | X | <-- points here but accesses +/-\+ +---+ are now failing | +---+ Reads | Y | <-- happen here +---+

16

slide-17
SLIDE 17

Split brain: disaster

  • A few things happen overnight and we wake-up to this:

+\-/+ | A | +/-\+ DNS +---+ +---+ points here --> | B | | X | +---+ +---+ | +---+ | Y | +---+

17

slide-18
SLIDE 18

Split brain: disaster’

  • And to make things worse, reads are still happening on Y:

+\-/+ | A | +/-\+ DNS (master) +---+ +---+ points here --> | B | | X | +---+ +---+ | +---+ Reads | Y | <-- happen here +---+

18

slide-19
SLIDE 19

Split brain: disaster’’

  • This is not good:
  • When A and B failed, X was promoted as the new master
  • Something made DNS point to B (we will see what later)

à writes are now happening on B

  • But B is outdated: all writes to X (after the failure of A) did not reach B
  • So we have data on X that cannot be read on B
  • And we have new data on B that is not read on Y

19

+\-/+ | A | +/-\+ DNS (master) +---+ +---+ points here --> | B | | X | +---+ +---+ | +---+ Reads | Y | <-- happen here +---+

slide-20
SLIDE 20

Split-brain: analysis

  • Digging more in the chain of events, we find that:
  • After the 1st failure of A, a 2nd failure was detected and Orchestrator promoted B
  • So after their failures, A and B came back and formed an isolated replication chain

20

+\-/+ | A | +/-\+ +\-/+ +---+ DNS (master) | B | | X | <-- points here +/-\+ +---+ | +---+ Reads | Y | <-- happen here +---+

slide-21
SLIDE 21

Split-brain: analysis

  • Digging more in the chain of events, we find that:
  • After the 1st failure of A, a 2nd failure was detected and Orchestrator promoted B
  • So after their failures, A and B came back and formed an isolated replication chain
  • And something caused a failure of A

21

+---+ | A | +---+ | +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here +---+

slide-22
SLIDE 22

Split-brain: analysis

  • Digging more in the chain of events, we find that:
  • After the 1st failure of A, a 2nd failure was detected and Orchestrator promoted B
  • So after their failures, A and B came back and formed an isolated replication chain
  • And something caused a failure of A
  • But how did DNS end-up pointing to B ?
  • The failover to B called the DNS repointing script
  • The script stole the DNS entry from X

and pointed it to B

22

+\-/+ | A | +/-\+ +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here +---+

slide-23
SLIDE 23

Split-brain: analysis

  • Digging more in the chain of events, we find that:
  • After the 1st failure of A, a 2nd failure was detected and Orchestrator promoted B
  • So after their failures, A and B came back and formed an isolated replication chain
  • And something caused a failure of A
  • But how did DNS end-up pointing to B ?
  • The failover to B called the DNS repointing script
  • The script stole the DNS entry from X

and pointed it to B

  • But is that all: what made A fail ?

23

+\-/+ | A | +/-\+ DNS (master) +---+ +---+ points here --> | B | | X | +---+ +---+ | +---+ Reads | Y | <-- happen here +---+

slide-24
SLIDE 24

Split-brain: analysis’

  • What made A fail ?
  • Once A and B came back up as a new replication chain, they had outdated data
  • If B would have come back before A, it could have been re-cloned under X

24

+---+ | A | +---+ | +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here +---+

slide-25
SLIDE 25

Split-brain: analysis’

  • What made A fail ?
  • Once A and B came back up as a new replication chain, they had outdated data
  • If B would have come back before A, it could have been re-cloned under X
  • But as A came back before re-cloning, it injected heartbeat and p-GTID into B

25

+\-/+ | A | +/-\+ +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here +---+

slide-26
SLIDE 26

Split-brain: analysis’

  • What made A fail ?
  • Once A and B came back up as a new replication chain, they had outdated data
  • If B would have come back before A, it could have been re-cloned under X
  • But as A came back before re-cloning, it injected heartbeat and p-GTID into B
  • Then B could have been re-cloned without problems

26

+---+ | A | +---+ | +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here +---+

slide-27
SLIDE 27

Split-brain: analysis’

  • What made A fail ?
  • Once A and B came back up as a new replication chain, they had outdated data
  • If B would have come back before A, it could have been re-cloned under X
  • But as A came back before re-cloning, it injected heartbeat and p-GTID into B
  • Then B could have been re-cloned without problems
  • But A was re-cloned instead (human error #1)

27

+---+ | A | +---+ +\-/+ +---+ DNS (master) | B | | X | <-- points here +/-\+ +---+ | +---+ Reads | Y | <-- happen here +---+

slide-28
SLIDE 28

Split-brain: analysis’

  • What made A fail ?
  • Once A and B came back up as a new replication chain, they had outdated data
  • If B would have come back before A, it could have been re-cloned under X
  • But as A came back before re-cloning, it injected heartbeat and p-GTID into B
  • Then B could have been re-cloned without problems
  • But A was re-cloned instead (human error #1)
  • Why did Orchestrator not fail over right away ?
  • B was promoted hours after A was brought down…
  • Because A was downed time only for 4 hours

(human error #2)

28

+\-/+ | A | +/-\+ +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here +---+

slide-29
SLIDE 29

Orchestrator anti-flapping

  • Orchestrator has a failover throttling/acknowledgment mechanism[1]:
  • Automated recovery will happen
  • for an instance in a cluster that has not recently been recovered
  • unless such recent recoveries were acknowledged.
  • In our case:
  • the anti-flap timeout had expired
  • The downtime timeout had expired
  • maybe Orchestrator should not have failed over the second time

[1]: https://github.com/github/orchestrator/blob/master/docs/topology-recovery.md #blocking-acknowledgments-anti-flapping

slide-30
SLIDE 30

Split brain: summary

  • So in summary, this disaster was caused by:

1. A fancy failure: 2 servers failing in the same data center at the same time 2. A probably too short a timeout for recent failover 3. Edge-case recovery: both servers forming a new replication topology 4. Re-cloning of the wrong server (A instead of B) 5. Too short downtime for the re-cloning 6. DNS repointing script not defensive enough

30

slide-31
SLIDE 31

Fancy failure: more details

  • Why did A and B fail at the same time ?
  • Deployment error: the two servers in the same rack/failure domain ?
  • And/or very unlucky ?
  • Very unlucky because…

10 to 20 servers failed that day in the same data center Because human operations and sensitive hardware

31

DNS (master) +\-/+ points here ---> | A | but now accesses +/-\+ are failing +\-/+ +---+ | B | | X | +/-\+ +---+ | +---+ | Y | +---+

slide-32
SLIDE 32

How to fix such situation ?

  • Fixing non-intersecting data on B and X is hard.
  • Some solutions are:
  • Kill B or X (and lose data)
  • Replay writes from B on X (manually or with replication)
  • But AUTO_INCREMENTs are in the way:
  • up to i used on A before 1st failover
  • i-n to j1 used on X after recovery
  • i to j2 used on B after 2nd failover

32

+\-/+ | A | +/-\+ DNS (master) +---+ +---+ points here --> | B | | X | +---+ +---+ | +---+ Reads | Y | <-- happen here +---+

slide-33
SLIDE 33

Takeaway

  • Twisted situations happen
  • Automation is not simple:
  • code automation scripts defensively
  • Handling unexpected failures is really hard
  • Need to understand interactions between different tools in great detail
  • Human interaction of infrequent events can be problematic
  • Downtime more than less or only turn off downtime state once the system is

working

slide-34
SLIDE 34

Takeaway

  • More testing of failure scenarios though this will never be complete
  • Shutdown slaves first
  • Use something other than autoinc columns for PK
  • monotonically increasing UUID[1] [2] ?
  • We actually have an autoinc service but were not using it.

[1]: https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/ [2]: http://mysql.rjweb.org/doc.php/uuid

slide-35
SLIDE 35

Improvements in Orchestrator

  • Orchestrator mainly did what it was supposed to do
  • Automation around orchestrator had unintended affects
  • Should Orchestrator be changed ?
  • It depends…
  • Not easy to define what should be changed
  • Not everyone wants the same thing (for legitimate reasons)
  • Suggestions welcome for describing problems seen or suggesting solutions

https://github.com/github/orchestrator/issues

  • In any case orchestrator is being changed and is getting better

35

slide-36
SLIDE 36

Links

  • Booking.com
  • Blog: https://blog.booking.com/
  • Careers: https://workingatbooking.com/
  • Orchestrator
  • On Github: https://github.com/github/orchestrator/
  • Myself:
  • Blog: http://blog.wl0.org/
  • https://www.linkedin.com/in/simon-j-mudd-397bb01/

36

slide-37
SLIDE 37

Oh, and Booking.com is hiring!

  • Almost any role:
  • MySQL Engineer / DBA
  • System Administrator
  • System Engineer
  • Site Reliability Engineer
  • Developer
  • Designer
  • Technical Team Lead
  • Product Owner
  • Data Scientist
  • And many more…
  • https://workingatbooking.com/

37

slide-38
SLIDE 38

Questions?

slide-39
SLIDE 39

Thanks

Simon J Mudd simon.mudd@booking.com