autopsy of an automation disaster
play

Autopsy of an automation disaster Simon J Mudd (Senior Database - PowerPoint PPT Presentation

Autopsy of an automation disaster Simon J Mudd (Senior Database Engineer) Percona Live, 25 th April 2017 To err is human To really foul things up requires a computer [1] (or a script) [1]: http://quoteinvestigator.com/2010/12/07/foul-computer/


  1. Autopsy of an automation disaster Simon J Mudd (Senior Database Engineer) Percona Live, 25 th April 2017

  2. To err is human To really foul things up requires a computer [1] (or a script) [1]: http://quoteinvestigator.com/2010/12/07/foul-computer/ 2

  3. Booking.com ● Based in Amsterdam since 1996 ● Online Hotel/Accommodation/Travel Agent (OTA): ● +1,200,000 properties in 227 countries ● +1.200.000 room nights reserved daily ● +40 languages (website and customer service) ● +15.700 people working in 187 offices worldwide ● Part of the Priceline Group, PCLN on Nasdaq ● And we use MySQL: ● Thousands (1000s) of servers, ~90% replicating ● >150 masters: ~30 >50 slaves & ~10 >100 slaves 3

  4. Session Summary 1. MySQL replication at Booking.com 2. Automation disaster: external eye 3. Chain of events: analysis 4. Learning / takeaway 4

  5. MySQL replication at Booking.com ● Typical MySQL replication deployment at Booking.com: +---+ | M | +---+ | +------+-- ... --+---------------+-------- ... | | | | +---+ +---+ +---+ +---+ | S1| | S2| | Sn| | M1| +---+ +---+ +---+ +---+ | +-- ... --+ | | +---+ +---+ | T1| | Tm| +---+ +---+ 5

  6. MySQL replication at Booking.com’ ● We use and contribute to Orchestrator: 6

  7. MySQL replication at Booking.com’’ ● Orchestrator allows us to: ● Visualize our replication deployments ● Move slaves for planned maintenance of an intermediate master ● Automatically replace an intermediate master in case of its unexpected failure (thanks to pseudo-GTIDs when we have not deployed GTIDs) 7

  8. MySQL replication at Booking.com’’ 8

  9. MySQL replication at Booking.com’’ ● Orchestrator allows us to: ● Visualize our replication deployments ● Move slaves for planned maintenance of an intermediate master ● Automatically replace an intermediate master in case of its unexpected failure (thanks to pseudo-GTIDs when we have not deployed GTIDs widely though orchestrator handles both types) ● Automatically replace a master in case of a failure (failing over to a slave) 9

  10. MySQL replication at Booking.com’’ 10

  11. MySQL replication at Booking.com’’ ● Orchestrator allows us to: ● Visualize our replication deployments ● Move slaves for planned maintenance of an intermediate master ● Automatically replace an intermediate master in case of its unexpected failure (thanks to pseudo-GTIDs when we have not deployed GTIDs) ● Automatically replace a master in case of a failure (failing over to a slave) ● But Orchestrator cannot replace a master alone: ● Booking.com uses DNS for master discovery ● So Orchestrator calls custom hooks (a script) to repoint DNS (and to do other magic) 11

  12. MySQL replication at Booking.com’’ ● We also contribute to Orchestrator ● To allow for better scaling ● To improve integration with our own tooling ● To ensure that it can work on all our systems (MySQL, MariaDB) with GTID or pseudo-GTID ● To ensure it can provide us an HA service ● Shlomi has not stopped improving it and others contribute too 12

  13. MySQL replication at Booking.com’’ ● So it can handle this: 13

  14. Our subject database ● Simple replication deployment (in two data centers): DNS (master) +---+ points here --> | A | +---+ | +------------------------+ | | Reads +---+ +---+ happen here --> | B | | X | +---+ +---+ | +---+ And reads | Y | <-- happen here +---+ 14

  15. Split brain: 1 st event ● A and B (two servers in same data center) fail at the same time: DNS (master) +\-/+ points here --> | A | but accesses +/-\+ are now failing Reads +\-/+ +---+ happen here --> | B | | X | but accesses +/-\+ +---+ are now failing | +---+ And reads | Y | <-- happen here +---+ (I will cover how/why this happened later.) 15

  16. Split brain: 1 st event’ ● Orchestrator fixes things: +\-/+ | A | +/-\+ Reads +\-/+ +---+ Now, DNS (master) happen here --> | B | | X | <-- points here but accesses +/-\+ +---+ are now failing | +---+ Reads | Y | <-- happen here +---+ 16

  17. Split brain: disaster ● A few things happen overnight and we wake-up to this: +\-/+ | A | +/-\+ DNS +---+ +---+ points here --> | B | | X | +---+ +---+ | +---+ | Y | +---+ 17

  18. Split brain: disaster’ ● And to make things worse, reads are still happening on Y: +\-/+ | A | +/-\+ DNS (master) +---+ +---+ points here --> | B | | X | +---+ +---+ | +---+ Reads | Y | <-- happen here +---+ 18

  19. Split brain: disaster’’ ● This is not good: ● When A and B failed, X was promoted as the new master ● Something made DNS point to B (we will see what later) à writes are now happening on B ● But B is outdated: all writes to X (after the failure of A) did not reach B +\-/+ ● So we have data on X that cannot be read on B | A | +/-\+ ● And we have new data on B that is not read on Y DNS (master) +---+ +---+ points here --> | B | | X | +---+ +---+ | +---+ Reads | Y | <-- happen here 19 +---+

  20. Split-brain: analysis ● Digging more in the chain of events, we find that: After the 1 st failure of A, a 2 nd failure was detected and Orchestrator promoted B ● ● So after their failures, A and B came back and formed an isolated replication chain +\-/+ | A | +/-\+ +\-/+ +---+ DNS (master) | B | | X | <-- points here +/-\+ +---+ | +---+ Reads | Y | <-- happen here 20 +---+

  21. Split-brain: analysis ● Digging more in the chain of events, we find that: After the 1 st failure of A, a 2 nd failure was detected and Orchestrator promoted B ● ● So after their failures, A and B came back and formed an isolated replication chain ● And something caused a failure of A +---+ | A | +---+ | +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here 21 +---+

  22. Split-brain: analysis ● Digging more in the chain of events, we find that: After the 1 st failure of A, a 2 nd failure was detected and Orchestrator promoted B ● ● So after their failures, A and B came back and formed an isolated replication chain ● And something caused a failure of A ● But how did DNS end-up pointing to B ? +\-/+ | A | ● The failover to B called the DNS repointing script +/-\+ ● The script stole the DNS entry from X +---+ +---+ DNS (master) and pointed it to B | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here 22 +---+

  23. Split-brain: analysis ● Digging more in the chain of events, we find that: After the 1 st failure of A, a 2 nd failure was detected and Orchestrator promoted B ● ● So after their failures, A and B came back and formed an isolated replication chain ● And something caused a failure of A ● But how did DNS end-up pointing to B ? +\-/+ | A | ● The failover to B called the DNS repointing script +/-\+ ● The script stole the DNS entry from X DNS (master) +---+ +---+ and pointed it to B points here --> | B | | X | +---+ +---+ | ● But is that all: what made A fail ? +---+ Reads | Y | <-- happen here 23 +---+

  24. Split-brain: analysis’ ● What made A fail ? ● Once A and B came back up as a new replication chain, they had outdated data ● If B would have come back before A, it could have been re-cloned under X +---+ | A | +---+ | +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here 24 +---+

  25. Split-brain: analysis’ ● What made A fail ? ● Once A and B came back up as a new replication chain, they had outdated data ● If B would have come back before A, it could have been re-cloned under X ● But as A came back before re-cloning, it injected heartbeat and p-GTID into B +\-/+ | A | +/-\+ +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here 25 +---+

  26. Split-brain: analysis’ ● What made A fail ? ● Once A and B came back up as a new replication chain, they had outdated data ● If B would have come back before A, it could have been re-cloned under X ● But as A came back before re-cloning, it injected heartbeat and p-GTID into B ● Then B could have been re-cloned without problems +---+ | A | +---+ | +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here 26 +---+

  27. Split-brain: analysis’ ● What made A fail ? ● Once A and B came back up as a new replication chain, they had outdated data ● If B would have come back before A, it could have been re-cloned under X ● But as A came back before re-cloning, it injected heartbeat and p-GTID into B ● Then B could have been re-cloned without problems +---+ ● | A | But A was re-cloned instead ( human error #1 ) +---+ +\-/+ +---+ DNS (master) | B | | X | <-- points here +/-\+ +---+ | +---+ Reads | Y | <-- happen here 27 +---+

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend