How not to screw up when building HA cluster FOSDEM PGDay 2019, - - PowerPoint PPT Presentation

how not to screw up when building ha cluster
SMART_READER_LITE
LIVE PREVIEW

How not to screw up when building HA cluster FOSDEM PGDay 2019, - - PowerPoint PPT Presentation

Please write title, subtitle Please write title, subtitle and speaker name in all and speaker name in all capital letters capital letters How not to screw up when building HA cluster FOSDEM PGDay 2019, Brussels Alexander Kukushkin


slide-1
SLIDE 1

Please write title, subtitle and speaker name in all capital letters

How not to screw up when building HA cluster

FOSDEM PGDay 2019, Brussels Alexander Kukushkin

01-02-2018

Please write title, subtitle and speaker name in all capital letters

slide-2
SLIDE 2

2

Please write the title in all capital letters Put images in the grey dotted box "unsupported placeholder" Use bullet points to summarize information rather than writing long paragraphs in the text box

ABOUT ME Alexander Kukushkin

Database Engineer @ZalandoTech The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn

slide-3
SLIDE 3

3

Please write the title in all capital letters Put images in the grey dotted box "unsupported placeholder"

WE BRING FASHION TO PEOPLE IN 17 COUNTRIES 17 markets 7 fulfillment centers 23 million active customers 4.5 billion € net sales 2017 200 million visits per month 15,000 employees in Europe

slide-4
SLIDE 4

4

Please write the title in all capital letters

> 650 clusters

in the Cloud (AWS)

FACTS & FIGURES

> 300 databases

  • n premise
slide-5
SLIDE 5

5

Put images in the grey dotted box "unsupported placeholder" Please write the title in all capital letters

What is High Availability? What HA will not solve? Disaster recovery Automatic failover done right Examples of real incidents

AGENDA

Wrap it up

slide-6
SLIDE 6

Put images in the grey dotted box "unsupported placeholder" - behind the

  • range box and quote in

capital letters

What is High Availability? What is High Availability?

slide-7
SLIDE 7

7

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

Availability

slide-8
SLIDE 8

8

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

  • Scheduled downtime (often excluded from availability)

○ Hardware/BIOS/Firmware upgrade ○ Software update

  • Unscheduled downtime

○ Datacenter failure (natural disasters, fire, power outage) ○ Network splits ○ Hardware failure (CPU, network card, disk controller, disk) ○ Software/Data corruption (Bugs in application/OS code) ○ User error (rm -fr $PGDATA, DROP/TRUNCATE table, UPDATE/DELETE without WHERE clause)

Causes of Downtime

slide-9
SLIDE 9

9

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

Availability

Downtime

Year Month Week Day 99% (“Two nines”) 3.65 d 7.31 h 1.68 h 14.4 m 99.9% (“Three nines”) 8.77 h 43.83 m 10.08 m 1.44 m 99.95% (“Three and a half nines”) 4.38 h 21.92 m 5.04 m 43.2 s 99.99% (“Four nines”) 52.6 m 4.38 m 1.01 m 8.64 s 99.999% (“Five nines”) 5.26 m 26.3 s 6.05 s 864 ms 99.9999% (“Six nines”) 31.56 s 2.63 s 604.8 ms 86.4 ms 99.99999% (“Seven nines”) 3.16 s 262.98 ms 60.48 ms 864 μs

slide-10
SLIDE 10

10

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

  • No Official Definition appears to exist!
  • Wikipedia:

○ High availability (HA) is a characteristic of a system, which aims to ensure an agreed level of

  • perational performance, usually uptime, for a

higher than normal period.

What is HA anyway?

slide-11
SLIDE 11

11

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

  • A Service-Level Agreement (SLA) is an agreement between a service provider and a client.

○ Type of service to be provided ○ Desired performance level (especially availability, reliability and responsiveness) ○ Monitoring process and service level reporting ○ Steps for reporting issues ○ Response and issue resolution time-frame

  • A Service-Level Indicator (SLI) is a measure of the service level provided by a service provider to

a customer ○ Availability ○ Latency ○ Throughput

  • A Service-Level Objective (SLO) is a key element of SLA; a goal that service provider wants to

reach

SLA, SLI, and SLO

slide-12
SLIDE 12

12

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

  • Hardware failure
  • Network splits
  • Datacenter failure
  • Software failure/Data corruption
  • User error

Causes of Unscheduled Downtime Automatic failover Disaster recovery

slide-13
SLIDE 13

13

Put images in the grey dotted box "unsupported placeholder" Please write the title in all capital letters

What is High Availability? What HA will not solve?

Disaster recovery

Automatic failover done right Examples of real incidents Wrap it up

slide-14
SLIDE 14

14

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

  • Involves a set of policies, tools and procedures to enable the recovery or continuation of vital

technology infrastructure and systems following a natural or human-induced disaster

  • Recovery point objective (RPO) and recovery time objective (RTO) are two important

measurements in disaster recovery and downtime ○ A recovery point objective (RPO) is defined by business continuity planning. It is the maximum targeted period in which data (transactions) might be lost from an IT service due to a major incident ○ The recovery time objective (RTO) is the targeted duration of time and a service level within which a business process must be restored after a disaster (or disruption) in

  • rder to avoid unacceptable consequences associated with a break in business

continuity

Disaster recovery

slide-15
SLIDE 15

15

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

Disaster recovery

https://en.wikipedia.org/wiki/File:RPO_RTO_example_converted.png

$ Data loss price Service downtime price

Data recovery price High Availability price

slide-16
SLIDE 16

16

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

  • Automatic failover won’t help to backup and restore data

○ Enable backups and log archiving ■ archive_timeout - how often postgres should archive WALs ■ pg_receivewal ○ Recovery from the backup might take hours ■ Consider having a delayed replica (recovery_min_apply_delay)

  • if RTO is higher than 15 minutes, you don’t need automatic failover!

○ Unless you are running hundreds of clusters

  • synchronous replication - to prevent data loss during failover

RPO, RTO & PostgreSQL

slide-17
SLIDE 17

17

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

  • In general it is possible, but VERY expensive
  • This is a price for complexity of such system

○ Complexity is often decreasing availability ○ The more elements a system has, the more reliable each element has to be

  • Trade-off between the speed of failure detection and false positives

Sub-second Automatic Failover

slide-18
SLIDE 18

Put images in the grey dotted box "unsupported placeholder" - behind the

  • range box and quote in

capital letters

High Availability and Disaster Recovery Need Each Other!

slide-19
SLIDE 19

19

Put images in the grey dotted box "unsupported placeholder" Please write the title in all capital letters

What is High Availability? What HA will not solve? Disaster recovery

Automatic failover done right

Examples of real incidents Wrap it up

slide-20
SLIDE 20

20

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

  • PostgreSQL XC/XL

○ Data nodes + Coordinators + 2PC + GTM(SPOF)

  • BDR

○ logical replication + conflict resolution ■ eventual consistency

  • Postgres Pro Enterprise (proprietary)

○ logical replication + E3PC

Multimaster?

slide-21
SLIDE 21

21

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

  • Quorum

○ Helps to deal with network splits ○ Requires at least 3 nodes

  • Fencing

○ Make sure the old primary is unaccessible. STONITH!

  • Watchdog

○ Primary should not run if supervising HA process failed

A good HA system

slide-22
SLIDE 22

22

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

No Quorum and no Fencing

Primary Standby wal stream health check

slide-23
SLIDE 23

23

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

https://github.com/MasahikoSawada/pg_keeper

No Quorum and no Fencing

Primary Primary

slide-24
SLIDE 24

24

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

Witness node is making decisions

Primary witness health check h e a l t h c h e c k wal stream Standby

slide-25
SLIDE 25

25

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

Witness node dies

Primary witness wal stream Standby

slide-26
SLIDE 26

26

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

Witness and no Fencing

Primary witness health check h e a l t h c h e c k wal stream Standby

slide-27
SLIDE 27

27

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

Witness and no Fencing

Primary witness health check Primary

slide-28
SLIDE 28

28

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

Automatic failover done right

Standby I am the leader L e a d e r c h a n g e d ?

Quorum

Primary

slide-29
SLIDE 29

29

Put images in the grey dotted box "unsupported placeholder" Please write the title in all capital letters

What is High Availability? What HA will not solve? Disaster recovery Automatic failover done right

Examples of real incidents

Wrap it up

slide-30
SLIDE 30

30

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

GoCardless Incident

Learn your HA system

Pacemaker Pacemaker Pacemaker VIP Async Primary Sync

slide-31
SLIDE 31

31

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

  • Failed raid controller on the primary
  • Primary was manually terminating (hope on auto failover)
  • Auto Failover didn’t work due to coincident crash of postgres on sync

replica

  • Spend 1h30m trying to trigger a failover using Pacemaker!
  • Manually promoted sync replica
  • Total outage 1h50m

GoCardless Incident

slide-32
SLIDE 32

32

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

  • HA systems usually are quite complex
  • Running them is similar to flying modern airplane

○ Mostly autopilot ○ But sometimes it fails ○ You need to know how to “fly” manually

  • Learn your HA system

○ Try to break it and fix afterwards

GoCardless Lessons

slide-33
SLIDE 33

33

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

GitHub Incident

Resource Planning

Latency 60ms replica replica pages notifications US West replica primary replica pages jobs notifications github.com US East

slide-34
SLIDE 34

34

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

  • Network split due to network maintenance
  • Automatic failover from East to West Coast datacenter
  • Applications from East are slow due to latency between East and West
  • Switchback to East wasn’t possible due to a few seconds of writes

which were not replicated

  • Rebuild of all replicas in the East took nearly 16 hours
  • Total time of incident 24h11m

GitHub Incident

slide-35
SLIDE 35

35

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

  • Avoid doing cross-region failover if you don’t have 100% resources

symmetry

  • MySQL can’t do pg_rewind :)

GitHub Lessons

slide-36
SLIDE 36

36

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

GitLab Incident

  • Primary-Replica setup (no automatic failover)
  • Increased database load on the primary resulted in replica falling behind

○ WAL segment needed for replica was recycled by primary

  • A few attempts to rebuild replica with pg_basebackup (--checkpoint=spread)
  • rm -fr $PGDATA on the primary! (human error)
  • Three different backups were done only once a day (no WAL archiving)

○ pg_dump was always failing due to major version mismatch! ○ Azure disk snapshots were disabled for database servers! ○ LVM snapshots were working and periodically tested by restoring them to staging ■ Incident happened nearly 24 hours after the last snapshot was taken! ■ “Luckily”, someone manually created the snapshot 6 hours before the incident

  • Recovery from LVM snapshot took longer than 18 hours

Broken Disaster Recovery procedures

slide-37
SLIDE 37

37

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

  • RPO and RTO were not set or not adequate to their business needs

○ Daily snapshots only and no WAL archiving (RPO = 24 hours) ○ Streaming replication can’t be used for Disaster Recovery ■ Unless it is a “delayed” replica

  • Runbooks can’t replace fire-drills
  • Backups must be monitored and tested

GitLab Lessons

slide-38
SLIDE 38

38

Put images in the grey dotted box "unsupported placeholder" Please write the title in all capital letters

What is High Availability?

What HA will not solve?

Disaster recovery Automatic failover done right Examples of real incidents Wrap it up

slide-39
SLIDE 39

39

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

  • HA doesn’t solve all problems with postgres, it won't cover:

○ Hardware errors, CPU load, Memory, etc... ○ Disk space for $PGDATA, tablespaces and pg_wal ○ autovacuums, checkpoints ○ Tables and indexes bloat ○ Queries performance ○ etc...

  • Depending on RPO you maybe don’t need HA at all, but monitoring is a must

○ Don’t forget to monitor your HA system!

Monitoring

slide-40
SLIDE 40

40

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

Everything must be monitored

Monitoring

High Availability Disaster Recovery

slide-41
SLIDE 41

41

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

  • OS configuration tuning

○ Huge pages, shared memory, semaphores, overcommit, etc...

  • PostgreSQL configuration tuning

○ shared_buffers, max_wal_size, checkpoint completion_target, random_page_cost, etc...

  • HA won’t do it for you!

Configuration tuning

slide-42
SLIDE 42

42

Put images in the grey dotted box "unsupported placeholder" Please write the title in all capital letters

What is High Availability? What HA will not solve? Disaster recovery Automatic failover done right Examples of real incidents

Wrap it up

slide-43
SLIDE 43

43

Please write the title in all capital letters Use bullet points to summarize information rather than writing long paragraphs in the text box

  • Always start with Disaster Recovery planning

○ Define RPO and RTO ■ Depending on RTO you maybe don’t need HA ○ Build the Availability you need, not the Availability you want

  • Test everything

○ High Availability system ○ Backups!

  • Do regular fire-drills

Wrap it up

slide-44
SLIDE 44

Put images in the grey dotted box "unsupported placeholder" - behind the

  • range box and quote in

capital letters

Thank you!

Questions?

Feedback: https://2019.fosdempgday.org/f