Large databases lots of servers on premises in the cloud GET THEM - - PowerPoint PPT Presentation

large databases lots of servers on premises in the cloud
SMART_READER_LITE
LIVE PREVIEW

Large databases lots of servers on premises in the cloud GET THEM - - PowerPoint PPT Presentation

Large databases lots of servers on premises in the cloud GET THEM ALL! Flavio Gurgel pgDay Paris 2019 DBA leboncoin Mars 12, 2019 A common stack The Internet Border, DMZ http Application Replication Master Standby A common stack


slide-1
SLIDE 1
slide-2
SLIDE 2

Large databases lots of servers

  • n premises

in the cloud GET THEM ALL!

pgDay Paris 2019 Mars 12, 2019 Flavio Gurgel DBA leboncoin

slide-3
SLIDE 3

A common stack

Master Standby Replication Application http Border, DMZ The Internet

slide-4
SLIDE 4
slide-5
SLIDE 5

A common stack

leboncoin circa 2009 Master Standby Replication Application http Border, DMZ The Internet

slide-6
SLIDE 6

A growing stack

leboncoin circa 2013 Master Standby Replication Application http Border, DMZ The Internet Standby Standby (spare)

slide-7
SLIDE 7

A growing company

leboncoin still at 2013

Master Standby (short) Replication Application http Border, DMZ The Internet Standby (long) Standby (spare) OLAP DB BI Stuff (ETL, OLAP, web frontend, etc) Internal users

slide-8
SLIDE 8

classifield ads online

27 million

more than

new ads every day

800 000

*Source Médiamétrie Net Ratings, avril 2018

28,1 million

unique visitors* categories

73

slide-9
SLIDE 9

Stack? Incomplete chart - 2 DCs + AWS

leboncoin at 2019

slide-10
SLIDE 10

>20 Gbits/s

  • utflows & a

database of

3 To 300M

images

50k

req/s on leboncoin Tech team of more than

200

people A strong « open-source » culture: PostgreSQL, Go, React, Python, Hadoop, Kubernetes …

2

Datacenters &

1

Cloud provider

2000

Virtual machines

slide-11
SLIDE 11

To handle all that: automation

slide-12
SLIDE 12

Availability - is replication enough?

  • Hardware
  • Warranty
  • Power
  • RAID 10
  • Battery
  • ECC RAM
  • Network
  • Fans
  • Alerting
slide-13
SLIDE 13

Replication minimum requirements

And let’s think about load-balancing too

  • Standby
  • Streaming
  • Replication slots
  • Geographic distribution
  • Path
  • Load balancing
  • Spare
slide-14
SLIDE 14

Getting critical

Master Standby Standby Standby (spare)

DC A DC B

HAProxy

Read-only endpoint for applications Read-write endpoint for applications

slide-15
SLIDE 15

pg_dump

  • Nightly
  • Archiving DBs not dumped
  • Custom mode
  • Directory mode (from 300 GB)
  • Encrypted
  • Sent to the cloud
  • Retention -> GDPR
slide-16
SLIDE 16

pg_dump (and restore) strategy

Dump server AWS Storage Gateway DC A S3 bucket Local cache disk On premise Databases All DCs NFS mount AWS Storage Gateway DC B Local cache disk Encryption happens here

slide-17
SLIDE 17

Testing pg_dumps

  • Mandatory
  • Corruption
  • Procedures
  • Bugs
  • Time to restore
slide-18
SLIDE 18

Physical backups

  • Barman
  • Basebackups (based on WAL/day)

○ Daily -> from 1 TB ○ Twice a week -> between 100 GB and 1 TB ○ Twice a month -> up to 100 GB

  • PITR
  • Tests
slide-19
SLIDE 19

Barman tips

  • Postgres method

○ pg_receivewal ○ Pg_basebackup ○ Replication slot

  • Geographic distribution
  • Archive
  • Disk space
  • Retention
slide-20
SLIDE 20

Barman strategy

Barman 2 DC B On premise Databases All DCs Barman 1 DC B Barman 1 DC A Barman 1 DC A

slide-21
SLIDE 21

Monitoring and Alerting

pgwatch2

slide-22
SLIDE 22
slide-23
SLIDE 23
slide-24
SLIDE 24
slide-25
SLIDE 25
slide-26
SLIDE 26

Data flows

slide-27
SLIDE 27

Minor version upgrades

  • Release notes
  • DBA + SRE + Developper
  • Standby -> Master
  • Automation
  • 1h total
  • Site always up
  • Services cut for seconds
slide-28
SLIDE 28
  • Same version everywhere
  • Current is 10
  • Decide
  • QA + Staging
  • Production
  • pg_upgrade
  • 3h total
  • Site always up

Major version upgrades

slide-29
SLIDE 29
  • Near zero downtime
  • Logical replication to 11
  • Stop origin on 10
  • Update sequences
  • Point to new origin
  • Start production
  • New physical replica

Major version (new generation) upgrades

slide-30
SLIDE 30
  • row_to_json
  • Parallel query
  • Plan
  • Auto-analyze
  • Function
  • Replication lag
  • DDL locks
  • Replication slots

How it was to migrate from 9.3 to 10?

slide-31
SLIDE 31

Applying DDL

slide-32
SLIDE 32
  • Replication lag
  • Changing execution plans
  • Sqitch
  • Unattended upgrade

Incidents we faced

slide-33
SLIDE 33
  • Instance types
  • On premises cost
  • Variables
  • Physical backups/replicas
  • Lock-in
  • New scenarios
  • Small, elastic, internal services
  • Decommissioned DBs

Cloud? (speaking only of databases)

slide-34
SLIDE 34
  • NoSQL

○ InfluxDB ○ Elasticsearch ○ Redis

  • Other engines

○ PostgreSQL - more than 70 servers ○ MySQL - some servers ○ MSSQL - one server

  • Ditch PostgreSQL for (generic NoSQL here)

○ Never

Other DB engine? NoSQL?

slide-35
SLIDE 35

*Source Médiamétrie Net Ratings, avril 2018 **Source Baromètre de satisfaction BVA novembre 2017

50,6

is on device mobile An app (iOS + Android)

slide-36
SLIDE 36

Let’s keep in touch…

leboncoin Engineering blog @leboncoinEng github.com/leboncoin

slide-37
SLIDE 37