ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR - - PowerPoint PPT Presentation

▶

Mar 28, 2024 5 likes •458 views

ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR YOUR 1.5TB MYSQL DATABASE FOR DUMMIES) INTRO (WHO IS THIS GUY) migrating an entire company's infrastructure from Rackspace to AWS 60 virtual machines 3 baremetal boxes (db)

SLIDE 1

ZERO-DOWNTIME DATACENTER FAILOVERS

(SWITCHING HOSTING PROVIDERS FOR YOUR 1.5TB MYSQL DATABASE FOR DUMMIES)

SLIDE 2

INTRO

(WHO IS THIS GUY)

SLIDE 3

migrating an entire company's infrastructure from Rackspace to AWS

SLIDE 4

60 virtual machines 3 baremetal boxes (db)

SLIDE 5

the migration took 2 months to execute but a year and a half to prepare

SLIDE 6

FOUND STATE

SLIDE 7

everything continuously deployed
no concept of stable

SLIDE 8

hand-crafted build server
10 GB git repo

SLIDE 9

no local dev environments
horrible code review tool

SLIDE 10

CDN is some weird magical thing
no access to LB config, has a bunch of magic in it

SLIDE 11

no insight into server metrics / perfdata

SLIDE 12

still hosting custom PHP code
even tho majority of codebase is now java and python

SLIDE 13

same mysql account used by everyone everywhere

SLIDE 14

that mysql account is "root"

SLIDE 15

that mysql db is 1.5 TB big

SLIDE 16

half the company has to VPN into production to get any work done

SLIDE 17

no db schema migration system == no db versioning

SLIDE 18

half the servers are not deployable from scratch
or their deployability is unknown

SLIDE 19

no access to disaster recovery instance

in case the primary DC went down

SLIDE 20

but Rackspace was a constant pain to deal with
unexpected outages of unexplained causes
unresponsive support team
zero flexibility

SLIDE 21

HOW LONG WOULD IT TAKE TO MIGRATE THIS?

> conservatively: 3 months > realistically: 6-9 months

SLIDE 22

NO LEADERSHIP BUY-IN

> 2 failed attempts to get buy-in > Infrastructure team makes a pact > Do Things The Right Way From Now On

SLIDE 23

A YEAR AND A HALF LATER...

majority of the issues were fixed

r at least significantly improved

SLIDE 24

RACKSPACE STARTS FALLING APART

SLIDE 25

> New estimate: 19 man-days (after final push for preparation)

SLIDE 26

Savings estimate

12$k / mo

SLIDE 27

GOT APPROVAL!

SLIDE 28

> Actually executed in 25-30 man-days

ver 2 months

SLIDE 29

HOW?

SLIDE 30

> all LB logic slowly moved to our own haproxies > CDN magic moved to our haproxies

SLIDE 31

> VPN bridge between DCs > ~20 MB/s, ~20ms ping good enough to treat as a "local" connection for shorter periods of time

SLIDE 32

> mysql master-master replication between DCs

SLIDE 33

> app servers in both DCs

SLIDE 34

> haproxies in to both DCs

SLIDE 35

> failover with DNS at CloudFlare near-instantly but even stray requests would get handled

SLIDE 36

> metrics, metrics, metrics Datadog ftw

SLIDE 37

RESULTS

SLIDE 38

> core production migrated in days

SLIDE 39

> internal tools migrated within a week or two

SLIDE 40

> developer tools migrated within a month git hosting, build server, etc

SLIDE 41

> obscure legacy services migrated within 2 months

SLIDE 42

> all hardware at Rackspace decomissioned within 3 months

SLIDE 43

> and it was good

SLIDE 44

SLIDE 45