ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR - - PowerPoint PPT Presentation

zero downtime datacenter failovers
SMART_READER_LITE
LIVE PREVIEW

ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR - - PowerPoint PPT Presentation

ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR YOUR 1.5TB MYSQL DATABASE FOR DUMMIES) INTRO (WHO IS THIS GUY) migrating an entire company's infrastructure from Rackspace to AWS 60 virtual machines 3 baremetal boxes (db)


slide-1
SLIDE 1

ZERO-DOWNTIME DATACENTER FAILOVERS

(SWITCHING HOSTING PROVIDERS FOR YOUR 1.5TB MYSQL DATABASE FOR DUMMIES)

slide-2
SLIDE 2

INTRO

(WHO IS THIS GUY)

slide-3
SLIDE 3

migrating an entire company's infrastructure from Rackspace to AWS

slide-4
SLIDE 4

60 virtual machines 3 baremetal boxes (db)

slide-5
SLIDE 5

the migration took 2 months to execute but a year and a half to prepare

slide-6
SLIDE 6

FOUND STATE

slide-7
SLIDE 7
  • everything continuously deployed
  • no concept of stable
slide-8
SLIDE 8
  • hand-crafted build server
  • 10 GB git repo
slide-9
SLIDE 9
  • no local dev environments
  • horrible code review tool
slide-10
SLIDE 10
  • CDN is some weird magical thing
  • no access to LB config, has a bunch of magic in it
slide-11
SLIDE 11
  • no insight into server metrics / perfdata
slide-12
SLIDE 12
  • still hosting custom PHP code
  • even tho majority of codebase is now java and python
slide-13
SLIDE 13
  • same mysql account used by everyone everywhere
slide-14
SLIDE 14
  • that mysql account is "root"
slide-15
SLIDE 15
  • that mysql db is 1.5 TB big
slide-16
SLIDE 16
  • half the company has to VPN into production to get any work done
slide-17
SLIDE 17
  • no db schema migration system == no db versioning
slide-18
SLIDE 18
  • half the servers are not deployable from scratch
  • or their deployability is unknown
slide-19
SLIDE 19
  • no access to disaster recovery instance

in case the primary DC went down

slide-20
SLIDE 20
  • but Rackspace was a constant pain to deal with
  • unexpected outages of unexplained causes
  • unresponsive support team
  • zero flexibility
slide-21
SLIDE 21

HOW LONG WOULD IT TAKE TO MIGRATE THIS?

> conservatively: 3 months > realistically: 6-9 months

slide-22
SLIDE 22

NO LEADERSHIP BUY-IN

> 2 failed attempts to get buy-in > Infrastructure team makes a pact > Do Things The Right Way From Now On

slide-23
SLIDE 23

A YEAR AND A HALF LATER...

majority of the issues were fixed

  • r at least significantly improved
slide-24
SLIDE 24

RACKSPACE STARTS FALLING APART

slide-25
SLIDE 25

> New estimate: 19 man-days (after final push for preparation)

slide-26
SLIDE 26

Savings estimate

12$k / mo

slide-27
SLIDE 27

GOT APPROVAL!

slide-28
SLIDE 28

> Actually executed in 25-30 man-days

  • ver 2 months
slide-29
SLIDE 29

HOW?

slide-30
SLIDE 30

> all LB logic slowly moved to our own haproxies > CDN magic moved to our haproxies

slide-31
SLIDE 31

> VPN bridge between DCs > ~20 MB/s, ~20ms ping good enough to treat as a "local" connection for shorter periods of time

slide-32
SLIDE 32

> mysql master-master replication between DCs

slide-33
SLIDE 33

> app servers in both DCs

slide-34
SLIDE 34

> haproxies in to both DCs

slide-35
SLIDE 35

> failover with DNS at CloudFlare near-instantly but even stray requests would get handled

slide-36
SLIDE 36

> metrics, metrics, metrics Datadog ftw

slide-37
SLIDE 37

RESULTS

slide-38
SLIDE 38

> core production migrated in days

slide-39
SLIDE 39

> internal tools migrated within a week or two

slide-40
SLIDE 40

> developer tools migrated within a month git hosting, build server, etc

slide-41
SLIDE 41

> obscure legacy services migrated within 2 months

slide-42
SLIDE 42

> all hardware at Rackspace decomissioned within 3 months

slide-43
SLIDE 43

> and it was good

slide-44
SLIDE 44
slide-45
SLIDE 45

QUESTIONS?