MySQL Infrastructure Testing Automation @ GitHub Jonah Berquist, - - PowerPoint PPT Presentation

mysql infrastructure testing automation github
SMART_READER_LITE
LIVE PREVIEW

MySQL Infrastructure Testing Automation @ GitHub Jonah Berquist, - - PowerPoint PPT Presentation

MySQL Infrastructure Testing Automation @ GitHub Jonah Berquist, Tom Krouper GitHub Percona Live 2018 How people build so fu ware 1 Agenda Intros MySQL @ GitHub Backup/restores Schema migrations Failovers


slide-1
SLIDE 1 How people build sofuware
  • MySQL Infrastructure

Testing Automation 
 @ GitHub

Jonah Berquist, Tom Krouper GitHub Percona Live 2018 1
slide-2
SLIDE 2 How people build sofuware
  • Agenda
  • Intros
  • MySQL @ GitHub
  • Backup/restores
  • Schema migrations
  • Failovers
2
slide-3
SLIDE 3 How people build sofuware
  • About Tom
  • Sr. Infrastructure Engineer
  • Member of the Database Infrastructure Team
  • Working with MySQL since 2003 (MySQL 4.0 release era)
  • Worked on MySQL at Twituer, Booking, and Box previous to
  • GitHub. Several other places too.
htups://github.com/tomkrouper htups://twituer.com/@CaptainEyesight 3
slide-4
SLIDE 4 How people build sofuware
  • About Jonah
  • Infrastructure Engineering Manager
  • Member of the Database Infrastructure team
  • Proud manager of 5 lovely team members
htups://github.com/jonahberquist htups://twituer.com/@hashtagjonah 4
slide-5
SLIDE 5 How people build sofuware
  • 5
  • The world’s largest Octocat t-shirt and stickers store
  • And plush Octocats
  • And hoodies
  • And sofuware development platform
GitHub
slide-6
SLIDE 6 How people build sofuware
  • MySQL at GitHub
  • GitHub stores repositories in git, and uses MySQL
as the backend database for all related metadata.
  • We run a few (growing number of) clusters, totaling
  • ver 100 MySQL servers.
  • The setup isn’t very large but very busy.
6
slide-7
SLIDE 7 How people build sofuware
  • MySQL at GitHub
  • Our MySQL servers must be available, responsive
and in good state
  • GitHub has 99.95% SLA
  • Availability issues must be handled quickly, as
automatically as possible. 7
slide-8
SLIDE 8 How people build sofuware
  • Backups
8
slide-9
SLIDE 9 How people build sofuware
  • Your data
It’s important 9
slide-10
SLIDE 10 How people build sofuware
  • Backups
  • xtrabackup
  • On busy clusters, dedicated backup servers.
  • Backups from replicas in each DC
  • We monitor for number of “success” events in past
24-ish hours, per cluster. 10
slide-11
SLIDE 11 How people build sofuware
  • Restores
  • Something bad happened and you need that data
  • Building a new host
  • Rebuilding a broken one
  • All the time!
11
slide-12
SLIDE 12 How people build sofuware
  • Restores - the old way
  • Dedicated restore servers.
  • One per cluster.
  • Continuously restores, catches up with replication,
restores, catches up with replication, restores, …
  • Sending a “success” event at the end of each cycle.
  • We monitor for number of “success” events in past
24-ish hours, per cluster. 12
slide-13
SLIDE 13 How people build sofuware
  • 13
  • production replicas
auto-restore replica master
  • auto-restore replicas
  • backup replica
slide-14
SLIDE 14 How people build sofuware
  • Restores - the new way
  • Database-class servers in kubernetes.
  • Data not persistent.
  • Database cluster agnostic.
  • Continuously restores, catches up with replication,
restores, catches up with replication, restores, …
  • Sending a “success” event at the end of each cycle.
  • We monitor for number of “success” events in past
24-ish hours, per cluster. 14
slide-15
SLIDE 15 How people build sofuware
  • 15
  • auto-restore replicas on k8s
slide-16
SLIDE 16 How people build sofuware
  • 16
  • Auto-restore
  • Picks a backup from cluster A
slide-17
SLIDE 17 How people build sofuware
  • 17
  • Auto-restore
  • starts replicating from cluster A
slide-18
SLIDE 18 How people build sofuware
  • 18
  • replication catches up
slide-19
SLIDE 19 How people build sofuware
  • moves on to backup of cluster B
  • Auto-restore
slide-20
SLIDE 20 How people build sofuware
  • replicates from cluster B
  • Auto-restore
slide-21
SLIDE 21 How people build sofuware
  • replication catches up
slide-22
SLIDE 22 How people build sofuware
  • auto-restore replica not always running
slide-23
SLIDE 23 How people build sofuware
  • Restores
  • New host provisioning uses same flow as restore.
  • A human may kick a restore/reclone manually.
  • This can grab the latest, or really any backup we
have
  • We can also restore from another running host.
23
slide-24
SLIDE 24 How people build sofuware
  • Restore failure
  • A specific backup/restore may fail because
computers.
  • No reason for panic.
  • Previous backup/restores proven to be working
  • At most we lose time
  • Lack of successful restore for a cluster in the last
~24 hours is an issue to be investigated 24
slide-25
SLIDE 25 How people build sofuware
  • Restore: delayed replica
  • One delayed replica per cluster
  • Lagging at 4 hours
25
slide-26
SLIDE 26 How people build sofuware
  • Backup/restore: logical
  • We routinely run a logical backup of all individual
tables (independently)
  • We can load a specific table from a specific logical
backup, onto a non-production server
  • No need for DBA. Table allocated in a developer’s
space.
  • Operation is audited.
26
slide-27
SLIDE 27 How people build sofuware
  • Schema migrations
27
slide-28
SLIDE 28 How people build sofuware
  • Is your data correct?
The data you see is merely a ghost of your original data 28
slide-29
SLIDE 29 How people build sofuware
  • gh-ost
  • Young. 1yr old.
  • In production at GitHub since born.
  • Sofuware
  • Bugs
  • Development
  • Bugs
29
slide-30
SLIDE 30 How people build sofuware
  • gh-ost
  • Overview
30
slide-31
SLIDE 31 How people build sofuware
  • 31
Synchronous triggers based migration pt-online-schema-change
  • ak-online-alter-table
LHM
  • riginal table
ghost table
  • insert
delete update replace delete replace
slide-32
SLIDE 32 How people build sofuware
  • 32
  • riginal table
ghost table
  • insert
delete update no triggers
  • binary log
Triggerless, binlog based migration gh-ost
slide-33
SLIDE 33 How people build sofuware
  • 33
  • Binary logs can be read from anywhere
  • gh-ost prefers connecting to a replica, offloading work from master
  • gh-ost controls the entire data flow
  • It can truly throtule, suspending all writes on the migrated server
  • gh-ost writes are decoupled from the master workload
  • Write concurrency on master turns irrelevant
  • gh-ost’s design is to issue all writes sequentially
  • Completely avoiding locking contention
  • Migrated server only sees a single connection issuing writes
  • Migration algorithm simplified
Binlog based design implications
slide-34
SLIDE 34 How people build sofuware
  • 34
Binlog based migration, utilize replica
  • master
replica
slide-35
SLIDE 35 How people build sofuware
  • gh-ost testing
  • gh-ost works perfectly well on our data
  • Tested, re-tested, and tested again
  • Full coverage of production tables
35
slide-36
SLIDE 36 How people build sofuware
  • gh-ost testing servers
  • Dedicated servers that run continuous tests
36
slide-37
SLIDE 37 How people build sofuware
  • 37
  • production replicas
testing replica master
  • gh-ost testing replicas
  • production replicas
testing replica master
slide-38
SLIDE 38 How people build sofuware
  • gh-ost testing
  • Trivial ENGINE=INNODB migration
  • Stop replication
  • Cut-over, cut-back
  • Checksum both tables, compare
  • Checksum failure: stop the world, alert
  • Success/failure: event
  • Drop ghost table
  • Catch up
  • Next table
38
slide-39
SLIDE 39 How people build sofuware
  • gh-ost development cycle
  • Work on branch

.deploy gh-ost/mybranch to prod/mysql_role=ghost_testing
  • Let continuous tests run
  • Depending on nature of change, observe hours/days/more.
  • Merge
  • Tests run regardless of deployed branch
39
slide-40
SLIDE 40 How people build sofuware
  • Failovers
40
slide-41
SLIDE 41 How people build sofuware
  • MySQL setup @ GitHub
  • Plain-old single writer master-replicas
  • Semi-sync
  • Cross DC, multiple data centers
  • 5.7, RBR
  • Servers with special roles: production replica,
backup, migration-test, analytics, …
  • 2-3 tiers of replication
  • Occasional cluster split (functional sharding)
  • Very dynamic, always changing
41
slide-42
SLIDE 42 How people build sofuware
  • Points of failure
  • Master failure, sev1
  • Intermediate masters failure
42
slide-43
SLIDE 43 How people build sofuware
  • rchestrator
  • Topology discovery
  • Refactoring
  • Failovers for masters and intermediate masters
  • Open source, Apache 2 license
  • github.com/github/orchestrator
43
slide-44
SLIDE 44 How people build sofuware
  • rchestrator failovers @ GitHub
  • Automated master & intermediate master failovers
for all clusters.
  • On failover, runs GitHub-specific hooks
  • Grabbing VIP/DNS
  • Updating server role
  • Kicking services (e.g. pt-heartbeat)
  • Notifying chat
  • Running puppet
44
slide-45
SLIDE 45 How people build sofuware
  • Testing cluster
  • Dedicated testing cluster in production
  • Does not take production traffic
  • “load-test” traffic
  • Resembles a production topology:
  • OS, MySQL Versions
  • Data centers
  • Server roles
  • DNS
  • Proxy
  • Used for many of our deployment tests
45
slide-46
SLIDE 46 How people build sofuware
  • Failover testing
  • Multiple times per day:
  • Setup the cluster in desired topology layout
  • Inject failure (kill/block/reject)
  • Wait, expect recovery
  • Check topology:
  • Expect new master, correct DNS changes,
replica capacity, …
  • Restore old master from backup
  • (an implicit backup/restore test)
  • “success/failure” event
46
slide-47
SLIDE 47 How people build sofuware
  • Failover in production
  • We expect < 30s failover
  • Normal case is 10-13s
  • Intermediate master failover has low impact on
subset of users, depending on cluster/DC/server
  • Master failover implies outage
  • Planned master switchover takes a few seconds
47
slide-48
SLIDE 48 How people build sofuware
  • A moment of reflection
48
slide-49
SLIDE 49 How people build sofuware
  • What builds trust in failovers?
A testing environment? 49
slide-50
SLIDE 50 How people build sofuware
  • Chaos testing in production
  • First steps into regular testing
  • Manual
  • Supported by our peers
  • Learning, understanding impact
50
slide-51
SLIDE 51 How people build sofuware
  • Tests that go wrong
  • Many things can go wrong
  • Corrupt replication
  • Invalidated servers
  • Unassigned DNS
  • Cleanups
51
slide-52
SLIDE 52 How people build sofuware
  • Conclusion
  • Backup & restore
  • Failovers
  • Schema migrations
52
slide-53
SLIDE 53 How people build sofuware
  • Thank you!
Questions? Jonah Berquist @hashtagjonah Tom Krouper @CaptainEyesight 53