MySQL Infrastructure Testing Automation @ GitHub Jonah Berquist, - PowerPoint PPT Presentation

MySQL Infrastructure Testing Automation   @ GitHub � Jonah Berquist, Tom Krouper GitHub Percona Live 2018 � How people build so fu ware 1

Agenda • Intros • MySQL @ GitHub • Backup/restores � • Schema migrations • Failovers � How people build so fu ware 2

About Tom • Sr. Infrastructure Engineer • Member of the Database Infrastructure Team • Working with MySQL since 2003 (MySQL 4.0 release era) � • Worked on MySQL at Twi tu er, Booking, and Box previous to GitHub. Several other places too. h tu ps://github.com/tomkrouper h tu ps://twi tu er.com/@CaptainEyesight � How people build so fu ware 3

About Jonah • Infrastructure Engineering Manager • Member of the Database Infrastructure team • Proud manager of 5 lovely team members � h tu ps://github.com/jonahberquist h tu ps://twi tu er.com/@hashtagjonah � How people build so fu ware 4

GitHub • The world’s largest Octocat t-shirt and stickers store • And plush Octocats • And hoodies • And so fu ware development platform � How people build so fu ware 5

MySQL at GitHub • GitHub stores repositories in git , and uses MySQL as the backend database for all related metadata. • We run a few (growing number of) clusters, totaling � over 100 MySQL servers. • The setup isn’t very large but very busy. � How people build so fu ware 6

MySQL at GitHub • Our MySQL servers must be available, responsive and in good state • GitHub has 99.95% SLA � • Availability issues must be handled quickly, as automatically as possible. � How people build so fu ware 7

Backups � How people build so fu ware 8

Your data � It’s important � How people build so fu ware 9

Backups • xtrabackup • On busy clusters, dedicated backup servers. • Backups from replicas in each DC � • We monitor for number of “success” events in past 24-ish hours, per cluster. � How people build so fu ware 10

Restores • Something bad happened and you need that data • Building a new host • Rebuilding a broken one � • All the time! � How people build so fu ware 11

Restores - the old way • Dedicated restore servers. • One per cluster. • Continuously restores, catches up with replication, � restores, catches up with replication, restores, … • Sending a “success” event at the end of each cycle. • We monitor for number of “success” events in past 24-ish hours, per cluster. � How people build so fu ware 12

auto-restore replicas � � production replicas � � master � � backup replica auto-restore replica �� How people build so fu ware 13

Restores - the new way • Database-class servers in kubernetes. • Data not persistent. • Database cluster agnostic. � • Continuously restores, catches up with replication, restores, catches up with replication, restores, … • Sending a “success” event at the end of each cycle. • We monitor for number of “success” events in past 24-ish hours, per cluster. � How people build so fu ware 14

auto-restore replicas on k8s � � � � � � � � � � �� How people build so fu ware 15

Picks a backup from cluster A � � � � � � � � � � � Auto-restore �� How people build so fu ware 16

starts replicating from cluster A � � � � � � � � � � � Auto-restore �� How people build so fu ware 17

replication catches up � � � � � � � � � � �  �� How people build so fu ware 18

moves on to backup of cluster B � � � � � � � � � � � Auto-restore �� How people build so fu ware

replicates from cluster B � � � � � � � � � � � Auto-restore �� How people build so fu ware

replication catches up � � � � � � � � � � �  �� How people build so fu ware

auto-restore replica not always running � � � � � � � � � � �� How people build so fu ware

Restores • New host provisioning uses same flow as restore. • A human may kick a restore/reclone manually. • This can grab the latest, or really any backup we � have • We can also restore from another running host. � How people build so fu ware 23

Restore failure • A specific backup/restore may fail because computers. • No reason for panic. � • Previous backup/restores proven to be working • At most we lose time • Lack of successful restore for a cluster in the last ~24 hours is an issue to be investigated � How people build so fu ware 24

Restore: delayed replica • One delayed replica per cluster • Lagging at 4 hours � � How people build so fu ware 25

Backup/restore: logical • We routinely run a logical backup of all individual tables (independently) • We can load a specific table from a specific logical � backup, onto a non-production server • No need for DBA. Table allocated in a developer’s space. • Operation is audited. � How people build so fu ware 26

Schema migrations � How people build so fu ware 27

Is your data correct? � The data you see is merely a ghost of your original data � How people build so fu ware 28

gh-ost • Young. 1yr old. • In production at GitHub since born. • So fu ware • Bugs • Development • Bugs � How people build so fu ware 29

gh-ost • Overview � How people build so fu ware 30

Synchronous triggers based migration � LHM � insert replace � � � delete delete update replace original table ghost table pt-online-schema-change oak-online-alter-table � How people build so fu ware 31

Triggerless, binlog based migration � � insert � � � delete no triggers update original table ghost table � binary log gh-ost � How people build so fu ware 32

Binlog based design implications � • Binary logs can be read from anywhere • gh-ost prefers connecting to a replica, o ffl oading work from master • gh-ost controls the entire data flow • It can truly thro tu le, suspending all writes on the migrated server • gh-ost writes are decoupled from the master workload • Write concurrency on master turns irrelevant • gh-ost’s design is to issue all writes sequentially • Completely avoiding locking contention • Migrated server only sees a single connection issuing writes • Migration algorithm simplified � How people build so fu ware 33

Binlog based migration, utilize replica � � � � � � � master � � replica � How people build so fu ware 34

gh-ost testing • gh-ost works perfectly well on our data • Tested, re-tested, and tested again • Full coverage of production tables � How people build so fu ware 35

gh-ost testing servers • Dedicated servers that run continuous tests � How people build so fu ware 36

gh-ost testing replicas � � � � production replicas production replicas � � � � master master � � � � testing replica testing replica � � � How people build so fu ware 37

gh-ost testing • Trivial ENGINE=INNODB migration • Stop replication • Cut-over, cut-back • Checksum both tables, compare • Checksum failure: stop the world, alert • Success/failure: event • Drop ghost table • Catch up • Next table � How people build so fu ware 38

gh-ost development cycle • Work on branch   .deploy gh-ost/mybranch to prod/mysql_role=ghost_testing • Let continuous tests run • Depending on nature of change, observe hours/days/more. • Merge • Tests run regardless of deployed branch � How people build so fu ware 39

Failovers � How people build so fu ware 40

MySQL setup @ GitHub • Plain-old single writer master-replicas • Semi-sync • Cross DC, multiple data centers � • 5.7, RBR • Servers with special roles: production replica, backup, migration-test, analytics, … • 2-3 tiers of replication • Occasional cluster split (functional sharding) • Very dynamic, always changing � How people build so fu ware 41

Points of failure • Master failure, sev1 • Intermediate masters failure � � � � � � � � � � � How people build so fu ware 42

orchestrator • Topology discovery • Refactoring • Failovers for masters and intermediate masters � • Open source, Apache 2 license • github.com/github/orchestrator � How people build so fu ware 43

orchestrator failovers @ GitHub • Automated master & intermediate master failovers for all clusters. • On failover, runs GitHub-specific hooks � • Grabbing VIP/DNS • Updating server role • Kicking services (e.g. pt-heartbeat) • Notifying chat • Running puppet � How people build so fu ware 44

MySQL Infrastructure Testing Automation @ GitHub Jonah Berquist, - PowerPoint PPT Presentation

MySQL Infrastructure Testing Automation @ GitHub Jonah Berquist, Tom Krouper GitHub Percona Live 2018 How people build so fu ware 1 Agenda Intros MySQL @ GitHub Backup/restores Schema migrations Failovers

Performance Guide for MySQL Cluster Mikael Ronstrm, Ph.D Senior MySQL Architect Sun

MySQL Replication Update MySQL 5.5 (GA) & MySQL 5.6.2 (Dev. Milestone) Lars Thalmann

MySQL Proxy meets: binlogs Jan Kneschke MySQL Enterprise Tools mailto: jan@mysql.com What is

MySQL Proxy Making MySQL more flexible Jan Kneschke jan@mysql.com MySQL Proxy proxy-servers

MySQL Group Replication & MySQL InnoDB Cluster Production Ready? Kenny Gryp MySQL Practice

MySQL Cluster und MySQL Proxy Alles Online Diese Slides gibt es auch unter:

Reducing Risk When Upgrading Your MySQL Environment Kenny Gryp MySQL Practice Manager My

PHP and MySQL Dr. E. Benoist Winter Term 2006-2007 PHP and MySQL 1 PHP and MySQL Introduction

More on gdb for MySQL DBAs or Using gdb to study MySQL internals and as a last resort Valerii

1 Automation Overview Definition Automation (automation, Automation ) : 1) set of all measures

Managing MySQL at Scale Pradeep Nayak & Junyi (Luke) Lu Production Engineers - MySQL Infra

MySQL Backup and Restore at Facebook Scale Ola Berjak Production Engineer at MySQL

Test automation Building automatically repeatable test suites Test automation n Test automation

Percona MySQL About Me Qunar.com DB Director

Forecasting MySQL Scalability Baron Schwartz O'Reilly MySQL Conference & Expo 2011

PHP + MySQL MySQL on the command line is great and all well not its not really that great

DETAILS OF LECTURE NOTES Name of the Teacher : Dr. Yatendra Singh Name of the Subject : Research

Earnings Conference Call First Quarter 2012 April 25, 2012 Cautionary Statements And Risk

Reinforcement Learning CS 4100: Artificial Intelligence Reinforcement Learning II Still

CSE 473: Artificial Intelligence Reinforcement Learning Hanna Hajishirzi Many slides

CONC 2 SEQ : A Frama-C Plugin for the Verification of Parallel Compositions of C Programs Allan

WINGERT A Thread Migrating OS for Real-Time Applications Alexander Zpke

Formal Proofs for Global Optimization Templates and Sums of Squares Victor MAGRON

Certified Optimization for System Verification Victor Magron , CNRS 26 Juin 2017 SMAI-MODE

MySQL Infrastructure Testing Automation @ GitHub Jonah Berquist, - PowerPoint PPT Presentation

MySQL Infrastructure Testing Automation @ GitHub Jonah Berquist, Tom Krouper GitHub Percona Live 2018 How people build so fu ware 1 Agenda Intros MySQL @ GitHub Backup/restores Schema migrations Failovers

Performance Guide for MySQL Cluster Mikael Ronstrm, Ph.D Senior MySQL Architect Sun

MySQL Replication Update MySQL 5.5 (GA) &amp; MySQL 5.6.2 (Dev. Milestone) Lars Thalmann

MySQL Proxy meets: binlogs Jan Kneschke MySQL Enterprise Tools mailto: jan@mysql.com What is

MySQL Proxy Making MySQL more flexible Jan Kneschke jan@mysql.com MySQL Proxy proxy-servers

MySQL Group Replication &amp; MySQL InnoDB Cluster Production Ready? Kenny Gryp MySQL Practice

MySQL Cluster und MySQL Proxy Alles Online Diese Slides gibt es auch unter:

Reducing Risk When Upgrading Your MySQL Environment Kenny Gryp MySQL Practice Manager My

PHP and MySQL Dr. E. Benoist Winter Term 2006-2007 PHP and MySQL 1 PHP and MySQL Introduction

More on gdb for MySQL DBAs or Using gdb to study MySQL internals and as a last resort Valerii

1 Automation Overview Definition Automation (automation, Automation ) : 1) set of all measures

Managing MySQL at Scale Pradeep Nayak &amp; Junyi (Luke) Lu Production Engineers - MySQL Infra

MySQL Backup and Restore at Facebook Scale Ola Berjak Production Engineer at MySQL

Test automation Building automatically repeatable test suites Test automation n Test automation

Percona MySQL About Me Qunar.com DB Director

Forecasting MySQL Scalability Baron Schwartz O'Reilly MySQL Conference &amp; Expo 2011

PHP + MySQL MySQL on the command line is great and all well not its not really that great

DETAILS OF LECTURE NOTES Name of the Teacher : Dr. Yatendra Singh Name of the Subject : Research

Earnings Conference Call First Quarter 2012 April 25, 2012 Cautionary Statements And Risk

Reinforcement Learning CS 4100: Artificial Intelligence Reinforcement Learning II Still

CSE 473: Artificial Intelligence Reinforcement Learning Hanna Hajishirzi Many slides

CONC 2 SEQ : A Frama-C Plugin for the Verification of Parallel Compositions of C Programs Allan

WINGERT A Thread Migrating OS for Real-Time Applications Alexander Zpke

Formal Proofs for Global Optimization Templates and Sums of Squares Victor MAGRON

Certified Optimization for System Verification Victor Magron , CNRS 26 Juin 2017 SMAI-MODE

MySQL Replication Update MySQL 5.5 (GA) & MySQL 5.6.2 (Dev. Milestone) Lars Thalmann

MySQL Group Replication & MySQL InnoDB Cluster Production Ready? Kenny Gryp MySQL Practice

Managing MySQL at Scale Pradeep Nayak & Junyi (Luke) Lu Production Engineers - MySQL Infra

Forecasting MySQL Scalability Baron Schwartz O'Reilly MySQL Conference & Expo 2011