Percona Live Europe 2016 Launching Vitess Anthony Yeh, Dan Rogart - - PowerPoint PPT Presentation

percona live europe 2016 launching vitess
SMART_READER_LITE
LIVE PREVIEW

Percona Live Europe 2016 Launching Vitess Anthony Yeh, Dan Rogart - - PowerPoint PPT Presentation

Percona Live Europe 2016 Launching Vitess Anthony Yeh, Dan Rogart Amsterdam, Netherlands | October 3 5, 2016 Overview http://vitess.io Why Vitess? Their App YouTube Your App Their Vitess Vitess Sharding Magic Sharding Magic


slide-1
SLIDE 1

Percona Live Europe 2016 Launching Vitess

Anthony Yeh, Dan Rogart Amsterdam, Netherlands | October 3 – 5, 2016

slide-2
SLIDE 2

Overview

http://vitess.io

slide-3
SLIDE 3

3

Why Vitess?

Their App Their Sharding Magic MySQL YouTube Vitess Sharding Magic MySQL Your App Vitess Sharding Magic MySQL

slide-4
SLIDE 4

4

Why not Vitess?

Vitess is...

  • an opinionated cluster

○ Many ways to scale; this is one. ○ More on those opinions next.

  • a powerful tool

○ Huge problems get easier. ○ Simple things get more complex.

Vitess is not...

  • a proxy

○ Understands the query. ○ Generates queries of its own.

  • plug-and-play

○ ... yet. ○ This talk is about the gaps.

slide-5
SLIDE 5

Launching Vitess

http://vitess.io/user-guide/launching.html

slide-6
SLIDE 6

Scalability Philosophy

slide-7
SLIDE 7

7

Horizontal Scaling

Small Instances

  • Many instances per host
  • Faster replication, backup/restore
  • Less contention, outages isolated

Self-Healing, Automation

  • Health checks
  • Ops work should be O(1)

Cluster Orchestration

  • Containers isolate ports, files,

compute

  • Scheduling for resilience
  • Improves HW utilization
slide-8
SLIDE 8

8

Durability and Consistency

Durability through replication

  • Disk is not durable

○ sync_binlog off

  • Data must be on multiple

machines

○ semisync ○ lossless failover ○ routine reparent

Sharded consistency model

  • Single-shard transactions

○ Same guarantees as MySQL

  • Cross-shard transactions

○ May fail partially across shards ○ Work in progress on 2PC

  • Cross-shard reads

○ Even with 2PC, may read from shards in different states

slide-9
SLIDE 9

9

Globally Distributed

Multi-Cell Deployment

  • Cell = Zone | Availability Zone

○ Possible shared fate within cell ○ But failures shouldn't propagate

  • Multi-Region

○ Survive fiber cuts, regional outages ○ Lower regional read latency

  • Single-Master

○ Writes redirected at frontend ○ Only one inter-cell roundtrip ○ DB writes intra-cell

Cluster Metadata ("Topology")

  • Distributed, consistent, highly

available key-value store

○ e.g. etcd, ZooKeeper

  • Global Topology Store

○ Quorum across multiple cells ○ Survives any given cell death

  • Local Topology Store

○ Quorum within a single cell ○ Independent of any other cell

slide-10
SLIDE 10

Production Planning

slide-11
SLIDE 11

11

Testing

Integration Tests

  • Run app tests against Vitess

○ Use real schema ○ Test sharding

  • py/vttest

○ Small footprint to run on 1 machine ○ Emulate a full cluster for tests ○ Loads schema from .sql files ○ 1 vtcombo = all Vitess servers ○ 1 mysqld = all shards

Query Compatibility

  • Bind Variables

○ Client-side prepared statements ○ Vitess query plan cache

  • Tablet Types

○ master: writes, read-after-write ○ replica: live site read traffic ○ rdonly: batch jobs, backups

  • Query Support

○ Vitess SQL parser is incomplete ○ Report important use cases

slide-12
SLIDE 12

12

Replication

Binary Logging

  • Enabled everywhere (slaves too)
  • Statement-based

○ Rewrite to PK lookups

  • GTID required
  • Used for master management,

resharding, update stream, schema swap, etc. Side Effects

  • Triggers
  • Stored procedures
  • Foreign key constraints
  • These can break resharding
slide-13
SLIDE 13

13

Monitoring

Status URLs (vtgate, vttablet, etc.)

  • /debug/status
  • /debug/vars

○ Prometheus, InfluxDB

  • /healthz
  • /queryz
  • /schemaz

Coming soon...

○ Realtime fleet-wide health map

slide-14
SLIDE 14

14

Backups

Built-in Backups

  • Part of cloning, schema swap

○ Restores every day

  • Storage Plugins

○ Filesystem (NFS, etc.) ○ Google Cloud Storage ○ Amazon S3 ○ Ceph

  • Needs to be triggered periodically
slide-15
SLIDE 15

Migration Strategies

Tribute

slide-16
SLIDE 16

16

Migration

New Workloads

  • Getting Started + Launch Guide

Offline Migration

  • Import data to Vitess

Online Migration

  • Run Vitess above existing

MySQL

  • Previously Unsharded
  • Already Sharded

○ Custom Vindex

slide-17
SLIDE 17

YouTube Production

Dan Rogart, YouTube SRE

slide-18
SLIDE 18

18

Run Vitess the SRE Way!

  • Cattle, not pets
  • Systemic failure is more important than individual failure
  • Failure is constant
  • Automate responses to failure when appropriate
  • Or detect and alert a human if required
  • The atomic unit is a mysql instance - for durability, availability,

replacement

slide-19
SLIDE 19

19

"If I have seen further than others, it is by standing upon the shoulders

  • f giants" -- Isaac Newton
  • s/seen/scaled/
  • Vitess runs on MySQL...
  • MySQL runs on Borg (Google's container cloud)...
  • Borg runs on Google datacenters and networks...
  • Each level is supported by amazing teams and we rely heavily upon

their work

slide-20
SLIDE 20

20

Vitess runs on MySQL on Borg

  • YouTube/Vitess did not fully migrate into Borg until 2013
  • So, it's actually a pretty good example of how a Vitess integration with

an existing MySQL stack went (pretty well, so far)

  • MoB had a lot of mature tools that Vitess leveraged:
  • Backups
  • Failover
  • Schema Management
slide-21
SLIDE 21

21

Decider

shard vtgate mysqld vttablet master vtgate decider mysqld vttablet mysqld vttablet replicas mysqld vttablet batch replicas mysqld vttablet vtctld

slide-22
SLIDE 22

22

Decider...(vastly simplified):

  • Polls all mysql instances every n seconds
  • If the old master is unhealthy it elects a new master from the replica

pool

  • It re-masters all the other replicas to properly replicate from the new

master

  • Is the reason TabletExternallyReparented exists in Vitess
  • Total failover times for YouTube Vitess are around 5 seconds
slide-23
SLIDE 23

23

Schema Management (small changes)

  • Autoschema
  • A "small" change is basically an ALTER against a table with < 2M rows
  • When executed on a replica it won't block the replication stream
  • Defined paths in source control are monitored
  • When a peer reviewed file containing sql is submitted...
  • ...autoschema will validate the change and apply it to all masters in a

cluster

slide-24
SLIDE 24

24

Schema Management (big changes)

  • Pivot
  • A "big" change is basically an ALTER that will block traffic for too long on

the master or block replication too long when executed on a slave

  • Defined paths in source control are monitored
  • When a peer reviewed file containing sql is submitted...
  • ...an SRE will start a pivot
  • The ALTER is applied to a single replica and a seed backup is taken
  • All other replicas are restarted such that they restore from the backup

that contains the change

  • Finally, the master is done last and a replica with the change is

promoted

slide-25
SLIDE 25

25

Schema Management

  • Autoschema changes take minutes
  • Pivots take days
  • At YouTube all schema changes must be forwards and backwards

compatible with code. Enforced with extensive automated tests.

  • Sometimes dangerous: common example is removing a column using a
  • pivot. This can break replication, so we have to block access.
  • Sometimes confusing for our developers: they shouldn't really care

about how a change happens

  • Open source pivot is coming.
slide-26
SLIDE 26

26

Resharding Automation

  • Online copy of data performed n times
  • Final offline copy of data to sync to a gtid
  • Filtered replication
  • Traffic redirect
  • ???
  • Profit!
slide-27
SLIDE 27

27

Resharding Automation (online copy)

shard 1 mysqld vttablet master mysqld vttablet mysqld vttablet replicas shard 0 mysqld vttablet master mysqld vttablet mysqld vttablet replicas unsharded mysqld vttablet master mysqld vttablet mysqld vttablet replicas vtworker

  • Replication running
  • Read chunks from

source

  • Read chunks from

target

  • Reconcile and write

diff to target

  • Adaptive throttle
slide-28
SLIDE 28

28

Resharding Automation (offline copy)

shard 1 mysqld vttablet master mysqld vttablet mysqld vttablet replicas shard 0 mysqld vttablet master mysqld vttablet mysqld vttablet replicas unsharded mysqld vttablet master mysqld vttablet mysqld vttablet replicas vtworker

  • Replication stopped
  • Read chunks from

source

  • Read chunks from

target

  • Reconcile and write

diff to target

  • Adaptive throttle
slide-29
SLIDE 29

29

Resharding Automation (filtered repl)

shard 1 mysqld vttablet master mysqld vttablet mysqld vttablet replicas shard 0 mysqld vttablet master mysqld vttablet mysqld vttablet replicas unsharded mysqld vttablet master mysqld vttablet mysqld vttablet replicas

  • Target master

tablets connect to a source replica

  • Parse binlogs and

apply statements that belong in that shard

  • gtid is stored and

replicated on target to survive restarts

slide-30
SLIDE 30

30

Resharding Automation (redirection)

  • Finally application traffic is redirected:
  • vtctl-prod MigrateServedTypes keyspace_name/0 replica
  • (^^^^ sends replica traffic from unsharded to sharded)
  • vtctl-prod MigrateServedTypes keyspace_name/0 master
  • (^^^^ master cutover, point of no return)
  • < 5s of downtime during master cutover (faster than a normal decider

failover, since only the vitess layer is touched)

slide-31
SLIDE 31

31

Regression Testing

  • We use the Yahoo Cloud Serving Benchmark
  • Allows for comparison of Vitess to other storage solutions using the

same workloads

  • A daily Vitess/YCSB sandbox is run to measure qps per core and latency
  • Deviations from previous results (postive or negative) are noted and

investigated

slide-32
SLIDE 32

32

Rate My Session!