One System To Fit Them All: Shared MySQL Hosting At Facebook - - PowerPoint PPT Presentation

one system to fit them all
SMART_READER_LITE
LIVE PREVIEW

One System To Fit Them All: Shared MySQL Hosting At Facebook - - PowerPoint PPT Presentation

One System To Fit Them All: Shared MySQL Hosting At Facebook Andrew Regner Production Engineer | MySQL Infrastructure Data choices @Facebook Everyone has data to persist Also have: ZippyDB, ODS, Scuba, HBase, Scribe, RocksDB,


slide-1
SLIDE 1

Andrew Regner

Production Engineer | MySQL Infrastructure

One System To Fit Them All:


Shared MySQL Hosting At Facebook

slide-2
SLIDE 2
slide-3
SLIDE 3

Data choices @Facebook

  • Everyone has data to persist
  • Also have:
  • ZippyDB, ODS, Scuba, HBase, Scribe, RocksDB, TAO
  • MySQL is the most mature
slide-4
SLIDE 4

"Anything" Database

XDB

  • Larger and/or order use cases of MySQL will have

their own *db

  • XDB is supposed to be the answer for everyone else
slide-5
SLIDE 5
  • c. 2004

Our History

  • Started with "CDB" allocating resources manually and with

little logic behind it.

  • In the last few years, we've grown the MySQL Teams by a

few engineers, but the company has grown > 10x.

slide-6
SLIDE 6

Things that store data

  • video encoding (queue)
  • data warehouse (metadata)
  • job scheduling
  • server management
  • internal tools (tasks, wiki)
  • hack-a-thon toys
  • visitor sign-in system
  • backup systems (more than i can count)
  • qualitative analysis of search results
  • machine learning models

Hundreds of Teams Thousands of Shards

slide-7
SLIDE 7

Terminology

slide-8
SLIDE 8

server

instance instance

shard shard shard shard shard shard shard shard shard shard shard shard shard shard

slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12

Master Slave

replication replication

Slave

slide-13
SLIDE 13

Replica Set

replication replication

tasks burger pong_1 feed videos tags wiki tasks burger pong_1 feed videos tags wiki tasks burger pong_1 feed videos tags wiki

slide-14
SLIDE 14

replication replication

slide-15
SLIDE 15
slide-16
SLIDE 16

Move Fast

Philosophy of FB Infrastructure

  • Enable engineers to do what they need, when they

need it.

  • They understand the importance and scope of their

product the best.

slide-17
SLIDE 17

Build Stable Infrastructure

Philosophy of FB MySQL Infra

  • K.I.S.S. works at scale, too
  • No: foreign keys, views, events, triggers, procedures,

replication lag

  • Yes: good indexes, sharding, planning
slide-18
SLIDE 18

Story Time

slide-19
SLIDE 19

Robert's New Feature

xdb.profile_events

slide-20
SLIDE 20

Robert's New Feature

xdb.profile_events

slide-21
SLIDE 21

xdb.profile_events

Robert's New Feature

Lands a configuration change to cause all records in his database to be re-processed. Everything is still in test mode, so not worried about the impact to anything.

slide-22
SLIDE 22

xdb.profile_events (master)

Connected Running Lock time

slide-23
SLIDE 23

xdb.profile_events

Robert's New Feature

His system limits concurrency to 30 at once. Didn't realize that the script opens a new connection for each of 100 queries per execution. :-(

slide-24
SLIDE 24

xdb.profile_events

Robert's New Feature

Query comments to know where the jobs is Use internal UI's to kill the job, than page Robert

slide-25
SLIDE 25

Remember All The Tickets

xdb.ticket_processing

slide-26
SLIDE 26

xdb.ticket_processing

Remembering All The Tickets

Every time something changes with a Help Center support ticket, some metadata is created. Have to hold onto parts of it for legal reasons.

slide-27
SLIDE 27

xdb.ticket_processing

alarm
 1 day ago

slide-28
SLIDE 28

xdb.ticket_processing

Remembering All The Tickets

  • Guidance on maximum ideal shard size
  • Use cases vary too much to enforce
slide-29
SLIDE 29

Forgetting some things

Remembering All The Tickets

Tool that an intern / now full-time employee wrote to delete large amounts of data based arbitrary SQL in chunks with range queries. OSC on a larger host to reclaim disk space, than replace all the instances.

slide-30
SLIDE 30

What we learned

Remembering All The Tickets

  • Existing automation is very good at hiding this
  • Put shard sizes in front of the user ASAP
  • Proactively reach out to owners of the top 5% by

shard size

  • Automatically notify owners when their growth looks

"dangerous"

slide-31
SLIDE 31

Cleaning up some data

xdb.analytics

slide-32
SLIDE 32

xdb.analytics

Cleaning up some data

Engineer somewhere in the world wants to clean up some stale records

  • DELETE FROM table_one WHERE table_one.id

IN (SELECT foreign_id FROM table_two WHERE some_random_thing = 'foobar')

slide-33
SLIDE 33

xdb.analytics

slide-34
SLIDE 34

xdb.analytics

Cleaning up some data

Only responses after it happens are to not do it again Replace instances if we need to If we caught it earlier, we can kill the query on master

slide-35
SLIDE 35

The nicest bad neighbor ever

xdb.scheduler / xdb.looks_like

slide-36
SLIDE 36

xdb.scheduler / xdb.looks_like

The nicest bad neighbor ever

Shared pool of general purpose XDB shards A bunch of little shards on a few replica sets for a queue workload A large shard on the same replica set for archival

slide-37
SLIDE 37

xdb.scheduler / xdb.looks_like

Running Threads History List Length

slide-38
SLIDE 38

Current Tools

slide-39
SLIDE 39

=============================================================== Instance: xdb0123.prn1:3307 Report UUID: 91a5d53e-2fcf-49e8-8546-26df7fda31d1 Time started: 2016-09-28 08:56:11 Length: 30s Reason: dbstatus disabled instance for lag =============================================================== Total sampled queries: 2753 myservice_data (2750): 2019 LOAD DATA INFILE ? REPLACE INTO TABLE `all_the_data` FIELDS TERMINATED BY ? ENCLOSED BY '?\\?\n? 420 LOAD DATA INFILE ? IGNORE INTO TABLE `some_more_data` FIELDS TERMINATED BY ? ENCLOSED BY '?\\?\n? 127 UPDATE `sig_tw_jobs` t SET t.status = ? WHERE t.shard_id = ? AND t.handle = ? AND t.status = ? 10 UPDATE `sig_model_snapshot` s, `sig_model` m SET s.removed = ?, m.active_snapshot_id = ? WHERE s.model_snapshot_id = ? AND m.model_...70 more bytes 2 UPDATE `sigrid_model_snapshot` SET model_output = ? WHERE model_snapshot_id = ?

finding the cause of lag

dba replblame

slide-40
SLIDE 40

$ xdb task xdb.fb_learning_mysql --template size assigned_to=1369320034 tags=[u'dba', u'xdb', u'oncall', u'xdb_enforcement', u'disk_space'] title=XDB xdb.learning_mysql exceeding allowed disk space desc=An xdb that you own ( xdb.learning_mysql ) has exceeded disk space limits. Please cleanup some data immediately (see https://wiki.fb.com/out_of_space ). Instance sizes: https://ods.fb.com/455023229 Table sizes: https://ods.fb.com/455023234 Table sizes (information_schema): Schema Table Size(GB) learning_mysql channels 607.990 learning_mysql workflow_runs 157.080 learning_mysql operator_plans 133.090 learning_mysql retention 60.550 learning_mysql operator_runs 31.510 learning_mysql job_instance_status 28.230 learning_mysql job_instance_status_updates 24.980 learning_mysql operator_run_outputs 19.290

tell someone there's a problem

xdb task

slide-41
SLIDE 41

DB Portal

slide-42
SLIDE 42

DB Portal

xdb.myservice_data

slide-43
SLIDE 43

Looking Forward

slide-44
SLIDE 44

Some random thoughts from the roadmap

Automatic tasks / pages for shard owners

  • Share our monitoring subscriptions

Automatic detection (& killing) of bad queries Stricter enforcement of quotas / capacity

slide-45
SLIDE 45

Andrew Regner

Production Engineer | MySQL Infrastructure | aregner@fb.com

One System To Fit Them All:


Shared MySQL Hosting At Facebook