Backing up Wikipedia Databases Jaime Crespo & Manuel Arstegui - PowerPoint PPT Presentation

Backing up Wikipedia Databases Jaime Crespo & Manuel Aróstegui

Data Persistence Subteam, Site Reliability Engineering

1) Existing Environment 2) Design Contents 3) Implementation Details W h a t w e a 4) Results r e t g h i o s i n i s g w t o h m a t e w n r e o t i 5) Planned Work & Lessons Learned q r o u k n e i r d i e n m f o t h e r i n o s t u s t r a m l e k n a i y v s i b r o o e u n r d m e i f x f e e n p r e t e a r n i t e t . t n h c e e t i a m n d e . o Y u o r u r l e n a r e n e d i n s g a s n - d

Existing Environment

● We use RAID 10, read replicas, multiple DCs for High Availability ● Public XMLDumps ● But what about... Why ○ Checking a concrete record back in time? backups? ○ Application bug changing data on all servers? ○ Operator mistake? ○ Abuse of external user?

● Aside from the English Wikipedia, 800 other wikis in 300 languages ● ~550 TB of data of relational data Database over 24+ replica groups ● ~60 TB of those is unique data, of context those: ○ ~24TB of compressed mediawiki (mid-2019) insert-only content ○ The rest is metadata , local content, misc services, disk cache, analytics, backups, ...

● Self hosted on bare metal Only open source sofuware Brief description ● ● 2 DCs holding data - at the moment, of our one active and one passive ● Normal replication topology with environment several intermediate masters https://dbtree.wikimedia.org/

○ Coordinates were not being saved We were ○ No good monitoring in place, failures could be missed using only ○ Single file with the whole database (100GB+ compressed mysqldump file) ○ Slow to backup and recover

● Used TokuDB for compression and to Backup hosts maximize disk space resources whilst production runs InnoDB were different ● Running multisource replication It could not be used for an ○ from production automatic provisioning system

● Hardware was old, and Hardware prone to suffer issues needed to be ● More disk and IOPS needed refreshed ● Lack of proper DC redundancy

Design

For simplicity, we started with full ● backups only Cross-dc redundancy ● New backup ● Scale over several instances for flexibility and performance system ● Aiming for 30 minute TTR requirements Row granularity ● ● 90 day retention Fully automated creation and recovery ●

● Bacula is used as cold, long term storage, primarily because it’s the tool shared with the rest of the infrastructure backups ● Data deduplication was considered but Storage no good solution that fit our needs ○ Space saving at application side, InnoDB compression and parallel gzip were considered good enough

● Logical backups provide great flexibility, small size, good Logical compatibility, and less prone to data-corruption Backups vs ● Logical backups are fast to generate Snapshots but slow to recover ● Snapshots are faster to recover, but take more space and are less flexible

● We decided to do both ! Snapshots will be used for full ○ disaster recovery, and provisioning ○ Dumps to be used for long term archival and small-scale recoveries * Image from Old El Paso commercial own by General Mills, Inc Use under fair use

● mysqlpump discarded early due to mysqlpump incompatibilities (mariadb GTID) vs ● mysqldump is the standard tool, but required hacks to make it parallel, too mysqldump slow to recover vs ● mydumper has good MariaDB support, integrated compression, a flexible mydumper e c i dump format and is fast and o h c r u multithreaded O

● LVM ○ Disk-efficient (especially for LVM vs multiple copies) Xtrabackup vs ○ Fast to recover if kept locally Cold Backup vs ○ Requires dedicated partition ○ Needs to be done locally and then Delayed slave (I) moved remotely to be stored

e c i o h c r xtrabackup* ● u O LVM vs ○ --prepare Xtrabackup vs Can be piped through network ○ ○ More resources on generation Cold Backup vs xtrabackup works at innodb level ○ Delayed slave (II) and lvm at filesystem level * We use mariabackup as xtrabackup isn’t supported for MariaDB

LVM vs ● Cold backups Xtrabackup vs ○ Requires stopping MySQL ○ Consistent on a file level wise Cold Backup vs ○ Combined with LVM can give Delayed slave good results (III)

● Delayed slave LVM vs ○ Faster recovery: for a given time Xtrabackup vs period ○ We used to have it and had bad Cold Backup vs experiences Delayed slave ○ Not great for provisioning new hosts (IV)

Backups will not be just tested on a lab ● ○ New hosts will be provisioned from the existing backups Provisioning Dedicated backup testing hosts: ● & testing ○ Replication will automatically validate most “live data” ○ We already have production row-by-row data comparison

Implementation Details

Per Datacenter 5 dedicated replicas with 2 mysql ● instances each (consolidation) 2 provisioning hosts (SSDs + HDs) ● ● 1 new bacula host Hardware 1 disk array dedicated for ○ databases 1 test host (same spec as regular ● replicas)

● Python 3 for gluing underlying applications ● WMF-specific development and deployment is done though puppet so not a portable “product” Development ○ WMFMariaDBpy: https://phabricator.wikimedia.org/diffusion/OSMD/ ○ Our Puppet: https://phabricator.wikimedia.org/source/operations-puppet/ ● Very easy to add new backup methods

class NullBackup: config = dict() def __init__(self, config, backup): """ Initialize commands """ self.config = config self.backup = backup self.logger = backup.logger def get_backup_cmd(self, backup_dir): """ Return list with binary and options to execute to generate a new backup at backup_dir """ return '/bin/true' def get_prepare_cmd(self, backup_dir): """ Return list with binary and options to execute to prepare an existing backup. Return none if prepare is not necessary (nothing will be executed in that case). """ return ''

root@cumin1001:~$ cat /etc/mysql/backups.cnf type: snapshot rotate: True retention: 4 compress: True archive: False statistics: host: db1115.eqiad.wmnet database: zarcillo sections: s1: Configuration host: db1139.eqiad.wmnet port: 3311 destination: dbprov1002.eqiad.wmnet stop_slave: True order: 2 s2: host: db1095.eqiad.wmnet port: 3312 destination: dbprov1002.eqiad.wmnet order: 4

● Backups are taken from dedicated replicas for convenience ● A cron job starts the backup on the provisioning servers, running mydumper ● Several threads used to dump in parallel , result is automatically compressed per table

Snapshots have to be ● coordinated remotely as it requires file transfer ● Xtrabackup installed on the source db is used to prevent incompatibilities Content is piped directly ● through network to avoid local disk write step

root@cumin1001:~$ transfer.py --help usage: transfer.py [-h] [--port PORT] [--type {file,xtrabackup,decompress}] [--compress | --no-compress] [--encrypt | --no-encrypt] [--checksum | --no-checksum] [--stop-slave] source target [target ...] positional arguments: source [...] target [...] optional arguments: A wrapper ● -h, --help show this help message and exit --port PORT Port used for netcat listening on the source. By default, 4444, utility to but it must be changed if more than 1 transfer to the same host happen at the same time, or the transfer files, second copy will fail top open the socket again. This port has its firewall disabled during precompressed transfer automatically with an extra iptables rule. tarballs and --type {file,xtrabackup,decompress} File: regular file or directory recursive copy piping xtrabackup: runs mariabackup on source --compress Use pigz to compress stream using gzip format (ignored on xtrabackup decompress mode) --no-compress Do not use compression on streaming output --encrypt Enable compression using openssl and algorithm chacha20 (default) --no-encrypt Disable compression- send data using an unencrypted stream --checksum Generate a checksum of files before transmission which will be used for checking integrity after transfer finishes. It only works for file transfers, as there is no good way to checksum a running mysql instance or a tar.gz --no-checksum Disable checksums --stop-slave Only relevant if on xtrabackup mode: attempt to stop slave on the mysql instance before running xtrabackup, and start slave after it completes to try to speed up backup by preventing many changes queued on the xtrabackup_log. By default, it doesn't try to stop replication.

● Postprocessing both types of backups involves: --prepare ● ● consolidation of files ● metadata gathering ● compression ● validation ● Main monitoring is done from the backup metadata database

Backing up Wikipedia Databases Jaime Crespo & Manuel Arstegui - PowerPoint PPT Presentation

Backing up Wikipedia Databases Jaime Crespo & Manuel Arstegui Data Persistence Subteam, Site Reliability Engineering 1) Existing Environment 2) Design Contents 3) Implementation Details W h a t w e a 4) Results r e t g

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Semantic Wikipedia [[enhances::Wikipedia]] Wikipedia today A free online encyclopdia

Module 3: Creating and Managing Databases Overview Creating Databases Creating

Backing Chain Management in libvirt and qemu Eric Blake <eblake@redhat.com> KVM Forum,

SQL Injection Attacks Many web servers have backing databases Much of their information

Genealogy Wikis & Wikipedia Dave Barton Agenda What is a Wiki Genealogy Wikis

Wikipedia: n ++ made easy Matt Might University of Utah / NGLY1.org matt.might.net What

Wikipedia Sociographics Jimmy Wales President, Wikimedia Foundation Wikipedia Founder Todays

Computers Session 1 INST 346 Agenda The Computer The Course Source: Wikipedia

Physical Infrastructure Week 1 INFM 603 Agenda The Computer The Internet The Web

Saturday, 29 January 2011 OVERVIEW What is Wikipedia/Wikimedia? (Mike) What makes a

Introduction to Wikipedia editing Mike Peel 12 November 2014 Questions Who has used

Tor and Wikipedia Roger Dingledine The Free Haven Project 1 Motivation China blocks

GEMS/Food Databases and GEMS/Food Databases and GEMS/Food Databases and in the Food Supply

Connecting Language, Perception and Interaction using Type Theory with Records Staffan Larsson

TYTLES summary Why types for lexical semantics? generative lexicon, coercion, learning,

.zZ PowerNap your Data Center Dustin Kirkland Canonical Manager, Systems Integration Ubuntu

POIR 613: Computational Social Science Pablo Barber a School of International Relations

remediation patterns Jez Humble QCon London 2011 jez@thoughtworks.com @jezhumble

Welcome Emma Churchill Departmental Liaison and Policy Director, Border Delivery Group 2

Quantifier Retrieval la Przepirkowski Jonathan Khoo jkhoo@sfs.uni-tuebingen.de Introduction

Testing Real-Time Embedded Systems Using UppAal-TRON -Tool and Application Kim G. Larsen, Marius

Backing up Wikipedia Databases Jaime Crespo & Manuel Arstegui - PowerPoint PPT Presentation

Backing up Wikipedia Databases Jaime Crespo & Manuel Arstegui Data Persistence Subteam, Site Reliability Engineering 1) Existing Environment 2) Design Contents 3) Implementation Details W h a t w e a 4) Results r e t g

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Semantic Wikipedia [[enhances::Wikipedia]] Wikipedia today A free online encyclopdia

Module 3: Creating and Managing Databases Overview Creating Databases Creating

Backing Chain Management in libvirt and qemu Eric Blake &lt;eblake@redhat.com&gt; KVM Forum,

SQL Injection Attacks Many web servers have backing databases Much of their information

Genealogy Wikis &amp; Wikipedia Dave Barton Agenda What is a Wiki Genealogy Wikis

Wikipedia: n ++ made easy Matt Might University of Utah / NGLY1.org matt.might.net What

Wikipedia Sociographics Jimmy Wales President, Wikimedia Foundation Wikipedia Founder Todays

Computers Session 1 INST 346 Agenda The Computer The Course Source: Wikipedia

Physical Infrastructure Week 1 INFM 603 Agenda The Computer The Internet The Web

Saturday, 29 January 2011 OVERVIEW What is Wikipedia/Wikimedia? (Mike) What makes a

Introduction to Wikipedia editing Mike Peel 12 November 2014 Questions Who has used

Tor and Wikipedia Roger Dingledine The Free Haven Project 1 Motivation China blocks

GEMS/Food Databases and GEMS/Food Databases and GEMS/Food Databases and in the Food Supply

Connecting Language, Perception and Interaction using Type Theory with Records Staffan Larsson

TYTLES summary Why types for lexical semantics? generative lexicon, coercion, learning,

.zZ PowerNap your Data Center Dustin Kirkland Canonical Manager, Systems Integration Ubuntu

POIR 613: Computational Social Science Pablo Barber a School of International Relations

remediation patterns Jez Humble QCon London 2011 jez@thoughtworks.com @jezhumble

Welcome Emma Churchill Departmental Liaison and Policy Director, Border Delivery Group 2

Quantifier Retrieval la Przepirkowski Jonathan Khoo jkhoo@sfs.uni-tuebingen.de Introduction

Testing Real-Time Embedded Systems Using UppAal-TRON -Tool and Application Kim G. Larsen, Marius

Backing Chain Management in libvirt and qemu Eric Blake <eblake@redhat.com> KVM Forum,

Genealogy Wikis & Wikipedia Dave Barton Agenda What is a Wiki Genealogy Wikis