Errant GTIDs breaking replication Dieter Adriaenssens Ghent - - PowerPoint PPT Presentation
Errant GTIDs breaking replication Dieter Adriaenssens Ghent - - PowerPoint PPT Presentation
Errant GTIDs breaking replication Dieter Adriaenssens Ghent University Who am I? Dieter Adriaenssens Linux System Administrator MySQL DBA Works at Ghent University Open Source : former phpMyAdmin team member Lives in
2
Who am I?
Dieter Adriaenssens
- Linux System Administrator
- MySQL DBA
- Works at Ghent University
- Open Source : former phpMyAdmin team member
- Lives in Ghent, Belgium
- Climber
- E-mail : dieter.adriaenssens@ugent.be
- Twitter : @dcadriaenssens
- Pictures :
−
(c) Thomas Kindermans https://en.wikipedia.org/wiki/File:Gent_Graslei_zonsondergang.JPG
−
(c) Rouslan Placella
Errant GTIDs breaking replication
How to detect and avoid them
4
Context
Replicating MySQL cluster
- Primary with some replicas
- Master failover
- Replication breaks from new primary
(previous replica) to remaining replicas
- Where did those transactions on the new
primary come from?
‘[..] the master has purged binary logs containing GTIDs that the slave requires.’
5
Overview
- Replication and GTID
- Errant GTIDs on a replica
- Detection
- Where do they come from?
- How to avoid them
- How to fix them
- Demo
Introduction
Replication, GTID, data consistency
7
Replication
MySQL server replication :
- Primary (master)
- Several replicas (slaves)
- Identical state across several servers
Primar y Replic a Replic a
8
Replication : Why
- High availability
- Master failover
- Disaster recovery
- Scaling load
- Regional distribution
- ...
9
Replication : How
MySQL master-slave replication :
- All nodes start from an identical state
- Replicas apply transactions from the
primary node’s binary logs
- Transaction in correct order
- Asynchronous : replica initiates
- Maintain identical state
- Read/write on primary
- Replicas should be readonly
10
Replication : How
Keeping track of binlog position :
- Several binlogfiles : mysql_bin.xxxxxx
- Rotation (size based, restart server)
- Every transaction has a unique
(incremental) position
- Replication start :
- Binary log file name
- position in the binlog
- Problematic
- multiple primaries
- Master failover
- Restore from backup
11
GTID
Global Transaction identifier
- Unique identifier of a transaction
- GTID = Origin ID + transaction ID
- Origin ID : server UUID
- Transaction ID : incremental number
Server_uuid:transaction_id 3E11FA47-71CA-11E1-9E33-C80AA9429562:23
SELECT @@GLOBAL.server_uuid; 3E11FA47-71CA-11E1-9E33-C80AA9429562
12
GTID set
Executed GTID set
- Set of applied GTIDs on a server
- Defines a state of data consistency
- Easy to compare between nodes
- Replica applies GTIDs of a primary
- Simplifies failovers (no need to
remember binlog position)
SHOW SLAVE STATUS; SHOW MASTER STATUS; SELECT @@GLOBAL.gtid_executed; Executed_Gtid_Set:
27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552113, 50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1-80192973, c004c0eb-c84e-11e6-8efc-aa00009002fd: 1-6084195:6140951-6141015
13
ROW based replication
- Changes in rows are replicated
- <-> transaction-based : actual queries
- In combination with GTID => data
consistency across the cluster nodes
14
Replicating cluster with GTID
MySQL Server cluster
- Primary (R/W)
- Replicas (R/O)
- GTID enabled
- Row based replication
- ProxySQL to redirect traffic to correct
cluster node
- Orchestrator managing the cluster nodes
(automatic master failover)
- rchestrator -c topology –alias=demo
node1:3306 [0s,ok,5.7.25,rw,ROW,>>,GTID] + node2:3306 [0s,ok,5.7.25,ro,ROW,>>,GTID] + node3:3306 [0s,ok,5.7.25,ro,ROW,>>,GTID]
Nod e1 Node 2 Node 3 Prox y SQL
Orchestrator
Errant GTID
Definition, consequences, detection, examples, how to avoid, fixes
16
Errant GTID
- Unreplicated transaction existing only on
a replica
- Data is not the same on all nodes
- Cluster is no longer in a consistent state
17
Errant GTID : consequences
- Everything is fine?
- Inconsistent state between nodes
- Split brain
- Different data when reading from
that replica
- Unexpected behaviour when a replica is
promoted to master
- Replication might fail
- If GTID is purged from binlog, on
master failover → replication stops
‘[..] the master has purged binary logs containing GTIDs that the slave requires.’
Errant GTID detection
19
Errant GTID detection
- Compare executed GTID sets between
primary node and replica nodes
- Replica has more GTIDs than primary =>
errant GTID
20
GTID executed set
# primary SELECT @@GLOBAL.gtid_executed; 27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552120 # replica SELECT @@GLOBAL.gtid_executed; 27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552113, 50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1-4
21
GTID subset
SELECT GTID_SUBSET( '27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552113', '27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552120') AS is_subset; +-----------+ | is_subset | +-----------+ | 1 | +-----------+ 1 row in set (0.00 sec)
Replica GTID set is a subset of primary GTID set : OK SELECT GTID_SUBSET('<gtid_executed_replica>', '<gtid_executed_primary>');
22
GTID subset
SELECT GTID_SUBSET( '27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552113, 50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1-4', '27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552120') AS is_subset; +-----------+ | is_subset | +-----------+ | 0 | +-----------+ 1 row in set (0.00 sec)
Replica GTID set is NOT a subset of primary GTID set => Errant GTID on replica SELECT GTID_SUBSET('<gtid_executed_replica>', '<gtid_executed_primary>');
23
Find errant GTIDs
SELECT GTID_SUBTRACT( '27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552113, 50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1-4', '27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552120') AS errant_gtid; +------------------------------------------+ | errant_gtid | +------------------------------------------+ | 50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1-4 | +------------------------------------------+ 1 row in set (0.00 sec)
Result is subset of errant GTIDs SELECT GTID_SUBTRACT('<gtid_executed_replica>', '<gtid_executed_primary>');
24
Errant GTID : automatic detection
- Icinga monitoring check
- Orchestrator
- Other tools :
- Mysqlfailover
- Mysqlrpladmin
25
Errant GTID : detection
Monitoring check
- Automate checking for errant GTID
- Icinga compatible output format
- Uses orchestrator for cluster info
- https://github.com/UGent-DICT/check_mysql_gtid
26
Errant GTID : monitoring check
./check_mysql_gtid demo MYSQL_CLUSTER_GTID OK - GTIDs on all nodes are replicated in the cluster! Cluster demo (primary : node1) :
- node2 : OK
- node3 : OK
./check_mysql_gtid <clustername> Everything is fine!
27
Errant GTID : monitoring check
./check_mysql_gtid demo MYSQL_CLUSTER_GTID WARNING : replicas containing unreplicated GTIDs : node2 Cluster demo (primary : node1) :
- node2 : GTIDs only exist on the replica :
50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1-4
- node3 : OK
./check_mysql_gtid <clustername> Check reports unreplicated (errant) GTIDs on one of the nodes
28
Errant GTID : detection
Orchestrator
- Reports errant GTIDs (>= v3.0.13)
- Provides ways to fix them (>= v3.0.14)
29
Errant GTID
Find transaction
- Look for GTID in binary logs
- Each binlog mentions the executed GTID
set (initial state)
- Select relevant binlog
- Find transaction in that binlog
30
Errant GTID : find transaction
mysqlbinlog --base64-output=DECODE-ROWS --verbose node2-bin.00* | grep 50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1 -A6 SET @@SESSION.GTID_NEXT= '50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1'/*!*/; # at 23890384 #180305 6:25:02 server id 590100 end_log_pos 23890464 CRC32 0x2fe8266b Query thread_id=3848059 exec_time=0 error_code=0 SET TIMESTAMP=1520227502/*!*/; SET @@session.sql_mode=1073741824/*!*/; FLUSH ERROR LOGS /*!*/;
31
Errant GTID : Examples
Transactions on a replica
- Manual transactions (by accident on a replica)
- Scripted maintenance tasks (eg. config management)
- User creation
- Database creation
- Master failover gone bad :
- Split brain
- Writes redirected to a replica (eg. host is r/w by
accident, or after a restart)
- Log flushes
32
Intermezzo : flush-logs
- FLUSH LOGS; → not added to binlog (1)
- FLUSH ERROR LOGS; → added to binlog (2)
- FLUSH [LOCAL | NO_WRITE_TO_BINLOG] ERROR LOGS; → not added to binlog
- mysqladmin flush-logs executes FLUSH LOGS;
- mysqladmin flush-logs error (3) executes FLUSH ERROR LOGS;
- (1) expected behaviour, according to documentation : “FLUSH LOGS, FLUSH BINARY LOGS, FLUSH
TABLES WITH READ LOCK (with or without a table list), and FLUSH TABLES tbl_name ... FOR EXPORT are not written to the binary log in any case because they would cause problems if replicated to a slave.“ https://dev.mysql.com/doc/refman/8.0/en/flush.html
- (2) Related bug report : https://bugs.mysql.com/bug.php?id=88720
- (3) introduced in MySQL 5.7.4
Avoid errant GTIDs
34
Avoid errant GTIDs
Avoid transactions on replica nodes:
- Don’t execute (write) statements on a replica
35
Avoid errant GTIDs
Use read_only:
- Set read_only on all replicas
- Preferably in the config file, to avoid a writable node after
restart
- Orchestrator can set a previous master to read_only on a
failover : ApplyMySQLPromotionAfterMasterFailover = true
36
Avoid errant GTIDs
Use super_read_only:
- Users with SUPER privileges can still write when read_only is set
- Limit SUPER privileges/users
- Set super_read_only on all replicas
- Orchestrator can set a previous master to super_read_only on a failover :
UseSuperReadOnly = true (>= v3.0.7)
37
Avoid errant GTIDs
Disable binlog for some transactions:
- Some statements don’t have to be replicated
- Examples:
- log flushes
- Fixing an inconsistent state on a replica (eg. errant
GTID)
- Use @@SESSION variable!
SET @@SESSION.SQL_LOG_BIN=0; <statement>; SET @@SESSION.SQL_LOG_BIN=1;
Fix errant GTIDs
39
Fix errant GTIDs
Examine situation :
- Examine transaction (binlog)
- Does it change data?
- Can’t find transaction?
- Check consistency with pt-table-checksum and pt-table-sync
- Is data consistent?
40
Fix errant GTIDs
Possible fixes:
- Insert empty transactions on other nodes (including primary)
- ‘Remove’ GTIDs from replica binlog
- Rollback transactions : Unsplit brain
- Talk Shlomi Noach @ FOSDEM 2019
https://fosdem.org/2019/schedule/event/unplitmysql/
- Restore data from primary/backup
41
Errant GTID : Insert empty transactions
SET gtid_next='50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1'; BEGIN; COMMIT; SET gtid_next='50d5e9eb-c5d3-11e6-b86b-aa00009002f7:2'; BEGIN; COMMIT; SET gtid_next=automatic;
On all nodes (or only on the primary of replication still works):
- Repeat for each errant GTID
42
Errant GTID : Remove from binlog
STOP SLAVE; RESET MASTER; SET GLOBAL GTID_PURGED="27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552120"; START SLAVE;
On the replica
SELECT @@GLOBAL.gtid_executed; 27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552120
On the primary RESET MASTER : Binlogs are purged on the replica!
43
Errant GTID : Fix them
Orchestrator (>= v3.0.14)
- Inject empty transactions
- Reset master (remove GTIDs)
Demo
Click to add text
45
Conclusion
Avoid errant GTIDs
- It can break replication
- Data consistency across the cluster is lost
- Avoid manual transactions on replicas
- Set read_only
- Set super_read_only
- Temporarily Disable binlog if you have to do
something on a replica
- Monitor for errant GTIDs
46
Acknowledgements
Thanks to:
- Colleagues at Ghent University
- Tibor Korocz from Percona
- Blogposts :
- https://www.percona.com/blog/2014/05/19/errant-transactions-major-hurdle-for-gtid-based-failover-in-mysql-5-6/
- https://dzone.com/articles/how-createrestore-slave-using
- https://dzone.com/articles/mysql-replication-errant-transactions-in-gtid-base
- https://severalnines.com/blog/mysql-replication-and-gtid-based-failover-deep-dive-errant-transactions
- https://www.percona.com/blog/2013/03/26/repair-mysql-5-6-gtid-replication-by-injecting-empty-transactions/
47
Questions?
- Contact : @dcadriaenssens
- Monitoring check: https://github.com/UGent-DICT/check_mysql_gtid
48