Understanding Percona XtraDB Cluster 5.7 Operation and Key - - PowerPoint PPT Presentation
Understanding Percona XtraDB Cluster 5.7 Operation and Key - - PowerPoint PPT Presentation
Understanding Percona XtraDB Cluster 5.7 Operation and Key Algorithms Krunal Bauskar PXC Product Lead (Percona Inc.) Objective I want to use Percona XtraDB Cluster but is it suitable for my needs and can it handle my workloads?
Objective
- “I want to use Percona XtraDB Cluster but is it suitable for my needs
and can it handle my workloads?”
- “As an active user of Percona XtraDB Cluster I wonder why my
transaction keeps failing and will this workload run correctly with the software?”
Agenda
- PXC technology
- Understanding State Snapshot Transfer (SST)
- How replication works ?
- How certification works ?
- Different types of failures
- Understanding IST (Incremental State Transfer)
- Common cause of failures with PXC
- What's new in PXC-5.7
- Introducing pxc_strict_mode
- Monitoring PXC through PFS
- Securing PXC cluster
- Geo-distributed Cluster Setup
- Proxy-SQL compatible
PXC technology
- PXC is multi-master solution that provides High Availability using MySQL and Galera replicator
technology based on synchronous replication theme.
- PXC node can operate in 2 modes: Standalone that is Percona-Server compatible or with Galera
replicator loaded as PXC node but PXC is different from Percona Server in terms of binary.
Layer-1: Percona Server with write-set replication plugin. Layer-2: Galera Replicator Layer-3: gcomm communication channel.
Understanding SST
- Concept:
○ Initial/First PXC node is started in cluster bootstrap mode (wsrep_cluster_address=gcomm://
- r --wsrep_new_cluster). This node creates a new cluster.
○ Follow-up nodes connect to the existing cluster and sync with cluster state before they start processing workload. ○ This gives rise to a DONOR and JOINER relationship in PXC. DONOR donates the data (also known as write-sets). JOINER receives the data. This process is called SST where-in almost complete data-dir is copied over from DONOR.
- PXC support 3 different ways of doing SST:
○ rsync ○ mysqldump ○ xtrabackup (RECOMMENDED and DEFAULT)
Understanding SST
- How does PXC detect and handle this ?
○ Each node of cluster locally maintains a graph/picture of a cluster. ○ When new node joins the cluster, node is made part of cluster and graph is updated. This is important so that newly generated write-sets can now be delivered to new node too. ○ New node then detects that its local state (-1) is behind the cluster state (say N). ○ Node then searches for DONOR. ○ JOINER then sends a request to DONOR for SST (that is complete state transfer). ○ DONOR node enter DONOR/DESYNC state and start servicing the request. While DONOR node is servicing write-set to JOINER it will continue to get cluster write-sets too. ○ Once JOINER get the needed snapshot, JOINER can fill out for pending write-sets from its own gcache to get in sync with CLUSTER state. ○ New node is now ready to service workload.
- Why Xtrabackup ?
○ SST is time consuming operation especially when you have TBs of data. ○ XB is optimized to use BACKUP LOCKS where-in DONOR node is not paused during data-transmission. ○ It also has an option to secure PXC-SST traffic.
Understanding SST (DONOR selection)
- SST DONOR Selection
○ Search by name (if specified by user --wsrep_sst_donor). ■ If node state restrict it from acting as DONOR things get delayed.
Member 0.0 (n3) requested state transfer from 'n2', but it is impossible to select State Transfer donor: Resource temporarily unavailable
○ Search donor by state: ■ Scan all nodes. ■ Check if a node can act as DONOR (avoid DESYNCEd node, arbitrator).
- YES: Is node part of same segment (like JOINER) -> Yes -> Got DONOR
- NO: Keep searching but cache this remote segment too. Worse case will use this.
How replication works ?
- Short answer
○ Using binary log events. (Same unit that MySQL uses for Master-Slave replication).
- Transaction execution steps:
○ User initiate a transaction on said node. ○ Node processes the transaction and keeps track of each data-object that is being modified. ○ On commit, write-set is generated with binlog events + certification data (what is modified). ○ This write-set is then replicated on group-channel. ○ All nodes (including originating node listen to group channel). ○ If node = originating node it has to attest receival of write-set and update counters. ○ If node != originating node it has to consume the write-set, certify it, apply it and then commit it. ○ After originating node submit the packet on the channel originating node will certify the packet and commit the transaction.
How replication works ?
Node-1 Node-2 Node-3 Trx Process & apply
Group channel Write-set certify commit certify apply commit certify apply commit
ACK. Write-set. Update counter
How replication works ? (TOI replication)
- TOI stants for Total Order Isolation.
- All DDL or MyISAM replication execute in TOI fashion. Any statement that is not allowed to fail due to conflict are TOI
- executed. (For example: DDL or MyISAM as MyISAM is non-transactional).
- Let’s understand TOI flow
○ User initiate TOI statement (say DDL). ○ Query is wrapper as packet to replicate and along with certification key (which in this case is db and table name) is added to channel. ○ Packet is serviced using a special TOI flow path that locks complete operation by holding Apply and Commit Monitor for period of TOI execution. This ensure no other transaction is allowed to proceed and existing transactions are done with their work.
- If there is no parallel transaction then why do we need certification key.
○ Say N1 executed drop table <table> that is replicated on N2 and N2 is trying to insert in same table. ○ Without certification N2 will replicate, certify and proceed to commit. In meantime, table is attempted for removal but removal path discover that table is in use which generate conflicts then.
How replication works ? (Parallel Replication)
- PXC can apply write-sets in parallel thereby improving overall throughput.
- -wsrep_slave_threads helps control this option.
- Let’s quickly understand how parallel replication works.
○ 3 nodes cluster: N1, N2, N3 all are healthy and processing write-sets. ○ Say N1, N2 generated non-conflicting write-sets. N2 goes first followed by N1. ○ Let’s take a view from N3 perspective. N3 has set wsrep_slave_thread = 2 so it can apply both write-sets in parallel. (Note: This is just like MySQL executing 2 workloads from 2 different clients). ○ While apply can proceed in parallel commits are ordered. That is N2 has to commit first even if N1 replication thread is allowed to proceed. (These co-ordinating units are called monitor (Apply, Commit). There is also local monitor meant for local node action co-ordination). ○ Also note that, given that apply action is only about execution of write-set this is pretty quick and less resource hogging unless query demands it. (As in parsing, filtering,etc all has been taken care on host machine that initiated the initial transaction so in general write-set are optimized unit for application).
How certification works ?
- Let’s now understand most crucial part of the replication “How certification works”.
- Basic principles:
○ ORIGINATOR NODE also certify its own transaction. ○ FIRST COMMITTER TO GROUP CHANNEL WINS.
N1 N2 N3
N1-wset N2-wset N3-wset
T1: (i int, primary key pk(i)) (1, 2, 3) T2: (i int, primary key pk(i)); (11, 22, 33) N1: update t1 set i = i + 10; N2: update t1 set i = i + 100; N3: update t2 set i = i + 10; N1-wset: {db.t1.r1, db.t1.r2, db.t1.r3} N2-wset: {db.t1.r1, db.t1.r2, db.t1.r3} N3-wset: {db.t2.r1, db.t2.r2, db.t2.r3}
How certification works ? (N1)
Conflicts: db.t1.r1 (N2 != N1) db.t1.r2 (N2 != N1) db.t1.r3 (N2 != N1) N1 N1 CCV N1
N1-wset
db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 CERTIFY CCV N1 N1-writeset rejected CCV N1
N2-wset
CCV N1 N1 N1 CERTIFY N2-writeset certified. N1 N1 N1 N1 CCV N1
N3-wset
db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 CERTIFY CCV N1 N3-writeset certified. db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 db.t1.r1 -> N2 db.t2.r1 -> N3 db.t1.r2 -> N2 db.t2.r2 -> N3 db.t1.r3 -> N2 db.t2.r3 -> N3 N1 N1
How certification works ? (N2)
Conflicts: db.t1.r1 (N2 != N1) db.t1.r2 (N2 != N1) db.t1.r3 (N2 != N1) N1 N1 CCV N1
N1-wset
db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 CERTIFY CCV N1 N1-writeset rejected N2 N2 N1 N1 CCV N1
N3-wset
db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 CERTIFY CCV N1 N3-writeset certified. db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 db.t1.r1 -> N2 db.t2.r1 -> N3 db.t1.r2 -> N2 db.t2.r2 -> N3 db.t1.r3 -> N2 db.t2.r3 -> N3 N2-writeset certified. CERTIFY CCV N1
N2-wset
CCV N1 N2 N2 N2 N2
How certification works ? (N3)
Conflicts: db.t1.r1 (N2 != N1) db.t1.r2 (N2 != N1) db.t1.r3 (N2 != N1) N1 N1 CCV N1
N1-wset
db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 CERTIFY CCV N1 N1-writeset rejected N3 N3 N1 N1 CCV N1
N3-wset
db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 CERTIFY CCV N1 N3-writeset certified. db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 db.t1.r1 -> N2 db.t2.r1 -> N3 db.t1.r2 -> N2 db.t2.r2 -> N3 db.t1.r3 -> N2 db.t2.r3 -> N3 N2-writeset certified. CERTIFY CCV N1
N2-wset
CCV N1 N3 N3 N3 N3
Different types of failures
- Brute-Force Abort: (wsrep_local_bf_aborts)
○ N1 and N2 both are modifying same data-set (conflicting workload). N2 is done and serializes its write-set to
- cluster. N1 is still executing in meantime N1 replication thread try to apply N2 transaction and found it
conflicting with N1 active transaction. N1 replication thread will forcefully abort N1 active transaction.
N2
begin; insert into t values (1); commit;
N1
begin; insert into t values (1); Apply_event (from group channel). Found conflicting
- transaction. FORCE ABORT it.
2016-09-18T06:52:11.376545Z 3 [Note] WSREP: cluster conflict due to high priority abort for threads: 2016-09-18T06:52:11.376566Z 3 [Note] WSREP: Winning thread: THD: 3, mode: applier, state: executing, conflict: no conflict, seqno: 3 SQL: (null) 2016-09-18T06:52:11.376580Z 3 [Note] WSREP: Victim thread: THD: 6, mode: local, state: idle, conflict: no conflict, seqno: -1 SQL: (null) 2016-09-18T06:52:11.376593Z 3 [Note] WSREP: BF kill (1, seqno: 3), victim: (6) trx: 1287 2016-09-18T06:52:11.376599Z 3 [Note] WSREP: Aborting query: void
ABORT N1
Different types of failures
- Certification Failure: (wsrep_local_cert_failures)
○ N2 generated transaction N1 is trying to apply it. Apply stage means transaction has been certified. ○ N1 then initiate conflict transaction and try to commit it. Before transaction is allowed to commit transaction is first written to group channel followed by certification check. At this stage it detects there is a conflicting transaction that has already been certified so N1 certification fails. This is called local certification failure.
insert into t values (1); COMMIT ABORT
N1
Replication transaction CERTIFIED. APPLY IT.
N2
insert into t values (1);
2016-09-18T07:11:51.017630Z 6 [Note] WSREP: cluster conflict due to certification failure for threads: 2016-09-18T07:11:51.017655Z 6 [Note] WSREP: Victim thread: THD: 6, mode: local, state: executing, conflict: cert failure, seqno: 5 SQL: insert into t values (10) 2016-09-18T07:11:51.027158Z 6 [Note] WSREP: cleanup transaction for LOCAL_STATE: insert into t values (10) 2016-09-18T07:11:51.027276Z 6 [Note] WSREP: cert failure, thd: 6 is_AC: 1, retry: 1 - 1 SQL: insert into t values (10)
N1 N1
Certification failure are retried. Check wsrep_retry_autocommit
Understanding IST (Incremental State Transfer)
- Let’s say for a reason user need to take down the node (maintenance mode or some unexpected
error like we saw before). When node re-joins user expect it to do delta transfer (IST) for changes and not a full snapshot (SST).
- That’s what IST is about, getting delta changes from DONOR.
- Replication unit in PXC is write-set and each write-set has unique seqno that is unique across the
- cluster. If node is taken down for a short time, on reboot, node can catch-up with cluster state by
requesting missing write-sets.
- Like SST, IST will also have DONOR and JOINER node where-in JOINER request missing
write-sets information. While IST is progressing JOINER node can’t service workload as it is still catching up with cluster.
2016-09-19T06:30:08.967395Z 2 [Note] WSREP: State transfer required: Group state: 627bb4e7-7e32-11e6-a942-065bdedeed0a:9 Local state: 627bb4e7-7e32-11e6-a942-065bdedeed0a:6
Understanding IST (DONOR selection)
- IST DONOR Selection
○ Let’s assume a 3 nodes (N1, N2, N3) cluster and cluster that has processed write-set from A1-A5 since
- inception. {A1, A2, A3, A4, A5}
○ All nodes are in sync and N3 is shutdown for maintenance. ○ While N3 is down, cluster further processes {A6, A7, A8} . Given the space limitation N1 purges {A1, A2} and N2 purges {A1, A2, A3}. ○ N3 comes back and see it is lagging and raises demand for IST. It needs write-sets {A6, A7, A8}. ○ N3 actively starts looking for DONOR Node. ○ Given current state both the nodes can act as donor but there is more to consider. ■ By the time evaluation is done and request is generated DONOR node may make a progress thereby further purging more write-sets. To avoid this Galera adds what it calls Safety_Gap. N3 needs {A6,A7,A8} and available write-set in cluster as whole are {A3, A4, A5, A6, A7, A8} but will project consumability as {A4, A5, A6, A7, A8} so that if more purging take place things are still safe. (Still within N3 needed limit. ■ If say N3 went down earlier and comes back demanding {A3+x - A8} write-set then IST can’t be done and algorithm fallbacks to SST. (though theoretically it can still be serviced using IST).
IST DONOR selection
○ So with new updated criteria we have both nodes that can service the request. Which one to select ? ○ Select one that is far away (closer to lowest demanded write-set) so probability of purge for it is reduced. ■ N1: {A3, A4, A5, A6, A7, A8} ■ N2: {A4, A5, A6, A7, A8} ■ Needed: {A6, A7, A8} ■ Select N2 as it is far away from wsrep_local_cached_downto (that is A3). ○ Done!!!! Not yet. What if N2 is located in different segment (read data-center) that means IST latency will be
- higher. Well in that case fall back to select N1. (So always 2 candidate are shortlisted local donor and remote
- donor. Prefered is local donor if there is no local donor then fallback to remote donor).
○ FINALLY DONE. IST can now commence now. So next time you need to predict which node is going to act as DONOR you can do so and schedule maintenance
- r workloads accordingly. You also can understand why SST was selected and not IST.
Common cause of failure with PXC usage
- Inconsistent or difference configuration on different nodes of same cluster. Ideally all cluster nodes
should have same configuration (except for physical limit like memory size, disk file size, etc.. but it is good to match these values too especially gcache.size and all).
- Use of experimental features (non-transactional storage engine, LOCK tables, table without
primary-key, large transaction (this is being fixed in galera-4.x)
- Stable network connectivity. Most of the issues we see at Percona are related to network
- connectivity. Theoretically we should have consistent network but given that this is not practically
possible PXC/Galera has different timeout to configure.
- Avoid networking hogging by booting multiple nodes at same time. Add one node at a time.
- For Geo-distributed cluster look at timeouts and segment.
- Understand effect of operation like RSU, FTWRL, DONOR/DESYNC, etc...
Understanding timeouts
N1 N3 N2
Group Channel
State-1: 3 nodes cluster all are healthy. When there is no workload keep_alive signal is sent every 1 sec N1 N3 N2
Group Channel
State-2: N1 has flaky network connection and can’t respond in set inactive_check_period. Cluster (N2, N3) mark N1 to be added delayed list and adds it once it crosses delayed margin. N1 N3 N2
Group Channel
State-3: N1 network connection is restored but to get it out from delayed listed N2 and N3 needs assurance from N1 for delayed_keep_period. N1 N3 N2
Group Channel
State-4: N1 again has flaky connection and it doesn’t respond back for inactive_timeout. (Node is already in delayed list). This will cause node to pronounce as DEAD.
What’s new in PXC-5.7 ?
- Save your workload from experimental features:
○ Introducing pxc-strict-mode.
- Extended support for PFS.
○ Now monitor Galera and wsrep instrument through PFS.
- Support for encrypted tablespaces in Multi-Master Topology.
○ PXC can wire encrypted tablespace to new booting node.
- Proxy-SQL compatible PXC
○ PXC is ProxySQL compatible with new single step configuration.
What’s new in PXC-5.7 ?
- PMM enabled monitoring for PXC
○ Effectively monitor PXC using PMM.
- MySQL/PS-5.7.14 and galera-3.17 compatibility
○ Bug fixes, Improved logging and lot more.
- Simplified packaging for PXC
○ PXC packages now takes care of galera installation too.
- Support to use latest Xtrabackup with enhanced security checks.
○ Feel power of improved and latest XB with PXC.
Introducing pxc_strict_mode
- PXC/Galera limitations:
○ Supports only Transactional Storage Engine. ○ Explicit locks don’t fit in multi-master topology. ○ Need of primary key for certification data generation. ○ XA-transaction are not supported.
- pxc_strict_mode helps control execution of experimental features for
example: myisam table replication, explicit locking, etc…
Introducing pxc_strict_mode
- pxc_strict_mode can be set to following 4 different values:
- ENFORCING (DEFAULT/RECOMMENDED): Use of experimental features raises error.
(during startup server refuse to start and runtime operation is blocked. error is logged)
- DISABLED: PXC-5.6 compatible. Allows experimental feature. No error. No warning
- PERMISSIVE: Use of experimental feature result in warning at startup and runtime. Server
continue to accept the setting and operate.
- MASTER: Same as ENFORCING for all experimental feature except explicit-table-locking
validation checks are not performed under this mode.
- pxc_strict_mode is local to given node and if user plan to toggle it, it should be done on all the
nodes for cluster consistency and correctness.
Introducing pxc_strict_mode
So what all things are not allowed under pxc_strict_mode
- DML and DDL operations (except CREATE/DROP) are not permitted on non-transactional Storage Engine.
- Table can be converted from non-transactional SE to transactional SE using ALTER
- Trying to enable MyISAM replication is blocked.
- binlog-format has to be ROW.
- DML to tables without primary-key is not allowed
- log-output has to be directed to FILE or DISABLED (NONE)
- Explicit TABLE locking feature (LOCK table, GET_LOCK, FLUSH TABLE WITH READ LOCK, Setting
SERIALIZABLE transaction level) is blocked.
- auto-increment mode has to be INTERLEAVED.
- Combining Schema and DML changes in single statement like CTAS is not permitted.
Monitoring PXC through Performance Schema
- Traditional method to monitor PXC or MySQL through log file is time consuming and may need
special tool even to detect occurrence of event.
- Performance Schema is effective way and has become de facto standard for monitoring different
elements of MySQL.
- Till 5.6 PXC had limited support for performance_schema where-in only wsrep related instruments
were exposed through performance schema that too in limited fashion.
- Starting PXC-5.7 we have taken big-step further enabling monitoring of galera instruments and
- ther wsrep-instruments as part of performance schema.
Monitoring PXC through Performance Schema
- Instruments that are monitored.
a. THREADS: applier, rollback, service_thd, gcomm conn, receiver, sst/ist threads, etc… b. LOCK/COND_VARIABLES: from wsrep and galera library. c. FILE: record-set file, ring-buffer file (default gcache), gcache-page file.* d. STAGES: Different stage threads are passing through. With this information, user should able to track some of the most important instruments that can help get some insight as to where server is really spending time or out-of-path flow like rollback of transactions.
Securing PXC
Secure PXC - while TRANSIT, while at REST. During TRANSIT: PXC has following 3 data traffic:
- SST traffic which is through some independent tool (rsync, mysqldump, xtrabackup)
- IST traffic (inter-node traffic) that is controlled internally by PXC/Galera.(socket.ssl* options)
- Replication traffic for general write-set replication
During REST:
- MySQL introduced encrypted tablespace in 5.7. PXC inherited it and added support to move these
tablespaces across the cluster during SST operation there-by fully supporting encrypted tablespace (in turn securing data at REST).
Geo distributed Cluster
- Nodes of cluster are distributed across data-centers still all nodes are part of one-big cluster.
N1 N3 N2 N4 N6 N5
WAN
- Important aspect to consider:
○ gmcast.segment that help notify topology. All local nodes should have same gmcast.segment. This helps in
- ptimizing replication traffic and of-course SST and IST.
○ Tune timeouts if needed for your WAN setup. ○ Avoid flow-control and fragmentation at galera level (gcs.max_packet_size=1048576; evs.send_window=512; evs.user_send_window=256") ○ Fragmentation at bin-log level binlog-row-event-max-size.
Proxy-SQL compatible
- Proxy-SQL is load balancer or MySQL proxy.
- PXC is Proxy-SQL compatible and Percona has developed some really cool scripts for monitoring
cluster state.
- Also, default configuring Proxy-SQL should be a single step process with new proxy-sql-admin
script.
- You can also download packages from percona site or talk to Percona support for this.