Understanding Percona XtraDB Cluster 5.7 Operation and Key - - PowerPoint PPT Presentation

understanding percona xtradb cluster 5 7 operation and
SMART_READER_LITE
LIVE PREVIEW

Understanding Percona XtraDB Cluster 5.7 Operation and Key - - PowerPoint PPT Presentation

Understanding Percona XtraDB Cluster 5.7 Operation and Key Algorithms Krunal Bauskar PXC Product Lead (Percona Inc.) Objective I want to use Percona XtraDB Cluster but is it suitable for my needs and can it handle my workloads?


slide-1
SLIDE 1

Understanding Percona XtraDB Cluster 5.7 Operation and Key Algorithms

Krunal Bauskar PXC Product Lead (Percona Inc.)

slide-2
SLIDE 2

Objective

  • “I want to use Percona XtraDB Cluster but is it suitable for my needs

and can it handle my workloads?”

  • “As an active user of Percona XtraDB Cluster I wonder why my

transaction keeps failing and will this workload run correctly with the software?”

slide-3
SLIDE 3

Agenda

  • PXC technology
  • Understanding State Snapshot Transfer (SST)
  • How replication works ?
  • How certification works ?
  • Different types of failures
  • Understanding IST (Incremental State Transfer)
  • Common cause of failures with PXC
  • What's new in PXC-5.7
  • Introducing pxc_strict_mode
  • Monitoring PXC through PFS
  • Securing PXC cluster
  • Geo-distributed Cluster Setup
  • Proxy-SQL compatible
slide-4
SLIDE 4

PXC technology

  • PXC is multi-master solution that provides High Availability using MySQL and Galera replicator

technology based on synchronous replication theme.

  • PXC node can operate in 2 modes: Standalone that is Percona-Server compatible or with Galera

replicator loaded as PXC node but PXC is different from Percona Server in terms of binary.

Layer-1: Percona Server with write-set replication plugin. Layer-2: Galera Replicator Layer-3: gcomm communication channel.

slide-5
SLIDE 5

Understanding SST

  • Concept:

○ Initial/First PXC node is started in cluster bootstrap mode (wsrep_cluster_address=gcomm://

  • r --wsrep_new_cluster). This node creates a new cluster.

○ Follow-up nodes connect to the existing cluster and sync with cluster state before they start processing workload. ○ This gives rise to a DONOR and JOINER relationship in PXC. DONOR donates the data (also known as write-sets). JOINER receives the data. This process is called SST where-in almost complete data-dir is copied over from DONOR.

  • PXC support 3 different ways of doing SST:

○ rsync ○ mysqldump ○ xtrabackup (RECOMMENDED and DEFAULT)

slide-6
SLIDE 6

Understanding SST

  • How does PXC detect and handle this ?

○ Each node of cluster locally maintains a graph/picture of a cluster. ○ When new node joins the cluster, node is made part of cluster and graph is updated. This is important so that newly generated write-sets can now be delivered to new node too. ○ New node then detects that its local state (-1) is behind the cluster state (say N). ○ Node then searches for DONOR. ○ JOINER then sends a request to DONOR for SST (that is complete state transfer). ○ DONOR node enter DONOR/DESYNC state and start servicing the request. While DONOR node is servicing write-set to JOINER it will continue to get cluster write-sets too. ○ Once JOINER get the needed snapshot, JOINER can fill out for pending write-sets from its own gcache to get in sync with CLUSTER state. ○ New node is now ready to service workload.

  • Why Xtrabackup ?

○ SST is time consuming operation especially when you have TBs of data. ○ XB is optimized to use BACKUP LOCKS where-in DONOR node is not paused during data-transmission. ○ It also has an option to secure PXC-SST traffic.

slide-7
SLIDE 7

Understanding SST (DONOR selection)

  • SST DONOR Selection

○ Search by name (if specified by user --wsrep_sst_donor). ■ If node state restrict it from acting as DONOR things get delayed.

Member 0.0 (n3) requested state transfer from 'n2', but it is impossible to select State Transfer donor: Resource temporarily unavailable

○ Search donor by state: ■ Scan all nodes. ■ Check if a node can act as DONOR (avoid DESYNCEd node, arbitrator).

  • YES: Is node part of same segment (like JOINER) -> Yes -> Got DONOR
  • NO: Keep searching but cache this remote segment too. Worse case will use this.
slide-8
SLIDE 8

How replication works ?

  • Short answer

○ Using binary log events. (Same unit that MySQL uses for Master-Slave replication).

  • Transaction execution steps:

○ User initiate a transaction on said node. ○ Node processes the transaction and keeps track of each data-object that is being modified. ○ On commit, write-set is generated with binlog events + certification data (what is modified). ○ This write-set is then replicated on group-channel. ○ All nodes (including originating node listen to group channel). ○ If node = originating node it has to attest receival of write-set and update counters. ○ If node != originating node it has to consume the write-set, certify it, apply it and then commit it. ○ After originating node submit the packet on the channel originating node will certify the packet and commit the transaction.

slide-9
SLIDE 9

How replication works ?

Node-1 Node-2 Node-3 Trx Process & apply

Group channel Write-set certify commit certify apply commit certify apply commit

ACK. Write-set. Update counter

slide-10
SLIDE 10

How replication works ? (TOI replication)

  • TOI stants for Total Order Isolation.
  • All DDL or MyISAM replication execute in TOI fashion. Any statement that is not allowed to fail due to conflict are TOI
  • executed. (For example: DDL or MyISAM as MyISAM is non-transactional).
  • Let’s understand TOI flow

○ User initiate TOI statement (say DDL). ○ Query is wrapper as packet to replicate and along with certification key (which in this case is db and table name) is added to channel. ○ Packet is serviced using a special TOI flow path that locks complete operation by holding Apply and Commit Monitor for period of TOI execution. This ensure no other transaction is allowed to proceed and existing transactions are done with their work.

  • If there is no parallel transaction then why do we need certification key.

○ Say N1 executed drop table <table> that is replicated on N2 and N2 is trying to insert in same table. ○ Without certification N2 will replicate, certify and proceed to commit. In meantime, table is attempted for removal but removal path discover that table is in use which generate conflicts then.

slide-11
SLIDE 11

How replication works ? (Parallel Replication)

  • PXC can apply write-sets in parallel thereby improving overall throughput.
  • -wsrep_slave_threads helps control this option.
  • Let’s quickly understand how parallel replication works.

○ 3 nodes cluster: N1, N2, N3 all are healthy and processing write-sets. ○ Say N1, N2 generated non-conflicting write-sets. N2 goes first followed by N1. ○ Let’s take a view from N3 perspective. N3 has set wsrep_slave_thread = 2 so it can apply both write-sets in parallel. (Note: This is just like MySQL executing 2 workloads from 2 different clients). ○ While apply can proceed in parallel commits are ordered. That is N2 has to commit first even if N1 replication thread is allowed to proceed. (These co-ordinating units are called monitor (Apply, Commit). There is also local monitor meant for local node action co-ordination). ○ Also note that, given that apply action is only about execution of write-set this is pretty quick and less resource hogging unless query demands it. (As in parsing, filtering,etc all has been taken care on host machine that initiated the initial transaction so in general write-set are optimized unit for application).

slide-12
SLIDE 12

How certification works ?

  • Let’s now understand most crucial part of the replication “How certification works”.
  • Basic principles:

○ ORIGINATOR NODE also certify its own transaction. ○ FIRST COMMITTER TO GROUP CHANNEL WINS.

N1 N2 N3

N1-wset N2-wset N3-wset

T1: (i int, primary key pk(i)) (1, 2, 3) T2: (i int, primary key pk(i)); (11, 22, 33) N1: update t1 set i = i + 10; N2: update t1 set i = i + 100; N3: update t2 set i = i + 10; N1-wset: {db.t1.r1, db.t1.r2, db.t1.r3} N2-wset: {db.t1.r1, db.t1.r2, db.t1.r3} N3-wset: {db.t2.r1, db.t2.r2, db.t2.r3}

slide-13
SLIDE 13

How certification works ? (N1)

Conflicts: db.t1.r1 (N2 != N1) db.t1.r2 (N2 != N1) db.t1.r3 (N2 != N1) N1 N1 CCV N1

N1-wset

db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 CERTIFY CCV N1 N1-writeset rejected CCV N1

N2-wset

CCV N1 N1 N1 CERTIFY N2-writeset certified. N1 N1 N1 N1 CCV N1

N3-wset

db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 CERTIFY CCV N1 N3-writeset certified. db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 db.t1.r1 -> N2 db.t2.r1 -> N3 db.t1.r2 -> N2 db.t2.r2 -> N3 db.t1.r3 -> N2 db.t2.r3 -> N3 N1 N1

slide-14
SLIDE 14

How certification works ? (N2)

Conflicts: db.t1.r1 (N2 != N1) db.t1.r2 (N2 != N1) db.t1.r3 (N2 != N1) N1 N1 CCV N1

N1-wset

db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 CERTIFY CCV N1 N1-writeset rejected N2 N2 N1 N1 CCV N1

N3-wset

db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 CERTIFY CCV N1 N3-writeset certified. db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 db.t1.r1 -> N2 db.t2.r1 -> N3 db.t1.r2 -> N2 db.t2.r2 -> N3 db.t1.r3 -> N2 db.t2.r3 -> N3 N2-writeset certified. CERTIFY CCV N1

N2-wset

CCV N1 N2 N2 N2 N2

slide-15
SLIDE 15

How certification works ? (N3)

Conflicts: db.t1.r1 (N2 != N1) db.t1.r2 (N2 != N1) db.t1.r3 (N2 != N1) N1 N1 CCV N1

N1-wset

db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 CERTIFY CCV N1 N1-writeset rejected N3 N3 N1 N1 CCV N1

N3-wset

db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 CERTIFY CCV N1 N3-writeset certified. db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 db.t1.r1 -> N2 db.t1.r2 -> N2 db.t1.r3 -> N2 db.t1.r1 -> N2 db.t2.r1 -> N3 db.t1.r2 -> N2 db.t2.r2 -> N3 db.t1.r3 -> N2 db.t2.r3 -> N3 N2-writeset certified. CERTIFY CCV N1

N2-wset

CCV N1 N3 N3 N3 N3

slide-16
SLIDE 16

Different types of failures

  • Brute-Force Abort: (wsrep_local_bf_aborts)

○ N1 and N2 both are modifying same data-set (conflicting workload). N2 is done and serializes its write-set to

  • cluster. N1 is still executing in meantime N1 replication thread try to apply N2 transaction and found it

conflicting with N1 active transaction. N1 replication thread will forcefully abort N1 active transaction.

N2

begin; insert into t values (1); commit;

N1

begin; insert into t values (1); Apply_event (from group channel). Found conflicting

  • transaction. FORCE ABORT it.

2016-09-18T06:52:11.376545Z 3 [Note] WSREP: cluster conflict due to high priority abort for threads: 2016-09-18T06:52:11.376566Z 3 [Note] WSREP: Winning thread: THD: 3, mode: applier, state: executing, conflict: no conflict, seqno: 3 SQL: (null) 2016-09-18T06:52:11.376580Z 3 [Note] WSREP: Victim thread: THD: 6, mode: local, state: idle, conflict: no conflict, seqno: -1 SQL: (null) 2016-09-18T06:52:11.376593Z 3 [Note] WSREP: BF kill (1, seqno: 3), victim: (6) trx: 1287 2016-09-18T06:52:11.376599Z 3 [Note] WSREP: Aborting query: void

ABORT N1

slide-17
SLIDE 17

Different types of failures

  • Certification Failure: (wsrep_local_cert_failures)

○ N2 generated transaction N1 is trying to apply it. Apply stage means transaction has been certified. ○ N1 then initiate conflict transaction and try to commit it. Before transaction is allowed to commit transaction is first written to group channel followed by certification check. At this stage it detects there is a conflicting transaction that has already been certified so N1 certification fails. This is called local certification failure.

insert into t values (1); COMMIT ABORT

N1

Replication transaction CERTIFIED. APPLY IT.

N2

insert into t values (1);

2016-09-18T07:11:51.017630Z 6 [Note] WSREP: cluster conflict due to certification failure for threads: 2016-09-18T07:11:51.017655Z 6 [Note] WSREP: Victim thread: THD: 6, mode: local, state: executing, conflict: cert failure, seqno: 5 SQL: insert into t values (10) 2016-09-18T07:11:51.027158Z 6 [Note] WSREP: cleanup transaction for LOCAL_STATE: insert into t values (10) 2016-09-18T07:11:51.027276Z 6 [Note] WSREP: cert failure, thd: 6 is_AC: 1, retry: 1 - 1 SQL: insert into t values (10)

N1 N1

Certification failure are retried. Check wsrep_retry_autocommit

slide-18
SLIDE 18

Understanding IST (Incremental State Transfer)

  • Let’s say for a reason user need to take down the node (maintenance mode or some unexpected

error like we saw before). When node re-joins user expect it to do delta transfer (IST) for changes and not a full snapshot (SST).

  • That’s what IST is about, getting delta changes from DONOR.
  • Replication unit in PXC is write-set and each write-set has unique seqno that is unique across the
  • cluster. If node is taken down for a short time, on reboot, node can catch-up with cluster state by

requesting missing write-sets.

  • Like SST, IST will also have DONOR and JOINER node where-in JOINER request missing

write-sets information. While IST is progressing JOINER node can’t service workload as it is still catching up with cluster.

2016-09-19T06:30:08.967395Z 2 [Note] WSREP: State transfer required: Group state: 627bb4e7-7e32-11e6-a942-065bdedeed0a:9 Local state: 627bb4e7-7e32-11e6-a942-065bdedeed0a:6

slide-19
SLIDE 19

Understanding IST (DONOR selection)

  • IST DONOR Selection

○ Let’s assume a 3 nodes (N1, N2, N3) cluster and cluster that has processed write-set from A1-A5 since

  • inception. {A1, A2, A3, A4, A5}

○ All nodes are in sync and N3 is shutdown for maintenance. ○ While N3 is down, cluster further processes {A6, A7, A8} . Given the space limitation N1 purges {A1, A2} and N2 purges {A1, A2, A3}. ○ N3 comes back and see it is lagging and raises demand for IST. It needs write-sets {A6, A7, A8}. ○ N3 actively starts looking for DONOR Node. ○ Given current state both the nodes can act as donor but there is more to consider. ■ By the time evaluation is done and request is generated DONOR node may make a progress thereby further purging more write-sets. To avoid this Galera adds what it calls Safety_Gap. N3 needs {A6,A7,A8} and available write-set in cluster as whole are {A3, A4, A5, A6, A7, A8} but will project consumability as {A4, A5, A6, A7, A8} so that if more purging take place things are still safe. (Still within N3 needed limit. ■ If say N3 went down earlier and comes back demanding {A3+x - A8} write-set then IST can’t be done and algorithm fallbacks to SST. (though theoretically it can still be serviced using IST).

slide-20
SLIDE 20

IST DONOR selection

○ So with new updated criteria we have both nodes that can service the request. Which one to select ? ○ Select one that is far away (closer to lowest demanded write-set) so probability of purge for it is reduced. ■ N1: {A3, A4, A5, A6, A7, A8} ■ N2: {A4, A5, A6, A7, A8} ■ Needed: {A6, A7, A8} ■ Select N2 as it is far away from wsrep_local_cached_downto (that is A3). ○ Done!!!! Not yet. What if N2 is located in different segment (read data-center) that means IST latency will be

  • higher. Well in that case fall back to select N1. (So always 2 candidate are shortlisted local donor and remote
  • donor. Prefered is local donor if there is no local donor then fallback to remote donor).

○ FINALLY DONE. IST can now commence now. So next time you need to predict which node is going to act as DONOR you can do so and schedule maintenance

  • r workloads accordingly. You also can understand why SST was selected and not IST.
slide-21
SLIDE 21

Common cause of failure with PXC usage

  • Inconsistent or difference configuration on different nodes of same cluster. Ideally all cluster nodes

should have same configuration (except for physical limit like memory size, disk file size, etc.. but it is good to match these values too especially gcache.size and all).

  • Use of experimental features (non-transactional storage engine, LOCK tables, table without

primary-key, large transaction (this is being fixed in galera-4.x)

  • Stable network connectivity. Most of the issues we see at Percona are related to network
  • connectivity. Theoretically we should have consistent network but given that this is not practically

possible PXC/Galera has different timeout to configure.

  • Avoid networking hogging by booting multiple nodes at same time. Add one node at a time.
  • For Geo-distributed cluster look at timeouts and segment.
  • Understand effect of operation like RSU, FTWRL, DONOR/DESYNC, etc...
slide-22
SLIDE 22

Understanding timeouts

N1 N3 N2

Group Channel

State-1: 3 nodes cluster all are healthy. When there is no workload keep_alive signal is sent every 1 sec N1 N3 N2

Group Channel

State-2: N1 has flaky network connection and can’t respond in set inactive_check_period. Cluster (N2, N3) mark N1 to be added delayed list and adds it once it crosses delayed margin. N1 N3 N2

Group Channel

State-3: N1 network connection is restored but to get it out from delayed listed N2 and N3 needs assurance from N1 for delayed_keep_period. N1 N3 N2

Group Channel

State-4: N1 again has flaky connection and it doesn’t respond back for inactive_timeout. (Node is already in delayed list). This will cause node to pronounce as DEAD.

slide-23
SLIDE 23

What’s new in PXC-5.7 ?

  • Save your workload from experimental features:

○ Introducing pxc-strict-mode.

  • Extended support for PFS.

○ Now monitor Galera and wsrep instrument through PFS.

  • Support for encrypted tablespaces in Multi-Master Topology.

○ PXC can wire encrypted tablespace to new booting node.

  • Proxy-SQL compatible PXC

○ PXC is ProxySQL compatible with new single step configuration.

slide-24
SLIDE 24

What’s new in PXC-5.7 ?

  • PMM enabled monitoring for PXC

○ Effectively monitor PXC using PMM.

  • MySQL/PS-5.7.14 and galera-3.17 compatibility

○ Bug fixes, Improved logging and lot more.

  • Simplified packaging for PXC

○ PXC packages now takes care of galera installation too.

  • Support to use latest Xtrabackup with enhanced security checks.

○ Feel power of improved and latest XB with PXC.

slide-25
SLIDE 25

Introducing pxc_strict_mode

  • PXC/Galera limitations:

○ Supports only Transactional Storage Engine. ○ Explicit locks don’t fit in multi-master topology. ○ Need of primary key for certification data generation. ○ XA-transaction are not supported.

  • pxc_strict_mode helps control execution of experimental features for

example: myisam table replication, explicit locking, etc…

slide-26
SLIDE 26

Introducing pxc_strict_mode

  • pxc_strict_mode can be set to following 4 different values:
  • ENFORCING (DEFAULT/RECOMMENDED): Use of experimental features raises error.

(during startup server refuse to start and runtime operation is blocked. error is logged)

  • DISABLED: PXC-5.6 compatible. Allows experimental feature. No error. No warning
  • PERMISSIVE: Use of experimental feature result in warning at startup and runtime. Server

continue to accept the setting and operate.

  • MASTER: Same as ENFORCING for all experimental feature except explicit-table-locking

validation checks are not performed under this mode.

  • pxc_strict_mode is local to given node and if user plan to toggle it, it should be done on all the

nodes for cluster consistency and correctness.

slide-27
SLIDE 27

Introducing pxc_strict_mode

So what all things are not allowed under pxc_strict_mode

  • DML and DDL operations (except CREATE/DROP) are not permitted on non-transactional Storage Engine.
  • Table can be converted from non-transactional SE to transactional SE using ALTER
  • Trying to enable MyISAM replication is blocked.
  • binlog-format has to be ROW.
  • DML to tables without primary-key is not allowed
  • log-output has to be directed to FILE or DISABLED (NONE)
  • Explicit TABLE locking feature (LOCK table, GET_LOCK, FLUSH TABLE WITH READ LOCK, Setting

SERIALIZABLE transaction level) is blocked.

  • auto-increment mode has to be INTERLEAVED.
  • Combining Schema and DML changes in single statement like CTAS is not permitted.
slide-28
SLIDE 28

Monitoring PXC through Performance Schema

  • Traditional method to monitor PXC or MySQL through log file is time consuming and may need

special tool even to detect occurrence of event.

  • Performance Schema is effective way and has become de facto standard for monitoring different

elements of MySQL.

  • Till 5.6 PXC had limited support for performance_schema where-in only wsrep related instruments

were exposed through performance schema that too in limited fashion.

  • Starting PXC-5.7 we have taken big-step further enabling monitoring of galera instruments and
  • ther wsrep-instruments as part of performance schema.
slide-29
SLIDE 29

Monitoring PXC through Performance Schema

  • Instruments that are monitored.

a. THREADS: applier, rollback, service_thd, gcomm conn, receiver, sst/ist threads, etc… b. LOCK/COND_VARIABLES: from wsrep and galera library. c. FILE: record-set file, ring-buffer file (default gcache), gcache-page file.* d. STAGES: Different stage threads are passing through. With this information, user should able to track some of the most important instruments that can help get some insight as to where server is really spending time or out-of-path flow like rollback of transactions.

slide-30
SLIDE 30

Securing PXC

Secure PXC - while TRANSIT, while at REST. During TRANSIT: PXC has following 3 data traffic:

  • SST traffic which is through some independent tool (rsync, mysqldump, xtrabackup)
  • IST traffic (inter-node traffic) that is controlled internally by PXC/Galera.(socket.ssl* options)
  • Replication traffic for general write-set replication

During REST:

  • MySQL introduced encrypted tablespace in 5.7. PXC inherited it and added support to move these

tablespaces across the cluster during SST operation there-by fully supporting encrypted tablespace (in turn securing data at REST).

slide-31
SLIDE 31

Geo distributed Cluster

  • Nodes of cluster are distributed across data-centers still all nodes are part of one-big cluster.

N1 N3 N2 N4 N6 N5

WAN

  • Important aspect to consider:

○ gmcast.segment that help notify topology. All local nodes should have same gmcast.segment. This helps in

  • ptimizing replication traffic and of-course SST and IST.

○ Tune timeouts if needed for your WAN setup. ○ Avoid flow-control and fragmentation at galera level (gcs.max_packet_size=1048576; evs.send_window=512; evs.user_send_window=256") ○ Fragmentation at bin-log level binlog-row-event-max-size.

slide-32
SLIDE 32

Proxy-SQL compatible

  • Proxy-SQL is load balancer or MySQL proxy.
  • PXC is Proxy-SQL compatible and Percona has developed some really cool scripts for monitoring

cluster state.

  • Also, default configuring Proxy-SQL should be a single step process with new proxy-sql-admin

script.

  • You can also download packages from percona site or talk to Percona support for this.
slide-33
SLIDE 33

Q&A

Write to me at krunal.bauskar@percona.com Or DROP in at Percona Booth