Percona XtraDB Cluster: Failure Scenarios and their Recovery Krunal - - PowerPoint PPT Presentation

percona xtradb cluster failure scenarios and their
SMART_READER_LITE
LIVE PREVIEW

Percona XtraDB Cluster: Failure Scenarios and their Recovery Krunal - - PowerPoint PPT Presentation

Percona XtraDB Cluster: Failure Scenarios and their Recovery Krunal Bauskar (PXC Lead, Percona) Alkin Tezuysal (Sr. Technical Manager, Percona) Who we are? Krunal Bauskar Alkin Tezuysal (@ask_dba) Database enthusiast. Open Source


slide-1
SLIDE 1

Percona XtraDB Cluster: Failure Scenarios and their Recovery

Krunal Bauskar (PXC Lead, Percona) Alkin Tezuysal (Sr. Technical Manager, Percona)

slide-2
SLIDE 2

2

Who we are?

Krunal Bauskar

  • Database enthusiast.
  • Practicing databases (MySQL) for over a

decade now.

  • Wide interest in data handling and

management.

  • Worked on some real big data that powered

application @ Yahoo, Oracle, Teradata.

Alkin Tezuysal (@ask_dba)

  • Open Source Database Evangelist
  • Global Database Operations Expert
  • Cloud Infrastructure Architect AWS
  • Inspiring Technical and Strategic Leader
  • Creative Team Builder
  • Speaker, Mentor, and Coach
  • Outdoor Enthusiast
slide-3
SLIDE 3

3

Agenda

  • Quick sniff at PXC
  • Failure Scenarios and their recovery
  • PXC Genie - You wish. We implement.
  • Q & A
slide-4
SLIDE 4

Quick Sniff at PXC

slide-5
SLIDE 5

5

What is PXC ?

Auto-node provisioning Multi-master Performance tuned Enhanced Security Flexible topology Network protection

(Geo-distributed)

slide-6
SLIDE 6

Failure Scenarios and their recovery

slide-7
SLIDE 7

7

Scenario: New node fail to connect to cluster

slide-8
SLIDE 8

8

Scenario: New node fail to connect to cluster

Joiner log

slide-9
SLIDE 9

9

Scenario: New node fail to connect to cluster

Joiner log

DONOR log doesn’t have any traces of JOINER trying to JOIN. Administrator reviews configuration settings like IP address are sane and valid.

slide-10
SLIDE 10

10

Scenario: New node fail to connect to cluster

Joiner log

DONOR log doesn’t have any traces of JOINER trying to JOIN. Administrator reviews configuration settings like IP address are sane and valid.

Still JOINER fails to connect

slide-11
SLIDE 11

11

Scenario: New node fail to connect to cluster

Joiner log

DONOR log doesn’t have any traces of JOINER trying to JOIN. Administrator reviews configuration settings like IP address are sane and valid.

SELinux/AppArmor

slide-12
SLIDE 12

12

Scenario: New node fail to connect to cluster

Joiner log

Don’t confuse this error with SST since node is not yet offered membership of cluster. SST comes post membership.

slide-13
SLIDE 13

13

Scenario: New node fail to connect to cluster

  • Solution-1:

○ Setting mode to PERMISSIVE or DISABLED

slide-14
SLIDE 14

14

Scenario: New node fail to connect to cluster

  • Solution-1:

○ Setting mode to PERMISSIVE or DISABLED

  • Solution-2:

○ Configuring policy to allow access in ENFORCING mode. ○ Related blogs

■ “Lock Down: Enforcing SELinux with Percona XtraDB Cluster”. It probs what all permission are needed and add rules accordingly. ■

“Lock Down: Enforcing AppArmor with Percona XtraDB Cluster”

■ Using this we can continue to use SELinux in enable mode. (You can also refer to selinux configuration on Codership site too).

slide-15
SLIDE 15

15

Scenario: New node fail to connect to cluster PXC can operate with SELinux/AppArmor.

slide-16
SLIDE 16

16

Scenario: Catching up cluster (SST, IST)

slide-17
SLIDE 17

17

Scenario: Catching up cluster (SST, IST)

  • SST: complete copy-over of data-directory

○ SST has has multiple external components SST script, XB, network aspect,

  • etc. Some of these are outside control of PXC process.
  • IST: missing write-sets (as node is already member of cluster).

○ Intrinsic to PXC process space.

slide-18
SLIDE 18

18

Scenario: Catching up cluster (SST, IST)

#1

Joiner log

slide-19
SLIDE 19

19

Scenario: Catching up cluster (SST, IST)

#1

Joiner log

SST failed on DONOR

slide-20
SLIDE 20

20

Scenario: Catching up cluster (SST, IST)

#1

Joiner log

SST failed on DONOR wsrep_sst_auth not set on DONOR

slide-21
SLIDE 21

21

Scenario: Catching up cluster (SST, IST)

#1

Joiner log

wsrep_sst_auth should be set on DONOR (often user set it

  • n JOINER and things still fails). Post SST, JOINER will

copy-over the said user from DONOR.

slide-22
SLIDE 22

22

Scenario: Catching up cluster (SST, IST)

#2

Donor log

slide-23
SLIDE 23

23

Scenario: Catching up cluster (SST, IST)

#2

Donor log

Possible cause:

  • Specified wsrep_sst_auth user doesn’t exit.
  • Credentials are wrong.
  • Insufficient privileges.
slide-24
SLIDE 24

24

Scenario: Catching up cluster (SST, IST)

#3

Joiner log

slide-25
SLIDE 25

25

Scenario: Catching up cluster (SST, IST)

#3

Joiner log

Trying to get old version JOINER to join from new version DONOR. (Not supported). Opposite is naturally allowed.

slide-26
SLIDE 26

26

Scenario: Catching up cluster (SST, IST)

#4

Joiner log Donor log

slide-27
SLIDE 27

27

Scenario: Catching up cluster (SST, IST)

#4

Joiner log Donor log

WSREP_SST: [WARNING] wsrep_node_address or wsrep_sst_receive_address not set. Consider setting them if SST fails.

slide-28
SLIDE 28

28

Scenario: Catching up cluster (SST, IST)

#5

slide-29
SLIDE 29

29

Scenario: Catching up cluster (SST, IST)

#5 Faulty SSL configuration

slide-30
SLIDE 30

30

Scenario: Catching up cluster (SST, IST)

PXC recommends: Same configuration on all nodes of the cluster. Old DONOR - New JOINER (OK) XB is external tool and has its own set of controllable configuration (passed through PXC my.cnf) SST user should be present on DONOR Look at DONOR and JOINER log. wsrep_sst_recieve_address/wsrep_node_ address is needed. Advance encryption option like keyring on DONOR and no keyring on JOINER is not allowed. Ensure stable n/w link between DONOR and JOINER. Network rules (firewall, etc..). SST uses port

  • 4444. IST uses 4568.

Often-error are local to XB. Check the XB log file that can give hint of error.

slide-31
SLIDE 31

31

Scenario: Cluster doesn’t come up on restart

slide-32
SLIDE 32

32

Scenario: Cluster doesn’t come up on restart

  • All your nodes are located in same Data-Center (DC)
  • DC hits power failure and all nodes are restarted.
  • On restart, recovery flow is executed to recover wsrep coordinates.
slide-33
SLIDE 33

33

Scenario: Cluster doesn’t come up on restart

  • All your nodes are located in same Data-Center (DC)
  • DC hits power failure and all nodes are restarted.
  • On restart, recovery flow is executed to recover wsrep coordinates.
slide-34
SLIDE 34

34

Scenario: Cluster doesn’t come up on restart

  • All your nodes are located in same Data-Center (DC)
  • DC hits power failure and all nodes are restarted.
  • On restart, recovery flow is executed to recover wsrep coordinates.

Cluster still fails to come up

slide-35
SLIDE 35

35

Scenario: Cluster doesn’t come up on restart

  • Close look at the log shows original bootstrapping node has

safe_to_bootstrap set to 0 so it refuse to come up.

  • Other nodes of cluster are left dangling (in non-primary state) in

absence of original cluster forming node.

slide-36
SLIDE 36

36

Scenario: Cluster doesn’t come up on restart

  • Close look at the log shows original bootstrapping node has

safe_to_bootstrap set to 0 so it refuse to come up.

  • Other nodes of cluster are left dangling (in non-primary state) in

absence of original cluster forming node.

Galera/PXC expect user to identify node that has latest data and then use that too bootstrap. So as safety check safe_to_bootstrap was added.

slide-37
SLIDE 37

37

Scenario: Cluster doesn’t come up on restart

Identify the node that has latest data (look at wsrep-recovery co-ords)

Bootstrap the node Restart other non-primary node (if they fail to auto-join). set

safe_to_bootstrap

to 1 in grastate.dat

from data-directory

slide-38
SLIDE 38

38

Scenario: Cluster doesn’t come up on restart

I have exact same setup but I never face this issue. My cluster get auto-restore on power failure. Am I losing data or doing something wrong ?

slide-39
SLIDE 39

39

Scenario: Cluster doesn’t come up on restart

Because you have bootstrapped your node using

wsrep_cluster_address=<node-ip>

& pc.recovery=true (default)

slide-40
SLIDE 40

40

Scenario: Cluster doesn’t come up on restart

Because you have bootstrapped your node using

wsrep_cluster_address=<node-ip>

& pc.recovery=true (default)

Error is observed if you have bootstrapped: wsrep_cluster_address=”gcomm://” OR wsrep_cluster_address=”<node-ips>” but pc.recovery=false

slide-41
SLIDE 41

41

Scenario: Cluster doesn’t come up on restart PXC can auto-restart on DC failure depending on configuration option used.

slide-42
SLIDE 42

42

Scenario: Data inconsistency

slide-43
SLIDE 43

43

Scenario: Data inconsistency

slide-44
SLIDE 44

44

Scenario: Data inconsistency

  • 2 kinds of inconsistencies

○ Physical inconsistency: Hardware Issues ○ Logical inconsistency: Data Issues

slide-45
SLIDE 45

45

Scenario: Data inconsistency

  • 2 kinds of inconsistencies

○ Physical inconsistency: Hardware Issues ○ Logical inconsistency: Data Issues

Logical inconsistency caused to cluster specific operation like locks, RSU, wsrep_on=off, etc…

slide-46
SLIDE 46

46

Scenario: Data inconsistency

  • 2 kinds of inconsistencies

○ Physical inconsistency: Hardware Issues ○ Logical inconsistency: Data Issues

Logical inconsistency caused to cluster specific operation like locks, RSU, wsrep_on=off, etc…

PXC has zero tolerance for inconsistency

and so it immediately isolate the nodes on detecting inconsistency.

slide-47
SLIDE 47

47

Scenario: Data inconsistency

Inconsistency detected

slide-48
SLIDE 48

48

Scenario: Data inconsistency

Cluster in healthy and running

ISOLATED NODE (SHUTDOWN)

slide-49
SLIDE 49

49

Scenario: Data inconsistency

Inconsistency detected Inconsistency detected

slide-50
SLIDE 50

50

Scenario: Data inconsistency

shutdown shutdown non-prim

State marked as UNSAFE

slide-51
SLIDE 51

51

Scenario: Data inconsistency

majority group minority group

slide-52
SLIDE 52

52

Scenario: Data inconsistency

majority group minority group Minority group has GOOD DATA

slide-53
SLIDE 53

53

Scenario: Data inconsistency

If there are multiple nodes in minority group, identify a node that has latest data.

slide-54
SLIDE 54

54

Scenario: Data inconsistency

If there are multiple nodes in minority group, identify a node that has latest data. Set pc.bootstrap=1 on the selected node. Single node cluster formed

slide-55
SLIDE 55

55

Scenario: Data inconsistency

If there are multiple nodes in minority group, identify a node that has latest data. Set pc.bootstrap=1 on the selected node. Boot other majority node. (they will join through SST).

slide-56
SLIDE 56

56

Scenario: Data inconsistency

CLUSTER RESTORED

slide-57
SLIDE 57

57

Scenario: Data inconsistency

shutdown shutdown non-prim

State marked as UNSAFE

slide-58
SLIDE 58

58

Scenario: Data inconsistency

majority group minority group Majority group has GOOD DATA

slide-59
SLIDE 59

59

Scenario: Data inconsistency

Nodes in majority group are already

  • SHUTDOWN. Initiate SHUTDOWN of

nodes from minority group.

slide-60
SLIDE 60

60

Scenario: Data inconsistency

Valid uuid can be copied over from a minority group node. Nodes in majority group are already

  • SHUTDOWN. Initiate SHUTDOWN of

nodes from minority group.

Fix grastate.dat for the nodes from majority group. (Consistency shutdown sequence has marked STATE=UNSAFE).

slide-61
SLIDE 61

61

Scenario: Data inconsistency

Nodes in majority group are already

  • SHUTDOWN. Initiate SHUTDOWN of

nodes from minority group.

Fix grastate.dat for the nodes from majority group. (Consistency shutdown sequence has marked STATE=UNSAFE). Bootstrap the cluster using one of the node from majority group and eventually get other majority nodes to join.

slide-62
SLIDE 62

62

Scenario: Data inconsistency

Nodes in majority group are already

  • SHUTDOWN. Initiate SHUTDOWN of

nodes from minority group.

Fix grastate.dat for the nodes from majority group. (Consistency shutdown sequence has marked STATE=UNSAFE). Bootstrap the cluster using one of the node from majority group and eventually get other majority nodes to join. Remove grastate.dat of minority group nodes and restart them to join newly formed cluster.

slide-63
SLIDE 63

63

Scenario: Data inconsistency

CLUSTER RESTORED

slide-64
SLIDE 64

64

Scenario: Another aspect of data inconsistency

slide-65
SLIDE 65

65

Scenario: Another aspect of data inconsistency

One of the node from minority group

slide-66
SLIDE 66

66

Scenario: Another aspect of data inconsistency

Transaction upto X Transaction upto X - 1

slide-67
SLIDE 67

67

Scenario: Another aspect of data inconsistency

Transaction upto X Transaction upto X - 1 Transaction X caused inconsistency so it never made it to these nodes.

slide-68
SLIDE 68

68

Scenario: Another aspect of data inconsistency

Transaction upto X Transaction upto X - 1

slide-69
SLIDE 69

69

Scenario: Another aspect of data inconsistency

Transaction upto X Transaction upto X - 1 Membership rejected as new coming node has one extra transaction than cluster state.

slide-70
SLIDE 70

70

Scenario: Another aspect of data inconsistency

2 node cluster is up and it started processing transaction. Moving the state of cluster from X -> X + 3

slide-71
SLIDE 71

71

Scenario: Another aspect of data inconsistency

2 node cluster is up and it started processing transaction. Moving the state of cluster from X -> X + 3

slide-72
SLIDE 72

72

Scenario: Another aspect of data inconsistency

2 node cluster is up and it started processing transaction. Moving the state of cluster from X -> X + 3

Node got membership and node joined through IST too?

slide-73
SLIDE 73

73

Scenario: Another aspect of data inconsistency

2 node cluster is up and it started processing transaction. Moving the state of cluster from X -> X + 3

Node has transaction upto X and cluster says it has transaction upto X+3. Node joining doesn’t evaluate

  • data. It is all

dependent on seqno.

slide-74
SLIDE 74

74

Scenario: Another aspect of data inconsistency

User failed to remove grastate.dat that caused all this confusion.

slide-75
SLIDE 75

75

Scenario: Another aspect of data inconsistency

trx-seqno=x trx-seqno=x

T r a n s a c t i

  • n

w i t h s a m e s e q n

  • b

u t d i f f e r e n t u p d a t e

trx-seqno=x

slide-76
SLIDE 76

76

Scenario: Another aspect of data inconsistency

trx-seqno=x

Cluster restored just to enter more inconsistency (that may detect in future).

T r a n s a c t i

  • n

w i t h s a m e s e q n

  • b

u t d i f f e r e n t u p d a t e

trx-seqno=x trx-seqno=x

slide-77
SLIDE 77

77

Scenario: Cluster doesn’t come up on restart Avoid running node local operation. If cluster enter inconsistent state carefully follow the step-by-step guide to recover (don’t fear SST, it is for your good).

slide-78
SLIDE 78

78

Scenario: Delayed purging

slide-79
SLIDE 79

79

Scenario: Delayed purging

Gcache

(staging area to hold replicated transaction)

slide-80
SLIDE 80

80

Scenario: Delayed purging

Transaction replicated and staged

slide-81
SLIDE 81

81

Scenario: Delayed purging

All nodes finished applying transaction

slide-82
SLIDE 82

82

Scenario: Delayed purging

Transactions can be removed from gcache

slide-83
SLIDE 83

83

Scenario: Delayed purging

  • Each node at configured interval notifies other nodes/cluster about its

transaction committed status

  • This configuration is controlled by 2 conditions:

○ gcache.keep_page_size and gcache.keep_page_count ○ static limit on number of keys (1K), transactions (128), bytes (128M).

  • Accordingly each node evaluates the cluster level lowest water mark and

initiate gcache purge.

slide-84
SLIDE 84

84

Scenario: Delayed purging

Each node update local graph and evaluate cluster purge watermark N1_purged_upto: x+1 N2_purged_upto: x+1 N3_purged_upto: x N1_purged_upto: x+1 N2_purged_upto: x+1 N3_purged_upto: x N1_purged_upto: x+1 N2_purged_upto: x+1 N3_purged_upto: x

slide-85
SLIDE 85

85

Scenario: Delayed purging

And accordingly all nodes will purge local gcache upto X. N1_purged_upto: x+1 N2_purged_upto: x+1 N3_purged_upto: x N1_purged_upto: x+1 N2_purged_upto: x+1 N3_purged_upto: x N1_purged_upto: x+1 N2_purged_upto: x+1 N3_purged_upto: x cluster-purge-water-mark=X cluster-purge-water-mark=X cluster-purge-water-mark=X

slide-86
SLIDE 86

86

Scenario: Delayed purging

gcache page created and purged.

slide-87
SLIDE 87

87

Scenario: Delayed purging

New COMMIT CUT 2360 after 2360 from 1 purging index up to 2360 releasing seqno from gcache 2360 Got commit cut from GCS: 2360

slide-88
SLIDE 88

88

Scenario: Delayed purging

New COMMIT CUT 2360 after 2360 from 1 purging index up to 2360 releasing seqno from gcache 2360 Got commit cut from GCS: 2360 Regularly each node communicates, committed upto water mark and then as per protocol explained, purging initiates.

slide-89
SLIDE 89

89

Scenario: Delayed purging

slide-90
SLIDE 90

90

Scenario: Delayed purging

Gcache STOP processing transaction Transaction start to pile up in gcache

slide-91
SLIDE 91

91

Scenario: Delayed purging

Gcache STOP processing transaction Transaction start to pile up in gcache

  • FTWRL, RSU … action that

causes node to pause and desync.

slide-92
SLIDE 92

92

Scenario: Delayed purging

  • Given that one of the node is not making progress it would not emit

its transaction committed status.

  • This would freeze the cluster-purge-water-mark as lowest

transaction continue to lock-down.

  • This means, though other nodes are making progress, they will

continue to pile up galera cache.

slide-93
SLIDE 93

93

Scenario: Delayed purging

  • Given that one of the node is not making progress it would not emit

its transaction committed status.

  • This would freeze the cluster-purge-water-mark as lowest

transaction continue to lock-down.

  • This means, though other nodes are making progress, they will

continue to pile up galera cache.

Galera has protection against it. If number of transactions continue to grow beyond some hard limits it would force purge.

slide-94
SLIDE 94

94

Scenario: Delayed purging

trx map size: 16511 - check if status.last_committed is incrementing purging index up to 11264 releasing seqno from gcache 11264

In-build mechanism to force purge.

slide-95
SLIDE 95

95

Scenario: Delayed purging

trx map size: 16511 - check if status.last_committed is incrementing purging index up to 11264 releasing seqno from gcache 11264

Purge can get delayed but not halt.

slide-96
SLIDE 96

96

Scenario: Delayed purging

Gcache STOP processing transaction Force purge done

slide-97
SLIDE 97

97

Scenario: Delayed purging

Gcache STOP processing transaction

Purging means these entries are removed from galera maintained purge array. (Physical removal of files gcache.page.0000xx is controlled by gcache.keep_pages_size and gcache.keep_pages_count)

slide-98
SLIDE 98

98

Scenario: Delayed purging

All nodes should have same configuration. Keep a close watch if you plan to run a backup operation

  • r other operation that can cause node to halt.

Monitor node is making progress by keeping watch on wsrep_last_applied/wsrep_last_committed.

slide-99
SLIDE 99

99

Scenario: Network latency and related failures

slide-100
SLIDE 100

10

Scenario: Network latency and related failures

slide-101
SLIDE 101

10 1

Scenario: Network latency and related failures

slide-102
SLIDE 102

10 2

Scenario: Network latency and related failures Why ? What caused this weird behavior ?

slide-103
SLIDE 103

10 3

Scenario: Network latency and related failures

slide-104
SLIDE 104

10 4

Scenario: Network latency and related failures

Cluster is neither complete down nor complete up. What’s going on ? What is causing this weird behavior ?

slide-105
SLIDE 105

10 5

Scenario: Network latency and related failures

All my writes are going to single node still I am getting this conflict ?

slide-106
SLIDE 106

10 6

Scenario: Network latency and related failures

All nodes are able to reach each other

slide-107
SLIDE 107

10 7

Scenario: Network latency and related failures

If link between 2 of nodes is broken then packets can be relayed through 3rd node that is reachable from both

  • f the nodes.
slide-108
SLIDE 108

10 8

Scenario: Network latency and related failures

If link between 2 of nodes is broken then packets can be relayed through 3rd node that is reachable from both

  • f the nodes.
slide-109
SLIDE 109

10 9

Scenario: Network latency and related failures

Said node has flaky network connection or say has higher latency.

slide-110
SLIDE 110

11

Scenario: Network latency and related failures

Each node will monitor other nodes of the cluster @ inactive_check_period (0.5 seconds). If node is not reachable from given node post peer_timeout (3S), cluster will enable relaying of message. If all nodes votes for said node inactivity (suspect_timeout (5S)) it is pronounced DEAD. If node detects delay in response from given node it would try to add it to delayed list. While suspect_timeout needs consensus. inactive_timeout(15S) doesn’t need it. If node doesn’t respond it is marked DEAD Node waits for delayed_margin before adding node to delayed_list (1S) Even if node becomes active again it would take delayed_keep_period (30S) to remove it from the list.

slide-111
SLIDE 111

11 1

Scenario: Network latency and related failures

If node detects delay in response from given node it would try to add it to delayed list. Node waits for delayed_margin before adding node to delayed_list (1S) Even if node becomes active again it would take delayed_keep_period (30S) to remove it from the list. Each node will monitor other nodes of the cluster @ inactive_check_period (0.5 seconds). If node is not reachable from given node post peer_timeout (3S), cluster will enable relaying of message. If all nodes votes for said node inactivity (suspect_timeout (5S)) it is pronounced DEAD. While suspect_timeout needs consensus. inactive_timeout(15S) doesn’t need it. If node doesn’t respond it is marked DEAD

Runtime configurable

slide-112
SLIDE 112

11 2

Scenario: Network latency and related failures

< 1 ms 7 sec 7 sec Latency

slide-113
SLIDE 113

11 3

Scenario: Network latency and related failures

< 1 ms 7 sec 7 sec

slide-114
SLIDE 114

11 4

Scenario: Network latency and related failures

< 1 ms 7 sec 7 sec Start sysbench workload

slide-115
SLIDE 115

11 5

Scenario: Network latency and related failures

< 1 ms 7 sec 7 sec Start sysbench workload

Given RTT between n1 and n3 is 7 sec each trx needs 7 sec to complete even though it gets ACK from n2 in < 1ms

slide-116
SLIDE 116

11 6

Scenario: Network latency and related failures

#1

slide-117
SLIDE 117

11 7

Scenario: Network latency and related failures

  • TPS hits 0 for 5 secs and then resume

back. #1

slide-118
SLIDE 118

11 8

Scenario: Network latency and related failures

  • TPS hits 0 for 5 secs and then resume

back.

  • This is because trx is waiting for ACK

from n3 that would take 7 sec but in meantime suspect_timeout timer goes off and marks n3 as DEAD so workload resumes after 5 secs. #1

slide-119
SLIDE 119

11 9

Scenario: Network latency and related failures

  • This temporarily make the complete

cluster unavailable.

  • Unfortunately, protocol design

demands ACK from the farthest node to ensure consistency.

  • Of-course latency of 7 sec is not

realistic. #1

slide-120
SLIDE 120

12

Scenario: Network latency and related failures

#2

slide-121
SLIDE 121

12 1

Scenario: Network latency and related failures

< 1 ms 2 sec 2 sec

slide-122
SLIDE 122

12 2

Scenario: Network latency and related failures

  • This time I reduced the latency from 7

to 2 sec. Because of this every 2 sec (less 5 sec) there was some communication between node and this prevent n3 from being marked as DEAD.

  • Post 10 secs we reverted back latency

to original value so snag is seen for 10 secs. #2

slide-123
SLIDE 123

12 3

Scenario: Network latency and related failures

All my writes are going to single node still I am getting this conflict ?

#3

slide-124
SLIDE 124

12 4

Scenario: Network latency and related failures

Because when the view changes initial position is re-assigned there-by purging history from cert index. Follow up transaction in cert that has dependency with old trx (that got purged) faces this conflict.

#3

slide-125
SLIDE 125

12 5

Scenario: Network latency and related failures

Farthest node dictates how cluster would operate and so latency is important. Geo-Distributed cluster has milli-sec latency so timeout should be configured to avoid marking node as UNSTABLE due to added latency. For geo-distributed cluster segment, window settings are other param to configure. Flaky node are not good for overall transaction processing. (Can cause certification failures).

slide-126
SLIDE 126

12 6

Scenario: Blocking Transaction and related failures

slide-127
SLIDE 127

12 7

Scenario: Blocking Transaction and related failures

  • Fail to load a table with N rows.
slide-128
SLIDE 128

12 8

Scenario: Blocking Transaction and related failures

  • Fail to load a table with N rows.
  • Why ?

○ Because PXC has limit on how much data it can wrap in write-set and replicate across the cluster. ○ Current limit allows data transaction of size 2 G. (controlled through wsrep_max_ws_size)

But ever imagined why is that a limitation ?

slide-129
SLIDE 129

12 9

Scenario: Blocking Transaction and related failures

execute prepare replicate commit

slide-130
SLIDE 130

13

Scenario: Blocking Transaction and related failures

execute prepare replicate commit Transaction first execute on local

  • node. During this

execution transaction doesn’t block

  • ther

non-dependent transaction Transaction replicate after it has been executed on local node but not yet committed. Replication involves transporting write-set (binlog) to

  • ther nodes.
slide-131
SLIDE 131

13 1

Scenario: Blocking Transaction and related failures

execute prepare replicate commit

N1

apply commit

N2

slide-132
SLIDE 132

13 2

Scenario: Blocking Transaction and related failures

execute prepare replicate commit

N1 N2

apply commit To maintain data consistency across the cluster, protocol needs transaction to commit in same order on all the nodes.

slide-133
SLIDE 133

13 3

Scenario: Blocking Transaction and related failures

execute prepare replicate commit

N1 N2

apply commit This means even though transaction following largest transaction are non-dependent and have completed APPLY ACTION before the largest transaction they can’t commit.

slide-134
SLIDE 134

13 4

Scenario: Blocking Transaction and related failures

execute prepare replicate commit

N1 N2

apply commit This means even though transaction following largest transaction are non-dependent and have completed APPLY ACTION before the largest transaction they can’t commit.

slide-135
SLIDE 135

13 5

Scenario: Blocking Transaction and related failures

execute prepare replicate commit

N1 N2

apply commit Bigger the transaction, bigger backlog of small transactions this would eventually cause FLOW_CONTROL

slide-136
SLIDE 136

13 6

Scenario: Blocking Transaction and related failures

slide-137
SLIDE 137

13 7

Scenario: Blocking Transaction and related failures

slide-138
SLIDE 138

13 8

Scenario: Blocking Transaction and related failures

First snag appears when originating node block all resources to replicate a long running transaction. Second snag appears when replicating node emit flow-control.

slide-139
SLIDE 139

13 9

Scenario: Network latency and related failures

PXC doesn’t like long running transaction. For load data use LOAD DATA INFILE that would cause intermediate commit every 10K rows. Note: Random failure can cause partial data to get committed. DDL can block/stall complete cluster workload as they need to execute in total-isolation. (Alternative is to use RSU but be careful at it is local

  • peration to the node).
slide-140
SLIDE 140

14

One last important note

  • Majority of the error are due to mis-configuration or

difference in configuration of nodes.

  • PXC recommend same configuration on all nodes of

the cluster.

slide-141
SLIDE 141

PXC Genie: You Wish. We implement

slide-142
SLIDE 142

14 2

PXC Genie: You Wish. We implement

  • Like to hear from you what you want next in PXC ?
  • Any specific module that you expect improvement ?
  • How can Percona help you with PXC or HA ?
  • Log issue (mark them as new improvement)

https://jira.percona.com/projects/PXC/issue

  • PXC forum is other way to reach us.
slide-143
SLIDE 143

Questions and Answer

slide-144
SLIDE 144

14 4

Thank You Sponsors!!