Reducing Risk When Upgrading Your MySQL Environment Kenny Gryp - - PowerPoint PPT Presentation

reducing risk when upgrading your mysql environment
SMART_READER_LITE
LIVE PREVIEW

Reducing Risk When Upgrading Your MySQL Environment Kenny Gryp - - PowerPoint PPT Presentation

Reducing Risk When Upgrading Your MySQL Environment Kenny Gryp MySQL Practice Manager My Experience as MySQL Consultant On Upgrading MySQL it's quite complex... Kenny Gryp MySQL Practice Manager Table of Contents The Ocial Documentation


slide-1
SLIDE 1

Reducing Risk When Upgrading Your MySQL Environment

Kenny Gryp MySQL Practice Manager

slide-2
SLIDE 2

My Experience as MySQL Consultant On Upgrading MySQL

it's quite complex...

Kenny Gryp MySQL Practice Manager

slide-3
SLIDE 3

Table of Contents

The Ocial Documentation Make Your Own Documentation Potential Risks Establish Upgrade Method For A Single Server Rollback Scenario Testing Test Writes Test Individual Reads Workload Testing Establish (& Test) Migration Process Migration In Production (Rollback) Post-Migration Assessment

3 / 77

slide-4
SLIDE 4

The Ocial Documentation

4 / 77

slide-5
SLIDE 5

Oracle's Recommended Process

Backup your data Read all release notes and assess https://dev.mysql.com/doc/relnotes/mysql/5.7/en/ Read Changes Affecting Upgrades to MySQL 5.7 https://dev.mysql.com/doc/refman/5.7/en/upgrading-from- previous-series.html

5 / 77

slide-6
SLIDE 6

Oracle's Recommended Process

Upgrade Slaves First

6 / 77

slide-7
SLIDE 7

Oracle's Recommended Process

Upgrade Slaves First In-Place Upgrade: Clean shutdown (innodb_fast_shutdown=0) Run mysql_upgrade Logical Upgrade: mysqldump data Import data again Run mysql_upgrade to x mysql schema

http://dev.mysql.com/doc/refman/5.7/en/upgrading.html

7 / 77

slide-8
SLIDE 8

Oracle's Recommended Process (cont.)

A Lot of Risk: No guarantee queries will execute the same No guarantee queries will be same speed or faster No guarantee all your queries will still work (new default stricter sql_mode) There is no ocial support to upgrade from <5.6 to 5.7 but we might actually be able to do that

8 / 77

slide-9
SLIDE 9

do-it-yourself

Documenting The Process

9 / 77

slide-10
SLIDE 10

Documenting The Process

PEBKAC: Human errors happen and create issues import data using wrong character set setting up replica using wrong binlog le/pos ... Document every step, we need to repeat it multiple times

10 / 77

slide-11
SLIDE 11

making you afraid to upgrade by describing

Potential Risks

11 / 77

slide-12
SLIDE 12

Optimizer Changes

Example: index_merge_intersection Often seen during migrations to MySQL 5.6 Affects environments with sub-optimal indexing Queries with c1='a' AND c2='b' when composite index (c1,c2) is missing Is often slower when selectivity with 1 of the 2 columns is bad (and it happens frequently) Result: a lot of queries were slower in new environment Need SELECT performance tests between versions

https://www.percona.com/blog/2012/12/14/the-optimization-that-often-isnt-index-merge-intersection/

12 / 77

slide-13
SLIDE 13

New Defaults In MySQL 5.7

The new defaults in MySQL 5.7 make a lot of sense: More use of available features and performance enhancements out of the box More strictness with data/query validation New Reserved words Applications might not be ready for it. Drupal 7 - https://www.drupal.org/node/2545480 They will/might break the application more easily: sql_mode=ONLY_FULL_GROUP_BY, STRICT_TRANS_TABLES, NO_ZERO_IN_DATE, NO_ZERO_DATE, ERROR_FOR_DIVISION_BY_ZERO, NO_AUTO_CREATE_USER, NO_ENGINE_SUBSTITUTION innodb_strict_mode=1 Needs SELECT & DML query validity tests between versions

13 / 77

slide-14
SLIDE 14

Other Changes in MySQL 5.7

Passwords that use the older pre-4.1 password hashing format is removed.

14 / 77

slide-15
SLIDE 15

MySQL 5.0.37 +-------+ | 0 | +-------+ MySQL 5.0.45 +-------+ | 1 | +-------+

Minor Versions Also At Risk

CREATE TABLE date (d DATE); INSERT INTO date VALUES ('2017-04-19'); SELECT COUNT(*) FROM date WHERE d < NOW()-INTERVAL 1 DAY; Seen with DELETE FROM date WHERE d < NOW()- INTERVAL 1 DAY in binlog_format=STATEMENT environments. Needs SELECT & DML query result tests between versions

15 / 77

slide-16
SLIDE 16

Workload

SYNC_BINLOG=1 in MySQL 5.7 Can impact certain environments, might not be noticed when looking at a single query InnoDB LRU Flushing changes require tuning for heavy workloads in 5.6 (innodb_lru_scan_depth) When switching to MySQL 8.0 with the new data dictionary ... Need to do Workload Testing between versions

http://mysqlentomologist.blogspot.com/2015/10/fun-with-bugs-38-regression-bugs-in.html http://lefred.be/content/sync_binlog-1-in-5-7/

16 / 77

slide-17
SLIDE 17

How Do We Reduce All This Risk?

17 / 77

slide-18
SLIDE 18

Testing!

18 / 77

slide-19
SLIDE 19

establish

Upgrade Method For A Single Server

19 / 77

slide-20
SLIDE 20

Upgrade Method For A Single Server

Follow MySQL documentation: http://dev.mysql.com/doc/refman/5.7/en/upgrading.html Ensure to document every command Restore from backup Or take a replica you can miss

20 / 77

slide-21
SLIDE 21

Upgrade Method For A Single Server

21 / 77

slide-22
SLIDE 22

Upgrade Method For A Single Server

22 / 77

slide-23
SLIDE 23

Replication Consistency

Testing Writes

23 / 77

slide-24
SLIDE 24

Writes - Replication Consistency

pt-table-checksum: validate consistency in a replication topology Identify problems caused by PEBKAC Ensure events replicate properly (binlog_format=STATEMENT) Upgrade a replica or add a replica which is using the modied version. Do it on production, will have no result in test/staging

https://www.percona.com/doc/percona-toolkit/3.0/pt-table-checksum.html

24 / 77

slide-25
SLIDE 25

Writes - Replicate Test Server

25 / 77

slide-26
SLIDE 26
  • ften left behind is

Rollback Scenario Testing

26 / 77

slide-27
SLIDE 27

Rollback Scenario Testing

Possibility to fall back in case something went wrong during migration Can be done using replication, but has to be tested!

27 / 77

slide-28
SLIDE 28

Writes - Rollback Testing

28 / 77

slide-29
SLIDE 29

Rollback Scenario Testing

You might need to change some settings to your new my.cnf to be able to support replicating back. Example: binlog_checksum = NONE binlog_row_image = FULL binlog_rows_query_log_events = OFF log_bin_use_v1_row_events = 1 gtid_mode = OFF log_slave_updates=1 skip-slave-start

29 / 77

slide-30
SLIDE 30

Writes - Checksums - GTID

30 / 77

slide-31
SLIDE 31

Writes - Checksums - Non-GTID

31 / 77

slide-32
SLIDE 32

Writes - Checkums - ROW

32 / 77

slide-33
SLIDE 33

Writes - Checkums - ROW

33 / 77

slide-34
SLIDE 34

Where To Run pt-table-checksum?

GTID: pt-table-checksum can only be run on Master (Errant Transactions) Or scratch the pt-table-checksum host after tests non-GTID: pt-table-checksum can be run on intermediate master binlog_format=ROW:

  • nly 1 tier below can be checksummed

run on every tier that has a replica (for rollback) pt-table-checksum can bring prod overhead when run on active master Let replication run for a while before checksumming

34 / 77

slide-35
SLIDE 35

pt-table-checksum results

On every replica (including rollback): SELECT db, tbl, SUM(this_cnt) AS total_rows, COUNT(*) AS chunks FROM percona.checksum WHERE (master_cnt <> this_cnt OR master_crc <> this_crc OR ISNULL(master_crc) <> ISNULL(this_crc)) GROUP BY db, tbl; +----+-----------------+------------+--------+ | db | tbl | total_rows | chunks | +----+-----------------+------------+--------+ | db | telephone_debit | 44342 | 1 | | db | orderline | 21451 | 3 | | db | orders | 25125215 | 12 | +----+-----------------+------------+--------+

35 / 77

slide-36
SLIDE 36

pt-table-checksum - Analysis

Troubleshooting starts now... What went wrong?

36 / 77

slide-37
SLIDE 37

pt-table-checksum - Analysis

Which chunks failed? db: db tbl: telephone_debit chunk: 100 chunk_time: 0.4956125 chunk_index: PRIMARY lower_boundary: 5014733 upper_boundary: 5059074 this_crc: 7fd37eb9 this_cnt: 44342 master_crc: b7babd94 master_cnt: 44342 ts: 2013-02-05 01:59:48

37 / 77

slide-38
SLIDE 38

pt-table-checksum - Analysis

Which chunks failed? db: db tbl: telephone_debit chunk: 100 chunk_time: 0.4956125 chunk_index: PRIMARY lower_boundary: 5014733 upper_boundary: 5059074 this_crc: 7fd37eb9 this_cnt: 44342 master_crc: b7babd94 master_cnt: 44342 ts: 2013-02-05 01:59:48

38 / 77

slide-39
SLIDE 39

pt-table-checksum - Analysis

SELECT * INTO outfile '/tmp/telephone_debit_mysql56' FROM db.telephone_debit WHERE id BETWEEN 5014733 AND 5059074; SELECT * INTO outfile '/tmp/telephone_debit_mysql57' FROM db.telephone_debit WHERE id BETWEEN 5014733 AND 5059074; # diff -u /tmp/telephone_debit_mysql5{6,7}

39 / 77

slide-40
SLIDE 40

pt-table-checksum - Analysis

SELECT * INTO outfile '/tmp/telephone_debit_mysql56' FROM db.telephone_debit WHERE id BETWEEN 5014733 AND 5059074; SELECT * INTO outfile '/tmp/telephone_debit_mysql57' FROM db.telephone_debit WHERE id BETWEEN 5014733 AND 5059074; # diff -u /tmp/telephone_debit_mysql5{6,7} Use twindb_table_compare! https://github.com/twindb/twindb_table_compare

40 / 77

slide-41
SLIDE 41

pt-table-checksum - Analysis

Wrong upgrade method backups wrong replication le/pos ... binlog_format=STATEMENT using (UUID()...) Common Seen Issues replicating older versions: Floating point differences: Storing currencies in a DOUBLE Temporal data types Invalid dates converted to zero dates Trailing spaces in CHAR elds

41 / 77

slide-42
SLIDE 42

Testing Writes

Consistency Checks Process: Checksum Check for differences On new environment On rollback environment For each inconsistency Analyze diff Find root cause Fix problem Document problem & solution Repeat checksum again

42 / 77

slide-43
SLIDE 43

Testing Individual Reads

43 / 77

slide-44
SLIDE 44

Testing Reads - Collect Queries

44 / 77

slide-45
SLIDE 45

Testing Reads - Collect Queries

Collection Techniques: Slow Query Log long_query_time=0 Careful when ~+10000 QPS Percona Server: log_slow_rate_limit tcpdump 'packets lost' in libpcap Application/Load Balancer queries Ensure: Get the full workload (long enough) Get data from Master & Replicas Collect batchjob queries running at night

https://www.percona.com/doc/percona-server/5.7/diagnostics/slow_extended.html

45 / 77

slide-46
SLIDE 46

Testing Reads - Setup 2 Environments

46 / 77

slide-47
SLIDE 47

Testing Reads - Setup 2 Environments

Need 2 Test Servers: Reuse servers from checksum + rollback Ensure they have the same data (break replication at same time) Same HW specications Similar Congurations on buffer pool, flatc... Fast enough to more or less resemble production Optionally can be done using 1 machine (pt-upgrade --save-results)

47 / 77

slide-48
SLIDE 48

Testing Reads - pt-upgrade

48 / 77

slide-49
SLIDE 49

Testing Reads - pt-upgrade

pt-upgrade: runs one query at a time on both test environments compares differences: warnings/errors resultset (even different order) query response time Run pt-upgrade on third host with similar network latency Run twice to warm up buffer pool rst (need to be equal) Can also compare writes for execution time & warnings Filter slowlog initially to limit similar queries pt-query-digest

  • -no-report --output slowlog --samples 20

https://www.percona.com/doc/percona-toolkit/3.0/pt-upgrade.html

49 / 77

slide-50
SLIDE 50

Testing Reads - pt-upgrade

Reporting class because there are 1000 row diffs. Total queries 10 Unique queries 10 Discarded queries 0 select ... from ... ## ## Row diffs: 10 ##

  • - 1.

@ row 2 < 13178,"dim0",37,2,21,,,0,0,0,1,NULL,NULL > 13178,"dimø",37,2,21,,,0,0,0,1,NULL,NULL ...

50 / 77

slide-51
SLIDE 51

Testing Reads - pt-upgrade

Reporting class because it has diffs, but hasn't been reported yet. SELECT * FROM `database`.table WHERE treeid = '' AND productid='0' ## Warning diffs: 2 Code: 1366 Level: Warning Message: Incorrect integer value: '' for column 'treeid' at row 1 vs. No warning 1366

51 / 77

slide-52
SLIDE 52

Testing Reads - pt-upgrade

SELECT * FROM `database`.client_orders WHERE client=? AND blacklist=? LIMIT ? ## Query time diffs: 1

  • - 1.

0.000513 vs. 0.036395 seconds (70.9x increase) SELECT * FROM `database`.client_orders WHERE client=57450 AND blacklist=1 LIMIT 1

52 / 77

slide-53
SLIDE 53

Testing Reads Process

Collect queries Run pt-upgrade (twice) For each entry in report Figure out why it is reported Deploy x in Prod Application Make schema changes Document analysis Run pt-upgrade again

53 / 77

slide-54
SLIDE 54
  • ne of the most challenging is

Testing Workload

54 / 77

slide-55
SLIDE 55

Workload Testing - Percona Playback

55 / 77

slide-56
SLIDE 56

Workload Testing - Query Playback

Uses slowlog to replay queries Needs long_query_time=0 - challenging on busy servers Enough data during peak workload Tries to execute workload as realistically as possible same connections, same transactions, same delays between queries Run against both environments, compare speed Think about preloading buffer on both the same way Active development by Marius Wachtler (ex)-DropBox! Thank you! (unocal product of Percona, no support)

56 / 77

slide-57
SLIDE 57

Workload Testing - ProxySQL Mirroring

57 / 77

slide-58
SLIDE 58

Workload Testing - ProxySQL Mirroring

Mirror queries from Load Balancer to test environment Good Blogpost: https://www.pythian.com/blog/using-proxysql-validate-mysql- updates/

58 / 77

slide-59
SLIDE 59

establish (& test)

Migration Process

59 / 77

slide-60
SLIDE 60

Migration Process

Create Migration Plan Different for every environment/application Upgrade a replica rst for a couple of days/weeks? How to switch masters? How is failover being handled nowadays? MHA, Orchestrator, Manual, GTID/msyqlrpladmin...? Test in staging!

60 / 77

slide-61
SLIDE 61

the actual

Migration In Production

61 / 77

slide-62
SLIDE 62

Migration - Create Slave Environments

62 / 77

slide-63
SLIDE 63

Migration - Redirect Read Trac

63 / 77

slide-64
SLIDE 64

Migration - Application Switchover - 1

64 / 77

slide-65
SLIDE 65

Migration - Application Switchover - 2

65 / 77

slide-66
SLIDE 66

you (think you) will never need to do a

Rollback

66 / 77

slide-67
SLIDE 67

Rollback

67 / 77

slide-68
SLIDE 68

Rollback

What went wrong? I did not follow the full process! (or I forgot to document it) Do consistency checks again!

68 / 77

slide-69
SLIDE 69

after all that testing, it's ok to spend time doing

Post-Migration Assessment

69 / 77

slide-70
SLIDE 70

Post-Migration

Check trending for different behavior more cpu load? more disk IO? higher amount of innodb_rows_* and handler_* threads_running stability? do some query optimization If all looks good, scratch the 5.6 rollback & make it 5.7 Remove the rollback specic conguration options

70 / 77

slide-71
SLIDE 71

Post Migration Cleanup

71 / 77

slide-72
SLIDE 72

small recap

Summary

72 / 77

slide-73
SLIDE 73

Multi-Use

(Minor MySQL version upgrades) Major MySQL version upgrades Switching Hardware from Intel -> AMD archicture Using a new kernel/libc/memory allocator Switching storage engines MariaDB/Percona Server/MySQL ...

73 / 77

slide-74
SLIDE 74

Do I really have to go through this?

Many success stories: Have done several MySQL upgrades from 4.1 -> 5.5 without intermediate slaves Upgraded environments with major schema changes in the mix (mssql-style environments using stored procedures only) Found numerous application bugs using this process Optimized many customers schemas/queries in the meantime As long as you follow this process completely, the risk of running into problems is quite small.

74 / 77

slide-75
SLIDE 75

Do I really have to go through this?

It Depends: Your business might be risk-averse: every change has to be thoroughly tested Other companies just upgrade a replica in production and see how it goes My suggestion to do this at least for: Major MySQL version upgrades Switching storage engines

75 / 77

slide-76
SLIDE 76

Summary

Test Step Skip? Document Upgrade Single Server Really? Why? Rollback Scenarios Not Recommended Consistency Checks Required, No Debate! Read Tests Strongly Suggested Workload Tests Possible (Early Adopter Alert) Migration Tests Not Recommended To Skip

76 / 77

slide-77
SLIDE 77

Reducing Risk When Upgrading Your MySQL Environment

Q&A!

Kenny Gryp MySQL Practice Manager