Performance Guide for MySQL Cluster Mikael Ronstrm, Ph.D Senior - - PowerPoint PPT Presentation

performance guide for mysql cluster
SMART_READER_LITE
LIVE PREVIEW

Performance Guide for MySQL Cluster Mikael Ronstrm, Ph.D Senior - - PowerPoint PPT Presentation

Performance Guide for MySQL Cluster Mikael Ronstrm, Ph.D Senior MySQL Architect Sun Microsystems MySQL Cluster Application Application Application MySQL Client MySQL Client MySQL Client Application Application MySQL MySQL MySQL


slide-1
SLIDE 1

Mikael Ronström, Ph.D Senior MySQL Architect Sun Microsystems

Performance Guide for MySQL Cluster

slide-2
SLIDE 2

MySQL Cluster

NDB API NDB Kernel NDB Kernel NDB Kernel NDB Kernel MySQL Server MySQL Server MySQL Server Cluster Interconnect Application MySQL Client Application MySQL Client Application MySQL Client Application Application NDB API NDB API

slide-3
SLIDE 3

Aspects of Performance

  • Response times
  • Throughput
  • Low variation of response times
slide-4
SLIDE 4

Experience Base

  • DBT2 (similar to TPC-C) using SQL
  • DBT2 using NDB API
  • TPC-W
  • Prototyping efforts with customers in

area of real-time systems

  • Loads of benchmarks executed using

NDB API

slide-5
SLIDE 5

Possible Areas how to Improve Performance

  • Use of low level API (NDB API)
  • Use of new features in MySQL Cluster

Carrier Grade Edition version 6.3 (currently at version 6.3.13)

  • Ensure proper partitioning of your Tables
  • Use of HW
  • Use of features in MySQL Cluster 5.0
slide-6
SLIDE 6

Use of low-level NDB API for Application Programming

  • NDB API is a C++ record access API
  • Supports sending parallel record
  • perations within same transaction or in

different transactions

  • Two modes, synchronous/asynchronous
  • Hints to select transaction coordinator
  • Simple interpreter for filters and simple

additions/subtractions

slide-7
SLIDE 7

NDB Kernel (Database nodes) Application

Looking at performance

Five synchronous insert transactions (10 x TCP/IP time) Five inserts in one synchronous transaction (2 x TCP/IP time) NDB Kernel (Database nodes) Application NDB Kernel (Database nodes) Application Five asynchronous insert transactions (2 x TCP/IP time)

slide-8
SLIDE 8

Example of prototyping using NDB API

  • Step 1: Develop prototype using MySQL C API

=> Performance: X, Response time: Y

  • Step 2: Develop same functionality using

synchronous NDB API => Performance: 3X, Response time: ~0.5Y

  • Step 3: Develop same functionality using

asynchronous NDB API => Performance: 6X, Response time: ~0.25Y

slide-9
SLIDE 9

Conclusion on when to use NDB API

  • When performance is critical
  • When real-time response time is critical
  • When scalability of application is

important (in terms of threads, application nodes, data nodes)

slide-10
SLIDE 10

Conclusion on when NOT to use NDB API

  • When design time is critical
  • When use of standard API’s is critical
  • For complex queries where it makes

sense to let the MySQL optimiser handle writing the query plan

slide-11
SLIDE 11

Use of new features in MySQL Cluster Carrier Grade Edition version 6.3.13

  • Polling based communication
  • Epoll replacing select system call (Linux)
  • Send buffer gathering
  • Real-time scheduler for threads
  • Lock threads to CPU
  • Distribution Awareness
  • Avoid read before Update/Delete with PK
slide-12
SLIDE 12

Polling-based communication

  • Avoids wake-up delay in conjunction with

new messages

  • Avoids interrupt delay for new messages
  • Drawback: CPU used heavily also at lower

throughput

  • Significant response time improvement
  • If used in connection with Real-time

Scheduling also very reliable response time (e.g. 100% within 3 millisecond response time at fairly high load)

slide-13
SLIDE 13

Interrupt Handling in Dolphin SuperSockets

  • Dolphin HW has checksums integrated

No interrupt processing required to process Network Protocol

  • Interrupt Processing only required to wake

sleeping process waiting for events on the Dolphin SuperSockets Socket

slide-14
SLIDE 14

Socket Interface to Interrupts

  • Interrupts enabled when no data available

in select/poll call where timeout is > 0

  • Interrupts enabled after blocking receive

call with no data available

  • Otherwise Interrupts Disabled

=> No interrupts happening when using Polling-based Communication

slide-15
SLIDE 15

Polling-based communication Benchmark Results

  • Improving performance when CPU isn’t limited
  • Decrease performance when CPU is limiting

factor (e.g. 1 data node per Core)

  • 10% performance improvement on 2, 4 and 8

data node clusters using DBT2

  • 20% improvement using Dolphin Express

all dump 506 200 (spin for 200 microseconds before going to sleep, will call select(0)/epoll_wait(0) while spinning)

slide-16
SLIDE 16

Epoll replacing select system call

  • Decreases overhead of select system call in large clusters
  • Increases interrupt overhead of Intel e1000 Ethernet driver
  • Improved performance 20% on 32-node clusters
  • Improved performance of up 10-15% also on smaller clusters

where CPU wasn’t a bottleneck (together with Polling mode 20% improvement)

  • Slight decrease of performance on CPU-limited configurations

(=1 data node per CPU)

slide-17
SLIDE 17

Extra Round of Execution before Sending Messages

  • Adapting NDB Scheduler to receive another round
  • f messages and execute them before Sending

Messages

  • Larger size of Messages Sent

Increases Throughput Increases Response Time all dump 502 50 (set all nodes to continue executing until 50 microseconds have passed)

slide-18
SLIDE 18

Setting Threads to Real-time

  • Use Real-time Scheduling in NDB Kernel
  • Maintenance Threads at Higher Priority
  • Main Thread lower priority

Avoids decreased priority at high loads Decreases response time 3 dump 503 1 (set node 3 process in real-time priority)

slide-19
SLIDE 19

Locking Threads to CPU’s

  • Lock Maintenance Threads (Connection Threads, Watch

Dog, File System Threads) to a CPU

  • Lock Main Thread to a CPU

No cache thrashing due to moving threads

  • Interacting with real-time priority + new scheduler in NDB

Main Thread owning CPU 2 dump 505 1 (locks maintenance threads on node 2 to CPU 1) 2 dump 504 0 (locks main thread on node 2 to CPU 0)

slide-20
SLIDE 20

MySQL Cluster RT solution on Dual Core

Connection Threads Watch Dog thread FileSystem threads Rest Main Thread Super Socket Read/Write

CPU 0 CPU 1

slide-21
SLIDE 21

MySQL Cluster RT solution on Quad-Core computer using 4 data nodes CPU optimized architecture using Dolphin SuperSockets and Polling-based

Connection Threads Watch Dog thread FileSystem threads

CPU 0

Super Sockets Read/Write Main Thread

Connection Threads Watch Dog thread FileSystem threads

CPU 1

Super Sockets Read/Write Main Thread

Connection Threads Watch Dog thread FileSystem threads

CPU 2

Super Sockets Read/Write Main Thread

Connection Threads Watch Dog thread FileSystem threads

CPU 3

Super Sockets Read/Write Main Thread

slide-22
SLIDE 22

MySQL Cluster RT solution on Quad-Core computer using 3 data nodes CPU optimized architecture using Ethernet

Interrupt Handling

CPU 0 CPU 1

Ethernet Main Thread

CPU 2

Ethernet Main Thread

CPU 3

Ethernet Main Thread

Connection Threads Watch Dog thread FileSystem threads

slide-23
SLIDE 23

MySQL Cluster RT solution on Eight-Core computer using 6 data nodes CPU optimized architecture using Ethernet

Interrupt Handling

Core 1 Core 4

Ethernet Main Thread

Core 3

Ethernet Main Thread

Core 5

Ethernet Main Thread

Core 7

Ethernet Main Thread

Core 6

Ethernet Main Thread

Core 8

Ethernet Main Thread

Core 2

Connection Threads Watch Dog thread FileSystem threads Interrupt Handling

slide-24
SLIDE 24

Old ”thruths” revisited

  • Previous recommendation was to run 1

data node per computer

  • This was due to bugs in handling Multi-

node failure handling

  • This recommendation no longer exists

since more than a year back

  • Quality of multiple nodes per computer is

good now

slide-25
SLIDE 25

Distribution Awareness

  • Start transaction coordinator on node

which first query of transaction is using

  • E.g. SELECT * from t WHERE pk=x

=> Map x into a partition, partition is then mapped into a node containing the primary replica of the record

  • 100-200% improvement when

application is distribution aware

slide-26
SLIDE 26

Remove read before PK update

  • UPDATE t SET a = const1 WHERE pk = x;
  • No need to do a read before UPDATE,

all data is already known

  • ~10% improvement on DBT2
  • Applies to DELETE as well
slide-27
SLIDE 27

Ensure Proper Partitioning of Data Model

  • Proper partitioning is important to ensure

transaction execution is as localised to

  • ne nodegroup as possible (works

together with Distribution Awareness)

  • Transactions spanning several node

groups means much more communication

slide-28
SLIDE 28

Synchronous Replication: Low failover time

TC

Primary Backup Backup

Prepare Prepare Prepare Commit Commit Commit

Example showing transaction with two operations using three replicas

messages = 2 x operations x (replicas + 1)

Primary Backup Backup

Prepare Commit

  • 1. Prepare F1
  • 1. Prepare F2
  • 2. Commit F1
  • 2. Commit F2
slide-29
SLIDE 29

Partitioning in DBT2 almost entirely on Warehouse ID

  • Partitioning on primary key makes all

transactions fully distributed over the entire cluster

  • PARTITION BY KEY (warehouse_id)
  • PARTITION BY HASH (warehouse_id)

=> Gives more or less perfect partitioning

slide-30
SLIDE 30

Other Partitioning tricks

  • If there is a table that has a lot of index

scans (not primary key) on it Partitioning this table to only be in one node group can be a good idea Partition syntax for this: PARTITION BY KEY (id) ( PARTITION p0 NODEGROUP 0);

slide-31
SLIDE 31

Use of features in MySQL Cluster version 5.0

  • Lock Memory
  • Batching of IN (..) primary key access
  • INSERT batching
  • Condition pushdown (faster table scans)
slide-32
SLIDE 32

Lock Memory in Main Memory

  • Ensure no swapping occurs in NDB

Kernel

slide-33
SLIDE 33

Batching IN (…) with primary keys

  • 100 x SELECT * from t WHERE pk = x;
  • SELECT * from t WHERE pk IN

(x1,,,x100);

  • IN-statement is around 10x faster than

100 SELECT single record PK access

slide-34
SLIDE 34

Use of multi-INSERT

  • 100 x INSERT INTO t (x)
  • INSERT INTO t (x1),(x2),,,,,(x100)
  • Multi-insert up to about 10x faster
slide-35
SLIDE 35

Use of features in MySQL Cluster CGE version 6.4

  • Multi-threaded Data nodes

Currently no benefit using DBT2 Have been shown to increase throughput by 40% for some NDB API benchmarks

slide-36
SLIDE 36

Use of HW, CPU choice

  • Pentium D @ 2.8GHz -> Core 2 Duo at

2.8GHz => 75% improvement

  • Doubling of L2 cache size seem to

double thread scalability of MySQL Cluster (experience using DBT2)

  • Multi-core CPU’s can be used, requires

multiple node per Server

slide-37
SLIDE 37

Use of HW, Interconnect choice

  • Choice of Dolphin Express interconnect

has been shown to increase throughput between 10% and 400% dependent on use case

  • Response time improvements have been

seen from 20% to 700%

slide-38
SLIDE 38

Dolphin SuperSockets

  • Implementation of the Socket API using

Dolphin Express Interconnect HW

  • Latency of ping-pong on socket layer downto

few microseconds

  • High-Availability Features integrated
  • Multi-Channel support integrated
  • PCI Express Cards => 700 Mbyte/sec on

Server Hardware

slide-39
SLIDE 39

Minimal Cluster, 2 data nodes

2 Mysql servers

5000 10000 15000 20000 25000 30000 1 2 4 8 16 32 Parallell activity

eth sci eth + rt sci + rt

slide-40
SLIDE 40

Distribution aware (8 data nodes on 2 Quad Core)

20 000 40 000 60 000 80 000 100 000 120 000 1 2 4 8 16 32 64 128 256 Parallell activity eth sci eth+rt sci + rt Improvements compared to ethernet

  • 20

20 40 60 80 100 1 2 4 8 16 32 64 128 256

Parallell activity

%

Improvment sci vs eth Improvment eth+rt vs eth Improvment sci+rt vs eth

slide-41
SLIDE 41

Non-distribution aware (4 data nodes on 4 Quad Cores)

10000 20000 30000 40000 50000 60000 70000 1 2 4 8 16 32 64 128 256 Parallell activity

eth sci

Improvement of sci compared to ethernet

20 40 60 80 100 120 140 160 180 200 1 2 4 8 16 32 64 128 256 Parallell activity

% Improvement

slide-42
SLIDE 42

Non-distribution aware (12 data nodes on 3 Quad Cores)

10000 20000 30000 40000 50000 60000 70000 80000 1 4 8 16 32 48 64 96

Parallell activity

sci eth

Improvement of sci compared to ethernet

50 100 150 200 250 300 350 1 4 8 16 32 48 64 96

Parallell activity

% Improvement

slide-43
SLIDE 43

Important MySQL Server Parameters (5.1)

  • --ndb-index-stat-enable=0 (Bug if enabled)
  • --ndb-use-exact-count=0 (100%)
  • --ndb-force-send=1 (20%)
  • --engine-condition-pushdown=1 (~10%)
slide-44
SLIDE 44

Scalability of Threads using DBT2

  • Linear scalability 1->2->4 threads
  • Special case of 1->2 threads on smaller

clusters gives 200% increase

  • ~40-70% increase 4->8 threads
  • ~10-30% increase 8->16 threads
  • Decreasing performance going beyond

16 threads

slide-45
SLIDE 45

Scalability of MySQL Servers using DBT2

  • Linear scalability adding MySQL Servers
  • Maximum Performance where #MySQL

Servers = 3 x Number of Data Nodes

  • Number of MySQL Server = Number of Data

Nodes 25% less maximum performance

  • Number of MySQL Servers = 2 x Number of

Data Nodes 5% less maximum performance

slide-46
SLIDE 46

Scalability of Data Nodes using DBT2 with proper partitioning using Ethernet

  • Measured using number of #Data Nodes == #MySQL

Servers and at least 2 cores per data node

  • 2-nodes Max = 27.000 tpm
  • 4-nodes Max = 40.000 tpm (~50%)
  • 8-nodes Max = 66.000 tpm (~65%)
  • 16-nodes Max = 91.000 tpm (~40%)
  • 32-nodes Max = 132.000 tpm (~40%)
slide-47
SLIDE 47

Scalability of Data Nodes using DBT2 with proper partitioning using Dolphin Express

  • 2-nodes 25.000 tpm
  • 8-nodes 100.000 tpm
  • Scalability using Dolphin Express much improved

compared to Ethernet scalability

slide-48
SLIDE 48

Future SW performance improvements (1)

  • Batched Key Access, Improves execution of joins

especially where joins use lookups of many primary key accesses (0-400%) Preview of this feature already available

  • Improved Scan protocol (~15%)
  • Improved NDB Wire Protocol (decreases number of

bits transported to almost half) (~20%) Less cost for communication Less cost for memory copying in NDB code

slide-49
SLIDE 49

Future SW performance improvements (2)

  • Incremental Backups
  • Optimised backup code
  • Parallel I/O on Index Scans Using disk data
  • Various local code optimisations
  • Using Solaris features for locking to CPU’s, Fixed

Scheduler priority, Interrupts on dedicated core

  • Compiler improvements (see my blog for how this

improved MySQL/InnoDB on Niagara boxes)

  • Improved scalability inside of one MySQL Server
  • Increase maximum number of data nodes from 48 to

128

slide-50
SLIDE 50

So how will MySQL Cluster work on a Niagara-II with 256 GB memory? Unpublished results from 2002

  • Benchmark load:
  • Simple read, read 100 bytes of data through primary key
  • Simple update, update 8 bytes of data through primary key
  • Both are transactional
  • HW: 72-CPU SunFire 15k, 256 GB memory
  • CPU’s: Ultra Sparc-III@900MHz
  • 32-node NDB Cluster, 1 data node locked to 1 CPU
  • Results (Database size = 88 Gbyte, ~900 million records):
  • Simple Read: 1.5 million reads per second
  • Simple update: 340.000 updates per second