POLARDB for MyRocks Extending shared storage to MyRocks Zhang, Yuan - - PowerPoint PPT Presentation

polardb for myrocks
SMART_READER_LITE
LIVE PREVIEW

POLARDB for MyRocks Extending shared storage to MyRocks Zhang, Yuan - - PowerPoint PPT Presentation

POLARDB for MyRocks Extending shared storage to MyRocks Zhang, Yuan Alibaba Cloud Apr, 2018 About me Yuan Zhang database engineer Work at Ailbaba for 5 years Focus on MySQL & MyRocks email


slide-1
SLIDE 1

POLARDB for MyRocks

Extending shared storage to MyRocks

Zhang, Yuan Alibaba Cloud Apr, 2018

slide-2
SLIDE 2

MORE THAN JUST CLOUD

About me

  • Yuan Zhang
  • database engineer
  • Work at Ailbaba for 5 years
  • Focus on MySQL & MyRocks
  • email:zhangyuan.zy@alibaba-inc.com
slide-3
SLIDE 3

MORE THAN JUST CLOUD

Agenda

  • Background
  • Basic Architecture
  • Implementation details
  • Performance Improment
  • Future plan
slide-4
SLIDE 4

MORE THAN JUST CLOUD

Background

Why POLARDB for MyRocks

Benifits from MyRocks

  • Greate space efficiency, better compression
  • Greate write efficiency, lower write amplification
  • Fast data loading
  • Compatiable with MySQL

Benifits from share-storage(polarstore)

  • Promising data consistency
  • Ability to scale read node immediately without full copy of data

MyRocks + Polarstore

slide-5
SLIDE 5

MORE THAN JUST CLOUD

Basic Architecture

Primary

  • Accept Read/Write workload

Replica

  • Only Accept Read workload
  • Share sst/wal with primary
slide-6
SLIDE 6

MORE THAN JUST CLOUD

Let’s Begin

prepare for rocksdb wal replication

  • Base on AIiSQL5.7
  • Port MyRocks from Facebook
  • Only support RocksDB and MyISAM engine
  • Convert system tables to RocksDB
slide-7
SLIDE 7

MORE THAN JUST CLOUD

Convert system tables to RocksDB

Prepare for RocksDB WAL replication

  • Convert system tables to RocksDB
  • Except mysql.slow_log, mysql.general_log, they store in local disk,

primary and replica have their owen mysql.slow_log, mysql.general_log tables.

slide-8
SLIDE 8

MORE THAN JUST CLOUD

Rocksdb WAL/Manifest replication

Architecture

slide-9
SLIDE 9

MORE THAN JUST CLOUD

Rocksdb WAL/Manifest replication

Asynchronous replication

WAL Replication

  • Replay PUT/DELETE/MERGE

Manifest Replicaion

  • Replay flush & compaction

WAL and Manifest Coordination

  • Only apply VEdit while Applied lsn > VEdit lsn
slide-10
SLIDE 10

MORE THAN JUST CLOUD

Rocksdb WAL/Manifest replication

Control Primary WAL and SST files deletion

WAL deletion - original wal deletion will lead Replica lost wal

  • Lm: min_log_number on Primary
  • Ln: min_log_number on all Replicas
  • new_min_log_number= min(Lm,Ln)
  • When WAL’s number < new_min_log_number, then this WAL can be deleted

SST deletion- original SST deleteion will lead Replica cannot find SST and crash

  • min_version_number: the minimal version number replica is using
  • SST can be deleted only when It will’t be used by Primary and all Replicas
slide-11
SLIDE 11

MORE THAN JUST CLOUD

DDL&Cache replication

Architecture

slide-12
SLIDE 12

MORE THAN JUST CLOUD

DDL Replication

Remove frm,par files

Frm,par files

  • Table metadata information
  • If Master and replica share frm,par files, DDL replication must be synchronous

Remove frm,par files

  • Store these contents in RocksDB
  • Replica can read multi version of table schema
  • DDL replication is asynchronous
slide-13
SLIDE 13

MORE THAN JUST CLOUD

DDL Replication

Remove frm,par files

DDL replication is asynchronous

  • Multiple Table schema version in rocksdb
  • Row data also have different verisions
slide-14
SLIDE 14

MORE THAN JUST CLOUD

DDL Replication

Primary

  • Log MDL lock start and end.

Replica

  • Replay MDL lock start
  • A. lock MDL
  • Replay MDL lock end
  • A. update table cache in myrocks
  • B. unlock MDL

We have MDL lock to protect DDL operation in Primary. This lock also need in Replica’s DDL.

slide-15
SLIDE 15

MORE THAN JUST CLOUD

Cache Replication

ACL, Procedure, Query cache Replicaition

Primary

  • Log cache change in RocksDB WAL

ACL, Procedure Replica

  • Replay this change from WAL and invaild this cache
slide-16
SLIDE 16

MORE THAN JUST CLOUD

Index Statistics Replication

Persistent

  • Part index statistics information persist in each SST
  • Total index statistics store in INDEX_STATISTICS

Memory

  • Rdb_dey_def::m_stats

Update

  • Analyze table
  • Flush memtable
  • Compact


Replica listen PUT operation in INDEX_STATISTICS and reload statistic info to memory.

slide-17
SLIDE 17

MORE THAN JUST CLOUD

New Log Format

log change for replication

Log Types

  • DDL(START, END)
  • Cache change, ACL/Proc

Log format

  • PUT/DELETE

Log store location

  • __system__ column family
slide-18
SLIDE 18

MORE THAN JUST CLOUD

New Log Format

New type in data dictionary

// Data dictionary types enum DATA_DICT_TYPE { DDL_ENTRY_INDEX_START_NUMBER = 1, INDEX_INFO = 2, CF_DEFINITION = 3, BINLOG_INFO_INDEX_NUMBER = 4, DDL_DROP_INDEX_ONGOING = 5, INDEX_STATISTICS = 6, MAX_INDEX_ID = 7, DDL_CREATE_INDEX_ONGOING = 8,

POLAR_LOG = 100, // for polar replication

END_DICT_INDEX_ID = 255 };

enum POLAR_LOG_TYPE { TABLE_DDL = 1, CACHE_CHANGE = 2, …… END_POLAR_ROCK_TYPE = 255 };

slide-19
SLIDE 19

MORE THAN JUST CLOUD

New Log Format

New type in data dictionary

DDL_START

  • type: PUT
  • key: POLAR_LOG+TABLE_DDL+dbname.tablename
  • value: NULL

DDL_END

  • type: DELETE
  • key: POLAR_LOG+TABLE_DDL+dbname.tablename
  • value: NULL

CACHE_CHANGE

  • type: PUT
  • key: POLAR_LOG+CACHE_CHANGE+ACL/Proc
  • value: NULL
slide-20
SLIDE 20

MORE THAN JUST CLOUD

New Log Format

Problems

DDL_START

  • type: PUT
  • key: POLAR_LOG+TABLE_DDL+dbname.tablename
  • value: NULL

DDL_END

  • type: DELETE
  • key: POLAR_LOG+TABLE_DDL+dbname.tablename
  • value: NULL

DDL_START and DDL_END must be a pair.

Problem 1: Primary Crash

  • Primary crash after DDL_START, Primary will

resent DDL_START when restart, and the previous DDL_END will lost.

  • Replica replay DDL_START and hold MDL lock, It

will not unlock with DDL_END 


slide-21
SLIDE 21

MORE THAN JUST CLOUD

New Log Format

Problems DDL_START and DDL_END must be a pair.

Problem 1: Primary Crash

  • Primary crash after DDL_START, Primary will

resent DDL_START when restart, and the previous DDL_END will lost.

  • Replica replay DDL_START and hold MDL lock, It

will not unlock with DDL_END

Solution

  • Primary Scan RocksDB to find record TABLE_DDL

when restart, if found, Primary should resent DDL_END, and Replica will unlock the old lock

slide-22
SLIDE 22

MORE THAN JUST CLOUD

New Log Format

Problems DDL_START and DDL_END must be a pair.

Problem 2: Replica Crash

  • Replica carsh after DDL_START, Replica will

continue to replay DDL_END when restart

  • But the lock with DDL_START will not exist after

restart, Replica replay DDL_END to unlock a MDL lock which is not exist 


slide-23
SLIDE 23

MORE THAN JUST CLOUD

New Log Format

Problems DDL_START and DDL_END must be a pair.

Problem 2: Replica Crash

  • Replica carsh after DDL_START, Replica will

continue to replay DDL_END when restart

  • But the lock with DDL_START will not exist after

restart, Replica replay DDL_END to unlock a MDL lock which is not exist


 Solution

  • Replica Scan RocksDB to find record TABLE_DDL

when restart, if found, Replica should replay DDL_START to lock

slide-24
SLIDE 24

MORE THAN JUST CLOUD

MVCC

MVCC based on RocksDB snapshot

Control compact in Primary

  • Compact in Primary should consider about Replica’s snapshot
  • Only delete record when sequnce >=Sn, Sn is the laste seqence in Replica
  • Primary’s snapshot list merge with replica’ s snapshot list.

Keep a consistent snapshot in Replica

  • Replica can’t get the record after Primary compact
slide-25
SLIDE 25

MORE THAN JUST CLOUD

MVCC

MVCC based on RocksDB snapshot

Keep a consistent snapshot in Replica

slide-26
SLIDE 26

MORE THAN JUST CLOUD

Performance Improment

Optimize write performance

  • Async-commit
  • Optimize auto_increment
slide-27
SLIDE 27

MORE THAN JUST CLOUD

Performance Improment

Async-commit

Original pipeline write

slide-28
SLIDE 28

MORE THAN JUST CLOUD

Performance Improment

Async-commit

Async-commit

slide-29
SLIDE 29

MORE THAN JUST CLOUD

Performance Improment

Optimize write performance

Optimize auto_increment

  • write need check unique
  • Do Get first then write
  • Get is expensive

Actually, most auto_increment check uniqueness is not necessary. Espacially, when all the auto_incment column is automatically generated.

slide-30
SLIDE 30

MORE THAN JUST CLOUD

Performance Improment

Optimize write performance

Optimize auto_increment

  • max_specify_pk: user sepcified max auto_increment value
  • if pk > max_specify_pk, skip unique check
  • if pk <= max_specify_pk nead unique check

max_specify_pk update when user use sepcified auto_increment value

slide-31
SLIDE 31

MORE THAN JUST CLOUD

Future

Feature

  • Online DDL
  • Multiple-Master

Performance

  • Compaction optimize
slide-32
SLIDE 32

MORE THAN JUST CLOUD

Q&A

slide-33
SLIDE 33