POLARDB for MyRocks Extending shared storage to MyRocks Zhang, Yuan - - PowerPoint PPT Presentation
POLARDB for MyRocks Extending shared storage to MyRocks Zhang, Yuan - - PowerPoint PPT Presentation
POLARDB for MyRocks Extending shared storage to MyRocks Zhang, Yuan Alibaba Cloud Apr, 2018 About me Yuan Zhang database engineer Work at Ailbaba for 5 years Focus on MySQL & MyRocks email
MORE THAN JUST CLOUD
About me
- Yuan Zhang
- database engineer
- Work at Ailbaba for 5 years
- Focus on MySQL & MyRocks
- email:zhangyuan.zy@alibaba-inc.com
MORE THAN JUST CLOUD
Agenda
- Background
- Basic Architecture
- Implementation details
- Performance Improment
- Future plan
MORE THAN JUST CLOUD
Background
Why POLARDB for MyRocks
Benifits from MyRocks
- Greate space efficiency, better compression
- Greate write efficiency, lower write amplification
- Fast data loading
- Compatiable with MySQL
Benifits from share-storage(polarstore)
- Promising data consistency
- Ability to scale read node immediately without full copy of data
MyRocks + Polarstore
MORE THAN JUST CLOUD
Basic Architecture
Primary
- Accept Read/Write workload
Replica
- Only Accept Read workload
- Share sst/wal with primary
MORE THAN JUST CLOUD
Let’s Begin
prepare for rocksdb wal replication
- Base on AIiSQL5.7
- Port MyRocks from Facebook
- Only support RocksDB and MyISAM engine
- Convert system tables to RocksDB
MORE THAN JUST CLOUD
Convert system tables to RocksDB
Prepare for RocksDB WAL replication
- Convert system tables to RocksDB
- Except mysql.slow_log, mysql.general_log, they store in local disk,
primary and replica have their owen mysql.slow_log, mysql.general_log tables.
MORE THAN JUST CLOUD
Rocksdb WAL/Manifest replication
Architecture
MORE THAN JUST CLOUD
Rocksdb WAL/Manifest replication
Asynchronous replication
WAL Replication
- Replay PUT/DELETE/MERGE
Manifest Replicaion
- Replay flush & compaction
WAL and Manifest Coordination
- Only apply VEdit while Applied lsn > VEdit lsn
MORE THAN JUST CLOUD
Rocksdb WAL/Manifest replication
Control Primary WAL and SST files deletion
WAL deletion - original wal deletion will lead Replica lost wal
- Lm: min_log_number on Primary
- Ln: min_log_number on all Replicas
- new_min_log_number= min(Lm,Ln)
- When WAL’s number < new_min_log_number, then this WAL can be deleted
SST deletion- original SST deleteion will lead Replica cannot find SST and crash
- min_version_number: the minimal version number replica is using
- SST can be deleted only when It will’t be used by Primary and all Replicas
MORE THAN JUST CLOUD
DDL&Cache replication
Architecture
MORE THAN JUST CLOUD
DDL Replication
Remove frm,par files
Frm,par files
- Table metadata information
- If Master and replica share frm,par files, DDL replication must be synchronous
Remove frm,par files
- Store these contents in RocksDB
- Replica can read multi version of table schema
- DDL replication is asynchronous
MORE THAN JUST CLOUD
DDL Replication
Remove frm,par files
DDL replication is asynchronous
- Multiple Table schema version in rocksdb
- Row data also have different verisions
MORE THAN JUST CLOUD
DDL Replication
Primary
- Log MDL lock start and end.
Replica
- Replay MDL lock start
- A. lock MDL
- Replay MDL lock end
- A. update table cache in myrocks
- B. unlock MDL
We have MDL lock to protect DDL operation in Primary. This lock also need in Replica’s DDL.
MORE THAN JUST CLOUD
Cache Replication
ACL, Procedure, Query cache Replicaition
Primary
- Log cache change in RocksDB WAL
ACL, Procedure Replica
- Replay this change from WAL and invaild this cache
MORE THAN JUST CLOUD
Index Statistics Replication
Persistent
- Part index statistics information persist in each SST
- Total index statistics store in INDEX_STATISTICS
Memory
- Rdb_dey_def::m_stats
Update
- Analyze table
- Flush memtable
- Compact
Replica listen PUT operation in INDEX_STATISTICS and reload statistic info to memory.
MORE THAN JUST CLOUD
New Log Format
log change for replication
Log Types
- DDL(START, END)
- Cache change, ACL/Proc
Log format
- PUT/DELETE
Log store location
- __system__ column family
MORE THAN JUST CLOUD
New Log Format
New type in data dictionary
// Data dictionary types enum DATA_DICT_TYPE { DDL_ENTRY_INDEX_START_NUMBER = 1, INDEX_INFO = 2, CF_DEFINITION = 3, BINLOG_INFO_INDEX_NUMBER = 4, DDL_DROP_INDEX_ONGOING = 5, INDEX_STATISTICS = 6, MAX_INDEX_ID = 7, DDL_CREATE_INDEX_ONGOING = 8,
POLAR_LOG = 100, // for polar replication
END_DICT_INDEX_ID = 255 };
enum POLAR_LOG_TYPE { TABLE_DDL = 1, CACHE_CHANGE = 2, …… END_POLAR_ROCK_TYPE = 255 };
MORE THAN JUST CLOUD
New Log Format
New type in data dictionary
DDL_START
- type: PUT
- key: POLAR_LOG+TABLE_DDL+dbname.tablename
- value: NULL
DDL_END
- type: DELETE
- key: POLAR_LOG+TABLE_DDL+dbname.tablename
- value: NULL
CACHE_CHANGE
- type: PUT
- key: POLAR_LOG+CACHE_CHANGE+ACL/Proc
- value: NULL
MORE THAN JUST CLOUD
New Log Format
Problems
DDL_START
- type: PUT
- key: POLAR_LOG+TABLE_DDL+dbname.tablename
- value: NULL
DDL_END
- type: DELETE
- key: POLAR_LOG+TABLE_DDL+dbname.tablename
- value: NULL
DDL_START and DDL_END must be a pair.
Problem 1: Primary Crash
- Primary crash after DDL_START, Primary will
resent DDL_START when restart, and the previous DDL_END will lost.
- Replica replay DDL_START and hold MDL lock, It
will not unlock with DDL_END
MORE THAN JUST CLOUD
New Log Format
Problems DDL_START and DDL_END must be a pair.
Problem 1: Primary Crash
- Primary crash after DDL_START, Primary will
resent DDL_START when restart, and the previous DDL_END will lost.
- Replica replay DDL_START and hold MDL lock, It
will not unlock with DDL_END
Solution
- Primary Scan RocksDB to find record TABLE_DDL
when restart, if found, Primary should resent DDL_END, and Replica will unlock the old lock
MORE THAN JUST CLOUD
New Log Format
Problems DDL_START and DDL_END must be a pair.
Problem 2: Replica Crash
- Replica carsh after DDL_START, Replica will
continue to replay DDL_END when restart
- But the lock with DDL_START will not exist after
restart, Replica replay DDL_END to unlock a MDL lock which is not exist
MORE THAN JUST CLOUD
New Log Format
Problems DDL_START and DDL_END must be a pair.
Problem 2: Replica Crash
- Replica carsh after DDL_START, Replica will
continue to replay DDL_END when restart
- But the lock with DDL_START will not exist after
restart, Replica replay DDL_END to unlock a MDL lock which is not exist
Solution
- Replica Scan RocksDB to find record TABLE_DDL
when restart, if found, Replica should replay DDL_START to lock
MORE THAN JUST CLOUD
MVCC
MVCC based on RocksDB snapshot
Control compact in Primary
- Compact in Primary should consider about Replica’s snapshot
- Only delete record when sequnce >=Sn, Sn is the laste seqence in Replica
- Primary’s snapshot list merge with replica’ s snapshot list.
Keep a consistent snapshot in Replica
- Replica can’t get the record after Primary compact
MORE THAN JUST CLOUD
MVCC
MVCC based on RocksDB snapshot
Keep a consistent snapshot in Replica
MORE THAN JUST CLOUD
Performance Improment
Optimize write performance
- Async-commit
- Optimize auto_increment
MORE THAN JUST CLOUD
Performance Improment
Async-commit
Original pipeline write
MORE THAN JUST CLOUD
Performance Improment
Async-commit
Async-commit
MORE THAN JUST CLOUD
Performance Improment
Optimize write performance
Optimize auto_increment
- write need check unique
- Do Get first then write
- Get is expensive
Actually, most auto_increment check uniqueness is not necessary. Espacially, when all the auto_incment column is automatically generated.
MORE THAN JUST CLOUD
Performance Improment
Optimize write performance
Optimize auto_increment
- max_specify_pk: user sepcified max auto_increment value
- if pk > max_specify_pk, skip unique check
- if pk <= max_specify_pk nead unique check
max_specify_pk update when user use sepcified auto_increment value
MORE THAN JUST CLOUD
Future
Feature
- Online DDL
- Multiple-Master
Performance
- Compaction optimize
MORE THAN JUST CLOUD