Building a Transactional Key-Value Store That Scales to 100+ Nodes
Siddon Tang at PingCAP
(Twitter: @siddontang; @pingcap)
1
Building a Transactional Key-Value Store That Scales to 100+ Nodes - - PowerPoint PPT Presentation
Building a Transactional Key-Value Store That Scales to 100+ Nodes Siddon Tang at PingCAP (Twitter: @siddontang; @pingcap) 1 About Me Chief Engineer at PingCAP Leader of TiKV project My other open-source projects: go-mysql
1
○ go-mysql ○ go-mysql-elasticsearch ○ LedisDB ○ raft-rs ○ etc..
2
3
4
5
F1 Spanner Client TiDB TiKV MySQL Client
6
7
8
9
10
11
12
Time Rust Learning Curve
13
14
15
16
17
18
19
20
Follower
Candidate
Leader Start Election Timeout, Start new election. Find leader or receive higher term msg Receive majority vote Election, re-campaign Receive higher term msg
21
a <- 1 b <- 2 State Machine Log Raft Module Client a <- 1 b <- 2 State Machine Log Raft Module a <- 1 b <- 2 State Machine Log Raft Module
22
1 a 2 b 1 a 2 b 1 a 2 b
23
24
(-∞, a) [a, b) (b, +∞)
Range Sharding (TiKV)
Chunk 1 Chunk 2 Chunk 3
Hash Sharding
Dataset Key Hash Dataset
25
Region 1 Region 2 Region 3 Region 1 Region 2 Region 3 Region 1 Region 2 Region 3 Raft Group Raft Group Raft Group A - B B - C C - D Range Sharding
26
Node 1 Node 2 Node 3
Region A Region A Region B Region A Region A Region B
Split
Region A Region A Region B
Merge 27
Node 2 Node 1
Region A’ Region B’
How to Move Region A?
28
Node 1 Node 2
Region A’ Region B’
How to Move Region A?
Region A
Add Replica
29
Node 1 Node 2
Region A Region B’
How to Move Region A?
Region A’
Transfer Leader
30
Node 1 Node 2
Region B’
How to Move Region A?
Region A’
Remove Replica
31
Node 1 Node 2
32
Region 1 Region 1 Region 1 Region 2 Region 2 Region 2
Begin
Set a = 1 Set b = 2 Commit Raft Group Raft Group
33
34
35
36
37
38
Raft Group
Client gRPC RocksDB Raft Transaction Txn KV API TiKV Instance gRPC gRPC RocksDB Raft Transaction Txn KV API TiKV Instance RocksDB Raft Transaction Txn KV API TiKV Instance
39
40
TiKV We are Gods!!! TiKV TiKV TiKV TiKV TiKV TiKV TiKV
41
PD PD PD
Placement Drivers
PD
TiKV TiKV TiKV Store Heatbeat Region Heatbeat Add Replica Remove Replica Transfer Leader ... Schedule Operator
42
PD’ PD
43
R1 R2 R3 R4 R5 R6 R1 R2 R3 R4 R6 R5
44
Regions’ sizes are not the same
R1 - 0 MB R2 - 0 MB R3 - 0 MB R4 - 64 MB R5 - 64 MB R6 - 96 MB
45
R1 - 0 MB R2 - 0 MB R3 - 0 MB R4 - 64 MB R5 - 64 MB R6 - 96 MB R1 - 0 MB R5 - 64 MB R3 - 0 MB R4 - 64 MB R2 - 0 MB R6 - 96 MB
46
Some regions are very hot for Read/Write
R1 R2 R3 R4 R5 R6 Hot Cold Normal
47
R1 R2 R3 R4 R5 R6 R1 R3 R2 R4 R5 R6
48
○ Weight Balance - High-weight TiKV will save more data ○ Evict Leader Balance - Some TiKV node can’t have any Raft leader
49
50
DC
Rack
R1
Rack
R1
DC
Rack
R2
Rack
R2
DC
Rack
R1
Rack
R2
DC
Rack
R1
Rack
R2
DC
Rack
R1
Rack
R2
DC
Rack
R1
Rack
R2
51
DC - Seattle 1
Rack
R1
Rack
R2
DC - Seattle 2
Rack
R1
Rack
R2
DC - Santa Clara
Rack
R1’
Rack
R2’
DC - Seattle 1
Rack
R1’
Rack
R2
DC - Seattle 2
Rack
R1
Rack
R2’
DC - Santa Clara
Rack
R1
Rack
R2
52
53
PD PD PD
... ...
56
○ Strong consistency ○ ACID compliance ○ Horizontal scalability ○ Cloud-native architecture
○ So far: TiDB (MySQL), TiSpark (SparkSQL), Toutiao.com (metadata service for their own S3), Ele.me (Redis Protocol Layer) ○ Sky is the limit!
57
58