Building a Transactional Key-Value Store That Scales to 100+ Nodes - - PowerPoint PPT Presentation

building a transactional key value store that scales to
SMART_READER_LITE
LIVE PREVIEW

Building a Transactional Key-Value Store That Scales to 100+ Nodes - - PowerPoint PPT Presentation

Building a Transactional Key-Value Store That Scales to 100+ Nodes Siddon Tang at PingCAP (Twitter: @siddontang; @pingcap) 1 About Me Chief Engineer at PingCAP Leader of TiKV project My other open-source projects: go-mysql


slide-1
SLIDE 1

Building a Transactional Key-Value Store That Scales to 100+ Nodes

Siddon Tang at PingCAP

(Twitter: @siddontang; @pingcap)

1

slide-2
SLIDE 2

About Me

  • Chief Engineer at PingCAP
  • Leader of TiKV project
  • My other open-source projects:

○ go-mysql ○ go-mysql-elasticsearch ○ LedisDB ○ raft-rs ○ etc..

2

slide-3
SLIDE 3

Agenda

  • Why did we build TiKV?
  • How do we build TiKV?
  • Going beyond TiKV

3

slide-4
SLIDE 4

Why?

Is it worthwhile to build another Key-Value store?

4

slide-5
SLIDE 5

We want to build a distributed relational database to solve the scaling problem of MySQL!!!

5

slide-6
SLIDE 6

Inspired by Google F1 + Spanner

F1 Spanner Client TiDB TiKV MySQL Client

6

slide-7
SLIDE 7

How?

7

slide-8
SLIDE 8

A High Building, A Low Foundation

8

slide-9
SLIDE 9

What we need to build...

  • 1. A high-performance Key-Value engine to store data
  • 2. A consensus model to ensure data consistency in different machines
  • 3. A transaction model to meet ACID compliance across machines
  • 4. A network framework for communication
  • 5. A scheduler to manage the whole cluster

9

slide-10
SLIDE 10

Choose a Language!

10

slide-11
SLIDE 11

Hello Rust

11

slide-12
SLIDE 12

Rust...?

12

slide-13
SLIDE 13

Rust - Cons (2 years ago):

  • Makes you think differently
  • Long compile time
  • Lack of libraries and tools
  • Few Rust programmers
  • Uncertain future

Time Rust Learning Curve

13

slide-14
SLIDE 14

Rust - Pros:

  • Blazing Fast
  • Memory safety
  • Thread safety
  • No GC
  • Fast FFI
  • Vibrant package ecosystem

14

slide-15
SLIDE 15

Let’s start from the beginning!

15

slide-16
SLIDE 16

Key-Value engine

16

slide-17
SLIDE 17

Why RocksDB?

  • High Write/Read Performance
  • Stability
  • Easy to be embedded in Rust
  • Rich functionality
  • Continuous development
  • Active community

17

slide-18
SLIDE 18

RocksDB: The data is in one machine. We need fault tolerance.

18

slide-19
SLIDE 19

Consensus Algorithm

19

slide-20
SLIDE 20

Raft - Roles

  • Leader
  • Follower
  • Candidate

20

slide-21
SLIDE 21

Raft - Election

Follower

Candidate

Leader Start Election Timeout, Start new election. Find leader or receive higher term msg Receive majority vote Election, re-campaign Receive higher term msg

21

slide-22
SLIDE 22

Raft - Log Replicated State Machine

a <- 1 b <- 2 State Machine Log Raft Module Client a <- 1 b <- 2 State Machine Log Raft Module a <- 1 b <- 2 State Machine Log Raft Module

22

1 a 2 b 1 a 2 b 1 a 2 b

slide-23
SLIDE 23

Raft - Optimization

  • Leader appends logs and sends msgs in parallel
  • Prevote
  • Pipeline
  • Batch
  • Learner
  • Lease based Read
  • Follower Read

23

slide-24
SLIDE 24

A Raft can’t manage a huge dataset. So we need Multi-Raft!!!

24

slide-25
SLIDE 25

Multi-Raft: Data sharding

(-∞, a) [a, b) (b, +∞)

Range Sharding (TiKV)

Chunk 1 Chunk 2 Chunk 3

Hash Sharding

Dataset Key Hash Dataset

25

slide-26
SLIDE 26

Multi-Raft in TiKV

Region 1 Region 2 Region 3 Region 1 Region 2 Region 3 Region 1 Region 2 Region 3 Raft Group Raft Group Raft Group A - B B - C C - D Range Sharding

26

Node 1 Node 2 Node 3

slide-27
SLIDE 27

Multi-Raft: Split and Merge

Region A Region A Region B Region A Region A Region B

Split

Region A Region A Region B

Merge 27

Node 2 Node 1

slide-28
SLIDE 28

Multi-Raft: Scalability

Region A’ Region B’

How to Move Region A?

28

Node 1 Node 2

slide-29
SLIDE 29

Multi-Raft: Scalability

Region A’ Region B’

How to Move Region A?

Region A

Add Replica

29

Node 1 Node 2

slide-30
SLIDE 30

Multi-Raft: Scalability

Region A Region B’

How to Move Region A?

Region A’

Transfer Leader

30

Node 1 Node 2

slide-31
SLIDE 31

Multi-Raft: Scalability

Region B’

How to Move Region A?

Region A’

Remove Replica

31

Node 1 Node 2

slide-32
SLIDE 32

How to ensure cross-region data consistency?

32

slide-33
SLIDE 33

Distributed Transaction

Region 1 Region 1 Region 1 Region 2 Region 2 Region 2

Begin

Set a = 1 Set b = 2 Commit Raft Group Raft Group

33

slide-34
SLIDE 34

Transaction in TiKV

  • Optimized two phase commit, inspired by Google Percolator
  • Multi-version concurrency control
  • Optimistic Commit
  • Snapshot Isolation
  • Use Timestamp Oracle to allocate unique timestamp for transactions

34

slide-35
SLIDE 35

Percolator Optimization

  • Use a latch on TiDB to support pessimistic commit
  • Concurrent Prewrite

○ We are formally proving it with TLA+

35

slide-36
SLIDE 36

How to communicate with each other? RPC Framework!

36

slide-37
SLIDE 37

Hello gRPC

37

slide-38
SLIDE 38

Why gRPC?

  • Widely used
  • Supported by many languages
  • Works with Protocol Buffers and FlatBuffers
  • Rich interface
  • Benefits from HTTP/2

38

slide-39
SLIDE 39

TiKV Stack

Raft Group

Client gRPC RocksDB Raft Transaction Txn KV API TiKV Instance gRPC gRPC RocksDB Raft Transaction Txn KV API TiKV Instance RocksDB Raft Transaction Txn KV API TiKV Instance

39

slide-40
SLIDE 40

How to manage 100+ nodes?

40

slide-41
SLIDE 41

Scheduler in TiKV

TiKV We are Gods!!! TiKV TiKV TiKV TiKV TiKV TiKV TiKV

41

PD PD PD

Placement Drivers

slide-42
SLIDE 42

Scheduler - How

PD

TiKV TiKV TiKV Store Heatbeat Region Heatbeat Add Replica Remove Replica Transfer Leader ... Schedule Operator

42

PD’ PD

slide-43
SLIDE 43

Scheduler - Goal

  • Make the load and data size balanced
  • Avoid hotspot performance issue

43

slide-44
SLIDE 44

Scheduler - Region Count Balance

Assume the Regions are about the same size

R1 R2 R3 R4 R5 R6 R1 R2 R3 R4 R6 R5

44

slide-45
SLIDE 45

Scheduler - Region Count Balance

Regions’ sizes are not the same

R1 - 0 MB R2 - 0 MB R3 - 0 MB R4 - 64 MB R5 - 64 MB R6 - 96 MB

45

slide-46
SLIDE 46

Scheduler - Region Size balance

Use size for calculation

R1 - 0 MB R2 - 0 MB R3 - 0 MB R4 - 64 MB R5 - 64 MB R6 - 96 MB R1 - 0 MB R5 - 64 MB R3 - 0 MB R4 - 64 MB R2 - 0 MB R6 - 96 MB

46

slide-47
SLIDE 47

Scheduler - Region Size Balance

Some regions are very hot for Read/Write

R1 R2 R3 R4 R5 R6 Hot Cold Normal

47

slide-48
SLIDE 48

Scheduler - Hot balance

R1 R2 R3 R4 R5 R6 R1 R3 R2 R4 R5 R6

TiKV reports region Read/Write traffic to PD

48

slide-49
SLIDE 49

Scheduler - More

  • More balances…

○ Weight Balance - High-weight TiKV will save more data ○ Evict Leader Balance - Some TiKV node can’t have any Raft leader

  • OpInfluence - Avoid over frequent balancing

49

slide-50
SLIDE 50

Geo-Replication

50

slide-51
SLIDE 51

Scheduler - Cross DC

DC

Rack

R1

Rack

R1

DC

Rack

R2

Rack

R2

DC

Rack

R1

Rack

R2

DC

Rack

R1

Rack

R2

DC

Rack

R1

Rack

R2

DC

Rack

R1

Rack

R2

51

slide-52
SLIDE 52

Scheduler - three DCs in two cities

DC - Seattle 1

Rack

R1

Rack

R2

DC - Seattle 2

Rack

R1

Rack

R2

DC - Santa Clara

Rack

R1’

Rack

R2’

DC - Seattle 1

Rack

R1’

Rack

R2

DC - Seattle 2

Rack

R1

Rack

R2’

DC - Santa Clara

Rack

R1

Rack

R2

52

slide-53
SLIDE 53

Going beyond TiKV

53

slide-54
SLIDE 54

TiDB HTAP Solution

PD PD PD

slide-55
SLIDE 55

Cloud-Native

... ...

KV

slide-56
SLIDE 56

Who’s Using TiKV Now?

56

slide-57
SLIDE 57

To sum up, TiKV is ...

  • An open-source, unifying distributed storage layer that supports:

○ Strong consistency ○ ACID compliance ○ Horizontal scalability ○ Cloud-native architecture

  • Building block to simplify building other systems

○ So far: TiDB (MySQL), TiSpark (SparkSQL), Toutiao.com (metadata service for their own S3), Ele.me (Redis Protocol Layer) ○ Sky is the limit!

57

slide-58
SLIDE 58

Thank you!

Email: tl@pingcap.com Github: siddontang Twitter: @siddontang; @pingcap

58