Building a Transactional Key-Value Store That Scales to 100+ Nodes - PowerPoint PPT Presentation

Building a Transactional Key-Value Store That Scales to 100+ Nodes Siddon Tang at PingCAP (Twitter: @siddontang; @pingcap) 1

About Me ● Chief Engineer at PingCAP ● Leader of TiKV project ● My other open-source projects: ○ go-mysql ○ go-mysql-elasticsearch ○ LedisDB ○ raft-rs ○ etc.. 2

Agenda ● Why did we build TiKV? ● How do we build TiKV? ● Going beyond TiKV 3

Why? Is it worthwhile to build another Key-Value store? 4

We want to build a distributed relational database to solve the scaling problem of MySQL!!! 5

Inspired by Google F1 + Spanner Client MySQL Client F1 TiDB Spanner TiKV 6

How? 7

A High Building, A Low Foundation 8

What we need to build... 1. A high-performance Key-Value engine to store data 2. A consensus model to ensure data consistency in different machines 3. A transaction model to meet ACID compliance across machines 4. A network framework for communication 5. A scheduler to manage the whole cluster 9

Choose a Language! 10

Hello Rust 11

Rust...? 12

Rust - Cons (2 years ago): ● Makes you think differently ● Long compile time ● Lack of libraries and tools ● Few Rust programmers ● Uncertain future Rust Learning Curve Time 13

Rust - Pros: ● Blazing Fast ● Memory safety ● Thread safety ● No GC ● Fast FFI ● Vibrant package ecosystem 14

Let’s start from the beginning! 15

Key-Value engine 16

Why RocksDB? ● High Write/Read Performance ● Stability ● Easy to be embedded in Rust ● Rich functionality ● Continuous development ● Active community 17

RocksDB: The data is in one machine. We need fault tolerance. 18

Consensus Algorithm 19

Raft - Roles ● Leader ● Follower ● Candidate 20

Raft - Election Start Follower Election Timeout, Receive higher Start new election. Find leader or term msg receive higher term msg Election, re-campaign Leader Candidate Receive majority vote 21

Raft - Log Replicated State Machine Client State State State Machine Machine Machine Raft Raft Raft Module Module Module a 1 a 1 a 1 b 2 b 2 b 2 Log Log Log a <- 1 b <- 2 a <- 1 b <- 2 a <- 1 b <- 2 22

Raft - Optimization ● Leader appends logs and sends msgs in parallel ● Prevote ● Pipeline ● Batch ● Learner ● Lease based Read ● Follower Read 23

A Raft can’t manage a huge dataset. So we need Multi-Raft!!! 24

Multi-Raft: Data sharding Hash Sharding Range Sharding (TiKV) Chunk 1 (-∞, a) Key Hash [a, b) Chunk 2 Dataset Dataset Chunk 3 (b, +∞) 25

Multi-Raft in TiKV Range Sharding A - B Raft Group Region 1 Region 1 Region 1 Raft Group Region 2 Region 2 Region 2 B - C Raft Group Region 3 Region 3 Region 3 C - D Node 1 Node 2 Node 3 26

Multi-Raft: Split and Merge Split Region A Region A Region B Region A Region A Region B Merge Node 1 Region A Node 2 Region A Region B 27

Multi-Raft: Scalability How to Move Region A? Region A’ Region B’ Node 1 Node 2 28

Multi-Raft: Scalability How to Move Region A? Region A Region A’ Region B’ Node 1 Node 2 Add Replica 29

Multi-Raft: Scalability Transfer Leader How to Move Region A? Region A’ Region A Region B’ Node 1 Node 2 30

Multi-Raft: Scalability How to Move Region A? Region A’ Region B’ Node 1 Node 2 Remove Replica 31

How to ensure cross-region data consistency? 32

Distributed Transaction Begin Region 1 Region 1 Region 1 Raft Group Set a = 1 Region 2 Region 2 Region 2 Raft Group Set b = 2 Commit 33

Transaction in TiKV ● Optimized two phase commit, inspired by Google Percolator ● Multi-version concurrency control ● Optimistic Commit ● Snapshot Isolation ● Use Timestamp Oracle to allocate unique timestamp for transactions 34

Percolator Optimization ● Use a latch on TiDB to support pessimistic commit ● Concurrent Prewrite ○ We are formally proving it with TLA+ 35

How to communicate with each other? RPC Framework! 36

Hello gRPC 37

Why gRPC? ● Widely used ● Supported by many languages ● Works with Protocol Buffers and FlatBuffers ● Rich interface ● Benefits from HTTP/2 38

TiKV Stack Client gRPC gRPC gRPC Txn KV API Txn KV API Txn KV API Transaction Transaction Transaction Raft Group Raft Raft Raft RocksDB RocksDB RocksDB TiKV Instance TiKV Instance TiKV Instance 39

How to manage 100+ nodes? 40

Scheduler in TiKV We are Gods!!! TiKV TiKV PD PD TiKV TiKV PD Placement Drivers TiKV TiKV TiKV TiKV 41

Scheduler - How PD PD’ PD Add Replica Store Heatbeat Remove Replica Schedule Operator Region Heatbeat Transfer Leader ... TiKV TiKV TiKV 42

Scheduler - Goal ● Make the load and data size balanced ● Avoid hotspot performance issue 43

Scheduler - Region Count Balance Assume the Regions are about the same size R1 R3 R2 R4 R6 R5 R1 R3 R5 R2 R4 R6 44

Scheduler - Region Count Balance Regions’ sizes are not the same R1 - 0 MB R3 - 0 MB R5 - 64 MB R2 - 0 MB R4 - 64 MB R6 - 96 MB 45

Scheduler - Region Size balance Use size for calculation R1 - 0 MB R3 - 0 MB R5 - 64 MB R2 - 0 MB R4 - 64 MB R6 - 96 MB R1 - 0 MB R3 - 0 MB R2 - 0 MB R5 - 64 MB R4 - 64 MB R6 - 96 MB 46

Scheduler - Region Size Balance Some regions are very hot for Read/Write Hot Cold R1 R3 R5 Normal R2 R4 R6 47

Scheduler - Hot balance TiKV reports region Read/Write traffic to PD R1 R3 R5 R2 R4 R6 R1 R2 R5 R3 R4 R6 48

Scheduler - More ● More balances… ○ Weight Balance - High-weight TiKV will save more data ○ Evict Leader Balance - Some TiKV node can’t have any Raft leader ● OpInfluence - Avoid over frequent balancing 49

Geo-Replication 50

Scheduler - Cross DC DC DC DC Rack Rack Rack Rack Rack Rack R1 R1 R2 R2 R1 R2 DC DC DC Rack Rack Rack Rack Rack Rack R1 R2 R1 R2 R1 R2 51

Scheduler - three DCs in two cities DC - Seattle 1 DC - Seattle 2 DC - Santa Clara Rack Rack Rack Rack Rack Rack R1 R2 R1 R2 R1’ R2’ DC - Seattle 1 DC - Seattle 2 DC - Santa Clara Rack Rack Rack Rack Rack Rack R1’ R2 R1 R2’ R1 R2 52

Going beyond TiKV 53

TiDB HTAP Solution PD PD PD

Cloud-Native KV ... ...

Who’s Using TiKV Now? 56

To sum up, TiKV is ... ● An open-source, unifying distributed storage layer that supports: ○ Strong consistency ○ ACID compliance ○ Horizontal scalability ○ Cloud-native architecture ● Building block to simplify building other systems ○ So far: TiDB (MySQL), TiSpark (SparkSQL), Toutiao.com (metadata service for their own S3), Ele.me (Redis Protocol Layer) ○ Sky is the limit! 57

Thank you! Email: tl@pingcap.com Github: siddontang Twitter: @siddontang; @pingcap 58

Building a Transactional Key-Value Store That Scales to 100+ Nodes - PowerPoint PPT Presentation

Building a Transactional Key-Value Store That Scales to 100+ Nodes Siddon Tang at PingCAP (Twitter: @siddontang; @pingcap) 1 About Me Chief Engineer at PingCAP Leader of TiKV project My other open-source projects: go-mysql

Sapporo Sapporo Namba Namba Shinjuku Shinjuku Store Store Store Store West Store West

Momentum i i Filtered Filtered = Momentum v f x G

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 25 November 2016 Lecture 8

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 27 November 2015 Lecture 8

Scalaris: Scalable Web Applications with a Transactional Key-Value Store Nico Kruber Michael

Waste Data Automation Alan Housley Vice President Marketing / LoadMan On-Board Truck Scales

Waste Data Automation Alan Housley Vice President Marketing / LoadMan On-Board Truck Scales

An Example of Index An Example of Index pattern of structure in indicators pattern of structure

Spring Scales Theyre only accurate when everything is at rest Turn off all electronic

Eric Berne 1910 - 1970 Transactional Analysis UET6 TASK 02 Transactional Analysis is: A

Transactional Systems: Examples Core OS / RedHat: Various: SUSE: Common Properties of

Ego State Model Transactional Analysis Ego States P A C VISIONS Inc. Transactional Analysis

Transactional Memory 1 To read more This days papers: Herlihy and Moss, Transactional

Extending Hardware Transactional Memory to Support Non-busy Waiting and Non-transactional Actions

Transactional Memory: Architectural support for Lock-Free Data Structure Transactional Memory:

Transactional Recovery Transactional Recovery Transactions: ACID Properties Transactions: ACID

International Conference in QUALITY IN HIGHER EDUCATION: GLOBAL PERSPECTIVES AND BEST PRACTICES Ho

Tomasz ciso Software Engineer Samsung R&D Center Poland The Earth Guard development

waste and food packaging: Summary of research findings (March 2013) In partnership with:

Discrete Mathematics -- Chapter 1: Fundamental Ch t 1 F d t l Principles of Counting Hung-Yu

Introducing TiDB For those coming from MySQL... Agenda History and Community

EIC Software Consortium William & Mary, Physics Department May 17-18, 2018 Cool Facts

Using Rust to Build a Distributed Transactional Key-Value Database LiuTang | tl@pingcap.com

Observations ConTeXt & XML 5th ConTeXt Meeting, Bassenge, Belgium Jano Kula 21. 9. 2011

Building a Transactional Key-Value Store That Scales to 100+ Nodes - PowerPoint PPT Presentation

Building a Transactional Key-Value Store That Scales to 100+ Nodes Siddon Tang at PingCAP (Twitter: @siddontang; @pingcap) 1 About Me Chief Engineer at PingCAP Leader of TiKV project My other open-source projects: go-mysql

Sapporo Sapporo Namba Namba Shinjuku Shinjuku Store Store Store Store West Store West

Momentum i i Filtered Filtered = Momentum v f x G

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 25 November 2016 Lecture 8

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 27 November 2015 Lecture 8

Scalaris: Scalable Web Applications with a Transactional Key-Value Store Nico Kruber Michael

Waste Data Automation Alan Housley Vice President Marketing / LoadMan On-Board Truck Scales

Waste Data Automation Alan Housley Vice President Marketing / LoadMan On-Board Truck Scales

An Example of Index An Example of Index pattern of structure in indicators pattern of structure

Spring Scales Theyre only accurate when everything is at rest Turn off all electronic

Eric Berne 1910 - 1970 Transactional Analysis UET6 TASK 02 Transactional Analysis is: A

Transactional Systems: Examples Core OS / RedHat: Various: SUSE: Common Properties of

Ego State Model Transactional Analysis Ego States P A C VISIONS Inc. Transactional Analysis

Transactional Memory 1 To read more This days papers: Herlihy and Moss, Transactional

Extending Hardware Transactional Memory to Support Non-busy Waiting and Non-transactional Actions

Transactional Memory: Architectural support for Lock-Free Data Structure Transactional Memory:

Transactional Recovery Transactional Recovery Transactions: ACID Properties Transactions: ACID

International Conference in QUALITY IN HIGHER EDUCATION: GLOBAL PERSPECTIVES AND BEST PRACTICES Ho

Tomasz ciso Software Engineer Samsung R&amp;D Center Poland The Earth Guard development

waste and food packaging: Summary of research findings (March 2013) In partnership with:

Discrete Mathematics -- Chapter 1: Fundamental Ch t 1 F d t l Principles of Counting Hung-Yu

Introducing TiDB For those coming from MySQL... Agenda History and Community

EIC Software Consortium William &amp; Mary, Physics Department May 17-18, 2018 Cool Facts

Using Rust to Build a Distributed Transactional Key-Value Database LiuTang | tl@pingcap.com

Observations ConTeXt &amp; XML 5th ConTeXt Meeting, Bassenge, Belgium Jano Kula 21. 9. 2011

Tomasz ciso Software Engineer Samsung R&D Center Poland The Earth Guard development

EIC Software Consortium William & Mary, Physics Department May 17-18, 2018 Cool Facts

Observations ConTeXt & XML 5th ConTeXt Meeting, Bassenge, Belgium Jano Kula 21. 9. 2011