Introducing LSM-tree into PostgreSQL, making it as a data gobbler - - PowerPoint PPT Presentation

introducing lsm tree into postgresql making it as a
SMART_READER_LITE
LIVE PREVIEW

Introducing LSM-tree into PostgreSQL, making it as a data gobbler - - PowerPoint PPT Presentation

Introducing LSM-tree into PostgreSQL, making it as a data gobbler By Shichao Jin From https://zhuanlan.zhihu.com/p/56259025 About Me 8 years experience in DBMS R&D Redshift AWS and Facebook RocksDB Ex-PhD student at UWaterloo


slide-1
SLIDE 1

Introducing LSM-tree into PostgreSQL, making it as a data gobbler

By Shichao Jin

From https://zhuanlan.zhihu.com/p/56259025
slide-2
SLIDE 2

About Me

  • 8 years experience in DBMS R&D
  • Redshift AWS and Facebook RocksDB
  • Ex-PhD student at UWaterloo
  • Fan of PostgreSQL
  • Founder of VidarDB

2

slide-3
SLIDE 3

Outline

3

  • Experiment & Analysis
  • LSM-tree Introduction
  • Our Implementation
slide-4
SLIDE 4

Background

4

Storage Engine DBMS Data Structure

  • Storage engine accounts

for a big chunk of the performance

  • What is “storage engine”?

incarnation of a data structure

slide-5
SLIDE 5

Background

5

  • B+tree/B-tree: InnoDB, BerkeleyDB, WiredTiger
  • LSM-tree: LevelDB, RocksDB
slide-6
SLIDE 6

Experiment Setting

6

Hardware: CPU: Xeon E3-1265L, 2.5GHz, 4 cores Disk: 7200 RPM RAM: 16GB DRR3

Software: TPC-H 10GB, shuffled! Ubuntu 18.04, PostgreSQL11.6, RocksDB 6.2, default setting libpqxx

slide-7
SLIDE 7

Experiment Results

7

slide-8
SLIDE 8

Why huge difference?

8

Data structure! PostgreSQL uses B+tree! So what’s wrong with B+tree?

slide-9
SLIDE 9

B+tree Reading Pattern

9

Rare case: data come in order Common case: data come in random order Red rectangle means disk page cached!

slide-10
SLIDE 10

Problem with B+tree

10

I only talk about the insertion case here, but the root cause comes from the reading pattern

  • f B+tree. Every insertion of B+tree needs to

trigger random reading to locate the exact position of B+tree. Note: Reading uses cache!

slide-11
SLIDE 11

Problem with B+tree (Cont.)

11

Data is large, and usually come not in the order of primary key, the previous cached disk blocks are useless! B+tree insertion causes too many disk access!

slide-12
SLIDE 12

Disk vs. SSD vs. RAM in my laptop

12

Magnetic Disk SSD RAM

Sequential vs. Random Seq Ran

slide-13
SLIDE 13

Solution

13

No touch of disk (excluding WAL) when insertion! Tiered design. LSM-tree (Log-structured merge-tree) comes to rescue!

RAM SSD/Disk

slide-14
SLIDE 14

LSM-tree

14

A ~ Z A ~ Z

A ~ G A ~ Z A ~ F

Tail Index Cache

H ~ O P ~ Z G ~ N O ~ Z A ~ Z A ~ Z

RAM

Disk

L0 L1 L2 Immutable table Mutable table new tuple SST SST SST SST: Sorted String Table
slide-15
SLIDE 15

LSM-tree

15

A ~ Z A ~ Z

A ~ G A ~ Z A ~ F

Tail Index Cache

H ~ O P ~ Z G ~ N O ~ Z A ~ Z A ~ Z

RAM

Disk

L0 L1 L2 Immutable table Mutable table new tuple SST SST SST

Skiplist O(logN) Sorted kv pairs

slide-16
SLIDE 16

Basic Operations

16

Put(key, value), Delete(key)

A ~ Z A ~ Z

A ~ G A ~ Z A ~ F

Tail Index Cache

H ~ O P ~ Z G ~ N O ~ Z A ~ Z A ~ Z

RAM

Disk

L0 L1 L2 Immutable table Mutable table new tuple SST SST SST

WAL

K | 2020 K | 2019

  • verwrite
slide-17
SLIDE 17

Basic Operations

17

Get(Key), return value

A ~ Z A ~ Z

A ~ G A ~ Z A ~ F

Tail Index Cache

H ~ O P ~ Z G ~ N O ~ Z A ~ Z A ~ Z

RAM

Disk

L0 L1 L2 Immutable table Mutable table new tuple SST SST SST
slide-18
SLIDE 18

Basic Operations

18

Iterator(): seek & next

A ~ Z A ~ Z

A ~ G A ~ Z A ~ F

Tail Index Cache

H ~ O P ~ Z G ~ N O ~ Z A ~ Z A ~ Z

RAM

Disk

L0 L1 L2 Immutable table Mutable table new tuple SST SST SST C

A

D E B F G
slide-19
SLIDE 19

Advanced Operations

19

Flush(): RAM→ L0, usually by the background thread

A ~ Z A ~ Z

A ~ G A ~ Z A ~ F

Tail Index Cache

H ~ O P ~ Z G ~ N O ~ Z A ~ Z A ~ Z

RAM

Disk

L0 L1 L2 Immutable table Mutable table new tuple SST SST SST
slide-20
SLIDE 20

Advanced Operations

20

Compact(k1, k2): L0→ L1, L1→ L2, usually called by the background thread

A ~ Z A ~ Z

A ~ G A ~ Z A ~ F

Tail Index Cache

H ~ O P ~ Z G ~ N O ~ Z A ~ Z A ~ Z

RAM

Disk

L0 L1 L2 Immutable table Mutable table new tuple SST SST SST

RAM size Write Read

slide-21
SLIDE 21

Snapshot(seq): sequence number, which is always increasing

Advanced Operations

21

A ~ Z A ~ Z

A ~ G A ~ Z A ~ F

Tail Index Cache

H ~ O P ~ Z G ~ N O ~ Z A ~ Z A ~ Z

RAM

Disk

L0 L1 L2 Immutable table Mutable table new tuple SST SST SST

Tuple component multi-version

slide-22
SLIDE 22

Implementation

22

  • Foreign data wrapper (FDW)
  • RocksDB is based on the multi-thread model
  • Shared memory

Shared memory Process: Select, Update…. RocksDB

threads
slide-23
SLIDE 23

More To Do

23

thread
  • Engine recognizes the data types of PostgreSQL
  • Transaction?
  • Migrate to pluggable storage engine, table AM
slide-24
SLIDE 24

Motivation of designing a new data structure

24

  • 1. random R/W of RAM is 2000X faster than

magnetic disk, but 120X more expensive

  • 2. Ratio of RAM/disk size is 40~200X larger than

1970s, when B+tree is designed

  • 3. New storage choice, NVRAM
  • 4. Requires expertise to choose different engines,

data systems

slide-25
SLIDE 25

25

Thanks!

Shichao@vidardb.com Twitter: @jsc0218

github.com/vidardb/vidardb github.com/vidardb/PostgresForeignDataWrapper