Ceph & RocksDB (Cloud Storage ) Ceph Basics Placement Group - - PowerPoint PPT Presentation

ceph amp rocksdb
SMART_READER_LITE
LIVE PREVIEW

Ceph & RocksDB (Cloud Storage ) Ceph Basics Placement Group - - PowerPoint PPT Presentation

Ceph & RocksDB (Cloud Storage ) Ceph Basics Placement Group PG#1 PG#2 PG#3 myobject mypool hash(myobject) = 4% 3(# of PGs) = 1 Target PG CRUSH PG#1 PG#2 PG#3 mypool OSD#1 OSD#3 OSD#12 Recovery PG#1 PG#2 PG#3


slide-1
SLIDE 1

Ceph & RocksDB

변일수(Cloud Storage팀)

slide-2
SLIDE 2

Ceph Basics

slide-3
SLIDE 3

Placement Group

myobject mypool

PG#1 PG#2 PG#3

hash(myobject) = 4% 3(# of PGs) = 1 ← Target PG

slide-4
SLIDE 4

CRUSH

mypool

PG#1 PG#2 PG#3

OSD#1 OSD#3 OSD#12

slide-5
SLIDE 5

Recovery

mypool

PG#1 PG#2 PG#3

OSD#1 OSD#3 OSD#12

slide-6
SLIDE 6

OSD

ObjectStore

Replication, ???, …

OSD

Peering, Heartbeat, …

FileStore BlueStore

https://www.scan.co.uk/products/4tb-toshiba-mg04aca400e-enterprise-hard-drive-35-hdd-sata-iii-6gb-s-7200rpm-128mb-cache-oem

slide-7
SLIDE 7

ObjectStore

https://ceph.com/community/new-luminous-bluestore/

slide-8
SLIDE 8

OSD Transaction

slide-9
SLIDE 9

Consistency is enforced here!

CRUSH

slide-10
SLIDE 10

SSD

  • p_wq

kv_committing pipe out_q

Request Ack write flusth

ROCKSDB

sync transaction

Shard finisher queue

  • To maintain ordering within each PG, ordering within each shard should

be guaranteed.

BlueStore Transaction

slide-11
SLIDE 11
  • Metadata is stored in RocksDB.
  • After storing metadata atomically, data is available to users.

Request SSTFile Logfile Memtable Transaction Log Flush JoinBatchGroup (leader) PreprocessWrite WriteToWAL MarkLogsSynced ExitAsBatchGroupLeader Write to memtable #1 Thread #2 Thread JoinBatchGroup AwaitState #3 Thread JoinBatchGroup AwaitState PreprocessWrite WriteToWAL leader LaunchParallelFollower MarkLogsSynced follower ExitAsBatchGroupFollower CompleteParallelWorker Concurrent write to memtable Group commit

RocksDB Group Commit

slide-12
SLIDE 12

Thread Scalibility

10,000 20,000 30,000 40,000 50,000 60,000

1 shard 10 shard IOPS

Shard Scalability

WAL disableWAL PUT PUT PUT PUT WAL RocksDB

slide-13
SLIDE 13

RadosGW

slide-14
SLIDE 14
  • RadosGW is an application of RADOS

RadosGW

OSD Mon Mgr Rados RadosGW CephFS

slide-15
SLIDE 15
  • All atomic operations depen on RocksDB

RadosGW Transaction

Prepare Index Write Data Complete Index Put Object RADOS Index Object Data Object

k v k v k v k v k v k v

RocksDB

slide-16
SLIDE 16

SSD

  • p_wq

kv_committing pipe out_q

Request Ack write flusth

ROCKSDB

sync transaction

Shard finisher queue

bstore_shard_finishers = true

  • To maintain ordering within each PG, ordering within each shard should

be guaranteed.

BlueStore Transaction

slide-17
SLIDE 17

Performance Issue

slide-18
SLIDE 18

Tail Latency

slide-19
SLIDE 19

Performance Metrics

slide-20
SLIDE 20
  • "SILK: Preventing Latency Spikes in Log-Structured Merge Key-Value

Stores" (ATC'19)

RocksDB Compaction Overhead

slide-21
SLIDE 21
  • Ceph highly depends on RocksDB
  • Strong consistency of Ceph is implemented using RocksDB transactions
  • The performance of ceph also depends on RocksDB
  • Especially for Small IO
  • But RocksDB has some performance Issues
  • Flushing WAL
  • Compaction
  • ilsoobyun@linecorp.com

Conclusions

slide-22
SLIDE 22

THANK YOU