Improvements to MongoRocks in 2017 Mark Callaghan Member of - - PowerPoint PPT Presentation
Improvements to MongoRocks in 2017 Mark Callaghan Member of - - PowerPoint PPT Presentation
Improvements to MongoRocks in 2017 Mark Callaghan Member of Technical Sta ff , Facebook MongoDB Engines mmapv1 - update-in-place B-Tree WiredTiger - copy-on-write B-Tree MongoRocks - log structured merge tree (LSM) Why MongoRocks?
- mmapv1 - update-in-place B-Tree
- WiredTiger - copy-on-write B-Tree
- MongoRocks - log structured merge tree (LSM)
MongoDB Engines
- Best space efficiency
- Better write efficiency
- Good read efficiency
- Effective with SSD & disk
Why MongoRocks?
RUM = Read, Update, Memory
- An algorithm can’t be optimal for all 3
- See Designing Access Methods: The RUM Conjecture
We make MongoRocks better by making RocksDB better
- Smarter compaction
- Support for database >> RAM
- Less response time variance
- Much more done and in progress
Making MongoRocks better
- RocksDB has too many options
- CPU bound on IO-heavy tests
- Percona Server isn’t using recent RocksDB releases
Things to make better in MongoRocks
Percona Server for MongoDB
- 3.0.15 with RocksDB 4.1.0 (October 2015)
- 3.2.15 with RocksDB 4.4.1 (February 2016)
- 3.4.6 with RocksDB 4.13.5 (December 2016)
Compiled from source
- MongoDB 3.4.7 with RocksDB 5.7.3 (August 2017)
MongoDB versions tested
- N collections (N is 1 or 16)
- 1 PK, 3 secondary indexes per collection
- Inserts are in PK order, random for secondary indexes
- Queries are short range scans
- Supports MongoDB and MySQL
- In Python, need to rewrite with something faster
Insert benchmark
Runs in 4 phases 1. Insert-only - load X million rows 2. Scan PK, secondary indexes in sequence 3. N clients do queries, N clients each insert 1000/s 4. N clients do queries, N clients each insert 100/s Configuration options
- Database in memory vs IO-bound
- Client per collection
Finding database stalls for 10 years
Load throughput
IO-bound
mmapv1 WiredTiger MongoRocks 3.0 3.2 3.4 3.0 3.2 3.4 3.0 3.2 3.4 Inserts/second 15000 30000 45000 60000
51,943 14,234 14,446 47,990 14,179 13,983 7,827 17,286 13,157
Large server, 16 clients
In-memory
mmapv1 WiredTiger MongoRocks 3.0 3.2 3.4 3.0 3.2 3.4 3.0 3.2 3.4 Inserts/second 30000 60000 90000 120000
54,933 93,284 24,286 50,241 94,841 21,122 7,931 113,533 17,778
IO-bound load efficiency
Large server, 16 clients, database >> RAM, v3.4.6
Avg IPS IO read/ insert IO KB read/ insert IO KB write/insert CPU/insert Size (GB) mmapv1 14446 3.95 22.72 25.53 595 11xx WiredTiger 14234 1.81 26.26 39.31 2082 577 MongoRocks 51493 0.05 2.83 10.32 461 487
Scan secondary indexes
Large server, 16 clients
IO-bound
mmapv1 WiredTiger MongoRocks 3.0 3.2 3.4 3.0 3.2 3.4 3.0 3.2 3.4 Seconds 1250 2500 3750 5000
3,538 2,182 1,474 3,022 2,149 1,461 13,904 2,078 1,334 In-memory
mmapv1 WiredTiger MongoRocks 3.0 3.2 3.4 3.0 3.2 3.4 3.0 3.2 3.4 Seconds 250 500 750 1000
363 230 121 340 231 116 1,932 159 103
IO-bound: read-write
mmapv1 WiredTiger MongoRocks 3.0 3.2 3.4 3.0 3.2 3.4 3.0 3.2 3.4 Queries/second 1500 3000 4500 6000
5,390 630 175 5,596 484 120 432 4,025 103
Large server, 16 clients, database >> RAM
mmapv1 WiredTiger MongoRocks 3.0 3.2 3.4 3.0 3.2 3.4 3.0 3.2 3.4 Inserts/second 4000 8000 12000 16000
15,845 13,221 12,253 15,845 13,930 12,277 7,119 12,812 11,944
IO-bound: read-write
Large server, 16 clients, database >> RAM, v3.4.6
Avg IPS Avg QPS IO MB read/ second IO MB write/ second CPU mmapv1 12253 175 416 324 8.3 WiredTiger 13221 630 631 567 42.4 MongoRocks 15845 5390 914 504 21.2
In-memory: read-write
mmapv1 WiredTiger MongoRocks 3.0 3.2 3.4 3.0 3.2 3.4 3.0 3.2 3.4 Queries/second 10000 20000 30000 40000
21,272 29,214 9,822 23,349 29,379 7,218 12,820 32,256 1,010
Large server, 16 clients, database >> RAM
mmapv1 WiredTiger MongoRocks 3.0 3.2 3.4 3.0 3.2 3.4 3.0 3.2 3.4 Inserts/second 4000 8000 12000 16000
15,845 15,845 15,754 15,845 15,845 15,352 7,029 15,364 15,276
load: oplog impact
IO-bound
mmap WT Rocks
- plog
no oplog
- plog
no oplog
- plog
no oplog Inserts/second 20000 40000 60000 80000
71,395 14,976 14,720 51,493 14,234 14,446
Large server, 16 clients, v3.4.6
In-memory
mmap WT Rocks
- plog
no oplog
- plog
no oplog
- plog
no oplog Inserts/second 40000 80000 120000 160000
77,906 145,180 25,649 54,933 93,284 24,286
Load: benefit from new features
Small server, 1 clients, database >> RAM, v3.4.7
Load Avg IPS IO read/ insert IO KB read/ insert IO KB write/insert CPU/insert
- ld features
9319 0.04 4.42 18.68 6124 new features 10806 0.02 1.91 14.17 5283
rocksdb.org mongorocks.org smalldatum.blogspot.com twitter.com/markcallaghan
Thank you
An LSM in one slide
1000.sst 0:999 999.sst 0:999 998.sst 0:999 997.sst 0:999 993.sst 0:249 994.sst 250:499 995.sst 500:749 996.sst 750:999 802.sst 0:24 610.sst 25:50
…
471.sst 950:974 480.sst 975:999 49.sst 0:1 1001.sst 2:4
…
2.sst 995:996 1.sst 997:999
Level-0 Level-1 Level-2 Level-3 memtable write ahead log
Space efficiency
- Fragmentation
- Fixed page size
- Per-row metadata
- Key prefix encoding
Efficiency: RocksDB vs a B-Tree
Write efficiency
- Uses more space = more data to write
- sizeof(page) / sizeof(row)
- large writes
- write delta
Read efficiency
- More data in cache & less data to cache
- Bloom filter
- Spend less on writes, use more for reads
- Read-free index maintenance