time series data in mongodb on a budget
play

Time-Series Data in MongoDB on a Budget Peter Schwaller Senior - PowerPoint PPT Presentation

Time-Series Data in MongoDB on a Budget Peter Schwaller Senior Director Server Engineering, Percona Santa Clara, California | April 23th 25th, 2018 TIME SERIES DATA in MongoDB on a Budget Click to add text What is Time-Series Data?


  1. Time-Series Data in MongoDB on a Budget Peter Schwaller – Senior Director Server Engineering, Percona Santa Clara, California | April 23th – 25th, 2018

  2. TIME SERIES DATA in MongoDB on a Budget Click to add text

  3. What is Time-Series Data? Characteristics: • Arriving data is stored as a new value as opposed to overwriting existing values • Usually arrives in time order • Accumulated data size grows over time • Time is the primary means of organizing/accessing the data 3

  4. Time Series Data in MONGODB on a Budget Click to add text

  5. Why MongoDB? • General purpose database • Specialized Time-Series DBs do exist • Do not use mmap storage engine 5

  6. Data Retention Options • Purge old entries • Set up MongoDB index with TTL option (be careful if this index is your shard key) • Aggregate data and store summaries • Create summary document, delete original raw data • Huge compression possible (seconds->minutes->hours->days->months->years) • Measurement buckets • Store all entries for a time window in a single document • Avoids storing duplicate metadata • Individual Documents for Each Measurement • Useful when data is sparse or intermittent (e.g., events rather than sensors) 6

  7. Potential Problems with Data Collection • Duplicate entries • Utilize unique index in MongoDB to reject duplicate entries • Delayed • Out of order 7

  8. Problems with Delayed and Out-of-Order Entries • Alert/Event generation • Incremental Backup 8

  9. Enable Streaming of Data • Add recordedTime field (in addition to existing field with timestamp) • Utilize $currentDate feature of db.collection.update() $currentDate: { recordedTime: true } • You cannot use this field as a shard key! • Requires use of update instead of insert • Which in turn requires specification of _id field • Consider constructing your _id to solve the duplicate entries issue at the same time Allows applications to reliably process each document once and only once. 9

  10. Accessing Your Data It’s only *mostly* write-only.

  11. Create Appropriate Indexes • Avoid collection scans! • Consider using: db.adminCommand( { setParameter: 1, notablescan: 1 } ) • Avoid queries that might as well be collection scans • Create the indexes you need (but no more) • Don’t depend on index intersection • Don’t over index • Each index can take up a lot of disk/memory • Consider using partial indexes { partialFilterExpression: { speed: { $gt: 75.0 } } } 11

  12. Check Your Indexes • Use .explain() liberally • Check which indexes are actually used: db.collection.aggregate( [ { $indexStats: {}}]) 12

  13. Adding Data Getting the Speed You Need

  14. API Methods • Insert array database[collection].insert(doc_array) • Insert unordered bulk bulk = database[collection].initialize_unordered_bulk_op() bulk.insert(doc) # loop here bulk.execute() • Upsert unordered bulk bulk = database[collection].initialize_unordered_bulk_op() bulk.find({"_id": doc["_id"]}).upsert().update_one({"$set": doc}) # loop here bulk.execute() • Insert single database[collection].insert(doc) • Upsert single database[collection].update_one({"_id": doc["_id"]}, {"$set": doc}, upsert=True) 14

  15. Relative Performance Comparison of API Methods 40000 35000 30000 25000 20000 15000 10000 5000 0 Insert Array Insert Unordered Bulk Update Unordered Bulk Insert Single Update Single Docs/Sec 15

  16. Benchmarks … and other lies. Answering , “ Why can’t I just use a gigantic HDD RAID array ?”

  17. Benchmark Environment • VMs • 4 core Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz • 8 GB RAM • Sandisk Ultra II 960GB SSD • WD 5TB 7200rpm HDD • MongoDB • 3.4.13 • WiredTiger • 4GB Cache • Snappy collection compression • Standalone server (no replica set, no mongos) • Data • 178 bytes per document in 6 fields • 3 indexes (2 compound) • Disk usage: 40% storage, 60% indexes • Using update unordered bulk method, 1000 docs per bulk.execute() 17

  18. Benchmark SSD vs. HDD 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Inserts/Sec SSD HDD 18

  19. SSD Benchmark 60 Minutes 19

  20. SSD Benchmark 0:30-1:00 20

  21. HDD Benchmark 0:30-1:30 21

  22. HDD Benchmark 0:30-8:45 (42M documents) 22

  23. HDD Benchmark Last Hour 23

  24. SSD Benchmark 0:30-2:10 (42M documents) 24

  25. Benchmark SSD vs. HDD Last Hour 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Inserts/Sec SSD HDD 25

  26. 96 Hour Test 26

  27. TL;DR • Don’t trust someone else’s benchmarks (especially mine!) • Benchmark using your own “schema” and indexes • Artificially accelerate index size exceeding available memory 27

  28. Time Series Data in MongoDB on a BUDGET

  29. Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary, Secondary, Secondary) • Every replica set server on its own hardware • Disk mirroring • Cost cutting options • Primary, Secondary, Arbiter • Locate multiple replica set servers on the same hardware (but NOT from the SAME replica set) • No disk mirroring (how many copies do you really need?) • “I love downtime and don’t care about my data” • Single instance servers instead of replica sets • RAID0 (“no wasted disk space!”) • No backups 29

  30. Storing Lots of Data “ Sharding is a method for distributing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations. ”

  31. Conventional Sharding • Non-sharded data kept in default replica set • Shard key hashed on timestamp to evenly distribute data • Pros: • Increases insert rate • Arbitrarily large data storage • Cons: • All shard replica sets should have comparable hardware • All shards start thrashing at the same time • Expanding means a LOT of rebalancing 31

  32. Data Access Patterns • New writes are always very recent • Reads are almost always of recent data • Reads of old data are “intuitively” slower … let’s take advantage of that. 32

  33. Sharding by Zone • Non-sharded data kept in default replica set • Most recent time- series data stored in “fast” replica set • Older time- series data stored in “slow” replica sets • Pros: • Pay for speed where we need it • Swap “fast” to “slow” before thrashing kills performance • “Infinite” data size • Cons: • Ceiling on insert speed 33

  34. Prerequisites for Zone Sharding • Sharded cluster configured (config replica set, mongos, etc) • Existing replica set rsmain (primary shard) contains your normal (not time- series) data • TimeSeries collection with an index on “time” • New replica set for time-series data (e.g., rs001) added as a shard 34

  35. Initial Zone Ranges • Run on mongos: use admin sh.enableSharding (‘ DBName ’) sh.shardCollection (‘ DBName.TimeSeries ’, { time : 1 } ) sh.addShardTag('rsmain ', ‘future') sh.addShardTag (‘rs001', ‘ts001') sh.addTagRange('DBName.TimeSeries',{time: new Date("2099-01-01")}, {time:MaxKey},'future') sh.addTagRange (‘ DBName.TimeSeries',{time:MinKey},{time:new Date("2099-01- 01")},‘ts001') # sh.splitAt(' DBName.TimeSeries ', {"time" : new Date("2099-01-01") }) 35

  36. Adding a New Time-Series Replica Set Step 1 – Create new Replica Set • When? • Well before you run out of available fast storage • Before your input capacity is lowered too close to your needs • Where? • On the same server with fast storage as the current time-series replica set • Run on mongos: use admin db.runCommand({addShard: “rs002/ hostname:port", name: "rs002"}) sh.addShardTag (‘rs002’, ‘ts002') var configdb=db.getSiblingDB("config"); configdb.tags.update ({tag:“ts001"},{$set:{' max.time': new ISODate (“2018 -04- 26”) }}) sh.addTagRange (‘ DBName.TimeSeries',{time:new Date("2018-04-26")},{time:new Date("2099-01- 01")},‘ts002') # sh.splitAt('DBName.TimeSeries', {"time" : new ISODate("2018-04-26")}) 36

  37. Adding a New Time-Series Replica Set Step 2 – Wait before Relocation • Initially nothing changes – all data is added into previous replica set • Eventually, new entries match the min.time of the new replica set and will be stored there • How long to wait before relocation? • Make sure you don’t fill up your fast storage • How far back in time do “normal” queries go? - Queries to previous replica set will get slower after relocation 37

  38. Adding a New Time-Series Replica Set Step 3 – Relocate to Slow Storage • Follow standard procedure for moving replica set • Multiple server instances can share same server/storage • Use unique ports • Set – wiredTigerCacheSizeGB appropriately 38

  39. Pause for Questions

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend