Time-Series Data in MongoDB on a Budget Peter Schwaller Senior - - PowerPoint PPT Presentation

time series data in mongodb on a budget
SMART_READER_LITE
LIVE PREVIEW

Time-Series Data in MongoDB on a Budget Peter Schwaller Senior - - PowerPoint PPT Presentation

Time-Series Data in MongoDB on a Budget Peter Schwaller Senior Director Server Engineering, Percona Santa Clara, California | April 23th 25th, 2018 TIME SERIES DATA in MongoDB on a Budget Click to add text What is Time-Series Data?


slide-1
SLIDE 1

Peter Schwaller – Senior Director Server Engineering, Percona Santa Clara, California | April 23th – 25th, 2018

Time-Series Data in MongoDB

  • n a Budget
slide-2
SLIDE 2

TIME SERIES DATA in MongoDB

  • n a Budget

Click to add text

slide-3
SLIDE 3

3

What is Time-Series Data?

Characteristics:

  • Arriving data is stored as a new value as opposed to overwriting existing

values

  • Usually arrives in time order
  • Accumulated data size grows over time
  • Time is the primary means of organizing/accessing the data
slide-4
SLIDE 4

Time Series Data in MONGODB on a Budget

Click to add text

slide-5
SLIDE 5

5

Why MongoDB?

  • General purpose database
  • Specialized Time-Series DBs do exist
  • Do not use mmap storage engine
slide-6
SLIDE 6

6

Data Retention Options

  • Purge old entries
  • Set up MongoDB index with TTL option (be careful if this index is your shard key)
  • Aggregate data and store summaries
  • Create summary document, delete original raw data
  • Huge compression possible (seconds->minutes->hours->days->months->years)
  • Measurement buckets
  • Store all entries for a time window in a single document
  • Avoids storing duplicate metadata
  • Individual Documents for Each Measurement
  • Useful when data is sparse or intermittent (e.g., events rather than sensors)
slide-7
SLIDE 7

7

Potential Problems with Data Collection

  • Duplicate entries
  • Utilize unique index in MongoDB to reject duplicate entries
  • Delayed
  • Out of order
slide-8
SLIDE 8

8

Problems with Delayed and Out-of-Order Entries

  • Alert/Event generation
  • Incremental Backup
slide-9
SLIDE 9

9

Enable Streaming of Data

  • Add recordedTime field (in addition to existing field with timestamp)
  • Utilize $currentDate feature of db.collection.update()

$currentDate: { recordedTime: true }

  • You cannot use this field as a shard key!
  • Requires use of update instead of insert
  • Which in turn requires specification of _id field
  • Consider constructing your _id to solve the duplicate entries issue at the same time

Allows applications to reliably process each document once and only once.

slide-10
SLIDE 10

Accessing Your Data

It’s only *mostly* write-only.

slide-11
SLIDE 11

11

Create Appropriate Indexes

  • Avoid collection scans!
  • Consider using: db.adminCommand( { setParameter: 1, notablescan: 1 } )
  • Avoid queries that might as well be collection scans
  • Create the indexes you need (but no more)
  • Don’t depend on index intersection
  • Don’t over index
  • Each index can take up a lot of disk/memory
  • Consider using partial indexes

{ partialFilterExpression: { speed: { $gt: 75.0 } } }

slide-12
SLIDE 12

12

Check Your Indexes

  • Use .explain() liberally
  • Check which indexes are actually used:

db.collection.aggregate( [ { $indexStats: {}}])

slide-13
SLIDE 13

Adding Data

Getting the Speed You Need

slide-14
SLIDE 14

14

API Methods

  • Insert array

database[collection].insert(doc_array)

  • Insert unordered bulk

bulk = database[collection].initialize_unordered_bulk_op() bulk.insert(doc) # loop here bulk.execute()

  • Upsert unordered bulk

bulk = database[collection].initialize_unordered_bulk_op() bulk.find({"_id": doc["_id"]}).upsert().update_one({"$set": doc}) # loop here bulk.execute()

  • Insert single

database[collection].insert(doc)

  • Upsert single

database[collection].update_one({"_id": doc["_id"]}, {"$set": doc}, upsert=True)

slide-15
SLIDE 15

15

Relative Performance

5000 10000 15000 20000 25000 30000 35000 40000 Insert Array Insert Unordered Bulk Update Unordered Bulk Insert Single Update Single

Comparison of API Methods

Docs/Sec

slide-16
SLIDE 16

Benchmarks… and other lies.

Answering, “Why can’t I just use a gigantic HDD RAID array?”

slide-17
SLIDE 17

17

Benchmark Environment

  • VMs
  • 4 core Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz
  • 8 GB RAM
  • Sandisk Ultra II 960GB SSD
  • WD 5TB 7200rpm HDD
  • MongoDB
  • 3.4.13
  • WiredTiger
  • 4GB Cache
  • Snappy collection compression
  • Standalone server (no replica set, no mongos)
  • Data
  • 178 bytes per document in 6 fields
  • 3 indexes (2 compound)
  • Disk usage: 40% storage, 60% indexes
  • Using update unordered bulk method, 1000 docs per bulk.execute()
slide-18
SLIDE 18

18

Benchmark SSD vs. HDD

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Inserts/Sec SSD HDD

slide-19
SLIDE 19

19

SSD Benchmark 60 Minutes

slide-20
SLIDE 20

20

SSD Benchmark 0:30-1:00

slide-21
SLIDE 21

21

HDD Benchmark 0:30-1:30

slide-22
SLIDE 22

22

HDD Benchmark 0:30-8:45 (42M documents)

slide-23
SLIDE 23

23

HDD Benchmark Last Hour

slide-24
SLIDE 24

24

SSD Benchmark 0:30-2:10 (42M documents)

slide-25
SLIDE 25

25

Benchmark SSD vs. HDD Last Hour

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Inserts/Sec SSD HDD

slide-26
SLIDE 26

26

96 Hour Test

slide-27
SLIDE 27

27

TL;DR

  • Don’t trust someone else’s benchmarks (especially mine!)
  • Benchmark using your own “schema” and indexes
  • Artificially accelerate index size exceeding available memory
slide-28
SLIDE 28

Time Series Data in MongoDB on a BUDGET

slide-29
SLIDE 29

29

Replica Set Rollout Options

  • Follow standard advice
  • 3 server replica sets (Primary, Secondary, Secondary)
  • Every replica set server on its own hardware
  • Disk mirroring
  • Cost cutting options
  • Primary, Secondary, Arbiter
  • Locate multiple replica set servers on the same hardware (but NOT from the SAME

replica set)

  • No disk mirroring (how many copies do you really need?)
  • “I love downtime and don’t care about my data”
  • Single instance servers instead of replica sets
  • RAID0 (“no wasted disk space!”)
  • No backups
slide-30
SLIDE 30

Storing Lots of Data

“Sharding is a method for distributing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations.”

slide-31
SLIDE 31

31

Conventional Sharding

  • Non-sharded data kept in default replica set
  • Shard key hashed on timestamp to evenly distribute data
  • Pros:
  • Increases insert rate
  • Arbitrarily large data storage
  • Cons:
  • All shard replica sets should have comparable hardware
  • All shards start thrashing at the same time
  • Expanding means a LOT of rebalancing
slide-32
SLIDE 32

32

Data Access Patterns

  • New writes are always very recent
  • Reads are almost always of recent data
  • Reads of old data are “intuitively” slower

… let’s take advantage of that.

slide-33
SLIDE 33

33

Sharding by Zone

  • Non-sharded data kept in default replica set
  • Most recent time-series data stored in “fast” replica set
  • Older time-series data stored in “slow” replica sets
  • Pros:
  • Pay for speed where we need it
  • Swap “fast” to “slow” before thrashing kills performance
  • “Infinite” data size
  • Cons:
  • Ceiling on insert speed
slide-34
SLIDE 34

34

Prerequisites for Zone Sharding

  • Sharded cluster configured (config replica set, mongos, etc)
  • Existing replica set rsmain (primary shard) contains your normal (not time-

series) data

  • TimeSeries collection with an index on “time”
  • New replica set for time-series data (e.g., rs001) added as a shard
slide-35
SLIDE 35

35

Initial Zone Ranges

  • Run on mongos:

use admin sh.enableSharding(‘DBName’) sh.shardCollection(‘DBName.TimeSeries’, { time : 1 } ) sh.addShardTag('rsmain', ‘future') sh.addShardTag(‘rs001', ‘ts001') sh.addTagRange('DBName.TimeSeries',{time: new Date("2099-01-01")}, {time:MaxKey},'future') sh.addTagRange(‘DBName.TimeSeries',{time:MinKey},{time:new Date("2099-01-01")},‘ts001') # sh.splitAt('DBName.TimeSeries', {"time" : new Date("2099-01-01")})

slide-36
SLIDE 36

36

Adding a New Time-Series Replica Set Step 1 – Create new Replica Set

  • When?
  • Well before you run out of available fast storage
  • Before your input capacity is lowered too close to your needs
  • Where?
  • On the same server with fast storage as the current time-series replica set
  • Run on mongos:

use admin db.runCommand({addShard: “rs002/hostname:port", name: "rs002"}) sh.addShardTag(‘rs002’, ‘ts002') var configdb=db.getSiblingDB("config"); configdb.tags.update({tag:“ts001"},{$set:{'max.time': new ISODate(“2018-04-26”) }}) sh.addTagRange(‘DBName.TimeSeries',{time:new Date("2018-04-26")},{time:new Date("2099-01- 01")},‘ts002') # sh.splitAt('DBName.TimeSeries', {"time" : new ISODate("2018-04-26")})

slide-37
SLIDE 37

37

Adding a New Time-Series Replica Set Step 2 – Wait before Relocation

  • Initially nothing changes – all data is added into previous replica set
  • Eventually, new entries match the min.time of the new replica set and will

be stored there

  • How long to wait before relocation?
  • Make sure you don’t fill up your fast storage
  • How far back in time do “normal” queries go?
  • Queries to previous replica set will get slower after relocation
slide-38
SLIDE 38

38

Adding a New Time-Series Replica Set Step 3 – Relocate to Slow Storage

  • Follow standard procedure for moving replica set
  • Multiple server instances can share same server/storage
  • Use unique ports
  • Set –wiredTigerCacheSizeGB appropriately
slide-39
SLIDE 39

Pause for Questions

slide-40
SLIDE 40

40

Wrap Up

  • 1. Determine your anticipated time-series data rate
  • 2. Mock up a benchmark app matching your use-case
  • Focus on indexed fields and their cardinality
  • 3. Benchmark on a single server
  • Fast storage
  • Limited memory to accelerate index thrashing
  • Ensure benchmarks run long enough
  • 4. Iterate adjusting the following tradeoffs:
  • single vs bulk/array
  • upsert vs insert
  • size of bulk/array insert/upsert
  • if using measurement buckets, adjust size of bucket
  • 5. If you achieve your needed data rate, use shard tags to push old data to slower

(cheaper) servers

slide-41
SLIDE 41

41

Rate My Session

slide-42
SLIDE 42

42

Thank You Sponsors!!

slide-43
SLIDE 43

Thank You!