Deploying MongoDB in Production
Monday, November 5, 2018 9:00 AM - 12:00 PM Bull
Deploying MongoDB in Production Monday, November 5, 2018 9:00 AM - - - PowerPoint PPT Presentation
Deploying MongoDB in Production Monday, November 5, 2018 9:00 AM - 12:00 PM Bull About us 4 Agenda Hardware and OS configuration MongoDB in Production Backups and Monitoring Q&A 5 Terminology Data Document: single *SON
Monday, November 5, 2018 9:00 AM - 12:00 PM Bull
4
5
6
○
Document: single *SON object, often nested ○ Field: single field in a document ○ Collection: grouping of documents ○ Database: grouping of collections ○ Capped Collection: A fixed-size FIFO collection
○ Oplog: A special capped collection for replication ○ Primary: A replica set node that can receive writes ○ Secondary: A replica of the Primary that is read-only ○ Voting: A process to elect a Primary node ○ Hidden-Secondary: A replica that cannot become Primary
companies.
8
9
10
11
13
Magnetic disks SSD/nvMe
15
16
Performance: High Redundancy: None Overhead/parity : None
17
Performance: Low Redundancy: Yes Overhead/parity : Yes
18
Performance: High Redundancy: None Overhead/parity : Yes
19
Performance: High Redundancy: Yes Overhead/parity : Yes
source: http://www.icc-usa.com/raid-calculator.html
20
21
22
than local/fiber connected disks
23
NOOP DEADLINE CFQ
24
25
26
27
28
○
Use XFS or EXT4
■
Use XFS only on WiredTiger
■
EXT4 “data=ordered” mode recommended
○
Btrfs not tested, yet!
○
Set ‘noatime’ on MongoDB data volumes in ‘/etc/fstab’:
○ ○
Remount the filesystem after an options change, or reboot
29
30
/etc/udev/rules.d/60-mongodb-disk.rules:
# set deadline scheduler and 32/16kb read-ahead for /dev/sda ACTION=="add|change", KERNEL=="sda", ATTR{queue/scheduler}="deadline", ATTR{bdi/read_ahead_kb}="16"
sudo blockdev --getra and sudo blockdev --setra
31
○ Faster cores doesn't necessarily mean a faster database. ○ Almost all the databases takes advantage of multi cores for a good performance.
33
database performance.
34
35
and CPUs for lower latency
allocations on NUMA systems
○
In the Server BIOS
○
Using ‘numactl’ in init scripts BEFORE ‘mongod’ command (recommended for future compatibility):
numactl --interleave=all /usr/bin/mongod <other flags>
37
38
39
More: https://blog.2ndquadrant.com/postgresql-vs-kernel-versions/
41
○ Number of User-level Processes ○ Number of Open Files ○ CPU Seconds ○ Scheduling Priority
○ Should probably have a dedicated VM, container or server ○ Creates a new process ○ Creates an open file for each active data file on disk
42
○
Linux default: 60
○
To avoid disk-based swap: 1 (not zero!)
○
To allow some disk-based swap: 10
○
‘0’ can cause more swapping than ‘1’ on recent kernels
More on this here: https://www.percona.com/blog/2014/04/28/oom-relation-vm-swappiness0-new-kernel/
43
○ /etc/security/limits.d file ○ Systemd Service ○ Init script
○ Example on left: PSMDB RPM (Systemd)
44
○ mongodb_consistent_backup relies on time sync, for example!
○ “It’s ok if everyone is equally wrong”
○ Run NTP daemon on all MongoDB and Monitoring hosts ○ Enable service so it starts on reboot
○ Check if your VM platform has an “agent” syncing time ○ VMWare and Xen are known to have their own time sync ○ If no time sync provided install NTP daemon
45
46
○ Add the above sysctl tunings to /etc/sysctl.conf ○ Run “/sbin/sysctl -p” as root to set the tunings ○ Run “/sbin/sysctl -a” to verify the changes
48
○ Network Edge ○ Public Server VLAN ○ Backend Server VLAN ○ Data VLAN
49
○ Try to use 10GBe for low latency ○ Use Jumbo Frames for efficiency ○ Try to keep all MongoDB nodes on the same segment
○ Databases don’t need to talk to the internet*
○ Try to replicate the above with features of your provider
50
https://www.percona.com/blog/2016/08/12/tuning-linux-for-mongodb/
51
○
Every 60 seconds or >= 2GB data changes
○
Journal buffer size 128kb
○
Synced every 50 ms (as of 3.2)
○
Or every change with Journaled write concern
○
In between write operations while the journal records remain in the buffer, updates can be lost following a hard shutdown!
53
○ In heap
~ 50% available system memory Uncompressed WT pages
○ Filesystem Cache
~ 50% available system memory Compressed pages
54
55
○
Always enable unless data is transient - default true
○
Always enable on cluster config servers
○
Max time between journal syncs - default 100ms
○
Max time between data file flushes - default 60 seconds
“Think of the network like a public place” ~ Unknown
57
58
○ ‘mongod’ or ‘mongodb’ on most systems ○ Ensure data path, log file and key file(s) are owned by this user+group
○ Mode: 0750
○ Mode: 0640 ○ Contains real queries and their fields!
○ Files Include: keyFile and SSL certificates or keys ○ Mode: 0600
59
○ Mainly port 27017
60
61
○ Supported in PSMDB and MongoDB Enterprise
62
63
64
3.6.8-20 does have encryption at rest using keyfile in BETA
65
○ Asynchronous
■ Write Concerns can provide pseudo-synchronous replication ■ Changelog based, using the “Oplog”
○ Maximum 50 members ○ Maximum 7 voting members
■ Use “vote:0” for members $gt 7
67
○ The “oplog.rs” capped-collection in “local” storing changes to data ○ Read by secondary members for replication ○ Written to by local node after “apply” of operation ○ Events in the oplog are idempotent
■
dataset
○ Each event in the oplog represent a single document inserted, updated, deleted ○ Oplog has a default size depending on the OS and the storage engine
■ from 3.6 the size can be change at runtime using replSetResizeOplog admin command
68
69
70
71
72
resume normal operations
successfully
configured to run on secondaries while the primary is offline
73
○ Minimum of 3 x physical servers required for High-Availability ○ Ensure only 1 x member per Replica Set is on a single physical server!!!
○ Place Replica Set members in odd number of Availability Zones, same region ○ Use a hidden-secondary node for Backup and Disaster Recovery in another region ○ Entire Availability Zones have been lost before!
74
75
76
○ Reflects earlier state of the dataset ○ Recover from unsuccessful application upgrades and operator errors, Backups ○ Must be priority = 0 : cannot be elected as Primary but votes during election
77
MongoDB HA, what can go wrong? Igor Donchovski Wed 7th 12:20PM 1:10PM @Bull
78
15 minutes
“The problem with troubleshooting is trouble shoots back” ~ Unknown
lock/execution details
○ Original Query ○ Parsed Query ○ Query Runtime ○ Locking details
○ { "$ownOps": true } == Only show operations for the current user ○ https://docs.mongodb.com/manual/reference/method/db.currentOp/#examples
81
82
○ Document-data size (dataSize) ○ Index-data size (indexSize) ○ Real-storage size (storageSize) ○ Average Object Size ○ Number of Indexes ○ Number of Objects ○ https://docs.mongodb.com/manual/reference/method/db.stats/
83
84
○
Slow queries
○
Storage engine details (sometimes)
○
Index operations
○
Sharding
■
Chunk moves
○
Elections / Replication
○
Authentication
○
Network
■
Connections
○ verbosity can be controlled using db.setLogLevel() ○ https://docs.mongodb.com/manual/reference/method/db.setLogLevel/
85
2018-09-19T20:58:03.896+0200 I COMMAND [conn175] command config.locks appName: "MongoDB Shell" command: findAndModify { findAndModify: "locks", query: { ts: ObjectId('59c168239586572394ae37ba') }, update: { $set: { state: 0 } }, writeConcern: { w: "majority", wtimeout: 15000 }, maxTimeMS: 30000 } planSummary: IXSCAN { ts: 1 } update: { $set: { state: 0 } } keysExamined:1 docsExamined:1 nMatched:1 nModified:1 keysInserted:1 keysDeleted:1 numYields:0 reslen:604 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 } }, Collection: { acquireCount: { w: 1 } }, Metadata: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_command 106ms
86
○
Capped Collection “system.profile” in each database, default 1MB
○
The collection is capped, ie: profile data doesn’t last forever
○
Start with a very high threshold and decrease it in steps
○
Usually 50-100ms is a good threshold
○
Enable in mongod.conf
slowOpThresholdMs: 100 mode: slowOp
87
○
○
keysExamined: # of index keys examined
○
docsExamined: # of docs examined to achieve result
○
writeConflicts: # of Write Concern Exceptions encountered during update
○
numYields: # of times operation yielded for others
○
locks: detailed lock statistics
88
○ Winning Plan
■ Query stages
■ Index chosen by optimiser
○ Rejected Plans
89
○ mlogfilter --scan <file> ■
Shows all collection scan queries
○ mlogfilter --slow <ms> <file> ■
Shows all queries that are slower than X milliseconds
Shows all queries of the operation type X (eg: find, aggregate, etc)
https://github.com/rueckstiess/mtools
91
behavior
○ number of inserted/updated/deleted/read documents ○ percentage of WiredTiger cache in use/dirty ○ number of flushes to disk ○ inbound/outbound traffic
92
93
“The problem with troubleshooting is trouble shoots back” ~ Unknown
○ Only use strings if required ○ Do not store numbers as strings! ○ Look for {field:“123456”} instead of {field:123456}
■ “12345678” moved to a integer uses 25% less space ■ Range queries on proper integers is more efficient
○ Example JavaScript to convert a field in an entire collection
■ db.items.find().forEach(function(x) { newItemId = parseInt(x.itemId); db.containers.update( { _id: x._id }, { $set: {itemId: itemId } } ) });
○ Do not store dates as strings!
■ The field "2017-08-17 10:00:04 CEST" stores in 52.5% less space!
○ Do not store booleans as strings!
■ “true” -> true = 47% less space wasted
95
○ index creation is a really heavy task
○ Use real performance data to make indexing decisions, find out before Production!
○ Index entries must be maintained for any insert/update/delete
○ Try to cover .sort() with index and match direction!
96
○ index creation doesn’t lock the collection ○ the collection can be used by other queries ○ index creation takes longer than foreground creation ○ unpredictable performance !
○ detach a SECONDARY from the RS and create the index in foreground, the reconnect to the RS ○ repeat for all the SECONDARY nodes ○ at last detach the PRIMARY
■ wait fo the election and detach the node when SECONDARY ■ create the foreground index ■ reconnect the node to the RS
97
○ Several fields supported ○ Fields can be in forward or backward direction
■ Consider any .sort() query options and match sort direction!
○ Composite Keys are read Left -> Right
■ Index can be partially-read ■ Left-most fields do not need to be duplicated! ■ All Indexes below are duplicates:
■ Duplicate indexes must be dropped
98
○
Read-heavy apps benefit from pre-computed results
○
Consider moving expensive reads computation to insert/update/delete
○
Example 1: An app does ‘count’ queries often
■
Move .count() read query to a summary document with counters
■
Increment/decrement single count value at write-time
○
Example 2: An app that does groupings of data
■
Move .aggregate() read query that is in-line to the user to a backend summary worker
■
Read from a summary collection, like a view
○
Reduce indexing as much as possible
○
Consider batching or a decentralised model with lazy updating (eg: social media graph)
101
○ MongoDB returns entire documents unless fields are specified ○ Only return the fields required for an application operation! ○ Covered-index operations require only the index fields to be specified
○ MongoDB (or any RDBMS) doesn’t handle large lists of $and or $or efficiently ○ Try to avoid this sort of model with
■ Data locality ■ Background Summaries / Views
102
124
○ atomicity ○ consistency ○ isolation ○ durability
○ in order to use transactions on a standalone server you need to start Replica Set ○ transaction support for sharded cluster is scheduled for 4.2
128
like createUser, getParameter, etc.
129
○ you have a lot of 1:N and N:N relationships between different collections and you are aware of data consistency ○ you manage commercial/financial and/or really sensitive data ○ you need to be aware of data consistency because your app needs to be
○ transactions incur a greater performance cost over single document writes ○ transactions should not be a replacement for effective schema design
■ embed documents as much as possible ■ denormalized data model continue to be optimal ■ single document writes are always atomic
130
Use multi-document ACID transactions in MongoDB 4.0 Corrado Pandiani Wed 7th 2:20PM 3:10PM @Bull What’s new in MongoDB 4.0 Vinicius Gripps Tue 6th 11:20AM 12:10PM @Bull
131
13 3
○ testing a different storage engine ○ testing a different hardware ○ testing a different OS configuration
13 4
○ mongoreplay record -i eth0 -e "port 27017" -p ~/recordings/playback
○ mongoreplay play -p ~/recordings/playback --report ~/reports/replay_stats.json --host mongodb://192.168.0.4:27018
○ mongoreplay monitor -i eth0 -e 'port 27017' --report ~/reports/monitor-live.json --collect json
13 5
○ setProfilingLevel set to 2
○ the replayer can sends these ops to databases as fast as possible to test limits ○ reply ops in accordance to their original timestamps, which allows us to imitate regular traffic
○ 60 - 300 seconds is not enough! ○ Problems can begin/end in seconds
○ Store more than you graph ○ Example: PMM gathers 700-900 metrics per polling
○ Use to troubleshoot Production events / incidents ○ Iterate and Improve monitoring
■ Add graphing for whatever made you SSH to a host ■ Blind QA with someone unfamiliar with the problem
150
○ Operation counters ○ Cache Traffic and Capacity ○ Checkpoint ○ Concurrency Tickets (WiredTiger) ○ Document and Index scanning
○ CPU ○ Disk ○ Bandwidth / Util ○ Average Wait Time ○ Memory and Network
151
○
Prometheus
○
Grafana
○
Go Language
from PMM!
Metrics
152
153
Monitoring MongoDB with Percona Monitoring and Management (PMM)
154