Running MongoDB in Production
Tim Vaillancourt Sr Technical Operations Architect, Percona
Running MongoDB in Production Tim Vaillancourt Sr Technical - - PowerPoint PPT Presentation
Running MongoDB in Production Tim Vaillancourt Sr Technical Operations Architect, Percona `whoami` { name: tim, lastname: vaillancourt, employer: percona, techs: [ mongodb, mysql, cassandra,
Tim Vaillancourt Sr Technical Operations Architect, Percona
2
{ name: “tim”, lastname: “vaillancourt”, employer: “percona”, techs: [ “mongodb”, “mysql”, “cassandra”, “redis”, “rabbitmq”, “solr”, “python”, “golang” ] }
3
4
○
Document: single *SON object, often nested ○ Field: single field in a document ○ Collection: grouping of documents ○ Database: grouping of collections ○ Capped Collection: A fixed-size FIFO collection
○ Oplog: A special capped collection for replication ○ Primary: A replica set node that can receive writes ○ Secondary: A replica of the Primary that is read-only ○ Voting: A process to elect a Primary node ○ Hidden-Secondary: A replica that cannot become Primary
“An admin is only worth the backups they keep” ~ Unknown
6
○ Multi-threaded dumping in 3.2+ ○ Optional gzip compression of data ○ Optional dumping of oplog for single-node consistency ○ Replica set awareness (Read Preference)
■ Ie: primary, primaryPreferred, secondary, secondaryPreferred, nearest
○ Tool issues .find() with $snapshot query ○ Stores BSON data in a file per collection ○ Stores BSON oplog data in “oplog.bson”
7
○ Not Sharding aware ○ Sharded backups are not Point-in-Time consistent ○ Fetch from storage-engine, serialization, networking, etc is very inefficient ○ Indexes are rebuilt in serial (per-collection)
■ Index metadata only in backup ■ Often indexing process takes longer than restoring the data!
○ Wire Protocol Compression (added in 3.4+) not supported: https://jira.mongodb.org/browse/TOOLS-1668 (Please vote/watch Issue!) ○ Requires oplog to be as long as the backup run time
8
○ Cold Backup ○ LVM Snapshot ○ Hot Backup
■ Percona Server for MongoDB (FREE!) ■ MongoDB Enterprise Hot Backup (non-free) ■ NOTE: MMAPv1 not supported
○ Indexes are backed up == faster restore! ○ Storage-engine format backed up == faster backup AND restore!
○ Increased backup storage requirements ○ Compression is storage-engine dependant
9
○ CPU Architecture limitations (64-bit vs 32-bit) ○ Cascading corruption ○ Batteries not included
■ Not Sharding aware ■ Not Replica Set aware
○ Cold Backup
■ Stop a mongod SECONDARY, copy/archive dbPath
○ LVM Snapshot
■ Optionally call ‘db.fsyncLock()’ (not required in 3.2+ with Journalling) ■ Create LVM snapshot of the dbPath ■ Copy/Archive dbPath
10
○ LVM Snapshot
■ Remove LVM snapshot (as quickly as possible!) ■ NOTE: LVM snapshots can cause up to 30%* write latency impact to disk (due to COW)
○ Hot Backup (PSMDB or MongoDB Enterprise)
■ Pay $$$ for MongoDB Enterprise or download PSMDB for free(!) ■ db.adminCommand({ createBackup: 1, backupDir: "/data/mongodb/backup" }) ■ Copy/archive the output path ■ Delete the backup output path ■ NOTE: RocksDB-based createBackup creates filesystem hardlinks whenever possible! ■ NOTE: Delete RocksDB backupDir as soon as possible to reduce bloom filter overhead!
11
○ Dynamic nature of Replica Set ○ Impact of backup on live nodes
○ Place a ‘hidden: true’ SECONDARY in another location ○ Optionally use cloud object store (AWS S3, Google GS, etc)
○ “tags” allow fine-grained server selection with key/value pairs ○ Use key/value pair to fence various application workflows ○ Example:
■ { “role”: “backup” } == Backup Node ■ { “role”: “application” } == App Node
12
○ Replica Set and Sharded Cluster awareness ○ Cluster-wide Point-in-time consistency ○ In-line Oplog backup (vs post-backup) ○ Notifications of success / failure
○ Remote Upload (AWS S3, Google Cloud Storage and Rsync) ○ Archiving (Tar or ZBackup deduplication and optional AES-at-rest) ○ CentOS/RHEL7 RPMs and Docker-based releases (.deb soon!)
13
○ Single Python PEX binary ○ Multithreaded / Concurrent ○ Auto-scales to available CPUs
○ Tool focuses on low impact ○ Uses Secondary nodes only ○ Considers (Scoring)
■ Replication Lag ■ Replication Priority ■ Replication Health / State ■ Hidden-Secondary State (preferred by tool) ■ Fails if chosen Secondary becomes Primary (on purpose)
14
○ Multi-threaded Rsync Upload ○ Replica Set Tags support ○ Support for MongoDB SSL / TLS connections and client auth ○ Rotation / Expiry of old backups (locally-stored only)
○ Incremental Backups ○ Binary-level Backups (Hot Backup, Cold Backup, LVM, Cloud-based, etc) ○ More Notification Methods (PagerDuty, Email, etc) ○ Restore Helper Tool ○ Instrumentation / Metrics ○ <YOUR AWESOME IDEA HERE> we take GitHub PRs (and it’s Python)!
15
○ Seamless restore: “mongorestore --oplogReplay --gzip --dir /path/to/backup”
○ Mongorestore backups of config servers
■ If restoring old/SCCC config servers, restore to every node ■ If restoring replica-set config servers
○ Update “config.shards” documents if shard hosts/ports changed ○ Mongorestore each shard from backup subdirectory (matches shard name) ○ Start mongos process and test / QA
■ Tip: stopping the balancer may simplify troubleshooting any problems
16
“Think of the network like a public place” ~ Unknown
18
○ Database User: Read or Write data from collections
■ “All Databases” or Single-database
○ Database Admin: Non-RW commands (create/drop/list/etc) ○ Backup and Restore: ○ Cluster Admin: Add/Drop/List shards ○ Superuser/Root: All capabilities
○ Exact Resource+Action specification ○ Very fine-grained ACLs
■ DB + Collection specific
19
○ ‘mongod’ or ‘mongodb’ on most systems ○ Ensure data path, log file and key file(s) are owned by this user+group
○ Mode: 0750
○ Mode: 0640 ○ Contains real queries and their fields!!!
■ See Log Redaction for PSMDB (or MongoDB Enterprise) to remove these fields
○ Files Include: keyFile and SSL certificates or keys ○ Mode: 0600
20
○ Single TCP port
■ MongoDB Client API ■ MongoDB Replication API ■ MongoDB Sharding API
○ Sharding
■ Only the ‘mongos’ process needs access to shards ■ Client driver does not need to reach shards directly
○ Replication
■ All nodes must be accessible to the driver
21
○ Linux Kernel Built-in Security mechanism ○ Massively reduces the attack possibilities on a system by using ACLs/policies ○ Modes
■ Enforcing: Do not allow policy violations ■ Permissive: Log and allow policy violations ■ Disabled: I really don’t like security!
○ Enforcing mode supported with Percona Server for MongoDB when using CentOS / RHEL 7+ RPMs
■ SELinux NOT supported by MongoDB Community or Enterprise binaries!!
22
○ Supported in PSMDB and MongoDB Enterprise ○ The following components are necessary for external authentication to work
■ LDAP Server: Remotely stores all user credentials (i.e. user name and associated password). ■ SASL Daemon: Used as a MongoDB server-local proxy for the remote LDAP service. ■ SASL Library: Used by the MongoDB client and server to create authentication mechanism-specific data.
○ Creating a User:
db.getSiblingDB("$external").createUser( {user : christian, roles: [{role: "read", db: "test"} ]} );
○ Authenticating as a User:
db.getSiblingDB("$external").auth({ mechanism:"PLAIN", user:"christian", pwd:"secret", digestPassword:false})
○ Other auth methods possible with MongoDB Enterprise
23
○ Supported since MongoDB 2.6x
■ May need to complile-in yourself on older binaries ■ Supported 100% in Percona Server for MongoDB
○ Minimum of 128-bit key length for security ○ Relaxed and strict (requireSSL) modes ○ System (default) or Custom Certificate Authorities are accepted
○ MongoDB supports x.509 certificate authentication for use with a secure TLS/SSL connection as of 2.6.x. ○ The x.509 client authentication allows clients to authenticate to servers with certificates rather than with a username and password. ○ Enabled with: security.clusterAuthMode: x509
24
○ Encryption supported in Enterprise binaries ($$$)
○ Use CryptFS/LUKS block device for encryption of data volume ○ Documentation published (or coming soon) ○ Completely open-source / Free
○ Selectively encrypt only required fields in application ○ Benefits
■ The data is only readable by the application (reduced touch points) ■ The resource cost of encryption is lower when it’s applied selectively ■ Offloading of encryption overhead from database
25
○ Default port 27017 ○ This does not include monitoring tools, etc
■ Percona PMM requires inbound connectivity to 1-2 TCP ports
○ Application servers only need access to ‘mongos’ ○ Block direct TCP access from application -> shard/mongod instances
■ Unless ‘mongos’ is bound to localhost!
○ Move inter-node replication to own network fabric, VLAN, etc ○ Accept client connections on a Public interface
26
28
○ 60 - 300 seconds is not enough! ○ Problems can begin/end in seconds
○ Store more than you graph ○ Example: PMM gathers 700-900 metrics per polling
○ Use to troubleshoot Production events / incidents ○ Iterate and Improve monitoring
■ Add graphing for whatever made you SSH to a host ■ Blind QA with someone unfamiliar with the problem
29
○ Operation counters ○ Cache Traffic and Capacity ○ Checkpoint / Compaction Performance ○ Concurrency Tickets (WiredTiger and RocksDB) ○ Document and Index scanning ○ Various engine-specific details
○ CPU ○ Disk ○ Bandwidth / Util ○ Average Wait Time ○ Memory and Network
30
○
Prometheus
○
Grafana
○
Go Language
32
○ Asynchronous
■ Write Concerns can provide psuedo-synchronous replication ■ Changelog based, using the “Oplog”
○ Maximum 50 members ○ Maximum 7 voting members
■ Use “vote:0” for members $gt 7
○ Oplog
■ The “oplog.rs” capped-collection in “local” storing changes to data ■ Read by secondary members for replication ■ Written to by local node after “apply” of operation
33
○ Minimum of 3 x physical servers required for High-Availability ○ Ensure only 1 x member per Replica Set is on a single physical server!!!
○ Place Replica Set members in odd number of Availability Zones, same region ○ Use a hidden-secondary node for Backup and Disaster Recover in another region ○ Entire Availability Zones have been lost before!
35
○ Buy some really amazing, expensive hardware ○ Buy some crazy expensive license
■ Don’t run a lot of servers due to above
○ Scale up:
■ Buy even more amazing hardware for monolithic host ■ Hardware came on a truck
○ HA: When it rains, it pours
○ Everything fails, nothing is precious ○ Elastic infrastructures (“The cloud”, Mesos, etc) ○ Scale up: add more cheap, commodity servers ○ HA: lots of cheap, commodity servers - still up!
36
○
Run Mongod dbPaths on separate volume
○
Optionally, run Mongod journal on separate volume
○
RAID 10 == performance/durability sweet spot
○
RAID 0 == fast and dangerous
○
Benefit MMAPv1 a lot
○
Benefit WT and RocksDB a bit less
○
Keep about 30% free for internal GC on the SSD
37
○
Risks / Drawbacks
■
Exponentially more things to break
■
Block device requests wrapped in TCP is extremely slow
■
You probably already paid for some fast local disks
■
More difficult (sometimes nearly-impossible) to troubleshoot
■
MongoDB doesn’t really benefit from remote storage features/flexibility
38
○ Lots of cores > faster cores (4 CPU minimum recommended) ○ Thread-per-connection Model
○ ‘cpufreq’: a daemon for dynamic scaling of the CPU frequency ○ Terrible idea for databases or any predictability! ○ Disable or set governor to 100% frequency always, i.e mode: ‘performance’ ○ Disable any BIOS-level performance/efficiency tuneable ○ ENERGY_PERF_BIAS
■ A CentOS/RedHat tuning for energy vs performance balance ■ RHEL 6 = ‘performance’ ■ RHEL 7 = ‘normal’ (!)
39
○ Network Edge ○ Public Server VLAN
■ Servers with Public NAT and/or port forwards from Network Edge ■ Examples: Proxies, Static Content, etc ■ Calls backends in Backend VLAN
○ Backend Server VLAN
■ Servers with port forwarding from Public Server VLAN (w/Source IP ACLs) ■ Optional load balancer for stateless backends ■ Examples: Webserver, Application Server/Worker, etc ■ Calls data stores in Data VLAN
○ Data VLAN
■ Servers, filers, etc with port forwarding from Backend Server VLAN (w/Source IP ACLs) ■ Examples: Databases, Queues, Filers, Caches, HDFS, etc
40
○ Try to use 10GBe for low latency ○ Use Jumbo Frames for efficiency ○ Try to keep all MongoDB nodes on the same segment
■ Goal: few or no network hops between nodes ■ Check with ‘traceroute’
○ Databases don’t need to talk to the internet*
■ Store a copy of your Yum, DockerHub, etc repos locally ■ Deny any access to Public internet or have no route to it ■ Hackers will try to upload a dump of your data out of the network!!
○ Try to replicate the above with features of your provider
41
○ Single-instance performance is important, but deal-breaking
○ Not hardware anymore
43
○
Override with —syncDelay <seconds> flag
○
If a server with no journal crashes it can lose 1 min of data!!!
○
Synced every 30ms ‘journal’ is on a different disk
○
Or every 100ms
○
Or 1/3rd of above if change uses Journaled write concern (explained later)
44
○
Can cause serious slowdowns on scans, range queries, etc
○
db.<collection>.stats()
■
Shows various storage info for a collection
■
Fragmentation can be computed by dividing ‘storageSize’ by ‘size’
■
Any value > 1 indicates fragmentation
○
Compact when you near a value of 2 by rebuilding secondaries or using the ‘compact’ command
○
WiredTiger and RocksDB have little-no fragmentation due to checkpoints / compaction
45
○
Every 60 seconds or >= 2GB data changes
○
Journal buffer size 128kb
○
Synced every 50 ms (as of 3.2)
○
Or every change with Journaled write concern (explained later)
○
In between write operations while the journal records remain in the buffer, updates can be lost following a hard shutdown!
46
○ Built-in Compression ○ Block and Filesystem caches
○
Tiered level compaction
○
Follows same logic as MMAPv1 for journal buffering
○ A layer between RocksDB and MongoDB’s storage engine API ○ Developed in partnership with Facebook
47
○ In heap
■ 50% available system memory ■ Uncompressed WT pages
○ Filesystem Cache
■ 50% available system memory ■ Compressed pages
○ Internal testing planned from Percona in the future ○ 30% in-heap cache recommended by Facebook / Parse
48
○
Default since 2.0 on 64-bit builds
○
Always enable unless data is transient
○
Always enable on cluster config servers
○
Max time between journal syncs
○
Max time between data file flushes
49
○ External monitoring is recommended
○ Will be deprecated in 3.6+
○ In most situations this is not necessary unless
■ You use MMAPv1, and ■ It is a Development / Test environment ■ You have 100s-1000s of databases with very little data inside (unlikely)
○ Unless troubleshooting an issue / intentional
51
52
○ But no databases want to use it :(
○
In the Server BIOS
○
Using ‘numactl’ in init scripts BEFORE ‘mongod’ command (recommended for future compatibility):
numactl --interleave=all /usr/bin/mongod <other flags>
53
○
Reboot the system
■ Disabling online does not clear previous TH pages ■ Rebooting tests your system will come back up!
54
○ mongodb_consistent_backup relies on time sync, for example!
○ “It’s ok if everyone is equally wrong”
○ Run NTP daemon on all MongoDB and Monitoring hosts ○ Enable service so it starts on reboot
○ Check if your VM platform has an “agent” syncing time ○ VMWare and Xen are known to have their own time sync ○ If no time sync provided install NTP daemon
55
○
“Completely Fair Queue”
○
Default scheduler in 2.6-era Linux distributions
○
Perhaps too clever/inefficient for database workloads
○
Probably good for a laptop
○
Best general default IMHO
○
Predictable I/O request latencies
○
Use with virtualised servers
○
Use with real-hardware BBU RAID controllers
56
○
Use XFS or EXT4, not EXT3
■
EXT3 has very poor pre-allocation performance
■
Use XFS only on WiredTiger
■
EXT4 “data=ordered” mode recommended
○
Btrfs not tested, yet!
○
Set ‘noatime’ on MongoDB data volumes in ‘/etc/fstab’:
○
Remount the filesystem after an options change, or reboot
57
○
There is a sequential read pattern
○
Something will benefit from the extra cached blocks
○
Too high waste cache space
○
Increases eviction work
■
MongoDB tends to have very random disk patterns
○ Let MongoDB worry about optimising the pattern
58
○
Add file to ‘/etc/udev/rules.d’
■
/etc/udev/rules.d/60-mongodb-disk.rules:
# set deadline scheduler and 32/16kb read-ahead for /dev/sda ACTION=="add|change", KERNEL=="sda", ATTR{queue/scheduler}="deadline", ATTR{bdi/read_ahead_kb}="16"
○
Reboot (or use CLI tools to apply)
59
○
Pages stored in-cache, but needs to be written to storage
○
Max percent of total memory that can be dirty
○
VM stalls and flushes when this limit is reached
○
Start with ’10’, default (30) too high
○
Separate threshold forbackground dirty page flushing
○
Flushes without pauses
○
Start with ‘3’, default (15) too high
60
○
Linux default: 60
○
To avoid disk-based swap: 1 (not zero!)
○
To allow some disk-based swap: 10
○
‘0’ can cause more swapping than ‘1’ on recent kernels
■ More on this here: https://www.percona.com/blog/2014/04/28/oom-rela tion-vm-swappiness0-new-kernel/
61
○
Number of User-level Processes
○
Number of Open Files
○
CPU Seconds
○
Scheduling Priority
○
And others…
○
Should probably have a dedicated VM, container or server
○
Creates a new process
■ For every new connection to the Database ■ Plus various background tasks / threads
○ Creates an open file for each active data file on disk
■ 64,000 open files and 64,000 max processes is a good start
62
○ /etc/security/limits.d file ○ Systemd Service ○ Init script
○ Example on left: PSMDB RPM (Systemd)
63
○ Add the above sysctl tunings to /etc/sysctl.conf ○ Run “/sbin/sysctl -p” as root to set the tunings ○ Run “/sbin/sysctl -a” to verify the changes
64
65
○
A “framework” for applying tunings to Linux
■
RedHat/CentOS 7 only for now
■
Debian added tuned, not sure if compatible yet
■
Cannot tune NUMA, file system type or fs mount opts
■
Syctls, THP, I/O sched, etc
○ https://github.com/Percona-Lab/tuned-percona-mongodb
“The problem with troubleshooting is trouble shoots back” ~ Unknown
67
○
Collection-level locks
○
Document-level locks
○
Software mutex/semaphore
○
Max connections
○
Operation rate limits
○
Resource limits
○
Lack of IOPS, RAM, CPU, network, etc
68
○
System CPU
○
FS cache
○
Networking
○
Disk I/O
○
Threading
○
Compression (WiredTiger and RocksDB)
○
Session Managemen
○
BSON (de)serialisation
○
Filtering / scanning / sorting
69
○
Optimiser
○
Disk
○
Data file read/writes
○
Journaling
○
Error logging
○
Network
○
Query request/response
○
Replication
○ Journaling ○ Oplog Reads / Writes ○ Background Flushing / Compactions / etc
70
○ Page Faults (data not in cache) ○ Swapping
○ Client API ○ Replication ○ Sharding
■ Chunk Moves ■ Mongos -> Shards
71
○ Original Query ○ Parsed Query ○ Query Runtime ○ Locking details
○ { "$ownOps": true } == Only show operations for the current user ○ https://docs.mongodb.com/manual/reference/method/db.currentOp/#examples
72
○ Document-data size (dataSize) ○ Index-data size (indexSize) ○ Real-storage size (storageSize) ○ Average Object Size ○ Number of Indexes ○ Number of Objects
73
74
○
Slow queries
○
Storage engine details (sometimes)
○
Index operations
○
Sharding
■
Chunk moves
○
Elections / Replication
○
Authentication
○
Network
■
Connections
75
2017-09-19T20:58:03.896+0200 I COMMAND [conn175] command config.locks appName: "MongoDB Shell" command: findAndModify { findAndModify: "locks", query: { ts: ObjectId('59c168239586572394ae37ba') }, update: { $set: { state: 0 } }, writeConcern: { w: "majority", wtimeout: 15000 }, maxTimeMS: 30000 } planSummary: IXSCAN { ts: 1 } update: { $set: { state: 0 } } keysExamined:1 docsExamined:1 nMatched:1 nModified:1 keysInserted:1 keysDeleted:1 numYields:0 reslen:604 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 } }, Collection: { acquireCount: { w: 1 } }, Metadata: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_command 106ms
76
○
Capped Collection “system.profile” in each database, default 1mb
○
The collection is capped, ie: profile data doesn’t last forever
○
Start with a very high threshold and decrease it in steps
○
Usually 50-100ms is a good threshold
○
Enable in mongod.conf
slowOpThresholdMs: 100 mode: slowOp
77
○
○
keysExamined: # of index keys examined
○
docsExamined: # of docs examined to achieve result
○
writeConflicts: # of Write Concern Exceptions encountered during update
○
numYields: # of times operation yielded for others
○
locks: detailed lock statistics
78
○ Winning Plan
■ Query stages
clusters
■ Index chosen by optimiser
○ Rejected Plans
79
80
○ Use .find() queries to view Cluster Metadata
○
actionlog (3.0+)
○
changelog
○
databases
○
collections
○
shards
○
chunks
○
settings
○
mongos
○
locks
○
lockpings
81
○ ie:“{ item: 123456 }” -> “{ item: ##### }”.
82
83
mlogfilter --scan <file>
Shows all collection scan queries
mlogfilter --slow <ms> <file>
Shows all queries that are slower than X milliseconds
mlogfilter --op <op-type> <file>
Shows all queries of the operation type X (eg: find, aggregate, etc)
84
○ removeShard Doesn’t Complete
■ Check the ‘dbsToMove’ array of the removeShard response
mongos> db.adminCommand({removeShard:"test2"}) { "msg" : "draining started successfully", "state" : "started", "shard" : "test2", "note" : "you need to drop or movePrimary these databases", "dbsToMove" : [ "wikipedia" ], "ok" : 1 }
■ Why?
mongos> use config switched to db config mongos> db.databases.find() { "_id" : "wikipedia", "primary" : "test2", "partitioned" : true }
85
○ removeShard Doesn’t Complete
■ Try
database ○ This starts ths draining of the shard
○
If the draining and removing is complete this will respond with success
○ Jumbo Chunks
■ Will prevent balancing from occurring ■ config.chunks collection document will contain jumbo:true as a key/value pair ■ Sharding ‘split’ commands can be used to reduce the chunk size (sh.splitAt, etc) ■ https://www.percona.com/blog/2016/04/11/dealing-with-jumbo-chunks-in-mongodb/
“The problem with troubleshooting is trouble shoots back” ~ Unknown
87
○ Only use strings if required ○ Do not store numbers as strings! ○ Look for {field:“123456”} instead of {field:123456}
■ “12345678” moved to a integer uses 25% less space ■ Range queries on proper integers is more efficient
○ Example JavaScript to convert a field in an entire collection
■ db.items.find().forEach(function(x) { newItemId = parseInt(x.itemId); db.containers.update( { _id: x._id }, { $set: {itemId: itemId } } ) });
88
○ Do not store dates as strings!
■ The field "2017-08-17 10:00:04 CEST" stores in 52.5% less space!
○ Do not store booleans as strings!
■ “true” -> true = 47% less space wasted
○ DBRefs provide pointers to another document ○ DBRefs can be cross-collection
○ Higher precision for floating-point numbers
89
○
Default behaviour
○
Runs indexing in the background avoiding pauses
○
Hard to monitor and troubleshoot progress
○
Unpredictable performance impact
○ Use real performance data to make indexing decisions, find out before Production!
○ Try to cover .sort() with index and match direction!
90
○ Several fields supported ○ Fields can be in forward or backward direction
■ Consider any .sort() query options and match sort direction!
○ Composite Keys are read Left -> Right
■ Index can be partially-read ■ Left-most fields do not need to be duplicated! ■ All Indexes below are duplicates:
91
○ Index: keysExamined / nreturned ○ Document: docsExamined / nreturned
○ Tip: when using covered indexes zero documents are fetched (docsExamined: 0)! ○ Example: a query scanning 10 documents to return 1 has efficiency 0.1 ○ Scanning zero docs is possible if using a covered index!
92
○
Greate cache/disk-footprint efficiency
○
Centralised schemas may create a hotspot for write locking
○ MongoDB rarely stores data sequentially on disk ○ Multi-document operations are less efficient ○ Less potential for hotspots/write locking ○ Increased overhead due to fan-out of updates ○ Example: Social Media status update, graph relationships, etc ○ More on this later..
93
○
Read-heavy apps benefit from pre-computed results
○
Consider moving expensive reads computation to insert/update/delete
○
Example 1: An app does ‘count’ queries often
■
Move .count() read query to a summary document with counters
■
Increment/decrement single count value at write-time
○
Example 2: An app that does groupings of data
■
Move .aggregate() read query that is in-line to the user to a backend summary worker
■
Read from a summary collection, like a view
○
Reduce indexing as much as possible
○
Consider batching or a decentralised model with lazy updating (eg: social media graph)
94
○
Requires less network commands
○
Allows the server to do some internal batching Operations will be slower overall Suited for queue worker scenarios batching many changes Traditional user-facing database traffic should aim to operate on a single (or few) document(s)
○ 1 x DB operation = 1 x CPU core only ○ Executing Parallel Reads
■ Large batch queries benefit from several parallel sessions ■ Break query range or conditions into several client->server threads ■ Not recommended for Primary nodes or Secondaries with heavy reads
95
○ MongoDB returns entire documents unless fields are specified ○ Only return the fields required for an application operation! ○ Covered-index operations require only the index fields to be specified
○ This executes JavaScript with a global lock
○ MongoDB (or any RDBMS) doesn’t handle large lists of $and or $or efficiently ○ Try to avoid this sort of model with
■ Data locality ■ Background Summaries / Views
96
○ Decentralised ○ Data is eventually written in many locations ○ Complex write path (several updates)
■ Good use-case for Queue/Worker model ■ Batching possible
○ Simple read path (data locality)
○ Centralised ○ Simple Write path
■ Possible Write locking
○ Complex Read Path
■ Potential for latency due to network
“The problem with troubleshooting is trouble shoots back” ~ Unknown
98
○ Online Marketing / Publishing
■ Paid for clicks coming in ■ Downtime = revenue + traffic (paid for) loss
○ Warehousing / Pricing SaaS
■ Store real items in warehouses/stores/etc ■ Downtime = many businesses (customers)/warehouses/etc at stand-still ■ Integrity problems =
○ Moved on to Gaming, Percona
2010
99
○
Finds the last point of consistency to disk
○
Searches the journal file(s) for the record matching the checkpoint
○
Applies all changes in the journal since the last point of consistency
100
○
Allow control of data integrity of a write to a Replica Set ○ Write Concern Modes
■ “w: <num>” - Writes much acknowledge to defined number of nodes ■ “majority” - Writes much acknowledge on a majority of nodes ■ “<replica set tag>” - Writes acknowledge to a member with the specified replica set tags
○ Durability
■ By default write concerns are NOT durable ■ “j: true” - Optionally, wait for node(s) to acknowledge journaling of operation ■ In 3.4+ “writeConcernMajorityJournalDefault” allows enforcement of “j: true” via replica set configuration!
101
○
A PRIMARY writes 10 documents with w:1 Write Concern to the oplog, then dies
○
SECONDARY (2x) nodes applied 5 and 7 of the changes written
○
The SECONDARY with 7 changes wins PRIMARY election
○
The PRIMARY that died comes back alive
○
The old-PRIMARY node becomes RECOVERING then SECONDARY
○
3 documents are “rolled back” to disk
■
A JSON file written to ‘rollback’ dir on-disk when PRIMARY crashes when ahead of SECONDARYs
■
Monitor for this file existing on disk!!
102
○
“local” - Default, return the current node’s most-recent version of the data
○
“majority” - Reads return the most-version of the data that has been ack’d on a majority of nodes. Not supported on MMAPv1.
○
“linearizable” (3.4+) - Reads return data that reflects a “majority” read of all changes prior to the read
103
○ Ordered by time ○ Written to locally after apply-time of
■ Client API change ■ Or replication change
○ A crashed node will resume replication using last position from local oplog ○ Size of Oplog
■ Monitor this closely! ■ The length of time from start to end of the oplog affects the impact of adding new nodes ■ If a node is brought online with a backup within the window it avoids a full sync ■ If a node is brought online with a backup older than the window it will full sync!!!
○ Due to async lag is possible ○ A use of Read Concerns and/or Write Concerns can work around this!
104
○
Lots of Replset Members
○
Read and Write Concern
○
Proper Geolocation/Node Redundancy
○
Cheaper, non-redundant storage becomes possible
■ JBOD ■ RAID0 ■ InMemory (faster)
105
○ Store a Hidden Secondary in another location
○ Sharding: Look into Tag-aware Sharding ○ Replica Set: Multi-locations of members
“The problem with troubleshooting is trouble shoots back” ~ Unknown
107
○ “true” vs true ○ “123456” vs 123456 ○ "2017-09-19T16:50:58.347Z" vs ISODate("2017-09-19T16:50:58.347Z")
○ Read-heavy apps benefit from pre-computed results ○ Consider moving expensive reads computation to insert/update/delete
■ Example: move .count() read query to a summary-document, increment/decrement summary count at write-time
○ MongoDB is fast, memcache is even faster (although very simple)
108
○ primary (default) ○ primaryPreferred ○ secondary ○ secondaryPreferred (recommended for Read Scaling!) ○ nearest
○ Select nodes based on key/value pairs (one or more) ○ Often used for
■ Datacenter awareness, eg: { “dc”: “eu-east” } ■ Specific workflows, eg: Analytics, BI, Batch summaries, Backups
109
○ rs.add() new Replicas with no data triggers initial sync from Primary!
■ This can also happen if the backup is too old
○ 50 member maximum, Primary included
○ Ensure configuration replSetName and key files match ○ Logical Restore
■ mongorestore a mongodump-based backup (containing oplog) ■ Use bsondump to find the last document in the “oplog.bson” file ■ Create oplog and insert the last “oplog.bson” document
use local db.runCommand( { create: "oplog.rs", capped: true, size: (20 * 1024 * 1024 * 1024) } ); db.oplog.rs.insert(<last doc>);
110
○ Logical Restore
■ Create oplog and insert the last “oplog.bson” document ■ Gather the “system.replset” document from the “local” database of an existing node use local db.system.replset.find() { … } ■ Insert the “system.replset” document into the “local” database of the new node use local db.system.replset.insert(< document >); ■ On the Primary, add the new node with rs.reconfig() or rs.add()
“The problem with troubleshooting is trouble shoots back” ~ Unknown
112
○ ‘mongos’ binary is the router ○ Applications connect to the Sharding Router ○ Abstraction layer to driver
○ Read by the Sharding Routers (‘mongos’) ○ Config Servers
■ 1+ ‘mongod’ servers storing the metadata ■ 3.4+ config servers are require Replica Sets
○ 1+ ‘mongod’ instances that store an exclusive piece of the cluster data ○ Often a Replica Set
■ 3 x members or more recommended
○ Standalone nodes are supported (misconception)
113
○ databases: Sharding-enabled databases ○ collections: Sharding-enabled collections
■ Shard Key ■ Shard Primary
○ chunks: List of collection chunks
■ Chunk range of shard key ■ Mapping of responsible shard
○ shards: List of shards in cluster ○ actionlog: Log of performed actions ○ locks: Distributed Locks ○ lockpings: Pings to Distributed Locks ○ mongos: List of mongos instances (online and offline, check ‘ping’)
114
115
116
117
○ Useful for app that will scale up eventually ○ Start with 1 shard and a good shard key ○ Add replica set members and/or shards as the system scales
○ Chunks that grow larger than 64mb are split by the shard Primary
■ Chunk splits impact write performance!
○ Pre-creating an even distribution of chunks spanning the shard key range improves write performance greatly! ○ Example: A shard key with a possible value of 1-100 pre-split on 10 shards will result in 10 pre-sploy chunks per shard ○ More on this topic: https://docs.mongodb.com/manual/tutorial/create-chunks-in-sharded-cluster/
118
○
Isolate a subset of data on a specific set of shards.
○
Ensure that the most relevant data reside on shards that are geographically closest to the application servers.
○
Route data based on Hardware / Performance
119
○
Use “nearest” Read Preference to route reads to closest servers ○ Use shard tag to route ops to correct shard(s)
○ At least 2 x per datacenter or running locally to application servers
○ Replica Set based (CSRS) set required ○ At least 2 x Config Servers per Datacenter ○ 50 max members in Config Server set ○ No votes required (7 voter limit)
120
122
○ Previously did not exist in MongoDB except TokuMX
○ bindIp set to localhost ○ Addition of source IP restrictions
○ Watch this feature closely! ○ NTP may no longer be necessary
123
124