Running MongoDB in Production Tim Vaillancourt Sr Technical - - PowerPoint PPT Presentation

running mongodb in production
SMART_READER_LITE
LIVE PREVIEW

Running MongoDB in Production Tim Vaillancourt Sr Technical - - PowerPoint PPT Presentation

Running MongoDB in Production Tim Vaillancourt Sr Technical Operations Architect, Percona `whoami` { name: tim, lastname: vaillancourt, employer: percona, techs: [ mongodb, mysql, cassandra,


slide-1
SLIDE 1

Running MongoDB in Production

Tim Vaillancourt Sr Technical Operations Architect, Percona

slide-2
SLIDE 2

2

`whoami`

{ name: “tim”, lastname: “vaillancourt”, employer: “percona”, techs: [ “mongodb”, “mysql”, “cassandra”, “redis”, “rabbitmq”, “solr”, “python”, “golang” ] }

slide-3
SLIDE 3

3

Agenda

  • Backups
  • Security
  • Monitoring
  • Architecture and High-Availability
  • Hardware
  • Tuning MongoDB
  • Tuning Linux
  • Troubleshooting
  • Schema
  • Data Integrity
  • Scaling (Read/Writes)
slide-4
SLIDE 4

4

  • Data

Document: single *SON object, often nested ○ Field: single field in a document ○ Collection: grouping of documents ○ Database: grouping of collections ○ Capped Collection: A fixed-size FIFO collection

  • Replication

○ Oplog: A special capped collection for replication ○ Primary: A replica set node that can receive writes ○ Secondary: A replica of the Primary that is read-only ○ Voting: A process to elect a Primary node ○ Hidden-Secondary: A replica that cannot become Primary

Terminology

slide-5
SLIDE 5

Backups

“An admin is only worth the backups they keep” ~ Unknown

slide-6
SLIDE 6

6

Backups: Logical

  • ‘mongodump’ tool from mongo-tools project
  • Supports

○ Multi-threaded dumping in 3.2+ ○ Optional gzip compression of data ○ Optional dumping of oplog for single-node consistency ○ Replica set awareness (Read Preference)

■ Ie: primary, primaryPreferred, secondary, secondaryPreferred, nearest

  • Process

○ Tool issues .find() with $snapshot query ○ Stores BSON data in a file per collection ○ Stores BSON oplog data in “oplog.bson”

slide-7
SLIDE 7

7

Backups: Logical

  • Limitations

○ Not Sharding aware ○ Sharded backups are not Point-in-Time consistent ○ Fetch from storage-engine, serialization, networking, etc is very inefficient ○ Indexes are rebuilt in serial (per-collection)

■ Index metadata only in backup ■ Often indexing process takes longer than restoring the data!

○ Wire Protocol Compression (added in 3.4+) not supported: https://jira.mongodb.org/browse/TOOLS-1668 (Please vote/watch Issue!) ○ Requires oplog to be as long as the backup run time

slide-8
SLIDE 8

8

Backups: Binary

  • Options

○ Cold Backup ○ LVM Snapshot ○ Hot Backup

■ Percona Server for MongoDB (FREE!) ■ MongoDB Enterprise Hot Backup (non-free) ■ NOTE: MMAPv1 not supported

  • Benefits

○ Indexes are backed up == faster restore! ○ Storage-engine format backed up == faster backup AND restore!

  • Limitations

○ Increased backup storage requirements ○ Compression is storage-engine dependant

slide-9
SLIDE 9

9

Backups: Binary

  • Limitations

○ CPU Architecture limitations (64-bit vs 32-bit) ○ Cascading corruption ○ Batteries not included

■ Not Sharding aware ■ Not Replica Set aware

  • Process

○ Cold Backup

■ Stop a mongod SECONDARY, copy/archive dbPath

○ LVM Snapshot

■ Optionally call ‘db.fsyncLock()’ (not required in 3.2+ with Journalling) ■ Create LVM snapshot of the dbPath ■ Copy/Archive dbPath

slide-10
SLIDE 10

10

Backups: Binary

  • Process

○ LVM Snapshot

■ Remove LVM snapshot (as quickly as possible!) ■ NOTE: LVM snapshots can cause up to 30%* write latency impact to disk (due to COW)

○ Hot Backup (PSMDB or MongoDB Enterprise)

■ Pay $$$ for MongoDB Enterprise or download PSMDB for free(!) ■ db.adminCommand({ createBackup: 1, backupDir: "/data/mongodb/backup" }) ■ Copy/archive the output path ■ Delete the backup output path ■ NOTE: RocksDB-based createBackup creates filesystem hardlinks whenever possible! ■ NOTE: Delete RocksDB backupDir as soon as possible to reduce bloom filter overhead!

slide-11
SLIDE 11

11

Backups: Architecture

  • Risks

○ Dynamic nature of Replica Set ○ Impact of backup on live nodes

  • Example: Cheap Disaster-Recovery

○ Place a ‘hidden: true’ SECONDARY in another location ○ Optionally use cloud object store (AWS S3, Google GS, etc)

  • Example: Replica Set Tags

○ “tags” allow fine-grained server selection with key/value pairs ○ Use key/value pair to fence various application workflows ○ Example:

■ { “role”: “backup” } == Backup Node ■ { “role”: “application” } == App Node

slide-12
SLIDE 12

12

Backups: mongodb_consistent_backup

  • Python project by Percona-Lab for consistent backups
  • URL: https://github.com/Percona-Lab/mongodb_consistent_backup
  • Best-effort support, not a “Percona Product”
  • Created to solve limitations in MongoDB backup tools:

○ Replica Set and Sharded Cluster awareness ○ Cluster-wide Point-in-time consistency ○ In-line Oplog backup (vs post-backup) ○ Notifications of success / failure

  • Extra Features

○ Remote Upload (AWS S3, Google Cloud Storage and Rsync) ○ Archiving (Tar or ZBackup deduplication and optional AES-at-rest) ○ CentOS/RHEL7 RPMs and Docker-based releases (.deb soon!)

slide-13
SLIDE 13

13

Backups: mongodb_consistent_backup

  • Extra Features

○ Single Python PEX binary ○ Multithreaded / Concurrent ○ Auto-scales to available CPUs

  • Low-Impact

○ Tool focuses on low impact ○ Uses Secondary nodes only ○ Considers (Scoring)

■ Replication Lag ■ Replication Priority ■ Replication Health / State ■ Hidden-Secondary State (preferred by tool) ■ Fails if chosen Secondary becomes Primary (on purpose)

slide-14
SLIDE 14

14

Backups: mongodb_consistent_backup

  • 1.2.0

○ Multi-threaded Rsync Upload ○ Replica Set Tags support ○ Support for MongoDB SSL / TLS connections and client auth ○ Rotation / Expiry of old backups (locally-stored only)

  • Future

○ Incremental Backups ○ Binary-level Backups (Hot Backup, Cold Backup, LVM, Cloud-based, etc) ○ More Notification Methods (PagerDuty, Email, etc) ○ Restore Helper Tool ○ Instrumentation / Metrics ○ <YOUR AWESOME IDEA HERE> we take GitHub PRs (and it’s Python)!

slide-15
SLIDE 15

15

Backups: mongodb_consistent_backup

  • Simple Restore

○ Seamless restore: “mongorestore --oplogReplay --gzip --dir /path/to/backup”

  • Restore an Entire Cluster

○ Mongorestore backups of config servers

■ If restoring old/SCCC config servers, restore to every node ■ If restoring replica-set config servers

  • Ensure Replica Set is initiated (rs.initiate() / rs.config())
  • Ensure SECONDARY members are added (via PRIMARY)
  • Restore to PRIMARY only

○ Update “config.shards” documents if shard hosts/ports changed ○ Mongorestore each shard from backup subdirectory (matches shard name) ○ Start mongos process and test / QA

■ Tip: stopping the balancer may simplify troubleshooting any problems

slide-16
SLIDE 16

16

More on Backups (some overlap)

Room: Field Suite #2 Time: Wednesday, 16:30 - 16:55

slide-17
SLIDE 17

Security

“Think of the network like a public place” ~ Unknown

slide-18
SLIDE 18

18

Security: Authorization

  • Always enable auth on Production Installs!
  • Built-in Roles

○ Database User: Read or Write data from collections

■ “All Databases” or Single-database

○ Database Admin: Non-RW commands (create/drop/list/etc) ○ Backup and Restore: ○ Cluster Admin: Add/Drop/List shards ○ Superuser/Root: All capabilities

  • User-Defined Roles

○ Exact Resource+Action specification ○ Very fine-grained ACLs

■ DB + Collection specific

slide-19
SLIDE 19

19

Security: Filesystem Access

  • Use a service user+group

○ ‘mongod’ or ‘mongodb’ on most systems ○ Ensure data path, log file and key file(s) are owned by this user+group

  • Data Path

○ Mode: 0750

  • Log File

○ Mode: 0640 ○ Contains real queries and their fields!!!

■ See Log Redaction for PSMDB (or MongoDB Enterprise) to remove these fields

  • Key File(s)

○ Files Include: keyFile and SSL certificates or keys ○ Mode: 0600

slide-20
SLIDE 20

20

Security: Network Access

  • Firewall

○ Single TCP port

■ MongoDB Client API ■ MongoDB Replication API ■ MongoDB Sharding API

○ Sharding

■ Only the ‘mongos’ process needs access to shards ■ Client driver does not need to reach shards directly

○ Replication

■ All nodes must be accessible to the driver

  • Internal Authentication: Use a key to use inter-node replication/sharding
  • Creating a dedicated network segment for Databases is recommended!
  • DO NOT allow MongoDB to talk to the internet at all costs!!!
slide-21
SLIDE 21

21

Security: System Access

  • Recommended to restrict system access to Database Administrators
  • A “shell” on a system can be enough to take the system over!
  • SELinux

○ Linux Kernel Built-in Security mechanism ○ Massively reduces the attack possibilities on a system by using ACLs/policies ○ Modes

■ Enforcing: Do not allow policy violations ■ Permissive: Log and allow policy violations ■ Disabled: I really don’t like security!

○ Enforcing mode supported with Percona Server for MongoDB when using CentOS / RHEL 7+ RPMs

■ SELinux NOT supported by MongoDB Community or Enterprise binaries!!

slide-22
SLIDE 22

22

Security: External Authentication

  • LDAP Authentication

○ Supported in PSMDB and MongoDB Enterprise ○ The following components are necessary for external authentication to work

■ LDAP Server: Remotely stores all user credentials (i.e. user name and associated password). ■ SASL Daemon: Used as a MongoDB server-local proxy for the remote LDAP service. ■ SASL Library: Used by the MongoDB client and server to create authentication mechanism-specific data.

○ Creating a User:

db.getSiblingDB("$external").createUser( {user : christian, roles: [{role: "read", db: "test"} ]} );

○ Authenticating as a User:

db.getSiblingDB("$external").auth({ mechanism:"PLAIN", user:"christian", pwd:"secret", digestPassword:false})

○ Other auth methods possible with MongoDB Enterprise

slide-23
SLIDE 23

23

Security: SSL Connections and Auth

  • SSL / TLS Connections

○ Supported since MongoDB 2.6x

■ May need to complile-in yourself on older binaries ■ Supported 100% in Percona Server for MongoDB

○ Minimum of 128-bit key length for security ○ Relaxed and strict (requireSSL) modes ○ System (default) or Custom Certificate Authorities are accepted

  • SSL Client Authentication (x509)

○ MongoDB supports x.509 certificate authentication for use with a secure TLS/SSL connection as of 2.6.x. ○ The x.509 client authentication allows clients to authenticate to servers with certificates rather than with a username and password. ○ Enabled with: security.clusterAuthMode: x509

slide-24
SLIDE 24

24

Security: Encryption at Rest

  • MongoDB Enterprise

○ Encryption supported in Enterprise binaries ($$$)

  • Percona Server for MongoDB

○ Use CryptFS/LUKS block device for encryption of data volume ○ Documentation published (or coming soon) ○ Completely open-source / Free

  • Application-Level

○ Selectively encrypt only required fields in application ○ Benefits

■ The data is only readable by the application (reduced touch points) ■ The resource cost of encryption is lower when it’s applied selectively ■ Offloading of encryption overhead from database

slide-25
SLIDE 25

25

Security: Network Firewall

  • MongoDB only requires a single TCP port to be reachable (to all nodes)

○ Default port 27017 ○ This does not include monitoring tools, etc

■ Percona PMM requires inbound connectivity to 1-2 TCP ports

  • Restrict TCP port access to nodes that require it!
  • Sharded Cluster

○ Application servers only need access to ‘mongos’ ○ Block direct TCP access from application -> shard/mongod instances

■ Unless ‘mongos’ is bound to localhost!

  • Advanced

○ Move inter-node replication to own network fabric, VLAN, etc ○ Accept client connections on a Public interface

slide-26
SLIDE 26

26

More on Security (some overlap)

Room: Field Suite #2 Time: Tuesday, 17:25 to 17:50

slide-27
SLIDE 27

Monitoring

slide-28
SLIDE 28

28

Monitoring: Methodology

  • Monitor often

○ 60 - 300 seconds is not enough! ○ Problems can begin/end in seconds

  • Correlate Database and Operating System together!
  • Monitor a lot

○ Store more than you graph ○ Example: PMM gathers 700-900 metrics per polling

  • Process

○ Use to troubleshoot Production events / incidents ○ Iterate and Improve monitoring

■ Add graphing for whatever made you SSH to a host ■ Blind QA with someone unfamiliar with the problem

slide-29
SLIDE 29

29

Monitoring: Important Metrics

  • Database

○ Operation counters ○ Cache Traffic and Capacity ○ Checkpoint / Compaction Performance ○ Concurrency Tickets (WiredTiger and RocksDB) ○ Document and Index scanning ○ Various engine-specific details

  • Operating System

○ CPU ○ Disk ○ Bandwidth / Util ○ Average Wait Time ○ Memory and Network

slide-30
SLIDE 30

30

Monitoring: Percona PMM

  • Open-source monitoring from

Percona!

  • Based on open-source technology

Prometheus

Grafana

Go Language

  • Simple deployment
  • Examples in this demo are

from PMM!

  • Correlation of OS and DB

Metrics

  • 800+ metrics per ping
slide-31
SLIDE 31

Architecture and High-Availability

slide-32
SLIDE 32

32

High Availability

  • Replication

○ Asynchronous

■ Write Concerns can provide psuedo-synchronous replication ■ Changelog based, using the “Oplog”

○ Maximum 50 members ○ Maximum 7 voting members

■ Use “vote:0” for members $gt 7

○ Oplog

■ The “oplog.rs” capped-collection in “local” storing changes to data ■ Read by secondary members for replication ■ Written to by local node after “apply” of operation

slide-33
SLIDE 33

33

Architecture

  • Datacenter Recommendations

○ Minimum of 3 x physical servers required for High-Availability ○ Ensure only 1 x member per Replica Set is on a single physical server!!!

  • EC2 / Cloud Recommendations

○ Place Replica Set members in odd number of Availability Zones, same region ○ Use a hidden-secondary node for Backup and Disaster Recover in another region ○ Entire Availability Zones have been lost before!

slide-34
SLIDE 34

Hardware

slide-35
SLIDE 35

35

Hardware: Mainframe vs Commodity

  • Databases: The Past

○ Buy some really amazing, expensive hardware ○ Buy some crazy expensive license

■ Don’t run a lot of servers due to above

○ Scale up:

■ Buy even more amazing hardware for monolithic host ■ Hardware came on a truck

○ HA: When it rains, it pours

  • Databases: A New Era

○ Everything fails, nothing is precious ○ Elastic infrastructures (“The cloud”, Mesos, etc) ○ Scale up: add more cheap, commodity servers ○ HA: lots of cheap, commodity servers - still up!

slide-36
SLIDE 36

36

Hardware: Block Devices

  • Isolation

Run Mongod dbPaths on separate volume

Optionally, run Mongod journal on separate volume

  • RAID Level

RAID 10 == performance/durability sweet spot

RAID 0 == fast and dangerous

  • SSDs

Benefit MMAPv1 a lot

Benefit WT and RocksDB a bit less

Keep about 30% free for internal GC on the SSD

slide-37
SLIDE 37

37

Hardware: Block Devices

  • EBS / NFS / iSCSI

Risks / Drawbacks

Exponentially more things to break

Block device requests wrapped in TCP is extremely slow

You probably already paid for some fast local disks

More difficult (sometimes nearly-impossible) to troubleshoot

MongoDB doesn’t really benefit from remote storage features/flexibility

  • Built-in High-Availability of data via replication
  • MongoDB replication can bootstrap new members
  • Strong write concerns can be specified for critical data
slide-38
SLIDE 38

38

Hardware: CPUs

  • Cores vs Core Speed

○ Lots of cores > faster cores (4 CPU minimum recommended) ○ Thread-per-connection Model

  • CPU Frequency Scaling

○ ‘cpufreq’: a daemon for dynamic scaling of the CPU frequency ○ Terrible idea for databases or any predictability! ○ Disable or set governor to 100% frequency always, i.e mode: ‘performance’ ○ Disable any BIOS-level performance/efficiency tuneable ○ ENERGY_PERF_BIAS

■ A CentOS/RedHat tuning for energy vs performance balance ■ RHEL 6 = ‘performance’ ■ RHEL 7 = ‘normal’ (!)

  • My advice: use ‘tuned’ to set to ‘performance’
slide-39
SLIDE 39

39

Hardware: Network Infrastructure

  • Datacenter Tiers

○ Network Edge ○ Public Server VLAN

■ Servers with Public NAT and/or port forwards from Network Edge ■ Examples: Proxies, Static Content, etc ■ Calls backends in Backend VLAN

○ Backend Server VLAN

■ Servers with port forwarding from Public Server VLAN (w/Source IP ACLs) ■ Optional load balancer for stateless backends ■ Examples: Webserver, Application Server/Worker, etc ■ Calls data stores in Data VLAN

○ Data VLAN

■ Servers, filers, etc with port forwarding from Backend Server VLAN (w/Source IP ACLs) ■ Examples: Databases, Queues, Filers, Caches, HDFS, etc

slide-40
SLIDE 40

40

Hardware: Network Infrastructure

  • Network Fabric

○ Try to use 10GBe for low latency ○ Use Jumbo Frames for efficiency ○ Try to keep all MongoDB nodes on the same segment

■ Goal: few or no network hops between nodes ■ Check with ‘traceroute’

  • Outbound / Public Access

○ Databases don’t need to talk to the internet*

■ Store a copy of your Yum, DockerHub, etc repos locally ■ Deny any access to Public internet or have no route to it ■ Hackers will try to upload a dump of your data out of the network!!

  • Cloud?

○ Try to replicate the above with features of your provider

slide-41
SLIDE 41

41

Hardware: Why So Quick?

  • MongoDB allows you to scale reads and writes with more nodes

○ Single-instance performance is important, but deal-breaking

  • You are the most expensive resource!

○ Not hardware anymore

slide-42
SLIDE 42

Tuning MongoDB

slide-43
SLIDE 43

43

Tuning MongoDB: MMAPv1

  • A kernel-level function to map file blocks to

memory

  • MMAPv1 syncs data to disk once per 60 seconds

(default)

Override with —syncDelay <seconds> flag

If a server with no journal crashes it can lose 1 min of data!!!

  • In memory buffering of Journal

Synced every 30ms ‘journal’ is on a different disk

Or every 100ms

Or 1/3rd of above if change uses Journaled write concern (explained later)

slide-44
SLIDE 44

44

Tuning MongoDB: MMAPv1

  • Fragmentation

Can cause serious slowdowns on scans, range queries, etc

db.<collection>.stats()

Shows various storage info for a collection

Fragmentation can be computed by dividing ‘storageSize’ by ‘size’

Any value > 1 indicates fragmentation

Compact when you near a value of 2 by rebuilding secondaries or using the ‘compact’ command

WiredTiger and RocksDB have little-no fragmentation due to checkpoints / compaction

slide-45
SLIDE 45

45

Tuning MongoDB: WiredTiger

  • WT syncs data to disk in a process called

“Checkpointing”:

Every 60 seconds or >= 2GB data changes

  • In-memory buffering of Journal

Journal buffer size 128kb

Synced every 50 ms (as of 3.2)

Or every change with Journaled write concern (explained later)

In between write operations while the journal records remain in the buffer, updates can be lost following a hard shutdown!

slide-46
SLIDE 46

46

Tuning MongoDB: RocksDB

  • Level-based strategy using immutable data

level files

○ Built-in Compression ○ Block and Filesystem caches

  • RocksDB uses “compaction” to apply

changes to data files

Tiered level compaction

Follows same logic as MMAPv1 for journal buffering

  • MongoRocks

○ A layer between RocksDB and MongoDB’s storage engine API ○ Developed in partnership with Facebook

slide-47
SLIDE 47

47

Tuning MongoDB: Storage Engine Caches

  • WiredTiger

○ In heap

■ 50% available system memory ■ Uncompressed WT pages

○ Filesystem Cache

■ 50% available system memory ■ Compressed pages

  • RocksDB

○ Internal testing planned from Percona in the future ○ 30% in-heap cache recommended by Facebook / Parse

slide-48
SLIDE 48

48

Tuning MongoDB: Durability

  • storage.journal.enabled = <true/false>

Default since 2.0 on 64-bit builds

Always enable unless data is transient

Always enable on cluster config servers

  • storage.journal.commitIntervalMs = <ms>

Max time between journal syncs

  • storage.syncPeriodSecs = <secs>

Max time between data file flushes

slide-49
SLIDE 49

49

Tuning MongoDB: Don’t Enable!

  • “cpu”

○ External monitoring is recommended

  • “rest”

○ Will be deprecated in 3.6+

  • “smallfiles”

○ In most situations this is not necessary unless

■ You use MMAPv1, and ■ It is a Development / Test environment ■ You have 100s-1000s of databases with very little data inside (unlikely)

  • Profiling mode ‘2’

○ Unless troubleshooting an issue / intentional

slide-50
SLIDE 50

Tuning Linux

slide-51
SLIDE 51

51

Tuning Linux: The Linux Kernel

  • Linux 2.6.x?
  • Avoid Linux earlier than 3.10.x - 3.12.x
  • Large improvements in parallel efficiency in 3.10+ (for Free!)
  • More: https://blog.2ndquadrant.com/postgresql-vs-kernel-versions/
slide-52
SLIDE 52

52

Tuning Linux: NUMA

  • A memory architecture that takes into account the

locality of memory, caches and CPUs for lower latency

○ But no databases want to use it :(

  • MongoDB codebase is not NUMA “aware”, causing

unbalanced memory allocations on NUMA systems

  • Disable NUMA

In the Server BIOS

Using ‘numactl’ in init scripts BEFORE ‘mongod’ command (recommended for future compatibility):

numactl --interleave=all /usr/bin/mongod <other flags>

slide-53
SLIDE 53

53

Tuning Linux: Transparent HugePages

  • Introduced in RHEL/CentOS 6, Linux 2.6.38+
  • Merges memory pages in background (Khugepaged process)
  • Decreases overall performance when used with MongoDB!
  • “AnonHugePages” in /proc/meminfo shows usage
  • Disable TransparentHugePages!
  • Add “transparent_hugepage=never” to kernel command-line (GRUB)

Reboot the system

■ Disabling online does not clear previous TH pages ■ Rebooting tests your system will come back up!

slide-54
SLIDE 54

54

Tuning Linux: Time Source

  • Replication and Clustering needs consistent clocks

○ mongodb_consistent_backup relies on time sync, for example!

  • Use a consistent time source/server

○ “It’s ok if everyone is equally wrong”

  • Non-Virtualized

○ Run NTP daemon on all MongoDB and Monitoring hosts ○ Enable service so it starts on reboot

  • Virtualised

○ Check if your VM platform has an “agent” syncing time ○ VMWare and Xen are known to have their own time sync ○ If no time sync provided install NTP daemon

slide-55
SLIDE 55

55

Tuning Linux: I/O Scheduler

  • Algorithm kernel uses to commit reads and writes to disk
  • CFQ

“Completely Fair Queue”

Default scheduler in 2.6-era Linux distributions

Perhaps too clever/inefficient for database workloads

Probably good for a laptop

  • Deadline

Best general default IMHO

Predictable I/O request latencies

  • Noop

Use with virtualised servers

Use with real-hardware BBU RAID controllers

slide-56
SLIDE 56

56

Tuning Linux: Filesystems

  • Filesystem Types

Use XFS or EXT4, not EXT3

EXT3 has very poor pre-allocation performance

Use XFS only on WiredTiger

EXT4 “data=ordered” mode recommended

Btrfs not tested, yet!

  • Filesystem Options

Set ‘noatime’ on MongoDB data volumes in ‘/etc/fstab’:

Remount the filesystem after an options change, or reboot

slide-57
SLIDE 57

57

Tuning Linux: Block Device Readahead

  • Tuning that causes data ahead of a block on disk to be read and then

cached

  • Assumption:

There is a sequential read pattern

Something will benefit from the extra cached blocks

  • Risk

Too high waste cache space

Increases eviction work

MongoDB tends to have very random disk patterns

  • A good start for MongoDB volumes is a ’32’ (16kb) read-ahead

○ Let MongoDB worry about optimising the pattern

slide-58
SLIDE 58

58

Tuning Linux: Block Device Readahead

  • Change ReadAhead

Add file to ‘/etc/udev/rules.d’

/etc/udev/rules.d/60-mongodb-disk.rules:

# set deadline scheduler and 32/16kb read-ahead for /dev/sda ACTION=="add|change", KERNEL=="sda", ATTR{queue/scheduler}="deadline", ATTR{bdi/read_ahead_kb}="16"

Reboot (or use CLI tools to apply)

slide-59
SLIDE 59

59

Tuning Linux: Virtual Memory Dirty Pages

  • Dirty Pages

Pages stored in-cache, but needs to be written to storage

  • Dirty Ratio

Max percent of total memory that can be dirty

VM stalls and flushes when this limit is reached

Start with ’10’, default (30) too high

  • Dirty Background Ratio

Separate threshold forbackground dirty page flushing

Flushes without pauses

Start with ‘3’, default (15) too high

slide-60
SLIDE 60

60

Tuning Linux: Swappiness

  • A Linux kernel sysctl setting for preferring

RAM or disk for swap

Linux default: 60

To avoid disk-based swap: 1 (not zero!)

To allow some disk-based swap: 10

‘0’ can cause more swapping than ‘1’ on recent kernels

■ More on this here: https://www.percona.com/blog/2014/04/28/oom-rela tion-vm-swappiness0-new-kernel/

slide-61
SLIDE 61

61

Tuning Linux: Ulimit

  • Allows per-Linux-user resource constraints

Number of User-level Processes

Number of Open Files

CPU Seconds

Scheduling Priority

And others…

  • MongoDB

Should probably have a dedicated VM, container or server

Creates a new process

■ For every new connection to the Database ■ Plus various background tasks / threads

○ Creates an open file for each active data file on disk

■ 64,000 open files and 64,000 max processes is a good start

slide-62
SLIDE 62

62

Tuning Linux: Ulimit

  • Setting ulimits

○ /etc/security/limits.d file ○ Systemd Service ○ Init script

  • Ulimits are set by

Percona and MongoDB packages!

○ Example on left: PSMDB RPM (Systemd)

slide-63
SLIDE 63

63

Tuning Linux: Network Stack

  • Defaults are not good for > 100mbps Ethernet
  • Suggested starting point:
  • Set Network Tunings:

○ Add the above sysctl tunings to /etc/sysctl.conf ○ Run “/sbin/sysctl -p” as root to set the tunings ○ Run “/sbin/sysctl -a” to verify the changes

slide-64
SLIDE 64

64

Tuning Linux: More on this...

https://www.percona.com/blog/2016/08/12/tuning-linux-for-mongodb/

slide-65
SLIDE 65

65

Tuning Linux: “Tuned”

  • Tuned

A “framework” for applying tunings to Linux

RedHat/CentOS 7 only for now

Debian added tuned, not sure if compatible yet

Cannot tune NUMA, file system type or fs mount opts

Syctls, THP, I/O sched, etc

  • My apology to the community for writing “Tuning Linux for MongoDB”:

○ https://github.com/Percona-Lab/tuned-percona-mongodb

slide-66
SLIDE 66

Troubleshooting

“The problem with troubleshooting is trouble shoots back” ~ Unknown

slide-67
SLIDE 67

67

Troubleshooting: Usual Suspects

  • Locking

Collection-level locks

Document-level locks

Software mutex/semaphore

  • Limits

Max connections

Operation rate limits

Resource limits

  • Resources

Lack of IOPS, RAM, CPU, network, etc

slide-68
SLIDE 68

68

Troubleshooting: MongoDB Resources

  • Memory
  • CPU

System CPU

FS cache

Networking

Disk I/O

Threading

  • User CPU (MongoDB)

Compression (WiredTiger and RocksDB)

Session Managemen

BSON (de)serialisation

Filtering / scanning / sorting

slide-69
SLIDE 69

69

Troubleshooting: MongoDB Resources

  • User CPU (MongoDB)

Optimiser

Disk

Data file read/writes

Journaling

Error logging

Network

Query request/response

Replication

  • Disk I/O

○ Journaling ○ Oplog Reads / Writes ○ Background Flushing / Compactions / etc

slide-70
SLIDE 70

70

Troubleshooting: MongoDB Resources

  • Disk I/O

○ Page Faults (data not in cache) ○ Swapping

  • Network

○ Client API ○ Replication ○ Sharding

■ Chunk Moves ■ Mongos -> Shards

slide-71
SLIDE 71

71

Troubleshooting: db.currentOp()

  • A function that dumps status info about running operations and various

lock/execution details

  • Only queries currently in progress are shown.
  • Provided Query ID number can be used to kill long running queries.
  • Includes

○ Original Query ○ Parsed Query ○ Query Runtime ○ Locking details

  • Filter Documents

○ { "$ownOps": true } == Only show operations for the current user ○ https://docs.mongodb.com/manual/reference/method/db.currentOp/#examples

slide-72
SLIDE 72

72

Troubleshooting: db.stats()

  • Returns

○ Document-data size (dataSize) ○ Index-data size (indexSize) ○ Real-storage size (storageSize) ○ Average Object Size ○ Number of Indexes ○ Number of Objects

slide-73
SLIDE 73

73

Troubleshooting: db.currentOp()

slide-74
SLIDE 74

74

Troubleshooting: Log File

  • Interesting details are logged to the mongod/mongos log files

Slow queries

Storage engine details (sometimes)

Index operations

Sharding

Chunk moves

Elections / Replication

Authentication

Network

Connections

  • Errors
  • Client / Inter-node connections
slide-75
SLIDE 75

75

Troubleshooting: Log File - Slow Query

2017-09-19T20:58:03.896+0200 I COMMAND [conn175] command config.locks appName: "MongoDB Shell" command: findAndModify { findAndModify: "locks", query: { ts: ObjectId('59c168239586572394ae37ba') }, update: { $set: { state: 0 } }, writeConcern: { w: "majority", wtimeout: 15000 }, maxTimeMS: 30000 } planSummary: IXSCAN { ts: 1 } update: { $set: { state: 0 } } keysExamined:1 docsExamined:1 nMatched:1 nModified:1 keysInserted:1 keysDeleted:1 numYields:0 reslen:604 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 } }, Collection: { acquireCount: { w: 1 } }, Metadata: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_command 106ms

slide-76
SLIDE 76

76

Troubleshooting: Operation Profiler

  • Writes slow database operations to a new MongoDB collection for

analysis

Capped Collection “system.profile” in each database, default 1mb

The collection is capped, ie: profile data doesn’t last forever

  • Support for operationProfiling data in Percona Monitoring and

Management in current future goals

  • Enable operationProfiling in “slowOp” mode

Start with a very high threshold and decrease it in steps

Usually 50-100ms is a good threshold

Enable in mongod.conf

  • perationProfiling:

slowOpThresholdMs: 100 mode: slowOp

slide-77
SLIDE 77

77

Troubleshooting: Operation Profiler

  • Useful Profile Metrics

  • p/ns/query: type, namespace and query of a profile

keysExamined: # of index keys examined

docsExamined: # of docs examined to achieve result

writeConflicts: # of Write Concern Exceptions encountered during update

numYields: # of times operation yielded for others

locks: detailed lock statistics

slide-78
SLIDE 78

78

Troubleshooting: .explain()

  • Shows query explain plan for query

cursors

  • This will include

○ Winning Plan

■ Query stages

  • Query stages may include sharding info in

clusters

■ Index chosen by optimiser

○ Rejected Plans

slide-79
SLIDE 79

79

Troubleshooting: .explain() and Profiler

slide-80
SLIDE 80

80

Troubleshooting: Cluster Metadata

  • The “config” database on Cluster Config servers

○ Use .find() queries to view Cluster Metadata

  • Contains

actionlog (3.0+)

changelog

databases

collections

shards

chunks

settings

mongos

locks

lockpings

slide-81
SLIDE 81

81

Troubleshooting: Percona PMM QAN

  • The Query Analytics tool enables DBAs and developers to analyze queries
  • ver periods of time and find performance problems.
  • Helps you optimise database performance by making sure that queries are

executed as expected and within the shortest time possible.

  • Central, web-based location for visualising data.
  • Agent collected from MongoDB Profiler (required) from agent.
  • Great for reducing access to systems while proving valueable data to

development teams!

  • Query Normalization

○ ie:“{ item: 123456 }” -> “{ item: ##### }”.

  • Command-line Equivalent: pt-mongodb-query-digest tool
slide-82
SLIDE 82

82

Troubleshooting: Percona PMM QAN

slide-83
SLIDE 83

83

Troubleshooting: mlogfilter

  • A useful tool for processing mongod.log files
  • A log-aware replacement for ‘grep’, ‘awk’ and friends
  • Generally focus on

mlogfilter --scan <file>

Shows all collection scan queries

mlogfilter --slow <ms> <file>

Shows all queries that are slower than X milliseconds

mlogfilter --op <op-type> <file>

Shows all queries of the operation type X (eg: find, aggregate, etc)

  • More on this tool here

https://github.com/rueckstiess/mtools/wiki/mlogfilter

slide-84
SLIDE 84

84

Troubleshooting: Common Problems

  • Sharding

○ removeShard Doesn’t Complete

■ Check the ‘dbsToMove’ array of the removeShard response

mongos> db.adminCommand({removeShard:"test2"}) { "msg" : "draining started successfully", "state" : "started", "shard" : "test2", "note" : "you need to drop or movePrimary these databases", "dbsToMove" : [ "wikipedia" ], "ok" : 1 }

■ Why?

mongos> use config switched to db config mongos> db.databases.find() { "_id" : "wikipedia", "primary" : "test2", "partitioned" : true }

slide-85
SLIDE 85

85

Troubleshooting: Common Problems

  • Sharding

○ removeShard Doesn’t Complete

■ Try

  • Use movePrimary to move database(s) Primary-role to others
  • Run the removeShard command once the shard being removed is NOT primary for any

database ○ This starts ths draining of the shard

  • Run the same removeShard command to check on progress.

If the draining and removing is complete this will respond with success

○ Jumbo Chunks

■ Will prevent balancing from occurring ■ config.chunks collection document will contain jumbo:true as a key/value pair ■ Sharding ‘split’ commands can be used to reduce the chunk size (sh.splitAt, etc) ■ https://www.percona.com/blog/2016/04/11/dealing-with-jumbo-chunks-in-mongodb/

slide-86
SLIDE 86

Schema Design & Workflow

“The problem with troubleshooting is trouble shoots back” ~ Unknown

slide-87
SLIDE 87

87

Schema Design: Data Types

  • Strings

○ Only use strings if required ○ Do not store numbers as strings! ○ Look for {field:“123456”} instead of {field:123456}

■ “12345678” moved to a integer uses 25% less space ■ Range queries on proper integers is more efficient

○ Example JavaScript to convert a field in an entire collection

■ db.items.find().forEach(function(x) { newItemId = parseInt(x.itemId); db.containers.update( { _id: x._id }, { $set: {itemId: itemId } } ) });

slide-88
SLIDE 88

88

Schema Design: Data Types

  • Strings

○ Do not store dates as strings!

■ The field "2017-08-17 10:00:04 CEST" stores in 52.5% less space!

○ Do not store booleans as strings!

■ “true” -> true = 47% less space wasted

  • DBRefs

○ DBRefs provide pointers to another document ○ DBRefs can be cross-collection

  • NumberLong (3.4+)

○ Higher precision for floating-point numbers

slide-89
SLIDE 89

89

Schema Design: Indexes

  • MongoDB supports BTree, text and geo indexes

Default behaviour

  • Collection lock until indexing completes
  • {background:true}

Runs indexing in the background avoiding pauses

Hard to monitor and troubleshoot progress

Unpredictable performance impact

  • Avoid drivers that auto-create indexes

○ Use real performance data to make indexing decisions, find out before Production!

  • Too many indexes hurts write performance for an entire collection
  • Indexes have a forward or backward direction

○ Try to cover .sort() with index and match direction!

slide-90
SLIDE 90

90

Schema Design: Indexes

  • Compound Indexes

○ Several fields supported ○ Fields can be in forward or backward direction

■ Consider any .sort() query options and match sort direction!

○ Composite Keys are read Left -> Right

■ Index can be partially-read ■ Left-most fields do not need to be duplicated! ■ All Indexes below are duplicates:

  • {username: 1, status: 1, date: 1, count: -1}
  • {username: 1, status: 1, data: 1}
  • {username: 1, status: 1 }
  • {username: 1 }
  • Use db.collection.getIndexes() to view current Indexes
slide-91
SLIDE 91

91

Schema Design: Query Efficiency

  • Query Efficiency Ratios

○ Index: keysExamined / nreturned ○ Document: docsExamined / nreturned

  • End goal: Examine only as many Index Keys/Docs as you return!

○ Tip: when using covered indexes zero documents are fetched (docsExamined: 0)! ○ Example: a query scanning 10 documents to return 1 has efficiency 0.1 ○ Scanning zero docs is possible if using a covered index!

slide-92
SLIDE 92

92

Schema Workflow

  • MongoDB optimised for single-document operations
  • Single Document / Centralised

Greate cache/disk-footprint efficiency

Centralised schemas may create a hotspot for write locking

  • Multi Document / Decentralised

○ MongoDB rarely stores data sequentially on disk ○ Multi-document operations are less efficient ○ Less potential for hotspots/write locking ○ Increased overhead due to fan-out of updates ○ Example: Social Media status update, graph relationships, etc ○ More on this later..

slide-93
SLIDE 93

93

Schema Workflow

  • Read Heavy Workflow

Read-heavy apps benefit from pre-computed results

Consider moving expensive reads computation to insert/update/delete

Example 1: An app does ‘count’ queries often

Move .count() read query to a summary document with counters

Increment/decrement single count value at write-time

Example 2: An app that does groupings of data

Move .aggregate() read query that is in-line to the user to a backend summary worker

Read from a summary collection, like a view

  • Write Heavy Workflow

Reduce indexing as much as possible

Consider batching or a decentralised model with lazy updating (eg: social media graph)

slide-94
SLIDE 94

94

Schema Workflow

  • Batching Inserts/Updates

Requires less network commands

Allows the server to do some internal batching Operations will be slower overall Suited for queue worker scenarios batching many changes Traditional user-facing database traffic should aim to operate on a single (or few) document(s)

  • Thread-per-connection model

○ 1 x DB operation = 1 x CPU core only ○ Executing Parallel Reads

■ Large batch queries benefit from several parallel sessions ■ Break query range or conditions into several client->server threads ■ Not recommended for Primary nodes or Secondaries with heavy reads

slide-95
SLIDE 95

95

Schema Workflow

  • No list of fields specified in .find()

○ MongoDB returns entire documents unless fields are specified ○ Only return the fields required for an application operation! ○ Covered-index operations require only the index fields to be specified

  • Using $where operators

○ This executes JavaScript with a global lock

  • Many $and or $or conditions

○ MongoDB (or any RDBMS) doesn’t handle large lists of $and or $or efficiently ○ Try to avoid this sort of model with

■ Data locality ■ Background Summaries / Views

slide-96
SLIDE 96

96

Fan Out / Fan In

  • Fan-Out Systems

○ Decentralised ○ Data is eventually written in many locations ○ Complex write path (several updates)

■ Good use-case for Queue/Worker model ■ Batching possible

○ Simple read path (data locality)

  • Fan-In

○ Centralised ○ Simple Write path

■ Possible Write locking

○ Complex Read Path

■ Potential for latency due to network

slide-97
SLIDE 97

Data Integrity

“The problem with troubleshooting is trouble shoots back” ~ Unknown

slide-98
SLIDE 98

98

Data Integrity: `whoami` (continued)

  • Very Paranoid
  • Previous RDBMs

○ Online Marketing / Publishing

■ Paid for clicks coming in ■ Downtime = revenue + traffic (paid for) loss

○ Warehousing / Pricing SaaS

■ Store real items in warehouses/stores/etc ■ Downtime = many businesses (customers)/warehouses/etc at stand-still ■ Integrity problems =

  • Orders shipped but not paid for
  • Orders paid for but not shipped, etc

○ Moved on to Gaming, Percona

  • So why MongoDB?

2010

slide-99
SLIDE 99

99

Data Integrity: Storage and Journaling

  • The Journal provides durability in the event of failure of the

server

  • Changes are written ahead to the journal for each write
  • peration
  • On crash recovery, the server

Finds the last point of consistency to disk

Searches the journal file(s) for the record matching the checkpoint

Applies all changes in the journal since the last point of consistency

  • Journal data is stored in the ‘journal’ subdirectory of the

server data path (dbPath)

  • Dedicated disks for data (random I/O) and journal (sequential

I/O) improve performance

slide-100
SLIDE 100

100

Data Integrity: Write Concern

  • MongoDB Replication is Asynchronous
  • Write Concerns

Allow control of data integrity of a write to a Replica Set ○ Write Concern Modes

■ “w: <num>” - Writes much acknowledge to defined number of nodes ■ “majority” - Writes much acknowledge on a majority of nodes ■ “<replica set tag>” - Writes acknowledge to a member with the specified replica set tags

○ Durability

■ By default write concerns are NOT durable ■ “j: true” - Optionally, wait for node(s) to acknowledge journaling of operation ■ In 3.4+ “writeConcernMajorityJournalDefault” allows enforcement of “j: true” via replica set configuration!

  • Must specify “j: false” or alter “writeConcernMajorityDefault” to disable
slide-101
SLIDE 101

101

Data Integrity: Replica Set Rollbacks

  • Consider this when using “w:1” Write Concern

A PRIMARY writes 10 documents with w:1 Write Concern to the oplog, then dies

SECONDARY (2x) nodes applied 5 and 7 of the changes written

The SECONDARY with 7 changes wins PRIMARY election

The PRIMARY that died comes back alive

The old-PRIMARY node becomes RECOVERING then SECONDARY

3 documents are “rolled back” to disk

A JSON file written to ‘rollback’ dir on-disk when PRIMARY crashes when ahead of SECONDARYs

Monitor for this file existing on disk!!

  • Risk: the application and/or end-user thinks this was written!
  • Majority Write Concern and correct Read Concern can avoid this!
slide-102
SLIDE 102

102

Data Integrity: Read Concern

  • New feature in MongoDB and PSMDB 3.2+
  • Like write concerns, the consistency of reads can be

tuned per session or operation

  • Levels

“local” - Default, return the current node’s most-recent version of the data

“majority” - Reads return the most-version of the data that has been ack’d on a majority of nodes. Not supported on MMAPv1.

“linearizable” (3.4+) - Reads return data that reflects a “majority” read of all changes prior to the read

slide-103
SLIDE 103

103

Data Integrity: Replication

  • Oplog

○ Ordered by time ○ Written to locally after apply-time of

■ Client API change ■ Or replication change

○ A crashed node will resume replication using last position from local oplog ○ Size of Oplog

■ Monitor this closely! ■ The length of time from start to end of the oplog affects the impact of adding new nodes ■ If a node is brought online with a backup within the window it avoids a full sync ■ If a node is brought online with a backup older than the window it will full sync!!!

  • Lag

○ Due to async lag is possible ○ A use of Read Concerns and/or Write Concerns can work around this!

slide-104
SLIDE 104

104

Data Integrity: Replication

  • Replset as Data Redundancy (use at own risk)

Lots of Replset Members

Read and Write Concern

Proper Geolocation/Node Redundancy

Cheaper, non-redundant storage becomes possible

■ JBOD ■ RAID0 ■ InMemory (faster)

slide-105
SLIDE 105

105

Data Integrity: Disaster Recovery

  • Storing data in many physical locations provides improved recovery
  • ptions and integrity
  • Hidden Secondaries

○ Store a Hidden Secondary in another location

  • Upload completed backups to another location
  • Use a geo-distributed architecture

○ Sharding: Look into Tag-aware Sharding ○ Replica Set: Multi-locations of members

slide-106
SLIDE 106

Scaling Reads

“The problem with troubleshooting is trouble shoots back” ~ Unknown

slide-107
SLIDE 107

107

Scaling Reads: Schema and Summaries

  • Correct data types

○ “true” vs true ○ “123456” vs 123456 ○ "2017-09-19T16:50:58.347Z" vs ISODate("2017-09-19T16:50:58.347Z")

  • Predictive Workflow

○ Read-heavy apps benefit from pre-computed results ○ Consider moving expensive reads computation to insert/update/delete

■ Example: move .count() read query to a summary-document, increment/decrement summary count at write-time

  • External Caching

○ MongoDB is fast, memcache is even faster (although very simple)

slide-108
SLIDE 108

108

Scaling Reads: Read Preference

  • Modes

○ primary (default) ○ primaryPreferred ○ secondary ○ secondaryPreferred (recommended for Read Scaling!) ○ nearest

  • Tags

○ Select nodes based on key/value pairs (one or more) ○ Often used for

■ Datacenter awareness, eg: { “dc”: “eu-east” } ■ Specific workflows, eg: Analytics, BI, Batch summaries, Backups

slide-109
SLIDE 109

109

Scaling Reads: Secondary Members

  • Linear* scale-out of reads, utilization of all nodes!
  • Be aware

○ rs.add() new Replicas with no data triggers initial sync from Primary!

■ This can also happen if the backup is too old

○ 50 member maximum, Primary included

  • Adding a Replica

○ Ensure configuration replSetName and key files match ○ Logical Restore

■ mongorestore a mongodump-based backup (containing oplog) ■ Use bsondump to find the last document in the “oplog.bson” file ■ Create oplog and insert the last “oplog.bson” document

use local db.runCommand( { create: "oplog.rs", capped: true, size: (20 * 1024 * 1024 * 1024) } ); db.oplog.rs.insert(<last doc>);

slide-110
SLIDE 110

110

Scaling Reads: Secondary Members

  • Adding a Replica

○ Logical Restore

■ Create oplog and insert the last “oplog.bson” document ■ Gather the “system.replset” document from the “local” database of an existing node use local db.system.replset.find() { … } ■ Insert the “system.replset” document into the “local” database of the new node use local db.system.replset.insert(< document >); ■ On the Primary, add the new node with rs.reconfig() or rs.add()

  • Adding the node as “hidden: true” during recovery causes less visible lag
slide-111
SLIDE 111

Scaling Writes (Sharding)

“The problem with troubleshooting is trouble shoots back” ~ Unknown

slide-112
SLIDE 112

112

Sharding: Components

  • Sharding Routers

○ ‘mongos’ binary is the router ○ Applications connect to the Sharding Router ○ Abstraction layer to driver

  • Config Metadata

○ Read by the Sharding Routers (‘mongos’) ○ Config Servers

■ 1+ ‘mongod’ servers storing the metadata ■ 3.4+ config servers are require Replica Sets

○ 1+ ‘mongod’ instances that store an exclusive piece of the cluster data ○ Often a Replica Set

■ 3 x members or more recommended

○ Standalone nodes are supported (misconception)

slide-113
SLIDE 113

113

Sharding: Config Metadata

  • Metadata Collections

○ databases: Sharding-enabled databases ○ collections: Sharding-enabled collections

■ Shard Key ■ Shard Primary

○ chunks: List of collection chunks

■ Chunk range of shard key ■ Mapping of responsible shard

○ shards: List of shards in cluster ○ actionlog: Log of performed actions ○ locks: Distributed Locks ○ lockpings: Pings to Distributed Locks ○ mongos: List of mongos instances (online and offline, check ‘ping’)

slide-114
SLIDE 114

114

Sharding: Shard Key

  • The shard key determines the distribution of the collection’s documents

among the cluster’s shards.

  • The shard key is either an indexed field or indexed compound fields that

exists in every document in the collection.

  • MongoDB partitions data in the collection using ranges of shard key

values.

  • Each range defines a non-overlapping range of shard key values and is

associated with a chunk.

slide-115
SLIDE 115

115

Sharding: Chunks

  • A contiguous range of shard key values within a particular shard
  • Chunk ranges are inclusive of the lower boundary and exclusive of the

upper boundary (minKey and maxKey)

  • MongoDB splits chunks when they grow beyond the configured chunk

size, which by default is 64 megabytes.

  • MongoDB migrates chunks when a shard contains too many chunks
slide-116
SLIDE 116

116

Sharding: Balancer

  • The balancer is a background process that monitors the number of

chunks on each shard.

  • When the number of chunks on a given shard reaches thresholds, the

balancer attempts to automatically migrate chunks between shards and reach an equal number of chunks per shard.

  • The balancer does not move any data! The data moves from shard to

shard.

  • Serial operation in pre-3.4, multiple migrations possible in 3.4+
slide-117
SLIDE 117

117

Sharding: Deployment Methods

  • Small Deployment / Getting Started

○ Useful for app that will scale up eventually ○ Start with 1 shard and a good shard key ○ Add replica set members and/or shards as the system scales

  • Chunk Pre-Splitting

○ Chunks that grow larger than 64mb are split by the shard Primary

■ Chunk splits impact write performance!

○ Pre-creating an even distribution of chunks spanning the shard key range improves write performance greatly! ○ Example: A shard key with a possible value of 1-100 pre-split on 10 shards will result in 10 pre-sploy chunks per shard ○ More on this topic: https://docs.mongodb.com/manual/tutorial/create-chunks-in-sharded-cluster/

slide-118
SLIDE 118

118

Sharding: Tag Aware Sharding

  • In sharded clusters, you can tag specific

ranges of the shard key and associate those tags with a shard or subset of shards.

  • MongoDB routes reads and writes that fall

into a tagged range only to those shards configured for the tag.

  • Use Cases

Isolate a subset of data on a specific set of shards.

Ensure that the most relevant data reside on shards that are geographically closest to the application servers.

Route data based on Hardware / Performance

slide-119
SLIDE 119

119

Sharding: Multi-Datacenter

  • Application Usage

Use “nearest” Read Preference to route reads to closest servers ○ Use shard tag to route ops to correct shard(s)

  • Mongos / Routers

○ At least 2 x per datacenter or running locally to application servers

  • Config Servers

○ Replica Set based (CSRS) set required ○ At least 2 x Config Servers per Datacenter ○ 50 max members in Config Server set ○ No votes required (7 voter limit)

slide-120
SLIDE 120

120

Sharding: Multi-Datacenter

slide-121
SLIDE 121

Coming in 3.6...

slide-122
SLIDE 122

122

MongoDB 3.6

  • Transactions

○ Previously did not exist in MongoDB except TokuMX

  • Better Default Security

○ bindIp set to localhost ○ Addition of source IP restrictions

  • Better concurrency on metadata refreshing to remove slow downs
  • Metadata refresh can try 10 times not just 3 now
  • New cluster based clock vs system based clocks

○ Watch this feature closely! ○ NTP may no longer be necessary

  • RepairDB removed
  • Move Chunk command report statistics
slide-123
SLIDE 123

123

MongoDB 3.6: More on this...

Speaker: David Murphy Location: Goldsmith 2 Date: Tuesday, 17:55-18:20

slide-124
SLIDE 124

124

Questions?