MongoDB Analysis with Prometheus and Grafana Akira Kurogane - - PowerPoint PPT Presentation
MongoDB Analysis with Prometheus and Grafana Akira Kurogane - - PowerPoint PPT Presentation
MongoDB Analysis with Prometheus and Grafana Akira Kurogane Percona Talk Overview The 'math' in MongoDB metrics MongoDB's counters and 'gauges' mongodb_exporter metrics Prometheus equations PMM's Grafana dashboards
Talk Overview
- The 'math' in MongoDB metrics
○ MongoDB's counters and 'gauges' ○ mongodb_exporter metrics ○ Prometheus equations
- PMM's Grafana dashboards
- How to cook new dashboards
2
The 'Math' in MongoDB Metrics
Implementation
- Mostly Counters
- Eg. opcounters, bytes transferred
- Gauge values
- Eg. open cursors, WT 'tickets' in use
- Histograms
Counters vs sub-ranges
Exported as timeseries to a monitoring server.
3
Mostly any software's
Eventually enters your brain as:
- A graph
- An alert threshold
The graph / threshold value is:
- Rate function on a counter
- Instantaneous value of a gauge
- Derivative functions combining
several of the above
Roll-call
Counters
- command executed (by basic type)
- command executed (by exact type)
- network requests
- count of documents returned
- connections created
- asserts (error, warning, lesser)
- journal commits
- journal (to memory) write time
- journal disk write time
- journal commit (fsync) wait times
- time (all connections) have waited to
acquire lock
- time (all connections) have held a
lock
- lock acquisition counts
- count of lock acquisition attempts that
failed
Gauges
- (current) open connections
- active (op in progress) connections
- active reading connections
- active writing connections
- (transaction) sessions in progress
- time needed for the last transaction
cleanup
- replset election id
- replset node status (as number)
- latest op time? uptime?
- number sharded collections
- last configsvr optime
- number of shards currently waiting
for metadata refresh
- inactive transactions sessions
- 'watchdog' interval setting
- replset member id
Constants (per restart)
- hostname
- pid
- version
- replset name
- security cert name
- security cert expiry
- configsvr conn string
- storage engine name
- TLS version
- 32 or 64 bit archictecture
- configure max repl buffer size
C
- u
n t e r ( c
- n
t . )
- p latencies sum (reads)
- p latencies sum (reads)
- p latencies sum (other cmd)
- p counts by readConcernType
- p counts by writeConcernType
- p counts applied in replication thread
- chunk moves
- time spent during chunk moves
- chunk move failures
- aggregate time spent in chunk commit
- number of docs in moved chunks
- sharding metadata refresh wait time
- sharding metadata refresh fail count
- transactions committed
- aborted transactions count
- retried transactions count
- documents deleted
- documents inserted
- documents updated
- write conflicts
- scanAndOrder steps in query execution
- index entries scanned
- whole data objects scanned
- replication batches applied
- time spent during repl batch apply
- total ops applied during repl batches
- initial syncs (completed)
- failed initial syncs
Gauges (cont)
- mem resident
- mem virtual
- cursors open
- timeout cursors open
- pinned cursors open
- replication buffer size
C
- u
n t e r ( c
- n
t . )
- TTL index delete batch iterations
- documents deleted during TTL
- getMore command executed for replication
- time spent doing repl getMore cmds
- repl getMore commands created
- repl prefetch stage docs fetched
- repl prefetch stage fetch time
- freelist 'bucket' misses
- count of freelist searches
- record allocations found in freelist searches
- number of cursors than invole 2+ shards
- 'watchdog' checks iterated
- 'watchdog' filesystem checks
- lsm_work_queue_app;
- lsm_work_queue_manager;
- lsm_rows_merged;
- lsm_checkpoint_throttle;
- lsm_merge_throttle;
- lsm_work_queue_switch;
- lsm_work_units_discarded;
- lsm_work_units_done;
- lsm_work_units_created;
- lsm_work_queue_max;
- async_cur_queue;
- async_max_queue;
- async_alloc_race;
- async_flush;
- async_alloc_view;
- async_full;
- async_nowork;
- async_op_alloc;
- async_op_compact;
- async_op_insert;
- async_op_remove;
- async_op_search;
- async_op_update;
- block_preload;
- block_read;
- block_write;
- block_byte_read;
- block_byte_write;
- block_byte_write_checkpoint
;
- block_map_read;
- block_byte_map_read;
- cache_read_app_count;
- cache_read_app_time;
- cache_write_app_count;
- cache_write_app_time;
- cache_bytes_image;
- cache_bytes_lookaside;
- cache_bytes_inuse;
- cache_bytes_dirty_total;
- cache_bytes_other;
- cache_bytes_read;
- cache_bytes_write;
- cache_lookaside_cursor_wai
t_application;
- cache_lookaside_cursor_wai
t_internal;
- cache_lookaside_score;
- cache_lookaside_entries;
- cache_lookaside_insert;
- cache_lookaside_remove;
- c
a c h e _ e v i c t i
- n
_ c h e c k p
- i
n t ;
- c
a c h e _ e v i c t i
- n
_ g e t _ r e f ;
- c
a c h e _ e v i c t i
- n
_ g e t _ r e f _ e m p t y ;
- c
a c h e _ e v i c t i
- n
_ g e t _ r e f _ e m p t y 2 ;
- c
a c h e _ e v i c t i
- n
_ a g g r e s s i v e _ s e t ;
- c
a c h e _ e v i c t i
- n
_ e m p t y _ s c
- r
e ;
- c
a c h e _ e v i c t i
- n
_ w a l k _ p a s s e s ;
- c
a c h e _ e v i c t i
- n
_ q u e u e _ e m p t y ;
- c
a c h e _ e v i c t i
- n
_ q u e u e _ n
- t
_ e m p t y ;
- c
a c h e _ e v i c t i
- n
_ s e r v e r _ e v i c t i n g ;
- c
a c h e _ e v i c t i
- n
_ s e r v e r _ s l e p t ;
- c
a c h e _ e v i c t i
- n
_ s l
- w
;
- c
a c h e _ e v i c t i
- n
_ s t a t e ;
- c
a c h e _ e v i c t i
- n
_ t a r g e t _ p a g e _ l t 1 ;
- c
a c h e _ e v i c t i
- n
_ t a r g e t _ p a g e _ l t 3 2 ;
- c
a c h e _ e v i c t i
- n
_ t a r g e t _ p a g e _ g e 1 2 8 ;
- c
a c h e _ e v i c t i
- n
_ t a r g e t _ p a g e _ l t 6 4 ;
- c
a c h e _ e v i c t i
- n
_ t a r g e t _ p a g e _ l t 1 2 8 ;
- c
a c h e _ e v i c t i
- n
_ w a l k s _ a b a n d
- n
e d ;
- c
a c h e _ e v i c t i
- n
_ w a l k s _ s t
- p
p e d ;
- c
a c h e _ e v i c t i
- n
_ w a l k s _ g a v e _ u p _ n
- _
t a r g e t s ;
- c
a c h e _ e v i c t i
- n
_ w a l k s _ g a v e _ u p _ r a t i
- ;
- c
a c h e _ e v i c t i
- n
_ w a l k s _ e n d e d ;
- c
a c h e _ e v i c t i
- n
_ w a l k _ f r
- m
_ r
- t
;
- c
a c h e _ e v i c t i
- n
_ w a l k _ s a v e d _ p
- s
;
- c
a c h e _ e v i c t i
- n
_ a c t i v e _ w
- r
k e r s ;
- c
a c h e _ e v i c t i
- n
_ w
- r
k e r _ c r e a t e d ;
- c
a c h e _ e v i c t i
- n
_ w
- r
k e r _ e v i c t i n g ;
- c
a c h e _ e v i c t i
- n
_ w
- r
k e r _ r e m
- v
e d ;
- c
a c h e _ e v i c t i
- n
_ s t a b l e _ s t a t e _ w
- r
k e r s ;
- c
a c h e _ e v i c t i
- n
_ f
- r
c e _ f a i l ;
- c
a c h e _ e v i c t i
- n
_ f
- r
c e _ f a i l _ t i m e ;
- c
a c h e _ e v i c t i
- n
_ w a l k s _ a c t i v e ;
- c
a c h e _ e v i c t i
- n
_ w a l k s _ s t a r t e d ;
- c
a c h e _ e v i c t i
- n
_ f
- r
c e _ r e t u n e ;
- c
a c h e _ e v i c t i
- n
_ h a z a r d ;
- c
a c h e _ h a z a r d _ c h e c k s ;
- c
a c h e _ h a z a r d _ w a l k s ;
- c
a c h e _ h a z a r d _ m a x ;
- c
a c h e _ i n m e m _ s p l i t t a b l e ;
- c
a c h e _ i n m e m _ s p l i t ;
- c
a c h e _ e v i c t i
- n
_ i n t e r n a l ;
- c
a c h e _ e v i c t i
- n
_ s p l i t _ i n t e r n a l ;
- c
a c h e _ e v i c t i
- n
_ s p l i t _ l e a f ;
- c
a c h e _ b y t e s _ m a x ;
- c
a c h e _ e v i c t i
- n
_ m a x i m u m _ p a g e _ s i z e ;
- c
a c h e _ e v i c t i
- n
_ d i r t y ;
- c
a c h e _ e v i c t i
- n
_ a p p _ d i r t y ;
- c
a c h e _ t i m e d _
- u
t _
- p
s ;
- c
a c h e _ r e a d _
- v
e r f l
- w
;
- + WiredTiger Metrics
- cache_eviction_deepen;
- cache_write_lookaside;
- cache_pages_inuse;
- cache_eviction_force;
- cache_eviction_force_time;
- cache_eviction_force_delete;
- cache_eviction_force_delete_time;
- cache_eviction_app;
- cache_eviction_pages_queued;
- cache_eviction_pages_queued_urgent;
- cache_read;
- cache_read_deleted;
- cache_read_deleted_prepared;
- cache_read_lookaside;
- cache_read_lookaside_checkpoint;
- cache_read_lookaside_skipped;
- cache_pages_requested;
- cache_eviction_pages_seen;
- cache_eviction_fail;
- cache_eviction_walk;
- cache_write;
- cache_read_lookaside_delay;
- cache_read_lookaside_delay_checkpoint;
- cache_write_restore;
- cache_overhead;
- cache_eviction_pages_queued_oldest;
- cache_bytes_internal;
- cache_bytes_leaf;
- cache_bytes_dirty;
- cache_pages_dirty;
- cache_eviction_clean;
- f
s y n c _ a l l _ f h _ t
- t
a l ;
- f
s y n c _ a l l _ f h ;
- f
s y n c _ a l l _ t i m e ;
- c
a p a c i t y _ t h r e s h
- l
d ;
- c
a p a c i t y _ b y t e s _ r e a d ;
- c
a p a c i t y _ b y t e s _ c k p t ;
- c
a p a c i t y _ b y t e s _ e v i c t ;
- c
a p a c i t y _ b y t e s _ l
- g
;
- c
a p a c i t y _ b y t e s _ w r i t t e n ;
- c
a p a c i t y _ t i m e _ t
- t
a l ;
- c
a p a c i t y _ t i m e _ c k p t ;
- c
a p a c i t y _ t i m e _ e v i c t ;
- c
a p a c i t y _ t i m e _ l
- g
;
- c
a p a c i t y _ t i m e _ r e a d ;
- c
- n
d _ a u t
- _
w a i t _ r e s e t ;
- c
- n
d _ a u t
- _
w a i t ;
- t
i m e _ t r a v e l ;
- f
i l e _
- p
e n ;
- m
e m
- r
y _ a l l
- c
a t i
- n
;
- m
e m
- r
y _ f r e e ;
- m
e m
- r
y _ g r
- w
;
- c
- n
d _ w a i t ;
- r
w l
- c
k _ r e a d ;
- r
w l
- c
k _ w r i t e ;
- f
s y n c _ i
- ;
- r
e a d _ i
- ;
- w
r i t e _ i
- ;
- c
u r s
- r
_ c a c h e d _ c
- u
n t ;
- c
u r s
- r
_ c a c h e ;
- c
u r s
- r
_ c r e a t e ;
- c
u r s
- r
_ i n s e r t ;
- c
u r s
- r
_ m
- d
i f y ;
- c
u r s
- r
_ n e x t ;
- c
u r s
- r
_ r e s t a r t ;
- c
u r s
- r
_ p r e v ;
- c
u r s
- r
_ r e m
- v
e ;
- c
u r s
- r
_ r e s e r v e ;
- c
u r s
- r
_ r e s e t ;
- c
u r s
- r
_ s e a r c h ;
- log_force_write_skip;
- log_compress_writes;
- log_compress_write_fails;
- log_compress_small;
- log_release_write_lsn;
- log_scans;
- log_scan_rereads;
- log_write_lsn;
- log_write_lsn_skip;
- log_sync;
- log_sync_duration;
- log_sync_dir;
- log_sync_dir_duration;
- log_writes;
- log_slot_consolidated;
- log_max_filesize;
- log_prealloc_max;
- log_prealloc_missed;
- log_prealloc_files;
- log_prealloc_used;
- log_scan_records;
- log_slot_close_race;
- log_slot_close_unbuf;
- log_slot_closes;
- log_slot_races;
- log_slot_yield_race;
- log_slot_immediate;
- log_slot_yield_close;
- log_slot_yield_sleep;
- log_slot_yield;
- log_slot_active_closed;
- log_slot_yield_duration;
- log_slot_no_free_slots;
- log_slot_unbuffered;
- log_compress_mem;
- log_buffer_size;
- log_compress_len;
- log_slot_coalesced;
- log_close_yields;
- perf_hist_fsread_latency_lt50;
- perf_hist_fsread_latency_lt100;
- perf_hist_fsread_latency_lt250;
- perf_hist_fsread_latency_lt500;
- perf_hist_fsread_latency_lt1000;
- perf_hist_fsread_latency_gt1000;
- perf_hist_fswrite_latency_lt50;
- perf_hist_fswrite_latency_lt100;
- perf_hist_fswrite_latency_lt250;
- perf_hist_fswrite_latency_lt500;
- perf_hist_fswrite_latency_lt1000;
- perf_hist_fswrite_latency_gt1000;
- perf_hist_opread_latency_lt250;
- perf_hist_opread_latency_lt500;
- perf_hist_opread_latency_lt1000;
- perf_hist_opread_latency_lt10000;
- perf_hist_opread_latency_gt10000;
- perf_hist_opwrite_latency_lt250;
- perf_hist_opwrite_latency_lt500;
- perf_hist_opwrite_latency_lt1000;
- perf_hist_opwrite_latency_lt10000;
- perf_hist_opwrite_latency_gt10000;
- rec_page_delete_fast;
- rec_pages;
- rec_pages_eviction;
- rec_page_delete;
- rec_split_stashed_bytes;
- rec_split_stashed_objects;
- session_open;
- session_query_ts;
- session_table_alter_fail;
- session_table_alter_success;
- session_table_alter_skip;
- session_table_compact_fail;
- session_table_compact_success;
- session_table_create_fail;
- session_table_create_success;
- session_table_drop_fail;
- session_table_drop_success;
- session_table_rebalance_fail;
- session_table_rebalance_success;
- session_table_rename_fail;
- session_table_rename_success;
- session_table_salvage_fail;
- session_table_salvage_success;
- session_table_truncate_fail;
- session_table_truncate_success;
- session_table_verify_fail;
- session_table_verify_success;
- thread_fsync_active;
- thread_read_active;
- thread_write_active;
- rec_page_delete;
- rec_split_stashed_bytes;
- rec_split_stashed_objects;
- session_open;
- session_query_ts;
- session_table_alter_fail;
- session_table_alter_success;
- session_table_alter_skip;
- session_table_compact_fail;
- session_table_compact_success;
- session_table_create_fail;
- session_table_create_success;
- session_table_drop_fail;
- session_table_drop_success;
- session_table_rebalance_fail;
- session_table_rebalance_succes
s;
- session_table_rename_fail;
- session_table_rename_success;
- session_table_salvage_fail;
- session_table_salvage_success;
- session_table_truncate_fail;
- session_table_truncate_success;
- session_table_verify_fail;
- session_table_verify_success;
- thread_fsync_active;
- thread_read_active;
- thread_write_active;
- rec_ptxn_read_queue_empty;
- txn_read_queue_head;
- txn_read_queue_inserts;
- txn_read_queue_len;
- txn_rollback_to_stable;
- txn_rollback_upd_aborted;
- txn_rollback_las_removed;
- txn_set_ts;
- txn_set_ts_commit;
- txn_set_ts_commit_upd;
- txn_set_ts_oldest;
- txn_set_ts_oldest_upd;
- txn_set_ts_stable;
- txn_set_ts_stable_upd;
- txn_begin;
- txn_checkpoint_running;
- txn_checkpoint_generation;
- txn_checkpoint_time_max;
- txn_checkpoint_time_min;
- txn_checkpoint_time_recent;
- txn_checkpoint_scrub_target;
- txn_checkpoint_scrub_time;
- txn_checkpoint_time_total;
- txn_checkpoint;
- txn_checkpoint_skipped;
- txn_fail_cache;
- txn_checkpoint_fsync_post;
- txn_checkpoint_fsync_post_duration
;
- txn_pinned_range;
- txn_pinned_checkpoint_range;
- txn_pinned_snapshot_range;
- txn_pinned_timestamp;
- txn_pinned_timestamp_checkpoint;
- txn_pinned_timestamp_oldest;
- ;
- txn_sync;
- txn_commit;
- txn_rollback;
- txn_update_conflict;
- bloom_false_positive;
- bloom_hit;
- bloom_miss;
- bloom_page_evict;
- bloom_page_read;
- bloom_count;
- lsm_chunk_count;
- lsm_generation_max;
- lsm_lookup_no_bloom;
- lsm_checkpoint_throttle;
- lsm_merge_throttle;
- bloom_size;
- block_extension;
- block_alloc;
- block_free;
- block_checkpoint_size;
- allocation_size;
- block_reuse_bytes;
- block_magic;
- block_major;
- block_size;
- block_minor;
- btree_checkpoint_generation;
- btree_column_fix;
- btree_column_internal;
- btree_column_rle;
- btree_column_deleted;
- btree_column_variable;
- btree_fixed_len;
"That Can't Be Sane!"
It gets scarier - metric op rate is very high. 100,000+ queries or updates per second == > several million counter increments per second Plus daemon 'housework' threads:
- Storage engine, Replication, Network, Journal, etc.
5
The layers:
sub-atomic particles electrons silicon transistor gates 64bit register assembly C++ std::atomic<uint64> metric_counter_X //a class member var metric_counter_X.fetchAndAdd(1) (some class's gateway point) e.g. CollectionIndexUsageTracker::recordIndexAccess(...),
ServiceEntryPointCommon::handleRequest(...)
User level: find() / update() / delete() / count() / aggregate() / ... etc. etc.
One MongoDB Metric Increment
6
The layers:
sub-atomic particles electrons silicon transistor gates 64bit register assembly C++ std::atomic metric_counter_X (some Command class) e.g. CollectionIndexUsageTracker::recordIndexAccess(...)
ServiceEntryPointCommon::handleRequest(...)
User level: find() / update() / delete() / count() / aggregate() / ... etc. etc.
(WiredTiger)
uint64 member in a global struct e.g. __wt_xxxxxx_stats WT_STAT_CONN_INCR(session, metric_X))
namespace mongo { ... void CollectionIndexUsageTracker::recordIndexAccess(StringData indexName) { invariant(!indexName.empty()); dassert(_indexUsageMap.find(indexName) != _indexUsageMap.end()); _indexUsageMap[indexName].accesses.fetchAndAdd(1); } void CollectionIndexUsageTracker::registerIndex(StringData indexName, const BSONObj& indexKey) { invariant(!indexName.empty()); dassert(_indexUsageMap.find(indexName) == _indexUsageMap.end()); // Create map entry. _indexUsageMap[indexName] = IndexUsageStats(_clockSource->now(), indexKey); } ... CollectionIndexUsageMap CollectionIndexUsageTracker::getUsageStats() const { return _indexUsageMap; }
Gauges, Histograms
7
Counter metricX.fetchAndAdd() WT_STAT_*_INCR(..., metric_x) Gauge metricX.set(x) WT_STAT_SET(..., metric_x) Histogram Counters, put in bucket ranges automatically. (Only one so far: $collStat's latencyStats)
Reading the MongoDB Metrics
8
db.serverStatus()
OpCounter Network, Connections WiredTiger (or MMAP) ReplicationInfo, OplogInfo Sharding, ShardingStatistics Transactions, LogicalSession GlobalLock, LockStats OpReadConcern, OpWriteConcern Storage, DataFileSync, DurSSS ..., ...
SS does NOT include these stats:
- Database
- Collection
- Index
E.g. doc count or avg. size, storageSize, access counts per db / coll / index Iterate each DB, collection & index to get those as well as calling serverStatus.
Side-Topic: FTDC
= Metrics persisted to disk once per second
Code search: FTDC controller's addPeriodicCollector
- serverStatus 935 stats
- replSetGetStatus 81 stats
- collStats on local.oplog.rs 152 stats
- linux/windows OS metrics 211 stats
- connpool stats (on mongos)
Not purely internal - Try the { getDiagnosticData: 1} command. Impact: 'cheap' cache re-read but a lot of BSON/JSON data. N.b. does NOT include per-database, per-collection and per-index stats
9
10
MongoDB --> Prometheus
mongodb_exporter
Passively awaits Prometheus server's call once per x seconds Returns up to 200+ metrics from:
- serverStatus
- replsetGetStatus
- Optionally
○ dbStats ○ collStats (to become $collStats) ○ $indexStats
11
Installing / Running
PMM:
pmm-admin add mongodb [NAME] [OPTIONS] pmm-admin [list | stop | restart | remove | ...] [NAME] [OPTIONS]
Command line:
./mongodb_exporter
- mongodb.uri "mongodb://user:pwd@localhost:27017/...."
- collect.database
- collect.collection
- collect.indexusage
..
12
- Eg. Prometheus Metrics (serverStatus)
metrics.queryExecutor.scannedObject =
mongodb_mongod_metrics_query_executor_total{state="scanned_objects"}
wiredTiger.cache["bytes written from cache"] =
mongodb_mongod_wiredtiger_cache_bytes_total{type="written"}
connections.current =
mongodb_connections{state="current"}
network.bytesIn =
mongodb_network_bytes_total{state="in_bytes"}
...
13
- Eg. Prometheus Metrics ($collStats, etc.)
Collection's aggregate indexes size =
mongodb_mongod_db_coll_indexes_size{db="x",collection="y"}
Index accesses =
mongodb_mongod_index_usage_count{db="x",collection="y",index="z"}
Read latency =
mongodb_mongod_op_latencies_latency_total{type="read",db="x",collection="y"}
Write latency =
mongodb_mongod_op_latencies_latency_total{type="write",db="x",collection="y"} ...
14
Viewing With Prometheus's Own GUI
Prometheus graph page In PMM2:
https://<host>/prometheus/graph
(PMM admin role users only)
A web console to
- Discover metric names
- View as graph
- View labelled metrics at a
single point of time.
15
How Long Can You Store the Statistics?
16
Depends on disk space budget. By my subjective picture of small and big budgets:
Resolution Small budget Big budget
1 sec 😚 1 day 1 week 10 sec 1 week 1 month 1 min 😟 6 month Years
N.b. no automatic downscaling of metrics resolution for older time ranges.
Summary So Far...
17
- MongoDB has a very large number of statistics
- Very high resolution: Typically many metrics updated every microsecond
- 100+ metrics exported to Prometheus.
(Optionally enable per-collection etc. metrics as well.)
- Metric names changed - but similar.
- Counters still counters; Gauges still gauges.
- Labels: metric_aa_x, metric_aa_y => m_aa{label="x"}, m_aa{label="y"}
- Now you can access MongoDB metric history.
18
Displaying the Metrics
"Pass to Graphing GUI, Done. Right?"
19
No - DBAs need to see a 'natural' picture. Many different 'natural' concepts (Grrr! Humans! Grrr!) Also need to join related metrics in Prometheus - tricky. Counter metric types, gauge metric types ⇒ both become a y value over a time x-axis in graphs.
"So it's a learning step for the first two or so, but after that it's all the same right?"
Various Shapes of Prometheus Equations
20
Cursors open, WT tickets x Ops, bytes, .. rate(x)[interval] Lag x{state="PRIMARY"} - x{state="SECONDARY"} Mem threshold x{server=hhh} / y{server=hhh} Reads on all secondaries (z{state="SECONDARY"} - z{state="SECONDARY"}) + x Cluster totals sum(.....<various> ....) by (cluster) Worst latency on a primary max(x{cluster=ccc} + (z{state="P.."} - z{state="P.."})) Shard imbalance of x max(sum(x) by (replset)) - min(sum(x) by (replset))
Prometheus Vector Matching - Tricky
metric_x + metric_y
(or *, /, -)
similar concept to:
SELECT x.value + y.value, x.label_1, x.label_2, ... FROM metric_x x INNER JOIN metric_y y ON x.label_1 = y.label_1 AND x.label_2 = y.label_2 AND .. (join all labels) Runtime error if vector match 'join' fails is invalid. Vector-label modifying operators often needed:
- n(...), unless(...), group_left(...), ... ignoring(...), etc.
21
Many Unique, Complex, Equations
When in Grafana you can edit/explore a graph to see the equation.
22
Some are simpler, some are more complex. The above is average.
Prometheus as a Graph's Datasource
23
- Same Prom. equation
- Accepts substitute $variable
values from the GUI.
- Every timeseries + label
combination becomes its
- wn line.
- Unless:
○ Filter to single ○ Aggregate to single ○ Use 'repeat graph on $variable' option.
24
PMM's MongoDB Dashboards in PMM
Percona Monitoring and Management
PMM Server containerizes:
- Prometheus
- Grafana
- pmm-managed daemon
- Additional web services (e.g. QAN)
- Backing DBs for the above (e.g. tsdb for Prometheus)
After PMM server started install PMM Client on MongoDB host servers. Then: pmm-admin config --server <ip_address> pmm-admin add mongodb [OPTIONS]
25
26
Default MongoDB Dashboards in PMM2
For comparing all MongoDB nodes in the environment at once:
- Services Overview
For a cluster (or subset):
- Cluster Summary
Sharding stats
- Overview
Various stats aggregated up
- Summary
A subset of Overview
- Replset
Elections, oplog lag, oplog volume
- Storage Engines
WiredTiger, MMAP, RocksDB, InMemory
- Compare
Instance side-by-side comparison
27
28
Roll Your Own Dashboards
"Democratize Metrics"
Grafana / Raintank 2015:
"Make the tools of observability accessible to everyone in an organization, not just the single Ops person." See and edit dashboards and graphs through the same Web GUI. (Edit for "admin" role users, at least.)
29
Editing PMM's MongoDB Dashboards
PMM's packaged MongoDB dashboards are generic. Specialize for your environment
- Save a copy as new dashboard and edit as you like.
- Ignore / cut out what you don't need
- Merge with other dashboards’ graphs, etc.
If you edit packaged dashboards in place:
- Overwritten by PMM updates
30
MongoDB Dashboard Mash-up Demo
(Graph demo in Grafana front-end)
31