Expectations on Remote Data Supporting the Prometheus Remote Storage - - PowerPoint PPT Presentation

expectations on remote data
SMART_READER_LITE
LIVE PREVIEW

Expectations on Remote Data Supporting the Prometheus Remote Storage - - PowerPoint PPT Presentation

Expectations on Remote Data Supporting the Prometheus Remote Storage API Alfred Landrum Engineer @ Sysdig Twitter: @alfred-landrum Github: alfred-landrum Sysdig Backend Sysdig Dashboards/Alerts/Topology Host/Node based Agents Sysdig Data


slide-1
SLIDE 1

Expectations on Remote Data

Supporting the Prometheus Remote Storage API

Alfred Landrum Engineer @ Sysdig Twitter: @alfred-landrum Github: alfred-landrum

slide-2
SLIDE 2

Sysdig Backend Sysdig Data Engine and Store Host/Node based Agents Sysdig Dashboards/Alerts/Topology

Distributed Datastore

  • Time/Group Aggregation
  • RBAC
  • Downsampling

Status Data

  • Orchestrator State
  • Service Topology
  • Application Checks

Time Series Data

  • StatsD
  • JMX
  • Prometheus
  • ...
slide-3
SLIDE 3

Sysdig Backend PromQL Evaluation Sysdig Data Engine and Store Host/Node based Agents Grafana PromQL Dashboards/Alerts Sysdig Dashboards/Alerts/Topology Prometheus HTTP API Sysdig Data API

Status Data

  • Orchestrator State
  • Service Topology
  • Application Checks

Time Series Data

  • StatsD
  • JMX
  • Prometheus
  • ...
slide-4
SLIDE 4

API PromQL Storage

range query from t0 to t1, step 10s:

up{env=”prod”} > 1

labels: __name__ = “up” env = “prod” time: start: (t0 - 5m) end: t1 …

range query from t0 to t1, step 10s:

rate(alerts_total[1m])

labels: __name__ = “alerts_total” time: start: t0 end: t1 read hints: func: “rate” …

  • 1. Why ask for an extra 5 minutes?
  • 2. What’s a “func” hint?
  • 3. What does “rate” mean?
slide-5
SLIDE 5

Storage data model: a set of time series, identified by metric name and labels. No alignment guarantees.

time value

slide-6
SLIDE 6

Instant Query: evaluate an expression at a particular time.

time value

slide-7
SLIDE 7

time value

start end step

Range Query: logically, a repeated instant query on [start,end] every step.

slide-8
SLIDE 8

What if there’s no sample at the evaluation time?

time value

slide-9
SLIDE 9

Range Vector Selector PromQL: avg_over_time(queue_depth[1m])

time value

1m

slide-10
SLIDE 10

Instant vector selector PromQL: queue_depth The most recent value found at or before the evaluation time.

time value

slide-11
SLIDE 11

time value

start end step

Same for range queries, applied at each step.

slide-12
SLIDE 12

time value

start end step

PromQL now has aligned values, for calculations, comparisons, etc.

slide-13
SLIDE 13

time

How long do you see the value of the last sample? Controlled in 2 different ways:

????? evaluation time

value

last scraped datapoint

slide-14
SLIDE 14

time

First way is via a configuration setting: lookbackDelta Default is 5 minutes.

evaluation time

value

last scraped datapoint lookbackDelta

slide-15
SLIDE 15

time

Same for range queries, applied at each step.

evaluation time

value

start step

last scraped datapoint

lookbackDelta

slide-16
SLIDE 16

Consider an alert that should fire if there’s no value.

time

lookbackDelta

evaluation time last scraped datapoint

value

slide-17
SLIDE 17

The second way: stale markers Scraping logic adds them 1-2 intervals after the last sample.

time

evaluated value no value scraped datapoint

valu e

stale marker

slide-18
SLIDE 18

Sysdig Backend PromQL Evaluation Sysdig Data Engine and Store Host/Node based Agents Grafana PromQL Dashboards/Alerts Sysdig Dashboards/Alerts/Topology Prometheus HTTP API Sysdig Data API

Status Data

  • Orchestrator State
  • Service Topology
  • Application Checks

Time Series Data

  • StatsD
  • JMX
  • Prometheus
  • ...
slide-19
SLIDE 19

API PromQL Storage

range query from t0 to t1, step 10s:

up{env=”prod”} > 1

labels: __name__ = “up” env = “prod” time: start: (t0 - 5m) end: t1 …

range query from t0 to t1, step 10s:

rate(alerts_total[1m])

labels: __name__ = “alerts_total” time: start: t0 end: t1 read hints: func: “rate” … ✔ Sample Alignment

  • 2. What’s a “func” hint?
  • 3. What does “rate” mean?
slide-20
SLIDE 20

time

Scraping intervals typically on the order of 1 minute. A query for a month’s data would take ~45k samples. That’s likely more than the pixel width of your display.

slide-21
SLIDE 21

time

Store an aggregation of many samples within some fixed resolution. What representative value should you store?

slide-22
SLIDE 22

time

Average

slide-23
SLIDE 23

time

Maximum

slide-24
SLIDE 24

time

Sum

05 05 12 16 06 14

slide-25
SLIDE 25

Not limited to a single aggregation - store several. How to select the best one for a query?

time

05 05 12 16 06 14

slide-26
SLIDE 26

API PromQL Storage

range query from t0 to t1, step 10s:

up{env=”prod”} > 1

labels: __name__ = “up” env = “prod” time: start: (t0 - 5m) end: t1 …

range query from t0 to t1, step 10s:

rate(alerts_total[1m])

labels: __name__ = “alerts_total” time: start: t0 end: t1 read hints: func: “rate” … ✔ Sample Alignment ✔ Aggregation Selection

  • 3. What does “rate” mean?
slide-27
SLIDE 27
slide-28
SLIDE 28

A decrease in value indicates a reset occurred. A common reason for a reset is a restarted instance.

time

slide-29
SLIDE 29

rate(): divide the difference in events by a time duration.

time

slide-30
SLIDE 30

No resets? Sum the deltas between samples.

time

slide-31
SLIDE 31

Reset? Add post-reset value.

time

slide-32
SLIDE 32

effectively: slide everything up after each reset.

time

slide-33
SLIDE 33

What kind of aggregation would you need for rate?

time

slide-34
SLIDE 34

How many events occurred between t1 and t2?

time

t1 t2

slide-35
SLIDE 35

How many events occurred between t1 and t2?

time

t1 t2

slide-36
SLIDE 36

time

t1 t2

How many events occurred between t1 and t2?

slide-37
SLIDE 37

time

t1 t2

How many events occurred between t1 and t2?

slide-38
SLIDE 38

time

t1 t2

How many events occurred between t1 and t2?

slide-39
SLIDE 39

time

t1 t2

How many events occurred between t1 and t2?

slide-40
SLIDE 40

How many events occurred between t1 and t2? In this case, the difference between:

  • the sum of events in t2 window
  • the last value before t1

time

t1 t2

slide-41
SLIDE 41

How many events occurred between t3 and t4?

time

t3 t4

slide-42
SLIDE 42

How many events occurred between t3 and t4?

time

t3 t4

slide-43
SLIDE 43

How many events occurred between t3 and t4? Can we just take the difference between the last raw & the sum?

time

t3 t4

slide-44
SLIDE 44

No: a border reset means values in t3 don’t matter: just the t4 event sum. Need to store the first and last raw values to detect border resets.

time

t3 t4

boundary aligned reset!

slide-45
SLIDE 45

Store first, last raw values, and sum in events.

time

slide-46
SLIDE 46

How to turn this into a response for PromQL remote read?

time

slide-47
SLIDE 47

Generate a response sequence for each query.

time

t3 t4 t1 t2 t0

slide-48
SLIDE 48

PromQL sees a monotonically increasing value, with no resets.

time

t3 t4 t1 t2 t0

slide-49
SLIDE 49

API PromQL Storage

range query from t0 to t1, step 10s:

up{env=”prod”} > 1

labels: __name__ = “up” env = “prod” time: start: (t0 - 5m) end: t1 …

range query from t0 to t1, step 10s:

rate(alerts_total[1m])

labels: __name__ = “alerts_total” time: start: t0 end: t1 read hints: func: “rate” … ✔ Sample Alignment ✔ Aggregation Selection ✔ Counter Downsampling

slide-50
SLIDE 50

Sysdig Backend PromQL Evaluation Sysdig Data Engine and Store Host/Node based Agents Grafana PromQL Dashboards/Alerts Sysdig Dashboards/Alerts/Topology Prometheus HTTP API Sysdig Data API

Status Data

  • Orchestrator State
  • Service Topology
  • Application Checks

Time Series Data

  • StatsD
  • JMX
  • Prometheus
  • ...
slide-51
SLIDE 51

Thanks!