Cortex: Prometheus as a Service, One Year On Tom Wilkie, PromCon - - PowerPoint PPT Presentation

cortex prometheus as a service one year on
SMART_READER_LITE
LIVE PREVIEW

Cortex: Prometheus as a Service, One Year On Tom Wilkie, PromCon - - PowerPoint PPT Presentation

Cortex: Prometheus as a Service, One Year On Tom Wilkie, PromCon 2017 tom.wilkie@gmail.com https://github.com/weaveworks/cortex https://www.youtube.com/watch?v=3Tb4Wc0kfCM Prometheus HA Grafana Cortex: Prometheus as a Service Alertmanager


slide-1
SLIDE 1

Cortex: Prometheus as a Service, One Year On

Tom Wilkie, PromCon 2017 tom.wilkie@gmail.com

https://github.com/weaveworks/cortex

slide-2
SLIDE 2
slide-3
SLIDE 3

https://www.youtube.com/watch?v=3Tb4Wc0kfCM

slide-4
SLIDE 4

Cortex: Prometheus as a Service

  • Natively multi tenant; isolate different

customers in the same services.

  • Different story around scaling & HA
  • “Virtually infinite” retention and durability
  • Opportunities for performance

enhancements

Cortex

Your Your Your Your

Your Jobs Alertmanager Grafana

Prometheus HA

slide-5
SLIDE 5

Frontend Ditributor DynamoDB Memcache Consul Ingester Write requests Read requests Control requests Prometheus Your Jobs S3

Cortex Architecture

slide-6
SLIDE 6

A Year’s Evolution

slide-7
SLIDE 7

Problem #1: DynamoDB Write Throughput

slide-8
SLIDE 8

https://github.com/weaveworks/cortex/issues/254

slide-9
SLIDE 9

Frontend Ditributor DynamoDB Memcache Consul Ingester Write requests Read requests Control requests Prometheus Your Jobs S3 Table Manager

Cortex Architecture

slide-10
SLIDE 10

Problem #2: DynamoDB Write Throughput, again

slide-11
SLIDE 11

Original schema:

  • Hash Key:

<user ID>:<hour>:<metric name>

  • Range Key:

<label name>:<label value>:<chunk ID> New schema:

  • Hash Key:

<user ID>:<day>:<metric name>:<label name>

  • Range Key:

<chunk ID>:<chunk end time>

https://github.com/weaveworks/cortex/pull/262

slide-12
SLIDE 12

Problem #3: Queries of Death

slide-13
SLIDE 13

Frontend Ditributor Querier Table Manager DynamoDB Memcache Consul Ingester Write requests Read requests Control requests Prometheus Your Jobs S3

Cortex Architecture

slide-14
SLIDE 14

Problem #3: Recording rules and alerts

slide-15
SLIDE 15

Frontend Ditributor Querier Table Manager DynamoDB Memcache Consul Ingester Write requests Read requests Control requests Prometheus Your Jobs S3 Ruler

Cortex Architecture

slide-16
SLIDE 16

Problem #4: Long tail

slide-17
SLIDE 17
slide-18
SLIDE 18

https://www.weave.works/blog/the-long-tail-tools-to-investigate-high-long-tail-latency/

slide-19
SLIDE 19

Problem #5: Cost

slide-20
SLIDE 20

S3 DynamoDB IOP Cost ($/IOP) 5x10-6 2x10-7 Storage Cost ($/GB/Month) 0.023 0.250

https://github.com/weaveworks/cortex/issues/141

slide-21
SLIDE 21

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.005 0.01 0.015 0.02 0.025

Object size (GB) Cost ($) S3 DynamoDB

slide-22
SLIDE 22

Frontend Ditributor Querier Table Manager DynamoDB Memcache Consul Ruler Ingester Write requests Read requests Control requests Prometheus Your Jobs

Cortex Architecture

slide-23
SLIDE 23

Problem #6: DynamoDB, again

slide-24
SLIDE 24

Frontend Ditributor Querier Table Manager BigTable Memcache Consul Ruler Ingester Write requests Read requests Control requests Prometheus Your Jobs

Cortex Architecture

slide-25
SLIDE 25

DynamoDB BigTable 99th Percentile Write Latency (ms) 70-100 50-150 99th Percentile Read Latency (ms) 100-2500 ~250 LOC ~2000 ~400

DynamoDB numbers courtesy of Weaveworks

slide-26
SLIDE 26

Closing thoughts

slide-27
SLIDE 27
  • 1. DynamoDB Write Throughput
  • 2. DynamoDB Write Throughput, again
  • 3. Recording rules and alerts
  • 4. Long tail
  • 5. Cost
  • 6. DynamoDB, again
slide-28
SLIDE 28

Running for >12months

  • Availability: querier unavailable for <12hrs

~99.9%

  • Durability: lost <2 days of data

>99.5%

  • 99th percentile write performance

~60ms

  • 99th percentile query performance

<200ms

slide-29
SLIDE 29

Future

  • Direct chunk writes from Prometheus to Cortex Chunk Store
  • Separate ingester index for better load balancing
  • Use prometheus/tsdb for the ingesters
  • Etcd & gossip for ring storage
  • Chunks in Google Cloud Storage
slide-30
SLIDE 30

One more thing…

slide-31
SLIDE 31

I left Weaveworks at the begging of June to focus on Prometheus & Cortex development. Since then I’ve teamed up with David to develop some ideas around Prometheus, logging, and tracing. We’re available for Prometheus hosting, consulting, training and support. email: hello@kausal.co

slide-32
SLIDE 32

Metrics

slide-33
SLIDE 33

Logs

slide-34
SLIDE 34

Traces

slide-35
SLIDE 35

Thank you!

Questions?