Graphite@Scale: How to store million metrics per second Vladimir - - PowerPoint PPT Presentation

graphite scale
SMART_READER_LITE
LIVE PREVIEW

Graphite@Scale: How to store million metrics per second Vladimir - - PowerPoint PPT Presentation

Graphite@Scale: How to store million metrics per second Vladimir Smirnov System Administrator LinuxCon Europe 2016 5 October 2016 Why you might need to store your metrics? Most common cases: Capacity planning Troubleshooting and


slide-1
SLIDE 1

Graphite@Scale:

How to store million metrics per second

Vladimir Smirnov System Administrator LinuxCon Europe 2016

5 October 2016

slide-2
SLIDE 2

Why you might need to store your metrics? Most common cases:

◮ Capacity planning ◮ Troubleshooting and Postmortems ◮ Visualization of business data ◮ And more...

slide-3
SLIDE 3

Graphite and its modular architecture

From the graphiteapp.org

◮ Allows to store time-series data ◮ Easy to use — text protocol and HTTP API ◮ You can create any data flow you want ◮ Modular — you can replace any part of it

slide-4
SLIDE 4

Open Source stack

LoadBalancer graphite-web graphite-web carbon-cache Store1 DC1 Servers, Apps, etc carbon-relay Metrics User Requests graphite-web carbon-cache Store2 graphite-web graphite-web carbon-cache Store1 DC2 graphite-web carbon-cache Store2 carbon-aggegator

slide-5
SLIDE 5

Breaking graphite: our problems at scale

LoadBalancer graphite-web graphite-web carbon-cache Store1 DC1 Servers, Apps, etc carbon-relay Metrics User Requests graphite-web carbon-cache Store2 graphite-web graphite-web carbon-cache Store1 DC2 graphite-web carbon-cache Store2 carbon-aggegator

What’s wrong with this schema?

◮ carbon-relay — SPOF ◮ Doesn’t scale well ◮ Stores may have

different data after failures

◮ Render time increases

with more store servers

slide-6
SLIDE 6

Replacing carbon-relay

LoadBalancer graphite-web graphite-web carbon-cache Store1 DC1 carbon-c-relay carbon-c-relay Metrics User Requests graphite-web carbon-cache Store2 graphite-web graphite-web carbon-cache Store1 DC2 graphite-web carbon-cache Store2 carbon-c-relay Servers, Apps, etc Server carbon-c-relay

slide-7
SLIDE 7

Replacing carbon-relay carbon-c-relay:

◮ Written in C ◮ Routes 1M data points per second using only 2 cores ◮ L7 LB for graphite line protocol (RR with sticking) ◮ Can do aggregations ◮ Buffers the data if upstream is unavailable

slide-8
SLIDE 8

Zipper stack: Solution Query: target=sys.server.cpu.user Result: t0 V V V V V t1 Node1 t0 V V V V V t1 Node2 t0 V V V V V V V t1 Zipped metric

slide-9
SLIDE 9

Zipper stack: architecture

LoadBalancer carbonzipper carbonserver carbon-cache Store1 DC1 User Requests carbonserver carbon-cache Store2 carbonzipper carbonserver carbon-cache Store1 DC2 carbonserver carbon-cache Store2 graphite-web graphite-web

slide-10
SLIDE 10

Zipper stack: results

◮ Written in Go ◮ Can query store servers in parallel ◮ Can ”Zip” the data ◮ carbonzipper ⇔ carbonserver — 2700 RPS

graphite-web ⇔ carbon-cache — 80 RPS.

slide-11
SLIDE 11

Metric distribution: how it works Up to 20% difference in worst case

slide-12
SLIDE 12

Metric distribution: jump hash

slide-13
SLIDE 13

Rewriting Frontend in Go: carbonapi

LoadBalancer carbonzipper carbonserver carbon-cache Store1 DC1 carbon-c-relay User Requests carbonserver carbon-cache Store2 graphite-web carbonapi

slide-14
SLIDE 14

Rewriting Frontend in Go: result

◮ Significantly reduced response time for users (15s ⇒ 0.8s) ◮ Allowes more complex queries because it’s faster ◮ Easier to implement new heavy math functions ◮ Also available as Go library

slide-15
SLIDE 15

Replication techniques and their pros and cons

a,h c,a b,c d,e e,f g,b f,d h,g

Replication Factor 2

slide-16
SLIDE 16

Replication techniques and their pros and cons

a,e c,g b,f d,h a,e c,g b,f d,h

Replication Factor 1

slide-17
SLIDE 17

Replication techniques and their pros and cons

a,e c,g b,f d,h a,g h,e c,f b,d

Replication Factor 1, randomized

slide-18
SLIDE 18

Replication techniques and their pros and cons

slide-19
SLIDE 19

Replication techniques and their pros and cons

slide-20
SLIDE 20

Our current setup

◮ 32 Frontend Servers ◮ 200 RPS on Frontend ◮ 30k Metric Requests per second ◮ 11 Gbps traffic on the backend ◮ 200 Store servers in 2 DCs ◮ 2M unique metrics per second (8M hitting stores) ◮ 130 TB of Metrics in total ◮ Replaced all the components*

* — except for carbon-cache

slide-21
SLIDE 21

What’s next?

◮ Metadata search (in progress) ◮ Solve problems with missing Cache (in progress) ◮ Find a replacement for Whisper ◮ Improve aggregators ◮ Replace graphite line protocol between components

slide-22
SLIDE 22

It’s all Open Source!

◮ carbonzipper — github.com/dgryski/carbonzipper ◮ carbonserver — github.com/grobian/carbonserver ◮ carbonapi — github.com/dgryski/carbonapi ◮ carbon-c-relay — github.com/grobian/carbon-c-relay ◮ carbonmem — github.com/dgryski/carbonmem ◮ replication factor test — github.com/Civil/graphite-rf-test

slide-23
SLIDE 23

Questions? vladimir.smirnov@booking.com

slide-24
SLIDE 24

Thanks! We are hiring! https://workingatbooking.com