monitors distributed systems a long time ago in a galaxy far far - - PowerPoint PPT Presentation

monitors distributed systems a long time ago in a galaxy
SMART_READER_LITE
LIVE PREVIEW

monitors distributed systems a long time ago in a galaxy far far - - PowerPoint PPT Presentation

monitors distributed systems a long time ago in a galaxy far far away... Distributed architectures are hard 4 Monitoring distributed systems - Time windows - Rates 100k write/sec 105k write/sec - Percentiles - Cluster monitoring -


slide-1
SLIDE 1

monitors distributed systems

slide-2
SLIDE 2

a long time ago in a galaxy far far away...

slide-3
SLIDE 3
slide-4
SLIDE 4

Distributed architectures are hard

4

slide-5
SLIDE 5

Monitoring distributed systems

  • Time windows
  • Rates
  • Percentiles
  • Cluster monitoring
  • Correlation between metrics
  • State transitions (OK => KO)
  • Alerts (mail, slack, pagerduty…)
  • Flexibility
  • ...

100k write/sec 105k write/sec 1k write/sec 101k write/sec

5

slide-6
SLIDE 6

6

slide-7
SLIDE 7
  • Created by Kyle Kingsbury (Aphyr)
  • Event processing
  • Clojure
  • Monitoring

7

slide-8
SLIDE 8

An immutable event

:host “foo.bar.com” :service “df_percent_bytes_used_root” :state “critical” :time 1493243041 :metric 90 :description “Disk is full” :tags [“disk”] :ttl 60

8

slide-9
SLIDE 9

Collectd Telegraf K8s/Heapster Statsd Graphite ... Syslog-ng Logstash Fluentd ... Kafka Nagios check Chef ... Java Haskell Go Python Perl ... UDP TCP HTTP Graphite

Good Drop packets Slow Compat

9

TLS OpenTSBD

slide-10
SLIDE 10

Streams

... ... ...

10

slide-11
SLIDE 11

:host “foo1.com” :service “api_rate” :time 1493243041 :metric 90 :host “foo1.com” :service “foobar” :time 1493243041 :metric 90 :host “foo1.com” :service “api_rate” :time 1493243044 :metric 90 :host “foo1.com” :service “api_rate” :time 1493243041 :metric 90 :host “foo1.com” :service “api_rate” :time 1493243044 :metric 90

where = service “api_rate”

11

slide-12
SLIDE 12

:host “foo1.com” :service “api_rate” :time 1493243041 :metric 90 :host “foo1.com” :service “api_rate” :time 1493243044 :metric 90 :host “foo1.com” :service “api_rate” :time 1493243041 :metric 90 :host “foo1.com” :service “api_rate” :time 1493243044 :metric 90

fixed-time-window 10

12

slide-13
SLIDE 13

:host “foo1.com” :service “api_rate” :time 1493243041 :metric 90 :host “foo1.com” :service “api_rate” :time 1493243044 :metric 90

smap sum

:host “foo1.com” :service “api_rate” :time 1493243044 :metric 180

13

slide-14
SLIDE 14

where < metric 200

:host “foo1.com” :service “api_rate” :time 1493243044 :metric 180 :host “foo1.com” :service “api_rate” :time 1493243044 :metric 180

14

slide-15
SLIDE 15

email “ops@riemann.io”

:host “foo1.com” :service “api_rate” :time 1493243044 :metric 180

15

slide-16
SLIDE 16

where = service “api_rate” fixed-time-window 10 smap sum where < metric 200 email “ops@riemann.io”

16

slide-17
SLIDE 17

(where (= service “api_rate”) (fixed-time-window 10 (smap sum (where (< metric 200) (email “ops@riemann.io”)))))

17

slide-18
SLIDE 18

(where (= service “api_rate”) (fixed-time-window 10 (smap sum (where (< metric 200) (email “ops@riemann.io”)))))

Use map !

18

slide-19
SLIDE 19

Configuration as code (Your config is 100 % Clojure)

19

slide-20
SLIDE 20

(with {:description “Disk is full” :state “critical”} child)

:host “foo.bar.com” :service “df_home_mathieu” :state “ok” :time 1493243041 :metric 90 :host “foo.bar.com” :service “df_home_mathieu” :state “ok” :time 1493243041 :metric 90 :description “Disk is full”

20

slide-21
SLIDE 21

(where (service “foo”) (with {:description “cat”} (email “ops@riemann.io”)) (with {:description “dog”} (email “dev@riemann.io”)))

where has 2 children First child Second child

21

slide-22
SLIDE 22

(where (service “foo”) (with {:description “cat”} (email “ops@riemann.io”)) (with {:description “dog”} (email “dev@riemann.io”)))

where has 2 children First child Second child with has 1 child with has 1 child

22

slide-23
SLIDE 23

Clojure datastructures Immutability No side effects between streams

23

slide-24
SLIDE 24

(where (service “df_percent_bytes_used_var_log”)) (where (service #“^df_percent_bytes_used_”)) (where (and (service #“^df_percent_bytes_used_”) (> (:metric event) 80)))

24

slide-25
SLIDE 25

(default :ttl 60 child)

:host “foo.bar.com” :service “df_home_mathieu” :state “ok” :time 1493243041 :metric 90

25

:host “foo.bar.com” :service “df_home_mathieu” :state “ok” :time 1493243041 :metric 90 :ttl 60

slide-26
SLIDE 26

(smap (fn [event] (assoc event :ttl 60)) child)

:host “foo.bar.com” :service “df_home_mathieu” :state “ok” :time 1493243041 :metric 90 :host “foo.bar.com” :service “df_home_mathieu” :state “ok” :time 1493243041 :metric 90 :ttl 60

26

slide-27
SLIDE 27

t

60 120 180 240

(fixed-time-window 60 child1 child2)

27

slide-28
SLIDE 28

t (moving-time-window 60 child)

28

60 120 180 240

slide-29
SLIDE 29

t (fixed-event-window 3 child)

29

slide-30
SLIDE 30

t (moving-event-window 3 child)

30

slide-31
SLIDE 31

(rate 5 child) t

5 10

31

15 5 1 1 2 1 9 4 5 1.4 0.6 3.6

slide-32
SLIDE 32

(scale (/ 1 1024 1024 1024) child)

t

bytes

t

Gigabytes

32

slide-33
SLIDE 33

(ddt child)

33

t

5 25 65 15 38 2 4.75 19

slide-34
SLIDE 34

(changed :state {:init “ok”} child) t

  • k
  • k

ko ko ko

  • k

ko ko

  • k

ko

Sent to child

34

slide-35
SLIDE 35

(by [:host :service] (changed :state {:init “ok”} child)) t

  • k

ko ko

  • k

t

ko ko

  • k
  • k
  • k

:host “foo.com” :service “kafka lag” :host “foo.com” :service “disk /root %”

35

slide-36
SLIDE 36

(where (service “api request”) (percentiles 60 [0.5 0.99] child))

36

t

60 10 20 30

:host “riemann.io” :service “api request 0.5” :metric 20 :host “riemann.io” :service “api request 0.99” :metric 30

slide-37
SLIDE 37

(where (state “critical”) (throttle 2 3600 (email “foo@riemann.io”)))

37

t

3600 critical

  • k

critical critical critical critical critical

child

slide-38
SLIDE 38

(where (service “cpu %”) (coalesce 10 (smap max)))

host 1 cpu %

...

38

host 2

...

:host “host 2” :service “cpu %” :metric 20 :host “host 1” :service “cpu %” :metric 10

... ...

every 10s

Max

slide-39
SLIDE 39

InfluxDB Elasticsearch Graphite Kafka Logstash Datadog Cloudwatch Riemann ... Email Hipchat Slack Mailgun ... Nagios Shinken ... Pagerduty VictorOps Twilio Alerta ...

39

slide-40
SLIDE 40

(batch 100 1 ;; batch size = 100 every 1 sec (async-queue! :influxdb ;; create a threadpool {:queue-size 10000 :core-pool-size 4} (influxdb {:host 127.0.0.1 ;; forward to influx :db ”riemann”})))

40

slide-41
SLIDE 41

(exception-stream (email “alert@riemann.io”) (influxdb {:host 127.0.0.1 :db ”riemann”}))

41

slide-42
SLIDE 42

Configuration as code

Split your configuration

42

slide-43
SLIDE 43

(def check-critical-state ;; a var containing a stream (where (state “critical”) (email “admin@riemann.io”))) (defn check-state ;; a function returning a stream [s email-addr] (where (state s) (email email-addr))) (streams check-critical-state (check-state “critical” “admin@riemann.io”))

43

slide-44
SLIDE 44

/etc/riemann/riemann.config /mycorp/app/elasticsearch.clj /mycorp/output/mail.clj /mycorp/system/disk.clj /mycorp/system/ram.clj … + A plugin system

44

slide-45
SLIDE 45

Configuration as code

Tests

45

slide-46
SLIDE 46

(scale (/ 1 1024 1024 1024) (tap :scale-tap) child) (tests (deftest foo-test (is (= (:scale-tap (inject! [{:metric 1000}])) [{:metric (/ 1000 1024 1024 1024)}]))))

46

slide-47
SLIDE 47

The index

  • In memory datastructure (hashmap)
  • Key : [host service]
  • Value: an event
  • The index stream adds event to the index

(where (service “ram_percent”) (index))

slide-48
SLIDE 48

service

foo.bar fizz.buzz cpu_% :host “foo.bar” :service “cpu_percent” :metric 40 :ttl 60 :time 1 :host “fizz.buzz” :service “cpu_percent” :metric 90 :ttl 60 :time 3 :host “foo.bar” :service “ram_percent” :metric 65 :ttl 120 :time 2

host

ram_% :host “fizz.buzz” :service “ram_percent” :metric 80 :ttl 120 :time 2 Time 3

48

slide-49
SLIDE 49

service

foo.bar fizz.buzz cpu_% :host “foo.bar” :service “cpu_percent” :metric 45 :ttl 60 :time 10 :host “fizz.buzz” :service “cpu_percent” :metric 90 :ttl 60 :time 3 :host “foo.bar” :service “ram_percent” :metric 65 :ttl 120 :time 2

host

ram_% :host “fizz.buzz” :service “ram_percent” :metric 80 :ttl 120 :time 2 Time 10

49

slide-50
SLIDE 50

service

foo.bar fizz.buzz cpu_% :host “foo.bar” :service “cpu_percent” :metric 45 :ttl 60 :time 10 :host “fizz.buzz” :service “cpu_percent” :metric 90 :ttl 60 :time 3 :host “foo.bar” :service “ram_percent” :metric 65 :ttl 120 :time 2

host

ram_% :host “fizz.buzz” :service “ram_percent” :metric 80 :ttl 120 :time 2 Time 64

50

slide-51
SLIDE 51

service

foo.bar fizz.buzz cpu_% :host “foo.bar” :service “cpu_percent” :metric 45 :ttl 60 :time 10 :host “fizz.buzz” :service “cpu_percent” :metric 90 :ttl 60 :time 3 :host “foo.bar” :service “ram_percent” :metric 65 :ttl 120 :time 2

host

ram_% :host “fizz.buzz” :service “ram_percent” :metric 80 :ttl 120 :time 2 Time 64

51

Reinjected in Riemann with :state “expired”

slide-52
SLIDE 52

(expired (email “expired@riemann.io”))

52

slide-53
SLIDE 53

INDEX Client Query = “state = cpu_percent”

... ... ...

Response =

53

slide-54
SLIDE 54

INDEX Client Query = “state = cpu_percent”

... ... ...

Response =

54

Client

Query = “state = ram_percent and host = fizz.buzz”

Websocket / SSE Stream of events

...

... ...

slide-55
SLIDE 55

55

slide-56
SLIDE 56

JVM

slide-57
SLIDE 57

Multithreaded (<3 Clojure) Netty Protobuf In Memory Back pressure

FAST

slide-58
SLIDE 58
slide-59
SLIDE 59

No HA

  • Sharding
  • Send events to 2 instances
  • Keepalived
slide-60
SLIDE 60

But Clojure is hard ! (((((lisp)))))

slide-61
SLIDE 61

Prometheus Zabbix … easy ? Prometheus : Zabbix :

slide-62
SLIDE 62

But Clojure is hard ! (((((lisp)))))

slide-63
SLIDE 63

Thanks ! Questions ?

riemann.io

63