SUMMARY Example of what Satellite can do: When swap falls below 1 - - PowerPoint PPT Presentation

summary example of what satellite can do when swap falls
SMART_READER_LITE
LIVE PREVIEW

SUMMARY Example of what Satellite can do: When swap falls below 1 - - PowerPoint PPT Presentation

SUMMARY Example of what Satellite can do: When swap falls below 1 day, turn the host off; if 20% of the cluster is turned off, send a pagerduty alert. Satellite is an application Two Sigma wrote to monitor, alert, and auto-administer our Mesos


slide-1
SLIDE 1
slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5

SUMMARY Example of what Satellite can do: When swap falls below 1 day, turn the host off; if 20%

  • f the cluster is turned off, send a pagerduty alert.

Satellite is an application Two Sigma wrote to monitor, alert, and auto-administer our Mesos clusters.

  • Monitor: provide a global view of the cluster
  • Alert: communicate status changes to the outside world
  • Administer: perform actions on status changes that affect the cluster
slide-6
SLIDE 6
slide-7
SLIDE 7

SUMMARY Mesos exposes limited information through its HTTP REST API. With Satellite, you can expose arbitrary host metrics either at the host level or aggregated. As an example, I’d like to know in real-time

  • What percent of the cluster had high swap utilization
  • What the median max_allowed_age is on the cluster
  • How many slaves have a max_allowed_age less than 1 day
slide-8
SLIDE 8
slide-9
SLIDE 9

SUMMARY Alerting means communicating this aggregate view you have derived. Ex: In the swap case, say we want to receive an email whenever a host makes a state transition from < 90% swap utilization to >= 90% swap utilization. And we want to get a pagerduty alert when 50% of the cluster is >= 90%.

slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13
slide-14
SLIDE 14

SUMMARY Auto-administration is the ability to programmatically control when a host will receive new tasks. Satellite offers two special primitives, off-host and on-host, that allow you to stop sending new tasks to a given host and re-commence sending tasks to a host, respectively. Ex: Suppose when a host has 90% of swap used, you want to turn it off, and when pressure relieves, you want it to turn back on, automatically. This something you can do easily within satellite.

slide-15
SLIDE 15
slide-16
SLIDE 16
slide-17
SLIDE 17

SUMMARY Automation without overrides is trouble. We initially wrote Satellite without manual

  • verrides and found times we wanted to turn hosts off – say there was a bad deployment
  • n that host – that were task black holes. Unfortunately, Satellite would turn them back on

immediately. Satellite lets you set manual overrides over an HTTP REST interface; Satellite will ignore the auto-administration command, while you have set a host to be on/off manually. Ex: If you want to perform some maintenance work on a set of hosts – you can loop through those hosts in a bash script and make sure no new tasks are sent to them during your maintenance window.

slide-18
SLIDE 18
slide-19
SLIDE 19

SUMMARY High level view of the Satellite architecture: In the normal Mesos architecture there are two type of hosts, master and slave hosts, on which mesos-master and mesos-slave processes reside, respectively. In Satellite, this architecture is preserved; there is a satellite-master process that co-exists on each master host, and a satellite-slave process that co-exists on each slave host. The Satellite slave periodically pushes to the Satellite masters an update of its status. The Satellite slave communicates to the satellite master over TCP; its message is a Riemann event.

slide-20
SLIDE 20
slide-21
SLIDE 21
slide-22
SLIDE 22

SUMMARY The update is a Riemann event. Riemann events are just key-value maps. A Riemann event is identified by the host it is coming from, the service, which is a string name, and the time the event is valid for. Conventionally, and optionally, there are also fields like “state” and “metric” that we will talk about later.

slide-23
SLIDE 23
slide-24
SLIDE 24
slide-25
SLIDE 25

SUMMARY The satellite-slave takes a user specified list of tests. A comet (test) is just a shell command we run periodically and whose output we convert into a list of Riemann events

slide-26
SLIDE 26

;; A comet (a Satellite Slave test) {:command “echo 17” :schedule (every (‐> 60 seconds)) :output (fn [out err exit] ... [{:state (if (zero? exit) “ok” “critical”) :metric exit :ttl 300 :service “echo returns”}])}

slide-27
SLIDE 27

;; A comet (a Satellite Slave test) {:command “echo 17” :schedule (every (‐> 60 seconds)) :output (fn [out err exit] ... [{:state (if (zero? exit) “ok” “critical”) :metric exit :ttl 300 :service “echo returns”}])}

slide-28
SLIDE 28

;; A comet (a Satellite Slave test) {:command “echo 17” :schedule (every (‐> 60 seconds)) :output (fn [out err exit] ... [{:state (if (zero? exit) “ok” “critical”) :metric exit :ttl 300 :service “echo returns”}])}

slide-29
SLIDE 29

;; A comet (a Satellite Slave test) {:command “echo 17” :schedule (every (‐> 60 seconds)) :output (fn [out err exit] ... [{:state (if (zero? exit) “ok” “critical”) :metric exit :ttl 300 :service “echo returns”}])}

slide-30
SLIDE 30

;; A comet (a Satellite Slave test) {:command “echo 17” :schedule (every (‐> 60 seconds)) :output (fn [out err exit] ... [{:state (if (zero? exit) “ok” “critical”) :metric exit :ttl 300 :service “echo returns”}])}

slide-31
SLIDE 31

;; A comet (a Satellite Slave test) {:command “echo 17” :schedule (every (‐> 60 seconds)) :output (fn [out err exit] ... [{:state (if (zero? exit) “ok” “critical”) :metric exit :ttl 300 :service “echo returns”}])}

slide-32
SLIDE 32

;; A comet (a Satellite Slave test) {:command “echo 17” :schedule (every (‐> 60 seconds)) :output (fn [out err exit] ... [{:state (if (zero? exit) “ok” “critical”) :metric exit :ttl 300 :service “echo returns”}])}

slide-33
SLIDE 33

;; A comet (a Satellite Slave test) {:command “echo 17” :schedule (every (‐> 60 seconds)) :output (fn [out err exit] ... [{:state (if (zero? exit) “ok” “critical”) :metric exit :ttl 300 :service “echo returns”}])}

slide-34
SLIDE 34

;; A comet (a Satellite Slave test) {:command “echo 17” :schedule (every (‐> 60 seconds)) :output (fn [out err exit] ... [{:state (if (zero? exit) “ok” “critical”) :metric exit :ttl 300 :service “echo returns”}])}

slide-35
SLIDE 35

;; A comet (a Satellite Slave test) {:command “echo 17” :schedule (every (‐> 60 seconds)) :output (fn [out err exit] ... [{:state (if (zero? exit) “ok” “critical”) :metric (if (zero? exit) 1 0) :ttl 300 :service “num echo returns”}])}

slide-36
SLIDE 36

SUMMARY Overview of a Satellite test

slide-37
SLIDE 37
slide-38
SLIDE 38

SUMMARY Each slave emits its events to the masters. Now we’re finished with the slaves.

slide-39
SLIDE 39
slide-40
SLIDE 40
slide-41
SLIDE 41

SUMMARY Satellite is able to perform its monitoring and alerting capabilities because we embed Riemann, a stream processor written by Kyle Kingsbury aka @aphyr, in the same JVM that Satellite runs in. Riemann is a stream processing system that provides many primitives / functions for monitoring and alerting. What you don’t find in Riemann, you can make yourself – every Riemann config is a clojure/java program, so you have a full programming language available to you. These are the reasons we choose Riemann – it was easy to extend, its data model suits our use case, and we already had experience with it. It also explains why the satellite project is written in Clojure – because Riemann is too.

slide-42
SLIDE 42

;; only send tasks if enough swap (where (service #”mesos/slave/swap”) (where (> metric 0.9) (off‐host host) (else (on‐host host)))) ... (def pd (pagerduty MY‐SWEET‐API‐KEY)) (where (service #”mesos/prop‐available‐hosts”) (where (< metric 0.7) (:trigger pd) (else (:resolve pd))))

slide-43
SLIDE 43

;; only send tasks if enough swap (where (service #”mesos/slave/swap”) (where (> metric 0.9) (off‐host host) (else (on‐host host)))) ... (def pd (pagerduty MY‐SWEET‐API‐KEY)) (where (service #”mesos/prop‐available‐hosts”) (where (< metric 0.7) (:trigger pd) (else (:resolve pd))))

slide-44
SLIDE 44

;; only send tasks if enough swap (where (service #”mesos/slave/swap”) (where (> metric 0.9) (off‐host host) (else (on‐host host)))) ... (def pd (pagerduty MY‐SWEET‐API‐KEY)) (where (service #”mesos/prop‐available‐hosts”) (where (< metric 0.7) (:trigger pd) (else (:resolve pd))))

slide-45
SLIDE 45

;; only send tasks if enough swap (where (service #”mesos/slave/swap”) (where (> metric 0.9) (off‐host host) (else (on‐host host)))) ... (def pd (pagerduty MY‐SWEET‐API‐KEY)) (where (service #”mesos/prop‐available‐hosts”) (where (< metric 0.7) (:trigger pd) (else (:resolve pd))))

slide-46
SLIDE 46

;; only send tasks if enough swap (where (service #”mesos/slave/swap”) (where (> metric 0.9) (off‐host host) (else (on‐host host)))) ... (def pd (pagerduty MY‐SWEET‐API‐KEY)) (where (service #”mesos/prop‐available‐hosts”) (where (< metric 0.7) (:trigger pd) (else (:resolve pd))))

slide-47
SLIDE 47

;; only send tasks if enough swap (where (service #”mesos/slave/swap”) (where (> metric 0.9) (off‐host host) (else (on‐host host)))) ... (def pd (pagerduty MY‐SWEET‐API‐KEY)) (where (service #”mesos/prop‐available‐hosts”) (where (< metric 0.7) (:trigger pd) (else (:resolve pd))))

slide-48
SLIDE 48

;; only send tasks if enough swap (where (service #”mesos/slave/swap”) (where (> metric 0.9) (off‐host host) (else (on‐host host)))) ... (def pd (pagerduty MY‐SWEET‐API‐KEY)) (where (service #”mesos/prop‐available‐hosts”) (where (< metric 0.7) (:trigger pd) (else (:resolve pd))))

slide-49
SLIDE 49

;; only send tasks if enough swap (where (service #”mesos/slave/swap”) (where (> metric 0.9) (off‐host host) (else (on‐host host)))) ... (def pd (pagerduty MY‐SWEET‐API‐KEY)) (where (service #”mesos/prop‐available‐hosts”) (where (< metric 0.7) (:trigger pd) (else (:resolve pd))))

slide-50
SLIDE 50

;; only send tasks if enough swap (where (service #”mesos/slave/swap”) (where (> metric 0.9) (off‐host host) (else (on‐host host)))) ... (def pd (pagerduty MY‐SWEET‐API‐KEY)) (where (service #”mesos/prop‐available‐hosts”) (where (< metric 0.7) (:trigger pd) (else (:resolve pd))))

slide-51
SLIDE 51

;; only send tasks if enough swap (where (service #”mesos/slave/swap”) (where (> metric 0.9) (off‐host host) (else (on‐host host)))) ... (def pd (pagerduty MY‐SWEET‐API‐KEY)) (where (service #”mesos/prop‐available‐hosts”) (where (< metric 0.7) (:trigger pd) (else (:resolve pd))))

slide-52
SLIDE 52

;; only send tasks if enough swap (where (service #”mesos/slave/swap”) (where (> metric 0.9) (off‐host host) (else (on‐host host)))) ... (def pd (pagerduty MY‐SWEET‐API‐KEY)) (where (service #”mesos/prop‐available‐hosts”) (where (< metric 0.7) (:trigger pd) (else (:resolve pd))))

slide-53
SLIDE 53

;; only send tasks if enough swap (where (service #”mesos/slave/swap”) (where (> metric 0.9) (off‐host host) (else (on‐host host)))) ... (def pd (pagerduty MY‐SWEET‐API‐KEY)) (where (service #”mesos/prop‐available‐hosts”) (where (< metric 0.7) (:trigger pd) (else (:resolve pd))))

slide-54
SLIDE 54

;; only send tasks if enough swap (where (service #”mesos/slave/swap”) (where (> metric 0.9) (off‐host host) (else (on‐host host)))) ... (def pd (pagerduty MY‐SWEET‐API‐KEY)) (where (service #”mesos/prop‐available‐hosts”) (where (< metric 0.7) (:trigger pd) (else (:resolve pd))))

slide-55
SLIDE 55

;; only send tasks if enough swap (where (service #”mesos/slave/swap”) (where (> metric 0.9) (off‐host host) (else (on‐host host)))) ... (def pd (pagerduty MY‐SWEET‐API‐KEY)) (where (service #”mesos/prop‐available‐hosts”) (where (< metric 0.7) (:trigger pd) (else (:resolve pd))))

slide-56
SLIDE 56

;; only send tasks if enough swap (where (service #”mesos/slave/swap”) (where (> metric 0.9) (off‐host host) (else (on‐host host)))) ... (def pd (pagerduty MY‐SWEET‐API‐KEY)) (where (service #”mesos/prop‐available‐hosts”) (where (< metric 0.7) (:trigger pd) (else (:resolve pd))))

slide-57
SLIDE 57

;; only send tasks if enough swap (where (service #”mesos/slave/swap”) (where (> metric 0.9) (off‐host host) (else (on‐host host)))) ... (def pd (pagerduty MY‐SWEET‐API‐KEY)) (where (service #”mesos/prop‐available‐hosts”) (where (< metric 0.7) (:trigger pd) (else (:resolve pd))))

slide-58
SLIDE 58

;; only send tasks if enough swap (where (service #”mesos/slave/swap”) (where (> metric 0.9) (off‐host host) (else (on‐host host)))) ... (def pd (pagerduty MY‐SWEET‐API‐KEY)) (where (service #”mesos/prop‐available‐hosts”) (where (< metric 0.7) (:trigger pd) (else (:resolve pd))))

slide-59
SLIDE 59

SUMMARY Example of writing a Riemann config for Satellite that auto-administers for high swap utilization and sends an alert via Pagerduty if there cluster availability falls below a threshold.

slide-60
SLIDE 60
slide-61
SLIDE 61

SUMMARY Every master should see every message. To ensure that, generally, there is one and only one action is by having the leader be the stream processor. Any state changes – hosts that are turned on/off and the reasons for why the transition happened – are written to Zookeeper by the leader. These changes are read by the follower masters, so they are up to date during a Mesos failover.

slide-62
SLIDE 62
slide-63
SLIDE 63

SUMMARY Satellite communicates to Mesos through the whitelist file

slide-64
SLIDE 64
slide-65
SLIDE 65

SUMMARY Future work includes

  • More recipes
  • Improving the web UI
  • Being able to hot reload configs through a SIGHUP
slide-66
SLIDE 66
slide-67
SLIDE 67

SUMMARY

  • In production for almost a year at Two Sigma
  • Manages multiple clusters and thousands of non-commodity hosts

When I deployed Satellite the first time, 20% of the cluster was turned off immediately. At first I thought I had done something wrong, but Satellite was doing everything what it should. In fact, we had a lot of hosts whose swap was completely utilized. It turns out there were a number of jobs that were stuck, and thrown into swap – they had been sitting there for

  • weeks. What’s worse is that this was cutting into the quota of the users who had originally

launched those jobs. This fast failure doesn’t extinguish any fires, but rather forces you to pay attention when things are truly awry.

slide-68
SLIDE 68
slide-69
SLIDE 69