Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google - - PowerPoint PPT Presentation

autopilot workload autoscaling at google
SMART_READER_LITE
LIVE PREVIEW

Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google - - PowerPoint PPT Presentation

Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Pawe Findeisen, Jacek widerski, Przemysaw Zych, Przemek Broniek, Jarek Kumierek, Pawe Nowak, Beata Strack, Piotr Witusowski, Steven


slide-1
SLIDE 1

Autopilot: workload autoscaling at Google

Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski, Przemysław Zych, Przemek Broniek, Jarek Kuśmierek, Paweł Nowak, Beata Strack, Piotr Witusowski, Steven Hand, John Wilkes (Google) EuroSys 2020 April 2020

slide-2
SLIDE 2

Proprietary + Confjdential

Google runs in containers In any given week, we launch over two billion containers across Google.

slide-3
SLIDE 3

Proprietary + Confjdential

Resource limits are crucial to isolate workloads

container limit: max amount of CPU/mem a container can use container usage: CPU/mem used container slack: CPU/mem wasted

slide-4
SLIDE 4

Proprietary + Confjdential

Borg, our scheduler, packs containers to machines by resource limits.

image source: http://dx.doi.org/10.1145/2741948.2741964 [Verma et al., EuroSys’15]

machines

slide-5
SLIDE 5

Proprietary + Confjdential

Limits are fjne-grained: CPU in milli-cores memory in bytes

Source: http://dx.doi.org/10.1145/2741948.2741964 [Verma et al., EuroSys’15]

slide-6
SLIDE 6

Proprietary + Confjdential

We pack containers to machines by limits. So, precise limits are crucial for effjciency and reliability.

6

limit machine capacity

container B container C container A

slide-7
SLIDE 7

Proprietary + Confjdential

We pack containers to machines by limits. So, precise limits are crucial for effjciency and reliability.

7

precise limits good!

limit

limit usage machine capacity

container B container C container A

slide-8
SLIDE 8

Proprietary + Confjdential

We pack containers to machines by limits. So, precise limits are crucial for effjciency and reliability.

  • ut-of-resources

crash

8

precise limits good!

limit resource are wasted (underallocated machine)

bad! bad!

limit usage

requested usage container limit

machine capacity

container B container C container A

slide-9
SLIDE 9

Proprietary + Confjdential

Autopilot acts as a controller for Borg limits. Autopilot continuously adjusts resource limits: CPU/Mem limits for containers (veruical scaling), number of replicas (horizontal scaling).

container limits container counts container limits staru/stop

containers

slide-10
SLIDE 10

Proprietary + Confjdential

Autopilot Recommenders

slide-11
SLIDE 11

Proprietary + Confjdential

Moving window recommenders

  • Exponentially-decaying samples

(half-life of 48 hours)

  • Compute statistics over the

samples, e.g. 95%ile

  • add a safety margin

time

resources

usage 98%ile

slide-12
SLIDE 12

Proprietary + Confjdential

Moving window recommenders

  • Exponentially-decaying samples

(half-life of 48 hours)

  • Compute statistics over the

samples, e.g. 95%ile

  • add a safety margin

time

resources

usage

exponential decay

98%ile

slide-13
SLIDE 13

Proprietary + Confjdential

Moving window recommenders

  • Exponentially-decaying samples

(half-life of 48 hours)

  • Compute statistics over the

samples, e.g. 95%ile

  • add a safety margin

time

resources

usage

exponential decay safety margin

limit

slide-14
SLIDE 14

Proprietary + Confjdential

  • Each model is an arg-max

algorithm picking a limit value

  • Each model is parametrized by

the decay rate and the safety margin.

  • The recommender picks the

model pergorming the best over a longer time period. Machine learning recommenders

decay rate limit model n model 1 model 2 …….

slide-15
SLIDE 15

Proprietary + Confjdential

Evaluation: Observational study of production jobs Focus on memory

slide-16
SLIDE 16

Proprietary + Confjdential

Autopilot effjciency - reduction of slack

relative slack: (av_limit - 95%ile usage) / (av_limit)

16

limit(t) usage(t) slack(t) absolute slack: ∫ slack(t) dt = ∫ limit(t) dt - ∫ usage(t) dt unit: capacity of a single (largish) machine av_limit average limit during the day 95%ile usage during the day (av_limit - 95%ile usage)

slide-17
SLIDE 17

Proprietary + Confjdential

A random sample of 5000 jobs in each category.

Cumulative distribution function

Autopiloted jobs have signifjcantly smaller relative slack.

Relative slack: (av_limit - 95%ile usage) / av_limit av_limit (av_limit - 95%ile usage)

betuer worse

slide-18
SLIDE 18

Proprietary + Confjdential

Autopiloted jobs save signifjcant capacity.

Cumulative distribution function

Absolute slack [machines] Absolute slack A random sample of 5000 jobs in each category.

betuer worse

slide-19
SLIDE 19

Proprietary + Confjdential

When jobs migrate to Autopilot, their slack is signifjcantly reduced. A random sample of 500 jobs that migrated to autopilot in a ceruain month, m0. CDFs for slack for 2 months before and afuer migration

reduction of relative slack Cumulative distribution function

betuer worse

slide-20
SLIDE 20

Proprietary + Confjdential

Autopilot Reliability: how frequent are

  • ut-of-memory errors.

We count terminations of containers. We weight the number of terminations by the average number of containers of a job.

  • ut-of-resources crash

requested usage container limit machine capacity

slide-21
SLIDE 21

Proprietary + Confjdential

Autopilot reduces the frequency of

  • ut-of-memory events.

OOMs are rare: 99.5% of autopiloted jobs have no OOMs.

Cumulative distribution function

betuer worse

slide-22
SLIDE 22

Proprietary + Confjdential

DevOps: Autopiloted jobs account for over 48% of Google’s fmeet-wide resource usage.

slide-23
SLIDE 23

Proprietary + Confjdential

Autopilot’s dynamic limits could help to keep the job running despite bugs.

slide-24
SLIDE 24

Proprietary + Confjdential

1. Effjcient scheduling requires fjne-grained control of jobs’ limits 2. Humans are bad at setuing the limits precisely. 3. Autopilot uses past usage to drive future limits 4. Autopilot reduces relative slack by 2x ...and it reduces the number of jobs severely impacted by OOMs 10x Autopilot: workload autoscaling at Google