Serenity MESOS OVERSUBSCRIPTION MODULE Szymon Konefa SOFTWARE - - PowerPoint PPT Presentation

serenity
SMART_READER_LITE
LIVE PREVIEW

Serenity MESOS OVERSUBSCRIPTION MODULE Szymon Konefa SOFTWARE - - PowerPoint PPT Presentation

Serenity MESOS OVERSUBSCRIPTION MODULE Szymon Konefa SOFTWARE ENGINEER INTEL CORPORATION Agenda Oversubscription Basics Oversubscription in Mesos Serenity Architecture Next steps for Serenity & Mesos Oversubscription


slide-1
SLIDE 1

Serenity

MESOS OVERSUBSCRIPTION MODULE

slide-2
SLIDE 2

Szymon Konefał

SOFTWARE ENGINEER

INTEL CORPORATION

slide-3
SLIDE 3

Agenda

 Oversubscription Basics  Oversubscription in Mesos  Serenity Architecture  Next steps for Serenity & Mesos

slide-4
SLIDE 4

Oversubscription Basics

OVERSUBSCRIPTION FROM MESOS PERSPECTIVE

slide-5
SLIDE 5

Oversubscription Basics

 Recycling of reserved but unused resources  Spinning up revocable („best effort”) tasks  Throttle or revoke BE tasks when production task needs

more resources (Quality of Service)

 Goal: Increase overall data center utilization

slide-6
SLIDE 6

Oversubscription Basics

RESOURCE ESTIMATOR & BEST EFFORT TASKS

 Exposes Slack Resources to Mesos

Agent, who passes them to allocator

 Allocator offers Slack Resources to

Frameworks

 Frameworks which are registered as

consumers of oversubscribed resources can reserve them

 Jobs running on slack resources are

considered „revocable”

slide-7
SLIDE 7

Oversubscription Basics

QUALITY OF SERVICE & TASK THROTTLING AND REVOCATION

 Throttle best effort tasks when production task needs

more of it’s isolated compressible resource, eg. cpu time

 Revoke best effort tasks when production task needs

more of a shared resource or non-compressible one

 Competition for shared resource is considered a

„noisy neighbour” situation

 Shared resources examples:

L3 CPU cache* Memory bandwith * Actually you can isolate that using Intel Cache Allocation Technology ;-)

slide-8
SLIDE 8

Oversubscription Modules

POWERED BY YOU

slide-9
SLIDE 9

Mesos Oversubscription API

 Introduced in Mesos 0.23.0  Defines Resource Estimator and Quality of

Service controller

 Mesos is shipped with fixed RE and stubbed QoS

controller

 You are expected to provide your own modules,

if you want to use oversubscription features

slide-10
SLIDE 10

Mesos Oversubscription API

RESOURCE ESTIMATOR

class ResourceEstimator { public: virtual virtual Try<Nothing> initialize( const lambda::function<process::Future<ResourceUsage ResourceUsage>()>& usage) = 0; virtual virtual process::Future<Resources Resources> oversubscribable

  • versubscribable() = 0;

};

slide-11
SLIDE 11

Mesos Oversubscription API

QOS CONTROLLER

class QoSController { public: virtual virtual Try<Nothing> initialize( const lambda::function<process::Future<ResourceUsage ResourceUsage>()>& usage) = 0; virtual virtual process::Future<std::list<QoSCorrection QoSCorrection>> corrections corrections() = 0; };

slide-12
SLIDE 12

Mesos Oversubscription API

FRAMEWORK

 Framework needs to register with

REVOCABLE_RESOURCES capability set

slide-13
SLIDE 13

Serenity Architecture

POWER OVERWHELMING

slide-14
SLIDE 14

Serenity Architecture

 Flexible solution with

interchangeable components

 Estimation and correction is done in

pipeline approach

 Filters inside pipelines smoothen,

shape and transforms the input

 Open source on Github

https://github.com/mesosphere/serenity

slide-15
SLIDE 15

Serenity Architecture

 Pipeline can consists of different components:

 Input smoothing: Exponential Moving Average filter  Input shaping: PR-executor pass filter, Ignore new

executors

 Interference signal indicator: Changepoint detector  Flow control: Valve filter, Utilization threshold  Slack Resource Estimator – estimates slack  QoS Controller – decides, which BE tasks need to be

revoked

slide-16
SLIDE 16

Resource Estimator Pipeline

slide-17
SLIDE 17

Serenity Quality of Service

 We look at HW performance counters of

production tasks to identify Noisy Neighbour situation

 QoS Controller revokes BE tasks until HW counters

returns back to previous values

 To make enviroment more stable during resource

contention, the QoS controller sends StopOversubscription message to RE Valve filter

slide-18
SLIDE 18

Serenity & Mesos Future

IN A WORLD OF MAGNETS AND MIRACLES THERE'S A HUNGER STILL UNSATISFIED

slide-19
SLIDE 19

Next steps for Serenity

 Make QoS Algorithms more sophisticated  Expose Noisy Neighbour situations as a hint for

schedulers

 Cluster-level Serenity?

 Pipelines drawn & configured in simple config file  Integrate with Application Performance Metrics

slide-20
SLIDE 20

Mesos Environment

 Enable oversubscription features in frameworks  Enable CPU Set isolator  Enable Cache Partitioning isolator

slide-21
SLIDE 21

What’s left to answer in Mesos?

 How to fully isolate of BE tasks and latency

critical tasks on CPU level?

 What does it mean, when BE tasks has „4 cpus”?  How to signal framework that performance of

tasks is affected?

 What to do with BE jobs, when PR job finishes it’s

work?

slide-22
SLIDE 22

Application Performance Metrics

THE NEXT BIG THING

slide-23
SLIDE 23

Application Performance Metrics

 Let frameworks report their Service Level

Indicators (SLIs) and Service Level Objectives (SLOs)

 Report global and local cluster performance  Support in identifying noisy neighbour situation  Still in design exploration  Design docs: http://bit.ly/MesosAPM

slide-24
SLIDE 24

https://github.com/mesosphere/serenity