Looking for the perfect VM scheduler @fhermeni Fabien Hermenier - - PowerPoint PPT Presentation

looking for the perfect vm scheduler
SMART_READER_LITE
LIVE PREVIEW

Looking for the perfect VM scheduler @fhermeni Fabien Hermenier - - PowerPoint PPT Presentation

Looking for the perfect VM scheduler @fhermeni Fabien Hermenier fabien.hermenier@nutanix.com placing rectangles since 2006 https://fhermeni.github.io 2006 - 2010 PhD - Postdoc Gestion dynamique des tches dans les grappes, une


slide-1
SLIDE 1

Fabien Hermenier

— placing rectangles since 2006 fabien.hermenier@nutanix.com @fhermeni https://fhermeni.github.io

Looking for the perfect VM scheduler

slide-2
SLIDE 2 Gestion dynamique des tâches dans les grappes, une approche à base de machines virtuelles How to design a better testbed: Lessons from a decade of network experiments 2006 - 2010 2011 2011 - 2016 PhD - Postdoc Postdoc Associate professor VM scheduling, green computing
slide-3
SLIDE 3 VM scheduling, resource management Virtualization Entreprise cloud company “Going beyond hyperconverged infrastructures”
slide-4
SLIDE 4

Inside a private cloud

slide-5
SLIDE 5

Clusters

Isolated applications SAN based: converged infrastructure shared over the nodes: hyper-converged infrastructure virtual machines containers storage layer from 2 to x physical servers

slide-6
SLIDE 6

monitoring data VM queue

actuators

VM scheduler

decisions model

slide-7
SLIDE 7

find a server to every VM to run

VM scheduling

Such that

compatible hw enough pCPU enough RAM enough storage enough whatever

While

min or max sth

slide-8
SLIDE 8

Bigger business value, same infrastructure

? A good VM scheduler provides

slide-9
SLIDE 9

Same business value, smaller infrastructure

A good VM scheduler provides

slide-10
SLIDE 10

KEEP CALM AND

CONSOLIDATE

AS HELL

1 node = VDI workload: 12+ vCPU/1 pCPU 100+ VMs / server

slide-11
SLIDE 11

dynamic schedulers static schedulers

consider the VM queue deployed everywhere [1,2,3,4] fragmentation issues live-migrations [5] to address fragmentation Costly (storage, migration latency) thousands of articles [10-13]
  • ver-hyped ? [9]
but used in private clouds [6,7,8]
 (steady workloads ?)
slide-12
SLIDE 12

Placement constraints

hard or soft

manipulated concepts

performance, security, power efficiency, legal agreements, high-availability, fault-tolerance …

dimension various concerns

spatial or temporal

enforcement level

state, placement, resource allocation, action schedule, counters, etc.
slide-13
SLIDE 13

discrete constraints

spread(VM[1,2]) ban(VM1, N1) ban(VM2, N2)

N1 N2 N3

VM1 VM2

N1 N2 N3

VM1 VM2

continuous constraints

>>spread(VM[1,2]) ban(VM1, N1) ban(VM2, N2)

harder scheduling problem (think about actions interleaving) “simple” spatial problem [15]

slide-14
SLIDE 14

soft constraints hard constraints

must be satisfied all or nothing approach not always meaningful satisfiable or not internal or external penalty model harder to implement/scale hard to standardise ? spread(VM[1..50])

mostlySpread(VM[1..50], 4, 6)

[6]

slide-15
SLIDE 15

High-availability

exact approach: solve n placement problems [17]

0 - FT 1 - FT

x-FT VMs must survive to any crash of x nodes

x
slide-16
SLIDE 16

The VMWare DRS way

slot based catch the x- biggest nodes checks the remaining free slots simple, scalable waste with heterogeneous VMs cluster based

slide-17
SLIDE 17

VM-host affinity (DRS 4.1)

Dedicated instances (EC2)

MaxVMsPerServer (DRS 5.1)

  • apr. 2011
  • mar. 2011
  • sep. 2012

The constraint needed in 2014

2016

VM-VM affinity (DRS) 2010 ? Dynamic Power Management (DRS 3.1) 2009 ?

The constraint catalog evolves

slide-18
SLIDE 18

the bjective

provider side min(x) or max(x)

slide-19
SLIDE 19

min(penalties) min(Total Cost Ownership) min(unbalance)

atomic objectives

slide-20
SLIDE 20

min(αx + β y)

composite objectives

using weights

useful to model sth. you don’t understand ? How to estimate coefficients ?

min(α TCO + β VIOLATIONS) max(REVENUES)

€ as a common quantifier:

slide-21
SLIDE 21

threshold based min(…) or max(…) composable composable through weighting magic

Optimize or satisfy ?

verifiable hardly provable domain specific expertise easy to say

slide-22
SLIDE 22

Trigger

affinity constraints Resource demand (from machine learning)

Thresholds

85%

Maintain Minimize

Σ

mig. cost CPU storage-CPU

Acropolis Dynamic Scheduler [18]

H

  • t

s p

  • t

m i t i g a t i

  • n
slide-23
SLIDE 23

adapt the VM placement depending on pluggable expectations

network and memory-aware migration scheduler, VM-(VM|PM) affinities, resource matchmaking, node state manipulation, counter based restrictions, energy efficiency, discrete or continuous restrictions

slide-24
SLIDE 24 spread(VM[2..3]); preserve(VM1,’cpu’, 3);
  • ffline(@N4);
0’00 to 0’02: relocate(VM2,N2) 0’00 to 0’04: relocate(VM6,N2) 0’02 to 0’05: relocate(VM4,N1) 0’04 to 0’08: shutdown(N4) 0’05 to 0’06: allocate(VM1,‘cpu’,3)

The reconfiguration plan

BtrPlace

interaction though a DSL, an API or JSON messages
slide-25
SLIDE 25

the right model for the right problem deterministic composition high-level constraints

An Open-Source java library for constraint programming

slide-26
SLIDE 26

BtrPlace core CSP

models a reconfiguration plan 1 model of transition per element action durations as constants *

D(v) ∈ N st(v) = [0, H − D(v)] ed(v) = st(v) + D(v) d(v) = ed(v) − st(v) d(v) = D(v) ed(v) < H d(v) < H h(v) ∈ {0, .., |N| − 1}

boot(v ∈ V )

relocatable(v ∈ V ) . . . shutdown(v ∈ V ) . . . suspend(v ∈ V ) . . . resume(v ∈ V ) . . . kill(v ∈ V ) . . . bootable(n ∈ N) . . . haltable(n ∈ N) . . .

slide-27
SLIDE 27

new variables and relations

V i e w s b r i n g a d d i t i

  • n

a l c

  • n

c e r n s

ShareableResource(r) ::= Network() ::= … Power() ::= … High-Availability() ::= …

slide-28
SLIDE 28

Constraints state new relations

slide-29
SLIDE 29

vector packing problem

items with a finite volume to place inside finite bins the basic to model the infra. 1 dimension = 1 resource generalisation of the bin packing problem

VM1 VM3

N1

cpu mem

VM2 VM4

N2

cpu mem

NP-hard problem

slide-30
SLIDE 30

how to support migrations

temporary, resources are used on the source and the destination nodes

slide-31
SLIDE 31 1 2 3 200 300 400 500 600 700 800 900 1000 Migration duration [min.] Allocated bandwidth [Mbit/s] 1000*200K 1000*100K 1000*10K

M i g r a t i

  • n

s a r e c

  • s

t l y

slide-32
SLIDE 32

dynamic schedulers

Using Vector packing [10,12]

N3

cpu mem

N1

cpu mem

N4

cpu mem

N2

cpu mem

VM1 VM2 VM3

VM4 VM5

VM6

min(#onlineNodes) = 3

slide-33
SLIDE 33

dynamic schedulers

Using Vector packing [10,12]

N3

cpu mem

N1

cpu mem

N4

cpu mem

N2

cpu mem

VM1 VM2 VM3

VM4 VM5

VM6

min(#onlineNodes) = 3

slide-34
SLIDE 34

dynamic schedulers

Using Vector packing [10,12]

N3

cpu mem

N1

cpu mem

N4

cpu mem

N2

cpu mem

VM1 VM2 VM3

VM4 VM5

VM6

min(#onlineNodes) = 3

slide-35
SLIDE 35

dynamic schedulers

Using Vector packing [10,12]

N3

cpu mem

N1

cpu mem

N4

cpu mem

N2

cpu mem

VM1 VM2 VM3

VM4 VM5

VM6

min(#onlineNodes) = 3

slide-36
SLIDE 36

dynamic schedulers

Using Vector packing [10,12]

N3

cpu mem

N1

cpu mem

N4

cpu mem

N2

cpu mem

VM1 VM2 VM3

VM4 VM5

VM6

min(#onlineNodes) = 3

sol #1: 1m,1m,2m

slide-37
SLIDE 37

N3

cpu mem

N1

cpu mem

N4

cpu mem

N2

cpu mem

VM1 VM2 VM3

VM4 VM5

VM6

min(#onlineNodes) = 3

dynamic schedulers

Using Vector packing [10,12]

slide-38
SLIDE 38

N3

cpu mem

N1

cpu mem

N4

cpu mem

N2

cpu mem

VM1 VM2 VM3

VM4 VM5

VM6

min(#onlineNodes) = 3

dynamic schedulers

Using Vector packing [10,12]

slide-39
SLIDE 39

N3

cpu mem

N1

cpu mem

N4

cpu mem

N2

cpu mem

VM1 VM2 VM3

VM4 VM5

VM6

min(#onlineNodes) = 3

dynamic schedulers

Using Vector packing [10,12]

slide-40
SLIDE 40

N3

cpu mem

N1

cpu mem

N4

cpu mem

N2

cpu mem

VM1 VM2 VM3

VM4 VM5

VM6

min(#onlineNodes) = 3

sol #2: 1m,2m 1m

dynamic schedulers

Using Vector packing [10,12]

slide-41
SLIDE 41

N3

cpu mem

N1

cpu mem

N4

cpu mem

N2

cpu mem

VM1 VM2 VM3

VM4 VM5

VM6

min(#onlineNodes) = 3

sol #2: 1m,2m 1m

lower MTTR (faster)

dynamic schedulers

Using Vector packing [10,12]

slide-42
SLIDE 42

N3

cpu mem

N1

cpu mem

N4

cpu mem

N2

cpu mem

VM1 VM2 VM3

VM4 VM5

VM6

min(#onlineNodes) = 3

sol #2: 1m,2m 1m

lower MTTR (faster)

dynamic schedulers

Using Vector packing [10,12]

sol #1: 1m,1m,2m

slide-43
SLIDE 43

dynamic scheduling using vector packing

N3

cpu mem

N1

cpu mem

N4

cpu mem

N2

cpu mem

VM1 VM2 VM3

VM4 VM5

VM6

N5

cpu mem

VM7

  • ffline(N2) + no CPU sharing

[10, 12]

slide-44
SLIDE 44

N3 N1 N4 N2

VM1 VM2 VM3

VM4 VM5

VM6

N5

Dependency management

VM7

slide-45
SLIDE 45

N3 N1 N4 N2

VM1 VM2 VM3

VM4 VM5

VM6

N5

Dependency management

VM7

1) migrate VM2, migrate VM4, migrate VM5

slide-46
SLIDE 46

N3 N1 N4

VM1 VM2 VM3

VM4 VM5

VM6

N5

Dependency management

2) shutdown(N2), migrate VM7

VM7

1) migrate VM2, migrate VM4, migrate VM5

slide-47
SLIDE 47

coarse grain staging delay actions

mig(VM2) mig(VM4) mig(VM5)

  • ff(N2)

mig(VM7)

time

stage 1 stage 2

slide-48
SLIDE 48

N1 N2 N3 N4

time

VM1 VM5 VM6 VM3 VM7 VM4 VM2

  • ff

VM1 VM3 VM7

N5

VM5 VM4 VM2 VM6

3 4 8

Resource-Constrained Project Scheduling Problem [14]

slide-49
SLIDE 49

Resource-Constrained Project Scheduling Problem

1 resource per (node x dimension), bounded capacity tasks to model the VM lifecycle. height to model a consumption width to model a duration at any moment, the cumulative task consumption on a resource cannot exceed its capacity comfortable to express continuous optimisation NP-hard problem

slide-50
SLIDE 50

duration may be longer 0:3 - migrate VM4 0:3 - migrate VM5 0:4 - migrate VM2 3:8 - migrate VM7 4:8 - shutdown(N2) convert to an event based schedule

  • : migrate VM4
  • : migrate VM5
  • : migrate VM2

!migrate(VM2) & !migrate(VM4): shutdown(N2) !migrate(VM5): migrate VM7

From a theoretical to a practical solution

slide-51
SLIDE 51 60 50 40 30 20 10 70 80 90 VM 1 VM 2 VM 7 VM 5 VM 8 VM 3 VM 4 VM 6

BtrPlace vanilla

migration duration (sec.)

network and workload blind

[btrplace vanilla, entropy, cloudsim, …]

Extensibility in practice

looking for a better migration scheduler

slide-52
SLIDE 52

network and workload aware

60 50 40 30 20 10 70 80 90 VM 1 VM 5 VM 4 VM 8 VM 3 VM 7 VM 6 VM 2

BtrPlace + s

migration duration (sec.)

btrplace + migration scheduler [16]

Extensibility in practice

looking for a better migration scheduler

slide-53
SLIDE 53

Extensibility in practice

solver-side

Network Model Migration Model

heterogeneous network

cumulative constraints; +/- 300 sloc.

memory and network aware

+/- 200 sloc.

Constraints Model

restrict the migration models

+/- 100 sloc. t bw

VM1 VM2

VM3

core switch
slide-54
SLIDE 54

placement scheduling

vector packing problem multi-mode resource-constrained project scheduling problem

NP-hard

scaling

problems

1000 VMs / 10 nodes -> 10 1000 assignments

Nobody’s perfect

exact approaches: heuristics approaches: fast but approximatives

slide-55
SLIDE 55

the search heuristic

per objective guide choco to instantiation of interest at each search node 1. which of the variables to focus 2. which value to try do not alter the theoretical problem

.[1/2] relocatable(vm#0).dSlice_hoster = {31} ..[1/2] relocatable(vm#1).dSlice_hoster = {31} ...[1/2] relocatable(vm#2).dSlice_hoster = {31} ....[1/2] relocatable(vm#3).dSlice_hoster = {31} .....[1/2] relocatable(vm#4).dSlice_hoster = {31} ......[1/2] relocatable(vm#5).dSlice_hoster = {31} .........[1/2] shutdownableNode(node#3).start = {0} ..........[1/2] shutdownableNode(node#2).start = {0} ...........[1/2] shutdownableNode(node#1).start = {0} ............[1/2] shutdownableNode(node#0).start = {0} ..............[1/2] relocatable(vm#97).cSlice_end = {1} ..................[2/2] relocatable(vm#202).cSlice_end \ {2} ...................[1/2] relocatable(vm#202).cSlice_end = {4} ....................[1/2] relocatable(vm#203).cSlice_end = {2}
slide-56
SLIDE 56

manage only supposed mis-placed VMs beware of under estimations !

spread({VM3,VM2,VM8}); lonely({VM7}); preserve({VM1},’ucpu’, 3);
  • ffline(@N6);
ban($ALL_VMS,@N8); fence(VM[1..7],@N[1..4]); fence(VM[8..12],@N[5..8]);

scheduler.doRepair(true)

static model analysis 101

slide-57
SLIDE 57 spread({VM3,VM2,VM8}); lonely({VM7}); preserve({VM1},’ucpu’, 3);
  • ffline(@N6);
ban($ALL_VMS,@N8); fence(VM[1..7],@N[1..4]); fence(VM[8..12],@N[5..8]);

independent sub-problems solved in parallel beware of resource fragmentation !

s.setInstanceSolver( new StaticPartitioning())

slide-58
SLIDE 58
  • 15
20 25 30 60 120 180 240 300 Time (sec) Virtual machines (x 1,000)
  • LI
NR LI−filter NR−filter 2013 perf numbers… LI NR LI-repair NR-repair 1000 2000 3000 4000 5000 30 60 90 120 150 180 Time (sec) Partition size (servers) LI + filter NR + filter LI-repair NR-repair

Repair benefits Partitioning benefits

/!\ non Nutanix workloads
slide-59
SLIDE 59

Master the problem

understand the workload, tune the model, tune the solver, tune the heuristics

(benching on my laptop) /!\ non Nutanix workloads
slide-60
SLIDE 60 2.5 5.0 7.5 10.0 15 20 25 30 Virtual machines (x 1,000) Time (sec) kind li nr xeon servers

“current” performance

/!\ non Nutanix workloads
slide-61
SLIDE 61

RECAP

50
slide-62
SLIDE 62

The VM scheduler makes cloud benefits real

51
slide-63
SLIDE 63

think about what is costly

52
slide-64
SLIDE 64

static scheduling for a peaceful life

53
slide-65
SLIDE 65

dynamic scheduling to cease the day

54
slide-66
SLIDE 66

no holy grail

55
slide-67
SLIDE 67

master the problem

56
slide-68
SLIDE 68

with great power comes great responsibility

57
slide-69
SLIDE 69

BtrPlace

http:// .org

production ready live demo stable user API documented tutorials issue tracker support chat room

slide-70
SLIDE 70

WE WANT YOU

(once graduated) Member of Technical Staff

E ffi c i e n t l y c

  • n

n e c t i n g C L O U D & E D G E 2 y r s . p

  • s

t d

  • c

S

  • p

h i a , F r a n c e

resource management in edge computing

San Jose, California
slide-71
SLIDE 71 1. Omega: flexible, scalable schedulers for large computer clusters. Eurosys’13 2. Sparrow: distributed, low latency scheduling, SOSP’13 3. Large-scale cluster management at Google with Borg. Eurosys 15 4. Firmament: fast, centralized cluster at scale. OSDI 16 5. live-migration of virtual machines. NSDI’05 6. VMWare DRS. 2006 7. OpenStack Watcher. 2016 8. Nutanix Acropolis Dynamic Scheduler. 2017 9. Virtual Machine Consolidation in the Wild. Middleware 2014 10. Entropy: a consolidation manager for clusters. VEE 2009 11. pMapper: power and migration cost aware application placement in virtualized systems. Middleware 2009 12. Memory Buddies: exploiting page sharing for smart consolidation in virtualised data centres. VEE 2009 13. Energy-aware resource allocation heuristics for efficient management of data centres for cloud
  • computing. FGCS 2012
14. BtrPlace: a flexible consolidation manager for highly available applications. TDSC 2013 15. Higher SLA satisfaction in datacenter with continuous VM placement constraints. HotDep 2013 16. Scheduling live-migrations for fast, adaptable and energy-efficient relocation operations. UCC 2015 17. Guaranteeing high availability goals for virtual machine placement. ICDCS 2011 18. The Acropolis Dynamic Scheduler. http://nutanixbible.com/

References