[PPT] - adv advanc anced insta ed instance nce sc schedu heduling by PowerPoint Presentation

SLIDE 1

How c How can p an plac lacement hel ement help you p you ach achiev ieve e adv advanc anced insta ed instance nce sc schedu heduling by ling by int integr egrate with diffe ate with differe rent nt se servi rvices ces

Zhenyu Zheng irc: Kevin_Zheng Huawei Yikun Jiang irc: Yikun Huawei Sheng Hu irc: Tommylikehu Huawei

Nov

vembe

mber r 2018 18 Be Berlin in

SLIDE 2

Ab About us

ut us
Zhenyu

nyu Zheng ng Huawei Technologies Co., Ltd. OpenStack Nova Contributor, Upstream Developer.

Yi

Yikun Jiang ang Huawei Technologies Co., Ltd. OpenStack Nova & Cinder Contributor, Upstream Developer.

Sheng

ng Hu Hu Huawei Technologies Co., Ltd. OpenStack Cinder Core Reviewer, Upstream Developer.

SLIDE 3

Con Contents tents

Placement in a Nutshell
How Nova uses Placement and what problems did it solve
How other services interact with Nova through Placement

and what feature did we achieve

What can users expect for Stein?

SLIDE 4

Pl Place acement ment in in a nu a nutshe tshell ll

SLIDE 5

Pl Place acement ment in in a Nu a Nutsh tshell ell - His History tory

Introduced in Newton(14.0.0) release as a part of Nova.
Goal: Enable more effective accounting of resources in an OpenStack

deployment and better scheduling of various entities in the cloud[1].

A separate RESTful API stack and data model used to track resource

provider inventories and usages, along with different classes of resources[1].

SLIDE 6

Pl Place acement ment in in a Nu a Nutsh tshell ell - No Now

Added a lot of features in the past 4 releases, made it more powerful

and easier to use [1]:

Allowing define custom resource classes (Ocata)
Added Traits APIs in Pike and allow query RPs by Traits in Queens
Allowing query Allocation Candidates that are members of aggregates (Rocky)
More and more services are starting to consider to adopt Placement.
Placement is now going through extractions.

SLIDE 7

Pl Place acement ment in in a Nu a Nutsh tshell ell - Con Contents tents

This session will focus on high-level resource abstractions and

workflows.

Er

Eric ic Fr Fried ied and Ed Ed Le Leafe afe have given a very good session in Vancouver about implementation details with an very interesting example:

https://www.openstack.org/videos/vancouver-2018/placement-

present-and-future-in-nova-and-beyond

SLIDE 8

Pl Place acement ment in in a Nu a Nutsh tshell ell - Fr From

m 100

1000 Fe 0 Feet et

Placement service is straightforward:
A WSGI application send/receive

JSON requests;

A RDBMS for data persistence.
State is managed solely in the DB. Thus

scaling the placement service could be done by increasing the number of WSGI app instances and scaling the RDBMS using traditional database scaling techniques.

Placement API RDBMS

SLIDE 9

Pl Place acement ment in in a Nu a Nutsh tshell ell - De Deployment ployment

A sample deployment at CERN[2].
Placement deployed together

with Cells V2.

Overall 16 Placement services &

70 cells(200 compute nodes in each cell) made a successful deployment of 14,000 compute nodes in one region.

[2] https://www.openstack.org/videos/vancouver-2018/moving-from-cellsv1-to-cellsv2-at-cern

SLIDE 10

Pl Place acement ment in in a Nu a Nutsh tshell ell - Con Concepts cepts

Re

Resource

urce Providers
viders: An abstraction data model representing the object that

provides certain type/number of resources tracked by placement service, such as com

mpute

pute no node de & storage

rage po

pool

l.
Re

Resource

urce Cl

Clas ass: Types of resources, there are standard resource classes (for example DISK_GB, MEMORY_MB, and VCPU) and custom resource classes (prefixed with CUSTOM_*)

In

Inventories: entories: Qua uantity ntity of different resource classes that each resource provider can provide, for example, RP_1 has the inventory of 10 100 0 DI DISK_GB SK_GB, 20 2048 48 ME MEMORY_MB MORY_MB and 8 VCP 8 VCPU.

Traits:

aits: Describe qu qualitative alitative aspects of the resource provider, for example, the DI DISK_GB SK_GB provided by RP_1 might be soli

lid

d state ate dr drives ives (S (SSD) SD), , so we can set a `` ``is is_SSD SSD`` `` traits aits fo for RP RP_1 _1.

SLIDE 11

Pl Place acement ment in in a Nu a Nutsh tshell ell – Con Concepts cepts

Co

Cons nsumer umers: s: The user who occupied resources from the resource providers, for example, an instance is a consumer for RP_1(a compute node), who consumed 10 DISK_GB, 1024 MEMORY_MB and 4 VCPU.

Allocations:

locations: The data model used to store the resource occupation relationship between resource providers and consumers, a typical allocation record could be consumer_1 occupied 4 unit of resource_1 from resource_provider_1.

Allocation

location Can Candi didates dates: : Placement will provide a group of resource providers that are suitable for the requests, they are called the allocation candidates, callers can then use these candidates as an input for their own filter and sort process to select the best candidate.

SLIDE 12

Pl Place acement ment in in a Nu a Nutsh tshell ell – Con Concepts cepts

Ne

Neste ted d Re Resource

urce Provid
viders:

ers: In Queens release, placement introduced the ability to allow hierarchical relationship between different resource providers to be represented. This is very useful for users to represent resources like NUMA nodes and SRIOV_NET_VFs.

NUMA Cell 0 ID = 1 8 VCPU 4096 MEMORY _MB

Parent = 0 Root = 0

Compute Node 1 ID = 0 100 DISK_GB SSD

Parent = NULL Root = 0 Parent = 0 Root = 0

Physical Function 0 ID = 3 8 SRIOV_NET_VF

Parent = 1 Root = 0 Parent = 2 Root = 0

NUMA Cell 1 ID = 2 8 VCPU 4096 MEMORY _MB Physical Function 1 ID = 4 8 SRIOV_NET_VF

HW_NIC_OFFLOAD_GENEVE

Resource Providers Inventories Traits

SLIDE 13

Ho How Nova w Nova us uses es Pl Place acement ment & wh & what at pr problems

blems di

did d it it so solve lve

SLIDE 14

No Nova va wor workflow kflow - Pr Problems

blems
In

Incor

rrect

rect res esourc

urce

e us usag age e rep epor

rting:

ting: Due to legacy reasons, Nova considers resources are only being reported by a compute node, when reporting, Nova naively calculates resource usage and availability by simply summing amounts across all compute nodes in its database, causing number of problems.

Lar

Large ge Sc Scale ale Pr Prob

blem:

lem: When scheduling, Nova scheduler retrieves a list of all compute node in the entire deployment and loops through them across all the filters enabled in the deployment, it is extremely wasteful and this inefficiency gets worse the larger the deployment is.

SLIDE 15

No Nova va wor workflow kflow - Pr Problems

blems
Cr

Cros

ss-pr

projec

ject

t sche heduling: duling: It was very hard to leverage Nova scheduling and advanced scheduling features by other service(like routed network functionality provided by Neutron) at the same time to achieve an more advanced scheduling process. Using a more generic resource management service makes it much more easier.

SLIDE 16

No Nova va wor workflow kflow - Repor Report

Co

Comp mput ute e No Node de rep epor

rts

ts to to pl plac acemen ement: t:

Logic added in nova-compute(resource_tracker) to report its’ available

resources to placement as a resource provider and related inventories.

Currently only reports VCPU, DISK_GB, MEMORY_MB and VGPU, and

also starts to report some CPU features as traits[3].

Syncing periodically and during initialize, done by request to placement with:

PUT /resource_providers/{rp_uuid}/inventories: { ‘resource_provider_generation’: 66, ‘inventories’: { ‘VCPU’: 16, … }}

SLIDE 17

No Nova va wor workflow kflow – Sc Schedu heduling ling

Nova-Scheduler gather all scheduling related parameters(such as VCPU,

MEMORY_MB, DISK_GB, etc. ``Flavor Extra_Specs`` and ``image_properties`` will be translated to ``Traits`` requests.

Nova-Scheduler will call Placement as:
GET
/allocation_candidates?
resources=VCPU:1,MEMORY_MB:1024,DISK_GB:100&required=SSD
OpenStack-API-Version: placement 1.10 (Maximum in Pike)

SLIDE 18

No Nova va wor workflow kflow – Sc Schedu heduling ling

A typical response will be:

{ "prov

vider_summa

ider_summari ries": es": { "0bd25 d25be bea-5adc adc-4b3 4b39-ac4d c4d-acd6e acd6e98 98d24 d2439 39": ": { "trait its": s": [ "HW_ W_CPU CPU_X _X86_S 86_SSE2" E2"... .. ], ], "resou sources rces": : { "VCPU PU": ": { "used sed": ": 6, "capacit pacity": ": 256 }, }, "MEMOR ORY_MB MB": ": { "used sed": ": 5120, "capacity pacity": ": 59568 }, }, "DISK_ K_GB GB": ": { "used sed": ": 22, "capacity pacity": ": 837 } } } … }, }, "allocation

cation_requ

requests": ests": [ { "allocations

cations":

": { "0bd bd25bea 5bea-5a 5adc dc-4b39 b39-ac4d ac4d-acd acd6e98 e98d24 d2439": 39": { "resou sources": ces": { "VCPU": U": 1, "MEMORY ORY_MB MB": ": 512, "DISK_G _GB": B": 1 } } } } ] }

SLIDE 19

No Nova va wor workflow kflow – Sc Schedu heduling ling

Nova-Scheduler will then go through the enabled filters and weighers as

before to select the best hosts for instance creation.

Resource claim has been shifted to an earlier state to scheduler from nova-

compute, by calling PUT UT /al /allocations/{consume locations/{consumer_uuid} r_uuid} of the placement API, together with all claiming resources as body. In this way we maintain a only source of the resources and avoid conflicts and reschedules.

In current Nova, as cells v2 is used and rescheduling can only work within
ne cell, scheduler will select a best host and then select alternatives from the

same cell.

SLIDE 20

No Nova va wor workflow kflow – Sc Schedu heduling ling

When we are in a large and sparse cloud(e.g. 10,000 empty compute nodes),

a request of very small instance(or very normal instance) can return 10,000

candidates. This has implication for memory and performance on both

placement service and Nova-Scheduler.

Thus in Queens we added two config options[4]:

[scheduler]/max_placement_results, default=1000 [placement]/randomize_allocation_candidates, default=False

In Rocky, pre-placement filters functionality is added in the early phase of the

scheduling process, users can get improved overall scheduling process. Currently support filter resource provider aggregate by project_id and/or availability zone.

SLIDE 21

How How ot

ther

her se services interac rvices interact t wit with h No Nova va th through Pla rough Placement cement an and d wh what at fe feature ature di did d we ac we achie hieve ve

SLIDE 22

Wor

rkflow

kflow – No Nova va & Ne & Neut utron ron In Inter teraction action

Go

Goal al: Enabling instance scheduling based on the network bandwidth available in the hosts.

Wo

Workflow: kflow:

1. Nova creates the root RP of the compute node RP tree;
2. Neutron creates the networking RP tree of a compute node under the

compute node root RP and reports bandwidth inventories;

3. Neutron provides the resource_request of a port in Neutron API;

SLIDE 23

Wor

rkflow

kflow – No Nova va & Ne & Neut utron ron In Inter teraction action

Wo

Workflow kflow – con

nd’:
4. Nova takes the ports’ resource_request and includes it in the GET

/allocation_candidates request;

5. Nova selects the best host from the allocation candidates and claims the

resources in Placement;

6. Nova will also pass the allocation information to Neutron during port

binding.

SLIDE 24

Wor

rkflow

kflow – No Nova va & Cyb & Cyborg

rg In

Interaction teraction

Cy

Cybo borg: g:

Cyborg is an OpenStack project that aims to provide a general management

framework for acceleration resources, such as FPGA, ASIC, GPU etc [5].

Launched in Pike and growing fast.
The acceleration resources will be attached to instances as devices, thus we

need Nova/Cyborg joint scheduling to make it work. This could be done with the help of Placement.

SLIDE 25

Wor

rkflow

kflow – No Nova va & Cyb & Cyborg

rg In

Interaction teraction

Wo

Workflow: kflow:

Cyborg will discover, manage the accelerator resources and abstract them as

resource providers in Placement;

Like Neutron, Cyborg will report the accelerator resources as child resource

providers of a compute node root resource providers;

User will have to specify accelerator request in flavor_extra_specs,

image_properties or scheduler_hints when boot instance;

Nova scheduler will use these information during the scheduling process;
A new ``os-acc`` lib will be used for attaching/detach process.

SLIDE 26

Wh What can at can us users ers ex expe pect ct fo for r Ste Stein ? in ?

SLIDE 27

Placement in a separate project
NUMA Topology with Resource Providers[6]:
Using resource providers tree for explaning the relationship between a root

RP(compute node) and one or more NUMA nodes, each of them having separate resources(Memory, PCI devices)

Network Bandwidth RPs[7]:
Make network bandwidth as a child resource provider of a parent

RP(compute node) to achieve bandwidth based scheduling

Support ANY traits in allocation_candidates query[8]:
Able to query RPs with ANY of the required tratis
Support filtering by forbidden aggregate[9]:
Able to use ``!member_of`` syntax to query RPs that are not member of the

specified group

Ste Stein Featur in Features es

SLIDE 28

Ref References erences

[1] https://docs.openstack.org/nova/latest/user/placement.html
[2] https://www.openstack.org/videos/vancouver-2018/moving-from-cellsv1-to-

cellsv2-at-cern

[3] https://specs.openstack.org/openstack/nova-

specs/specs/rocky/implemented/report-cpu-features-as-traits.html

[4] https://specs.openstack.org/openstack/nova-

specs/specs/queens/implemented/allocation-candidates-limit.html

[5] wiki.openstack.org/wiki/Cyborg
[6] https://specs.openstack.org/openstack/nova-specs/specs/stein/approved/numa-

topology-with-rps.html

[7] https://specs.openstack.org/openstack/nova-

specs/specs/stein/approved/bandwidth-resource-provider.html

[8] https://specs.openstack.org/openstack/nova-

specs/specs/stein/approved/placement-any-traits-in-allocation_candidates- query.html

[9] https://specs.openstack.org/openstack/nova-

specs/specs/stein/approved/negative-aggregate-membership.html

SLIDE 29

Q & A ? Q & A ?

SLIDE 30