Im Improving g Res esource av availa labi bili lity ty i in - - PowerPoint PPT Presentation

im improving g res esource av availa labi bili lity ty i
SMART_READER_LITE
LIVE PREVIEW

Im Improving g Res esource av availa labi bili lity ty i in - - PowerPoint PPT Presentation

Im Improving g Res esource av availa labi bili lity ty i in CER ERN C N Clo loud ud Jos Castro Len & Spyros Trigazis CERN Cloud Infrastructure Outlines Introduction CERN Cloud service Get the most of cloud


slide-1
SLIDE 1
slide-2
SLIDE 2

Im Improving g Res esource av availa labi bili lity ty i in CER ERN C N Clo loud ud

José Castro León & Spyros Trigazis

CERN Cloud Infrastructure

slide-3
SLIDE 3

Outlines

3

  • Introduction
  • CERN Cloud service
  • Get the most of cloud resources

Automation

Optimization

Preemptibles

Containers on Baremetal

slide-4
SLIDE 4

European Organization for Nuclear Research

4

  • World largest particle physics laboratory
  • Founded in 1954
  • 23 member states
  • Fundamental research in physics
slide-5
SLIDE 5

5

European Organization for Nuclear Research

slide-6
SLIDE 6

6

  • Infrastructure as a Service
  • Production since July 2013
  • CentOS 7 based
  • Geneva and Wigner Computer centres
  • Highly scalable architecture > 70 nova cells

2 regions

  • Currently running Rocky release

CERN Cloud Service

slide-7
SLIDE 7

7

slide-8
SLIDE 8

CERN Cloud Infrastructure – initial offering

8

IaaS

Compute Storage glance keystone Identity nova horizon Web UI

slide-9
SLIDE 9

CERN Cloud Infrastructure

9

IaaS

neutron manila Network Orchestration heat barbican Container Orchestration magnum Automation mistral

IaaS+

Key manager Compute Storage cinder glance keystone Identity horizon Web UI Optimization watcher rally Benchmark ironic nova

slide-10
SLIDE 10

10

Back in 2012

20 40 60 80 100 120 140 160 Run 1 Run 2 Run 3 Run 4

GRID ATLAS CMS LHCb ALICE

  • LHC Computing and Data requirements where increasing
  • Constant team size
  • LS one ahead next window on 2019
  • Other deployments have surpassed CERN

we were there

what we can afford

3 core areas:

  • Centralized Monitoring
  • Configuration management
  • IaaS based on OpenStack

“All servers shall be virtual!”

slide-11
SLIDE 11

11

Situation now

  • ~300k core cloud and increasing

Addition of new services

Continuous improvements on existing ones

  • No change in number of staff
  • Improvement areas

Code efficiency

Improve algorithms with Machine learning

Use of Compute accelerators GPUs / FPGAs

Resource availability

slide-12
SLIDE 12

12

Improve resource availability

  • Continuous improvement process

Evaluate current cloud status

Find room for improvement

Develop new solutions and services

Make those services available to our users

  • Get the most of cloud resources

Performance

A vailability

slide-13
SLIDE 13

13

404

Image not found

Automation Optimization Preemptibles Baremetal Containers Ironic + magnum aardvark watcher mistral

slide-14
SLIDE 14

CERN Cloud Automation

14

mistral

C

HR Resources cornerstone collectd grafana GNI watcher rally

slide-15
SLIDE 15

15

Main objectives of automation

  • Simplify resource management

Focus on getting the last bit of performance

  • Optimize user experience
  • Maximize resources available

Cleanup of orphaned resources

Expire unused resources

slide-16
SLIDE 16

16

Resource Lifecycle Management

  • Types of projects
  • Provisioning and cleanup in Mistral workflows

Service inter-dependencies

Multi-region support

Affiliation Expired User Disabled User Deletion Shared Promote

  • Personal
  • Stop

Delete

slide-17
SLIDE 17

17

Resource Lifecycle Management in detail

  • Set of workbooks interconnected to manage

Projects

Services

keystone.project_get keystone.project_delete service_delete mistral service_delete project_delete magnum barbican heat nova cinder manila s3 glance neutron

slide-18
SLIDE 18

18

Multi region support

  • We’ve just added a 2nd region

mistral service_delete magnum barbican heat nova cinder manila s3 glance neutron launch_per_region get_regions region_loop launch_per_region get_override launch_override launch_default

slide-19
SLIDE 19

19

Multi region support (code)

... launch_per_region: input:

  • name
  • type
  • id

tasks: get_regions: action: std.noop publish: regions: <% let(type => $.type) -> $.openstack.service_catalog.catalog.where($.type = $type).endpoints.flatten(). where($.interface = 'public').select($.region).distinct().orderBy($) %>

  • n-success:
  • region_loop

region_loop: with-items: region in <% $.regions %> workflow: launch_region_with_override input: name: <% $.name %> id: <% $.id %> region: <% $.region %> ...

slide-20
SLIDE 20

20

Optimize resource availability - Expiration

  • Each VM in a personal project has an expiration date
  • Set shortly after creation and evaluated daily
  • Configured to 180 days and renewable
  • Reminder mails starting 30 days before expiration
  • Implemented as a Workbook in Mistral

ACTIVE EXPIRED

Reminder Expiration Deletion

slide-21
SLIDE 21

21

Expiration of Personal Instances

1000 unused VMs 3000 cores freed

slide-22
SLIDE 22

22

404

Image not found

Automation Optimization Preemptibles Baremetal Containers Ironic + magnum aardvark watcher mistral

slide-23
SLIDE 23

23

Towards Optimization service at CERN

  • Successful evaluation of Watcher service
  • Recently involved with upstream community

Corne Lukken @D4ntali0n

  • Room for improvement

Execution at scale

Additional datasources

Strategy improvements

slide-24
SLIDE 24

24

Get the most of the infrastructure

  • Per-cell audit on the Cloud

Improve Cloud service user perception (fair share)

Early discovery of performance issues

  • Dynamically adjust workloads in hyperconverged environments

Keeping free resources for IO

A void impact on compute

Automatic live-migration

slide-25
SLIDE 25

25

Watcher strategy as preemptible scheduler?

  • Use case:

Hardware procurement 2 times per year

Once provisioned, the users will start to use them

On decommission, they are slowly being drained

  • Issue:
  • Watcher automatic audit could create preemptible instances with BOINC workloads

unused resources

slide-26
SLIDE 26

26

Optimization service status

  • Execution at scale

Audit Scope

  • Datasources

Grafana-proxy

  • Strategies

Per-cell workload balancer

Hyperconverged balancer

Preemptible scheduler

slide-27
SLIDE 27

27

404

Image not found

Automation Optimization Preemptibles Baremetal Containers Ironic + magnum aardvark watcher mistral

slide-28
SLIDE 28

28

Preemptibles

404

Image not found

user VMs pre user VMs pre

aardvark

A user VMs pre user VMs

slide-29
SLIDE 29

29

Preemptible Service Demo

404

Image not found

Demo: https://youtu.be/d-qO1knInHM?t=424

slide-30
SLIDE 30

30

Preemptible Service Status

404

Image not found

  • Upstream work

Add instance state PENDING

  • spec

code

Allow rebuild instances in cell0

  • spec
  • code
  • Users

LHC@home

Opportunistic Batch

slide-31
SLIDE 31

31

Automation Optimization Baremetal Containers Ironic + magnum watcher mistral

404

Image not found

Preemptibles aardvark

slide-32
SLIDE 32

32

Containers on Baremetal

  • Get the last bit of performance

Put together OpenStack managed containers and baremetal

  • Batch farm runs in VMs as well

3% performance overhead, 0% with containers

  • Federated kubernetes for cluster integration
slide-33
SLIDE 33

33

Containers on Baremetal Status

  • Typical deployment

Masters in VMs

Minions in Physical nodes

  • Users

Batch farm

  • Clusters available
  • Adapting own Terraform templates
  • HTCondor queues
  • Job submission
slide-34
SLIDE 34

34

One more thing…

slide-35
SLIDE 35

35

Tech Blog

  • Backfilling Kubernetes Clusters by Ricardo Rocha

https://techblog.web.cern.ch/techblog/post/priority-preemption-boinc-backfill/

  • Splitting the CERN OpenStack Cloud into Two Regions by Belmiro Moreira

https://techblog.web.cern.ch/techblog/post/region-split/

  • Expiry of VMs in the CERN cloud by José Castro León

https://techblog.web.cern.ch/techblog/post/expiry-of-vms-in-cern-cloud/

  • Maximizing resource utilization with Preemptible Instances by Theodoros Tsioutsias

https://techblog.web.cern.ch/techblog/post/maximizing-resource-utilization-with/

slide-36
SLIDE 36

Thank you

36

gitlab.cern.ch/cloud-infrastructure cern.ch/techblog jose.castro.leon@cern.ch spyridon.trigazis@cern.ch @josecastroleon @strigazi

slide-37
SLIDE 37