Towards ds a self elf auto tomated CE CERN Clo Cloud Jos Castro - - PowerPoint PPT Presentation

towards ds a self elf auto tomated ce cern clo cloud
SMART_READER_LITE
LIVE PREVIEW

Towards ds a self elf auto tomated CE CERN Clo Cloud Jos Castro - - PowerPoint PPT Presentation

Towards ds a self elf auto tomated CE CERN Clo Cloud Jos Castro Len CERN Cloud Infrastructure CERN Cloud Team Who am I? Outlines Introduction CERN Cloud service Automation status Upcoming challenges Improvement


slide-1
SLIDE 1
slide-2
SLIDE 2

Towards ds a self elf auto tomated CE CERN Clo Cloud

José Castro León

CERN Cloud Infrastructure

slide-3
SLIDE 3

Who am I? CERN Cloud Team

slide-4
SLIDE 4

Outlines

4

  • Introduction
  • CERN Cloud service
  • Automation status
  • Upcoming challenges
  • Improvement plan
  • Source code
slide-5
SLIDE 5

European Organization for Nuclear Research

5

  • World largest particle physics laboratory
  • Founded in 1954
  • 22 member states
  • Fundamental research in physics
slide-6
SLIDE 6

6

  • Infrastructure as a Service
  • Production since July 2013
  • CentOS 7 based
  • Geneva and Wigner Computer centres
  • Highly scalable architecture > 70 nova cells
  • Currently running Rocky release

CERN Cloud Service

slide-7
SLIDE 7

7

slide-8
SLIDE 8

CERN Cloud Infrastructure – initial offering

8

IaaS

Compute Storage nova glance keystone Identity horizon Web UI

slide-9
SLIDE 9

CERN Cloud Infrastructure - now

9

IaaS

neutron ironic manila Network Orchestration heat barbican Container Orchestration magnum Automation mistral

IaaS+

Key manager Compute Storage nova cinder glance keystone Identity horizon Web UI

slide-10
SLIDE 10

Automation in the CERN Cloud

10

mistral

C

HR Resources cornerstone collectd grafana GNI

slide-11
SLIDE 11

11

Back in 2012

20 40 60 80 100 120 140 160 Run 1 Run 2 Run 3 Run 4

GRID ATLAS CMS LHCb ALICE

  • LHC Computing and Data requirements where increasing
  • Constant team size
  • Improve manageability and efficiency
  • Automation

Considered early on

Exercise it as much as possible

slide-12
SLIDE 12

12

Situation now

  • 300k core cloud and increasing

Addition of new services

Continuous improvements on existing ones

  • No change in number of staff
  • Automation is key

Keep service knowledge

Offload common tasks

Simplify management

slide-13
SLIDE 13

13

Automation in the CERN Cloud @today

Resource Lifecycle management Host and Service monitoring Optimize resource availability Improve VM availability and Performance

slide-14
SLIDE 14

14

Host and Service Monitoring

  • Monitor HW events with Collectd
  • Collect service logs through Flume
  • General Notification Infrastructure

Support tickets for repairs

  • Service alarms in Grafana
  • Rundeck jobs

Time-scheduled jobs to fix common issues

Offload ticket handling

Schedule interventions

slide-15
SLIDE 15

15

RunDeck: Task delegation

collectd GNI

  • Rely on Rundeck for offloading tasks to different teams

Procurement

Repair Team

Resource Coordinator

Cloud Service operations

  • Example: disk replacement

Repair Team

slide-16
SLIDE 16

16

Resource Lifecycle Management

  • Types of projects
  • Provisioning and cleanup in Mistral workflows

Service inter-dependencies

Affiliation Expired User Disabled User Deletion Shared Promote

  • Personal
  • Stop

Delete

slide-17
SLIDE 17

17

Resource Lifecycle Management in detail

  • Set of workbooks interconnected to manage

Projects

Services

keystone.project_get keystone.project_delete service_delete mistral service_delete project_delete magnum barbican heat nova cinder manila s3 glance neutron

slide-18
SLIDE 18

18

Resource Lifecycle Management for end user

mistral

slide-19
SLIDE 19

19

Optimize resource availability - Expiration

  • Each VM in a personal project has an expiration date
  • Set shortly after creation and evaluated daily
  • Configured to 180 days and renewable
  • Reminder mails starting 30 days before expiration
  • Implemented on a Workbook in Mistral

ACTIVE EXPIRED

Reminder Expiration Deletion

slide-20
SLIDE 20

20

Expiration of Personal Instances

slide-21
SLIDE 21

21

Expiration workbook in detail

retrieve_projects daily_expiration_global daily.project_expiration

  • Based on project expiration tag and expire_at instance attribute

retrieve_instances daily_expiration_project daily.instance_expiration check_status daily_expiration_instance check_expiration fix_expiration process_expiration reminder expire delete

slide-22
SLIDE 22

22

Improve VM availability and performance

  • Hyperconverged servers

Compute + Storage Nodes

Local Ceph pool

  • Instances
  • Volumes

Ease management

Small IO latency

Increased Disk capacity

Use cases:

  • DB and Storage services
slide-23
SLIDE 23

23

Automation in the CERN Cloud @next

Add new services Root Cause Analysis Kubernetes Jobs Improve further more availability and performance

slide-24
SLIDE 24

24

Continuous addition of new services

  • Project management workbooks are prepared to be extended
  • Latest addition is the S3 service through RadosGW
  • Uses AdminOps API for quota operations

python-radosgw-admin

python-mistral-radosgw-actions

  • Modify workflows accordingly

disable_user: join: all action: radosgw.user_update input: uid: <% $.id %> suspended: true secret_key: <% $.access_key %> access_key: <% $.secret_key %>

slide-25
SLIDE 25

25

Root Cause Analysis

  • Find root cause of issues

Degradation of response of an application

  • CPU issue? kernel degradation?
  • Improve alarms with scope

Automatically list impacted services

  • Find hidden service dependencies
  • Trigger automatic resolutions

Run healing workflows

mistral collectd vitrage cloud

slide-26
SLIDE 26

26

Kubernetes jobs

  • Moving towards running control plane in kubernetes

Based on Helm charts

Healing operations added as jobs

  • All automated tasks in rundeck can be “dockerized”
  • Rundeck now interfaces with Kubernetes
  • Start moving tasks into jobs
slide-27
SLIDE 27

27

Get even more performance

  • Hyperconverged servers

Fixed CPU allocation for protecting IO operations

  • Dynamically adjust CPU usage in the setup

Keeping free resources for IO

A void impact on compute

Automatic live-migration

watcher

slide-28
SLIDE 28

28

Improve Cloud utilization

user VMs pre user VMs pre

aardvark

  • Interested in preemptibles: Preemptible Instances at CERN on Thursday Nov 15th 1:40pm Hall A3

A user VMs pre user VMs

slide-29
SLIDE 29

29

Improve Cloud utilization

  • Dynamic allocation of preemptible instances

user VMs user VMs pre user VMs pre user VMs pre

watcher watcher aardvark

A

slide-30
SLIDE 30

30

#talk is cheap show me the code

slide-31
SLIDE 31

31

Here are the links

  • https://gitlab.cern.ch/cloud-infrastructure/

cinder, horizon, ironic, keystone, mistral, neutron and nova

mistral-workflows

mistral-radosgw-actions (python-radosgw-admin)

hzrequestspanel

cci-scripts

cci-tools

slide-32
SLIDE 32

Thank you

32

gitlab.cern.ch/cloud-infrastructure

  • penstack-in-production.blogspot.ch

jose.castro.leon@cern.ch @josecastroleon

slide-33
SLIDE 33
slide-34
SLIDE 34

BACKUP SLIDES