Server Life-cycle Management with Ironic at CERN Arne Wiebalck - PowerPoint PPT Presentation

From Hire to Retire! Server Life-cycle Management with Ironic at CERN Arne Wiebalck & Surya Seetharaman CERN Cloud Infrastructure Team

CERN and CERN IT in 1 Minute ... ➔ Understand the mysteries of the universe! ➢ Large Hadron Collider ➢ 4 main detectors ➢ Initial reconstruction ➢ 100 m underground ➢ Cathedral-sized ➢ Permanent storage ➢ 27 km circumference ➢ O(10) GB/s ➢ World-wide distribution Arne Wiebalck & Surya Seetharaman 3 Server Life-cycle Management with Ironic at CERN

The 2½ CERN IT Data Centers Meyrin (CH) Budapest (HU) LHCb Point-8 (FR) ~12800 servers ~2200 servers ~800 servers Arne Wiebalck & Surya Seetharaman 4 Server Life-cycle Management with Ironic at CERN

OpenStack Deployment in CERN IT In production since 2013! 8’500 hosts with ~300k cores ● ~35k instances in ~80 cells ● 3 main regions (+ test regions) ● Wide use case spectrum ● Control plane a use case as well ● Ironic controllers are VMs on compute nodes which are physical instances Nova created in Ironic ... CERN Production Infrastructure ESSEX FOLSOM GRIZZLY HAVANA ICEHOUSE JUNO KILO LIBERTY MITAKA NEWTON OCATA PIKE QUEENS ROCKY STEIN TRAIN Arne Wiebalck & Surya Seetharaman 5 Server Life-cycle Management with Ironic at CERN

Ironic & CERN’s Ironic Deployment ➢ 3x Ironic controllers api / httpd in a bare metal cell ➢ conductor (1 CN for 3’500 nodes!) inspector currently on Stein++ ➢ Arne Wiebalck & Surya Seetharaman 6 Server Life-cycle Management with Ironic at CERN

CERN’s Ironic Deployment: Node Growth ➢ New deliveries Ironic-only ● ➢ Data center repatriation ● adoption imminent ➢ Scaling issues ● power sync ● resource tracker ● image conversion Arne Wiebalck & Surya Seetharaman 7 Server Life-cycle Management with Ironic at CERN

Server Life-cycle Management with Ironic in production configure work in progress preparation: racks planned power benchmark provision network adopt burn-in physical physical installation removal health repair check retire registration Arne Wiebalck & Surya Seetharaman 8 Server Life-cycle Management with Ironic at CERN

The Early Life-cycle Steps ... ➔ Currently done with a CERN-only auto-registration image registration ➢ could move to Ironic, unclear if we want to do this health ➔ Currently done by “manually” checking the inventory check should move to Ironic’s introspection rules (S3) ➢ ➔ Will become a set of cleaning steps “burnin- burn-in {cpu,disk,memory,network}” ➢ rely on standard tools like badblocks in production ➢ stops upon failure work in progress ➔ Will become a cleaning step “benchmark” benchmark planned ➢ launches a container which will know what to do Arne Wiebalck & Surya Seetharaman 9 Server Life-cycle Management with Ironic at CERN

Configure: Clean-time Software RAID ➔ Vast majority of our 15’000 nodes rely on Software RAID Redundancy & space ➢ Set the raid_interface ➔ The lack of support in Ironic required re-installations Additional installation based on user-provided kickstart file ➢ ➢ Other deployments do have similar constructs for such setups Set the target_ raid_config ➔ With the upstream team, we added Software RAID support ➢ Available in Train Trigger manual *Initial* support ➢ cleaning ➢ In analogy to hardware RAID implemented as part of ‘manual’ cleaning In-band via the Ironic Python Agent ➢ Arne Wiebalck & Surya Seetharaman 10 Server Life-cycle Management with Ironic at CERN

Configure: Clean-time Software RAID ➔ How does Ironic configure the RAID? (1) triggers (2) gets target_raid_config manual cleaning Ironic manageable (4) passes clean steps (3) boots (& triggers bootloader install during deploy) IPA manageable (5) configures RAID Arne Wiebalck & Surya Seetharaman 11 Server Life-cycle Management with Ironic at CERN

Configure: Clean-time Software RAID /dev/md0 /dev/md1 (RAID-1) (RAID-N) holder disk MBR /dev/sda1 md /dev/sda2 md /dev/sda holder disk md md MBR /dev/sdb1 /dev/sdb2 /dev/sdb /dev/md0p{1,2} /dev/md1 “payload” device deploy device config drive Arne Wiebalck & Surya Seetharaman 12 Server Life-cycle Management with Ironic at CERN

Configure: Clean-time Software RAID ➔ What about GPT/UEFI and disk selection? ➢ Initial implementation uses MBR, BIOS, and fixed partition for the root file system ... ➢ GPT works (needed mechanism to find root fs) ➢ UEFI will require additional work … ongoing! ➢ Disk selection not yet possible … ongoing! ➔ How to repair a broken RAID? Cleaning 400 nodes triggered the creation of Software RAIDs ➢ “broken” == “broken beyond repair” (!= degraded) ➢ Do it the cloudy way: delete the instance! ➢ At CERN: {delete,create}_configuration steps part of our custom hardware manager ➢ What about ‘ nova rebuild ’? Arne Wiebalck & Surya Seetharaman 13 Server Life-cycle Management with Ironic at CERN

Provision The Instance Life-cycle Why Ironic? ➔ Single interface for virtual and physical resources ➔ Same request approval workflow and tools ➔ Satisfies requests where VMs are not apt OpenStack Hypervisors ➔ Consolidates the accounting Other users deploying available active cleaning deleting Arne Wiebalck & Surya Seetharaman 14 Server Life-cycle Management with Ironic at CERN

Provision Physical Instances as Hypervisors Region ‘Services’ Region ‘Batch’ ‘bare metal’ cell ‘compute’ cell physical instances virtual instances nova- nova- compute CN=’abc’ compute CN=’xyz’ Ironic driver virt driver ‘def’ ‘pqr’ inst=’def’ inst=’pqr’ resource resource tracker tracker node=’abc’ RP=’abc’ RP=’xyz’ Arne Wiebalck & Surya Seetharaman 15 Server Life-cycle Management with Ironic at CERN

Adopt “Take over the world!” ➔ Ironic provides “adoption” of nodes, but this does not include instance creation! ➔ Our procedure to adopt production nodes into Ironic: ➢ Enroll the node, including its resource_class (now in ‘manageable’) Set fake drivers for this node in Ironic ➢ ➢ Provide the node (now in ‘available’) Create the port in Ironic (usually created by inspection) ➢ ➢ Let Nova discover the node Add the node to the placement aggregate ➢ ➢ Wait for the resource tracker Create instance in Nova (with flavor matching the above resource_class) ➢ Set real drivers and interfaces in Ironic ➢ Arne Wiebalck & Surya Seetharaman 16 Server Life-cycle Management with Ironic at CERN

Repair Even with Ironic nodes do break! DATA CENTER ➔ The OpenStack team does not directly intervene on nodes Dedicated repair team (plus workflow framework based on Rundeck & Mistral) ➢ ➔ Incidents: scheduled vs. unplanned BMC firmware upgrade campaign vs. motherboard failure ➢ ➔ Introduction of Ironic required workflow adaptations, training, and ... New concepts like “physical instance” ➢ Reinstallations ➢ ➔ … upstream code changes in Ironic and Nova power synchronisation (the “root of all evil”) ➢ Arne Wiebalck & Surya Seetharaman 17 Server Life-cycle Management with Ironic at CERN

Repair The Nova / Ironic Power Sync Problem Statement DATA CENTER DATA CENTER Ironic database Nova database POWER_ONFF POWER_ONFF power physical outage instance Arne Wiebalck & Surya Seetharaman 18 Server Life-cycle Management with Ironic at CERN

Repair The Nova / Ironic Power Sync Problem Statement DATA CENTER Ironic database Nova database POWER_OFFN POWER_OFF physical POWER_ON != POWER_OFF instance ???? Arne Wiebalck & Surya Seetharaman 19 Server Life-cycle Management with Ironic at CERN

Repair The Nova / Ironic Power Sync Problem Statement DATA CENTER Ironic database Nova database POWER_ONFF POWER_OFF physical POWER_ON != POWER_OFF instance ???? Force power state update to reflect what nova feels is right Arne Wiebalck & Surya Seetharaman 20 Server Life-cycle Management with Ironic at CERN

Repair The Nova / Ironic Power Sync Problem Statement ➔ Unforeseen events like power outage: ➢ Physical instance goes down ● Nova puts the instance into SHUTDOWN state ○ through the `` _sync_power_states `` periodic task ○ hypervisor regarded as the source of truth ➢ Physical instance comes back up without Nova knowing ● Nova again puts the instance back into SHUTDOWN state ○ through the `` _sync_power_states `` periodic task ○ database regarded as the source of truth Nova should not force the instance to POWER_OFF when it comes back up Arne Wiebalck & Surya Seetharaman 21 Server Life-cycle Management with Ironic at CERN

Repair The Nova / Ironic Power Sync Implemented Solution DATA CENTER target_power_state = POWER_OFF Power update event using os-server-external-events DATA POWER_ONFF POWER_ONFF CENTER power physical outage instance POWER_OFF == POWER_OFF Arne Wiebalck & Surya Seetharaman 22 Server Life-cycle Management with Ironic at CERN

Server Life-cycle Management with Ironic at CERN Arne Wiebalck - PowerPoint PPT Presentation

From Hire to Retire! Server Life-cycle Management with Ironic at CERN Arne Wiebalck & Surya Seetharaman CERN Cloud Infrastructure Team CERN and CERN IT in 1 Minute ... Understand the mysteries of the universe! Large Hadron Collider

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

Ironic Grenade Blowing up our upgrades Vlad Drok (vdrok) - Mirantis Vasyl Saienko (vsaienko) -

Bare Metal In The Cloud: Isnt it Ironic? by Dmitry Tantsur and Ilya Etingof, Red Hat In this

Isn't it Ironic? Managing a bare metal cloud Devananda van der Veen twitter: @devananda

Overview of the SPS LLRF upgrade Gregoire Hagmann (CERN) Mattia Rizzi (CERN) Philippe

Intro to Life Cycle Analysis Intro to Life Cycle Analysis Intro to Life Cycle Analysis

Data Life Cycle Management for Oracle @ CERN with partitioning Oracle @ CERN with partitioning,

MAB Life Cycle View Analysis tool for the life cycle management of your plant and strategic

Accelera'ng records management at CERN Andrew Short andrew.short@cern.ch CERN Accelerator

Server Traffic Management Server Traffic Management Jeff Chase Duke University, Department of

Ironic Project Update, OpenStack Summit Sydney Julia Kreger - TheJulia -

Engineering Engineering the the POlicy POlicy- -making making LIfe LIfe CYcle LIfe LIfe

Marek Domaracky CERN IT Vidyo@CERN CERN WebRTC Future 3 VIDYO@CERN: SCALE AND

Benchmarking topics at Benchmarking topics at CERN CERN Helge Meinhard / CERN- -IT IT Helge

200711473 Software life cycle SDD within the life cycle SDD

Intro to Life Cycle Analysis 2.83/2.813 Manufacturing End of Life Mining Use Phase Life Cycle

Shutdown Update George Ginther University of Rochester for the D Collaboration 18 May 2006

MicroBooNE status Pawel Guzowski The University of Manchester DAQ uptime POT-weighted DAQ

1 WAREHO U SI NG | OMNICHANN EL FULFIL L M EN T | TRANSP OR T A T IO N | PACKAGIN G CONFIDEN T

Turning the Tide on Outages 1 What are the true costs of implementing or failing to implement

CO 445H SECURE DESIGN PRIN INCIPLES JAVA SECURITY ROP AND ADVANCED EXPLOITS Dr. Ben Livshits

Customers at the Center 2019 Results and Guidance Feb. 26, 2020 Cautionary Statements Use of

Energy Systems, Uptime and the Digital Economy Chicago EDA Conference June 23, 2020 Andrew R.

Computer Viruses in Urban Indian Telecenters: Characterizing an Unsolved Problem Prasanta