Fabric Management with ELFms Presented by U. Schwickerath CERN/ IT - - PowerPoint PPT Presentation

fabric management with elfms
SMART_READER_LITE
LIVE PREVIEW

Fabric Management with ELFms Presented by U. Schwickerath CERN/ IT - - PowerPoint PPT Presentation

Fabric Management with ELFms Presented by U. Schwickerath CERN/ IT Outline The ELFms framework Quattor Lemon SLS LEAF German Cancio CERN/ IT - n 2 Fabric Managem ent w ith ELFm s ( I ) ELFms stands for E


slide-1
SLIDE 1

Fabric Management with ELFms

Presented by U. Schwickerath – CERN/ IT

slide-2
SLIDE 2

German Cancio – CERN/ IT - n° 2

Outline

The ELFms framework

Quattor Lemon SLS LEAF
slide-3
SLIDE 3

German Cancio – CERN/ IT - n° 3

Fabric Managem ent w ith ELFm s ( I )

ELFms stands for ‘Extremely Large Fabric m anagement system’ Subsystems:

  • : configuration, installation and management of nodes
  • : system / service monitoring
  • : hardware / state management

ELFms manages and controls most of the nodes in the CERN CC

~ 4700 nodes out of ~ 5500.. Increasing! Multiple functionality and cluster size (batch nodes, disk servers, tape servers, DB,

web, … )

Heterogeneous hardware (CPU, memory, HD size,..) Supported OS: Linux (RHES3/ 4, Scientific Linux 3/ 4 – 32/ 64bit) and Solaris (RIP..) N
  • d
e Configuration Management Node Management
slide-4
SLIDE 4

German Cancio – CERN/ IT - n° 4

  • Development is now coordinated by CERN/ IT in collaboration with
  • ther HEP institutes

Fabric Managem ent w ith ELFm s ( I I )

  • Quattor/ Lemon are used in production in/ outside CERN
  • LCG T1/ T2 sites, ranging from 50-1000 nodes/ site
  • Complete configuration of system and LCG Grid middleware

via Quattor

  • Integration with Grid services e.g. monitoring (GridICE, MonALISA)
  • ELFms (Quattor/ Lemon) were started in the scope of EU DataGrid.
slide-5
SLIDE 5

German Cancio – CERN/ IT - n° 5

http:/ / quattor.org

slide-6
SLIDE 6

German Cancio – CERN/ IT - n° 6

Quattor

Quattor takes care of the configuration, installation and management

  • f fabric nodes

A Configuration Database holds the ‘desired state’ of all fabric

elements

  • Node setup (CPU, HD, memory, software RPMs/ PKGs, network, system

services, location, audit info… )

  • Cluster (name and type, batch system, load balancing info…

)

Autonomous management agents running on the node for

  • Base installation
  • Service ( re-) configuration
  • Softw are installation and m anagem ent
slide-7
SLIDE 7

German Cancio – CERN/ IT - n° 7

Node Configuration Manager NCM CompA CompB CompC ServiceA ServiceB ServiceC RPMs / PKGs SW Package Manager SPMA Managed Nodes SW server(s)

HTTP

SW Repository

RPMs

Architecture

Install server

HTTP / PXE

System installer Install Manager

base OS XML configuration profiles

Configuration server

HTTP

CDB

SQL backend

SQL

CLI GUI scripts

XML backend

SOA P

slide-8
SLIDE 8

German Cancio – CERN/ IT - n° 8

Configuration I nform ation

Configuration is expressed using a language called Pan Information is arranged into templates

Common properties set only once

Using templates it is possible to create hierarchies to match

service structures

CERN CC name_srv1: 137.138.16.5 time_srv1: ip-time-1 lxbatch cluster_name: lxbatch master: lxmaster01 pkg_add (lsf5.1) lxplus cluster_name: lxplus pkg_add (lsf5.1) disk_srv

lxplus001

eth0/ ip: 137.138.4.246 pkg_add (lsf6_beta)

lxplus020

eth0/ ip: 137.138.4.225

lxplus029

slide-9
SLIDE 9

German Cancio – CERN/ IT - n° 9

Quattor Deploym ent

Quattor in complete control of Linux boxes (~ 4700 nodes, to grow to

~ 6-8000 in 2008)

CDB holding information of all systems in CERN-CC Over 100 NCM configuration components developed

From basic system configuration to Grid services setup… (including

desktops)

SPMA used for managing all software

security and functional updates (including kernel upgrades)
  • Eg. KDE security upgrade (~ 300MB per node) and LSF client upgrade in

30 mins, without service interruption

Handles (occasional) downgrades as well

Developments ongoing:

CDB: Fine-grained ACL protection to templates, namespaces, improved

SQL backend …

Security: Deployment of HTTPS instead of HTTP (usage of host

certificates)

Proxy architecture for enhanced scalability …

slide-10
SLIDE 10

German Cancio – CERN/ IT - n° 10

Proxy server setup

DNS-load balanced HTTP

M M’

Backend (“Master”) Frontend

L1 proxies L2 proxies (“Head” nodes)

Server cluster

H H H

Rack 1 Rack 2… … Rack N

Installation images, RPMs, configuration profiles

slide-11
SLIDE 11

German Cancio – CERN/ IT - n° 11

Quattor outside CERN

Many sites (a dozen, including LAL/ IN2P3, NIKHEF, DESY,..)

adopt quattor as fabric management framework…

See Quattor tool survey

quattor.org/ documentation/ misc/ feedback-poll-0605.htm

… leading to improved core software robustness and

completeness

Identified and removed site dependencies and assumptions Documentation, installation guides, bug tracking, release cycles

Components available for a fully automated LCG configuration

slide-12
SLIDE 12

German Cancio – CERN/ IT - n° 12

http:/ / cern.ch/ lem on

slide-13
SLIDE 13

German Cancio – CERN/ IT - n° 13

Lem on – LHC Era Monitoring

Correlation Engines

User Workstations

Web browser

Lemon CLI

User

Monitoring Repository

TCP/UD P SOA P SOA P Repository backend SQL

Nodes

Monitoring Agent

Sensor Sensor Sensor

RRDTool / PHP apache

HTTP

slide-14
SLIDE 14

German Cancio – CERN/ IT - n° 14

Deploym ent and Enhancem ents

Smooth production running of Monitoring Agent and Oracle-based

repository at CERN-CC

~ 400 metrics sampled every 30s -> 1d; ~ 2 GB of data / day on ~ 4500

nodes

Usage outside CERN-CC, collaborations

GridICE (> 100 LCG sites) CMS-Online IN2P3 INFN/ CNAF Others…

Correlation and Fault Recovery

Light-weight local self-healing module (eg. / tmp cleanup, restart

daemons)

Security for sample transport (TCP and UDP) (BARC) Status and performance visualization pages …

slide-15
SLIDE 15

German Cancio – CERN/ IT - n° 15

Monitoring the Fabric

Using a web-based status display:

CC Overview

slide-16
SLIDE 16

German Cancio – CERN/ IT - n° 16

Monitoring the Fabric

Using a web-based status display:

CC Overview Clusters and nodes

slide-17
SLIDE 17

German Cancio – CERN/ IT - n° 17

Monitoring the Fabric

Using a web-based status display:

CC Overview Clusters and nodes VO’s

slide-18
SLIDE 18

German Cancio – CERN/ IT - n° 18

Monitoring the Fabric

Using a web-based status display:

CC Overview Clusters and nodes VO’s Power

slide-19
SLIDE 19

German Cancio – CERN/ IT - n° 19

Monitoring the Fabric

Using a web-based status display:

CC Overview Clusters and nodes VO’s Power Error trending

slide-20
SLIDE 20

German Cancio – CERN/ IT - n° 20

Monitoring the Fabric

Using a web-based status display:

CC Overview Clusters and nodes VO’s Power Error trending Batch system

slide-21
SLIDE 21

German Cancio – CERN/ IT - n° 21

LAS ( Lem on Alarm System )

Alarm system for operators (LAS, Lemon Alarm System)

Allow 24/ 24h 7/ 7d operators to receive, acknowledge, ignore,

hide, process alarms received via Lemon

Recently put in production at CERN, replacing the old legacy SURE

system

slide-22
SLIDE 22

German Cancio – CERN/ IT - n° 22

Quattor-LEMON integration

Quattor and Lemon are tightly integrated at CERN

Note however that Quattor and Lemon have no mutual

dependencies!

Configuration of Lemon Agent and Server:

CDB holds definitions of all sensors, metric classes, and metric

instances

An NCM component (ncm-fmonagent) generates the Agent config

file

Another NCM component updates the Oracle Server configuration

Configuration of Lemon Web Pages:

Information on what clusters exist, and what nodes belong to

which cluster, is extracted from CDBSQL

slide-23
SLIDE 23

German Cancio – CERN/ IT - n° 23

Quattor-LEMON integration ( I I )

Visualization of Quattor configuration

Indexed CDB templates available, linked to node and cluster

status pages

XML profiles display

Alarm generation

E.g. generate an alarm if the configured kernel version differs

from the actual one

Visualization of CC equipment

Geometry of CC (racks, robots, etc) Location of each node in the CC (what rack)
slide-24
SLIDE 24

German Cancio – CERN/ IT - n° 24

http:/ / cern.ch/ sls

SLS

slide-25
SLIDE 25

German Cancio – CERN/ IT - n° 25

SLS ( Service Level Status)

Service based views (user/ mgmt perspective)

Synoptical view of what services are running how – appropriate

for end users and managers

http: / / cern.ch/ sls See screenshots next slides
slide-26
SLIDE 26

German Cancio – CERN/ IT - n° 26

Using a web-based status display:

(Meta-)Services Overview

SLS

slide-27
SLIDE 27

German Cancio – CERN/ IT - n° 27

Using a web-based status display:

(Meta-)Services Overview Drilling down to one meta-service

SLS

slide-28
SLIDE 28

German Cancio – CERN/ IT - n° 28

Using a web-based status display:

(Meta-)Services Overview Drilling down to one meta-service More details: Tier-1 sites

SLS

slide-29
SLIDE 29

German Cancio – CERN/ IT - n° 29

Using a web-based status display:

(Meta-)Services Overview Drilling down to one meta-service More details: Tier-1 sites A specific Tier-1 site: Availability history

SLS

slide-30
SLIDE 30

German Cancio – CERN/ IT - n° 30

Using a web-based status display:

(Meta-)Services Overview Drilling down to one meta-service More details: Tier-1 sites A specific Tier-1 site: Availability history Service-specific information

SLS

slide-31
SLIDE 31

German Cancio – CERN/ IT - n° 31

Using a web-based status display:

(Meta-)Services Overview Drilling down to one meta-service More details: Tier-1 sites A specific Tier-1 site: Availability history Service-specific information Other entry views: What services users are interested in

SLS

slide-32
SLIDE 32

German Cancio – CERN/ IT - n° 32

Using a web-based status display:

(Meta-)Services Overview Drilling down to one meta-service More details: Tier-1 sites A specific Tier-1 site: Availability history Service-specific information Other entry views: What services users are interested in Can be used for any kind of service

SLS

slide-33
SLIDE 33

German Cancio – CERN/ IT - n° 33

Service availability and status

Service fully (100% ) available Service available in 95% , still marked as fully available

  • above the highest threshold

Service available in 87% , marked as affected

  • below the highest threshold

Service available in 50% , marked as degraded

  • below the medium threshold

Service available in 13% , marked as not available

  • below the lowest threshold

Service info expired, update not available Different status thresholds mean different status for services with the same availability

(more at http: / / cern.ch/ SLS/ help.php)

slide-34
SLIDE 34

German Cancio – CERN/ IT - n° 34

http:/ / cern.ch/ leaf

slide-35
SLIDE 35

German Cancio – CERN/ IT - n° 35

LEAF - LHC Era Automated Fabric

LEAF is a collection of workflows for high level node hardware and

state management, on top of Quattor and LEMON:

HMS (Hardware Management System):

Track systems through all physical steps in lifecycle eg. installation,

moves, vendor calls, retirement

Automatically requests installs, retires etc. to technicians GUI to locate equipment physically HMS implementation is CERN specific (based on Remedy workflows), but

concepts and design should be generic

SMS (State Management System):

Automated handling (and tracking of) high-level configuration steps
  • Reconfigure and reboot all cluster nodes for new kernel and/ or physical move
  • Drain and reconfig nodes for diagnosis / repair operations
Issues all necessary (re)configuration commands via Quattor extensible framework – plug-ins for site-specific operations possible
slide-36
SLIDE 36

German Cancio – CERN/ IT - n° 36

Use Case: Move rack of m achines

Node

HMS NW DB SMS

Quatto r CDB

ServiceMgr Technicians

  • 1. new location
  • 2. Set to standby
  • 3. Update
  • 4. Refresh
  • 5. Take out of production
  • Close queues and drain jobs
  • Disable alarms
  • 6. Request move
  • 9. Install work order
  • 7a. Update
  • 7b. Update
  • 10. Set to production
  • 11. Update
  • 12. Refresh
  • 13. Put into production
slide-37
SLIDE 37

German Cancio – CERN/ IT - n° 37

LEAF Deploym ent

HMS in full production for all nodes in CC

HMS heavily used during CC node migration (~ 1500 nodes)

SMS in production for all quattor managed nodes Current work:

More automation, and handling of other HW types for HMS More service specific SMS clients (eg. tape & disk servers)

Developing ‘asset management’ GUI (CCTracker) -> BARC

Multiple select, drag&drop nodes to automatically initiate HMS moves and

SMS operations

Interface to LEMON GUI
slide-38
SLIDE 38

German Cancio – CERN/ IT - n° 38

Managing the Fabric

Visualize, locate and manage CC objects using high-level workflows

Visualize

physical location of equipment
slide-39
SLIDE 39

German Cancio – CERN/ IT - n° 39

Managing the Fabric

Visualize, locate and manage CC objects using high-level workflows

Visualize

physical location of equipment properties
slide-40
SLIDE 40

German Cancio – CERN/ IT - n° 40

Managing the Fabric

Visualize, locate and manage CC objects using high-level workflows

Visualize

physical location of equipment properties

Initiate and track workflows on hardware and services

e.g. add/ remove/ retire operations, update properties, kernel and

OS upgrades, etc

slide-41
SLIDE 41

German Cancio – CERN/ IT - n° 41

ELFms is deployed in production at CERN

Stabilized results from 5-year developments within EDG and LCG Established technology - from Prototype to Production Consistent full-lifecycle management and high automation level Providing real added-on value for day-to-day operations

Quattor, LEMON and SLS are generic software

Other projects and sites getting involved

Site-specific workflows and “glue scripts” can be put on top for

smooth integration with existing fabric environments

LEAF HMS and SMS

Sum m ary

= + +

More information:

http: / / cern.ch/ elfms

slide-42
SLIDE 42

German Cancio – CERN/ IT - n° 42

Contacts

Quattor : German Cancio Melia Lemon : Miroslav Siket SLS : Sebastian Lobienski LEAF: Bill Tomlin