Fabric Management with ELFms
Presented by U. Schwickerath – CERN/ IT
Fabric Management with ELFms Presented by U. Schwickerath CERN/ IT - - PowerPoint PPT Presentation
Fabric Management with ELFms Presented by U. Schwickerath CERN/ IT Outline The ELFms framework Quattor Lemon SLS LEAF German Cancio CERN/ IT - n 2 Fabric Managem ent w ith ELFm s ( I ) ELFms stands for E
Fabric Management with ELFms
Presented by U. Schwickerath – CERN/ IT
German Cancio – CERN/ IT - n° 2
Outline
The ELFms framework
Quattor Lemon SLS LEAFGerman Cancio – CERN/ IT - n° 3
Fabric Managem ent w ith ELFm s ( I )
ELFms stands for ‘Extremely Large Fabric m anagement system’ Subsystems:
ELFms manages and controls most of the nodes in the CERN CC
~ 4700 nodes out of ~ 5500.. Increasing! Multiple functionality and cluster size (batch nodes, disk servers, tape servers, DB,web, … )
Heterogeneous hardware (CPU, memory, HD size,..) Supported OS: Linux (RHES3/ 4, Scientific Linux 3/ 4 – 32/ 64bit) and Solaris (RIP..) NGerman Cancio – CERN/ IT - n° 4
Fabric Managem ent w ith ELFm s ( I I )
via Quattor
German Cancio – CERN/ IT - n° 5
http:/ / quattor.org
German Cancio – CERN/ IT - n° 6
Quattor
Quattor takes care of the configuration, installation and management
A Configuration Database holds the ‘desired state’ of all fabric
elements
services, location, audit info… )
)
Autonomous management agents running on the node for
German Cancio – CERN/ IT - n° 7
Node Configuration Manager NCM CompA CompB CompC ServiceA ServiceB ServiceC RPMs / PKGs SW Package Manager SPMA Managed Nodes SW server(s)
HTTP
SW Repository
RPMs
Architecture
Install server
HTTP / PXE
System installer Install Manager
base OS XML configuration profiles
Configuration server
HTTP
CDB
SQL backend
SQL
CLI GUI scripts
XML backend
SOA P
German Cancio – CERN/ IT - n° 8
Configuration I nform ation
Configuration is expressed using a language called Pan Information is arranged into templates
Common properties set only onceUsing templates it is possible to create hierarchies to match
service structures
CERN CC name_srv1: 137.138.16.5 time_srv1: ip-time-1 lxbatch cluster_name: lxbatch master: lxmaster01 pkg_add (lsf5.1) lxplus cluster_name: lxplus pkg_add (lsf5.1) disk_srv
lxplus001
eth0/ ip: 137.138.4.246 pkg_add (lsf6_beta)
lxplus020
eth0/ ip: 137.138.4.225
lxplus029
German Cancio – CERN/ IT - n° 9
Quattor Deploym ent
Quattor in complete control of Linux boxes (~ 4700 nodes, to grow to
~ 6-8000 in 2008)
CDB holding information of all systems in CERN-CC Over 100 NCM configuration components developed
From basic system configuration to Grid services setup… (includingdesktops)
SPMA used for managing all software
security and functional updates (including kernel upgrades)30 mins, without service interruption
Handles (occasional) downgrades as wellDevelopments ongoing:
CDB: Fine-grained ACL protection to templates, namespaces, improvedSQL backend …
Security: Deployment of HTTPS instead of HTTP (usage of hostcertificates)
Proxy architecture for enhanced scalability …
German Cancio – CERN/ IT - n° 10
Proxy server setup
DNS-load balanced HTTP
M M’
Backend (“Master”) Frontend
L1 proxies L2 proxies (“Head” nodes)
Server cluster
H H H
…
Rack 1 Rack 2… … Rack N
Installation images, RPMs, configuration profiles
German Cancio – CERN/ IT - n° 11
Quattor outside CERN
Many sites (a dozen, including LAL/ IN2P3, NIKHEF, DESY,..)
adopt quattor as fabric management framework…
See Quattor tool surveyquattor.org/ documentation/ misc/ feedback-poll-0605.htm
… leading to improved core software robustness and
completeness
Identified and removed site dependencies and assumptions Documentation, installation guides, bug tracking, release cyclesComponents available for a fully automated LCG configuration
German Cancio – CERN/ IT - n° 12
http:/ / cern.ch/ lem on
German Cancio – CERN/ IT - n° 13
Lem on – LHC Era Monitoring
Correlation Engines
User Workstations
Web browser
Lemon CLI
User
Monitoring Repository
TCP/UD P SOA P SOA P Repository backend SQL
Nodes
Monitoring Agent
Sensor Sensor Sensor
RRDTool / PHP apache
HTTP
German Cancio – CERN/ IT - n° 14
Deploym ent and Enhancem ents
Smooth production running of Monitoring Agent and Oracle-based
repository at CERN-CC
~ 400 metrics sampled every 30s -> 1d; ~ 2 GB of data / day on ~ 4500nodes
Usage outside CERN-CC, collaborations
GridICE (> 100 LCG sites) CMS-Online IN2P3 INFN/ CNAF Others…Correlation and Fault Recovery
Light-weight local self-healing module (eg. / tmp cleanup, restartdaemons)
Security for sample transport (TCP and UDP) (BARC) Status and performance visualization pages …
German Cancio – CERN/ IT - n° 15
Monitoring the Fabric
Using a web-based status display:
CC Overview
German Cancio – CERN/ IT - n° 16
Monitoring the Fabric
Using a web-based status display:
CC Overview Clusters and nodes
German Cancio – CERN/ IT - n° 17
Monitoring the Fabric
Using a web-based status display:
CC Overview Clusters and nodes VO’s
German Cancio – CERN/ IT - n° 18
Monitoring the Fabric
Using a web-based status display:
CC Overview Clusters and nodes VO’s Power
German Cancio – CERN/ IT - n° 19
Monitoring the Fabric
Using a web-based status display:
CC Overview Clusters and nodes VO’s Power Error trending
German Cancio – CERN/ IT - n° 20
Monitoring the Fabric
Using a web-based status display:
CC Overview Clusters and nodes VO’s Power Error trending Batch system
German Cancio – CERN/ IT - n° 21
LAS ( Lem on Alarm System )
Alarm system for operators (LAS, Lemon Alarm System)
Allow 24/ 24h 7/ 7d operators to receive, acknowledge, ignore,hide, process alarms received via Lemon
Recently put in production at CERN, replacing the old legacy SUREsystem
German Cancio – CERN/ IT - n° 22
Quattor-LEMON integration
Quattor and Lemon are tightly integrated at CERN
Note however that Quattor and Lemon have no mutualdependencies!
Configuration of Lemon Agent and Server:
CDB holds definitions of all sensors, metric classes, and metricinstances
An NCM component (ncm-fmonagent) generates the Agent configfile
Another NCM component updates the Oracle Server configurationConfiguration of Lemon Web Pages:
Information on what clusters exist, and what nodes belong towhich cluster, is extracted from CDBSQL
German Cancio – CERN/ IT - n° 23
Quattor-LEMON integration ( I I )
Visualization of Quattor configuration
Indexed CDB templates available, linked to node and clusterstatus pages
XML profiles displayAlarm generation
E.g. generate an alarm if the configured kernel version differsfrom the actual one
Visualization of CC equipment
Geometry of CC (racks, robots, etc) Location of each node in the CC (what rack)German Cancio – CERN/ IT - n° 24
http:/ / cern.ch/ sls
German Cancio – CERN/ IT - n° 25
SLS ( Service Level Status)
Service based views (user/ mgmt perspective)
Synoptical view of what services are running how – appropriatefor end users and managers
http: / / cern.ch/ sls See screenshots next slidesGerman Cancio – CERN/ IT - n° 26
Using a web-based status display:
(Meta-)Services Overview
SLS
German Cancio – CERN/ IT - n° 27
Using a web-based status display:
(Meta-)Services Overview Drilling down to one meta-service
SLS
German Cancio – CERN/ IT - n° 28
Using a web-based status display:
(Meta-)Services Overview Drilling down to one meta-service More details: Tier-1 sites
SLS
German Cancio – CERN/ IT - n° 29
Using a web-based status display:
(Meta-)Services Overview Drilling down to one meta-service More details: Tier-1 sites A specific Tier-1 site: Availability history
SLS
German Cancio – CERN/ IT - n° 30
Using a web-based status display:
(Meta-)Services Overview Drilling down to one meta-service More details: Tier-1 sites A specific Tier-1 site: Availability history Service-specific information
SLS
German Cancio – CERN/ IT - n° 31
Using a web-based status display:
(Meta-)Services Overview Drilling down to one meta-service More details: Tier-1 sites A specific Tier-1 site: Availability history Service-specific information Other entry views: What services users are interested in
SLS
German Cancio – CERN/ IT - n° 32
Using a web-based status display:
(Meta-)Services Overview Drilling down to one meta-service More details: Tier-1 sites A specific Tier-1 site: Availability history Service-specific information Other entry views: What services users are interested in Can be used for any kind of service
SLS
German Cancio – CERN/ IT - n° 33
Service availability and status
Service fully (100% ) available Service available in 95% , still marked as fully available
Service available in 87% , marked as affected
Service available in 50% , marked as degraded
Service available in 13% , marked as not available
Service info expired, update not available Different status thresholds mean different status for services with the same availability
(more at http: / / cern.ch/ SLS/ help.php)
German Cancio – CERN/ IT - n° 34
http:/ / cern.ch/ leaf
German Cancio – CERN/ IT - n° 35
LEAF - LHC Era Automated Fabric
LEAF is a collection of workflows for high level node hardware and
state management, on top of Quattor and LEMON:
HMS (Hardware Management System):
Track systems through all physical steps in lifecycle eg. installation,moves, vendor calls, retirement
Automatically requests installs, retires etc. to technicians GUI to locate equipment physically HMS implementation is CERN specific (based on Remedy workflows), butconcepts and design should be generic
SMS (State Management System):
Automated handling (and tracking of) high-level configuration stepsGerman Cancio – CERN/ IT - n° 36
Use Case: Move rack of m achines
Node
HMS NW DB SMS
Quatto r CDB
ServiceMgr Technicians
German Cancio – CERN/ IT - n° 37
LEAF Deploym ent
HMS in full production for all nodes in CC
HMS heavily used during CC node migration (~ 1500 nodes)SMS in production for all quattor managed nodes Current work:
More automation, and handling of other HW types for HMS More service specific SMS clients (eg. tape & disk servers)Developing ‘asset management’ GUI (CCTracker) -> BARC
Multiple select, drag&drop nodes to automatically initiate HMS moves andSMS operations
Interface to LEMON GUIGerman Cancio – CERN/ IT - n° 38
Managing the Fabric
Visualize, locate and manage CC objects using high-level workflows
Visualize
physical location of equipmentGerman Cancio – CERN/ IT - n° 39
Managing the Fabric
Visualize, locate and manage CC objects using high-level workflows
Visualize
physical location of equipment propertiesGerman Cancio – CERN/ IT - n° 40
Managing the Fabric
Visualize, locate and manage CC objects using high-level workflows
Visualize
physical location of equipment propertiesInitiate and track workflows on hardware and services
e.g. add/ remove/ retire operations, update properties, kernel andOS upgrades, etc
German Cancio – CERN/ IT - n° 41
ELFms is deployed in production at CERN
Stabilized results from 5-year developments within EDG and LCG Established technology - from Prototype to Production Consistent full-lifecycle management and high automation level Providing real added-on value for day-to-day operationsQuattor, LEMON and SLS are generic software
Other projects and sites getting involvedSite-specific workflows and “glue scripts” can be put on top for
smooth integration with existing fabric environments
LEAF HMS and SMSSum m ary
= + +
More information:
http: / / cern.ch/ elfms
German Cancio – CERN/ IT - n° 42
Contacts
Quattor : German Cancio Melia Lemon : Miroslav Siket SLS : Sebastian Lobienski LEAF: Bill Tomlin