fabric management with elfms
play

Fabric Management with ELFms Presented by U. Schwickerath CERN/ IT - PowerPoint PPT Presentation

Fabric Management with ELFms Presented by U. Schwickerath CERN/ IT Outline The ELFms framework Quattor Lemon SLS LEAF German Cancio CERN/ IT - n 2 Fabric Managem ent w ith ELFm s ( I ) ELFms stands for E


  1. Fabric Management with ELFms Presented by U. Schwickerath – CERN/ IT

  2. Outline � The ELFms framework � Quattor � Lemon � SLS � LEAF German Cancio – CERN/ IT - n° 2

  3. Fabric Managem ent w ith ELFm s ( I ) ELFms stands for ‘ E xtremely L arge F abric m anagement s ystem’ Subsystems: : configuration, installation and management of nodes � : system / service monitoring � : hardware / state management � N o Configuration d Management e Node Management � ELFms manages and controls most of the nodes in the CERN CC � ~ 4700 nodes out of ~ 5500.. Increasing! � Multiple functionality and cluster size (batch nodes, disk servers, tape servers, DB, web, … ) � Heterogeneous hardware (CPU, memory, HD size,..) � Supported OS: Linux (RHES3/ 4, Scientific Linux 3/ 4 – 32/ 64bit) and Solaris (RIP..) German Cancio – CERN/ IT - n° 3

  4. Fabric Managem ent w ith ELFm s ( I I ) • ELFms (Quattor/ Lemon) were started in the scope of EU DataGrid. • Development is now coordinated by CERN/ IT in collaboration with other HEP institutes • Quattor/ Lemon are used in production in/ outside CERN • LCG T1/ T2 sites, ranging from 50-1000 nodes/ site • Complete configuration of system and LCG Grid middleware via Quattor • Integration with Grid services e.g. monitoring (GridICE, MonALISA) German Cancio – CERN/ IT - n° 4

  5. http:/ / quattor.org German Cancio – CERN/ IT - n° 5

  6. Quattor Quattor takes care of the configuration, installation and management of fabric nodes � A Configuration Database holds the ‘desired state’ of all fabric elements • Node setup (CPU, HD, memory, software RPMs/ PKGs, network, system services, location, audit info… ) • Cluster (name and type, batch system, load balancing info… ) � Autonomous management agents running on the node for • Base installation • Service ( re-) configuration • Softw are installation and m anagem ent German Cancio – CERN/ IT - n° 6

  7. Architecture Configuration server SQL SQL backend CLI SOA GUI CDB P scripts XML backend HTTP XML configuration profiles SW server(s) Install server Node Configuration Manager NCM Install CompA CompB CompC Manager SW HTTP HTTP / ServiceA ServiceB ServiceC PXE RPMs base OS Repository System RPMs / PKGs installer SW Package Manager SPMA Managed Nodes German Cancio – CERN/ IT - n° 7

  8. Configuration I nform ation � Configuration is expressed using a language called Pan � Information is arranged into templates � Common properties set only once � Using templates it is possible to create hierarchies to match service structures CERN name_srv1: 137.138.16.5 CC time_srv1: ip-time-1 cluster_name: lxbatch cluster_name: lxplus lxbatch lxplus disk_srv master: lxmaster01 pkg_add (lsf5.1) pkg_add (lsf5.1) eth0/ ip: 137.138.4.246 eth0/ ip: 137.138.4.225 lxplus001 lxplus020 lxplus029 pkg_add (lsf6_beta) German Cancio – CERN/ IT - n° 8

  9. Quattor Deploym ent � Quattor in complete control of Linux boxes (~ 4700 nodes, to grow to ~ 6-8000 in 2008) � CDB holding information of all systems in CERN-CC � Over 100 NCM configuration components developed � From basic system configuration to Grid services setup… (including desktops) � SPMA used for managing all software � security and functional updates (including kernel upgrades) � Eg. KDE security upgrade (~ 300MB per node) and LSF client upgrade in 30 mins, without service interruption � Handles (occasional) downgrades as well � Developments ongoing: � CDB: Fine-grained ACL protection to templates, namespaces, improved SQL backend … � Security: Deployment of HTTPS instead of HTTP (usage of host certificates) � Proxy architecture for enhanced scalability … German Cancio – CERN/ IT - n° 9

  10. Proxy server setup Server cluster Backend M M’ (“Master”) Installation images, RPMs, configuration profiles Frontend L1 proxies DNS-load balanced HTTP L2 proxies (“Head” H H H nodes) … Rack 1 Rack 2… … Rack N German Cancio – CERN/ IT - n° 10

  11. Quattor outside CERN � Many sites (a dozen, including LAL/ IN2P3, NIKHEF, DESY,..) adopt quattor as fabric management framework… � See Quattor tool survey quattor.org/ documentation/ misc/ feedback-poll-0605.htm � … leading to improved core software robustness and completeness � Identified and removed site dependencies and assumptions � Documentation, installation guides, bug tracking, release cycles � Components available for a fully automated LCG configuration German Cancio – CERN/ IT - n° 11

  12. http:/ / cern.ch/ lem on German Cancio – CERN/ IT - n° 12

  13. Lem on – L HC E ra Mon itoring Repository SQL backend RRDTool / PHP Monitoring Correlation SOA SOA P P Engines Repository apache TCP/UD HTTP P Nodes Lemon Web Monitoring Agent browser CLI User Sensor Sensor Sensor User Workstations German Cancio – CERN/ IT - n° 13

  14. Deploym ent and Enhancem ents � Smooth production running of Monitoring Agent and Oracle-based repository at CERN-CC � ~ 400 metrics sampled every 30s -> 1d; ~ 2 GB of data / day on ~ 4500 nodes � Usage outside CERN-CC, collaborations � GridICE (> 100 LCG sites) � CMS-Online � IN2P3 � INFN/ CNAF � Others… � Correlation and Fault Recovery � Light-weight local self-healing module (eg. / tmp cleanup, restart daemons) � Security for sample transport (TCP and UDP) (BARC) � Status and performance visualization pages … German Cancio – CERN/ IT - n° 14

  15. Monitoring the Fabric Using a web-based status display: � CC Overview German Cancio – CERN/ IT - n° 15

  16. Monitoring the Fabric Using a web-based status display: � CC Overview � Clusters and nodes German Cancio – CERN/ IT - n° 16

  17. Monitoring the Fabric Using a web-based status display: � CC Overview � Clusters and nodes � VO’s German Cancio – CERN/ IT - n° 17

  18. Monitoring the Fabric Using a web-based status display: � CC Overview � Clusters and nodes � VO’s � Power German Cancio – CERN/ IT - n° 18

  19. Monitoring the Fabric Using a web-based status display: � CC Overview � Clusters and nodes � VO’s � Power � Error trending German Cancio – CERN/ IT - n° 19

  20. Monitoring the Fabric Using a web-based status display: � CC Overview � Clusters and nodes � VO’s � Power � Error trending � Batch system German Cancio – CERN/ IT - n° 20

  21. LAS ( Lem on Alarm System ) � Alarm system for operators (LAS, Lemon Alarm System) � Allow 24/ 24h 7/ 7d operators to receive, acknowledge, ignore, hide, process alarms received via Lemon � Recently put in production at CERN, replacing the old legacy SURE system German Cancio – CERN/ IT - n° 21

  22. Quattor-LEMON integration Quattor and Lemon are tightly integrated at CERN � Note however that Quattor and Lemon have no mutual dependencies! � Configuration of Lemon Agent and Server: � CDB holds definitions of all sensors, metric classes, and metric instances � An NCM component (ncm-fmonagent) generates the Agent config file � Another NCM component updates the Oracle Server configuration � Configuration of Lemon Web Pages: � Information on what clusters exist, and what nodes belong to which cluster, is extracted from CDBSQL German Cancio – CERN/ IT - n° 22

  23. Quattor-LEMON integration ( I I ) � Visualization of Quattor configuration � Indexed CDB templates available, linked to node and cluster status pages � XML profiles display � Alarm generation � E.g. generate an alarm if the configured kernel version differs from the actual one � Visualization of CC equipment � Geometry of CC (racks, robots, etc) � Location of each node in the CC (what rack) German Cancio – CERN/ IT - n° 23

  24. SLS http:/ / cern.ch/ sls German Cancio – CERN/ IT - n° 24

  25. SLS ( Service Level Status) � Service based views (user/ mgmt perspective) � Synoptical view of what services are running how – appropriate for end users and managers � http: / / cern.ch/ sls � See screenshots next slides German Cancio – CERN/ IT - n° 25

  26. SLS Using a web-based status display: � (Meta-)Services Overview German Cancio – CERN/ IT - n° 26

  27. SLS Using a web-based status display: � (Meta-)Services Overview � Drilling down to one meta-service German Cancio – CERN/ IT - n° 27

  28. SLS Using a web-based status display: � (Meta-)Services Overview � Drilling down to one meta-service � More details: Tier-1 sites German Cancio – CERN/ IT - n° 28

  29. SLS Using a web-based status display: � (Meta-)Services Overview � Drilling down to one meta-service � More details: Tier-1 sites � A specific Tier-1 site: Availability history German Cancio – CERN/ IT - n° 29

  30. SLS Using a web-based status display: � (Meta-)Services Overview � Drilling down to one meta-service � More details: Tier-1 sites � A specific Tier-1 site: Availability history � Service-specific information German Cancio – CERN/ IT - n° 30

  31. SLS Using a web-based status display: � (Meta-)Services Overview � Drilling down to one meta-service � More details: Tier-1 sites � A specific Tier-1 site: Availability history � Service-specific information � Other entry views: What services users are interested in German Cancio – CERN/ IT - n° 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend