WP4 Fabric Management 3 rd EU Review Maite Barroso - CERN - - PowerPoint PPT Presentation

wp4 fabric management
SMART_READER_LITE
LIVE PREVIEW

WP4 Fabric Management 3 rd EU Review Maite Barroso - CERN - - PowerPoint PPT Presentation

WP4 Fabric Management 3 rd EU Review Maite Barroso - CERN Maite.Barroso.Lopez@cern.ch DataGrid is a project funded by the European Commission 3 rd EU Review 19-20/02/2004 under contract IST-2000-25182 Outline Objectives (3) (Summary


slide-1
SLIDE 1

DataGrid is a project funded by the European Commission under contract IST-2000-25182 3rd EU Review – 19-20/02/2004

WP4 Fabric Management

3rd EU Review

Maite Barroso - CERN Maite.Barroso.Lopez@cern.ch

slide-2
SLIDE 2

Title - n° 2

Outline

Objectives (3’) (Summary of objectives for the whole project) Achievements (5’) (Summary of all useful products) Lessons learned (3’) Future & Exploitation (4’) Questions (10’)

slide-3
SLIDE 3

Title - n° 3

WP4: main objective

“To deliver a computing fabric comprised of all the necessary tools to manage a center providing grid services on clusters of thousands of nodes.”

  • User job management (Grid and local)
  • Automated management of large

clusters

slide-4
SLIDE 4

Title - n° 4

WP4 objective

“To deliver a computing fabric comprised of all the necessary tools to manage a center providing grid services on clusters of thousands of nodes.”

  • User job management (Grid and local)
  • Automated management of large clusters

The development work divided into 6 subtasks:

WP4

Configuration Mgt Installation Mgt

Monitoring Fault Tolerance Resource Mgt Gridification

slide-5
SLIDE 5

Title - n° 5

DataGrid Architecture

Collective Services Collective Services Information & Monitoring Information & Monitoring Replica Manager Replica Manager Grid Scheduler Grid Scheduler Local Application Local Application Local Database Local Database Underlying Grid Services Underlying Grid Services Computing Element Services Computing Element Services Authorization Authentication and Accounting Authorization Authentication and Accounting Replica Catalog Replica Catalog Storage Element Services Storage Element Services SQL Database Services SQL Database Services Fabric services Fabric services Configuration Management Configuration Management Node Installation & Management Node Installation & Management Monitoring and Fault Tolerance Monitoring and Fault Tolerance Resource Management Resource Management Fabric Storage Management Fabric Storage Management Grid Fabric Local Computing Grid Grid Application Layer Grid Application Layer Data Management Data Management Job Management Job Management Metadata Management Metadata Management Object to File Mapping Object to File Mapping Service Index Service Index

WP4

slide-6
SLIDE 6

Title - n° 6

WP4 Architecture design and the ideas behind

Information model. Configuration is distinct from monitoring

Configuration == desired state (what we want) Monitoring == actual state (what we have)

Aggregation of configuration information

Good experience with LCFG concepts with central configuration

template hierarchies

Node autonomy. Resolve local problems locally if possible

Cache node configuration profile and local monitoring buffer

Scheduling of intrusive actions Plug-in authorization and credential mapping

slide-7
SLIDE 7

Title - n° 7

Automated management of large clusters

GRID FABRIC

Computing Element RMS Configuration System Monitoring System Fault Tolerance Installation System

slide-8
SLIDE 8

Title - n° 8

Node Configuration System Monitoring System Installation System Fault Tolerance System Automated management of large clusters

slide-9
SLIDE 9

Title - n° 9

Node WP4 Fault Tolerance framework Automated management of large clusters

slide-10
SLIDE 10

Title - n° 10

User job management (Grid and local)

  • ComputingElement
  • CE
  • (Computing

Element)

  • Grid
  • Workload
  • Mgt System
  • (WP1)
  • LCAS
  • static list
  • static list
  • wallclocktime
  • wallclocktime
  • quota check
  • quota check
  • plug
  • ins
  • LCMAPS
  • WP4 non
  • gridification
  • WP4 non
  • gridification
  • Gridification component
  • Gridification component
  • Non
  • WP4 subsystem
  • Non
  • WP4 subsystem
  • uid/gid
  • uid/gid
  • other
  • tokens
  • other
  • tokens
  • SE
  • SE
  • RMS
  • RMS
  • farms
  • External to fabric
  • Internal to fabric
  • StorageElement
  • (WP5)
  • Job repository
slide-11
SLIDE 11

Title - n° 11

Achievements

Long term solution for system installation and configuration; modular, robust, reliable and scalable system which addresses the needs of large computing clusters Interim solution proposed to the EU DataGrid testbed as installation and configuration management toolkit while the final quattor framework was developed Framework for monitoring of performance, system status and environmental changes for all resources contained in a fabric

slide-12
SLIDE 12

Title - n° 12

Achievements

Resource Management System. Its main task is to maintain control over the fabric’s farm resources and to ensure the efficient scheduling and execution of user (grid or local) jobs and their coordination with maintenance tasks

Framework for automatic fault detection and correction Computing Element, Local Centre Authorization Service, Local Credential Mapping Service: provide mechanism for grid services to access the local fabric services: secure job submission and job control

RMS

Fault Tolerance Framework Gridification components

slide-13
SLIDE 13

Title - n° 13

Lessons learned

Fabric Management components are not grid

components themselves but they are essential for a working grid.

Experience and feedback with existing tools and prototypes

helped to get requirements and early feedback from users

There is a real need to be able to install, configure and

manage the sites

Correctly, to avoid configuration errors that may affect not

  • nly the site but the whole grid response
  • Automatically, to reduce the work load of system

administrators

Supporting adaptability, properly managing resource

reconfigurations in a fault tolerant way

In a reproducible way

slide-14
SLIDE 14

Title - n° 14

Future & Exploitation

All the WP4 partners are committed to continue support to the WP4

middleware?? To be discussed during the workshop

Technical evolution (commitment from partners not needed, could be for

whoever wants to work in this field in the future):

Gridification components: the components will be evolved in the directions

marked by GGF for authorization and authentication (LCAS: GGF standards for expressing access policies; LCMAPS: support more services like file access using girdFTP, support better OS insulation) . The support and extension will be undertaken by EGEE.

RMS: evolution to use it for resource management in data intensive cluster

  • computing. Evolution towards OGSA.

LCFGng: No support/evolution after the end of the project. Quattor: some open issues being tackled by the partners: overall installation toolkit

and comprehensive end user documentation. Future work on security enhancements (e.g. fine-grained authorization access to CDB, data encryption). Porting to Solaris 9 and to future RH versions or other Linux distributions.

Lemon: displays/GUIs, enhancement of simple data model, sensors for other

platforms (Windows)

Fault Tolerance: improvements on rule design (web spider?), user FT API

slide-15
SLIDE 15

Title - n° 15

Future & Exploitation

Virtual laboratory for E-science

project (The Netherlands)

Fermilab’s Site Authentication

and Authorization service (SAZ). This triggered the development

  • f the authorization call-out

mechanism within Globus

LHC Computing Grid project

(LCG)

CrossGrid GridIce project CERN Computing Centre (~2000

nodes)

Universidad Autonoma de

Madrid (Spain)

University of Liverpool (UK) NIKHEF (The Netherlands) LAL (Laboratoire de

l'Accélérateur Linéaire, Orsay, France)

Zuse Institute Berlin (ZIB)

WP4 products have been deployed not only within the EDG testbed, but also within other sites and Grid projects/environments (map of Europe with all the sites?):

slide-16
SLIDE 16

Title - n° 16

Future & Exploitation

An excellent example of WP4 product exploitation by a

production site is CERN:

CERN Computer centre was one of the WP4 main

requirement sources

Very close collaboration to test and evaluate some of the

WP4 products (Lemon and quattor)

After a successful evaluation, they adopt them and made

the necessary changes to run them in the production clusters (~2000 nodes)

Support and future evolution will be overtaken by them??

slide-17
SLIDE 17

Title - n° 17

Future & Exploitation

General concepts:

Move from testbeds to production fabrics A production fabric has

Inertia … as a virtue! Charted QoS Scalability Procedures and Manageability

Cautious introduction

Retain qualities and add functionality!

slide-18
SLIDE 18

Title - n° 18

Prototype

Service Lifecycle Focuses

  • Proliferation, Elaboration
  • Focus on functionality
  • Performance and scalability
  • Simplification, Automation
  • Focus on uniformity, minimisation
  • Process and procedure
  • Availability and reliability
  • Stability and robustness

Production

Risks

  • Destabilisation
  • Workload
slide-19
SLIDE 19

Title - n° 19

Questions?

Level 1

Level 2

Level 3

Level 4

Level 5