Challenges for Grids Challenges for Grids Markus Schulz CERN IT GD - - PowerPoint PPT Presentation

challenges for grids challenges for grids
SMART_READER_LITE
LIVE PREVIEW

Challenges for Grids Challenges for Grids Markus Schulz CERN IT GD - - PowerPoint PPT Presentation

Challenges for Grids Challenges for Grids Markus Schulz CERN IT GD LCG/EGEE Disclaimer Disclaimer All views expressed are mine and are not necessarily shared by the projects or organization that I am associated with Dont blame:


slide-1
SLIDE 1

Challenges for Grids Challenges for Grids

Markus Schulz CERN IT GD LCG/EGEE

slide-2
SLIDE 2

7/31/2006 Challenges for grids 2

Disclaimer Disclaimer

  • All views expressed are mine and are not

necessarily shared by the projects or

  • rganization that I am associated with

– Don’t blame: EGEE, LCG, CERN…. – Critique, flames, and the like should be directed to:

  • Markus.schulz@cern.ch
slide-3
SLIDE 3

7/31/2006 Challenges for grids 3

Approach Approach

  • Thinking a few years ahead

– Based on what we know – Ignoring problems like

  • software quality (far from perfect)
  • lack of fabric management on sites
  • site admin fear of loosing total control

– Focused on structural problems

  • Make production grids work at the required scale
  • Expand the systems to other domains

– Industry, micro Vos, ……

  • Move closer to the grid vision
slide-4
SLIDE 4

7/31/2006 Challenges for grids 4

Babylonian Confusion Babylonian Confusion

  • What is called Grid covers฀:

– Standalone Clusters – Clusters for scaling a single service – Intra organizational clusters

  • With central administrative control

– Community computing

  • SETI@home, boinc

– I.Foster: <------- This is what I will use…..

  • “coordinated resource sharing and problem solving in

dynamic, multi-institutional virtual organizations. “

  • ”On-demand, ubiquitous access to computing, data, and

services”

slide-5
SLIDE 5

7/31/2006 Challenges for grids 5

The Dangers of Success The Dangers of Success

  • Early Success

– Constraints from existing infrastructures

  • Users depend on them

– Research ---> Production transition is very hard – Restricts standardization

  • The curse of backwards compatibility
  • Example EGEE, WLCG, OSG, ARC

– > 70 VOs

slide-6
SLIDE 6

7/31/2006 Challenges for grids 6

EGEE:

> 190 sites, 40 countries

> 24,000 processors, ~ 5 PB storage ~ 70 Virtual organizations EGEE:

> 190 sites, 40 countries

> 24,000 processors, ~ 5 PB storage ~ 70 Virtual organizations

EGEE Grid Sites : Q1 2006

20 40 60 80 100 120 140 160 180 200 Apr-04 Jun-04 Aug-04 Oct-04 Dec-04 Feb-05 Apr-05 Jun-05 Aug-05 Oct-05 Dec-05 20 40 60 80 100 120 140 160 180 200 Apr-04 Jun-04 Aug-04 Oct-04 Dec-04 Feb-05 Apr-05 Jun-05 Aug-05 Oct-05 Dec-05

sites sites

5000 10000 15000 20000 25000 30000 A p r

  • 4

J u n

  • 4

A u g

  • 4

O c t

  • 4

D e c

  • 4

F e b

  • 5

A p r

  • 5

J u n

  • 5

A u g

  • 5

O c t

  • 5

D e c

  • 5

F e b

  • 6

5000 10000 15000 20000 25000 30000 A p r

  • 4

J u n

  • 4

A u g

  • 4

O c t

  • 4

D e c

  • 4

F e b

  • 5

A p r

  • 5

J u n

  • 5

A u g

  • 5

O c t

  • 5

D e c

  • 5

F e b

  • 6

CPU CPU

slide-7
SLIDE 7

7/31/2006 Challenges for grids 7

EGEE Operations EGEE Operations

  • Grid operator on duty

– 6 teams working in weekly rotation

  • CERN, IN2P3, INFN, UK/I, Ru,Taipei

– Crucial in improving site stability and management – Expanding to all ROCs in EGEE-II

  • Operations coordination

– Weekly operations meetings – Regular ROC managers meetings – Series of EGEE Operations Workshops

  • Nov 04, May 05, Sep 05, June 06
  • Geographically distributed responsibility

for operations:

– There is no “central” operation – Tools are developed/hosted at different sites:

  • GOC DB (RAL), SFT (CERN), GStat (Taipei),

CIC Portal (Lyon)

  • Procedures described in Operations

Manual

– Introducing new sites – Site downtime scheduling – Suspending a site – Escalation procedures – etc

slide-8
SLIDE 8

7/31/2006 Challenges for grids 8

Use of the infrastructure Use of the infrastructure

Total non-LCG

5000 10000 15000 20000 25000 30000 35000 Jan-05 Feb-05 Mar-05 Apr-05 May-05 Jun-05 Jul-05 Aug-05 Sep-05 Oct-05 Nov-05 Dec-05 Jan-06 Feb-06 Mar-06 Apr-06

  • No. jobs/day

CPU time delivered 500,000 1,000,000 1,500,000 2,000,000 2,500,000 3,000,000 Jun-05 Jul-05 Aug-05 Sep-05 Oct-05 Nov-05 Dec-05 Jan-06 Feb-06 Mar-06 Apr-06 SI2K

  • hours/month

lhcb geant4 cms biomed atlas alice CPU - cpu-years/month

50 100 150 200 250 300 Jun-05 Jul-05 Aug-05 Sep-05 Oct-05 Nov-05 Dec-05 Jan-06 Feb-06 Mar-06 Apr-06 cpu-year / m onth

Sustained & regular workloads of >30K jobs/day

  • spread across full infrastructure
  • doubling/tripling in last 6 months – no effect on operations
  • Will increase to at least 150k jobs/day in the next

18month

slide-9
SLIDE 9

7/31/2006 Challenges for grids 9

Use of the infrastructure Use of the infrastructure

Massive data transfers > 1.5 GB/s

  • Several applications now depend on EGEE as their

primary computing resource Sustainability:

  • Usage can (and does) grow without need for additional
  • perational effort
slide-10
SLIDE 10

7/31/2006 Challenges for grids 10

A global, federated e-Infrastructure A global, federated e-Infrastructure

EGEE infrastructure

~ 200 sites in 39 countries ~ 20 000 CPUs > 5 PB storage > 35 000 concurrent jobs per day > 80 Virtual Organisations

EUIndiaGrid EUMedGrid SEE-GRID EELA BalticGrid EUChinaGrid OSG NAREGI

slide-11
SLIDE 11

7/31/2006 Challenges for grids 11

OSG- Currently ~20,000 Jobs/Day OSG- Currently ~20,000 Jobs/Day

CDF ATLAS CMS GLOW, STAR

D0

slide-12
SLIDE 12

7/31/2006 Challenges for grids 12

This all looks very promising…. This all looks very promising….

  • But…….

– Interoperation between grids

  • Lack of standardization
  • Several larger sites have to support multiple interfaces

– Managing diversity inside grids

  • OS versions

– Applications are sensitive and sites have preferences – Sites and user move independently

  • Batch systems

– Each requires extensive work to interface – Limited to smallest set of shared functionality » Frustrates users AND resource managers » Lack of standardization

slide-13
SLIDE 13

7/31/2006 Challenges for grids 13

More problems…. More problems….

  • Storage, DBs…

– Different storage management systems are established

  • HSMs, disk pools with shared file systems

– Different security, storage models, lack of standards

  • VO management

– Creation of a VO is straight forward – Getting access to resources requires:

  • Negotiation with resource providers
  • Significant effort of sites to host an additional VO

– Accounting, dynamic prioritization, quotas problematic

  • n global level (between different Vos)
  • inter-VO
  • Constrained by national privacy laws

– No market of resources

slide-14
SLIDE 14

7/31/2006 Challenges for grids 14

More problems…. More problems….

  • Achievable reliability limited

– The more complex services have to interact, the higher the probability that the overall service fails

  • ‘Russian Doll Performance Sink’ here: File open

– Applies to many services

  • Grid interfaces need to be native interfaces

– STANDARDS

MSS SRM MSS GFAL

Information system interactions are left out

slide-15
SLIDE 15

7/31/2006 Challenges for grids 15

State of Standardization State of Standardization

  • First round of tentative standards

– Mostly based on research work

  • Missed deployment and operations related part

– Production grids started with ‘de facto standards’ – Now: OGSA

  • Much more detailed, recycles established standards
  • But: additional layers, old services will be wrapped!!!

Diagram from Globus Alliance

slide-16
SLIDE 16

7/31/2006 Challenges for grids 16

Context Services I nfo Services I nfra Services Security Services Rsrc Mgmt Services Execution Mgmt Services Data Services

Policy Mgmt VO Mgmt Access Integration Transfer Replication Boundary Traversal Integrity Authorization Authentication WSRF WSN WSDM Event Mgmt Monitoring Discovery Job Mgmt Logging Execution Planning Workflow Mgmt Workload Mgmt Provisioning Execution Deployment Configuration Reservation Naming

Self Mgmt Services

Heterogeneity Mgmt Service Level Attainment QoS Mgmt Optimization

Information Services Infrastructure Services Self Mgmt Services Security Services Resource Mgmt Services Execution Mgmt Services Data Services Context Services

slide-17
SLIDE 17

Relevant Specifications Relevant Specifications

SYSTEMS MANAGEMENT UTILITY COMPUTING GRID COMPUTING Core Services Base Profile

WS-Addressing Privacy

WS-Base Notification

CIM/JSIM WSRF-RAP WSDM WS-Security Naming OGSA-EMS ByteIO GFD-C.16 GGF-UR Data Model HTTP(S)/SOAP

GRID Computing, Distributed Computing and Utility Computing are different views of the same important problem domain.

Discovery SAML/XACML WSDL WSRF-RL Trust WS-DAI VO Management Information

Distributed query processing ASP Data Centre Collaboration Multi Media Persistent Archive

Use Cases & Applications

Data Transport WSRF-RP X.509

slide-18
SLIDE 18

7/31/2006 Challenges for grids 18

Is there Hope? Is there Hope?

  • Diversity on OS level

– Virtualization is making progress (XEN,…)

  • Experience based standardization

– Information systems,etc.

  • Interoperation efforts start to influence

standardization

  • Core services start to work on native GRID interfaces

– DBs, batch systems, storage – Still in an early state, but has a huge potential

  • Solid, well managed standards are needed
  • Otherwise a wrapper is the ‘best’ solution
slide-19
SLIDE 19

7/31/2006 Challenges for grids 19

Detailed ‘Solvable’ Problem 1 Detailed ‘Solvable’ Problem 1

  • Easy introduction and destruction of VOs is at the core of the

grid vision

  • We can ease the config work, but access to resources is still

based on negotiations

– N*M problem

  • For VOs and resource providers a system is needed for:

– Trading resources (resource against resource or money) – Managing global priorities – Managing priorities between different groups inside a VO – And the same for quotas – Needed for: CPU, Storage, and Bandwidth – Has to be dynamic and leave control with the resource owners – For Oil and frozen orange juice the problem has been solved….

slide-20
SLIDE 20

7/31/2006 Challenges for grids 20

Illustration from HEP Illustration from HEP

  • The ATLAS VO that has ~20 ฀฀฀฀research groups (b-Physics, top,

higgs…)

– The members of these groups have different roles (about 5)

  • User, storage admin, leading researcher…
  • There are several experiments with similar structure
  • The association can be expressed via the VOMS proxy extensions
  • On Monday ATLAS has a standard split of:

– 10% for b-Physics – 20% for top – 60% for Higgs – The rest equally split… – The lead researcher should get top priority

  • Tuesday rumors spread that the student Judith from SUSY team of CMS has an

indication of a signal (a signal is a ticket to Stockholm)

– ATLAS needs now in almost real time:

  • Shift 90% of their resources and top priority to student Jack of their corresponding team
  • Friday Judith gives a presentation in which she explained that she mixed the

Monte Carlo Data with real data

– ATLAS has to switch now quickly back to standard mode….

slide-21
SLIDE 21

7/31/2006 Challenges for grids 21

The Resource Providers Story The Resource Providers Story

  • There are a few hundred or even thousands
  • We pick one:

– Computing center of the physics department of College Town

  • Funding by:

– National grid project, departments budget which is in CMS, donation by the foundation for top-physics, …..

– The center is open for all ATLAS and CMS groups

  • But, over a long time resources have to be provided based on funding
  • This is currently solved with static configuration of fair share schedulers

– Because there is NO trading system or currency

  • The site can’t change configuration on the fly

– As most grid sites a fraction of an admin is running the grid aspect

  • A system that would allow management of computing currencies

and that would provide a market to establish a price would simplify the situation

slide-22
SLIDE 22

7/31/2006 Challenges for grids 22

Detailed ‘Solvable’ Problem 2 Detailed ‘Solvable’ Problem 2

  • Access to storage

– For large files, where latency is a minor issue solutions are underway

  • Interfaces to MSS, FTS for reliable transport, replica catalogue
  • Latency is on the order of several seconds to minutes
  • Missing

– The replacement for the users home directory on the grid – Characterization:

  • Many, many files ( > 10^6 per user)
  • Average size is small ( 1 MB per file, total from 1GB to a few 100GB)
  • In a work session the user will create several
  • And access quite a few O(100)
  • Access is almost ฀random
  • Latency matters since the user will work interactive with these files

– Statistical data, plots, etc.

  • Hint:

– Central storage or replicating all files to all sites is not an acceptable solution