Grid Deployment & Operations in the UK Wednesday 3 rd May ISGC - - PowerPoint PPT Presentation

grid deployment operations in the uk
SMART_READER_LITE
LIVE PREVIEW

Grid Deployment & Operations in the UK Wednesday 3 rd May ISGC - - PowerPoint PPT Presentation

Grid Deployment & Operations in the UK Wednesday 3 rd May ISGC 2006, Taipei Jeremy Coles GridPP Production Manager UK&I Operations for EGEE J.Coles@rl.ac.uk Overview 1 Background to e-Science The UK Grid Projects NGS & GridPP


slide-1
SLIDE 1

Grid Deployment & Operations in the UK

Jeremy Coles GridPP Production Manager UK&I Operations for EGEE J.Coles@rl.ac.uk

Wednesday 3rd May ISGC 2006, Taipei

slide-2
SLIDE 2

Overview

2 The deployment and operations models and vision 3 GridPP performance measures

4 Progress in GridPP against LCG requirements

5 Future plans 6 Summary 1 Background to e-Science – The UK Grid Projects NGS & GridPP

slide-3
SLIDE 3

UK e-Science

  • National initiatives began in 2001
  • UK e-S

cience programme

– Application focused/ led developments – Varying degree of “ infrastructure” …

‘e-S cience is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.’

John Taylor Director General of Research Councils Office of S cience and Technology

http:/ / www.rcuk.ac.uk/ escience/

slide-4
SLIDE 4

UK e-Infrastructure directions

LHC ISIS TS2 HPCx + HECtoR Users get common access, tools, information, Nationally supported services, through NGS

Integrated internationally

VRE, VLE, IE

Regional and Campus grids

Community Grids

slide-5
SLIDE 5

UK e-Infrastructure directions

slide-6
SLIDE 6

Applications

  • Many, but not all, applications

cover traditional computational sciences

– Both user and pre-installed software

  • S

everal data focused activities

  • Common features are

– Distributed data and/ or collaborators

  • Not j ust pre-existing large

collaborations

– Explicitly encourage new users – Common infrastructure/ interfaces

  • Thermodynamic integration
  • Molecular dynamics
  • Systems biology
  • Neutron scattering
  • Econometric analysis
  • Climate modelling
  • Nano-particles
  • Protein folding
  • Ab-initio protein structure prediction
  • radiation transport (radiotherapy)
  • IXI (medical imaging)
  • Biological membranes
  • Micromagnetics
  • Archeology
  • Text mining
  • Lattice QCD (analysis)
  • Astronomy (VO services)
slide-7
SLIDE 7

National Grid Service

slide-8
SLIDE 8

The UK & Ireland contribution to EGEE SA1 – deployment &

  • perations

Consisted of 3 partners in EGEE-I:

  • The National Grid S

ervice (NGS )

  • Grid Ireland
  • GridPP

Number of Registered NGS Users

50 100 150 200 250 300 14 January 2004 23 April 2004 01 August 2004 09 November 2004 17 February 2005 28 May 2005 05 September 2005 14 December 2005

Date Number of Users

NGS User Registrations Linear (NGS User Registrations)

slide-9
SLIDE 9

The UK & Ireland contribution to EGEE SA1 – deployment & operations

Consisted of 3 partners in EGEE-I:

  • The National Grid S

ervice (NGS )

  • Grid Ireland
  • GridPP

Grid-Ireland focus:

  • National computational grid for Ireland

built over the Higher Education Authority network

  • Central operations from Dublin
  • Have developed an auto-build system for

EGEE componenets

slide-10
SLIDE 10

The UK & Ireland contribution to EGEE SA1 – deployment & operations

Consisted of 3 partners in EGEE-I:

  • The National Grid S

ervice (NGS )

  • Grid Ireland
  • GridPP
  • Composed of 4 regional Tier-2s

and a Tier-1 as per the LCG Tier model

In EGEE-II:

  • NGS

and Grid-Ireland unchanged

  • The lead institute in each of the

GridPP Tier-2s becomes a partner.

slide-11
SLIDE 11

What UK structures are involved?

UK Core e-Science Programme Institutes Tier-2 Centres CERN LCG EGEE

GridPP

Tier-1/A Middleware, Security, Networking Experiments Grid Support Centre

Not to scale!

Apps Dev Apps Int

GridPP

UK Core e-Science Programme Institutes Tier-2 Centres CERN LCG CERN LCG EGEE

GridPP GridPP

Tier-1/A Tier-1/A Middleware, Security, Networking Middleware, Security, Networking Experiments Grid Support Centre Grid Support Centre

Not to scale!

Apps Dev Apps Dev Apps Int

GridPP

slide-12
SLIDE 12

Focus: GridPP structure

Production Manager

NorthGrid Coordinator S

  • uthGrid Coordinator

S cotGrid Coordinator London Tier-2 Coordinator

Tier-2 support Tier-2 support Tier-2 support Tier-2 support Site Administrat or Site Administrat or Site Administrat or Site Administrator Tier-1 Manager

Tier-1 Technical Coordinator Tier-1 support & administrators

S torage Group Networking Group VOMS support Catalogue support Helpdesk support Tier-2 Board Tier-1 Board Deployment Board User Board Proj ect Management Board Collaboration Board Oversight Committee

slide-13
SLIDE 13

GridPP structure and work areas

Production Manager

NorthGrid Coordinator S

  • uthGrid Coordinator

S cotGrid Coordinator London Tier-2 Coordinator

Tier-2 support Tier-2 support Tier-2 support Tier-2 support Site Administrat or Site Administrat or Site Administrat or Site Administrator Tier-1 Manager

Tier-1 Technical Coordinator Tier-1 support & administrators

S torage Group Networking Group VOMS support Catalogue support Helpdesk support Tier-2 Board Tier-1 Board Deployment Board User Board Proj ect Management Board Collaboration Board Oversight Committee

  • Deployment of new hardware
  • Information exchange
  • Maintaining site services
  • Maintaining production services
  • LCG service challenges
  • GridPP challenges
  • Monitoring use of resources
  • Reporting
  • Running helpdesks
  • Interoperation – parallel deployment
  • S

upporting dCache

  • S

upporting DPM

  • Developing plug-ins
  • Constructing data views
  • S

upporting network testing

  • Running core services
  • Ticket process management
  • Pre-production service
  • UK testzone
  • Pre-release testing
  • Updating proj ect plans
  • Agreeing resource allocations
  • Checking proj ect direction
  • Tracking documentation
  • VO interaction/ support
  • Portal development

Recent output from SOME areas follows… .

Example activities from across these areas

slide-14
SLIDE 14

How effectively are resources being used?

Tier-1 developed script uses one simple measure: sum(CPU time) / sum(wall time). Low efficiencies for 2005 were generally a few j obs making the situation look bad.

http:/ / www.gridpp.rl.ac.uk/ stats/ Problems with SEs 2006

slide-15
SLIDE 15

RTM data views - efficiency

http:/ / gridportal.hep.ph.ic.ac.uk/ rtm/ reports.html * Data shown for Q42005 What are the underlying reasons for big differences in overall efficiency

slide-16
SLIDE 16

RTM data views - usage

http:/ / gridportal.hep.ph.ic.ac.uk/ rtm/ reports.html * Data shown for Q42005 Does the usage distribution make sense?

slide-17
SLIDE 17

RTM data views – job distribution

http:/ / gridportal.hep.ph.ic.ac.uk/ rtm/ reports.html * Data shown for Q42005 Operations needs to check mappings and discover why some sites not used

slide-18
SLIDE 18

Site performance measures

  • S

torage provided

slide-19
SLIDE 19
  • S

torage provided

  • S

cheduled downtime

100 200 300 400 500 600 L i v e r p

  • l

M a n c h e s t e r * U C L

  • H

E P B r u n e l S h e f f i e l d D u r h a m Q u e e n M a r y , U L B i r m i n g h a m R

  • y

a l H

  • l

l

  • w

a y , U L I C H E P U C L

  • C

C C I C L e S C * * G l a s g

  • w

E d i n b u r g h O x f

  • r

d C a m b r i d g e R A L P P R A L

  • T

i e r

  • 1

L a n c a s t e r B r i s t

  • l

Hours of Scheduled Downtime October November December

Site performance measures

slide-20
SLIDE 20

Site performance measures

100 200 300 400 500 600 L i v e r p

  • l

M a n c h e s t e r * U C L

  • H

E P B r u n e l S h e f f i e l d D u r h a m Q u e e n M a r y , U L B i r m i n g h a m R

  • y

a l H

  • l

l

  • w

a y , U L I C H E P U C L

  • C

C C I C L e S C * * G l a s g

  • w

E d i n b u r g h O x f

  • r

d C a m b r i d g e R A L P P R A L

  • T

i e r

  • 1

L a n c a s t e r B r i s t

  • l

Hours of Scheduled Downtime October November December

  • S

torage provided

  • S

cheduled downtime

  • Estimated occupancy

0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00% 50.00% R A L P P E d i n b u r g h O x f

  • r

d R

  • y

a l H

  • l

l

  • w

a y , U L U C L

  • C

C C L a n c a s t e r Q u e e n M a r y , U L L i v e r p

  • l

G l a s g

  • w

D u r h a m B i r m i n g h a m S h e f f i e l d I C H E P B r i s t

  • l

B r u n e l M a n c h e s t e r U C L

  • H

E P C a m b r i d g e I C L e S C Average occupancy Contribution to UK Tier-2 processing

slide-21
SLIDE 21

Site performance measures

100 200 300 400 500 600 Liverpool Manchester* UCL-HEP Brunel Sheffield Durham Queen Mary, UL Birmingham Royal Holloway, UL IC HEP UCL-CCC IC LeSC** Glasgow Edinburgh Oxford Cambridge RALPP RAL-Tier-1 Lancaster Bristol Hours of Scheduled Downtime October November December 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00% 50.00% RALPP Edinburgh Oxford Royal Holloway, UL UCL-CCC Lancaster Queen Mary, UL Liverpool Glasgow Durham Birmingham Sheffield IC HEP Bristol Brunel Manchester UCL-HEP Cambridge IC LeSC Average occupancy Contribution to UK Tier-2 processing

  • S

torage provided

  • S

cheduled downtime

  • Estimated occupancy
  • S

FT failures

20 40 60 80 100 120 140 B r u n e l R A L

  • T

i e r

  • 1

E d i n b u r g h I C H E P B i r m i n g h a m G l a s g

  • w

S h e f f i e l d L a n c a s t e r U C L

  • C

C C O x f

  • r

d C a m b r i d g e R

  • y

a l H

  • l

l

  • w

a y , U L Q u e e n M a r y , U L D u r h a m U C L

  • H

E P R A L P P I C L e S C L i v e r p

  • l

M a n c h e s t e r * B r i s t

  • l

* # of critical SFT failures December November October

slide-22
SLIDE 22

Site performance measures

100 200 300 400 500 600 Liverpool Manchester* UCL-HEP Brunel Sheffield Durham Queen Mary, UL Birmingham Royal Holloway, UL IC HEP UCL-CCC IC LeSC** Glasgow Edinburgh Oxford Cambridge RALPP RAL-Tier-1 Lancaster Bristol Hours of Scheduled Downtime October November December 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00% 50.00% RALPP Edinburgh Oxford Royal Holloway, UL UCL-CCC Lancaster Queen Mary, UL Liverpool Glasgow Durham Birmingham Sheffield IC HEP Bristol Brunel Manchester UCL-HEP Cambridge IC LeSC Average occupancy Contribution to UK Tier-2 processing

  • S

torage provided

  • S

cheduled downtime

  • Estimated occupancy
  • S

FT failures

  • Tickets & responsiveness

20 40 60 80 100 120 140 Brunel RAL-Tier-1 Edinburgh IC HEP Birmingham Glasgow Sheffield Lancaster UCL-CCC Oxford Cambridge Royal Holloway, UL Queen Mary, UL Durham UCL-HEP RALPP IC LeSC Liverpool Manchester* Bristol* # of critical SFT failures December November October

20 40 60 80 100 120 R A L T i e r

  • 1

Q u e e n M a r y , U L M a n c h e s t e r R

  • y

a l H

  • l

l

  • w

a y , U L S h e f f i e l d C a m b r i d g e R A L P P U C L

  • H

E P B r u n e l D u r h a m L a n c a s t e r E d i n b u r g h B i r m i n g h a m G l a s g

  • w

O x f

  • r

d I m p e r i a l C

  • l

l e g e H E P L i v e r p

  • l

U C L

  • C

C C # tickets Q3 & Q4 2005 Average time in hrs to resolve tickets for Q3 & Q4 2005

slide-23
SLIDE 23

Site performance measures

100 200 300 400 500 600 Liverpool Manchester* UCL-HEP Brunel Sheffield Durham Queen Mary, UL Birmingham Royal Holloway, UL IC HEP UCL-CCC IC LeSC** Glasgow Edinburgh Oxford Cambridge RALPP RAL-Tier-1 Lancaster Bristol Hours of Scheduled Downtime October November December 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00% 50.00% RALPP Edinburgh Oxford Royal Holloway, UL UCL-CCC Lancaster Queen Mary, UL Liverpool Glasgow Durham Birmingham Sheffield IC HEP Bristol Brunel Manchester UCL-HEP Cambridge IC LeSC Average occupancy Contribution to UK Tier-2 processing

  • S

torage provided

  • S

cheduled downtime

  • Estimated occupancy
  • S

FT failures

  • Tickets & responsiveness
  • # VOs supported

20 40 60 80 100 120 140 Brunel RAL-Tier-1 Edinburgh IC HEP Birmingham Glasgow Sheffield Lancaster UCL-CCC Oxford Cambridge Royal Holloway, UL Queen Mary, UL Durham UCL-HEP RALPP IC LeSC Liverpool Manchester* Bristol* # of critical SFT failures December November October 20 40 60 80 100 120 R A L T i e r

  • 1

Q u e e n M a r y , U L M a n c h e s t e r R

  • y

a l H

  • l

l

  • w

a y , U L S h e f f i e l d C a m b r i d g e R A L P P U C L

  • H

E P B r u n e l D u r h a m L a n c a s t e r E d i n b u r g h B i r m i n g h a m G l a s g

  • w

O x f

  • r

d I m p e r i a l C

  • l

l e g e H E P L i v e r p

  • l

U C L

  • C

C C # tickets Q3 & Q4 2005 Average time in hrs to resolve tickets for Q3 & Q4 2005

2 4 6 8 10 12 14 16 R A L P P I C

  • H

E P R A L T i e r

  • 1

O x f

  • r

d B i r m i n g h a m E d i n b u r g h U C L

  • C

C C L a n c a s t e r M a n c h e s t e r G l a s g

  • w

D u r h a m Q u e e n M a r y , U L R

  • y

a l H

  • l

l

  • w

a y , U L C a m b r i d g e B r u n e l S h e f f i e l d B r i s t

  • l

L i v e r p

  • l

U C L

  • H

E P I C

  • L

E S C Number of supported VOs

slide-24
SLIDE 24

Site performance measures

100 200 300 400 500 600 Liverpool Manchester* UCL-HEP Brunel Sheffield Durham Queen Mary, UL Birmingham Royal Holloway, UL IC HEP UCL-CCC IC LeSC** Glasgow Edinburgh Oxford Cambridge RALPP RAL-Tier-1 Lancaster Bristol Hours of Scheduled Downtime October November December 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00% 50.00% RALPP Edinburgh Oxford Royal Holloway, UL UCL-CCC Lancaster Queen Mary, UL Liverpool Glasgow Durham Birmingham Sheffield IC HEP Bristol Brunel Manchester UCL-HEP Cambridge IC LeSC Average occupancy Contribution to UK Tier-2 processing

  • S

torage provided

  • S

cheduled downtime

  • Estimated occupancy
  • S

FT failures

  • Tickets & responsiveness
  • # VOs supported
  • + others…

..

WHAT MAKES A S ITE BETTER (beyond manpower)?

  • Need more data over longer

periods

  • Ideally need more automated

data!

  • Importance will increase in

meeting MoU/ S LA targets

  • How reliable are the metrics

20 40 60 80 100 120 140 Brunel RAL-Tier-1 Edinburgh IC HEP Birmingham Glasgow Sheffield Lancaster UCL-CCC Oxford Cambridge Royal Holloway, UL Queen Mary, UL Durham UCL-HEP RALPP IC LeSC Liverpool Manchester* Bristol* # of critical SFT failures December November October 20 40 60 80 100 120 R A L T i e r

  • 1

Q u e e n M a r y , U L M a n c h e s t e r R

  • y

a l H

  • l

l

  • w

a y , U L S h e f f i e l d C a m b r i d g e R A L P P U C L

  • H

E P B r u n e l D u r h a m L a n c a s t e r E d i n b u r g h B i r m i n g h a m G l a s g

  • w

O x f

  • r

d I m p e r i a l C

  • l

l e g e H E P L i v e r p

  • l

U C L

  • C

C C # tickets Q3 & Q4 2005 Average time in hrs to resolve tickets for Q3 & Q4 2005

2 4 6 8 10 12 14 16 R A L P P I C

  • H

E P R A L T i e r

  • 1

O x f

  • r

d B i r m i n g h a m E d i n b u r g h U C L

  • C

C C L a n c a s t e r M a n c h e s t e r G l a s g

  • w

D u r h a m Q u e e n M a r y , U L R

  • y

a l H

  • l

l

  • w

a y , U L C a m b r i d g e B r u n e l S h e f f i e l d B r i s t

  • l

L i v e r p

  • l

U C L

  • H

E P I C

  • L

E S C Number of supported VOs

slide-25
SLIDE 25

Meeting the LCG challenge

Example: Tier-2 individual transfer tests

172 Mb/ s

QMUL IC-HEP

461 Mb/ s

Birmingham

456 Mb/ s

Oxford

74 Mb/ s

Cambridge

193 Mb/ s

Durham

Cam 289 Mb/ s Birmingham 252 Mb/ s Oxford Durham 118 Mb/ s QMUL 388 Mb/ s

RAL-PPD IC-HEP

331Mb/ s

Glasgow

440Mb/ s

Edinburgh

150 Mb/ s

Manchester Lancaster

397 Mb/ s 84Mb/ s 166 Mb/ s 156Mb/ s 350Mb/ s ~800Mb/ s

RAL Tier-1

RAL-PPD IC-HEP Glasgow Edinburgh Manchester Lancaster RAL Tier-1

Receiving

Example rates from throughput tests

  • Big variation in what sites could achieve
  • Internal networking configuration issues
  • S

ite connectivity (and contention)

  • S

RM setup and level of optimisation

  • Rates to RAL were generally better than from RAL
  • Availability and setup of gridFTP servers at Tier-2s
  • S

RM setup and level of optimisation

  • S

cheduling tests was not straightforward

  • Availability of local site staff
  • S

tatus of hardware deployment

  • Availability of Tier-1
  • Need to avoid first tests during certain periods (local impacts)

http:/ / wiki.gridpp.ac.uk/ wiki/ S ervice_Challenge_Transfer_Tests Initial focus was on getting S RMs understood and deployed… ..

slide-26
SLIDE 26

Meeting the LCG challenge

Example: Tier-1 & Tier-2 combined transfer tests

http:/ / wiki.gridpp.ac.uk/ wiki/ S C4_Aggregate_Throughput

  • Early attempts revealed unexplained dropouts
  • Dropouts later traced to firewall
  • A rate cap at RAL was introduced for later tests
  • Tests repeated to check RAL capping
  • Rate was stretched further by using an OPN link to

Lancaster

slide-27
SLIDE 27

Meeting the LCG challenge

Tier-1 & Tier-2 combined transfer tests-rerun

http:/ / wiki.gridpp.ac.uk/ wiki/ S C4_Aggregate_Throughput

slide-28
SLIDE 28

GridPP operations: What is next?

  • S

RM deployments now stable and focus has shifted to improving site configurations and optimisations

  • S

ites are now more comfortable with the release/ reporting process but concerns remain – gLite 3.0

  • We need to continue improving site transfer performance but also extend the

tests to include such things as sustained simultaneous reading and writing

  • S

everal sites are receiving new equipment – we need to ensure a smooth

  • deployment. 64-bit machines are being deployed in some cases.
  • GridPP mapped its Tier-2s to experiments for closer working and “ proving” of

the Tier-2 capabilities. S

  • me progress already but much more needed.
  • Data is becoming available for understanding performance of sites but

integration and automaton is far from ideal.

  • The installation of network monitoring “ boxes” at UK sites
  • S

ecurity – several areas but extending ROC security challenge and implementing an approach for j oint logging are in progress.

  • More interoperation (and j oint operations) with NGS
slide-29
SLIDE 29

Summary

2 There will be increasing interoperation between UK activities 3 The UK particle physics grid remains one of the largest projects 4 Operational focus will shift to performance measures

6 There are clear areas where further work is required

1 UK e-science has a broad vision with NGS a central part 5 Progress being made for LHC pilot service but not always smoothly