Distributed Operations into the EGI era: EGEE and EGI EGEE and EGI - - PowerPoint PPT Presentation

distributed operations into the egi era egee and egi egee
SMART_READER_LITE
LIVE PREVIEW

Distributed Operations into the EGI era: EGEE and EGI EGEE and EGI - - PowerPoint PPT Presentation

Distributed Operations into the EGI era: EGEE and EGI EGEE and EGI Vera Hansper Vera Hansper TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010 Brief History Brief History 2007 Solely NDGF-T1 operations 2008 Begun collaboration


slide-1
SLIDE 1

Distributed Operations into the EGI era: EGEE and EGI EGEE and EGI Vera Hansper Vera Hansper

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

slide-2
SLIDE 2

Brief History Brief History

  • 2007
  • Solely NDGF-T1 operations
  • 2008
  • Begun collaboration with EGEE operations
  • New Head of Operations appointed in October
  • NE ROC (SNIC) merge operations with NDGF (Nov 2008)
  • 2009
  • 2009
  • New operations Team comprising of SNIC and NDGF teams
  • New operations combining NDGF ops and EGEE ops
  • Update of procedures

Update of procedures

  • New ticketing system
  • NOC joins to cover T1 247 requirements
  • 2010

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

  • Change in rostering style
  • EGI ... new challenges
slide-3
SLIDE 3

EGEE Era The Team EGEE Era – The Team

  • NDGF
  • Vera Hansper (Finnish Node Co-ord, Head of Operations)
  • Jens Larsson (Swedish Node Co-ord.)
  • Tore Mauset (Norwegian Node Co-ord.)
  • Anders Rhod Gregersen (Danish Node Co-ord, weekend on call)
  • Mattias Wadenstein (Systems Integrator, weekend on call)
  • SNIC
  • SNIC
  • Zeeshan Ali Shah (PDC) (only 2009)
  • Thomas Bellman (NSC)
  • Michaela Lechner (PDC)

( )

  • Andreas Davour (PDC) (New)
  • Roger Oscarsson (HPC2N)
  • Åke Sandgren (HPC2N)

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

slide-4
SLIDE 4

What does EGEE era mean? What does EGEE era mean?

NDGF was set up as a distributed T1 center for LHC resources from Nordic countries

  • Operations focused on those resources only

B d ARC iddl

  • Based on ARC middleware

NE ROC, as part of EGEE, covers resources based on gLite middleware middleware

  • Includes other Tier resources
  • Covers more than just Nordic countries
  • Operations fell into 2 Categories
  • NDGF operations (OoD)
  • EGEE operations (ROD)

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

slide-5
SLIDE 5

What's done where What's done where

NDGF-T1 is a distributed T1! G s a d st buted

  • NDGF Operations cover sites which are running the ARC
  • middleware. These include
  • Denmark
  • Finland
  • Norway
  • Slovenia

S d

  • Sweden
  • Switzerland
  • NE ROC ROD duty covers sites which fall under the NE

ROC Nordics region and also run gLite. These include ROC Nordics region and also run gLite. These include

  • Baltic Grid
  • Finland
  • Norway

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

  • Sweden
slide-6
SLIDE 6

NDGF Operations from 2007 NDGF Operations from 2007

  • NDGF team on 5 week rotation roster
  • One person per shift
  • Information about operations is found at

htt // t l d t/di l / d f iki/O ti https://portal.nordu.net/display/ndgfwiki/Operation

  • This is constantly under development
  • Alerts from nagios are sent by email to the whole team
  • OoD also get SMSs
  • OoD also get SMSs
  • OoD must fill in a daily operations log
  • Attend WLCG weekly ops meeting
  • Communication lines are numerous!
  • Communication lines are numerous!
  • Jabber
  • Email
  • Wiki logs – updated daily!

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

g p y

  • Phone ....
  • The occasional pigeon ...
slide-7
SLIDE 7

NDGF Operations merge with SNIC NDGF Operations merge with SNIC

  • Merger occurred November 2008
  • Runs 8/7, with SNIC and NDGF teams alternating on a 6

week rotation roster

  • One person per shift
  • Weekend on call shifts handled by NDGF ops team
  • Information about operations can be found at

https://portal nordu net/display/ndgfwiki/Operation https://portal.nordu.net/display/ndgfwiki/Operation

  • Alerts from nagios are sent by email to both teams
  • NDGF on duty operators also get SMSs
  • On duty operator must fill in a daily operations log
  • On duty operator must fill in a daily operations log

(moved to the weekly meetings)

  • Attend the weekly WLCG meeting

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

slide-8
SLIDE 8

NDGF Operations from 2009 247 NDGF Operations from 2009 - 247

  • 24/7 operations are required by MoU with the LHC/WLCG.
  • NORDUnet NOC agreed to cover this need.
  • NOC handle after hours ( 17:00 CET – 09:00+1 CET) ops
  • Monitor Level 1 critical services
  • Receive NAGIOS alerts via email after hours
  • Communicate directly with active responsible persons in case of

emergency emergency

  • Usually Anders, Gerd or Mattias …
  • Have their own independent roster
  • First iteration of 247 began on the 9th of July 2009

g y

  • Initial coverage was from 17:00 to 22:00 CET
  • Final steps to have full 247 started September 2009

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

slide-9
SLIDE 9

EGEE Operations EGEE Operations

  • EGEE operations moved from a centrally managed system (COD) to a

p y g y ( ) regional managed model (ROD) in 2009.

  • NE ROC has been in the regional model since the beginning of 2009 and NDGF has

been instrumental in the creation process of the structure of the model.

  • Th

i l t th EGEE ti

  • There are various layers to the EGEE operations
  • Site Administrators
  • 1st Line Support
  • ROD
  • ROD
  • C-COD
  • Monitors site availability through SAM tests
  • Managed through the CIC portal

g g p

  • The Regional Dashboard provides the dashboard and tools for ROD.
  • Extra Documentation specific to EGEE
  • https://twiki.cern.ch/twiki/bin/view/EGEE/OperationalDocumentationCERN edms

repository

slide-10
SLIDE 10

Tools Tools

  • NDGF monitoring tools
  • Nagios
  • Ganglia
  • DCACHE dashboard

FTS it i

  • FTS monitoring
  • WLCG/EGEE monitoring systems
  • SAM tests
  • GRIDMAP

GRIDMAP

  • GRIDVIEW
  • EGEE monitoring tools
  • SAM (Based on a CERN developed monitoring system)

( p g y )

  • Changed to NAGIOS based early 2010
  • GRIDMAP
  • GRIDVIEW

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

  • ROD Dashboard
  • Tools/Services linked to dashboard
slide-11
SLIDE 11

A view of a tool A view of a tool ...

  • Developed by CIC at
  • Developed by CIC at

IN2P3 (France)

  • Integrates monitoring and

data base services TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

slide-12
SLIDE 12

A view of a tool cont A view of a tool cont.

Alarms from failing tests (NAGIOS) b it d can be monitored

  • Sites can be contacted through the

notepad

GGUS tickets can be created, GGUS tickets can be created, monitored and solved directly

  • There are time constraints for these

too

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

slide-13
SLIDE 13

C i ti h l Ti k ti S t Communication channels:Ticketing System

  • NOC operations moved their ticketing system to a JIRA

based system at the beginning of 2009

  • NDGF adopted this system a short time later
  • Sit

b ib t i ti k t i f ti

  • Sites can subscribe to receive ticket information
  • Tickets or issues can be tailored to send them to sites, or keep

the issue internal

  • Used by NDGF for NDGF-T1 operations

Used by NDGF for NDGF T1 operations

  • Still use EGEE system (GGUS) in tangent
  • Used mainly for EGEE operations
  • NDGF-T1 also receives GGUS tickets
  • Outside users submit these

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

slide-14
SLIDE 14

Communication : customers Communication : customers

  • Sites and Site administrators == Customers
  • Site admins are encouraged to subscribe to the NDGF ticketing system
  • http://mail.ndgf.org/mailman/listinfo/ndgfticket

S ll l li t i l t tif d i b t t l (NDGF T1) i

  • Small volume list, mainly to notify admins about central (NDGF-T1) service

maintenance

  • Site admins should subscribe to EGEE alarm notifications
  • https://cic.gridops.org/index.php?section=roc&page=alertnotification

p g p g p p p g

  • Can be done on a site or node basis
  • Operators are encouraged to be proactive
  • Alert sites/admins about a problem immediately – minimise ticket creation (EGEE-

style)

  • The faster a problem is solved, the better the overall availability of the site
  • (Important for both NDGF-T1 and EGEE level operations!)

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

slide-15
SLIDE 15

Operations is a also support unit Operations is a also support unit

  • Operators can issue downtime in the GOCDB for sites
  • The mailing list is actively read
  • Admins can freely use it to communicate with the operators and

d i admins

  • Developers also read this list
  • Help

advice training etc

  • Help – advice, training, etc.
  • Some of the operators are cheap – you only need to mention

beer!

  • No question is too trivial

q TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

slide-16
SLIDE 16

Operations at NDGF: EGEE era Operations at NDGF: EGEE era

  • Combined daily efforts of NDGF and SNIC operations

t t f h h dl d b NOC teams, out of hours handled by NOC

  • NDGF/SNIC operators rostered one week in six
  • NOC has it's own rostering system
  • Lots of communications channels
  • Lots of communications channels
  • JABBER room used extensively
  • NDGF and EGEE ticketing
  • Many monitoring tools and logging systems
  • Many monitoring tools and logging systems
  • NAGIOS – NDGF and CERN
  • GANGLIA, FTS, dCACHE
  • WIKI
  • DASHBOARD
  • Attend numerous meetings
  • Weekly WLCG meeting

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

  • Daily WLCG operations meeting
  • Weekly NDGF meeting
  • Nordic NE ROC phone meeting
slide-17
SLIDE 17

EGEE era room for improvement EGEE era – room for improvement

C l St t f O ti Complex Structure of Operations:

  • 2 aspects to consider
  • NDGF-T1
  • EGEE
  • EGEE
  • 2 teams to administer
  • NDGF handle weekend on call
  • SNIC only handle normal work days

y y

  • Teams rostered on alternative weeks
  • Roster structure complicated …
  • Lots of meetings
  • Chat room meetings
  • Phone meetings
  • ... also Face to Face meetings

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

slide-18
SLIDE 18

Successes Successes

  • OLD style EGEE model has worked
  • Co-operation between teams
  • Co-operation in group as a whole
  • A common work space – JABBER!
  • Moving to EGI era should not be too complex
  • BUT ...

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

slide-19
SLIDE 19

Nomenclature Nomenclature

  • Acronyms so far ...
  • NDGF – Nordic Data Grid Facility
  • NOC – Network Operations Center
  • EGEE – Enabling Grids for E-Science
  • SNIC – Swedish National Infrastructure for Computing
  • NE ROC – Northern Europe Regional Operations Center
  • NEW acronyms
  • NGI – National Grid Infrastructure
  • EGI – European Grid Initiative/Infrastructure

p

  • EGI-InSPIRE – Integrated Sustainable Pan-European Infrastructure for

Researchers in Europe

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

slide-20
SLIDE 20

NDGF and EGI : operations NDGF and EGI : operations

  • NDGF == NGI
  • Technical organisation for ease of EGI tools
  • One team – one set of duties
  • Details are fluid – but basically the same type of duties as now.
  • Team comprised of operations support from each NGI
  • No distinction between NDGF OoD and ROD duties
  • Only referred to as OoD
  • Only referred to as OoD
  • Tools streamlining?
  • NAGIOS is here to stay

NAGIOS is here to stay

  • Need to maintain 2 instances – NDGF-T1 and CERN
  • Still use NDGF tools
  • May adapt use of EGI tools/services

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

  • GOCDB, GGUS, Dashboard
slide-21
SLIDE 21

Operations Team Operations Team

  • Denmark
  • TBA
  • Finland
  • Vera Hansper
  • TBA
  • Norway

T M t

  • Tore Mauset
  • Sweden (still to be finalised)
  • Jens Larsson
  • Thomas Bellman
  • Thomas Bellman
  • Roger Oscarsson
  • Andreas Davour

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

slide-22
SLIDE 22

Face of EGI operations Face of EGI operations

  • Required to maintain T1 level resources (as now)
  • Required to monitor and assist sites at our NGIs
  • NO LONGER NE ROC
  • Baltic states will still be monitored
  • Some more formal arrangement for the long term
  • Use of some tools is mandatory – concept of Regional versus

Centralised Tools Centralised Tools

  • NAGIOS
  • Dashboard
  • GOCDB

GOCDB

  • Accounting (SGAS)
  • NDGF tools will remain (basically) the same

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

slide-23
SLIDE 23

Face of EGI operations Face of EGI operations

Operators will have certain responsibilities to EGI operations

  • utside of the normal roster
  • This will in some part be determined by their NGI, and partly in

negotiation with NDGF negotiation with NDGF

  • Political structure has changed
  • Responsibility for providing operations teams falls onto NGIs.
  • Communication channels
  • Communication channels
  • Ticketing system will flow more smoothly
  • Keep JABBER
  • NE ROC meetings disappear – absorbed into NDGF meetings
  • EGEE meetings transform to EGI meetings
  • Less Face to Face
  • More video conferencing anticipated

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

slide-24
SLIDE 24

Challenges Challenges

  • Streamlining tools
  • Some of this is out of our control
  • We need to make sure we use the ones available to us

efficiently! efficiently!

  • Keeping logs
  • Where to store this information?
  • How to ensure that these are completed in a timely manner?

How to ensure that these are completed in a timely manner?

  • The WIKI and JIRA
  • Improvements will be made to the Operations pages
  • Improve readability

p y

  • Reduce repetition
  • CALENDAR system
  • Need something better than GOOGLE?

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

  • Move to use embedded calendar in Confluence/WIKI
slide-25
SLIDE 25

Resources

  • NDGF WIKI
  • EGI.EU pages
  • Operations Documentation
  • Tool Documentation
  • PEOPLE!
  • By far the greatest resource!

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

slide-26
SLIDE 26

The Undiscovered Country milestones to come The Undiscovered Country – milestones to come

  • Start of LHC – Dec 2009
  • Data is flowing into NDGF resources!
  • 2010 – EGI and beyond
  • Transitional challenges
  • Transitional challenges
  • People moving to new responsibilities
  • Where to fit sites/NGIs outside of the Nordics?
  • Case by case basis

y

  • 2011 – NDGF or something else?
  • News is that NDGF will stay. Exact entity is being determined now
  • The transition should be viewed as an opportunity
  • Challenges will keep Operations on our toes
  • Still requires the unique team spirit that has embodied this group

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

  • If we don't like the way a tool looks – get involved!
slide-27
SLIDE 27

Why does it work? Why does it work?

DISTRIBUTED == CO-OPERATION

  • Group dynamics is good
  • People are friendly, collaborative, supportive
  • Communications involves humour
  • Friendships develop
  • Nordic countries are small individually

P t l b d ll ti l

  • Present a larger user base and resource collectively
  • Common cultural background
  • Helps with building a community
  • Eases cross-cultural lines
  • Eases cross-cultural lines
  • Nordic people are open to others

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

slide-28
SLIDE 28

New horizons? New horizons?

Operations are here to stay – even if under different hats and other guises HOW does this fit to your models? The key is communication.

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010

slide-29
SLIDE 29

QUESTIONS? QUESTIONS?

TERENA TF-NOC meeting, NORDUNET, 3rd May 2010