distributed operations into the egi era egee and egi egee
play

Distributed Operations into the EGI era: EGEE and EGI EGEE and EGI - PowerPoint PPT Presentation

Distributed Operations into the EGI era: EGEE and EGI EGEE and EGI Vera Hansper Vera Hansper TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010 Brief History Brief History 2007 Solely NDGF-T1 operations 2008 Begun collaboration


  1. Distributed Operations into the EGI era: EGEE and EGI EGEE and EGI Vera Hansper Vera Hansper TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  2. Brief History Brief History  2007 • Solely NDGF-T1 operations  2008 • Begun collaboration with EGEE operations • New Head of Operations appointed in October • NE ROC (SNIC) merge operations with NDGF (Nov 2008)  2009  2009 • New operations Team comprising of SNIC and NDGF teams • New operations combining NDGF ops and EGEE ops • Update of procedures Update of procedures • New ticketing system • NOC joins to cover T1 247 requirements  2010 • Change in rostering style • EGI ... new challenges TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  3. EGEE Era EGEE Era – The Team The Team  NDGF • Vera Hansper (Finnish Node Co-ord, Head of Operations) • Jens Larsson (Swedish Node Co-ord.) • Tore Mauset (Norwegian Node Co-ord.) • Anders Rhod Gregersen (Danish Node Co-ord, weekend on call) • Mattias Wadenstein (Systems Integrator, weekend on call)  SNIC  SNIC • Zeeshan Ali Shah (PDC) (only 2009) • Thomas Bellman (NSC) • Michaela Lechner (PDC) ( ) • Andreas Davour (PDC) (New) • Roger Oscarsson (HPC2N) • Åke Sandgren (HPC2N) TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  4. What does EGEE era mean? What does EGEE era mean? NDGF was set up as a distributed T1 center for LHC resources from Nordic countries • Operations focused on those resources only • B Based on ARC middleware d ARC iddl NE ROC, as part of EGEE, covers resources based on gLite middleware middleware • Includes other Tier resources • Covers more than just Nordic countries  Operations fell into 2 Categories • NDGF operations (OoD) • EGEE operations (ROD) TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  5. What's done where What's done where NDGF-T1 is a distributed T1! G s a d st buted  NDGF Operations cover sites which are running the ARC middleware. These include • Denmark • Finland • Norway • Slovenia • S Sweden d • Switzerland  NE ROC ROD duty covers sites which fall under the NE ROC Nordics region and also run gLite. These include ROC Nordics region and also run gLite. These include • Baltic Grid • Finland • Norway • Sweden TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  6. NDGF Operations from 2007 NDGF Operations from 2007  NDGF team on 5 week rotation roster • One person per shift  Information about operations is found at https://portal.nordu.net/display/ndgfwiki/Operation htt // t l d t/di l / d f iki/O ti • This is constantly under development  Alerts from nagios are sent by email to the whole team • • OoD also get SMSs OoD also get SMSs  OoD must fill in a daily operations log • Attend WLCG weekly ops meeting  Communication lines are numerous!  Communication lines are numerous! • Jabber • Email • Wiki logs – updated daily! g p y • Phone .... • The occasional pigeon ... TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  7. NDGF Operations merge with SNIC NDGF Operations merge with SNIC  Merger occurred November 2008  Runs 8/7, with SNIC and NDGF teams alternating on a 6 week rotation roster • One person per shift • Weekend on call shifts handled by NDGF ops team  Information about operations can be found at https://portal nordu net/display/ndgfwiki/Operation https://portal.nordu.net/display/ndgfwiki/Operation  Alerts from nagios are sent by email to both teams • NDGF on duty operators also get SMSs  On duty operator must fill in a daily operations log  On duty operator must fill in a daily operations log (moved to the weekly meetings) • Attend the weekly WLCG meeting TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  8. NDGF Operations from 2009 NDGF Operations from 2009 - 247 247  24/7 operations are required by MoU with the LHC/WLCG.  NORDUnet NOC agreed to cover this need. • NOC handle after hours ( 17:00 CET – 09:00+1 CET) ops • Monitor Level 1 critical services • Receive NAGIOS alerts via email after hours • Communicate directly with active responsible persons in case of emergency emergency • Usually Anders, Gerd or Mattias … • Have their own independent roster  First iteration of 247 began on the 9 th of July 2009 g y • Initial coverage was from 17:00 to 22:00 CET • Final steps to have full 247 started September 2009 TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  9. EGEE Operations EGEE Operations  EGEE operations moved from a centrally managed system (COD) to a p y g y ( ) regional managed model (ROD) in 2009. • NE ROC has been in the regional model since the beginning of 2009 and NDGF has been instrumental in the creation process of the structure of the model.  Th  There are various layers to the EGEE operations i l t th EGEE ti • Site Administrators 1 st Line Support • • • ROD ROD • C-COD  Monitors site availability through SAM tests • Managed through the CIC portal g g p • The Regional Dashboard provides the dashboard and tools for ROD.  Extra Documentation specific to EGEE • https://twiki.cern.ch/twiki/bin/view/EGEE/OperationalDocumentationCERN edms repository

  10. Tools Tools  NDGF monitoring tools • Nagios • Ganglia • DCACHE dashboard • FTS monitoring FTS it i • WLCG/EGEE monitoring systems • SAM tests • GRIDMAP GRIDMAP • GRIDVIEW  EGEE monitoring tools • SAM (Based on a CERN developed monitoring system) ( p g y ) • Changed to NAGIOS based early 2010 • GRIDMAP • GRIDVIEW • ROD Dashboard • Tools/Services linked to dashboard TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  11. A view of a tool A view of a tool ...  Developed by CIC at  Developed by CIC at IN2P3 (France)  Integrates monitoring and data base services TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  12. A view of a tool cont A view of a tool cont. Alarms from failing tests (NAGIOS) can be monitored b it d • Sites can be contacted through the notepad GGUS tickets can be created, GGUS tickets can be created, monitored and solved directly • There are time constraints for these too TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  13. C Communication channels:Ticketing System i ti h l Ti k ti S t  NOC operations moved their ticketing system to a JIRA based system at the beginning of 2009 • NDGF adopted this system a short time later  Sites can subscribe to receive ticket information  Sit b ib t i ti k t i f ti • Tickets or issues can be tailored to send them to sites, or keep the issue internal • Used by NDGF for NDGF-T1 operations Used by NDGF for NDGF T1 operations  Still use EGEE system (GGUS) in tangent • Used mainly for EGEE operations • NDGF-T1 also receives GGUS tickets • Outside users submit these TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  14. Communication : customers Communication : customers  Sites and Site administrators == Customers • Site admins are encouraged to subscribe to the NDGF ticketing system • http://mail.ndgf.org/mailman/listinfo/ndgfticket • Small volume list, mainly to notify admins about central (NDGF-T1) service S ll l li t i l t tif d i b t t l (NDGF T1) i maintenance • Site admins should subscribe to EGEE alarm notifications • https://cic.gridops.org/index.php?section=roc&page=alertnotification p g p g p p p g • Can be done on a site or node basis  Operators are encouraged to be proactive • Alert sites/admins about a problem immediately – minimise ticket creation (EGEE- style) • The faster a problem is solved, the better the overall availability of the site • (Important for both NDGF-T1 and EGEE level operations!) TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  15. Operations is a also support unit Operations is a also support unit  Operators can issue downtime in the GOCDB for sites  The mailing list is actively read • Admins can freely use it to communicate with the operators and admins d i • Developers also read this list  Help  Help – advice, training, etc. advice training etc • Some of the operators are cheap – you only need to mention beer!  No question is too trivial q TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  16. Operations at NDGF: EGEE era Operations at NDGF: EGEE era  Combined daily efforts of NDGF and SNIC operations t teams, out of hours handled by NOC t f h h dl d b NOC • NDGF/SNIC operators rostered one week in six • NOC has it's own rostering system  Lots of communications channels  Lots of communications channels • JABBER room used extensively • NDGF and EGEE ticketing  Many monitoring tools and logging systems  Many monitoring tools and logging systems • NAGIOS – NDGF and CERN • GANGLIA, FTS, dCACHE • WIKI • DASHBOARD  Attend numerous meetings • Weekly WLCG meeting • Daily WLCG operations meeting • Weekly NDGF meeting TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010 • Nordic NE ROC phone meeting

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend